In this lab we'll explore regularized regression models to better understand how to handle the bias-variance trade off in parameterized machine learning models.
Directions: Please read through the contents of this lab with your partner and try the examples. After you're both confident that you understand a topic you should attempt the associated exercise and record your answer in your own Jupyter notebook that you will submit for credit. The notebook you submit should only contain answers to the lab's exercises (so you should remove any code you ran for the examples, or use a separate notebook to test out the examples).
To begin, you'll need the following libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
This lab's initial examples and questions will use the superconductivity dataset from the UCI machine learning repository:
### Read superconductivity data
sc = pd.read_csv("https://remiller1450.github.io/data/super.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train_sc, test_sc = train_test_split(sc, test_size=0.2, random_state=5)
## Split predictors and outcome
train_y_sc = train_sc['critical_temp']
train_X_sc = train_sc.drop(['critical_temp'], axis=1)
train_X_sc.shape
(17010, 81)
The goal in this application was to predict the critical temperature of a material based upon its chemical properties. Critical temperature is the temperature at which the electrical resistivity of a material effectively vanishes. For most conductors, electrical resistivity gradually declines with temperature, but for superconductors it drops abruptly. The predictive features in this application were derived from the chemical formula of the material in question.
sklearn
¶Ridge and Lasso regression each have their own functions in sklearn
, both of which are similar to LinearRegression()
with an additional argument, alpha
, that controls the amount of regularization.
However, it is unwise to simply throw these models into GridSearchCV()
to tune alpha
like we've done for other tuning parameters. To understand why, let's first create a regularization path plot:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
## Standardize
train_Xs_sc = StandardScaler().fit_transform(train_X_sc)
## Set up sequence of alphas
n_alphas = 200
alphas = np.logspace(-1, 5, n_alphas)
## Find coefficients at each alpha (Ridge regression)
coefs = []
for a in alphas:
ridge = Ridge(alpha=a)
ridge.fit(train_Xs_sc, train_y_sc)
coefs.append(ridge.coef_)
## Create Plot
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale("log")
ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
plt.xlabel("alpha")
plt.ylabel("weights")
plt.axis("tight")
plt.show()
Notice that the regularization plot is smooth, meaning the estimated weights at nearby values of alpha
are similar.
Lasso and Elastic Net regression do not have closed form solutions and instead the optimal weights are found using iterative algorithms. We want to avoid "starting from scratch" when we estimate the optimal weights at each value of alpha
, so we should use specialized functions (RidgeCV()
and LassoCV()
) that employ a warm start strategy (initializing using results from an adjacent value of alpha
):
from sklearn.linear_model import RidgeCV
## Set up sequence of alphas
n_alphas = 200
alphas = np.logspace(-1, 5, n_alphas)
## Evaluate using 5-fold CV and print the best alpha
ridge_cv_res = RidgeCV(cv=5, alphas = alphas, scoring='neg_mean_squared_error').fit(train_Xs_sc, train_y_sc)
best_alpha = ridge_cv_res.alpha_
print(best_alpha)
0.11489510001873092
## Compare w/ what we'd get from LinearRegression() using 5-fold CV
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
lm_score = cross_val_score(LinearRegression(), train_Xs_sc, train_y_sc, scoring='neg_mean_squared_error', cv = 5)
ridge_score = cross_val_score(Ridge(alpha=best_alpha), train_Xs_sc, train_y_sc, scoring='neg_mean_squared_error', cv = 5)
print("OLS:", np.average(lm_score), "Ridge:", np.average(ridge_score))
OLS: -309.5888872062948 Ridge: -309.5656322375192
Question #1: The Lasso()
function in sklearn
is directly comparable to Ridge()
.
alpha
using np.logspace()
going from 0.1 to 100. Then, using the example in this section as a template, create the regularization path plot for a Lasso model applied to the superconductivity data.alpha = 10
.LassoCV()
to perform 5-fold cross-validation in order to find the optimal value of alpha
from the sequence you created in Part A. Because the function does not have a scoring
argument, you may rely upon the default scoring metric ($R^2$) to select the best alpha
.GridSearchCV()
to compare the best performing Lasso model from Part C with the best Ridge model (found in the example) and an ordinary linear regression model using mean squared error as the scoring metric. The purpose of this comparison is to ensure the exact same CV folds are used for each model. Report the cross-validated RMSE of each model.alpha
and briefly comment upon why you believe the Lasso model at the value of alpha
identified in Part C performed better/worse than the Ridge model at its best value of alpha
.In a previous lab, we noted that sklearn
's implementation of logistic regression applies regularization by default, which is why we provided the argument penalty='none'
.
To further understand how regularization is implemented in this model, as well as why it is useful, we'll use the MNIST handwritten digit data set. Specifically, we'll see if we can use pixel intensities as features in logistic regression to predict whether or not a digit is a four.
### Read flattened MNIST data
mnist = pd.read_csv("https://remiller1450.github.io/data/mnist_small.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train_mnist, test_mnist = train_test_split(mnist, test_size=0.2, random_state=5)
### Separate the label column (outcome)
train_mnist_y = train_mnist['label']
train_mnist_X = train_mnist.drop(['label'], axis="columns")
### Create a binary label for 4's
train_mnist_y_binary = (train_mnist_y == 4).astype(int)
To begin, let's consider a simple pipeline that fits a logistic regression model without any regularization:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('model', LogisticRegression(penalty=None, solver='saga'))
])
log_mod = pipe.fit(train_mnist_X, train_mnist_y_binary)
C:\Users\millerry\Anaconda3\lib\site-packages\sklearn\linear_model\_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge warnings.warn(
This model fails to converge, meaning an optimal set of weight parameters, $\{w_0, w_1, \ldots, w_p\}$, could not be determined. There are a few reasons why this can happen that you should be aware of:
As you might expect, both of these issues can be addressed via regularization, but your first strategy should be to rescale the input features and modify the optimization algorithm and parameters (the type of solver, number of iterations, and tolerance).
## Modify and re-fit the logistic regression pipeline
from sklearn.preprocessing import MinMaxScaler
new_pipe = Pipeline([
('scaler', MinMaxScaler()),
('model', LogisticRegression(penalty=None, solver='lbfgs', max_iter=150))
])
new_log_mod = new_pipe.fit(train_mnist_X, train_mnist_y_binary)
These changes were enough to allow convergence, meaning the optimization algorithm got close enough to an optimal set of weights that the weight estimates were no longer changing by more than a certain level of tolerance.
To further understand what is happening here it can be helpful to plot a logistic regression model's linear predictors vs. its predicted probabilities:
## Extract the weights and bias from the fitted model
w = log_mod.named_steps['model'].coef_
b = log_mod.named_steps['model'].intercept_
## Calculate the linear predictor using the dot product of X and w + the bias term
lin_predictor = np.dot(MinMaxScaler().fit_transform(train_mnist_X), np.transpose(w)) + b
## Apply the sigmoid transformation
pi = 1/(1 + np.exp(-lin_predictor))
## Plot the original model (which failed to converge)
plt.scatter(lin_predictor, pi, c=train_mnist_y_binary.astype('category'))
plt.show()
## Plot the new model (which converged)
w = new_log_mod.named_steps['model'].coef_
b = new_log_mod.named_steps['model'].intercept_
lin_predictor = np.dot(MinMaxScaler().fit_transform(train_mnist_X), np.transpose(w)) + b
pi = 1/(1 + np.exp(-lin_predictor))
plt.scatter(lin_predictor, pi, c=train_mnist_y_binary.astype('category'))
plt.show()
C:\Users\millerry\AppData\Local\Temp\ipykernel_27380\4165323997.py:5: RuntimeWarning: overflow encountered in exp pi = 1/(1 + np.exp(-lin_predictor))
Question #2:
cross_val_predict()
to obtain cross-validated predictions using 3-fold cross-validation for each training sample using the pipeline that had converged with the full training set. What is the cross-validated accuracy score of this model? Why might it not be as high as the accuracy score you found in Parts A and B?$~$
Now let's consider regularization as a way to handle the issues we encountered in this section. The LogisticRegression()
function uses the argument C
to control the inverse of the regularization strength. Thus, unlike how alpha
works in Lasso and Ridge regression, smaller values of C
impose a greater penalty upon the weight vector.
The code below demonstrates how we can use this argument to add a Ridge penalty to logistic regression and show the regularization path plot:
## Set up sequence of "C" values
n_c = 20
c_vals = np.logspace(-5, 0, n_c)
ridge_pipe = Pipeline([
('scaler', MinMaxScaler()),
('model', LogisticRegression(penalty="l2", solver='lbfgs', warm_start=True, max_iter=150))
])
## Fit the pipeline to each value of C
coefs = []
for c in c_vals:
ridge_pipe.set_params(model__C=c)
ridge_pipe.fit(train_mnist_X, train_mnist_y_binary)
coefs.append(ridge_pipe.named_steps['model'].coef_.ravel())
## Create Plot
ax = plt.gca()
ax.plot(c_vals, coefs)
ax.set_xscale("log")
plt.xlabel("C")
plt.ylabel("weights")
plt.axis("tight")
plt.show()
Similar to Part 1 of the lab, sklearn
has a specialized function LogisticRegressionCV()
that uses the "warm start" strategy for more effective turning of C
. This function is demonstrated below:
## Perform cross-validated search for C
c_vals = np.logspace(-4, -1, 6)
from sklearn.linear_model import LogisticRegressionCV
cv_lr_mod = LogisticRegressionCV(cv=3, Cs = c_vals, penalty="l2", solver="lbfgs", max_iter=150).fit(train_mnist_X, train_mnist_y_binary)
## Print the best value of C
print(cv_lr_mod.C_)
## Print the cross-validated score
print(cv_lr_mod.score(train_mnist_X, train_mnist_y_binary))
C:\Users\millerry\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( C:\Users\millerry\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( C:\Users\millerry\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(
[0.0001] 0.9991666666666666
C:\Users\millerry\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(
You can see that some of the models didn't converge, but we still were able to find an amount of regularization that produced a model with a very good cross-validated accuracy (higher than our model without penalization from earlier).
Question #3:
solver
. Consider 25 different logarithmically spaced values of C
ranging from $10^-4$ to 1 when creating your plot.C=10^-3
? How many variables have non-zero weights in the model using L2 regularization when C=10^-3
?LogisticRegressionCV()
to select an optimal value of C
for the model you considered in Part A using 3-fold cross-validation. Report the optimal value and the corresponding cross-validated accuracy score.LogisticRegressionCV()
stores the estimated weights of the model with the best performing value of C
. Use these estimated weights to report the sparsity of the best model. That is, what proportion of input features had weight estimates that were exactly zero?This application will use many of the core machine learning steps we've engaged with throughout the entire semester up until this point.
You will work with the Ames Housing data set, which is slightly dated (it contains homes sold between 2006 and 2010) but remains one of the most comprehensive publicly available housing data sets.
The goal of the application is to accurately predict sale prices based upon features that would be available before a home is sold. The steps involved in Question #4 will also introduce a few new functions that are minor extensions of what we've seen in this lab and previous ones.
## Read the data
ames = pd.read_csv("https://remiller1450.github.io/data/AmesHousing.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train_ames, test_ames = train_test_split(ames, test_size=0.2, random_state=9)
## Separate y and X
train_ames_y = train_ames['SalePrice']
train_ames_X = train_ames.drop('SalePrice', axis=1)
## Note the presence of missing data
print(train_ames_X.isna().sum().sort_values(ascending=False))
## We'll also remove any variables with more than 1000 missing values
## they have too much missing data to reliably impute.
train_ames_X = train_ames_X[train_ames_X.columns[train_ames_X.isna().sum() < 1000]]
Pool.QC 2333 Misc.Feature 2263 Alley 2193 Fence 1903 Fireplace.Qu 1136 ... Heating.QC 0 Central.Air 0 X1st.Flr.SF 0 X2nd.Flr.SF 0 Sale.Condition 0 Length: 81, dtype: int64
Question #4:
train_ames_X
and drop anything that should be used as a predictor. For example, you may assume that the property identification number has no predictive value since it arbitrarily assigned by the county. Relatedly, variables like Yr.Sold
would not be available for a home that has just gone on the market, so it isn't relevant for prediction in our context. Finally, you may remove the variable MS.SubClass
for simplicity.select_dtypes()
method to facilitate this.ElasticNetCV()
with 5-fold cross-validation and sequences of reasonably chosen alpha
values and l1_ratio
values (aim to use approximately 30 different alpha values and 3-5 different l1_ratio
values)alpha
and l1_ratio
.GridSearchCV()
to find appropriate values of the n_neighbors
and weights
hyperparameters.