This is a short lab covering the topic of regularization in the context of regression models.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
In our regression examples, we'll use the superconductivity dataset from the UCI machine learning repository:
### Read superconductivity data
sc = pd.read_csv("https://remiller1450.github.io/data/super.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train_sc, test_sc = train_test_split(sc, test_size=0.2, random_state=5)
## Split predictors and outcome
train_y_sc = train_sc['critical_temp']
train_X_sc = train_sc.drop(['critical_temp'], axis=1)
train_X_sc.shape
(17010, 81)
For some additional background, critical temperature is the temperature at which the electrical resistivity of a material vanishes. For most conductors, resistivity gradually declines with temperature, but for superconductors it drops abruptly. The 81 predictors in this application are characteristics derived from the chemical formulas of various materials. We will aim to predict critical temperature using the chemical features of a material.
In our classification examples, we'll return to the data from Sobar (2016) that has been used in several of our previous labs. Recall that this study aimed to accurately predict instances of cervix cancer using data from several non-invasive behavioral questionaires.
## Read sobar data
sobar = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv')
train_sobar, test_sobar = train_test_split(sobar, test_size=0.2, random_state=5)
## Define outcome variable
train_sobar_y = train_sobar['ca_cervix']
## Define predictor matrix
train_sobar_X = train_sobar.drop('ca_cervix',axis=1)
Regularization is penalty imposed on a model's estimated weights that skrinks them towards zero. For ridge regression, the penalty function has the basic form: $\alpha \sum w_j^2$, while the Lasso penalty function is: $\alpha \sum |w_j|$.
Ridge and Lassso regression each have their own functions in sklearn
, and both functions are similar to LinearRegression
with an additional argument, alpha
, that controls the amount of regularization. However, it is unwise jump straight into a cross-validated grid search to tune $\alpha$ using GridSearchCV()
like we've done for other tuning parameters. To understand why, let's first look at a regularization path plot for a Lasso model:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
## Standardize
train_Xs_sc = StandardScaler().fit_transform(train_X_sc)
## Set up sequence of alphas
n_alphas = 40
alphas = np.logspace(-1, 2, n_alphas)
## Find coefficients at each alpha (Ridge regression)
coefs = []
for a in alphas:
lasso_mod = Lasso(alpha=a)
lasso_mod.fit(train_Xs_sc, train_y_sc)
coefs.append(lasso_mod.coef_)
## Create Plot
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale("log")
ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
plt.xlabel("alpha")
plt.ylabel("weights")
plt.axis("tight")
plt.show()
Notice that for a given value of alpha
, the weight estimates at nearby values of alpha
are very similar (ie: the regularization path is smooth).
Thus, for computational efficiency, we don't want to perform a large number of optimization calculations that start "from scratch". Instead, it's more efficient (and numerically stable) to use a "warm start" strategy that finds weight estimates for the current value of alpha
starting with the estimates from a slightly larger alpha
the initial value guess. The specialized functions RidgeCV()
and LassoCV()
use warm starts to facilitate easier tuning alpha
:
from sklearn.linear_model import RidgeCV
## Set up sequence of alphas
n_alphas = 200
alphas = np.logspace(-1, 5, n_alphas)
## Evaluate using 5-fold CV and print the best alpha
ridge_cv_res = RidgeCV(cv=5, alphas = alphas, scoring='neg_mean_squared_error').fit(train_Xs_sc, train_y_sc)
best_alpha = ridge_cv_res.alpha_
print(best_alpha)
0.11489510001873092
Question 1:
LassoCV()
function with k=5 folds to find the value of $\alpha$ that leads to the best cross-validated performance on the superconductivity training data. Because LassoCV()
does not currently support alternative scoring functions, you may use the default ($R^2$).In our previous lab, we used the argument penalty=None
(or penalty='none'
) whenever fitting a logistic (or softmax) regression model. This was because these functions are programmed to include a ridge penalty by default.
In the LogisticRegression()
function, the argument C
is the inverse of the regularization strength. So smaller values of C
correspond with higher amounts of penalization (the opposite of how $\alpha$ works in the Ridge
and Lasso
functions we've used for numeric outcomes).
Below we demonstrate L1 regularized logistic regression on the Sobar data:
from sklearn.linear_model import LogisticRegression
## Standardize X
train_sobar_Xs = StandardScaler().fit_transform(train_sobar_X)
## Set up sequence of "C" values
n_c = 30
c_vals = np.logspace(-2, 5, n_c)
## Initialize model
lr_mod = LogisticRegression(penalty="l1", solver="liblinear", warm_start=True, tol=1e-6)
coefs = []
for c in c_vals:
lr_mod.set_params(C=c)
lr_mod.fit(train_sobar_Xs, train_sobar_y)
coefs.append(lr_mod.coef_.ravel())
## Create Plot
ax = plt.gca()
ax.plot(c_vals, coefs)
ax.set_xscale("log")
plt.xlabel("C")
plt.ylabel("weights")
plt.axis("tight")
plt.show()
Notice how we changed the optimization algorithm to liblinear
, an algorithm that supports both Lasso and Ridge penalties (the default algorithm, lbfgs
, only supports L2 regularization). The documentation page for LogisticRegression()
provides information on the solving algorithms that are compatible with each penalty.
Additionally, I needed to increase the tolerance of the solving algorithm (using the tol
parameter) to achieve converge for some of the smaller amounts of penalization (ie: large values of C
). We can start to see signs of this estimation instability in the weight estimates towards the end of the regularization path.
If we wanted to use the Ridge penalty (L2 regularization) instead of the Lasso penalty (L1 regularization), we'd simply modify the penalty
argument:
## Set up sequence of "C" values
n_c = 30
c_vals = np.logspace(-2, 5, n_c)
## Initialize model
lr_mod = LogisticRegression(penalty="l2", solver="liblinear", warm_start=True, tol=1e-6)
coefs = []
for c in c_vals:
lr_mod.set_params(C=c)
lr_mod.fit(train_sobar_Xs, train_sobar_y)
coefs.append(lr_mod.coef_.ravel())
## Create Plot
ax = plt.gca()
ax.plot(c_vals, coefs)
ax.set_xscale("log")
plt.xlabel("C")
plt.ylabel("weights")
plt.axis("tight")
plt.show()
When choosing between regularization approaches, it is important to remember that L1 regularization promotes sparsity (weight estimates of exactly zero), while L2 regularization does not. However, L2 regularization can provide greater stability in the face of highly collinear predictors.
To use cross-validation to tune the hyperparameter, C
, we should rely upon the LogisticRegressionCV()
function. As discussed in the previous section, this specialized function uses the "warm start" strategy and is much more effective at tuning C
than GridSearchCV()
.
from sklearn.linear_model import LogisticRegressionCV
## Set up seq of C values
n_c = 30
c_vals = np.logspace(-2, 5, n_c)
## Initialize and fit
cv_lr_mod = LogisticRegressionCV(cv=10, Cs = c_vals, penalty="l1", solver="liblinear",
tol=1e-6, scoring = "f1").fit(train_sobar_Xs, train_sobar_y)
## Print the best value of C
print(cv_lr_mod.C_)
## Print the cross-validated score
print(cv_lr_mod.score(train_sobar_Xs, train_sobar_y))
[4.52035366] 1.0
In this example we provided a sequence of C
values via the Cs
argument. We then access the single value of C
that resulted in the best cross-validated score using the .C_
attribute of our fitted model.
Question 2
The Lasso penalty has the benefit of leading to sparse models; however, its behavior in scenarios with highly collinear predictors can be undesirable. Recall that if two predictors are very highly correlated, the Lasso penalty will select one of them and combined signal (impact on the outcome variable) will be absorbed into this single estimated weight. In these situations, introducing additional L2 regularization on top of the Lasso penalty can lead to both predictors being selected, and their shared signal being split between them (perhaps resulting in a more generalizable model).
This penalization scheme (combining L1 and L2 penalties) is known as Elastic Net, and the property described above is known as the grouping effect of the Elastic Net. In the 2005 paper proposing the method%20301-320%20Zou%20&%20Hastie.pdf) the authors describe elastic net:
Similar to the lasso, the elastic net simultaneously does automatic variable selection and continuous shrinkage, and it can select groups of correlated variables. It is like a stretchable fishing net that retains ‘all the big fish’.
In sklearn
, Elastic Net models use a common $\alpha$ value and the argument l1_ratio
to control the proportion of that value used in the L1 penalty (with the remainder being used as the L2 penalty parameter). For example, l1_ratio = 1
is equivalent to the Lasso penalty, while l1_ratio = 0
is equivalent to the ridge penalty.
Similar to other regularized regression methods, elastic net has it's own built-in cross-validation that we can use to tune the parameters l1_ratio
and alpha
before going on to include an elastic net model in a larger comparison of different models.
from sklearn.linear_model import ElasticNetCV
## Set up sequence of alphas
n_alphas = 50
alphas = np.logspace(-1, 5, n_alphas)
## Set up a few differ l1 ratios
rats = [.1, .5, .7, .9, .95, .99, 1]
## Evaluate using 5-fold CV and print the best alpha
enet_cv_res = ElasticNetCV(cv=5, alphas = alphas, l1_ratio = rats).fit(train_Xs_sc, train_y_sc)
print([enet_cv_res.alpha_, enet_cv_res.l1_ratio_])
[0.1, 1.0]
Note that once ElasticNetCV
has found the best performing combination of alpha
and l1_ratio
via cross-validation it will fit the corresponding model to the entire data set. Thus, printing the alpha_
and l1_ratio_
attributes of the fitted model tells us the best combination of tuning parameters.
In this example, we see that a Lasso model (l1_ratio = 1.0
) with alpha=0.1
was found to be optimal.
If we'd like to use elastic net regularization for classification, we simply need to change the penalty
argument in LogisticRegression()
(or LogisticRegressionCV()
) to "elasticnet". When doing this, we must be careful to adjust the optimization algorithm that is used, as "saga" is currently the only solver that supports the elastic net penalty.
## Set up seq of C values
n_c = 30
c_vals = np.logspace(-2, 5, n_c)
## Fit
cv_elastic_lr = LogisticRegressionCV(cv=10, Cs = c_vals,
l1_ratios = [0.1,0.5],
penalty="elasticnet",solver="saga", tol=1e-4, max_iter = 1000,
scoring = "f1").fit(train_sobar_Xs, train_sobar_y)
## Print the best value of C
print(cv_elastic_lr.C_)
## Print the best l1_ratio
print(cv_elastic_lr.l1_ratio_)
## Print the cross-validated score
print(cv_lr_mod.score(train_sobar_Xs, train_sobar_y))
[72.78953844] [0.1] 1.0
It's worth noting that the tolerance, tol
, and maximum number of iterations, max_iter
, were both changed from their defaults in order to reach convergence for these data (and the tuning parameters used).
In case you aren't familiar with these terms, tolerance reflects maximum change in weights that is acceptable for the estimates to be considered optimized. For example, setting tol=1e-4
indicates that if the largest change in weight estimates from iteration to the next is less than 0.0001 the algorithm has reached a sufficiently accurate solution.
Question #3
l1_ratios
achieve this). Report the best fitting model and state whether it uses the elastic net, lasso, or ridge penalty.