This lab will cover implementations of regularization (penalization) for generalized linear models in sklearn
.
As usual, we'll begin by loading several familiar libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
For regression examples, we'll use the superconductivity dataset from the UCI machine learning repository:
### Read superconductivity data
sc = pd.read_csv("https://remiller1450.github.io/data/super.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train_sc, test_sc = train_test_split(sc, test_size=0.2, random_state=5)
## Split predictors and outcome
train_y_sc = train_sc['critical_temp']
train_X_sc = train_sc.drop(['critical_temp'], axis=1)
train_X_sc.shape
(17010, 81)
For some additional background, critical temperature is the temperature at which the electrical resistivity of a material vanishes. For most conductors, resistivity gradually declines with temperature, but for superconductors it drops abruptly. The 81 predictors in this application are characteristics derived from the chemical formulas of various materials. We will aim to predict critical temperature using the chemical features of a material.
For classification examples, we'll return to the data from Sobar (2016), which was used in Lab #3, Part 1 and the corresponding lecture (on classifier performance metrics). Recall that this study aimed to accurately predict instances of cervix cancer using data from several non-invasive behavioral questionaires.
## Read sobar data
sobar = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv')
train_sobar, test_sobar = train_test_split(sobar, test_size=0.2, random_state=5)
## Define outcome variable
train_sobar_y = train_sobar['ca_cervix']
## Define predictor matrix
train_sobar_X = train_sobar.drop('ca_cervix',axis=1)
Regularization describes a penalty imposed on a model's weight estimates that skrinks them towards zero. For ridge regression, the penalty function is $\alpha \sum w_j^2$, while the Lasso penalty function is $\alpha \sum |w_j|$.
Ridge and Lassso regression each have their own functions in sklearn
, and both functions are similar to LinearRegression
with an additional argument, alpha
, that controls the amount of regularization. However, it is unwise jump straight into a cross-validated grid search to tune $\alpha$ like we've done for other tuning parameters. To understand why, let's first look at a regularization path plot:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
## Standardize
train_Xs_sc = StandardScaler().fit_transform(train_X_sc)
## Set up sequence of alphas
n_alphas = 200
alphas = np.logspace(-1, 5, n_alphas)
## Find coefficients at each alpha (Ridge regression)
coefs = []
for a in alphas:
ridge = Ridge(alpha=a)
ridge.fit(train_Xs_sc, train_y_sc)
coefs.append(ridge.coef_)
## Create Plot
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale("log")
ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
plt.xlabel("alpha")
plt.ylabel("weights")
plt.axis("tight")
plt.show()
For a given value of alpha
, the Ridge estimates at nearby values of alpha
are similar (ie: the regularization path is smooth).
For computational efficiency, we don't want to fit hundreds of regressions "from scratch". Instead, it's much more efficient (and numerically stable) to use a "warm start" strategy that finds the weight estimates for the current value of alpha
using the estimates from a slightly larger alpha
as initial values. The specialized functions RidgeCV
and LassoCV
use warm starts to facilitate easier tuning alpha
:
## Import
from sklearn.linear_model import RidgeCV
## Set up sequence of alphas
n_alphas = 200
alphas = np.logspace(-1, 5, n_alphas)
## Evaluate using 5-fold CV and print the best alpha
ridge_cv_res = RidgeCV(cv=5, alphas = alphas, scoring='neg_mean_squared_error').fit(train_Xs_sc, train_y_sc)
best_alpha = ridge_cv_res.alpha_
print(best_alpha)
0.11489510001873092
After finding an optimal amount of regularization (value of alpha
), we can compare the cross-validated performance of this model with ordinary linear regression:
## Fit final ridge model
print(ridge_cv_res.best_score_)
## Fit linear regression and evaluate via 5-fold CV
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
lm_score = cross_val_score(LinearRegression(), train_Xs_sc, train_y_sc, scoring='neg_mean_squared_error', cv = 5)
print(np.average(lm_score))
-309.56563223752676 -309.58888720629477
Here we see that by introducing a small penalty, $\alpha = 0.115$ on the L2 norm of the weight vector we can achieve slightly better out-of-sample performance.
However, we might be curious if L1 regularization is a better strategy. In situations where a larger number of predictors are expected to be unrelated to the outcome, Lasso tends to outperform Ridge Regression due to its variable selection properties. Let's see if that is the case for the superconductivity data:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
## Standardize
train_Xs_sc = StandardScaler().fit_transform(train_X_sc)
## Set up sequence of alphas
n_alphas = 50
alphas = np.logspace(-1, 2, n_alphas)
## Find coefficients at each alpha (Lasso regression)
coefs = []
for a in alphas:
lasso = Lasso(alpha=a)
lasso.fit(train_Xs_sc, train_y_sc)
coefs.append(lasso.coef_)
## Create Plot
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale("log")
ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
plt.xlabel("alpha")
plt.ylabel("weights")
plt.axis("tight")
plt.show()
Notice how the Lasso regularization path shows many variables with weights estimated at exactly zero for many values of $\alpha$. For example, at alpha = 10
there are only 4 predictors that are actively contributing to the model, and the remainder have weight estimates of exactly zero.
In the question below, I'll ask you to formally compare the performance of Lasso regression with Ridge regression and ordinary least squares using cross-validation.
LassoCV
function with k=5 folds to find the value of $\alpha$ that leads to the best cross-validated performance on the superconductivity training data. Because LassoCV
does not currently support alternative scoring functions, you may use the default ($R^2$).In Lab 4 (part 2) we used the argument penalty='none'
whenever fitting a logistic (or softmax) regression model. This was because these functions are programmed to include a ridge penalty by default to provide numerical stability in their weight estimates.
In the LogisticRegression
function, the argument C
is the inverse of the regularization strength. So smaller values of C
correspond with higher amounts of penalization (the opposite of how $\alpha$ works in the Ridge
and Lasso
functions we've used for numeric outcomes).
Below we demonstrate L1 regularized logistic regression on the Sobar data:
## Import
from sklearn.linear_model import LogisticRegression
## Standardize
train_sobar_Xs = StandardScaler().fit_transform(train_sobar_X)
## Set up sequence of "C" values
n_c = 30
c_vals = np.logspace(-2, 5, n_c)
## Initialize model
lr_mod = LogisticRegression(penalty="l1", solver="liblinear", warm_start=True, tol=1e-6)
coefs = []
for c in c_vals:
lr_mod.set_params(C=c)
lr_mod.fit(train_sobar_Xs, train_sobar_y)
coefs.append(lr_mod.coef_.ravel())
## Create Plot
ax = plt.gca()
ax.plot(c_vals, coefs)
ax.set_xscale("log")
plt.xlabel("C")
plt.ylabel("weights")
plt.axis("tight")
plt.show()
Notice how we changed the optimization algorithm to liblinear
, an algorithm that supports both Lasso and Ridge penalties (the default algorithm, lbfgs
, only supports L2 regularization).
Additionally, we needed to increase the tolerance of the solving algorithm (using the tol
parameter) to achieve converge for some of the smaller amounts of penalization (ie: large values of C
). We can start to see signs of this estimation instability in the weight estimates towards the end of the regularization path.
If we wanted to use the Ridge penalty (L2 regularization) instead of the Lasso penalty (L1 regularization), we'd simply modify the penalty
argument:
## Set up sequence of "C" values
n_c = 30
c_vals = np.logspace(-2, 5, n_c)
## Initialize model
lr_mod = LogisticRegression(penalty="l2", solver="liblinear", warm_start=True, tol=1e-6)
coefs = []
for c in c_vals:
lr_mod.set_params(C=c)
lr_mod.fit(train_sobar_Xs, train_sobar_y)
coefs.append(lr_mod.coef_.ravel())
## Create Plot
ax = plt.gca()
ax.plot(c_vals, coefs)
ax.set_xscale("log")
plt.xlabel("C")
plt.ylabel("weights")
plt.axis("tight")
plt.show()
When choosing between regularization approaches, it is important to remember that L1 regularization promotes sparsity (weight estimates of exactly zero), while L2 regularization does not. However, L2 regularization can provide greater stability in the face of highly collinear predictors.
To use cross-validation to tune the hyperparameter, C
, we should rely upon the LogisticRegressionCV
function. As discussed in the previous section, this specialized function uses the "warm start" strategy and much more effective at tuning C
than generic functions like GridSearchCV
.
## Import
from sklearn.linear_model import LogisticRegressionCV
## Set up seq of c-values
n_c = 30
c_vals = np.logspace(-2, 5, n_c)
## Initialize and fit
cv_lr_mod = LogisticRegressionCV(cv=10, Cs = c_vals, penalty="l1", solver="liblinear", tol=1e-6, scoring = "f1").fit(train_sobar_Xs, train_sobar_y)
## Print the best value of C
print(cv_lr_mod.C_)
## Print the cross-validated score
print(cv_lr_mod.score(train_sobar_Xs, train_sobar_y))
[4.52035366] 1.0
Notice how we can supply a sequence of C
values to try via the Cs
argument. We can then access the single value of C
that resulted in the best cross-validated score using the .C_
attribute of our fitted model.
LogisticRegressionCV
(fit in the previous code chunk) to construct and display a confusion matrix.The Lasso penalty has the benefit of leading to sparse models; however, its behavior in scenarios with highly collinear predictors can be undesirable. Recall that if two predictors are very highly correlated, the Lasso penalty will select one of them and combined signal (impact on the outcome variable) will be absorbed into this single estimated weight. In these situations, introducing additional L2 regularization on top of the Lasso penalty can lead to both predictors being selected, and their shared signal being split between them (perhaps resulting in a more generalizable model).
This penalization scheme (combining L1 and L2 penalties) is known as Elastic Net, and the property described above is known as the grouping effect of the Elastic Net. In the 2005 paper proposing the method%20301-320%20Zou%20&%20Hastie.pdf) the authors describe elastic net:
Similar to the lasso, the elastic net simultaneously does automatic variable selection and continuous shrinkage, and it can select groups of correlated variables. It is like a stretchable fishing net that retains ‘all the big fish’.
In sklearn
, Elastic Net models use a common $\alpha$ value and the argument l1_ratio
to control the proportion of that value used in the L1 penalty (with the remainder being used as the L2 penalty parameter). For example, l1_ratio = 1
is equivalent to the Lasso penalty, while l1_ratio = 0
is equivalent to the ridge penalty.
Similar to other regularized regression methods, elastic net has it's own built-in cross-validation that we can use to tune the parameters l1_ratio
and alpha
before going on to include an elastic net model in a larger comparison of different models.
## Import
from sklearn.linear_model import ElasticNetCV
## Set up sequence of alphas
n_alphas = 50
alphas = np.logspace(-1, 5, n_alphas)
## Set up a few differ l1 ratios
rats = [.1, .5, .7, .9, .95, .99, 1]
## Evaluate using 5-fold CV and print the best alpha
enet_cv_res = ElasticNetCV(cv=5, alphas = alphas, l1_ratio = rats).fit(train_Xs_sc, train_y_sc)
print([enet_cv_res.alpha_, enet_cv_res.l1_ratio_])
[0.1, 1.0]
Note that once ElasticNetCV
has found the best performing combination of alpha
and l1_ratio
via cross-validation it will fit the corresponding model to the entire data set. Thus, printing the alpha_
and l1_ratio_
attributes of the fitted model tells us the best combination of tuning parameters.
In this example, we see that a Lasso model (l1_ratio = 1.0
) with alpha=0.1
was found to be optimal.
If we'd like to use elastic net regularization for classification, we simply need to change the penalty
argument in LogisticRegression
(or LogisticRegressionCV
) to "elasticnet". When doing this, we must also be careful to adjust the optimization algorithm that is used, as only "saga" is currently supported for the elastic net penalty.
## Set up seq of c-values
n_c = 30
c_vals = np.logspace(-2, 5, n_c)
## Initialize
cv_elastic_lr = LogisticRegressionCV(cv=10, Cs = c_vals, l1_ratios = [0.1,0.5], penalty="elasticnet", solver="saga", tol=1e-4, max_iter = 1000, scoring = "f1")
## Fit
cv_elastic_lr.fit(train_sobar_Xs, train_sobar_y)
## Print the best value of C
print(cv_elastic_lr.C_)
## Print the best l1_ratio
print(cv_elastic_lr.l1_ratio_)
## Print the cross-validated score
print(cv_lr_mod.score(train_sobar_Xs, train_sobar_y))
[72.78953844] [0.1] 1.0
It's worth noting that the tolerance, tol
, and maximum number of iterations, max_iter
, were changed from their defaults to reach convergence for these data (and tuning parameters).
In case you aren't familiar with these terms, tolerance reflects maximum change in weights that is acceptable for the estimates to be considered optimized. For example, setting tol=1e-4
indicates that if the largest change in weight estimates from iteration to the next is less than 0.0001 the algorithm has reached a sufficiently accurate solution.
l1_ratios
achieve this). Report the best fitting model and state whether it uses the elastic net, lasso, or ridge penalty.The Ames Housing data set includes comprehensive information on all residential properties sold in Ames, Iowa between 2006 and 2010. The goal of this application is to build a model that can accurately predict the sale price of a home in Ames that has yet to go on the market based upon the features of the property. Note: although it has become slightly dated, this data set remains one of the most comprehensive and interesting publically available home price data sets.
A description of each variable contained in the Ames Housing data set can be found here.
A few early steps in your work with these data are given below:
## Read the data
ames = pd.read_csv("https://remiller1450.github.io/data/AmesHousing.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train_ames, test_ames = train_test_split(ames, test_size=0.2, random_state=9)
## Separate y and X
train_ames_y = train_ames['SalePrice']
train_ames_X = train_ames.drop('SalePrice', axis=1)
## Note the presence of missing data
print(train_ames_X.isna().sum().sort_values(ascending=False))
## We'll remove any variables with more than 1000 missing values since they have too much missing data to reliably impute.
train_ames_X = train_ames_X[train_ames_X.columns[train_ames_X.isna().sum() < 1000]]
# train_ames_X.columns # In case you need to know some of the column names
Pool.QC 2333 Misc.Feature 2263 Alley 2193 Fence 1903 Fireplace.Qu 1136 ... Heating.QC 0 Central.Air 0 X1st.Flr.SF 0 X2nd.Flr.SF 0 Sale.Condition 0 Length: 81, dtype: int64
The following parts of this question are meant to guide you through a start-to-finish application of machine learning on a new data set. This type of question should help give you a reasonable starting point for the basic steps you'll need to engage in during your capstone project (if you choose to work with "flat" or "tablular" data, your steps might look very different if working with images/text). However, you should be aware that I'd expect you to try out more feature engineering strategies and modeling methods.
train_ames_X
and drop any columns that should not be used as predictors. For example, the property identifcation number can be assumed to have no predictive value since it's arbitrary assigned. Additionally, variables like "Yr.Sold" will not be available for a home that hasn't yet gone on the market, so they also shouldn't be used. Further, you may also remove the variable "MS.SubClass" for simplicity..select_dtypes()
method of pandas data frames with the argument include='object'
to return only non-numeric columns, then you can store the columns
attribute of this result.ElasticNetCV
with 5-fold cross-validation and sequences of reasonably chosen alpha values and l1_ratio values (use approximately 30 different alpha values and 3-5 different l1_ratios).GridSearchCV
to find a suitable combination of values for n_neighbors
, distance
, and p
. You should try at least 5 different choices of n_neighbors
.ElasticNet()
with the arguments you identified in Part F or KNeighborsRegressor()
with the arguments you identified in Part G. Then, use GridSearchCV
to find the cross-validated 'neg_root_mean_squared_error' of each model.