This lab covers cross-validation tools for model evaluation and hyperparameter tuning in sklearn
.
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
# KNN will warn you about future updates, but we'll turn these off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
We will continue using the Iowa City Home Sales data (introduced in our previous lab) in our examples:
## Read IC homes data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(ic, test_size=0.2, random_state=7)
## Separate the target from the predictors (only using numeric predictors)
train_y = train['sale.amount']
test_y = test['sale.amount']
train_X = train.select_dtypes("number").drop('sale.amount', axis = 1)
test_X = test.select_dtypes("number").drop('sale.amount', axis = 1)
Most of the time we'll want to use cross-validation in conjunction with a pipeline. Given below is a simple pipeline that applies a standardization step followed by a model fitting step:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
pipe = Pipeline([
('scaler', StandardScaler()),
('model', KNeighborsRegressor())
])
Here we did not modify any of the default values in either step. This is because we will be supplying our own combinations of tuning parameters stored in a dictionary object.
from sklearn.preprocessing import RobustScaler, MaxAbsScaler
parms = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
'model__n_neighbors': [10,20,30],
'model__weights': ['uniform','distance'],
'model__p': [1,2]
}
Recall that dictionaries are used to store key:value pairs. In this example, you should notice that the keys involve the same names as those given to the steps in our pipeline, and the double underscore is used to reference an argument or parameter within a named step. That is, the name model__p
references the argument p
in the step named model
, or the power parameter for Minkowski distance in KNeighborsRegressor()
.
After setting up a pipeline and a dictionary of parameters we'd like to search over, we can use the GridSearchCV()
function to find the best combination of tuning parameters as determined using cross-validation.
from sklearn.model_selection import GridSearchCV
grid_res = GridSearchCV(pipe, parms, cv=5).fit(train_X, train_y)
print(grid_res.best_estimator_)
Pipeline(steps=[('scaler', StandardScaler()), ('model', KNeighborsRegressor(n_neighbors=10, p=1, weights='distance'))])
In this example we find that the best performing approach uses a standard scaler, $k=10$ neighbors, and inverse-distance weighting. You might notice that we do not see a value of p
in this output. This is because p=2
is the default value used by KNeighborsRegressor()
so it doesn't need to be explicitly stated when selected as the optimal value.
The complete grid search results are accessible through the cv_results_
attribute. However, these results are stored in a dictionary of numpy
arrays, so we'll first convert them to a pandas
DataFrame for easier use. The code given below shows the top five best performing tuning parameter combinations and their cross-validated scores:
full_grid_results = pd.DataFrame(grid_res.cv_results_)
parms_to_show = ['param_model__n_neighbors', 'param_model__p', 'param_model__weights', 'param_scaler', 'mean_test_score']
full_grid_results.sort_values('mean_test_score', ascending=False)[parms_to_show].head(5)
param_model__n_neighbors | param_model__p | param_model__weights | param_scaler | mean_test_score | |
---|---|---|---|---|---|
3 | 10 | 1 | distance | StandardScaler() | 0.731008 |
9 | 10 | 2 | distance | StandardScaler() | 0.722000 |
5 | 10 | 1 | distance | MaxAbsScaler() | 0.718766 |
0 | 10 | 1 | uniform | StandardScaler() | 0.704226 |
6 | 10 | 2 | uniform | StandardScaler() | 0.698957 |
Question 1: Notice the values of mean_test_score
in the results shown above do not appear to in the same units of the outcome variable (recall we're predicting home prices that cost hundreds of thousands of dollars). Use the documentation for GridSearchCV()
found here and the list of predefined model evaluation metrics found here to re-do this parameter search using out-of-sample RMSE as the performance evaluation criteria.
Finally, you should know that the character string 'passthrough' can be used to skip over a data preparation step. For example, we could consider models with and without PCA for dimension reduction:
from sklearn.decomposition import PCA
## Pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('reducer', PCA()),
('model', KNeighborsRegressor())
])
## Parameters to try
parms = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
'reducer': ['passthrough', PCA(n_components=4), PCA(n_components=8)],
'model__n_neighbors': [10,20,30]
}
## Results
GridSearchCV(pipe, parms, cv=5).fit(train_X, train_y).best_estimator_
Pipeline(steps=[('scaler', StandardScaler()), ('reducer', 'passthrough'), ('model', KNeighborsRegressor(n_neighbors=10))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('scaler', StandardScaler()), ('reducer', 'passthrough'), ('model', KNeighborsRegressor(n_neighbors=10))])
StandardScaler()
passthrough
KNeighborsRegressor(n_neighbors=10)
Here we see that dimension reduction via PCA was not used in the best estimator.
Question 2: The code below creates a new binary outcome variable that records whether a home was assessed at a higher value than what it sold for. We can use this variable to identify homes whose value was "over-assessed" by the county.
yeo-johnson
normalization transformation or no normalizing transformation## Create the binary outcome variable indicating over-assessment
train_y_binary = (train['assessed'] > train['sale.amount']).astype(int)
Grid search can be computationally expensive when the number of tuning parameters is large and you have little insight into which areas of the parameter space will be most promising. Randomized search is an alternative that allows you to search over a wide parameter space with a fixed computational cost.
The example search below will randomly sample tuning parameter values from the specified distributions. When a list of values is given, values will be sampled with equal probability from that list. When a distribution is given, such as the Poisson distribution with a mean of 20 given for the parameter n_neighbors
, the search will randomly sample values from that distribution at each iteration of the search.
from scipy.stats import poisson
from sklearn.model_selection import RandomizedSearchCV
## Parameter dictionary, notice the use of 'poisson(20)'
parms = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
'model__n_neighbors': poisson(20),
'model__weights': ['uniform','distance']
}
## Results
RandomizedSearchCV(pipe, parms, cv =5, n_iter = 30, random_state=0).fit(train_X, train_y).best_estimator_
Pipeline(steps=[('scaler', StandardScaler()), ('reducer', PCA()), ('model', KNeighborsRegressor(n_neighbors=13, weights='distance'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('scaler', StandardScaler()), ('reducer', PCA()), ('model', KNeighborsRegressor(n_neighbors=13, weights='distance'))])
StandardScaler()
PCA()
KNeighborsRegressor(n_neighbors=13, weights='distance')
Here we found $k=13$ to be ideal, which is a value we'd have been unlikely to try in a grid search using equally spaced values.
The diagram below explains how random searching may be able to produce better results than grid search in certain circumstances:
## Screenshot of the Dark Triad survey items
from IPython.display import Image
Image("C:\\Users\\millerry\\OneDrive - Grinnell College\\Documents\\STA-395_Intro_ML\\Spring24\\Labs\\rs.PNG")
In this hypothetical example, both searches consider 9 different tuning parameter combinations, but the uniform spacing of the grid search renders it unable to find a good value of the important parameter.
Question 3: Modify the search you performed in Question to now perform 20 iterations of random searching over the following parameters/steps:
yeo-johnson
normalization transformation or no normalizing transformationBriefly comment on how the best estimator from this search compares with the one you identified in Question 2.
Most of the time we'll want to use cross-validation within the context of a data preparation pipeline. However, occasionally we might want to perform cross-validation using a single model and set of data preparation steps for the purposes of creating data visualizations or doing our own "by hand" calculations (ie: making a results table for a publication, etc.)
The code displayed below demonstrates how to obtain cross-validated predictions:
## Functions to perform k-fold CV, LOOC, and cross-validated predictions
from sklearn.model_selection import KFold, LeaveOneOut, cross_val_predict
## Initialize CV functions
k_folds = KFold(n_splits = 5)
loo = LeaveOneOut()
## Initialize a knn model
knn_mod = KNeighborsRegressor(n_neighbors=13)
## Get cross-validated predictions
kfold_pred = cross_val_predict(knn_mod, train_X, train_y, cv = k_folds)
loo_pred = cross_val_predict(knn_mod, train_X, train_y, cv = loo)
## Example visualization, predicted prices vs. cross-validated residual errors
plt.scatter(kfold_pred, train_y-kfold_pred)
plt.show()
A final thing to know about the GridSearchCV()
and RandomizedSearchCV()
functions is that they will automatically fit the best estimator to the entire training data set after finding it.
Thus, we can evaluate the performance of the best approach selected by our tuning parameter search on the test set using the predict()
:
## Pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('model', KNeighborsRegressor())
])
## Parameters to try
parms = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
'model__n_neighbors': [10,20,30]
}
## Predictions
test_preds = GridSearchCV(pipe, parms, cv=5).fit(train_X, train_y).predict(test_X)
## Test set RMSE using the best model/steps (calculated "by hand"):
np.sqrt(np.average((test_y - test_preds)**2))
47298.88262157997
In Homework 1 you were introduced to the to MNIST database, a collection of thousands of handwritten digits (0-9) recorded as 28x28 pixel grayscale images.
The code below loads a random sample of 6000 digits and displays a few of them:
### Read flattened, processed data
mnist = pd.read_csv("https://remiller1450.github.io/data/mnist_small.csv")
### Separate the label column (outcome)
label = mnist['label']
mnist = mnist.drop(['label'], axis=1)
### Convert to numpy array and reshape to 28 by 28
mnist_unflattened = mnist.to_numpy()
mnist_unflattened = mnist_unflattened.reshape(6000,28,28)
## Plot some examples
import matplotlib.cm as cm
fig, axs = plt.subplots(ncols=7)
for i in range(7):
axs[i].imshow(mnist_unflattened[i], cmap=cm.Greys)
axs[i].title.set_text(f'label={label[i]}')
plt.show()
In this application you will create a machine learning pipeline to classify these digits.
Question 4:
random_state=10
.Note: You'd normally want to consider standardization in a pipeline like this one, but because the features of the MNIST data set all involve the same units (grayscale color intensity of pixels) this isn't necessary.