This lab introduces the concepts parameter tuning and model selection using cross-validation methods. More specifically, it will demonstrate built-in cross-validation and model evaluation routines for $k$-nearest neighbors models using pipelines.
To begin, we'll use the same libraries as Part 1:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
# Unfortunately, knn functions prompt "future warnings", the commands below turn them off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
The lab's examples use the same dataset (Iowa City home sales) as Part 1:
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
Next, remember that our first step should always be to perform a training-testing split in order to avoid data leakage:
from sklearn.model_selection import train_test_split
train, test = train_test_split(ic, test_size=0.2, random_state=7)
We'll then select only numeric features, construct the two target variables described in Part 1, and create a matrix of predictors.
## Select numeric data
train_num = train.select_dtypes("number")
## Specify target vars
train_price = train_num['sale.amount']
train_over = (train_num['assessed'] > train_num['sale.amount']).astype(int)
## Create predictor matrix
train_X_price = train_num.drop('sale.amount',axis=1)
train_X_over = train_num.drop(['sale.amount', 'assessed'], axis=1)
In our previous lab, I expected you to calculate performance measures, such as classification accuracy or RMSE, using your own code. However, for greater compatibility with the methods covered in today's lab, we'll begin by covering a few built-in implementations.
The code below demonstrates the default scoring method implemented in KNeighborsClassifier
:
from sklearn.neighbors import KNeighborsClassifier
knnc = KNeighborsClassifier()
knnc.fit(train_X_over, train_over)
knnc.score(train_X_over, train_over)
0.7616747181964574
We can see that this default is classification accuracy (as implemented in the function accuracy_score
):
from sklearn.metrics import accuracy_score
y_pred_over = knnc.predict(train_X_over)
accuracy_score(train_over, y_pred_over)
0.7616747181964574
Next week we'll discuss alternative scoring metrics, so you should try to remember this module of sklearn
.
During the remainder of this lab, I'll frequently use the term "score". For now, you can view "score" synonomously with "classification accuracy" (at least for KNeighborsClassifier
), but that's something we'll be able to manipulate soon.
train_price
as the outcome. Does root-mean squared error appear to be the default scoring metric? Justify your answer by comparing the output from score
with a manually calculated RMSE (using the training data). For this question you do not need to standardize or rescale the data.Pipelines are important tool in sklearn
that can help automate a machine learning workflow (potentially reducing data leakage). The basic idea of a pipeline is to chain together a series of steps that can then be applied to any input data of the proper format (including the same data repeatedly).
Shown below is a simple pipeline that first standardizes the input data using StandardScaler
, then fits a KNeighborsClassifer
to the scaled data:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier())
])
We can utilize the object pipe
via its fit
method:
pipe.fit(train_X_over, train_over)
Pipeline(steps=[('scaler', StandardScaler()), ('classifier', KNeighborsClassifier())])
At this point, we can utilize any methods of the final estimator in the pipeline (which is KNeighborsClassifier
with default arguments in this example). For example, the classification accuracy score (on the training data) of this classifer is 77.9%:
pipe.score(train_X_over, train_over)
0.7793880837359098
Recognize that our simple pipeline used default parameter values at each step, which is unlikely to produce an optimal classifier. The next section, will cover how to use grid search and cross-validation with pipelines to help determine better combinations of tuning parameters.
pipe_final
that uses robust scaling, uniform weighting, $k=30$ and $p=2$ (euclidean distance)..score
.test_X_over
, and target vector, test_over
, using the test data. Then evaluate the knn model from Part B on the test data using your pipeline and .score
. How does the accuracy on the test set compare to the accuracy on the training data?In this section we'll use cross-validation to evaluate various models corresponding to different combinations of tuning parameters. We'll begin by defining a dictionary of parameter values we'd like to consider:
from sklearn.preprocessing import RobustScaler, MaxAbsScaler
parms = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
'classifier__n_neighbors': [10,20,30],
'classifier__weights': ['uniform','distance'],
'classifier__p': [1,2]
}
Notice how this dictionary matches keys (step names) from our pipeline with lists of objects/values for that key to consider.
A double-underscore, __
, is used to specify parameters within an object. In our example, we are specifying parameters for KNeighborsClassifier
(such as n_neighbors
or weights
). We do this by referencing the key named "classifier" in our pipeline followed by the double-underscore and the argument name of the parameter.
After setting up a dictionary of parameter values, we're ready to use cross-validation. The code below uses the GridSearchCV
function to evaluate each combination of the parameters in parms
using 5-fold cross-validation, with cross-validated classification accuracy as the scoring metric. After running the search, we print the best_estimator_
, which is the combination of tuning parameters that led to the best cross-validated score.
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, parms, cv=5, scoring='accuracy').fit(train_X_over, train_over)
print(grid.best_estimator_)
Pipeline(steps=[('scaler', RobustScaler()), ('classifier', KNeighborsClassifier(n_neighbors=30))])
We can access more detailed information about the various parameter combinations considered during the search using cv_results_
The code below creates a DataFrame containing the full set of search results, sorts the results by scoring rank, then prints the rows corresponding to the five best performing parameter combinations:
result = pd.DataFrame.from_dict(grid.cv_results_, orient='columns')
print(result.sort_values('rank_test_score').head(5))
mean_fit_time std_fit_time mean_score_time std_score_time \ 31 0.003014 0.000031 0.002402 4.917636e-04 0 0.003801 0.001469 0.004399 2.332625e-03 13 0.003001 0.000632 0.003000 8.996946e-07 3 0.002599 0.000488 0.001800 4.001618e-04 25 0.003392 0.000496 0.003200 3.992559e-04 param_classifier__n_neighbors param_classifier__p \ 31 30 2 0 10 1 13 20 1 3 10 1 25 30 1 param_classifier__weights param_scaler \ 31 uniform RobustScaler() 0 uniform StandardScaler() 13 uniform RobustScaler() 3 distance StandardScaler() 25 uniform RobustScaler() params split0_test_score \ 31 {'classifier__n_neighbors': 30, 'classifier__p... 0.656 0 {'classifier__n_neighbors': 10, 'classifier__p... 0.736 13 {'classifier__n_neighbors': 20, 'classifier__p... 0.648 3 {'classifier__n_neighbors': 10, 'classifier__p... 0.720 25 {'classifier__n_neighbors': 30, 'classifier__p... 0.664 split1_test_score split2_test_score split3_test_score \ 31 0.701613 0.701613 0.709677 0 0.693548 0.645161 0.677419 13 0.701613 0.669355 0.709677 3 0.685484 0.653226 0.685484 25 0.693548 0.669355 0.685484 split4_test_score mean_test_score std_test_score rank_test_score 31 0.653226 0.684426 0.024535 1 0 0.661290 0.682684 0.031158 2 13 0.677419 0.681213 0.022291 3 3 0.661290 0.681097 0.023328 4 25 0.685484 0.679574 0.011065 5
At this point we can continue tweaking our model, or we can settle on the top performer from our grid search and evaluate it on the test set.
To use the top performer on the test set, we can simply call best_estimator_
and use it's predict method on the test set's $X$ matrix to get predicted outcomes.
GridSearchCV
to find the best performing value of $k$ using 10-fold cross validation. Include the argument scoring = 'neg_root_mean_squared_error'
so that GridSearchCV
evaluates models using cross-validated RMSE. Print the best performing parameter combination.Grid search can be an effective way to handle tuning parameters in many settings; however, what if our we had little idea regarding which tuning parameters would work well and our data were large enough to make model fitting computationally expensive?
Under these circumstances we might consider a randomized search:
from scipy.stats import poisson
parms = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
'classifier__n_neighbors': poisson(20),
'classifier__weights': ['uniform','distance'],
'classifier__p': [1,2]
}
from sklearn.model_selection import RandomizedSearchCV
rs = RandomizedSearchCV(pipe, parms, cv =5, n_iter = 30, random_state=0).fit(train_X_over, train_over)
This setup is nearly identical to our earlier grid search; however, RandomizedSearchCV
will randomly sample from the distributions specified in the parms
dictionary at each search iteration (so 30 times in this example).
This is most obvious for the n_neighbors
argument, which is sampled from a Poisson distribution with mean parameter $\mu = 20$. The other parameters used in this example are randomly sampled from the list of specified values (with each value being equally likely).
rs.best_estimator_
Pipeline(steps=[('scaler', StandardScaler()), ('classifier', KNeighborsClassifier(n_neighbors=28, p=1, weights='distance'))])
The randomized search finds a combination of tuning parameters that are reasonably similar to some of the best performers in the earlier grid search (a good sign).
In future applications, you might consider using a randomized search as part of a two-stage approach. First, a small-scale randomized search can locate promising regions of the parameter space, then a grid search explores those regions more closely.
The previous sections have covered k-fold cross-validation in the context of tuning parameter searches and pipelines, but you should be also aware of cross-validation implementations outside of these contexts.
The code displayed below demonstrates how to obtain cross-validated predictions. These can be used for the visualization of a model, or "by hand" calculations:
## Functions for k-fold CV and LOOCV
from sklearn.model_selection import KFold, LeaveOneOut, cross_val_predict
k_folds = KFold(n_splits = 5)
loo = LeaveOneOut()
## Initialize a knn model
knn_class = KNeighborsClassifier(n_neighbors=25)
## Return cross-validated predictions
kfold_pred = cross_val_predict(knn_class, train_X_over, train_over, cv = k_folds)
loo_pred = cross_val_predict(knn_class, train_X_over, train_over, cv = loo)
The code below demonstrates how to obtain cross-validated model performance score:
## Import
from sklearn.model_selection import cross_validate
## Evaluate (note that knn_class was defined previously)
cross_validate(knn_class, train_X_over, train_over, cv = 5, scoring = 'accuracy')
{'fit_time': array([0.00400138, 0.00199819, 0.00500059, 0.00300217, 0.00398231]), 'score_time': array([0.00600123, 0.00599766, 0.0055058 , 0.00601673, 0.00501132]), 'test_score': array([0.664 , 0.67741935, 0.67741935, 0.66129032, 0.67741935])}
Notice cross_validate
returns accuracy scores within each of the 5-folds. Typically these are averaged to provide a single performance measure for the model:
cv_results = cross_validate(knn_class, train_X_over, train_over, cv = 5, scoring = 'accuracy')
## Average (function is from numpy library)
np.average(cv_results['test_score'])
0.6715096774193547
## Same calculation using preds
np.average(kfold_pred == train_over)
0.6714975845410628
Note: the purpose of this section was to demonstrate several useful cross-validation functions in isolation. To do this efficiently, it skipped over many important steps (ie: scaling, tuning, etc.) that you should include in a complete application.
The data loaded below are a random sample of $n=6000$ images from the MNIST database, which contains thousands of handwritten digits (0-9) recorded as 28x28 pixel grayscale images. The original MNIST training set included 60,000 samples, so we'll be working with only 10% of the database. The goal of this application is build a classification model that can accurately indentify handwritten digits.
The data available at the link below have been flattened, meaning each 28x28 image is represented using single row. In this format, there are 784 columns that represent the normalized color intensity at a particular pixel, with higher values indicating the presence of ink and zeros indicating the absence of ink.
For context, let's begin by displaying a few of the samples:
### Read flattened, processed data
mnist = pd.read_csv("https://remiller1450.github.io/data/mnist_small.csv")
### Separate the label column (outcome)
label = mnist['label']
mnist = mnist.drop(['label'], axis=1)
### Convert to numpy array and reshape to 28 by 28
mnist_unflattened = mnist.to_numpy()
mnist_unflattened = mnist_unflattened.reshape(6000,28,28)
### Import grayscale color map
import matplotlib.cm as cm
## Plot the first six samples
fig, axs = plt.subplots(ncols=6)
for i in range(6):
axs[i].imshow(mnist_unflattened[i], cmap=cm.Greys)
axs[i].title.set_text(f'label={label[i]}')
plt.show()
random_state=10
.p
, being either 1 (manhattan distance) or 2 (euclidean distance) in your search. You do not need to consider scaling/standardization in this application (since all pixels are measured on the same scale).