Lab #2 (part 2) - Pipelines, Cross-validation, and Tuning¶

This lab introduces the concepts parameter tuning and model selection using cross-validation methods. More specifically, it will demonstrate built-in cross-validation and model evaluation routines for $k$-nearest neighbors models using pipelines.

To begin, we'll use the same libraries as Part 1:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math

# Unfortunately, knn functions prompt "future warnings", the commands below turn them off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  

The lab's examples use the same dataset (Iowa City home sales) as Part 1:

In [2]:
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

Next, remember that our first step should always be to perform a training-testing split in order to avoid data leakage:

In [3]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(ic, test_size=0.2, random_state=7)

We'll then select only numeric features, construct the two target variables described in Part 1, and create a matrix of predictors.

In [4]:
## Select numeric data
train_num = train.select_dtypes("number")

## Specify target vars
train_price = train_num['sale.amount']
train_over = (train_num['assessed'] > train_num['sale.amount']).astype(int)

## Create predictor matrix
train_X_price = train_num.drop('sale.amount',axis=1)
train_X_over = train_num.drop(['sale.amount', 'assessed'], axis=1)

Part 1 - Scoring Methods¶

In our previous lab, I expected you to calculate performance measures, such as classification accuracy or RMSE, using your own code. However, for greater compatibility with the methods covered in today's lab, we'll begin by covering a few built-in implementations.

The code below demonstrates the default scoring method implemented in KNeighborsClassifier:

In [5]:
from sklearn.neighbors import KNeighborsClassifier
knnc = KNeighborsClassifier()
knnc.fit(train_X_over, train_over)
knnc.score(train_X_over, train_over)
Out[5]:
0.7616747181964574

We can see that this default is classification accuracy (as implemented in the function accuracy_score):

In [6]:
from sklearn.metrics import accuracy_score
y_pred_over = knnc.predict(train_X_over)
accuracy_score(train_over, y_pred_over)
Out[6]:
0.7616747181964574

Next week we'll discuss alternative scoring metrics, so you should try to remember this module of sklearn.

During the remainder of this lab, I'll frequently use the term "score". For now, you can view "score" synonomously with "classification accuracy" (at least for KNeighborsClassifier), but that's something we'll be able to manipulate soon.

Question #1 (scoring)¶

  • Fit a $k$-nearest neighbors regression model using train_price as the outcome. Does root-mean squared error appear to be the default scoring metric? Justify your answer by comparing the output from score with a manually calculated RMSE (using the training data). For this question you do not need to standardize or rescale the data.

Part 2 - Pipelines¶

Pipelines are important tool in sklearn that can help automate a machine learning workflow (potentially reducing data leakage). The basic idea of a pipeline is to chain together a series of steps that can then be applied to any input data of the proper format (including the same data repeatedly).

Shown below is a simple pipeline that first standardizes the input data using StandardScaler, then fits a KNeighborsClassifer to the scaled data:

In [7]:
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier())
])

We can utilize the object pipe via its fit method:

In [8]:
pipe.fit(train_X_over, train_over)
Out[8]:
Pipeline(steps=[('scaler', StandardScaler()),
                ('classifier', KNeighborsClassifier())])

At this point, we can utilize any methods of the final estimator in the pipeline (which is KNeighborsClassifier with default arguments in this example). For example, the classification accuracy score (on the training data) of this classifer is 77.9%:

In [9]:
pipe.score(train_X_over, train_over)
Out[9]:
0.7793880837359098

Recognize that our simple pipeline used default parameter values at each step, which is unlikely to produce an optimal classifier. The next section, will cover how to use grid search and cross-validation with pipelines to help determine better combinations of tuning parameters.

Question #2 (pipelines)¶

  • Part A) Create a pipeline named pipe_final that uses robust scaling, uniform weighting, $k=30$ and $p=2$ (euclidean distance).
  • Part B) Use the pipeline created in Part A to fit the requested knn model to the training data. Find the classification accuracy of this model on the training data using .score.
  • Part C) Create the matrix of predictors, test_X_over, and target vector, test_over, using the test data. Then evaluate the knn model from Part B on the test data using your pipeline and .score. How does the accuracy on the test set compare to the accuracy on the training data?

Part 3 - Cross-validation and Grid Search¶

In this section we'll use cross-validation to evaluate various models corresponding to different combinations of tuning parameters. We'll begin by defining a dictionary of parameter values we'd like to consider:

In [10]:
from sklearn.preprocessing import RobustScaler, MaxAbsScaler
parms = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
         'classifier__n_neighbors': [10,20,30],
         'classifier__weights': ['uniform','distance'],
         'classifier__p': [1,2]
        }

Notice how this dictionary matches keys (step names) from our pipeline with lists of objects/values for that key to consider.

A double-underscore, __, is used to specify parameters within an object. In our example, we are specifying parameters for KNeighborsClassifier (such as n_neighbors or weights). We do this by referencing the key named "classifier" in our pipeline followed by the double-underscore and the argument name of the parameter.

After setting up a dictionary of parameter values, we're ready to use cross-validation. The code below uses the GridSearchCV function to evaluate each combination of the parameters in parms using 5-fold cross-validation, with cross-validated classification accuracy as the scoring metric. After running the search, we print the best_estimator_, which is the combination of tuning parameters that led to the best cross-validated score.

In [11]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, parms, cv=5, scoring='accuracy').fit(train_X_over, train_over)
print(grid.best_estimator_)
Pipeline(steps=[('scaler', RobustScaler()),
                ('classifier', KNeighborsClassifier(n_neighbors=30))])

We can access more detailed information about the various parameter combinations considered during the search using cv_results_

The code below creates a DataFrame containing the full set of search results, sorts the results by scoring rank, then prints the rows corresponding to the five best performing parameter combinations:

In [12]:
result = pd.DataFrame.from_dict(grid.cv_results_, orient='columns')
print(result.sort_values('rank_test_score').head(5))
    mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
31       0.003014      0.000031         0.002402    4.917636e-04   
0        0.003801      0.001469         0.004399    2.332625e-03   
13       0.003001      0.000632         0.003000    8.996946e-07   
3        0.002599      0.000488         0.001800    4.001618e-04   
25       0.003392      0.000496         0.003200    3.992559e-04   

   param_classifier__n_neighbors param_classifier__p  \
31                            30                   2   
0                             10                   1   
13                            20                   1   
3                             10                   1   
25                            30                   1   

   param_classifier__weights      param_scaler  \
31                   uniform    RobustScaler()   
0                    uniform  StandardScaler()   
13                   uniform    RobustScaler()   
3                   distance  StandardScaler()   
25                   uniform    RobustScaler()   

                                               params  split0_test_score  \
31  {'classifier__n_neighbors': 30, 'classifier__p...              0.656   
0   {'classifier__n_neighbors': 10, 'classifier__p...              0.736   
13  {'classifier__n_neighbors': 20, 'classifier__p...              0.648   
3   {'classifier__n_neighbors': 10, 'classifier__p...              0.720   
25  {'classifier__n_neighbors': 30, 'classifier__p...              0.664   

    split1_test_score  split2_test_score  split3_test_score  \
31           0.701613           0.701613           0.709677   
0            0.693548           0.645161           0.677419   
13           0.701613           0.669355           0.709677   
3            0.685484           0.653226           0.685484   
25           0.693548           0.669355           0.685484   

    split4_test_score  mean_test_score  std_test_score  rank_test_score  
31           0.653226         0.684426        0.024535                1  
0            0.661290         0.682684        0.031158                2  
13           0.677419         0.681213        0.022291                3  
3            0.661290         0.681097        0.023328                4  
25           0.685484         0.679574        0.011065                5  

At this point we can continue tweaking our model, or we can settle on the top performer from our grid search and evaluate it on the test set.

To use the top performer on the test set, we can simply call best_estimator_ and use it's predict method on the test set's $X$ matrix to get predicted outcomes.

Question #3 (grid search)¶

  • Part A) Create a pipeline that will be used to determine the tuning parameters of a kNN regression model for the sale price outcome. Your pipeline should use min-max scaling prior to fitting the model. The model should use euclidean distance and uniform weighting, and you may leave the number of neighbors at the default value.
  • Part B) Create a parameter dictionary containing five different values of $k$ (the number of neighbors), then use GridSearchCV to find the best performing value of $k$ using 10-fold cross validation. Include the argument scoring = 'neg_root_mean_squared_error' so that GridSearchCV evaluates models using cross-validated RMSE. Print the best performing parameter combination.

Part 4 - Cross-validation and randomized search¶

Grid search can be an effective way to handle tuning parameters in many settings; however, what if our we had little idea regarding which tuning parameters would work well and our data were large enough to make model fitting computationally expensive?

Under these circumstances we might consider a randomized search:

In [13]:
from scipy.stats import poisson
parms = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
         'classifier__n_neighbors': poisson(20),
         'classifier__weights': ['uniform','distance'],
         'classifier__p': [1,2]
        }

from sklearn.model_selection import RandomizedSearchCV
rs = RandomizedSearchCV(pipe, parms, cv =5, n_iter = 30, random_state=0).fit(train_X_over, train_over)

This setup is nearly identical to our earlier grid search; however, RandomizedSearchCV will randomly sample from the distributions specified in the parms dictionary at each search iteration (so 30 times in this example).

This is most obvious for the n_neighbors argument, which is sampled from a Poisson distribution with mean parameter $\mu = 20$. The other parameters used in this example are randomly sampled from the list of specified values (with each value being equally likely).

In [14]:
rs.best_estimator_
Out[14]:
Pipeline(steps=[('scaler', StandardScaler()),
                ('classifier',
                 KNeighborsClassifier(n_neighbors=28, p=1,
                                      weights='distance'))])

The randomized search finds a combination of tuning parameters that are reasonably similar to some of the best performers in the earlier grid search (a good sign).

In future applications, you might consider using a randomized search as part of a two-stage approach. First, a small-scale randomized search can locate promising regions of the parameter space, then a grid search explores those regions more closely.

Part 5 - Other Cross-validation Functions¶

The previous sections have covered k-fold cross-validation in the context of tuning parameter searches and pipelines, but you should be also aware of cross-validation implementations outside of these contexts.

The code displayed below demonstrates how to obtain cross-validated predictions. These can be used for the visualization of a model, or "by hand" calculations:

In [15]:
## Functions for k-fold CV and LOOCV
from sklearn.model_selection import KFold, LeaveOneOut, cross_val_predict
k_folds = KFold(n_splits = 5)
loo = LeaveOneOut()

## Initialize a knn model
knn_class = KNeighborsClassifier(n_neighbors=25)

## Return cross-validated predictions
kfold_pred = cross_val_predict(knn_class, train_X_over, train_over, cv = k_folds)
loo_pred = cross_val_predict(knn_class, train_X_over, train_over, cv = loo)

The code below demonstrates how to obtain cross-validated model performance score:

In [16]:
## Import
from sklearn.model_selection import cross_validate

## Evaluate (note that knn_class was defined previously)
cross_validate(knn_class, train_X_over, train_over, cv = 5, scoring = 'accuracy')
Out[16]:
{'fit_time': array([0.00400138, 0.00199819, 0.00500059, 0.00300217, 0.00398231]),
 'score_time': array([0.00600123, 0.00599766, 0.0055058 , 0.00601673, 0.00501132]),
 'test_score': array([0.664     , 0.67741935, 0.67741935, 0.66129032, 0.67741935])}

Notice cross_validate returns accuracy scores within each of the 5-folds. Typically these are averaged to provide a single performance measure for the model:

In [17]:
cv_results = cross_validate(knn_class, train_X_over, train_over, cv = 5, scoring = 'accuracy')

## Average (function is from numpy library)
np.average(cv_results['test_score'])
Out[17]:
0.6715096774193547
In [18]:
## Same calculation using preds
np.average(kfold_pred == train_over)
Out[18]:
0.6714975845410628

Note: the purpose of this section was to demonstrate several useful cross-validation functions in isolation. To do this efficiently, it skipped over many important steps (ie: scaling, tuning, etc.) that you should include in a complete application.

Part 6 - Application¶

The data loaded below are a random sample of $n=6000$ images from the MNIST database, which contains thousands of handwritten digits (0-9) recorded as 28x28 pixel grayscale images. The original MNIST training set included 60,000 samples, so we'll be working with only 10% of the database. The goal of this application is build a classification model that can accurately indentify handwritten digits.

The data available at the link below have been flattened, meaning each 28x28 image is represented using single row. In this format, there are 784 columns that represent the normalized color intensity at a particular pixel, with higher values indicating the presence of ink and zeros indicating the absence of ink.

For context, let's begin by displaying a few of the samples:

In [19]:
### Read flattened, processed data
mnist = pd.read_csv("https://remiller1450.github.io/data/mnist_small.csv")

### Separate the label column (outcome)
label = mnist['label']
mnist = mnist.drop(['label'], axis=1)

### Convert to numpy array and reshape to 28 by 28
mnist_unflattened = mnist.to_numpy()
mnist_unflattened = mnist_unflattened.reshape(6000,28,28)

### Import grayscale color map
import matplotlib.cm as cm

## Plot the first six samples
fig, axs = plt.subplots(ncols=6)
for i in range(6):
    axs[i].imshow(mnist_unflattened[i], cmap=cm.Greys)
    axs[i].title.set_text(f'label={label[i]}')
plt.show()

Question #4 (Application)¶

  • Part A) Create a 90-10 training-testing split of these data using random_state=10.
  • Part B) Create pipeline and use a combination of random search and grid search in conjunction with 5-fold cross-validation to find a well-performing combination of tuning parameters for a $k$-nearest neighbors classifier. You should consider different numbers of neighbors, 'uniform' or 'distance' weights, and the power parameter, p, being either 1 (manhattan distance) or 2 (euclidean distance) in your search. You do not need to consider scaling/standardization in this application (since all pixels are measured on the same scale).
  • Part C) Find the classification accuracy of your best classifier from Part B on the test set (which you created in Part A).
  • Part D) Write a 1-paragraph summary (4-5 sentences) your model building process and results. This paragraph should be written your own words, but it should be concise but complete. You should envision that you are writing it for a scientific journal in an applied field (ie: not math/cs/stats, but something like psychology/biology/neuroscience/etc.)