Lab 3 - Pipelines and Cross-Validation¶

This lab covers two main topics, pipelines, a tool used in sklearn to chain together multiple data handling steps (like preprocessing and modeling), and cross-validation, a method used to estimate out-of-sample model performance and assist in hyperparameter selection.

Directions: Please read through the contents of this lab with your partner and try the examples. After you're both confident that you understand a topic you should attempt the associated exercise and record your answer in your own Jupyter notebook that you will submit for credit. The notebook you submit should only contain answers to the lab's exercises (so you should remove any code you ran for the examples, or use a separate notebook to test out the examples).

To begin, you'll need the following libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# # KNN functions may produce FutureWarnings; the command below suppresses them.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  

The examples in this lab will continue using the Iowa City Home Sales data from our previous two labs. We'll begin by splitting it into training and testing segments:

In [2]:
from sklearn.model_selection import train_test_split
ic_homes = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_train, ic_test = train_test_split(ic_homes, test_size=0.2, random_state=7)

Part 1 - Pipelines¶

As mentioned in the introduction, pipelines chain together several data handling steps so that they can more easily be run sequentially, meaning the output from a step is used as the input to the subsequent step. Below is an example pipeline that includes two steps:

In [3]:
## Imports
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PowerTransformer

## Set up the steps of the pipeline
my_pipeline = Pipeline([
('transformer',  PowerTransformer(method = 'yeo-johnson')),
('scaler', MinMaxScaler())
])

Pipeline step is given as a tuple, a native data storage type that is similar to a list but immutable, meaning its elements cannot be altered or removed. Each tuple's first element is the name you give to the preprocessing step, and the second element is the actual transformer or model estimator function.

Like data preprocessing and modeling functions, pipelines have fit() and transform() methods:

In [4]:
## Fit the pipeline
my_variables = ['area.lot','area.bsmt']
fitted_pipe = my_pipeline.fit(ic_train[my_variables])

## Apply the fitted pipeline to visualize a transformation followed by min-max scaling
pd.DataFrame(fitted_pipe.transform(ic_train[my_variables]), columns = my_variables).hist()
plt.show()

In the above example the variables "area.lot" and "area.bsmt" were first normalized using a power transformer, then re-scaled using min-max scaling, all in a single command.

We can take our pipeline a step further by adding a model as the final step. You should note that pipelines can accommodate any class with fit() and transform() methods as intermediate steps, and the final step only needs to have a fit() method.

In [5]:
## New pipeline with a modeling step at the end
from sklearn.neighbors import KNeighborsRegressor
my_pipeline = Pipeline([
    ('transformer',  PowerTransformer(method = 'yeo-johnson')),
    ('scaler', MinMaxScaler()),
    ('model', KNeighborsRegressor(n_neighbors = 12))
])

## Fit to training data (using only area.lot and area.bsmt as predictive features)
ic_train_X = ic_train[my_variables]
ic_train_y = ic_train['sale.amount']
fitted_pipe = my_pipeline.fit(X = ic_train_X, y = ic_train_y)

## Predict on test data
from sklearn.metrics import mean_squared_error
ic_test_X = ic_test[my_variables]
ic_test_y = ic_test['sale.amount']
np.sqrt(mean_squared_error(ic_test_y, fitted_pipe.predict(ic_test_X)))
Out[5]:
68356.27227228682

A few additional things to know about pipelines:

  1. Steps in a pipeline can be accessed using the named_steps attribute, for example my_pipeline.named_steps['scaler']
  2. The steps of existing pipeline can be modified using the set_params() method.
  3. Arguments within a step can be accessed using double underscores, for example my_pipeline.set_params(transformer__method = 'box-cox') would change the power transformer to a box-cox transformation.

Question #1:

  • Part A: Using the same data and variables as this section's example, create a pipeline that first applies quantile mapping to a uniform distribution then fits a KNN regressor with $k=5$ and distance weighting.
  • Part B: Use a for loop to iteratively change the value of $k$ in your pipeline's model fitting step, checking all values in the set [4,8,12,16,20]. Print the RMSE on the test set for each value of $k$.

Part 2 - Cross-Validation and Pipelines¶

Most of the time we'll use cross-validation in conjunction with a pipeline to compare the performance of different combinations of pre-processing steps and models. To accomplish this we will store the hyperparameter choices and models we'd like to evaluate in a dictionary:

In [6]:
## Dictionary of hyperparameters to try
from sklearn.preprocessing import StandardScaler, RobustScaler, MaxAbsScaler
my_params = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
         'model__n_neighbors': [10,20,30],
         'model__weights': ['uniform','distance'],
         'model__p': [1,2]}

Recall that dictionaries use key:value pairs to store data. Here, the keys are given names that reference specific entities within our pipeline, for example 'scaler' references the scaler step, while 'model__n_neighbors' references the n_neighbors argument within the modeling step.

After setting up a pipeline and an associated dictionary of parameters/models to evaluate, we can use the GridSearchCV() function to perform a grid search to find the best combination as determined by cross-validated performance.

In [7]:
## Perform grid search using 5-fold cross-validation to assess performance
from sklearn.model_selection import GridSearchCV
grid_results = GridSearchCV(my_pipeline, my_params, cv=5, scoring='neg_root_mean_squared_error').fit(ic_train_X, ic_train_y)

In this example you should note that the argument cv=5 was used to execute 5-fold cross-validation, and the scoring argument was used to set the performance metric that was tracked and returned.

Results of the grid search can be accessed via the cv_results_ attribute; however, it is often helpful to convert these results to a pandas Data Frame before doing anything with them:

In [8]:
## Example showing the top 5 combinations of hyperparameters/models
params_to_print = ['rank_test_score','param_scaler','param_model__n_neighbors', 'param_model__weights', 'param_model__p', 'mean_test_score']
pd.DataFrame(grid_results.cv_results_).sort_values(by='rank_test_score')[params_to_print].head(5)
Out[8]:
rank_test_score param_scaler param_model__n_neighbors param_model__weights param_model__p mean_test_score
23 1 MaxAbsScaler() 20 distance 2 -65998.837551
17 2 MaxAbsScaler() 20 distance 1 -66044.547147
11 3 MaxAbsScaler() 10 distance 2 -66227.991350
15 4 StandardScaler() 20 distance 1 -66298.240297
29 5 MaxAbsScaler() 30 distance 1 -66357.280200

Sometimes we might not be sure whether a preprocessing step is beneficial. We can explore what happens when that step is skipped by providing the character string 'passthrough' in our dictionary for a preprocessing step, like the 'scaler' step. This is demonstrated below:

In [9]:
## Example using passthrough
my_params = {'scaler': ['passthrough',StandardScaler(), RobustScaler()],
         'model__n_neighbors': [10,20,30],
         'model__weights': ['uniform','distance'],
         'model__p': [1,2]}
grid_results = GridSearchCV(my_pipeline, my_params, cv=5, scoring='neg_root_mean_squared_error').fit(ic_train_X, ic_train_y)
pd.DataFrame(grid_results.cv_results_).sort_values(by='rank_test_score')[params_to_print].head(5)
Out[9]:
rank_test_score param_scaler param_model__n_neighbors param_model__weights param_model__p mean_test_score
16 1 StandardScaler() 20 distance 1 -66298.240297
15 2 passthrough 20 distance 1 -66298.240297
9 3 passthrough 10 distance 2 -66466.373189
10 3 StandardScaler() 10 distance 2 -66466.373189
3 5 passthrough 10 distance 1 -66468.831129

A final thing you should know about GridSearchCV() is that the best combination is fit to the entire training set and this fitted model can be accessed via the best_estimator_ attribute. The code below illustrates how sklearn will print a neat interactive flowchart of your best estimator:

In [10]:
## Diagram of best estimator - click on each step for hyperparameter values
grid_results.best_estimator_
Out[10]:
Pipeline(steps=[('transformer', PowerTransformer()),
                ('scaler', StandardScaler()),
                ('model',
                 KNeighborsRegressor(n_neighbors=20, p=1, weights='distance'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('transformer', PowerTransformer()),
                ('scaler', StandardScaler()),
                ('model',
                 KNeighborsRegressor(n_neighbors=20, p=1, weights='distance'))])
PowerTransformer()
StandardScaler()
KNeighborsRegressor(n_neighbors=20, p=1, weights='distance')

And the code below demonstrates how you can use your GridSearchCV() results to see how the best estimator performs on the test set without needing to re-fit the pipeline/model:

In [11]:
## Performance of best estimator on the test set
np.sqrt(mean_squared_error(ic_test_y, grid_results.best_estimator_.predict(ic_test_X)))
Out[11]:
66086.58814048783

Question #2:

  • Part A: Create a pipeline that includes a re-scaling step and a model fitting using a decision tree.
  • Part B: Perform a cross-validated grid search using 10-fold cross validation to check all combinations of the parameters described below. Print a sorted Data Frame showing the top 5 best performing approaches (using RMSE as the scoring criteria).
    • The scaler could be min-max scaling, max-absolute scaling, or no re-scaling
    • The maximum tree depth could be 2, 4, or 6
    • The minimum samples in a node for it to be split could be either 100 or 2
  • Part C: Calculate the RMSE on the test set for the best performing approach found during your grid search in Part B.
  • Part D: Given your knowledge of the decision tree algorithm, how could you reduce the complexity of the grid search performed in Part B without reducing the chances of finding a well-fitting modeling pipeline?

Part 3 - Randomized Search¶

The computational cost of grid search can be problematic when you are considering many different modeling approaches and hyperparameter values. Randomized search provides an alternative that allows you to perform a rough search over a wide parameter space with a fixed computational cost.

The example below demonstrates how to search the hyperparameter n_neighbors using values drawn from a Poisson distribution with a mean of 20. Additionally, you should note that if a list of values is given, each will be sampled with equal probability during the search.

In [12]:
from scipy.stats import poisson
from sklearn.model_selection import RandomizedSearchCV

## Parameter dictionary, notice the use of 'poisson(20)'
my_params = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
         'model__n_neighbors': poisson(20),
         'model__weights': ['uniform','distance']
        }

## Results
RandomizedSearchCV(my_pipeline, my_params, cv =5, n_iter = 30, random_state=0).fit(ic_train_X, ic_train_y).best_estimator_
Out[12]:
Pipeline(steps=[('transformer', PowerTransformer()), ('scaler', MaxAbsScaler()),
                ('model',
                 KNeighborsRegressor(n_neighbors=18, weights='distance'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('transformer', PowerTransformer()), ('scaler', MaxAbsScaler()),
                ('model',
                 KNeighborsRegressor(n_neighbors=18, weights='distance'))])
PowerTransformer()
MaxAbsScaler()
KNeighborsRegressor(n_neighbors=18, weights='distance')

Question #3: In DecisionTreeRegressor(), the hyperparameter min_samples_split is set up to expect either integer or float input, where the float provided represents the fraction of the overall sample size that must be present in a node in order for it to be eligible for splitting. Modify your pipeline from Question #2 to remove the re-scaling step, then perform 25 iterations of randomized search sampling max_depth from a Poisson distribution with a mean of 4 and sampling min_samples_split from a Uniform distribution with endpoints of 0 and 0.8. Print a sorted Data Frame showing the top 5 best performing combinations (using RMSE as the scoring criteria).

Part 4 - Comparing Models While Tuning Hyperparameters¶

A strength of GridSearchCV() is that it allows us to compare models and parameters using a consistent set of cross-validation folds, thereby minimizing a potential source of variability in our estimates of out-of-sample performance. Thus, we should strive to use a single call to GridSearchCV() whenever possible.

We can handle scenarios where we'd like to evaluate different models and tune their hyperparameters in the same search by providing a list of dictionaries to GridSearchCV(). The example below compares decision tree and KNN models while tuning the hyperparameters of each as well as the scaling method for KNN models.

In [13]:
## In the parameter list provide two distinct model dictionaries
from sklearn.tree import DecisionTreeRegressor
my_params = [
    {"model": [DecisionTreeRegressor()],
        "model__max_depth": [2, 4, 6]},
    {"scaler": [MinMaxScaler(), RobustScaler()],
        "model": [KNeighborsRegressor()],
        "model__n_neighbors": [15, 30]}
]

## Simplified pipeline for this example
my_pipeline = Pipeline([
    ('scaler', 'passthrough'),
    ('model', KNeighborsRegressor())
])

## Perform grid search and print some of the results
grid_results = GridSearchCV(my_pipeline, my_params, cv=5, scoring='neg_root_mean_squared_error').fit(ic_train_X, ic_train_y)
pd.DataFrame(grid_results.cv_results_).sort_values(by='rank_test_score').head(5)
Out[13]:
mean_fit_time std_fit_time mean_score_time std_score_time param_model param_model__max_depth param_model__n_neighbors param_scaler params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
4 0.001403 0.000492 0.001391 0.000788 KNeighborsRegressor() NaN 15 RobustScaler() {'model': KNeighborsRegressor(), 'model__n_nei... -59109.156275 -70532.619049 -84820.363697 -48117.551024 -79083.658197 -68332.669648 13311.587487 1
6 0.002015 0.000020 0.000987 0.000021 KNeighborsRegressor() NaN 30 RobustScaler() {'model': KNeighborsRegressor(), 'model__n_nei... -56897.004249 -72368.464393 -84381.254113 -49428.363823 -82010.447426 -69017.106801 13760.314373 2
3 0.001613 0.000501 0.001179 0.000416 KNeighborsRegressor() NaN 15 MinMaxScaler() {'model': KNeighborsRegressor(), 'model__n_nei... -60327.869493 -72951.890271 -81030.432834 -50383.239622 -82301.617629 -69399.009970 12314.831320 3
1 0.001220 0.000392 0.000599 0.000489 DecisionTreeRegressor() 4 NaN NaN {'model': DecisionTreeRegressor(), 'model__max... -60868.253606 -73236.644144 -79723.617926 -57738.032191 -78215.168794 -69956.343332 9013.864060 4
5 0.001210 0.000413 0.000982 0.000018 KNeighborsRegressor() NaN 30 MinMaxScaler() {'model': KNeighborsRegressor(), 'model__n_nei... -60031.415423 -76715.241488 -84762.218723 -49547.016705 -85723.090527 -71355.796573 14274.028359 5

Question #4: Suppose we want to use cross-validation to evaluate our two re-scaling methods, robust scaling and min-max scaling, separately for the decision tree and KNN models considered in the previous example. Could we do this by adding another dictionary to our list, {"scaler": [MinMaxScaler(), RobustScaler()]}? Try this out and comment upon the results. If this approach does not produce the intended results explain what went wrong and modify the approach to properly evaluate the re-scaling choice separately for each candidate modeling method.

Part 5 - More Complex Preprocessing¶

Ideally we'd like to perform all of our data handling steps, including steps like one-hot encoding, using pipelines; however, this is complicated by the fact that we only want to apply OneHotEncoder() to categorical features, but the pipeline examples we've seen so far have applied the same preprocessing step to all predictors used in the modeling step at the end of the pipeline.

Column transformers, implemented via ColumnTransformer(), allow us to handle numeric and categorical features differently within a pipeline. The example given below applies a power transformer and standard scaler to numeric features, and a one-hot encoder to categorical features.

In [14]:
## Imports
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder

## Set up data w/ mixed feature types 
my_variables = ['area.lot', 'area.living', 'style']
ic_train_X = ic_train[my_variables]
ic_train_y = ic_train['sale.amount']

## Get names of numeric and categorical columns
num_cols = ic_train_X.select_dtypes(exclude=['object']).columns.tolist()
cat_cols = ic_train_X.select_dtypes(include=['object']).columns.tolist()

## Make a separate pipeline for numeric vs. cat vars
num_transformer = Pipeline([('transformer',  PowerTransformer(method = 'yeo-johnson')),
                            ("scaler", StandardScaler())])
cat_transformer = Pipeline([("encoder", OneHotEncoder(sparse=False, handle_unknown='ignore'))])

## Put these pipelines together using ColumnTransformer()
preprocessor = ColumnTransformer([
    ('num', num_transformer, num_cols),
    ('cat', cat_transformer, cat_cols)
])

## Create the final pipeline
final_pipe = Pipeline([
    ("preprocessor", preprocessor),
    ("model", KNeighborsRegressor())
])

## Fit the entire pipeline
final_fitted = final_pipe.fit(ic_train_X, ic_train_y)

## Calculate test set performance
ic_test_X = ic_test[my_variables]
ic_test_y = ic_test['sale.amount']
np.sqrt(mean_squared_error(ic_test_y, final_fitted.predict(ic_test_X)))
Out[14]:
49036.37811937593

Question #5: For the data handling pipeline given in the example above, use a cross-validated grid search (5-fold CV) to compare performance using all combinations of :

  • Standard scaling or robust scaling
  • $k=10$ or $k=20$ neighbors

Hint: You will need to think about the names used to define each handling step, working backwards from 'preprocessor'. Print the best estimator.

Part 6 - Application¶

The data used in this application are a random sample of $n=6000$ grayscale images from the MNIST database, a large collection of handwritten digits (0-9). These are provided in a "flattened" format, meaning the grayscale pixel intensities for each 28x28 image are represented using a single row and 784 columns. Note that higher values indicate the presence of ink, and zeroes indicate the absence of ink.

The code below displays a few sample images and their labels:

In [16]:
### Read flattened, processed data
mnist = pd.read_csv("https://remiller1450.github.io/data/mnist_small.csv")

### Separate the target variable (label column)
label = mnist['label']
mnist = mnist.drop(['label'], axis=1)

### Convert to numpy array and reshape to 28 by 28
mnist_unflattened = mnist.to_numpy()
mnist_unflattened = mnist_unflattened.reshape(6000,28,28)

## Plot some examples
import matplotlib.cm as cm
fig, axs = plt.subplots(ncols=7)
for i in range(7):
    axs[i].imshow(mnist_unflattened[i], cmap=cm.Greys)
    axs[i].title.set_text(f'label={label[i]}')
plt.show()

Question #6:

  • Part A: Create a 90-10 training-testing split using random_state=10
  • Part B: Set up a pipeline that includes at least one preprocessing step and a modeling step. You can choose any steps and model for this part.
  • Part C: Use a grid search with classification accuracy as the scoring metric to find the best performing modeling approach while satisfying the following criteria:
    1. Compare at least two scalers while allowing the possibility of neither scaler to be used
    2. Compare both KNN and Decision Tree models
    3. Compare at least three values of n_neighbors and at least three values of max_depth that reflect different locations on the bias-variance continuum of each model
  • Part D: Report the classification accuracy of the best performing model on the test set created in Part A.
  • Part E: Find the first digit in the test set that was incorrectly classified and display it as an image with the predicted label and correct label as title text.