This lab covers two main topics, pipelines, a tool used in sklearn
to chain together multiple data handling steps (like preprocessing and modeling), and cross-validation, a method used to estimate out-of-sample model performance and assist in hyperparameter selection.
Directions: Please read through the contents of this lab with your partner and try the examples. After you're both confident that you understand a topic you should attempt the associated exercise and record your answer in your own Jupyter notebook that you will submit for credit. The notebook you submit should only contain answers to the lab's exercises (so you should remove any code you ran for the examples, or use a separate notebook to test out the examples).
To begin, you'll need the following libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
# # KNN functions may produce FutureWarnings; the command below suppresses them.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
The examples in this lab will continue using the Iowa City Home Sales data from our previous two labs. We'll begin by splitting it into training and testing segments:
from sklearn.model_selection import train_test_split
ic_homes = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_train, ic_test = train_test_split(ic_homes, test_size=0.2, random_state=7)
As mentioned in the introduction, pipelines chain together several data handling steps so that they can more easily be run sequentially, meaning the output from a step is used as the input to the subsequent step. Below is an example pipeline that includes two steps:
## Imports
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PowerTransformer
## Set up the steps of the pipeline
my_pipeline = Pipeline([
('transformer', PowerTransformer(method = 'yeo-johnson')),
('scaler', MinMaxScaler())
])
Pipeline step is given as a tuple, a native data storage type that is similar to a list but immutable, meaning its elements cannot be altered or removed. Each tuple's first element is the name you give to the preprocessing step, and the second element is the actual transformer or model estimator function.
Like data preprocessing and modeling functions, pipelines have fit()
and transform()
methods:
## Fit the pipeline
my_variables = ['area.lot','area.bsmt']
fitted_pipe = my_pipeline.fit(ic_train[my_variables])
## Apply the fitted pipeline to visualize a transformation followed by min-max scaling
pd.DataFrame(fitted_pipe.transform(ic_train[my_variables]), columns = my_variables).hist()
plt.show()
In the above example the variables "area.lot" and "area.bsmt" were first normalized using a power transformer, then re-scaled using min-max scaling, all in a single command.
We can take our pipeline a step further by adding a model as the final step. You should note that pipelines can accommodate any class with fit()
and transform()
methods as intermediate steps, and the final step only needs to have a fit()
method.
## New pipeline with a modeling step at the end
from sklearn.neighbors import KNeighborsRegressor
my_pipeline = Pipeline([
('transformer', PowerTransformer(method = 'yeo-johnson')),
('scaler', MinMaxScaler()),
('model', KNeighborsRegressor(n_neighbors = 12))
])
## Fit to training data (using only area.lot and area.bsmt as predictive features)
ic_train_X = ic_train[my_variables]
ic_train_y = ic_train['sale.amount']
fitted_pipe = my_pipeline.fit(X = ic_train_X, y = ic_train_y)
## Predict on test data
from sklearn.metrics import mean_squared_error
ic_test_X = ic_test[my_variables]
ic_test_y = ic_test['sale.amount']
np.sqrt(mean_squared_error(ic_test_y, fitted_pipe.predict(ic_test_X)))
68356.27227228682
A few additional things to know about pipelines:
named_steps
attribute, for example my_pipeline.named_steps['scaler']
set_params()
method.my_pipeline.set_params(transformer__method = 'box-cox')
would change the power transformer to a box-cox transformation.Question #1:
[4,8,12,16,20]
. Print the RMSE on the test set for each value of $k$.Most of the time we'll use cross-validation in conjunction with a pipeline to compare the performance of different combinations of pre-processing steps and models. To accomplish this we will store the hyperparameter choices and models we'd like to evaluate in a dictionary:
## Dictionary of hyperparameters to try
from sklearn.preprocessing import StandardScaler, RobustScaler, MaxAbsScaler
my_params = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
'model__n_neighbors': [10,20,30],
'model__weights': ['uniform','distance'],
'model__p': [1,2]}
Recall that dictionaries use key:value pairs to store data. Here, the keys are given names that reference specific entities within our pipeline, for example 'scaler'
references the scaler step, while 'model__n_neighbors'
references the n_neighbors
argument within the modeling step.
After setting up a pipeline and an associated dictionary of parameters/models to evaluate, we can use the GridSearchCV()
function to perform a grid search to find the best combination as determined by cross-validated performance.
## Perform grid search using 5-fold cross-validation to assess performance
from sklearn.model_selection import GridSearchCV
grid_results = GridSearchCV(my_pipeline, my_params, cv=5, scoring='neg_root_mean_squared_error').fit(ic_train_X, ic_train_y)
In this example you should note that the argument cv=5
was used to execute 5-fold cross-validation, and the scoring
argument was used to set the performance metric that was tracked and returned.
Results of the grid search can be accessed via the cv_results_
attribute; however, it is often helpful to convert these results to a pandas
Data Frame before doing anything with them:
## Example showing the top 5 combinations of hyperparameters/models
params_to_print = ['rank_test_score','param_scaler','param_model__n_neighbors', 'param_model__weights', 'param_model__p', 'mean_test_score']
pd.DataFrame(grid_results.cv_results_).sort_values(by='rank_test_score')[params_to_print].head(5)
rank_test_score | param_scaler | param_model__n_neighbors | param_model__weights | param_model__p | mean_test_score | |
---|---|---|---|---|---|---|
23 | 1 | MaxAbsScaler() | 20 | distance | 2 | -65998.837551 |
17 | 2 | MaxAbsScaler() | 20 | distance | 1 | -66044.547147 |
11 | 3 | MaxAbsScaler() | 10 | distance | 2 | -66227.991350 |
15 | 4 | StandardScaler() | 20 | distance | 1 | -66298.240297 |
29 | 5 | MaxAbsScaler() | 30 | distance | 1 | -66357.280200 |
Sometimes we might not be sure whether a preprocessing step is beneficial. We can explore what happens when that step is skipped by providing the character string 'passthrough'
in our dictionary for a preprocessing step, like the 'scaler'
step. This is demonstrated below:
## Example using passthrough
my_params = {'scaler': ['passthrough',StandardScaler(), RobustScaler()],
'model__n_neighbors': [10,20,30],
'model__weights': ['uniform','distance'],
'model__p': [1,2]}
grid_results = GridSearchCV(my_pipeline, my_params, cv=5, scoring='neg_root_mean_squared_error').fit(ic_train_X, ic_train_y)
pd.DataFrame(grid_results.cv_results_).sort_values(by='rank_test_score')[params_to_print].head(5)
rank_test_score | param_scaler | param_model__n_neighbors | param_model__weights | param_model__p | mean_test_score | |
---|---|---|---|---|---|---|
16 | 1 | StandardScaler() | 20 | distance | 1 | -66298.240297 |
15 | 2 | passthrough | 20 | distance | 1 | -66298.240297 |
9 | 3 | passthrough | 10 | distance | 2 | -66466.373189 |
10 | 3 | StandardScaler() | 10 | distance | 2 | -66466.373189 |
3 | 5 | passthrough | 10 | distance | 1 | -66468.831129 |
A final thing you should know about GridSearchCV()
is that the best combination is fit to the entire training set and this fitted model can be accessed via the best_estimator_
attribute. The code below illustrates how sklearn
will print a neat interactive flowchart of your best estimator:
## Diagram of best estimator - click on each step for hyperparameter values
grid_results.best_estimator_
Pipeline(steps=[('transformer', PowerTransformer()), ('scaler', StandardScaler()), ('model', KNeighborsRegressor(n_neighbors=20, p=1, weights='distance'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('transformer', PowerTransformer()), ('scaler', StandardScaler()), ('model', KNeighborsRegressor(n_neighbors=20, p=1, weights='distance'))])
PowerTransformer()
StandardScaler()
KNeighborsRegressor(n_neighbors=20, p=1, weights='distance')
And the code below demonstrates how you can use your GridSearchCV()
results to see how the best estimator performs on the test set without needing to re-fit the pipeline/model:
## Performance of best estimator on the test set
np.sqrt(mean_squared_error(ic_test_y, grid_results.best_estimator_.predict(ic_test_X)))
66086.58814048783
Question #2:
The computational cost of grid search can be problematic when you are considering many different modeling approaches and hyperparameter values. Randomized search provides an alternative that allows you to perform a rough search over a wide parameter space with a fixed computational cost.
The example below demonstrates how to search the hyperparameter n_neighbors
using values drawn from a Poisson distribution with a mean of 20. Additionally, you should note that if a list of values is given, each will be sampled with equal probability during the search.
from scipy.stats import poisson
from sklearn.model_selection import RandomizedSearchCV
## Parameter dictionary, notice the use of 'poisson(20)'
my_params = {'scaler': [StandardScaler(), RobustScaler(), MaxAbsScaler()],
'model__n_neighbors': poisson(20),
'model__weights': ['uniform','distance']
}
## Results
RandomizedSearchCV(my_pipeline, my_params, cv =5, n_iter = 30, random_state=0).fit(ic_train_X, ic_train_y).best_estimator_
Pipeline(steps=[('transformer', PowerTransformer()), ('scaler', MaxAbsScaler()), ('model', KNeighborsRegressor(n_neighbors=18, weights='distance'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('transformer', PowerTransformer()), ('scaler', MaxAbsScaler()), ('model', KNeighborsRegressor(n_neighbors=18, weights='distance'))])
PowerTransformer()
MaxAbsScaler()
KNeighborsRegressor(n_neighbors=18, weights='distance')
Question #3: In DecisionTreeRegressor()
, the hyperparameter min_samples_split
is set up to expect either integer or float input, where the float provided represents the fraction of the overall sample size that must be present in a node in order for it to be eligible for splitting. Modify your pipeline from Question #2 to remove the re-scaling step, then perform 25 iterations of randomized search sampling max_depth
from a Poisson distribution with a mean of 4 and sampling min_samples_split
from a Uniform distribution with endpoints of 0 and 0.8. Print a sorted Data Frame showing the top 5 best performing combinations (using RMSE as the scoring criteria).
A strength of GridSearchCV()
is that it allows us to compare models and parameters using a consistent set of cross-validation folds, thereby minimizing a potential source of variability in our estimates of out-of-sample performance. Thus, we should strive to use a single call to GridSearchCV()
whenever possible.
We can handle scenarios where we'd like to evaluate different models and tune their hyperparameters in the same search by providing a list of dictionaries to GridSearchCV()
. The example below compares decision tree and KNN models while tuning the hyperparameters of each as well as the scaling method for KNN models.
## In the parameter list provide two distinct model dictionaries
from sklearn.tree import DecisionTreeRegressor
my_params = [
{"model": [DecisionTreeRegressor()],
"model__max_depth": [2, 4, 6]},
{"scaler": [MinMaxScaler(), RobustScaler()],
"model": [KNeighborsRegressor()],
"model__n_neighbors": [15, 30]}
]
## Simplified pipeline for this example
my_pipeline = Pipeline([
('scaler', 'passthrough'),
('model', KNeighborsRegressor())
])
## Perform grid search and print some of the results
grid_results = GridSearchCV(my_pipeline, my_params, cv=5, scoring='neg_root_mean_squared_error').fit(ic_train_X, ic_train_y)
pd.DataFrame(grid_results.cv_results_).sort_values(by='rank_test_score').head(5)
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_model | param_model__max_depth | param_model__n_neighbors | param_scaler | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 0.001403 | 0.000492 | 0.001391 | 0.000788 | KNeighborsRegressor() | NaN | 15 | RobustScaler() | {'model': KNeighborsRegressor(), 'model__n_nei... | -59109.156275 | -70532.619049 | -84820.363697 | -48117.551024 | -79083.658197 | -68332.669648 | 13311.587487 | 1 |
6 | 0.002015 | 0.000020 | 0.000987 | 0.000021 | KNeighborsRegressor() | NaN | 30 | RobustScaler() | {'model': KNeighborsRegressor(), 'model__n_nei... | -56897.004249 | -72368.464393 | -84381.254113 | -49428.363823 | -82010.447426 | -69017.106801 | 13760.314373 | 2 |
3 | 0.001613 | 0.000501 | 0.001179 | 0.000416 | KNeighborsRegressor() | NaN | 15 | MinMaxScaler() | {'model': KNeighborsRegressor(), 'model__n_nei... | -60327.869493 | -72951.890271 | -81030.432834 | -50383.239622 | -82301.617629 | -69399.009970 | 12314.831320 | 3 |
1 | 0.001220 | 0.000392 | 0.000599 | 0.000489 | DecisionTreeRegressor() | 4 | NaN | NaN | {'model': DecisionTreeRegressor(), 'model__max... | -60868.253606 | -73236.644144 | -79723.617926 | -57738.032191 | -78215.168794 | -69956.343332 | 9013.864060 | 4 |
5 | 0.001210 | 0.000413 | 0.000982 | 0.000018 | KNeighborsRegressor() | NaN | 30 | MinMaxScaler() | {'model': KNeighborsRegressor(), 'model__n_nei... | -60031.415423 | -76715.241488 | -84762.218723 | -49547.016705 | -85723.090527 | -71355.796573 | 14274.028359 | 5 |
Question #4: Suppose we want to use cross-validation to evaluate our two re-scaling methods, robust scaling and min-max scaling, separately for the decision tree and KNN models considered in the previous example. Could we do this by adding another dictionary to our list, {"scaler": [MinMaxScaler(), RobustScaler()]}
? Try this out and comment upon the results. If this approach does not produce the intended results explain what went wrong and modify the approach to properly evaluate the re-scaling choice separately for each candidate modeling method.
Ideally we'd like to perform all of our data handling steps, including steps like one-hot encoding, using pipelines; however, this is complicated by the fact that we only want to apply OneHotEncoder()
to categorical features, but the pipeline examples we've seen so far have applied the same preprocessing step to all predictors used in the modeling step at the end of the pipeline.
Column transformers, implemented via ColumnTransformer()
, allow us to handle numeric and categorical features differently within a pipeline. The example given below applies a power transformer and standard scaler to numeric features, and a one-hot encoder to categorical features.
## Imports
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder
## Set up data w/ mixed feature types
my_variables = ['area.lot', 'area.living', 'style']
ic_train_X = ic_train[my_variables]
ic_train_y = ic_train['sale.amount']
## Get names of numeric and categorical columns
num_cols = ic_train_X.select_dtypes(exclude=['object']).columns.tolist()
cat_cols = ic_train_X.select_dtypes(include=['object']).columns.tolist()
## Make a separate pipeline for numeric vs. cat vars
num_transformer = Pipeline([('transformer', PowerTransformer(method = 'yeo-johnson')),
("scaler", StandardScaler())])
cat_transformer = Pipeline([("encoder", OneHotEncoder(sparse=False, handle_unknown='ignore'))])
## Put these pipelines together using ColumnTransformer()
preprocessor = ColumnTransformer([
('num', num_transformer, num_cols),
('cat', cat_transformer, cat_cols)
])
## Create the final pipeline
final_pipe = Pipeline([
("preprocessor", preprocessor),
("model", KNeighborsRegressor())
])
## Fit the entire pipeline
final_fitted = final_pipe.fit(ic_train_X, ic_train_y)
## Calculate test set performance
ic_test_X = ic_test[my_variables]
ic_test_y = ic_test['sale.amount']
np.sqrt(mean_squared_error(ic_test_y, final_fitted.predict(ic_test_X)))
49036.37811937593
Question #5: For the data handling pipeline given in the example above, use a cross-validated grid search (5-fold CV) to compare performance using all combinations of :
Hint: You will need to think about the names used to define each handling step, working backwards from 'preprocessor'
. Print the best estimator.
The data used in this application are a random sample of $n=6000$ grayscale images from the MNIST database, a large collection of handwritten digits (0-9). These are provided in a "flattened" format, meaning the grayscale pixel intensities for each 28x28 image are represented using a single row and 784 columns. Note that higher values indicate the presence of ink, and zeroes indicate the absence of ink.
The code below displays a few sample images and their labels:
### Read flattened, processed data
mnist = pd.read_csv("https://remiller1450.github.io/data/mnist_small.csv")
### Separate the target variable (label column)
label = mnist['label']
mnist = mnist.drop(['label'], axis=1)
### Convert to numpy array and reshape to 28 by 28
mnist_unflattened = mnist.to_numpy()
mnist_unflattened = mnist_unflattened.reshape(6000,28,28)
## Plot some examples
import matplotlib.cm as cm
fig, axs = plt.subplots(ncols=7)
for i in range(7):
axs[i].imshow(mnist_unflattened[i], cmap=cm.Greys)
axs[i].title.set_text(f'label={label[i]}')
plt.show()
Question #6:
random_state=10
n_neighbors
and at least three values of max_depth
that reflect different locations on the bias-variance continuum of each model