This lab provides an introduction to some of the basic steps involved in applying machine learning methods using Scikit-learn (commonly known as sklearn
), a popular machine learning library containing implementations of various classical algorithms.
Directions: Please read through the contents of this lab with your partner and try the examples. After you're both confident that you understand a topic you should attempt the associated exercise and record your answer in your own Jupyter notebook that you will submit for credit. The notebook you submit should only contain answers to the lab's exercises (so you should remove any code you ran for the examples, or use a separate notebook to test out the examples).
To begin, you'll need the following libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
# # KNN functions may produce FutureWarnings; the command below suppresses them.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
The examples in this lab will use the Iowa City Home Sales data introduced in Lab #1:
ic_homes = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_homes.head(3)
sale.amount | sale.date | occupancy | style | built | bedrooms | bsmt | ac | attic | area.base | area.add | area.bsmt | area.garage1 | area.garage2 | area.living | area.lot | lon | lat | assessed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 172500 | 1/3/2005 | 116 (Zero Lot Line) | 1 Story Frame | 1993 | 3 | Full | Yes | None | 1102 | 0 | 925 | 418 | 0 | 1102 | 5520 | -91.509129 | 41.651160 | 173040 |
1 | 90000 | 1/5/2005 | 113 (Condominium) | 1 Story Frame | 2001 | 2 | None | Yes | None | 878 | 0 | 0 | 0 | 264 | 878 | 3718 | -91.522964 | 41.673240 | 89470 |
2 | 168500 | 1/12/2005 | 101 (Single-Family / Owner Occupied) | Split Foyer Frame | 1976 | 4 | Full | Yes | None | 1236 | 0 | 700 | 576 | 0 | 1236 | 8800 | -91.482311 | 41.658488 | 164230 |
A fundamental step in machine learning is to split your data into independent training and testing segments. To prevent data leakage, it is important to perform this step as early as possible and refrain from using the test set until we are ready to perform a final evaluation of our model(s) of interest.
To create independent training and testing segments we'll use the train_test_split
function from the model_selection
module of sklearn
:
from sklearn.model_selection import train_test_split
ic_train, ic_test = train_test_split(ic_homes, test_size=0.2, random_state=7)
The code given above randomly assigns 80% of the rows in ic_homes
to a DataFrame named ic_train
and the remaining 20% to a DataFrame named ic_test
. Note that the argument random_state=7
makes this split reproducible, so we will all have the exact same training and testing observations.
All new applications of machine learning should begin with a thoughtful investigation of the contents of the training set. In this step, we might seek to address the following questions:
Question #1:
isna()
and sum()
methods to confirm that there are no missing values in the ic_train
DataFrame.ic_train
DataFrame. Briefly comment upon whether any variables have unusual values and/or highly skewed distributions.ic_train
DataFrame. Outside of any relationships between predictors and the target variable, sale.amount
, are there any pairs of variables that seem so strongly related that they should be combined prior to model training? Note: if you aren't sure of your visual assessment you may consider using the corr()
method to look for pairwise correlations above 0.95 or 0.99.sklearn
¶In sklearn
models are trained using the fit
method of the corresponding objective. Thus, a simple way to train a model is to first instantiate an object of the desired model class then pass in the training data using this object's fit()
method. This workflow is demonstrated for a KNN regression model below:
## Import and instantiate
from sklearn.neighbors import KNeighborsRegressor
knn_model = KNeighborsRegressor(n_neighbors=10,weights='uniform',p=2)
## Fit using the training data
ic_train_y = ic_train['sale.amount']
ic_train_X = ic_train.select_dtypes("number").drop('sale.amount', axis=1)
knn_model.fit(ic_train_X, ic_train_y)
## Evaluate on the test data
from sklearn.metrics import mean_squared_error
ic_test_X = ic_test.select_dtypes("number").drop('sale.amount', axis=1) ## Create X for the test set
ic_test_y = ic_test['sale.amount']
np.sqrt(mean_squared_error(ic_test_y, knn_model.predict(ic_test_X))) ## Calculate RMSE w/ test data
24086.67584498422
This RMSE value represents the average prediction error on the test set.
A few additional comments:
R
/statistical modeling, it might seem mysterious how knn_model
was able to make predictions when we seemingly never stored the fitted model anywhere. sklearn
uses an object oriented programming paradigm that allows for methods like fit()
to modify existing objects.Question #2:
max_depth
of 3 using the same training and testing sets created in the previous example. Print the RMSE of this model for the test data.plot_tree()
function from the tree
module of sklearn
to display the decision tree you trained in Part A.max_depth
ranging from 1 to 10. Construct a line chart showing RMSE as a function of max_depth
using separate lines to delineate performances from the training and testing sets.As we discussed during lecture, some methods like KNN are sensitive to the measurement scale of the predictor variables, while others, such as decision trees, will produce the same results regardless of the scale used.
The preprocessing
module of sklearn
provides many different options for re-scaling, which you can find at this link.
Regardless of the re-scaling approach used, the main methods to be aware of are:
fit()
- computes the necessary statistics from the provided data to be used later for re-scalingtransform()
- re-scales the provided data using parameters estimated when the scaler was 'fit'fit_transform()
- fits to the provided data, then transforms itNote that fit_transform()
is typically used during exploratory analysis of the training data. For model evaluation, always use fit()
on the training data and transform()
on the test data.
The example below demonstrates how to apply min-max scaling to a subset of predictors:
## Variables to re-scale
example_predictors = ['area.lot', 'area.living', 'area.garage1']
## Import scaler function and re-scale training data
from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler().fit(ic_train[example_predictors]) # Fit the scaler
train_area_lot_scaled = mm_scaler.transform(ic_train[example_predictors]) # Transform training data
## Show the original and re-scaled data
ic_train[example_predictors].hist(layout = (1,3))
pd.DataFrame(train_area_lot_scaled, columns = example_predictors).hist(layout = (1,3))
plt.show()
Question #3:
fit_transform()
to apply a re-scaling procedure to all of our data before using train_test_split()
. Would you recommend this approach? Briefly explain your reasoning.sale.amount
using all other numeric variables as predictors using a standardization pre-processing step (via StandardScaler()
). Your KNN model should use $k=5$ neighbors, uniform weighting, and euclidean distance. Report the RMSE on the same test set (ic_test
) that was used in Part 2. Did re-scaling lead to better performance on the test data? In algorithms like KNN, re-scaling is generally considered to be an essential step, as some predictors will arbitrarily be given more weight in distance calculations when the predictors are not all on comparable scales.
Re-scaling preserves the distributional shape of the data, but another related pre-processing step, data transformation, intentionally changes the shape to enhance model performance, often by diminishing the impact of outliers. Depending on the context, data transformations may be applied to predictors, the target variable, or both, to improve model performance or reduce the influence of outliers.
In addition to simple procedures like log-transformations, you should be aware of the following pre-processing transformations:
The basic steps involved in using one of these data transformations are the same as those we saw when re-scaling. A brief example is shown below:
## Example - Power transformation
from sklearn.preprocessing import PowerTransformer
vars_to_transform = ['sale.amount', 'assessed']
yj_trans = PowerTransformer(method = 'yeo-johnson').fit(ic_train[vars_to_transform])
yj_trans_output = yj_trans.transform(ic_train[vars_to_transform])
## Display the results
pd.DataFrame(ic_train[vars_to_transform]).hist()
pd.DataFrame(yj_trans_output, columns = vars_to_transform).hist()
plt.show()
Question #4: Without relying upon any empirical results, would you expect applying a quantile transformation on all input features to improve the performance of a decision tree model? Explain your reasoning.
Question #5: Create a new binary target variable named over_assessed
that takes on a value of "yes" if a home's assessed value is greater than its sale price and "no" otherwise. Then create a KNN model to predict over_assessed
from all other numeric features (excluding sale price and assessed value) using $k=10$ neighbors, distance weighting, and euclidean distance with QuantileTransformer()
as a pre-processing step. Report the accuracy of this model on the test set.
So far, we've ignored the categorical predictors in our data for simplicity, but categorical predictors can be included in most machine learning models using an approach known as one-hot encoding (or "dummy variables" in contexts like statistical modeling).
The OneHotEncoder()
function from the preprocessing
module of sklearn
can be used to one-hot encode categorical data. An example involving two categorical features is shown below:
from sklearn.preprocessing import OneHotEncoder
example_data = pd.DataFrame([['A', 'Male'], ['A', 'Female'], ['B', 'Female'], ['C', 'Male']], )
oh_trans = OneHotEncoder(sparse_output=False).fit(example_data)
print(oh_trans.transform(example_data))
[[1. 0. 0. 0. 1.] [1. 0. 0. 1. 0.] [0. 1. 0. 1. 0.] [0. 0. 1. 0. 1.]]
Notice how OneHotEncoder()
represents the first variable, which has the categorical values of 'A'
, 'B'
, and 'C'
using three binary variables, the first of which encodes the category 'A'
, the second encodes 'B'
, and the third encodes 'C'
.
Similarly, the second variable, which has the categorical values of 'Male'
and 'Female'
is represented by two binary variables, the first encoding 'Male'
and the second encoding 'Female'
.
A few additional things to note:
OneHotEncoder()
returns a "sparse matrix" object by default, which isn't helpful unless we're using a machine learning model that can benefit from sparse matrix input. The argument sparse=False
prevents this return behavior.handle_unknown='ignore'
, which instructs OneHotEncoder()
to ignore (not represent) any new categories that weren't present in the training data.OneHotEncoder()
will treat any numeric columns you give it as categorical, which can be a major problem because numeric variables typically have lots of unique values. You should be careful to avoid one-hot encoding numeric variables.OneHotEncoder()
will create a matrix of one-hot variables that is not full rank, a linear algebra term used to indicate every column of the matrix being linearly independent. If we wanted a representation without any dependencies we could use the argument drop = 'first'
to drop the first dummy variable for each feature. You might note that this is the behavior of the model.matrix()
function in R
.Question #6: Starting with your model from Question #5, incorporate one-hot encoded predictors representing the orginal variables "style", "ac", and "bsmt" in addition to the numeric predictors you were already using. You do not need to worry about re-scaling the new one-hot features as they already exist on a 0 to 1 scale. Use the same hyperparameters that were specified in Question #5 and report the performance of this new approach on the test set. Did adding these additional features improve the model's performance?