This lab covers $k$-nearest neighbors classification and regression implementations in the sklearn
library, including the related topics of standardization/scaling, tuning parameters, and train/test splits.
To begin, you'll need the following libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
# Unfortunately, knn functions prompt "future warnings", so the commands below turn these off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
Most examples in this lab will use data scraped from the Johnson County Assessor documenting all recorded home sales in Iowa City, IA between 2005 and 2007. To illustrate various concepts, we will consider two different analysis goals:
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
In any application, the very first thing we should do is create a training-testing split. It's critical to make this split as soon as possible. Anything done using the full data set (including exploration, pre-processing, etc.) can lead to data leakage, or information from the testing set influencing the final model (either by virtue of your decisions, or directly through the sharing information).
To make a training-testing split, we'll use the train_test_split
function contained in the model_selection
module of sklearn
:
from sklearn.model_selection import train_test_split
train, test = train_test_split(ic, test_size=0.2, random_state=7)
print(train.shape)
print(test.shape)
(621, 19) (156, 19)
The code above verifies the split by checking the shape (dimensions) of the resulting objects. Notice that the test object has 156 observations (approximately 20% of the full dataset). Also note that the argument random_state
sets a randomization seed used to make this split repeatable (so that you can copy my code and get the same "random" split that I did).
In the early stages of a new machine learning application it is important to understand the contents of your data. This includes:
The collection of the Iowa City home sales data was previously described, the available variables are easily understood by their names, and there are no missing data. So, your only task will be exploring the distributions of the variables.
For now, we'll only use numeric predictors in our models (next week will discuss strategies for using categorical predictors).
Finally, remember that all steps should be performed only using the training data at this point. Any exploration of the test data can contaminate the entire building process.
## Select only numeric variables
train_num = train.select_dtypes("number") ## Remove later
## Specify target vars
train_price = train_num['sale.amount']
train_over = (train_num['assessed'] > train_num['sale.amount']).astype(int)
Notice in Question #1 that the available predictors have very different scales (units of measurement), which will cause problems for the distance-based calculations that $k$-nearest neighbors models rely upon.
Several functions for standardization and scaling are contained in the preprocessing
module of sklearn
:
## Import StandardScaler
from sklearn.preprocessing import StandardScaler
## Drop outcomes so they aren't used as predictors
train_X_price = train_num.drop('sale.amount',axis=1)
train_X_over = train_num.drop(['sale.amount', 'assessed'], axis=1)
## Apply standardization
train_XStd_price = StandardScaler().fit_transform(train_X_price)
train_XStd_over = StandardScaler().fit_transform(train_X_over)
## Check results
train_XStd_price[0:3,0:5]
array([[-1.47845332, 0.97303154, -0.74892431, 0.84993783, 0.51301889], [ 1.05381583, -0.0340561 , -0.49593874, -0.46853281, -0.26695141], [-0.25921262, 0.97303154, 1.98665591, -0.46853281, 1.21397258]])
The code above takes train_num
, which contains all numeric variables in the training data, and drops columns related to the outcome variables used in our previously defined goals.
Next, the standardization is applied across each column of predictor data frame using StandardScaler
and the fit_transform
method.
Implementations of the other scaling methods described in our lecture slides are shown below:
## Robust Scaling
from sklearn.preprocessing import RobustScaler
train_XRob_price = RobustScaler().fit_transform(train_X_price)
train_XRob_over = RobustScaler().fit_transform(train_X_over)
## Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
train_XMM_price = MinMaxScaler().fit_transform(train_X_price)
train_XMM_over = MinMaxScaler().fit_transform(train_X_over)
## Max Absolute Scaling
from sklearn.preprocessing import MaxAbsScaler
train_XMabs_price = MaxAbsScaler().fit_transform(train_X_price)
train_XMabs_over = MaxAbsScaler().fit_transform(train_X_over)
Shown below are a few methods of scalers that you should be aware of:
## Fit the scaler to training data so that its parameters (ie: mean, std, min, max, etc.) can be applied to new data
my_scaler = RobustScaler()
my_scaler.fit(train_X_price)
## Transform data using a previously fit scaler (useful for evaluating a model on the test set)
transformed_data = my_scaler.transform(train_X_price)
## Undo a transformation (ie: go back to the unscaled data)
original_data = my_scaler.inverse_transform(train_X_price)
pandas
DataFrame using the column names from the unscaled data (so that you can more easily identify 'assessed')At this point we're ready to implement a $k$-nearest neighbors model. As a starting point, we'll use $k = \sqrt{n}$, euclidean distance ($p=2$), and uniform weighting. Additionally, you should recognize that sklearn
has different implementations of $k$-nearest neighbors for classification and regression tasks.
## Get n, k
n = train_num.shape[0]
k = round(math.sqrt(n))
## Setup knn classifier
from sklearn.neighbors import KNeighborsClassifier
knn_class = KNeighborsClassifier(n_neighbors=k,weights='uniform',p=2)
## Setup knn reg
from sklearn.neighbors import KNeighborsRegressor
knn_reg = KNeighborsRegressor(n_neighbors=k,weights='uniform',p=2)
The code above simply sets up an object corresponding to each of our models, we still need to use the fit
method of each object to fit the model to the training data:
## Fit the classifier
knn_class.fit(train_XStd_over, train_over)
## Fit the reg
knn_reg.fit(train_XStd_price, train_price)
KNeighborsRegressor(n_neighbors=25)
Two other methods to be aware of are:
predict
- returns predicted labels (classification) or values (regression)predict_proba
- returns predicted probabilities (classification only)For additional attributes/methods, see the documentation on KNeighorsClassifier and the documentation on KNeighborsRegression.
sum
on a boolean variable to tally the number of True
entries.Standardization/scaling is an essential step in most $k$-nearest neighbors applications, with the exception being data that are universally measured on the same scale (such as an image's pixel intensities).
Data transformations are another step that can potentially improve model performance. The preprocessing
module contains several transformation options, for now we'll demonstrate the Box-Cox transformation (which requires strictly positive data) and a similar approach known as the Yeo-Johnson transformation (which can tolerate both positive and negative data)
## Import PowerTransformer
from sklearn.preprocessing import PowerTransformer
## Box-Cox example
bc_trans = PowerTransformer(method = 'box-cox')
bc_trans.fit(train_num[['sale.amount', 'assessed']])
bc_trans_vars = bc_trans.transform(train_num[['sale.amount', 'assessed']])
pd.DataFrame(bc_trans_vars).hist()
## Yeo-Johnson (everything else can remain the same)
yj_trans = PowerTransformer(method = 'yeo-johnson')
In a typical application, you'll iteratively work with the training data to determine what you believe to be the optimal model. For $k$-nearest neighbors, this might involve exploring different types of standardization/scaling, feature engineering/selection/transformation, different values of $k$, different weighting schemes, and different distance measures.
After identifying a suitable model (or a small number of candidates that you deem equally viable) the final step is to evaluate the chosen model on the test data. This is used as a final test of whether your model is good enough for dissemination/deployment, or if the project should return to data collection phase. To properly test the final model, you must replicate every pre-processing step using only the test data (ie: repeat the same scaling procedures, transformations, etc.), then you must apply the model you fit to the training data, but use the test set and calculate it's performance.
In the evaluation step is important that you only use fitted objects resulting from the training data, as the training data is all that is available when your model encounters new data after it is deployed. For our previous examples, this means you must first fit
the scaler to the training data, then the transform
method to preprocess the test data. Next, you must use the predict
or predict_proba
method of the model that was fit to the preprocessed training data.
over
, on the test set. Hint: first, you should fit this model to the proper version of the training data, then you should properly re-scale the test set before inputting it into the fitted model and using the predict
method.To conclude this lab, you will apply the workflow from the example to a new dataset on your own. The data for this application comes from the UC-Irvine machine learning repository, you can read about the Wisconsin Breast Cancer Diagnostic Dataset here ).
Each observation consists of the mean values for various cell characteristics of a patient. The goal is to predict the patient's diagnosis (recorded as "Label").
## Read data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")
random_state = 5
.