Lab #2 (part 1) - Introduction to $k$-Nearest Neighbors¶

This lab covers $k$-nearest neighbors classification and regression implementations in the sklearn library, including the related topics of standardization/scaling, tuning parameters, and train/test splits.

To begin, you'll need the following libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math

# Unfortunately, knn functions prompt "future warnings", so the commands below turn these off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  

Most examples in this lab will use data scraped from the Johnson County Assessor documenting all recorded home sales in Iowa City, IA between 2005 and 2007. To illustrate various concepts, we will consider two different analysis goals:

  1. Regression - Predicting a home's sale price using its attributes
  2. Classification - Identifying homes that are over-assessed (ie: assessed > sale.amount)
In [2]:
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

In any application, the very first thing we should do is create a training-testing split. It's critical to make this split as soon as possible. Anything done using the full data set (including exploration, pre-processing, etc.) can lead to data leakage, or information from the testing set influencing the final model (either by virtue of your decisions, or directly through the sharing information).

To make a training-testing split, we'll use the train_test_split function contained in the model_selection module of sklearn:

In [3]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(ic, test_size=0.2, random_state=7)
print(train.shape)
print(test.shape)
(621, 19)
(156, 19)

The code above verifies the split by checking the shape (dimensions) of the resulting objects. Notice that the test object has 156 observations (approximately 20% of the full dataset). Also note that the argument random_state sets a randomization seed used to make this split repeatable (so that you can copy my code and get the same "random" split that I did).

Part 1 - Data Exploration¶

In the early stages of a new machine learning application it is important to understand the contents of your data. This includes:

  1. Understanding the data source, manner of collection, and available variables
  2. Exploring the distributions of relevant variables for pecularities or errors
  3. Assessing the prevelence of missing/incomplete data

The collection of the Iowa City home sales data was previously described, the available variables are easily understood by their names, and there are no missing data. So, your only task will be exploring the distributions of the variables.

For now, we'll only use numeric predictors in our models (next week will discuss strategies for using categorical predictors).

Finally, remember that all steps should be performed only using the training data at this point. Any exploration of the test data can contaminate the entire building process.

In [4]:
## Select only numeric variables
train_num = train.select_dtypes("number") ## Remove later

## Specify target vars
train_price = train_num['sale.amount']
train_over = (train_num['assessed'] > train_num['sale.amount']).astype(int)

Question #1 (Data Exploration):¶

  • Create a histogram of each numeric variable. Do any of these variables appear improperly coded or contain suspicious values?

Part 2 - Standardization and Scaling¶

Notice in Question #1 that the available predictors have very different scales (units of measurement), which will cause problems for the distance-based calculations that $k$-nearest neighbors models rely upon.

Several functions for standardization and scaling are contained in the preprocessing module of sklearn:

In [5]:
## Import StandardScaler
from sklearn.preprocessing import StandardScaler

## Drop outcomes so they aren't used as predictors
train_X_price = train_num.drop('sale.amount',axis=1)
train_X_over = train_num.drop(['sale.amount', 'assessed'], axis=1)

## Apply standardization
train_XStd_price = StandardScaler().fit_transform(train_X_price)
train_XStd_over = StandardScaler().fit_transform(train_X_over)

## Check results
train_XStd_price[0:3,0:5]
Out[5]:
array([[-1.47845332,  0.97303154, -0.74892431,  0.84993783,  0.51301889],
       [ 1.05381583, -0.0340561 , -0.49593874, -0.46853281, -0.26695141],
       [-0.25921262,  0.97303154,  1.98665591, -0.46853281,  1.21397258]])

The code above takes train_num, which contains all numeric variables in the training data, and drops columns related to the outcome variables used in our previously defined goals.

Next, the standardization is applied across each column of predictor data frame using StandardScaler and the fit_transform method.

Implementations of the other scaling methods described in our lecture slides are shown below:

In [6]:
## Robust Scaling
from sklearn.preprocessing import RobustScaler
train_XRob_price = RobustScaler().fit_transform(train_X_price)
train_XRob_over = RobustScaler().fit_transform(train_X_over)

## Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
train_XMM_price = MinMaxScaler().fit_transform(train_X_price)
train_XMM_over = MinMaxScaler().fit_transform(train_X_over)

## Max Absolute Scaling
from sklearn.preprocessing import MaxAbsScaler
train_XMabs_price = MaxAbsScaler().fit_transform(train_X_price)
train_XMabs_over = MaxAbsScaler().fit_transform(train_X_over)

Shown below are a few methods of scalers that you should be aware of:

In [7]:
## Fit the scaler to training data so that its parameters (ie: mean, std, min, max, etc.) can be applied to new data
my_scaler = RobustScaler()
my_scaler.fit(train_X_price)

## Transform data using a previously fit scaler (useful for evaluating a model on the test set)
transformed_data = my_scaler.transform(train_X_price)

## Undo a transformation (ie: go back to the unscaled data)
original_data = my_scaler.inverse_transform(train_X_price)

Question #2 (Standardization and Scaling):¶

  • Part A) Create a histogram of the variable "assessed" after it has been rescaled using Min-Max scaling (use the code given above). How do these results compare to the distribution of the original variable you saw in Question #1? Hint: convert to a pandas DataFrame using the column names from the unscaled data (so that you can more easily identify 'assessed')
  • Part B) Conceptually, why is it not necessary to rescale the outcome prior to performing $k$-nearest neighbors regression? Hint: no coding is needed for this question.

Part 3 - $k$-Nearest Neighbors¶

At this point we're ready to implement a $k$-nearest neighbors model. As a starting point, we'll use $k = \sqrt{n}$, euclidean distance ($p=2$), and uniform weighting. Additionally, you should recognize that sklearn has different implementations of $k$-nearest neighbors for classification and regression tasks.

In [8]:
## Get n, k
n = train_num.shape[0]
k = round(math.sqrt(n))

## Setup knn classifier
from sklearn.neighbors import KNeighborsClassifier
knn_class = KNeighborsClassifier(n_neighbors=k,weights='uniform',p=2)

## Setup knn reg
from sklearn.neighbors import KNeighborsRegressor
knn_reg = KNeighborsRegressor(n_neighbors=k,weights='uniform',p=2)

The code above simply sets up an object corresponding to each of our models, we still need to use the fit method of each object to fit the model to the training data:

In [9]:
## Fit the classifier
knn_class.fit(train_XStd_over, train_over)

## Fit the reg
knn_reg.fit(train_XStd_price, train_price)
Out[9]:
KNeighborsRegressor(n_neighbors=25)

Two other methods to be aware of are:

  1. predict - returns predicted labels (classification) or values (regression)
  2. predict_proba - returns predicted probabilities (classification only)

For additional attributes/methods, see the documentation on KNeighorsClassifier and the documentation on KNeighborsRegression.

Question #3 (kNN):¶

  • Part A) Fit a $k$-nearest neighbors model with $k=20$, "uniform" weighting, and euclidean distance ($p=2$) to the data scaled using Max-Absolute scaling for the "over" outcome. Then, calculate and print the classification accuracy of this model on the training data. Hint: you can use sum on a boolean variable to tally the number of True entries.
  • Part B) Repeat the steps in Part A, but using "distance" weighting. How does changing this tuning parameter appear to impact in-sample classification accuracy?
  • Part C) Fit a $k$-nearest neighbors model with $k=20$, "uniform" weighting, and euclidean distance ($p=2$) to the data scaled using Max-Absolute scaling for the "price" outcome. Then, calculate the root mean squared error of this model. Hint: you'll need to use functions from the math library
  • Pard D) Repeat the steps in Part C, but using manhattan distance ($p=1$). How does changing this tuning parameter appear to impact in-sample prediction performance?

Part 4 - Transformations¶

Standardization/scaling is an essential step in most $k$-nearest neighbors applications, with the exception being data that are universally measured on the same scale (such as an image's pixel intensities).

Data transformations are another step that can potentially improve model performance. The preprocessing module contains several transformation options, for now we'll demonstrate the Box-Cox transformation (which requires strictly positive data) and a similar approach known as the Yeo-Johnson transformation (which can tolerate both positive and negative data)

In [10]:
## Import PowerTransformer
from sklearn.preprocessing import PowerTransformer

## Box-Cox example
bc_trans = PowerTransformer(method = 'box-cox')
bc_trans.fit(train_num[['sale.amount', 'assessed']])
bc_trans_vars = bc_trans.transform(train_num[['sale.amount', 'assessed']])
pd.DataFrame(bc_trans_vars).hist()

## Yeo-Johnson (everything else can remain the same)
yj_trans = PowerTransformer(method = 'yeo-johnson')

Part 5 - Evaluation¶

In a typical application, you'll iteratively work with the training data to determine what you believe to be the optimal model. For $k$-nearest neighbors, this might involve exploring different types of standardization/scaling, feature engineering/selection/transformation, different values of $k$, different weighting schemes, and different distance measures.

After identifying a suitable model (or a small number of candidates that you deem equally viable) the final step is to evaluate the chosen model on the test data. This is used as a final test of whether your model is good enough for dissemination/deployment, or if the project should return to data collection phase. To properly test the final model, you must replicate every pre-processing step using only the test data (ie: repeat the same scaling procedures, transformations, etc.), then you must apply the model you fit to the training data, but use the test set and calculate it's performance.

In the evaluation step is important that you only use fitted objects resulting from the training data, as the training data is all that is available when your model encounters new data after it is deployed. For our previous examples, this means you must first fit the scaler to the training data, then the transform method to preprocess the test data. Next, you must use the predict or predict_proba method of the model that was fit to the preprocessed training data.

Question #4 (Evaluation):¶

  • Suppose the best performing model on the training data used robust scaling, $k=22$, "distance" weighting, and manhattan distance. Evaluate this model, for the classification outcome over, on the test set. Hint: first, you should fit this model to the proper version of the training data, then you should properly re-scale the test set before inputting it into the fitted model and using the predict method.

Part 6 - Application¶

To conclude this lab, you will apply the workflow from the example to a new dataset on your own. The data for this application comes from the UC-Irvine machine learning repository, you can read about the Wisconsin Breast Cancer Diagnostic Dataset here ).

Each observation consists of the mean values for various cell characteristics of a patient. The goal is to predict the patient's diagnosis (recorded as "Label").

In [11]:
## Read data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")

Question #5 (Application)¶

  • Part A: Create a 70% training, 30% testing data split using random_state = 5.
  • Part B: Create histograms displaying the distributions of the 10 numeric predictors, then use them to decide upon standardization, scaling, and/or transformation methods.
  • Part C: Explore at least three different $k$-nearest neighbors models using $k \approx \sqrt{n}$, either "uniform" or "distance" scaling, and either "manhattan" or "euclidean" distance, then decide upon a final model.
  • Part D: Evaluate your final model on the test set.