Lab 4 - Machine Learning Workflow and $k$-nearest neighbors¶

This lab will cover essential concepts and steps in the machine learning workflow using $k$-nearest neighbors models and tools from the sklearn library.

In [1]:
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# KNN will warn you of future updates, so we'll use the commands below to turn these off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  

The lab's examples will use data scraped from the Johnson County assessor between 2005 and 2008. These data were originally used by a University of Iowa faculty member to evaluate whether newly listed homes had a list price below what they should be valued based upon their attributes, thereby representing a good deal.

In [2]:
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sale.amount   777 non-null    int64  
 1   sale.date     777 non-null    object 
 2   occupancy     777 non-null    object 
 3   style         777 non-null    object 
 4   built         777 non-null    int64  
 5   bedrooms      777 non-null    int64  
 6   bsmt          777 non-null    object 
 7   ac            777 non-null    object 
 8   attic         777 non-null    object 
 9   area.base     777 non-null    int64  
 10  area.add      777 non-null    int64  
 11  area.bsmt     777 non-null    int64  
 12  area.garage1  777 non-null    int64  
 13  area.garage2  777 non-null    int64  
 14  area.living   777 non-null    int64  
 15  area.lot      777 non-null    int64  
 16  lon           777 non-null    float64
 17  lat           777 non-null    float64
 18  assessed      777 non-null    int64  
dtypes: float64(2), int64(11), object(6)
memory usage: 115.5+ KB

Training vs. Testing¶

In order to prevent information leakage), it is important to split off a subset of data to be used for validation as soon as possible. All data exploration (visualization, descriptive stats, etc.), preprocessing (rescaling, transformation, dimension reduction, etc.), and modeling (model choice, hyperparameter tuning, etc.) should be done using only the training data.

To create separate training and testing data sets, we'll use the train_test_split() function from the model_selection module of sklearn. The code below reserves a randomly chosen subset containing 20% of the available data for validation.

In [3]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(ic, test_size=0.2, random_state=7)
print(train.shape)
print(test.shape)
(621, 19)
(156, 19)

The argument random_state=7 sets a randomization seed that ensures the same split will occur each time this code is run, thereby allowing all of us to work with the same validation data set. The number 7 is completely arbitrary.

As a final step we can separate the predictors from the outcome sale.amount. In this lab we'll focus exclusively on numeric predictors, and the next lab will cover strategies for including categorical predictors in methods like $k$-nearest neighbors.

Question 1: Create objects train_X, train_y, test_X, and test_y such that train_y and test_y contain the outcome variable 'sale.amount' and train_X and test_X contain all numeric predictors from the training and testing data (respectively).

Data Preparation¶

Now that we've got a separate validation set, we're free to explore the training data and make any modeling choices that we deem appropriate. A good first step is to perform a quick assessment of the numeric predictors using data visualizations and descriptive statistics. Some code to do this is given below:

In [4]:
## Descriptive stats
print(train.describe())

## Distributions
train.hist()
plt.show()
         sale.amount        built    bedrooms    area.base     area.add  \
count     621.000000   621.000000  621.000000   621.000000   621.000000   
mean   180461.838969  1971.291465    3.033816   997.391304    76.402576   
std     90454.257944    32.012907    0.993763   359.994278   163.199173   
min     38250.000000  1873.000000    1.000000   240.000000     0.000000   
25%    130000.000000  1957.000000    2.000000   729.000000     0.000000   
50%    157000.000000  1979.000000    3.000000   947.000000     0.000000   
75%    207500.000000  1998.000000    4.000000  1200.000000    32.000000   
max    815000.000000  2007.000000    7.000000  3440.000000  1192.000000   

         area.bsmt  area.garage1  area.garage2  area.living       area.lot  \
count   621.000000    621.000000    621.000000   621.000000     621.000000   
mean    348.731079    210.342995     79.138486  1349.797101    8937.497585   
std     392.638897    251.577814    162.963058   508.534575    8966.536367   
min       0.000000      0.000000      0.000000   312.000000     137.000000   
25%       0.000000      0.000000      0.000000  1008.000000    5398.000000   
50%     250.000000      0.000000      0.000000  1232.000000    7500.000000   
75%     600.000000    440.000000      0.000000  1526.000000    9902.000000   
max    2500.000000   1065.000000    780.000000  4988.000000  158123.000000   

              lon         lat       assessed  
count  621.000000  621.000000     621.000000  
mean   -91.522192   41.652605  174009.645733  
std      0.033560    0.011406   84160.526541  
min    -91.604721   41.628040   38590.000000  
25%    -91.547302   41.645714  125970.000000  
50%    -91.515492   41.652497  154710.000000  
75%    -91.495742   41.658984  198450.000000  
max    -91.463069   41.690921  778000.000000  

Transformations¶

Nothing looks too out of the ordinary, but we might notice that some predictors, notably 'area.lot' are extremely right-skewed with some large outliers. We might consider applying a normalizing transformation to this predictor:

In [5]:
## Import PowerTransformer
from sklearn.preprocessing import PowerTransformer

## Box-Cox transformation
bc_trans = PowerTransformer(method = 'box-cox').fit(train[['area.lot']])
bc_trans_arealot = bc_trans.transform(train[['area.lot']])

## Histogram of area.lot after transformation
pd.DataFrame(bc_trans_arealot).hist()
plt.show()

This code demonstrates a Box-Cox transformation using PowerTransformer(). You should note that the example uses the fit() method to fit the transformation to the provided data, which amounts to determining and storing the parameters of the transformation. Next, the transform() method is used to apply these parameters to actually transform the input data.

If the histogram shown above was all that we desired, we could have used the fit_transform() method. However, when applying our final machine learning approach to new data we'll want to be able to fit the transformation on the training data then apply it to new data.

Question 2: Consider the use of a Box-Cox transformation as part of a machine learning approach. Identify and briefly explain the problem that arises in each of the following scenarios:

  • Part A: We apply the transformation using fit_transform() to the entire data set before performing the training and testing split. Hint: Think about information leakage.
  • Part B: We apply the transformation to the training data using fit_transform() during our model training/selection process, and our final model applies fit_transform() separately to only the new data we seek to make predictions on. Hint: Think about what might happen if we want to predict on a single new data-point, such as a new listing that just came on the market.

Scaling¶

Based upon our initial data exploration, you should also have noticed that our predictors exist on very different scales. Thus, we must rescale them before they are used in $k$-nearest neighbors to ensure the variables with larger units do not dominate.

Each of the rescaling approaches discussed in our lecture slides are briefly demonstrated below on a few of the available predictors:

In [6]:
## A few predictors to show
example_pred_vars = ['area.lot', 'area.living', 'area.garage1']

## Standardization via StandardScaler
from sklearn.preprocessing import StandardScaler
ss_trans = StandardScaler().fit(train[example_pred_vars])
ss_trans_out = ss_trans.transform(train[example_pred_vars])
pd.DataFrame(ss_trans_out, columns = example_pred_vars).hist(layout = (1,3))
plt.show()

## Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
rs_trans = MinMaxScaler().fit(train[example_pred_vars])
rs_trans_out = rs_trans.transform(train[example_pred_vars])
pd.DataFrame(rs_trans_out, columns = example_pred_vars).hist(layout = (1,3))
plt.show()

## Two more to know about that aren't shown - robust scaling and max absolute scaling
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MaxAbsScaler

Pipelines¶

One of the most attractive features of sklearn are pipelines, or sequences of data handling steps that can easily be applied to a variety of machine learning tasks such as training, validation, hyperparameter tuning and more.

The example below demonstrates a simple pipeline that applies a Box-Cox transformation followed by min-max rescaling:

In [7]:
from sklearn.pipeline import Pipeline 

## Set up the steps of the pipeline
my_pipeline = Pipeline([
('transformer',  PowerTransformer(method = 'yeo-johnson')),
('scaler', MinMaxScaler())
])

## Fit the pipeline
fitted_pipe = my_pipeline.fit(train[example_pred_vars])

## Apply the fitted pipeline to visualize a transformation followed by min-max scaling
pd.DataFrame(fitted_pipe.transform(train[example_pred_vars]), columns = example_pred_vars).hist(layout = (1,3))
plt.show()

While they are not consequential in this simple example, we should notice that the pipeline's steps are named ("transformer" and "scaler") in our case. These are names that we choose and can subsequently use to refer to certain steps within the pipeline.

Additionally, the intermediate steps of a pipeline must be "transforms", meaning they must have both a fit() and transform() method, while the final step of a pipeline can be a model training step. In the next section we'll add a $k$-nearest neighbors model to this pipeline.

$k$-nearest Neighbors¶

We're now ready to use a $k$-nearest neighbors model to predict the sale prices of homes. Because this is a regression task, we'll need to use KNeighborsRegressor(). You should note that KNeighborsClassifier() is used for classification tasks.

The code below initializes a KNeighborsRegressor using $k=10$ neighbors, uniform weighting, and Euclidean distance (Minkowski distance with $p=2$):

In [8]:
## Setup knn reg
from sklearn.neighbors import KNeighborsRegressor
knn_reg = KNeighborsRegressor(n_neighbors=10, weights='uniform', p=2)

Like many of the other sklearn objects we've worked with, we can then fit this model to the training data, but this time we need to provide both the predictors and the target outcome:

In [9]:
## Fit the model
knn_reg.fit(X = train[example_pred_vars], y = train[['sale.amount']])

## Make predictions for the test data
knn_preds = knn_reg.predict(X = test[example_pred_vars])

In this example you'll notice that we did not need to store a separate fitted model, instead the fit() method modifies our existing KNeighborsRegressor object. That said, it's not really an issue if you wanted to store fitted model in a separate object, especially for the small-sized models we'll be using early on in this course.

Next, we can evaluate the performance of this model by looking at the $RMSE$ on the test data:

In [10]:
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_true = test[['sale.amount']], y_pred = knn_preds))
Out[10]:
54017.72706078141

We see that on average this model's predictions are off by around $54,000. Whether this is "good" or "bad" will depend upon the application. However, we can gain some additional perspective by looking at the amount of error present when we predict every new data-point as the average sale amount observed in the training data. This error rate provides a baseline level of performance that can be achieved by a model that did not learn anything meaningful.

In [11]:
## Baseline error
training_avg = np.average(train[['sale.amount']])  ## Average sale amount in training data

## RMSE using the average sale amount
np.sqrt(mean_squared_error(y_true = test[['sale.amount']], y_pred = np.repeat(training_avg, test.shape[0])))
Out[11]:
91453.58553396251

From this we can conclude that our model did appear to learn something, but being off by an average of approximately $54,000 might still reflect an unacceptable amount of error.

Finally, you should recognize that we can perform the same basic steps for a classification task. A simple example is shown below, the major difference is the use of classification accuracy to evaluate the model.

In [12]:
## Set up a classifier
from sklearn.neighbors import KNeighborsClassifier
knn_class = KNeighborsClassifier(n_neighbors=10, weights='uniform', p=2)

## Create a binary outcome - was the assessed value higher than the sale price?
train_y_binary = (train['assessed'] > train['sale.amount']).astype(int)
knn_class.fit(X = train[example_pred_vars], y = train_y_binary)

## Make predictions
knn_class_preds = knn_class.predict(X = test[example_pred_vars])

## Accuracy score
from sklearn.metrics import accuracy_score
test_y_binary = (test['assessed'] > test['sale.amount']).astype(int)
accuracy_score(knn_class_preds, test_y_binary)
Out[12]:
0.6410256410256411

We can go on to find a "baseline" level of accuracy by predicting each new sample as the most common category of the outcome in the training data:

In [13]:
## Baseline accuracy
most_common_y = np.argmax(np.bincount(train_y_binary))
accuracy_score(knn_class_preds, np.repeat(most_common_y, test.shape[0]))
Out[13]:
0.8461538461538461

Here we can see that this KNN model performs worse than always guessing "under-assessed" (or the integer label 0) for every home in the validation data.

Question 3:

  • Part A: Considering the outcome 'sale.amount' and all numeric predictors available in the Iowa City home sales data, create a machine learning pipeline using a Yeo-Johnson normalizing transformation followed by Min-Max scaling and a KNeighborsRegressor as the final step. Use inverse-distance weighting and Euclidean distance.
  • Part B: Use looping to fit your pipeline for each of the following values of n_neighbors: [3,5,10,15]. Find the RMSE on the test set corresponding to each of these values of n_neighbors, and report the best performing choice. You may ignore any overflow warnings (they simply reflect precision difficulties due to our outcome values being large numbers). Hint: You can set a parameter of one of the pipeline's named steps using the syntax given below.
In [14]:
## Simple pipeline
my_pipeline = Pipeline([
('transformer',  PowerTransformer(method = 'yeo-johnson')),
('scaler', MinMaxScaler())
])

## Change the method argument of the 'transformer' step to the value 'box-cox'
my_pipeline.set_params(transformer__method = 'box-cox')
Out[14]:
Pipeline(steps=[('transformer', PowerTransformer(method='box-cox')),
                ('scaler', MinMaxScaler())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('transformer', PowerTransformer(method='box-cox')),
                ('scaler', MinMaxScaler())])
PowerTransformer(method='box-cox')
MinMaxScaler()

Notice how a double underscore is used to access an argument within a named step of the pipeline.

Application¶

To conclude this lab, you will apply the workflow from the example to a new dataset on your own. The data for this application comes from the UC-Irvine machine learning repository, you can read more about these data at this link).

Each observation consists of the mean values for various cell characteristics of a patient. The goal is to predict the patient's diagnosis (recorded as "Label").

In [15]:
## Read data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")

Question 4:

  • Part A: Create a 70% training, 30% testing data split using random_state = 5.
  • Part B: Create histograms showing the distributions of the 10 numeric predictors, then decide upon whether a data preparation pipeline should include transformations, rescaling, or both.
  • Part C: Use the plotting.scatter_matrix() function in pandas (you may click here for documentation to help you decide if dimension reduction via PCA is appropriate for these data.
  • Part D: Regardless of your decision in Part C, create a scree plot of the variance explained by various numbers of retained components when PCA is applied to the numeric predictors.
  • Part E: Set up a machine learning pipeline that includes the following steps:
    1. A normalizing transformation
    2. Rescaling
    3. Dimension reduction via PCA (selecting the number of components using the results found in Part D)
    4. A $k$-nearest neighbors model using $k=10$, Manhattan distance, and inverse-distance weighting
  • Part F: Find the accuracy of your approach on the test data set.
  • Part G: Compare the accuracy on the test set to the proportion of training data in the majority class. Briefly comment upon whether the model appears to have learned something meaningful.