This lab will cover essential concepts and steps in the machine learning workflow using $k$-nearest neighbors models and tools from the sklearn
library.
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
# KNN will warn you of future updates, so we'll use the commands below to turn these off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
The lab's examples will use data scraped from the Johnson County assessor between 2005 and 2008. These data were originally used by a University of Iowa faculty member to evaluate whether newly listed homes had a list price below what they should be valued based upon their attributes, thereby representing a good deal.
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 777 entries, 0 to 776 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sale.amount 777 non-null int64 1 sale.date 777 non-null object 2 occupancy 777 non-null object 3 style 777 non-null object 4 built 777 non-null int64 5 bedrooms 777 non-null int64 6 bsmt 777 non-null object 7 ac 777 non-null object 8 attic 777 non-null object 9 area.base 777 non-null int64 10 area.add 777 non-null int64 11 area.bsmt 777 non-null int64 12 area.garage1 777 non-null int64 13 area.garage2 777 non-null int64 14 area.living 777 non-null int64 15 area.lot 777 non-null int64 16 lon 777 non-null float64 17 lat 777 non-null float64 18 assessed 777 non-null int64 dtypes: float64(2), int64(11), object(6) memory usage: 115.5+ KB
In order to prevent information leakage), it is important to split off a subset of data to be used for validation as soon as possible. All data exploration (visualization, descriptive stats, etc.), preprocessing (rescaling, transformation, dimension reduction, etc.), and modeling (model choice, hyperparameter tuning, etc.) should be done using only the training data.
To create separate training and testing data sets, we'll use the train_test_split()
function from the model_selection
module of sklearn
. The code below reserves a randomly chosen subset containing 20% of the available data for validation.
from sklearn.model_selection import train_test_split
train, test = train_test_split(ic, test_size=0.2, random_state=7)
print(train.shape)
print(test.shape)
(621, 19) (156, 19)
The argument random_state=7
sets a randomization seed that ensures the same split will occur each time this code is run, thereby allowing all of us to work with the same validation data set. The number 7
is completely arbitrary.
As a final step we can separate the predictors from the outcome sale.amount
. In this lab we'll focus exclusively on numeric predictors, and the next lab will cover strategies for including categorical predictors in methods like $k$-nearest neighbors.
Question 1: Create objects train_X
, train_y
, test_X
, and test_y
such that train_y
and test_y
contain the outcome variable 'sale.amount'
and train_X
and test_X
contain all numeric predictors from the training and testing data (respectively).
Now that we've got a separate validation set, we're free to explore the training data and make any modeling choices that we deem appropriate. A good first step is to perform a quick assessment of the numeric predictors using data visualizations and descriptive statistics. Some code to do this is given below:
## Descriptive stats
print(train.describe())
## Distributions
train.hist()
plt.show()
sale.amount built bedrooms area.base area.add \ count 621.000000 621.000000 621.000000 621.000000 621.000000 mean 180461.838969 1971.291465 3.033816 997.391304 76.402576 std 90454.257944 32.012907 0.993763 359.994278 163.199173 min 38250.000000 1873.000000 1.000000 240.000000 0.000000 25% 130000.000000 1957.000000 2.000000 729.000000 0.000000 50% 157000.000000 1979.000000 3.000000 947.000000 0.000000 75% 207500.000000 1998.000000 4.000000 1200.000000 32.000000 max 815000.000000 2007.000000 7.000000 3440.000000 1192.000000 area.bsmt area.garage1 area.garage2 area.living area.lot \ count 621.000000 621.000000 621.000000 621.000000 621.000000 mean 348.731079 210.342995 79.138486 1349.797101 8937.497585 std 392.638897 251.577814 162.963058 508.534575 8966.536367 min 0.000000 0.000000 0.000000 312.000000 137.000000 25% 0.000000 0.000000 0.000000 1008.000000 5398.000000 50% 250.000000 0.000000 0.000000 1232.000000 7500.000000 75% 600.000000 440.000000 0.000000 1526.000000 9902.000000 max 2500.000000 1065.000000 780.000000 4988.000000 158123.000000 lon lat assessed count 621.000000 621.000000 621.000000 mean -91.522192 41.652605 174009.645733 std 0.033560 0.011406 84160.526541 min -91.604721 41.628040 38590.000000 25% -91.547302 41.645714 125970.000000 50% -91.515492 41.652497 154710.000000 75% -91.495742 41.658984 198450.000000 max -91.463069 41.690921 778000.000000
Nothing looks too out of the ordinary, but we might notice that some predictors, notably 'area.lot'
are extremely right-skewed with some large outliers. We might consider applying a normalizing transformation to this predictor:
## Import PowerTransformer
from sklearn.preprocessing import PowerTransformer
## Box-Cox transformation
bc_trans = PowerTransformer(method = 'box-cox').fit(train[['area.lot']])
bc_trans_arealot = bc_trans.transform(train[['area.lot']])
## Histogram of area.lot after transformation
pd.DataFrame(bc_trans_arealot).hist()
plt.show()
This code demonstrates a Box-Cox transformation using PowerTransformer()
. You should note that the example uses the fit()
method to fit the transformation to the provided data, which amounts to determining and storing the parameters of the transformation. Next, the transform()
method is used to apply these parameters to actually transform the input data.
If the histogram shown above was all that we desired, we could have used the fit_transform()
method. However, when applying our final machine learning approach to new data we'll want to be able to fit the transformation on the training data then apply it to new data.
Question 2: Consider the use of a Box-Cox transformation as part of a machine learning approach. Identify and briefly explain the problem that arises in each of the following scenarios:
fit_transform()
to the entire data set before performing the training and testing split. Hint: Think about information leakage.fit_transform()
during our model training/selection process, and our final model applies fit_transform()
separately to only the new data we seek to make predictions on. Hint: Think about what might happen if we want to predict on a single new data-point, such as a new listing that just came on the market.Based upon our initial data exploration, you should also have noticed that our predictors exist on very different scales. Thus, we must rescale them before they are used in $k$-nearest neighbors to ensure the variables with larger units do not dominate.
Each of the rescaling approaches discussed in our lecture slides are briefly demonstrated below on a few of the available predictors:
## A few predictors to show
example_pred_vars = ['area.lot', 'area.living', 'area.garage1']
## Standardization via StandardScaler
from sklearn.preprocessing import StandardScaler
ss_trans = StandardScaler().fit(train[example_pred_vars])
ss_trans_out = ss_trans.transform(train[example_pred_vars])
pd.DataFrame(ss_trans_out, columns = example_pred_vars).hist(layout = (1,3))
plt.show()
## Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
rs_trans = MinMaxScaler().fit(train[example_pred_vars])
rs_trans_out = rs_trans.transform(train[example_pred_vars])
pd.DataFrame(rs_trans_out, columns = example_pred_vars).hist(layout = (1,3))
plt.show()
## Two more to know about that aren't shown - robust scaling and max absolute scaling
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MaxAbsScaler
One of the most attractive features of sklearn
are pipelines, or sequences of data handling steps that can easily be applied to a variety of machine learning tasks such as training, validation, hyperparameter tuning and more.
The example below demonstrates a simple pipeline that applies a Box-Cox transformation followed by min-max rescaling:
from sklearn.pipeline import Pipeline
## Set up the steps of the pipeline
my_pipeline = Pipeline([
('transformer', PowerTransformer(method = 'yeo-johnson')),
('scaler', MinMaxScaler())
])
## Fit the pipeline
fitted_pipe = my_pipeline.fit(train[example_pred_vars])
## Apply the fitted pipeline to visualize a transformation followed by min-max scaling
pd.DataFrame(fitted_pipe.transform(train[example_pred_vars]), columns = example_pred_vars).hist(layout = (1,3))
plt.show()
While they are not consequential in this simple example, we should notice that the pipeline's steps are named ("transformer" and "scaler") in our case. These are names that we choose and can subsequently use to refer to certain steps within the pipeline.
Additionally, the intermediate steps of a pipeline must be "transforms", meaning they must have both a fit()
and transform()
method, while the final step of a pipeline can be a model training step. In the next section we'll add a $k$-nearest neighbors model to this pipeline.
We're now ready to use a $k$-nearest neighbors model to predict the sale prices of homes. Because this is a regression task, we'll need to use KNeighborsRegressor()
. You should note that KNeighborsClassifier()
is used for classification tasks.
The code below initializes a KNeighborsRegressor
using $k=10$ neighbors, uniform weighting, and Euclidean distance (Minkowski distance with $p=2$):
## Setup knn reg
from sklearn.neighbors import KNeighborsRegressor
knn_reg = KNeighborsRegressor(n_neighbors=10, weights='uniform', p=2)
Like many of the other sklearn
objects we've worked with, we can then fit this model to the training data, but this time we need to provide both the predictors and the target outcome:
## Fit the model
knn_reg.fit(X = train[example_pred_vars], y = train[['sale.amount']])
## Make predictions for the test data
knn_preds = knn_reg.predict(X = test[example_pred_vars])
In this example you'll notice that we did not need to store a separate fitted model, instead the fit()
method modifies our existing KNeighborsRegressor
object. That said, it's not really an issue if you wanted to store fitted model in a separate object, especially for the small-sized models we'll be using early on in this course.
Next, we can evaluate the performance of this model by looking at the $RMSE$ on the test data:
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_true = test[['sale.amount']], y_pred = knn_preds))
54017.72706078141
We see that on average this model's predictions are off by around $54,000. Whether this is "good" or "bad" will depend upon the application. However, we can gain some additional perspective by looking at the amount of error present when we predict every new data-point as the average sale amount observed in the training data. This error rate provides a baseline level of performance that can be achieved by a model that did not learn anything meaningful.
## Baseline error
training_avg = np.average(train[['sale.amount']]) ## Average sale amount in training data
## RMSE using the average sale amount
np.sqrt(mean_squared_error(y_true = test[['sale.amount']], y_pred = np.repeat(training_avg, test.shape[0])))
91453.58553396251
From this we can conclude that our model did appear to learn something, but being off by an average of approximately $54,000 might still reflect an unacceptable amount of error.
Finally, you should recognize that we can perform the same basic steps for a classification task. A simple example is shown below, the major difference is the use of classification accuracy to evaluate the model.
## Set up a classifier
from sklearn.neighbors import KNeighborsClassifier
knn_class = KNeighborsClassifier(n_neighbors=10, weights='uniform', p=2)
## Create a binary outcome - was the assessed value higher than the sale price?
train_y_binary = (train['assessed'] > train['sale.amount']).astype(int)
knn_class.fit(X = train[example_pred_vars], y = train_y_binary)
## Make predictions
knn_class_preds = knn_class.predict(X = test[example_pred_vars])
## Accuracy score
from sklearn.metrics import accuracy_score
test_y_binary = (test['assessed'] > test['sale.amount']).astype(int)
accuracy_score(knn_class_preds, test_y_binary)
0.6410256410256411
We can go on to find a "baseline" level of accuracy by predicting each new sample as the most common category of the outcome in the training data:
## Baseline accuracy
most_common_y = np.argmax(np.bincount(train_y_binary))
accuracy_score(knn_class_preds, np.repeat(most_common_y, test.shape[0]))
0.8461538461538461
Here we can see that this KNN model performs worse than always guessing "under-assessed" (or the integer label 0
) for every home in the validation data.
Question 3:
'sale.amount'
and all numeric predictors available in the Iowa City home sales data, create a machine learning pipeline using a Yeo-Johnson normalizing transformation followed by Min-Max scaling and a KNeighborsRegressor
as the final step. Use inverse-distance weighting and Euclidean distance.n_neighbors
: [3,5,10,15]. Find the RMSE on the test set corresponding to each of these values of n_neighbors
, and report the best performing choice. You may ignore any overflow warnings (they simply reflect precision difficulties due to our outcome values being large numbers). Hint: You can set a parameter of one of the pipeline's named steps using the syntax given below.## Simple pipeline
my_pipeline = Pipeline([
('transformer', PowerTransformer(method = 'yeo-johnson')),
('scaler', MinMaxScaler())
])
## Change the method argument of the 'transformer' step to the value 'box-cox'
my_pipeline.set_params(transformer__method = 'box-cox')
Pipeline(steps=[('transformer', PowerTransformer(method='box-cox')), ('scaler', MinMaxScaler())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('transformer', PowerTransformer(method='box-cox')), ('scaler', MinMaxScaler())])
PowerTransformer(method='box-cox')
MinMaxScaler()
Notice how a double underscore is used to access an argument within a named step of the pipeline.
To conclude this lab, you will apply the workflow from the example to a new dataset on your own. The data for this application comes from the UC-Irvine machine learning repository, you can read more about these data at this link).
Each observation consists of the mean values for various cell characteristics of a patient. The goal is to predict the patient's diagnosis (recorded as "Label").
## Read data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")
Question 4:
plotting.scatter_matrix()
function in pandas
(you may click here for documentation to help you decide if dimension reduction via PCA is appropriate for these data.