Lab 2 - Introduction to Scikit-learn¶

This lab provides an introduction to some of the basic steps involved in applying machine learning methods using Scikit-learn (commonly known as sklearn), a popular machine learning library containing implementations of various classical algorithms.

Directions: Please read through the contents of this lab with your partner and try the examples. After you're both confident that you understand a topic you should attempt the associated exercise and record your answer in your own Jupyter notebook that you will submit for credit. The notebook you submit should only contain answers to the lab's exercises (so you should remove any code you ran for the examples, or use a separate notebook to test out the examples).

To begin, you'll need the following libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# # KNN functions may produce FutureWarnings; the command below suppresses them.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  

The examples in this lab will use the Iowa City Home Sales data introduced in Lab #1:

In [2]:
ic_homes = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_homes.head(3)
Out[2]:
sale.amount sale.date occupancy style built bedrooms bsmt ac attic area.base area.add area.bsmt area.garage1 area.garage2 area.living area.lot lon lat assessed
0 172500 1/3/2005 116 (Zero Lot Line) 1 Story Frame 1993 3 Full Yes None 1102 0 925 418 0 1102 5520 -91.509129 41.651160 173040
1 90000 1/5/2005 113 (Condominium) 1 Story Frame 2001 2 None Yes None 878 0 0 0 264 878 3718 -91.522964 41.673240 89470
2 168500 1/12/2005 101 (Single-Family / Owner Occupied) Split Foyer Frame 1976 4 Full Yes None 1236 0 700 576 0 1236 8800 -91.482311 41.658488 164230

A fundamental step in machine learning is to split your data into independent training and testing segments. To prevent data leakage, it is important to perform this step as early as possible and refrain from using the test set until we are ready to perform a final evaluation of our model(s) of interest.

To create independent training and testing segments we'll use the train_test_split function from the model_selection module of sklearn:

In [3]:
from sklearn.model_selection import train_test_split
ic_train, ic_test = train_test_split(ic_homes, test_size=0.2, random_state=7)

The code given above randomly assigns 80% of the rows in ic_homes to a DataFrame named ic_train and the remaining 20% to a DataFrame named ic_test. Note that the argument random_state=7 makes this split reproducible, so we will all have the exact same training and testing observations.

Part 1 - Data Exploration¶

All new applications of machine learning should begin with a thoughtful investigation of the contents of the training set. In this step, we might seek to address the following questions:

  1. Do any features have substantial amounts of missing data?
  2. Do any features have odd or peculiar distributions that suggest some values were incorrectly recorded?
  3. Should any features be rescaled or transformed before the model is trained? (if the model is sensitive to scale or distributional shape)
  4. Should any features be combined or removed?

Question #1:

  • Part A: Use the isna() and sum() methods to confirm that there are no missing values in the ic_train DataFrame.
  • Part B: Create a histogram of each numeric variable in the ic_train DataFrame. Briefly comment upon whether any variables have unusual values and/or highly skewed distributions.
  • Part C: Create a scatter plot matrix relating all numeric variables in the ic_train DataFrame. Outside of any relationships between predictors and the target variable, sale.amount, are there any pairs of variables that seem so strongly related that they should be combined prior to model training? Note: if you aren't sure of your visual assessment you may consider using the corr() method to look for pairwise correlations above 0.95 or 0.99.

Part 2 - Models in sklearn¶

In sklearn models are trained using the fit method of the corresponding objective. Thus, a simple way to train a model is to first instantiate an object of the desired model class then pass in the training data using this object's fit() method. This workflow is demonstrated for a KNN regression model below:

In [4]:
## Import and instantiate
from sklearn.neighbors import KNeighborsRegressor
knn_model = KNeighborsRegressor(n_neighbors=10,weights='uniform',p=2)

## Fit using the training data
ic_train_y = ic_train['sale.amount']
ic_train_X = ic_train.select_dtypes("number").drop('sale.amount', axis=1)
knn_model.fit(ic_train_X, ic_train_y)

## Evaluate on the test data
from sklearn.metrics import mean_squared_error
ic_test_X = ic_test.select_dtypes("number").drop('sale.amount', axis=1) ## Create X for the test set
ic_test_y = ic_test['sale.amount']
np.sqrt(mean_squared_error(ic_test_y, knn_model.predict(ic_test_X))) ## Calculate RMSE w/ test data
Out[4]:
24086.67584498422

This RMSE value represents the average prediction error on the test set.

A few additional comments:

  1. If your background is in R/statistical modeling, it might seem mysterious how knn_model was able to make predictions when we seemingly never stored the fitted model anywhere. sklearn uses an object oriented programming paradigm that allows for methods like fit() to modify existing objects.
  2. We only considered numeric predictors in this example, but we'll soon learn how to include categorical predictors using one-hot encoding.

Question #2:

  • Part A: Fit a decision tree regressor with a max_depth of 3 using the same training and testing sets created in the previous example. Print the RMSE of this model for the test data.
  • Part B: Use the plot_tree() function from the tree module of sklearn to display the decision tree you trained in Part A.
  • Part C: Use the decision tree diagram you created in Part B to predict the sale price of a home with an assessed value of $185,000. Provide a brief explanation describing how you used the tree diagram to arrive at this predicted value.
  • Part D: Use a for loop to track the train set RMSE and the test set RMSE for different values of max_depth ranging from 1 to 10. Construct a line chart showing RMSE as a function of max_depth using separate lines to delineate performances from the training and testing sets.

Part 3 - Standardization and Scaling¶

As we discussed during lecture, some methods like KNN are sensitive to the measurement scale of the predictor variables, while others, such as decision trees, will produce the same results regardless of the scale used.

The preprocessing module of sklearn provides many different options for re-scaling, which you can find at this link.

Regardless of the re-scaling approach used, the main methods to be aware of are:

  • fit() - computes the necessary statistics from the provided data to be used later for re-scaling
  • transform() - re-scales the provided data using parameters estimated when the scaler was 'fit'
  • fit_transform() - fits to the provided data, then transforms it

Note that fit_transform() is typically used during exploratory analysis of the training data. For model evaluation, always use fit() on the training data and transform() on the test data.

The example below demonstrates how to apply min-max scaling to a subset of predictors:

In [5]:
## Variables to re-scale
example_predictors = ['area.lot', 'area.living', 'area.garage1']

## Import scaler function and re-scale training data
from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler().fit(ic_train[example_predictors]) # Fit the scaler
train_area_lot_scaled = mm_scaler.transform(ic_train[example_predictors]) # Transform training data

## Show the original and re-scaled data
ic_train[example_predictors].hist(layout = (1,3))
pd.DataFrame(train_area_lot_scaled, columns = example_predictors).hist(layout = (1,3))
plt.show()

Question #3:

  • Part A: Consider using fit_transform() to apply a re-scaling procedure to all of our data before using train_test_split(). Would you recommend this approach? Briefly explain your reasoning.
  • Part B: Use KNN regression to predict sale.amount using all other numeric variables as predictors using a standardization pre-processing step (via StandardScaler()). Your KNN model should use $k=5$ neighbors, uniform weighting, and euclidean distance. Report the RMSE on the same test set (ic_test) that was used in Part 2. Did re-scaling lead to better performance on the test data?

Part 4 - Data Transformations¶

In algorithms like KNN, re-scaling is generally considered to be an essential step, as some predictors will arbitrarily be given more weight in distance calculations when the predictors are not all on comparable scales.

Re-scaling preserves the distributional shape of the data, but another related pre-processing step, data transformation, intentionally changes the shape to enhance model performance, often by diminishing the impact of outliers. Depending on the context, data transformations may be applied to predictors, the target variable, or both, to improve model performance or reduce the influence of outliers.

In addition to simple procedures like log-transformations, you should be aware of the following pre-processing transformations:

  • Power transformations (Yeo-Johnson in particular) - Power transforms produce output distributions that look "approximately Normal" by reducing skew and bringing outliers closer to the center.
  • Quantile transformations - Quantile transformations map data to a uniform distribution, which can help stabilize variance and make features more comparable.

The basic steps involved in using one of these data transformations are the same as those we saw when re-scaling. A brief example is shown below:

In [6]:
## Example - Power transformation
from sklearn.preprocessing import PowerTransformer
vars_to_transform = ['sale.amount', 'assessed']
yj_trans = PowerTransformer(method = 'yeo-johnson').fit(ic_train[vars_to_transform])
yj_trans_output = yj_trans.transform(ic_train[vars_to_transform])

## Display the results
pd.DataFrame(ic_train[vars_to_transform]).hist()
pd.DataFrame(yj_trans_output, columns = vars_to_transform).hist()
plt.show()

Question #4: Without relying upon any empirical results, would you expect applying a quantile transformation on all input features to improve the performance of a decision tree model? Explain your reasoning.

Question #5: Create a new binary target variable named over_assessed that takes on a value of "yes" if a home's assessed value is greater than its sale price and "no" otherwise. Then create a KNN model to predict over_assessed from all other numeric features (excluding sale price and assessed value) using $k=10$ neighbors, distance weighting, and euclidean distance with QuantileTransformer() as a pre-processing step. Report the accuracy of this model on the test set.

Part 5 - One-hot Encoding¶

So far, we've ignored the categorical predictors in our data for simplicity, but categorical predictors can be included in most machine learning models using an approach known as one-hot encoding (or "dummy variables" in contexts like statistical modeling).

The OneHotEncoder() function from the preprocessing module of sklearn can be used to one-hot encode categorical data. An example involving two categorical features is shown below:

In [7]:
from sklearn.preprocessing import OneHotEncoder
example_data = pd.DataFrame([['A', 'Male'], ['A', 'Female'], ['B', 'Female'], ['C', 'Male']], )
oh_trans = OneHotEncoder(sparse_output=False).fit(example_data)
print(oh_trans.transform(example_data))
[[1. 0. 0. 0. 1.]
 [1. 0. 0. 1. 0.]
 [0. 1. 0. 1. 0.]
 [0. 0. 1. 0. 1.]]

Notice how OneHotEncoder() represents the first variable, which has the categorical values of 'A', 'B', and 'C' using three binary variables, the first of which encodes the category 'A', the second encodes 'B', and the third encodes 'C'.

Similarly, the second variable, which has the categorical values of 'Male' and 'Female' is represented by two binary variables, the first encoding 'Male' and the second encoding 'Female'.

A few additional things to note:

  1. OneHotEncoder() returns a "sparse matrix" object by default, which isn't helpful unless we're using a machine learning model that can benefit from sparse matrix input. The argument sparse=False prevents this return behavior.
  2. We typically will fit an encoder to training data and use that fit to transform the test data. However, it is possible that the test data contains categories that were not present in the training data. We can address this using the argument handle_unknown='ignore', which instructs OneHotEncoder() to ignore (not represent) any new categories that weren't present in the training data.
  3. OneHotEncoder() will treat any numeric columns you give it as categorical, which can be a major problem because numeric variables typically have lots of unique values. You should be careful to avoid one-hot encoding numeric variables.
  4. By default OneHotEncoder() will create a matrix of one-hot variables that is not full rank, a linear algebra term used to indicate every column of the matrix being linearly independent. If we wanted a representation without any dependencies we could use the argument drop = 'first' to drop the first dummy variable for each feature. You might note that this is the behavior of the model.matrix() function in R.

Question #6: Starting with your model from Question #5, incorporate one-hot encoded predictors representing the orginal variables "style", "ac", and "bsmt" in addition to the numeric predictors you were already using. You do not need to worry about re-scaling the new one-hot features as they already exist on a 0 to 1 scale. Use the same hyperparameters that were specified in Question #5 and report the performance of this new approach on the test set. Did adding these additional features improve the model's performance?