Lab 5 - Regression in Machine Learning¶

In this lab we'll expand our modeling toolbox by exploring regression methods and how they compare with the KNN and decision tree models we've previously been relying upon.

Directions: Please read through the contents of this lab with your partner and try the examples. After you're both confident that you understand a topic you should attempt the associated exercise and record your answer in your own Jupyter notebook that you will submit for credit. The notebook you submit should only contain answers to the lab's exercises (so you should remove any code you ran for the examples, or use a separate notebook to test out the examples).

To begin, you'll need the following libraries:

In [1]:
## Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  

Throughout this lab we'll rely upon a few simulated data sets so that we can better understand the behavior of certain methods for specific types of patterns:

  1. "Circles" - two predictive features, $X_1$ and $X_2$, and a binary outcome where the positive class is most likely to occur in a circular region of intermediate values in both predictors.
  2. "Friedman" - a well-established data generation model for regression tasks involving non-linearity: $Y = 10sin(\pi X_1 X_2) + 20(X_3 - 0.5)^2 + 10X_4 + 5X_5 + \text{Noise}$

These data sets are created and displayed (in the $X_1$ and $X_2$ dimensions) below:

In [2]:
## Create data sets
from sklearn.datasets import make_circles, make_friedman1
circles_X, circles_y = make_circles(n_samples=200, shuffle=True, noise=0.2, random_state=11, factor=0.3)
friedman_X, friedman_y = make_friedman1(n_samples=200, n_features=10, noise=0.2, random_state=11)

## Train-test splits
from sklearn.model_selection import train_test_split
circles_X_train, circles_X_test, circles_y_train, circles_y_test = train_test_split(circles_X, circles_y, test_size=0.2, random_state=0)
friedman_X_train, friedman_X_test, friedman_y_train, friedman_y_test = train_test_split(friedman_X, friedman_y, test_size=0.2, random_state=0)

## Display training data (Circles)
plt.scatter(circles_X_train[:,0], circles_X_train[:,1], c = circles_y_train)
plt.title("Circles")
plt.show()

## Display training data (Friedman)
plt.scatter(friedman_X_train[:,0], friedman_X_train[:,1], c = friedman_y_train)
plt.title("Friedman")
plt.colorbar(label='Y Value')
plt.show()

Part 1 - Basic Regression Models for each Dataset¶

The linear_model module of sklearn contains the functions needed to fit linear and logistic regression models. We will begin by including all available predictors in our models without any feature expansion steps:

In [3]:
from sklearn.linear_model import LogisticRegression
fitted_model = LogisticRegression(penalty='none').fit(circles_X_train, circles_y_train)

The argument penalty = 'none' is included because sklearn applies regularization to logistic regression weights by default. We will discuss the topic of regularization later in the course, but for now we'll avoid it.

To see what the predicted probabilities of our fitted model look like, we'll create a dense grid of $X_1$ and $X_2$ values and visualize the model's predictions over this grid:

In [4]:
## Create the grid
X_grid = np.array(np.meshgrid(np.linspace(-1.5,1.5,100), np.linspace(-1.5,1.5,100))).reshape(2, 100*100).T

## Get predicted probs
grid_preds = fitted_model.predict_proba(X_grid)

## Plot the prediction surface
plt.scatter(X_grid[:,0], X_grid[:,1], c = grid_preds[:,1])
plt.title("Logistic Regression on 'Circles' Data")
plt.colorbar(label='Predicted Prob')
plt.show()

The logistic regression model fails miserably, as it imposes a monotonic relationship between each predictor and the predicted probability of the positive class.

Question #1: For this question you'll fit a basic linear regression model to the Friedman data and create a prediction surface in the $X_1$ and $X_2$ dimensions.

  • Part A: Import the LinearRegression function from the linear_model module and fit a linear regression model to the Friedman training data.
  • Part B: Similar to this section's example, create two-dimensional grid of 100 linearly spaced values between 0 and 1. This will represent values of $X_1$ and $X_2$. Next, create an array of zeros to use as values for the other features present in the Friedman data set. Use np.hstack to combine your grid and array of zeros to create an object that can be used as an input to the predict() method of your fitted model from Part A.
  • Part C: Similar to the example from this section, plot the prediction surface of your fitted model over the grid of $X_1$ and $X_2$ values created in Part B. The lower left corner of this plot should be dark blue, and the upper right should be bright yellow.
  • Part D: Consider the true relationship between $X_1$, $X_2$, and $Y$ in the Friedman data as given by the data generating function and plot in the lab's introduction. How do you feel about the bias and variance of this model when it comes to representing this pattern?
  • Part E: Create another prediction surface over your grid of $X_1$ and $X_2$ values using a decision tree regressor with a maximum depth of 4. How do you feel about the bias and variance of this model when it comes to representing the true relationship between $X_1$, $X_2$, and $Y$ in the Friedman data? Compared to your linear regression model, would you expect this decision tree model to have higher or lower error on new data?
  • Part F: Compare the RMSE of your linear regression and decision tree models on the test set.

Part 2 - Feature Expansion¶

The feature expansion strategies from our lecture slides can be integrated in machine learning pipelines using the following functions from the preprocessing module of sklearn:

  • KBinsDiscretizer() - splits each numeric feature into categorical bins and represents them using one-hot encoding
  • PolynomialFeatures() - creates a polynomial expansion of each feature
  • SplineTransformer() - creates a b-spline expansion of each feature

The code below demonstrates the use of KBinsDiscretizer() on the "Circles" data:

In [5]:
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import KBinsDiscretizer
disc_pipe = Pipeline([('expander', KBinsDiscretizer(n_bins=4, encode='onehot-dense', strategy='uniform')),
                  ('model', LogisticRegression(penalty='none'))])
                  
fitted_disc_pipe = disc_pipe.fit(circles_X_train, circles_y_train)
X_grid = np.array(np.meshgrid(np.linspace(-1.5,1.5,100), np.linspace(-1.5,1.5,100))).reshape(2, 100*100).T
grid_preds = fitted_disc_pipe.predict_proba(X_grid)
plt.scatter(X_grid[:,0], X_grid[:,1], c = grid_preds[:,1])
plt.colorbar()
plt.show()

At this point in the course, you should be familiar enough with sklearn to be comfortable using a function's official documentation to determine the appropriate arguments and values.

Question #2: For this question you will use the "circles" data set.

  • Part A: Create a pipeline that applies a spline transformer as a feature expansion step before fitting a logistic regression model, then use GridSearchCV() to tune the degree (1 or 2), number of knots (2, or 3), and knot spacing (uniform or quantile-based). You may use the default scoring metric (classification accuracy). Print the top five best performing approaches.
  • Part B: Display the prediction surface of the best estimator (or one of the best estimators in the case of a tie) using the examples from earlier in the lab. Briefly comment on how this model compares to the logistic regression without feature expansion from before.
  • Part C: Without relying on an empirical justification, provide a conceptual argument for why spline expansion is likely to outperform discretization for the "circles" data set. You should consider the scatter plot of the training data, as well as the prediction surfaces you've seen so far.

Part 3 - Residual Analysis¶

For the "circles" data set we were able to decide upon reasonable feature expansion strategies using a single scatter plot of the training data. However, many real data applications involve too many predictive features for this approach to be practical.

A more effective approach is to train a model on the unexpanded data and use residual plots to determine which features should be expanded. As an example, consider the variable $X_3$ in the "Friedman" data, which has a quadratic relationship with $Y$ in the data generating model:

In [6]:
## Fit basic linear regression to the Friedman data
from sklearn.linear_model import LinearRegression
fitted_model = LinearRegression().fit(friedman_X_train, friedman_y_train)

## Plot residual vs. X3, which had a quadratic relationship w/ Y in the data generating model
training_resids = friedman_y_train - fitted_model.predict(friedman_X_train)
plt.scatter(friedman_X_train[:,2], training_resids)
plt.axhline(y=0, color='black', linestyle='--')

## Add a lowess smoother
from statsmodels.nonparametric.smoothers_lowess import lowess
smoothed_data = lowess(training_resids, friedman_X_train[:,2], frac=0.3, it=0)
x_smoothed, y_smoothed = smoothed_data[:, 0], smoothed_data[:, 1]
plt.plot(x_smoothed, y_smoothed, color='red', linewidth=2)
plt.show()

Notice how the plot shows a quadratic relationship between $X_3$ and the residuals. This is no coincidence, instead, it indicates the linear relationship imposed by the model is insufficient in capturing how $X_3$ is related to $Y$.

We can make the same plot after applying a spline transformer to $X_3$ to assess whether feature expansion improves model fit:

In [8]:
## Apply the spline transformer to just one column
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import SplineTransformer
friedman_XDF_train = pd.DataFrame(friedman_X_train, columns=['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10'])
target_column = 'X3'
preprocessor = ColumnTransformer(
        transformers=[('spline', SplineTransformer(degree=2, n_knots=2), [target_column])],
        remainder='passthrough' )
expanded_X_train = preprocessor.fit_transform(friedman_XDF_train)
fitted_model = LinearRegression().fit(expanded_X_train, friedman_y_train)

## Plot residual vs. X3, which had a quadratic relationship w/ Y in the data generating model
training_resids = friedman_y_train - fitted_model.predict(expanded_X_train)
plt.scatter(friedman_X_train[:,2], training_resids)
plt.axhline(y=0, color='black', linestyle='--')

## Add a lowess smoother
from statsmodels.nonparametric.smoothers_lowess import lowess
smoothed_data = lowess(training_resids, friedman_X_train[:,2], frac=0.3, it=0)
x_smoothed, y_smoothed = smoothed_data[:, 0], smoothed_data[:, 1]
plt.plot(x_smoothed, y_smoothed, color='red', linewidth=2)
plt.show()

We now see only random variation in the smoothing line, which suggests we've properly integrated the feature into the model.

Note: For logistic regression the concept of a residual is more ambigious, but the general approach described in this section can still be applied calculating residuals using the binary encoding of the outcome variable and the model's predicted probabilities.

Question #3:

  • Part A: Using the code from this section as a starting point, create a for loop that generates a scatter plot of each feature in the "Friedman" data set versus the residuals of a linear regression model trained on the unexpanded data. Similar to the example code, make sure your plots include a smoothing line.
  • Part B: Based upon your results from Part A, which features appear to be promising candidates for feature expansion? How do the features you identified compare with the data generation process used to created the "Friedman" data set?
  • Part C: Use 5-fold cross-validation to compare the performance of two different modeling approaches - one that applies a degree-3 spline transformer with 2 knots and uniform spacing to all features, and another that applies the same transformer only to the features you identified in Part B.

Part 4 - Application¶

In this application you'll revisit the Iowa City Home Sales data from previous labs:

In [9]:
ic_homes = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_homes.head(3)
Out[9]:
sale.amount sale.date occupancy style built bedrooms bsmt ac attic area.base area.add area.bsmt area.garage1 area.garage2 area.living area.lot lon lat assessed
0 172500 1/3/2005 116 (Zero Lot Line) 1 Story Frame 1993 3 Full Yes None 1102 0 925 418 0 1102 5520 -91.509129 41.651160 173040
1 90000 1/5/2005 113 (Condominium) 1 Story Frame 2001 2 None Yes None 878 0 0 0 264 878 3718 -91.522964 41.673240 89470
2 168500 1/12/2005 101 (Single-Family / Owner Occupied) Split Foyer Frame 1976 4 Full Yes None 1236 0 700 576 0 1236 8800 -91.482311 41.658488 164230

The goal will be to compare a linear regression model that uses feature expansion strategies as necessary with a KNN model that naturally accomodates non-linear relationships between numeric features and sale price.

Question #4:

  • Part A: Perform an 80-20 training-testing split using random_state = 42 and separate the outcome, sale.amount, from the predictive features of interest, which are ['bedrooms', 'area.base', 'area.bsmt','area.lot', 'area.living', 'area.garage1', 'area.garage2']
  • Part B: Fit a linear regression model without any feature expansion to the training data, then use the residuals of this model to identify candidate features that might benefit from feature expansion.
  • Part C: Use cross-validation to compare a linear regression model that uses discretization as a feature expansion strategy for the candidate features you identified in Part B with at least 3 different KNN models. Report the cross-validated RMSE of each model you consider.
  • Part D: Find and report the RMSE of the best model from Part C on the test set.