In this lab we'll expand our modeling toolbox by exploring regression methods and how they compare with the KNN and decision tree models we've previously been relying upon.
Directions: Please read through the contents of this lab with your partner and try the examples. After you're both confident that you understand a topic you should attempt the associated exercise and record your answer in your own Jupyter notebook that you will submit for credit. The notebook you submit should only contain answers to the lab's exercises (so you should remove any code you ran for the examples, or use a separate notebook to test out the examples).
To begin, you'll need the following libraries:
## Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
Throughout this lab we'll rely upon a few simulated data sets so that we can better understand the behavior of certain methods for specific types of patterns:
These data sets are created and displayed (in the $X_1$ and $X_2$ dimensions) below:
## Create data sets
from sklearn.datasets import make_circles, make_friedman1
circles_X, circles_y = make_circles(n_samples=200, shuffle=True, noise=0.2, random_state=11, factor=0.3)
friedman_X, friedman_y = make_friedman1(n_samples=200, n_features=10, noise=0.2, random_state=11)
## Train-test splits
from sklearn.model_selection import train_test_split
circles_X_train, circles_X_test, circles_y_train, circles_y_test = train_test_split(circles_X, circles_y, test_size=0.2, random_state=0)
friedman_X_train, friedman_X_test, friedman_y_train, friedman_y_test = train_test_split(friedman_X, friedman_y, test_size=0.2, random_state=0)
## Display training data (Circles)
plt.scatter(circles_X_train[:,0], circles_X_train[:,1], c = circles_y_train)
plt.title("Circles")
plt.show()
## Display training data (Friedman)
plt.scatter(friedman_X_train[:,0], friedman_X_train[:,1], c = friedman_y_train)
plt.title("Friedman")
plt.colorbar(label='Y Value')
plt.show()
The linear_model
module of sklearn
contains the functions needed to fit linear and logistic regression models. We will begin by including all available predictors in our models without any feature expansion steps:
from sklearn.linear_model import LogisticRegression
fitted_model = LogisticRegression(penalty='none').fit(circles_X_train, circles_y_train)
The argument penalty = 'none'
is included because sklearn
applies regularization to logistic regression weights by default. We will discuss the topic of regularization later in the course, but for now we'll avoid it.
To see what the predicted probabilities of our fitted model look like, we'll create a dense grid of $X_1$ and $X_2$ values and visualize the model's predictions over this grid:
## Create the grid
X_grid = np.array(np.meshgrid(np.linspace(-1.5,1.5,100), np.linspace(-1.5,1.5,100))).reshape(2, 100*100).T
## Get predicted probs
grid_preds = fitted_model.predict_proba(X_grid)
## Plot the prediction surface
plt.scatter(X_grid[:,0], X_grid[:,1], c = grid_preds[:,1])
plt.title("Logistic Regression on 'Circles' Data")
plt.colorbar(label='Predicted Prob')
plt.show()
The logistic regression model fails miserably, as it imposes a monotonic relationship between each predictor and the predicted probability of the positive class.
Question #1: For this question you'll fit a basic linear regression model to the Friedman data and create a prediction surface in the $X_1$ and $X_2$ dimensions.
LinearRegression
function from the linear_model
module and fit a linear regression model to the Friedman training data.np.hstack
to combine your grid and array of zeros to create an object that can be used as an input to the predict()
method of your fitted model from Part A.The feature expansion strategies from our lecture slides can be integrated in machine learning pipelines using the following functions from the preprocessing
module of sklearn
:
KBinsDiscretizer()
- splits each numeric feature into categorical bins and represents them using one-hot encodingPolynomialFeatures()
- creates a polynomial expansion of each featureSplineTransformer()
- creates a b-spline expansion of each featureThe code below demonstrates the use of KBinsDiscretizer()
on the "Circles" data:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import KBinsDiscretizer
disc_pipe = Pipeline([('expander', KBinsDiscretizer(n_bins=4, encode='onehot-dense', strategy='uniform')),
('model', LogisticRegression(penalty='none'))])
fitted_disc_pipe = disc_pipe.fit(circles_X_train, circles_y_train)
X_grid = np.array(np.meshgrid(np.linspace(-1.5,1.5,100), np.linspace(-1.5,1.5,100))).reshape(2, 100*100).T
grid_preds = fitted_disc_pipe.predict_proba(X_grid)
plt.scatter(X_grid[:,0], X_grid[:,1], c = grid_preds[:,1])
plt.colorbar()
plt.show()
At this point in the course, you should be familiar enough with sklearn
to be comfortable using a function's official documentation to determine the appropriate arguments and values.
Question #2: For this question you will use the "circles" data set.
GridSearchCV()
to tune the degree (1 or 2), number of knots (2, or 3), and knot spacing (uniform or quantile-based). You may use the default scoring metric (classification accuracy). Print the top five best performing approaches.For the "circles" data set we were able to decide upon reasonable feature expansion strategies using a single scatter plot of the training data. However, many real data applications involve too many predictive features for this approach to be practical.
A more effective approach is to train a model on the unexpanded data and use residual plots to determine which features should be expanded. As an example, consider the variable $X_3$ in the "Friedman" data, which has a quadratic relationship with $Y$ in the data generating model:
## Fit basic linear regression to the Friedman data
from sklearn.linear_model import LinearRegression
fitted_model = LinearRegression().fit(friedman_X_train, friedman_y_train)
## Plot residual vs. X3, which had a quadratic relationship w/ Y in the data generating model
training_resids = friedman_y_train - fitted_model.predict(friedman_X_train)
plt.scatter(friedman_X_train[:,2], training_resids)
plt.axhline(y=0, color='black', linestyle='--')
## Add a lowess smoother
from statsmodels.nonparametric.smoothers_lowess import lowess
smoothed_data = lowess(training_resids, friedman_X_train[:,2], frac=0.3, it=0)
x_smoothed, y_smoothed = smoothed_data[:, 0], smoothed_data[:, 1]
plt.plot(x_smoothed, y_smoothed, color='red', linewidth=2)
plt.show()
Notice how the plot shows a quadratic relationship between $X_3$ and the residuals. This is no coincidence, instead, it indicates the linear relationship imposed by the model is insufficient in capturing how $X_3$ is related to $Y$.
We can make the same plot after applying a spline transformer to $X_3$ to assess whether feature expansion improves model fit:
## Apply the spline transformer to just one column
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import SplineTransformer
friedman_XDF_train = pd.DataFrame(friedman_X_train, columns=['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10'])
target_column = 'X3'
preprocessor = ColumnTransformer(
transformers=[('spline', SplineTransformer(degree=2, n_knots=2), [target_column])],
remainder='passthrough' )
expanded_X_train = preprocessor.fit_transform(friedman_XDF_train)
fitted_model = LinearRegression().fit(expanded_X_train, friedman_y_train)
## Plot residual vs. X3, which had a quadratic relationship w/ Y in the data generating model
training_resids = friedman_y_train - fitted_model.predict(expanded_X_train)
plt.scatter(friedman_X_train[:,2], training_resids)
plt.axhline(y=0, color='black', linestyle='--')
## Add a lowess smoother
from statsmodels.nonparametric.smoothers_lowess import lowess
smoothed_data = lowess(training_resids, friedman_X_train[:,2], frac=0.3, it=0)
x_smoothed, y_smoothed = smoothed_data[:, 0], smoothed_data[:, 1]
plt.plot(x_smoothed, y_smoothed, color='red', linewidth=2)
plt.show()
We now see only random variation in the smoothing line, which suggests we've properly integrated the feature into the model.
Note: For logistic regression the concept of a residual is more ambigious, but the general approach described in this section can still be applied calculating residuals using the binary encoding of the outcome variable and the model's predicted probabilities.
Question #3:
In this application you'll revisit the Iowa City Home Sales data from previous labs:
ic_homes = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_homes.head(3)
sale.amount | sale.date | occupancy | style | built | bedrooms | bsmt | ac | attic | area.base | area.add | area.bsmt | area.garage1 | area.garage2 | area.living | area.lot | lon | lat | assessed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 172500 | 1/3/2005 | 116 (Zero Lot Line) | 1 Story Frame | 1993 | 3 | Full | Yes | None | 1102 | 0 | 925 | 418 | 0 | 1102 | 5520 | -91.509129 | 41.651160 | 173040 |
1 | 90000 | 1/5/2005 | 113 (Condominium) | 1 Story Frame | 2001 | 2 | None | Yes | None | 878 | 0 | 0 | 0 | 264 | 878 | 3718 | -91.522964 | 41.673240 | 89470 |
2 | 168500 | 1/12/2005 | 101 (Single-Family / Owner Occupied) | Split Foyer Frame | 1976 | 4 | Full | Yes | None | 1236 | 0 | 700 | 576 | 0 | 1236 | 8800 | -91.482311 | 41.658488 | 164230 |
The goal will be to compare a linear regression model that uses feature expansion strategies as necessary with a KNN model that naturally accomodates non-linear relationships between numeric features and sale price.
Question #4:
random_state = 42
and separate the outcome, sale.amount
, from the predictive features of interest, which are ['bedrooms', 'area.base', 'area.bsmt','area.lot', 'area.living', 'area.garage1', 'area.garage2']