This lab will conclude our study of regression methods by introducing a few functions that can used during pre-processing to enhance the flexibility of highly structrued models like linear/logisitc/softmax regression.
More specifically, it will cover polynomial and spline expansions, as well as discretization. You should note that these strategies can be used for any type of model, but they are generally most useful for highly structured models like linear/logistic/softmax regression.
As usual, we'll begin by loading several familiar libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
For illustration purposes, we'll use a simulated dataset created using the make_circles
function in sklearn
. These data contain two predictors and a binary outcome, and they are generated such that the relationship between the predictors and the outcome is non-linear:
## Create data
from sklearn.datasets import make_circles
X, y = sklearn.datasets.make_circles(n_samples=200, shuffle=True, noise=0.2, random_state=11, factor=0.3)
## Train-test split
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)
## Display training data
plt.scatter(train_X[:,0], train_X[:,1], c = train_y)
plt.show()
To begin, let's fit a logistic regression model to these data without any preprocessing steps. Next, we'll construct a dense 2-dimensional grid of $X_1$ and $X_2$ values to predict over, then we'll visualize our model's predictions over that grid by using color to represent the model's predicted probabilities:
from sklearn.linear_model import LogisticRegression
fitted_lr = LogisticRegression(penalty='none').fit(train_X, train_y)
## Grid to predict over
X_grid = np.array(np.meshgrid(np.linspace(-1.5,1.5,100), np.linspace(-1.5,1.5,100))).reshape(2, 100*100).T
grid_preds = fitted_lr.predict_proba(X_grid)
plt.scatter(X_grid[:,0], X_grid[:,1], c = grid_preds[:,1])
plt.colorbar()
plt.show()
We can see that logistic regression fails to the learn the true relationship between these two predictors and the outcome. However, this shouldn't be suprising since logistic regression specifies a linear relationship between predictors and the log-odds of the positive class (or a monotonic relationship between each predictor and the probability of belonging to the positive class).
A simple strategy that introduces enough flexibility to detect non-linear patterns in models like linear/logistic regression is discretization, or splitting a numeric predictor into categorical bins. The code below demonstrates this approach using KBinsDiscretizer
:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import KBinsDiscretizer
ds_lr = Pipeline([('expander', KBinsDiscretizer(n_bins=3, encode='onehot-dense', strategy='uniform')),
('model', LogisticRegression(penalty='none'))])
fitted_ds_lr = ds_lr.fit(train_X, train_y)
grid_preds = fitted_ds_lr.predict_proba(X_grid)
plt.scatter(X_grid[:,0], X_grid[:,1], c = grid_preds[:,1])
plt.colorbar()
plt.show()
We can see that this model is far more effective at capturing the true pattern in our data.
We can confirm this by looking at the training data accuracy of each model:
# Original features
print(fitted_lr.score(train_X, train_y))
## Discretized features
print(fitted_ds_lr.score(train_X, train_y))
0.69375 0.875
strategy
argument to 'quantile' and recreate the graphic shown above. Briefly comment on the impact of this change on the graphic.KBinsDiscretizer
, briefly explain the differences between the uniform
and quantile
strategies. Which strategy do you think is more appropriate for these data? Briefly explain.uniform
and quantile
strategies to be more similar or less similar? Briefly explain.Discretization can be effective and interpretable, but the abrupt changes that occur at cut-points can be problematic when trying to generalize a model to new data. Polynomials and Splines address this shortcoming by accomodating non-linear relationships using smooth transitions.
Both polynomials and splines are easily incorperated in a pipeline in a similar manner to KBinsDiscretizer
. The code below uses PolynomialFeatures
to construct second degree polynomial expansions:
from sklearn.preprocessing import PolynomialFeatures
poly_lr = Pipeline([('expander', PolynomialFeatures(degree=2)),
('model', LogisticRegression(penalty='none'))])
fitted_poly_lr = poly_lr.fit(train_X, train_y)
grid_preds = fitted_poly_lr.predict_proba(X_grid)
plt.scatter(X_grid[:,0], X_grid[:,1], c = grid_preds[:,1])
plt.colorbar()
plt.show()
We can see that second degree polynomials do a good job allowing our model to capture the true relationship, and our visual assessment is further supported by the training data accuracy:
## Accuracy with polynomials
print(fitted_poly_lr.score(train_X, train_y))
0.95625
Splines can be included in a preprocessing pipeline using SplineTransformer
. Be aware that both the degree and number of knots have an impact on the performance of a spline:
from sklearn.preprocessing import SplineTransformer
sp_lr = Pipeline([('expander', SplineTransformer(degree=2, n_knots=2)),
('model', LogisticRegression(penalty='none'))])
fitted_sp_lr = sp_lr.fit(train_X, train_y)
grid_preds = fitted_sp_lr.predict_proba(X_grid)
plt.scatter(X_grid[:,0], X_grid[:,1], c = grid_preds[:,1])
plt.colorbar()
plt.show()
## Accuracy with splines
print(fitted_sp_lr.score(train_X, train_y))
0.95625
While we see the same training data accuracy for the polynomial and spline approaches, one might argue that splines were more effective at learning the true underlying pattern that generated these data (circles).
cross_validate
to find the cross-validated accuracy of a k-nearest neighbors model that uses $k=10$ and distance weighting. You may or may not choose to scale the predictors before using KNN (since both predictors are already on roughly the same scale). Compare the cross-validated accuracy of this model to the best score found in Part A.