Lab #5 (part 2) - Feature Expansions and Transformations¶

This lab will conclude our study of regression methods by introducing a few functions that can used during pre-processing to enhance the flexibility of highly structrued models like linear/logisitc/softmax regression.

More specifically, it will cover polynomial and spline expansions, as well as discretization. You should note that these strategies can be used for any type of model, but they are generally most useful for highly structured models like linear/logistic/softmax regression.

As usual, we'll begin by loading several familiar libraries:

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math

For illustration purposes, we'll use a simulated dataset created using the make_circles function in sklearn. These data contain two predictors and a binary outcome, and they are generated such that the relationship between the predictors and the outcome is non-linear:

In [2]:

## Create data
from sklearn.datasets import make_circles
X, y = sklearn.datasets.make_circles(n_samples=200, shuffle=True, noise=0.2, random_state=11, factor=0.3)

## Train-test split
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

## Display training data
plt.scatter(train_X[:,0], train_X[:,1], c = train_y)
plt.show()

Part 1 - Discretization¶

To begin, let's fit a logistic regression model to these data without any preprocessing steps. Next, we'll construct a dense 2-dimensional grid of $X_1$ and $X_2$ values to predict over, then we'll visualize our model's predictions over that grid by using color to represent the model's predicted probabilities:

In [3]:

from sklearn.linear_model import LogisticRegression
fitted_lr = LogisticRegression(penalty='none').fit(train_X, train_y)

## Grid to predict over
X_grid = np.array(np.meshgrid(np.linspace(-1.5,1.5,100), np.linspace(-1.5,1.5,100))).reshape(2, 100*100).T
grid_preds = fitted_lr.predict_proba(X_grid)
plt.scatter(X_grid[:,0], X_grid[:,1], c = grid_preds[:,1])
plt.colorbar()
plt.show()

We can see that logistic regression fails to the learn the true relationship between these two predictors and the outcome. However, this shouldn't be suprising since logistic regression specifies a linear relationship between predictors and the log-odds of the positive class (or a monotonic relationship between each predictor and the probability of belonging to the positive class).

A simple strategy that introduces enough flexibility to detect non-linear patterns in models like linear/logistic regression is discretization, or splitting a numeric predictor into categorical bins. The code below demonstrates this approach using KBinsDiscretizer:

In [4]:

from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import KBinsDiscretizer
ds_lr = Pipeline([('expander', KBinsDiscretizer(n_bins=3, encode='onehot-dense', strategy='uniform')),
                  ('model', LogisticRegression(penalty='none'))])
                  
fitted_ds_lr = ds_lr.fit(train_X, train_y)
grid_preds = fitted_ds_lr.predict_proba(X_grid)
plt.scatter(X_grid[:,0], X_grid[:,1], c = grid_preds[:,1])
plt.colorbar()
plt.show()

We can see that this model is far more effective at capturing the true pattern in our data.

We can confirm this by looking at the training data accuracy of each model:

In [5]:

# Original features
print(fitted_lr.score(train_X, train_y))

## Discretized features
print(fitted_ds_lr.score(train_X, train_y))

0.69375
0.875

Question #1¶

Part A: Change the strategy argument to 'quantile' and recreate the graphic shown above. Briefly comment on the impact of this change on the graphic.
Part B: Using the documentation for KBinsDiscretizer, briefly explain the differences between the uniform and quantile strategies. Which strategy do you think is more appropriate for these data? Briefly explain.
Part C: If you used more bins, would you expect the uniform and quantile strategies to be more similar or less similar? Briefly explain.
Part D: Explain one downside of using more bins.

Part 2 - Polynomials and Splines¶

Discretization can be effective and interpretable, but the abrupt changes that occur at cut-points can be problematic when trying to generalize a model to new data. Polynomials and Splines address this shortcoming by accomodating non-linear relationships using smooth transitions.

Both polynomials and splines are easily incorperated in a pipeline in a similar manner to KBinsDiscretizer. The code below uses PolynomialFeatures to construct second degree polynomial expansions:

In [6]:

from sklearn.preprocessing import PolynomialFeatures
poly_lr = Pipeline([('expander', PolynomialFeatures(degree=2)),
                  ('model', LogisticRegression(penalty='none'))])
                  
fitted_poly_lr = poly_lr.fit(train_X, train_y)
grid_preds = fitted_poly_lr.predict_proba(X_grid)
plt.scatter(X_grid[:,0], X_grid[:,1], c = grid_preds[:,1])
plt.colorbar()
plt.show()

We can see that second degree polynomials do a good job allowing our model to capture the true relationship, and our visual assessment is further supported by the training data accuracy:

In [7]:

## Accuracy with polynomials
print(fitted_poly_lr.score(train_X, train_y))

0.95625

Splines can be included in a preprocessing pipeline using SplineTransformer. Be aware that both the degree and number of knots have an impact on the performance of a spline:

In [8]:

from sklearn.preprocessing import SplineTransformer
sp_lr = Pipeline([('expander', SplineTransformer(degree=2, n_knots=2)),
                  ('model', LogisticRegression(penalty='none'))])
                  
fitted_sp_lr = sp_lr.fit(train_X, train_y)
grid_preds = fitted_sp_lr.predict_proba(X_grid)
plt.scatter(X_grid[:,0], X_grid[:,1], c = grid_preds[:,1])
plt.colorbar()
plt.show()

In [9]:

## Accuracy with splines
print(fitted_sp_lr.score(train_X, train_y))

0.95625

While we see the same training data accuracy for the polynomial and spline approaches, one might argue that splines were more effective at learning the true underlying pattern that generated these data (circles).

Question #2¶

Part A - Use 5-fold cross-validation and classification accuracy to properly determine the feature expansion strategy that works best for these data (using unpenalized logistic regression as your model). You should explore a few different polynomial degrees (and numbers of knots for splines) using cross-validated grid search.
Part B - Use cross_validate to find the cross-validated accuracy of a k-nearest neighbors model that uses $k=10$ and distance weighting. You may or may not choose to scale the predictors before using KNN (since both predictors are already on roughly the same scale). Compare the cross-validated accuracy of this model to the best score found in Part A.
Part C - Evaluate the best performing model identified in Part B on the test data by reporting its classification accuracy.