This lab will introduce both logistic and softmax regression in sklearn
. It will also provide a demonstration and some practice implementing stochastic gradient descent to solve for optimal weights in a logistic regression model.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
As it turns out, logistic (and softmax) regression can be challenging to work with for a few different reasons. We'll use the MNIST handwritten digits data (introduced in a few earlier labs) to illustrate some of these challenges:
### Read flattened, processed MNIST data
mnist = pd.read_csv("https://remiller1450.github.io/data/mnist_small.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(mnist, test_size=0.2, random_state=5)
### Separate the label column (outcome)
train_y = train['label']
train_X = train.drop(['label'], axis="columns")
### Create a binary label for 8's
train_y_binary = (train_y == 8).astype(int)
To demonstrate logistic regression, which involves a binary outcome variable, we've created train_y_binary
to provide a separate label for digits that are an "8" versus all other digits.
sklearn
¶The LogisticRegression()
function in sklearn
is largely similar to LinearRegression()
with a few key differences:
penalty = None
(or penalty = 'none'
in older versions of sklearn
).Below is a simple attempt to use logistic regression to learn how to predict whether a digit is an "8". We see that the default optimization algorithm does not work...
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('model', LogisticRegression(penalty=None))
])
log_mod = pipe.fit(train_X, train_y_binary)
C:\Users\millerry\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(
This happens somewhat frequently with logistic regression, and there are several reasons why, with the major problems being either:
If we suspect numerical instability, we can try re-scaling the data and using an algorithm that is more precise. Without getting too far into the details, we can change our solver to "newton-cg", which calculates the entire hessian matrix of second order partial derivatives rather than trying to approximate (as is done by the default choice of "lbfgs").
## Try a different solver w/ scaling
from sklearn.preprocessing import MinMaxScaler
pipe = Pipeline([
('scaler', MinMaxScaler()),
('model', LogisticRegression(penalty=None, solver='newton-cg'))
])
log_mod = pipe.fit(train_X, train_y_binary)
log_mod.score(train_X, train_y_binary)
C:\Users\millerry\Anaconda3\lib\site-packages\scipy\optimize\_linesearch.py:305: LineSearchWarning: The line search algorithm did not converge warn('The line search algorithm did not converge', LineSearchWarning) C:\Users\millerry\Anaconda3\lib\site-packages\sklearn\utils\optimize.py:204: UserWarning: Line Search failed warnings.warn("Line Search failed")
1.0
Unfortunately these changes still do not allow us to coverge to a set of optimal weights.
However, by checking the in-sample classification accuracy we can see that the weights in the final iteration were such that perfect classification of the training data was achieved. Thus, "perfect separation" was likely the reason for logistic regression struggling to converge in this application.
Question 1: The previous demonstration checked the in-sample classification accuracy of our model (which we know didn't converge). We might wonder how generalizable this model given the uncertainty surrounding the estimated weights.
train_y_binary
using the model that was trained above. In doing so, consider the outcome being an "8" as the positive class.To perform softmax regression, which is used for multi-label classification, we simply use LogisticRegression()
with the additional argument multi_class='multinomial'
:
## Modify pipeline to fit softmax regression
pipe = Pipeline([
('scaler', MinMaxScaler()),
('model', LogisticRegression(penalty=None, multi_class='multinomial'))
])
sm_mod = pipe.fit(train_X, train_y)
sm_mod.score(train_X, train_y)
1.0
While the perfect classification accuracy we see in the training data might seem suspicious, cross-validation suggests this is just overfitting:
from sklearn.model_selection import cross_validate
cv_res = cross_validate(sm_mod, train_X, train_y, cv = 5, scoring = 'accuracy')
np.average(cv_res['test_score'])
0.8652083333333334
Question 2: Use pipelines and 5-fold cross-validation to compare the performance of softmax regression with $k$-nearest neighbors using n_neighbors=20
, distance weighting, and euclidean distance calculations on the MNIST data. Use min-max scaling as a preprocessing step in your pipeline to promote numerical stability. Report a cross-validated F1-score calculated using macro-averaging for each model.
In logistic regression, the prediction scores (ie: $\mathbf{x}_i^T\mathbf{w}$ for the $i^{th}$ observation) are linked to probabilities of the positive class via the sigmoid function:
$$log\big(\tfrac{\pi_i}{1-\pi_i}\big) = \mathbf{x}_i^T\mathbf{w}$$In this expression $Pr(Y=1)$ is denoted by $\pi$. For future calculations, we can denote $z_i = \mathbf{x}_i^T\mathbf{w}$ and recognize:
$$\pi_i = g(z_i) = \tfrac{1}{1 + exp(-\mathbf{x}_i^T\mathbf{w})}$$We can verify this relationship by manually calculating prediction scores/probabilities and comparing them with the output returned by LogisticRegression
:
## Refit logistic regression
pipe = Pipeline([
('scaler', MinMaxScaler()),
('model', LogisticRegression(penalty=None, solver='newton-cg'))
])
log_mod = pipe.fit(train_X, train_y_binary)
## Store the weights and bias
w = log_mod.named_steps['model'].coef_
b = log_mod.named_steps['model'].intercept_
## Calculate eta manually using the dot product of X and w + the bias term
train_Xs = MinMaxScaler().fit_transform(train_X)
z = np.dot(train_Xs, np.transpose(w)) + b
## Use z to calculate probabilities of the positive class
pi = 1/(1 + np.exp(-z))
## Display relationship between z and pi
plt.scatter(z, pi, c=train_y_binary.astype('category'))
plt.show()
C:\Users\millerry\Anaconda3\lib\site-packages\scipy\optimize\_linesearch.py:305: LineSearchWarning: The line search algorithm did not converge warn('The line search algorithm did not converge', LineSearchWarning) C:\Users\millerry\Anaconda3\lib\site-packages\sklearn\utils\optimize.py:204: UserWarning: Line Search failed warnings.warn("Line Search failed")
We can compare this graph to one created using predicted probabilities found using the predict_proba()
method of our fitted model:
## predicted probabilities using `predict_proba`:
pi_m = log_mod.named_steps['model'].predict_proba(train_Xs)
plt.scatter(z, pi_m[:,1], c=train_y_binary.astype('category'))
plt.show()
These graphs should give us confidence that the manner by which we calculated predicted probabilities "by hand" is correct.
However, due to the problems with perfect separation in these data the sigmoid relationship between the linear predictor and predicted probability of the positive class isn't clear. Thus, we'll quickly look at a second example with a more typical relationship between $z$ and $Pr(Y=1)$:
## Load data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)
## Create X and y
train_num = train_ic.select_dtypes("number")
train_ic_y = (train_num['assessed'] > train_num['sale.amount']).astype(int)
train_ic_X = train_num.drop(['assessed','sale.amount'],axis=1)
## Fit logistic regression
pipe = Pipeline([
('scaler', MinMaxScaler()),
('model', LogisticRegression(penalty=None, solver='newton-cg'))
])
ic_log_mod = pipe.fit(train_ic_X, train_ic_y)
## Store the weights and bias
w = ic_log_mod.named_steps['model'].coef_
b = ic_log_mod.named_steps['model'].intercept_
## Calculate eta manually using the dot product of X and w + the bias term
train_Xs = MinMaxScaler().fit_transform(train_ic_X)
z = np.dot(train_Xs, np.transpose(w)) + b
## Use eta to calculate probabilities of the positive class
pi = 1/(1 + np.exp(-z))
## Display
plt.scatter(z, pi, c=train_ic_y.astype('category'))
plt.xlabel("z")
plt.ylabel("Pi")
plt.show()
Question 3: In our second application we did not calculate the in-sample classification accuracy of the model, while in our first example we found an in-sample classification accuracy of 100%. Using just the graphs of $z$ vs. $Pr(Y=1)$, briefly explain how you can tell that the classifier in the second application (Iowa City home assessments) will achieve substantially lower in-sample classification accuracy. Do not justify your answer by calculating the classification accuracy, instead your justification should only involve the relationship you see between $z$, $Pr(Y=1)$, and $\mathbf{y}$.
A strength of logistic regression relative to other binary classifiers is the ability to interpret and understand each of the estimated weights. By virtue of the model's form, we see:
$$log\big(\tfrac{\pi}{1-\pi}\big) = w_0 + w_1X_1 + \ldots + w_pX_p $$$$\implies \tfrac{\pi}{1-\pi} = exp(w_0 + w_1X_1 + \ldots + w_pX_p)$$$$\implies \tfrac{\pi}{1-\pi} = exp(w_0)*exp(w_1X_1)* \ldots *exp(w_pX_p)$$The left-hand side of this equation, $\tfrac{\pi}{1-\pi}$, is the odds of an observation belonging to the positive class.
For example, suppose a predictor's weight is 0.44. We can calculate $e^{0.44} = 1.55$, indicating that a 1-unit change in this predictor corresponds with an estimated 55% increase in the odds of an observation belonging to the positive class.
When interpreting weights, it is important to consider whether you've re-scaled the data during the model building process. If you've applied standardization, 1-unit change is actually a 1-standard deviation change in the predictor in terms of its original units. If you've applied min-max scaling, a 1-unit change reflects going from the minimum observed value to the maximum observed value.
Question #4: Using the logistic regression model fit to the Iowa City Home Sales data in example at the end of Part 3, (ie: the model stored in ic_log_mod
), find the estimated weight for the predictor 'bedrooms' and interpret this weight after exponentiation.
Unlike linear regression, neither logistic nor softmax regression have a closed form solution, so weights must be estimated using iterative algorithms. As you've already seen in this lab, there are many different algorithms capable of solving for an optimal set of weights. Note that an optimal set does exist so long as there isn't perfect separation in the data, otherwise there might be several different equally effective solutions. Even if there are many solutions, the cross entropy cost function itself has a single minimum value, there just might be several ways to achieve that minimum.
In this section of the lab, we'll explore the process of solving for these weights. To begin, we need to define the cost function:
$$Cost = -\tfrac{1}{n}\sum_{i=1}^{n}\big(y_ilog(g(z_i)) + (1-y_i)log(1-g(z_i))\big)$$In our lecture slides, we saw how to differentiate this function for only one stochastic gradient descent:
$$Gradient = (g(z_i)-y)\mathbf{x}_i$$Here $\mathbf{x}_i$ is the row vector corresponding to the single observation we're considering in the update step.
This expression leads to the following update scheme:
$$\mathbf{w}^{(j)} = \mathbf{w}^{(j-1)} + \alpha(g(z_i^{(j-1)})-y)\mathbf{x}_i$$An only one stochastic gradient descent update scheme is typically implemented by randomly reordering the data, looping through it one-at-atime, and then repeating this process as necessary until convergence is reached. In this context, each loop through the entire data set is called a training epoch.
The code below demonstrates stochastic gradient descent for the Iowa City Home Sales data. You should reference this code during Question #5 (which appears at the end of this section).
## First, reload IC data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
## Train-test split
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)
## Recreate X and Y
train_num = train_ic.select_dtypes("number")
train_ic_y = (train_num['assessed'] > train_num['sale.amount']).astype(int)
train_ic_X = train_num.drop(['assessed','sale.amount'],axis=1)
train_Xs = MinMaxScaler().fit_transform(train_ic_X)
## Define Functions to calculate cost and peform grad descent
import random
def cost_function(x, y, w):
eta = np.dot(x,w)
pred_y = 1/(1 + np.exp(-eta))
cost = -y*np.log(pred_y) - (1-y)*np.log(1-pred_y)
return cost
def grad_descent(X, y, w, alpha, n_iter):
n = len(y)
costs = np.zeros(n_iter)
for i in range(n_iter):
idx = random.sample(range(n), n)
temp_cost = 0.0
for si in range(n):
random_y = y.iloc[idx[si]]
random_x = np.append(1,X[idx[si],:])
eta = np.dot(random_x,w)
pred_y = 1/(1 + np.exp(-eta))
err = (pred_y - random_y).item()
grad = random_x*err
w = w - alpha*grad
temp_cost += cost_function(random_x, random_y, w)
costs[i] = temp_cost
return w, costs
## Run 500 epochs and plot the results
p = np.shape(train_Xs)[1]+1
n = len(train_ic_y)
nit = 500
gdres = grad_descent(X=train_Xs,y=train_ic_y,w=np.zeros(p),alpha=0.005, n_iter = nit)
plt.plot(np.linspace(0, nit, nit),gdres[1])
plt.show()
Here we see that only one stochastic gradient descent effectively effectively minimizes the cost function despite using just a single observation each time it performs an update.
It's worth knowing that stochastic gradient descent is one of the most widely used algorithms for estimating the weights in neural networks and many other deep learning models. So having a working understanding of the method is crucial to understanding how more sophisticated machine learning methods learn from the data.
Question #5:
np.dot(x, w)
. Briefly explain what this command calculates and what the dimensions of the output are.random.sample(..)
?np.append(1,X[idx[si],:])
? Why is a '1' included here?LogisticRegression()
? Briefly explain.