This lab will cover logistic regression and softmax regression. It will also touch on a few built-in optimization options, and provide a demonstration of stochastic gradient descent.
As usual, we'll begin by loading familiar libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
import scipy
Our several examples will use the MNIST handwritten digit dataset (introduced in Lab 2-B):
### Read flattened, processed MNIST data
mnist = pd.read_csv("https://remiller1450.github.io/data/mnist_small.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(mnist, test_size=0.2, random_state=5)
### Separate the label column (outcome)
train_y = train['label']
train_X = train.drop(['label'], axis="columns")
### Create a binary label for 8's
train_y_binary = (train_y == 8).astype(int)
For illustrative purposes, we create an additional outcome label_binary
to reflect whether a digit is an eight, one of the more commonly misclassified digits in our earlier work of these data.
The sklearn
implementation of logistic regression operates similarly to LinearRegression
with a few noteworthy differences. The code below sets up a pipeline and attempts to fit an ordinary logistic regression model to the training data:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('model', LogisticRegression(penalty='none'))
])
log_mod = pipe.fit(train_X, train_y_binary)
C:\Users\millerry\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(
As you can see, the default optimizer, "lbfgs" was unable to converge.
We won't go too far into the details of why this may have happened. But two strategies you should be aware of try using scaling (to provide more numerical stability) and/or using a different optimization algorithm.
The code below implements both strategies using MinMaxScaler()
and the "newton-cg" algorithm, a method that calculates the entire hessian matrix of second order partial derivatives rather than approximating the matrix (as done by "lbfgs").
from sklearn.preprocessing import MinMaxScaler
## Try a different solver w/ scaling
pipe = Pipeline([
('scaler', MinMaxScaler()),
('model', LogisticRegression(penalty='none', solver='newton-cg'))
])
log_mod = pipe.fit(train_X, train_y_binary)
log_mod.score(train_X, train_y_binary)
1.0
We're now able to achieve convergence, but the in-sample classification accuracy of 100% suggests extreme overfitting (which is likely why "lbfgs" struggled to converge).
Fortunately, cross-validation allows us to obtain a more reliable estimate of the model's performance on data it wasn't trained on:
from sklearn.model_selection import cross_validate
cv_res = cross_validate(log_mod, train_X, train_y_binary, cv = 5, scoring = 'accuracy')
np.average(cv_res['test_score'])
0.9227083333333332
Like all of the other models we've cover so far, logistic regression has a .fit
method, so we can still utilize all of the functions from the model_selection
and metrics
modules in sklearn
that were covered in Lab 3-A.
cross_val_predict
and confusion_matrix
to create a cross-validated confusion matrix for the logistic regression model described in the sections above. Then, calculate this model's F1 score "by hand" using this matrix. When calculating your score, consider the outcome representing the digit being 8 as the positive class.cross_val_predict
and RocCurveDisplay
to create a cross-validated ROC curve. Briefly comment on the performance of logistic regression as a classifier in this application using the ROC curve.To perform softmax regression, we simply use LogisticRegression
with the additional argument multi_class='multinomial'
## Modify pipeline to fit softmax regression
pipe = Pipeline([
('scaler', MinMaxScaler()),
('model', LogisticRegression(penalty='none', multi_class='multinomial'))
])
sm_mod = pipe.fit(train_X, train_y)
sm_mod.score(train_X, train_y)
1.0
Like we saw in the previous section, softmax regression overfits the training data and achieves an in-sample classification accuracy of 100%.
However, cross-validation estimates the model's classification accuracy on new data will be substantially lower:
cv_res = cross_validate(log_mod, train_X, train_y, cv = 5, scoring = 'accuracy')
np.average(cv_res['test_score'])
0.8758333333333332
Since softmax regression is used in classification problems with multiple outcome classes, we might consider using alternative performance metrics described in the "Multiple Classes" section of Lab 3-A (Part 5 of the lab).
n_neighbors=20
, distance weighting, and euclidean distance calculations on the MNIST data. Use min-max scaling as a preprocessing step in your pipeline to promote numerical stability. Report a cross-validated F1-score calculated using macro-averaging for each model.In logistic regression, the prediction scores (ie: $\mathbf{x}_i^T\mathbf{w}$ for the $i^{th}$ observation) are linked to probabilities of the positive class via the sigmoid function:
$$log\big(\tfrac{\pi_i}{1-\pi_i}\big) = \mathbf{x}_i^T\mathbf{w}$$For future calculations, we can denote $\eta_i = \mathbf{x}_i^T\mathbf{w}$ and recognize:
$$\pi_i = g(\eta_i) = \tfrac{1}{1 + exp(-\mathbf{x}_i^T\mathbf{w})}$$We can verify this relationship by manually calculating prediction scores/probabilities and comparing them with the output returned by LogisticRegression
:
## Refit logistic regression
pipe = Pipeline([
('scaler', MinMaxScaler()),
('model', LogisticRegression(penalty='none', solver='newton-cg'))
])
log_mod = pipe.fit(train_X, train_y_binary)
## Store the weights and bias
w = log_mod.named_steps['model'].coef_
b = log_mod.named_steps['model'].intercept_
## Calculate eta manually using the dot product of X and w + the bias term
train_Xs = MinMaxScaler().fit_transform(train_X)
eta = np.dot(train_Xs, np.transpose(w)) + b
## Use eta to calculate probabilities of the positive class
pi = 1/(1 + np.exp(-eta))
## Display relationship between eta and pi
plt.scatter(eta, pi, c=train_y_binary.astype('category'))
plt.show()
C:\Users\millerry\AppData\Local\Temp\ipykernel_15620\1742544608.py:17: RuntimeWarning: overflow encountered in exp pi = 1/(1 + np.exp(-eta))
## Now let's extract sklearn's predicted probabilities using `predict_proba`:
pi_m = log_mod.named_steps['model'].predict_proba(train_Xs)
plt.scatter(eta, pi_m[:,1], c=train_y_binary.astype('category'))
plt.show()
Comparing the two graphics created above, we can see that our manual calculation of $\eta$ and $\pi$ is correct (since both graphs are identical). However, our results in this example are somewhat atypical.
You might notice that we receive a warning message when trying to manually calculate $\pi$. This is because all of our probability estimates are very close zero or one, leading to perfect separation in the data.
Normally, we'd expect to see the sigmoid function in this plots. To demonstrate the relationship between $\eta$ and $\pi$ that we'd typically expect, let's quickly re-do this example using the Iowa City Home Sales data:
## Load data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)
## Create X and y
train_num = train_ic.select_dtypes("number")
train_ic_y = (train_num['assessed'] > train_num['sale.amount']).astype(int)
train_ic_X = train_num.drop(['assessed','sale.amount'],axis=1)
## Fit logistic regression
pipe = Pipeline([
('scaler', MinMaxScaler()),
('model', LogisticRegression(penalty='none', solver='newton-cg'))
])
ic_log_mod = pipe.fit(train_ic_X, train_ic_y)
## Store the weights and bias
w = ic_log_mod.named_steps['model'].coef_
b = ic_log_mod.named_steps['model'].intercept_
## Calculate eta manually using the dot product of X and w + the bias term
train_Xs = MinMaxScaler().fit_transform(train_ic_X)
eta = np.dot(train_Xs, np.transpose(w)) + b
## Use eta to calculate probabilities of the positive class
pi = 1/(1 + np.exp(-eta))
## Display
plt.scatter(eta, pi, c=train_ic_y.astype('category'))
plt.xlabel("Eta")
plt.ylabel("Pi")
plt.show()
In this new example, we see that higher values of $\eta_i$ (higher scores) correspond with higher predicted probabilities of a home being over-assessed, and this relationship follows a sigmoid curve.
A strength of logistic regression relative to other binary classification models is the ability to interpret and understand each of the estimated weights. By virtue of the model's form, we see:
$$log\big(\tfrac{\pi}{1-\pi}\big) = w_0 + w_1X_1 + \ldots + w_pX_p $$$$\implies \tfrac{\pi}{1-\pi} = exp(w_0 + w_1X_1 + \ldots + w_pX_p)$$$$\implies \tfrac{\pi}{1-\pi} = exp(w_0)*exp(w_1X_1)* \ldots *exp(w_pX_p)$$The left-hand side of the this equation, $\tfrac{\pi}{1-\pi}$, is the odds of an observation belonging to the positive class.
For example, suppose a predictor's weight is 0.44. We can calculate $e^{0.44} = 1.55$, indicating that a 1-unit change in this predictor corresponds with an estimated 55% increase in the odds of an observation belonging to the positive class.
When interpretting weights, it is important to consider whether you've rescaled the data during the model building process. If you've applied standardization, 1-unit change is actually a 1-standard deviation change in the predictor in terms of its original units. If you've applied min-max scaling, a 1-unit change reflects going from the minimum observed value to the maximum observed value.
ic_log_mod
), find the estimated weight for the predictor 'bedrooms' and interpret this weight after exponentiation.Unlike linear regression, neither logistic nor softmax regression have a closed form solution, so weights must be estimated using iterative algorithms. As you've already seen in this lab, there are many different algorithms capable of solving for an optimal set of weights (note that an optimal set does exist, since the cross entropy cost function has a single minimum).
In this section of the lab, we'll explore the process of solving for these weights. To begin, we need to define the cost function:
$$Cost = -\tfrac{1}{n}\sum_{i=1}^{n}\big(y_ilog(g(\eta_i)) + (1-y_i)log(1-g(\eta_i))\big)$$In our lecture slides, we saw how to differentiate this function for only one stochastic gradient descent:
$$Gradient = (g(\eta_i)-y)\mathbf{x}_i$$Note that $\mathbf{x}_i$ is the row vector corresponding to the single observation we've considering.
This leads to following update scheme:
$$\mathbf{w}^{(j)} = \mathbf{w}^{(j-1)} + \alpha(g(\eta_i^{(j-1)})-y)\mathbf{x}_i$$We can implement this update scheme by randomly reordering the data, looping through it one-at-atime, and repeating as necessary.
The code below demonstrates stochastic gradient descent for the Iowa City Home Sales data. You should reference this code during Question #5 (which appears at the end of this section).
## First, reload IC data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
## Train-test split
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)
## Recreate X and Y
train_num = train_ic.select_dtypes("number")
train_ic_y = (train_num['assessed'] > train_num['sale.amount']).astype(int)
train_ic_X = train_num.drop(['assessed','sale.amount'],axis=1)
train_Xs = MinMaxScaler().fit_transform(train_ic_X)
## Define Functions
import random
def cost_function(x, y, w):
eta = np.dot(x,w)
pred_y = 1/(1 + np.exp(-eta))
cost = -y*np.log(pred_y) - (1-y)*np.log(1-pred_y)
return cost
def grad_descent(X, y, w, alpha, n_iter):
n = len(y)
costs = np.zeros(n_iter)
for i in range(n_iter):
idx = random.sample(range(n), n)
temp_cost = 0.0
for si in range(n):
random_y = y.iloc[idx[si]]
random_x = np.append(1,X[idx[si],:])
eta = np.dot(random_x,w)
pred_y = 1/(1 + np.exp(-eta))
err = (pred_y - random_y).item()
grad = random_x*err
w = w - alpha*grad
temp_cost += cost_function(random_x, random_y, w)
costs[i] = temp_cost
return w, costs
## Plot of results
p = np.shape(train_Xs)[1]+1
n = len(train_ic_y)
nit = 500
gdres = grad_descent(X=train_Xs,y=train_ic_y,w=np.zeros(p),alpha=0.005, n_iter = nit)
plt.plot(np.linspace(0, nit, nit),gdres[1])
plt.show()
Here we see that only one stochastic gradient descent effectively optimizes the cost function despite using just a single observation each time it performs an update.
It's worth knowing that stochastic gradient descent is one of the most widely used algorithms for estimating the weights in neural networks, as the gradient can be computed using chain rule.
np.dot(x, w)
. Briefly explain what this command calculates and what the dimension of the output is.random.sample(..)
?np.append(1,X[idx[si],:])
? Why is a '1' included here?LogisticRegression()
? Briefly explain.