Lab 11 - Gradient Descent Algorithms for Machine Learning¶

This lab focuses on gradient descent algorithms and related concepts. You'll need the following libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math

Part 1 - Gradient Descent¶

Consider an $n$-dimensional vector of numeric outcomes, $\mathbf{y}$, and an $n$ by $p$ matrix of predictive features, $\mathbf{X}$. Recall that the squared-error cost function can be expressed using the following matrix-vector notation:

$$ \text{Cost} = \frac{1}{n}(\mathbf{y} - \hat{\mathbf{y}})^T(\mathbf{y} - \hat{\mathbf{y}})$$

For linear regression, $\hat{\mathbf{y}} = \mathbf{X}\hat{\mathbf{w}}$, where $\hat{\mathbf{w}}$ is a $p$-dimensional vector of weight parameters that are estimated from the data with the goal of minimizing the cost function.

We can implement the squared-error cost function for linear regression ourselves:

In [2]:
## Define the cost function
def cost_function(X, y, w):
    n = len(y)
    pred_y = np.dot(X,w)
    err = y - pred_y
    cost = (1./n)*np.dot(np.transpose(err),err)                 
    return cost    

The question that follows aims to build intuition on how the current weight estimates of a model relate to the cost function for a fixed set of data.

Question #1:

  • Part A: Consider an optimization of the squared-error cost function involving linear regression and the simple simulated data generated by the code provided below. What are the dimensions of $\mathbf{X}$, $\mathbf{y}$, and $\hat{\mathbf{w}}$ for this scenario?
  • Part B: Create a line graph displaying the cost (y-axis) in response to a sequence of 100 values of $\hat{w}_1$ ranging from 2 to 4 (you must create this sequence).
  • Part C: Suppose the current iteration of a gradient descent algorithm is using the value $\hat{w}_1 = 2$. Without calculating anything, what is the sign of the gradient at this value? How does this relate to whether the value of $\hat{w}_1$ should be increased or decreased to improve the cost?
  • Part D: Now suppose the current iteration of a gradient descent algorithm is using the value $\hat{w}_1 = 2.7$. Without calculating anything, is the magnitude of the gradient larger or smaller than it was at the value considered in Part C? Explain your answer
In [3]:
## Simulated Data for Question 1
np.random.seed(0)
x = np.linspace(-10, 10, 100)
y = 3 * x + np.random.normal(0, 3, 100)

Gradient Descent¶

Recall that in machine learning the gradient is a vector of partial derivatives of the cost function with respect to each unknown parameter (weight or bias) in the model.

The gradient descent algorithm iteratively finds better estimates of these unknown parameters (ie: estimates yielding a lower cost) using the following update scheme: $$\hat{\mathbf{w}}^{(j)} = \hat{\mathbf{w}}^{(j-1)} - \alpha \cdot \frac{\partial \text{Cost}(\mathbf{w}^{(j-1)})}{\partial \mathbf{w}}$$

For linear regression, the gradient (in matrix notation) is: $$\frac{-2}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}})$$

You should be comfortable obtaining this result yourself using either chain rule, or expanding the cost function and apply matrix derivative shortcuts before rearranging terms.

Provided below is a crude gradient descent function that performs gradient descent updates for a fixed number of iterations:

In [4]:
## Define gradient descent function for linear regression
def grad_descent(X, y, w, alpha, n_iter):
    costs = np.zeros(n_iter)
    n = len(y) 
    
    for i in range(n_iter):
        pred_y = np.dot(X,w)
        grad = (-2./n)*np.dot(np.transpose(X), (y - pred_y))
        w = w - alpha*grad
        costs[i] = cost_function(X, y, w)
        
    return w, costs

We can try this function out on the Iowa City Home Sales data set, using all of the numeric features to predict sale.amount:

In [5]:
## Run gradient descent (for the original IC homes data)
from sklearn.preprocessing import StandardScaler
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_y = ic['sale.amount']
ic_X = ic.select_dtypes("number").drop('sale.amount',axis=1)
ic_Xs = StandardScaler().fit_transform(ic_X)
gdres = grad_descent(X=ic_Xs,y=ic_y,w=np.zeros(12),alpha=0.1, n_iter=100)

## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()

It seems like the cost is leveling out (reaching an optimal value) towards the end of this plot. So, let's check how our estimates compare with those given by the implementation of linear regression in sklearn:

In [6]:
## sklearn's linear regression
from sklearn.linear_model import LinearRegression
lmf = LinearRegression(fit_intercept=False).fit(ic_Xs, ic_y)
print(lmf.coef_)
print(gdres[0])
[-1571.29533112  2584.32172762  -225.92361186 -5778.1626521
  -608.85626531 -3022.00349719  1522.1430872    159.08016378
  1813.22175235   181.64294314  -520.80749986 90251.73234668]
[-2083.11880499  1076.37333675  -235.38323399 -6152.0121008
  1687.30812052 -3088.59226859  1724.58677251  5188.78489252
  1997.70355932   310.97837215  -466.9349181  85369.24932668]

You can see that our estimates are getting close to the least squares solution, but they aren't quite there yet. So, let's try again using more iterations of gradient descent:

In [7]:
## Run gradient descent (again)
gdres = grad_descent(X=ic_Xs,y=ic_y,w=np.zeros(12),alpha=0.1, n_iter=2000)
print(lmf.coef_)
print(gdres[0])
[-1571.29533112  2584.32172762  -225.92361186 -5778.1626521
  -608.85626531 -3022.00349719  1522.1430872    159.08016378
  1813.22175235   181.64294314  -520.80749986 90251.73234668]
[-1571.29533112  2584.32172762  -225.92361186 -5778.1626521
  -608.85626531 -3022.00349719  1522.1430872    159.08016378
  1813.22175235   181.64294314  -520.80749986 90251.73234668]

After 2000 iterations of gradient descent with a learning rate of $\alpha = 0.1$ you can see that our estimates have reached the least squares solution exactly. However, you might be wondering if we could have gotten away with fewer iterations had we used a larger learning rate.

Below is what happens if we triple our learning rate to $\alpha = 0.3$:

In [8]:
## What happens with a larger learning rate?
gdres = grad_descent(X=ic_Xs,y=ic_y,w=np.zeros(12),alpha=0.3, n_iter=100)
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()

When using $\alpha = 0.3$, each gradient descent update overshoots the minimum of the cost function by such a large degree that the new weights actually cause the cost function to increase. This problem compounds as more iterations are run, producing the divergent behavior seen in the cost curve.

Question #2: Consider a simplified model involving a single predictor and no bias/intercept (like what we saw in our lecture slides): $$\mathbf{y} = \mathbf{x}w_1 + \mathbf{\epsilon}$$

  • Part A: Write out the cost function (squared-error) and find its derivative with respect to $w_1$.
  • Part B: Create your own functions that calculate the cost and perform gradient descent for this scenario. You may use the examples in this section as a starting point, but you will have to make a few changes due to $w_1$ being a scalar.
  • Part C: Apply your gradient descent function to the Iowa City Home Sales dataset using assessed as the only predictor of sale price. Print the final weight estimate. Hint: For numerical stability you may want to use the standardized version of assessed created by the example code below.
In [9]:
## Standardizing just one predictor
X_assess = (ic_X['assessed'] - np.average(ic_X['assessed']))/np.std(ic_X['assessed'])

Part 2 - Mini-batch Gradient Descent¶

In mini-batch gradient descent the main idea is to segment the training set into batches, then perform a gradient descent update using only a single batch at a time.

We can modify our previous gradient descent code to sample batches within the training loop:

In [10]:
## Slightly modified cost function
def cost_function(X, y, w):
    n = y.size
    pred_y = np.dot(X,w)
    err = y - pred_y
    cost = (1./n)*np.dot(np.transpose(err),err)                 
    return cost    

## Mini-batch gradient descent function
import random
def batch_grad_descent(X, y, w, alpha, n_iter, n_batches):
    costs = np.zeros(n_iter)
    n = len(y) 
    
    for i in range(n_iter):
        batch_idx = random.choices(range(n_batches), k=n_batches)
        temp_cost = 0.0
        
        for bi in range(n_batches):
            batch_y = y.iloc[batch_idx[bi]]
            batch_X = X[batch_idx[bi],:]
            pred_y = np.dot(batch_X,w)
            grad = (-2./batch_y.size)*np.dot(np.transpose(batch_X), (batch_y - pred_y))
            w = w - alpha*grad
            temp_cost += cost_function(batch_X, batch_y, w)
        
        costs[i] = temp_cost
        
    return w, costs

## Run and plot the costs by epoch
bgdres = batch_grad_descent(X=ic_Xs,y=ic_y,w=np.zeros(12),n_batches = 5, alpha=0.001, n_iter=200)
plt.plot(np.linspace(0, 200, 200),bgdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()

Question #3: Using the functions provided in this section as a guide, modify the gradient descent functions you created in Question #2 to perform mini-batch gradient descent. Test these functions by displaying a cost curve and showing that it looks similar (with noise) to the one produced by the functions you created in Question #2. Note: You might need to modify the learning rates to get these curves to appear similar.

$~$

Part 3 - Stochastic Gradient Descent and Logistic Regression¶

Logistic regression is an example of a parametric model that we can easily optimize using stochastic gradient descent and the cross-entropy cost function. This approach is convenient because the cost function has the form: $$\text{Cost} = -\tfrac{1}{n}\sum_{i=1}^{n}\big(y_i\cdot log(\hat{y}_i) + (1-y_i)\cdot log(1-\hat{y}_i)\big)$$ Under stochastic gradient descent we do not need to worry about the summation, as only one training observation is considered.

This scenario also affords us an opportunity to exploit chain rule to simplify the gradient calculation, as $\hat{y_i} = g(z_i)$, where $g()$ is the sigmoid function, and $z_i = \mathbf{x}_i^T\mathbf{\hat{w}}$

Question #4: Use chain rule to show that $\frac{\partial Cost_i}{\partial \mathbf{w}} = -y_i\mathbf{x_i}\big(\frac{g(\mathbf{x_i})(1-g(\mathbf{x_i}))}{g(\mathbf{x_i})}\big) + (1-y_i)\mathbf{x_i}\big(\frac{g(\mathbf{x_i})(1-g(\mathbf{x_i}))}{1-g(\mathbf{x_i})}\big)$. You may use the fact that derivative of the sigmoid function, $g()$, is $g()\cdot(1-g())$ without establishing this result "from scratch". If you're curious, this Stack Exchange page provides a thorough derivation of the result.

$~$

Implementing Stochastic Gradient Descent¶

The result you established in Question #3 simplifies to a gradient of $\mathbf{x}_i(\hat{y}_i - y_i)$ (you do not need to show this). This is used to create the stochastic gradient descent function provided below:

In [11]:
## Cross-entropy cost for 1 data-point
def cost_function(x, y, w):
    z = np.dot(x,w)
    pred_y = 1/(1 + np.exp(-z))
    cost = -y*np.log(pred_y) - (1-y)*np.log(1-pred_y)               
    return cost               

## Stochastic Gradient Descent
def sgd(X, y, w, alpha, n_iter):
    n = len(y)
    costs = np.zeros(n_iter)
    
    for i in range(n_iter):
        idx = random.sample(range(n), n)
        temp_cost = 0.0
        
        for si in range(n):
            random_y = y.iloc[idx[si]]
            random_x = np.append(1,X[idx[si],:])
            z = np.dot(random_x,w)
            pred_y = 1/(1 + np.exp(-z))
            grad = random_x*(pred_y - random_y)
            w = w - alpha*grad
            temp_cost += cost_function(random_x, random_y, w)
        
        costs[i] = temp_cost
        
    return w, costs

We test out this function on a binary outcome derived from the Iowa City Homes Data, and we can see that the cost is effectively minimized after relatively few training epochs.

In [12]:
## Set up binary outcome and scaled predictors
ic_y_bin = (ic['assessed'] > ic['sale.amount']).astype(int)
ic_Xs_bin = StandardScaler().fit_transform(ic.select_dtypes("number").drop(['sale.amount', 'assessed'],axis=1))

## Some prelims to input into SGD
p = np.shape(ic_Xs_bin)[1]+1
n = len(ic_y_bin)
nit = 100

## Run 100 epochs of SGD and display the results
gdres = sgd(X=ic_Xs_bin,y=ic_y_bin,w=np.zeros(p),alpha=0.0025, n_iter = nit)
plt.plot(np.linspace(0, nit, nit),gdres[1])
plt.show()

Question #5: For this question you should use the sgd() function provided earlier in this section.

  • Part A: Consider an observation belonging to the positive class (so $y_i = 1$), what is the cost associated with this observation if the current model estimates their prediction score as $z_i = 2.5$? If the model's weights are updated such that the prediction score increases to $z_i = 3$, how would this impact the cost?
  • Part B: In the function provided in this section, sgd(), what is returned by random.sample(range(n), n)? Why doesn't the sgd() function just randomly sample the index of a single data point to use for each gradient update?
  • Part C: Notice the sgd() function includes the command np.append(1,X[idx[si],:]). Why is the value 1 being appended to the front of the vector of data?
  • Part D: This section's example ran 100 epochs of stochastic gradient descent. Will the resulting weight estimates perfectly match those produced by the LogisticRegression() function (without any regularization)? If not, will running more training epochs produce closer estimates? Briefly explain.
  • Part E: Change the learning rate from alpha = 0.0025 to alpha = 0.001 and notice the results. Why might this make the cost more consistent/smooth towards the right of the graph? Could there be a downside to this? Briefly explain.