Lab #12 - Linear Regression and Gradient Descent¶

This lab covers two major topics:

  1. The sklearn implementation of linear regression, which is straightforward and largely uses the same syntax as other algorithms we've studied.
  2. Gradient descent algorithms and their behavior
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math

Examples throughout the lab will use the Iowa City home sales data that we've seen in several of our previous labs:

In [2]:
## Read data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

## Split training and testing sets
from sklearn.model_selection import train_test_split
train, test = train_test_split(ic, test_size=0.2, random_state=7)

## Create target and keep only the numeric predictors (though we could use one-hot encoding if we wanted to)
train_y = train['sale.amount']
train_X = train.select_dtypes("number").drop('sale.amount',axis=1)

Part 1 - Linear Regression in sklearn¶

Linear regression is performed using the LinearRegression() function in sklearn. This function is compatible with pipelines and all the pre-processing tools we've been using:

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline 

## Simple pipeline using linear regression
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LinearRegression())
])

Compared to other algorithms, a benefit of linear regression is that we can easily determine the impact of each individual predictor by looking at its weight in the model's linear combination:

In [4]:
## Fit the pipeline to the training data
lin_mod = pipe.fit(train_X, train_y)

## print the model's coefficients
lin_mod.named_steps['model'].coef_
Out[4]:
array([-1.25805804e+03,  1.92884548e+03,  6.72865255e+02, -1.92085650e+03,
       -7.95482914e+01, -3.12181937e+03,  1.97731725e+03,  8.24129242e+02,
        1.82409979e+03,  3.74282983e+02,  2.29862180e+00,  8.78235016e+04])

While standardization is not an essential step in linear regression, it has the benefit of making these estimated weights more directly comparable. That is, after standardizing each of these weights represents the estimated effect of a 1 standard deviation increase in the corresponding predictor on the model's outcome.

You should also recognize that the model's estimated bias (intercept) is stored as a different attribute:

In [5]:
## Print the estimated intercept (bias)
lin_mod.named_steps['model'].intercept_
Out[5]:
180461.83896940417

Linear regression does not have any tuning parameters, but we can still look at an example of it being used within a cross-validated grid search by exploring models with and without an intercept (bias):

In [6]:
## Simple grid search
from sklearn.model_selection import GridSearchCV
parms = {'model__fit_intercept': [True, False]}
grid_res = GridSearchCV(pipe, parms, cv=5, scoring='neg_root_mean_squared_error').fit(train_X, train_y)
grid_res.best_estimator_
Out[6]:
Pipeline(steps=[('scaler', StandardScaler()), ('model', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('scaler', StandardScaler()), ('model', LinearRegression())])
StandardScaler()
LinearRegression()

Question 1: Suppose we'd like to model a home's assessed value rather than its sale price.

  • Part A: Create two objects, train_X_assessed and train_y_assessed where the latter records each home's assessed value and the former is a matrix containing the predictors: area.living, bedrooms, and area.lot.
  • Part B: Create a pipeline that includes min-max scaling followed by a Yeo-Johnson normalizing transformation on the model's predictors as preprocessing step priors to fitting a linear regression model in the pipeline's final step.
  • Part C: Use a cross-validated grid search with RMSE as the scoring metric to evaluate whether the use of a normalizing transformation on these predictors improves performance relative to an approach that doesn't involve any normalization.
  • Part D: Print the estimated weights of the best model in Part C. Is the predictor with the largest estimated weight the most important contributor in this model? Hint: think about the scale of each predictor and how that influences your interpretation of the corresponding weight parameter.

Question 2: Next you'll explore how linear regression compares with other methods we've learned. Throughout this question you should use RMSE as your scoring metric.

  • Part A: Using the objects you created in Part A of Question 1, use a cross-validated grid search to find an appropriate k-nearest neighbors model for these data. Be sure to include any preprocessing steps you believe to be necessary for this approach. A satisfactory answer should explore at least a few different combinations of tuning parameters.
  • Part B: Using the objects you created in Part A of Question 1, use a cross-validated grid search to find an appropriate xgboost model for these data. Be sure to include any preprocessing steps you believe to be necessary for this approach. A satisfactory answer should explore at least a few different combinations of tuning parameters.
  • Part C: Perform a final cross-validated grid search that compares the best models from Parts A and B of this question with the best model you identified in Question 1 (linear regression either with or without a normalizing transformation).
  • Part D: Find the RMSE of each model you considered in Part C on the test set.

Part 2 - Optimization and Gradient Descent¶

The weights and biases in regression are estimated such that they minimize a pre-determined cost function. For numeric outcomes, we will focus our attention on squared error cost:

$$Cost = \tfrac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2$$

We can express the squared error cost function in matrix form:

$$Cost = \tfrac{1}{n}(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}})^T(\mathbf{y}-\mathbf{X}\hat{\mathbf{w}})$$

In this notation:

  • $\mathbf{y}$ is a vector of length $n$
  • $\mathbf{X}$ an $n$ by $p$ matrix of predictors (including a column of 1s for the bias term)
  • $\hat{\mathbf{w}}$ a vector of length $p$ representing the estimated weights

Below is a Python function that calculates the cost associated with a given vector of weights and data:

In [7]:
## Define cost function
def cost_function(X, y, w):
    n = len(y)
    pred_y = np.dot(X,w)
    err = y - pred_y
    cost = (1./n)*np.dot(np.transpose(err),err)                 
    return cost    

Question 3: To explore the cost function defined above, create a line graph showing the cost associated with a sequence of 20 different values of $w_1$ ranging from 2 to 4

In [8]:
## Simple data set used for Question 3
np.random.seed(0)
x = np.linspace(-10, 10, 100)
y = 3 * x + np.random.normal(0, 3, 100)

Recall that the gradient is a vector of partial derivatives with respect to each unknown parameter of a function. In our context, these unknowns are the weight parameters in the cost function defined above.

Gradient descent is an optimization algorithm that uses the gradient to sequentially update a model's unknown parameters in a way that reduces the cost function:

$$\hat{\mathbf{w}}^{(j)} = \hat{\mathbf{w}}^{(j-1)} - \alpha \tfrac{\partial}{\partial \mathbf{w}}\text{Cost}\big(\hat{\mathbf{w}}^{(j-1)}\big)$$

These updates are repeated (ie: $j$ is incremented) until the cost function reaches a minimum (within an acceptable tolerance). However, for simplicity, we will implement gradient descent for a fixed number of iterations.

The gradient (in matrix form) for linear regression looks like:

$$\tfrac{-2}{n}\mathbf{X}^T(\mathbf{y} - \hat{\mathbf{y}})$$

This is found using linear algebra and matrix calculus:

$$Cost = \tfrac{1}{n}(\mathbf{y}^T\mathbf{y}-2\mathbf{y}^T\hat{\mathbf{y}} + \hat{\mathbf{y}}^T\hat{\mathbf{y}})$$$$Cost = \tfrac{1}{n}\mathbf{y}^T\mathbf{y} + \tfrac{1}{n}(-2\mathbf{y}^T\mathbf{X}\hat{\mathbf{w}} + (\mathbf{X}\hat{\mathbf{w}})^T\mathbf{X}\hat{\mathbf{w}})$$$$Cost = \tfrac{1}{n}\mathbf{y}^T\mathbf{y} + \tfrac{1}{n}(-2\mathbf{X}^T\hat{\mathbf{w}}^T\mathbf{y} + \hat{\mathbf{w}}^T\mathbf{X}^T\mathbf{X}\hat{\mathbf{w}})$$

The steps shown above simply rearrange the cost function. After we've reorganized it, we can differentiate it with respect to $\hat{\mathbf{w}}$, which yields:

$$Gradient = \tfrac{-2}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}})$$

Below is a function that implements gradient descent (for a fixed number of update iterations) using this result:

In [9]:
def grad_descent(X, y, w, alpha, n_iter):
    costs = np.zeros(n_iter)
    n = len(y) 
    
    for i in range(n_iter):
        pred_y = np.dot(X,w)
        err = y - pred_y
        grad = (-2./n)*np.dot(np.transpose(X), err)
        w = w - alpha*grad
        costs[i] = cost_function(X, y, w)
        
    return w, costs
In [10]:
## Run gradient descent (for the original IC homes data)
train_y = train['sale.amount']
train_X = train.select_dtypes("number").drop('sale.amount',axis=1)
train_Xs = StandardScaler().fit_transform(train_X)
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.1, n_iter=100)

## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()

The cost function appears to have reached a minimum, let's now compare the weight estimates with those found by sklearn's linear regression implementation:

In [11]:
## sklearn's linear regression
lmf = LinearRegression(fit_intercept=False).fit(train_Xs, train_y)
print(lmf.coef_)
print(gdres[0])
[-1.25805804e+03  1.92884548e+03  6.72865255e+02 -1.92085650e+03
 -7.95482914e+01 -3.12181937e+03  1.97731725e+03  8.24129242e+02
  1.82409979e+03  3.74282983e+02  2.29862181e+00  8.78235016e+04]
[-1853.74094527   423.88972963   648.24511176 -2379.89586956
  2067.56723688 -3190.55684732  2191.25895646  5493.89295523
  1897.97999934   489.9667822    140.69348536 83440.4859917 ]

Unfortunately, the weights didn't quite reach their closed form solution (though many of them got fairly close). But if we increase the number of iterations they eventually will:

In [12]:
## Run gradient descent (again)
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.1, n_iter=2000)
print(lmf.coef_)
print(gdres[0])
[-1.25805804e+03  1.92884548e+03  6.72865255e+02 -1.92085650e+03
 -7.95482914e+01 -3.12181937e+03  1.97731725e+03  8.24129242e+02
  1.82409979e+03  3.74282983e+02  2.29862181e+00  8.78235016e+04]
[-1.25805804e+03  1.92884548e+03  6.72865255e+02 -1.92085650e+03
 -7.95482914e+01 -3.12181937e+03  1.97731725e+03  8.24129242e+02
  1.82409979e+03  3.74282983e+02  2.29862181e+00  8.78235016e+04]

Next, let's see what happens if we double the learning rate:

In [13]:
## Run gradient descent
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.2, n_iter=100)

## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()

As shown in the graph above, setting alpha=0.2 allows us to find weights that minimize the cost function using fewer gradient descent iterations (epochs). In general, setting an appropriate learning rate can involve some trial and error, especially if the predictors haven't been standardized.

For example, let's now increase the learning rate to something even more aggressive, such as $\alpha = 0.3$:

In [14]:
## Run gradient descent
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.3, n_iter=100)

## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()

For $\alpha = 0.3$, each gradient descent update step overshoots the minimum by such a large degree that the new weights actually cause the cost function to increase. This problem compounds as more iterations are run, eventually culminating in the divergent behavior shown on the right side of the above graph.

Question 4: Consider a linear regression model with a single predictor and bias term fixed at zero (ie: no intercept).

$\mathbf{y} = w_1\mathbf{x}_1 + \epsilon$

Gradient descent was demonstrated for this model in our lecture slides.

  • Part A: Write out the cost function (squared error loss) for this model, and find its derivative with respect to $w_1$
  • Part B: Using the previously given cost and gradient descent functions as examples, write your own Python functions that compute the cost and perform gradient descent for this model
  • Part C: Apply your gradient descent function to the Iowa City Home Sales dataset using assessed as the only predictor of sale price. You may use the code below to ascertain this predictor. Print the final estimate of the only estimated weight, $\hat{w}_1$.
  • Part D: Compare your estimated weight from Part C with the results of LinearRegression(). Be sure to use the argument fit_intercept=False to avoid fitting an intercept/bias parameter. If these weights are substantially different, try manipulating your learning rate (or number of iterations) until they are similar.

Hint: The object X_assess created below is a pandas Series, you may want it to be a numpy array of a particular shape when using it in LinearRegression() in Part D.

In [15]:
## Standardized version of the only predictor "assessed"
X_assess = (train_X['assessed'] - np.average(train_X['assessed']))/np.std(train_X['assessed'])