Lab #4 (part 1) - Linear Regression and Optimization¶

The focus of this lab will be on linear regression. Since the method and its sklearn implementation are fairly straightfoward, we'll devote much of our attention towards understanding the optimzation of structural model parameters and the gradient descent algorithm. We will also compare the performance of linear regression and k-nearest neighbors (kNN) using pipelines.

As usual, we'll begin by loading several familiar libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math

Our examples will use the Iowa City Home Sales data (first introduced in Lab 2, part 1):

In [2]:
## Read data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

## Split to training and testing sets
from sklearn.model_selection import train_test_split
train, test = train_test_split(ic, test_size=0.2, random_state=7)

## Create target
train_y = train['sale.amount']

## Create predictor matrix (numeric predictors only for simplicity, but we could use OHE if we wanted to)
train_X = train.select_dtypes("number").drop('sale.amount',axis=1)

Part 1 - Linear Regression¶

To stay consistent with the workflow we developed when using kNN, we'll begin by setting up a pipeline.

Standardization/scaling aren't nearly as important for regression as they are for some methods, but we'll still add a scaling step to our pipeline since we'll later compare linear regression with kNN (where scaling is very important).

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline 

## Simple pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LinearRegression())
])

Next, let's fit a linear regression model to the entire training set to predict sale price. Then, let's print the estimated weights (coefficients):

In [4]:
## Fit the pipeline to the training data
lin_mod = pipe.fit(train_X, train_y)

## print the model's coefficients
lin_mod.named_steps['model'].coef_
Out[4]:
array([-1.25805804e+03,  1.92884548e+03,  6.72865255e+02, -1.92085650e+03,
       -7.95482914e+01, -3.12181937e+03,  1.97731725e+03,  8.24129242e+02,
        1.82409979e+03,  3.74282983e+02,  2.29862180e+00,  8.78235016e+04])

Because coef_ is an attribute of the object created by LinearRegression() (which was a single step in our pipeline), we must use the named_steps of our pipeline to access the regression coefficients (recognize that we gave the name 'model' to the pipeline step that uses LinearRegression()).

Next, notice how 12 weight estimates are returned and our predictor matrix, train_X, had 12 columns. To find the model's bias term (intercept), we must ask for it separately:

In [5]:
## print the estimated intercept (bias)
lin_mod.named_steps['model'].intercept_
Out[5]:
180461.83896940417

This simple example, we did not use our pipeline to optimize any tuning parameters. But it's important to understand the syntax needed to access model coefficients (or other attributes) if we did.

For example, when GridSearchCV() were used for model tuning, we'll need to use the best_estimator_ attribute of the search results as the first step in obtaining the estimated weights:

In [6]:
## Set up a simple grid to compare two scaling methods
from sklearn.preprocessing import RobustScaler
parms = {'scaler': [StandardScaler(), RobustScaler()]}

## Perform grid search
from sklearn.model_selection import GridSearchCV
lin_mod2 = GridSearchCV(pipe, parms, cv=5, scoring='neg_root_mean_squared_error').fit(train_X, train_y)

## Print the coefficients of the top performing model
lin_mod2.best_estimator_.named_steps['model'].coef_
Out[6]:
array([-1.25805804e+03,  1.92884548e+03,  6.72865255e+02, -1.92085650e+03,
       -7.95482914e+01, -3.12181937e+03,  1.97731725e+03,  8.24129242e+02,
        1.82409979e+03,  3.74282983e+02,  2.29862180e+00,  8.78235016e+04])

It's also worth knowing that GridSearchCV (and other search functions that use cross-validation) will fit the best model to the entire training set after evaluation has been done via cross-validation (so these weights match the ones we saw earlier).

If we wanted to know the cross-validated root-mean squared error (RMSE) of our model, we could find it in the grid search results:

In [7]:
## Using the score method of GridSearchCV
lin_mod2.cv_results_['mean_test_score'][0]
Out[7]:
-19833.449405441286

We could also use find the same estimate using the cross_validate function:

In [8]:
## Split to training and testing sets
from sklearn.model_selection import cross_validate
cv_res = cross_validate(lin_mod2, train_X, train_y, cv = 5, scoring='neg_root_mean_squared_error')
np.average(cv_res['test_score'])
Out[8]:
-19833.44940544128

Based upon these measures, we'd estimate this model's predictions will be off by an average of approximately $19,833 if it were applied to new data.

Question #1¶

  • Part A) Suppose we'd like to model a home's assessed value (rather than its sale price) using its attributes. Create the objects train_X_assessed and train_y_assessed where the latter is the home's assessed value and the former is a matrix containing the predictors: bedrooms, area.lot, and area.garage1.
  • Part B) Create a pipeline that applies the Yeo-Johnson normalizing transformation and min-max scaling prior to fitting a linear regression model that uses all of the features in train_X_assessed to predict the assessed value of a home.
  • Part C) Use 5-fold cross-validation to estimate the RMSE of the model produced by the pipeline you created in Part B.

Part 2 - Comparison with $k$-Nearest Neighbors¶

To compare different machine learning approaches, such as kNN and regression, we must use pipelines and cross-validation to ensure an equitable comparison:

In [9]:
## Set up simple pipeline that does scaling and modeling
from sklearn.neighbors import KNeighborsRegressor
pipe = Pipeline([
('scaler', StandardScaler()),
('model', KNeighborsRegressor())
])

## Set up parameter grid - notice we are considering two different models (and two different scalers)
parms = {'scaler': [StandardScaler(), RobustScaler()],
         'model': [KNeighborsRegressor(n_neighbors=15,weights='distance'), LinearRegression()],
        }

## Use cross-validated grid search and print the best model
mod_comp = GridSearchCV(pipe, parms, cv=5, scoring='neg_mean_squared_error').fit(train_X, train_y)
print(mod_comp.best_estimator_)
Pipeline(steps=[('scaler', StandardScaler()), ('model', LinearRegression())])

Since LinearRegression and KNeighborsRegressor have different tuning parameters, a fairer approach would first optimize the kNN tuning parameters and use the best combination in a subsequent comparison across model classes. For simplicity, our example used n_neighbors=15 and weights='distance' as if these were the optimal combination identified in an earlier grid search.

Additionally, it's worth knowing that the default scoring measure for numeric outcomes is the coefficient of determination, $R^2$. If you do not specify the scoring argument in GridSearchCV this default will be used to determine the best estimator.

In [10]:
## Default scorer:
mod_comp = GridSearchCV(pipe, parms, cv=5).fit(train_X, train_y)
mod_comp.best_score_
Out[10]:
0.9440104734478318

Question #2¶

  • Part A) Set up a parameter grid and use a grid search to tune the values of the parameters n_neighbors and weights in a $k$-nearest neighbors model that uses train_X_assessed to predict train_y_assessed (these were created in Question #1). Your search should consider at least 5 different choices of n_neighbors that are sensibly chosen. You should use 'neg_root_mean_squared_error' as your scoring criterion.
  • Part B) Use pipelines and 5-fold cross validation to determine whether the top performing $k$-nearest neighbors model from Part A does better tha linear regression. You should again use 'neg_root_mean_squared_error' as your scoring criterion.
  • Part C) Print the cross-validated RMSE of the top performing model found in Part B.

Part 3 - Optimization and Gradient Descent¶

Regression is our first example of a model with a structured set of unknown parameters that must be learned from the data.

These parameters are estimated such that they minimize a pre-determined cost function. For numeric outcomes, this will be squared error loss:

$$Cost = \tfrac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2$$

Realize that we can write squared error loss in matrix form:

$$Cost = \tfrac{1}{n}(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}})^T(\mathbf{y}-\mathbf{X}\hat{\mathbf{w}})$$

In this notation:

  • $\mathbf{y}$ is a vector of length $n$
  • $\mathbf{X}$ an $n$ by $p$ matrix of predictors (including a column of 1s for the bias term)
  • $\hat{\mathbf{w}}$ a vector of length $p$ representing the estimated weights

Below is a Python function that calculates the cost associated with a given dataset and vector of weights:

In [11]:
## Define cost function
def cost_function(X, y, w):
    n = len(y)
    pred_y = np.dot(X,w)
    err = y - pred_y
    cost = (1./n)*np.dot(np.transpose(err),err)                 
    return cost               

Recall that the gradient is a vector of partial derivatives with respect to each unknown parameter of a function. In our context, these unknowns are the weight parameters in the cost function defined above.

Gradient descent is an optimization algorithm that uses the gradient to sequentially update a model's unknown parameters in a way that reduces the cost function:

$$\hat{\mathbf{w}}^{(j)} = \hat{\mathbf{w}}^{(j-1)} - \alpha \tfrac{\partial}{\partial \mathbf{w}}\text{Cost}\big(\hat{\mathbf{w}}^{(j-1)}\big)$$

These updates are repeated (ie: $j$ is incremented) until the cost function reaches a minimum (within an acceptable tolerance). However, for simplicity, we will implement gradient descent for a fixed number of iterations.

The gradient (in matrix form) for linear regression looks like:

$$\tfrac{-2}{n}\mathbf{X}^T(\mathbf{y} - \hat{\mathbf{y}})$$

This is found using linear algebra and matrix calculus:

$$Cost = \tfrac{1}{n}(\mathbf{y}^T\mathbf{y}-2\mathbf{y}^T\hat{\mathbf{y}} + \hat{\mathbf{y}}^T\hat{\mathbf{y}})$$$$Cost = \tfrac{1}{n}\mathbf{y}^T\mathbf{y} + \tfrac{1}{n}(-2\mathbf{y}^T\mathbf{X}\hat{\mathbf{w}} + (\mathbf{X}\hat{\mathbf{w}})^T\mathbf{X}\hat{\mathbf{w}})$$$$Cost = \tfrac{1}{n}\mathbf{y}^T\mathbf{y} + \tfrac{1}{n}(-2\mathbf{X}^T\hat{\mathbf{w}}^T\mathbf{y} + \hat{\mathbf{w}}^T\mathbf{X}^T\mathbf{X}\hat{\mathbf{w}})$$

The steps shown above simply rearrange the cost function. After we've reorganized it, we can differentiate it with respect to $\hat{\mathbf{w}}$, which yields:

$$Gradient = \tfrac{-2}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}})$$

I encourage you to review this matrix calculus cheatsheet if these steps don't make sense.

Below is a function that implements gradient descent (for a fixed number of update iterations) using this result:

In [12]:
def grad_descent(X, y, w, alpha, n_iter):
    costs = np.zeros(n_iter)
    n = len(y) 
    
    for i in range(n_iter):
        pred_y = np.dot(X,w)
        err = y - pred_y
        grad = (-2./n)*np.dot(np.transpose(X), err)
        w = w - alpha*grad
        costs[i] = cost_function(X, y, w)
        
    return w, costs

Let's confirm that the method works as intended (ie: it successfully estimates weights that minimize the cost function):

In [13]:
## Run gradient descent
train_Xs = StandardScaler().fit_transform(train_X)
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.1, n_iter=100)

## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()

The cost function appears to have reached a minimum, let's now compare the weight estimates with those found by sklearn's linear regression implementation:

In [14]:
lmf = LinearRegression(fit_intercept=False).fit(train_Xs, train_y)
print(lmf.coef_)
print(gdres[0])
[-1.25805804e+03  1.92884548e+03  6.72865255e+02 -1.92085650e+03
 -7.95482914e+01 -3.12181937e+03  1.97731725e+03  8.24129242e+02
  1.82409979e+03  3.74282983e+02  2.29862181e+00  8.78235016e+04]
[-1853.74094527   423.88972963   648.24511176 -2379.89586956
  2067.56723688 -3190.55684732  2191.25895646  5493.89295523
  1897.97999934   489.9667822    140.69348536 83440.4859917 ]

Unfortunately, the weights didn't quite reach their closed form solution (though many of them got fairly close)

Fortunately, we can run more iterations to confirm that they eventually will:

In [15]:
## Run gradient descent
train_Xs = StandardScaler().fit_transform(train_X)
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.1, n_iter=2000)
print(lmf.coef_)
print(gdres[0])
[-1.25805804e+03  1.92884548e+03  6.72865255e+02 -1.92085650e+03
 -7.95482914e+01 -3.12181937e+03  1.97731725e+03  8.24129242e+02
  1.82409979e+03  3.74282983e+02  2.29862181e+00  8.78235016e+04]
[-1.25805804e+03  1.92884548e+03  6.72865255e+02 -1.92085650e+03
 -7.95482914e+01 -3.12181937e+03  1.97731725e+03  8.24129242e+02
  1.82409979e+03  3.74282983e+02  2.29862181e+00  8.78235016e+04]

Now let's explore what happens if we double the learning rate, $\alpha$:

In [16]:
## Run gradient descent
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.2, n_iter=100)

## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()

As shown in the graph above, setting alpha=0.2 allows us to find weights that minimize the cost function using fewer gradient descent iterations (epochs). In general, setting an appropriate learning rate can involve some trial and error, especially if the predictors haven't been standardized.

To further illustrate this notion, suppose we increase the learning rate to something even more aggressive, such as $\alpha = 0.3$:

In [17]:
## Run gradient descent
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.3, n_iter=100)

## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
Out[17]:
Text(0, 0.5, 'Cost')

The graph above illustrates that alpha=0.3 is a learning learning rate that is too large. Each gradient descent update step overshoots the minimum by such a large degree that the new weights actually cause the cost function to increase. This problem compounds as more iterations are run, eventually culminating in the divergent behavior shown on the right side of the above graph.

Question #3 (gradient descent)¶

Consider a linear regression model with a single predictor and bias term fixed at zero (ie: no intercept).

$\mathbf{y} = w_1\mathbf{x}_1 + \epsilon$

Gradient descent was demonstrated for this model in our lecture slides.

  • Part A) Write out the cost function (squared error loss) for this model, and find its derivative with respect to $w_1$
  • Part B) Using the previously given cost and gradient descent functions as examples, write your own Python functions that compute the cost and perform gradient descent for this model
  • Part C) Apply your gradient descent function to the Iowa City Home Sales dataset using assessed as the only predictor of sale price. You may use the code below to ascertain this predictor. Print the final estimate of the only estimated weight, $\hat{w}_1$.
  • Part D) Compare your estimated weight from Part C with the results of LinearRegression(). Be sure to use the argument fit_intercept=False to avoid fitting an intercept/bias parameter. If these weights are substantially different, try manipulating your learning rate (or number of iterations) until they are similar.

Hint: The object X_assess created below is a pandas Series, you may want it to be a numpy array of a particular shape when using it in LinearRegression() in Part D.

In [18]:
## Standardized version of the only predictor "assessed"
X_assess = (train_X['assessed'] - np.average(train_X['assessed']))/np.std(train_X['assessed'])