The focus of this lab will be on linear regression. Since the method and its sklearn
implementation are fairly straightfoward, we'll devote much of our attention towards understanding the optimzation of structural model parameters and the gradient descent algorithm. We will also compare the performance of linear regression and k-nearest neighbors (kNN) using pipelines.
As usual, we'll begin by loading several familiar libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
Our examples will use the Iowa City Home Sales data (first introduced in Lab 2, part 1):
## Read data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
## Split to training and testing sets
from sklearn.model_selection import train_test_split
train, test = train_test_split(ic, test_size=0.2, random_state=7)
## Create target
train_y = train['sale.amount']
## Create predictor matrix (numeric predictors only for simplicity, but we could use OHE if we wanted to)
train_X = train.select_dtypes("number").drop('sale.amount',axis=1)
To stay consistent with the workflow we developed when using kNN, we'll begin by setting up a pipeline.
Standardization/scaling aren't nearly as important for regression as they are for some methods, but we'll still add a scaling step to our pipeline since we'll later compare linear regression with kNN (where scaling is very important).
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
## Simple pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LinearRegression())
])
Next, let's fit a linear regression model to the entire training set to predict sale price. Then, let's print the estimated weights (coefficients):
## Fit the pipeline to the training data
lin_mod = pipe.fit(train_X, train_y)
## print the model's coefficients
lin_mod.named_steps['model'].coef_
array([-1.25805804e+03, 1.92884548e+03, 6.72865255e+02, -1.92085650e+03, -7.95482914e+01, -3.12181937e+03, 1.97731725e+03, 8.24129242e+02, 1.82409979e+03, 3.74282983e+02, 2.29862180e+00, 8.78235016e+04])
Because coef_
is an attribute of the object created by LinearRegression()
(which was a single step in our pipeline), we must use the named_steps
of our pipeline to access the regression coefficients (recognize that we gave the name 'model' to the pipeline step that uses LinearRegression()
).
Next, notice how 12 weight estimates are returned and our predictor matrix, train_X
, had 12 columns. To find the model's bias term (intercept), we must ask for it separately:
## print the estimated intercept (bias)
lin_mod.named_steps['model'].intercept_
180461.83896940417
This simple example, we did not use our pipeline to optimize any tuning parameters. But it's important to understand the syntax needed to access model coefficients (or other attributes) if we did.
For example, when GridSearchCV()
were used for model tuning, we'll need to use the best_estimator_
attribute of the search results as the first step in obtaining the estimated weights:
## Set up a simple grid to compare two scaling methods
from sklearn.preprocessing import RobustScaler
parms = {'scaler': [StandardScaler(), RobustScaler()]}
## Perform grid search
from sklearn.model_selection import GridSearchCV
lin_mod2 = GridSearchCV(pipe, parms, cv=5, scoring='neg_root_mean_squared_error').fit(train_X, train_y)
## Print the coefficients of the top performing model
lin_mod2.best_estimator_.named_steps['model'].coef_
array([-1.25805804e+03, 1.92884548e+03, 6.72865255e+02, -1.92085650e+03, -7.95482914e+01, -3.12181937e+03, 1.97731725e+03, 8.24129242e+02, 1.82409979e+03, 3.74282983e+02, 2.29862180e+00, 8.78235016e+04])
It's also worth knowing that GridSearchCV
(and other search functions that use cross-validation) will fit the best model to the entire training set after evaluation has been done via cross-validation (so these weights match the ones we saw earlier).
If we wanted to know the cross-validated root-mean squared error (RMSE) of our model, we could find it in the grid search results:
## Using the score method of GridSearchCV
lin_mod2.cv_results_['mean_test_score'][0]
-19833.449405441286
We could also use find the same estimate using the cross_validate
function:
## Split to training and testing sets
from sklearn.model_selection import cross_validate
cv_res = cross_validate(lin_mod2, train_X, train_y, cv = 5, scoring='neg_root_mean_squared_error')
np.average(cv_res['test_score'])
-19833.44940544128
Based upon these measures, we'd estimate this model's predictions will be off by an average of approximately $19,833 if it were applied to new data.
train_X_assessed
and train_y_assessed
where the latter is the home's assessed value and the former is a matrix containing the predictors: bedrooms
, area.lot
, and area.garage1
.train_X_assessed
to predict the assessed value of a home.To compare different machine learning approaches, such as kNN and regression, we must use pipelines and cross-validation to ensure an equitable comparison:
## Set up simple pipeline that does scaling and modeling
from sklearn.neighbors import KNeighborsRegressor
pipe = Pipeline([
('scaler', StandardScaler()),
('model', KNeighborsRegressor())
])
## Set up parameter grid - notice we are considering two different models (and two different scalers)
parms = {'scaler': [StandardScaler(), RobustScaler()],
'model': [KNeighborsRegressor(n_neighbors=15,weights='distance'), LinearRegression()],
}
## Use cross-validated grid search and print the best model
mod_comp = GridSearchCV(pipe, parms, cv=5, scoring='neg_mean_squared_error').fit(train_X, train_y)
print(mod_comp.best_estimator_)
Pipeline(steps=[('scaler', StandardScaler()), ('model', LinearRegression())])
Since LinearRegression
and KNeighborsRegressor
have different tuning parameters, a fairer approach would first optimize the kNN tuning parameters and use the best combination in a subsequent comparison across model classes. For simplicity, our example used n_neighbors=15
and weights='distance'
as if these were the optimal combination identified in an earlier grid search.
Additionally, it's worth knowing that the default scoring measure for numeric outcomes is the coefficient of determination, $R^2$. If you do not specify the scoring
argument in GridSearchCV
this default will be used to determine the best estimator.
## Default scorer:
mod_comp = GridSearchCV(pipe, parms, cv=5).fit(train_X, train_y)
mod_comp.best_score_
0.9440104734478318
n_neighbors
and weights
in a $k$-nearest neighbors model that uses train_X_assessed
to predict train_y_assessed
(these were created in Question #1). Your search should consider at least 5 different choices of n_neighbors
that are sensibly chosen. You should use 'neg_root_mean_squared_error' as your scoring criterion.Regression is our first example of a model with a structured set of unknown parameters that must be learned from the data.
These parameters are estimated such that they minimize a pre-determined cost function. For numeric outcomes, this will be squared error loss:
$$Cost = \tfrac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2$$Realize that we can write squared error loss in matrix form:
$$Cost = \tfrac{1}{n}(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}})^T(\mathbf{y}-\mathbf{X}\hat{\mathbf{w}})$$In this notation:
Below is a Python function that calculates the cost associated with a given dataset and vector of weights:
## Define cost function
def cost_function(X, y, w):
n = len(y)
pred_y = np.dot(X,w)
err = y - pred_y
cost = (1./n)*np.dot(np.transpose(err),err)
return cost
Recall that the gradient is a vector of partial derivatives with respect to each unknown parameter of a function. In our context, these unknowns are the weight parameters in the cost function defined above.
Gradient descent is an optimization algorithm that uses the gradient to sequentially update a model's unknown parameters in a way that reduces the cost function:
$$\hat{\mathbf{w}}^{(j)} = \hat{\mathbf{w}}^{(j-1)} - \alpha \tfrac{\partial}{\partial \mathbf{w}}\text{Cost}\big(\hat{\mathbf{w}}^{(j-1)}\big)$$These updates are repeated (ie: $j$ is incremented) until the cost function reaches a minimum (within an acceptable tolerance). However, for simplicity, we will implement gradient descent for a fixed number of iterations.
The gradient (in matrix form) for linear regression looks like:
$$\tfrac{-2}{n}\mathbf{X}^T(\mathbf{y} - \hat{\mathbf{y}})$$This is found using linear algebra and matrix calculus:
$$Cost = \tfrac{1}{n}(\mathbf{y}^T\mathbf{y}-2\mathbf{y}^T\hat{\mathbf{y}} + \hat{\mathbf{y}}^T\hat{\mathbf{y}})$$$$Cost = \tfrac{1}{n}\mathbf{y}^T\mathbf{y} + \tfrac{1}{n}(-2\mathbf{y}^T\mathbf{X}\hat{\mathbf{w}} + (\mathbf{X}\hat{\mathbf{w}})^T\mathbf{X}\hat{\mathbf{w}})$$$$Cost = \tfrac{1}{n}\mathbf{y}^T\mathbf{y} + \tfrac{1}{n}(-2\mathbf{X}^T\hat{\mathbf{w}}^T\mathbf{y} + \hat{\mathbf{w}}^T\mathbf{X}^T\mathbf{X}\hat{\mathbf{w}})$$The steps shown above simply rearrange the cost function. After we've reorganized it, we can differentiate it with respect to $\hat{\mathbf{w}}$, which yields:
$$Gradient = \tfrac{-2}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}})$$I encourage you to review this matrix calculus cheatsheet if these steps don't make sense.
Below is a function that implements gradient descent (for a fixed number of update iterations) using this result:
def grad_descent(X, y, w, alpha, n_iter):
costs = np.zeros(n_iter)
n = len(y)
for i in range(n_iter):
pred_y = np.dot(X,w)
err = y - pred_y
grad = (-2./n)*np.dot(np.transpose(X), err)
w = w - alpha*grad
costs[i] = cost_function(X, y, w)
return w, costs
Let's confirm that the method works as intended (ie: it successfully estimates weights that minimize the cost function):
## Run gradient descent
train_Xs = StandardScaler().fit_transform(train_X)
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.1, n_iter=100)
## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()
The cost function appears to have reached a minimum, let's now compare the weight estimates with those found by sklearn
's linear regression implementation:
lmf = LinearRegression(fit_intercept=False).fit(train_Xs, train_y)
print(lmf.coef_)
print(gdres[0])
[-1.25805804e+03 1.92884548e+03 6.72865255e+02 -1.92085650e+03 -7.95482914e+01 -3.12181937e+03 1.97731725e+03 8.24129242e+02 1.82409979e+03 3.74282983e+02 2.29862181e+00 8.78235016e+04] [-1853.74094527 423.88972963 648.24511176 -2379.89586956 2067.56723688 -3190.55684732 2191.25895646 5493.89295523 1897.97999934 489.9667822 140.69348536 83440.4859917 ]
Unfortunately, the weights didn't quite reach their closed form solution (though many of them got fairly close)
Fortunately, we can run more iterations to confirm that they eventually will:
## Run gradient descent
train_Xs = StandardScaler().fit_transform(train_X)
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.1, n_iter=2000)
print(lmf.coef_)
print(gdres[0])
[-1.25805804e+03 1.92884548e+03 6.72865255e+02 -1.92085650e+03 -7.95482914e+01 -3.12181937e+03 1.97731725e+03 8.24129242e+02 1.82409979e+03 3.74282983e+02 2.29862181e+00 8.78235016e+04] [-1.25805804e+03 1.92884548e+03 6.72865255e+02 -1.92085650e+03 -7.95482914e+01 -3.12181937e+03 1.97731725e+03 8.24129242e+02 1.82409979e+03 3.74282983e+02 2.29862181e+00 8.78235016e+04]
Now let's explore what happens if we double the learning rate, $\alpha$:
## Run gradient descent
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.2, n_iter=100)
## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()
As shown in the graph above, setting alpha=0.2
allows us to find weights that minimize the cost function using fewer gradient descent iterations (epochs). In general, setting an appropriate learning rate can involve some trial and error, especially if the predictors haven't been standardized.
To further illustrate this notion, suppose we increase the learning rate to something even more aggressive, such as $\alpha = 0.3$:
## Run gradient descent
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.3, n_iter=100)
## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
Text(0, 0.5, 'Cost')
The graph above illustrates that alpha=0.3
is a learning learning rate that is too large. Each gradient descent update step overshoots the minimum by such a large degree that the new weights actually cause the cost function to increase. This problem compounds as more iterations are run, eventually culminating in the divergent behavior shown on the right side of the above graph.
Consider a linear regression model with a single predictor and bias term fixed at zero (ie: no intercept).
$\mathbf{y} = w_1\mathbf{x}_1 + \epsilon$
Gradient descent was demonstrated for this model in our lecture slides.
assessed
as the only predictor of sale price. You may use the code below to ascertain this predictor. Print the final estimate of the only estimated weight, $\hat{w}_1$.LinearRegression()
. Be sure to use the argument fit_intercept=False
to avoid fitting an intercept/bias parameter. If these weights are substantially different, try manipulating your learning rate (or number of iterations) until they are similar.Hint: The object X_assess
created below is a pandas Series, you may want it to be a numpy array of a particular shape when using it in LinearRegression()
in Part D.
## Standardized version of the only predictor "assessed"
X_assess = (train_X['assessed'] - np.average(train_X['assessed']))/np.std(train_X['assessed'])