This lab covers two major topics:
sklearn
implementation of linear regression, which is straightforward and largely uses the same syntax as other algorithms we've studied.import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
Examples throughout the lab will use the Iowa City home sales data that we've seen in several of our previous labs:
## Read data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
## Split training and testing sets
from sklearn.model_selection import train_test_split
train, test = train_test_split(ic, test_size=0.2, random_state=7)
## Create target and keep only the numeric predictors (though we could use one-hot encoding if we wanted to)
train_y = train['sale.amount']
train_X = train.select_dtypes("number").drop('sale.amount',axis=1)
sklearn
¶Linear regression is performed using the LinearRegression()
function in sklearn
. This function is compatible with pipelines and all the pre-processing tools we've been using:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
## Simple pipeline using linear regression
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LinearRegression())
])
Compared to other algorithms, a benefit of linear regression is that we can easily determine the impact of each individual predictor by looking at its weight in the model's linear combination:
## Fit the pipeline to the training data
lin_mod = pipe.fit(train_X, train_y)
## print the model's coefficients
lin_mod.named_steps['model'].coef_
array([-1.25805804e+03, 1.92884548e+03, 6.72865255e+02, -1.92085650e+03, -7.95482914e+01, -3.12181937e+03, 1.97731725e+03, 8.24129242e+02, 1.82409979e+03, 3.74282983e+02, 2.29862180e+00, 8.78235016e+04])
While standardization is not an essential step in linear regression, it has the benefit of making these estimated weights more directly comparable. That is, after standardizing each of these weights represents the estimated effect of a 1 standard deviation increase in the corresponding predictor on the model's outcome.
You should also recognize that the model's estimated bias (intercept) is stored as a different attribute:
## Print the estimated intercept (bias)
lin_mod.named_steps['model'].intercept_
180461.83896940417
Linear regression does not have any tuning parameters, but we can still look at an example of it being used within a cross-validated grid search by exploring models with and without an intercept (bias):
## Simple grid search
from sklearn.model_selection import GridSearchCV
parms = {'model__fit_intercept': [True, False]}
grid_res = GridSearchCV(pipe, parms, cv=5, scoring='neg_root_mean_squared_error').fit(train_X, train_y)
grid_res.best_estimator_
Pipeline(steps=[('scaler', StandardScaler()), ('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('scaler', StandardScaler()), ('model', LinearRegression())])
StandardScaler()
LinearRegression()
Question 1: Suppose we'd like to model a home's assessed value rather than its sale price.
train_X_assessed
and train_y_assessed
where the latter records each home's assessed value and the former is a matrix containing the predictors: area.living
, bedrooms
, and area.lot
.Question 2: Next you'll explore how linear regression compares with other methods we've learned. Throughout this question you should use RMSE as your scoring metric.
xgboost
model for these data. Be sure to include any preprocessing steps you believe to be necessary for this approach. A satisfactory answer should explore at least a few different combinations of tuning parameters.The weights and biases in regression are estimated such that they minimize a pre-determined cost function. For numeric outcomes, we will focus our attention on squared error cost:
$$Cost = \tfrac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2$$We can express the squared error cost function in matrix form:
$$Cost = \tfrac{1}{n}(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}})^T(\mathbf{y}-\mathbf{X}\hat{\mathbf{w}})$$In this notation:
Below is a Python function that calculates the cost associated with a given vector of weights and data:
## Define cost function
def cost_function(X, y, w):
n = len(y)
pred_y = np.dot(X,w)
err = y - pred_y
cost = (1./n)*np.dot(np.transpose(err),err)
return cost
Question 3: To explore the cost function defined above, create a line graph showing the cost associated with a sequence of 20 different values of $w_1$ ranging from 2 to 4
## Simple data set used for Question 3
np.random.seed(0)
x = np.linspace(-10, 10, 100)
y = 3 * x + np.random.normal(0, 3, 100)
Recall that the gradient is a vector of partial derivatives with respect to each unknown parameter of a function. In our context, these unknowns are the weight parameters in the cost function defined above.
Gradient descent is an optimization algorithm that uses the gradient to sequentially update a model's unknown parameters in a way that reduces the cost function:
$$\hat{\mathbf{w}}^{(j)} = \hat{\mathbf{w}}^{(j-1)} - \alpha \tfrac{\partial}{\partial \mathbf{w}}\text{Cost}\big(\hat{\mathbf{w}}^{(j-1)}\big)$$These updates are repeated (ie: $j$ is incremented) until the cost function reaches a minimum (within an acceptable tolerance). However, for simplicity, we will implement gradient descent for a fixed number of iterations.
The gradient (in matrix form) for linear regression looks like:
$$\tfrac{-2}{n}\mathbf{X}^T(\mathbf{y} - \hat{\mathbf{y}})$$This is found using linear algebra and matrix calculus:
$$Cost = \tfrac{1}{n}(\mathbf{y}^T\mathbf{y}-2\mathbf{y}^T\hat{\mathbf{y}} + \hat{\mathbf{y}}^T\hat{\mathbf{y}})$$$$Cost = \tfrac{1}{n}\mathbf{y}^T\mathbf{y} + \tfrac{1}{n}(-2\mathbf{y}^T\mathbf{X}\hat{\mathbf{w}} + (\mathbf{X}\hat{\mathbf{w}})^T\mathbf{X}\hat{\mathbf{w}})$$$$Cost = \tfrac{1}{n}\mathbf{y}^T\mathbf{y} + \tfrac{1}{n}(-2\mathbf{X}^T\hat{\mathbf{w}}^T\mathbf{y} + \hat{\mathbf{w}}^T\mathbf{X}^T\mathbf{X}\hat{\mathbf{w}})$$The steps shown above simply rearrange the cost function. After we've reorganized it, we can differentiate it with respect to $\hat{\mathbf{w}}$, which yields:
$$Gradient = \tfrac{-2}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}})$$Below is a function that implements gradient descent (for a fixed number of update iterations) using this result:
def grad_descent(X, y, w, alpha, n_iter):
costs = np.zeros(n_iter)
n = len(y)
for i in range(n_iter):
pred_y = np.dot(X,w)
err = y - pred_y
grad = (-2./n)*np.dot(np.transpose(X), err)
w = w - alpha*grad
costs[i] = cost_function(X, y, w)
return w, costs
## Run gradient descent (for the original IC homes data)
train_y = train['sale.amount']
train_X = train.select_dtypes("number").drop('sale.amount',axis=1)
train_Xs = StandardScaler().fit_transform(train_X)
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.1, n_iter=100)
## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()
The cost function appears to have reached a minimum, let's now compare the weight estimates with those found by sklearn
's linear regression implementation:
## sklearn's linear regression
lmf = LinearRegression(fit_intercept=False).fit(train_Xs, train_y)
print(lmf.coef_)
print(gdres[0])
[-1.25805804e+03 1.92884548e+03 6.72865255e+02 -1.92085650e+03 -7.95482914e+01 -3.12181937e+03 1.97731725e+03 8.24129242e+02 1.82409979e+03 3.74282983e+02 2.29862181e+00 8.78235016e+04] [-1853.74094527 423.88972963 648.24511176 -2379.89586956 2067.56723688 -3190.55684732 2191.25895646 5493.89295523 1897.97999934 489.9667822 140.69348536 83440.4859917 ]
Unfortunately, the weights didn't quite reach their closed form solution (though many of them got fairly close). But if we increase the number of iterations they eventually will:
## Run gradient descent (again)
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.1, n_iter=2000)
print(lmf.coef_)
print(gdres[0])
[-1.25805804e+03 1.92884548e+03 6.72865255e+02 -1.92085650e+03 -7.95482914e+01 -3.12181937e+03 1.97731725e+03 8.24129242e+02 1.82409979e+03 3.74282983e+02 2.29862181e+00 8.78235016e+04] [-1.25805804e+03 1.92884548e+03 6.72865255e+02 -1.92085650e+03 -7.95482914e+01 -3.12181937e+03 1.97731725e+03 8.24129242e+02 1.82409979e+03 3.74282983e+02 2.29862181e+00 8.78235016e+04]
Next, let's see what happens if we double the learning rate:
## Run gradient descent
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.2, n_iter=100)
## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()
As shown in the graph above, setting alpha=0.2
allows us to find weights that minimize the cost function using fewer gradient descent iterations (epochs). In general, setting an appropriate learning rate can involve some trial and error, especially if the predictors haven't been standardized.
For example, let's now increase the learning rate to something even more aggressive, such as $\alpha = 0.3$:
## Run gradient descent
gdres = grad_descent(X=train_Xs,y=train_y,w=np.zeros(12),alpha=0.3, n_iter=100)
## Plot costs by iteration
plt.plot(np.linspace(0, 100, 100),gdres[1])
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.show()
For $\alpha = 0.3$, each gradient descent update step overshoots the minimum by such a large degree that the new weights actually cause the cost function to increase. This problem compounds as more iterations are run, eventually culminating in the divergent behavior shown on the right side of the above graph.
Question 4: Consider a linear regression model with a single predictor and bias term fixed at zero (ie: no intercept).
$\mathbf{y} = w_1\mathbf{x}_1 + \epsilon$
Gradient descent was demonstrated for this model in our lecture slides.
assessed
as the only predictor of sale price. You may use the code below to ascertain this predictor. Print the final estimate of the only estimated weight, $\hat{w}_1$.LinearRegression()
. Be sure to use the argument fit_intercept=False
to avoid fitting an intercept/bias parameter. If these weights are substantially different, try manipulating your learning rate (or number of iterations) until they are similar.Hint: The object X_assess
created below is a pandas Series, you may want it to be a numpy array of a particular shape when using it in LinearRegression()
in Part D.
## Standardized version of the only predictor "assessed"
X_assess = (train_X['assessed'] - np.average(train_X['assessed']))/np.std(train_X['assessed'])