This lab continues our introduction supervised learning, the process of modeling an outcome variable. It will focus on numerical outcomes, and future labs will cover categorical outcomes.
Directions (Please read before starting)
\(~\)
This lab will again use the \(k\)-nearest neighbors in the
FNN
package.
# load the following packages
# install.packages("FNN")
library(FNN)
library(ggplot2)
library(dplyr)
\(~\)
In our previous lab, we introduced the general modeling framework: \[Y = f(\mathbf{X}) + \epsilon\]
In this framework, finding a good estimate of \(f()\) is of critical importance. As an illustration, let’s consider various polynomial regression models of increasing degree fit to the resting metabolism data:
Based upon this graph, you might expect the straight line or second degree polynomial to be best; however, here’s a chart of the in-sample \(RMSE\) of each model:
Does this mean we should use an eighth degree polynomial (lowest in-sample \(RMSE\)) to model this relationship?
\(~\)
Using the same data to train (ie: estimate \(\{\beta_0, \ldots, \beta_1\}\)) and test (ie: calculate performance measures like \(RMSE\)) a model can lead to over-fitting, or the systematic favoritism of models that are too specific to the available data and won’t generalize very well to new data.
In our example, the eighth degree polynomial model is clearly over fit - its prediction would fail miserably for a new data-point with a body weight of approximately 280 lbs.
To prevent over-fitting, we should evaluate (test) our model on data that was not used in the training process, thereby obtaining a measure of out-of-sample performance. Cross-validation is a method for estimating the out-of-sample performance of a model using non-overlapping subsets of the available data. We’ll focus on k-fold cross-validation, which is summarized by the following algorithm:
The diagram below illustrates 4-fold cross-validation:
Here’s a comparison of in-sample vs. out-of-sample performance of each polynomial regression model of the resting metabolism example:
Degree | 1 | 2 | 4 | 6 | 8 |
RMSE (in-sample) | 154 | 153 | 148 | 145 | 140 |
RMSE (CV) | 164 | 165 | 206 | 3393 | 52010 |
The following code demonstrates how to implement 5-fold cross-validation:
## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(8) ## set.seed will now ensure reproducible fold assignments
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size = n)
## Step 2 - Set up an empty vector to store predictions
preds <- numeric(n)
## Loop through the k=5 folds
for(k in 1:5){
## Split the data into training and testing partitions
train <- rmr[fold_id != k, ]
test <- rmr[fold_id == k, ]
## Fit model on this iteration's training folds
m1 <- lm(rate_kcal ~ weight_lbs, data = train)
## Store predictions for the testing fold
preds[fold_id == k] <- predict(m1, newdata = test)
}
## Calculate out-of-sample RMSE
sqrt(1/n*sum((rmr$rate_kcal - preds)^2))
## [1] 164.3637
A few things to be aware of:
data = train
. A common mistake is to use the entire data
set here (which defeats the entire purpose of cross-validation)\(~\)
Question #1: Our previous lab introduced the
afqt
data set, which is a random sample of \(n = 2584\) Americans who were first
selected and tested in 1979, then re-interviewed in 2006 and asked about
their education and annual income.
afqt <- read.csv("https://remiller1450.github.io/data/AFQT.csv")
Part A: Consider the use of k-fold cross-validation
with \(k = 8\) to evaluate the simple
linear regression model: Income2005 ~ AFQT
. How many
different times are the model’s parameters (\(\{\beta_0, \beta_1\}\)) estimated during
the evaluation process? How many observations (data-points) contribute
to each of these estimates?
Part B: Estimate the out-of-sample \(RMSE\) of the simple linear regression described in Part A using 8-fold cross-validation.
Part C: Now consider the multiple regression model:
Income2005 ~ AFQT + Educ
. Use 8-fold cross-validation to
estimate the out-of-sample \(RMSE\) of this model and compare it to the
model you evaluated in Parts A and B. Hint: be sure to use the
same fold assignments (which are determined by the randomization seed)
for both models.
Part D: Consider a log-2 transformation of “Income2005”. Could cross-validation be used to determine whether this transformation yields a better fitting model than the ones evaluated in Parts B and C? Briefly explain. Hint: think about the meaning of \(RMSE\).
\(~\)
Using \(k\)-fold cross-validation requires a pre-specified value of \(k\), the number of cross-validation folds.
Larger values of \(k\) will lead to less variability in the estimates of out-of-sample \(RMSE\) across different sets of randomly generated folds; however, if \(k\) is too large the trained models used in each fold will be highly correlated (leading to correlated errors when applied to the testing data).
To understand this dilemma, it’s useful to think about both extremes:
In LOOCV, many of the trained models are nearly identical, as only one data-point was not used in each different model estimation. In a single data split, the two models that are estimated can be very different from each other, which makes the \(RMSE\) estimate highly dependent on the random split that was drawn.
Question #2 (Part A): Writing or running any code,
briefly explain why set.seed()
is not necessary to ensure
reproducible results when using leave-one-out
cross-validation.
Question #2 (Part B): For \(k=2\), use k-fold cross-validation with
three different randomization seeds to come up with three
different estimates of the out-of-sample \(RMSE\) of the simple linear regression
model Income2005 ~ AFQT
(for the afqt
data
described in Question #1). Does the variability displayed in these
estimates seem problematic when considering the out-of-sample
\(RMSE\)?
\(~\)
For the reasons discussed above, a more robust approach to estimating a model’s out-of-sample performance is repeated k-fold cross-validation. As the name suggests, this method uses an outer loop to repeat the process of k-fold cross-validation several times. Performance estimates from each repetition are then aggregated to produce a single estimate (usually by simple averaging).
The code below demonstrates repeated cross-validation:
## Set up
set.seed(8)
n <- nrow(rmr)
preds <- numeric(n)
reps <- 10
rmses <- numeric(reps)
## Outer loop
for(i in 1:reps){
fold_id <- sample(rep(1:5, length.out = n), size = n)
## Inner loop
for(k in 1:5){
## Split the data into training and testing partitions
train <- rmr[fold_id != k, ]
test <- rmr[fold_id == k, ]
## Fit model on this iteration's training folds
m1 <- lm(rate_kcal ~ weight_lbs, data = train)
## Store predictions for the testing fold
preds[fold_id == k] <- predict(m1, newdata = test)
}
# Store each outer loop iteration's estimate
rmses[i] <- sqrt(1/n*sum((rmr$rate_kcal - preds)^2))
}
## Results
summary(rmses)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 160.0 161.8 162.4 163.4 164.9 169.0
Question #3 (Part A): In the example code provided above, how many repetitions of k-fold cross-validation were performed?
Question #3 (Part B): How many cross-validation folds were used in each repetition?
Question #3 (Part C): What are the range and interquartile range of the out-of-sample \(RMSE\) estimates? Briefly comment on whether or not you feel repeated cross-validation is useful in this application.
\(~\)
Remarks: When using repeated cross-validation, most practitioners recommend choosing \(k\) between 5 and 10, with the number of repetitions depending upon the application. In the initial stages of an analysis, a single repetition of k-fold cross-validation is generally sufficient. In the final stages of analysis, such as reporting the final performance of a chosen model, \(k\)-fold cross-validation should be repeated as many times necessary to yield a reliable estimate.
\(~\)
A strength of cross-validation is that it can be used to compare any two models involving the same outcome variable. This enables us to compare the performance of completely different model classes, such as non-parametric methods like \(k\)-nearest neighbors and parametric models like linear regression - something that traditional statistical testing is unable to do.
The code below demonstrates the use of 5-fold cross-validation to compare \(k\)-nearest neighbors (with k = 7) and simple linear regression for the resting metabolism data:
# Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7)
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size =n)
## Step 2
## Three empty numeric vectors to store out-of-sample predictions (for 3 choices of k)
preds_k7 <- preds_slr <- numeric(n)
## Loop through the k=5 folds
for(k in 1:5){
## Subset the data
train <- rmr[fold_id != k, ]
test <- rmr[fold_id == k, ]
## Fit models on the data
mod_k7 <- knn.reg(train = train$weight_lbs, test = data.frame(test$weight_lbs),
y = train$rate_kcal, k = 7)
mod_slr <- lm(rate_kcal ~ weight_lbs, data = train)
## Store predictions
preds_k7[fold_id == k] <- mod_k7$pred
preds_slr[fold_id == k] <- predict(mod_slr, newdata = test)
}
## Out-of-sample RMSE for each model
c(sqrt(1/n*sum((rmr$rate_kcal - preds_k7)^2)),
sqrt(1/n*sum((rmr$rate_kcal - preds_slr)^2)))
## [1] 163.3975 166.3328
Cross-validation is also commonly used to select tuning parameters, or quantities that must be specified by the analyst before a model can be fit (such as the value \(k\) in \(k\)-nearest neighbors).
The code below uses 5-fold cross-validation to compare three different values of \(k\):
## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7)
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size =n)
## Step 2
## Three empty numeric vectors to store out-of-sample predictions (for 3 choices of k)
preds_k5 <- preds_k15 <- preds_k30 <- numeric(n)
## Loop through the k=5 folds
for(k in 1:5){
## Subset the data
train <- rmr[fold_id != k, ]
test <- rmr[fold_id == k, ]
## Fit models on the data
mod_k5 <- knn.reg(train = train$weight_lbs, test = data.frame(test$weight_lbs), y = train$rate_kcal, k = 3)
mod_k15 <- knn.reg(train = train$weight_lbs, test = data.frame(test$weight_lbs), y = train$rate_kcal, k = 7)
mod_k30 <- knn.reg(train = train$weight_lbs, test = data.frame(test$weight_lbs), y = train$rate_kcal, k = 15)
## Store predictions
preds_k5[fold_id == k] <- mod_k5$pred
preds_k15[fold_id == k] <- mod_k15$pred
preds_k30[fold_id == k] <- mod_k30$pred
}
## Out-of-sample RMSE for each model
out_RMSE_k5 <- sqrt(1/n*sum((rmr$rate_kcal - preds_k5)^2))
out_RMSE_k15 <- sqrt(1/n*sum((rmr$rate_kcal - preds_k15)^2))
out_RMSE_k30 <- sqrt(1/n*sum((rmr$rate_kcal - preds_k30)^2))
## Comparison
print(c(out_RMSE_k5, out_RMSE_k15, out_RMSE_k30))
## [1] 197.1123 163.3975 180.0256
This comparison suggests \(k = 7\) leads to a better model than either \(k = 3\) or \(k = 15\). Additionally, you should notice that out-of-sample \(RMSE\) is no longer monotonically improving as \(k\) decreases (something that happened with in-sample \(RMSE\)).
Question #4: For this question you will use the
afqt
data introduced in Question #1.
Part A: As mentioned in the previous lab, \(k = \sqrt{n}\) is a reasonable starting point when using \(k\)-nearest neighbors. For Part A of Question #4, use cross-validation to compare the out-of-sample performance of kNN models for three different values of \(k\): 25, 50, and 75 when using the variables “Educ” and “AFQT” to predict “Income2005”. Report a performance estimate for each of these possible values, and indicate which choice of \(k\) results in the best model.
Part B: An advantage of kNN is the flexibility to detect non-linearity in the relationship between a predictor and the outcome. For Part B of Question #4, compare the out-of-sample performance of the best kNN model you identified in Part A with a multiple linear regression that uses third degree polynomials to represent the effects of “Educ” and “AFQT”?
Part C: In Question #1 you considered two different linear models for these data and evaluated their performance using cross-validation. Considering results from Question #1 and the earlier parts of this question, do zero, one, or both of the models from Question #1 outperform the best model from Part B?
\(~\)
While cross-validation can be used to select which explanatory variables to use in a linear regression model, doing so quickly becomes untenable when the number of available variables is moderate to large.
The least absolute shrinkage and selection operation estimator, or lasso, is an alternative to ordinary least squares regression that naturally performs variable selection and shrinks coefficient estimates towards zero, often leading to improved out-of-sample performance.
Using penalization to shrink a model’s estimated effects towards zero in order to avoid over-fitting is known as regularization, and this technique can be applied to many types of models.
Mathematically, the lasso solution is found by minimizing \(Q(\beta) =\sum_i (y_i - x_i^T\beta)^2 + \lambda ||\beta||_1\) with respect to \(\beta\), where \(||\beta||_1 = |\beta_1| + |\beta_2| + \ldots + |\beta_p|\) is the L1 norm of the coefficient vector. \(\lambda\) is a tuning parameter, or a numeric constant that needs to specified prior to estimation.
Typically we try an entire path of different \(\lambda\) values, tracking the lasso
solution for each. The code demonstrates how to use the
glmnet
package to fit a lasso model that predicts the sale
prices of homes sold in Iowa City, IA between 2005 and 2008 using
several attributes:
ic <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_X <- ic %>% select(ac,area.bsmt, area.garage1, area.garage2,
area.living, area.lot, assessed)
## Pre-processing
ic_X <- model.matrix(~ -1 + ., data = ic_X) ## Notice how "ac" is handled
ic_Y <- ic$sale.amount
## Modeling
library(glmnet)
lasso_fit <- glmnet(x = ic_X, y = ic_Y)
Note:
glmnet()
doesn’t allow formulas, you must provide a
matrix of explanatory variables and a vector of outcomes separately.
When dummy variables are needed, the model.matrix()
function should be used to construct an appropriate design
matrix
model.matrix()
is very similar to setting up a
model formula. Here, “-1” is used to indicate we don’t want an intercept
(glmnet
will create an intercept for us), and “+.”
indicates we want to use everything else in
ic_X
.To understand this a lasso model it is useful to visualize the coefficient path:
plot(lasso_fit, xvar = "lambda", label = TRUE)
abline(v = 7.7, lty = 3)
The coefficient path displays the relationship between \(\lambda\), the penalty parameter, the number of active variables, and their estimated coefficients.
Each vertical cross-section of the coefficient path represents a
different lasso model. For illustration, we use abline()
to
add a vertical line at \(log(\lambda) =
7.7\). Notice that the model characterized by this penalty
contains only 3 active variables (everything else has an estimated
coefficient of exactly zero). We can use the coef()
function to print these variables and their estimated coefficients:
coef(lasso_fit, s = exp(7.7))
## 9 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 4.266824e+03
## acNo .
## acYes .
## area.bsmt .
## area.garage1 .
## area.garage2 2.522547e+00
## area.living .
## area.lot 3.063874e-03
## assessed 1.007374e+00
An optimal value of \(\lambda\) can
be determined using cross-validation. In fact, glmnet
contains its own cross-validation implementation in the function
cv.glmnet()
.
cv_lasso_fit <- cv.glmnet(x = ic_X, y = ic_Y, nfolds = 5)
cv_lasso_fit$lambda.min
## [1] 923.3904
In this example, a \(\lambda\) value
of 923.3904, or roughly 6.828 on the log-scale, corresponds with the
best fitting model. Conveniently, cv.glmnet()
also fits a
lasso model to the full data set, so we can find the estimated
parameters of the best fitting model using the following code:
coef(cv_lasso_fit$glmnet.fit, s = cv_lasso_fit$lambda.min)
## 9 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 875.0476236
## acNo .
## acYes .
## area.bsmt 0.4728611
## area.garage1 -6.3536220
## area.garage2 7.6736881
## area.living .
## area.lot 0.1223852
## assessed 1.0252551
Thus, the estimated lasso model is: \[\small \widehat{SalePrice} = 875.05 + 0.47*\text{area.bsmt} - 6.35*\text{area.garage1} + 7.67*\text{area.garage2} + 0.12*\text{area.lot} + 1.025*\text{assessed}\]
Notice the variables “ac” and “area.living” were not selected to contribute to the model.
Question #5: The code given below loads the
“colleges 2019” data set and filters these data to only include
observations without any missing data. The goal in this analysis is to
model the median 10 year salary of college’s alumni using the other
variables contained in cd
.
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% filter(complete.cases(colleges))
Part A: Recall that glmnet()
requires
an outcome vector, y
, and a model matrix, X
.
For this question, set up a vector named cd_y
that contains
the outcome variable, and create a properly formatted model matrix
cd_X
that all other variables besides “Name” and “City”,
which we’ll consider identifiers of a college. Hint:
cd_X
should contain 50 binary indicator variables
representing each category of “State”.
Part B: Fit a lasso model using 10-fold
cross-validation as implemented by the cv.glmnet()
function. Graph the coefficient path of this model and add a vertical
line at the value of \(\lambda\)
favored by cross-validation.
Part C: The model matrix used in Parts A and B should have contained 75 columns (predictor variables). At the value of \(\lambda\) favored by cross-validation, how many of these variables were not selected to contribute to the final model?
\(~\)
Question #6: The data loaded below, oz
,
record the daily ozone concentrations in New York City across 111
different days in the same year. The variable “Day” describes how many
days after Jan 1st the measurement occurred.
oz <- read.csv("https://remiller1450.github.io/data/Ozone.csv")
Part A: Use the pairs()
function to
create a scatter plot matrix that displays the variables in
oz
. Which two variables appear to have the strongest linear
relationships with the outcome variable, “Ozone”?
Part B: Fit a linear regression model that uses a
day’s wind speed and temperature (as linear effects)
to predict the ozone concentration for that day. Then, use
plotly
to visualize this model using a surface overlaid on
a 3-D scatter plot. Hint: see the plotly
lab for
several examples, and be careful to set up your “x” and “y” sequences
using the range of the predictors used in this model.
Part C: Estimate the out-of-sample \(RMSE\) of the model you fit in Part B using 5-fold cross-validation.
Part D: Use 5-fold cross-validation to compare the out-of-sample \(RMSE\) with a \(k\)-nearest neighbors model using \(k = 11\) (approximately \(\sqrt{n}\)). You should be careful to use the same cross-validation folds for both models, and you should report each model’s estimated out-of-sample \(RMSE\).
Part E: Visualize the \(k\)-nearest neighbors model using \(k = 11\) from Part D using a surface
overlaid on a 3-D plotly
scatter plot. Hint: Use
the same grid
object you created in Part B as the input to
the test
argument in knn.reg()
.
Part F: If you were to fit a lasso model to these data using all available predictors, would you expect the surface display of the resulting model to be more similar to the surface in Part B or the surface in Part E? Briefly explain (no R code is necessary for this question).