This lab continues our introduction supervised learning, the process of modeling an outcome variable. It will focus on numerical outcomes, and future labs will cover categorical outcomes.

Directions (Please read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
  2. Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
  3. Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Packages and Datasets

This lab will again use the \(k\)-nearest neighbors in the FNN package.

# load the following packages
# install.packages("FNN")
library(FNN)
library(ggplot2)
library(dplyr)

\(~\)

In-sample vs. Out-of-sample Performance

In our previous lab, we introduced the general modeling framework: \[Y = f(\mathbf{X}) + \epsilon\]

In this framework, finding a good estimate of \(f()\) is of critical importance. As an illustration, let’s consider various polynomial regression models of increasing degree fit to the resting metabolism data:

Based upon this graph, you might expect the straight line or second degree polynomial to be best; however, here’s a chart of the in-sample \(RMSE\) of each model:

Does this mean we should use an eighth degree polynomial (lowest in-sample \(RMSE\)) to model this relationship?

\(~\)

Cross-validation

Using the same data to train (ie: estimate \(\{\beta_0, \ldots, \beta_1\}\)) and test (ie: calculate performance measures like \(RMSE\)) a model can lead to over-fitting, or the systematic favoritism of models that are too specific to the available data and won’t generalize very well to new data.

In our example, the eighth degree polynomial model is clearly over fit - its prediction would fail miserably for a new data-point with a body weight of approximately 280 lbs.

To prevent over-fitting, we should evaluate (test) our model on data that was not used in the training process, thereby obtaining a measure of out-of-sample performance. Cross-validation is a method for estimating the out-of-sample performance of a model using non-overlapping subsets of the available data. We’ll focus on k-fold cross-validation, which is summarized by the following algorithm:

  1. Randomly divide the original data set into \(k\) equally sized, non-overlapping subsets
  2. Fit the candidate model using data from \(k - 1\) folds, and use this model to find predicted values (\(\hat{y}_i\)’s) for the observations in the \(k^{th}\) fold (the “left out” fold)
  3. Repeat step two until each fold has been left out exactly once, resulting in an out-of-sample prediction for each observation in the data set

The diagram below illustrates 4-fold cross-validation:

Here’s a comparison of in-sample vs. out-of-sample performance of each polynomial regression model of the resting metabolism example:

Degree 1 2 4 6 8
RMSE (in-sample) 154 153 148 145 140
RMSE (CV) 164 165 206 3393 52010

The following code demonstrates how to implement 5-fold cross-validation:

## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(8)  ## set.seed will now ensure reproducible fold assignments
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size = n)

## Step 2 - Set up an empty vector to store predictions
preds <- numeric(n)

## Loop through the k=5 folds
for(k in 1:5){
  
  ## Split the data into training and testing partitions
  train <- rmr[fold_id != k, ]
  test <- rmr[fold_id == k, ]

  ## Fit model on this iteration's training folds
  m1 <- lm(rate_kcal ~ weight_lbs, data = train)
  
  ## Store predictions for the testing fold
  preds[fold_id == k] <- predict(m1, newdata = test)
}

## Calculate out-of-sample RMSE
sqrt(1/n*sum((rmr$rate_kcal - preds)^2))
## [1] 164.3637

A few things to be aware of:

  1. A key part of the model fitting step is the argument data = train. A common mistake is to use the entire data set here (which defeats the entire purpose of cross-validation)
  2. If you are using cross-validation to compare multiple models, it’s best to do so within the same loop so that identical fold assignments are used.

\(~\)

Lab

Cross-validating linear regression

Question #1: Our previous lab introduced the afqt data set, which is a random sample of \(n = 2584\) Americans who were first selected and tested in 1979, then re-interviewed in 2006 and asked about their education and annual income.

afqt <- read.csv("https://remiller1450.github.io/data/AFQT.csv")

Part A: Consider the use of k-fold cross-validation with \(k = 8\) to evaluate the simple linear regression model: Income2005 ~ AFQT. How many different times are the model’s parameters (\(\{\beta_0, \beta_1\}\)) estimated during the evaluation process? How many observations (data-points) contribute to each of these estimates?

Part B: Estimate the out-of-sample \(RMSE\) of the simple linear regression described in Part A using 8-fold cross-validation.

Part C: Now consider the multiple regression model: Income2005 ~ AFQT + Educ. Use 8-fold cross-validation to estimate the out-of-sample \(RMSE\) of this model and compare it to the model you evaluated in Parts A and B. Hint: be sure to use the same fold assignments (which are determined by the randomization seed) for both models.

Part D: Consider a log-2 transformation of “Income2005”. Could cross-validation be used to determine whether this transformation yields a better fitting model than the ones evaluated in Parts B and C? Briefly explain. Hint: think about the meaning of \(RMSE\).

\(~\)

How many folds?

Using \(k\)-fold cross-validation requires a pre-specified value of \(k\), the number of cross-validation folds.

Larger values of \(k\) will lead to less variability in the estimates of out-of-sample \(RMSE\) across different sets of randomly generated folds; however, if \(k\) is too large the trained models used in each fold will be highly correlated (leading to correlated errors when applied to the testing data).

To understand this dilemma, it’s useful to think about both extremes:

  1. \(k=n\), an approach known as leave-one-out cross-validation (LOOCV)
  2. \(k=2\), a single data split

In LOOCV, many of the trained models are nearly identical, as only one data-point was not used in each different model estimation. In a single data split, the two models that are estimated can be very different from each other, which makes the \(RMSE\) estimate highly dependent on the random split that was drawn.

Question #2 (Part A): Writing or running any code, briefly explain why set.seed() is not necessary to ensure reproducible results when using leave-one-out cross-validation.

Question #2 (Part B): For \(k=2\), use k-fold cross-validation with three different randomization seeds to come up with three different estimates of the out-of-sample \(RMSE\) of the simple linear regression model Income2005 ~ AFQT (for the afqt data described in Question #1). Does the variability displayed in these estimates seem problematic when considering the out-of-sample \(RMSE\)?

\(~\)

For the reasons discussed above, a more robust approach to estimating a model’s out-of-sample performance is repeated k-fold cross-validation. As the name suggests, this method uses an outer loop to repeat the process of k-fold cross-validation several times. Performance estimates from each repetition are then aggregated to produce a single estimate (usually by simple averaging).

The code below demonstrates repeated cross-validation:

## Set up
set.seed(8)
n <- nrow(rmr)
preds <- numeric(n)
reps <- 10
rmses <- numeric(reps)

## Outer loop
for(i in 1:reps){

fold_id <- sample(rep(1:5, length.out = n), size = n)

## Inner loop
for(k in 1:5){
  
  ## Split the data into training and testing partitions
  train <- rmr[fold_id != k, ]
  test <- rmr[fold_id == k, ]

  ## Fit model on this iteration's training folds
  m1 <- lm(rate_kcal ~ weight_lbs, data = train)
  
  ## Store predictions for the testing fold
  preds[fold_id == k] <- predict(m1, newdata = test)
}

# Store each outer loop iteration's estimate
rmses[i] <- sqrt(1/n*sum((rmr$rate_kcal - preds)^2))
}

## Results
summary(rmses)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   160.0   161.8   162.4   163.4   164.9   169.0

Question #3 (Part A): In the example code provided above, how many repetitions of k-fold cross-validation were performed?

Question #3 (Part B): How many cross-validation folds were used in each repetition?

Question #3 (Part C): What are the range and interquartile range of the out-of-sample \(RMSE\) estimates? Briefly comment on whether or not you feel repeated cross-validation is useful in this application.

\(~\)

Remarks: When using repeated cross-validation, most practitioners recommend choosing \(k\) between 5 and 10, with the number of repetitions depending upon the application. In the initial stages of an analysis, a single repetition of k-fold cross-validation is generally sufficient. In the final stages of analysis, such as reporting the final performance of a chosen model, \(k\)-fold cross-validation should be repeated as many times necessary to yield a reliable estimate.

\(~\)

Cross-validating \(k\)-nearest neighbors

A strength of cross-validation is that it can be used to compare any two models involving the same outcome variable. This enables us to compare the performance of completely different model classes, such as non-parametric methods like \(k\)-nearest neighbors and parametric models like linear regression - something that traditional statistical testing is unable to do.

The code below demonstrates the use of 5-fold cross-validation to compare \(k\)-nearest neighbors (with k = 7) and simple linear regression for the resting metabolism data:

# Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7) 
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size =n)

## Step 2
## Three empty numeric vectors to store out-of-sample predictions (for 3 choices of k)
preds_k7 <- preds_slr <- numeric(n)


## Loop through the k=5 folds
for(k in 1:5){
  
  ## Subset the data
  train <- rmr[fold_id != k, ]
  test <- rmr[fold_id == k, ]
  
  ## Fit models on the data
  mod_k7 <- knn.reg(train = train$weight_lbs,  test = data.frame(test$weight_lbs), 
                    y = train$rate_kcal, k = 7)   
  mod_slr <- lm(rate_kcal ~ weight_lbs, data = train)
  
  
  ## Store predictions
  preds_k7[fold_id == k] <- mod_k7$pred
  preds_slr[fold_id == k] <- predict(mod_slr, newdata = test)
}

## Out-of-sample RMSE for each model
c(sqrt(1/n*sum((rmr$rate_kcal - preds_k7)^2)),
sqrt(1/n*sum((rmr$rate_kcal - preds_slr)^2)))
## [1] 163.3975 166.3328

Cross-validation is also commonly used to select tuning parameters, or quantities that must be specified by the analyst before a model can be fit (such as the value \(k\) in \(k\)-nearest neighbors).

The code below uses 5-fold cross-validation to compare three different values of \(k\):

## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7) 
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size =n)

## Step 2
## Three empty numeric vectors to store out-of-sample predictions (for 3 choices of k)
preds_k5 <- preds_k15 <- preds_k30 <- numeric(n)


## Loop through the k=5 folds
for(k in 1:5){
  
  ## Subset the data
  train <- rmr[fold_id != k, ]
  test <- rmr[fold_id == k, ]
  
  ## Fit models on the data
  mod_k5 <- knn.reg(train = train$weight_lbs,  test = data.frame(test$weight_lbs), y = train$rate_kcal, k = 3)    
  mod_k15 <- knn.reg(train = train$weight_lbs,  test = data.frame(test$weight_lbs), y = train$rate_kcal, k = 7) 
  mod_k30 <- knn.reg(train = train$weight_lbs,  test = data.frame(test$weight_lbs), y = train$rate_kcal, k = 15)  
  
  ## Store predictions
  preds_k5[fold_id == k] <- mod_k5$pred
  preds_k15[fold_id == k] <- mod_k15$pred
  preds_k30[fold_id == k] <- mod_k30$pred
}

## Out-of-sample RMSE for each model
out_RMSE_k5 <- sqrt(1/n*sum((rmr$rate_kcal - preds_k5)^2))
out_RMSE_k15 <- sqrt(1/n*sum((rmr$rate_kcal - preds_k15)^2))
out_RMSE_k30 <- sqrt(1/n*sum((rmr$rate_kcal - preds_k30)^2))

## Comparison
print(c(out_RMSE_k5, out_RMSE_k15, out_RMSE_k30))
## [1] 197.1123 163.3975 180.0256

This comparison suggests \(k = 7\) leads to a better model than either \(k = 3\) or \(k = 15\). Additionally, you should notice that out-of-sample \(RMSE\) is no longer monotonically improving as \(k\) decreases (something that happened with in-sample \(RMSE\)).

Question #4: For this question you will use the afqt data introduced in Question #1.

Part A: As mentioned in the previous lab, \(k = \sqrt{n}\) is a reasonable starting point when using \(k\)-nearest neighbors. For Part A of Question #4, use cross-validation to compare the out-of-sample performance of kNN models for three different values of \(k\): 25, 50, and 75 when using the variables “Educ” and “AFQT” to predict “Income2005”. Report a performance estimate for each of these possible values, and indicate which choice of \(k\) results in the best model.

Part B: An advantage of kNN is the flexibility to detect non-linearity in the relationship between a predictor and the outcome. For Part B of Question #4, compare the out-of-sample performance of the best kNN model you identified in Part A with a multiple linear regression that uses third degree polynomials to represent the effects of “Educ” and “AFQT”?

Part C: In Question #1 you considered two different linear models for these data and evaluated their performance using cross-validation. Considering results from Question #1 and the earlier parts of this question, do zero, one, or both of the models from Question #1 outperform the best model from Part B?

\(~\)

Penalized regression

While cross-validation can be used to select which explanatory variables to use in a linear regression model, doing so quickly becomes untenable when the number of available variables is moderate to large.

The least absolute shrinkage and selection operation estimator, or lasso, is an alternative to ordinary least squares regression that naturally performs variable selection and shrinks coefficient estimates towards zero, often leading to improved out-of-sample performance.

Using penalization to shrink a model’s estimated effects towards zero in order to avoid over-fitting is known as regularization, and this technique can be applied to many types of models.

Mathematically, the lasso solution is found by minimizing \(Q(\beta) =\sum_i (y_i - x_i^T\beta)^2 + \lambda ||\beta||_1\) with respect to \(\beta\), where \(||\beta||_1 = |\beta_1| + |\beta_2| + \ldots + |\beta_p|\) is the L1 norm of the coefficient vector. \(\lambda\) is a tuning parameter, or a numeric constant that needs to specified prior to estimation.

  • When \(\lambda = 0\), the lasso solution is the same as the least squares solution.
  • For \(\lambda \rightarrow \infty\), the lasso solution is the zero vector (ie: no variables make non-zero contributions in the model).
  • Each intermediate value of \(\lambda\) produces a sparse solution, where only a subset of variables make non-zero contributions to the model.

Typically we try an entire path of different \(\lambda\) values, tracking the lasso solution for each. The code demonstrates how to use the glmnet package to fit a lasso model that predicts the sale prices of homes sold in Iowa City, IA between 2005 and 2008 using several attributes:

ic <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_X <- ic %>% select(ac,area.bsmt, area.garage1, area.garage2,
                      area.living, area.lot, assessed)

## Pre-processing
ic_X <- model.matrix(~ -1  + ., data = ic_X) ## Notice how "ac" is handled
ic_Y <- ic$sale.amount

## Modeling
library(glmnet)
lasso_fit <- glmnet(x = ic_X, y = ic_Y)

Note:

  • glmnet() doesn’t allow formulas, you must provide a matrix of explanatory variables and a vector of outcomes separately. When dummy variables are needed, the model.matrix() function should be used to construct an appropriate design matrix
    • Using model.matrix() is very similar to setting up a model formula. Here, “-1” is used to indicate we don’t want an intercept (glmnet will create an intercept for us), and “+.” indicates we want to use everything else in ic_X.

To understand this a lasso model it is useful to visualize the coefficient path:

plot(lasso_fit, xvar = "lambda", label = TRUE)
abline(v = 7.7, lty = 3)

The coefficient path displays the relationship between \(\lambda\), the penalty parameter, the number of active variables, and their estimated coefficients.

Each vertical cross-section of the coefficient path represents a different lasso model. For illustration, we use abline() to add a vertical line at \(log(\lambda) = 7.7\). Notice that the model characterized by this penalty contains only 3 active variables (everything else has an estimated coefficient of exactly zero). We can use the coef() function to print these variables and their estimated coefficients:

coef(lasso_fit, s = exp(7.7))
## 9 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  4.266824e+03
## acNo         .           
## acYes        .           
## area.bsmt    .           
## area.garage1 .           
## area.garage2 2.522547e+00
## area.living  .           
## area.lot     3.063874e-03
## assessed     1.007374e+00

An optimal value of \(\lambda\) can be determined using cross-validation. In fact, glmnet contains its own cross-validation implementation in the function cv.glmnet().

cv_lasso_fit <- cv.glmnet(x = ic_X, y = ic_Y, nfolds = 5)
cv_lasso_fit$lambda.min
## [1] 923.3904

In this example, a \(\lambda\) value of 923.3904, or roughly 6.828 on the log-scale, corresponds with the best fitting model. Conveniently, cv.glmnet() also fits a lasso model to the full data set, so we can find the estimated parameters of the best fitting model using the following code:

coef(cv_lasso_fit$glmnet.fit, s = cv_lasso_fit$lambda.min)
## 9 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  875.0476236
## acNo           .        
## acYes          .        
## area.bsmt      0.4728611
## area.garage1  -6.3536220
## area.garage2   7.6736881
## area.living    .        
## area.lot       0.1223852
## assessed       1.0252551

Thus, the estimated lasso model is: \[\small \widehat{SalePrice} = 875.05 + 0.47*\text{area.bsmt} - 6.35*\text{area.garage1} + 7.67*\text{area.garage2} + 0.12*\text{area.lot} + 1.025*\text{assessed}\]

Notice the variables “ac” and “area.living” were not selected to contribute to the model.

Question #5: The code given below loads the “colleges 2019” data set and filters these data to only include observations without any missing data. The goal in this analysis is to model the median 10 year salary of college’s alumni using the other variables contained in cd.

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% filter(complete.cases(colleges))

Part A: Recall that glmnet() requires an outcome vector, y, and a model matrix, X. For this question, set up a vector named cd_y that contains the outcome variable, and create a properly formatted model matrix cd_X that all other variables besides “Name” and “City”, which we’ll consider identifiers of a college. Hint: cd_X should contain 50 binary indicator variables representing each category of “State”.

Part B: Fit a lasso model using 10-fold cross-validation as implemented by the cv.glmnet() function. Graph the coefficient path of this model and add a vertical line at the value of \(\lambda\) favored by cross-validation.

Part C: The model matrix used in Parts A and B should have contained 75 columns (predictor variables). At the value of \(\lambda\) favored by cross-validation, how many of these variables were not selected to contribute to the final model?

\(~\)

Practice

Question #6: The data loaded below, oz, record the daily ozone concentrations in New York City across 111 different days in the same year. The variable “Day” describes how many days after Jan 1st the measurement occurred.

oz <- read.csv("https://remiller1450.github.io/data/Ozone.csv")

Part A: Use the pairs() function to create a scatter plot matrix that displays the variables in oz. Which two variables appear to have the strongest linear relationships with the outcome variable, “Ozone”?

Part B: Fit a linear regression model that uses a day’s wind speed and temperature (as linear effects) to predict the ozone concentration for that day. Then, use plotly to visualize this model using a surface overlaid on a 3-D scatter plot. Hint: see the plotly lab for several examples, and be careful to set up your “x” and “y” sequences using the range of the predictors used in this model.

Part C: Estimate the out-of-sample \(RMSE\) of the model you fit in Part B using 5-fold cross-validation.

Part D: Use 5-fold cross-validation to compare the out-of-sample \(RMSE\) with a \(k\)-nearest neighbors model using \(k = 11\) (approximately \(\sqrt{n}\)). You should be careful to use the same cross-validation folds for both models, and you should report each model’s estimated out-of-sample \(RMSE\).

Part E: Visualize the \(k\)-nearest neighbors model using \(k = 11\) from Part D using a surface overlaid on a 3-D plotly scatter plot. Hint: Use the same grid object you created in Part B as the input to the test argument in knn.reg().

Part F: If you were to fit a lasso model to these data using all available predictors, would you expect the surface display of the resulting model to be more similar to the surface in Part B or the surface in Part E? Briefly explain (no R code is necessary for this question).