Directions (Please read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
  2. Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
  3. Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Packages and Datasets

In addition to familiar packages, this lab will use the implementation of \(k\)-nearest neighbors regression in the FNN package.

# load the following packages
# install.packages("FNN", "rpart", "rpart.plot")
library(FNN)
library(ggplot2)
library(dplyr)
library(rpart)
library(rpart.plot)

\(~\)

Motivation

We’ve been using the following framework for supervised learning: \[Y = f(\mathbf{X}) + \epsilon\]

  • When determining \(f()\), a major consideration is prediction error, which we can be measured via \(RMSE\)

Shown below are various linear regression models for the RMR data set:

and shown below are various \(k\)-nearest neighbors models:

  • Which of these models do you think has the lowest RMSE?
  • Which of these models do you think is most useful in understanding the relation between weight and metabolism? Which do you think will make the most accurate predicts on a new sample of data?

An obvious is problem is that more complex/flexible models can always achieve lower error on the training data when compared against less complex or less flexible models.

However, these models are simply learning patterns in observed errors that are inherent to the sample data. They aren’t finding more accurate representations of \(f()\).

\(~\)

Validation

Because any model has the potential to learn non-existent patterns from the noise/error in the training data, we should assess the error rate of any models we are considering on an independent set of data that was not used to train the model (ie: estimate \(f()\)).

A simple strategy is to separate off a random subset of our available data and use it only for calculating \(RMSE\) (and not for estimating \(f()\)):

rmr <- read.csv("https://remiller1450.github.io/data/RMR.csv")

## Randomly choose 10 data-points for validation
validation_ids <- sample(1:nrow(rmr), size = 10, replace = FALSE)

## Split the training and validation sets
rmr_validation = rmr[validation_ids, ]
rmr_training = rmr[-validation_ids, ]

We train our model(s) on the data that are not part of the validation set:

## Train using lm()
reg_degree1 <- lm(rate_kcal ~ poly(weight_lbs, degree = 1), data = rmr_training)
reg_degree8 <- lm(rate_kcal ~ poly(weight_lbs, degree = 8), data = rmr_training)

And we evaluate performance (ie: calculate RMSE) on the validation set (sometimes called a test set) that we previously separated from the training data:

## Use fitted models to get predictions on the validation set
yh_degree1 <- predict(reg_degree1, newdata = rmr_validation)
yh_degree8 <- predict(reg_degree8, newdata = rmr_validation)

## Calculate RMSE using those predictions
mse1 = 1/nrow(rmr_validation)*sum((rmr_validation$rate_kcal - yh_degree1)^2)
mse8 = 1/nrow(rmr_validation)*sum((rmr_validation$rate_kcal - yh_degree8)^2)

## Print results
print(sqrt(c(mse1,mse8)))
## [1] 216.1449 775.0366

Because these \(RMSE\) values were calculated using data that were not used to train the models, we will describe each as an estimate of that model’s out-of-sample \(RMSE\) (or out-of-sample error more generally).

\(~\)

Cross-validation

In the previous section, we described RMSE calculated from the validation set as an “estimate”. This is important, because a different validation set would result a different value.

When the size of training and validation sets are small (as they are here), the variability in our estimate of out-of-sample RMSE can be problematic. It might be difficult to confidently conclude that one model actually has lower error than another since smaller estimates could be explained by the particular validation set that was used.

A strategy to combat this limitation is cross-validation, which iteratively repeats the validation approach described in the previous section. We’ll focus on k-fold cross-validation, which is summarized by the following algorithm:

  1. Randomly split the sample data into \(k\) non-overlapping subsets (called “folds”)
  2. Hold out one of these subsets for validation, and train the model of interest using the data in remaining \(k-1\) subsets
  3. Estimate the error of the model trained in Step 2 using the held out subset as a validation set
  4. Repeat Steps 2 and 3 until each fold has be held out exactly once.

The diagram below is an illustration of 4-fold cross-validation:

Lab

The Mechanics of Cross-Validation

The following code demonstrates a simple implementation of 5-fold cross-validation. While many R packages can perform cross-validation for you without so many lines of code, familiarizing yourself with an implementation like this one is valuable in understanding exactly what the approach is doing, and it provides you the flexibility to cross-validate any model you’re interested in.

## Step 0 - set up preliminary quantities
set.seed(8)          ## set.seed will now ensure reproducible fold assignments
n <- nrow(rmr)       ## Sample size
n_folds <- 5         ## Set number of folds

## Step 1 - Create an ID variable containing fold assignments
fold_id <- sample(rep(1:n_folds, length.out = n), size = n)

## Step 2 - Set up an empty vector to store predictions
preds <- numeric(n)

## Loop through the k=5 folds
for(k in 1:n_folds){
  
  ## Split the data into training and testing partitions
  train <- rmr[fold_id != k, ]
  test <- rmr[fold_id == k, ]

  ## Fit model on this iteration's training folds
  m1 <- lm(rate_kcal ~ weight_lbs, data = train)
  
  ## Store predictions for the testing fold
  preds[fold_id == k] <- predict(m1, newdata = test)
}

## Calculate out-of-sample RMSE
sqrt(1/n*sum((rmr$rate_kcal - preds)^2))
## [1] 164.3637

A few things to be aware of:

  1. When fitting the chosen model, the argument data = train is critically important. A common mistake is to use the entire data set here (which defeats the entire purpose of cross-validation)
  2. If you are using cross-validation to compare multiple models, it’s best to do so within the same loop so that identical fold assignments are used.

Question #1: Our previous lab introduced the afqt data, which are a random sample of \(n = 2584\) Americans who took an aptitude test in 1979, then were re-interviewed in 2006 and asked about their education and annual income.

afqt <- read.csv("https://remiller1450.github.io/data/AFQT.csv")
  • Part A: Consider the use of k-fold cross-validation with \(k = 8\) to evaluate the simple linear regression model: Income2005 ~ AFQT. How many different times are the model’s parameters (\(\{\beta_0, \beta_1\}\)) estimated during cross-validation? Additionally, how many observations (data-points) contribute towards each of these estimates?
  • Part B: Estimate the out-of-sample \(RMSE\) of the simple linear model described in Part A using 8-fold cross-validation.
  • Part C: Now consider the multiple regression model: Income2005 ~ AFQT + Educ. Use 8-fold cross-validation to estimate the out-of-sample \(RMSE\) of this model and compare it to the model you evaluated in Parts A and B. Hint: You should use the same fold assignments for both models.
  • Part D: Consider a log-2 transformation of “Income2005”. Could cross-validation be used to determine whether this transformation yields a better fitting model than the ones evaluated in Parts B and C? Briefly explain. Hint: Think about the meaning of \(RMSE\) and its units.

\(~\)

How many folds?

Cross-validation requires choosing a pre-specified number of folds. Generally speaking, this choice isn’t overly consequential, as we’re simply using cross-validation as a tool to come up with a reliable estimate of a model’s performance.

However, we ultimately need to specify \(k\), so we should be aware of what happens at each extreme:

  1. One extreme is setting \(k=n\), an approach known as leave-one-out cross-validation (LOOCV)
  2. The other extreme is setting \(k=2\), or performing a single data split

In LOOCV, most of the estimated models are nearly identical, as their training data differs by just a single data-point. Consequently, LOOCV may end up averaging errors that are correlated with each other. This doesn’t necessary imply the LOOCV estimate is biased or “bad”, but it can be more sensitive to specific artifacts present in the sample data.

In contrast, the two estimated models involved in 2-fold cross-validation can be very different from one another, which can the final error estimate sensitive to the specific random split that was used.

Question #2 (Part A): Without using any R code, briefly explain why the set.seed() command is not necessary to ensure reproducible results when using leave-one-out cross-validation.

Question #2 (Part B): For the rmr data, modify the cross-validation code that was previously provided to use 2-fold cross-validation. Run this loop 4 different times, setting a different randomization seed each time, and record the RMSE estimate for each randomization seed. Why might the variability in these different estimates be problematic? Briefly explain.

Remarks:

  • In practice, \(k=5\) and \(k=10\) are commonly used choices that tend to avoid the undesirable behaviors described above.
  • If you suspect high variability in a cross-validated error estimate, you can perform repeated k-fold cross-validation and average the error estimates of each repetition.

\(~\)

Cross-validation and \(k\)-nearest neighbors

A strength of cross-validation is that it can be used to compare any two models involving the same outcome variable. This allows us to objectively compare the performance of models built using completely different frameworks. For example, we can compare non-parametric methods like \(k\)-nearest neighbors with highly structured parametric models like linear regression.

The code below demonstrates the use of 5-fold cross-validation to compare \(k\)-nearest neighbors (with k = 7) and simple linear regression on the resting metabolism data:

## Setup folds
set.seed(7) 
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size =n)

## Empty vectors to store predictions
preds_k7 <- preds_slr <- numeric(n)

## Loop through the k=5 folds
for(k in 1:5){
  
  ## Subset the data
  train <- rmr[fold_id != k, ]
  test <- rmr[fold_id == k, ]
  
  ## Fit models on the data
  mod_k7 <- knn.reg(train = train$weight_lbs,  
                    test = data.frame(test$weight_lbs), 
                    y = train$rate_kcal, k = 7)   
  mod_slr <- lm(rate_kcal ~ weight_lbs, data = train)
  
  
  ## Store predictions
  preds_k7[fold_id == k] <- mod_k7$pred
  preds_slr[fold_id == k] <- predict(mod_slr, newdata = test)
}

## Out-of-sample RMSE for each model
c(sqrt(1/n*sum((rmr$rate_kcal - preds_k7)^2)),
sqrt(1/n*sum((rmr$rate_kcal - preds_slr)^2)))
## [1] 163.3975 166.3328

We can see that KNN slightly outperforms simple linear regression.

Cross-validation is also commonly used to select tuning parameters, or quantities that must be specified by the analyst before a model can be fit (such as the value \(k\) in \(k\)-nearest neighbors).

The code below uses 5-fold cross-validation to compare three different values of \(k\):

## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7) 
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size =n)

## Step 2
## Three empty numeric vectors to store out-of-sample predictions (for 3 choices of k)
preds_k3 <- preds_k7 <- preds_k15 <- numeric(n)


## Loop through the k=5 folds
for(k in 1:5){
  
  ## Subset the data
  train <- rmr[fold_id != k, ]
  test <- rmr[fold_id == k, ]
  
  ## Fit models on the data
  mod_k3 <- knn.reg(train = train$weight_lbs,  test = data.frame(test$weight_lbs), y = train$rate_kcal, k = 3)    
  mod_k7 <- knn.reg(train = train$weight_lbs,  test = data.frame(test$weight_lbs), y = train$rate_kcal, k = 7) 
  mod_k15 <- knn.reg(train = train$weight_lbs,  test = data.frame(test$weight_lbs), y = train$rate_kcal, k = 15)  
  
  ## Store predictions
  preds_k3[fold_id == k] <- mod_k3$pred
  preds_k7[fold_id == k] <- mod_k7$pred
  preds_k15[fold_id == k] <- mod_k15$pred
}

## Out-of-sample RMSE for each model
out_RMSE_k3 <- sqrt(1/n*sum((rmr$rate_kcal - preds_k3)^2))
out_RMSE_k7 <- sqrt(1/n*sum((rmr$rate_kcal - preds_k7)^2))
out_RMSE_k15 <- sqrt(1/n*sum((rmr$rate_kcal - preds_k15)^2))

## Comparison
print(c(out_RMSE_k3, out_RMSE_k7, out_RMSE_k15))
## [1] 197.1123 163.3975 180.0256

This comparison suggests \(k = 7\) leads to a better model than either \(k = 3\) or \(k = 15\). Recall that we’ve previously discussed the fact that in-sample RMSE will continually improve as \(k\) decreases, but this is clearly not the case for out-of-sample RMSE.

Question #4: For this question you will use the afqt data used in Question #1 and introduced in our previous lab.

  • Part A: Use 5-fold cross-validation to compare the out-of-sample performance of kNN models for three different values of \(k\): 15, 25, and 50 when using the variables “Educ” and “AFQT” to predict “Income2005”. Provide an RMSE estimate for each model, and indicate the best choice of \(k\) (among these three options). Hint: You should standardize these two predictors and store them in a separate data frame to be used by knn.reg().
  • Part B: Use 5-fold cross-validation to compare the best model from Part A with the linear regression model defined by the formula Income2005 ~ AFQT + Educ. Which model has better out-of-sample performance? What does this tell you about the relationship between AFQT scores, education, and income? Briefly explain.

\(~\)

Cross-validation and decision trees

Recall from our previous lab that an important aspect of the decision tree algorithm is its stopping point. If the algorithm is allowed to continue until every data-point is in its own node the resulting tree will almost surely be over-fit to the training data and will not generalize well to new data.

Cross-validation can be used to tune the following parameters of the decision tree algorithm to find values that optimize out-of-sample RMSE:

  1. Complexity parameter (cp)
  2. Tree depth (maxdepth)
  3. Minimum number of data-points in a node eligible for splitting (minsplit)

While you could explore different combinations of parameter values for all three using a grid search approach. It often makes sense to focus on tuning the maximum depth of the tree as a good proxy for its overall complexity. Maximum depth has the advantage of being discrete, and having a relatively small number of values of interest (generally in the single digits).

For illustration purposes, the code below demonstrates how to tune the complexity parameter using cross-validation (as I’ll leave it to you to tune maxdepth on your own):

## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7) 
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size =n)

## Step 2
## Three empty numeric vectors to store out-of-sample predictions (for 3 choices of k)
preds1 <- preds2 <- preds3 <- numeric(n)


## Loop through the k=5 folds
for(k in 1:5){
  
  ## Subset the data
  train <- rmr[fold_id != k, ]
  test <- rmr[fold_id == k, ]
  
  ## Fit models on the data
  mod_cp10 <-  rpart(rate_kcal ~ weight_lbs, data = train,
                 control = rpart.control(cp = 0.1, minsplit = 2))  
  mod_cp05 <-  rpart(rate_kcal ~ weight_lbs, data = train,
                 control = rpart.control(cp = 0.05, minsplit = 2))
  mod_cp01 <-  rpart(rate_kcal ~ weight_lbs, data = train,
                 control = rpart.control(cp = 0.01, minsplit = 2))
  
  ## Store predictions
  preds1[fold_id == k] <- predict(mod_cp10, test)
  preds2[fold_id == k] <- predict(mod_cp05, test)
  preds3[fold_id == k] <- predict(mod_cp01, test)
}

## Out-of-sample RMSE for each model
out_RMSE1 <- sqrt(1/n*sum((rmr$rate_kcal - preds1)^2))
out_RMSE2 <- sqrt(1/n*sum((rmr$rate_kcal - preds2)^2))
out_RMSE3 <- sqrt(1/n*sum((rmr$rate_kcal - preds3)^2))

## Comparison
print(c(out_RMSE1, out_RMSE2, out_RMSE3))
## [1] 190.3522 203.9304 208.7651

Question #5:

  • Part A: Modify the above code by removing minsplit = 2 from the control options. Why do you think all of the out-of-sample RMSE estimates are the same making this change?
  • Part B: How many different decision trees are being estimated during the entire course of the 5-fold cross validation for loop given in the above code? Briefly explain.

\(~\)

Computational Considerations

Algorithms like decisions trees involve several tuning parameters, and each combination of tuning parameters that is to be assessed using 5-fold cross-validation requires the estimation of 5 different decision tree models.

Consider a search that considers 3 different values of maxdepth, 3 different values of minsplit, and 3 different values of cp. There are 27 different combinations of these tuning parameters that are part of this search, meaning 135 different decision trees need to be estimated/trained if 5-fold cross validation is used to select the best performing combination. If we wanted to consider 1 additional value of any of these tuning parameters the number of models that need to be estimated increases to 180.

Pruning is a strategy that can be used to reduce the number of decision tree models you’re training. This exploits the fact that a deeper tree can be “cut” into a shallower tree due to the greedy nature of the decision tree algorithm. Thus, the maxdepth and cp parameters can be tuned without rerunning the algorithm.

The code below demonstrates pruning by using the prune() function to help optimize cp using 5-fold cross validation:

## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7) 
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size =n)

## Step 2
## Three empty numeric vectors to store out-of-sample predictions (for 3 choices of k)
preds1 <- preds2 <- preds3 <- numeric(n)


## Loop through the k=5 folds
for(k in 1:5){
  
  ## Subset the data
  train <- rmr[fold_id != k, ]
  test <- rmr[fold_id == k, ]
  
  ## Fit the deepest model
  mod_cp01 <-  rpart(rate_kcal ~ weight_lbs, data = train,
                 control = rpart.control(cp = 0.01, minsplit = 2))
  
  ## Cut the deepest model into shorter trees
  mod_cp05 <- prune(mod_cp01, cp = 0.05)
  mod_cp10 <- prune(mod_cp01, cp = 0.10)
  
  ## Store predictions
  preds1[fold_id == k] <- predict(mod_cp10, test)
  preds2[fold_id == k] <- predict(mod_cp05, test)
  preds3[fold_id == k] <- predict(mod_cp01, test)
}

## Out-of-sample RMSE for each model
out_RMSE1 <- sqrt(1/n*sum((rmr$rate_kcal - preds1)^2))
out_RMSE2 <- sqrt(1/n*sum((rmr$rate_kcal - preds2)^2))
out_RMSE3 <- sqrt(1/n*sum((rmr$rate_kcal - preds3)^2))

## Comparison
print(c(out_RMSE1, out_RMSE2, out_RMSE3))
## [1] 190.3522 203.9304 208.7651

Notice how these RMSE results are identical to those we previous saw, but this time we only needed to estimate 5 different trees instead of 15 different trees.

Question #6: Find a combination of tuning parameters that yields a lower out-of-sample RMSE than the best performance achieved in our example (ie: better than 190.35). You should explore different values for both cp and minsplit in your search. Be sure to use the same randomization seed as the example (ie: set.seed(7)).

\(~\)

Application

Question #7: The data loaded below, oz, record the daily ozone concentrations in New York City across 111 different days in the same year. The variable “Day” describes how many days after Jan 1st the measurement occurred. The goal of this application is to develop a model that can accurately predict the Ozone concentration on a given day based upon the wind speed, amount of solar radiation, and temperature for that day.

oz <- read.csv("https://remiller1450.github.io/data/Ozone.csv")

Part A: Use the pairs() function to create a scatter plot matrix displaying each combination of variables in oz. Which two variables appear to have the strongest linear relationships with the outcome variable, “Ozone”?

Part B: Fit a linear regression model that uses a day’s wind speed and temperature (as linear effects) to predict the ozone concentration for that day. Then, use plotly to visualize this model using a surface overlaid on a 3-D scatter plot. Hint: see the plotly lab for several examples, and be careful to set up your “x” and “y” sequences using the range of the predictors used in this model.

Part C: Estimate the out-of-sample \(RMSE\) of the model you fit in Part B using 5-fold cross-validation.

Part D: Starting with your code from Part C, use 5-fold cross-validation to compare the out-of-sample \(RMSE\) of the linear regression model Part B with a \(k\)-nearest neighbors model using \(k = 11\) (approximately \(\sqrt{n}\)). Both models should only use wind speed and temperature as predictors, and you should be careful to use the same cross-validation folds when comparing the models. Be sure that you report each model’s estimated out-of-sample \(RMSE\).

Part E: Visualize the \(k\)-nearest neighbors model using \(k = 11\) from Part D using a surface overlaid on a 3-D plotly scatter plot. Hint: Use the same grid object you created in Part B as the input to the test argument in knn.reg().

Part F: Use 5-fold cross-validation to find the optimum depth of a decision tree model that uses wind speed and temperature as predictors of ozone concentration. You should compare maximum depths of 2, 3, and 4, reporting the depth that you found to be most appropriate.

Part G: Visualize the model from Part F using a surface overlaid on a 3-D plotly scatter plot. Briefly describe how this model’s prediction surface compares with the ones from Part B and Part E.