Directions (Please read before starting)
\(~\)
In addition to familiar packages, this lab will use the
implementation of \(k\)-nearest
neighbors regression in the FNN
package.
# load the following packages
# install.packages("FNN", "rpart", "rpart.plot")
library(FNN)
library(ggplot2)
library(dplyr)
library(rpart)
library(rpart.plot)
\(~\)
We’ve been using the following framework for supervised learning: \[Y = f(\mathbf{X}) + \epsilon\]
Shown below are various linear regression models for the RMR data set:
and shown below are various \(k\)-nearest neighbors models:
An obvious is problem is that more complex/flexible models can always achieve lower error on the training data when compared against less complex or less flexible models.
However, these models are simply learning patterns in observed errors that are inherent to the sample data. They aren’t finding more accurate representations of \(f()\).
\(~\)
Because any model has the potential to learn non-existent patterns from the noise/error in the training data, we should assess the error rate of any models we are considering on an independent set of data that was not used to train the model (ie: estimate \(f()\)).
A simple strategy is to separate off a random subset of our available data and use it only for calculating \(RMSE\) (and not for estimating \(f()\)):
rmr <- read.csv("https://remiller1450.github.io/data/RMR.csv")
## Randomly choose 10 data-points for validation
validation_ids <- sample(1:nrow(rmr), size = 10, replace = FALSE)
## Split the training and validation sets
rmr_validation = rmr[validation_ids, ]
rmr_training = rmr[-validation_ids, ]
We train our model(s) on the data that are not part of the validation set:
## Train using lm()
reg_degree1 <- lm(rate_kcal ~ poly(weight_lbs, degree = 1), data = rmr_training)
reg_degree8 <- lm(rate_kcal ~ poly(weight_lbs, degree = 8), data = rmr_training)
And we evaluate performance (ie: calculate RMSE) on the validation set (sometimes called a test set) that we previously separated from the training data:
## Use fitted models to get predictions on the validation set
yh_degree1 <- predict(reg_degree1, newdata = rmr_validation)
yh_degree8 <- predict(reg_degree8, newdata = rmr_validation)
## Calculate RMSE using those predictions
mse1 = 1/nrow(rmr_validation)*sum((rmr_validation$rate_kcal - yh_degree1)^2)
mse8 = 1/nrow(rmr_validation)*sum((rmr_validation$rate_kcal - yh_degree8)^2)
## Print results
print(sqrt(c(mse1,mse8)))
## [1] 216.1449 775.0366
Because these \(RMSE\) values were calculated using data that were not used to train the models, we will describe each as an estimate of that model’s out-of-sample \(RMSE\) (or out-of-sample error more generally).
\(~\)
In the previous section, we described RMSE calculated from the validation set as an “estimate”. This is important, because a different validation set would result a different value.
When the size of training and validation sets are small (as they are here), the variability in our estimate of out-of-sample RMSE can be problematic. It might be difficult to confidently conclude that one model actually has lower error than another since smaller estimates could be explained by the particular validation set that was used.
A strategy to combat this limitation is cross-validation, which iteratively repeats the validation approach described in the previous section. We’ll focus on k-fold cross-validation, which is summarized by the following algorithm:
The diagram below is an illustration of 4-fold cross-validation:
The following code demonstrates a simple implementation of 5-fold
cross-validation. While many R
packages can perform
cross-validation for you without so many lines of code, familiarizing
yourself with an implementation like this one is valuable in
understanding exactly what the approach is doing, and it provides you
the flexibility to cross-validate any model you’re interested
in.
## Step 0 - set up preliminary quantities
set.seed(8) ## set.seed will now ensure reproducible fold assignments
n <- nrow(rmr) ## Sample size
n_folds <- 5 ## Set number of folds
## Step 1 - Create an ID variable containing fold assignments
fold_id <- sample(rep(1:n_folds, length.out = n), size = n)
## Step 2 - Set up an empty vector to store predictions
preds <- numeric(n)
## Loop through the k=5 folds
for(k in 1:n_folds){
## Split the data into training and testing partitions
train <- rmr[fold_id != k, ]
test <- rmr[fold_id == k, ]
## Fit model on this iteration's training folds
m1 <- lm(rate_kcal ~ weight_lbs, data = train)
## Store predictions for the testing fold
preds[fold_id == k] <- predict(m1, newdata = test)
}
## Calculate out-of-sample RMSE
sqrt(1/n*sum((rmr$rate_kcal - preds)^2))
## [1] 164.3637
A few things to be aware of:
data = train
is critically important. A common
mistake is to use the entire data set here (which defeats the entire
purpose of cross-validation)Question #1: Our previous lab introduced the
afqt
data, which are a random sample of \(n = 2584\) Americans who took an aptitude
test in 1979, then were re-interviewed in 2006 and asked about their
education and annual income.
afqt <- read.csv("https://remiller1450.github.io/data/AFQT.csv")
Income2005 ~ AFQT
. How many
different times are the model’s parameters (\(\{\beta_0, \beta_1\}\)) estimated during
cross-validation? Additionally, how many observations (data-points)
contribute towards each of these estimates?Income2005 ~ AFQT + Educ
. Use 8-fold cross-validation to
estimate the out-of-sample \(RMSE\) of this model and compare it to the
model you evaluated in Parts A and B. Hint: You should use the
same fold assignments for both models.\(~\)
Cross-validation requires choosing a pre-specified number of folds. Generally speaking, this choice isn’t overly consequential, as we’re simply using cross-validation as a tool to come up with a reliable estimate of a model’s performance.
However, we ultimately need to specify \(k\), so we should be aware of what happens at each extreme:
In LOOCV, most of the estimated models are nearly identical, as their training data differs by just a single data-point. Consequently, LOOCV may end up averaging errors that are correlated with each other. This doesn’t necessary imply the LOOCV estimate is biased or “bad”, but it can be more sensitive to specific artifacts present in the sample data.
In contrast, the two estimated models involved in 2-fold cross-validation can be very different from one another, which can the final error estimate sensitive to the specific random split that was used.
Question #2 (Part A): Without using any
R
code, briefly explain why the set.seed()
command is not necessary to ensure reproducible results when using
leave-one-out cross-validation.
Question #2 (Part B): For the rmr
data,
modify the cross-validation code that was previously provided to use
2-fold cross-validation. Run this loop 4 different times, setting a
different randomization seed each time, and record the RMSE estimate for
each randomization seed. Why might the variability in these different
estimates be problematic? Briefly explain.
Remarks:
\(~\)
A strength of cross-validation is that it can be used to compare any two models involving the same outcome variable. This allows us to objectively compare the performance of models built using completely different frameworks. For example, we can compare non-parametric methods like \(k\)-nearest neighbors with highly structured parametric models like linear regression.
The code below demonstrates the use of 5-fold cross-validation to compare \(k\)-nearest neighbors (with k = 7) and simple linear regression on the resting metabolism data:
## Setup folds
set.seed(7)
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size =n)
## Empty vectors to store predictions
preds_k7 <- preds_slr <- numeric(n)
## Loop through the k=5 folds
for(k in 1:5){
## Subset the data
train <- rmr[fold_id != k, ]
test <- rmr[fold_id == k, ]
## Fit models on the data
mod_k7 <- knn.reg(train = train$weight_lbs,
test = data.frame(test$weight_lbs),
y = train$rate_kcal, k = 7)
mod_slr <- lm(rate_kcal ~ weight_lbs, data = train)
## Store predictions
preds_k7[fold_id == k] <- mod_k7$pred
preds_slr[fold_id == k] <- predict(mod_slr, newdata = test)
}
## Out-of-sample RMSE for each model
c(sqrt(1/n*sum((rmr$rate_kcal - preds_k7)^2)),
sqrt(1/n*sum((rmr$rate_kcal - preds_slr)^2)))
## [1] 163.3975 166.3328
We can see that KNN slightly outperforms simple linear regression.
Cross-validation is also commonly used to select tuning parameters, or quantities that must be specified by the analyst before a model can be fit (such as the value \(k\) in \(k\)-nearest neighbors).
The code below uses 5-fold cross-validation to compare three different values of \(k\):
## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7)
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size =n)
## Step 2
## Three empty numeric vectors to store out-of-sample predictions (for 3 choices of k)
preds_k3 <- preds_k7 <- preds_k15 <- numeric(n)
## Loop through the k=5 folds
for(k in 1:5){
## Subset the data
train <- rmr[fold_id != k, ]
test <- rmr[fold_id == k, ]
## Fit models on the data
mod_k3 <- knn.reg(train = train$weight_lbs, test = data.frame(test$weight_lbs), y = train$rate_kcal, k = 3)
mod_k7 <- knn.reg(train = train$weight_lbs, test = data.frame(test$weight_lbs), y = train$rate_kcal, k = 7)
mod_k15 <- knn.reg(train = train$weight_lbs, test = data.frame(test$weight_lbs), y = train$rate_kcal, k = 15)
## Store predictions
preds_k3[fold_id == k] <- mod_k3$pred
preds_k7[fold_id == k] <- mod_k7$pred
preds_k15[fold_id == k] <- mod_k15$pred
}
## Out-of-sample RMSE for each model
out_RMSE_k3 <- sqrt(1/n*sum((rmr$rate_kcal - preds_k3)^2))
out_RMSE_k7 <- sqrt(1/n*sum((rmr$rate_kcal - preds_k7)^2))
out_RMSE_k15 <- sqrt(1/n*sum((rmr$rate_kcal - preds_k15)^2))
## Comparison
print(c(out_RMSE_k3, out_RMSE_k7, out_RMSE_k15))
## [1] 197.1123 163.3975 180.0256
This comparison suggests \(k = 7\) leads to a better model than either \(k = 3\) or \(k = 15\). Recall that we’ve previously discussed the fact that in-sample RMSE will continually improve as \(k\) decreases, but this is clearly not the case for out-of-sample RMSE.
Question #4: For this question you will use the
afqt
data used in Question #1 and introduced in our
previous lab.
knn.reg()
.Income2005 ~ AFQT + Educ
. Which model has better
out-of-sample performance? What does this tell you about the
relationship between AFQT scores, education, and income? Briefly
explain.\(~\)
Recall from our previous lab that an important aspect of the decision tree algorithm is its stopping point. If the algorithm is allowed to continue until every data-point is in its own node the resulting tree will almost surely be over-fit to the training data and will not generalize well to new data.
Cross-validation can be used to tune the following parameters of the decision tree algorithm to find values that optimize out-of-sample RMSE:
cp
)maxdepth
)minsplit
)While you could explore different combinations of parameter values for all three using a grid search approach. It often makes sense to focus on tuning the maximum depth of the tree as a good proxy for its overall complexity. Maximum depth has the advantage of being discrete, and having a relatively small number of values of interest (generally in the single digits).
For illustration purposes, the code below demonstrates how to tune
the complexity parameter using cross-validation (as I’ll leave it to you
to tune maxdepth
on your own):
## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7)
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size =n)
## Step 2
## Three empty numeric vectors to store out-of-sample predictions (for 3 choices of k)
preds1 <- preds2 <- preds3 <- numeric(n)
## Loop through the k=5 folds
for(k in 1:5){
## Subset the data
train <- rmr[fold_id != k, ]
test <- rmr[fold_id == k, ]
## Fit models on the data
mod_cp10 <- rpart(rate_kcal ~ weight_lbs, data = train,
control = rpart.control(cp = 0.1, minsplit = 2))
mod_cp05 <- rpart(rate_kcal ~ weight_lbs, data = train,
control = rpart.control(cp = 0.05, minsplit = 2))
mod_cp01 <- rpart(rate_kcal ~ weight_lbs, data = train,
control = rpart.control(cp = 0.01, minsplit = 2))
## Store predictions
preds1[fold_id == k] <- predict(mod_cp10, test)
preds2[fold_id == k] <- predict(mod_cp05, test)
preds3[fold_id == k] <- predict(mod_cp01, test)
}
## Out-of-sample RMSE for each model
out_RMSE1 <- sqrt(1/n*sum((rmr$rate_kcal - preds1)^2))
out_RMSE2 <- sqrt(1/n*sum((rmr$rate_kcal - preds2)^2))
out_RMSE3 <- sqrt(1/n*sum((rmr$rate_kcal - preds3)^2))
## Comparison
print(c(out_RMSE1, out_RMSE2, out_RMSE3))
## [1] 190.3522 203.9304 208.7651
Question #5:
minsplit = 2
from the control options. Why do you think all
of the out-of-sample RMSE estimates are the same making this
change?\(~\)
Computational Considerations
Algorithms like decisions trees involve several tuning parameters, and each combination of tuning parameters that is to be assessed using 5-fold cross-validation requires the estimation of 5 different decision tree models.
Consider a search that considers 3 different values of
maxdepth
, 3 different values of minsplit
, and
3 different values of cp
. There are 27 different
combinations of these tuning parameters that are part of this search,
meaning 135 different decision trees need to be estimated/trained if
5-fold cross validation is used to select the best performing
combination. If we wanted to consider 1 additional value of any of these
tuning parameters the number of models that need to be estimated
increases to 180.
Pruning is a strategy that can be used to reduce the number
of decision tree models you’re training. This exploits the fact that a
deeper tree can be “cut” into a shallower tree due to the greedy nature
of the decision tree algorithm. Thus, the maxdepth
and
cp
parameters can be tuned without rerunning the
algorithm.
The code below demonstrates pruning by using the prune()
function to help optimize cp
using 5-fold cross
validation:
## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7)
n <- nrow(rmr)
fold_id <- sample(rep(1:5, length.out = n), size =n)
## Step 2
## Three empty numeric vectors to store out-of-sample predictions (for 3 choices of k)
preds1 <- preds2 <- preds3 <- numeric(n)
## Loop through the k=5 folds
for(k in 1:5){
## Subset the data
train <- rmr[fold_id != k, ]
test <- rmr[fold_id == k, ]
## Fit the deepest model
mod_cp01 <- rpart(rate_kcal ~ weight_lbs, data = train,
control = rpart.control(cp = 0.01, minsplit = 2))
## Cut the deepest model into shorter trees
mod_cp05 <- prune(mod_cp01, cp = 0.05)
mod_cp10 <- prune(mod_cp01, cp = 0.10)
## Store predictions
preds1[fold_id == k] <- predict(mod_cp10, test)
preds2[fold_id == k] <- predict(mod_cp05, test)
preds3[fold_id == k] <- predict(mod_cp01, test)
}
## Out-of-sample RMSE for each model
out_RMSE1 <- sqrt(1/n*sum((rmr$rate_kcal - preds1)^2))
out_RMSE2 <- sqrt(1/n*sum((rmr$rate_kcal - preds2)^2))
out_RMSE3 <- sqrt(1/n*sum((rmr$rate_kcal - preds3)^2))
## Comparison
print(c(out_RMSE1, out_RMSE2, out_RMSE3))
## [1] 190.3522 203.9304 208.7651
Notice how these RMSE results are identical to those we previous saw, but this time we only needed to estimate 5 different trees instead of 15 different trees.
Question #6: Find a combination of tuning parameters
that yields a lower out-of-sample RMSE than the best performance
achieved in our example (ie: better than 190.35). You should explore
different values for both cp
and minsplit
in
your search. Be sure to use the same randomization seed as the example
(ie: set.seed(7)
).
\(~\)
Question #7: The data loaded below, oz
,
record the daily ozone concentrations in New York City across 111
different days in the same year. The variable “Day” describes how many
days after Jan 1st the measurement occurred. The goal of this
application is to develop a model that can accurately predict the Ozone
concentration on a given day based upon the wind speed, amount of solar
radiation, and temperature for that day.
oz <- read.csv("https://remiller1450.github.io/data/Ozone.csv")
Part A: Use the pairs()
function to
create a scatter plot matrix displaying each combination of variables in
oz
. Which two variables appear to have the strongest linear
relationships with the outcome variable, “Ozone”?
Part B: Fit a linear regression model that uses a
day’s wind speed and temperature (as linear effects)
to predict the ozone concentration for that day. Then, use
plotly
to visualize this model using a surface overlaid on
a 3-D scatter plot. Hint: see the plotly
lab for
several examples, and be careful to set up your “x” and “y” sequences
using the range of the predictors used in this model.
Part C: Estimate the out-of-sample \(RMSE\) of the model you fit in Part B using 5-fold cross-validation.
Part D: Starting with your code from Part C, use 5-fold cross-validation to compare the out-of-sample \(RMSE\) of the linear regression model Part B with a \(k\)-nearest neighbors model using \(k = 11\) (approximately \(\sqrt{n}\)). Both models should only use wind speed and temperature as predictors, and you should be careful to use the same cross-validation folds when comparing the models. Be sure that you report each model’s estimated out-of-sample \(RMSE\).
Part E: Visualize the \(k\)-nearest neighbors model using \(k = 11\) from Part D using a surface
overlaid on a 3-D plotly
scatter plot. Hint: Use
the same grid
object you created in Part B as the input to
the test
argument in knn.reg()
.
Part F: Use 5-fold cross-validation to find the optimum depth of a decision tree model that uses wind speed and temperature as predictors of ozone concentration. You should compare maximum depths of 2, 3, and 4, reporting the depth that you found to be most appropriate.
Part G: Visualize the model from Part F using a
surface overlaid on a 3-D plotly
scatter plot. Briefly
describe how this model’s prediction surface compares with the ones from
Part B and Part E.