Directions:

As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This means you will work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.

You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.

You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.

\(~\)

K-Nearest Neighbors

K-nearest neighbors (KNN) is popular non-parametric model used for classification and prediction. The model does not specify any particular relationship between the explanatory and response variables. This makes the method very flexible, as it can detect non-linearity and localized patterns in the data. We’ll view KNN as an alternative to linear regression.

\(~\)

Distance

Consider a dataset containing \(n\) observations of the form \(\{x_i, y_i\}\). The goal is to model \(E(y)\) as a function \(x\); to do so, KNN algorithm uses the nearest \(k\) data-points. However, determining these data-points depends on how we define distance.

Right now, we’ve been focusing on models that use only a single explanatory variable, and in this special situation it doesn’t really matter how we define the distance between data-points. For example, using \(|x_i - x|\) or \((x_i - x)^2\) will lead us to identify the same set of neighbors.

More generally, distance is an important topic in modeling, with statisticians and data scientists employing a wide variety of distance metrics (we’ll look at a few different ones later). For now, we’ll focus on Euclidean distance, which is defined below:

\[\text{dist}_{a,b} = \sqrt{\sum_{j=1}^{p} (x_{aj} - x_{bj})^2}\]

  • In words, the Euclidean distance between data-points \(a\) and \(b\) is the square-root of the sum of the squared distances in each dimension of \(x\)
    • Right now, we’re only considering models involving a single predictor, so \(x\) only has one dimension
    • Soon, we’ll consider models involving many predictors, so each data-point might exist in \(p\) dimensions (\(x_1, x_2, \ldots, x_p\))
  • In most applications, it is common practice to standardize the predictor variables (ie: transform them into z-scores) before doing anything that uses Euclidean distance.

Question #1: Consider the RMR dataset (loaded below) and suppose we’d like to use K-Nearest Neighbors to predict the resting metabolic rate of a woman who weighs 130 lbs. For this question, write code that calculates the Euclidean distance between each data-point and the value we’re interested (ie: 130 lbs)

rmr <- read.csv("https://remiller1450.github.io/data/RMR.csv")

\(~\)

K-Nearest Neighbors Algorithm

Once distance is defined, the K-nearest neighbors algorithm is straightforward:

  1. Calculate the distance between each observed data-point and the target value
  2. Sort these distances from smallest to largest
  3. Select the data-points with the \(k\) smallest distances
  4. Use the observed \(y\) values of these data-points to make a prediction (either as a simple average, or as an inverse distance weighted average)

Question #2: Use the order function to locate the \(k = 3\) data-points closest to \(x = 130\) lbs. Then, use a simple average of the \(y\)-value of these three data-points to generate the K-Nearest Neighbors prediction for \(x = 130\) lbs.

\(~\)

Choosing K

To apply the KNN algorithm, \(k\) must be defined ahead of time by the data analyst. There is no universally best choice \(k\), instead the optimal value will depend upon the data. In general, larger values of \(k\) lead to models that are more stable, but also more biased. While smaller values of \(k\) lead to models that are more flexible, but also more unstable.

The code below demonstrates how to fit and graph KNN models using the FNN package. The code creates and displays predictions for KNN models using \(k = 2\), \(k = 8\), and \(k = 20\).

## Load Data
tips <- read.csv("https://remiller1450.github.io/data/Tips.csv")

## Setup x-grid to predict over
xgrid <- data.frame(TotBill = seq(min(tips$TotBill),  
                                  max(tips$TotBill), by = 1))

## Fitting KNN using the "knn.reg" function in the "FNN" package
library(FNN)  ## Be sure to install this package

## Fit the model
km2 <- knn.reg(train = tips$TotBill,   ## Observed x-values
               test = xgrid,             ## New values to predict over
               y = tips$Tip,        ## Observed y-values
               k = 2)                    ## Choice of k

## Fit k = 8 and k = 20 models
km8 <- knn.reg(train = tips$TotBill, test = xgrid, y = tips$Tip, k = 8)    
km20 <- knn.reg(train = tips$TotBill, test = xgrid, y = tips$Tip, k = 20)    

## Create base Scatterplot
plot(tips$TotBill, tips$Tip, xlab = "Total Bill", ylab = "Tip")

## Add model predictions as "lines" to the scatterplot previous scatterplot
lines(xgrid$TotBill, km2$pred, col = "red")
lines(xgrid$TotBill, km8$pred, col = "green")
lines(xgrid$TotBill, km20$pred, col = "purple")

In the next section, we’ll cover some quantitative ways to compare different models. This will provide us an objective framework for making decisions like choosing \(k\).

Question #3: Modify the example given above to fit and graph two different K-Nearest Neighbors models (\(k = 3\) and \(k = 10\)) that predict the resting metabolic rate of the women in the RMR dataset based upon their bodyweight. Plot the \(k=3\) model using a red line, and the \(k = 10\) model using a blue line.

\(~\)

Comparing Models

Now that we’ve introduced the concept of distance, we can use it quantitatively evaluate the closeness of a model’s predictions to the actual data. The subsections that follow focus using distance to compare two or more models in an objective fashion.

RMSE

One of the most common ways to summarize the accuracy of a model’s predictions is root mean squared error, or \(RMSE\): \[RMSE = \sqrt{\tfrac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2}\] As shown above, \(RMSE\) roughly corresponds to the average prediction error made by the model on the observed data. Ideally, we’d like the prediction errors made by our model to be as small as possible, so lower \(RMSE\) suggests a better model.

Unfortunately, relying upon in-sample \(RMSE\) to choose a model is flawed because it will always favor larger, more complex models. The code example below provides a simple illustration of this by adding a few nonsensical explanatory variables (random values drawn from a Normal distribution) to a model that predicts “Tip”

## Add random values as three extra variables to Tips
tips$R1 <- rnorm(nrow(tips))
tips$R2 <- rnorm(nrow(tips))
tips$R3 <- rnorm(nrow(tips))

## Build bigger and bigger models using these nonsensical variables
m1 <- lm(Tip ~ TotBill, data = tips)
m2 <- lm(Tip ~ TotBill + R1, data = tips)
m3 <- lm(Tip ~ TotBill + R1 + R2, data = tips)
m4 <- lm(Tip ~ TotBill + R1 + R2 + R3, data = tips)

## Calculate RMSE for each model
rmse1 <- sqrt(1/nrow(tips)*sum((tips$Tip - m1$fitted.values)^2))
rmse2 <- sqrt(1/nrow(tips)*sum((tips$Tip - m2$fitted.values)^2))
rmse3 <- sqrt(1/nrow(tips)*sum((tips$Tip - m3$fitted.values)^2))
rmse4 <- sqrt(1/nrow(tips)*sum((tips$Tip - m4$fitted.values)^2))

## Notice how prediction errors get smaller! (try re-running multiple times)
print(c(rmse1,rmse2,rmse3,rmse4))
## [1] 1.017850 1.017049 1.016270 1.003809

Question #4: Consider two models for a woman’s resting metabolic rate using the RMR dataset. One model uses only bodyweight as a predictor, the other model accidentally includes the subject ID as another predictor (so it uses two predictors, bodyweight and subject ID). Which model will have a lower \(RMSE\)? (Hint: you don’t need to write any code or fit any models to answer this question)

\(~\)

Cross-Validation

The bias of in-sample \(RMSE\) towards larger, more complex models can be avoided by using out-of-sample \(RMSE\). One way of estimating the out-of-sample performance of a model is a method known as cross-validation.

In this class, we’ll focus on \(k\)-fold cross-validation, which involves the following steps:

  1. Randomly divide the original dataset into \(k\) equally sized, non-overlapping subsets
  2. Fit the candidate model using data from \(k - 1\) folds (the “training” folds), then calculate the model’s prediction error on the data in the \(k^{th}\) fold (the “left out” fold)
  3. Repeat step two a total of \(k\) times until each fold has been left out exactly once, then aggregate the prediction errors across the different folds

The code below demonstrates 5-fold cross validation:

## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7)  ## It's good practice to see a randomization seed so that your results are repeatable
fold_id <- sample(rep(1:5, length.out = nrow(tips)), size = nrow(tips))

## Step 2/3
## First setup an empty numeric vector to store out-of-sample predictions
preds <- numeric(nrow(tips))

## Loop through the k=5 folds
for(k in 1:5){
  
  ## Subset the data
  train <- tips[fold_id != k, ]
  test <- tips[fold_id == k, ]
  
  ## Fit model on the data in this iterations training folds
  mod <- lm(Tip ~ TotBill, data = train)
  
  ## Store predictions
  preds[fold_id == k] <- predict(mod, newdata = test)
}

## Out-of-sample RMSE
out_RMSE <- sqrt(1/nrow(tips)*sum((tips$Tip - preds)^2))

## In-sample RMSE (for comparison)
m <- lm(Tip ~ TotBill, data = tips)
in_RMSE <- sqrt(1/nrow(tips)*sum((tips$Tip - m$fitted.values)^2))

## Compare
print(c(out_RMSE, in_RMSE))
## [1] 1.038133 1.017850

To demonstrate the advantage of cross-validated, out-of-sample \(RMSE\) in comparing models, let’s return to the example that used random values as extra predictors. Recall that adding these meaningless predictors led to slight decreases in the in-sample RMSE.

## Add random values as three extra variables to Tips
tips$R1 <- rnorm(nrow(tips))
tips$R2 <- rnorm(nrow(tips))
tips$R3 <- rnorm(nrow(tips))

## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7)  ## This seed will now ensure we get the same fold assignments as before!!
fold_id <- sample(rep(1:5, length.out = nrow(tips)), size = nrow(tips))

## Step 2/3
## First setup an empty numeric vector to store out-of-sample predictions
preds <- numeric(nrow(tips))

## Loop through the k=5 folds
for(k in 1:5){
  
  ## Subset the data
  train <- tips[fold_id != k, ]
  test <- tips[fold_id == k, ]
  
  ## Fit model on the data in this iterations training folds
  mod <- lm(Tip ~ TotBill + R1 + R2 + R3, data = train)
  
  ## Store predictions
  preds[fold_id == k] <- predict(mod, newdata = test)
}

## Out-of-sample RMSE
out_RMSE <- sqrt(1/nrow(tips)*sum((tips$Tip - preds)^2))

## In-sample RMSE (for comparison)
m <- lm(Tip ~ TotBill + R1 + R2 + R3, data = tips)
in_RMSE <- sqrt(1/nrow(tips)*sum((tips$Tip - m$fitted.values)^2))

## Compare
print(c(out_RMSE, in_RMSE))
## [1] 1.045775 1.014966

As shown above, these extra variables increase the out-of-sample error from 1.038 to 1.046 (correctly suggesting they make the model worse), despite slightly decreasing the in-sample error from 1.018 to 1.015 (misleadingly suggesting their addition makes the model better).

\(~\)

Choosing \(k\) in KNN

One useful application of cross validation is determining the optimal value of \(k\) in \(k\)-nearest neighbors. The code below demonstrates this process on the Tips dataset, comparing three different choices of \(k\) (5, 15, and 30):

## Step 1 - Create an ID variable used to assign each data-point to a fold
set.seed(7)  ## This seed will now ensure we get the same fold assignments as before!!
fold_id <- sample(rep(1:5, length.out = nrow(tips)), size = nrow(tips))

## Step 2/3
## First setup an empty numeric vectors to store out-of-sample predictions
preds_k5 <- numeric(nrow(tips))
preds_k15 <- numeric(nrow(tips))
preds_k30 <- numeric(nrow(tips))


## Loop through the k=5 folds
for(k in 1:5){
  
  ## Subset the data
  train <- tips[fold_id != k, ]
  test <- tips[fold_id == k, ]
  
  ## Fit models on the data
  mod_k5 <- knn.reg(train = train$TotBill,  test = data.frame(test$TotBill), y = train$Tip, k = 5)    
  mod_k15 <- knn.reg(train = train$TotBill,  test = data.frame(test$TotBill), y = train$Tip, k = 15) 
  mod_k30 <- knn.reg(train = train$TotBill,  test = data.frame(test$TotBill), y = train$Tip, k = 30)  
  
  ## Store predictions
  preds_k5[fold_id == k] <- mod_k5$pred
  preds_k15[fold_id == k] <- mod_k15$pred
  preds_k30[fold_id == k] <- mod_k30$pred
}

## Out-of-sample RMSE for each model
out_RMSE_k5 <- sqrt(1/nrow(tips)*sum((tips$Tip - preds_k5)^2))
out_RMSE_k15 <- sqrt(1/nrow(tips)*sum((tips$Tip - preds_k15)^2))
out_RMSE_k30 <- sqrt(1/nrow(tips)*sum((tips$Tip - preds_k30)^2))

## Compare
print(c(out_RMSE_k5, out_RMSE_k15, out_RMSE_k30))
## [1] 1.128374 1.099606 1.085273

Comparing the out-of-sample \(RMSE\)s, it appears that \(k = 30\) yields the most accurate model (of the choices of \(k\) we considered, not necessarily of all possible choices of \(k\)).

Question #5: Using the code given above as a starting point, use 5-fold cross-validation to compare two k-nearest neighbors models that use \(k = 30\) and \(k = 45\). Does the \(k = 45\) model appear to be more accurate? Justify your answer by reporting the out-of-sample \(RMSE\) of each model.

\(~\)

Practice

Question #6: For this question you will revisit the AFQT dataset that was introduced in Lab #3. For your reference, the description of these data is copied below:

The Armed Forces Qualification Test (AFQT) is a multiple choice aptitude test designed to gauge whether an individual’s reasoning and verbal skills are sufficient for military service. Because the test has been so broadly applied, scores are reported as percentiles and are frequently used as a measure of general intelligence.

The data you will use in the questions that follow comes from a random sample of 2584 Americans who were first selected and tested in 1979. In this analysis, you’ll determine whether KNN or linear regression yields more accurate predictions of future income.

## Load the data
afqt <- read.csv("https://remiller1450.github.io/data/AFTQ.csv")

Part A: Fit a simple linear regression model that uses “AFQT” scores to predict “Income2005” (as you did in Lab #3). Calculate the in-sample \(RMSE\) of this model.

Part B: Use cross-validation to compare 3 different models that use “AFQT” to predict “Income2005” using out-of-sample \(RMSE\). The first of these models should be the simple linear regression model from Part A, and the second and third should be k-nearest neighbors models with \(k = 15\) and \(k = 40\). Report which model is most accurate, justifying your answer using cross-validated \(RMSE\) (use 5-fold cross-validation).

Part C: Fit a simple linear regression that uses “Educ” (the years of formal education an individual received) to predict “Income2005”. Next, compare the estimated slope coefficient and \(t\)-test of this model with the estimated slope coefficient and \(t\)-test of the model in Part A (that used “AFQT” as a predictor). Do both models appear to be predictive? Can you determine which model is superior using only the information discussed in this question (ie: using only the slope estimates and their corresponding \(t\)-tests)?

Part D: Use cross-validation to compare the 2 different simple linear models mentioned in Part C. Report which model is most accurate, justifying your answer using cross-validated \(RMSE\) (use 5-fold cross-validation).