Introduction to Supervised Learning (Part 1

Directions (Please read before starting)

Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Packages and Datasets

In addition to familiar packages, this lab will use the implementation of \(k\)-nearest neighbors regression in the FNN package.

# load the following packages
# install.packages("FNN", "rpart", "rpart.plot")
library(FNN)
library(ggplot2)
library(dplyr)
library(rpart)
library(rpart.plot)

\(~\)

Supervised Learning Framework

In contrast with unsupervised methods (like clustering and PCA), supervised learning focuses on modeling a pre-determined outcome variable via the following framework:

\[Y = f(\mathbf{X}) + \epsilon\]

\(Y\) is the outcome variable
\(f()\) is an unknown function of \(\mathbf{X}\) (a matrix of input data)
\(\epsilon\) is a random error term that is independent of the variables in \(\mathbf{X}\) and has a mean of zero

Thus, the outcome variable is modeled by a systematic component, which uses information from \(\mathbf{X}\) (sometimes referred to as signal), and error, which is sometimes referred to as noise.

The overarching goal of supervised learning is to estimate f() from the observed data.

Accurately estimating f() can facilitate accurate predictions on new data
Accurately estimating f() can provide us an understanding for how the observed predictor variables relate to the outcome variable

Classification vs. Regression

Because supervised learning centers around a single outcome variable, tailoring the approach to the type of this variable is extremely important.

Applications with a categorical outcome are known as classification tasks
- Classification models estimate \(f()\) such that the outputs are predicted probabilities of each class (ie: \(Pr(Y = y)\)).
Applications with a numeric outcome are known as regression tasks
- Regression models estimate \(f()\) such that the outputs are predicted numeric values.

Structured vs. Unstructured Models

To understand the supervised learning framework, let’s consider a simple data set containing the body weight (lbs) and resting metabolic rate (RMR) of \(n=44\) adult women. The goal is to model RMR as a function of BMI:

\[\text{RMR} = f(\text{Weight}) + \epsilon\]

One approach is to estimating \(f()\) is to specify a structured parametric model, for example we might consider a straight line:

\[RMR = \beta_0 + \beta_1*\text{Weight} + \epsilon\] Similarly, we might increase the flexibility of our structured model by adding parameters. One way to do this is performing a polynomial expansion on “Weight”, for example we use a degree=3 polynomial expansion:

\[RMR = \beta_0 + \beta_1*\text{Weight} + \beta_2*\text{Weight}^2 + \beta_3*\text{Weight}^3 + \epsilon\]

Alternatively, another approach is to use an algorithm that doesn’t impose a parametric structure, for example we might use the following rule (a method known as \(k\)-nearest neighbors):

Predict RMR for \(\text{Weight}=x\) as the average of the \(k\) observed data-points that are closest to \(x\).

These different approaches are shown below:

This lab will focus on understanding these different models, and our next lab will address how we might choose between.

\(~\)

Lab

Linear Models

Parametric models, such as linear regression, have the advantage of being easily interpretable. However, they also involve an estimation problem.

That is, you say that you’ll model the relationship between RMR and Weight using a straight line - buy how do you decide which straight line?

To illustrate this point, the plot below shows three different straight line models. The blue line clearly doesn’t capture the pattern in the data, but the red and green lines both seem to do a decent job, so we need to decide which is better.

One solution to the estimation question is set up a mathematical criteria by which the model’s success is measured, and then solve for parameter estimates (ie: a slope and intercept) that optimize that criteria. For regression problems, the criteria of choice is the model’s root mean-squared error, or \(RMSE\):

\[RMSE = \sqrt{\tfrac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2}\]

Here, \(\hat{y}_i = f(x_i)\), which is the model’s prediction for the \(i^{th}\) observed data-point
The squared differences between predicted and observed outcomes is averaged, and then the square root brings us back to the original units used to measure the outcome

Question #1: For the 44 women in these data, Model #1 has an RMSE of 154.27, Model #2 has an RMSE of 167.07, and Model #3 an RMSE of 312.83 What do these numbers mean (in practical terms)? Which model should we prefer?

\(~\)

In Question #1, you compared the \(RMSE\) of 3 different models (each of which is defined by its own slope and intercept). A method known as ordinary least squares can be used to solve for the combination of slope and intercept that minimize \(RMSE\) for the observed data:

Additional mathematical details can be found here.

Once the optimal slope and intercept have been estimated from the sample data, we can provide an estimated model using numbers (rather than symbols). We refer to any quantity involving estimates based upon sample data using a \(\hat{ }\) or “hat”. For example:

We can state a population-level model in generic terms:

\[Y = \beta_0 + \beta_1 X_1 + \epsilon\]

Then, after using least squares estimation on the observed data, we can provide an estimated model:

\[\hat{Y} = 810 + 3.4X\] The code below demonstrates this process:

## Sample data
rmr <- read.csv("https://remiller1450.github.io/data/RMR.csv")

## Model specification
my_form <- formula(rate_kcal ~ weight_lbs)

## Estimation of the model
lm_fit <- lm(my_form, data = rmr)
coefficients(lm_fit)

## (Intercept)  weight_lbs 
##   811.22667     3.36168

Many models in R use a formula to define our model’s functional form. In this example, our formula specifies “rate_kcal” as the outcome variable and “weight_lbs” as the predictor variable. We could include additional predictor variables in the formula using a +. For example, a generic formula involving three predictors is could be \(y ~ x1 + x2 + x3\)

The lm() takes the model formula and estimates the model parameters using the given data. In this example, we the coefficients() function is used to extract the estimated slope and intercept from the fitted model.

Finally, you should note that we aren’t limited to linear models with a single predictor. Linear regression generalizes to the form:

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots \beta_pX_p + \epsilon\]

Recall that we visualized linear regression in 3 dimensions (2 predictors) in our the plotly lab.

Question #2: For this question, consider a multiple linear regression model where the median 10-year salary of a college’s alumni is modeled by a linear combination of a college’s admissions rate and its net tuition. We will estimate this model using college scorecard data from 2019, stored in cd (created by the code given below).

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

## Remove colleges missing the relevant variables
cd <- filter(colleges, !is.na(Salary10yr_median), !is.na(Adm_Rate), !is.na(Net_Tuition))

Part A: Write out the population-level model for this scenario. Hint: R is not necessary for this part.
Part B: Use the lm() function to estimate the parameters of this model from the sample data stored in cd. Print these parameter estimates
Part C: The lm() function returns a list object that contains a vector named “fitted.values” (ie: predicted \(y\)-values). Use this vector to calculate the model’s in-sample \(RMSE\). Hint: use functions like sum() and sqrt().

Note: In Part C, the in-sample \(RMSE\) will be close to the “Residual standard error” reported when using the summary() function on the fitted model, but \(RMSE\) uses \(\tfrac{1}{n}\) while residual SE standard error uses \(\tfrac{1}{n-p}\) (where \(p\) is the number of model parameters)

\(~\)

Feature Expansion

Standard linear regression models are too rigid to capture non-linear relationships. However, we can use a strategy known as feature expansion to overcome this limitation. For example, let’s expand the “weight” variable in the RMR data by creating new columns representing “weight”, “weight squared”, and “weight cubed” using the poly() function:

## Create rmr2, an augmented data frame
rmr2 = data.frame(poly(rmr$weight_lbs, degree = 3, raw = TRUE), rmr$rate_kcal)
colnames(rmr2) = c("weight","weight_squared","weight_cubed","rate_kcal")
head(rmr2)

##   weight weight_squared weight_cubed rate_kcal
## 1 104.79       10980.94      1150693      1079
## 2 106.68       11380.62      1214085      1146
## 3 108.78       11833.09      1287203      1115
## 4 110.46       12201.41      1347768      1161
## 5 120.96       14631.32      1769805      1325
## 6 128.94       16625.52      2143695      1351

We can then fit a linear regression model using these three predictors:

poly3_mod <- lm(rate_kcal ~ weight + weight_squared + weight_cubed, data = rmr2)
coefficients(poly3_mod)

##    (Intercept)         weight weight_squared   weight_cubed 
##   4.005720e+02   9.541122e+00  -2.756257e-02   3.664196e-05

Importantly, we could also perform the polynomial expansion inside of the lm() function using the original rmr data frame as the model’s input data:

poly3_mod_alt <- lm(rate_kcal ~ poly(weight_lbs, degree = 3, raw = TRUE), data = rmr)
coefficients(poly3_mod_alt)

##                               (Intercept) 
##                              4.005720e+02 
## poly(weight_lbs, degree = 3, raw = TRUE)1 
##                              9.541122e+00 
## poly(weight_lbs, degree = 3, raw = TRUE)2 
##                             -2.756257e-02 
## poly(weight_lbs, degree = 3, raw = TRUE)3 
##                              3.664196e-05

The names lm() gives to these coefficients are a little messy, but putting poly() directly in the model formula makes visualization of the model much easier.

To facilitate visualization of a wide variety of models, we’ll adopt the following general approach:

Step 1 - Generate dense sequence of hypothetical body weights (or the model’s predictor(s) more generally) over the range for which we’d like to graph the model.
Step 2- Use the estimated model to predict RMR (the model’s outcome) for every body weight in the sequence created in Step 1
Step 3 - Connect these predictions using line segments, which will closely approximate the model if the sequence in Step 1 was sufficiently dense

These steps are demonstrated below:

## Step 1 - sequence of 200 values from the min to max weight:
xx = with(rmr, seq(min(weight_lbs), max(weight_lbs), length.out = 200))

## Step 2 - predictions for the sequence
yy = predict(poly3_mod_alt, data.frame(weight_lbs = xx))

## Step 3 - plot predictions using geom_line()
ggplot() + 
  geom_point(aes(x = rmr$weight_lbs, y = rmr$rate_kcal)) +
  geom_line(aes(x = xx, y = yy), col = "red", size = 2)

Here we see that the polynomial expansion makes our linear model more flexible, allowing its predictions to be closer to the observed y-values.

Two additional feature expansion strategies, splines and discretization, are demonstrated below:

## Discretize into 3 categories using cut()
des_mod = lm(rate_kcal ~ cut(weight_lbs, breaks = 3), data = rmr)

## Spline expansion using bs()
library(splines)
spline_mod = lm(rate_kcal ~ bs(weight_lbs, degree = 2, knots = c(150,200,250)), data = rmr)

## New seq predictions for each
yd = predict(des_mod, data.frame(weight_lbs = xx))
ys = predict(spline_mod, data.frame(weight_lbs = xx))

## Visualizations of each
ggplot()+ 
  geom_point(aes(x = rmr$weight_lbs, y = rmr$rate_kcal)) +
  geom_line(aes(x = xx, y = ys), col = "green", size = 2) +
  geom_line(aes(x = xx, y = yd), col = "purple", size = 2)

Without getting too far into the details:

Discretization (achieved using cut()) separates a numeric predictor into categories that are either equally spaced (if an integer is given for the breaks argument) or split at certain cut-points (if a vector is given for the breaks argument).
Splines (implemented using bs() and other similar functions) fit piece-wise polynomials of a specified degree that are joined together at locations called knots. Additional information can be found here.

Question #3:

Part A: Modify the spline model from the example using the rmr data (given above) to use degree = 1, how does this change the model?
Part B: Further modify the spline model from Part A to now only include a single knot at a weight of 220. How does this change the model?
Part B: Can you think of reason why splines or polynomial expansion might be preferable to discretization? Briefly explain.

\(~\)

K-nearest Neighbors

Even with strategies like feature expansion, regression can still be limited in its ability to detect highly localized patterns. In these circumstances, we might consider modeling algorithms that do not assume a parametric functional form when estimating \(f()\), with \(k\)-nearest neighbors (KNN) being a simple example.

KNN Algorithm:

Standardizing \(\mathbf{X}\) is an important preliminary step in \(k\)-nearest neighbors. Because neighbors are found using distance, all variables should be on the same scale.

Next, consider a grid of target values, or points in the \(p\)-dimensional space corresponding to \(\mathbf{X}\) that we’d like the model to predict.

For each target value, calculate its distance from each observation in \(\mathbf{X}\)
Sort these distances from smallest to largest
Select \(k\) data-points in order of smallest distance
Aggregate the observed \(y\) values of the selected data-points to obtain a predicted \(Y\) value (either a simple average, or inverse distance weighted average)

Question #4

Part A: For the resting metabolism data, use the mutate() and arrange() functions to find the \(k = 3\) data-points nearest to a target value of \(x = 130\) lbs. Hint: since the target exists in 1-dimensional space, euclidean distance between the target value and \(x_i\) will be \(d_i =\sqrt{(x_i - \text{target})^2}\)
Part B: Use a simple average of the \(k = 3\) nearest neighbors from Part A to make a prediction for the resting metabolic rate of a 130 pound woman.

\(~\)

How to visualize KNN:

KKN models can be “fit” using the knn.reg() function in the FNN library.

You should note that there’s also a knn() function that we’ll use in the future for classification tasks (predicting categorical outcomes).

library(FNN)
knn_fit = knn.reg(train = rmr$weight_lbs, ## Observed Predictors
                  test = data.frame(xx),  ## Targets to Predict
                  y = rmr$rate_kcal,      ## Observed Outcome
                  k = 5)                  ## Neighbors

Models from knn.reg() are not compatible with the predict() function, so we need to slightly adjust our previous visualization approach. knn.reg() returns predicted y-values for the data given in the test argument. So, we can supply our contrived sequence xx as data via the test argument to get an approximation of the model’s functional form.

The pred component of the fitted model stores these predictions, and below we use them to visualize this KNN model’s estimate of \(f()\):

ggplot() + 
  geom_point(aes(x = rmr$weight_lbs, y = rmr$rate_kcal)) +
  geom_line(aes(x = xx, y = knn_fit$pred), col = "blue", size = 2)

How to choose k:

Similar to \(k\)-means clustering, the data analyst using KNN must specify a value of \(k\) at the start of the algorithm, and the optimal value of \(k\) will vary across different applications.

A larger value of \(k\) (using more neighbors) increases stability (ie: less potential for overfitting) but also increases bias (ie: reduced ability to detect local patterns in the data)
A smaller value of \(k\) (using fewer neighbors) will decrease bias but also decrease stability

This tension is commonly referred to as the bias-variance trade-off, and it applies to all modeling approaches.

In our next lab, we’ll learn how to navigate the bias-variance trade-off using cross-validation as a guiding framework. For now, you should choose a reasonable value of \(k\) using visual inspection.

Question #5: Modify the KNN code (given as an example of knn.reg()) so that resting metabolic rates are predicted using the nearest \(k=15\) neighbors. Then, briefly describe the visual changes you observe in this new model compared to models that used smaller values of \(k\).

\(~\)

Decision Trees

A downside of KNN is that the way predictions are generated is very localized and thus difficult to describe or articulate. Put differently, there’s little you can do to explain why a body weight of X leads to a predicted RMR of Y short of saying “because that’s the average RMR of the neighbors of X”.

Decision tree modeling is an unstructured algorithmic approach that works recursively partition the \(p\)-dimensional space (of the predictor variables) to improve purity.

For the RMR application, a simple decision tree splits the range of “Weight” into segments that are determined by splitting rules:

my_tree = rpart(rate_kcal ~ weight_lbs, data = rmr)
rpart.plot(my_tree)

The initial split separates the data-points with a body weight below 190 lbs into their own segment, or node. This node is further split to separate the data-points with a body weight below 125, giving us a decision tree with 3 terminal nodes. Within each node, the top line shows the predicted outcome (RMR in this example), and the second line displays the percentage of data-points that belong to that segment.

We can visualize the functional form of this decision tree using methods described earlier in the lab:

## Seq to predict over
xx = seq(min(rmr$weight_lbs), max(rmr$weight_lbs), length.out = 200)

## Predictions across sequence
yt = predict(des_mod, data.frame(weight_lbs = xx))

## Visualization
ggplot()+ 
  geom_point(aes(x = rmr$weight_lbs, y = rmr$rate_kcal)) +
  geom_line(aes(x = xx, y = yt), col = "green", size = 2)

Notice the tree’s two splitting rules partition the range of “Weight” into 3 distinct regions. The tree’s predictions are the average outcomes within each region, hence the 3 horizontal lines present on this visualization.

How does the algorithm determine the splitting rules?

Starting with a “parent” node, the algorithm searches for the split resulting in the most significant difference in mean outcomes across the two “child” nodes. This is called the “anova” splitting method.
After a split is made, the child nodes become eligible for splitting and another split is found using the “anova” splitting method. That is, the next split is the one that produces the most significant difference in means across the tree’s terminal nodes.
This process continues until a stopping criteria has been reached

When does the algorithm stop making splits?

Decisions trees can involve several different stopping criteria, though they generally depend upon one or more the following factors:

The purity of a node
The number of data-points in a node
The overall depth of the tree

By default, a decision tree fit using the “anova” method will stop splitting when:

There are no remaining splits that will increase the R-squared value of the model by at least 0.01 (controlled by the parameter cp)
Each terminal node contains fewer than 20 data-points (controlled by the parameter minsplit)
The tree reaches a certain “depth” (controlled by maxdepth)

These parameters are controlled using the control argument in rpart() in conjunction with the rpart.control() function. This is demonstrated below:

### Fit a new tree w/ different stopping criteria
my_tree = rpart(rate_kcal ~ weight_lbs, data = rmr,
                 control = rpart.control(cp = 0.005, minsplit = 10))

### Visualize the tree
rpart.plot(my_tree)

In this example, you should notice that relaxing the stopping criteria leads to a deeper, more complex tree.

Question #6: Fit a decision tree using the cd dataset created below (and used earlier in the lab) that predicts “Salary10yr_median” using the variables “Adm_Rate” and “Net_Tuition”. Set the stopping criteria such that a 2% improvement in R-squared is required to make an additional split.

Part A: Visualize the fitted tree and briefly describe the characteristics of colleges that belong to the tree’s smallest terminal node (in terms of the number of colleges contained in that node).

Part B: Report the predicted salary of colleges that belong to the node you identified in Part A.

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

## Remove colleges missing the relevant variables
cd <- filter(colleges, !is.na(Salary10yr_median), !is.na(Adm_Rate), !is.na(Net_Tuition))

\(~\)

Practice (required)

Question #7: For this question you will conduct a comprehensive analysis using the AFQT data introduced below.

The Armed Forces Qualification Test (AFQT) is a multiple choice aptitude test designed to measure whether an individual’s reasoning and verbal skills are sufficient for military service. Because the test has been so broadly applied, percentile scores are frequently used as a measure of general intelligence.

The afqt data set is a random sample of \(n = 2584\) Americans who were first selected and tested in 1979, then later re-interviewed in 2006 and asked about their education and annual income. The focus of this analysis will be modeling 2005 incomes using AFQT scores (from 1979) with and without adjusting for educational attainment.

## Load the data
afqt <- read.csv("https://remiller1450.github.io/data/AFQT.csv")

Part A: Create a scatter plot displaying the relationship between AFQT percentile scores and 2005 income while coloring each point by educational attainment. Write 1-2 sentences describing the relationship (or lack thereof) between these variables.

Part B: Without considering educational attainment, use \(k\)-nearest neighbors with \(k = 50\) (approximately \(\sqrt{n}\)) to generate predicted incomes based upon AFQT percentile. Create a graphical display of this model and report the in-sample \(RMSE\).

Part C: Without considering educational attainment, use simple linear regression to model income in response to AFQT percentile. Create a graphical display of this model and report the in-sample \(RMSE\).

Part D: Report and interpret the slope coefficient from the model in Part C.

Part E: Fit new linear regression model that also uses “Educ” as a predictor and report its slope. Why might this slope differ from the one in Part D? How should an interpretation of it change? Hint: remember that the slope coefficient for “AFQT” in this model reflects changes in the outcome as “AFQT” varies while “Educ” is held constant.

Part F: Fit a decision tree model to these data using the default stopping criteria and both educational attainment and AFQT as predictors. Report the \(RMSE\) of this model and state whether you think it fits the data more closely than the models you fit in Parts B and C.

Introduction to Supervised Learning (Part 1 - Regression Problems)

Preamble

Packages and Datasets

Supervised Learning Framework

Classification vs. Regression

Structured vs. Unstructured Models

Lab

Linear Models

Feature Expansion

K-nearest Neighbors

Decision Trees

Practice (required)