Directions (Please read before starting)
\(~\)
In addition to familiar packages, this lab will use the
implementation of \(k\)-nearest
neighbors regression in the FNN
package.
# load the following packages
# install.packages("FNN", "rpart", "rpart.plot")
library(FNN)
library(ggplot2)
library(dplyr)
library(rpart)
library(rpart.plot)
\(~\)
In contrast with unsupervised methods (like clustering and PCA), supervised learning focuses on modeling a pre-determined outcome variable via the following framework:
\[Y = f(\mathbf{X}) + \epsilon\]
Thus, the outcome variable is modeled by a systematic component, which uses information from \(\mathbf{X}\) (sometimes referred to as signal), and error, which is sometimes referred to as noise.
The overarching goal of supervised learning is to estimate f() from the observed data.
Because supervised learning centers around a single outcome variable, tailoring the approach to the type of this variable is extremely important.
To understand the supervised learning framework, let’s consider a simple data set containing the body weight (lbs) and resting metabolic rate (RMR) of \(n=44\) adult women. The goal is to model RMR as a function of BMI:
\[\text{RMR} = f(\text{Weight}) + \epsilon\]
One approach is to estimating \(f()\) is to specify a structured parametric model, for example we might consider a straight line:
\[RMR = \beta_0 + \beta_1*\text{Weight} + \epsilon\] Similarly, we might increase the flexibility of our structured model by adding parameters. One way to do this is performing a polynomial expansion on “Weight”, for example we use a degree=3 polynomial expansion:
\[RMR = \beta_0 + \beta_1*\text{Weight} + \beta_2*\text{Weight}^2 + \beta_3*\text{Weight}^3 + \epsilon\]
Alternatively, another approach is to use an algorithm that doesn’t impose a parametric structure, for example we might use the following rule (a method known as \(k\)-nearest neighbors):
These different approaches are shown below:
This lab will focus on understanding these different models, and our next lab will address how we might choose between.
\(~\)
Parametric models, such as linear regression, have the advantage of being easily interpretable. However, they also involve an estimation problem.
That is, you say that you’ll model the relationship between RMR and Weight using a straight line - buy how do you decide which straight line?
To illustrate this point, the plot below shows three different straight line models. The blue line clearly doesn’t capture the pattern in the data, but the red and green lines both seem to do a decent job, so we need to decide which is better.
One solution to the estimation question is set up a mathematical criteria by which the model’s success is measured, and then solve for parameter estimates (ie: a slope and intercept) that optimize that criteria. For regression problems, the criteria of choice is the model’s root mean-squared error, or \(RMSE\):
\[RMSE = \sqrt{\tfrac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2}\]
Question #1: For the 44 women in these data, Model #1 has an RMSE of 154.27, Model #2 has an RMSE of 167.07, and Model #3 an RMSE of 312.83 What do these numbers mean (in practical terms)? Which model should we prefer?
\(~\)
In Question #1, you compared the \(RMSE\) of 3 different models (each of which is defined by its own slope and intercept). A method known as ordinary least squares can be used to solve for the combination of slope and intercept that minimize \(RMSE\) for the observed data:
Additional mathematical details can be found here.
Once the optimal slope and intercept have been estimated from the sample data, we can provide an estimated model using numbers (rather than symbols). We refer to any quantity involving estimates based upon sample data using a \(\hat{ }\) or “hat”. For example:
We can state a population-level model in generic terms:
\[Y = \beta_0 + \beta_1 X_1 + \epsilon\]
Then, after using least squares estimation on the observed data, we can provide an estimated model:
\[\hat{Y} = 810 + 3.4X\] The code below demonstrates this process:
## Sample data
rmr <- read.csv("https://remiller1450.github.io/data/RMR.csv")
## Model specification
my_form <- formula(rate_kcal ~ weight_lbs)
## Estimation of the model
lm_fit <- lm(my_form, data = rmr)
coefficients(lm_fit)
## (Intercept) weight_lbs
## 811.22667 3.36168
Many models in R
use a formula to
define our model’s functional form. In this example, our formula
specifies “rate_kcal” as the outcome variable and “weight_lbs” as the
predictor variable. We could include additional predictor variables in
the formula using a +
. For example, a generic formula
involving three predictors is could be \(y ~
x1 + x2 + x3\)
The lm()
takes the model formula and estimates the model
parameters using the given data. In this example, we the
coefficients()
function is used to extract the estimated
slope and intercept from the fitted model.
Finally, you should note that we aren’t limited to linear models with a single predictor. Linear regression generalizes to the form:
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots \beta_pX_p + \epsilon\]
Recall that we visualized
linear regression in 3 dimensions (2 predictors) in our the
plotly
lab.
Question #2: For this question, consider a
multiple linear regression model where the median 10-year
salary of a college’s alumni is modeled by a linear combination of a
college’s admissions rate and its net tuition. We will estimate this
model using college scorecard data from 2019, stored in cd
(created by the code given below).
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
## Remove colleges missing the relevant variables
cd <- filter(colleges, !is.na(Salary10yr_median), !is.na(Adm_Rate), !is.na(Net_Tuition))
R
is not necessary
for this part.lm()
function to
estimate the parameters of this model from the sample data stored in
cd
. Print these parameter estimateslm()
function returns a
list object that contains a vector named “fitted.values” (ie: predicted
\(y\)-values). Use this vector to
calculate the model’s in-sample \(RMSE\). Hint: use functions like
sum()
and sqrt()
.Note: In Part C, the in-sample \(RMSE\) will be close to the “Residual
standard error” reported when using the summary()
function
on the fitted model, but \(RMSE\) uses
\(\tfrac{1}{n}\) while residual SE
standard error uses \(\tfrac{1}{n-p}\)
(where \(p\) is the number of model
parameters)
\(~\)
Standard linear regression models are too rigid to capture non-linear
relationships. However, we can use a strategy known as feature
expansion to overcome this limitation. For example, let’s expand
the “weight” variable in the RMR data by creating new columns
representing “weight”, “weight squared”, and “weight cubed” using the
poly()
function:
## Create rmr2, an augmented data frame
rmr2 = data.frame(poly(rmr$weight_lbs, degree = 3, raw = TRUE), rmr$rate_kcal)
colnames(rmr2) = c("weight","weight_squared","weight_cubed","rate_kcal")
head(rmr2)
## weight weight_squared weight_cubed rate_kcal
## 1 104.79 10980.94 1150693 1079
## 2 106.68 11380.62 1214085 1146
## 3 108.78 11833.09 1287203 1115
## 4 110.46 12201.41 1347768 1161
## 5 120.96 14631.32 1769805 1325
## 6 128.94 16625.52 2143695 1351
We can then fit a linear regression model using these three predictors:
poly3_mod <- lm(rate_kcal ~ weight + weight_squared + weight_cubed, data = rmr2)
coefficients(poly3_mod)
## (Intercept) weight weight_squared weight_cubed
## 4.005720e+02 9.541122e+00 -2.756257e-02 3.664196e-05
Importantly, we could also perform the polynomial expansion inside of
the lm()
function using the original rmr
data
frame as the model’s input data:
poly3_mod_alt <- lm(rate_kcal ~ poly(weight_lbs, degree = 3, raw = TRUE), data = rmr)
coefficients(poly3_mod_alt)
## (Intercept)
## 4.005720e+02
## poly(weight_lbs, degree = 3, raw = TRUE)1
## 9.541122e+00
## poly(weight_lbs, degree = 3, raw = TRUE)2
## -2.756257e-02
## poly(weight_lbs, degree = 3, raw = TRUE)3
## 3.664196e-05
The names lm()
gives to these coefficients are a little
messy, but putting poly()
directly in the model formula
makes visualization of the model much easier.
To facilitate visualization of a wide variety of models, we’ll adopt the following general approach:
These steps are demonstrated below:
## Step 1 - sequence of 200 values from the min to max weight:
xx = with(rmr, seq(min(weight_lbs), max(weight_lbs), length.out = 200))
## Step 2 - predictions for the sequence
yy = predict(poly3_mod_alt, data.frame(weight_lbs = xx))
## Step 3 - plot predictions using geom_line()
ggplot() +
geom_point(aes(x = rmr$weight_lbs, y = rmr$rate_kcal)) +
geom_line(aes(x = xx, y = yy), col = "red", size = 2)
Here we see that the polynomial expansion makes our linear model more flexible, allowing its predictions to be closer to the observed y-values.
Two additional feature expansion strategies, splines and discretization, are demonstrated below:
## Discretize into 3 categories using cut()
des_mod = lm(rate_kcal ~ cut(weight_lbs, breaks = 3), data = rmr)
## Spline expansion using bs()
library(splines)
spline_mod = lm(rate_kcal ~ bs(weight_lbs, degree = 2, knots = c(150,200,250)), data = rmr)
## New seq predictions for each
yd = predict(des_mod, data.frame(weight_lbs = xx))
ys = predict(spline_mod, data.frame(weight_lbs = xx))
## Visualizations of each
ggplot()+
geom_point(aes(x = rmr$weight_lbs, y = rmr$rate_kcal)) +
geom_line(aes(x = xx, y = ys), col = "green", size = 2) +
geom_line(aes(x = xx, y = yd), col = "purple", size = 2)
Without getting too far into the details:
cut()
) separates a
numeric predictor into categories that are either equally spaced (if an
integer is given for the breaks
argument) or split at
certain cut-points (if a vector is given for the breaks
argument).bs()
and other similar
functions) fit piece-wise polynomials of a specified degree that are
joined together at locations called knots. Additional
information can be found
here.Question #3:
rmr
data (given above) to use
degree = 1
, how does this change the model?\(~\)
Even with strategies like feature expansion, regression can still be limited in its ability to detect highly localized patterns. In these circumstances, we might consider modeling algorithms that do not assume a parametric functional form when estimating \(f()\), with \(k\)-nearest neighbors (KNN) being a simple example.
KNN Algorithm:
Standardizing \(\mathbf{X}\) is an important preliminary step in \(k\)-nearest neighbors. Because neighbors are found using distance, all variables should be on the same scale.
Next, consider a grid of target values, or points in the \(p\)-dimensional space corresponding to \(\mathbf{X}\) that we’d like the model to predict.
Question #4
mutate()
and arrange()
functions to find the
\(k = 3\) data-points nearest to a
target value of \(x = 130\) lbs.
Hint: since the target exists in 1-dimensional space, euclidean
distance between the target value and \(x_i\) will be \(d_i =\sqrt{(x_i - \text{target})^2}\)\(~\)
How to visualize KNN:
KKN models can be “fit” using the knn.reg()
function in
the FNN
library.
You should note that there’s also a knn()
function that
we’ll use in the future for classification tasks (predicting categorical
outcomes).
library(FNN)
knn_fit = knn.reg(train = rmr$weight_lbs, ## Observed Predictors
test = data.frame(xx), ## Targets to Predict
y = rmr$rate_kcal, ## Observed Outcome
k = 5) ## Neighbors
Models from knn.reg()
are not compatible with the
predict()
function, so we need to slightly adjust our
previous visualization approach. knn.reg()
returns
predicted y-values for the data given in the test
argument.
So, we can supply our contrived sequence xx
as data via the
test
argument to get an approximation of the model’s
functional form.
The pred
component of the fitted model stores these
predictions, and below we use them to visualize this KNN model’s
estimate of \(f()\):
ggplot() +
geom_point(aes(x = rmr$weight_lbs, y = rmr$rate_kcal)) +
geom_line(aes(x = xx, y = knn_fit$pred), col = "blue", size = 2)
How to choose k:
Similar to \(k\)-means clustering, the data analyst using KNN must specify a value of \(k\) at the start of the algorithm, and the optimal value of \(k\) will vary across different applications.
This tension is commonly referred to as the bias-variance trade-off, and it applies to all modeling approaches.
In our next lab, we’ll learn how to navigate the bias-variance trade-off using cross-validation as a guiding framework. For now, you should choose a reasonable value of \(k\) using visual inspection.
Question #5: Modify the KNN code (given as an
example of knn.reg()
) so that resting metabolic rates are
predicted using the nearest \(k=15\)
neighbors. Then, briefly describe the visual changes you observe in this
new model compared to models that used smaller values of \(k\).
\(~\)
A downside of KNN is that the way predictions are generated is very localized and thus difficult to describe or articulate. Put differently, there’s little you can do to explain why a body weight of X leads to a predicted RMR of Y short of saying “because that’s the average RMR of the neighbors of X”.
Decision tree modeling is an unstructured algorithmic approach that works recursively partition the \(p\)-dimensional space (of the predictor variables) to improve purity.
For the RMR application, a simple decision tree splits the range of “Weight” into segments that are determined by splitting rules:
my_tree = rpart(rate_kcal ~ weight_lbs, data = rmr)
rpart.plot(my_tree)
The initial split separates the data-points with a body weight below 190 lbs into their own segment, or node. This node is further split to separate the data-points with a body weight below 125, giving us a decision tree with 3 terminal nodes. Within each node, the top line shows the predicted outcome (RMR in this example), and the second line displays the percentage of data-points that belong to that segment.
We can visualize the functional form of this decision tree using methods described earlier in the lab:
## Seq to predict over
xx = seq(min(rmr$weight_lbs), max(rmr$weight_lbs), length.out = 200)
## Predictions across sequence
yt = predict(des_mod, data.frame(weight_lbs = xx))
## Visualization
ggplot()+
geom_point(aes(x = rmr$weight_lbs, y = rmr$rate_kcal)) +
geom_line(aes(x = xx, y = yt), col = "green", size = 2)
Notice the tree’s two splitting rules partition the range of “Weight” into 3 distinct regions. The tree’s predictions are the average outcomes within each region, hence the 3 horizontal lines present on this visualization.
How does the algorithm determine the splitting rules?
When does the algorithm stop making splits?
Decisions trees can involve several different stopping criteria, though they generally depend upon one or more the following factors:
By default, a decision tree fit using the “anova” method will stop splitting when:
cp
)minsplit
)maxdepth
)These parameters are controlled using the control
argument in rpart()
in conjunction with the
rpart.control()
function. This is demonstrated below:
### Fit a new tree w/ different stopping criteria
my_tree = rpart(rate_kcal ~ weight_lbs, data = rmr,
control = rpart.control(cp = 0.005, minsplit = 10))
### Visualize the tree
rpart.plot(my_tree)
In this example, you should notice that relaxing the stopping criteria leads to a deeper, more complex tree.
Question #6: Fit a decision tree using the
cd
dataset created below (and used earlier in the lab) that
predicts “Salary10yr_median” using the variables “Adm_Rate” and
“Net_Tuition”. Set the stopping criteria such that a 2% improvement in
R-squared is required to make an additional split.
Part A: Visualize the fitted tree and briefly describe the characteristics of colleges that belong to the tree’s smallest terminal node (in terms of the number of colleges contained in that node).
Part B: Report the predicted salary of colleges that belong to the node you identified in Part A.
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
## Remove colleges missing the relevant variables
cd <- filter(colleges, !is.na(Salary10yr_median), !is.na(Adm_Rate), !is.na(Net_Tuition))
\(~\)
Question #7: For this question you will conduct a comprehensive analysis using the AFQT data introduced below.
The Armed Forces Qualification Test (AFQT) is a multiple choice aptitude test designed to measure whether an individual’s reasoning and verbal skills are sufficient for military service. Because the test has been so broadly applied, percentile scores are frequently used as a measure of general intelligence.
The afqt
data set is a random sample of \(n = 2584\) Americans who were first
selected and tested in 1979, then later re-interviewed in 2006 and asked
about their education and annual income. The focus of this analysis will
be modeling 2005 incomes using AFQT scores (from 1979) with and without
adjusting for educational attainment.
## Load the data
afqt <- read.csv("https://remiller1450.github.io/data/AFQT.csv")
Part A: Create a scatter plot displaying the relationship between AFQT percentile scores and 2005 income while coloring each point by educational attainment. Write 1-2 sentences describing the relationship (or lack thereof) between these variables.
Part B: Without considering educational attainment, use \(k\)-nearest neighbors with \(k = 50\) (approximately \(\sqrt{n}\)) to generate predicted incomes based upon AFQT percentile. Create a graphical display of this model and report the in-sample \(RMSE\).
Part C: Without considering educational attainment, use simple linear regression to model income in response to AFQT percentile. Create a graphical display of this model and report the in-sample \(RMSE\).
Part D: Report and interpret the slope coefficient from the model in Part C.
Part E: Fit new linear regression model that also uses “Educ” as a predictor and report its slope. Why might this slope differ from the one in Part D? How should an interpretation of it change? Hint: remember that the slope coefficient for “AFQT” in this model reflects changes in the outcome as “AFQT” varies while “Educ” is held constant.
Part F: Fit a decision tree model to these data using the default stopping criteria and both educational attainment and AFQT as predictors. Report the \(RMSE\) of this model and state whether you think it fits the data more closely than the models you fit in Parts B and C.