caret
(Numeric Outcomes)Models are tools used to understand or reconstruct complex processes. When deciding upon a model some key considerations are:
In this lab we explore these considerations, focusing on interpretable linear models and addressing the trade-off between bias and overfitting using the caret
package to perform cross-validation.
library(caret) ## package for model comparisons
library(glmnet) ## package for fitting lasso models
library(mgcv) ## package for fitting GAM models
Linear regression is a type of parametric model, meaning there are two steps to applying a linear regression model. We must first specify the structure of the model, and then estimate the model’s parameters. A simple regression model includes only two parameters, an intercept and slope.
The simple linear regression model assumes the structure: \[ Y = \beta_0 + \beta_1 X_{1} + \epsilon\]
Where:
Once the model structure has been specified its parameters, \(\beta_0\), \(\beta_1\), and \(\sigma^2\), must be estimated.
Parameter estimation is sometimes called fitting or training the model, for linear regression these parameters are estimated via least squares. We typically denote estimates using a hat symbol, meaning \(\hat{\beta}_1\) is the estimated value (from our sample data) of \(\beta_1\) (which we assume is an unknown population level effect).
In R linear models are fit using the lm
function:
salaries <- read.csv("https://remiller1450.github.io/data/Salaries.csv")
fit1 <- lm(data = salaries, salary ~ yrs.service)
Within lm
, a model’s structure by a formula of the general form Outcome ~ Predictors
. In this example we model professor salary using years of service: \[\text{salary} = \beta_0 + \beta_1 \text{years service} + \epsilon\]
Coefficient estimates and their standard errors for this model are printed using the summary
function:
summary(fit1)
##
## Call:
## lm(formula = salary ~ yrs.service, data = salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81933 -20511 -3776 16417 101947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99974.7 2416.6 41.37 < 2e-16 ***
## yrs.service 779.6 110.4 7.06 7.53e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28580 on 395 degrees of freedom
## Multiple R-squared: 0.1121, Adjusted R-squared: 0.1098
## F-statistic: 49.85 on 1 and 395 DF, p-value: 7.529e-12
To summarize this example:
yrs.service
is zeroyrs.service
In the professor salaries example you should recoginize that yrs.service
is not the only factor related to a faculty member’s salary. Fortunately, the linear regression framework easily accommodates multiple explanatory variables: \[ Y = \beta_0 + \beta_1 X_{1} + \beta_2X_2 + \ldots + \beta_pX_p + \epsilon\] Here:
We can use multiple regression to develop a more complex model of professor salary that includes both yrs.service
and rank
.
fit2 <- lm(data = salaries, salary ~ yrs.service + relevel(rank, ref = "AsstProf"))
summary(fit2)
##
## Call:
## lm(formula = salary ~ yrs.service + relevel(rank, ref = "AsstProf"),
## data = salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64515 -16180 -1234 12181 107174
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 81151.3 2896.9 28.013
## yrs.service -158.1 115.0 -1.376
## relevel(rank, ref = "AsstProf")AssocProf 14615.4 4270.6 3.422
## relevel(rank, ref = "AsstProf")Prof 49228.8 3991.9 12.332
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## yrs.service 0.169708
## relevel(rank, ref = "AsstProf")AssocProf 0.000686 ***
## relevel(rank, ref = "AsstProf")Prof < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23610 on 393 degrees of freedom
## Multiple R-squared: 0.3972, Adjusted R-squared: 0.3926
## F-statistic: 86.3 on 3 and 393 DF, p-value: < 2.2e-16
In this model you should notice:
rank
is a categorical variable with 3 categories, so it must be represented using two different dummy variables and a pre-specified reference category. We use the relevel
function to change the reference category to “AsstProf”, the lowest rank in these data.rank
in the model the coefficient of yrs.service
became negative, this is due to the coefficient now representing the adjusted effect of salary within a fixed rank. Phrased differently, for two professors both at the same rank each additional year of service leads to an expected salary decrease of 158 dollars.Before moving on from multiple regression we’ll fit one more model, this time including all of the available predictors in the salaries
dataset. The .
symbol on the right side of the formula is shorthand for “all variables in the data not previously specified”.
fit3 <- lm(data = salaries, salary ~ . )
Question 1:
Interpet the effect of the dummy variable “Rank=Prof” in fit2
, be specific in your answer.
So far we’ve fit three different linear models for predicting a professor’s salary, but which of these models is best?
One way to evaluate the performance of different models is to look at how accurately they predict the outcome. We can summarize this accuracy using root mean squared error (RMSE) or mean absolute error (MAE): \[RMSE = \sqrt{\tfrac{1}{n}\sum_i (y_i - \hat{y}_i)^2} \qquad MAE = \tfrac{1}{n}\sum_i|y_i - \hat{y}_i|\] Both RMSE and MAE are ways to describe how far, on average, each of the models predicted values, \(\hat{y}_i\) is from the actual outcome, \(y_i\). We can easily calculate both metrics for each of our three models:
n <- nrow(salaries)
RMSE1 <- sqrt(1/n * sum((salaries$salary - fit1$fitted.values)^2))
RMSE2 <- sqrt(1/n * sum((salaries$salary - fit2$fitted.values)^2))
RMSE3 <- sqrt(1/n * sum((salaries$salary - fit3$fitted.values)^2))
cat("RMSE:",RMSE1, RMSE2, RMSE3)
## RMSE: 28505.66 23487.72 22155.48
MAE1 <- 1/n * sum(abs(salaries$salary - fit1$fitted.values))
MAE2 <- 1/n * sum(abs(salaries$salary - fit2$fitted.values))
MAE3 <- 1/n * sum(abs(salaries$salary - fit3$fitted.values))
cat(" MAE:",MAE1, MAE2, MAE3)
## MAE: 22317.68 18021.59 16173
Based upon these measures fit3
, the model which uses all of the available explanatory variables, yields predictions that are closest to the observed values.
This is troublesome because fit3
doesn’t appear to be a very good model, it includes a predictor X
that is just the row number of each observation! The reason for this superiority is that we used metrics which measured in-sample performance, something that will always improve as models grow more complex.
In most applications we are less concerned with how well a model performs on the data used to train it, and more concerned with how well it will generalize to an independent dataset. In other words, we usually don’t want models that are overly specific to a particular sample (overfit) because these models don’t actually help us explain the phenomenon of interest more broadly.
Cross-validation is a tool to estimate performance of a model on an independent sample. In R we can use cross-validation to compare the out-of-sample performance of different models using the caret
package.
In the caret
package models are fit using the train
function, and their performance is evaluated by instructions provided in a trainControl
object.
In the example below we specify a trainControl
object stating that our models will be evaluated using 10 repeats of 5-fold cross validation, we then train four different models of professor salary:
fit.control <- trainControl(method = "repeatedcv", number = 5, repeats = 10)
set.seed(123)
fit1 <- train(salary ~ yrs.service, data = salaries, method = "lm", trControl = fit.control)
set.seed(123)
fit2 <- train(salary ~ yrs.service + rank, data = salaries, method = "lm", trControl = fit.control)
set.seed(123)
fit3 <- train(salary ~ yrs.service + rank + discipline + sex, data = salaries, method = "lm", trControl = fit.control)
set.seed(123)
fit4 <- train(salary ~ ., data = salaries, method = "lm", trControl = fit.control)
Notice:
method = "lm"
within the train
function, a strength of the caret
package is its uniform syntax for hundreds of different model types, which we could request simply by changing this argument.set.seed
to set the same randomization seed prior to training each model, this is advised to ensure the same random fold assignments are used when cross-validating each model.The resamples
function allows us to compare any number of models trained using caret
:
resamps <- resamples(list(LM1 = fit1,
LM2 = fit2,
LM3 = fit3,
LM4 = fit4))
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: LM1, LM2, LM3, LM4
## Number of resamples: 50
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LM1 19577.75 21412.67 22386.72 22402.14 23494.91 25084.35 0
## LM2 15456.89 17403.22 18053.85 18162.68 18830.40 20964.37 0
## LM3 13855.87 15721.93 16778.70 16691.86 17341.92 19797.06 0
## LM4 13923.25 15566.76 16488.27 16468.64 17392.02 19291.53 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LM1 24142.28 27043.28 28694.28 28554.47 29986.45 32392.73 0
## LM2 19542.67 22061.55 23833.97 23582.51 24740.09 27955.03 0
## LM3 18992.35 21349.30 22400.41 22650.20 24127.02 26700.71 0
## LM4 19330.44 21066.15 22401.00 22503.02 23863.75 26142.43 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LM1 0.01235389 0.0737127 0.1133501 0.1260101 0.1573254 0.2975567 0
## LM2 0.30243305 0.3506905 0.4019138 0.3959562 0.4326401 0.5095178 0
## LM3 0.33000516 0.4042301 0.4558864 0.4433776 0.4833220 0.5585006 0
## LM4 0.34541472 0.4054179 0.4652892 0.4511285 0.4884499 0.5715129 0
resamples
summarizes the out-of-sample performance within each “sample”. Our trainControl
specified 10 repeats of 5-fold cross-validation, so there are 50 different “samples” where our models can be compared, one for each fold within each repeat. In our professor salaries example we can see that models 3 and 4 are similar to each other and are both substantially better than models 1 and 2.
One visual way to compare similar models is the Bland-Altman plot:
xyplot(resamps, what = "BlandAltman", metric = "RMSE", models = c("LM3", "LM4"))
Each point represents a paired difference (paired within a “sample”) for the specified metric (RMSE in this case). The y-axis depicts that paired difference, while the x-axis depicts the average value of the metric for that sample. If the performance of the models is exactly equal the Bland-Altman plot will like random scatter.
Question 2:
Construct a plot comparing the out-of-sample RMSE versus in-sample RMSE using the four models (fit1, fit2, fit3, fit4) specified above. As a first step in constructing this plot, you should assemble an appropriate data frame using the data.frame
and c()
functions.
In some circumstances we might want to use a linear regression model, but desire an alternative to least squares estimation. Two reasons for not wanting to use least squares are:
The least absolute shrinkage and selection operation estimator or lasso is one such alternative. The lasso naturally performs variable selection and shrinks coefficient estimates towards zero, often leading to improved out-of-sample prediction. The idea that estimated effects should be shrunken towards zero to avoid overfitting is known as regularization, a concept that can be applied to many types of models.
The lasso solution is found by minimizing \(Q(\beta) =\sum_i (y_i - x_i^T\beta)^2 + \lambda ||\beta||_1\) with respect to \(\beta\), where \(||\beta||_1 = |\beta_1| + |\beta_2| + \ldots + |\beta_p|\) is the L1 norm of the coefficient vector. \(\lambda\) is a tuning parameter, or a numeric constant that needs to specified prior to estimation.
When \(\lambda = 0\), the lasso solution is the same as the least squares solution. For \(\lambda \rightarrow \infty\), the lasso solution is the zero vector (ie: no variables make non-zero contributions in the model). Each intermediate value of \(\lambda\) provide sparse solutions, where a subset of variables make non-zero contributions with coefficients that are shrunken towards zero (relative to their least squares estimates).
Typically we try an entire path of different \(\lambda\) values, tracking the lasso solution for each. The code below fits a lasso model on the salaries
data using the glmnet
package:
X = model.matrix(data = salaries[,-7], ~ -1 + .) ## Set up the matrix including dummy vars
y = salaries$salary
lasso <- glmnet(x = X, y = y)
Note:
glmnet
doesn’t allow formulas, you must provide the matrix of explanatory variables and the response vector separately. This is cumbersome when dummy variables are needed, requiring us to use the model.matrix
function to construct the appropriate design matrix
model.matrix
we provide the data and the right side of the model formula. “-1” is used to indicate we don’t want an intercept (glmnet
creates an intercept for us).To understand this a lasso model it is useful to visualize the coefficient path:
plot(lasso, xvar = "lambda")
abline(v = 8, lty = 2)
The coefficient path displays the relationship between \(\lambda\), the penalty parameter, the number of active variables, and their estimated coefficients.
Each point along the coefficient path represents a different lasso model. For illustration we use abline
to add a vertical line at \(log(\lambda) = 8\), the model characterized by this penalty contains only 3 variables. We can use the coef
function to print these selections and their estimated effects:
coef(lasso, s = exp(8))
## 9 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 91489.840
## X .
## rankAssocProf .
## rankAsstProf -8582.950
## rankProf 29641.312
## disciplineB 6992.899
## yrs.since.phd .
## yrs.service .
## sexMale .
In order to use caret
to compare the lasso with least squares regression we need to utilize the tuneGrid
argument to specify values of tuning parameters (ie: \(\lambda\), the lasso penalty parameter).
lams <- expand.grid(alpha = 1, lambda = lasso$lambda)
set.seed(123)
fit.lasso <- train(salary ~ ., data = salaries, method = "glmnet", trControl = fit.control, tuneGrid = lams)
set.seed(123)
fit.lm <- train(salary ~ ., data = salaries, method = "lm", trControl = fit.control)
Note:
glmnet
allows a second tuning parameter alpha
, that we do not consider in this lab. alpha
is used by the elastic net, a regularization approach similar to the lasso.tuneGrid
the train
function will still work; however, the default values used by train
tend to be sub-optimal.rs <- resamples(list(LM = fit.lm, LASSO = fit.lasso))
xyplot(rs, what = "BlandAltman", metric = "RMSE", models = c("LM", "LASSO"))
Continuing our ongoing professor salary example, we see that the lasso model provides roughly the same predictive performance as the least squares regression model.
This is not unexpected, in this example the number of predictors is relatively small compared to the sample size. As shown below, the best lasso model actually includes all of the available predictors:
coef(fit.lasso$finalModel, s = fit.lasso$bestTune$lambda)
## 8 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 73807.42881
## X 27.25719
## rankAsstProf -12981.97361
## rankProf 31931.45842
## disciplineB 16663.03244
## yrs.since.phd 468.80110
## yrs.service -447.23551
## sexMale 4016.01485
In some circumstances we might want to add flexibility to the linear regression to allow for changes in the explanatory variables to correspond with non-linear changes in the outcome. Generalized Additive Models (GAMs) modify the traditional linear model to include a smoothing term for the contribution of each predictor: \[ Y = \beta_0 + f_1(X_{1}) + f_2(X_2) + \ldots + f_p(X_p) + \epsilon\] Notice that if we take \(f_j(X_j) = \beta_j X_j\) for all \(j \in \{1, \ldots, p\}\) we arrive back at the familar multiple regression model.
To fit a GAM model we must specify the type of function used to relate \(X_j\) and \(Y\). While there are many possibilities, perhaps the most choice are splines which are specified in the model formula using the s
function:
gam1 <- gam(salary ~ s(yrs.service), data = salaries)
We can understand the spline function by plotting the relationship between the explanatory variable and the predicted outcome:
plot(gam1)
In this example we see that yrs.service
has a roughly linear effect until about 20 years, after which the variable has little impact (ignoring the erratic behavior at very large values where there isn’t much data).
Would using a GAM model lead to better salary predictions? We can investigate this using caret
.
fit.control <- trainControl(method = "repeatedcv", number = 5, repeats = 10)
set.seed(123)
fit1 <- train(salary ~ yrs.service, data = salaries, method = "lm", trControl = fit.control)
set.seed(123)
fit2 <- train(salary ~ yrs.service, data = salaries, method = "gam", trControl = fit.control)
rs <- resamples(list(LM = fit1, GAM = fit2))
xyplot(rs, what = "BlandAltman", metric = "RMSE", models = c("LM", "GAM"))
As evidenced by the Bland-Altman plot, the GAM model provides superior predictions (measured by RMSE). You should also notice that we didn’t need to specify the spline term s(yrs.service)
in the GAM model formula, by default caret
will use splines when you use method = "gam"
.
At this point we’ve seen how to implement and evaluate a handful of interpretable models using caret
. With that said, this lab isn’t intended to be a comprehensive guide to modeling, a vast topic that can requires years of study to master. A few of topics that weren’t covered include:
If you’re interested in these topics I encourage you to check out following books written by Max Kuhn and Kjell Johnson:
The birthwt
data frame contains data on 189 infants born at Baystate Medical Center in Springfield, MA in 1986. For this application we will use the variables: “age”, “lwt”, “race”, “smoke”, “ptl”, “ht”, “ui”, and ftv to predict “btw”, the infant’s birth weight. We will ignore the variable “low”, which is a direct function of “bwt”. To see a description these variables run the command ?birthwt
library(MASS)
data("birthwt")
head(birthwt)
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
## 88 0 21 108 1 1 0 0 1 2 2594
## 89 0 18 107 1 1 0 0 1 0 2600
## 91 0 21 124 3 0 0 0 0 0 2622
Question #3:
Construct a scatterplot showing the relationship between “age”, plotted on the x-axis, and “bwt”, plotted on the y-axis. Then use the lm
function to fit a simple linear regression model. Interpret the coefficient estimates of this model, and write 1-2 sentences discussing whether you feel the model is appropriate given what you see in the scatterplot.
Question #4:
Use 10 repeats of 5-fold cross-validation in caret
to compare the simple linear regression model used in Question #3 with a multiple regression model that includes all of the available predictors (expect for “low”). Make sure that your model treats the variable “race” as a factor. Use the resamples
function to decide which model provides better prediction.
Question #5:
Use caret
and the resamples
function to compare a lasso regression model with the multiple regression model you used in Question #4. You should use glmnet
to explore an appropriate sequence of \(\lambda\) values to be given to the train
function as a tuning grid. You may use the code below to set up the X
and y
to be used glmnet
.
X = model.matrix(data = birthwt[,-c(1,10)], ~ -1 - race + factor(race) + .)
y = birthwt$bwt
Question #6:
Print the train
fit of your lasso regression model and identify the value of \(\lambda\) with the best out-of-sample performance. Print this model’s coefficients using code similar to that provided below and answer the following questions:
coef(myLassoTrainFit$finalModel, s = best_lambda)
Question #7:
Fit a GAM (do not use the variable “race”), then use caret
to compare its out-of-sample performance with the lasso regression model of Question #5 and Question #6 and answer the following questions:
plot.gam(myGAMTrainFit$finalModel, pages = 1)