Model Building with caret (Numeric Outcomes)

1. Introduction

Models are tools used to understand or reconstruct complex processes. When deciding upon a model some key considerations are:

Bias - Is the modeling framework flexible enough to accurately represent the phenomenon of interest?
Overfitting - Is the chosen model too specific to sample used to fit it?
Interpretability - Is the chosen model too difficult to explain to be useful?

In this lab we explore these considerations, focusing on interpretable linear models and addressing the trade-off between bias and overfitting using the caret package to perform cross-validation.

library(caret)  ## package for model comparisons
library(glmnet) ## package for fitting lasso models
library(mgcv)   ## package for fitting GAM models

2. Linear Regression

Linear regression is a type of parametric model, meaning there are two steps to applying a linear regression model. We must first specify the structure of the model, and then estimate the model’s parameters. A simple regression model includes only two parameters, an intercept and slope.

The simple linear regression model assumes the structure: \[ Y = \beta_0 + \beta_1 X_{1} + \epsilon\]

Where:

\(X_1\) is linearly related with Y
Deviation from this linear relationship is captured by \(\epsilon\)
\(\epsilon\) follows a \(N(0,\sigma^2)\) distribution for all values of \(X_1\)

Once the model structure has been specified its parameters, \(\beta_0\), \(\beta_1\), and \(\sigma^2\), must be estimated.

Parameter estimation is sometimes called fitting or training the model, for linear regression these parameters are estimated via least squares. We typically denote estimates using a hat symbol, meaning \(\hat{\beta}_1\) is the estimated value (from our sample data) of \(\beta_1\) (which we assume is an unknown population level effect).

In R linear models are fit using the lm function:

salaries <- read.csv("https://remiller1450.github.io/data/Salaries.csv")
fit1 <- lm(data = salaries, salary ~ yrs.service)

Within lm, a model’s structure by a formula of the general form Outcome ~ Predictors. In this example we model professor salary using years of service: \[\text{salary} = \beta_0 + \beta_1 \text{years service} + \epsilon\]

Coefficient estimates and their standard errors for this model are printed using the summary function:

summary(fit1)

## 
## Call:
## lm(formula = salary ~ yrs.service, data = salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81933 -20511  -3776  16417 101947 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  99974.7     2416.6   41.37  < 2e-16 ***
## yrs.service    779.6      110.4    7.06 7.53e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28580 on 395 degrees of freedom
## Multiple R-squared:  0.1121, Adjusted R-squared:  0.1098 
## F-statistic: 49.85 on 1 and 395 DF,  p-value: 7.529e-12

To summarize this example:

\(Y\) is the salary of a professor
\(X_1\) is the number of years a professor has served at their institution
\(\hat{\beta}_0 = 99974\) is the estimated intercept, it represents the expected salary when yrs.service is zero
\(\hat{\beta}_1 = 779.6\) is the estimated coefficient for \(X_1\), it represents the expected increase in salary for each 1 unit increase in yrs.service

Multiple Regression

In the professor salaries example you should recoginize that yrs.service is not the only factor related to a faculty member’s salary. Fortunately, the linear regression framework easily accommodates multiple explanatory variables: \[ Y = \beta_0 + \beta_1 X_{1} + \beta_2X_2 + \ldots + \beta_pX_p + \epsilon\] Here:

Each variable, \(X_j\), makes its own separate and additive linear contribution to \(Y\)
Deviation from this linear relationship is captured by \(\epsilon\)
\(\epsilon\) follows a \(N(0,\sigma^2)\) distribution over the entire range of \(\{X_1, X_2, \ldots, X_p\}\)

We can use multiple regression to develop a more complex model of professor salary that includes both yrs.service and rank.

fit2 <- lm(data = salaries, salary ~ yrs.service + relevel(rank, ref = "AsstProf"))
summary(fit2)

## 
## Call:
## lm(formula = salary ~ yrs.service + relevel(rank, ref = "AsstProf"), 
##     data = salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64515 -16180  -1234  12181 107174 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                               81151.3     2896.9  28.013
## yrs.service                                -158.1      115.0  -1.376
## relevel(rank, ref = "AsstProf")AssocProf  14615.4     4270.6   3.422
## relevel(rank, ref = "AsstProf")Prof       49228.8     3991.9  12.332
##                                          Pr(>|t|)    
## (Intercept)                               < 2e-16 ***
## yrs.service                              0.169708    
## relevel(rank, ref = "AsstProf")AssocProf 0.000686 ***
## relevel(rank, ref = "AsstProf")Prof       < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23610 on 393 degrees of freedom
## Multiple R-squared:  0.3972, Adjusted R-squared:  0.3926 
## F-statistic:  86.3 on 3 and 393 DF,  p-value: < 2.2e-16

In this model you should notice:

rank is a categorical variable with 3 categories, so it must be represented using two different dummy variables and a pre-specified reference category. We use the relevel function to change the reference category to “AsstProf”, the lowest rank in these data.
After including rank in the model the coefficient of yrs.service became negative, this is due to the coefficient now representing the adjusted effect of salary within a fixed rank. Phrased differently, for two professors both at the same rank each additional year of service leads to an expected salary decrease of 158 dollars.

Before moving on from multiple regression we’ll fit one more model, this time including all of the available predictors in the salaries dataset. The . symbol on the right side of the formula is shorthand for “all variables in the data not previously specified”.

fit3 <- lm(data = salaries, salary ~ . )

Question 1:

Interpet the effect of the dummy variable “Rank=Prof” in fit2, be specific in your answer.

3. Comparing Models

So far we’ve fit three different linear models for predicting a professor’s salary, but which of these models is best?

One way to evaluate the performance of different models is to look at how accurately they predict the outcome. We can summarize this accuracy using root mean squared error (RMSE) or mean absolute error (MAE): \[RMSE = \sqrt{\tfrac{1}{n}\sum_i (y_i - \hat{y}_i)^2} \qquad MAE = \tfrac{1}{n}\sum_i|y_i - \hat{y}_i|\] Both RMSE and MAE are ways to describe how far, on average, each of the models predicted values, \(\hat{y}_i\) is from the actual outcome, \(y_i\). We can easily calculate both metrics for each of our three models:

n <- nrow(salaries)
RMSE1 <- sqrt(1/n * sum((salaries$salary -  fit1$fitted.values)^2))
RMSE2 <- sqrt(1/n * sum((salaries$salary -  fit2$fitted.values)^2))
RMSE3 <- sqrt(1/n * sum((salaries$salary -  fit3$fitted.values)^2))
cat("RMSE:",RMSE1, RMSE2, RMSE3)

## RMSE: 28505.66 23487.72 22155.48

MAE1 <- 1/n * sum(abs(salaries$salary -  fit1$fitted.values))
MAE2 <- 1/n * sum(abs(salaries$salary -  fit2$fitted.values))
MAE3 <- 1/n * sum(abs(salaries$salary -  fit3$fitted.values))
cat(" MAE:",MAE1, MAE2, MAE3)

##  MAE: 22317.68 18021.59 16173

Based upon these measures fit3, the model which uses all of the available explanatory variables, yields predictions that are closest to the observed values.

This is troublesome because fit3 doesn’t appear to be a very good model, it includes a predictor X that is just the row number of each observation! The reason for this superiority is that we used metrics which measured in-sample performance, something that will always improve as models grow more complex.

Cross-validation

In most applications we are less concerned with how well a model performs on the data used to train it, and more concerned with how well it will generalize to an independent dataset. In other words, we usually don’t want models that are overly specific to a particular sample (overfit) because these models don’t actually help us explain the phenomenon of interest more broadly.

Cross-validation is a tool to estimate performance of a model on an independent sample. In R we can use cross-validation to compare the out-of-sample performance of different models using the caret package.

In the caret package models are fit using the train function, and their performance is evaluated by instructions provided in a trainControl object.

In the example below we specify a trainControl object stating that our models will be evaluated using 10 repeats of 5-fold cross validation, we then train four different models of professor salary:

fit.control <- trainControl(method = "repeatedcv", number = 5, repeats = 10)

set.seed(123)
fit1 <- train(salary ~ yrs.service, data = salaries, method = "lm", trControl = fit.control)

set.seed(123)
fit2 <- train(salary ~ yrs.service + rank, data = salaries, method = "lm", trControl = fit.control)

set.seed(123)
fit3 <- train(salary ~ yrs.service + rank + discipline + sex, data = salaries, method = "lm", trControl = fit.control)

set.seed(123)
fit4 <- train(salary ~ ., data = salaries, method = "lm", trControl = fit.control)

Notice:

We specified method = "lm" within the train function, a strength of the caret package is its uniform syntax for hundreds of different model types, which we could request simply by changing this argument.
We used set.seed to set the same randomization seed prior to training each model, this is advised to ensure the same random fold assignments are used when cross-validating each model.

The resamples function allows us to compare any number of models trained using caret:

resamps <- resamples(list(LM1 = fit1,
                          LM2 = fit2,
                          LM3 = fit3,
                          LM4 = fit4))

summary(resamps)

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: LM1, LM2, LM3, LM4 
## Number of resamples: 50 
## 
## MAE 
##         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## LM1 19577.75 21412.67 22386.72 22402.14 23494.91 25084.35    0
## LM2 15456.89 17403.22 18053.85 18162.68 18830.40 20964.37    0
## LM3 13855.87 15721.93 16778.70 16691.86 17341.92 19797.06    0
## LM4 13923.25 15566.76 16488.27 16468.64 17392.02 19291.53    0
## 
## RMSE 
##         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## LM1 24142.28 27043.28 28694.28 28554.47 29986.45 32392.73    0
## LM2 19542.67 22061.55 23833.97 23582.51 24740.09 27955.03    0
## LM3 18992.35 21349.30 22400.41 22650.20 24127.02 26700.71    0
## LM4 19330.44 21066.15 22401.00 22503.02 23863.75 26142.43    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LM1 0.01235389 0.0737127 0.1133501 0.1260101 0.1573254 0.2975567    0
## LM2 0.30243305 0.3506905 0.4019138 0.3959562 0.4326401 0.5095178    0
## LM3 0.33000516 0.4042301 0.4558864 0.4433776 0.4833220 0.5585006    0
## LM4 0.34541472 0.4054179 0.4652892 0.4511285 0.4884499 0.5715129    0

resamples summarizes the out-of-sample performance within each “sample”. Our trainControl specified 10 repeats of 5-fold cross-validation, so there are 50 different “samples” where our models can be compared, one for each fold within each repeat. In our professor salaries example we can see that models 3 and 4 are similar to each other and are both substantially better than models 1 and 2.

One visual way to compare similar models is the Bland-Altman plot:

xyplot(resamps, what = "BlandAltman", metric = "RMSE", models = c("LM3", "LM4"))

Each point represents a paired difference (paired within a “sample”) for the specified metric (RMSE in this case). The y-axis depicts that paired difference, while the x-axis depicts the average value of the metric for that sample. If the performance of the models is exactly equal the Bland-Altman plot will like random scatter.

Question 2:

Construct a plot comparing the out-of-sample RMSE versus in-sample RMSE using the four models (fit1, fit2, fit3, fit4) specified above. As a first step in constructing this plot, you should assemble an appropriate data frame using the data.frame and c() functions.

4. Regularization

In some circumstances we might want to use a linear regression model, but desire an alternative to least squares estimation. Two reasons for not wanting to use least squares are:

Prediction accuracy - The least squares estimates can be too specific the effects seen in the sample data, prediction can often be improved by introducing bias and shrinking variable effects towards zero.
Interpretability - Least squares estimates are non-zero for all parameters, but if we have a large number of explanatory variables we might want the model to use only a subset that have strong effects.

The least absolute shrinkage and selection operation estimator or lasso is one such alternative. The lasso naturally performs variable selection and shrinks coefficient estimates towards zero, often leading to improved out-of-sample prediction. The idea that estimated effects should be shrunken towards zero to avoid overfitting is known as regularization, a concept that can be applied to many types of models.

The lasso solution is found by minimizing \(Q(\beta) =\sum_i (y_i - x_i^T\beta)^2 + \lambda ||\beta||_1\) with respect to \(\beta\), where \(||\beta||_1 = |\beta_1| + |\beta_2| + \ldots + |\beta_p|\) is the L1 norm of the coefficient vector. \(\lambda\) is a tuning parameter, or a numeric constant that needs to specified prior to estimation.

When \(\lambda = 0\), the lasso solution is the same as the least squares solution. For \(\lambda \rightarrow \infty\), the lasso solution is the zero vector (ie: no variables make non-zero contributions in the model). Each intermediate value of \(\lambda\) provide sparse solutions, where a subset of variables make non-zero contributions with coefficients that are shrunken towards zero (relative to their least squares estimates).

Typically we try an entire path of different \(\lambda\) values, tracking the lasso solution for each. The code below fits a lasso model on the salaries data using the glmnet package:

X = model.matrix(data = salaries[,-7], ~ -1 + .) ## Set up the matrix including dummy vars
y = salaries$salary

lasso <- glmnet(x = X, y = y)

Note:

glmnet doesn’t allow formulas, you must provide the matrix of explanatory variables and the response vector separately. This is cumbersome when dummy variables are needed, requiring us to use the model.matrix function to construct the appropriate design matrix
- Within model.matrix we provide the data and the right side of the model formula. “-1” is used to indicate we don’t want an intercept (glmnet creates an intercept for us).

To understand this a lasso model it is useful to visualize the coefficient path:

plot(lasso, xvar = "lambda")
abline(v = 8, lty = 2)

The coefficient path displays the relationship between \(\lambda\), the penalty parameter, the number of active variables, and their estimated coefficients.

Each point along the coefficient path represents a different lasso model. For illustration we use abline to add a vertical line at \(log(\lambda) = 8\), the model characterized by this penalty contains only 3 variables. We can use the coef function to print these selections and their estimated effects:

coef(lasso, s = exp(8))

## 9 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)   91489.840
## X                 .    
## rankAssocProf     .    
## rankAsstProf  -8582.950
## rankProf      29641.312
## disciplineB    6992.899
## yrs.since.phd     .    
## yrs.service       .    
## sexMale           .

In order to use caret to compare the lasso with least squares regression we need to utilize the tuneGrid argument to specify values of tuning parameters (ie: \(\lambda\), the lasso penalty parameter).

lams <- expand.grid(alpha = 1, lambda = lasso$lambda)
set.seed(123)
fit.lasso <- train(salary ~ ., data = salaries, method = "glmnet", trControl = fit.control, tuneGrid = lams)

set.seed(123)
fit.lm <- train(salary ~ ., data = salaries, method = "lm", trControl = fit.control)

Note:

glmnet allows a second tuning parameter alpha, that we do not consider in this lab. alpha is used by the elastic net, a regularization approach similar to the lasso.
If we don’t specify our own tuneGrid the train function will still work; however, the default values used by train tend to be sub-optimal.

rs <- resamples(list(LM = fit.lm, LASSO = fit.lasso))
xyplot(rs, what = "BlandAltman", metric = "RMSE", models = c("LM", "LASSO"))

Continuing our ongoing professor salary example, we see that the lasso model provides roughly the same predictive performance as the least squares regression model.

This is not unexpected, in this example the number of predictors is relatively small compared to the sample size. As shown below, the best lasso model actually includes all of the available predictors:

coef(fit.lasso$finalModel, s = fit.lasso$bestTune$lambda)

## 8 x 1 sparse Matrix of class "dgCMatrix"
##                          1
## (Intercept)    73807.42881
## X                 27.25719
## rankAsstProf  -12981.97361
## rankProf       31931.45842
## disciplineB    16663.03244
## yrs.since.phd    468.80110
## yrs.service     -447.23551
## sexMale         4016.01485

5. Generalized Additive Models

In some circumstances we might want to add flexibility to the linear regression to allow for changes in the explanatory variables to correspond with non-linear changes in the outcome. Generalized Additive Models (GAMs) modify the traditional linear model to include a smoothing term for the contribution of each predictor: \[ Y = \beta_0 + f_1(X_{1}) + f_2(X_2) + \ldots + f_p(X_p) + \epsilon\] Notice that if we take \(f_j(X_j) = \beta_j X_j\) for all \(j \in \{1, \ldots, p\}\) we arrive back at the familar multiple regression model.

To fit a GAM model we must specify the type of function used to relate \(X_j\) and \(Y\). While there are many possibilities, perhaps the most choice are splines which are specified in the model formula using the s function:

gam1 <- gam(salary ~ s(yrs.service), data = salaries)

We can understand the spline function by plotting the relationship between the explanatory variable and the predicted outcome:

plot(gam1)

In this example we see that yrs.service has a roughly linear effect until about 20 years, after which the variable has little impact (ignoring the erratic behavior at very large values where there isn’t much data).

Would using a GAM model lead to better salary predictions? We can investigate this using caret.

fit.control <- trainControl(method = "repeatedcv", number = 5, repeats = 10)

set.seed(123)
fit1 <- train(salary ~ yrs.service, data = salaries, method = "lm", trControl = fit.control)

set.seed(123)
fit2 <- train(salary ~ yrs.service, data = salaries, method = "gam", trControl = fit.control)

rs <- resamples(list(LM = fit1, GAM = fit2))
xyplot(rs, what = "BlandAltman", metric = "RMSE", models = c("LM", "GAM"))

As evidenced by the Bland-Altman plot, the GAM model provides superior predictions (measured by RMSE). You should also notice that we didn’t need to specify the spline term s(yrs.service) in the GAM model formula, by default caret will use splines when you use method = "gam".

5. Loose-ends

At this point we’ve seen how to implement and evaluate a handful of interpretable models using caret. With that said, this lab isn’t intended to be a comprehensive guide to modeling, a vast topic that can requires years of study to master. A few of topics that weren’t covered include:

Variable selection and feature engineering
Statistical inference on model parameters
Interactions between explanatory variables

If you’re interested in these topics I encourage you to check out following books written by Max Kuhn and Kjell Johnson:

6. Practice

The birthwt data frame contains data on 189 infants born at Baystate Medical Center in Springfield, MA in 1986. For this application we will use the variables: “age”, “lwt”, “race”, “smoke”, “ptl”, “ht”, “ui”, and ftv to predict “btw”, the infant’s birth weight. We will ignore the variable “low”, which is a direct function of “bwt”. To see a description these variables run the command ?birthwt

library(MASS)
data("birthwt")
head(birthwt)

##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

Question #3:

Construct a scatterplot showing the relationship between “age”, plotted on the x-axis, and “bwt”, plotted on the y-axis. Then use the lm function to fit a simple linear regression model. Interpret the coefficient estimates of this model, and write 1-2 sentences discussing whether you feel the model is appropriate given what you see in the scatterplot.

Question #4:

Use 10 repeats of 5-fold cross-validation in caret to compare the simple linear regression model used in Question #3 with a multiple regression model that includes all of the available predictors (expect for “low”). Make sure that your model treats the variable “race” as a factor. Use the resamples function to decide which model provides better prediction.

Question #5:

Use caret and the resamples function to compare a lasso regression model with the multiple regression model you used in Question #4. You should use glmnet to explore an appropriate sequence of \(\lambda\) values to be given to the train function as a tuning grid. You may use the code below to set up the X and y to be used glmnet.

X = model.matrix(data = birthwt[,-c(1,10)], ~ -1 - race + factor(race) + .)
y = birthwt$bwt

Question #6:

Print the train fit of your lasso regression model and identify the value of \(\lambda\) with the best out-of-sample performance. Print this model’s coefficients using code similar to that provided below and answer the following questions:

How many variables were excluded from the model by the lasso?
How do you interpret the coefficient of “lwt”?

coef(myLassoTrainFit$finalModel, s = best_lambda)

Question #7:

Fit a GAM (do not use the variable “race”), then use caret to compare its out-of-sample performance with the lasso regression model of Question #5 and Question #6 and answer the following questions:

How do you describe the effect of “lwt” in the GAM? (Hint: use the plot created by the code below)
Based upon your answer to B), why do you think the lasso model has better out-of-sample performance?

plot.gam(myGAMTrainFit$finalModel, pages = 1)