Directions:
As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This means you will work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.
You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.
You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.
\(~\)
The previous lab focused on conceptually understanding the role of various combinations of predictors within a multiple regression model. This lab will focus on the statistical inference and how to choose which predictors belong in a model.
To begin, let’s review the population-level multiple linear regression model: \[y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \ldots + \beta_k x_k + \epsilon\]
The random errors, \(\epsilon\), make multiple linear regression a statistical model. The main aspects of statistical inference for multiple linear regression are like those we discussed with simple linear regression, with one new area of importance. To summarize:
The next few sections will cover each of these areas of statistical inference in greater detail.
\(~\)
Like simple linear regression, inference on the \(\beta\) parameters of a multiple regression model is based upon the \(t\)-distribution. The summary()
and confint()
functions are used to conduct statistical inference on these model parameters. Recognize that you’ve already done this in the earlier labs, but here’s a quick example using the Professor Salaries dataset (introduced in Lab #5):
## Load the data and fit a multiple regression model
ps <- read.csv("https://remiller1450.github.io/data/Salaries.csv")
m <- lm(yrs.service ~ yrs.since.phd + rank, data = ps)
## Hypothesis tests
summary(m)
##
## Call:
## lm(formula = yrs.service ~ yrs.since.phd + rank, data = ps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.085 -2.273 0.884 3.674 17.318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.87838 0.81394 -3.536 0.000454 ***
## yrs.since.phd 0.95977 0.02952 32.514 < 2e-16 ***
## rankAsstProf 0.35237 0.99087 0.356 0.722315
## rankProf -1.46815 0.84118 -1.745 0.081706 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.393 on 393 degrees of freedom
## Multiple R-squared: 0.8294, Adjusted R-squared: 0.8281
## F-statistic: 636.8 on 3 and 393 DF, p-value: < 2.2e-16
## Confidence Intervals
confint(m)
## 2.5 % 97.5 %
## (Intercept) -4.4786112 -1.2781559
## yrs.since.phd 0.9017392 1.0178090
## rankAsstProf -1.5956868 2.3004321
## rankProf -3.1219318 0.1856227
A couple of points of emphasis:
Question #1: Using the Professor Salaries dataset and the model specified above (ie: yrs.service ~ yrs.since.phd + rank
) test whether the slope coefficient of “yrs.since.phd” is statistically different from 1 when predicting “yrs.service” after adjusting for rank. Interpret the results of this test. (Hint: the pt()
function is used to calculate tail area probabilities for the \(t\)-distribution)
\(~\)
Similar what we saw with simple linear regression, confidence bands can be used to express the statistical uncertainty in \(E(y)\), the mean response (according to the specified model), while prediction bands can be used to express the statistical uncertainty in individual values of \(y\) (again according to the specified model).
Both these intervals can be obtained using the predict()
function; however, for multiple linear regression we must supply \(x\)-values for all of the predictors in the model. When the model includes a categorical predictor this can be facilitated by the expand.grid()
function:
## Model under consideration
m <- lm(yrs.service ~ yrs.since.phd + rank, data = ps)
## Sequences of x1 and x2 values to consider (note that x2 is categorical)
x1_grid <- seq(min(ps$yrs.since.phd), max(ps$yrs.since.phd), length.out = 100)
x2_grid <- levels(as.factor(ps$rank))
## All combinations of x1 and x2
x1x2_grid <- expand.grid(yrs.since.phd = x1_grid, rank = x2_grid)
## Add confidence band columns to the grid using column binding (cbind)
grid_ci <- cbind(x1x2_grid, predict(m, newdata = x1x2_grid, interval = "confidence"))
## Add prediction band columns
grid_pi <- cbind(x1x2_grid, predict(m, newdata = x1x2_grid, interval = "prediction"))
## Plot the results
plot(ps$yrs.since.phd, ps$yrs.service)
## Add Assoc Prof CI lines
lines(grid_ci$yrs.since.phd[grid_ci$rank == "AssocProf"], grid_ci$lwr[grid_ci$rank == "AssocProf"], col = "blue")
lines(grid_ci$yrs.since.phd[grid_ci$rank == "AssocProf"], grid_ci$upr[grid_ci$rank == "AssocProf"], col = "blue")
## Add Ast Prof CI lines
lines(grid_ci$yrs.since.phd[grid_ci$rank == "AsstProf"], grid_ci$lwr[grid_ci$rank == "AsstProf"], col = "red")
lines(grid_ci$yrs.since.phd[grid_ci$rank == "AsstProf"], grid_ci$upr[grid_ci$rank == "AsstProf"], col = "red")
## Add Full Prof CI lines
lines(grid_ci$yrs.since.phd[grid_ci$rank == "Prof"], grid_ci$lwr[grid_ci$rank == "Prof"], col = "yellow")
lines(grid_ci$yrs.since.phd[grid_ci$rank == "Prof"], grid_ci$upr[grid_ci$rank == "Prof"], col = "yellow")
## Add Assoc Prof PI lines
lines(grid_pi$yrs.since.phd[grid_pi$rank == "AssocProf"], grid_pi$lwr[grid_pi$rank == "AssocProf"], col = "blue", lty = 2)
lines(grid_pi$yrs.since.phd[grid_pi$rank == "AssocProf"], grid_pi$upr[grid_pi$rank == "AssocProf"], col = "blue", lty = 2)
## Add Ast Prof PI lines
lines(grid_pi$yrs.since.phd[grid_pi$rank == "AsstProf"], grid_pi$lwr[grid_pi$rank == "AsstProf"], col = "red", lty = 2)
lines(grid_pi$yrs.since.phd[grid_pi$rank == "AsstProf"], grid_pi$upr[grid_pi$rank == "AsstProf"], col = "red", lty = 2)
## Add Full Prof PI lines
lines(grid_pi$yrs.since.phd[grid_pi$rank == "Prof"], grid_pi$lwr[grid_pi$rank == "Prof"], col = "yellow", lty = 2)
lines(grid_pi$yrs.since.phd[grid_pi$rank == "Prof"], grid_pi$upr[grid_pi$rank == "Prof"], col = "yellow", lty = 2)
## Legend
legend("topleft", legend = c("Assoc", "Asst", "Full", "Prediction"), col = c("blue","red","yellow","black"), lty = c(1,1,1,2), bty = "n")
Note: The graph above is probably too busy to use in a presentation or publication, nevertheless it is a good demonstration of how to use the predict function with a multiple regression model.
Question #2: Using the professor salaries dataset, fit the model salary ~ sex + yrs.since.phd
. Then create a scatterplot displaying the variables “yrs.since.phd” and “salary” with separate 95% prediction bands for males and females.
\(~\)
Because multiple regression models contain several different slopes, a \(t\)-test on single slope parameter is no longer sufficient for evaluating certain associations.
As an example, a categorical predictor might involve several dummy variables, each with their own slope coefficient that will compare that group with the reference category. However, what if you’d like to test for an overall association with the outcome variable? This requires a fundamentally different approach than testing for differences in comparison to the reference category.
Analysis of Variance, using an \(F\)-test, is a method of comparing a model against a smaller, nested sub-model. This has two key uses:
The \(F\)-test statistic (given below) can be viewed as the standardized decrease in sum of squares that occurs when a group of predictors is added to model.
\[ F = \frac{(SS_{yy} - SSE)/\delta_k}{SSE/(n - (k + 1))}\]
We can perform the \(F\)-test in R
using the anova()
function. The example below uses ANOVA to test whether “rank” is associated with “yrs.service” after adjusting for “yrs.since.phd”:
## Larger model of interest
m1 <- lm(yrs.service ~ yrs.since.phd + rank, data = ps)
## Smaller, nested submodel (ie: omit the variable(s) you're testing)
m2 <- lm(yrs.service ~ yrs.since.phd, data = ps)
## ANOVA Table
anova(m1,m2)
## Analysis of Variance Table
##
## Model 1: yrs.service ~ yrs.since.phd + rank
## Model 2: yrs.service ~ yrs.since.phd
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 393 11430
## 2 395 11558 -2 -128.07 2.2018 0.112
Based upon the \(p\)-value this test, we can conclude there is borderline statistical evidence that “rank” is associated with “yrs.service” after adjusting for “yrs.since.phd”. This should make sense, since most of the variability in “yrs.service” is already capable of being explained by the variable “yrs.since.phd”.
We could also use the anova()
function to evaluate the overall utility of this model:
## Larger model of interest
m1 <- lm(yrs.service ~ yrs.since.phd + rank, data = ps)
## Smaller, nested submodel (1 is used to indicate an intercept only)
m2 <- lm(yrs.service ~ 1, data = ps)
## ANOVA Table
anova(m1,m2)
## Analysis of Variance Table
##
## Model 1: yrs.service ~ yrs.since.phd + rank
## Model 2: yrs.service ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 393 11430
## 2 396 66986 -3 -55556 636.76 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Unsurprisingly, the model that involves “yrs.since.phd” and “rank” is highly predictive of “yrs.service”. Further, you might have noticed that we could also find this \(F\)-test at the bottom of the output produced by the summary()
function (using the larger model of interest).
summary(m1)
##
## Call:
## lm(formula = yrs.service ~ yrs.since.phd + rank, data = ps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.085 -2.273 0.884 3.674 17.318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.87838 0.81394 -3.536 0.000454 ***
## yrs.since.phd 0.95977 0.02952 32.514 < 2e-16 ***
## rankAsstProf 0.35237 0.99087 0.356 0.722315
## rankProf -1.46815 0.84118 -1.745 0.081706 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.393 on 393 degrees of freedom
## Multiple R-squared: 0.8294, Adjusted R-squared: 0.8281
## F-statistic: 636.8 on 3 and 393 DF, p-value: < 2.2e-16
Question #4: For the Professor Salaries data, use an \(F\)-test to evaluate the importance of the variable “rank” in a model that uses “rank”, “yrs.since.phd”, and “discipline” to predict “salary”. Provide a one sentence conclusion summarizing what the test tells you about the role of “rank” in the model.
Question #5: Perform an overall \(F\)-test of the model that uses “rank”, “yrs.since.phd”, and “discipline” to predict “salary”. Provide a one sentence conclusion summarizing the result of this test regarding the overall efficacy of this model.
\(~\)
It is rare for all the variables in a dataset to actually be useful in predicting an outcome. And as data becomes increasingly easy to collect, we should only expect the amount of non-predictive variables we encounter to increase.
Hypothesis testing provides the data analyst one method for identifying and removing non-predictive variables from a model. However, using hypothesis tests to find the optimal model is a very challenging task. As we’ve already seen, the hypothesis testing results within a given model are adjusted for everything else in that model.
So, for example, just because “sex” is a significant predictor of “salary” by itself (in the model salary ~ sex
) doesn’t mean it will be a significant predictor in the model salary ~ sex + rank
. Similarly, even if “sex” is not a significant predictor in the model salary ~ sex + rank
, it might be a significant predictor in the model salary ~ sex + rank + discipline
.
Because of these challenges, statisticians are very deliberate in how they select variables. In general, they focus on the following principles:
\(~\)
Stepwise model selection algorithms are one application of the \(F\)-test. Broadly speaking, these algorithms fall into one of three categories:
As you might expect, stepwise algorithms tend to be the most popular because they consider the most expansive set of possible models.
While the \(F\)-test is used by many software programs (such as Minitab or SPSS) to perform stepwise selection, the step()
function in R uses a model selection criterion known as AIC instead. So before we see any examples of stepwise selection, we’ll first need to introduce AIC.
Question #6: In general, Forward Stepwise Selection is more popular for data containing many predictors (ie: large \(p\)), while Backward Stepwise Elimination is more popular for data containing relatively few predictors. For this question, write 1-2 sentences explaining why you think this is.
\(~\)
Model selection criteria are calculable quantities that enable an easy comparison of two or more models. Akaike’s Information Criterion (AIC) is perhaps the most popular criterion among statisticians.
For a given model, \(AIC = -\text{Log-Likelihood} + 2k\), where the log-likelihood measures goodness of fit (we won’t cover the details in this class) and \(k\) is the number of predictors in the model.
A lower \(AIC\) value indicates a superior model. There are two possible causes for a high \(AIC\) value (indicating an inferior model): either the model includes too many variables relative to its predictive ability (high \(k\) relative to the log-likelihood), or it doesn’t fit the data well (low \(k\), but a small log-likelihood).
In R
, \(AIC\) is used the basis for stepwise selection:
## StepAIC is contained in the MASS package (installed by default)
library(MASS)
## Backwards stepwise selection (initial model contains all predictors)
ps_initial <- lm(salary ~ rank + discipline + yrs.since.phd + yrs.service + sex, data = ps)
stepAIC(ps_initial,
scope = list(upper = salary ~ rank + discipline + yrs.since.phd + yrs.service + sex,
lower = salary ~ 1),
direction = "both")
## Start: AIC=7965.19
## salary ~ rank + discipline + yrs.since.phd + yrs.service + sex
##
## Df Sum of Sq RSS AIC
## - sex 1 7.8068e+08 1.9890e+11 7964.8
## <none> 1.9812e+11 7965.2
## - yrs.since.phd 1 2.5041e+09 2.0062e+11 7968.2
## - yrs.service 1 2.7100e+09 2.0083e+11 7968.6
## - discipline 1 1.9237e+10 2.1735e+11 8000.0
## - rank 2 6.9508e+10 2.6762e+11 8080.6
##
## Step: AIC=7964.75
## salary ~ rank + discipline + yrs.since.phd + yrs.service
##
## Df Sum of Sq RSS AIC
## <none> 1.9890e+11 7964.8
## + sex 1 7.8068e+08 1.9812e+11 7965.2
## - yrs.since.phd 1 2.5001e+09 2.0140e+11 7967.7
## - yrs.service 1 2.5763e+09 2.0147e+11 7967.9
## - discipline 1 1.9489e+10 2.1839e+11 7999.9
## - rank 2 7.0679e+10 2.6958e+11 8081.5
##
## Call:
## lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service,
## data = ps)
##
## Coefficients:
## (Intercept) rankAsstProf rankProf disciplineB yrs.since.phd
## 82700.5 -12831.5 32456.2 14505.2 534.6
## yrs.service
## -476.7
In this example, the only “step” was to eliminate the variable “sex” from the model. The final result was the model: salary ~ rank + discipline + yrs.since.phd + yrs.service
Other commonly used selection criteria include:
Question #7: Using the code provided above as a starting point, perform forward stepwise selection using \(AIC\) as the model selection criteria to find the model that best predicts “salary” in the professor salaries dataset. Report the final model.
Question #8: Using the code provided above as a starting point, perform backward stepwise selection using \(BIC\) as the model selection criteria to find the model that best predicts “salary” in the professor salaries dataset. You can add the argument k = log(nrow(ps))
to ensure that stepAIC
calculates \(BIC\) rather than \(AIC\) (which uses \(k = 2\) in its penalty term). Report your final model.
Question #9: Briefly comment upon the different models chosen by forward versus backward stepwise algorithms, as well as \(AIC\) versus \(BIC\). You should write 2-3 sentences comparing and contrasting the size (ie: number of predictors) of these models.
\(~\)
When the number of predictors is relatively small, it’s sometimes possible to exhausting compare all possible models using a model selection criterion. In R
the regsubsets
function (located in the leaps
package) can be used to perform best subsets regression, an exhaustive approach that will return the best model of each size (ie: each number of predictors).
The code below demonstrates how to use regsubsets
:
## Load leaps package
library(leaps)
## Specify the largest model (formula) and the dataset
bsubs <- regsubsets(salary ~ rank + discipline + yrs.since.phd + yrs.service + sex, data = ps)
## Extract summary information
bsubs_summary <- summary(bsubs)
## Assemble the BIC/Adjusted R2 for the best model of each size
data.frame(bsubs_summary$which, BIC = bsubs_summary$bic, Adj_R2 = bsubs_summary$adjr2)
## X.Intercept. rankAsstProf rankProf disciplineB yrs.since.phd yrs.service
## 1 TRUE FALSE TRUE FALSE FALSE FALSE
## 2 TRUE FALSE TRUE TRUE FALSE FALSE
## 3 TRUE TRUE TRUE TRUE FALSE FALSE
## 4 TRUE TRUE TRUE TRUE FALSE FALSE
## 5 TRUE TRUE TRUE TRUE TRUE TRUE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE
## sexMale BIC Adj_R2
## 1 FALSE -177.0373 0.3772158
## 2 FALSE -203.7709 0.4250270
## 3 FALSE -209.7988 0.4407437
## 4 TRUE -205.1838 0.4412470
## 5 FALSE -203.2665 0.4455269
## 6 TRUE -198.8438 0.4462870
In this example, the best model according to \(BIC\) is the model that only includes a single dummy variable that separates rank = Prof
from the lower ranks. In contrast, Adjusted \(R^2\) favors the largest model that includes all of the available predictors.
Note: Because different model selection penalize solely based upon model size, there is no need to specify the specific selection criteria you are interested (since the best model of each size then depends only on the goodness of fit component of the selection criterion).
Due to their convenience and quick computation, model selection criteria are widely used by statisticians. However, the goodness of fit component is based upon the type of model you are using. The implication is that you cannot use something like \(AIC\) to compare a multiple linear regression model and a \(k\)-nearest neighbors model. For this task, you’ll need to use cross-validation (covered in Lab #4)
Cross-validation also has the advantage of providing an unbiased estimate of out-of-sample model performance; something none of the aforementioned methods offer.
Question #10: For the professor salaries dataset, use 5-fold cross-validation to compare the out-of-sample RMSE of the four models listed below (Hint: revisit Lab #4 if you need a template for how to implement cross-validation in R
):
salary ~ rank + discipline + yrs.since.phd + yrs.service
- the model chosen by backwards stepwise regression using \(AIC\)salary ~ rank + discipline
- the model chosen by forward stepwise regression using \(AIC\)salary ~ rank==Prof
- the model favored by \(BIC\) in best subsets regressionsalary ~ rank + discipline + yrs.since.phd + yrs.service + sex
- the model favored by Adjusted \(R^2\) in best subsets regression