MATH-257 - Lab #6 - Multiple Regression - Inference and Variable Selection

Directions:

As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This means you will work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.

You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.

You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.

\(~\)

Introduction

The previous lab focused on conceptually understanding the role of various combinations of predictors within a multiple regression model. This lab will focus on the statistical inference and how to choose which predictors belong in a model.

To begin, let’s review the population-level multiple linear regression model: \[y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \ldots + \beta_k x_k + \epsilon\]

The random errors, \(\epsilon\), make multiple linear regression a statistical model. The main aspects of statistical inference for multiple linear regression are like those we discussed with simple linear regression, with one new area of importance. To summarize:

Since population model must be estimated from the sample data (via least squares estimation), we might be interested in statistical inference on the parameters \(\beta_0, \beta_1, \ldots, \beta_k\) in the form of \(t\)-tests or confidence intervals
If we are interested in the variability of possible outcomes for a given combination of \(x_1, x_2, \ldots, x_k\), we can construct prediction intervals
Additionally, because the \(t\)-test on an individual slope is not sufficient to gauge the overall efficacy of a model with multiple predictors, we can use analysis of variance to compare a model against a smaller, nested sub-model.

The next few sections will cover each of these areas of statistical inference in greater detail.

\(~\)

Inference on \(\beta\) Parameters

Like simple linear regression, inference on the \(\beta\) parameters of a multiple regression model is based upon the \(t\)-distribution. The summary() and confint() functions are used to conduct statistical inference on these model parameters. Recognize that you’ve already done this in the earlier labs, but here’s a quick example using the Professor Salaries dataset (introduced in Lab #5):

## Load the data and fit a multiple regression model
ps <- read.csv("https://remiller1450.github.io/data/Salaries.csv")
m <- lm(yrs.service ~ yrs.since.phd + rank, data = ps)

## Hypothesis tests
summary(m)

## 
## Call:
## lm(formula = yrs.service ~ yrs.since.phd + rank, data = ps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.085  -2.273   0.884   3.674  17.318 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.87838    0.81394  -3.536 0.000454 ***
## yrs.since.phd  0.95977    0.02952  32.514  < 2e-16 ***
## rankAsstProf   0.35237    0.99087   0.356 0.722315    
## rankProf      -1.46815    0.84118  -1.745 0.081706 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.393 on 393 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.8281 
## F-statistic: 636.8 on 3 and 393 DF,  p-value: < 2.2e-16

## Confidence Intervals
confint(m)

##                    2.5 %     97.5 %
## (Intercept)   -4.4786112 -1.2781559
## yrs.since.phd  0.9017392  1.0178090
## rankAsstProf  -1.5956868  2.3004321
## rankProf      -3.1219318  0.1856227

A couple of points of emphasis:

You should always remember that multiple regression coefficients are adjusted for the other variables in the model
\(t\)-values are calculated via: \(t =\tfrac{\text{estimate} - \text{hypothesized value}}{SE}\)
- The default hypothesized value is 0, but you can use the formula above to perform your own \(t\)-test with a null hypothesis of your choosing
- The degrees of freedom corresponding to these \(t\)-values is \(n - (k+1)\), where \(n\) is the number of data-points used to fit the model, and \(k+1\) is the number of \(\beta\) parameters in the model (remember the model is starting from \(\beta_0\) and going to \(\beta_k\))

Question #1: Using the Professor Salaries dataset and the model specified above (ie: yrs.service ~ yrs.since.phd + rank) test whether the slope coefficient of “yrs.since.phd” is statistically different from 1 when predicting “yrs.service” after adjusting for rank. Interpret the results of this test. (Hint: the pt() function is used to calculate tail area probabilities for the \(t\)-distribution)

\(~\)

Prediction Intervals

Similar what we saw with simple linear regression, confidence bands can be used to express the statistical uncertainty in \(E(y)\), the mean response (according to the specified model), while prediction bands can be used to express the statistical uncertainty in individual values of \(y\) (again according to the specified model).

Both these intervals can be obtained using the predict() function; however, for multiple linear regression we must supply \(x\)-values for all of the predictors in the model. When the model includes a categorical predictor this can be facilitated by the expand.grid() function:

## Model under consideration
m <- lm(yrs.service ~ yrs.since.phd + rank, data = ps)

## Sequences of x1 and x2 values to consider (note that x2 is categorical)
x1_grid <- seq(min(ps$yrs.since.phd), max(ps$yrs.since.phd), length.out = 100)
x2_grid <- levels(as.factor(ps$rank))

## All combinations of x1 and x2
x1x2_grid <- expand.grid(yrs.since.phd = x1_grid, rank = x2_grid)

## Add confidence band columns to the grid using column binding (cbind)
grid_ci <- cbind(x1x2_grid, predict(m, newdata = x1x2_grid, interval = "confidence"))

## Add prediction band columns
grid_pi <- cbind(x1x2_grid, predict(m, newdata = x1x2_grid, interval = "prediction"))

## Plot the results
plot(ps$yrs.since.phd, ps$yrs.service)

## Add Assoc Prof CI lines
lines(grid_ci$yrs.since.phd[grid_ci$rank == "AssocProf"], grid_ci$lwr[grid_ci$rank == "AssocProf"], col = "blue")
lines(grid_ci$yrs.since.phd[grid_ci$rank == "AssocProf"], grid_ci$upr[grid_ci$rank == "AssocProf"], col = "blue")

## Add Ast Prof CI lines
lines(grid_ci$yrs.since.phd[grid_ci$rank == "AsstProf"], grid_ci$lwr[grid_ci$rank == "AsstProf"], col = "red")
lines(grid_ci$yrs.since.phd[grid_ci$rank == "AsstProf"], grid_ci$upr[grid_ci$rank == "AsstProf"], col = "red")

## Add Full Prof CI lines
lines(grid_ci$yrs.since.phd[grid_ci$rank == "Prof"], grid_ci$lwr[grid_ci$rank == "Prof"], col = "yellow")
lines(grid_ci$yrs.since.phd[grid_ci$rank == "Prof"], grid_ci$upr[grid_ci$rank == "Prof"], col = "yellow")

## Add Assoc Prof PI lines
lines(grid_pi$yrs.since.phd[grid_pi$rank == "AssocProf"], grid_pi$lwr[grid_pi$rank == "AssocProf"], col = "blue", lty = 2)
lines(grid_pi$yrs.since.phd[grid_pi$rank == "AssocProf"], grid_pi$upr[grid_pi$rank == "AssocProf"], col = "blue", lty = 2)

## Add Ast Prof PI lines
lines(grid_pi$yrs.since.phd[grid_pi$rank == "AsstProf"], grid_pi$lwr[grid_pi$rank == "AsstProf"], col = "red", lty = 2)
lines(grid_pi$yrs.since.phd[grid_pi$rank == "AsstProf"], grid_pi$upr[grid_pi$rank == "AsstProf"], col = "red", lty = 2)

## Add Full Prof PI lines
lines(grid_pi$yrs.since.phd[grid_pi$rank == "Prof"], grid_pi$lwr[grid_pi$rank == "Prof"], col = "yellow", lty = 2)
lines(grid_pi$yrs.since.phd[grid_pi$rank == "Prof"], grid_pi$upr[grid_pi$rank == "Prof"], col = "yellow", lty = 2)

## Legend
legend("topleft", legend = c("Assoc", "Asst", "Full", "Prediction"), col = c("blue","red","yellow","black"), lty = c(1,1,1,2), bty = "n")

Note: The graph above is probably too busy to use in a presentation or publication, nevertheless it is a good demonstration of how to use the predict function with a multiple regression model.

Question #2: Using the professor salaries dataset, fit the model salary ~ sex + yrs.since.phd. Then create a scatterplot displaying the variables “yrs.since.phd” and “salary” with separate 95% prediction bands for males and females.

\(~\)

Analysis of Variance (ANOVA)

Because multiple regression models contain several different slopes, a \(t\)-test on single slope parameter is no longer sufficient for evaluating certain associations.

As an example, a categorical predictor might involve several dummy variables, each with their own slope coefficient that will compare that group with the reference category. However, what if you’d like to test for an overall association with the outcome variable? This requires a fundamentally different approach than testing for differences in comparison to the reference category.

Analysis of Variance, using an \(F\)-test, is a method of comparing a model against a smaller, nested sub-model. This has two key uses:

Testing the overall utility of an entire model. That is, determining whether the model is adequate for predicting \(y\) beyond what could be explained by random chance.
Testing groups of related variables within a model. This might be a group of dummy variables used to express a categorical predictor. Or, in the future, it might be a polynomial expansion of a numeric predictor.

The \(F\)-test statistic (given below) can be viewed as the standardized decrease in sum of squares that occurs when a group of predictors is added to model.

\[ F = \frac{(SS_{yy} - SSE)/\delta_k}{SSE/(n - (k + 1))}\]

\(SS_{yy}\) is the residual sum of squares for the smaller sub-model
\(SSE\) is the residual sum of squares for the larger model of interest
\(\delta_k\) is the difference in the number of parameters in the two models
- For testing the overall utility of a model (ie: \(F\)-test use #1), \(\delta_k = k\) because the smaller sub-model is the intercept-only model and the larger model builds upon that by adding \(k\) different slope parameters
The denominator of the \(F\)-statistic is often called the mean squared error or \(MSE\), it represents the total unexplained variability in the outcome and is an unbiased estimate of the error variance (ie: \(\sigma^2\) in the distribution of the model’s random errors)

We can perform the \(F\)-test in R using the anova() function. The example below uses ANOVA to test whether “rank” is associated with “yrs.service” after adjusting for “yrs.since.phd”:

## Larger model of interest
m1 <- lm(yrs.service ~ yrs.since.phd + rank, data = ps)

## Smaller, nested submodel (ie: omit the variable(s) you're testing)
m2 <- lm(yrs.service ~ yrs.since.phd, data = ps)

## ANOVA Table
anova(m1,m2)

## Analysis of Variance Table
## 
## Model 1: yrs.service ~ yrs.since.phd + rank
## Model 2: yrs.service ~ yrs.since.phd
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    393 11430                           
## 2    395 11558 -2   -128.07 2.2018  0.112

Based upon the \(p\)-value this test, we can conclude there is borderline statistical evidence that “rank” is associated with “yrs.service” after adjusting for “yrs.since.phd”. This should make sense, since most of the variability in “yrs.service” is already capable of being explained by the variable “yrs.since.phd”.

We could also use the anova() function to evaluate the overall utility of this model:

## Larger model of interest
m1 <- lm(yrs.service ~ yrs.since.phd + rank, data = ps)

## Smaller, nested submodel (1 is used to indicate an intercept only)
m2 <- lm(yrs.service ~ 1, data = ps)

## ANOVA Table
anova(m1,m2)

## Analysis of Variance Table
## 
## Model 1: yrs.service ~ yrs.since.phd + rank
## Model 2: yrs.service ~ 1
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    393 11430                                  
## 2    396 66986 -3    -55556 636.76 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unsurprisingly, the model that involves “yrs.since.phd” and “rank” is highly predictive of “yrs.service”. Further, you might have noticed that we could also find this \(F\)-test at the bottom of the output produced by the summary() function (using the larger model of interest).

summary(m1)

## 
## Call:
## lm(formula = yrs.service ~ yrs.since.phd + rank, data = ps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.085  -2.273   0.884   3.674  17.318 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.87838    0.81394  -3.536 0.000454 ***
## yrs.since.phd  0.95977    0.02952  32.514  < 2e-16 ***
## rankAsstProf   0.35237    0.99087   0.356 0.722315    
## rankProf      -1.46815    0.84118  -1.745 0.081706 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.393 on 393 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.8281 
## F-statistic: 636.8 on 3 and 393 DF,  p-value: < 2.2e-16

Question #4: For the Professor Salaries data, use an \(F\)-test to evaluate the importance of the variable “rank” in a model that uses “rank”, “yrs.since.phd”, and “discipline” to predict “salary”. Provide a one sentence conclusion summarizing what the test tells you about the role of “rank” in the model.

Question #5: Perform an overall \(F\)-test of the model that uses “rank”, “yrs.since.phd”, and “discipline” to predict “salary”. Provide a one sentence conclusion summarizing the result of this test regarding the overall efficacy of this model.

\(~\)

Variable Selection

It is rare for all the variables in a dataset to actually be useful in predicting an outcome. And as data becomes increasingly easy to collect, we should only expect the amount of non-predictive variables we encounter to increase.

Hypothesis testing provides the data analyst one method for identifying and removing non-predictive variables from a model. However, using hypothesis tests to find the optimal model is a very challenging task. As we’ve already seen, the hypothesis testing results within a given model are adjusted for everything else in that model.

So, for example, just because “sex” is a significant predictor of “salary” by itself (in the model salary ~ sex) doesn’t mean it will be a significant predictor in the model salary ~ sex + rank. Similarly, even if “sex” is not a significant predictor in the model salary ~ sex + rank, it might be a significant predictor in the model salary ~ sex + rank + discipline.

Because of these challenges, statisticians are very deliberate in how they select variables. In general, they focus on the following principles:

Parsimony - If two models are roughly equal in terms of their predictive ability, the smaller/simpler model should be favored.
\(n\) to \(p\) ratio - If the ratio of data-points to predictors in a model is too small, the model will likely be overfit to the sample data. A general rule of thumb for linear regression that the \(n\)-\(p\) ratio should not be lower than 10:1.

\(~\)

Stepwise Algorithms

Stepwise model selection algorithms are one application of the \(F\)-test. Broadly speaking, these algorithms fall into one of three categories:

Forward Selection - Start with an intercept only model and add the “most significant” variable (assuming it meets the entry threshold). Continue adding variables one at a time until no predictors remain that meet a predetermined threshold (ie: \(\alpha = 0.1\))
Backward Elimination - Start with a model that contains all available predictors and eliminate the “least significant” variable (assuming it does not meet the inclusion threshold). Continue eliminating variables one at a time until all variables in the model meet a predetermined threshold (ie: \(\alpha = 0.1\))
Stepwise (either forward or backward) - Modify the above approaches to allow variables to be either added or dropped at each step (rather than only added or only dropped)

As you might expect, stepwise algorithms tend to be the most popular because they consider the most expansive set of possible models.

While the \(F\)-test is used by many software programs (such as Minitab or SPSS) to perform stepwise selection, the step() function in R uses a model selection criterion known as AIC instead. So before we see any examples of stepwise selection, we’ll first need to introduce AIC.

Question #6: In general, Forward Stepwise Selection is more popular for data containing many predictors (ie: large \(p\)), while Backward Stepwise Elimination is more popular for data containing relatively few predictors. For this question, write 1-2 sentences explaining why you think this is.

\(~\)

Model Selection Criteria

Model selection criteria are calculable quantities that enable an easy comparison of two or more models. Akaike’s Information Criterion (AIC) is perhaps the most popular criterion among statisticians.

For a given model, \(AIC = -\text{Log-Likelihood} + 2k\), where the log-likelihood measures goodness of fit (we won’t cover the details in this class) and \(k\) is the number of predictors in the model.

A lower \(AIC\) value indicates a superior model. There are two possible causes for a high \(AIC\) value (indicating an inferior model): either the model includes too many variables relative to its predictive ability (high \(k\) relative to the log-likelihood), or it doesn’t fit the data well (low \(k\), but a small log-likelihood).

In R, \(AIC\) is used the basis for stepwise selection:

## StepAIC is contained in the MASS package (installed by default)
library(MASS)

## Backwards stepwise selection (initial model contains all predictors)
ps_initial <- lm(salary ~ rank + discipline + yrs.since.phd + yrs.service + sex, data = ps)
stepAIC(ps_initial, 
        scope = list(upper = salary ~ rank + discipline + yrs.since.phd + yrs.service + sex, 
                     lower = salary ~ 1), 
        direction = "both")

## Start:  AIC=7965.19
## salary ~ rank + discipline + yrs.since.phd + yrs.service + sex
## 
##                 Df  Sum of Sq        RSS    AIC
## - sex            1 7.8068e+08 1.9890e+11 7964.8
## <none>                        1.9812e+11 7965.2
## - yrs.since.phd  1 2.5041e+09 2.0062e+11 7968.2
## - yrs.service    1 2.7100e+09 2.0083e+11 7968.6
## - discipline     1 1.9237e+10 2.1735e+11 8000.0
## - rank           2 6.9508e+10 2.6762e+11 8080.6
## 
## Step:  AIC=7964.75
## salary ~ rank + discipline + yrs.since.phd + yrs.service
## 
##                 Df  Sum of Sq        RSS    AIC
## <none>                        1.9890e+11 7964.8
## + sex            1 7.8068e+08 1.9812e+11 7965.2
## - yrs.since.phd  1 2.5001e+09 2.0140e+11 7967.7
## - yrs.service    1 2.5763e+09 2.0147e+11 7967.9
## - discipline     1 1.9489e+10 2.1839e+11 7999.9
## - rank           2 7.0679e+10 2.6958e+11 8081.5

## 
## Call:
## lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service, 
##     data = ps)
## 
## Coefficients:
##   (Intercept)   rankAsstProf       rankProf    disciplineB  yrs.since.phd  
##       82700.5       -12831.5        32456.2        14505.2          534.6  
##   yrs.service  
##        -476.7

In this example, the only “step” was to eliminate the variable “sex” from the model. The final result was the model: salary ~ rank + discipline + yrs.since.phd + yrs.service

Other commonly used selection criteria include:

BIC: a criterion that is used similar to \(AIC\) which takes the form: \(AIC = -\text{Log-Likelihood} + log(n)*k\) - \(BIC\) will generally favor more parsimonious models relative to \(AIC\). Lower \(BIC\) values indicate a superior model.
Adjusted \(R^2\): a adjusted version of the coefficient of determination that penalizes larger models based upon the number of predictors they include.

Question #7: Using the code provided above as a starting point, perform forward stepwise selection using \(AIC\) as the model selection criteria to find the model that best predicts “salary” in the professor salaries dataset. Report the final model.

Question #8: Using the code provided above as a starting point, perform backward stepwise selection using \(BIC\) as the model selection criteria to find the model that best predicts “salary” in the professor salaries dataset. You can add the argument k = log(nrow(ps)) to ensure that stepAIC calculates \(BIC\) rather than \(AIC\) (which uses \(k = 2\) in its penalty term). Report your final model.

Question #9: Briefly comment upon the different models chosen by forward versus backward stepwise algorithms, as well as \(AIC\) versus \(BIC\). You should write 2-3 sentences comparing and contrasting the size (ie: number of predictors) of these models.

\(~\)

Best Subsets Regression

When the number of predictors is relatively small, it’s sometimes possible to exhausting compare all possible models using a model selection criterion. In R the regsubsets function (located in the leaps package) can be used to perform best subsets regression, an exhaustive approach that will return the best model of each size (ie: each number of predictors).

The code below demonstrates how to use regsubsets:

## Load leaps package
library(leaps)

## Specify the largest model (formula) and the dataset
bsubs <- regsubsets(salary ~ rank + discipline + yrs.since.phd + yrs.service + sex, data = ps)

## Extract summary information
bsubs_summary <- summary(bsubs)

## Assemble the BIC/Adjusted R2 for the best model of each size
data.frame(bsubs_summary$which, BIC = bsubs_summary$bic, Adj_R2 = bsubs_summary$adjr2)

##   X.Intercept. rankAsstProf rankProf disciplineB yrs.since.phd yrs.service
## 1         TRUE        FALSE     TRUE       FALSE         FALSE       FALSE
## 2         TRUE        FALSE     TRUE        TRUE         FALSE       FALSE
## 3         TRUE         TRUE     TRUE        TRUE         FALSE       FALSE
## 4         TRUE         TRUE     TRUE        TRUE         FALSE       FALSE
## 5         TRUE         TRUE     TRUE        TRUE          TRUE        TRUE
## 6         TRUE         TRUE     TRUE        TRUE          TRUE        TRUE
##   sexMale       BIC    Adj_R2
## 1   FALSE -177.0373 0.3772158
## 2   FALSE -203.7709 0.4250270
## 3   FALSE -209.7988 0.4407437
## 4    TRUE -205.1838 0.4412470
## 5   FALSE -203.2665 0.4455269
## 6    TRUE -198.8438 0.4462870

In this example, the best model according to \(BIC\) is the model that only includes a single dummy variable that separates rank = Prof from the lower ranks. In contrast, Adjusted \(R^2\) favors the largest model that includes all of the available predictors.

Note: Because different model selection penalize solely based upon model size, there is no need to specify the specific selection criteria you are interested (since the best model of each size then depends only on the goodness of fit component of the selection criterion).

Cross-Validation

Due to their convenience and quick computation, model selection criteria are widely used by statisticians. However, the goodness of fit component is based upon the type of model you are using. The implication is that you cannot use something like \(AIC\) to compare a multiple linear regression model and a \(k\)-nearest neighbors model. For this task, you’ll need to use cross-validation (covered in Lab #4)

Cross-validation also has the advantage of providing an unbiased estimate of out-of-sample model performance; something none of the aforementioned methods offer.

Question #10: For the professor salaries dataset, use 5-fold cross-validation to compare the out-of-sample RMSE of the four models listed below (Hint: revisit Lab #4 if you need a template for how to implement cross-validation in R):

salary ~ rank + discipline + yrs.since.phd + yrs.service - the model chosen by backwards stepwise regression using \(AIC\)
salary ~ rank + discipline - the model chosen by forward stepwise regression using \(AIC\)
salary ~ rank==Prof - the model favored by \(BIC\) in best subsets regression
salary ~ rank + discipline + yrs.since.phd + yrs.service + sex - the model favored by Adjusted \(R^2\) in best subsets regression