Directions:
As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This means you will work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.
You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.
You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.
\(~\)
Over the past few weeks, we’ve been learning about multiple linear regression models. As a reminder, the general form of a population-level multiple linear regression model is written below: \[y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \ldots + \beta_k x_k + \epsilon\]
The focus on this lab will be non-linear effects, structuring the model to include predictors that share a non-linear relationship with the outcome (after adjusting for other variables), as well as interactions, or situations in which the impact of one variable depends upon another variable.
\(~\)
As we saw with simple linear regression, the residuals of a model are an important diagnostic tool. Not only do the residuals allow us to assess the statistical assumptions made by the model, they also allow us to assess whether certain predictors are properly included.
Plotting a model’s residuals vs. one of the model’s explanatory variables allows us to evaluate whether the predictor might have a non-linear relationship with the outcome.
The code below creates simulated data where one predictor, x2
, has a non-linear relation with the outcome, y
:
## Generate two predictors, n = 70 values between 2 and 3
set.seed(123)
x1 <- runif(70, min = 2, max = 3)
x2 <- runif(70, min = 2, max = 3)
## Generate outcomes
y <- 3*x1 + x2^2 + rnorm(70, sd = 0.5)
## Scatterplots
par(mfrow = c(1, 2))
plot(x1, y)
plot(x2, y)
The non-linearty is obvious when looking at the population model, y = 3*x1 + x2^2
, but it’s difficult to see in the scatterplot unless you know ahead of time to look for it.
m <- lm(y ~ x1 + x2)
summary(m)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.15718 -0.25254 -0.03011 0.26274 1.15579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.9665 0.7950 -7.505 1.88e-10 ***
## x1 2.8772 0.2053 14.014 < 2e-16 ***
## x2 5.0298 0.2107 23.870 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4892 on 67 degrees of freedom
## Multiple R-squared: 0.9106, Adjusted R-squared: 0.908
## F-statistic: 341.4 on 2 and 67 DF, p-value: < 2.2e-16
The output from lm()
also is unlikely to help us detect this non-linear relationship, as the hypothesis test on the slope coefficient for x2
is highly significant (seemingly indicating a strong linear relationship).
However, if we plot the model’s residuals versus each of our explanation variables and add a loess smoothing line (a smoother variation of a moving average), we can see a clear quadratic pattern in the residuals of \(x_2\):
par(mfrow = c(1, 2))
ggplot() + geom_point(aes(x = x1, y = m$residuals)) + geom_smooth(aes(x = x1, y = m$residuals), method = "loess") + geom_abline(intercept = 0, slope = 0, lty = "dashed") + labs(title = "x1 vs. Residuals (no clear pattern)")
ggplot() + geom_point(aes(x = x2, y = m$residuals)) + geom_smooth(aes(x = x2, y = m$residuals), method = "loess") + geom_abline(intercept = 0, slope = 0, lty = "dashed") + labs(title = "x2 vs. Residuals (U-shaped pattern)")
Based upon this pattern, we can use the poly()
function to use a second-degree polynomial when including x2
in our regression model. This is akin to the population model: \[y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \beta_3 x_2^2 + \epsilon\]
After including a non-linear effect, the confidence band in the x2
vs. residual plot entirely overlaps with the horizontal line at zero (indicating no clear pattern in the residuals):
## Add Quadratic Term
m2 <- lm(y ~ x1 + poly(x2, 2))
par(mfrow = c(1, 2))
ggplot() + geom_point(aes(x = x2, y = m2$residuals)) + geom_smooth(aes(x = x2, y = m2$residuals), method = "loess") + geom_abline(intercept = 0, slope = 0, lty = "dashed") + labs(title = "x2 vs. Residuals (pattern is diminished)")
An ANOVA test can be used to statistically confirm the superiority of the model that includes a quadratic term:
## ANOVA test
m2 <- lm(y ~ x1 + poly(x2, 2))
anova(m, m2)
## Analysis of Variance Table
##
## Model 1: y ~ x1 + x2
## Model 2: y ~ x1 + poly(x2, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 67 16.032
## 2 66 14.694 1 1.3374 6.0072 0.0169 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: It is important recognize that ANOVA will not generally show a statistically significant improvement in fit for adding a quadratic effect, unless there truly is a non-linear relationship between the predictor and the outcome (after adjusting for other variables).
## Example showing its not true generally
set.seed(1234)
x1 <- runif(70, min = 2, max = 3)
x2 <- runif(70, min = 2, max = 3)
## Generate outcomes, notice x2 now has a linear effect
y <- 3*x1 + x2 + rnorm(70, sd = 0.5)
## ANOVA
m <- lm(y ~ x1 + x2)
m2 <- lm(y ~ x1 + poly(x2, 2))
anova(m, m2)
## Analysis of Variance Table
##
## Model 1: y ~ x1 + x2
## Model 2: y ~ x1 + poly(x2, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 67 13.566
## 2 66 13.516 1 0.050244 0.2453 0.622
Question #1: The Ozone dataset (loaded below) records the daily ozone concentrations in New York City for approximately 100 different days in a single year. For this question, fit a model that uses a day’s temperature and wind speed (as linear effects) to predict the ozone concentration for that day. Next, create a residual vs. predictor plot and use it to qualitatively assess whether it’s worth considering a quadratic effect for the variable “Temp”.
oz <- read.csv("https://remiller1450.github.io/data/Ozone.csv")
Question #2: Fit a model that predicts “Ozone” using a linear effect for “Wind” and a quadratic polynomial effect for “Temp”. Then, use ANOVA to determine if there is statistically compelling evidence that this model provides a better fit than the model in Question #1.
\(~\)
Residual plots can also provide an indication of when a model might benefit from a transformation on a predictor.
The example below looks at the variable “yrs.since.phd” in the Professor Salaries dataset:
ps <- read.csv('https://remiller1450.github.io/data/Salaries.csv')
m1 <- lm(salary ~ sex + yrs.since.phd, data = ps)
m2 <- lm(salary ~ sex + poly(yrs.since.phd, 2) , data = ps)
m3 <- lm(salary ~ sex + log2(yrs.since.phd), data = ps)
par(mfrow = c(1, 3))
ggplot() + geom_point(aes(x = ps$yrs.since.phd, y = m1$residuals)) + geom_smooth(aes(x = ps$yrs.since.phd, y = m1$residuals), method = "loess") + geom_abline(intercept = 0, slope = 0, lty = "dashed") + labs(title = "Linear Effect")
ggplot() + geom_point(aes(x = ps$yrs.since.phd, y = m2$residuals)) + geom_smooth(aes(x = ps$yrs.since.phd, y = m2$residuals), method = "loess") + geom_abline(intercept = 0, slope = 0, lty = "dashed") + labs(title = "Quadratic Effect")
ggplot() + geom_point(aes(x = ps$yrs.since.phd, y = m3$residuals)) + geom_smooth(aes(x = ps$yrs.since.phd, y = m3$residuals), method = "loess") + geom_abline(intercept = 0, slope = 0, lty = "dashed") + labs(title = "Log2 Transformation")
As you can see, the residual vs. predictor plot shows a curved pattern when a linear effect is used. Using a quadratic effect eliminates this pattern entirely; however, a log-transformation also does a decent job addressing this pattern (notice how the error band around the loess smoother now includes zero for most values of “yrs.since.phd”).
In this situation, the log-transformation might be preferable (over both the linear and quadratic effect models), as it results in a model with a good balance of interpretability and proper fit (non-linear effects tend to be difficult to interpret since they change over the range of the predictor).
Question #3: Briefly explain why ANOVA cannot be used to determine whether a data transformation significantly improves a model? (Hint: think about ANOVA’s assumption of “nested models”). Note: you do not need any R
code to answer this question.
Question #4: For the Ozone dataset described in the previous section, use AIC (found via the AIC()
function) as a model selection criterion to compare the follow models:
Ozone ~ Temp + Wind
Ozone ~ log2(Temp) + Wind
Ozone ~ log10(Temp) + Wind
Ozone ~ sqrt(Temp) + Wind
Ozone ~ poly(Temp,3) + Wind
Based upon these AIC values, which model has the optimal balance of accuracy and simplicity (objectively)? Which model has the optimal balance of fit and interpretability (your own subjective decision)?
\(~\)
Sometimes the effect of one predictor might depend upon another variable in the model. For example, in the Ames Housing data, we’ve previously seen that the effect of above ground living area, “Gr.Liv.Area”, likely depends upon the variable “House.Style”. That is, an increase in square footage for a 1 Story house has a greater impact on price than the same increase for a 2 Story house.
Within the lm()
function, the *
symbol can be used to specify an interaction between two variables:
ah <- read.csv('https://remiller1450.github.io/data/AmesHousing.csv')
ah <- dplyr::filter(ah, House.Style %in% c("1Story", "2Story"))
m1 <- lm(SalePrice ~ House.Style + Gr.Liv.Area, data = ah)
m2 <- lm(SalePrice ~ House.Style + Gr.Liv.Area + House.Style*Gr.Liv.Area, data = ah)
The code below visually depicts the two models specified above:
## No Interaction
plot(ah$Gr.Liv.Area, ah$SalePrice, xlab = "Above Ground sq. ft", ylab = "Sale Price",
col = ifelse(ah$House.Style == "1Story", "red", "blue"))
abline(a = m1$coefficients[1] + m1$coefficients[2], b = m1$coefficients[3], col = "blue", lwd = 2)
abline(a = m1$coefficients[1], b = m1$coefficients[3], col = "red", lwd = 2)
## Interaction
plot(ah$Gr.Liv.Area, ah$SalePrice, xlab = "Above Ground sq. ft", ylab = "Sale Price",
col = ifelse(ah$House.Style == "1Story", "red", "blue"))
abline(a = m2$coefficients[1] + m2$coefficients[2],
b = m2$coefficients[3]+ m2$coefficients[4], col = "blue", lwd = 2)
abline(a = m2$coefficients[1], b = m2$coefficients[3], col = "red", lwd = 2)
Let’s look at the summary()
output for the interaction model:
summary(m2)
##
## Call:
## lm(formula = SalePrice ~ House.Style + Gr.Liv.Area + House.Style *
## Gr.Liv.Area, data = ah)
##
## Residuals:
## Min 1Q Median 3Q Max
## -582644 -22111 790 21762 269975
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26041.686 4822.701 -5.400 7.34e-08 ***
## House.Style2Story -3758.606 8377.306 -0.449 0.654
## Gr.Liv.Area 155.552 3.508 44.342 < 2e-16 ***
## House.Style2Story:Gr.Liv.Area -27.930 4.998 -5.588 2.56e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53580 on 2350 degrees of freedom
## Multiple R-squared: 0.5916, Adjusted R-squared: 0.5911
## F-statistic: 1135 on 3 and 2350 DF, p-value: < 2.2e-16
First, note that “1Story” is the reference category in this model. The coefficient on the dummy variable “House.Style2Story” determines how much the intercept should be adjusted for “2Story” homes.
Next, note the interaction leads to two different slopes for the variable “Gr.Liv.Area”. For the reference category, “1Story”, the effect of “Gr.Liv.Area” is a $155.55 increase in price for each 1-sq foot increase in living area. For “2Story” homes, the effect of “Gr.Liv.Area” is a $155.55 - $27.93 = $127.62 increase in price for each 1-sq foot increase in living area.
Further, the statistically significant \(t\)-test on the model coefficient for the interaction term provides compelling statistical evidence the effect of “Gr.Liv.Area” on sale price differs for “1Story” and “2Story” homes. An ANOVA test could also be used to statistically confirm the superiority of the interaction model.
Question #5: Using the Professor Salaries dataset (provided below), fit the following two models: salary ~ sex + yrs.since.phd
and salary ~ sex + yrs.since.phd + sex*yrs.since.phd
. Using these two models, answer the following:
ps <- read.csv('https://remiller1450.github.io/data/Salaries.csv')
\(~\)
Interactions between quantitative predictors are more challenging to interpret; however, the fundamental idea remains the same - the slope coefficient of one predictor depends upon the value of the other predictor (which unfortunately will make the slope non-constant). For the Ozone dataset, we can use a 3-dimensional plot of the regression plane to explore an interaction between the variables “Wind” and “Temp”:
ozone <- read.csv("https://remiller1450.github.io/data/Ozone.csv")
m1 <- lm(Ozone ~ Wind + Temp, data = oz)
m2 <- lm(Ozone ~ Wind + Temp + Wind*Temp, data = oz)
## Create a 2D grid to predict over
xs <- seq(min(ozone$Temp), max(ozone$Temp), length.out = 100)
ys <- seq(min(ozone$Wind), max(ozone$Wind), length.out = 100)
grid <- expand.grid(xs,ys)
names(grid) <- c("Temp", "Wind")
## Use the predict function to generate predicted values
z <- predict(m1, newdata = grid)
## Populate a matrix using these predictions
m <- matrix(z, nrow = length(unique(grid$Temp)), ncol = length(unique(grid$Wind)), byrow = TRUE)
## Graph using the plotly package
library(plotly)
plot_ly() %>% add_surface(x = ~xs, y = ~ys, z = ~m) %>%
add_markers(x = ~ozone$Temp, y = ~ozone$Wind, z = ~ozone$Ozone, color = I("black")) %>%
layout(scene = list(xaxis = list(title = "Temp"),
yaxis = list(title = "Wind"),
zaxis = list(title = "Ozone")))
## Now let's look at m2, the model with an interaction
z <- predict(m2, newdata = grid)
## Populate a matrix using these predictions
m <- matrix(z, nrow = length(unique(grid$Temp)), ncol = length(unique(grid$Wind)), byrow = TRUE)
## Graph using the plotly package
library(plotly)
plot_ly() %>% add_surface(x = ~xs, y = ~ys, z = ~m) %>%
add_markers(x = ~ozone$Temp, y = ~ozone$Wind, z = ~ozone$Ozone, color = I("black")) %>%
layout(scene = list(xaxis = list(title = "Temp"),
yaxis = list(title = "Wind"),
zaxis = list(title = "Ozone")))
As you can see, the model without an interaction results in separate, constant slopes in each dimension. While the model with an interaction results in slopes that change according to the other predictor.
Rather than plotting in three dimensions, we could use color to represent the value of \(E(y)\):
library(visreg) ## Package for visualizing regression models
par(mfrow = c(1,2))
visreg2d(m1, "Wind", "Temp", main = "Ozone Model #1 (no interaction)")
visreg2d(m2, "Wind", "Temp", main = "Ozone Model #2 (includes interaction)")
An ANOVA test indicates the interaction leads to a statistically significant improvement in the model fit:
anova(m1, m2)
## Analysis of Variance Table
##
## Model 1: Ozone ~ Wind + Temp
## Model 2: Ozone ~ Wind + Temp + Wind * Temp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 108 50989
## 2 107 44394 1 6594.8 15.895 0.0001227 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
While it is possible to interpret the interaction coefficient quantitatively, my advice is to focus on providing a qualitative interpretation. In this example, the effect of either predictor (“Wind” or “Solar”) is diminished for larger values of the other predictor. This means that the model estimates the effect of “Wind” is lower at higher temperatures, and the effect of “Temp” is lower at higher wind speeds.
Question #6: Using the Ozone dataset, fit the following two models: Ozone ~ Temp + Solar
and Ozone ~ Temp + Solar + Temp*Solar
. Using these two models, answer the following:
visreg2d()
to create separate graphs that display the interaction an no-interaction models.visreg2d()
, and/or the coefficients in the interaction model, qualitatively describe the relationship between “Temp” and “Solar” in predicting “Ozone” (in the interaction model).\(~\)
Predictor vs. residual plots can also be used to help identify outliers and data-points that are having a disproportionate impact on the fit of the regression model.
Your level of concern upon encountering an outlier should be based upon how much influence that data-point has on the estimated model coefficients. If the outlier is approximately as influential as the less extreme data-points, you might decide to move on with your analysis.
However, if an outlier is deemed to be highly influential, you might decide to take remedial action - which could range from reporting your results with caution, transforming the predictor, or removing it entirely (if you determine it is a measurement error or a case that doesn’t belong to the population of interest).
Below are a few definitions used in identifying influential observations:
There are many alternative approaches to identifying problematic data-points; however, we will focus on Cook’s distance as it is an easy to use method for identifying highly influential outliers. In R
, the cooks.distance()
function can be used to calculate Cook’s Distance for all the data-points used in your regression model:
ps <- read.csv('https://remiller1450.github.io/data/Salaries.csv')
m1 <- lm(salary ~ sex + yrs.since.phd, data = ps)
cd <- cooks.distance(m1)
plot(cd)
abline(h = 4/nrow(ps), lty = 2) # Rule of thumb of 4/n
There are a variety of guidelines for interpreting Cook’s Distance; However, one easy to use threshold is a distance that is large than \(4/n\) indicates an influential observation. In the model, salary ~ sex + yrs.since.phd
, the two observations with distances larger than 0.04 might be worth investigating, though this type of distribution of Cook’s Distances is typical (so I wouldn’t be overly concerned). If we consider alternative ways of incorporating the variable “yrs.since.phd()” into the model, we can see that a log-transformation on “yrs.since.phd” leads to a more equitable distribution of influence. However, incorporating “yrs.since.phd” via a quadratic effect results in a single data point having a huge amount of influence on the model’s fit.
Question #7 For the Ozone dataset, fit the model Ozone ~ Temp + Solar
and use a graph of Cook’s distance to determine whether there are any disproportionately influential data-points in this model.
\(~\)
Identifying influential data-points is one thing, but in order properly determine whether any remediation steps are necessary it is important to consider the degree of influence in practical terms. A straightforward way of assessing the practical impact of an influential point is to delete it from the dataset and record how much the coefficients in the model change following its removal.
A generalization of this procedure that considers the removal impact of each observation in the data (individually) is known as DFBETA:
m1 <- lm(salary ~ sex + yrs.since.phd, data = ps)
dfb <- dfbeta(m1)
par(mfrow = c(1,2))
plot(dfb[,2], main = "Impact on SexMale Cofficient")
plot(dfb[,3], main = "Impact on yrs.since.phd Cofficient")
While many have put forth guidelines on interpreting the values of DFBETAs, my advice is to analyze the DFBETAs of your data qualitatively when determining if remedial action (such as data-transformation) is necessary. That is, you can ask yourself the question: is a single data-point having the potential to change \(\beta_k\) by \(X\) a problem for the application I’m studying? If the answer is yes, you should consider taking action.
Question #8: For the Ozone dataset, fit the model Ozone ~ Temp + Solar
and report DFBETAs for the most influential day in the dataset. Are you concerned by the influence of this observation on the model’s coefficients? Briefly explain.
\(~\)
For this section you’ll work with the Colleges 2019 dataset:
c19 <- read.csv('https://remiller1450.github.io/data/Colleges2019.csv')
Question #9: Filter the Colleges 2019 dataset to include only schools in the “Great Lakes” region (IL, IN, MI, OH, and WI) that have data for the variable “Salary10yr_median” (ie: exclude any colleges with an NA
value for this variable). Next, rescale the variable “Net_Tuition” to be expressed in thousands of dollars by dividing the original values by 1,000. Then, fit a regression model that uses the variables “Private” and your rescaled version of “Net_Tuition” to predict “Salary10yr_median” (the median salary of students 10 years after graduation). Provide a 1-sentence interpretation of the coefficient of the variable “Net_Tuition”.
Question #10: Use a residual vs. predictor plot to assess whether the model would benefit from a transformation or quadratic effect on the variable “Net_Tuition”. Briefly explain your assessment.
Question #11: Now fit a model that includes an interaction between the variables “Net_Tuition” and “Private”. Is this model superior to the one you fit in Question #9? You should justify your answer.
Question #12: Using Cook’s Distance, which college is the most influential in this model? Report the name of this school, and it’s DEFABETAs (for each coefficient in the model). How do you interpret the DFBETA for the variable “Net_Tuition”?
Question #13: Apply a base-two log-transformation to the variable “Net_Tuition” and re-fit the model from Question #9 using this transformed variable. How does the influence of the school you identified in Question #12 change following this transformation?