Directions:

As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This means you will work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.

You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.

You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.

\(~\)

Introduction

Many of the questions in this lab will use the “Professor Salaries” dataset provided in the code below. This dataset contains the de-identified 9-month salaries of faculty members a large, public US university. Our analysis will focus on exploring whether there is a gender-related bias in the salaries at this university that favors male faculty members.

ps <- read.csv("https://remiller1450.github.io/data/Salaries.csv")

The dataset contains the following variables:

  • Rank - a categorical variable with levels “AsstProf”, “AssocProf”, and “Prof”. New professors are usually hired at the rank of Assistant Professor and after several years of productivity they are promoted to Associate Professor or released. The promotion to Full Professor occurs after several additional years of productivity at the Associate Professor level.
  • Discipline - A binary variable with levels “A” (theoretical departments) and “B” (applied departments)
  • Yrs.since.phd - The number of years since the professor received their PhD
  • Yrs.service - The number of years the professor has been working for the university
  • Sex - Whether the professor is male or female
  • Salary - The 9-month salary of the professor (in US dollars)

\(~\)

Multiple Regression

Linear regression modeling is a general framework that can easily accommodate models that involve multiple predictors. There are many similarities between simple linear regression and multiple linear regression, we can begin by considering the following population-level model: \[y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \ldots + \beta_k x_k + \epsilon\]

This model involves \(k\) different explanatory features. Often, each is a unique variable, but sometimes the values of \(x_1, x_2, \ldots, x_k\) can be derived as functions of a single variable. The next section will demonstrate this idea for categorical predictors. It’s also possible for \(x_1, x_2, \ldots, x_k\) to be a polynomial expansion of a predictor (which would allow the model to capture non-linear trends).

Apart from including multiple predictors, many of the technical details of simple linear regression still apply:

  • \(\epsilon\) still represents random errors, which still follow a \(N(0, \sigma^2)\) distribution that is independent of the predictors
  • The parameters of the population-level model are still estimated from the data using least squares

The steps involved in analyzing data using a multiple regression model should also look familiar:

  1. Model Specification - Specify the model at the population level
  2. Model Fitting - Use the sample data to estimate the model’s unknown parameters (ie: estimate \(\beta_0, \beta_1, \beta_2, \ldots, \beta_k\))
  3. Inference - Use the fitted model to draw conclusions about the population, or make predictions about new data

\(~\)

Categorical Predictors

Our entry point into multiple regression will be the addition of a binary categorical predictor to a simple linear regression model. Let’s begin by looking at a simple linear regression model based upon the Golden State Warriors record-setting 2015-16 NBA season: \[\text{Margin} = \beta_0 + \beta_1 \text{FG%} + \epsilon\]

This model assumes the Warrior’s performance, measured by the game’s margin of victory (+) or loss (-), is linearly related to the percentage of shots the Warriors made in that game.

gsw <- read.csv("https://remiller1450.github.io/data/GSWarriors.csv")

## Create new outcome variable
gsw$Margin <- gsw$Points - gsw$OppPoints
gsw$FGP <- 100*gsw$FG/gsw$FGA
m <- lm(Margin ~ FGP, data = gsw)

plot(gsw$FGP, gsw$Margin, xlab = "FG%", ylab = "Point Margin")
abline(a = m$coefficients[1], b = m$coefficients[2])

summary(m)
## 
## Call:
## lm(formula = Margin ~ FGP, data = gsw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.347  -6.469  -1.472   5.959  36.012 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.6493    10.9007  -5.197 1.52e-06 ***
## FGP           1.3799     0.2219   6.218 2.15e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.41 on 80 degrees of freedom
## Multiple R-squared:  0.3258, Adjusted R-squared:  0.3174 
## F-statistic: 38.67 on 1 and 80 DF,  p-value: 2.151e-08

Based upon the fitted model, there is compelling evidence that FG% is a strong predictor of the Warrior’s performance.

Let’s now add the binary variable “Location” as a predictor:

m <- lm(Margin ~ FGP + Location, data = gsw)
summary(m)
## 
## Call:
## lm(formula = Margin ~ FGP + Location, data = gsw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.067  -6.287  -0.161   5.990  33.278 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -56.5205    10.5365  -5.364 7.88e-07 ***
## FGP            1.3183     0.2158   6.108 3.56e-08 ***
## LocationHome   5.7554     2.2357   2.574   0.0119 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.06 on 79 degrees of freedom
## Multiple R-squared:  0.378,  Adjusted R-squared:  0.3623 
## F-statistic: 24.01 on 2 and 79 DF,  p-value: 7.148e-09

First, notice that R automatically generated a “dummy variable” called “LocationHome”. This variable is output of a function which took the original variable “Location” and mapped Location = "Home" to a numeric value of 1 and Location = "Away" to a numeric value of 0.

Because this dummy variable can only take on values of 0 and 1, we can view it as a modification of the intercept that depends upon whether the game was played at home or away:

plot(gsw$FGP, gsw$Margin, xlab = "FG%", ylab = "Point Margin")
abline(a = m$coefficients[1] + m$coefficients[3], b = m$coefficients[2], col = 2, lwd = 2)
abline(a = m$coefficients[1], b = m$coefficients[2], col = 4, lwd = 2)
legend("bottomright", legend = c("Home", "Away"), lwd = 2, col = c(2,4), bty = "n")

So, adding a binary predictor to simple linear regression yields a model can be expressed by two parallel lines (ie: same slope, different intercepts). If we also believed the slopes should also differ for home and away games, we’d need to include an interaction in our model (a topic we’ll discuss in the future).

\(~\)

Regression Adjustment

Notice how the slope coefficient of “FGP” decreased once “Location” was included as predictor in the model. This demonstrates one of the most valuable (and slightly controversial) uses of regression modeling - it is able to statistically adjust for factors such as confounding variables.

To understand why the adjustment occurs, recognize that the Warriors tend to shoot better at home, and they tend to win by larger margins at home.

boxplot(FGP ~ Location, data = gsw, horizontal = TRUE)

boxplot(Margin ~ Location, data = gsw, horizontal = TRUE)

In the original model, Margin ~ FGP, the slope coefficient of the variable “FGP” is implicitly including the influence of home court advantage, which makes it appear to have a slightly stronger relationship with “Margin” than it actually does.

Put differently, the marginal effect of “FGP” is inflated because the Warriors tend to have higher field goal percentages when playing at home, which is the location where they play better (home-court advantage in sports is actually very well-established statistically).

\(~\)

Practice (part 1)

Question #1: Using the “Professor Salaries” dataset introduced at the start of this lab, perform a two-sample t-test (using the t.test function) to compare the average salaries of male and female professors. Based upon this test, are convinced that male professors at this university are earning higher salaries? or could the male-female discrepancy be explained by random chance? Further, does this discrepancy prove that discrimination is happening at this university?

Question #2: Using the “Professor Salaries” dataset, fit a linear regression model using the formula salary ~ sex and print a summary of this model. Briefly explain how R generated the dummy variable that is used in this model. Then, compare the estimated coefficients in this model to output from t.test from Question #1. Based upon this comparison, briefly explain connection between the average salaries of male and female professors to the intercept and slope of this model.

Question #3: Fit a multiple linear regression model using the formula salary ~ sex + yrs.since.phd and print a summary of this model. Based upon the output, report the adjusted difference in male-female salaries (after adjusting for years of experience). After adjustment, could the male-female discrepancy at this university be explained by random chance?

Question #4: Briefly explain why the estimated effect of “sex” on “salary” decreased when “yrs.since.phd” was added to the model. Feel free to include any graphs that might be helpful in understanding this change.

\(~\)

Quantitative Predictors

To learn about multiple regression models that use more than one numeric predictor, we’ll look at data documenting daily ozone concentrations in New York City. As background, ozone is harmful pollutant linked to respiratory ailments. Airborne ozone concentrations fluctuate on a day-to-day basis depending on numerous weather-related factors. Predicting daily ozone concentrations is helpful in protecting vulnerable individuals.

A useful first step when considering multiple numeric predictors is to look at a scatterplot matrix. By default, the plot function will generate this matrix if you supply a data.frame containing only numeric variables:

ozone <- read.csv("https://remiller1450.github.io/data/Ozone.csv")
plot(ozone)

Using this plot, you should recognize that each of these variables appears to have some relationship with “Ozone” (seen by looking across the top row of plots). However, many of these explanatory variables are also related to each other, which could be problematic if our goal is to understand the unique contribution each makes towards daily ozone concentrations.

\(~\)

Regression Planes

The code below creates a 3-D visualization of the multiple regression model that uses the explanatory variables “Temp” and “Wind” to predict “Ozone”. In this context, you should notice that \(E(y)\) is a plane rather than a line. This plane is uniquely defined by two different slopes, one in the “Temp” dimension and another in the “Wind” dimension.

## Predict Ozone using Temp and Wind
model <- lm(Ozone ~ Temp + Wind, data = ozone)

## Create a 2D grid to predict over
xs <- seq(min(ozone$Temp), max(ozone$Temp), length.out = 100)
ys <- seq(min(ozone$Wind), max(ozone$Wind), length.out = 100)
grid <- expand.grid(xs,ys)
names(grid) <- c("Temp", "Wind")

## Use the predict function to generate predicted values
z <- predict(model, newdata = grid)

## Populate a matrix using these predictions
m <- matrix(z, nrow = length(unique(grid$Temp)), ncol = length(unique(grid$Wind)), byrow = TRUE)

## Graph using the plotly package
library(plotly)
plot_ly() %>% add_surface(x = ~xs, y = ~ys, z = ~m) %>% 
  add_markers(x = ~ozone$Temp, y = ~ozone$Wind, z = ~ozone$Ozone, color = I("black")) %>% 
  layout(scene = list(xaxis = list(title = "Temp"), 
                      yaxis = list(title = "Wind"), 
                      zaxis = list(title = "Ozone")))

Similar to how it worked in simple linear regression, least squares estimation finds the plane (defined by two slopes) that minimizes the model’s squared residuals. In this example, the residuals are deviations in the “Ozone” dimension between the observed \(y\)-value and the plane of expected \(y\)-values.

\(~\)

Regression Adjustment

Multiple linear regression can be used to statistically adjust for numeric confounding variables. This is very useful because methods like stratification don’t scale well when the data includes many confounds that need to be accounted for.

Similar to our previous example (involving the Golden State Warriors), we can explore the difference between adjusted and unadjusted effects by the estimated comparing model coefficients and reconciling them using the scatterplot matrix.

## Both models
model <- lm(Ozone ~ Temp + Wind, data = ozone)
unadj_model <- lm(Ozone ~ Temp, data = ozone)

## Compare their slopes
model$coefficients
## (Intercept)        Temp        Wind 
##  -67.321953    1.827554   -3.294839
unadj_model$coefficients
## (Intercept)        Temp 
##  -147.64607     2.43911

In this example, the unadjusted (or marginal) effect of “Temp” on “Ozone” is much larger than the effect of “Temp” after adjusting for “Wind”. We can interpret this as the simple linear regression model overestimating the true impact of “Temp” because the scatterplot matrix shows that days which are warmer also tend to be less windy, and days that are less windy tend to have higher ozone concentrations. So when “Wind” is omitted from the model, its influence on “Ozone” is getting absorbed into the “Temp” coefficient.

The meaning of an adjusted effect is very different from an unadjusted one. In this example, we interpret the unadjusted effect as a 2.44-unit increase in “Ozone” for each 1-degree increase in temperature. We interpret the adjusted effect as a 1.83-unit increase in “Ozone” for each 1-degree increase in temperature that is not accompanied by any change in “Wind”.

A potential criticism is that an adjusted effect might be purely hypothetical or the result of a model-based extrapolation. For example, you might argue that it doesn’t make much practical sense to consider temperature increases that aren’t also accompanied by changes in wind speed.

\(~\)

Practice (part 2)

Question #5: Using the “Professor Salaries” dataset, create a scatterplot matrix of the three numeric variables (“yrs.serice”, “yr.since.phd”, and “salary”)

Question #6: Using the “Professor Salaries” dataset, find and interpret the marginal effect of “yrs.service” on “salary”.

Question #7: Using the “Professor Salaries” dataset, find and interpret the adjusted effect of “yrs.service” on “salary”, after adjusting for “yrs.since.phd”. Why is this adjusted effect so different from the marginal one?

\(~\)

Randomized Experiments and Regression Adjustment

Regression adjustment is invaluable in observational studies, which frequently contain numerous confounding variables. However, regression adjustment is a controversial topic in the analysis of data from randomized experiments. The bullets below summarize the controversy:

Do Not Adjust:

  • Theoretically speaking, confounding variables should not skew the results of a randomized experiment, as the randomization will balance them across the treatment variable.

Adjust:

  • Randomization will never yield a perfect balance across the treatment variable
  • Including variables that explain variability in the outcome in a model can improve statistical efficiency (ie: there’s less unexplained “noise” in the data, so hypothesis tests might yield smaller \(p\)-values)

The code below illustrates these concepts using the “Infant Heart” dataset, a randomized experiment evaluating the impact two different types of surgery on the long-term development (measured by Mental Development Index, or MDI, scores and Physical Development Index, or PDI, scores) of infants with congenital heart defects:

ih <- read.csv("https://remiller1450.github.io/data/InfantHeart.csv")

## Rely on Randomization
m1 <- lm(PDI ~ Treatment, data = ih)
summary(m1)
## 
## Call:
## lm(formula = PDI ~ Treatment, data = ih)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.771  -6.845   0.229  12.229  42.082 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                91.918      1.830  50.240   <2e-16 ***
## TreatmentLow-flow bypass    5.854      2.615   2.239   0.0268 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.63 on 141 degrees of freedom
## Multiple R-squared:  0.03432,    Adjusted R-squared:  0.02747 
## F-statistic: 5.011 on 1 and 141 DF,  p-value: 0.02676
## Adjust for demographics
m2 <- lm(PDI ~ Treatment + Weight + Length + Age + Sex, data = ih)
summary(m2)
## 
## Call:
## lm(formula = PDI ~ Treatment + Weight + Length + Age + Sex, data = ih)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.878  -8.865   0.465  10.821  36.457 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)  
## (Intercept)              88.261686  34.916539   2.528   0.0126 *
## TreatmentLow-flow bypass  5.869803   2.592389   2.264   0.0251 *
## Weight                    0.015831   0.008494   1.864   0.0645 .
## Length                   -0.531870   0.502403  -1.059   0.2916  
## Age                      -4.081807   1.865741  -2.188   0.0304 *
## SexMale                   0.879853   2.798724   0.314   0.7537  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.32 on 137 degrees of freedom
## Multiple R-squared:  0.09836,    Adjusted R-squared:  0.06546 
## F-statistic: 2.989 on 5 and 137 DF,  p-value: 0.01359

Both models suggest that PDI scores are on-average around 5.8 points higher in the Low-flow surgical group. However, the \(p\)-value is slightly smaller in the model that adjusts for the covariates of “Weight”, “Length”, “Age”, and “Sex”. This is because these covariates help explain some of the variability in the outcome.

Question #8: Using the data described in this section, use a regression model to test for a difference in MDI scores across the two surgery groups. How does the \(p\)-value of this test change when the covariates “Weight”, “Length”, “Age”, and “Sex” are included in the model?

\(~\)

Practice (part 3)

In 1905, RJ Gladstone published an article titled “A Study of the Relations of the Brain to to the Size of the Head” in the scientific journal Biometrika. Gladstone analyzed data from a sample of 237 adults in an attempt to better understand the relationship between brain weight and head size among males and females of various ages. A description of the variables in this dataset is provided below:

  • BrainWeight - brain weight, measured in grams
  • HeadSize - head volume, measured in \(cm^3\)
  • Sex - recorded as male or female
  • Age - recorded as 45 and younger, “45-”, or 46 and older, “45+”

Question #9: Compare the average “BrainWeight” of males and females in Gladstone’s data using a two-sample \(t\)-test. Report the unadjusted difference in means.

Question #10: Use a multiple regression model to find the adjusted difference in mean “BrainWeight” for males and females after adjusting for “HeadSize”. Report this adjusted difference and provide a 1-2 sentence explanation as to why it is so different from the difference you found in Question #9.

Question #11: Fit a multiple regression model using the formula BrainWeight ~ Sex + HeadSize + AgeGrp. Interpret the effect of “HeadSize” in this model. Be careful to recognize this this an adjusted effect.

hb <- read.csv("https://remiller1450.github.io/data/HeadSize.csv")