Lab #8 - Multivariable Regression (part 2)

Directions (read before starting)

Please work together with your assigned partner. Make sure you both fully understand something before moving on.
Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
Ask for help, clarification, or even just a check-in if anything seems unclear.

$~$

Onboarding

Our previous lab introduced multivariable regression models: \[y = b_0 + b_1x_1 + b_2x_2 + \ldots b_px_p + e\]

Here an outcome variable, $y$, is modeled by a linear combination of many different explanatory variables, $\{x_1, \ldots, x_p\}$, and error.

The previous lab focused on the implications of adding categorical predictors to a model with a single quantitative predictor. This lab will explore models involving multiple quantitative predictors.

$~$

Two quantitative predictors

For the colleges data set, consider modeling the response variable Net_Tuition using two explanatory variables, Cost and Avg_Fac_Salary.

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

We can visualize these data using a 3-dimensional scatter plot:

plot_ly() %>%
  add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Avg_Fac_Salary, 
            z = ~colleges$Net_Tuition, color = I("black"), marker = list(size = 2)) %>%
  layout(scene = list(xaxis = list(title = "Cost"),
                      yaxis = list(title = "Faculty Salary"),
                      zaxis = list(title = "Net Tuition")))

Our regression model has the form: \[\text{Net Tuition} = b_0 + b_1\text{Cost} + b_2\text{Faculty Salary} + e\]
We can see that this model involves an intercept, $b_0$, and two slope coefficients, $b_1$ and $b_2$. Together these form a plane of best fit, which is similar to the line of best fit we get in simple linear regression:

We can rotate this 3-d visualization to see the estimated slope coefficients in each variable.

Both explanatory variables have a positive relationship with the outcome variable, as higher values are associated with increases in net tuition.
- This is after adjusting for the other variables in the model. So, we can say “Cost” is positively associated with “Net Tuition” after controlling for “Avg_Fac_Salary”.

We can confirm this by printing the estimated coefficients:

model <- lm(Net_Tuition ~ Cost + Avg_Fac_Salary, data = colleges)
coef(model)

##    (Intercept)           Cost Avg_Fac_Salary 
##  -5.711352e+03   3.631916e-01   7.917021e-02

$~$

Adjusted effects and omitted variables

It’s extremely important to recognize that the estimated coefficients in a multivariable model are adjusted by the other variables in the model. This is because the regression plane is fit such that it minimizes the vertical distance between the plane and the observed values of the response variable:

Mathematically, this is not equivalent to minimizing the vertical distance between the simple linear regression line and the observed outcomes in each predictor separately (except for the extremely rare case where both predictors are perfectly independent).

We can see the difference between the adjusted and unadjusted effects by looking at the estimated coefficients of separate simple linear regressions:

## Simple linear reg w/ only "Cost" as a predictor
model_cost_only <- lm(Net_Tuition ~ Cost, data = colleges)
coef(model_cost_only)

##  (Intercept)         Cost 
## -949.6135998    0.3949237

## Simple linear reg w/ only "Adm_Rate" as a predictor
model_adm_only <- lm(Net_Tuition ~ Avg_Fac_Salary, data = colleges)
coef(model_adm_only)

##    (Intercept) Avg_Fac_Salary 
##   2338.7006625      0.1516073

The slope coefficient of “Cost” is slightly larger in the simple linear regression model (0.39 vs. 0.36)
The slope coefficient of “Avg_Fac_Salary” is substantially larger in the simple linear regression model (0.15 vs. 0.08)
- This is because “Cost” and “Avg_Fac_Salary” are correlated, and “Cost” is a good predictor of “Net_Tuition”
- Thus, when the variable “Cost” is omitted from the model it’s relationship with the response variable will be absorbed into the estimated effect of “Avg_Fac_Salary” due to their correlation.

Omitted variable bias describes the biased conclusions that can be reached when a relevant variable is left out of a model. The bias occurs by the model attributing some of the effects of the missing variable to explanatory variables that were included in the model that are correlated with the missing variable.

$~$

Lab

At this point you should begin working independently with your assigned partner(s).

Iowa City Home Sales Data

In this lab you’ll continue working with the Iowa City home sales data that were introduced in our previous lab:

IC_homes =  read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

Recall that this data set was assembled by a Statistics Professor at the University of Iowa using information scraped from the Johnson County Assessor for homes sold between Jan 1, 2005 and Sept 9, 2008.

You’ll be asked to use the following variables:

sale.amount - The amount the home sold for, which we’ll use as the response variable.
assessed - The assessed value of the home.
style - The style of the home (ie: is it 1-story or 2-story? is it brick or frame? etc.)
bedrooms - The number of bedrooms in the home
area.living - The total amount of livable space in the home in square-feet
ac - Whether the home has central air conditioning (either “Yes” or “No”)

The following code selects just these variables:

ICH = IC_homes %>% select(sale.amount, assessed, style, bedrooms, area.living, ac)

You should use this data frame, ICH, for the questions that follow.

$~$

Part 1 - Interpretting Adjusted Effects

Question 1: This question focuses on understanding the relationship between explanatory variables bedrooms and area.living, and the response variable sale.amount.

Part A: Fit a simple linear regression model using only the variable bedrooms to predict the response variable sale.amount. Interpret the effect of an additional bedroom on the expected sale price of a home.
Part B: Fit a multivariable regression model using the variables bedrooms and area.living to predict the response variable sale.amount. Interpret the adjusted effect of an additional bedroom on the expected sale price of a home after controlling for living area.
Part C: Conceptually, how would you explain the large difference between the unadjusted effect (Part A) and the adjusted effect controlling for living area (Part B) of an additional bedroom? Provide a 2-sentence explanation that would be appropriate for a non-statistician.
Part D: Suppose a person is considering a remodel of their home that will reconfigure the upper level to have 4 bedrooms instead of 3 bedrooms. The remodel will not change the overall amount of living space in the home. If this person is interested in estimating the change in value of their home after the remodel, will the model from Part A or Part B be more useful to them? Briefly explain.
Part E: Create a scatter plot using the variables bedrooms and area.living and include a “loess” smoother on the graph to more clearly illustrate the relationship between these variables. Additionally, suppose we know that area.living has a strong positive relationship with sale.amount. Given this information, how does the relationship you see in your scatter plot relate to the concept of omitted variable bias in the simple linear regression model you fit in Part A? Briefly explain.

Question 2: This question looks at the difference in price of 1-story and 2-story frame houses after adjusting for factors like living area, number of bedrooms, etc.

Part A: Use the filter() function (from the dplyr package) to filter the data to only include homes whose style is either "1 Story Frame" or "2 Story Frame".
Part B: Use the table() and prop.table() functions and to compare the conditional distributions of the variable bedrooms in 1-story frame and 2-story frame homes. Which style tends to have fewer bedrooms?
Part C: Use the group_by() and summarize() functions (from the dplyr package) to find and compare the conditional means and medians of the variable area.living in 1-story frame and 2-story frame homes. Which style tends to have smaller living areas?
Part D: Fit a regression model using style as the only explanatory variable and sale.amount as the response. Provide a 1-sentence interpretation of the estimated coefficient of the re-coded variable style2 Story Frame in this model.
Part E: Now fit a regression model that uses three explanatory variables: style, area.living, and bedrooms. Provide a 1-sentence interpretation of the estimated coefficient of the re-coded variable style2 Story Frame in this model.
Part F: Using what you learned about the relationships between style and the other explanatory variables (ie: area.living and bedrooms), along with the fact that more bedrooms and larger living areas are associated with higher sale amounts, explain why the model in Part D seems to suggest that 2-story homes are more expensive than 1-story homes while the model in Part E seems to suggest the opposite.
Part G: Suppose a prospective first-time home buyer is trying to decide whether they want a 1-story or a 2-story home and they are concerned about the total price they’d need to pay, but they do not care about the home’s size or number of bedrooms. Would the model from Part D or the model from Part E be more useful to this person? Briefly explain.

Part 2 - $R^2$ and Adjusted-$R^2$

Multivariable regression is a nuanced topic that we will revisit later in the course after we’ve covered statistical inference. For the moment, our primary focus is addressing the question:

Are we controlling for the necessary variables to negate the impacts of confounding/Simpson’s paradox and prevent omitted variable bias?

A few secondary questions we can begin to consider are:

How well does the entire set of explanatory variables in our model explain the response variable?
Are there additional explanatory variables we can add to the model to improve its fit? (even if they aren’t variables we need to control for)

These secondary questions can be addressed via $R^2$ and adjusted $R^2$, which were first introduced in Lab 5 (correlation and regression):

$R^2$ describes the fraction of the total variability in our response variable that is explained by our model.
Adjusted $R^2$ modifies $R^2$ to address the impacts of over fitting, or the fact that including extraneous explanatory variables can only increase $R^2$, albeit by small amounts if these variables are not associated with the response variable.

Recall that we can find a fitted model’s $R^2$ and adjusted $R^2$ using the summary() function:

## Example using colleges data
model <- lm(Net_Tuition ~ Cost + Avg_Fac_Salary, data = colleges)
summary(model)

## 
## Call:
## lm(formula = Net_Tuition ~ Cost + Avg_Fac_Salary, data = colleges)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19991.8  -2130.7   -269.2   1812.6  19007.6 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5.711e+03  4.270e+02  -13.37   <2e-16 ***
## Cost            3.632e-01  7.454e-03   48.73   <2e-16 ***
## Avg_Fac_Salary  7.917e-02  5.258e-03   15.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3606 on 1092 degrees of freedom
## Multiple R-squared:  0.7501, Adjusted R-squared:  0.7496 
## F-statistic:  1639 on 2 and 1092 DF,  p-value: < 2.2e-16

In this example, the explanatory variables Cost and Avg_Fac_Salary can explain 75% of the variation in the response variable Net_Tuition. This implies that these factors explain most of differences in net tuition that exist across colleges.

We use adjusted $R^2$ to decide whether including other explanatory variables can explain an even larger amount of the differences in net tuition:

## Example 1 - a variable that improves adj-R2
model_ex1 <- lm(Net_Tuition ~ Cost + Avg_Fac_Salary + Private , data = colleges)
summary(model_ex1)

## 
## Call:
## lm(formula = Net_Tuition ~ Cost + Avg_Fac_Salary + Private, data = colleges)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21154.5  -2098.9   -264.3   1716.7  20333.4 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -4.224e+03  5.564e+02  -7.591 6.80e-14 ***
## Cost            3.055e-01  1.581e-02  19.324  < 2e-16 ***
## Avg_Fac_Salary  9.864e-02  7.035e-03  14.021  < 2e-16 ***
## PrivatePublic  -1.972e+03  4.778e+02  -4.128 3.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3580 on 1091 degrees of freedom
## Multiple R-squared:  0.7539, Adjusted R-squared:  0.7533 
## F-statistic:  1114 on 3 and 1091 DF,  p-value: < 2.2e-16

## Example 2 - a variable that doesn't improve adj-R2
model_ex2 <- lm(Net_Tuition ~ Cost + Avg_Fac_Salary + FourYearComp_Males , data = colleges)
summary(model_ex2)

## 
## Call:
## lm(formula = Net_Tuition ~ Cost + Avg_Fac_Salary + FourYearComp_Males, 
##     data = colleges)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19813.7  -2156.1   -251.8   1774.6  18720.6 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -5.725e+03  4.273e+02 -13.398   <2e-16 ***
## Cost                3.591e-01  8.622e-03  41.645   <2e-16 ***
## Avg_Fac_Salary      7.540e-02  6.585e-03  11.451   <2e-16 ***
## FourYearComp_Males  9.448e+02  9.927e+02   0.952    0.341    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3607 on 1091 degrees of freedom
## Multiple R-squared:  0.7503, Adjusted R-squared:  0.7496 
## F-statistic:  1093 on 3 and 1091 DF,  p-value: < 2.2e-16

In these examples, you should notice:

$R^2$ went up in both models compared to the simpler model Net_Tuition ~ Cost + Avg_Fac_Salary that was shown earlier.
- However, adjusted $R^2$ only increased in Example 1, suggesting the variable Private is worth including as a predictor, while the variable FourYearComp_Males is not worth including.

We can confirm this by looking at the coefficient of FourYearComp_Males in the fitted model from in Example 2.

coef(model_ex2)

##        (Intercept)               Cost     Avg_Fac_Salary FourYearComp_Males 
##      -5.725169e+03       3.590677e-01       7.539841e-02       9.447705e+02

After adjusting for cost and faculty salary, a 1-unit change in the 4-year completion rate of males (which is akin to going from a 0% completion rate to 100% completion rate!) only leads to an expected increase in net tuition of about $\$944$.
- Equivalently, a 1% increase in the 4-year completion rate of males leads to an expected increase in net tuition of only $\$9.44$ (after adjusting for cost and faculty salary).

Question 3: In this question you’ll use $R^2$ and adjusted $R^2$ to further explore the home price models from Part 1 of the lab.

Part A: Find and interpret the $R^2$ value of the linear regression model defined by the R formula sale.amount ~ style + area.living + bedrooms fit to the subset of data including only homes whose style is either "1 Story Frame" or "2 Story Frame" (recall that you already fit this model in Part E of Question 2).
Part B: Consider a hypothetical explanatory variable $X$ that is not related to sale amount. Would including this variable as a predictor in the model from Part A increase or decrease the $R^2$ value? Would it increase or decrease the adjusted $R^2$ value? Hint: The answer to this question can be found earlier in this section, make sure you’ve read everything carefully.
Part C: Consider the explanatory variable assessed. Add this variable to the model in Part A and briefly describe how it impacts $R^2$ and adjusted $R^2$. Is it worthwhile to include this variable as a predictor of sale price if our goal is to make the most accurate predictions we possibly can?