Directions (read before starting)
\(~\)
Our previous lab introduced multivariable regression models: \[y = b_0 + b_1x_1 + b_2x_2 + \ldots b_px_p + e\]
Here an outcome variable, \(y\), is modeled by a linear combination of many different explanatory variables, \(\{x_1, \ldots, x_p\}\), and error.
The previous lab focused on the implications of adding categorical predictors to a model with a single quantitative predictor. This lab will explore models involving multiple quantitative predictors.
\(~\)
For the colleges data set, consider modeling the response variable
Net_Tuition
using two explanatory variables,
Cost
and Avg_Fac_Salary
.
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
We can visualize these data using a 3-dimensional scatter plot:
plot_ly() %>%
add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Avg_Fac_Salary,
z = ~colleges$Net_Tuition, color = I("black"), marker = list(size = 2)) %>%
layout(scene = list(xaxis = list(title = "Cost"),
yaxis = list(title = "Faculty Salary"),
zaxis = list(title = "Net Tuition")))
Our regression model has the form: \[\text{Net Tuition} = b_0 + b_1\text{Cost} +
b_2\text{Faculty Salary} + e\]
We can see that this model involves an intercept, \(b_0\), and two slope coefficients,
\(b_1\) and \(b_2\). Together these form a plane of
best fit, which is similar to the line of best fit we get in simple
linear regression:
We can rotate this 3-d visualization to see the estimated slope coefficients in each variable.
We can confirm this by printing the estimated coefficients:
model <- lm(Net_Tuition ~ Cost + Avg_Fac_Salary, data = colleges)
coef(model)
## (Intercept) Cost Avg_Fac_Salary
## -5.711352e+03 3.631916e-01 7.917021e-02
\(~\)
It’s extremely important to recognize that the estimated coefficients in a multivariable model are adjusted by the other variables in the model. This is because the regression plane is fit such that it minimizes the vertical distance between the plane and the observed values of the response variable:
Mathematically, this is not equivalent to minimizing the vertical distance between the simple linear regression line and the observed outcomes in each predictor separately (except for the extremely rare case where both predictors are perfectly independent).
We can see the difference between the adjusted and unadjusted effects by looking at the estimated coefficients of separate simple linear regressions:
## Simple linear reg w/ only "Cost" as a predictor
model_cost_only <- lm(Net_Tuition ~ Cost, data = colleges)
coef(model_cost_only)
## (Intercept) Cost
## -949.6135998 0.3949237
## Simple linear reg w/ only "Adm_Rate" as a predictor
model_adm_only <- lm(Net_Tuition ~ Avg_Fac_Salary, data = colleges)
coef(model_adm_only)
## (Intercept) Avg_Fac_Salary
## 2338.7006625 0.1516073
Omitted variable bias describes the biased conclusions that can be reached when a relevant variable is left out of a model. The bias occurs by the model attributing some of the effects of the missing variable to explanatory variables that were included in the model that are correlated with the missing variable.
\(~\)
At this point you should begin working independently with your assigned partner(s).
In this lab you’ll continue working with the Iowa City home sales data that were introduced in our previous lab:
IC_homes = read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
Recall that this data set was assembled by a Statistics Professor at the University of Iowa using information scraped from the Johnson County Assessor for homes sold between Jan 1, 2005 and Sept 9, 2008.
You’ll be asked to use the following variables:
sale.amount
- The amount the home sold for, which we’ll
use as the response variable.assessed
- The assessed value of the home.style
- The style of the home (ie: is it 1-story or
2-story? is it brick or frame? etc.)bedrooms
- The number of bedrooms in the homearea.living
- The total amount of livable space in the
home in square-feetac
- Whether the home has central air conditioning
(either “Yes” or “No”)The following code selects just these variables:
ICH = IC_homes %>% select(sale.amount, assessed, style, bedrooms, area.living, ac)
You should use this data frame, ICH
, for the questions
that follow.
\(~\)
Question 1: This question focuses on understanding
the relationship between explanatory variables bedrooms
and
area.living
, and the response variable
sale.amount
.
bedrooms
to predict the
response variable sale.amount
. Interpret the effect of an
additional bedroom on the expected sale price of a home.bedrooms
and
area.living
to predict the response variable
sale.amount
. Interpret the adjusted effect of an
additional bedroom on the expected sale price of a home after
controlling for living area.bedrooms
and area.living
and include a “loess”
smoother on the graph to more clearly illustrate the relationship
between these variables. Additionally, suppose we know that
area.living
has a strong positive relationship with
sale.amount
. Given this information, how does the
relationship you see in your scatter plot relate to the concept of
omitted variable bias in the simple linear regression model you
fit in Part A? Briefly explain.Question 2: This question looks at the difference in price of 1-story and 2-story frame houses after adjusting for factors like living area, number of bedrooms, etc.
filter()
function
(from the dplyr
package) to filter the data to only include
homes whose style
is either "1 Story Frame"
or
"2 Story Frame"
.table()
and
prop.table()
functions and to compare the conditional
distributions of the variable bedrooms
in 1-story frame and
2-story frame homes. Which style tends to have fewer
bedrooms?group_by()
and
summarize()
functions (from the dplyr
package)
to find and compare the conditional means and medians of the variable
area.living
in 1-story frame and 2-story frame homes. Which
style tends to have smaller living areas?style
as the only explanatory variable and
sale.amount
as the response. Provide a 1-sentence
interpretation of the estimated coefficient of the re-coded variable
style2 Story Frame
in this model.style
,
area.living
, and bedrooms
. Provide a
1-sentence interpretation of the estimated coefficient of the re-coded
variable style2 Story Frame
in this model.style
and the other explanatory
variables (ie: area.living
and bedrooms
),
along with the fact that more bedrooms and larger living areas are
associated with higher sale amounts, explain why the model in Part D
seems to suggest that 2-story homes are more expensive than 1-story
homes while the model in Part E seems to suggest the opposite.Multivariable regression is a nuanced topic that we will revisit later in the course after we’ve covered statistical inference. For the moment, our primary focus is addressing the question:
A few secondary questions we can begin to consider are:
These secondary questions can be addressed via \(R^2\) and adjusted \(R^2\), which were first introduced in Lab 5 (correlation and regression):
Recall that we can find a fitted model’s \(R^2\) and adjusted \(R^2\) using the summary()
function:
## Example using colleges data
model <- lm(Net_Tuition ~ Cost + Avg_Fac_Salary, data = colleges)
summary(model)
##
## Call:
## lm(formula = Net_Tuition ~ Cost + Avg_Fac_Salary, data = colleges)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19991.8 -2130.7 -269.2 1812.6 19007.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.711e+03 4.270e+02 -13.37 <2e-16 ***
## Cost 3.632e-01 7.454e-03 48.73 <2e-16 ***
## Avg_Fac_Salary 7.917e-02 5.258e-03 15.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3606 on 1092 degrees of freedom
## Multiple R-squared: 0.7501, Adjusted R-squared: 0.7496
## F-statistic: 1639 on 2 and 1092 DF, p-value: < 2.2e-16
In this example, the explanatory variables Cost
and
Avg_Fac_Salary
can explain 75% of the variation in the
response variable Net_Tuition
. This implies that these
factors explain most of differences in net tuition that exist across
colleges.
We use adjusted \(R^2\) to decide whether including other explanatory variables can explain an even larger amount of the differences in net tuition:
## Example 1 - a variable that improves adj-R2
model_ex1 <- lm(Net_Tuition ~ Cost + Avg_Fac_Salary + Private , data = colleges)
summary(model_ex1)
##
## Call:
## lm(formula = Net_Tuition ~ Cost + Avg_Fac_Salary + Private, data = colleges)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21154.5 -2098.9 -264.3 1716.7 20333.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.224e+03 5.564e+02 -7.591 6.80e-14 ***
## Cost 3.055e-01 1.581e-02 19.324 < 2e-16 ***
## Avg_Fac_Salary 9.864e-02 7.035e-03 14.021 < 2e-16 ***
## PrivatePublic -1.972e+03 4.778e+02 -4.128 3.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3580 on 1091 degrees of freedom
## Multiple R-squared: 0.7539, Adjusted R-squared: 0.7533
## F-statistic: 1114 on 3 and 1091 DF, p-value: < 2.2e-16
## Example 2 - a variable that doesn't improve adj-R2
model_ex2 <- lm(Net_Tuition ~ Cost + Avg_Fac_Salary + FourYearComp_Males , data = colleges)
summary(model_ex2)
##
## Call:
## lm(formula = Net_Tuition ~ Cost + Avg_Fac_Salary + FourYearComp_Males,
## data = colleges)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19813.7 -2156.1 -251.8 1774.6 18720.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.725e+03 4.273e+02 -13.398 <2e-16 ***
## Cost 3.591e-01 8.622e-03 41.645 <2e-16 ***
## Avg_Fac_Salary 7.540e-02 6.585e-03 11.451 <2e-16 ***
## FourYearComp_Males 9.448e+02 9.927e+02 0.952 0.341
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3607 on 1091 degrees of freedom
## Multiple R-squared: 0.7503, Adjusted R-squared: 0.7496
## F-statistic: 1093 on 3 and 1091 DF, p-value: < 2.2e-16
In these examples, you should notice:
Net_Tuition ~ Cost + Avg_Fac_Salary
that was shown earlier.
Private
is worth including as a predictor, while the
variable FourYearComp_Males
is not worth
including.We can confirm this by looking at the coefficient of
FourYearComp_Males
in the fitted model from in Example
2.
coef(model_ex2)
## (Intercept) Cost Avg_Fac_Salary FourYearComp_Males
## -5.725169e+03 3.590677e-01 7.539841e-02 9.447705e+02
Question 3: In this question you’ll use \(R^2\) and adjusted \(R^2\) to further explore the home price models from Part 1 of the lab.
R
formula
sale.amount ~ style + area.living + bedrooms
fit to the
subset of data including only homes whose style
is either
"1 Story Frame"
or "2 Story Frame"
(recall
that you already fit this model in Part E of Question 2).assessed
. Add this variable to the model in Part A and
briefly describe how it impacts \(R^2\)
and adjusted \(R^2\). Is it
worthwhile to include this variable as a predictor of sale price if our
goal is to make the most accurate predictions we possibly can?