Directions (read before starting)
\(~\)
Simple linear regression uses a straight line to model the relationship between a quantitative explanatory variable and a quantitative outcome: \[y = b_0 + b_1 x + e\]
This framework is easily amenable to additional explanatory variables: \[y = b_0 + b_1x_1 + b_2x_2 + \ldots b_px_p + e\]
Here the outcome variable, \(y\), is a modeled by a linear combination of many different explanatory variables, \(\{x_1, \ldots, x_p\}\), and error, \(e\).
\(~\)
One the simplest multivariable regression models modifies simple linear regression using an additional categorical predictor. For the colleges data, we can consider the model: \[\text{Net_Tuition} = b_0 + b_1\text{Cost} + b_2 \text{Private} + e\] Because the variable “Private” takes on the categorical values “Private” and “Public”, it must be re-coded in order to be used in a regression model (ie: multiplied by the coefficient \(b_2\) in the regression equation).
Shown below is how R
will re-code this variable for use
in the lm()
function:
Name | Original | Recoded |
---|---|---|
Abilene Christian University | Private | 0 |
Adelphi University | Private | 0 |
Adrian College | Private | 0 |
AdventHealth University | Private | 0 |
Alabama A & M University | Public | 1 |
Alabama State University | Public | 1 |
We do not need to perform this re-coding ourselves, the
lm()
function will do it internally when we add a
categorical variable to a model formula using +
:
colleges = read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
my_model = lm(Net_Tuition ~ Cost + Private, data = colleges)
coef(my_model)
## (Intercept) Cost PrivatePublic
## -4344.9582991 0.4586175 2518.7113517
The name given to the re-coded variable might seem strange, but
R
takes the variable name “Private” and appends the name of
the categorical value that was coded as “1”, which was “Public”, thereby
leading to the label “PrivatePublic”.
Visually, this model produces to two parallel lines, with one showing the expected net tuition for private colleges (red) and another showing the expected net tuition for public colleges (blue):
In this example, reporting the conditional effect of “Private” after adjusting for cost might be misleading because there isn’t much overlap in the costs of private and public colleges in the observed data:
ggplot(my_model, aes(x=Cost, y=Net_Tuition, color = Private)) + geom_point() +
geom_abline(slope = coef(my_model)[2], intercept = coef(my_model)[1], color = "red", lwd = 1.5) +
geom_abline(slope = coef(my_model)[2], intercept = coef(my_model)[1] + coef(my_model)[3], color = "blue", lwd = 1.5) + theme_bw() +
scale_color_manual(values = c("red", "blue"))
This is an example of hidden extrapolation, or the act of using a model to draw conclusions for data that either isn’t realistic or representative of the typical observation (ie: a private and public college with the same cost).
\(~\)
Next, let’s see how we’d use the fitted regression model to find the expected net tuition of a college with a cost of \(\$40,000\):
0
in the re-coded variable “PrivatePublic”, this value
(\(\$14015.04\)) would be the predicted
net tuition since the coefficient of “PrivatePublic” is multiplied by
zero in the regression equation when the college is “Private”1
, we’d need to add the coefficient of
“PrivatePublic”, or \(2518.71\),
thereby making the predicted net tuition \(14015.04 +2518.71 = 16533.75\) for a public
college whose cost is \(\$40,000\)\(~\)
Our previous example considered a binary categorical predictor, which is easily re-coded as 0 or 1.
To demonstrate how a non-binary categorical variable can be used in a regression model, let’s filter the colleges data to include only schools in Arkansas, California, Iowa, and Florida:
colleges_subset = colleges %>% filter(State %in% c("AR", "CA", "IA", "FL"))
After filtering, the variable “State” now contains 4 possible categorical values. However, to re-code this variable for use in a multivariable regression model we’ll only need 3 re-coded variables:
Name | Original | StateCA | StateFL | StateIA |
---|---|---|---|---|
AdventHealth University | FL | 0 | 1 | 0 |
Arkansas State University-Main Campus | AR | 0 | 0 | 0 |
Ave Maria University | FL | 0 | 1 | 0 |
Azusa Pacific University | CA | 1 | 0 | 0 |
Barry University | FL | 0 | 1 | 0 |
Biola University | CA | 1 | 0 | 0 |
Buena Vista University | IA | 0 | 0 | 1 |
One value of the original variable, in this case “AR”, is used as a reference category. It’s effect is contained in the model’s intercept.
The coefficients for the re-coded variables: “StateCA”, “StateFL”, and “StateIA” describe expected difference in net tuition of colleges in these states relative to the reference state:
my_model = lm(Net_Tuition ~ Cost + State, data = colleges_subset)
coef(my_model)
## (Intercept) Cost StateCA StateFL StateIA
## -4616.6671717 0.4373695 2851.8788624 2081.2957809 -591.5686924
In this example:
Below we can see that if we do not adjust for cost the expected difference in net tuition between Arkansas and California is much larger:
## Model without Cost
my_model = lm(Net_Tuition ~ State, data = colleges_subset)
coef(my_model)
## (Intercept) StateCA StateFL StateIA
## 8030.214 9214.865 4891.337 5838.906
This is because colleges in California tend to cost more than colleges in Arkansas, and cost is strongly associated with net tuition. By including cost in the model, we are able to estimate a difference in net tuition where the colleges California and Arkansas are “forced” to have the same cost.
\(~\)
At this point you should begin working independently with your assigned partner(s).
The questions during this lab will ask you to explore the “Iowa City Home Sales” data:
IC_homes = read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
This data set was assembled by a Professor of Biostatistics at the University of Iowa using information scraped from the Johnson County Assessor for homes sold between Jan 1, 2005 and Sept 9, 2008.
You’ll be asked to use the following variables:
sale.amount
- The amount the home sold for, which we’ll
use as the response variable.assessed
- The assessed value of the home.style
- The style of the home (ie: is it 1-story or
2-story? is it brick or frame? etc.)built
- The year that home was originally
constructedarea.living
- The total amount of livable space in the
home in square-feetac
- Whether the home has central air conditioning
(either “Yes” or “No”)The following code is used to select just these variables:
library(dplyr)
ICH = IC_homes %>% select(sale.amount, assessed, style, built, area.living, ac)
You may use this data frame, ICH
, for the questions that
follow.
\(~\)
The following questions each involve a single explanatory variable
and the response variable sale.amount
. Statisticians
typically build their models gradually, starting out simple and only
including additional control variables or predictors as necessary.
Question #1: How much should you expect to pay per square-foot of living space in a home?
area.living
on the x-axis and the
response variable sale.amount
on the y-axis. Add a loess
smoother and briefly describe the relationship between these two
variables.area.living
to predict sale.amount
.
Report and briefly interpret the \(R^2\) value of this model.sale.amount
and area.living
.\(~\)
Question #2: Do newer 1-story homes cost more than newer 2-story homes?
filter()
function to
filter the data to only include homes that were built in 1990 or later
whose style
is either "1 Story Frame"
or
"2 Story Frame"
. Display a set of side-by-side boxplots
that show the conditional distributions of sale.amount
for
each home style in this filtered data set. Briefly describe the
relationship you see in this graph.style
to predict
sale.amount
using the filtered data set you created in Part
A. Report and interpret both of the estimated slope coefficients in this
model.style
you’d expect to have a larger mean.\(~\)
All of the following questions will involve models that contain 1 quantitative predictor and 1 categorical predictor. Be careful to acknowledge that estimated coefficients in these models are conditional effects.
Question #3: What the adjusted difference in sale price between newer 1-story and 2-story homes after controlling for differences in living area?
filter()
function that contains only homes built
in 1990 or later of the style "1 Story Frame"
or
"2 Story Frame"
. Using this same data set, fit a
multivariable linear regression model that uses the explanatory
variables style
and area.living
to predict
sale.amount
.area.living
living. Be careful to acknowledge that this model is controlling for
style
.style2 Story Frame
.style2 Story Frame
from Part C of this question with the
estimated coefficient of style2 Story Frame
you found in
Part B of Question #2 (using a single predictor regression). How do you
explain the difference between these two coefficients?\(~\)
Question #4: How does adjusting for year built impact the relationship between whether a home has central air conditioning and its sale price?
ICH
,
fit a regression model with ac
as the single explanatory
variable and sale.amount
as the response variable. Report
and interpret the estimated coefficient for the re-coded variable
acYes
.ac
and built
as explanatory variables and the
response variable sale.amount
. Interpret the coefficient of
the re-coded variable acYes
in this model, being careful to
acknowledge that built
is also included in the model.acYes
is larger after controlling for built
.
Hint: Think about whether older or newer homes are more likely
to have central air, and whether older newer homes are more likely to
sell for high prices.