Lab #7 - Multivariable Regression (part 1)

Directions (read before starting)

Please work together with your assigned partner. Make sure you both fully understand something before moving on.
Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
Ask for help, clarification, or even just a check-in if anything seems unclear.

$~$

Onboarding

Simple linear regression uses a straight line to model the relationship between a quantitative explanatory variable and a quantitative outcome: \[y = b_0 + b_1 x + e\]

This framework is easily amenable to additional explanatory variables: \[y = b_0 + b_1x_1 + b_2x_2 + \ldots b_px_p + e\]

Here the outcome variable, $y$, is a modeled by a linear combination of many different explanatory variables, $\{x_1, \ldots, x_p\}$, and error, $e$.

$~$

Binary Categorical Predictors

One the simplest multivariable regression models modifies simple linear regression using an additional categorical predictor. For the colleges data, we can consider the model: \[\text{Net_Tuition} = b_0 + b_1\text{Cost} + b_2 \text{Private} + e\] Because the variable “Private” takes on the categorical values “Private” and “Public”, it must be re-coded in order to be used in a regression model (ie: multiplied by the coefficient $b_2$ in the regression equation).

Shown below is how R will re-code this variable for use in the lm() function:

Name	Original	Recoded
Abilene Christian University	Private	0
Adelphi University	Private	0
Adrian College	Private	0
AdventHealth University	Private	0
Alabama A & M University	Public	1
Alabama State University	Public	1

We do not need to perform this re-coding ourselves, the lm() function will do it internally when we add a categorical variable to a model formula using +:

colleges =  read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv") 

my_model = lm(Net_Tuition ~ Cost + Private, data = colleges)
coef(my_model)

##   (Intercept)          Cost PrivatePublic 
## -4344.9582991     0.4586175  2518.7113517

The name given to the re-coded variable might seem strange, but R takes the variable name “Private” and appends the name of the categorical value that was coded as “1”, which was “Public”, thereby leading to the label “PrivatePublic”.

Visually, this model produces to two parallel lines, with one showing the expected net tuition for private colleges (red) and another showing the expected net tuition for public colleges (blue):

You might think that this model suggests net tuition tends to be higher for public colleges due the estimated coefficient $\hat{b}_2 = 2518.71$ and the fact that the blue line is above the red line.
- However, this coefficient’s effect is conditional on “Cost” being remaining unchanged.
- So, the proper conclusion is that the net tuition for two colleges with the same cost is expected to be higher by $\$2518.71$ for the public college.

In this example, reporting the conditional effect of “Private” after adjusting for cost might be misleading because there isn’t much overlap in the costs of private and public colleges in the observed data:

ggplot(my_model, aes(x=Cost, y=Net_Tuition, color = Private)) + geom_point() +
  geom_abline(slope = coef(my_model)[2], intercept = coef(my_model)[1], color = "red", lwd = 1.5) +
  geom_abline(slope = coef(my_model)[2], intercept = coef(my_model)[1] + coef(my_model)[3], color = "blue", lwd = 1.5) + theme_bw() + 
  scale_color_manual(values = c("red", "blue"))

This is an example of hidden extrapolation, or the act of using a model to draw conclusions for data that either isn’t realistic or representative of the typical observation (ie: a private and public college with the same cost).

$~$

The Fitted Regression Equation

Next, let’s see how we’d use the fitted regression model to find the expected net tuition of a college with a cost of $\$40,000$:

First, we’d take the intercept: $-4344.96$, then use the slope for “Cost” to add: $0.459*40000$, leading to a net tuition of: $-4344.96 + 0.459*40000 = 14015.04$
- If the college were “Private”, which was coded as the numeric value 0 in the re-coded variable “PrivatePublic”, this value ($\$14015.04$) would be the predicted net tuition since the coefficient of “PrivatePublic” is multiplied by zero in the regression equation when the college is “Private”
- However, if the college were “Public”, which was re-coded to be the numeric value 1, we’d need to add the coefficient of “PrivatePublic”, or $2518.71$, thereby making the predicted net tuition $14015.04 +2518.71 = 16533.75$ for a public college whose cost is $\$40,000$

$~$

Nominal/ordinal Predictors

Our previous example considered a binary categorical predictor, which is easily re-coded as 0 or 1.

To demonstrate how a non-binary categorical variable can be used in a regression model, let’s filter the colleges data to include only schools in Arkansas, California, Iowa, and Florida:

colleges_subset = colleges %>% filter(State %in% c("AR", "CA", "IA", "FL"))

After filtering, the variable “State” now contains 4 possible categorical values. However, to re-code this variable for use in a multivariable regression model we’ll only need 3 re-coded variables:

Name	Original	StateCA	StateFL	StateIA
AdventHealth University	FL	0	1	0
Arkansas State University-Main Campus	AR	0	0	0
Ave Maria University	FL	0	1	0
Azusa Pacific University	CA	1	0	0
Barry University	FL	0	1	0
Biola University	CA	1	0	0
Buena Vista University	IA	0	0	1

One value of the original variable, in this case “AR”, is used as a reference category. It’s effect is contained in the model’s intercept.

The coefficients for the re-coded variables: “StateCA”, “StateFL”, and “StateIA” describe expected difference in net tuition of colleges in these states relative to the reference state:

my_model = lm(Net_Tuition ~ Cost + State, data = colleges_subset)
coef(my_model)

##   (Intercept)          Cost       StateCA       StateFL       StateIA 
## -4616.6671717     0.4373695  2851.8788624  2081.2957809  -591.5686924

In this example:

A college in California is expected to have a net tuition that is $\$2851.88$ higher than a college in Arkansas (the reference category) given both colleges have the same cost (ie: controlling for cost).
- The last phrase (ie: same cost) is a very important part of this interpretation.
  - Each coefficient in a regression model describes the effect of a single factor when everything else in the model is held constant.
- We might also say this difference in net tuition controls for cost, or that adjusts for differences in cost.

Below we can see that if we do not adjust for cost the expected difference in net tuition between Arkansas and California is much larger:

## Model without Cost
my_model = lm(Net_Tuition ~ State, data = colleges_subset)
coef(my_model)

## (Intercept)     StateCA     StateFL     StateIA 
##    8030.214    9214.865    4891.337    5838.906

This is because colleges in California tend to cost more than colleges in Arkansas, and cost is strongly associated with net tuition. By including cost in the model, we are able to estimate a difference in net tuition where the colleges California and Arkansas are “forced” to have the same cost.

$~$

Lab

At this point you should begin working independently with your assigned partner(s).

Iowa City Home Sales Data

The questions during this lab will ask you to explore the “Iowa City Home Sales” data:

IC_homes =  read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

This data set was assembled by a Professor of Biostatistics at the University of Iowa using information scraped from the Johnson County Assessor for homes sold between Jan 1, 2005 and Sept 9, 2008.

You’ll be asked to use the following variables:

sale.amount - The amount the home sold for, which we’ll use as the response variable.
assessed - The assessed value of the home.
style - The style of the home (ie: is it 1-story or 2-story? is it brick or frame? etc.)
built - The year that home was originally constructed
area.living - The total amount of livable space in the home in square-feet
ac - Whether the home has central air conditioning (either “Yes” or “No”)

The following code is used to select just these variables:

library(dplyr)
ICH = IC_homes %>% select(sale.amount, assessed, style, built, area.living, ac)

You may use this data frame, ICH, for the questions that follow.

$~$

Part 1 - Single Predictor Models

The following questions each involve a single explanatory variable and the response variable sale.amount. Statisticians typically build their models gradually, starting out simple and only including additional control variables or predictors as necessary.

Question #1: How much should you expect to pay per square-foot of living space in a home?

Part A: Create a scatter plot showing the explanatory variable area.living on the x-axis and the response variable sale.amount on the y-axis. Add a loess smoother and briefly describe the relationship between these two variables.
Part B: Fit a simple linear regression model that uses area.living to predict sale.amount. Report and briefly interpret the $R^2$ value of this model.
Part C: Briefly interpret the estimated slope for the fitted model in Part B in regard to the relationship between sale.amount and area.living.
Part D: Is it worthwhile to interpret the estimated intercept in the model you fit in Part B? Briefly explain.
Part E: If a home has a living area of 2500 square feet, how much would you expect it to sell for? Show your calculation.

$~$

Question #2: Do newer 1-story homes cost more than newer 2-story homes?

Part A: Use the filter() function to filter the data to only include homes that were built in 1990 or later whose style is either "1 Story Frame" or "2 Story Frame". Display a set of side-by-side boxplots that show the conditional distributions of sale.amount for each home style in this filtered data set. Briefly describe the relationship you see in this graph.
Part B: Fit a linear regression model using the explanatory variable style to predict sale.amount using the filtered data set you created in Part A. Report and interpret both of the estimated slope coefficients in this model.
Part C: How do the coefficient estimates from Part B relate to the information you can visually see in Part B? Hint: Think about the impact of skew/outliers on the mean and which style you’d expect to have a larger mean.

$~$

Part 2 - Multiple Predictor Models

All of the following questions will involve models that contain 1 quantitative predictor and 1 categorical predictor. Be careful to acknowledge that estimated coefficients in these models are conditional effects.

Question #3: What the adjusted difference in sale price between newer 1-story and 2-story homes after controlling for differences in living area?

Part A: In Question #2 you created a new data set using the filter() function that contains only homes built in 1990 or later of the style "1 Story Frame" or "2 Story Frame". Using this same data set, fit a multivariable linear regression model that uses the explanatory variables style and area.living to predict sale.amount.
Part B: Using the fitted model from Part A, report and interpret the estimated coefficient for area.living living. Be careful to acknowledge that this model is controlling for style.
Part C: Using the fitted model from Part A, and interpret the estimated coefficient of the re-coded variable named style2 Story Frame.
Part D: Compare the estimated coefficient of style2 Story Frame from Part C of this question with the estimated coefficient of style2 Story Frame you found in Part B of Question #2 (using a single predictor regression). How do you explain the difference between these two coefficients?

$~$

Question #4: How does adjusting for year built impact the relationship between whether a home has central air conditioning and its sale price?

Part A: Using the full data set, ICH, fit a regression model with ac as the single explanatory variable and sale.amount as the response variable. Report and interpret the estimated coefficient for the re-coded variable acYes.
Part B: Now fit a regression model using both ac and built as explanatory variables and the response variable sale.amount. Interpret the coefficient of the re-coded variable acYes in this model, being careful to acknowledge that built is also included in the model.
Part C: Provide a conceptual/qualitative argument explaining why the estimated coefficient of the re-coded variable acYes is larger after controlling for built. Hint: Think about whether older or newer homes are more likely to have central air, and whether older newer homes are more likely to sell for high prices.