\(~\)
Our lecture slides introduced the concept of one-hot
encoding as a strategy to include categorical predictors in a
regression model. Behind the scenes the lm() function in
R will use the model.matrix() function to set
up dummy variables whenever we include a categorical predictor in a
model formula.
To demonstrate this, consider the following example:
## Hypothetical color data
colors = c("Red", "Red", "Blue", "Green", "Blue")
## Matrix of one-hot variables to be used in regression
model.matrix( ~ colors)
## (Intercept) colorsGreen colorsRed
## 1 1 0 1
## 2 1 0 1
## 3 1 0 0
## 4 1 1 0
## 5 1 0 0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$colors
## [1] "contr.treatment"
Notice a few things:
For this model we could easily compare the expected outcomes for “Blue” vs. “Green” and “Blue” vs. “Red”, but we could not directly compare “Blue” vs. “Green”.
Recall from a few previous labs that the factor()
function allows us to reorder the “levels” of a categorical variable. We
can use this to change which category is encoded as a reference:
## Reorder the levels of "colors"
ordered_colors = factor(colors, levels = c("Red", "Blue", "Green"))
## New matrix of one-hot variables
model.matrix( ~ ordered_colors)
## (Intercept) ordered_colorsBlue ordered_colorsGreen
## 1 1 0 0
## 2 1 0 0
## 3 1 1 0
## 4 1 0 1
## 5 1 1 0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$ordered_colors
## [1] "contr.treatment"
The above example changed the reference category to red.
To illustrate the implications for multivariable regression, let’s set up a hypothetical quantitative outcome variable and display the model coefficients for the originally ordered and re-ordered predictor
## Example quantitative outcomes
example_y_values = c(10, 8, 5, 2, 4)
## Coefs for original order
lm(example_y_values ~ colors)
##
## Call:
## lm(formula = example_y_values ~ colors)
##
## Coefficients:
## (Intercept) colorsGreen colorsRed
## 4.5 -2.5 4.5
## Coefs for re-ordering to make "red" the reference
lm(example_y_values ~ ordered_colors)
##
## Call:
## lm(formula = example_y_values ~ ordered_colors)
##
## Coefficients:
## (Intercept) ordered_colorsBlue ordered_colorsGreen
## 9.0 -4.5 -7.0
For the original order, or the model using colors, we
interpret the following:
For the re-ordered data, or the model using
ordered_colors, we interpret the following:
\(~\)
In this lab, you’ll use data collected by a faculty member in the University of Iowa Department of Biostatistics to help them in their housing search. The faculty member web scraped the Johnson County Assessor’s website to get data on all single-family homes sold in Iowa City, IA between Jan 1, 2005 and Sept 9, 2008. We will restrict our analysis to the two most common styles of homes appearing in the dataset: 1-story frame and 2-story frame.
library(ggplot2)
library(dplyr)
ic_homes <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv") %>%
filter(style %in% c("1 Story Frame", "2 Story Frame"))
\(~\)
Let’s start by looking at a regression model involving a single categorical explanatory variable:
model1 = lm(sale.amount ~ bsmt, data = ic_homes)
The code given above fits a model that uses the explanatory variable
bsmt to predict the outcome sale.amount for
the ic_homes dataset. Printing the model will give us the
estimated coefficients for each dummy variable used to represent this
categorical explanatory variable.
model1
##
## Call:
## lm(formula = sale.amount ~ bsmt, data = ic_homes)
##
## Coefficients:
## (Intercept) bsmt3/4 bsmtCrawl bsmtFull bsmtNone
## 210250 94750 -30250 -11976 -87098
Question #1: For this question you should use the
example in this section involving the model
sale.amount ~ bsmt.
table() function to
display the frequencies of each category of the explanatory variable
bsmt. Which category is being used as the reference group
in the regression model?bsmtNone is -87098. Provide a brief interpretation of this
coefficient.factor() function to
reorder bsmt such that "None" is used as the
reference category. Print the coefficient estimates of the regression
model using the reordered predictor. Hint: You may use the
factor() function directly inside of the model formula
given to lm().bsmtFull in the model you fit in Part C. Why is this
coefficient estimate very different from the one found in the example
model (which was -11976)?\(~\)
On the surface adding a second predictor (explanatory variable) to
regression model seems simple, but the impact it has on how model
coefficients are estimated and should be interpreted is substantial. The
regression model below uses both bsmt (categorical) and
area.bsmt (quantitative) to predict a home’s sale
price:
model2 = lm(sale.amount ~ bsmt + area.bsmt, data = ic_homes)
model2$coefficients
## (Intercept) bsmt3/4 bsmtCrawl bsmtFull bsmtNone area.bsmt
## 197139.1366 107860.8634 -17139.1366 -47428.2182 -74379.8297 103.8484
Comparing the estimated coefficients of each dummy variable in this
model with those in model1 (from the previous section) we
notice that most of them are noticeably different. This is because they
are now adjusted effects that have been adjusted for
differences in basement area that are present across different styles of
basement.
As an example, the estimate coefficient for bsmtNone of
-74379.83 suggests that a home without any basement is expected to sell
for $74k less than a home with a 1/2 basement after adjusting such
that both homes have the same basement area. From this
interpretation, we can see that adjusting for basement area might be
nonsensical here, as it is impossible for a home without a basement to
have the same basement area as a home with a 1/2 basement. Nevertheless,
this interpretation helps us understand why the adjusted difference is
smaller than the marginal difference.
If you’d like to see what this model looks like visually, I’d
encourage you to use the visreg library. An example
visualization of our model with one categorical and one quantitative
explanatory variable is shown below:
# install.packages("visreg")
library(visreg)
visreg(fit = model2, xvar = "area.bsmt", by = "bsmt", layout = c(3,2))
The shaded grey areas around the fitted lines are 95% confidence interval bands that suggest a range of plausible lines that might be the underlying population-level model.
Question #2: This question focuses on understanding
the relationship between the variables style,
area.living, and sale.amount.
group_by() and
summarize() functions to find the average sale price of
homes in each category of style. Which style has a higher
average selling price?style as the explanatory variable and
sale.amount as the response variable. Interpret the
coefficient of the dummy variable style2 Story Frame in
this model.style and area.living.
Briefly describe how these variables are associated.area.living and sale.amount. Briefly describe
the relationship between these variables.style and area.living as explanatory variables
and sale.amount as the response variable. Interpret the
coefficient of style2 Story Frame in this model.\(~\)
When multiple categorical explanatory variables are used in a regression model the biggest change is how the reference category is defined.
Recall that R designates the reference group according
to first alphanumeric category when a single categorical explanatory
variable is used. However, if multiple categorical explanatory variables
are used then the reference group is defined by the combination
of the first alphanumeric categories of each variable.
For example, when the categorical variables bsmt and
style are used, the reference group is
1 Story Frame homes with a 1/2 basement. You
should recognize that this coincides with the upper left cell in the
two-way table relating these variables:
table(ic_homes$bsmt, ic_homes$style)
##
## 1 Story Frame 2 Story Frame
## 1/2 2 2
## 3/4 0 1
## Crawl 1 0
## Full 255 161
## None 89 20
The estimated coefficients in these models still describe expected differences between encoded categories and the reference group, but we need to be more careful about what the reference group entails, and we need to realize that the differences are now adjusted such that the other categorical variable is unchanged.
Question #3: This question focuses on understanding
the relationship between the variables ac,
bsmt and sale.amount.
sale.amount ~ bsmt + ac. Interpret the estimated
intercept in this model.bsmtCrawl in your model from Part A.factor() function from the start of this lab, reorder the
levels of the variables bsmt and ac as
necessary so that the group of homes with no basement and no air
conditioning are used as the reference group.sale.amount ~ bsmt + ac. Interpret the
estimated intercept in this model.bsmtCrawl in your model from Part D.\(~\)