\(~\)

Onboarding

Our lecture slides introduced the concept of one-hot encoding as a strategy to include categorical predictors in a regression model. Behind the scenes the lm() function in R will use the model.matrix() function to set up dummy variables whenever we include a categorical predictor in a model formula.

To demonstrate this, consider the following example:

## Hypothetical color data
colors = c("Red", "Red", "Blue", "Green", "Blue")

## Matrix of one-hot variables to be used in regression
model.matrix( ~ colors)
##   (Intercept) colorsGreen colorsRed
## 1           1           0         1
## 2           1           0         1
## 3           1           0         0
## 4           1           1         0
## 5           1           0         0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$colors
## [1] "contr.treatment"

Notice a few things:

  1. The model’s intercept is represented by a column of all ones,
  2. The category that comes first alphanumerically is designated as the reference group.

For this model we could easily compare the expected outcomes for “Blue” vs. “Green” and “Blue” vs. “Red”, but we could not directly compare “Blue” vs. “Green”.

Recall from a few previous labs that the factor() function allows us to reorder the “levels” of a categorical variable. We can use this to change which category is encoded as a reference:

## Reorder the levels of "colors"
ordered_colors = factor(colors, levels = c("Red", "Blue", "Green"))

## New matrix of one-hot variables
model.matrix( ~ ordered_colors)
##   (Intercept) ordered_colorsBlue ordered_colorsGreen
## 1           1                  0                   0
## 2           1                  0                   0
## 3           1                  1                   0
## 4           1                  0                   1
## 5           1                  1                   0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$ordered_colors
## [1] "contr.treatment"

The above example changed the reference category to red.

To illustrate the implications for multivariable regression, let’s set up a hypothetical quantitative outcome variable and display the model coefficients for the originally ordered and re-ordered predictor

## Example quantitative outcomes
example_y_values = c(10, 8, 5, 2, 4)

## Coefs for original order
lm(example_y_values ~ colors)
## 
## Call:
## lm(formula = example_y_values ~ colors)
## 
## Coefficients:
## (Intercept)  colorsGreen    colorsRed  
##         4.5         -2.5          4.5
## Coefs for re-ordering to make "red" the reference
lm(example_y_values ~ ordered_colors)
## 
## Call:
## lm(formula = example_y_values ~ ordered_colors)
## 
## Coefficients:
##         (Intercept)   ordered_colorsBlue  ordered_colorsGreen  
##                 9.0                 -4.5                 -7.0

For the original order, or the model using colors, we interpret the following:

  • 4.5 (the intercept) is the expected outcome for “Blue”
  • “Green” is expected to be 2.5 units less than “Blue”
  • “Red” is expected to be 4.5 units more than “Blue”

For the re-ordered data, or the model using ordered_colors, we interpret the following:

  • 9.0 (the intercept) is the expected outcome for “Red”
  • “Blue” is expected to be 4.5 units less than “Red”
  • “Green” is expected to be 7.0 units less than “Red”

\(~\)

Lab

In this lab, you’ll use data collected by a faculty member in the University of Iowa Department of Biostatistics to help them in their housing search. The faculty member web scraped the Johnson County Assessor’s website to get data on all single-family homes sold in Iowa City, IA between Jan 1, 2005 and Sept 9, 2008. We will restrict our analysis to the two most common styles of homes appearing in the dataset: 1-story frame and 2-story frame.

library(ggplot2)
library(dplyr)

ic_homes <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv") %>% 
              filter(style %in% c("1 Story Frame", "2 Story Frame"))

\(~\)

One Categorical Predictor

Let’s start by looking at a regression model involving a single categorical explanatory variable:

model1 = lm(sale.amount ~ bsmt, data = ic_homes)

The code given above fits a model that uses the explanatory variable bsmt to predict the outcome sale.amount for the ic_homes dataset. Printing the model will give us the estimated coefficients for each dummy variable used to represent this categorical explanatory variable.

model1
## 
## Call:
## lm(formula = sale.amount ~ bsmt, data = ic_homes)
## 
## Coefficients:
## (Intercept)      bsmt3/4    bsmtCrawl     bsmtFull     bsmtNone  
##      210250        94750       -30250       -11976       -87098

Question #1: For this question you should use the example in this section involving the model sale.amount ~ bsmt.

  • Part A: Use the table() function to display the frequencies of each category of the explanatory variable bsmt. Which category is being used as the reference group in the regression model?
  • Part B: The estimated coefficient of bsmtNone is -87098. Provide a brief interpretation of this coefficient.
  • Part C: Use the factor() function to reorder bsmt such that "None" is used as the reference category. Print the coefficient estimates of the regression model using the reordered predictor. Hint: You may use the factor() function directly inside of the model formula given to lm().
  • Part D: Interpret the estimated coefficient of bsmtFull in the model you fit in Part C. Why is this coefficient estimate very different from the one found in the example model (which was -11976)?

\(~\)

One Categorical and One Quantitative Predictor

On the surface adding a second predictor (explanatory variable) to regression model seems simple, but the impact it has on how model coefficients are estimated and should be interpreted is substantial. The regression model below uses both bsmt (categorical) and area.bsmt (quantitative) to predict a home’s sale price:

model2 = lm(sale.amount ~ bsmt + area.bsmt, data = ic_homes)
model2$coefficients
## (Intercept)     bsmt3/4   bsmtCrawl    bsmtFull    bsmtNone   area.bsmt 
## 197139.1366 107860.8634 -17139.1366 -47428.2182 -74379.8297    103.8484

Comparing the estimated coefficients of each dummy variable in this model with those in model1 (from the previous section) we notice that most of them are noticeably different. This is because they are now adjusted effects that have been adjusted for differences in basement area that are present across different styles of basement.

As an example, the estimate coefficient for bsmtNone of -74379.83 suggests that a home without any basement is expected to sell for $74k less than a home with a 1/2 basement after adjusting such that both homes have the same basement area. From this interpretation, we can see that adjusting for basement area might be nonsensical here, as it is impossible for a home without a basement to have the same basement area as a home with a 1/2 basement. Nevertheless, this interpretation helps us understand why the adjusted difference is smaller than the marginal difference.

If you’d like to see what this model looks like visually, I’d encourage you to use the visreg library. An example visualization of our model with one categorical and one quantitative explanatory variable is shown below:

# install.packages("visreg")
library(visreg)
visreg(fit = model2, xvar = "area.bsmt", by = "bsmt", layout = c(3,2))

The shaded grey areas around the fitted lines are 95% confidence interval bands that suggest a range of plausible lines that might be the underlying population-level model.

Question #2: This question focuses on understanding the relationship between the variables style, area.living, and sale.amount.

  • Part A: Use the group_by() and summarize() functions to find the average sale price of homes in each category of style. Which style has a higher average selling price?
  • Part B: Fit a regression model using style as the explanatory variable and sale.amount as the response variable. Interpret the coefficient of the dummy variable style2 Story Frame in this model.
  • Part C: Create a data visualization showing the relationship between style and area.living. Briefly describe how these variables are associated.
  • Part D: Use Pearson’s correlation coefficient to quantify the strength of linear association between area.living and sale.amount. Briefly describe the relationship between these variables.
  • Part E: Fit a regression model that uses both style and area.living as explanatory variables and sale.amount as the response variable. Interpret the coefficient of style2 Story Frame in this model.
  • Part F: Explain why the model in Part E seems to suggest that 2-story homes are less expensive than 1-story homes, while the model in Part B suggests the opposite.
  • Part G: Suppose a prospective first-time home buyer is trying to decide whether they want a 1-story or a 2-story home and they are concerned about the total price they’d need to pay, but they do not care about the home’s size. Is the model from Part B or the model from Part E more useful to this person? Briefly explain.

\(~\)

Multiple Categorical Predictors

When multiple categorical explanatory variables are used in a regression model the biggest change is how the reference category is defined.

Recall that R designates the reference group according to first alphanumeric category when a single categorical explanatory variable is used. However, if multiple categorical explanatory variables are used then the reference group is defined by the combination of the first alphanumeric categories of each variable.

For example, when the categorical variables bsmt and style are used, the reference group is 1 Story Frame homes with a 1/2 basement. You should recognize that this coincides with the upper left cell in the two-way table relating these variables:

table(ic_homes$bsmt, ic_homes$style)
##        
##         1 Story Frame 2 Story Frame
##   1/2               2             2
##   3/4               0             1
##   Crawl             1             0
##   Full            255           161
##   None             89            20

The estimated coefficients in these models still describe expected differences between encoded categories and the reference group, but we need to be more careful about what the reference group entails, and we need to realize that the differences are now adjusted such that the other categorical variable is unchanged.

Question #3: This question focuses on understanding the relationship between the variables ac, bsmt and sale.amount.

  • Part A: Fit the multivariable regression model: sale.amount ~ bsmt + ac. Interpret the estimated intercept in this model.
  • Part B: Interpret the estimated coefficient for the dummy variable bsmtCrawl in your model from Part A.
  • Part C: Using information about the factor() function from the start of this lab, reorder the levels of the variables bsmt and ac as necessary so that the group of homes with no basement and no air conditioning are used as the reference group.
  • Part D: Using the reordered variables from Part C, refit the model: sale.amount ~ bsmt + ac. Interpret the estimated intercept in this model.
  • Part E: Interpret the estimated coefficient for the dummy variable bsmtCrawl in your model from Part D.
  • Part F: Using the model from Part D, what is the expected difference in sale price between a home with no basement and no air conditioning and a home that has air conditioning and a full basement?

\(~\)