

Today’s lecture introduced one-hot encoding, a way to represent categorical variables using a collection of binary numerical variables. Behind the scenes, lm() uses the model.matrix() function to set up these binary variables:

colors = c("Red", "Red", "Blue", "Green", "Blue")

model.matrix( ~ colors)
##   (Intercept) colorsGreen colorsRed
## 1           1           0         1
## 2           1           0         1
## 3           1           0         0
## 4           1           1         0
## 5           1           0         0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$colors
## [1] "contr.treatment"

By default, R finds the first alphanumeric character string and uses that category as the reference group, which is expressed via that model’s intercept rather than a new binary variable. We can alter this behavior using the “factor” data type, which behaves similarly to the “character” type but assigns an ordering to the different “levels” of the variable. Below we assign the ordering "Red", "Blue", "Green":

ordered_colors = factor(colors, levels = c("Red", "Blue", "Green"))
model.matrix( ~ ordered_colors)
##   (Intercept) ordered_colorsBlue ordered_colorsGreen
## 1           1                  0                   0
## 2           1                  0                   0
## 3           1                  1                   0
## 4           1                  0                   1
## 5           1                  1                   0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$ordered_colors
## [1] "contr.treatment"

Notice how this changed the reference category to “Red”.

Below is an example of the same concept performed inside of the lm() function:

homes <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

## Model w/ the default reference for "bsmt"
lm(sale.amount ~ bsmt, data = homes)
## Call:
## lm(formula = sale.amount ~ bsmt, data = homes)
## Coefficients:
## (Intercept)      bsmt3/4    bsmtCrawl     bsmtFull     bsmtNone  
##      199333       105667       -19333        -3892       -71149
## Change the reference to "None", the natural reference category
lm(sale.amount ~ factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full")), data = homes)
## Call:
## lm(formula = sale.amount ~ factor(bsmt, levels = c("None", "1/2", 
##     "3/4", "Crawl", "Full")), data = homes)
## Coefficients:
##                                                          (Intercept)  
##                                                               128184  
##   factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))1/2  
##                                                                71149  
##   factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))3/4  
##                                                               176816  
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Crawl  
##                                                                51816  
##  factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Full  
##                                                                67257

Notice how lm() gives us really ugly looking variable names with this approach. An alternative is to assign the levels before using lm():

## Overwrite the existing "bsmt" variable
homes$bsmt = factor(homes$bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))

## Fit regression model
lm(sale.amount ~ bsmt, data = homes)
## Call:
## lm(formula = sale.amount ~ bsmt, data = homes)
## Coefficients:
## (Intercept)      bsmt1/2      bsmt3/4    bsmtCrawl     bsmtFull  
##      128184        71149       176816        51816        67257



In this lab you’ll use a subset of the “Iowa City Homes” data set, which was obtained from the Johnson County assessor and documents all homes sold in Iowa City, IA between Jan 1, 2005 and Sept 9, 2008. The code below uses the filter() function to keep only the homes whose style was 1-story frame or 2-story frame, the two most prevalent styles of single-family detached homes:


ic_homes <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv") %>% 
              filter(style %in% c("1 Story Frame", "2 Story Frame"))

You should use the data stored in ic_homes throughout the remainder of the lab.


One Categorical and One Quantitative Explanatory Variable

For the following questions you should use the ic_homes data frame created in the previous section. We already learned how to fit regression models using the lm() function. The only thing to be aware of is that when fitting a multivariable linear regression model we separate the explanatory variables using a + in the model formula.

As an example, the code below fits a model that uses the explanatory variables area.bsmt and area.garage1 and the response variable sale.amount. This model is stored in an object named example_model:

example_model = lm(sale.amount ~ area.bsmt + area.garage1, data = ic_homes)
coef(example_model) ## See the model's coefficients
##  (Intercept)    area.bsmt area.garage1 
## 127796.56509     64.71159    128.19038

This example also uses the coef() function to print the model’s estimated coefficients. We could write out the fitted model in this example as: \[\widehat{\text{sale.amount}} = 127796.57 + 64.71*\text{area.bsmt} + 128.19*\text{area.garage1}\]

Question #1: This question focuses on understanding the relationship between the variables style, area.living, and sale.amount.

  • Part A: Use the group_by() and summarize() functions to find the average sale price of homes in each category of style. Which style has a higher average selling price?
  • Part B: Fit a regression model using style as the explanatory variable and sale.amount as the response variable. Interpret the coefficient of the dummy variable style2 Story Frame in this model.
  • Part C: Create a data visualization showing the relationship between style and area.living. Briefly describe how these variables are associated.
  • Part D: Use Pearson’s correlation coefficient to quantify the strength of linear association between area.living and sale.amount. Briefly describe the relationship between these variables.
  • Part E: Fit a regression model that uses both style and area.living as explanatory variables and sale.amount as the response variable. Interpret the coefficient of style2 Story Frame in this model.
  • Part F: Explain why the model in Part E seems to suggest that 2-story homes are less expensive than 1-story homes, while the model in Part B suggests the opposite.
  • Part G: Suppose a prospective first-time home buyer is trying to decide whether they want a 1-story or a 2-story home and they are concerned about the total price they’d need to pay, but they do not care about the home’s size. Is the model from Part B or the model from Part E more useful to this person? Briefly explain.

Question #2: This question focuses on understanding the relationship between the variables bedrooms, style, and sale.amount.

  • Part A: Use the group_by() and summarize() functions to determine which style of home has more bedrooms on average.
  • Part B: Fit a simple linear regression model that uses bedrooms as the explanatory variable and sale.amount as the response. Interpret the slope and intercept of this model.
  • Part C: Fit a multivariable linear regression model that uses bedrooms and style as explanatory variables and sale.amount as the response. Interpret the coefficient of bedrooms in this model.
  • Part D: An individual is thinking about re-configuring their home, which previously had 3 bedrooms, to now include 4 bedrooms. If their goal is to better understand the consequences of this reconfiguration on the value of their home, which model is more useful to them? The model from Part B or the model from Part C? Briefly explain.
  • Part E: Considering the goals of the individual in Part D, are there any other variables that you consider essential to adjust for? Hint: Think about what you learned in Question 1, in particular Part D of that question.


Several Categorical Explanatory Variables

As more categorical explanatory variables are incorporated into a regression model the definition of the reference category becomes increasingly complex.

Recall that R designates the reference group according to first alphanumeric category when a single categorical explanatory variable is used. When multiple categorical explanatory variables are used the reference group is defined by a combination of the first alphanumeric categories in each variable.

For example, when the categorical variables bsmt and style are used, the reference group contains 1 Story Frame homes with a 1/2 basement. This happens to coincide with the upper left cell in the contingency table relating these variables:

table(ic_homes$bsmt, ic_homes$style)
##         1 Story Frame 2 Story Frame
##   1/2               2             2
##   3/4               0             1
##   Crawl             1             0
##   Full            255           161
##   None             89            20

The estimated coefficients in these models still describe expected differences from the reference group, but we need to be more careful about what the reference group includes.

Question #3: This question focuses on understanding the relationship between the variables ac, bsmt and sale.amount.

  • Part A: Fit the multivariable regression model: sale.amount ~ bsmt + ac. Interpret the estimated intercept in this model.
  • Part B: Interpret the estimated coefficient for the dummy variable bsmtCrawl in your model from Part A.
  • Part C: Using information about the factor() function from the start of this lab, reorder the levels of the variables bsmt and ac as necessary so that the group of homes without a basement and without air conditioning are used as the reference group.
  • Part D: Using the reordered variables from Part C, refit the model: sale.amount ~ bsmt + ac. Interpret the estimated intercept in this model.
  • Part E: Interpret the estimated coefficient for the dummy variable bsmtCrawl in your model from Part D.
  • Part F: Using the model from Part D, what is the expected difference in sale price between a home with no basement and no air conditioning and a home that has air conditioning and a full basement?


Two Quantitative Explanatory Variables

Question #4: This question focuses on understanding the relationship between the variables area.living, assessed, and sale.amount.

  • Part A: Use the select() function to create a subset of ic_homes that only contains the variables area.living, assessed, and sale.amount. Then use this subset to create a correlation matrix showing Pearson’s correlation coefficient for each pairing of variables. Notice the strong correlations between all of these variables.
  • Part B: Fit a simple linear regression model using the explanatory variable area.living to predict sale.amount. Interpret the slope coefficient of the fitted model.
  • Part C: Fit a simple linear regression model using the explanatory variable assessed to predict sale.amount. Briefly explain why it makes sense for the estimated slope coefficient in this model to be close to 1.
  • Part D: Fit a multivariable linear regression model using both area.living and assessed to predict sale.amount. Why is the estimated coefficient for area.living negative in this model when we saw in Parts A and B that area.living is positively associated with sale price?
  • Part E: An individual is thinking about adding a new sun-room to their house that will increase its area.living by 160 square feet. This will require them to get a building permit and the city will reassess the value of their home. If their goal is to understand how this addition might influence the market value of their home (future sale price), which of the models from this question would be most useful to them? The one from Part B, Part C, or Part D? Briefly explain.

Question #5: This question focuses on understanding the relationship between the variables area.lot, area.living, and sale.amount.

  • Part A: Find Pearson’s correlation coefficient for the variables area.lot and sale.amount. Briefly interpret the strength and direction of this correlation.
  • Part B: Create a scatter plot displaying area.lot on the x-axis and sale.amount on the y-axis. For the majority of homes, do you think the strength of linear association is similar, stronger, or weaker than what the correlation coefficient you reported in Part A suggests? Briefly explain.
  • Part C: Fit a simple linear regression using area.lot as the explanatory variable and sale.amount as the response. Interpret the estimated slope coefficient in this model.
  • Part D: One house in the Iowa City home sales data set had a lot size of 158123 square-feet (3.63 acres). No other homes had a lot size larger than 1.3 acres, and the average lot size was 9388 square-feet (0.21 acres). Considering this information, would the removal of this home from the data set greatly influence the estimated slope coefficient you reported in Part C? Briefly explain.
  • Part E: Fit a multivariable linear regression model using the explanatory variables area.lot and area.living and the response variable sale.amount. If two homes have the same living area, is having a larger sized lot still associated with a higher sale amount?
  • Part F: The coefficient of area.lot in the model you fit in Part E is much smaller than this variable’s coefficient in the model you fit in Part C. How do you explain this decrease?