\(~\)
Today’s lecture introduced one-hot encoding, a way
to represent categorical variables using a collection of binary
numerical variables. Behind the scenes, lm()
uses the
model.matrix()
function to set up these binary
variables:
colors = c("Red", "Red", "Blue", "Green", "Blue")
model.matrix( ~ colors)
## (Intercept) colorsGreen colorsRed
## 1 1 0 1
## 2 1 0 1
## 3 1 0 0
## 4 1 1 0
## 5 1 0 0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$colors
## [1] "contr.treatment"
By default, R
finds the first alphanumeric character
string and uses that category as the reference group, which is expressed
via that model’s intercept rather than a new binary variable. We can
alter this behavior using the “factor” data type, which behaves
similarly to the “character” type but assigns an ordering to the
different “levels” of the variable. Below we assign the ordering
"Red", "Blue", "Green"
:
ordered_colors = factor(colors, levels = c("Red", "Blue", "Green"))
model.matrix( ~ ordered_colors)
## (Intercept) ordered_colorsBlue ordered_colorsGreen
## 1 1 0 0
## 2 1 0 0
## 3 1 1 0
## 4 1 0 1
## 5 1 1 0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$ordered_colors
## [1] "contr.treatment"
Notice how this changed the reference category to “Red”.
Below is an example of the same concept performed inside of the
lm()
function:
homes <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
## Model w/ the default reference for "bsmt"
lm(sale.amount ~ bsmt, data = homes)
##
## Call:
## lm(formula = sale.amount ~ bsmt, data = homes)
##
## Coefficients:
## (Intercept) bsmt3/4 bsmtCrawl bsmtFull bsmtNone
## 199333 105667 -19333 -3892 -71149
## Change the reference to "None", the natural reference category
lm(sale.amount ~ factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full")), data = homes)
##
## Call:
## lm(formula = sale.amount ~ factor(bsmt, levels = c("None", "1/2",
## "3/4", "Crawl", "Full")), data = homes)
##
## Coefficients:
## (Intercept)
## 128184
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))1/2
## 71149
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))3/4
## 176816
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Crawl
## 51816
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Full
## 67257
Notice how lm()
gives us really ugly looking variable
names with this approach. An alternative is to assign the levels before
using lm()
:
## Overwrite the existing "bsmt" variable
homes$bsmt = factor(homes$bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))
## Fit regression model
lm(sale.amount ~ bsmt, data = homes)
##
## Call:
## lm(formula = sale.amount ~ bsmt, data = homes)
##
## Coefficients:
## (Intercept) bsmt1/2 bsmt3/4 bsmtCrawl bsmtFull
## 128184 71149 176816 51816 67257
\(~\)
In this lab you’ll use a subset of the “Iowa City Homes” data set,
which was obtained from the Johnson County assessor and documents all
homes sold in Iowa City, IA between Jan 1, 2005 and Sept 9, 2008. The
code below uses the filter()
function to keep only the
homes whose style was 1-story frame or 2-story frame, the two most
prevalent styles of single-family detached homes:
library(ggplot2)
library(dplyr)
ic_homes <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv") %>%
filter(style %in% c("1 Story Frame", "2 Story Frame"))
You should use the data stored in ic_homes
throughout
the remainder of the lab.
\(~\)
For the following questions you should use the ic_homes
data frame created in the previous section. We already learned how to
fit regression models using the lm()
function. The only
thing to be aware of is that when fitting a multivariable linear
regression model we separate the explanatory variables using a
+
in the model formula.
As an example, the code below fits a model that uses the explanatory
variables area.bsmt
and area.garage1
and the
response variable sale.amount
. This model is stored in an
object named example_model
:
example_model = lm(sale.amount ~ area.bsmt + area.garage1, data = ic_homes)
coef(example_model) ## See the model's coefficients
## (Intercept) area.bsmt area.garage1
## 127796.56509 64.71159 128.19038
This example also uses the coef()
function to print the
model’s estimated coefficients. We could write out the fitted model in
this example as: \[\widehat{\text{sale.amount}} = 127796.57 +
64.71*\text{area.bsmt} + 128.19*\text{area.garage1}\]
Question #1: This question focuses on understanding
the relationship between the variables style
,
area.living
, and sale.amount
.
group_by()
and
summarize()
functions to find the average sale price of
homes in each category of style
. Which style has a higher
average selling price?style
as the explanatory variable and
sale.amount
as the response variable. Interpret the
coefficient of the dummy variable style2 Story Frame
in
this model.style
and area.living
.
Briefly describe how these variables are associated.area.living
and sale.amount
. Briefly describe
the relationship between these variables.style
and area.living
as explanatory variables
and sale.amount
as the response variable. Interpret the
coefficient of style2 Story Frame
in this model.Question #2: This question focuses on understanding
the relationship between the variables bedrooms
,
style
, and sale.amount
.
group_by()
and
summarize()
functions to determine which style of home has
more bedrooms on average.bedrooms
as the explanatory variable and
sale.amount
as the response. Interpret the slope and
intercept of this model.bedrooms
and style
as explanatory
variables and sale.amount
as the response. Interpret the
coefficient of bedrooms
in this model.\(~\)
As more categorical explanatory variables are incorporated into a regression model the definition of the reference category becomes increasingly complex.
Recall that R
designates the reference group according
to first alphanumeric category when a single categorical explanatory
variable is used. When multiple categorical explanatory variables are
used the reference group is defined by a combination of the first
alphanumeric categories in each variable.
For example, when the categorical variables bsmt
and
style
are used, the reference group contains
1 Story Frame
homes with a 1/2
basement. This
happens to coincide with the upper left cell in the contingency table
relating these variables:
table(ic_homes$bsmt, ic_homes$style)
##
## 1 Story Frame 2 Story Frame
## 1/2 2 2
## 3/4 0 1
## Crawl 1 0
## Full 255 161
## None 89 20
The estimated coefficients in these models still describe expected differences from the reference group, but we need to be more careful about what the reference group includes.
Question #3: This question focuses on understanding
the relationship between the variables ac
,
bsmt
and sale.amount
.
sale.amount ~ bsmt + ac
. Interpret the estimated
intercept in this model.bsmtCrawl
in your model from Part A.factor()
function from the start of this lab, reorder the
levels of the variables bsmt
and ac
as
necessary so that the group of homes without a basement and without air
conditioning are used as the reference group.sale.amount ~ bsmt + ac
. Interpret the
estimated intercept in this model.bsmtCrawl
in your model from Part D.\(~\)
Question #4: This question focuses on understanding
the relationship between the variables area.living
,
assessed
, and sale.amount
.
select()
function to
create a subset of ic_homes
that only contains the
variables area.living
, assessed
, and
sale.amount
. Then use this subset to create a correlation
matrix showing Pearson’s correlation coefficient for each pairing of
variables. Notice the strong correlations between all of these
variables.area.living
to predict
sale.amount
. Interpret the slope coefficient of the fitted
model.assessed
to predict
sale.amount
. Briefly explain why it makes sense for the
estimated slope coefficient in this model to be close to 1.area.living
and assessed
to predict
sale.amount
. Why is the estimated coefficient for
area.living
negative in this model when we saw in Parts A
and B that area.living
is positively associated with sale
price?area.living
by 160 square feet. This will require them to
get a building permit and the city will reassess the value of their
home. If their goal is to understand how this addition might influence
the market value of their home (future sale price), which of the models
from this question would be most useful to them? The one from Part B,
Part C, or Part D? Briefly explain.Question #5: This question focuses on understanding
the relationship between the variables area.lot
,
area.living
, and sale.amount
.
area.lot
and sale.amount
.
Briefly interpret the strength and direction of this correlation.area.lot
on the x-axis and sale.amount
on the
y-axis. For the majority of homes, do you think the strength of linear
association is similar, stronger, or weaker than what the correlation
coefficient you reported in Part A suggests? Briefly explain.area.lot
as the explanatory variable and
sale.amount
as the response. Interpret the estimated slope
coefficient in this model.area.lot
and
area.living
and the response variable
sale.amount
. If two homes have the same living area, is
having a larger sized lot still associated with a higher sale
amount?area.lot
in
the model you fit in Part E is much smaller than this variable’s
coefficient in the model you fit in Part C. How do you explain this
decrease?