\(~\)
This lab contains no onboarding section, you should read through the sections below and work through it with your partners.
The lab’s examples will use the Iowa City home sales data set introduced earlier in the semester, focusing only on 1-story frame, split foyer frame, and 2-story frame homes, which are the three most common home types:
## Libraries and data
library(ggplot2)
library(dplyr)
homes = read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv") %>%
filter(style %in% c("1 Story Frame", "2 Story Frame", "Split Foyer Frame"))
\(~\)
Recall that two models are nested if the coefficients in one can be set to certain values to make it equivalent to the other model. In our example we’ll look at the following nested models:
1.sale.amount ~ bedrooms + area.living
2.
sale.amount ~ bedrooms + area.living + style
Writing out the form of each model makes it clear that they are nested:
We can see that if \(\beta_3 = 0\) and \(\beta_4 = 0\) then both models are equivalent, thus the reduced model is a special case of the alternative model.
To use an \(F\)-test on these nested models we first need to fit each model to our data:
## Fit both models, store as fit1 and fit2
fit1 = lm(sale.amount ~ bedrooms + area.living, data = homes)
fit2 = lm(sale.amount ~ bedrooms + area.living + style, data = homes)
Next, we can use the anova()
function to perform the
\(F\)-test:
## F-test
anova(fit1, fit2)
## Analysis of Variance Table
##
## Model 1: sale.amount ~ bedrooms + area.living
## Model 2: sale.amount ~ bedrooms + area.living + style
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 612 1.4457e+12
## 2 610 1.3225e+12 2 1.2319e+11 28.412 1.593e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova()
function expects the reduced model to be
given first, followed by the alternative modelAs an additional example, let’s consider an even more complex model that also uses a home’s latitude as a predictor.
## Fit an even more complex model with a predictor that's unlikely to matter
fit3 = lm(sale.amount ~ bedrooms + area.living + style + lat, data = homes)
## F-test
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: sale.amount ~ bedrooms + area.living + style
## Model 2: sale.amount ~ bedrooms + area.living + style + lat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 610 1.3225e+12
## 2 609 1.3224e+12 1 68353504 0.0315 0.8592
Here the \(p\)-value is large, so
the data do not provide sufficient evidence that lat
is a
worthwhile predictor of sale.amount
after the effects of
bedrooms
, area.living
, and style
have been accounted for.
Question #1: Consider the following three models that predict a home’s assessed value:
assessed ~ area.living
assessed ~ ac
assessed ~ area.living + ac
assessed ~ area.living
) and Model #3
(assessed ~ area.living + ac
). Provide the \(p\)-value and a 1-sentence summary of what
you conclude from the test.assessed ~ ac
) and Model #3
(assessed ~ area.living + ac
). Provide the \(p\)-value and a 1-sentence summary of what
you conclude from the test.\(~\)
Recall that valid statistical inference on regression model assumes
the model’s errors are independent and Normally distributed with a
constant variance. These assumptions can be checked using the
plot()
function:
## Check independence and constant variance
plot(fit3, which = 1)
By changing the argument to be which = 2
we can assess
the Normality assumption using a Q-Q plot:
## Check for Normality
plot(fit3, which = 2)
## Re-fit log-transformed outcome
fit3_log = lm(log2(sale.amount) ~ bedrooms + area.living + style + lat, data = homes)
## Q-Q plot of log-transformed outcome
plot(fit3_log, which = 2)
If strategies like log-transformation are ineffective, non-parametric alternatives such as quantile regression exist. However, for the purposes of our class, you may simply report the results of your regression inferences with caution if a log-transformation is not sufficient to fulfill the assumptions for statistical inference.
Question #2:
assessed ~ area.living + ac
) from Question #1. Briefly
comment on whether you believe the assumptions of the \(F\)-test are satisfied. Your answer should
address all of the major assumptions (independence, constant variance,
and Normality of the residuals).Note: If you decide a log-transformation is necessary you technically should revisit the model selection step (ie: the tests performed in Question #1); however, it’s relatively uncommon for a log-transformation to suggest a vastly different final model.
\(~\)
As mentioned in the lecture slides, improper fit might lead to patterns in the residuals that violate the independence assumption. The most common example of this is the model uses a linear relationship to represent a relationship that is actually non-linear. If such a pattern exists in your data, you may consider using polynomial effect to represent that variable in the model.
As an example, let’s consider the area.bsmt
variable in
the Iowa City home sales data:
ggplot(homes, aes(x = area.bsmt, y = sale.amount)) + geom_point() + geom_smooth(se = FALSE)
We can propose two different linear regression models relating this
variable and the outcome, sale.amount
:
sale.amount ~ area.bsmt
sale.amount ~ poly(area.bsmt, degree=2)
The first is a simple linear regression model \(Y = \beta_0 + \beta_1X + \epsilon\), while the second uses a second degree polynomial of \(X\) and takes the form \(Y = \beta_0 + \beta_1X + \beta_2X^2 + \epsilon\). We can see that these two models are nested, as when \(\beta_2 = 0\) they are equivalent.
While subtle, we see that the reduced model might exhibit the U-shaped pattern in the residuals that we saw in our lecture slides:
bsmt_mod1 = lm(sale.amount ~ area.bsmt, data = homes)
plot(bsmt_mod1, which = 1)
Because the models are nested, we can use an \(F\)-test to compare them:
bsmt_mod2 = lm(sale.amount ~ poly(area.bsmt, degree=2), data = homes)
anova(bsmt_mod1, bsmt_mod2)
## Analysis of Variance Table
##
## Model 1: sale.amount ~ area.bsmt
## Model 2: sale.amount ~ poly(area.bsmt, degree = 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 613 3.6001e+12
## 2 612 3.3180e+12 1 2.8214e+11 52.041 1.613e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we have strong evidence that using a quadratic effect for the
predictor area.bsmt
significantly improves model fit. Thus,
we can conclude that there is a non-linear relationship between the area
of a home’s basement and its sale price.
Question #3: Use an \(F\)-test to determine whether the size of a home’s lot is better represented by a quadratic effect or a linear effect in a model that predicts sale price and adjusts for the home’s living area.
\(~\)
After using the \(F\)-test to
determine a suitable final model, it is important to evaluate the roles
of individual variables within that model. The summary()
function provides \(t\)-test results
for individual model coefficients. Each of these tests uses the null
hypothesis \(H_0: \beta_j = 0\), which
reflects that variable making no contribution towards the outcome.
## Fit 3 from earlier in the lab
fit3 = lm(sale.amount ~ bedrooms + area.living + style + lat, data = homes)
summary(fit3)
##
## Call:
## lm(formula = sale.amount ~ bedrooms + area.living + style + lat,
## data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -207707 -23735 -2156 19434 352041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.201e+06 6.965e+06 0.172 0.863
## bedrooms 1.016e+04 2.346e+03 4.332 1.72e-05 ***
## area.living 1.432e+02 4.947e+00 28.945 < 2e-16 ***
## style2 Story Frame -3.643e+04 4.969e+03 -7.333 7.23e-13 ***
## styleSplit Foyer Frame 2.195e+03 5.722e+03 0.384 0.701
## lat -2.967e+04 1.672e+05 -0.177 0.859
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46600 on 609 degrees of freedom
## Multiple R-squared: 0.7179, Adjusted R-squared: 0.7156
## F-statistic: 309.9 on 5 and 609 DF, p-value: < 2.2e-16
Below we interpret the key information in the output given by
summary()
:
bedrooms
is zero is very small, suggesting there is
strong evidence of an association between bedrooms
and
sale.amount
after adjusting for differences in living
area, style, and latitude.
bedrooms
is 10160,
which tells us that each additional bedroom predicts an expected
increase in home price of $10160 assuming living area, style, and
latitude remain unchangedstyle
is categorical, which requires us to
interpret it differently
Notice that we aren’t able to statistically test the difference between “Split Foyer Frame” and “2 Story Frame” using just this model. The easiest way to perform that test is to re-fit the model using a different reference category, which is described in Lab 7
In addition to hypothesis testing, the \(t\)-distribution is also used as the basis for confidence interval estimates for model coefficients:
## 99% confidence interval estimates
confint(fit3, level = 0.99)
## 0.5 % 99.5 %
## (Intercept) -1.679673e+07 19199501.892
## bedrooms 4.102255e+03 16226.927
## area.living 1.304015e+02 155.966
## style2 Story Frame -4.927241e+04 -23594.640
## styleSplit Foyer Frame -1.259029e+04 16980.695
## lat -4.617859e+05 402445.000
area.living
was $143
per additional square foot (seen in the previous summary table)
Finally, you should note that coefficient interpretations change when the outcome has been log-transformed:
## Log-transformed model from earlier
fit3_log = lm(log2(sale.amount) ~ bedrooms + area.living + style + lat, data = homes)
summary(fit3_log)
##
## Call:
## lm(formula = log2(sale.amount) ~ bedrooms + area.living + style +
## lat, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.62506 -0.18333 0.00771 0.20195 1.23185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.042e+01 4.781e+01 1.473 0.141
## bedrooms 1.378e-01 1.610e-02 8.560 < 2e-16 ***
## area.living 8.179e-04 3.395e-05 24.087 < 2e-16 ***
## style2 Story Frame -1.711e-01 3.410e-02 -5.018 6.86e-07 ***
## styleSplit Foyer Frame 7.400e-02 3.928e-02 1.884 0.060 .
## lat -1.310e+00 1.148e+00 -1.142 0.254
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3198 on 609 degrees of freedom
## Multiple R-squared: 0.6928, Adjusted R-squared: 0.6903
## F-statistic: 274.6 on 5 and 609 DF, p-value: < 2.2e-16
The estimated adjusted effect of area.living
now
suggests an expected increase of 0.0008179 log-2 dollars for
every one square foot increase in living area. This is meaningless to
most people, so let’s revisit the regression equation and apply the
inverse of our log-transformation: \[\widehat{log_2(\text{price})} = 70.42 + 0.138 *
\text{Bedrooms} + 0.00082*\text{Area Living} + \ldots \\ \implies
\widehat{\text{price}} = 2^{ 70.42 + 0.138 * \text{Bedrooms} +
0.00082*\text{Area Living} + \ldots} \\ \implies \widehat{\text{price}}
= 2^{70.42}*2^{0.138 * \text{Bedrooms}}*2^{0.00082*\text{Area Living}}
*\ldots\]
We can see that every 1 square foot increase in living area increases
the expected sale price by a multiplicative factor of \(2^{0.00082} = 1.000569\). Since a 1
square foot increase is very small this is still difficult to interpret,
so instead we might look at a 100 square foot increase: \[2^{0.00082*100} = 1.058\] Thus, a
reasonable interpretation of the coefficient of Area.Living
in the log-transformed model is: after adjusting for differences in
bedrooms, style and latitude, each additional 100 square feet is
expected to increase a home’s sale price by 5.8%.
Question #4: For this question you should begin with
the model assessed ~ area.living + ac
used in Question #2.
Throughout the question you may ignore whether or not the assumptions
for inference are met.
acYes
in this model. You do not need to interpret
the results of the associated statistical test at this time.log2(assessed) ~ area.living + ac
. Interpret the
coefficient of acYes
in this model. Hint: You
should report the percentage change in expected sale price for
a home having central air.acYes
in the
log-transformed model is not zero.\(~\)
Question #5: In this application you’ll analyze data from a major US university as part of an investigation into salary differences between male and female faculty:
fac_sal = read.csv("https://remiller1450.github.io/data/Salaries.csv")
salary ~ sex
and interpret the coefficient and
corresponding \(t\)-test for the
re-coded variable sexMale
. Does there appear to be
statistical evidence of a difference in salaries for male versus female
faculty.salary ~ sex + rank
and note that the model from Part A is
nested within this model. Use an \(F\)-test to determine whether there is
statistical evidence that this model better fits the data than the model
from Part A.salary ~ sex + rank + yrs.since.phd
. Based upon this
test as well as the one you performed in Part B, which of the three
models you’ve looked at so far in this question do you deem most
appropriate? Explain your reasoning.yrs.since.phd
has a
non-linear relationship with salary
. Use an \(F\)-test to compare a model that uses a
degree-2 polynomial effect for yrs.since.phd
with the best
performing model you identified in Part C.