Directions (read before starting)
\(~\)
Similar to some of our previous labs, the first section of this lab will walk through a regression analysis including:
Each step will have an embedded question that you are expected to answer.
Our examples will use the Iowa City homes sales data set that has been used in several of our previous labs:
ic_homes = read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
\(~\)
To compare the fit of two different models using an F-test, we first
need to fit each model to our sample data using lm()
and
store that fit in an R
object:
## Store the null model
mod0 = lm(sale.amount ~ 1, data = ic_homes)
## Store the alternative
mod1 = lm(sale.amount ~ assessed, data = ic_homes)
We then use the anova()
function to perform the
hypothesis test:
## F-test of no difference b/w models
anova(mod0, mod1)
## Analysis of Variance Table
##
## Model 1: sale.amount ~ 1
## Model 2: sale.amount ~ assessed
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 776 6.3775e+12
## 2 775 3.4093e+11 1 6.0365e+12 13722 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Not surprisingly, there’s overwhelming evidence (a \(p\)-value of essentially zero) that using a home’s assessed value to predict its sale price is better than predicting every home sold for the same amount, the overall average sale price.
The nice thing about the F-test is that it can be used to
statistically compare any two nested models. This means we can add
additional explanatory variables to mod1
and see if they
further improve the model:
## Add area.living as a second explanatory var
mod2 = lm(sale.amount ~ assessed + area.living, data = ic_homes)
## F-test comparing mod2 and mod1 (which only had assessed)
anova(mod1, mod2)
## Analysis of Variance Table
##
## Model 1: sale.amount ~ assessed
## Model 2: sale.amount ~ assessed + area.living
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 775 3.4093e+11
## 2 774 3.3975e+11 1 1181199358 2.6909 0.1013
Here the \(p\)-value is borderline,
but high enough that we’d most likely conclude that there is a lack of
evidence supporting our more complicated model, which includes both
assessed
and area.living
, being superior to
the simpler model that only includes assessed
as a
predictor.
If we weren’t satisfied with the simpler model, we might continue searching for additional predictors that offer a statistically meaningful improvement in fit:
## Instead try 'built' as the second explanatory var
mod2b = lm(sale.amount ~ assessed + built, data = ic_homes)
## F-test comparing mod2 and mod1
anova(mod1, mod2b)
## Analysis of Variance Table
##
## Model 1: sale.amount ~ assessed
## Model 2: sale.amount ~ assessed + built
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 775 3.4093e+11
## 2 774 3.3526e+11 1 5668530388 13.087 0.0003167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This time we see that using both assessed
and
built
is superior to the simpler model that only uses
assessed
as a predictor of sale.amount
.
Question 1:
area.living ~ assessed + built
and
area.living ~ bedrooms + built
? Briefly explain.
Hint: Recall that the F-test is only applicable for nested
modelsarea.living ~ 1
(intercept-only), Model 2:
area.living ~ bedrooms
and Model 3:
area.living ~ bedrooms + built
.\(~\)
After we’ve decided upon a model that we believe offers an acceptable fit to our data (relative to other reasonable models that we’ve considered) the next step is to verify that statistical inference using an F-test and/or t-tests is valid for this model. This requires us to check two primary assumptions:
Using the R
object storing our fitted model, we can
check these assumptions using graphs created by the plot()
function:
## Residuals vs. Fitted values (check independence and constant variance)
plot(mod2b, which = 1)
## QQ-plot (check Normality assumption)
plot(mod2b, which = 2)
Unfortunately we see that the errors of this model deviate from what would be expected if they were Normally distributed. Though we should pay attention to the units of the graph, as we can judge that the deviation might not be as extreme as it visually appears to be. These data also exhibit heteroskedasticity, which is another cause for concern.
There are two options to consider in this scenario:
Below is an example of how the log()
function can be
used to transform the outcome variable:
## Example of a base-2 log-transformation
mod2c = lm(log(sale.amount, base = 2) ~ assessed + built, data = ic_homes)
plot(mod2c, which = 1)
plot(mod2c, which = 2)
We can see that the log-transformation improved the Normality of the residuals, but it also introduced a clear quadratic pattern in them. This pattern suggests we incorporate quadratic effects into our model (for one or more of the predictors). This is demonstrated below:
## Example of a quadratic polynomial effect
mod2d = lm(log(sale.amount, base = 2) ~ poly(assessed, degree=2) + built, data = ic_homes)
plot(mod2d, which = 1)
plot(mod2d, which = 2)
Notice the improvement in both plots. Although neither is perfect, they are good enough that the model is now suitable for statistical inference, though it’s interpretability has been negatively impacted.
Question 2:
area.living ~ bedrooms + built
.The summary()
function will provide a table that
includes the results of \(t\)-tests of
the null hypothesis \(H_0:\beta_j =
0\).
## One of our previous example models
mod2b = lm(sale.amount ~ assessed + built, data = ic_homes)
## The summary table
summary(mod2b)
##
## Call:
## lm(formula = sale.amount ~ assessed + built, data = ic_homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -150413 -7085 493 7827 149697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.705e+05 4.717e+04 3.615 0.000320 ***
## assessed 1.035e+00 8.761e-03 118.093 < 2e-16 ***
## built -8.657e+01 2.393e+01 -3.618 0.000317 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20810 on 774 degrees of freedom
## Multiple R-squared: 0.9474, Adjusted R-squared: 0.9473
## F-statistic: 6975 on 2 and 774 DF, p-value: < 2.2e-16
In this example, we have strong evidence that assessed
is associated with sale price after adjusting for year built, and that
built
is associated with sale price after adjusting for the
home’s assessed value.
The example shown above was straightforward, but these tests become more nuanced for models involving one or more categorical explanatory variables:
## Filter to only include 3 different styles of home
ic_homes_filtered = ic_homes %>% filter(style %in% c("1 Story Frame", "2 Story Frame", "Split Foyer Frame"))
## Fit model and display summary table
mod3 = lm(sale.amount ~ area.living + style, data = ic_homes_filtered)
summary(mod3)
##
## Call:
## lm(formula = sale.amount ~ area.living + style, data = ic_homes_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -235703 -25762 -1134 20265 349086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17685.804 5619.853 -3.147 0.00173 **
## area.living 154.752 4.221 36.661 < 2e-16 ***
## style2 Story Frame -40398.035 4947.992 -8.165 1.85e-15 ***
## styleSplit Foyer Frame 5339.262 5752.208 0.928 0.35366
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47230 on 611 degrees of freedom
## Multiple R-squared: 0.7092, Adjusted R-squared: 0.7078
## F-statistic: 496.7 on 3 and 611 DF, p-value: < 2.2e-16
In this example we should notice that “1 Story Frame” was designated
as the reference category, so the significant test and negative
coefficient for re-coded variable style2 Story Frame
indicates that 2-story homes are expected to sell for significantly
lower prices than 1-story homes, after adjusting for living
area.
Unfortunately, we cannot directly obtain a pairwise comparison
between 2-story and split foyer homes from this output. The easiest way
to obtain this comparison is to actively change the variable’s reference
category using factor()
and relevel()
which is
shown below:
ic_homes_relvl = ic_homes_filtered %>% mutate(new_style = relevel(factor(style), ref = 'Split Foyer Frame'))
mod3b = lm(sale.amount ~ area.living + new_style, data = ic_homes_relvl)
summary(mod3b)
##
## Call:
## lm(formula = sale.amount ~ area.living + new_style, data = ic_homes_relvl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -235703 -25762 -1134 20265 349086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12346.542 6976.797 -1.770 0.0773 .
## area.living 154.752 4.221 36.661 <2e-16 ***
## new_style1 Story Frame -5339.262 5752.208 -0.928 0.3537
## new_style2 Story Frame -45737.297 6799.518 -6.727 4e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47230 on 611 degrees of freedom
## Multiple R-squared: 0.7092, Adjusted R-squared: 0.7078
## F-statistic: 496.7 on 3 and 611 DF, p-value: < 2.2e-16
A more generalizable alternative is to use contrasts, a
topic we will not have time to cover but
you can learn about this vignette for the emmeans
package
Question 3: Fit the linear regression model
sale.amount ~ area.living + built + ac
and interpret the
hypothesis tests on each coefficient in the model that are given by the
summary()
function. A satisfactory interpretation will
mention whether you can or cannot be statistically confident that a
predictor is associated with the outcome after adjusting for the other
variables in the model, as well as an explanation of how that predictor
is associated with the outcome (only for predictors you deem
associated).
\(~\)
In this application you’ll analyze data from major United States university that was collected as part of the university’s ongoing evaluation of salary differences between its male and female faculty:
Question 4:
salary ~ sex
to estimate the expected difference in
salaries of male vs. female faculty. Consider the model
salary ~ sex + rank
. Does this more complex model provide a
significantly better fit to the data?yrs.since.phd
to the more complex model from Part A. Does
this model provide a significantly better fit to the data than the model
salary ~ sex + rank
?discipline
to the best model from Part B (either
salary ~ sex + rank
or
salary ~ sex + rank + yrs.since.phd
depending upon your
decision). Does adding this variable significantly improve the fit of
the model?\(~\)
The Armed Forces Qualification Test (AFQT) is a multiple choice aptitude test designed to measure whether an individual’s reasoning and verbal skills are sufficient for military service. Because the test has been so broadly applied, percentile scores are frequently used as a measure of general intelligence.
The data in this application is a random sample of \(n=2584\) Americans who were first selected and tested in 1979, then later re-interviewed in 2006 and asked about their education and annual income. The focus of this analysis will be modeling 2005 incomes using AFQT scores (from 1979) with and without adjusting for educational attainment.
afqt <- read.csv("https://remiller1450.github.io/data/AFQT.csv")
This analysis will use the following variables:
AFQT = 90
indicates they scored higher than 90% of those who took the test.Educ = 12
reflects
the subject completing high school.Question 5:
Income2005 ~ AFQT
and Income2005 ~ 1
(the
intercept only/null model). Does the more complex model provide a
superior fit to the data?Income2005 ~ AFQT
and
Income2005 ~ AFQT + Educ
. Which model provides a better fit
to the data?AFQT
in both models considered in Part B. How
would you explain the difference between these coefficients? That is,
why does AFQT
seem to have a larger effect on income in the
smaller, less complex model?AFQT
in the
model you fit in Part E where the outcome variable had undergone a
log-transformation. Do not involve the units log
dollars in your interpretation. Instead, try to come up with an
interpretation involving dollars.