\(~\)
Our lecture slides introduced the \(F\)-test as a means for comparing two nested regression models. This can help us compare the efficacy of more complex models with simpler models that are nested within them.
After deciding upon a model, we also might want to make specific inferences about the adjusted effects of variables present in that model. We can do this by using a \(t\)-test to evaluate the null hypothesis \(H_0: \beta_j = 0\). Unless the \(\beta_j\) corresponds to a single category of a nominal categorical variable being represented using one-hot encoding, this test is equivalent to the \(F\)-test.
Let’s see a quick example using the Iowa City homes sales data set:
library(ggplot2)
library(dplyr)
## Iowa City homes example data
ic_homes <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv") %>%
filter(style %in% c("1 Story Frame", "2 Story Frame"))
We can fit two nested models and compare their fits using the \(F\)-test, which is performed using the
anova() function:
## Fit two nested models
mod1 = lm(sale.amount ~ area.living, data = ic_homes)
mod2 = lm(sale.amount ~ area.living + bsmt, data = ic_homes)
## Compare using ANOVA
anova(mod1, mod2)
## Analysis of Variance Table
##
## Model 1: sale.amount ~ area.living
## Model 2: sale.amount ~ area.living + bsmt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 529 1.4877e+12
## 2 525 1.3675e+12 4 1.2021e+11 11.537 5.514e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because these models only differ by their inclusion of the
categorical predictor bsmt, we might wonder which
categories of bsmt differ from each other, a question we
can answer using the \(t\)-tests
performed by the summary() function:
## t-tests of beta=0 using summary
summary(mod2)
##
## Call:
## lm(formula = sale.amount ~ area.living + bsmt, data = ic_homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -195584 -30689 116 19651 393381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6456.705 26382.919 -0.245 0.807
## area.living 132.786 4.105 32.344 <2e-16 ***
## bsmt3/4 39511.036 57085.466 0.692 0.489
## bsmtCrawl 1220.275 57068.208 0.021 0.983
## bsmtFull 13120.015 25652.095 0.511 0.609
## bsmtNone -24816.672 26053.229 -0.953 0.341
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51040 on 525 degrees of freedom
## Multiple R-squared: 0.7026, Adjusted R-squared: 0.6997
## F-statistic: 248 on 5 and 525 DF, p-value: < 2.2e-16
Interesting we do not see any statistically significant differences between the encoded categories and the reference category of “1/2”, but this is misleading and does not mean there are no pairwise differences (in fact we’d expect the opposite due to the \(F\)-test results).
Below we change the reference category to “None” and show the \(t\)-test results:
## Change reference cat
mod2_reordered = lm(sale.amount ~ area.living + factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full")), data = ic_homes)
## t-tests
summary(mod2_reordered)
##
## Call:
## lm(formula = sale.amount ~ area.living + factor(bsmt, levels = c("None",
## "1/2", "3/4", "Crawl", "Full")), data = ic_homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -195584 -30689 116 19651 393381
##
## Coefficients:
## Estimate
## (Intercept) -31273.377
## area.living 132.786
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))1/2 24816.672
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))3/4 64327.708
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Crawl 26036.947
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Full 37936.687
## Std. Error
## (Intercept) 6833.152
## area.living 4.105
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))1/2 26053.229
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))3/4 51398.105
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Crawl 51278.362
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Full 5610.618
## t value
## (Intercept) -4.577
## area.living 32.344
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))1/2 0.953
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))3/4 1.252
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Crawl 0.508
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Full 6.762
## Pr(>|t|)
## (Intercept) 5.90e-06
## area.living < 2e-16
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))1/2 0.341
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))3/4 0.211
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Crawl 0.612
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Full 3.65e-11
##
## (Intercept) ***
## area.living ***
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))1/2
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))3/4
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Crawl
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Full ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51040 on 525 degrees of freedom
## Multiple R-squared: 0.7026, Adjusted R-squared: 0.6997
## F-statistic: 248 on 5 and 525 DF, p-value: < 2.2e-16
Here we see that there is overwhelming statistical evidence of a difference in sale amounts for homes with a full basement relative to homes with no basement (\(p < 0.001\)) after adjusting for differences in living area.
\(~\)
The examples in this lab will use the Iowa City home sales data provided in the introduction and discussed in our previous lab.
\(~\)
Recall that the \(F\)-test is built upon the assumption that the alternative model’s errors are independent and identically Normally distributed. We can check the appropriateness of this using two plots:
In the first plot we are look for there to be no clear pattern in the residuals (red smoothing line stays near zero) and roughly equal spread (in the y-dimension) across the range of the x-axis.
In the second plot we are looking for the observed residuals to roughly fall on a 45-degree line, indicating they align with their corresponding expected percentiles from a Normal distribution.
Below is a realistic example where both assumptions should be considered met:
met_model = lm(log2(assessed) ~ area.living + built, data = ic_homes)
plot(met_model, which = c(1,2))
This is example illustrates a few key points:
Now let’s see an example where the assumptions are violated:
violated_model = lm(sale.amount ~ area.lot + area.bsmt + area.living + built, data = ic_homes)
plot(violated_model, which = c(1,2))
In the first plot:
In the second plot:
Recall that we might choose to address these issues by a log-transformation, including additional predictors, or simply reporting our results with an amount of caution proportional to how severe we believe these violations to be.
Question #1:
area.living ~ builtarea.living ~ bedrooms + stylearea.living ~ bedrooms + builtarea.living ~ bedrooms + style + builtanova() function to compare the models
area.living ~ bedrooms + style + built and
area.living ~ built. Report a one-sentence conclusion
indicating which model offers a better fit to the data.log2() function to
apply a log-transformation using a base of 2 to the outcome variable.
After this transformation, do the assumptions you evaluated in Part C
seem more reasonable? Briefly explain.summary() function to determine whether 2
Story Frame homes have a significantly larger living area than 1 Story
Frame homes after adjusting for the number of bedrooms and year
built.\(~\)
In this portion of the lab you’ll analyze data collected by a large public university as part of an investigation into salary differences between male and female faculty members:
fac_sal = read.csv("https://remiller1450.github.io/data/Salaries.csv")
Question #2:
rank might
confound the marginal association between sex and
salary. You may use either visualizations of descriptive
statistics to check both pieces of the definition. Hint:
remember that the definition of confounding requires there to be an
association between a third variable and both the explanatory
and response variables.salary ~ sex + rank and salary ~ sex. Report
whether the more complex model seems to have a better fit.yrs.since.phd and discipline should also be
added to the model to further improve the fit (over the alternative
model from Part D).summary() function to
perform a \(t\)-test investigating
whether there is a difference in the expected salaries of male and
female faculty after adjusting for the other variables in your final
model. Report the adjusted difference in salaries as part of a
one-sentence conclusion that cites the \(p\)-value of this \(t\)-test.