\(~\)

Onboarding

Our lecture slides introduced the \(F\)-test as a means for comparing two nested regression models. This can help us compare the efficacy of more complex models with simpler models that are nested within them.

After deciding upon a model, we also might want to make specific inferences about the adjusted effects of variables present in that model. We can do this by using a \(t\)-test to evaluate the null hypothesis \(H_0: \beta_j = 0\). Unless the \(\beta_j\) corresponds to a single category of a nominal categorical variable being represented using one-hot encoding, this test is equivalent to the \(F\)-test.

Let’s see a quick example using the Iowa City homes sales data set:

library(ggplot2)
library(dplyr)


## Iowa City homes example data
ic_homes <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv") %>% 
              filter(style %in% c("1 Story Frame", "2 Story Frame"))

We can fit two nested models and compare their fits using the \(F\)-test, which is performed using the anova() function:

## Fit two nested models
mod1 = lm(sale.amount ~ area.living, data = ic_homes)
mod2 = lm(sale.amount ~ area.living + bsmt, data = ic_homes)

## Compare using ANOVA
anova(mod1, mod2)
## Analysis of Variance Table
## 
## Model 1: sale.amount ~ area.living
## Model 2: sale.amount ~ area.living + bsmt
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    529 1.4877e+12                                   
## 2    525 1.3675e+12  4 1.2021e+11 11.537 5.514e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because these models only differ by their inclusion of the categorical predictor bsmt, we might wonder which categories of bsmt differ from each other, a question we can answer using the \(t\)-tests performed by the summary() function:

## t-tests of beta=0 using summary
summary(mod2)
## 
## Call:
## lm(formula = sale.amount ~ area.living + bsmt, data = ic_homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -195584  -30689     116   19651  393381 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6456.705  26382.919  -0.245    0.807    
## area.living    132.786      4.105  32.344   <2e-16 ***
## bsmt3/4      39511.036  57085.466   0.692    0.489    
## bsmtCrawl     1220.275  57068.208   0.021    0.983    
## bsmtFull     13120.015  25652.095   0.511    0.609    
## bsmtNone    -24816.672  26053.229  -0.953    0.341    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51040 on 525 degrees of freedom
## Multiple R-squared:  0.7026, Adjusted R-squared:  0.6997 
## F-statistic:   248 on 5 and 525 DF,  p-value: < 2.2e-16

Interesting we do not see any statistically significant differences between the encoded categories and the reference category of “1/2”, but this is misleading and does not mean there are no pairwise differences (in fact we’d expect the opposite due to the \(F\)-test results).

Below we change the reference category to “None” and show the \(t\)-test results:

## Change reference cat
mod2_reordered = lm(sale.amount ~ area.living + factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full")), data = ic_homes)

## t-tests
summary(mod2_reordered)
## 
## Call:
## lm(formula = sale.amount ~ area.living + factor(bsmt, levels = c("None", 
##     "1/2", "3/4", "Crawl", "Full")), data = ic_homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -195584  -30689     116   19651  393381 
## 
## Coefficients:
##                                                                        Estimate
## (Intercept)                                                          -31273.377
## area.living                                                             132.786
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))1/2    24816.672
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))3/4    64327.708
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Crawl  26036.947
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Full   37936.687
##                                                                      Std. Error
## (Intercept)                                                            6833.152
## area.living                                                               4.105
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))1/2    26053.229
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))3/4    51398.105
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Crawl  51278.362
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Full    5610.618
##                                                                      t value
## (Intercept)                                                           -4.577
## area.living                                                           32.344
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))1/2     0.953
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))3/4     1.252
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Crawl   0.508
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Full    6.762
##                                                                      Pr(>|t|)
## (Intercept)                                                          5.90e-06
## area.living                                                           < 2e-16
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))1/2      0.341
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))3/4      0.211
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Crawl    0.612
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Full  3.65e-11
##                                                                         
## (Intercept)                                                          ***
## area.living                                                          ***
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))1/2      
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))3/4      
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Crawl    
## factor(bsmt, levels = c("None", "1/2", "3/4", "Crawl", "Full"))Full  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51040 on 525 degrees of freedom
## Multiple R-squared:  0.7026, Adjusted R-squared:  0.6997 
## F-statistic:   248 on 5 and 525 DF,  p-value: < 2.2e-16

Here we see that there is overwhelming statistical evidence of a difference in sale amounts for homes with a full basement relative to homes with no basement (\(p < 0.001\)) after adjusting for differences in living area.

\(~\)

Lab

The examples in this lab will use the Iowa City home sales data provided in the introduction and discussed in our previous lab.

\(~\)

Assumptions for Inference

Recall that the \(F\)-test is built upon the assumption that the alternative model’s errors are independent and identically Normally distributed. We can check the appropriateness of this using two plots:

  1. The model’s fitted values (predictions) vs. its residuals
  2. A Q-Q plot of the model’s residuals

In the first plot we are look for there to be no clear pattern in the residuals (red smoothing line stays near zero) and roughly equal spread (in the y-dimension) across the range of the x-axis.

In the second plot we are looking for the observed residuals to roughly fall on a 45-degree line, indicating they align with their corresponding expected percentiles from a Normal distribution.

Below is a realistic example where both assumptions should be considered met:

met_model = lm(log2(assessed) ~ area.living + built, data = ic_homes)
plot(met_model, which = c(1,2))

This is example illustrates a few key points:

  1. With real data you will never see perfection in these plots. It might be tempting to say there is a pattern in the residuals in the first plot, but the deviation of the red smoothing line from zero only occurs near \(\approx3\) of the over 700 data-points.
  2. Similarly, on the Q-Q plot we see some points that aren’t on the 45-degree line, but the majority of observations are close to where they are expected to be.

Now let’s see an example where the assumptions are violated:

violated_model = lm(sale.amount ~ area.lot + area.bsmt + area.living + built, data = ic_homes)
plot(violated_model, which = c(1,2))

In the first plot:

  • The residuals become more spread out as predicted values increase, a pattern known as heteroskedasticity that implies the model’s errors do not follow the same distribution everywhere (instead they have unequal variance).
  • The smoothing line starts out above zero then trends below zero, suggesting it might not be random whether our model over-predicts or under-predicts the outcome, which is a violation of the Independence assumption.

In the second plot:

  1. The residuals depart from the Normal distribution on both tails, suggesting the distribution of the actual errors is more spread out (thicker tails) than a Normal distribution.

Recall that we might choose to address these issues by a log-transformation, including additional predictors, or simply reporting our results with an amount of caution proportional to how severe we believe these violations to be.

Question #1:

  • Part A: Recall that the \(F\)-test can only be used to compare nested models. Below are several different models one might consider for predicting the living area of a home. Which pairs of these models are nested and therefore can be compared using an \(F\)-test? List all pairings that can be compared.
    • Model 1: area.living ~ built
    • Model 2: area.living ~ bedrooms + style
    • Model 3: area.living ~ bedrooms + built
    • Model 4: area.living ~ bedrooms + style + built
  • Part B: Use an \(F\)-test performed by the anova() function to compare the models area.living ~ bedrooms + style + built and area.living ~ built. Report a one-sentence conclusion indicating which model offers a better fit to the data.
  • Part C: Check the assumptions of the alternative model in Part B. Briefly describe what you see in each plot in regard to these assumptions and whether you consider them to be reasonably satisfied or violated.
  • Part D: Use the log2() function to apply a log-transformation using a base of 2 to the outcome variable. After this transformation, do the assumptions you evaluated in Part C seem more reasonable? Briefly explain.
  • Part E: Using the log-transformed outcome model from Part D, use the \(t\)-test results provided by the summary() function to determine whether 2 Story Frame homes have a significantly larger living area than 1 Story Frame homes after adjusting for the number of bedrooms and year built.
  • Part F: Apply an appropriate inverse transformation to the Model from Part D to interpret the adjusted effect of a home having an additional bedroom (adjusted for style and year built). Hint: Write out the model equation and apply the inverse transformation to both sides.

\(~\)

Application

In this portion of the lab you’ll analyze data collected by a large public university as part of an investigation into salary differences between male and female faculty members:

fac_sal = read.csv("https://remiller1450.github.io/data/Salaries.csv")

Question #2:

  • Part A: Use a two-sample \(t\)-test to evaluate whether there is a statistically significant difference in the salaries of male vs. female faculty. Report a one-sentence conclusion that includes the sample means and \(p\)-value.
  • Part B: Did the \(t\)-test from Part A assess a marginal effect, a conditional effect, or an adjusted effect? Is this a problem? State one of these and briefly explain your reasoning.
  • Part C: Apply the definition of confounding to check whether the variable rank might confound the marginal association between sex and salary. You may use either visualizations of descriptive statistics to check both pieces of the definition. Hint: remember that the definition of confounding requires there to be an association between a third variable and both the explanatory and response variables.
  • Part D: Use an \(F\)-test to compare the models salary ~ sex + rank and salary ~ sex. Report whether the more complex model seems to have a better fit.
  • Part E: Now use a series of \(F\)-tests to determine whether yrs.since.phd and discipline should also be added to the model to further improve the fit (over the alternative model from Part D).
  • Part F: Check the assumptions of the final model you decided upon in Part E. Briefly comment on each assumption.
  • Part H: If you deemed it necessary in Part F, perform a log-transformation on the outcome variable, otherwise skip this step.
  • Part G: Use the summary() function to perform a \(t\)-test investigating whether there is a difference in the expected salaries of male and female faculty after adjusting for the other variables in your final model. Report the adjusted difference in salaries as part of a one-sentence conclusion that cites the \(p\)-value of this \(t\)-test.