Directions (read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand something before moving on.
  2. Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
  3. Ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Lab

Similar to some of our previous labs, the first section of this lab will walk through a regression analysis including:

  1. Building a well-fitting model using F-tests
  2. Checking the assumptions of this model using plots of the residuals
  3. Evaluating the roles of individual variables using \(t\)-tests

Each step will have an embedded question that you are expected to answer.

Our examples will use the Iowa City homes sales data set that has been used in several of our previous labs:

ic_homes = read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

\(~\)

Step 1 - Using the F-test

To compare the fit of two different models using an F-test, we first need to fit each model to our sample data using lm() and store that fit in an R object:

## Store the null model
mod0 = lm(sale.amount ~ 1, data = ic_homes)
  
## Store the alternative
mod1 = lm(sale.amount ~ assessed, data = ic_homes)

We then use the anova() function to perform the hypothesis test:

## F-test of no difference b/w models
anova(mod0, mod1)
## Analysis of Variance Table
## 
## Model 1: sale.amount ~ 1
## Model 2: sale.amount ~ assessed
##   Res.Df        RSS Df  Sum of Sq     F    Pr(>F)    
## 1    776 6.3775e+12                                  
## 2    775 3.4093e+11  1 6.0365e+12 13722 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Not surprisingly, there’s overwhelming evidence (a \(p\)-value of essentially zero) that using a home’s assessed value to predict its sale price is better than predicting every home sold for the same amount, the overall average sale price.

The nice thing about the F-test is that it can be used to statistically compare any two nested models. This means we can add additional explanatory variables to mod1 and see if they further improve the model:

## Add area.living as a second explanatory var
mod2 = lm(sale.amount ~ assessed + area.living, data = ic_homes)

## F-test comparing mod2 and mod1 (which only had assessed)
anova(mod1, mod2)
## Analysis of Variance Table
## 
## Model 1: sale.amount ~ assessed
## Model 2: sale.amount ~ assessed + area.living
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)
## 1    775 3.4093e+11                            
## 2    774 3.3975e+11  1 1181199358 2.6909 0.1013

Here the \(p\)-value is borderline, but high enough that we’d most likely conclude that there is a lack of evidence supporting our more complicated model, which includes both assessed and area.living, being superior to the simpler model that only includes assessed as a predictor.

If we weren’t satisfied with the simpler model, we might continue searching for additional predictors that offer a statistically meaningful improvement in fit:

## Instead try 'built' as the second explanatory var
mod2b = lm(sale.amount ~ assessed + built, data = ic_homes)

## F-test comparing mod2 and mod1
anova(mod1, mod2b)
## Analysis of Variance Table
## 
## Model 1: sale.amount ~ assessed
## Model 2: sale.amount ~ assessed + built
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    775 3.4093e+11                                   
## 2    774 3.3526e+11  1 5668530388 13.087 0.0003167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This time we see that using both assessed and built is superior to the simpler model that only uses assessed as a predictor of sale.amount.

Question 1:

  • Part A: Is it acceptable to use an F-test to compare the following models: area.living ~ assessed + built and area.living ~ bedrooms + built? Briefly explain. Hint: Recall that the F-test is only applicable for nested models
  • Part B: Use one or more F-tests to choose a model from the following three candidate models, favoring the simpler model whenever the test suggests there is not a significant difference in fit. Model 1: area.living ~ 1 (intercept-only), Model 2: area.living ~ bedrooms and Model 3: area.living ~ bedrooms + built.

\(~\)

Step 2 - Evaluating Assumptions

After we’ve decided upon a model that we believe offers an acceptable fit to our data (relative to other reasonable models that we’ve considered) the next step is to verify that statistical inference using an F-test and/or t-tests is valid for this model. This requires us to check two primary assumptions:

  1. The model’s errors are independent
  2. The model’s errors are Normally distributed with constant variance

Using the R object storing our fitted model, we can check these assumptions using graphs created by the plot() function:

## Residuals vs. Fitted values (check independence and constant variance)
plot(mod2b, which = 1)

## QQ-plot (check Normality assumption)
plot(mod2b, which = 2)

Unfortunately we see that the errors of this model deviate from what would be expected if they were Normally distributed. Though we should pay attention to the units of the graph, as we can judge that the deviation might not be as extreme as it visually appears to be. These data also exhibit heteroskedasticity, which is another cause for concern.

There are two options to consider in this scenario:

  1. Reconfigure the model by transforming the outcome variable and/or including additional predictors to get a distribution of errors that appears more acceptable.
  2. Report your statistical results with caution. More specifically, you wouldn’t want to overstate the significance of a \(p\)-value of 0.04, but you can still be confident that a \(p\)-value of 0.001 is sufficient evidence of an association.

Below is an example of how the log() function can be used to transform the outcome variable:

## Example of a base-2 log-transformation
mod2c = lm(log(sale.amount, base = 2) ~ assessed + built, data = ic_homes)
plot(mod2c, which = 1)

plot(mod2c, which = 2)

We can see that the log-transformation improved the Normality of the residuals, but it also introduced a clear quadratic pattern in them. This pattern suggests we incorporate quadratic effects into our model (for one or more of the predictors). This is demonstrated below:

## Example of a quadratic polynomial effect
mod2d = lm(log(sale.amount, base = 2) ~ poly(assessed, degree=2) + built, data = ic_homes)
plot(mod2d, which = 1)

plot(mod2d, which = 2)

Notice the improvement in both plots. Although neither is perfect, they are good enough that the model is now suitable for statistical inference, though it’s interpretability has been negatively impacted.

Question 2:

  • Part A: Regardless of the model you selected at the time, evaluate the assumptions of the most complex model you fit in Question 1: area.living ~ bedrooms + built.
  • Part B: Regardless of your findings in Part A, perform a log-transformation using a base of 2 on the model’s outcome variable and re-fit the model to the sample data. Do you believe this transformation led to the model’s errors being closer to following a Normal distribution?

Step 3 - Tests on individual variables

The summary() function will provide a table that includes the results of \(t\)-tests of the null hypothesis \(H_0:\beta_j = 0\).

## One of our previous example models
mod2b = lm(sale.amount ~ assessed + built, data = ic_homes)

## The summary table
summary(mod2b)
## 
## Call:
## lm(formula = sale.amount ~ assessed + built, data = ic_homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -150413   -7085     493    7827  149697 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.705e+05  4.717e+04   3.615 0.000320 ***
## assessed     1.035e+00  8.761e-03 118.093  < 2e-16 ***
## built       -8.657e+01  2.393e+01  -3.618 0.000317 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20810 on 774 degrees of freedom
## Multiple R-squared:  0.9474, Adjusted R-squared:  0.9473 
## F-statistic:  6975 on 2 and 774 DF,  p-value: < 2.2e-16

In this example, we have strong evidence that assessed is associated with sale price after adjusting for year built, and that built is associated with sale price after adjusting for the home’s assessed value.

The example shown above was straightforward, but these tests become more nuanced for models involving one or more categorical explanatory variables:

## Filter to only include 3 different styles of home
ic_homes_filtered = ic_homes %>% filter(style %in% c("1 Story Frame", "2 Story Frame", "Split Foyer Frame"))

## Fit model and display summary table
mod3 = lm(sale.amount ~ area.living + style, data = ic_homes_filtered)
summary(mod3)
## 
## Call:
## lm(formula = sale.amount ~ area.living + style, data = ic_homes_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -235703  -25762   -1134   20265  349086 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -17685.804   5619.853  -3.147  0.00173 ** 
## area.living               154.752      4.221  36.661  < 2e-16 ***
## style2 Story Frame     -40398.035   4947.992  -8.165 1.85e-15 ***
## styleSplit Foyer Frame   5339.262   5752.208   0.928  0.35366    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47230 on 611 degrees of freedom
## Multiple R-squared:  0.7092, Adjusted R-squared:  0.7078 
## F-statistic: 496.7 on 3 and 611 DF,  p-value: < 2.2e-16

In this example we should notice that “1 Story Frame” was designated as the reference category, so the significant test and negative coefficient for re-coded variable style2 Story Frame indicates that 2-story homes are expected to sell for significantly lower prices than 1-story homes, after adjusting for living area.

Unfortunately, we cannot directly obtain a pairwise comparison between 2-story and split foyer homes from this output. The easiest way to obtain this comparison is to actively change the variable’s reference category using factor() and relevel() which is shown below:

ic_homes_relvl = ic_homes_filtered %>% mutate(new_style = relevel(factor(style), ref = 'Split Foyer Frame'))
mod3b = lm(sale.amount ~ area.living + new_style, data = ic_homes_relvl)
summary(mod3b)
## 
## Call:
## lm(formula = sale.amount ~ area.living + new_style, data = ic_homes_relvl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -235703  -25762   -1134   20265  349086 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -12346.542   6976.797  -1.770   0.0773 .  
## area.living               154.752      4.221  36.661   <2e-16 ***
## new_style1 Story Frame  -5339.262   5752.208  -0.928   0.3537    
## new_style2 Story Frame -45737.297   6799.518  -6.727    4e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47230 on 611 degrees of freedom
## Multiple R-squared:  0.7092, Adjusted R-squared:  0.7078 
## F-statistic: 496.7 on 3 and 611 DF,  p-value: < 2.2e-16

A more generalizable alternative is to use contrasts, a topic we will not have time to cover but you can learn about this vignette for the emmeans package

Question 3: Fit the linear regression model sale.amount ~ area.living + built + ac and interpret the hypothesis tests on each coefficient in the model that are given by the summary() function. A satisfactory interpretation will mention whether you can or cannot be statistically confident that a predictor is associated with the outcome after adjusting for the other variables in the model, as well as an explanation of how that predictor is associated with the outcome (only for predictors you deem associated).

\(~\)

Application #1

In this application you’ll analyze data from major United States university that was collected as part of the university’s ongoing evaluation of salary differences between its male and female faculty:

https://remiller1450.github.io/data/Salaries.csv

Question 4:

  • Part A: A previous analysis of these data used the model salary ~ sex to estimate the expected difference in salaries of male vs. female faculty. Consider the model salary ~ sex + rank. Does this more complex model provide a significantly better fit to the data?
  • Part B: Now consider adding the variable yrs.since.phd to the more complex model from Part A. Does this model provide a significantly better fit to the data than the model salary ~ sex + rank?
  • Part C: Finally, consider adding the predictor discipline to the best model from Part B (either salary ~ sex + rank or salary ~ sex + rank + yrs.since.phd depending upon your decision). Does adding this variable significantly improve the fit of the model?
  • Part D: Check the assumptions of the model you identified as best in Parts A-D. Comment upon the appropriateness and reliability of statistical results involving this model based upon what you see.
  • Part E: Regardless of your answer to Part D, perform a test of \(H_0: b_{sex} = 0\) using the model you identified as best in Parts A-D. Interpret the results of this test. A proper interpretation should indicate whether there is an association between sex and salary, as well as the magnitude/direction of the association (if an association was found).

\(~\)

Application #2

The Armed Forces Qualification Test (AFQT) is a multiple choice aptitude test designed to measure whether an individual’s reasoning and verbal skills are sufficient for military service. Because the test has been so broadly applied, percentile scores are frequently used as a measure of general intelligence.

The data in this application is a random sample of \(n=2584\) Americans who were first selected and tested in 1979, then later re-interviewed in 2006 and asked about their education and annual income. The focus of this analysis will be modeling 2005 incomes using AFQT scores (from 1979) with and without adjusting for educational attainment.

afqt <- read.csv("https://remiller1450.github.io/data/AFQT.csv")

This analysis will use the following variables:

  • Income2005 - the subject’s reported income in 2005 in US dollars.
  • AFQT - the percentile ranking of the subject when they took the AFQT in 1979. For example, AFQT = 90 indicates they scored higher than 90% of those who took the test.
  • Educ - the number of years of educational attainment by the subject. For example, Educ = 12 reflects the subject completing high school.

Question 5:

  • Part A: Fit the models Income2005 ~ AFQT and Income2005 ~ 1 (the intercept only/null model). Does the more complex model provide a superior fit to the data?
  • Part B: Now compare the models Income2005 ~ AFQT and Income2005 ~ AFQT + Educ. Which model provides a better fit to the data?
  • Part C: Compare the coefficients of the explanatory variable AFQT in both models considered in Part B. How would you explain the difference between these coefficients? That is, why does AFQT seem to have a larger effect on income in the smaller, less complex model?
  • Part D: Evaluate the assumptions of the model you identified as best in Parts A-C.
  • Part E: In Part D you should have found a lack of Normality in the model’s residuals. To address this, perform a log-transformation on the outcome variable and re-fit the model using this transformation. Then, reassess the model assumptions and comment upon whether statistical inference is more justifiable using this new model.
  • Part F (optional, 0.25 pts extra credit): Interpret the effect of 1-unit increase in AFQT in the model you fit in Part E where the outcome variable had undergone a log-transformation. Do not involve the units log dollars in your interpretation. Instead, try to come up with an interpretation involving dollars.