\(~\)

Lab

This lab contains no onboarding section, you should read through the sections below and work through it with your partners.

The lab’s examples will use the Iowa City home sales data set introduced earlier in the semester, focusing only on 1-story frame, split foyer frame, and 2-story frame homes, which are the three most common home types:

## Libraries and data
library(ggplot2)
library(dplyr)
homes = read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv") %>%
          filter(style %in% c("1 Story Frame", "2 Story Frame", "Split Foyer Frame"))

\(~\)

Comparing Nested Models with the \(F\)-test

Recall that two models are nested if the coefficients in one can be set to certain values to make it equivalent to the other model. In our example we’ll look at the following nested models:

1.sale.amount ~ bedrooms + area.living 2. sale.amount ~ bedrooms + area.living + style

Writing out the form of each model makes it clear that they are nested:

  • The reduced model is: \(Y = \beta_0 + \beta_1\text{Bedrooms} + \beta_2\text{Area} + \epsilon\)
  • The alternative model is: \(Y = \beta_0 + \beta_1\text{Bedrooms} + \beta_2\text{Area} + \beta_3\text{(Style = 2 Story)} + \beta_4\text{(Style = Split)} + \epsilon\)

We can see that if \(\beta_3 = 0\) and \(\beta_4 = 0\) then both models are equivalent, thus the reduced model is a special case of the alternative model.

To use an \(F\)-test on these nested models we first need to fit each model to our data:

## Fit both models, store as fit1 and fit2
fit1 = lm(sale.amount ~ bedrooms + area.living, data = homes)
fit2 = lm(sale.amount ~ bedrooms + area.living + style, data = homes)

Next, we can use the anova() function to perform the \(F\)-test:

## F-test
anova(fit1, fit2)
## Analysis of Variance Table
## 
## Model 1: sale.amount ~ bedrooms + area.living
## Model 2: sale.amount ~ bedrooms + area.living + style
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    612 1.4457e+12                                   
## 2    610 1.3225e+12  2 1.2319e+11 28.412 1.593e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The anova() function expects the reduced model to be given first, followed by the alternative model
  • The \(p\)-value is very small in this example, which indicates strong evidence that the alternative model fits the data more closely than the reduced model

As an additional example, let’s consider an even more complex model that also uses a home’s latitude as a predictor.

## Fit an even more complex model with a predictor that's unlikely to matter
fit3 = lm(sale.amount ~ bedrooms + area.living + style + lat, data = homes)

## F-test
anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: sale.amount ~ bedrooms + area.living + style
## Model 2: sale.amount ~ bedrooms + area.living + style + lat
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1    610 1.3225e+12                           
## 2    609 1.3224e+12  1  68353504 0.0315 0.8592

Here the \(p\)-value is large, so the data do not provide sufficient evidence that lat is a worthwhile predictor of sale.amount after the effects of bedrooms, area.living, and style have been accounted for.

Question #1: Consider the following three models that predict a home’s assessed value:

  1. assessed ~ area.living
  2. assessed ~ ac
  3. assessed ~ area.living + ac
  • Part A: Which of these models can be compared using an \(F\)-test? In other words, which pairs of models involve one model that is nested within the other?
  • Part B: Perform an \(F\)-test to compare Model #1 (assessed ~ area.living) and Model #3 (assessed ~ area.living + ac). Provide the \(p\)-value and a 1-sentence summary of what you conclude from the test.
  • Part C: Perform an \(F\)-test to compare Model #2 (assessed ~ ac) and Model #3 (assessed ~ area.living + ac). Provide the \(p\)-value and a 1-sentence summary of what you conclude from the test.
  • Part D: Based upon results from the earlier parts of this question, which of these three models should be preferred? Briefly explain your reasoning.

\(~\)

Checking the \(F\)-test Assumptions

Recall that valid statistical inference on regression model assumes the model’s errors are independent and Normally distributed with a constant variance. These assumptions can be checked using the plot() function:

## Check independence and constant variance
plot(fit3, which = 1)

  • The independence assumption is verified by there being no apparent pattern in the residuals, which is supported by the red smoothing line being nearly horizontal
  • The constant variance assumption is assessed by looking for the same spread of residuals everywhere along the x-axis, which might be questionable here as the spread seems smaller for lower priced homes (left side of the graph) than it does for middle-priced homes (middle of the graph)

By changing the argument to be which = 2 we can assess the Normality assumption using a Q-Q plot:

## Check for Normality
plot(fit3, which = 2)

  • Here we see a fairly substantial departure from Normality where the distribution of the residuals appears to be right-skewed
    • You should recall that this can be addressed by log-transforming the model’s outcome variable (shown below)
## Re-fit log-transformed outcome
fit3_log = lm(log2(sale.amount) ~ bedrooms + area.living + style + lat, data = homes)

## Q-Q plot of log-transformed outcome
plot(fit3_log, which = 2)

If strategies like log-transformation are ineffective, non-parametric alternatives such as quantile regression exist. However, for the purposes of our class, you may simply report the results of your regression inferences with caution if a log-transformation is not sufficient to fulfill the assumptions for statistical inference.

Question #2:

  • Part A: Check the assumptions of Model #3 (assessed ~ area.living + ac) from Question #1. Briefly comment on whether you believe the assumptions of the \(F\)-test are satisfied. Your answer should address all of the major assumptions (independence, constant variance, and Normality of the residuals).
  • Part B: Apply a log-2 transformation to the model’s outcome variable and re-check the assumptions for valid statistical inference.

Note: If you decide a log-transformation is necessary you technically should revisit the model selection step (ie: the tests performed in Question #1); however, it’s relatively uncommon for a log-transformation to suggest a vastly different final model.

\(~\)

Polynomials and Non-linear Effects

As mentioned in the lecture slides, improper fit might lead to patterns in the residuals that violate the independence assumption. The most common example of this is the model uses a linear relationship to represent a relationship that is actually non-linear. If such a pattern exists in your data, you may consider using polynomial effect to represent that variable in the model.

As an example, let’s consider the area.bsmt variable in the Iowa City home sales data:

ggplot(homes, aes(x = area.bsmt, y = sale.amount)) + geom_point() + geom_smooth(se = FALSE)

We can propose two different linear regression models relating this variable and the outcome, sale.amount:

  1. sale.amount ~ area.bsmt
  2. sale.amount ~ poly(area.bsmt, degree=2)

The first is a simple linear regression model \(Y = \beta_0 + \beta_1X + \epsilon\), while the second uses a second degree polynomial of \(X\) and takes the form \(Y = \beta_0 + \beta_1X + \beta_2X^2 + \epsilon\). We can see that these two models are nested, as when \(\beta_2 = 0\) they are equivalent.

While subtle, we see that the reduced model might exhibit the U-shaped pattern in the residuals that we saw in our lecture slides:

bsmt_mod1 = lm(sale.amount ~ area.bsmt, data = homes)
plot(bsmt_mod1, which = 1)

Because the models are nested, we can use an \(F\)-test to compare them:

bsmt_mod2 = lm(sale.amount ~ poly(area.bsmt, degree=2), data = homes)
anova(bsmt_mod1, bsmt_mod2)
## Analysis of Variance Table
## 
## Model 1: sale.amount ~ area.bsmt
## Model 2: sale.amount ~ poly(area.bsmt, degree = 2)
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    613 3.6001e+12                                   
## 2    612 3.3180e+12  1 2.8214e+11 52.041 1.613e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we have strong evidence that using a quadratic effect for the predictor area.bsmt significantly improves model fit. Thus, we can conclude that there is a non-linear relationship between the area of a home’s basement and its sale price.

Question #3: Use an \(F\)-test to determine whether the size of a home’s lot is better represented by a quadratic effect or a linear effect in a model that predicts sale price and adjusts for the home’s living area.

\(~\)

\(t\)-tests for Individual Coefficients

After using the \(F\)-test to determine a suitable final model, it is important to evaluate the roles of individual variables within that model. The summary() function provides \(t\)-test results for individual model coefficients. Each of these tests uses the null hypothesis \(H_0: \beta_j = 0\), which reflects that variable making no contribution towards the outcome.

## Fit 3 from earlier in the lab
fit3 = lm(sale.amount ~ bedrooms + area.living + style + lat, data = homes)
summary(fit3)
## 
## Call:
## lm(formula = sale.amount ~ bedrooms + area.living + style + lat, 
##     data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -207707  -23735   -2156   19434  352041 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.201e+06  6.965e+06   0.172    0.863    
## bedrooms                1.016e+04  2.346e+03   4.332 1.72e-05 ***
## area.living             1.432e+02  4.947e+00  28.945  < 2e-16 ***
## style2 Story Frame     -3.643e+04  4.969e+03  -7.333 7.23e-13 ***
## styleSplit Foyer Frame  2.195e+03  5.722e+03   0.384    0.701    
## lat                    -2.967e+04  1.672e+05  -0.177    0.859    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46600 on 609 degrees of freedom
## Multiple R-squared:  0.7179, Adjusted R-squared:  0.7156 
## F-statistic: 309.9 on 5 and 609 DF,  p-value: < 2.2e-16

Below we interpret the key information in the output given by summary():

  • The \(p\)-value for the \(t\)-test evaluating whether the coefficient for bedrooms is zero is very small, suggesting there is strong evidence of an association between bedrooms and sale.amount after adjusting for differences in living area, style, and latitude.
    • Furthermore, the coefficient of bedrooms is 10160, which tells us that each additional bedroom predicts an expected increase in home price of $10160 assuming living area, style, and latitude remain unchanged
  • The variable style is categorical, which requires us to interpret it differently
    • The data provide significant evidence that the “2 Story Frame” style sells for less than the reference category of “1 Story Frame” after adjusting for differences in living area, number of bedrooms, and latitude.
    • The data do not provide evidence of a difference between the “Split Foyer Frame” and “1 Story Frame” styles after adjustment

Notice that we aren’t able to statistically test the difference between “Split Foyer Frame” and “2 Story Frame” using just this model. The easiest way to perform that test is to re-fit the model using a different reference category, which is described in Lab 7

In addition to hypothesis testing, the \(t\)-distribution is also used as the basis for confidence interval estimates for model coefficients:

## 99% confidence interval estimates
confint(fit3, level = 0.99)
##                                0.5 %       99.5 %
## (Intercept)            -1.679673e+07 19199501.892
## bedrooms                4.102255e+03    16226.927
## area.living             1.304015e+02      155.966
## style2 Story Frame     -4.927241e+04   -23594.640
## styleSplit Foyer Frame -1.259029e+04    16980.695
## lat                    -4.617859e+05   402445.000
  • The estimated adjusted effect of area.living was $143 per additional square foot (seen in the previous summary table)
    • The 99% confidence interval estimate suggests that the true effect is plausibly somewhere between $130 and $156.

Finally, you should note that coefficient interpretations change when the outcome has been log-transformed:

## Log-transformed model from earlier
fit3_log = lm(log2(sale.amount) ~ bedrooms + area.living + style + lat, data = homes)
summary(fit3_log)
## 
## Call:
## lm(formula = log2(sale.amount) ~ bedrooms + area.living + style + 
##     lat, data = homes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62506 -0.18333  0.00771  0.20195  1.23185 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             7.042e+01  4.781e+01   1.473    0.141    
## bedrooms                1.378e-01  1.610e-02   8.560  < 2e-16 ***
## area.living             8.179e-04  3.395e-05  24.087  < 2e-16 ***
## style2 Story Frame     -1.711e-01  3.410e-02  -5.018 6.86e-07 ***
## styleSplit Foyer Frame  7.400e-02  3.928e-02   1.884    0.060 .  
## lat                    -1.310e+00  1.148e+00  -1.142    0.254    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3198 on 609 degrees of freedom
## Multiple R-squared:  0.6928, Adjusted R-squared:  0.6903 
## F-statistic: 274.6 on 5 and 609 DF,  p-value: < 2.2e-16

The estimated adjusted effect of area.living now suggests an expected increase of 0.0008179 log-2 dollars for every one square foot increase in living area. This is meaningless to most people, so let’s revisit the regression equation and apply the inverse of our log-transformation: \[\widehat{log_2(\text{price})} = 70.42 + 0.138 * \text{Bedrooms} + 0.00082*\text{Area Living} + \ldots \\ \implies \widehat{\text{price}} = 2^{ 70.42 + 0.138 * \text{Bedrooms} + 0.00082*\text{Area Living} + \ldots} \\ \implies \widehat{\text{price}} = 2^{70.42}*2^{0.138 * \text{Bedrooms}}*2^{0.00082*\text{Area Living}} *\ldots\]

We can see that every 1 square foot increase in living area increases the expected sale price by a multiplicative factor of \(2^{0.00082} = 1.000569\). Since a 1 square foot increase is very small this is still difficult to interpret, so instead we might look at a 100 square foot increase: \[2^{0.00082*100} = 1.058\] Thus, a reasonable interpretation of the coefficient of Area.Living in the log-transformed model is: after adjusting for differences in bedrooms, style and latitude, each additional 100 square feet is expected to increase a home’s sale price by 5.8%.

Question #4: For this question you should begin with the model assessed ~ area.living + ac used in Question #2. Throughout the question you may ignore whether or not the assumptions for inference are met.

  • Part A: Interpret the coefficient of the re-coded variable acYes in this model. You do not need to interpret the results of the associated statistical test at this time.
  • Part B: After adjusting for differences in living area, do homes with central air on average sell for more than homes without central air? Answer this question using a \(t\)-test on a coefficient from the regression model from Part A.
  • Part C: In Part B of Question #2 you applied a log-transformation and fit the model log2(assessed) ~ area.living + ac. Interpret the coefficient of acYes in this model. Hint: You should report the percentage change in expected sale price for a home having central air.
  • Part D: Using the results of the appropriate statistical test, determine whether there is sufficient statistical evidence that the coefficient of acYes in the log-transformed model is not zero.
  • Part E: Find a 95% confidence interval estimate for the percentage change in expected sale price for home having central air. Hint: Use the appropriate function to find the confidence interval’s endpoints on the log-scale, then undo the log-transformation on each endpoint using its inverse.

\(~\)

Application

Question #5: In this application you’ll analyze data from a major US university as part of an investigation into salary differences between male and female faculty:

fac_sal = read.csv("https://remiller1450.github.io/data/Salaries.csv")
  • Part A: Fit the regression model salary ~ sex and interpret the coefficient and corresponding \(t\)-test for the re-coded variable sexMale. Does there appear to be statistical evidence of a difference in salaries for male versus female faculty.
  • Part B: Consider the model salary ~ sex + rank and note that the model from Part A is nested within this model. Use an \(F\)-test to determine whether there is statistical evidence that this model better fits the data than the model from Part A.
  • Part C: Now use an \(F\)-test to compare the even more complex model salary ~ sex + rank + yrs.since.phd. Based upon this test as well as the one you performed in Part B, which of the three models you’ve looked at so far in this question do you deem most appropriate? Explain your reasoning.
  • Part D: Perhaps yrs.since.phd has a non-linear relationship with salary. Use an \(F\)-test to compare a model that uses a degree-2 polynomial effect for yrs.since.phd with the best performing model you identified in Part C.
  • Part E: Check the assumptions for inference for the best model you identified in Part D (which might have been an earlier model from Part A, B or C). Briefly address each assumption and whether you consider it to be met or violated.
  • Part F: If you believe the conditions for inference are met in Part E, use a \(t\)-test to evaluate whether there is a statistically significant association between sex and salary after adjusting for differences in the other variables present in your best-fitting model. If you believe the conditions for inference are not met in Part D, perform a log-transformation on the outcome variable in the regression model then use a \(t\)-test to evaluate whether there is a statistically significant association between sex and salary in the best-fitting model.