Lab #9 - Analysis of Variance (ANOVA)

\(~\)

Lab

This lab contains no onboarding section, you should read through the sections below and work through it with your partners.

The lab’s examples will use the TSA data that we’ve now worked with many times. The lab’s questions involve a random sample of \(n=200\) games played in the National Football League (NFL) between 2018 and 2023. The outcome variable for this lab is score_diff, which is the difference between the number of points scored by the home and away teams, with higher positive values reflecting a greater margin of victory for the home team, and negative values reflecting the margin of victory for the away team.

nfl_data = read.csv("https://remiller1450.github.io/data/nfl_sample.csv")

\(~\)

Review of ANOVA

One-way ANOVA uses an \(F\)-test to statistically evaluate whether there is a relationship between a quantitative response variable and a categorical explanatory variable, usually when the categorical variable has more than 2 categories.

The global null hypothesis in one-way ANOVA is: \(H_0: \mu_1 = \mu_2 = \ldots = \mu_k\)

This hypothesis reflects “no association” between the explanatory and response variables
Rejecting the null hypothesis suggests an association, implying that at least one group’s mean differs from another group’s mean

In R, ANOVA can be performed using the aov() function, which behaves similarly to lm(), which was the function we used to fit regression models earlier in the semester:

## Load the example data
tsa = read.csv("https://remiller1450.github.io/data/tsa_small.csv")

## Fit and store the ANOVA model
my_model = aov(Close_Amount ~ Claim_Site, data = tsa)

## Print the ANOVA table
summary(my_model)

##               Df    Sum Sq Mean Sq F value   Pr(>F)    
## Claim_Site     2   2949087 1474543   16.75 5.64e-08 ***
## Residuals   4997 439979039   88049                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this example, the \(F\)-statistic of 16.75 and the corresponding \(p\)-value of 0.0000000564 provides strong evidence that the alternative model fits the data better than the null model by an amount that cannot reasonably be explained by chance (sampling variability), thereby supporting an association between Claim_Site and Close_Amount.

Compared with our slides, you should notice that R’s version of the ANOVA table omits the “Total” row, which isn’t a problem since the “Total” row is simply the column sums of the relevant quantities from the provided rows.

Question #1: For this question you should use the NFL data provided at the beginning of the lab.

Part A: Consider an analysis seeking to determine whether this sample of games provides statistical evidence of a difference in the home field advantage (measured by the average margin of victory) of each team. What are the explanatory and response variables in this research question? Considering the type of each variable, why is ANOVA the appropriate statistical approach? Briefly explain your answers.
Part B: Use the group_by() and summarize() functions from the dplyr library to find the mean score_diff for each home team. Recall that these functions were first introduced in Lab 5
Part C: Below are two randomly chosen games. In a one-way ANOVA, what would the null model predict for the score_diff of each of these games? What would the alternative model predict for the score_diff of these games?

season	game_id	home_team	away_team	score_diff
2021	2021_15_WAS_PHI	PHI	WAS	10
2021	2021_04_ARI_LA	LA	ARI	-17

Part D: In game 2021_15_WAS_PHI the actual score_diff was 10 (PHI won by 10 points) and in game 2021_04_ARI_LA the actual score_diff was -17 (ARI won by 17 points). Using this information, what are the residuals of each game when using the null model?
Part E: Similar to Part D, what are the residuals of these two games using the alternative model?
Part F: If these two games were representative of trends in entire data set, would you expect the sum of squares to be higher, lower, or roughly the same for the alternative model in comparison with the null model? Briefly explain your reasoning.
Part G: Use the aov() function to perform the ANOVA \(F\)-test on the variables described in Part A. Does this test provide statistical evidence of an association?
Part H: Print the ANOVA table of the model you used in Part G and report the sum of squared residuals for both the alternative and null models using information from this table. Hint: You may want to calculate the “Total” row of the ANOVA table to make it better aligned with the format shown in our lecture slides.
Part I: Based upon the two sums of squares you found in Part H and the number of additional parameters in the alternative model, why is it not surprising that the \(F\)-test in Part G did not find statistical evidence that the alternative model is superior to the null model?
Part J: Do the test results from Part G provide convincing evidence that all teams have the same degree of home field advantage? Briefly explain.

\(~\)

Checking Model Assumptions

One-way ANOVA makes two main assumptions:

The residuals are Normally distributed (this is how \(\epsilon_i\) is defined in the null and alternative models)
The cases in each group have the same standard deviation (an assumption which implies the errors in each group follow a Normal curve with the same amount of variability but different means)

Note that the first assumption (Normality) does not apply to the response variable, but instead it deals with the alternative model’s residuals. It can be assessed by graphing the residuals using either a histogram or Q-Q plot:

## Fit the ANOVA model (note we're using Claim Amount as the outcome this time)
my_model = aov(Claim_Amount ~ Claim_Site, data = tsa)

## Use the residuals of this fitted model to make a histogram
ggplot() + geom_histogram(aes(x = my_model$residuals), bins = 150)

## Now a Q-Q plot
ggplot() + stat_qq(aes(sample = my_model$residuals)) + stat_qq_line(aes(sample = my_model$residuals))

Both the histogram and the Q-Q plot suggest the Normality assumption is badly violated.

We will talk about the rationale and implications for this after Fall break, but taking the logarithm of a right-skewed variable is a commonly used strategy to make data more appropriate for methods like ANOVA:

## Fitted ANOVA model using a log10 transformation
my_model = aov(log10(Claim_Amount + 1) ~ Claim_Site, data = tsa)

## Use the residuals of this fitted model to make a histogram
ggplot() + geom_histogram(aes(x = my_model$residuals), bins = 30)

## Now a Q-Q plot
ggplot() + stat_qq(aes(sample = my_model$residuals)) + stat_qq_line(aes(sample = my_model$residuals))

After applying this log-transformation, the Normality assumption still might not be perfectly satisfied, but it is much more justified to use ANOVA after the transformation, and many would find the method to be a reasonable choice (assuming the equal standard deviations assumption is also met).

In addition to the residuals being Normally distributed, we must check that the data in each group has the same standard deviation:

library(dplyr)
tsa %>% group_by(Claim_Site) %>% summarize(std_dev = sd(Claim_Amount))

## # A tibble: 3 × 2
##   Claim_Site      std_dev
##   <chr>             <dbl>
## 1 Checked Baggage  29016.
## 2 Checkpoint       39672.
## 3 Other             1193.

This assumption is not met using the rule of thumb that the largest standard deviation is no more than twice the smallest standard deviation. However, you should recall that we ultimately decided to perform a log-transformation on these data, so we really should be considering the standard deviations of the log-transformed data:

tsa %>% group_by(Claim_Site) %>% summarize(std_dev = sd(log10(Claim_Amount+1)))

## # A tibble: 3 × 2
##   Claim_Site      std_dev
##   <chr>             <dbl>
## 1 Checked Baggage   0.646
## 2 Checkpoint        0.645
## 3 Other             0.597

Notice that the equal standard deviations assumption is met for the log-transformed data.

Question #2: You should continue using the NFL data set on this question.

Part A: Use a Q-Q plot to check the Normality assumption of the one-way ANOVA model you used in Question #1. Briefly comment upon whether this assumption appears to be met.
Part B: Use group_by() and summarize() to check the equal standard deviation assumption of the one-way ANOVA model you used in Question #1. Briefly comment upon whether this assumption appears to be met. Hint: You might find it useful to store the summarized results and then click on them in the Environment pane to sort by standard deviation.

\(~\)

The Kruskal-Wallis Test

Sometimes a log-transformation is not possible, or it might not be sufficient to ensure the data satisfy the assumptions of ANOVA. In these circumstances the Kruskal-Wallis Test provides a non-parametric alternative. The test is a generalization of the Wilcoxon rank-sum test that briefly appeared near the end of Lab 5

The Kruskal-Wallis Test is performed using the kruskal.test() function, which uses the same syntax as aov():

kruskal.test(Claim_Amount ~ Claim_Site, data = tsa)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Claim_Amount by Claim_Site
## Kruskal-Wallis chi-squared = 40.065, df = 2, p-value = 1.995e-09

Question #3: You should continue using the NFL data set on this question.

Part A: Briefly explain why the Kruskal-Wallis Test should be used instead of a log-transformation for the NFL home field advantage research question that was posed in Question #1.
Part B: Use the Kruskal-Wallis Test to determine whether these data provide sufficient statistical evidence of an association between home_team and score_diff. Show the \(p\)-value of the test and provide a brief conclusion.

\(~\)

Post-hoc Testing

Any statistically significant ANOVA result should be followed by post-hoc pairwise testing. The phrase “post-hoc” comes from Latin meaning “after this” or “after the fact”. So, as the name suggests, we’ll only perform post-hoc testing this step if the global test was significant. The goal in post-hoc testing is to determine which pairs of group differences produced the significant result.

There are many post-hoc tests that can be performed, but we’ll use Tukey’s Honest Significant Differences test:

## Fitted ANOVA using log10 transform from before
my_model = aov(log10(Claim_Amount + 1) ~ Claim_Site, data = tsa)

## Tukey's test
TukeyHSD(my_model)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = log10(Claim_Amount + 1) ~ Claim_Site, data = tsa)
## 
## $Claim_Site
##                                 diff         lwr       upr     p adj
## Checkpoint-Checked Baggage 0.1626076  0.10018054 0.2250347 0.0000000
## Other-Checked Baggage      0.4227458  0.08350470 0.7619869 0.0097949
## Other-Checkpoint           0.2601382 -0.08325054 0.6035269 0.1777600

Note that, by default, the \(p\)-values from Tukey’s HSD test are adjusted to account for problems that arise when using the same data to perform many different hypothesis tests (we’ll discuss this issue soon).

So, in our example we can conclude that the log-transformed means are different for claims made at Checkpoint and claims made at Checked Baggage, as well as claims made at Other and Checked Baggage, but there is insufficient evidence of a difference in the mean claims at the Other and Checkpoint sites.

Question #4: For this question you’ll use the “tailgating” data set, which records the average following distances of regular users of different drugs during a driving simulation where each driver was asked to follow a lead vehicle to a certain destination. The researchers were interested in whether there was an association between drug type and average following distance, hypothesizing that regular users of harder drugs would be more prone to riskier behavior (very small following distances).

driving = read.csv("https://remiller1450.github.io/data/Tailgating.csv")

Part A: In words, write the null and alternative hypotheses involved in using one-way ANOVA to analyze these data.
Part B: Use aov() to fit a one-way ANOVA model. Print the ANOVA table and interpret the \(p\)-value. Be sure to make a conclusion that involves the context of these data.
Part C: Check the assumptions of the model you used in Part B. If one or more assumptions are violated, you should either log-transform the outcome variable, or perform the Kruskal-Wallis Test.
Part D: Perform post-hoc testing on your final model, which is from Part A if you determined the assumptions of ANOVA were met, or Part C if the assumptions were not met. If you used the Kruskal-Wallis Test in Part C, the appropriate method of post-hoc testing is the Dunn Test, which can be performed using the dunn.test() function in the dunn.test package. You can read more about this test here

# install.packages("dunn.test")
library(dunn.test)

dunn.test(x = my_outcome_var, g = my_expl_var) ## Replace "my_outcome_var" and "my_expl_var" with the relevant variables from your data