\(~\)
This lab contains no onboarding section, you should read through the sections below and work through it with your partners.
The lab’s examples will use the TSA data that we’ve now worked with
many times. The lab’s questions involve a random sample of \(n=200\) games played in the National
Football League (NFL) between 2018 and 2023. The outcome variable for
this lab is score_diff
, which is the difference between the
number of points scored by the home and away teams, with higher positive
values reflecting a greater margin of victory for the home team, and
negative values reflecting the margin of victory for the away team.
nfl_data = read.csv("https://remiller1450.github.io/data/nfl_sample.csv")
\(~\)
One-way ANOVA uses an \(F\)-test to statistically evaluate whether there is a relationship between a quantitative response variable and a categorical explanatory variable, usually when the categorical variable has more than 2 categories.
The global null hypothesis in one-way ANOVA is: \(H_0: \mu_1 = \mu_2 = \ldots = \mu_k\)
In R
, ANOVA can be performed using the
aov()
function, which behaves similarly to
lm()
, which was the function we used to fit regression
models earlier in the semester:
## Load the example data
tsa = read.csv("https://remiller1450.github.io/data/tsa_small.csv")
## Fit and store the ANOVA model
my_model = aov(Close_Amount ~ Claim_Site, data = tsa)
## Print the ANOVA table
summary(my_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Claim_Site 2 2949087 1474543 16.75 5.64e-08 ***
## Residuals 4997 439979039 88049
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this example, the \(F\)-statistic
of 16.75 and the corresponding \(p\)-value of 0.0000000564 provides strong
evidence that the alternative model fits the data better than the null
model by an amount that cannot reasonably be explained by chance
(sampling variability), thereby supporting an association between
Claim_Site
and Close_Amount
.
Compared with our slides, you should notice that R
’s
version of the ANOVA table omits the “Total” row, which isn’t a problem
since the “Total” row is simply the column sums of the relevant
quantities from the provided rows.
Question #1: For this question you should use the NFL data provided at the beginning of the lab.
group_by()
and
summarize()
functions from the dplyr
library
to find the mean score_diff
for each home team. Recall that
these functions were
first introduced in Lab 5score_diff
of each of these games? What would the
alternative model predict for the score_diff
of
these games?season | game_id | home_team | away_team | score_diff |
---|---|---|---|---|
2021 | 2021_15_WAS_PHI | PHI | WAS | 10 |
2021 | 2021_04_ARI_LA | LA | ARI | -17 |
2021_15_WAS_PHI
the
actual score_diff
was 10 (PHI won by 10 points) and in game
2021_04_ARI_LA
the actual score_diff
was -17
(ARI won by 17 points). Using this information, what are the
residuals of each game when using the null model?aov()
function to
perform the ANOVA \(F\)-test on the
variables described in Part A. Does this test provide statistical
evidence of an association?\(~\)
One-way ANOVA makes two main assumptions:
Note that the first assumption (Normality) does not apply to the response variable, but instead it deals with the alternative model’s residuals. It can be assessed by graphing the residuals using either a histogram or Q-Q plot:
## Fit the ANOVA model (note we're using Claim Amount as the outcome this time)
my_model = aov(Claim_Amount ~ Claim_Site, data = tsa)
## Use the residuals of this fitted model to make a histogram
ggplot() + geom_histogram(aes(x = my_model$residuals), bins = 150)
## Now a Q-Q plot
ggplot() + stat_qq(aes(sample = my_model$residuals)) + stat_qq_line(aes(sample = my_model$residuals))
Both the histogram and the Q-Q plot suggest the Normality assumption is badly violated.
We will talk about the rationale and implications for this after Fall break, but taking the logarithm of a right-skewed variable is a commonly used strategy to make data more appropriate for methods like ANOVA:
## Fitted ANOVA model using a log10 transformation
my_model = aov(log10(Claim_Amount + 1) ~ Claim_Site, data = tsa)
## Use the residuals of this fitted model to make a histogram
ggplot() + geom_histogram(aes(x = my_model$residuals), bins = 30)
## Now a Q-Q plot
ggplot() + stat_qq(aes(sample = my_model$residuals)) + stat_qq_line(aes(sample = my_model$residuals))
After applying this log-transformation, the Normality assumption still might not be perfectly satisfied, but it is much more justified to use ANOVA after the transformation, and many would find the method to be a reasonable choice (assuming the equal standard deviations assumption is also met).
In addition to the residuals being Normally distributed, we must check that the data in each group has the same standard deviation:
library(dplyr)
tsa %>% group_by(Claim_Site) %>% summarize(std_dev = sd(Claim_Amount))
## # A tibble: 3 × 2
## Claim_Site std_dev
## <chr> <dbl>
## 1 Checked Baggage 29016.
## 2 Checkpoint 39672.
## 3 Other 1193.
This assumption is not met using the rule of thumb that the largest standard deviation is no more than twice the smallest standard deviation. However, you should recall that we ultimately decided to perform a log-transformation on these data, so we really should be considering the standard deviations of the log-transformed data:
tsa %>% group_by(Claim_Site) %>% summarize(std_dev = sd(log10(Claim_Amount+1)))
## # A tibble: 3 × 2
## Claim_Site std_dev
## <chr> <dbl>
## 1 Checked Baggage 0.646
## 2 Checkpoint 0.645
## 3 Other 0.597
Notice that the equal standard deviations assumption is met for the log-transformed data.
Question #2: You should continue using the NFL data set on this question.
group_by()
and
summarize()
to check the equal standard deviation
assumption of the one-way ANOVA model you used in Question #1. Briefly
comment upon whether this assumption appears to be met. Hint:
You might find it useful to store the summarized results and then click
on them in the Environment pane to sort by standard deviation.\(~\)
Sometimes a log-transformation is not possible, or it might not be sufficient to ensure the data satisfy the assumptions of ANOVA. In these circumstances the Kruskal-Wallis Test provides a non-parametric alternative. The test is a generalization of the Wilcoxon rank-sum test that briefly appeared near the end of Lab 5
The Kruskal-Wallis Test is performed using the
kruskal.test()
function, which uses the same syntax as
aov()
:
kruskal.test(Claim_Amount ~ Claim_Site, data = tsa)
##
## Kruskal-Wallis rank sum test
##
## data: Claim_Amount by Claim_Site
## Kruskal-Wallis chi-squared = 40.065, df = 2, p-value = 1.995e-09
Question #3: You should continue using the NFL data set on this question.
home_team
and score_diff
.
Show the \(p\)-value of the test and
provide a brief conclusion.\(~\)
Any statistically significant ANOVA result should be followed by post-hoc pairwise testing. The phrase “post-hoc” comes from Latin meaning “after this” or “after the fact”. So, as the name suggests, we’ll only perform post-hoc testing this step if the global test was significant. The goal in post-hoc testing is to determine which pairs of group differences produced the significant result.
There are many post-hoc tests that can be performed, but we’ll use Tukey’s Honest Significant Differences test:
## Fitted ANOVA using log10 transform from before
my_model = aov(log10(Claim_Amount + 1) ~ Claim_Site, data = tsa)
## Tukey's test
TukeyHSD(my_model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = log10(Claim_Amount + 1) ~ Claim_Site, data = tsa)
##
## $Claim_Site
## diff lwr upr p adj
## Checkpoint-Checked Baggage 0.1626076 0.10018054 0.2250347 0.0000000
## Other-Checked Baggage 0.4227458 0.08350470 0.7619869 0.0097949
## Other-Checkpoint 0.2601382 -0.08325054 0.6035269 0.1777600
Note that, by default, the \(p\)-values from Tukey’s HSD test are adjusted to account for problems that arise when using the same data to perform many different hypothesis tests (we’ll discuss this issue soon).
So, in our example we can conclude that the log-transformed means are
different for claims made at Checkpoint
and claims made at
Checked Baggage
, as well as claims made at
Other
and Checked Baggage
, but there is
insufficient evidence of a difference in the mean claims at the
Other
and Checkpoint
sites.
Question #4: For this question you’ll use the “tailgating” data set, which records the average following distances of regular users of different drugs during a driving simulation where each driver was asked to follow a lead vehicle to a certain destination. The researchers were interested in whether there was an association between drug type and average following distance, hypothesizing that regular users of harder drugs would be more prone to riskier behavior (very small following distances).
driving = read.csv("https://remiller1450.github.io/data/Tailgating.csv")
aov()
to fit a one-way
ANOVA model. Print the ANOVA table and interpret the \(p\)-value. Be sure to make a conclusion
that involves the context of these data.dunn.test()
function in the dunn.test
package.
You can read
more about this test here# install.packages("dunn.test")
library(dunn.test)
dunn.test(x = my_outcome_var, g = my_expl_var) ## Replace "my_outcome_var" and "my_expl_var" with the relevant variables from your data