\(~\)
This lab contains no onboarding section, you should read through the sections below and work through it with your partners.
One-way ANOVA uses an \(F\)-test to statistically evaluate the relationship between a quantitative response variable and a categorical explanatory variable, usually when the categorical variable contains more than 2 categories. The global null hypothesis in one-way ANOVA is: \[H_0: \mu_1 = \mu_2 = \ldots = \mu_k\]
In R
, ANOVA can be performed using the
aov()
function, which behaves similarly to
lm()
, which was the function we used to fit regression
models earlier in the semester:
## Load sample data (5000 randomly selected claims against the TSA)
tsa = read.csv("https://remiller1450.github.io/data/tsa_small.csv")
## Fit and store the ANOVA model
my_model = aov(Close_Amount ~ Claim_Site, data = tsa)
## Print the ANOVA table
summary(my_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Claim_Site 2 2949087 1474543 16.75 5.64e-08 ***
## Residuals 4997 439979039 88049
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this example, the \(F\)-statistic
of 16.75 and the corresponding \(p\)-value of 0.0000000564 provides strong
evidence that the alternative model fits the data better than the null
model by an extent that cannot be explained by chance, thereby
supporting an association between Claim_Site
and
Close_Amount
.
Compared with our slides, you should notice that R
’s
version of the ANOVA table omits the “Total” row, which isn’t a problem
since the “Total” row is simply the column sums of the relevant
quantities from the provided rows. The question below is intended to
help you relate the components of an ANOVA table in R
with
the concepts and models involved in the \(F\)-test
Question #1:
tsa
data set. What does the null model
predict the close amount should be for each of these rows?
Hint: you will need to calculate some descriptive statistics
yourself to answer this question.Claim Number | Year | Month | Day and Time | Claim Site | Close Amount |
---|---|---|---|---|---|
2.00505E+12 | 2005 | 4 | 04 00:00:00 | Checked Baggage | 19.95 |
2.0051E+12 | 2005 | 8 | 25 00:00:00 | Checkpoint | 80.00 |
group_by()
and summarize()
.\(~\)
Question #2:
Airline_Name
and the response variable
Claim_Amount
. Briefly explain why one-way ANOVA is an
appropriate statistical approach for testing whether these variables are
associated.\(~\)
One-way ANOVA makes two main assumptions:
Note that the first assumptions does not apply to the response variable, but instead it deals with the alternative model’s residuals. It can be assessed by graphing the residuals using either a histogram or Q-Q plot:
## Fit and ANOVA model (note we're using Claim Amount as the outcome now)
my_model = aov(Claim_Amount ~ Claim_Site, data = tsa)
## Use the residuals of this fitted model to make a histogram
ggplot() + geom_histogram(aes(x = my_model$residuals), bins = 150)
## Now a Q-Q plot
ggplot() + stat_qq(aes(sample = my_model$residuals)) + stat_qq_line(aes(sample = my_model$residuals))
Both the histogram and the Q-Q plot suggest the Normality assumption is badly violated. Recall from earlier in the semester (Lab 14) that log-transformations can be used to ameliorate the problems associated with right-skewed variables and outliers:
## Fitted ANOVA model using a log10 transformation
my_model = aov(log10(Claim_Amount + 1) ~ Claim_Site, data = tsa)
## Use the residuals of this fitted model to make a histogram
ggplot() + geom_histogram(aes(x = my_model$residuals), bins = 30)
## Now a Q-Q plot
ggplot() + stat_qq(aes(sample = my_model$residuals)) + stat_qq_line(aes(sample = my_model$residuals))
After log-transformation the Normality assumption still might not be perfectly satisfied, but it is much more justified to use ANOVA after the transformation, and many would find the method to be a reasonable choice (assuming the equal standard deviations assumption is also met).
In addition to the residuals being Normally distributed, we must check that the data in each group has the same standard deviation:
library(dplyr)
tsa %>% group_by(Claim_Site) %>% summarize(std_dev = sd(Claim_Amount))
## # A tibble: 3 × 2
## Claim_Site std_dev
## <chr> <dbl>
## 1 Checked Baggage 29016.
## 2 Checkpoint 39672.
## 3 Other 1193.
This assumption is not met using the rule of thumb that the largest standard deviation is no more than twice the smallest standard deviation. However, you should recall that we ultimately decided to perform a log-transformation on these data, so we really should be considering the standard deviations of the log-transformed data:
tsa %>% group_by(Claim_Site) %>% summarize(std_dev = sd(log10(Claim_Amount+1)))
## # A tibble: 3 × 2
## Claim_Site std_dev
## <chr> <dbl>
## 1 Checked Baggage 0.646
## 2 Checkpoint 0.645
## 3 Other 0.597
Notice that the equal standard deviations assumption is met for the log-transformed data.
Question #3:
\(~\)
Sometimes log-transformation is not sufficient to ensure the data satisfy the assumptions of ANOVA. In these circumstances the Kruskal-Wallis Test provides a non-parametric alternative. The test is a generalization of the Wilcoxon rank-sum test described in Lab 14 that compares the sum of ranks in the three or more groups.
The Kruskal-Wallis Test is performed using the
kruskal.test()
function, which uses the same syntax as
aov()
:
kruskal.test(Claim_Amount ~ Claim_Site, data = tsa)
##
## Kruskal-Wallis rank sum test
##
## data: Claim_Amount by Claim_Site
## Kruskal-Wallis chi-squared = 40.065, df = 2, p-value = 1.995e-09
Question #4:
Airline_Name
and
Claim_Amount
. Show the \(p\)-value of the test and provide a brief
conclusion.\(~\)
Any statistically significant ANOVA result should be followed by post-hoc pairwise testing. The phrase “post-hoc” comes from Latin meaning “after this” or “after the fact”. So, as the name suggests, we’ll only perform post-hoc testing this step if the global test was significant. The goal in post-hoc testing is to determine which pairs of group differences produced the significant result.
There are many post-hoc tests that can be performed, but we’ll use Tukey’s Honest Significant Differences test:
## Fitted ANOVA using log10 transform from before
my_model = aov(log10(Claim_Amount + 1) ~ Claim_Site, data = tsa)
## Tukey's test
TukeyHSD(my_model, conf.level = 0.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = log10(Claim_Amount + 1) ~ Claim_Site, data = tsa)
##
## $Claim_Site
## diff lwr upr p adj
## Checkpoint-Checked Baggage 0.1626076 0.10018054 0.2250347 0.0000000
## Other-Checked Baggage 0.4227458 0.08350470 0.7619869 0.0097949
## Other-Checkpoint 0.2601382 -0.08325054 0.6035269 0.1777600
The \(p\)-values from Tukey’s HSD
test are naturally adjusted to control the Type I error rate associated
with multiple comparisons. This is done using the
conf.level
argument, where the value 0.95
allows us to compare each \(p\)-value
to a decision threshold of \(alpha =
0.05\) while still giving us confidence that there’s an at most
5% chance of many any Type I errors. Note that this is the same type of
Type I error control provided by the Bonferonni Correction.
So, in our example we can conclude that the log-transformed means are
different for claims made at Checkpoint
and claims made at
Checked Baggage
, as well as claims made at
Other
and Checked Baggage
, but there is
not sufficient evidence of a difference for the mean claims
made at Other
and Checkpoint
.
Question #5: In this question you’ll revisit our previous analysis of the “tailgating” data set, which recorded the mean following distances of regular users of different drugs during a driving simulation. The researchers were interested in whether there was an association between drug type and following distance, hypothesizing that regular users of harder drugs would be more prone to riskier behavior.
driving = read.csv("https://remiller1450.github.io/data/Tailgating.csv")
aov()
to fit a one-way
ANOVA model. Print the ANOVA table and interpret the \(p\)-value. Be sure to make a conclusion
that involves the context of these data.dunn.test()
function in the dunn.test
package.
You can read
more about this test here# install.packages("dunn.test")
library(dunn.test)
dunn.test(x = my_outcome_var, g = my_expl_var) ## Replace "my_outcome_var" and "my_expl_var" with the relevant variables from your data