Lab #16 - Analysis of Variance (ANOVA)

\(~\)

Lab

This lab contains no onboarding section, you should read through the sections below and work through it with your partners.

Overview of ANOVA

One-way ANOVA uses an \(F\)-test to statistically evaluate the relationship between a quantitative response variable and a categorical explanatory variable, usually when the categorical variable contains more than 2 categories. The global null hypothesis in one-way ANOVA is: \[H_0: \mu_1 = \mu_2 = \ldots = \mu_k\]

This hypothesis reflects “no association” between the explanatory and response variables
Rejecting the null hypothesis suggests an association, implying that at least one group’s mean differs from anothers

In R, ANOVA can be performed using the aov() function, which behaves similarly to lm(), which was the function we used to fit regression models earlier in the semester:

## Load sample data (5000 randomly selected claims against the TSA)
tsa = read.csv("https://remiller1450.github.io/data/tsa_small.csv")

## Fit and store the ANOVA model
my_model = aov(Close_Amount ~ Claim_Site, data = tsa)

## Print the ANOVA table
summary(my_model)

##               Df    Sum Sq Mean Sq F value   Pr(>F)    
## Claim_Site     2   2949087 1474543   16.75 5.64e-08 ***
## Residuals   4997 439979039   88049                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this example, the \(F\)-statistic of 16.75 and the corresponding \(p\)-value of 0.0000000564 provides strong evidence that the alternative model fits the data better than the null model by an extent that cannot be explained by chance, thereby supporting an association between Claim_Site and Close_Amount.

Compared with our slides, you should notice that R’s version of the ANOVA table omits the “Total” row, which isn’t a problem since the “Total” row is simply the column sums of the relevant quantities from the provided rows. The question below is intended to help you relate the components of an ANOVA table in R with the concepts and models involved in the \(F\)-test

Question #1:

Part A: Using the ANOVA table from this section’s example, how many distinct categories of the explanatory variable are there? Briefly explain how you can determine this from the ANOVA table.
Part B: Printed below are two randomly chosen rows from the tsa data set. What does the null model predict the close amount should be for each of these rows? Hint: you will need to calculate some descriptive statistics yourself to answer this question.

Claim Number	Year	Month	Day and Time	Claim Site	Close Amount
2.00505E+12	2005	4	04 00:00:00	Checked Baggage	19.95
2.0051E+12	2005	8	25 00:00:00	Checkpoint	80.00

Part C: Consider the two randomly chosen rows that were printed above. What does the alternative model (the one-way ANOVA model) predict the close amount should be for each of these rows? Hint: you will need to calculate some descriptive statistics yourself to answer this question. You might consider using group_by() and summarize().
Part D: How much lower was the alternative model’s sum of squares compared to the null model?
Part E: What is the sum of squares of the null model?
Part F: Consider the two randomly chosen rows from earlier, what values did each of these rows contribute to the sum of squares you found in Part E (for the null model)?
Part G: What is the sum of squares of the alternative model?
Part H: Again consider the two randomly chosen rows from earlier, what values did each of these rows contribute to the sum of squares you found in Part G (for the alternative model)?

\(~\)

Question #2:

Part A: Consider the explanatory variable Airline_Name and the response variable Claim_Amount. Briefly explain why one-way ANOVA is an appropriate statistical approach for testing whether these variables are associated.
Part B: Perform the ANOVA \(F\)-test using the variables mentioned in Part A and report whether the test provides sufficient evidence of an association.
Part C: Do the test results from Part B provide strong evidence that each airline in the TSA data set has the same average claim amount? Briefly explain your reasoning.

\(~\)

Checking Model Assumptions

One-way ANOVA makes two main assumptions:

The residuals are Normally distributed (this is how \(\epsilon_i\) is defined in the null and alternative models)
The cases in each group have the same standard deviation (an assumption which implies the errors in each group follow a Normal curve with the same amount of variability but different means)

Note that the first assumptions does not apply to the response variable, but instead it deals with the alternative model’s residuals. It can be assessed by graphing the residuals using either a histogram or Q-Q plot:

## Fit and ANOVA model (note we're using Claim Amount as the outcome now)
my_model = aov(Claim_Amount ~ Claim_Site, data = tsa)

## Use the residuals of this fitted model to make a histogram
ggplot() + geom_histogram(aes(x = my_model$residuals), bins = 150)

## Now a Q-Q plot
ggplot() + stat_qq(aes(sample = my_model$residuals)) + stat_qq_line(aes(sample = my_model$residuals))

Both the histogram and the Q-Q plot suggest the Normality assumption is badly violated. Recall from earlier in the semester (Lab 14) that log-transformations can be used to ameliorate the problems associated with right-skewed variables and outliers:

## Fitted ANOVA model using a log10 transformation
my_model = aov(log10(Claim_Amount + 1) ~ Claim_Site, data = tsa)

## Use the residuals of this fitted model to make a histogram
ggplot() + geom_histogram(aes(x = my_model$residuals), bins = 30)

## Now a Q-Q plot
ggplot() + stat_qq(aes(sample = my_model$residuals)) + stat_qq_line(aes(sample = my_model$residuals))

After log-transformation the Normality assumption still might not be perfectly satisfied, but it is much more justified to use ANOVA after the transformation, and many would find the method to be a reasonable choice (assuming the equal standard deviations assumption is also met).

In addition to the residuals being Normally distributed, we must check that the data in each group has the same standard deviation:

library(dplyr)
tsa %>% group_by(Claim_Site) %>% summarize(std_dev = sd(Claim_Amount))

## # A tibble: 3 × 2
##   Claim_Site      std_dev
##   <chr>             <dbl>
## 1 Checked Baggage  29016.
## 2 Checkpoint       39672.
## 3 Other             1193.

This assumption is not met using the rule of thumb that the largest standard deviation is no more than twice the smallest standard deviation. However, you should recall that we ultimately decided to perform a log-transformation on these data, so we really should be considering the standard deviations of the log-transformed data:

tsa %>% group_by(Claim_Site) %>% summarize(std_dev = sd(log10(Claim_Amount+1)))

## # A tibble: 3 × 2
##   Claim_Site      std_dev
##   <chr>             <dbl>
## 1 Checked Baggage   0.646
## 2 Checkpoint        0.645
## 3 Other             0.597

Notice that the equal standard deviations assumption is met for the log-transformed data.

Question #3:

Part A: Check the assumptions of the one-way ANOVA you performed in Question #2. Which, if any, assumptions are met? In addition to your answer, provide the code used to check each assumption.
Part B: Similar to the example in this section, perform a log-transformation on the outcome variable in the model from Part A (which came from Question #2). Interpret the \(F\)-test results when the log-transformation is used. Hint: Because there are a small number of cases that claimed zero dollars, you’ll need to add 1 to the outcome variable as was done in the example.
Part C: Check the assumptions of the one-way ANOVA you performed in Part B. Which, if any, assumptions are met? In addition to your answer, provide the code used to check each assumption.

\(~\)

The Kruskal-Wallis Test

Sometimes log-transformation is not sufficient to ensure the data satisfy the assumptions of ANOVA. In these circumstances the Kruskal-Wallis Test provides a non-parametric alternative. The test is a generalization of the Wilcoxon rank-sum test described in Lab 14 that compares the sum of ranks in the three or more groups.

The Kruskal-Wallis Test is performed using the kruskal.test() function, which uses the same syntax as aov():

kruskal.test(Claim_Amount ~ Claim_Site, data = tsa)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Claim_Amount by Claim_Site
## Kruskal-Wallis chi-squared = 40.065, df = 2, p-value = 1.995e-09

Question #4:

Part A: Briefly explain why the Kruskal-Wallis Test might be preferable to the log-transformation you performed in Question #3. Hint: you might consider referencing your Q-Q plot.
Part B: Use the Kruskal-Wallis Test to determine whether these data provide sufficient statistical evidence of an association between Airline_Name and Claim_Amount. Show the \(p\)-value of the test and provide a brief conclusion.
Part C: Is the \(p\)-value you found in Part B of this question closer to the \(p\)-value from the \(F\)-test you performed on the log-transformed data in Question #3 (Part B) or the \(F\)-test performed on the original data in Question #2 (Part B)? Why do you think this is?

\(~\)

Post-hoc Testing

Any statistically significant ANOVA result should be followed by post-hoc pairwise testing. The phrase “post-hoc” comes from Latin meaning “after this” or “after the fact”. So, as the name suggests, we’ll only perform post-hoc testing this step if the global test was significant. The goal in post-hoc testing is to determine which pairs of group differences produced the significant result.

There are many post-hoc tests that can be performed, but we’ll use Tukey’s Honest Significant Differences test:

## Fitted ANOVA using log10 transform from before
my_model = aov(log10(Claim_Amount + 1) ~ Claim_Site, data = tsa)

## Tukey's test
TukeyHSD(my_model, conf.level = 0.95)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = log10(Claim_Amount + 1) ~ Claim_Site, data = tsa)
## 
## $Claim_Site
##                                 diff         lwr       upr     p adj
## Checkpoint-Checked Baggage 0.1626076  0.10018054 0.2250347 0.0000000
## Other-Checked Baggage      0.4227458  0.08350470 0.7619869 0.0097949
## Other-Checkpoint           0.2601382 -0.08325054 0.6035269 0.1777600

The \(p\)-values from Tukey’s HSD test are naturally adjusted to control the Type I error rate associated with multiple comparisons. This is done using the conf.level argument, where the value 0.95 allows us to compare each \(p\)-value to a decision threshold of \(alpha = 0.05\) while still giving us confidence that there’s an at most 5% chance of many any Type I errors. Note that this is the same type of Type I error control provided by the Bonferonni Correction.

So, in our example we can conclude that the log-transformed means are different for claims made at Checkpoint and claims made at Checked Baggage, as well as claims made at Other and Checked Baggage, but there is not sufficient evidence of a difference for the mean claims made at Other and Checkpoint.

Question #5: In this question you’ll revisit our previous analysis of the “tailgating” data set, which recorded the mean following distances of regular users of different drugs during a driving simulation. The researchers were interested in whether there was an association between drug type and following distance, hypothesizing that regular users of harder drugs would be more prone to riskier behavior.

driving = read.csv("https://remiller1450.github.io/data/Tailgating.csv")

Part A: In words, write the null and alternative hypotheses involved in using one-way ANOVA to analyze these data.
Part B: Use aov() to fit a one-way ANOVA model. Print the ANOVA table and interpret the \(p\)-value. Be sure to make a conclusion that involves the context of these data.
Part C: Check the assumptions of the model you used in Part B. If one or more assumptions are violated, you should log-transform the data or perform the Kruskal-Wallis Test.
Part D: Perform post-hoc testing on your final model, which is from Part A if you determined the assumptions of ANOVA were met, or Part C if the assumptions were not met. If you used the Kruskal-Wallis Test in Part C, the appropriate method of post-hoc testing is the Dunn Test, which can be performed using the dunn.test() function in the dunn.test package. You can read more about this test here

# install.packages("dunn.test")
library(dunn.test)

dunn.test(x = my_outcome_var, g = my_expl_var) ## Replace "my_outcome_var" and "my_expl_var" with the relevant variables from your data