Lab #15 - Chi-squared Tests

\(~\)

Lab

This lab contains no onboarding section, you should read through the sections below and work through it with your partners.

Goodness of Fit Tests

The chisq.test() function is used to perform Chi-squared Goodness of Fit Tests. The function expects the sample data to be provided as a frequency table, and the null hypothesis is specified using the argument p.

In most circumstances, we’ll use the table() function to create the table used as the primary input to chisq.test().

## Create the table (w/ some made up data)
x = c("A", "A", "A", "A", "A", "B", "B", "C", "C", "C", "C", "C")
my_table = table(x)

## Use the table as an input to chisq.test()
chisq.test(x = my_table, p = c(1/3, 1/3, 1/3))

## 
##  Chi-squared test for given probabilities
## 
## data:  my_table
## X-squared = 1.5, df = 2, p-value = 0.4724

However, in some circumstances, we might need enter the table ourselves as a data.frame object. This is demonstrated below for our AP Exam example:

## Enter the data ourselves as a data frame
ap_exam_data = data.frame(A = 85, B = 90, C = 79, D = 78, E = 68)

## Perform the test
chisq.test(x = ap_exam_data, p = c(0.2, 0.2, 0.2, 0.2, 0.2))

## 
##  Chi-squared test for given probabilities
## 
## data:  ap_exam_data
## X-squared = 3.425, df = 4, p-value = 0.4894

In addition to reporting the \(p\)-value, the conclusion we draw from a hypothesis test should involve any directional relationships that are identified. For a Chi-squared test, this can be accomplished by comparing the observed and expected frequencies to identify where the largest discrepancies are:

## Store the test results
test_results = chisq.test(x = ap_exam_data, p = c(0.2, 0.2, 0.2, 0.2, 0.2))

## Table of expected counts
test_results$expected

## [1] 80 80 80 80 80

Notice how the category with the largest deviation from its expected count was the answer choice “E”. This was easy to see in our example, as all of the categories had the same expected frequency, but a more general approach is to look at standardized residuals: \[\text{stdres} = \frac{\text{Observed Frequency} - \text{Expected Frequency}}{SE}\]

Here SE is the standard error of the table cell in question, which estimated via the expected count adjusted by a scaling factor.

This document provides a good overview of the calculation, standard error estimation, and adjustment if you are interested in the details.
More generally, the main idea you should know is that any cell with a standardized residual that is large in absolute magnitude represents a category where the data deviate from the null hypothesis.
- If we consider the 68-95-99 rule we’d expect a standardized residual of 2 or larger to occur roughly 5% of the time when the null hypothesis is true, and a standardized residual of 3 or larger to occur less than 1% of the time.
- Additionally, the sign of the residual indicates whether the observed count was higher (positive) or lower (negative) than what is expected under the null hypothesis.

Below are the standardized residuals for our AP Exam example:

## Table of standardized residuals
test_results$stdres

## [1]  0.625  1.250 -0.125 -0.250 -1.500

Using these results we can interpret that the “E” category was the largest discrepancy, but it was only 1.5 standard deviations below what was expected. The largest positive deviation was “B”, and it was only 1.25 standard deviations above what was expected.

Additionally, you should notice that none of these standardized residuals exceeds 2, which is consistent with the Chi-squared test failing to reject the null hypothesis.

Question #1: In court cases jurors are selected from a pool of eligible adults that is supposed to be randomly chosen from the local community. The American Civil Liberties Union (ACLU) has studied the racial composition of jury pools in Alameda County, California, and shown below are the racial/ethnic composition of \(n=1453\) individuals included in these jury pools along with the distribution of eligible jurors (according to US Census data for Alameda County):

Race/Ethnicity	Number in Jury Pools	US Census Percentage
Non-Hispanic White	780	54%
Black	117	18%
Hispanic	114	12%
Asian	384	15%
Other	58	1%
Total	1453	100%

Part A: Suppose the ALCU would like to statistically evaluate whether jury pools deviate from the demographics of the local community. For this analysis, state the null hypothesis in words.
Part B: Perform a Chi-squared Goodness of Fit test using the provided data. Show the output of the test.
Part C: Display the standardized residuals of the test you performed in Part B. Which groups were the most overrepresented and underrepresented in the sample data relative to what was expected under the null hypothesis?

Question #2: As part of 2009 study, researchers collected data on the moves of 119 novice players in the game rock-paper-scissors against a computer opponent. The data below record each player’s first and second moves:

rps = read.csv("https://remiller1450.github.io/data/rock_paper_scissors.csv")

Part A: Consider the null hypothesis that novice players are randomly choosing their first move. Perform a Chi-squared Goodness of Fit test to evaluate this hypothesis. Display the output of this test and provide a 1-sentence conclusion.
Part B: Looking at the standardized residuals for the test you performed in Part A, which rock-paper-scissors category exhibited the largest deviation from what is expected under the null hypothesis? Were there more than expected or fewer than expected outcomes in that category?

\(~\)

Tests of Independence

Chi-squared tests of independence are also performed using the chisq.test() function. Here, we must provide a two-way frequency table, and we do not use the p argument, as the null proportions of this test are derived from the data.

The example below evaluates whether the first and second moves of novice rock-paper-scissors players are independent using the data from Question #2:

## Create the required two-way frequency table
rps_table = table(rps$first_move, rps$second_move)

## Chi-squared test using the two-way table
chisq.test(rps_table)

## 
##  Pearson's Chi-squared test
## 
## data:  rps_table
## X-squared = 6.784, df = 4, p-value = 0.1478

We will revisit this warning later, as the Chi-squared distribution only serves as a reasonable probability model when there are sufficiently large counts in each cell of the two-way frequency table (expected counts of at least 5 is a common guideline).

Similar to goodness of fit tests, we can follow-up on our testing results by looking at standardized residuals:

## Store test results
rps_test_results = chisq.test(rps_table)

## Standardized Residuals
rps_test_results$stdres

##           
##                 Paper       Rock   Scissors
##   Paper     0.7754986 -1.2571960  0.5169031
##   Rock     -0.3258697  1.9778030 -1.7726677
##   Scissors -0.6271109 -1.2193823  1.9814473

Notice that all of these residuals reflect deviations that are less than two standard deviations above/below what is expected. Nevertheless, this table does suggest that more players than expected are selecting the same choice for their second move as they had already used for their first move. For example, the number of players who chose rock on their first and second move is 1.98 standard deviations higher than what would be expected under independence.

Question #3: We’ve previously worked with the “TSA Claims” data set, which contained all claims made against the Transport Security Administration between 2003 and 2008. For this question you will analyze a random sample of \(n=5000\) claims from this time period.

tsa_sample = read.csv("https://remiller1450.github.io/data/tsa_small.csv")

Part A: Consider the variables Month (the month when the claim occurred) and Status (whether a claim was approved, denied, or settled). If these variables are independent, what proportion of claims would you expect to be approved, denied, and settled in each month? That is, find the distribution of Status under the null hypothesis of independence.
Part B: Perform a Chi-squared test of independence to evaluate the null hypothesis referenced in Part A.
Part C: Analyze the standardized residuals of the test you performed in Part B. Which months stand out as having different outcomes from what is expected? Provide a 1-2 sentence summary of your findings.

\(~\)

Cramer’s V

Standardized residuals are most useful when the variables involved in a Chi-squared test each contain a modest number of categories. However, for two-way frequency tables with a large number of cells it can become difficult to attribute a significant Chi-squared test result to any small set of individual cells and it may make sense to describe the strength of relationship holistically.

Cramer’s V is a popular measure of the association between two nominal categorical variables. It takes on values between 0 (independence) and +1 (complete dependence), allowing it to be interpretted similar to Pearson’s correlation coefficient.

\[V = \sqrt{\frac{X^2/n}{\text{min}(r-1, c-1)}}\]

\(X^2\) is the Chi-squared test statistic
\(n\) is the sample size
\(r\) is the number of rows in the two-way frequency table
\(c\) is the number of columns in the two-way frequency table

There are a few R packages that will calculate Cramer’s V (rcompanion and lsr), but it’s easy enough to do ourselves. The code below calculates Cramer’s V for our rock-paper-scissors example:

## Table and Test results
rps_table = table(rps$first_move, rps$second_move)
rps_test_results = chisq.test(rps_table)

## Calculate Cramer's V
Cramers_V = as.numeric(sqrt(rps_test_results$statistic/nrow(rps)/
       min(nrow(rps_table)-1, ncol(rps_table-1))))

## Print the result
Cramers_V

## [1] 0.1688317

Here Cramer’s V is relatively close to zero, which is unsurprising as our Chi-squared test was not statistically significant.

Moving beyond this application, it is worth noting that Cramer’s V is useful because it allows us compare the strength of association between many pairing of categorical variables due to its standardized scale.

Question #4: Using the tsa_sample data described in Question 3, use Cramer’s V to determine whether there is a stronger association between the variables Month and Status or Claim_Type and Status.

\(~\)

Fisher’s Exact Test

In the test for independence performed on rock-paper-scissors data we saw a red warning message when using chisq.test() due to the small expected counts in some of the cells in the two-way frequency table. This occurs because the Chi-squared test uses the assumption that for a sufficiently large sample the frequencies in each cell of the two-way frequency table will be approximately Normally distributed. A common “rule of thumb” states that when each cell in the table of expected counts exceeds 5 this assumption is reasonable.

If the data are such that some cells in the expected two-way table have expected values less than 5, Fisher’s Exact Test provides an exact testing approach that doesn’t rely upon a Normality assumption. The test works by considering all possible two-way frequency tables with row and column totals fixed at their observed values. As you might expect, this gets computationally expensive for large tables, and the assumption of fixed row and column totals makes the test conservative (less powerful) in most settings. I encourage you to read about Fisher’s “lady tasting tea” experiment if you’re interested.

Fisher’s Exact Test is performed using the fisher.test() function, which behaves similarly to chisq.test():

## Fisher's exact test on our rock-paper-scissors experiment
fisher.test(rps_table)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  rps_table
## p-value = 0.166
## alternative hypothesis: two.sided

A few things to note:

Fisher’s Exact Test is not based upon expected cell counts, so there are no residuals to analyze after performing the test. Thus, we simply conclude based upon the test’s \(p\)-value whether the data provide sufficient evidence to support an association between variables.
Fisher’s Exact Test is computationally demanding, so we might opt to use the argument simulate.p.value = TRUE in settings where the full test requires too many computational resources.

Question #5: The data below come from an occupational health study that asked individuals working in a biology lab to self-report how frequently they wore their lab gloves outside of the lab. These individuals were grouped by educational attainment, and the researchers were interested in whether the likelihood of wearing lab gloves outside of the lab environment is independent of educational attainment.

occ_health = read.csv("https://remiller1450.github.io/data/occ_health.csv")

Part A: Ignoring any assumptions of the method, analyze these data using a Chi-squared Test of Independence. Report the test’s \(p\)-value as well as a brief conclusion.
Part B: Considering the assumptions that underlie the Chi-squared Test of Independence, why is the test inappropriate for these data?
Part C: Analyze the data using Fisher’s Exact Test. Report the test’s \(p\)-value as well as a brief conclusion.