R
Directions (read before starting)
\(~\)
While this lab will focus on Chi-squared tests it will also involve a variety of other tools from earlier in the course as review. The purpose of this is to prepare you for “the real world” where no one will tell you that you need to use a Chi-squared test, instead it is your job as someone trained in statistics to know when to use a Chi-squared test and when to use a different approach.
\(~\)
Below are a few quick demonstrations of the chisq.test()
function, which is used to perform both goodness of fit testing and
association testing.
## Example data
colleges = read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
The example below is a Chi-squared goodness of fit test using equal proportions as the null hypothesis.
## Example 1
## Goodness of fit test w/ equal proportions for every region as H0
observed = table(colleges$Region)
results = chisq.test(x = observed)
results$expected ## Expected counts
## Far West Great Lakes Mid East New England Plains
## 136.875 136.875 136.875 136.875 136.875
## Rocky Mountains South East South West
## 136.875 136.875 136.875
results$statistic ## X^2 test statistic
## X-squared
## 369.5699
results$p.value ## p-value
## [1] 7.941409e-76
results$residuals ## Pearson residuals
##
## Far West Great Lakes Mid East New England Plains
## -2.8099830 4.4553723 5.2246452 -5.6306504 -0.9295381
## Rocky Mountains South East South West
## -9.1351159 13.3447483 -4.5194784
The Pearson residuals are the only quantities shown above not
discussed during this week’s lecture. For a given element in the table
you provided, observed
, the Pearson residual is the
difference between observed count minus the expected count standardized
by the square-root of the expected count, or: \[ \text{residual}_k = \tfrac{\text{observed}_k -
\text{expected}_k}{\sqrt{\text{expected}_k}}\]
Pearson residuals are useful in determining which categories were most influential in a statistically significant result In our example we can see that there are more colleges than expected in South East, Great Lakes, and Mid East and fewer colleges than expected in the Rocky Mountains, New England, South West, and Far West.
Pearson residuals look a lot like Z-scores, we can interpret them similarly. That is, an absolute value of \(\sim 3\) reflects an observed count being \(\sim 3\) standard errors above what we’d expected, or a noteworthy result.
\(~\)
Our first example used a null hypothesis of equal proportions across every category. But we can also provide our own set of null proportions:
## Goodness of fit test w/ custom proportions as H0
results = chisq.test(x = observed,
p = c(0.14, 0.15, 0.11, 0.13, 0.05, 0.08, 0.19, 0.15),
correct = FALSE)
results$expected ## Notice the different expected counts
## Far West Great Lakes Mid East New England Plains
## 153.30 164.25 120.45 142.35 54.75
## Rocky Mountains South East South West
## 87.60 208.05 164.25
The one thing to be careful of is that chisq.test()
will
align the positions of the proportions you provide with those of the
table of observed counts. So be careful that you’re providing your null
proportions in the same order as the categories appear in that
table.
\(~\)
In addition to goodness of fit testing, chisq.test()
will perform tests of association (independence) when you provide it a
two-way frequency table:
## Test of association/independence
observed = table(colleges$Region, colleges$Private)
results = chisq.test(x = observed, correct = FALSE)
results$expected ## Expected counts
##
## Private Public
## Far West 61.45023 42.54977
## Great Lakes 111.67397 77.32603
## Mid East 116.99178 81.00822
## New England 41.95160 29.04840
## Plains 74.44932 51.55068
## Rocky Mountains 17.72603 12.27397
## South East 173.12420 119.87580
## South West 49.63288 34.36712
results$statistic ## Test statistic
## X-squared
## 30.21473
results$p.value ## p-value
## [1] 8.672211e-05
results$residuals ## Pearson residuals
##
## Private Public
## Far West -0.3125682 0.3756280
## Great Lakes 1.2610280 -1.5154369
## Mid East 0.8328394 -1.0008625
## New England 0.3162575 -0.3800616
## Plains 1.1068892 -1.3302011
## Rocky Mountains -2.3100947 2.7761499
## South East -0.7694526 0.9246875
## South West -1.6512103 1.9843375
The argument correct = FALSE
prevents R
from using a “continuity correction”. The only reason for this is so
that our \(p\)-values will align
exactly with those you could calculate “by hand”. In practice it’s
difficult to think of a reason why you’d ever set this to
FALSE
when analyzing real data on your own.
\(~\)
Next, you should recall that Chi-squared tests are unreliable when there are expected counts smaller than 5 (as a rule of thumb). In these scenarios you should consider Fisher’s exact test as an alternative:
## Fisher's Exact Test for association/independence
observed = table(colleges$Region, colleges$Private)
fisher.test(x = observed, simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 2000 replicates)
##
## data: observed
## p-value = 0.0009995
## alternative hypothesis: two.sided
The null hypothesis is the same for this test and a Chi-squared test of association. The only difference is the manner by which the \(p\)-value is calculated. Recall that Chi-squared tests rely upon an underlying Normal approximation, while Fisher’s exact test does not.
Fisher’s exact test is computationally expensive for tables with a
large number of cells (like the one involved in this example), so you
may choose to use the argument simulate.p.value = TRUE
to
speed up the computation time.
\(~\)
pchisq()
Finally, you should be aware that pchisq()
can be used
to calculate areas under a section of the Chi-squared distribution. The
example below shows how to find the \(p\)-value for a test statistic of \(X^2 =3.5\) for a distribution with \(df = 1\)
## Area in the tail to the right of X^2 = 3.5
pchisq(q = 3.5, df = 1, lower.tail = FALSE)
## [1] 0.06136883
\(~\)
In this portion of the lab you’ll answer the data used in the fivethirtyeight.com article “Where Police Have Killed Americans in 2015”. These data are contained at the link below:
The data contain the following variables:
Because these data were collected in a single, we’ll consider them a sample of trends that exist more broadly within the United States.
Question 1:
In 2019 Pennsylvania jury controversially acquitted a police officer who fatally shot an unarmed teenager (this NPR Article provides details). Based upon this event, we might wonder how common are police killings of an unarmed individuals among all police killings? To answer this question:
\(~\)
Question 2:
The Pennsylvania case mentioned in Question #1 received a lot of publicity because it involved a white police officer killing an unarmed black individual. We might wonder, among those killed by the police, is the proportion of blacks who were unarmed different from the proportion of whites who were unarmed? To answer this question:
\(~\)
Question 3:
Everyone who appears in these data was killed by the police (by definition), but this means we don’t have data on individuals who were not killed by the police, so we might seek to bring external information into our analysis. It is estimated that the racial composition of the United States in 2015 was 61.8% non-Hispanic white, 13.2% black, 17.8% Hispanic (of any race), 5.2% Asian, 0.8% Native American, and 1.2% other. Based upon this, we might wonder if police killing are equally common across races, or if some racial/ethnic groups are disproportionately involved in police killings? To answer this question:
\(~\)
Question 4:
Critics of the analysis described in Question 3 might argue that because of socio-economic factors not all racial/ethnic groups commit crimes at the same rate, and therefore exposure to situations with a possibility of being killed by the police is unequal across groups. It might be possible to evaluate this criticism using external information from the National Crime Victimization Survey (NCVS). According the NCVS, 22.7% of the victims of violent crimes report that the perpetrator of the crime was black. Based upon this, we might wonder if the proportion of black individuals in the Police Killings data differs from proportion crimes with black perpetrators (given by the NCVS)? To answer this question:
\(~\)
The Transport Security Administration (TSA) is an agency within the US Department of Homeland Security that has authority over the safety and security of travel in the United States. The data given below is a random sample of \(n=5000\) claims made by travelers against the TSA between 2003 and 2008, including information of the claim type, claim amount, and whether it was approved, settled, or denied.
The relevant variables in this analysis are:
Question 5: Are the majority of claims property
loss? Perform an appropriate statistical test, clearly stating your null
hypothesis, \(p\)-value, and
conclusion. You should use R
to perform the test.
\(~\)
Question 6: Among laptops and cell phones, which
device is more likely to be the subject of a property loss claim than a
property damage claim? Hint: You’ll need to use the
filter()
and/or mutate()
and
ifelse()
functions in order to prepare the data for this
question. You should ignore all other items and claim types for this
question.
\(~\)
Question 7: Do property loss claims tend to result in higher close amounts (payouts) than property damage claims? You should consider only these two claim types in this question.
\(~\)
Question 8: Considering all available categories in
each variable, is the status of claim associated with the claim site?
Perform an appropriate statistical test, clearly stating your null
hypothesis, \(p\)-value, and
conclusion. You should use R
to perform the test.
Hint: Be careful regarding sample sizes in this analysis.
\(~\)
Question 9: What happens when you filter out the
“Other” category before performing the hypothesis test in Question 8?
Repeat this test after subsetting the data using the
filter()
function to remove cases in the “Other” category,
then briefly discuss the consequences, if any, of removing this category
from the analysis.