Directions (read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand something before moving on.
  2. Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
  3. Ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Introduction

While this lab will focus on Chi-squared tests it will also involve a variety of other tools from earlier in the course as review. The purpose of this is to prepare you for “the real world” where no one will tell you that you need to use a Chi-squared test, instead it is your job as someone trained in statistics to know when to use a Chi-squared test and when to use a different approach.

\(~\)

Lab

Examples

Below are a few quick demonstrations of the chisq.test() function, which is used to perform both goodness of fit testing and association testing.

## Example data
colleges = read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

Example #1

The example below is a Chi-squared goodness of fit test using equal proportions as the null hypothesis.

## Example 1
## Goodness of fit test w/ equal proportions for every region as H0
observed = table(colleges$Region)
results = chisq.test(x = observed)
results$expected  ## Expected counts
##        Far West     Great Lakes        Mid East     New England          Plains 
##         136.875         136.875         136.875         136.875         136.875 
## Rocky Mountains      South East      South West 
##         136.875         136.875         136.875
results$statistic  ## X^2 test statistic
## X-squared 
##  369.5699
results$p.value    ## p-value
## [1] 7.941409e-76
results$residuals  ## Pearson residuals
## 
##        Far West     Great Lakes        Mid East     New England          Plains 
##      -2.8099830       4.4553723       5.2246452      -5.6306504      -0.9295381 
## Rocky Mountains      South East      South West 
##      -9.1351159      13.3447483      -4.5194784

The Pearson residuals are the only quantities shown above not discussed during this week’s lecture. For a given element in the table you provided, observed, the Pearson residual is the difference between observed count minus the expected count standardized by the square-root of the expected count, or: \[ \text{residual}_k = \tfrac{\text{observed}_k - \text{expected}_k}{\sqrt{\text{expected}_k}}\]

Pearson residuals are useful in determining which categories were most influential in a statistically significant result In our example we can see that there are more colleges than expected in South East, Great Lakes, and Mid East and fewer colleges than expected in the Rocky Mountains, New England, South West, and Far West.

Pearson residuals look a lot like Z-scores, we can interpret them similarly. That is, an absolute value of \(\sim 3\) reflects an observed count being \(\sim 3\) standard errors above what we’d expected, or a noteworthy result.

\(~\)

Example #2

Our first example used a null hypothesis of equal proportions across every category. But we can also provide our own set of null proportions:

## Goodness of fit test w/ custom proportions as H0
results = chisq.test(x = observed, 
                     p = c(0.14, 0.15, 0.11, 0.13, 0.05, 0.08, 0.19, 0.15),
                     correct = FALSE)
results$expected  ## Notice the different expected counts
##        Far West     Great Lakes        Mid East     New England          Plains 
##          153.30          164.25          120.45          142.35           54.75 
## Rocky Mountains      South East      South West 
##           87.60          208.05          164.25

The one thing to be careful of is that chisq.test() will align the positions of the proportions you provide with those of the table of observed counts. So be careful that you’re providing your null proportions in the same order as the categories appear in that table.

\(~\)

Example #3

In addition to goodness of fit testing, chisq.test() will perform tests of association (independence) when you provide it a two-way frequency table:

## Test of association/independence
observed = table(colleges$Region, colleges$Private)
results = chisq.test(x = observed, correct = FALSE)
results$expected  ## Expected counts
##                  
##                     Private    Public
##   Far West         61.45023  42.54977
##   Great Lakes     111.67397  77.32603
##   Mid East        116.99178  81.00822
##   New England      41.95160  29.04840
##   Plains           74.44932  51.55068
##   Rocky Mountains  17.72603  12.27397
##   South East      173.12420 119.87580
##   South West       49.63288  34.36712
results$statistic  ## Test statistic
## X-squared 
##  30.21473
results$p.value    ## p-value
## [1] 8.672211e-05
results$residuals  ## Pearson residuals
##                  
##                      Private     Public
##   Far West        -0.3125682  0.3756280
##   Great Lakes      1.2610280 -1.5154369
##   Mid East         0.8328394 -1.0008625
##   New England      0.3162575 -0.3800616
##   Plains           1.1068892 -1.3302011
##   Rocky Mountains -2.3100947  2.7761499
##   South East      -0.7694526  0.9246875
##   South West      -1.6512103  1.9843375

The argument correct = FALSE prevents R from using a “continuity correction”. The only reason for this is so that our \(p\)-values will align exactly with those you could calculate “by hand”. In practice it’s difficult to think of a reason why you’d ever set this to FALSE when analyzing real data on your own.

\(~\)

Example #4

Next, you should recall that Chi-squared tests are unreliable when there are expected counts smaller than 5 (as a rule of thumb). In these scenarios you should consider Fisher’s exact test as an alternative:

## Fisher's Exact Test for association/independence
observed = table(colleges$Region, colleges$Private)
fisher.test(x = observed, simulate.p.value = TRUE)
## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  observed
## p-value = 0.0009995
## alternative hypothesis: two.sided

The null hypothesis is the same for this test and a Chi-squared test of association. The only difference is the manner by which the \(p\)-value is calculated. Recall that Chi-squared tests rely upon an underlying Normal approximation, while Fisher’s exact test does not.

Fisher’s exact test is computationally expensive for tables with a large number of cells (like the one involved in this example), so you may choose to use the argument simulate.p.value = TRUE to speed up the computation time.

\(~\)

Finding \(p\)-values using pchisq()

Finally, you should be aware that pchisq() can be used to calculate areas under a section of the Chi-squared distribution. The example below shows how to find the \(p\)-value for a test statistic of \(X^2 =3.5\) for a distribution with \(df = 1\)

## Area in the tail to the right of X^2 = 3.5
pchisq(q = 3.5, df = 1, lower.tail = FALSE)
## [1] 0.06136883

\(~\)

Application #1 - Police Involved Killings

In this portion of the lab you’ll answer the data used in the fivethirtyeight.com article “Where Police Have Killed Americans in 2015”. These data are contained at the link below:

https://remiller1450.github.io/data/PoliceKillings.csv

The data contain the following variables:

  • Name: Name of the deceased
  • Age: Age of the deceased at time of death
  • Gender: Gender of the deceased
  • RaceEthnicity: Racial/Ethnic category of the deceased
  • Month: The month when the incident occurred
  • Day: The day of the month when the incident occurred
  • Year: The year when the incident occurred
  • StreetAddress: The street address or intersection nearest to where the incident occurred
  • City: The city in which the incident occurred
  • State: The state in which the incident occurred
  • Latitude: The latitude of the street address nearest to where the incident occurred
  • Longitude: The longitude of the street address nearest to where the incident occurred
  • LawEnforcementAgency: The law enforcement agency involved
  • Cause: Cause of death
  • Armed: Whether the deceased subject was “Armed” or what they were armed with
  • Pov: The census tract poverty rate
  • Urate: The census tract unemployment rate
  • College: The census tract share of the age 25+ population with a bachelor’s degree (or higher)

Because these data were collected in a single, we’ll consider them a sample of trends that exist more broadly within the United States.

Question 1:

In 2019 Pennsylvania jury controversially acquitted a police officer who fatally shot an unarmed teenager (this NPR Article provides details). Based upon this event, we might wonder how common are police killings of an unarmed individuals among all police killings? To answer this question:

  • Part A: Create an appropriate data visualization.
  • Part B: Provide one or more appropriate descriptive statistics.
  • Part C: Perform an appropriate statistical test, clearly stating your null hypothesis, \(p\)-value, and conclusion. If you use a Chi-squared test, briefly comment upon what you see in the Pearson residuals when making your conclusion.

\(~\)

Question 2:

The Pennsylvania case mentioned in Question #1 received a lot of publicity because it involved a white police officer killing an unarmed black individual. We might wonder, among those killed by the police, is the proportion of blacks who were unarmed different from the proportion of whites who were unarmed? To answer this question:

  • Part A: Create an appropriate data visualization.
  • Part B: Provide one or more appropriate descriptive statistics.
  • Part C: Perform an appropriate statistical test, clearly stating your null hypothesis, \(p\)-value, and conclusion. If you use a Chi-squared test, briefly comment upon what you see in the Pearson residuals when making your conclusion.

\(~\)

Question 3:

Everyone who appears in these data was killed by the police (by definition), but this means we don’t have data on individuals who were not killed by the police, so we might seek to bring external information into our analysis. It is estimated that the racial composition of the United States in 2015 was 61.8% non-Hispanic white, 13.2% black, 17.8% Hispanic (of any race), 5.2% Asian, 0.8% Native American, and 1.2% other. Based upon this, we might wonder if police killing are equally common across races, or if some racial/ethnic groups are disproportionately involved in police killings? To answer this question:

  • Part A: Create an appropriate data visualization.
  • Part B: Provide one or more appropriate descriptive statistics.
  • Part C: Perform an appropriate statistical test, clearly stating your null hypothesis, \(p\)-value, and conclusion. If you use a Chi-squared test, briefly comment upon what you see in the Pearson residuals when making your conclusion.

\(~\)

Question 4:

Critics of the analysis described in Question 3 might argue that because of socio-economic factors not all racial/ethnic groups commit crimes at the same rate, and therefore exposure to situations with a possibility of being killed by the police is unequal across groups. It might be possible to evaluate this criticism using external information from the National Crime Victimization Survey (NCVS). According the NCVS, 22.7% of the victims of violent crimes report that the perpetrator of the crime was black. Based upon this, we might wonder if the proportion of black individuals in the Police Killings data differs from proportion crimes with black perpetrators (given by the NCVS)? To answer this question:

  • Part A: Provide one or more appropriate descriptive statistics.
  • Part B: Perform an appropriate statistical test, clearly stating your null hypothesis, \(p\)-value, and conclusion. If you use a Chi-squared test, briefly comment upon what you see in the Pearson residuals when making your conclusion.

\(~\)

Application #2 - Claims Against the TSA

The Transport Security Administration (TSA) is an agency within the US Department of Homeland Security that has authority over the safety and security of travel in the United States. The data given below is a random sample of \(n=5000\) claims made by travelers against the TSA between 2003 and 2008, including information of the claim type, claim amount, and whether it was approved, settled, or denied.

https://remiller1450.github.io/data/tsa_small.csv

The relevant variables in this analysis are:

  • Claim_Site - where the underlying event involved in the claim took place, either at a security checkpoint, in the handling of the individual’s checked baggage, or elsewhere.
  • Status - whether the claim was approved (paid in full), settled (partially paid/negotiated), or denied (not paid at all).
  • Item - the type of items involved in the claim (ie: cell phone, laptop, clothing, etc.)
  • Claim_Type - the category of the claim (ie: property damage, lost property, bodily injury, etc.)
  • Claim_Amount - the monetary amount requested by the individual making the claim
  • Close_Amount - the monetary amount paid to the individual making the claim. This will be the claim amount for claims that were approved and will be zero for claims that were denied.

Question 5: Are the majority of claims property loss? Perform an appropriate statistical test, clearly stating your null hypothesis, \(p\)-value, and conclusion. You should use R to perform the test.

\(~\)

Question 6: Among laptops and cell phones, which device is more likely to be the subject of a property loss claim than a property damage claim? Hint: You’ll need to use the filter() and/or mutate() and ifelse() functions in order to prepare the data for this question. You should ignore all other items and claim types for this question.

  • Part A: After preparing the data, create an appropriate data visualization related to this research question.
  • Part B: Perform an appropriate statistical test, clearly stating your null hypothesis, \(p\)-value, and conclusion.

\(~\)

Question 7: Do property loss claims tend to result in higher close amounts (payouts) than property damage claims? You should consider only these two claim types in this question.

  • Part A: After preparing the data, create an appropriate data visualization related to this research question.
  • Part B: Perform an appropriate statistical test, clearly stating your null hypothesis, \(p\)-value, and conclusion.

\(~\)

Question 8: Considering all available categories in each variable, is the status of claim associated with the claim site? Perform an appropriate statistical test, clearly stating your null hypothesis, \(p\)-value, and conclusion. You should use R to perform the test. Hint: Be careful regarding sample sizes in this analysis.

\(~\)

Question 9: What happens when you filter out the “Other” category before performing the hypothesis test in Question 8? Repeat this test after subsetting the data using the filter() function to remove cases in the “Other” category, then briefly discuss the consequences, if any, of removing this category from the analysis.