This lab is intended to provide practice and insight in applying Chi-squared tests (for goodness of fit as well as association) to real data. Due to time constraints, it will be slightly shorter than previous labs.

Directions (Please read before starting)

  1. Please work together with your assigned groups. Even though you’ll turn in a write-up that is later scored, labs are intended to formative and a substantial portion of the credit you’ll receive is based upon effort and completion.
  2. Please record your responses and code in an R Markdown document following the conventions we’ve used in previous labs.

\(~\)

Case Study - Police-involved Deaths

The Washington Post manages a comprehensive database of instances where police officers have used deadly force on an suspect dating back to 2015.

police <- read.csv("https://remiller1450.github.io/data/Police2019.csv")
police_complete <- police[police$race != "", ] ## Filter to exclude individuals with missing race data

These data contain the following variables:

  • name - name of the individual killed
  • date - date of the incident
  • year - year (extracted from date)
  • manner_of_death - cause of death
  • armed - weapon the killed individual was carrying (if applicable)
  • age - age of the individual killed
  • gender - gender of the individual killed
  • race - racial/ethnic group of the individual killed, using the US Census designations, “A” = Asian, “B” = Black, “H” = Hispanic (of any race), “N” = Native American, “O” = Other, “W” = Non-Hispanic White
  • city - location of the incident
  • state - state where the incident took place
  • signs_of_mental_illness - whether any past signs of mental illness were present
  • threat_level - whether the suspect was attacking the involved officers
  • flee - how the suspect fled (if applicable)
  • body_camera - whether any of the involved officers were wearing a body camera

Although an argument can be made that these data are a population (since they contain all incidents from 2015-2019), a useful alternative is to view the data as a representative sample of an underlying random process (since new police-involved deaths continue to occur over time). This alternative view means that we can use the tools of statistical inference (hypothesis tests and confidence intervals) to better understand the uncertainty inherent to underlying random process that gave rise to the observed data.

Question #1: The US Census Bureau estimates that the racial composition of the US as 61.5% Non-Hispanic White, 17.6% Hispanic (of any race), 12.3% Black, 5.3% Asian, 0.7% Native American, and 2.6% other (source). Using the US Census numbers as the basis for a null hypothesis, evaluate whether certain racial groups are disproportionately killed by the police.

Question #2: The widespread adoption of police body cameras has been an area of debate of past decade. To explore this question, perform a Chi-squared test to determine whether there is an association between the presence of a police-worn body camera and the race of individual who was killed. Perform the test “by hand” so that you can recognize and comment upon the largest contributor to the \(X^2\) test statistic.

Question #3: Calculate and interpret the odds ratio relating the odds of an individual in the racial/ethnic group with the largest \(X^2\) contribution being killed by an officer wearing a body camera to the odds of a white individual being killed by an officer wearing a body camera.

Question #4 Use the fisher.test function to find a 95% CI estimate for the odds ratio you found in Question #3. What is the importance of this entire confidence interval being above 1?