These questions are intended to help you practice for Exam #3. The real exam will feature 2-3 questions that follow a similar format. All course content up until this point may appear on the exam, but the primary focus will be on Chapters 5.1, 5.2, 5.3, 6.1, 7.1 of the textbook and our notes/labs pertaining to confidence interval estimation and hypothesis testing (for a single mean or single proportion).

On the actual exam you should be prepared to record your answers in a properly formatted R Markdown document, submitting the compiled HTML output.

\(~\)

Question #1

In recent years substantial media attention has been given to the conduct of law enforcement in cases where a criminal suspect was killed by the police. The data we’ll analyze in this application originate from a FiveThirtyEight article titled “Where Police Have Killed Americans in 2015”. These data contain demographic and geographic information for every individual killed by the police in the year 2015.

pk = read.csv("https://remiller1450.github.io/data/PoliceKillings.csv")

1-A: Suppose the researchers analyzing these data are interested in whether black individuals are disproportionately the victims in police-involved deaths. According to the US Census, the racial/ethnic composition of the United States in 2015 was 61.8% non-Hispanic white, 13.2% black, 17.8% Hispanic (of any race), 5.2% Asian, 0.8% Native American, and 1.2% other. With this in mind, state the null hypothesis these researchers would be interested in finding evidence against using these data. Define, in words, any population parameters (ie: \(\mu\) or \(p\)) used in your null hypothesis.

1-B: Assess the null hypothesis you described Part 1-A using an appropriate hypothesis test. Your answer should include the R code you used to perform the test, the resulting \(p\)-value, and a brief conclusion.

1-C: In your own words, describe what a Type I and Type II error would be in the context of this application. Which of these errors could you have made in Part 1-B?

1-D: Suppose you repeated the hypothesis test described in Parts 1-A and 1-B for the five other categories of race/ethnicity recorded in the US Census. If each test were independent of the others, and if the null hypothesis were true for every test, what is the likelihood of making at least one Type I error if a decision threshold of \(\alpha = 0.01\) is applied to \(p\)-value of each test?

\(~\)

Question #2

In 2010, new international rules were created to regulate swimsuit coverage and material after an inordinate amount of records were set at the 2008 Olympics by swimmers wearing a scientifically designed suit known as the LZR Racer.

To more rigorously investigate whether these scientifically designed swimsuits actually enhance performance, 12 professional swimmers and triathletes participated in an experiment where they were randomly assigned either a special wetsuit or a placebo swimsuit that was similar in appearance. They then swam a 1500m time trial and their swim velocity was recorded. Then, one week later, each participant swam another 1500m time trial, but this time they wore the other suit.

The outcome of interest in this study is recorded as the variable “Difference”, which indicates how much higher each participants swim velocity was when wearing the wetsuit (a positive value is an increase velocity, indicating better performance).

ws = read.csv("https://remiller1450.github.io/data/Wetsuits2.csv")

2-A: Using either a graph or descriptive statistics, assess whether the variable “Difference” appears to follow a Normal distribution. Additionally, briefly explain why it’s important to consider the distribution of this variable before performing a hypothesis test.

2-B: Use an appropriate hypothesis test to evaluate whether this study provides statistically compelling evidence that scientifically designed wetsuits have a beneficial impact on an individual’s swim velocity. Your answer should clearly state the statistical hypotheses you considered, and it should include any R code you used to perform the test, the resulting \(p\)-value, and a brief conclusion.

2-C: Using the data from this study, construct a 95% confidence interval estimate for the average improvement in 1500m swim velocity that a professional swimmer can expect from a scientifically designed wetsuit. Show all steps (R code) used to create your interval estimate.

2-D: Suppose that scientifically designed swimsuits do not actually improve 1500m swim velocity. If 20 different research organizations decided to independently collect data, and they each used the data they collected to form a 90% confidence interval estimate for the mean improvement in 1500m swim velocity that professional swimmers experience when wearing a wetsuit, how many of these research organizations would you expect to conclude that an improvement of zero is statistically plausible?

\(~\)

Question #3

Suppose we use a sample of \(n = 15\) randomly chosen adults to calculate a 95% confidence interval for the mean cholesterol level (mg/dl) of all US adults: \(214 \pm 2.145*\tfrac{21}{\sqrt{15}} = (202.4, 225.6)\). Rate each of the following statements as either true or false and explain why:

  • 3-A: A random sample of \(n = 150\) adults would likely produce a wider 95% confidence interval than (202.4, 225.6) due to there being more uncertainty in the sample.
  • 3-B: The confidence interval suggests we can be confident that 95% of the US adult population has a cholesterol level between 195.53 mg/dl and 210.47 mg/dl.
  • 3-C: A 99% confidence interval constructed using the same sample of \(n = 15\) adults will be wider than the interval (202.4, 225.6).
  • 3-D: The confidence interval suggests that 95% of different random samples of size \(n = 15\) from the same population will have a sample means between 202.4 mg/dl and 225.6 mg/dl.