Lab #10 - Bootstrapping and Confidence Intervals

Directions (read before starting)

Please work together with your assigned partner. Make sure you both fully understand something before moving on.
Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
Ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Introduction

Creating a confidence interval estimate of a population parameter using bootstrapping and “the standard error” approach involves the following steps:

Calculating a point estimate of the unknown population parameter using our sample.
Creating a large number of bootstrap samples by re-sampling from the original sample with replacement, and calculating a bootstrap statistic for each bootstrap sample.
Find the standard deviation of the bootstrap statistics - this is the standard error of the estimate from the original sample.
Applying the generic formula \(\text{Confidence Interval} = \text{Point Estimate} \pm c*SE\), where choosing \(c=2\) yields a 95% CI and choosing \(c=3\) yields a 99% CI (assuming the sampling distribution is bell-shaped)

These steps can be modified to use the “percentile bootstrap” approach by asking StatKey for the middle \(P\%\) of the bootstrap statistics produced in Step 2. The end points of this range are the \(P\%\) confidence interval.

Below we will walk through an example of each of these steps. Then you’ll be asked to replicate them for a couple of different applications.

Example

Previous labs have introduced the “police-involved deaths” data set that was compiled by the Washington Post between 2015 and mid-2020.

police = read.csv("https://remiller1450.github.io/data/Police.csv")

For this example we’ll consider the target population to be “all police-involved deaths within the United States”. Note that the Washington Post’s data can be considered a sample (of size \(n=7548\)) from this population it does not exhaustively document all police-involved deaths that will ever occur.

The population parameter we’ll try to estimate is the proportion of individuals killed by the police who were unarmed, which we’ll denote as \(p\). We will provide a 95% confidence interval estimate of \(p\) using the “SE approach”.

Step 1: Find the point estimate

This question involves a single categorical variable, armed, so we can find the point estimate by making a one-way table:

my_table = table(police$armed)
prop.table(my_table)

## 
##      armed    unarmed 
## 0.93945416 0.06054584

Thus, our point estimate is \(\hat{p} = 0.06\); so if our data are representative of the target population we expect it’s most likely that 6% of those killed by the police are unarmed.

Step 2: Generate bootstrap samples

Whenever possible we’ll use StatKey to generate bootstrap samples.

For this example we’re estimating a “single proportion”. The data we need to input is the numerator and denominator this proportion, which is present in the one-way table we stored as my_table:

my_table

## 
##   armed unarmed 
##    7091     457

Since we’re interested in the proportion of “unarmed”, we’ll input 457 as the count and \(457+7091=7548\) as the sample size.

Note that in scenarios involving quantitative data we’ll need to input the entire data set. You can download any of our class data sets by pasting the URL of the CSV file into a web browser.

Once we’ve loaded the necessary info into StatKey, we can generate bootstrap samples:

In this example, I’ve generated 100 bootstrap samples and you’re seeing 100 bootstrap statistics (one from each bootstrap sample). Here each bootstrap statistic is the proportion of unarmed individuals in a bootstrap sample drawn from the original sample with replacement.

Step 3: Find the standard error

The standard deviation of the bootstrap statistics is an estimate of the standard error of \(\hat{p}\). So we estimate \(SE = 0.0027\)

Step 4: Use the \(SE\) and the proper value of \(c\) to calculate the confidence interval.

We have \(SE = 0.0027\) and we know \(c = 2\) can be used to produce a 95% confidence interval when the sampling distribution is approximately bell-shaped (which applies to this scenario). Filling in these numbers, we have:

\[\text{Point Estimate} \pm c*SE = 0.06 \pm 2*0.0027 = (0.0546, 0.0654)\] Step 5: Understand the confidence interval

The generic interpretation of our interval, (0.0546, 0.0654), is “we can be 95% confident that the proportion of individuals killed by the police who are unarmed is between 0.0546 and 0.0654 in the population represented by our sample”.
- The notion of “confidence” means that if we were to collect another sample of size \(n=7548\) in the same manner of the Washington Post and calculate an interval estimate of \(p\) using our approach we’d expect a 95% chance of that interval containing the true, unknown value of \(p\).
- In practical terms, you can view this interval as “a range of plausible values of \(p\)”, and you should notice that this range is very narrow. This is because our sample size of \(n=7548\) is large.
It’s essential to know that any confidence interval only communicates the possible effects of sampling variability, it does not necessarily imply that we have highly accurate knowledge about what the true value of \(p\) is in our population.
- It’s possible that sampling bias can cause a confidence interval to be inaccurate, even when if the confidence level is high.
- In our application, our sample might be biased if data from the years 2015-2020 is fundamentally different from the target population of all police-involved deaths.

Note: If we wanted to use the “percentile bootstrap” method we could simply click the “two-tail” button on StatKey to show the middle \(P\%\) of bootstrap samples. Shown below is a 90% percentile bootstrap confidence interval:

\(~\)

The Infant Heart Study

Previous labs have introduced the “infant heart” data set, which records two outcome measures: Psychomotor Development Index (PDI) and Mental Development Index (MDI) measured at 2-years for infants born with a congenital heart defect who were randomly assigned to one of two surgeries: low-flow bypass or circulatory arrest.

inf_heart = read.csv("https://remiller1450.github.io/data/InfantHeart.csv")

Question 1:

The goal in this question is to estimate the expected difference in mean PDI scores for low-flow bypass minus circulatory arrest in the population of infants born with congenital heart defects. We’ll denote this difference in means:

\[\mu_{LF} - \mu_{CA}\]

Part A: Briefly describe the population and the sample in this application. That is, who are the cases in the population and how large is the sample.
Part B: To be included in this study, the parents of a child born with a congenital heart defect in one of several participating hospital systems needed to voluntarily consent to the study protocol in order for their infant to be enrolled and randomly assigned to a surgical group. Considering this information, did this study use a random sample from the target population? If not, is it possible for this sample to still be a representative sample.
Part C: Use the group_by() and summarize() functions to find the mean PDI score in each surgical group, then use this information to report a point estimate of the population parameter of interest, \(\mu_{LF} - \mu_{CA}\).
Part D: Interpret your point estimate in Part C. That is, does it seem that one surgery produces better PDI outcomes than the other?
Part E: Use StatKey’s Bootstrap CI for a difference in means page to estimate the standard error of your point estimate from Part C using at least 2000 bootstrapped statistics.
Part F: Calculate a 99% confidence interval estimate using the point estimate from Part C and the standard error from Part E. Be sure to use a value of \(c\) that corresponds with 99% confidence.
Part G: Interpret your confidence interval (from Part F) and answer whether your interval estimate provides convincing evidence as to whether the higher PDI scores in the low-flow bypass group could be due to sampling variability.

\(~\)

Death Penalty Sentencing (revisited)

A previous lab introduced the “death penalty sentencing” data set, which includes all murders which took place during a felony that were committed in the state of Florida between 1972 and 1977. The researchers who collected these data were interested in investigating potential racial bias in death penalty sentencing.

The data are available at the URL: https://remiller1450.github.io/data/DeathPenaltySentencing.csv

Question #2:

Part A: Find and a report a 99% CI for the difference in the proportion of death penalty sentences for white/black offenders without considering the race of the victim. Follow all of the steps outlined in this lab’s example. You should use StatKey and bootstrapping to help you construct the interval.
Part B: Briefly explain whether this interval supports the claim “the small difference in death penalty sentencing rates between white and black offenders can be explained by sampling variability and the relatively small sample size”

Question #3:

Part A: Now find and report a 99% CI for the difference in the proportion of death penalty sentences for white/black offenders conditioning on the victim’s race being white.
Part B: After conditioning on the race of the victim being white, can the difference in death penalty sentencing rates be explained by the relatively small sample size of this study? Briefly explain.