Directions (read before starting)
\(~\)
Creating a confidence interval estimate of a population parameter using bootstrapping and “the standard error” approach involves the following steps:
These steps can be modified to use the “percentile bootstrap” approach by asking StatKey for the middle \(P\%\) of the bootstrap statistics produced in Step 2. The end points of this range are the \(P\%\) confidence interval.
Below we will walk through an example of each of these steps. Then you’ll be asked to replicate them for a couple of different applications.
Previous labs have introduced the “police-involved deaths” data set that was compiled by the Washington Post between 2015 and mid-2020.
police = read.csv("https://remiller1450.github.io/data/Police.csv")
For this example we’ll consider the target population to be “all police-involved deaths within the United States”. Note that the Washington Post’s data can be considered a sample (of size \(n=7548\)) from this population it does not exhaustively document all police-involved deaths that will ever occur.
The population parameter we’ll try to estimate is the proportion of individuals killed by the police who were unarmed, which we’ll denote as \(p\). We will provide a 95% confidence interval estimate of \(p\) using the “SE approach”.
Step 1: Find the point estimate
This question involves a single categorical variable,
armed
, so we can find the point estimate by making a
one-way table:
my_table = table(police$armed)
prop.table(my_table)
##
## armed unarmed
## 0.93945416 0.06054584
Thus, our point estimate is \(\hat{p} = 0.06\); so if our data are representative of the target population we expect it’s most likely that 6% of those killed by the police are unarmed.
Step 2: Generate bootstrap samples
Whenever possible we’ll use StatKey to generate bootstrap samples.
For this example we’re estimating a “single proportion”. The data we
need to input is the numerator and denominator this proportion, which is
present in the one-way table we stored as my_table
:
my_table
##
## armed unarmed
## 7091 457
Since we’re interested in the proportion of “unarmed”, we’ll input 457 as the count and \(457+7091=7548\) as the sample size.
Note that in scenarios involving quantitative data we’ll need to input the entire data set. You can download any of our class data sets by pasting the URL of the CSV file into a web browser.
Once we’ve loaded the necessary info into StatKey, we can generate bootstrap samples:
In this example, I’ve generated 100 bootstrap samples and you’re seeing 100 bootstrap statistics (one from each bootstrap sample). Here each bootstrap statistic is the proportion of unarmed individuals in a bootstrap sample drawn from the original sample with replacement.
Step 3: Find the standard error
The standard deviation of the bootstrap statistics is an estimate of the standard error of \(\hat{p}\). So we estimate \(SE = 0.0027\)
Step 4: Use the \(SE\) and the proper value of \(c\) to calculate the confidence interval.
We have \(SE = 0.0027\) and we know \(c = 2\) can be used to produce a 95% confidence interval when the sampling distribution is approximately bell-shaped (which applies to this scenario). Filling in these numbers, we have:
\[\text{Point Estimate} \pm c*SE = 0.06 \pm 2*0.0027 = (0.0546, 0.0654)\] Step 5: Understand the confidence interval
Note: If we wanted to use the “percentile bootstrap” method we could simply click the “two-tail” button on StatKey to show the middle \(P\%\) of bootstrap samples. Shown below is a 90% percentile bootstrap confidence interval:
\(~\)
Previous labs have introduced the “infant heart” data set, which records two outcome measures: Psychomotor Development Index (PDI) and Mental Development Index (MDI) measured at 2-years for infants born with a congenital heart defect who were randomly assigned to one of two surgeries: low-flow bypass or circulatory arrest.
inf_heart = read.csv("https://remiller1450.github.io/data/InfantHeart.csv")
Question 1:
The goal in this question is to estimate the expected difference in mean PDI scores for low-flow bypass minus circulatory arrest in the population of infants born with congenital heart defects. We’ll denote this difference in means:
\[\mu_{LF} - \mu_{CA}\]
group_by()
and
summarize()
functions to find the mean PDI score in each
surgical group, then use this information to report a point
estimate of the population parameter of interest, \(\mu_{LF} - \mu_{CA}\).\(~\)
A previous lab introduced the “death penalty sentencing” data set, which includes all murders which took place during a felony that were committed in the state of Florida between 1972 and 1977. The researchers who collected these data were interested in investigating potential racial bias in death penalty sentencing.
The data are available at the URL:
https://remiller1450.github.io/data/DeathPenaltySentencing.csv
Question #2:
Question #3: