Lab #12 - Bootstrapping

\(~\)

Onboarding

In our previous lab, code was provided that used the sample() function to draw samples from a population. We will repeatedly apply this function with the argument replace = TRUE to perform bootstrapping.

Let’s consider bootstrapping the correlation between AvgMovingSpeed and TotalTime like we did in the “commute tracker” example from our lecture slides:

## Load the real sample data
commute_tracker = read.csv("https://remiller1450.github.io/data/CommuteTracker.csv")

## Setup
n_boot = 1000                # number of bootstrap samples to draw
n = nrow(commute_tracker)    # size of the real sample
boot_stats = numeric(n_boot) # empty object to store bootstrapped statistics

## Draw each sample and store the bootstrapped statistic
for(i in 1:n_boot){
  
  ## Use the sample function to identify which cases are selected in the bootstrap sample
  case_idx = sample(x = 1:n, size = n, replace = TRUE)
  
  ## Put the selected cases into a data frame
  cur_boot_sample = commute_tracker[case_idx, ]
  
  ## Calculate the descriptive statistic of interest (correlation) for the current bootstrap sample
  boot_stats[i] = cor(x=cur_boot_sample$AvgMovingSpeed, y=cur_boot_sample$TotalTime)
}

At this point we have a vector of 1000 different correlation coefficients that were bootstrapped from the original sample. The variation among these correlation coefficients should approximate the sampling variability involved in the collection of the original sample from the population. We could display this variation using a histogram:

library(ggplot2)
ggplot() + geom_histogram(aes(x = boot_stats), bins = 20)

As we saw in StatKey, the bootstrap distribution shows right-skew, so a symmetric probability model is unlikely to accurately represent the sampling variability in this scenario.

To find a \(P\%\) percentile bootstrap confidence interval we need to find the relevant percentiles that define the middle \(P\%\) of the bootstrap distribution, which we can do using the quantile() function:

quantile(boot_stats, probs = c(0.005, 0.995))

##       0.5%      99.5% 
## -0.9266987 -0.7639543

The above code gives us a 99% confidence interval estimate because it excludes the most extreme 1% of bootstrapped results (0.5% on each end of the bootstrap distribution).

If we had instead wanted a 95% confidence interval we simply need to modify the probs argument to reflect the middle 95% of bootstrapped samples:

quantile(boot_stats, probs = c(0.025, 0.975))

##       2.5%      97.5% 
## -0.9126938 -0.7839372

\(~\)

Lab

Your task in this lab will be to adapt the bootstrapping code from the lab’s introduction to a variety of different descriptive statistics. You will also compare your R results with bootstrapping results in StatKey. For some of these statistics using a Normal or \(t\)-distribution as a probability model would produce invalid confidence intervals.

Question #1: For this question you will bootstrap the mean of a moderately sized sample with strong right-skew. You will use the mj25 data set introduced in the previous lab. The aim is to estimate the average number of three point attempts per game that Michael Jordan took during his career. Shown below is the data and a histogram of the variable of interest. A link to the data is also provided to facilitate their entry into StatKey.

mj25 = read.csv("https://remiller1450.github.io/data/mj25.csv")
ggplot(mj25, aes(x = threeatt)) + geom_histogram()

https://remiller1450.github.io/data/mj25.csv

Part A: Using the “CI for Single Mean, Median, St. Dev.” menu of StatKey, upload the mj25 data, select the variable threeatt, and generate 1000 bootstrap samples. What does each dot displayed in the main graphic pane represent? How many dots are there?
Part B: Using the 1000 bootstrap samples from Part A, find a 95% percentile bootstrap confidence interval for the mean number of three-point attempts. Report the endpoints of this interval.
Part C: Adapt the bootstrapping code provided in this lab’s introduction to find an equivalent 95% percentile bootstrap confidence interval using R. Find the endpoints of this interval using the quantile() function as was demonstrated in the lab’s introduction.
Part D: Suppose these data did not exhibit any right-skew. Use an R function to find a 95% confidence interval using the probability model you would have relied upon had this been true.
Part E: Compare the interval endpoints found in Parts B and C. How different are these endpoints? What do you think explains the difference?
Part F: Now compare the interval endpoints found in Parts C and D. How different are these endpoints? What do you think explains the difference?

\(~\)

Question #2: Many scholars define a mass shooting as an incident where four or more people are shot and either injured or killed in a short period of time. The Gun Violence Archive maintains a database of mass shootings that have happened in the United States. The data below are the incidents from 2025 that are recorded in this database, which we will view as a representative sample of mass shootings more broadly.

https://remiller1450.github.io/data/mass_shootings_2025.csv

Part A: Load the data into R and create a histogram displaying the distribution of the number of victims who were killed in each incident. Describe the shape of this distribution.
Part B: Load the data into StatKey and create a bootstrap distribution of the mean number of victims who were killed. Does the bootstrapped distribution seem skewed or approximately symmetric? Is this surprising? Briefly explain.
Part C: Use StatKey to find a 99% percentile bootstrap confidence interval using 5000 bootstrapped samples. Report the interval’s endpoints.
Part D: Use an R function to find a 99% confidence interval that relies upon a probability model.
Part E: Compare the endpoints of the intervals you found in Parts C and D. Why do you think these intervals are so similar?

\(~\)

Question #3: In addition to scenarios where the assumptions of traditional methods are questionable, bootstrapping is also useful for unusual descriptive statistics where there is no commonly known probability model. An example of this is the mean-to-median ratio, which we saw in our last lecture. For this question you will use bootstrapping to create a 95% confidence interval for the mean-to-median ratio of a sample that likely came from a skewed population. You will continue using the “2025 mass shootings” dataset.

Part A: Modify the code provided in this lab’s introduction to find a 95% percentile bootstrap confidence interval for the mean-to-median ratio of the number of victims injured in mass shootings. Use the quantile() function to report the endpoints of the interval.
Part B: Suppose that in the population of all mass shootings the distribution of victims injured is perfectly symmetric. What would you expect the mean-to-median ratio to be when calculated using all cases in the population if this were the case?
Part C: Does the 95% percentile bootstrap confidence interval you found in Part A provide sufficient statistical evidence that the population is not symmetric? Briefly explain your reasoning.