\(~\)
In our previous lab, code was provided that used the
sample() function to draw samples from a population. We
will repeatedly apply this function with the argument
replace = TRUE to perform bootstrapping.
Let’s consider bootstrapping the correlation between
AvgMovingSpeed and TotalTime like we did in
the “commute tracker” example from our lecture slides:
## Load the real sample data
commute_tracker = read.csv("https://remiller1450.github.io/data/CommuteTracker.csv")
## Setup
n_boot = 1000 # number of bootstrap samples to draw
n = nrow(commute_tracker) # size of the real sample
boot_stats = numeric(n_boot) # empty object to store bootstrapped statistics
## Draw each sample and store the bootstrapped statistic
for(i in 1:n_boot){
## Use the sample function to identify which cases are selected in the bootstrap sample
case_idx = sample(x = 1:n, size = n, replace = TRUE)
## Put the selected cases into a data frame
cur_boot_sample = commute_tracker[case_idx, ]
## Calculate the descriptive statistic of interest (correlation) for the current bootstrap sample
boot_stats[i] = cor(x=cur_boot_sample$AvgMovingSpeed, y=cur_boot_sample$TotalTime)
}
At this point we have a vector of 1000 different correlation coefficients that were bootstrapped from the original sample. The variation among these correlation coefficients should approximate the sampling variability involved in the collection of the original sample from the population. We could display this variation using a histogram:
library(ggplot2)
ggplot() + geom_histogram(aes(x = boot_stats), bins = 20)
As we saw in StatKey, the bootstrap distribution shows right-skew, so a symmetric probability model is unlikely to accurately represent the sampling variability in this scenario.
To find a \(P\%\) percentile
bootstrap confidence interval we need to find the relevant
percentiles that define the middle \(P\%\) of the bootstrap distribution, which
we can do using the quantile() function:
quantile(boot_stats, probs = c(0.005, 0.995))
## 0.5% 99.5%
## -0.9266987 -0.7639543
The above code gives us a 99% confidence interval estimate because it excludes the most extreme 1% of bootstrapped results (0.5% on each end of the bootstrap distribution).
If we had instead wanted a 95% confidence interval we simply need to
modify the probs argument to reflect the middle 95% of
bootstrapped samples:
quantile(boot_stats, probs = c(0.025, 0.975))
## 2.5% 97.5%
## -0.9126938 -0.7839372
\(~\)
Your task in this lab will be to adapt the bootstrapping code from
the lab’s introduction to a variety of different descriptive statistics.
You will also compare your R results with bootstrapping
results in StatKey. For some of these statistics using a Normal or \(t\)-distribution as a probability model
would produce invalid confidence intervals.
Question #1: For this question you will bootstrap
the mean of a moderately sized sample with strong right-skew. You will
use the mj25 data set introduced in the previous lab. The
aim is to estimate the average number of three point attempts per game
that Michael Jordan took during his career. Shown below is the data and
a histogram of the variable of interest. A link to the data is also
provided to facilitate their entry into StatKey.
mj25 = read.csv("https://remiller1450.github.io/data/mj25.csv")
ggplot(mj25, aes(x = threeatt)) + geom_histogram()
mj25 data, select the
variable threeatt, and generate 1000 bootstrap samples.
What does each dot displayed in the main graphic pane represent? How
many dots are there?R. Find the endpoints of this
interval using the quantile() function as was demonstrated
in the lab’s introduction.R function to find a 95% confidence
interval using the probability model you would have relied upon had this
been true.\(~\)
Question #2: Many scholars define a mass shooting as an incident where four or more people are shot and either injured or killed in a short period of time. The Gun Violence Archive maintains a database of mass shootings that have happened in the United States. The data below are the incidents from 2025 that are recorded in this database, which we will view as a representative sample of mass shootings more broadly.
R and
create a histogram displaying the distribution of the number of victims
who were killed in each incident. Describe the shape of this
distribution.R function to find a
99% confidence interval that relies upon a probability model.\(~\)
Question #3: In addition to scenarios where the assumptions of traditional methods are questionable, bootstrapping is also useful for unusual descriptive statistics where there is no commonly known probability model. An example of this is the mean-to-median ratio, which we saw in our last lecture. For this question you will use bootstrapping to create a 95% confidence interval for the mean-to-median ratio of a sample that likely came from a skewed population. You will continue using the “2025 mass shootings” dataset.
quantile() function to report the
endpoints of the interval.