Directions

  • Read through the entire lab (not just the questions). The lab will introduce course content that you will be responsible for on exams/homework.
  • Answer all questions in a separate document, attaching Minitab (or StatKey) output if needed. Many groups choose to use google docs or designate a group member to be in charge of the write-up.
  • Do not use a “divide and conquer” strategy. While it is tempting to get done quicker, this approach negatively impacts you and your classmates. You are expected to work through the lab as a team. Also, you should recognize that Prof. Miller is happy to devote more class time to a lab if it is taking longer than anticipated.

Bootstrapping

Problem: Suppose you collect a random sample and want to use it to learn something about the population (inference). Because there is uncertainty (randomness) regarding which cases from the population ended up in your sample, you aren’t exactly sure how close the estimates from your sample are to the true characteristics of the population.

Desired Solution: If you could take many random samples, you’d be able to see the sampling distribution of your estimate(s). If your sampling procedure is unbiased, this distribution will be centered around the true population value, and the standard deviation of this distribution (ie: the standard error of your estimate) measures the variability introduced by your sampling scheme.

Barrier: Taking a large number of random samples is impractical, time-consuming, expensive, and often impossible. We need another way to attain the sampling distribution without needing to take a large number of samples.

Bootstrapping: The main idea behind bootstrapping is that re-sampling cases from your original sample approximates the process of sampling from the population.

A re-sampling cases is known as a bootstrap sample. Bootstrap samples are drawn with replacement from the original sample, meaning some cases from the original sample could occur multiple times, and some cases could be left out entirely, in a given bootstrap sample. To preserve the properties of the original sampling protocol, each bootstrap sample contains the same number of cases as the original sample.

By taking many bootstrap samples, we can calculate a different statistic using each and then compile these statistics to arrive at the bootstrap distribution, which is an estimate of the sampling distribution.

Question #1:

In your lab write-up, fill in the blanks in the following sentence: “The goal of bootstrapping is to estimate the __ distribution using only the data from a single __"

Details

The process of re-sampling with replacement, which is used to obtain bootstrap samples, is depicted in the image below:

In this image, you see the original sample contains 3 blue, 2 orange, 2 yellow, and 2 green cases. Each bootstrap sample is drawn from these cases, but after a case is selected it is still eligible to be sampled again (you can think of it as being “replaced” in the original sample). This allows certain cases to be repeated in a bootstrap sample. For example, bootstrap sample 3 has 5 blue cases, even though there were only 3 such cases in the original sample. This is because some of the blue cases got sampled, replaced, and then sampled again.

Once many bootstrap samples have been acquired, the next step is to calculate the statistic of interest for each bootstrap sample, and then to use this collection to form the bootstrap distribution. The bootstrap distribution is an estimate of the sampling distribution.

Question #2:

In your own words, what is re-sampling with replacement? Why is replacement necessary? (Hint: think about the goal of trying to estimate the sampling distribution and what would happen if cases weren’t replaced when drawing bootstrap samples).

The image below shows the bootstrap distribution (based upon 10,000 bootstrap samples) of the correlation coefficient between commute time and commute distance for a sample of 500 Atlanta commuters.

Question #3:

In the dotplot shown above, how many dots are displayed? What does each dot in this distribution represent?

Question #4:

Will the bootstrap distribution be centered at the population parameter or the sample estimate? How does this compare with the sampling distribution?

Confidence Intervals (the SE approach)

The primary use of the bootstrap distribution is to obtain an estimate of the standard error of the statistic of interest (the estimate from the original sample). The logic behind this approach is very similar to what we saw when first discussing the sampling distribution:

Real World Bootstrap World
Start with entire population Start with a single sample
Take many random samples Re-sample with replacement
Contstruct sampling distribution Construct bootstrap distribution
Find the standard error (SE) Use to estimate the standard error
Calculate confidence interval using SE Calculate confidence interval using SE

So, for the purposes of constructing confidence intervals, we are interested getting a standard error (SE) estimate from the bootstrap distribution.

In the Atlanta commuters plot (shown just above Question #3), the standard error (of the correlation coefficient relating commute time and distance) is 0.037. So, the 95% confidence interval estimate for the \(\rho\) is given by:

\[0.807 \pm 2*0.037 = (0.733, 0.881)\]

Note, the estimate, 0.807, came from the original sample, we only use the bootstrap distribution to get the estimate’s standard error.

Question #4:

Why was the standard error multiplied by 2 in the 95% confidence interval calculation above? Would this interval still be meaningful if we didn’t multiply the standard error by anything?

Question #5:

In 1-2 sentences, explain why it only makes sense to use bootstrapping when your data are a sample. Why wouldn’t you use bootstrapping if your data were the entire population?

Confidence Intervals (the percentile approach)

The standard error (SE) approach to constructing bootstrapped confidence intervals relies upon the bootstrap distribution being symmetric and bell-shaped, which isn’t always the case. In fact, the bootstrap distribution of the Atlanta Commute correlation coefficient that we’ve been looking at as an example is somewhat left-skewed.

A robust alternative is the percentile bootstrap. Rather than approximating the middle 95% of the sampling distribution using the 2 SE rule, we can just chop off the lowest 2.5% and highest 2.5% of our bootstrap distribution to determine the middle 95%. This idea is illustrated in the image below:

In the Atlanta commuters example, here’s what the percentile bootstrap looks like:

This yields a 95% confidence interval of (0.724, 0.866), which is somewhat different that the interval we calculated earlier.

In situations where the bootstrap distribution appears to be skewed or non-symmetric, the percentile approach to finding confidence intervals should be preferred.

Question #6:

Suppose we wanted to use the percentile approach to find a 90% confidence interval using a bootstrap distribution containing 10,000 bootstrap samples. Briefly explain how this would be done.

StatKey

StatKey (link) is an online application that can be used to perform bootstrapping for a variety of different data types and statistics. While StatKey contains several pre-loaded datasets, we’ll typically need to input our own sample data. What we need to input differs depending on the type of data and statistic you’re using, so we’ll quickly go through a few examples.

One-sample Quantitative Data

For a single quantitative variable, StatKey allows you to bootstrap its mean, median or standard deviation. To do so, you first must click on “CI for a Single Mean, Median, St. Dev”. Next, you must copy your quantitative variable into the “Edit Data” dialog found the towards the top of your screen. The data you copy into this dialog should look similar to what is shown below:

Once your data is loaded, you can draw bootstrap samples by clicking one of the “Generate Samples” buttons. You can change the statistic that is calculated within each bootstrap sample using the dropdown menu following the text “Bootstrap Dotplot of”.

Question #7:

Load the variable “LifeExpectency” from the Happy Planet dataset into StatKey. Next, click “Generate 1 sample” and answer each of the following using a short phrase/sentence:

  1. What is shown in the “Original Sample” panel in the upper right?
  2. What is shown in the “Bootstrap sample” panel in the bottom left?
  3. What is shown in the main, “Bootstrap Dotplot of” panel?
  4. How do these panels relate to each other?

Two-sample Quantitative Data

Two-sample quantitative data occurs when a quantitative variable is split into two groups (samples) by a binary categorical variable. In these situations, StatKey allows you to bootstrap the difference in means (of the quantitative variable) of the two samples.

To begin, you should navigate to “CI for Difference in Means” from the StatKey homepage. Next, you need to enter your data. For two-sample quantitative data, you need supply two columns, the first is a group identifier (a binary categorical variable), the second is the quantitative variable of interest. The data you input should something like that shown below:

Once you’ve successfully input your data, you may generate bootstrap samples using the “Generate Samples” buttons.

Question #8:

From the Mass Shootings dataset, load the necessary variables into StatKey to compare the difference in the mean number of victims for spree and mass shootings. Then, construct a 95% confidence interval estimate for the difference in means using the percentile method. Interpret your interval, be sure to mention what population might these data represent.

One-sample Categorical Data

To bootstrap one-sample categorical data, you should navigate to “CI for single proportion” from the StatKey homepage. This time you don’t need to copy an entire column of values into StatKey, instead you only need to supply the frequency of the category of interest (you can pick one), and the overall sample size, in StatKey these are called the “count” and “sample size”. You can obtain these numbers from the variable’s one-way frequency table.

Question #9:

Suppose a basketball player attempts 20 free throws, making 16 of them. Use bootstrapping and the standard error method to find a 95% confidence interval estimate for the proportion of free throws this player will make during the remainder of the season (assuming the 20 attempts are a representative sample).

Two-sample Categorical Data

Two-sample categorical data occurs when a categorical variable is split into two groups (samples) by a binary categorical variable. In this situation, association is often evaluated using differences in proportions.

To enter this type of data into StatKey, you’ll again only need to enter frequencies and sample sizes rather than entire columns from the dataset. This time you can obtain all the necessary numbers from the two-way frequency table of the variable of interested and the binary grouping variable.

Question #10

From the Mass Shootings dataset, use Minitab to construct a two-way frequency table of the variables “Mental” and “Type”. Use this table to input into StatKey the necessary information to compare the difference in the proportion of perpetrators with prior signs of mental illness for spree and mass shootings. Then, construct a 95% confidence interval estimate for the difference in proportions using the standard error method.

Correlation Coefficient

The correlation coefficient measures the strength of linear association between two quantitative variables. In StatKey you can bootstrap the correlation between two variables by copying the two columns of interested into the “Edit Data” dialog.

Question #11

From the Mass Shootings dataset, load the necessary variables and use bootstrapping to obtain a 99% confidence interval estimate (using the percentile method) of the correlation between the number of fatalities and the number of injuries in a mass shooting.

Application #1 - Infant Heart Surgery

Some infants are born with congenital heart defects and require surgery shortly after birth. The standard surgical approach is known as “circulatory arrest”, and has the downside of cutting of the flow of blood to the brain during the surgery, potentially leading to brain damage. An alternative surgical approach is “low-flow bypass”, which maintains circulation to the brain, but does so with an external pump that might lead to other types of brain injuries.

The Infant Heart Surgery data contains data from a randomized trial conducted by surgeons at Harvard Medical School. The data document the outcomes of 70 infants who received low-flow bypass surgery, and 73 infants who received surgery under a circulatory arrest approach. The study considered two primary outcomes:

  1. Psychomotor Development Index (PDI) - a composite score measuring physiological development, with higher scores indicating greater development
  2. Mental Development Index (MDI) - a composite score measuring mental development, with higher scores indicating greater development

In this application, we’ll use the result of this experiment to quantify the differences in these two surgical approaches.

Question #12:

Estimate the effect the “low-flow bypass” surgery on psychomotor development. That is, by how much does the low-flow bypass surgery better/worsen PDI relative to the “circulatory arrest” approach? Report your estimate using proper statistical notation. Then, use bootstrapping to find a 95% confidence interval using the standard error approach.

Question #13:

Estimate the effect the “low-flow bypass” surgery on mental development. That is, by how much does the low-flow bypass surgery better or worsen MDI relative to the “circulatory arrest” approach? Report your estimate using proper statistical notation. Then, use bootstrapping to find a 95% confidence interval using the standard error approach.

Question #14:

Can you conclude that the differences you reported in Questions 12 and 13 are caused by difference in surgery? Could there be other confounding factors? Briefly explain.

Question #15:

Based upon your findings in Questions 12, 13, and 14, are you confident that opting for the low-flow bypass surgery over the circulatory arrest approach has a beneficial effect on an infant’s future development? Briefly explain.

Application #2 - San Francisco Shoppers in the 1980’s

In 1987, Impact Resources Inc. surveyed 9409 shopping mall customers in the San Francisco Bay area (San Francisco, Oakland, and San Jose). Data for 6876 shoppers who completed survey are contained in the SF Mall Customers dataset. The variables include:

  • INCOME - a categorical variable describing self-reported household income (personal income if single)
  • SEX - the sex of the respondent
  • MARITAL.STATUS - the marital status of the respondent
  • AGE - a categorical variable describing the age of the respondent
  • EDUCATION - the highest level of education attained by the respondent
  • OCCUPATION - the category of occupation of the respondent
  • AREA - how long the respondent has lived in the bay area
  • DUAL.INCOMES - whether the respondent’s household has multiple income earners
  • HOUSEHOLD.SIZE - how many individuals are in the respondent’s household
  • UNDER18 - how many individuals under 18 are in the respondent’s household
  • HOUSEHOLDER - householder status (own, rent, or live with parents/family)
  • HOME.TYPE - type of home the respondent resides in
  • ETHNIC.CLASS - the racial/ethnic category of the respondent
  • LANGAUGE - the language most often spoken at home in the respondent’s household

In this application, we will use survey data to characterize San Francisco bay area shoppers in the 1980s.

Question #16:

Estimate the proportion of bay-area shopping mall customers who own their own home. Report your estimate using proper statistical notation. Then, use bootstrapping to find a 95% confidence interval using the percentile approach.

Question #17:

Estimate the average household size of bay-area shopping mall customers. Report your estimate using proper statistical notation. Then, use bootstrapping to find a 95% confidence interval. You may use either the standard error or percentile approach.

Question #18:

Estimate the difference in the average number of children (as defined by the variable UNDER18) of dual income and single income married shopping mall customers. Report your estimate using proper statistical notation. Then, use bootstrapping to find a 95% confidence interval. You may use either the standard error or percentile approach. (Hint: you may consider using “subset worksheet” to create a new worksheet that doesn’t include respondents who answered “Not Married” to the dual income question).

Question #19:

Estimate the difference in the proportion of shoppers living in apartments for 18-24 year-olds relative to 25-34 year-olds. Report your estimate using proper statistical notation. Then, use bootstrapping to find a 95% confidence interval. You may use either the standard error or percentile approach.

Question #20:

Would you be confident in using these data to characterize shoppers in the SF bay area in 2019? Briefly explain.