Sta-209 Lab #4 - Bootstrapping

This lab will use both Minitab and Statkey software to explore to sampling distributions, along with a technique called bootstrapping that allows us to estimate the sampling distribution (and use it construct confidences intervals!) with just a single sample.

The Data

The data we’ll use is from “The College Scorecard”, a government run database that stores information on all degree-granting higher education institutions. The original dataset contains thousands of cases and hundreds of variables; however, we will look at only a select set of variables for a smaller subset of cases defined to be small colleges (1000-5000 students) that primarily grant bachelor’s degrees and require the ACT as part of their admissions process. This is a very interesting dataset that we will continue to explore in the future. A description of the variables can be found below:

INSTNM The name of the institution
CITY The city where the institution is located
STABBR Abbreviation of the state where the institution is located
LOCALE A categorical variable describing the location of the institution
- 11 City: Large (population of 250,000 or more)
- 12 City: Midsize (population of at least 100,000 but less than 250,000)
- 13 City: Small (population less than 100,000)
- 21 Suburb: Large (outside principal city, in urbanized area with population of 250,000 or more)
- 22 Suburb: Midsize (outside principal city, in urbanized area with population of at least 100,000 but less than 250,000)
- 23 Suburb: Small (outside principal city, in urbanized area with population less than 100,000)
- 31 Town: Fringe (in urban cluster up to 10 miles from an urbanized area)
- 32 Town: Distant (in urban cluster more than 10 miles and up to 35 miles from an urbanized area)
- 33 Town: Remote (in urban cluster more than 35 miles from an urbanized area)
- 41 Rural: Fringe (rural territory up to 5 miles from an urbanized area or up to 2.5 miles from an urban cluster)
- 42 Rural: Distant (rural territory more than 5 miles but up to 25 miles from an urbanized area or more than 2.5 and up to 10 miles from an urban cluster)
- 43 Rural: Remote (rural territory more than 25 miles from an urbanized area and more than 10 miles from an urban cluster)
ADM_RATE Percentage of applicants that are admitted in to the institution
ACTCMMID Median cumulative ACT score of enrolled students
ACTENMID Median English ACT subscore of enrolled students
ACTMTMID Median Math ACT subscore of enrolled students
UGDS Total undergraduate enrollment
PFTFAC Percent of faculty that are full time
PCTPELL Percent of students receiving a Pell Grant
COSTT4_A Average yearly cost of attending
AVGFACSAL Average faculty salary

Exploratory Analysis - Learning about the variables

As we’ve previously discussed, the first step in any statistical analysis should be to explore and understand the variables in your data. To refresh our memory, here are a few things we should consider before jumping into an analysis:

How were the data collected? Are the data a sample or a population? What population(s) do/don’t they represent?
How were the variables measured? Which variables are quantitative and which are categorical?
How are the variables distributed? Do they have any outliers?
How are the variables related? Are there any patterns or associations?

Question #1

Are these data a population or a sample? If they are a sample, describe the population that they represent. If they are a population, describe what defines that population.

Question #2

Ignoring the variables “INSTNM” and “CITY”, which we will treat as ID variables, complete the table below and include it in your lab write-up. In doing so, state whether the variable is skewed right, skewed left, or approximately symmetric for quantitative variables, and state the percentage of cases in the largest category for categorical variables.

Bootstrapping - The Concept

Suppose we’d like to estimate a population parameter from sample data. Even if the sampling procedure is unbiased, there is still uncertainty regarding which cases from the population ended up in the sample.
Because of this uncertainty, the sample statistic is unlikely to exactly match the population parameter, and we should reflect this uncertainty by reporting an interval estimate.
Previously, the approach we learned for constructing confidence intervals requires us to know the standard error of the sampling distribution. Basically, this meant we needed hundreds of different random samples, something that is very impractical for most applications.
Bootstrapping is an ingenious idea that allows us to estimate the entire sampling distribution using just a single sample.
When bootstrapping, we treat the original sample as if it were the population. We then re-sample from the original sample, replacing each case after it is sampled (this is called sampling with replacement), to create a bootstrap sample.
Because we are re-sampling with replacement, the bootstrap sample might contain replicates of some of the cases in the original sample. Alternatively, some cases might not appear at all.
We don’t stop with one bootstrap sample, instead we repeat the same re-sampling procedure many times. This mimics the process of repeatedly sampling from a population (what is needed to get the sampling distribution). The diagram below illustrates how the process might unfold for the first three bootstrap samples:

You should notice a few things in this diagram:

The first and second bootstrap samples each have more yellow cases than were present in the original sample. This is due to sampling with replacement.
The third bootstrap sample contains zero green cases. This is also due to sampling with replacement.

Question #3

Fill in the blanks of the following sentence:

“The goal of bootstrapping is to estimate the __ distribution using only the data from a single __"

Question #4

Why do bootstrap samples need to be drawn with replacment? (Hint: think about the goal of bootstrapping that you stated in Lab Question #3, and how both sample size and variability relate to that goal.)

The Bootstrap Distribution

The next step in boostrapping is to calculate the statistic of interest for each bootstrap sample, and then to use this collection to form the bootstrap distribution. The bootstrap distribution is an estimate of the sampling distribution.

One use of the boostrap distribution is to approximate the variability (the standard error) of the statistic of interest from our original sample. The logic behind this approach is very similar to what we saw when first discussing the sampling distribution:

Real World	Bootstrap World
Start with entire population	Start with a single sample
Take many random samples	Re-sample with replacement
Contstruct sampling distribution	Construct bootstrap distribution
Find the standard error (SE)	Use to estimate the standard error
Calculate confidence interval using SE	Calculate confidence interval using SE

We will talk more about the reliability of bootstrapping in our next lecture, right now we will focus on learning how to use bootstrapping to construct confidence intervals.

StatKey

The program StatKey (link) allows for the convenient construction of bootstrap samples and the bootstrap distribution for a variety of different statistics. StatKey comes with a number of pre-loaded datasets, but it also allows you to input your own sample data. Categorical data can be entered into StatKey using just summary statistics; however, quantitative data requires you to copy-paste the variable’s entire column. We will practice this sort of data entry in this lab.

Random Samples in Minitab

Because it only makes sense to use bootstrapping when your data are a sample, we will take now take a random sample of size $n = 20$ from the Small College data and work with that sample for the remainder of the lab. The original data will serve as our population. You’ll want to have both the full data and your random sample of size $n=20$ readily available in seperate Minitab worksheets.

It is very important to understand that you’d never want to actually do this in a real analysis, we are simply doing so to learn about the mechanics of bootstrapping. If you ever did have data on the whole population there’d be no reason to restrict yourself to just a sample; in fact, you’d be putting yourself at a major disadvantage by doing so.

To draw a random sample in Minitab:

Select “Calc” -> “Random Data” -> “Sample From Columns”
Enter the number of rows you’d like to have in your new random sample
Select the columns containing the variables you’d like to include in your new random sample
Tell Minitab the columns to store your sample in

The example below illustrates drawing a random sample of size $n = 20$ from the Happy Planet Data. It includes only the variables “Happiness” and “LifeExpectency” and stores the sample in columns “C15” and “C16”.

Question #5

In 1-2 sentences, briefly explain why it only makes sense to use bootstrapping when your data are a sample (Contrast with when your data are the entire population you’re interested in).

Question #6

Using the full dataset, find the population mean cumulative ACT score and mean admissions rate. Include these population parameters in your lab write-up using proper notation (ie: you might need to use the “equations” option under “insert” if you are using word).

Question #7

Following the example above, obtain a random sample of size $n = 20$. Paste the names of the colleges in your sample in your lab write-up.

Question #8

Using your random sample, find the best estimates of the overall mean cumulative ACT score and mean admissions rate for all small colleges. Include these in your lab write-up using proper notation. How close are your best estimates to the truth? (see Lab Question #6)

Bootstrapping in StatKey

In Lab Question #8 you found the best estimates of a couple of population parameters. Because these estimates won’t exactly reflect true population parameters, a better option is to report an interval estimate. To construct an interval estimate we need to know the variability of our sample statistic (ie: we need to know the sampling distribution). As previously mentioned, bootstrapping can provide an approximation of the sampling distribution, giving us all the information we need to construct confidence intervals. Provided the boot strap distribution is symmetric and bell-shaped, we can apply the 68-95-99 rule to construct a 95% confidence interval for the population parameter using the formula:

\[ \text{Sample Statistic} \pm 2 * Std. Error \]

The sample statistic is our best guess from our sample and the standard error (SE) is estimated from the bootstrap distribution. The example below walks through the process using StatKey’s pre-loaded “Manhattan Apartments Rent” dataset:

The above image shows what StatKey looks like after generating a single bootstrap sample. The red box in the upper right shows the distribution of the original sample. The mean of that sample is the best guess at the population parameter. The blue box in the lower right shows the distribution of the current bootstrap sample. Notice how the single data point with a rent over $5000 occurs three times in this bootstrap sample. Finally, the blue circle shows how the bootstrap sample contributes to the bootstrap distribution. The next image shows what StatKey looks like after generating a large number of bootstrap samples.

In this example I’ve generated 5000 bootstrap samples, so the bootstrap distribution is built from 5000 different bootstrap statistics. Notice how the best estimate of the population parameter is very close to, but not exactly, the center of the bootstrap distribution. The key information needed to construct a 95% confidence interval are highlighted in yellow. The 95% confidence interval is found by:

\[3156.5 \pm 2*303.938 = ( 2548.624, 3764.376)\]

Thus we can be 95% confident that the true mean rent of all Manhattan Apartments is between $2548 and $3764.

Question #9

Using your random sample of size $n = 20$, use StatKey to construct a 95% bootstrap confidence interval for the mean cumulative ACT score of all small colleges. Show your calculation and include a copy of the bootstrap distribution you generated in StatKey.

Question #10

Now construct a new random sample of size $n = 50$ and use StatKey to construct a 95% bootstrap confidence interval for the mean cumulative ACT score of all small colleges. Show your calculation and include a copy of the bootstrap distribution you generated in StatKey.

Question #11

How do the confidence intervals you found in Lab Question #9 and Lab Question #10 compare? In a few setences, briefly explain why the lengths of these two intervals are different.

The Percentile Bootstrap

The “2SE” approach to confidence intervals that we’ve been using so far is somewhat restricted.

It requires the bootstrap distribution to be symmetric and bell-shaped, which isn’t always the case
It only allows us to construct 95% confidence intervals and some situations might warrant more or less confidence (ie: 90% confidence intervals or 99% confidence intervals)

Fortunately, because bootstrapping provides an estimate of the entire sampling distribution, we can construct valid confidence intervals using percentiles from the bootstrap distribution.

For example, rather than approximating the middle 95% of the sampling distribution using the 2 SE rule, we can just chop off the lowest 2.5% and highest 2.5% of our bootstrap distribution to get our estimate. This idea is illustrated in the images below:

Percentile bootstrap confidence intervals can be found by clicking the “Two Tail” box in StatKey. The example below illustrates a 95% percentile bootstrap confidence interval for the Manhattan Apartment Data:

Clicking on the box near the center of the distribution allows us to change the desired confidence level:

Question #12

Use your most recent random sample (the one of size $n = 50$) to fill out the following table relating confidence level and interval with for the percentile bootstrap.

Confidence Level	Interval (A, B)	Length (B - A)
50%
70%
80%
90%
95%
99%

Question #13

Create a scatterplot relatinge “Confidence Level” and “Length” from the table filled out in Lab Question #12. Is the relationship linear or non-linear? Briefly explain what you see and how it relates to the shape of the bootstrap distribution.

Question #14

Based upon your explorations so far in this lab, as well as any additional exploration you might do, fill out the following table summarizing the impact of changing various factors on confidence interval length:

Factor	Impact (ie: length increases / length decreases / negligible impact)
Larger $n$
More bootstrap samples
Higher confidence level
Larger standard error
Lower confidence level

Question #15

Given what you’ve seen, do you think that all values in a 95% confidence interval estimate are equally plausible? Briefly explain your answer.

Question #16

The examples in this lab used quantitative variables. For this question I’d like you to choose a categorical variable and construct a 95% bootstrap confidence interval for a proportion of interest. You can choose the specific proportion that you’re interested in (you might want to create a new binary variable for this purpose). In your write-up, include:

Your variable of interest
The size of your random sample
Your bootstrap confidence interval (you should show all of your work and include the bootstrap distribution generated by StatKey)
A one sentence interpretation of your confidence interval

Question #17

For this question I’d like you to choose two quantitative variables and construct a 95% bootstrap confidence interval for the correlation coefficient between those variables. In your write-up, include:

Your variable of interest
The size of your random sample
Your bootstrap confidence interval (you should show all of your work and include the bootstrap distribution generated by StatKey)
A one sentence interpretation of your confidence interval

Factor	Impact (ie: length increases / length decreases / negligible impact)
Larger \(n\)
More bootstrap samples
Higher confidence level
Larger standard error
Lower confidence level