This lab will use both Minitab and Statkey software to explore to sampling distributions, along with a technique called bootstrapping that allows us to estimate the sampling distribution (and use it construct confidences intervals!) with just a single sample.
The data we’ll use is from “The College Scorecard”, a government run database that stores information on all degree-granting higher education institutions. The original dataset contains thousands of cases and hundreds of variables; however, we will look at only a select set of variables for a smaller subset of cases defined to be small colleges (1000-5000 students) that primarily grant bachelor’s degrees and require the ACT as part of their admissions process. This is a very interesting dataset that we will continue to explore in the future. A description of the variables can be found below:
LOCALE A categorical variable describing the location of the institution
AVGFACSAL Average faculty salary
As we’ve previously discussed, the first step in any statistical analysis should be to explore and understand the variables in your data. To refresh our memory, here are a few things we should consider before jumping into an analysis:
Question #1
Are these data a population or a sample? If they are a sample, describe the population that they represent. If they are a population, describe what defines that population.
Question #2
Ignoring the variables “INSTNM” and “CITY”, which we will treat as ID variables, complete the table below and include it in your lab write-up. In doing so, state whether the variable is skewed right, skewed left, or approximately symmetric for quantitative variables, and state the percentage of cases in the largest category for categorical variables.
Suppose we’d like to estimate a population parameter from sample data. Even if the sampling procedure is unbiased, there is still uncertainty regarding which cases from the population ended up in the sample.
Because of this uncertainty, the sample statistic is unlikely to exactly match the population parameter, and we should reflect this uncertainty by reporting an interval estimate.
Previously, the approach we learned for constructing confidence intervals requires us to know the standard error of the sampling distribution. Basically, this meant we needed hundreds of different random samples, something that is very impractical for most applications.
Bootstrapping is an ingenious idea that allows us to estimate the entire sampling distribution using just a single sample.
When bootstrapping, we treat the original sample as if it were the population. We then re-sample from the original sample, replacing each case after it is sampled (this is called sampling with replacement), to create a bootstrap sample.
Because we are re-sampling with replacement, the bootstrap sample might contain replicates of some of the cases in the original sample. Alternatively, some cases might not appear at all.
We don’t stop with one bootstrap sample, instead we repeat the same re-sampling procedure many times. This mimics the process of repeatedly sampling from a population (what is needed to get the sampling distribution). The diagram below illustrates how the process might unfold for the first three bootstrap samples:
You should notice a few things in this diagram:
Question #3
Fill in the blanks of the following sentence:
“The goal of bootstrapping is to estimate the __ distribution using only the data from a single __"
Question #4
Why do bootstrap samples need to be drawn with replacment? (Hint: think about the goal of bootstrapping that you stated in Lab Question #3, and how both sample size and variability relate to that goal.)
The next step in boostrapping is to calculate the statistic of interest for each bootstrap sample, and then to use this collection to form the bootstrap distribution. The bootstrap distribution is an estimate of the sampling distribution.
One use of the boostrap distribution is to approximate the variability (the standard error) of the statistic of interest from our original sample. The logic behind this approach is very similar to what we saw when first discussing the sampling distribution:
Real World | Bootstrap World |
---|---|
Start with entire population | Start with a single sample |
Take many random samples | Re-sample with replacement |
Contstruct sampling distribution | Construct bootstrap distribution |
Find the standard error (SE) | Use to estimate the standard error |
Calculate confidence interval using SE | Calculate confidence interval using SE |
We will talk more about the reliability of bootstrapping in our next lecture, right now we will focus on learning how to use bootstrapping to construct confidence intervals.
The program StatKey (link) allows for the convenient construction of bootstrap samples and the bootstrap distribution for a variety of different statistics. StatKey comes with a number of pre-loaded datasets, but it also allows you to input your own sample data. Categorical data can be entered into StatKey using just summary statistics; however, quantitative data requires you to copy-paste the variable’s entire column. We will practice this sort of data entry in this lab.
Because it only makes sense to use bootstrapping when your data are a sample, we will take now take a random sample of size \(n = 20\) from the Small College data and work with that sample for the remainder of the lab. The original data will serve as our population. You’ll want to have both the full data and your random sample of size \(n=20\) readily available in seperate Minitab worksheets.
It is very important to understand that you’d never want to actually do this in a real analysis, we are simply doing so to learn about the mechanics of bootstrapping. If you ever did have data on the whole population there’d be no reason to restrict yourself to just a sample; in fact, you’d be putting yourself at a major disadvantage by doing so.
To draw a random sample in Minitab:
The example below illustrates drawing a random sample of size \(n = 20\) from the Happy Planet Data. It includes only the variables “Happiness” and “LifeExpectency” and stores the sample in columns “C15” and “C16”.
Question #5
In 1-2 sentences, briefly explain why it only makes sense to use bootstrapping when your data are a sample (Contrast with when your data are the entire population you’re interested in).
Question #6
Using the full dataset, find the population mean cumulative ACT score and mean admissions rate. Include these population parameters in your lab write-up using proper notation (ie: you might need to use the “equations” option under “insert” if you are using word).
Question #7
Following the example above, obtain a random sample of size \(n = 20\). Paste the names of the colleges in your sample in your lab write-up.
Question #8
Using your random sample, find the best estimates of the overall mean cumulative ACT score and mean admissions rate for all small colleges. Include these in your lab write-up using proper notation. How close are your best estimates to the truth? (see Lab Question #6)
In Lab Question #8 you found the best estimates of a couple of population parameters. Because these estimates won’t exactly reflect true population parameters, a better option is to report an interval estimate. To construct an interval estimate we need to know the variability of our sample statistic (ie: we need to know the sampling distribution). As previously mentioned, bootstrapping can provide an approximation of the sampling distribution, giving us all the information we need to construct confidence intervals. Provided the boot strap distribution is symmetric and bell-shaped, we can apply the 68-95-99 rule to construct a 95% confidence interval for the population parameter using the formula:
\[ \text{Sample Statistic} \pm 2 * Std. Error \]
The sample statistic is our best guess from our sample and the standard error (SE) is estimated from the bootstrap distribution. The example below walks through the process using StatKey’s pre-loaded “Manhattan Apartments Rent” dataset:
The above image shows what StatKey looks like after generating a single bootstrap sample. The red box in the upper right shows the distribution of the original sample. The mean of that sample is the best guess at the population parameter. The blue box in the lower right shows the distribution of the current bootstrap sample. Notice how the single data point with a rent over $5000 occurs three times in this bootstrap sample. Finally, the blue circle shows how the bootstrap sample contributes to the bootstrap distribution. The next image shows what StatKey looks like after generating a large number of bootstrap samples.
In this example I’ve generated 5000 bootstrap samples, so the bootstrap distribution is built from 5000 different bootstrap statistics. Notice how the best estimate of the population parameter is very close to, but not exactly, the center of the bootstrap distribution. The key information needed to construct a 95% confidence interval are highlighted in yellow. The 95% confidence interval is found by:
\[3156.5 \pm 2*303.938 = ( 2548.624, 3764.376)\]
Thus we can be 95% confident that the true mean rent of all Manhattan Apartments is between $2548 and $3764.
Question #9
Using your random sample of size \(n = 20\), use StatKey to construct a 95% bootstrap confidence interval for the mean cumulative ACT score of all small colleges. Show your calculation and include a copy of the bootstrap distribution you generated in StatKey.
Question #10
Now construct a new random sample of size \(n = 50\) and use StatKey to construct a 95% bootstrap confidence interval for the mean cumulative ACT score of all small colleges. Show your calculation and include a copy of the bootstrap distribution you generated in StatKey.
Question #11
How do the confidence intervals you found in Lab Question #9 and Lab Question #10 compare? In a few setences, briefly explain why the lengths of these two intervals are different.
The “2SE” approach to confidence intervals that we’ve been using so far is somewhat restricted.
Fortunately, because bootstrapping provides an estimate of the entire sampling distribution, we can construct valid confidence intervals using percentiles from the bootstrap distribution.
For example, rather than approximating the middle 95% of the sampling distribution using the 2 SE rule, we can just chop off the lowest 2.5% and highest 2.5% of our bootstrap distribution to get our estimate. This idea is illustrated in the images below:
Percentile bootstrap confidence intervals can be found by clicking the “Two Tail” box in StatKey. The example below illustrates a 95% percentile bootstrap confidence interval for the Manhattan Apartment Data:
Clicking on the box near the center of the distribution allows us to change the desired confidence level:
Question #12
Use your most recent random sample (the one of size \(n = 50\)) to fill out the following table relating confidence level and interval with for the percentile bootstrap.
Confidence Level | Interval (A, B) | Length (B - A) |
---|---|---|
50% | ||
70% | ||
80% | ||
90% | ||
95% | ||
99% |
Question #13
Create a scatterplot relatinge “Confidence Level” and “Length” from the table filled out in Lab Question #12. Is the relationship linear or non-linear? Briefly explain what you see and how it relates to the shape of the bootstrap distribution.
Question #14
Based upon your explorations so far in this lab, as well as any additional exploration you might do, fill out the following table summarizing the impact of changing various factors on confidence interval length:
Factor | Impact (ie: length increases / length decreases / negligible impact) |
---|---|
Larger \(n\) | |
More bootstrap samples | |
Higher confidence level | |
Larger standard error | |
Lower confidence level |
Question #15
Given what you’ve seen, do you think that all values in a 95% confidence interval estimate are equally plausible? Briefly explain your answer.
Question #16
The examples in this lab used quantitative variables. For this question I’d like you to choose a categorical variable and construct a 95% bootstrap confidence interval for a proportion of interest. You can choose the specific proportion that you’re interested in (you might want to create a new binary variable for this purpose). In your write-up, include:
Question #17
For this question I’d like you to choose two quantitative variables and construct a 95% bootstrap confidence interval for the correlation coefficient between those variables. In your write-up, include: