Directions
Lately we’ve seen how the sampling distribution of many commonly used statistics can be approximated by the normal curve, thereby allowing confidence intervals to be created using relatively simple mathematical formulas.
In this lab we’ll use a couple of different samples from the same population to study the impact of factors such as sample size or skew on confidence intervals while also comparing bootstrapped intervals with those arising from normal approximations.
We’ll begin with a brief review of normal approximations. For constructing confidence intervals, we are most concerned with the central limit theorem (CLT) result for the parameter we are trying to estimate. For now, we will focus on estimating five different parameters, which are described in the subsections that follow.
When estimating \(p\) using \(\hat{p}\), CLT suggests:
\(\hat{p} \sim N\bigg(p, \sqrt{\frac{p(1-p)}{n}}\bigg)\)
Leading to confidence intervals of the form:
\(\hat{p} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
Where \(z^*\) is the quantile of the standard normal distribution (ie: z-score) leading to desired confidence level. Recall that \(z^* = 1.96 \approx 2\) results in 95% confidence because the middle 95% of the standard normal distribution falls within \(\pm 1.96\).
When estimating \(p_1 - p_2\) using \(\hat{p}_1 - \hat{p}_2\), CLT suggests:
\(\hat{p}_1 - \hat{p}_2 \sim N\bigg(p_1 - p_2, \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\bigg)\)
Leading to confidence intervals of the form:
\(\hat{p}_1 - \hat{p}_2 \pm z^* \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_2} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\)
Again, where \(z^*\) is used to define the confidence level of the interval.
When estimating \(\mu\) using \(\bar{x}\), CLT suggests:
\(\bar{x} \sim N\big(\mu, \frac{s}{\sqrt{n}}\big)\)
Leading to confidence intervals of the form:
\(\bar{x} \pm t^*\frac{s}{\sqrt{n}}\)
This time \(t^*\) is used to define the confidence level of the interval, where \(t^*\) is the quantile from a \(t\)-distribution with \(n - 1\) degrees of freedom. Recall that the \(t\)-distribution is necessary to account for the extra uncertainty introduced by estimating \(\sigma\) using \(s\).
When estimating \(\mu_1 - \mu_2\) using \(\bar{x}_1 - \bar{x}_2\), suggests:
\(\bar{x}_1 - \bar{x}_2\sim N\big(\mu_1 -\mu_2, \sqrt{\frac{s_1^2}{n_1} +\frac{s_1^2}{n_2}} \big)\)
Leading to confidence intervals of the form:
\(\bar{x}_1 - \bar{x}_2 \pm t^* \sqrt{\frac{s_1^2}{n_1} + \frac{s_1^2}{n_2}}\)
This time, \(t^*\) is the quantile from a \(t\)-distribution with \(\text{min}(n_1 - 1, n_2 - 1)\) degrees of freedom. Recall that most software, including as Minitab, will calculate degrees of freedom differently.
When estimating \(\rho\) using \(r\), CLT suggests:
\(r \sim N\big(\rho, \sqrt{\frac{1-\rho^2}{n-2}} \big)\)
Leading to confidence intervals of the form:
\(r \pm z^* \sqrt{\frac{1-r^2}{n-2}}\)
where \(z^*\) is used to define the confidence level of the interval.
For this part of the lab, we’ll use four different random samples of basketball games played by Michael Jordan, a professional basketball player who is widely considered to be the greatest of all time. Each sample contains a different number of games. Links to these samples, and additional details on the variables, can be found below:
Each sample contains the variables:
In Minitab, confidence intervals are calculated using the normal approximations provided in the previous section. To calculate a confidence interval in Minitab you can use the following steps:
The first factor we’ll explore is sample size. Intuitively, estimation of a population parameter should be easier when the sample size is larger, and in this section we’ll try and understand the impact of sample size on confidence intervals in greater detail.
Question #1:
Before actually creating any confidence intervals, which sample do you think is most likely to produce a 95% confidence interval that contains the population parameter of interest (For example, Michael Jordan’s average points-per-game for his entire career)? Briefly explain.
Question #2:
For each sample, use Minitab to calculate a 95% confidence interval estimate of Michael Jordan’s career average points per game. Record these intervals in a table like the one below:
Sample Size | 95% CI Lower Endpoint | 95% CI Upper Endpoint |
---|---|---|
n = 5 | ||
n = 25 | ||
n = 75 | ||
n = 200 |
Question #3:
Calculate the length (ie: upper endpoint minus lower endpoint) of each confidence interval in your table, then use Minitab (or Excel/another program) to plot interval length versus sample size. Include this graph in your lab write-up, along with 1-2 sentences describing how sample size appears to be related to confidence interval length.
Question #4:
Looking at the CLT normal approximation for a single mean, explain why the relationship you saw in Question #3 exists? (Hint: Think about the margin of error in the confidence interval formula)
Generally speaking, given the sample size is large enough, the sampling distribution of most statistics will be approximately normal, even if the estimate comes from a variable with a skewed distribution. We’ll explore this perhaps surprising result by looking at the variable “tov”, or the turnovers committed by Michael Jordan in each game.
Question #5:
Using the MJ-200 sample, use Minitab to construct a histogram of the variable “tov”. Is this variable skewed right or skewed left?
Question #6:
For the MJ-200 sample, use StatKey to approximate the sampling distribution (via bootstrapping) of Michael Jordan’s average per-game tov. Does the sampling distribution appear to be approximately normal (ie: is it symmetric and bell-shaped)?
Question #7:
Use the distribution you found in Question #6 to find a 95% percentile bootstrap confidence interval. Report your interval in your lab write-up.
Question #8:
Now use Minitab to calculate a 95% confidence interval estimate for Michael Jordan’s career per-game tov. How does this interval compare to the one you found in Question #7?
Question #9:
Repeat the comparisons of Questions #6-9 using the MJ-5 sample. That is, use StatKey to estimate the sampling distribution and describe whether it appears skewed, then find a 95% percentile bootstrap confidence interval, and then compare this interval with the equivalent 95% confidence interval calculated by Minitab.
Question #10:
Based upon your answers to Questions #5-9, when should bootstrapping be preferred (relative to normal approximations)? (Hint: Think about how similar (or dissimilar) your bootstrap and normal approximation CIs were in the aforementioned questions, the underlying assumptions of each method, and the shape of the sampling distribution estimates you saw)
Michael Jordan is 6’6, which is very tall by normal-person standards, but is not particularly tall for a professional basketball player. Consequently, he tended not to block many shots (blocked shots are relatively rare even for taller players). In this section we will use the variable “blk” to explore confidence interval construction in situations where the data contain natural boundaries.
Question #11:
For the MJ-5 sample, use Minitab to construct a 95% confidence interval for career per-game average number of blocks by Michael Jordan. Do you see a problem with this confidence interval? (Hint: think about the confidence interval as a range of plausible values for the population parameter).
Question #12:
For the MJ-5 sample, use StatKey to construct a 95% percentile bootstrap confidence interval (use at least 5,000 bootstrap samples). Do you see any problems with this confidence interval?
The plot below depicts the process of:
Question #13:
Michael Jordan’s actual career-average is 0.8 blocks per-game. Using this information, along with the plot shown above, do you believe that confidence intervals constructed in this manner have 95% coverage? That is, are they actually 95% confidence intervals? Briefly explain your answer.
Question #14:
The plot below displays a similar procedure, but this time the intervals are constructed using bootstrapping (notice how none have boundary issues). Do 95% confidence intervals constructed in this manner have the correct coverage? Briefly explain.
Question #15:
The plot below displays the results of procedures described in Questions #13 and #14, except this time a sample size of \(n = 100\) was used. Do 95% confidence intervals constructed in this manner have the correct coverage? Briefly explain.
Remark: Questions #13, #14, and #15 demonstrate the importance of having a large, representative sample. Regardless of the procedure used, estimation is a difficult task when the sample size is small. Confidence intervals constructed using bootstrapping are generally preferred for small samples, for highly skewed variables, or for variables with boundary problems - but they aren’t a magic-bullet.
A large enough sample size tends to fix most problems related to interval estimation; however, a larger sample size will not fix a biased study design. Remember the Literary Digest sampling 2.4 million people and predicting Alfred Landon would win the election by a margin of 14, but due to biased sampling the election result ended up being in the entirely opposite direction. With a sample size of 2.4 million, any confidence interval would have been incredibly narrow, but would not have captured the true election results.
When the sample size is relatively small, or if the circumstances are unusual (a highly skewed variable, or a variable with pre-defined boundaries), bootstrapped confidence intervals tend to be preferred because they’ll always contain sensible values are rely on fewer assumptions. Otherwise, both bootstrapping and normal approximation approaches tend to produce very similar results, and approximate intervals are easily found using software like Minitab.
Question #16:
Complete the following table summarizing the role of various aspects on confidence interval length. To make each determination you might consider your answers to earlier lab questions, or you might choose to explore the factor yourself using an example dataset(s).
Factor | Increases/Decreases/Doesn’t Influence the Interval Length |
---|---|
Larger n | |
Larger SE | |
Higher Confidence Level | |
More Bootstrap Samples |
One of the most challenging aspects of statistics is knowing which tool(s) to use in a particular situation.
This section will present four questions where you should use your own judgement to answer the question using appropriate statistical methods. In this lab, that means deciding the appropriate statistics (ie: proportions, differences in proportions, means, differences in means, or correlation) and also whether to use bootstrapping or a normal approximation to construct a valid confidence interval. You may also need to do some data manipulations depending on the characteristic you’re trying to estimate.
Question #17
Using the MJ-200 sample, does Michael Jordan score more points in games where he attempts more three point shots?
Question #18
Using the MJ-75 sample, did Michael Jordan spend a greater proportion of his career playing for the Chicago Bulls or Washington Wizards (note that these are the only teams he played for)?
Question #19
Using the MJ-25 sample, did Michael Jordan average more points when playing for the Chicago Bulls or Washington Wizards?
Question #20
Using the MJ-5 sample, what proportion of free throws did Michael Jordan make during his career? (Hint: You might consider using the “SUM” function within the menu at “Calc -> Calculator” to obtain the necessary information for this question)