Directions

  • Read through the entire lab (not just the questions). The lab will introduce course content that you will be responsible for on exams/homework.
  • Answer all questions in a separate document, attaching Minitab output if needed.
  • Do not use a “divide and conquer” strategy. While it is tempting to get done quicker, this approach negatively impacts you and your classmates. You are expected to work through the lab as a team. Also, you should recognize that Prof. Miller is happy to devote more class time to a lab if it is taking longer than anticipated.

Normal Approximations and Confidence Intervals

Lately we’ve seen how the sampling distribution of many commonly used statistics can be approximated by the normal curve, thereby allowing confidence intervals to be created using relatively simple mathematical formulas.

In this lab we’ll use a couple of different samples from the same population to study the impact of factors such as sample size or skew on confidence intervals while also comparing bootstrapped intervals with those arising from normal approximations.

We’ll begin with a brief review of normal approximations. For constructing confidence intervals, we are most concerned with the central limit theorem (CLT) result for the parameter we are trying to estimate. For now, we will focus on estimating five different parameters, which are described in the subsections that follow.

Single Proportion

When estimating \(p\) using \(\hat{p}\), CLT suggests:

\(\hat{p} \sim N\bigg(p, \sqrt{\frac{p(1-p)}{n}}\bigg)\)

Leading to confidence intervals of the form:

\(\hat{p} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)

Where \(z^*\) is the quantile of the standard normal distribution (ie: z-score) leading to desired confidence level. Recall that \(z^* = 1.96 \approx 2\) results in 95% confidence because the middle 95% of the standard normal distribution falls within \(\pm 1.96\).

Difference in Proportions

When estimating \(p_1 - p_2\) using \(\hat{p}_1 - \hat{p}_2\), CLT suggests:

\(\hat{p}_1 - \hat{p}_2 \sim N\bigg(p_1 - p_2, \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\bigg)\)

Leading to confidence intervals of the form:

\(\hat{p}_1 - \hat{p}_2 \pm z^* \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_2} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\)

Again, where \(z^*\) is used to define the confidence level of the interval.

Single Mean

When estimating \(\mu\) using \(\bar{x}\), CLT suggests:

\(\bar{x} \sim N\big(\mu, \frac{s}{\sqrt{n}}\big)\)

Leading to confidence intervals of the form:

\(\bar{x} \pm t^*\frac{s}{\sqrt{n}}\)

This time \(t^*\) is used to define the confidence level of the interval, where \(t^*\) is the quantile from a \(t\)-distribution with \(n - 1\) degrees of freedom. Recall that the \(t\)-distribution is necessary to account for the extra uncertainty introduced by estimating \(\sigma\) using \(s\).

Difference in Means

When estimating \(\mu_1 - \mu_2\) using \(\bar{x}_1 - \bar{x}_2\), suggests:

\(\bar{x}_1 - \bar{x}_2\sim N\big(\mu_1 -\mu_2, \sqrt{\frac{s_1^2}{n_1} +\frac{s_1^2}{n_2}} \big)\)

Leading to confidence intervals of the form:

\(\bar{x}_1 - \bar{x}_2 \pm t^* \sqrt{\frac{s_1^2}{n_1} + \frac{s_1^2}{n_2}}\)

This time, \(t^*\) is the quantile from a \(t\)-distribution with \(\text{min}(n_1 - 1, n_2 - 1)\) degrees of freedom. Recall that most software, including as Minitab, will calculate degrees of freedom differently.

Correlation Coefficient

When estimating \(\rho\) using \(r\), CLT suggests:

\(r \sim N\big(\rho, \sqrt{\frac{1-\rho^2}{n-2}} \big)\)

Leading to confidence intervals of the form:

\(r \pm z^* \sqrt{\frac{1-r^2}{n-2}}\)

where \(z^*\) is used to define the confidence level of the interval.

Michael Jordan Game Data

For this part of the lab, we’ll use four different random samples of basketball games played by Michael Jordan, a professional basketball player who is widely considered to be the greatest of all time. Each sample contains a different number of games. Links to these samples, and additional details on the variables, can be found below:

Each sample contains the variables:

  • date - date the game was played
  • age - Jordan’s age reported as years-days
  • team - Jordan’s team
  • opp - opponent
  • result - game result: win (W) or loss (L) and score difference
  • mp - minutes played
  • fg - field goals made
  • fga - field goals attempted
  • fgp - field goal percentage
  • three - three-point shots made
  • threeatt - three-point shots attempted
  • threep - three-point shot percentage
  • ft - free-throws made
  • fta - free-throws attempted
  • ftp - free-throw percentage
  • orb - offensive rebounds
  • drb - defensive rebounds
  • trb - total rebounds (offensive + defensive)
  • ast - assists
  • stl - steals
  • blk - blocks
  • tov - turnovers
  • pts - total points scored (by Jordan)

Calculating Confidence Intervals in Minitab

In Minitab, confidence intervals are calculated using the normal approximations provided in the previous section. To calculate a confidence interval in Minitab you can use the following steps:

  1. Navigate to “Stat” -> “Basic Statistics”
  2. Select the option that describes the type of data you have:
    • 1-proportion = a single proportion
    • 2-proportions = a difference in proportions
    • 1-sample t = a single mean
    • 2-sample t = a difference in means
    • correlation = correlation coefficient
  3. Specify the appropriate column(s) of your dataset
    • Additionally, you can change the confidence level by clicking on “Options”

Sample Size

The first factor we’ll explore is sample size. Intuitively, estimation of a population parameter should be easier when the sample size is larger, and in this section we’ll try and understand the impact of sample size on confidence intervals in greater detail.

Question #1:

Before actually creating any confidence intervals, which sample do you think is most likely to produce a 95% confidence interval that contains the population parameter of interest (For example, Michael Jordan’s average points-per-game for his entire career)? Briefly explain.

Question #2:

For each sample, use Minitab to calculate a 95% confidence interval estimate of Michael Jordan’s career average points per game. Record these intervals in a table like the one below:

Sample Size 95% CI Lower Endpoint 95% CI Upper Endpoint
n = 5
n = 25
n = 75
n = 200

Question #3:

Calculate the length (ie: upper endpoint minus lower endpoint) of each confidence interval in your table, then use Minitab (or Excel/another program) to plot interval length versus sample size. Include this graph in your lab write-up, along with 1-2 sentences describing how sample size appears to be related to confidence interval length.

Question #4:

Looking at the CLT normal approximation for a single mean, explain why the relationship you saw in Question #3 exists? (Hint: Think about the margin of error in the confidence interval formula)

Skew

Generally speaking, given the sample size is large enough, the sampling distribution of most statistics will be approximately normal, even if the estimate comes from a variable with a skewed distribution. We’ll explore this perhaps surprising result by looking at the variable “tov”, or the turnovers committed by Michael Jordan in each game.

Question #5:

Using the MJ-200 sample, use Minitab to construct a histogram of the variable “tov”. Is this variable skewed right or skewed left?

Question #6:

For the MJ-200 sample, use StatKey to approximate the sampling distribution (via bootstrapping) of Michael Jordan’s average per-game tov. Does the sampling distribution appear to be approximately normal (ie: is it symmetric and bell-shaped)?

Question #7:

Use the distribution you found in Question #6 to find a 95% percentile bootstrap confidence interval. Report your interval in your lab write-up.

Question #8:

Now use Minitab to calculate a 95% confidence interval estimate for Michael Jordan’s career per-game tov. How does this interval compare to the one you found in Question #7?

Question #9:

Repeat the comparisons of Questions #6-9 using the MJ-5 sample. That is, use StatKey to estimate the sampling distribution and describe whether it appears skewed, then find a 95% percentile bootstrap confidence interval, and then compare this interval with the equivalent 95% confidence interval calculated by Minitab.

Question #10:

Based upon your answers to Questions #5-9, when should bootstrapping be preferred (relative to normal approximations)? (Hint: Think about how similar (or dissimilar) your bootstrap and normal approximation CIs were in the aforementioned questions, the underlying assumptions of each method, and the shape of the sampling distribution estimates you saw)

Parameter Boundaries

Michael Jordan is 6’6, which is very tall by normal-person standards, but is not particularly tall for a professional basketball player. Consequently, he tended not to block many shots (blocked shots are relatively rare even for taller players). In this section we will use the variable “blk” to explore confidence interval construction in situations where the data contain natural boundaries.

Question #11:

For the MJ-5 sample, use Minitab to construct a 95% confidence interval for career per-game average number of blocks by Michael Jordan. Do you see a problem with this confidence interval? (Hint: think about the confidence interval as a range of plausible values for the population parameter).

Question #12:

For the MJ-5 sample, use StatKey to construct a 95% percentile bootstrap confidence interval (use at least 5,000 bootstrap samples). Do you see any problems with this confidence interval?

The plot below depicts the process of:

  1. drawing a random sample of size \(n = 5\) from Michael Jordan’s career game-log
  2. using Minitab’s approach to construct a 95% confidence interval for blocks
  3. repeating steps 1 and 2 100 times (ie: drawing 100 different random samples)

Question #13:

Michael Jordan’s actual career-average is 0.8 blocks per-game. Using this information, along with the plot shown above, do you believe that confidence intervals constructed in this manner have 95% coverage? That is, are they actually 95% confidence intervals? Briefly explain your answer.

Question #14:

The plot below displays a similar procedure, but this time the intervals are constructed using bootstrapping (notice how none have boundary issues). Do 95% confidence intervals constructed in this manner have the correct coverage? Briefly explain.

Question #15:

The plot below displays the results of procedures described in Questions #13 and #14, except this time a sample size of \(n = 100\) was used. Do 95% confidence intervals constructed in this manner have the correct coverage? Briefly explain.

Remark: Questions #13, #14, and #15 demonstrate the importance of having a large, representative sample. Regardless of the procedure used, estimation is a difficult task when the sample size is small. Confidence intervals constructed using bootstrapping are generally preferred for small samples, for highly skewed variables, or for variables with boundary problems - but they aren’t a magic-bullet.

Conclusions

A large enough sample size tends to fix most problems related to interval estimation; however, a larger sample size will not fix a biased study design. Remember the Literary Digest sampling 2.4 million people and predicting Alfred Landon would win the election by a margin of 14, but due to biased sampling the election result ended up being in the entirely opposite direction. With a sample size of 2.4 million, any confidence interval would have been incredibly narrow, but would not have captured the true election results.

When the sample size is relatively small, or if the circumstances are unusual (a highly skewed variable, or a variable with pre-defined boundaries), bootstrapped confidence intervals tend to be preferred because they’ll always contain sensible values are rely on fewer assumptions. Otherwise, both bootstrapping and normal approximation approaches tend to produce very similar results, and approximate intervals are easily found using software like Minitab.

Question #16:

Complete the following table summarizing the role of various aspects on confidence interval length. To make each determination you might consider your answers to earlier lab questions, or you might choose to explore the factor yourself using an example dataset(s).

Factor Increases/Decreases/Doesn’t Influence the Interval Length
Larger n
Larger SE
Higher Confidence Level
More Bootstrap Samples

Decision Making

One of the most challenging aspects of statistics is knowing which tool(s) to use in a particular situation.

This section will present four questions where you should use your own judgement to answer the question using appropriate statistical methods. In this lab, that means deciding the appropriate statistics (ie: proportions, differences in proportions, means, differences in means, or correlation) and also whether to use bootstrapping or a normal approximation to construct a valid confidence interval. You may also need to do some data manipulations depending on the characteristic you’re trying to estimate.

Question #17

Using the MJ-200 sample, does Michael Jordan score more points in games where he attempts more three point shots?

Question #18

Using the MJ-75 sample, did Michael Jordan spend a greater proportion of his career playing for the Chicago Bulls or Washington Wizards (note that these are the only teams he played for)?

Question #19

Using the MJ-25 sample, did Michael Jordan average more points when playing for the Chicago Bulls or Washington Wizards?

Question #20

Using the MJ-5 sample, what proportion of free throws did Michael Jordan make during his career? (Hint: You might consider using the “SUM” function within the menu at “Calc -> Calculator” to obtain the necessary information for this question)

Submission Directions

  • Double check that you’ve completed all of the lab’s questions, making sure that everyone in your group agrees with the answer you’ve provided. You will receive a single group score for the lab.
  • Make sure that everyone’s name is on the write-up.
  • Email your completed write-up to Professor Miller with a subject heading that includes the text “Sta-209-Lab4”. Please include this exact character string, including the dashes. You will lose 1 point off the top of your score if you don’t do so.
  • If you’d like to provide feedback on your group, fill out the optional review form at this link: https://forms.gle/wNWRFMbbra8oK4LJ8