\(~\)

Onboarding

Later in this lab I will ask you to explore the validity of a confidence interval procedure when the sample size varies. You will use the code provided below as a template for this:

## Example that we'll use as a population
library(dplyr)
tsa <- read.csv("https://remiller1450.github.io/data/tsa.csv")
tsa_laptops = tsa %>% filter(Item == "Computer - Laptop", Claim_Amount < 1e6)

We’ve already seen the functions needed to create confidence interval estimates for many common scenarios. As review, these functions are summarized in the table below:

Scenario Function Notes
Single Proportion prop.test() or binom.test() prop.test() is valid when \(n \cdot \hat{p} \geq 10\) and $n (1-) $
Difference in Proportions prop.test() Valid when single proportion conditions are met for each group
Single Mean t.test() Valid when \(n \geq 30\) or Normally distributed data
Difference in Means t.test() Valid when \(n_1 \text{ and } n_2 \geq 30\) or Normally distributed data
Odds Ratio fisher.test() Asymmetric confidence interval
Correlation cor.test() Allows either Pearson’s or Spearman’s correlation
Regression Coefficients confint() The fitted regression model must be given as the function’s input

You should note that each function expects a different type of input. Below are a few quick examples so you don’t need to go back to our previous labs.

## For prop.test() "x" is the numerator(s) and "n" is the denominator(s) of the involved proportions
numerator_count = sum(tsa_laptops$Status == "Denied")
denominator_count = length(tsa_laptops$Status)
prop.test(x = numerator_count, n = denominator_count, correct = FALSE)


## For t.test() "x" is either the variable (one-sample) or a formula of the form `outcome ~ explanatory` and "data" is the data set.
t.test(x = tsa_laptops$Claim_Amount)


## For fisher.test() "x" is a 2 x 2 contingency table (made using table()) or a 2 x 2 data.frame() object (when entering our own summarized data)
tsa_laptops$Chicago_binary = ifelse(tsa_laptops$Airport_Code == "ORD", "Chicago", "Elsewhere")
tsa_laptops$Denied_binary = ifelse(tsa_laptops$Status == "Denied", "Denied", "Not Denied")
my_table = table(tsa_laptops$Chicago_binary, tsa_laptops$Denied_binary )
fisher.test(my_table)


## For cor.test() "x" and "y" are each of the two involved variables
cor.test(x = tsa_laptops$Claim_Amount, y = tsa_laptops$Close_Amount)


## For confint() "object" is the fitted regression model
fitted_model = lm(Close_Amount ~ Claim_Amount, data = tsa_laptops)
confint(fitted_model)

With all of these functions, the confidence level can be adjusted using the conf.level argument (except for confint(), which uses the level argument).

## Example of a 99% confidence interval
cor.test(x = tsa_laptops$Claim_Amount, y = tsa_laptops$Close_Amount, conf.level = 0.99)
## 
##  Pearson's product-moment correlation
## 
## data:  tsa_laptops$Claim_Amount and tsa_laptops$Close_Amount
## t = 4.7439, df = 1593, p-value = 2.284e-06
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.05396995 0.18111678
## sample estimates:
##       cor 
## 0.1180271

\(~\)

Lab

Michael Jordan is widely regarded as the best professional basketball player of all time. At the end of his playing career after the 2002-03 NBA season he averaged 30.1 points per game in his regular season appearances. Throughout this lab you will work with three different samples of varying sizes (\(n=5\), \(n=25\), and \(n=75\)) of regular season games played by Michael Jordan.

mj5 = read.csv("https://remiller1450.github.io/data/mj5.csv")
mj25 = read.csv("https://remiller1450.github.io/data/mj25.csv")
mj75 = read.csv("https://remiller1450.github.io/data/mj75.csv")

\(~\)

Part 1 - Calculating Confidence Intervals

Question #1: For this question you should use the mj75 sample. The variable pts records the outcome of interest in this question.

  • Part A: Consider using these data to construct a 95% confidence interval estimate for Michael Jordan’s career average points per game (in the regular season). Find both components of this interval’s margin of error. Hint: You may use the T-distribution menu on StatKey to find the value of \(c\).
  • Part B: Calculate the interval described in Part A by taking the sample mean (the point estimate) and adding and subtracting the margin of error arising from Part A.
  • Part C: Use the appropriate R function to verify the interval you calculated in Part B. It is okay if your interval is off by a tiny amount due to rounding.
  • Part D: Suppose you had instead calculated an 80% confidence interval in Parts B/C. Without performing the calculation, explain whether this interval will be wider or narrower.
  • Part E: Suppose you had used the mj5 sample rather than the mj75 sample. If this sample had the same sample average and sample standard deviation as mj75, and the confidence level used is unchanged, would you expect this sample to produce a wider or narrower interval than the one you found in Parts B and C? Explain your reasoning.
  • Part F: Recall from the introduction that Michael Jordan’s actual career average in regular season games was 30.1 points per game. Based on this information, did your interval estimate in Part B/C succeed in containing the truth? Is this an expected or an unexpected result?

\(~\)

Question #2: For this question you should continue using the mj75 sample. This question will focus on Michael Jordan’s three-point shooting percentage, which you can find in the sample using the sum() function and the three (made three-point shots) and threeatt (attempted three-point shots) variables.

  • Part A: Consider using this sample to construct an 80% confidence interval estimate for Michael Jordan’s career three-point shooting percentage (in the regular season). Find both components of this interval’s margin of error. Note: This population parameter is a proportion, as it is calculated by taking the number of made threes divided by the total number of attempted threes.
  • Part B: Calculate the 80% CI using the components you found in Part A and the sample proportion.
  • Part C: Use the appropriate R function to verify the interval you calculated in Part B. It is okay if your interval is off by a tiny amount due to rounding. Remember to change the confidence level in the function.
  • Part D: Michael Jordan’s actual career three-point shooting percentage is 32.7%. Based on this information, did your interval estimate in Part B/C succeed in containing the truth? Is this an expected or an unexpected result?

\(~\)

Question #3: For this question you will use the mj5, mj25, and mj75 samples.

  • Part A: Consider using these each of these different samples to construct a separate 95% confidence interval estimate for Michael Jordan’s career average points per game. How do both components of the margin of error change as the sample size increases? Explain your reasoning by finding and comparing these components, or by making a conceptual argument based upon where they come from.
  • Part B: Before any of these samples were collected, and assuming all assumptions are met (ie: each of the intervals are made using a valid procedure), which sample’s 95% confidence interval do you expect to be most likely to contain Michael Jordan’s true career average points per game? Briefly explain.
  • Part C: Calculate the mean of pts in each of these three samples. Given that Michael Jordan’s actual career average is 30.1 points per game, which sample produced the point estimate that is closest to the truth?
  • Part D: Use t.test() to calculate a separate 95% confidence interval estimate using each of these samples. Which of the intervals contained the true value? Is this surprising?

\(~\)

Question #4: For this question you should continue using the mj75 sample. This question will involve a variety of different research questions and your task is to find the requested confidence interval estimate for the relevant population parameter using an appropriate R function.

  • Part A: Suppose you would like to estimate the relationship between the number of turnovers Michael Jordan commits (variable tov) and the number of points he scores. Find a 90% confidence interval estimate for an appropriate descriptive statistic. Be sure you explore the data to determine the appropriate statistic.
  • Part B: Provide a one-sentence interpretation of the interval you found in Part A. According to the interval, is it plausible that these variables are unrelated?
  • Part C: Suppose you would like to assess whether Michael Jordan on average scored a different number of points when playing for the Washington Wizards than he did when playing for the Chicago Bulls. Find a 99% confidence interval estimate for the appropriate descriptive statistic.
  • Part D: Provide a one-sentence interpretation of the interval you found in Part C. According to the interval, is it plausible that the involved variables are unrelated?
  • Part E: Suppose you would like to assess whether there is an association between whether Michael Jordan scores more than 30 points and whether his team won. The outcome variable for this research question is created by the code provided below. You will need to determine/create the explanatory variable. After doing so, find a 95% confidence interval estimate for an appropriate descriptive statistic.
  • Part F: Provide a one-sentence interpretation of the interval you found in Part E. According to the interval, is it plausible that the involved variables are unrelated?

\(~\)

## Outcome variable for Part E/F
mj75$outcome = ifelse(substr(mj75$result, start=1, stop=1) == "W", "win", "loss")

\(~\)

Part 2 - Confidence Interval Validity

The approaches used in the previous section all rely upon a probability model used to represent the sampling variability in the data. As we saw with hypothesis testing, if this probability model is inappropriate then the procedures relying upon it will be invalid.

By definition, confidence intervals are based upon the long-run success rate of the procedure used to make them, so we can assess their validity by looking at their long-run success rate in various situations. The following code provides a basic template for this investigation:

## Create a population with known characteristics
pop_values = rnorm(n=10000, mean = 30, sd = 5)  # Hypothetical population of 10,000 cases

## Set up characteristics of the investigation (sample size, number of repeats to get a long-run average)
n_reps = 500
sample_size = 10
lower_endpoint = upper_endpoint = numeric(length = n_reps)

## Repeatedly sample from the population and apply the CI procedure being studied
for(i in 1:n_reps){
  current_sample = sample(pop_values, size = sample_size)
  point_est = mean(current_sample)
  lower_endpoint[i] = point_est - 1.96*(sd(current_sample)/sqrt(sample_size))
  upper_endpoint[i] = point_est + 1.96*(sd(current_sample)/sqrt(sample_size))
}

## Check if the true mean is inside each sample's CI
contains_truth = ifelse(lower_endpoint > mean(pop_values) | upper_endpoint < mean(pop_values), "Fails", "Succeeds")
mean(contains_truth == "Succeeds")  # Proportion of CI's that contain the truth
## [1] 0.928

Because the sample() function randomly samples from pop_values, the output of this code block will differ each time you run it, but in general we can see that this procedure does not produce valid 95% confidence intervals.

Question #5: For this question you should rely upon the code provided in this section.

  • Part A: Notice how the proportion of intervals that contain the true population mean (\(\mu = 30\)) is lower than would be expected for a target confidence level of 95%. Looking at the calculation performed inside the for loop, why do you think the success rate is too low? Explain your reasoning.
  • Part B: Instead of taking samples of size \(n=10\), modify the provided code to take samples of size \(n=400\). Does this procedure produce valid confidence intervals for very large sample sizes? You should run the loop a few times to make sure the success rate you see isn’t a fluke.
  • Part C: Reduce the sample size back down to \(n=10\) and modify another portion of the provided code such that valid 95% confidence intervals are produced.

\(~\)

Skew and Validity

Recall that the \(T\)-distribution is an appropriate probability model when the data come from a Normally distributed population, or when the sample size is large, for which the Central Limit theorem ensures an approximately Normal distribution for most sample statistics, including means and differences in means. We can explore this by creating a right-skewed population and sampling from it:

## Create a right-skewed population
skewed_pop_values = rexp(10000, rate = 5)
ggplot() + geom_histogram(aes(x = skewed_pop_values), bins=25)

We can now use an evaluation loop similar to the one provided in the previous section to understand the circumstances for which confidence intervals for a single mean created using the \(T\)-distribution are valid.

Question #6: For this question you will modify the evaluation loop from the previous section and use the skewed_pop_values data as the population.

  • Part A: Using samples of size \(n=10\), check the long-run success rate of intervals that get their calibration value, \(c\), from a \(T\)-distribution. Does this procedure seem valid for producing 95% confidence intervals when the population is highly right-skewed?
  • Part B: Now change the sample size to \(n=30\) (you’ll also need to change the value of \(c\)). Does this procedure seem valid for producing 95% confidence intervals when the population is highly right-skewed?

\(~\)

Conditions for \(Z\)-intervals

Recall that the \(Z\)-distribution is used as a probability model for a single proportion and a difference in proportions. A common rule of thumb for determining the appropriateness of this model is \(n \cdot \hat{p} \geq 10\) and $n (1-) $ (for both groups in two-sample scenarios).

We can simulate binary outcomes for two groups using the code provided below:

pop_group_1 = sample(x = c("Yes", "No"), size = 20000, replace = TRUE, prob = c(0.5,0.5))
pop_group_2 = sample(x = c("Yes", "No"), size = 10000, replace = TRUE, prob = c(0.7,0.3))
binary_pop = data.frame(Group = c(rep("Group1", 20000), rep("Group2", 10000)), 
                        Outcome = c(pop_group_1, pop_group_2))

In this simulated population the first group is expected to have 50% “no” outcomes, while the second group is expected to have 30% “no” outcomes. There are also twice as many members of the first group than there are in the second group.

The loop below will randomly select a sample of size sample_size from the population created above, then it will use prop.test() to construct a 95% confidence interval estimate for the difference in the proportion of “no” between each group:

## Set up characteristics of the investigation (sample size, number of repeats to get a long-run average)
n_reps = 500
sample_size = 100
lower_endpoint = upper_endpoint = numeric(length = n_reps)

## Repeatedly sample from the population and apply the CI procedure being studied
for(i in 1:n_reps){
  sample_idx = sample(1:nrow(binary_pop), size = sample_size)
  current_sample = binary_pop[sample_idx, ]
  current_sample_table = table(current_sample$Group, current_sample$Outcome)
  prop_test_results = prop.test(x = current_sample_table, correct = FALSE)
  lower_endpoint[i] = prop_test_results$conf.int[1]
  upper_endpoint[i] = prop_test_results$conf.int[2]
}

## Check if the true mean is inside each sample's CI
contains_truth = ifelse(lower_endpoint > 0.2 | upper_endpoint < 0.2, "Fails", "Succeeds")
mean(contains_truth == "Succeeds")  # Proportion of CI's that contain the truth
## [1] 0.934

Question #7: For this question you will modify the loop provided above to vary the sample size.

  • Part A: Consider a sample size of 20 and the distributions of each group and outcome in the population. Are the large count conditions likely to be met for a sample of this size? Explain your reasoning.
  • Part B: Modify the sample size of the provided code to take samples of size 20 from the population created earlier in this section. Does the \(Z\)-interval calculation performed by the prop.test() function produce valid 95% confidence intervals for samples of this size? Briefly explain.
  • Part C: For this population, the most difficult large count condition to satisfy involves the “no” cases in the second group. At what sample size (approximately) would expect this condition to be met? Explain your reasoning.
  • Part D: Run the provided code to investigate the validity of the confidence intervals at the sample size you identified in Part C. Briefly comment on whether the large count condition seems to be effective.

\(~\)

Question #8 (optional, extra credit): Modify the provided examples given throughout this lab to show that binom.test() produces valid confidence intervals for one-sample categorical data with a small sample size (say \(n=10\)). You will have to generate your own population and modify the repeated sampling loop as part of this question (which is why it is extra credit).