\(~\)

Onboarding

This lab will cover the use of Central Limit theorem and Normal distributions to find 95% confidence intervals for one proportion and a difference in proportions. It will also cover an example demonstrating a problem that arises when using Central Limit theorem and a Normal distribution to find confidence intervals for a single mean.

Proportions as Averages

Central Limit theorem is important because it gives us a probability model for the sample average. While at first this seems like a limited result, it is quite versatile for two reasons:

  1. A proportion is an average, so CLT applies to proportions in the same way it applies to the sample mean.
  2. Properties of random variables are such that we can generalize the results of CLT to linear combinations of sample averages. This means that we can use the results of CLT to develop probability models for differences in means and differences in proportions.

To understand why a proportion is an average, first consider using one-hot encoding to create a dummy variable representation of a binary categorical variable:

In this example, the average of the dummy variable Type="Public" will look like: \[\sum_{i=1}^{n}\frac{x_i}{n}=\frac{0 + 1 + 1 + 0 + 0}{5} = 0.4\]

Notice how the proportion of public colleges among these 5 cases is 0.4 (or 40%).

Central Limit theorem gives us an approximate probability model for the sampling distribution of a single proportion that looks like: \[\hat{p}\sim N\bigg(p,\sqrt{\frac{p(1-p)}{n}}\bigg)\]

  • \(\hat{p}\) is the proportion observed in the sample data
  • \(p\) is the proportion among the population
  • \(n\) is the size of the sample used to calculate the sample proportion

We will not discuss the rationale behind the standard error used in this scenario, and it will appear on your formula sheet on future exams, but if you are curious it comes from the variance of the binomial probability distribution.

This probability model suggests we can calculate \(P\%\) confidence intervals using the formula: \[\hat{p} \pm c * \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\] - Here \(c\) is a percentile of the \(N(0,1)\) distribution that governs the interval’s confidence level - Choosing \(c=1.96\) produces a 95% confidence interval estimate, as the middle 95% of the \(N(0,1)\) distribution is between \(-1.96\) and \(1.96\)

\(~\)

Validating a Confidence Interval Procedure

The confidence level of a confidence interval estimate describes a property of the procedure used to form the interval. More specifically, it means that the procedure for a \(P\%\) confidence interval will produce an interval that contains the population parameter we are trying to estimate in \(P\%\) of different random samples.

Thus, we can determine whether a confidence interval procedure is valid by testing it out on a large number of random samples from a known population using a for-loop. Let’s see an example looking at the proportion of claims against the TSA for losses on laptops that were denied. The code below loads the data:

## Data that we'll use as a population
library(dplyr)
tsa <- read.csv("https://remiller1450.github.io/data/tsa.csv")
tsa_laptops = tsa %>% filter(Item == "Computer - Laptop", Claim_Amount < 1e6)

The code below uses a for-loop to take 200 random samples of size \(n=30\) and create a confidence interval estimate for the proportion of denied claims using the data in each of these samples:

## Repeat the loop 200 times (ie: 200 random samples)
n_reps = 200

## Take random samples of size n=30
sample_size = 30

## Objects to store the lower and upper endpoint for the interval produced from each sample
lower_endpoint = upper_endpoint = numeric(length = n_reps)

## The loop itself (take many random samples)
for(i in 1:n_reps){
  current_sample = sample(tsa_laptops$Status, size = sample_size)
  p_hat = mean(current_sample == "Denied")
  lower_endpoint[i] = p_hat - 1.96*sqrt(p_hat*(1-p_hat)/sample_size)
  upper_endpoint[i] = p_hat + 1.96*sqrt(p_hat*(1-p_hat)/sample_size)
}

## Check if the true proportion is in the interval
true_p = mean(tsa_laptops$Status == "Denied")
contains_p = ifelse(lower_endpoint > true_p | upper_endpoint < true_p, "Fails", "Succeeds")

We can graph these results to get a quick visual assessment of the procedure’s validity:

## Graph the results using geom_errorbar()
library(ggplot2)
ggplot() + geom_errorbar(aes(x=1:n_reps, ymax = lower_endpoint, ymin = upper_endpoint, col = contains_p))

We can also calculate the success rate of the procedure across the random samples we drew:

## Success rate
mean(contains_p == "Succeeds")
## [1] 0.925

In practice, we won’t have data on the entire population, and we’ll only have one sample. So these types of experiments are essential in building our trust in the procedures that we’ll be applying in practical settings.

\(~\)

Lab

In this lab you’ll work with subset of claims involving laptops from the “TSA claims” data set that was introduced in our previous lab. You’ll also need to use the dplyr and ggplot2 libraries:

## Libraries
library(dplyr)
library(ggplot2)

## Full data set
tsa <- read.csv("https://remiller1450.github.io/data/tsa.csv") # Note that the data set is somewhat large

## Data Subset (claims on laptops, except for one outlier)
tsa_laptops = tsa %>% filter(Item == "Computer - Laptop", Claim_Amount < 1e6)

Calculating Confidence Intervals

The prop.test() function will use a Normal Probability model to calculate a confidence interval estimate of a single proportion using the following arguments:

  • x - the numerator of the sample proportion (ie: the count of the outcome of interest)
  • n - the denominator of the sample proportion (ie: the sample size)
  • conf.level - the confidence level of the interval

We’ll also use the argument correct = FALSE to turn off a continuity correction that is applied by default to increase the accuracy of the underlying Normal Approximation. This is purely to allow the intervals we calculate “by hand” to more closely align with those calculated using prop.test(), though they won’t perfectly align due to another minor methodological difference.

Below is an example of prop.test():

## We'll work with this sample from tsa_laptops
set.seed(12345) # This makes it so that everyone's "random" sample will be the same
laptop_sample = sample(tsa_laptops$Status, size = 30)

## Find the numerator and denominator
denied_count = sum(laptop_sample == "Denied")
sample_size = length(laptop_sample)

## Use prop.test - it returns a lot of output, we'll use $conf.int to access only the confidence interval
prop.test(x = denied_count, n = sample_size, conf.level = 0.95, correct = FALSE)$conf.int
## [1] 0.4232036 0.7540937
## attr(,"conf.level")
## [1] 0.95

This suggests that we can conclude with 95% confidence that somewhere between 42.3% and 75.4% of claims made against the TSA on laptops are denied.

Let’s see how this interval compares with what we’d calculate “by hand” using the formula:

\[\begin{align} \text{Point Estimate } \pm \text{Margin of Error} & \\ &= \hat{p} \pm c * \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \end{align}\]

The code below fills out this formula using R, but you could do it with a calculator if you wanted:

## We have numerator and denominator for the sample proportion from before
p_hat = denied_count/sample_size

## We can get "c" for 95% confidence level from qnorm()
c = qnorm(0.975, mean = 0, sd = 1)

## Interval endpoints
p_hat - c*sqrt(p_hat*(1-p_hat)/sample_size) ## Lower endpoint
p_hat + c*sqrt(p_hat*(1-p_hat)/sample_size) ## Upper endpoint
## [1] 0.4246955
## [1] 0.7753045

Again, the slight difference in endpoints is due to prop.test() not doing exactly the same procedure that we are, but the difference is trivial and we should still consider both approaches to be valid when the sample size is reasonably large.

Question #1: This question focuses on understanding the components of the example code provided in this section and being able to modify them to fit the needs of your analysis. To ensure consistency, you should use the following random sample of claim amounts (be sure to run the set.seed(1234) command):

## Use this random sample for Question #1
set.seed(1234)
q1_sample = sample(tsa_laptops$Claim_Amount, size = 30)
  • Part A: Modify the examples in this section to find a 90% confidence interval estimate using prop.test() for the proportion of laptop claim amounts that exceed $1000 in the population using the data contained in q1_sample.
  • Part B: Does your confidence interval estimate from Part A provide compelling statistical support for the conclusion that more than 50% of claims made against the TSA on laptops exceed $1000? Briefly explain.
  • Part C: Does your confidence interval from Part A provide compelling statistical support for the conclusion that more than 60% of claims made against the TSA on laptops exceed $1000? Briefly explain.
  • Part D: Calculate the same 90% confidence interval estimate using a “by hand” approach”. This should closely (but not perfectly) resemble the interval you found in Part A.
  • Part E: Use the full population, tsa_laptops, to calculate the true proportion of claims that exceeds $1000. Now that you know this proportion, did your confidence interval estimates succeed or fail?

\(~\)

Factors Influencing Confidence Interval Width

All else being equal, narrower intervals (those containing a shorter range of values) are preferable over wider intervals, as they give us a clearer sense of what the unknown population parameter might be. Thus, it’s important to have an understanding of the factors that influence the width of a confidence interval estimate.

Sample Size

One of the most influential factors in the width of a confidence interval estimate is the sample size of the data used to calculate the interval. Below are three random samples of different sizes that you should use to answer the question that follows:

## Three different random samples of various sizes
set.seed(1234)   # Run this to make sure you get the same samples as I do
laptop_sample1 = sample(tsa_laptops$Status, size = 20)
laptop_sample2 = sample(tsa_laptops$Status, size = 50)
laptop_sample3 = sample(tsa_laptops$Status, size = 80)

Question #2:

  • Part A: For each of the three different random samples given above, find a 95% confidence interval estimate for the proportion of claims that are denied. Next, calculate the width of each interval estimate by subtracting the lower endpoint from the upper endpoint.
  • Part B: Compare the confidence interval widths you found in Part A. How is sample size related to confidence interval width?
  • Part C: Does the width of a confidence interval change linearly as sample size increases (ie: a constant change in width for a 1-unit increase in sample size)? If it doesn’t, are there diminishing returns to taking a larger sample? Briefly explain your answer.
  • Part D: How does your answer in Part C relate to the “by hand” formula based upon Central Limit theorem that was given in earlier in this lab? Hint: look at where sample size appears in this formula.

\(~\)

Confidence Level

Another factor influencing confidence interval width that we have control over is the confidence level.

Question #3:

  • Part A: Using laptop_sample3 from Question #2, calculate the width of 85%, 90%, and 95% confidence interval estimates for the proportion of claims that are denied using the data in this sample.
  • Part B: Does the width of a confidence interval change linearly as the confidence level increases (ie: a constant change in width for each increase in confidence level)? If it doesn’t, how does width appear to increase when the confidence level is increased to higher and higher levels? Explain your answer.
  • Part C: Follow this link to an interactive visualization of the Normal distribution. Click the “two-tail” button and see that the middle 95% of the distribution is now shown to be between -1.96 and 1.96, noting that \(c=1.96\) is used in our confidence interval formula to ensure a 95% confidence level. Looking at the shape of this distribution and playing around with different values of \(c\), make an argument that the width of a confidence interval will increase at an increasing rate as the confidence level is increased.

\(~\)

Small Sample Sizes

Recall that the Central Limit theorem provides a theoretical result that involves an increasingly large sample size. So, in addition to small sample sizes producing wide confidence intervals, they may also produce invalid confidence intervals.

The lab’s introduction included a for-loop that demonstrated the validity of our “by hand” formula: \(\hat{p} \pm c * \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\). The code for this is pasted below for convenience:

## Repeat the loop 200 times (ie: 200 random samples)
n_reps = 200

## Take random samples of size n=30
sample_size = 30

## Objects to store the lower and upper endpoint for the interval produced from each sample
lower_endpoint = upper_endpoint = numeric(length = n_reps)

## The loop itself (take many random samples)
for(i in 1:n_reps){
  current_sample = sample(tsa_laptops$Status, size = sample_size)
  p_hat = mean(current_sample == "Denied")
  lower_endpoint[i] = p_hat - 1.96*sqrt(p_hat*(1-p_hat)/sample_size)
  upper_endpoint[i] = p_hat + 1.96*sqrt(p_hat*(1-p_hat)/sample_size)
}

## Check if the true proportion is in the interval
true_p = mean(tsa_laptops$Status == "Denied")
contains_p = ifelse(lower_endpoint > true_p | upper_endpoint < true_p, "Fails", "Succeeds")

## Success rate
mean(contains_p == "Succeeds")
## [1] 0.935

Question #4:

  • Part A: Modify the code that is provided to take random samples of size \(n=20\). Run the loop a few times and take note of the “success” rate of our “by hand” procedure for this sample size. Based upon what you see, does it seem like this method produces valid 95% confidence interval estimates for random samples of size \(n=20\)? Explain your answer.
  • Part B: Modify the provided code again to use a sample size of \(n=10\) and run the loop a few times. Does it seem like this method produces valid 95% confidence interval estimates for random samples of size \(n=10\)? Explain your answer.

\(~\)

Exact Confidence Intervals for a Single Proportion

When the sample size is small, approximating the distribution of the sample proportion is problematic due to the discrete set of possible proportions that can be observed. As an extreme example, consider a sample of size \(n=1\), for which the sample proportion can only be \(\hat{p}=0\) or \(\hat{p}=1\) with no other possibilities, regardless of the proportion in the population the sample was drawn from.

The image below demonstrates how the sampling distribution changes when taking samples of various sizes from a population with \(p=0.5\).

Image source

Fortunately, the discrete set of possibilities for the sample proportion allows us to calculate exact confidence intervals using the binomial probability distribution. You don’t need to be familiar with the binomial distribution for this class, other than the fact that it is used to calculate exact confidence intervals for a single proportion. The code below demonstrates how to use binom.test() to calculate a 95% confidence interval:

## One of our random samples from before
set.seed(1234)   # Run this to make sure you get the same samples as I do
laptop_sample1 = sample(tsa_laptops$Status, size = 20)

## Confidence interval using binom.test()
denied_count = sum(laptop_sample1 == "Denied")
sample_size = length(laptop_sample1)
binom.test(x = denied_count, n = sample_size, conf.level = 0.95)$conf.int
## [1] 0.4078115 0.8460908
## attr(,"conf.level")
## [1] 0.95

For reference, we can compare this to the prop.test() confidence interval estimate using this same sample.

## Compare to prop.test() interval
prop.test(x = denied_count, n = sample_size, conf.level = 0.95)$conf.int
## [1] 0.4094896 0.8369133
## attr(,"conf.level")
## [1] 0.95

Notice the interval calculated using binom.test() is slightly wider. This will generally, but not always, be true as intervals calculated using bionm.test() use the exact form of the sampling distribution rather than an approximation of it.

Question #5: For this question you should use the two random samples that are drawn by the code provided below:

## Two random samples
set.seed(1234) ## Run this to ensure you get the same samples as me
q5_sample1 = sample(tsa_laptops$Claim_Site, size = 10)
q5_sample2 = sample(tsa_laptops$Claim_Site, size = 100)
  • Part A: Using the first random sample, q5_sample1, calculate a 90% confidence interval estimate for the proportion of laptop claims occurring at security checkpoints (the category "Checkpoint") using binom.test().
  • Part B: Now use prop.test() to calculate a 90% confidence interval estimate for the same proportion. How does this interval estimate compare to the one you found in Part A?
  • Part C: This time use the second random sample, q5_sample2, to calculate a 90% confidence interval estimate for the proportion of laptop claims occurring at security checkpoints using binom.test()
  • Part D: Finally, use prop.test() to calculate a 90% confidence interval estimate for the same proportion using the second random sample, q5_sample2. How does this interval estimate compare to the one you found in Part C?
  • Part E: Using results from the previous parts of this question, explain why calculating an exact binomial confidence interval is more important when working with a smaller-sized sample than it is when working with a larger-sized sample.

Question #6 (extra credit): Using the for-loops given in this lab as a starting point. Show that binom.test() is able to produce valid 95% confidence intervals for samples of size \(n=10\).