Lab #11 - Confidence Intervals for Common Descriptive Statistics

\(~\)

Onboarding

The basic idea behind a confidence interval is to combine a point estimate (a descriptive statistic found using sample data) and a margin of error to produce an interval estimate whose margin of error is calibrated to achieve a long-run success rate. This generally takes the form: \[\text{Point Estimate} \pm c*SE\]

The value c is taken from a probability distribution to reflect the middle P% of that distribution, with P% corresponding with the confidence level of the interval
The standard error, \(SE\), reflects the expected variability in the point estimate. It is typically derived from Central Limit theorem, or related results. Thus, it will look somewhat different for various descriptive statistics (proportions, differences in proportions, means, etc.)

The table below presents standard error formulas for common descriptive statistics, as well as conditions for when using a standard error formula and probability model based upon the Central Limit Theorem is generally regarded as reasonable:

Descriptive Statistic	Standard Error	Conditions	Probability Model
\(\hat{p}\)	\(\sqrt{\frac{p(1 - p)}{n}}\)	\(np \geq 10\) and \(n(1-p) \geq 10\)	Normal distribution
\(\bar{x}\)	\(\frac{\sigma}{\sqrt{n}}\)	normal population or \(n \geq 30\)	\(t\)-distribution
\(\hat{p}_1 - \hat{p}_2\)	\(\sqrt{\frac{p_1(1 - p_1)}{n_1} + \frac{p_2(1 - p_2)}{n_2}}\)	\(n_ip_i \geq 10\) and \(n_i(1-p_i) \geq 10\) for \(i \in \{1,2\}\)	Normal distribution
\(\bar{x}_1 - \bar{x}_2\)	\(\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\)	Normal populations or \(n_1 \geq 30\) and \(n_2 \geq 30\)	\(t\)-distribution

You will rarely (if ever) be expected to use these formulas to calculate confidence interval estimates “by hand”, and you will be given this table on the front page of Exam 2.

This lab will also cover confidence intervals for Pearson’s correlation coefficient and odds ratios, though these are not mentioned in the above table because their associated methodologies don’t fit the pattern of how confidence intervals are found with these other descriptive statistics.

\(~\)

Lab

In this lab you’ll work with a random sample of \(n=200\) games played in the National Football League (NFL) between 2018 and 2023. These data contain the following variables of interest:

season - the year in which the game was played
game_type - whether the game was a regular season game (type = Reg) or a playoff game (type = Playoff)
game_outcome - whether the game’s home team won (game_outcome = 1) or lost (game_outcome = 0)
home_score - the points scored by the home team
away_score - the points scored by the away team
score_diff - the score of the home team minus the score of the away team

You’ll also need to use the dplyr and ggplot2 libraries:

## Libraries
library(dplyr)
library(ggplot2)

## Data that you'll use
nfl = read.csv("https://remiller1450.github.io/data/nfl_sample.csv")

## Data used in examples
tsa <- read.csv("https://remiller1450.github.io/data/tsa.csv")
tsa_laptops = tsa %>% filter(Item == "Computer - Laptop", Claim_Amount < 1e6)

\(~\)

Differences in Proportions

We’ve already seen how to use the binom.test() and prop.test() functions to find confidence interval estimates for a single proportion in our previous lab. The prop.test() function can also be used to find confidence interval estimates for a difference in proportions. This is useful because it allows us to compare the risks/success rates across two different groups while taking into account sampling variability as a possible explanation for the observed difference.

To use prop.test() for a difference in proportions we must supply the numerator and denominator of each proportion. While there are quicker approaches, the example shown below is easy to understand and apply to any scenario. In this example we compare the proportions of denied laptop claims at security checkpoints versus checked baggage.

## First find the numerator and denominator of each proportion
numerator_checked_baggage = sum(tsa_laptops$Status == "Denied" & tsa_laptops$Claim_Site == "Checked Baggage")
numerator_checkpoint = sum(tsa_laptops$Status == "Denied" & tsa_laptops$Claim_Site == "Checkpoint")

denominator_checked_baggage = sum(tsa_laptops$Claim_Site == "Checked Baggage")
denominator_checkpoint =  sum(tsa_laptops$Claim_Site == "Checkpoint")

## Then give them to prop.test(), being careful of the order
prop.test(x = c(numerator_checked_baggage, numerator_checkpoint), 
          n = c(denominator_checked_baggage, denominator_checkpoint),
          conf.level = 0.95)

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(numerator_checked_baggage, numerator_checkpoint) out of c(denominator_checked_baggage, denominator_checkpoint)
## X-squared = 96.204, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.2081118 0.3133901
## sample estimates:
##   prop 1   prop 2 
## 0.677747 0.416996

Here I’ve printed the entire set of output from prop.test() so that we can see that the function is considering “prop 1” to be the proportion of checked baggage claims that are denied, or numerator_checked_baggage/denominator_checked_baggage, which is 734/1083 or 0.677.

Question #1:

Part A: Find a 90% confidence interval estimate for the difference in the proportion of regular season games won by the home team (game_outcome = 1) relative to the proportion of playoff games won by the home team.
Part B: Does your confidence interval from Part A provide compelling statistical evidence of an association between the type of game and the likelihood of the game being won by the home team which cannot be explained by sampling variability (random chance)? Briefly explain.
Part C: The confidence interval estimate you found in Part A is based upon a Normal probability model, which only provides a good approximation for the sampling distribution when certain assumptions are met (see the table near the start of the lab). Given these assumptions, do you believe the confidence interval you found in Part A is reliable? Briefly explain.

\(~\)

Odds Ratios

As we discussed earlier in the semester, odds ratios are another common method for describing an association between two categorical variables, and they should be preferred over differences in proportions for rare outcomes.

Unlike other descriptive statistics, the odds ratio takes on values from \([0,\infty)\), so our generic formula of \(\text{Point Estimate} \pm \text{Margin of Error}\) isn’t applicable. We will rely upon the fisher.test() function to find confidence intervals for odds ratios. Something you should notice is that the point estimate (the sample odds ratio) is not in center of these interval estimates. An example is shown below:

## fisher.test() only accepts binary categorical variable or 2x2 tables, so we need a few extra steps to prepare our data

# First, filter the data to only include cases in the "checkpoint" and "checked baggage" groups
tsa_laptops_subset = filter(tsa_laptops, Claim_Site %in% c("Checkpoint", "Checked Baggage"))

# Second, use ifelse() to create a binary variable for claims that are "denied"
tsa_laptops_subset$Denied = ifelse(tsa_laptops_subset$Status == "Denied", "Denied", "Settled or Accepted")

## Now we can create a 2x2 table
status_site_table = table(tsa_laptops_subset$Claim_Site, tsa_laptops_subset$Denied)

## Let's make a note of the point estimate
(status_site_table[1,1]/status_site_table[1,2])/
  (status_site_table[2,1]/status_site_table[2,2])

## [1] 2.940426

## Then use fisher.test() on the table (we'll print the full output to confirm the OR)
fisher.test(x=status_site_table, conf.level = 0.95)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  status_site_table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.350539 3.678329
## sample estimates:
## odds ratio 
##   2.938064

When using this approach you should be mindful of the odds ratio you are calculating. That is, this example found the odds ratio comparing the odds of a claim being “denied” in the “checked baggage” group relative to the odds of a claim being denied in the “checkpoint” group. This is because “denied” is the first column of the frequency table, and “checked baggage” is the first row.

If we want to change this we can use the factor() function as described in Lab 7.

Question #2:

Part A: Find the odds ratio that compares the odds of the home team winning a regular season game relative to the odds of the home team winning a playoff game. Hint: when game_outcome = 0.5 the game was tied. You should use ifelse() to create a binary variable that groups ties and losses together as “not wins”.
Part B: Find a 95% confidence interval estimate involving the odds ratio you calculated in Part A. Hint: you will need to use factor() to reorder categories prior to creating your frequency table (see Lab 7 for details/examples).
Part C: Does your confidence interval from Part B provide compelling statistical evidence of an association between the type of game and the likelihood of the game being won by the home team which cannot be explained by sampling variability (random chance)? Briefly explain.

\(~\)

Means and Differences in Means

For large sample sizes, or small samples where it is plausible that the data came from a Normally distributed population, confidence interval estimates for a single mean and difference in means should be found using the t-distribution as the underlying probability model.

The t.test() function provides these confidence interval estimates as part of its default output:

## Confidence interval example for a single mean
t.test(x = tsa_laptops$Claim_Amount, conf.level = 0.99)

## 
##  One Sample t-test
## 
## data:  tsa_laptops$Claim_Amount
## t = 49.071, df = 1594, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
##  1510.687 1678.281
## sample estimates:
## mean of x 
##  1594.484

## Confidence interval example for a difference in means
t.test(Claim_Amount ~ Denied, data = tsa_laptops_subset, conf.level = 0.99)

## 
##  Welch Two Sample t-test
## 
## data:  Claim_Amount by Denied
## t = 5.6768, df = 1520.1, p-value = 1.641e-08
## alternative hypothesis: true difference in means between group Denied and group Settled or Accepted is not equal to 0
## 99 percent confidence interval:
##  197.2246 525.6245
## sample estimates:
##              mean in group Denied mean in group Settled or Accepted 
##                          1741.419                          1379.995

A few things to note:

For a single mean, we provide t.test() with the entire vector containing the numeric variable of interest in our sample data.
For a difference in means, we use the same formula syntax used by lm() to fit regression models. The quantitative variable (outcome) comes first on the left side of the ~, and the categorical variable defining the two groups comes second on the right side of the ~

Question #3:

Part A: Find a 95% confidence interval estimate for the average score difference of NFL games.
Part B: Does the confidence interval estimate you found in Part B provide compelling statistical evidence that, on average, home teams outscore away teams in NFL games? Briefly explain.
Part C: Use the ifelse() function to create a new binary variable that groups together games in weeks 1-9 as “early season” and games in weeks 10-21 as “late season”.
Part D: Find a 95% confidence interval estimate for the difference in mean score differentials for early season versus late season games (using your new variable from Part C).
Part E: Does the interval estimate you found in Part D provide compelling statistical evidence of an association between whether a game was played in the early or late portion of the season and the score differential? Briefly explain.

\(~\)

Pearson’s Correlation Coefficient

Confidence interval estimates for Pearson’s correlation coefficient are found using the cor.test() function. The example shown below finds a 95% confidence interval estimate for the correlation between the claim amount and close amount (the amount actually paid out by the TSA) for claims involving laptops:

cor.test(x = tsa_laptops$Claim_Amount, y = tsa_laptops$Close_Amount, conf.level = 0.99)

## 
##  Pearson's product-moment correlation
## 
## data:  tsa_laptops$Claim_Amount and tsa_laptops$Close_Amount
## t = 4.7439, df = 1593, p-value = 2.284e-06
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.05396995 0.18111678
## sample estimates:
##       cor 
## 0.1180271

Notice how the correlation between these two variables is weak (\(r = 0.118\)), but the confidence interval estimate suggests that we can conclude with high statistical confidence that these variables are associated. All else being equal, stronger associations are less likely to be explained by sampling variability, but that does not imply that weak correlations observed in sample data suggest independence.

Question #4:

Part A: Find a 90% confidence interval estimate for the correlation between home team scores and away team scores in NFL games.
Part B: Based upon your findings in Part A, can you conclude with 90% confidence that there’s an association between how many points the home team scores and how many points the away team scores? Briefly explain.
Part C: Now find an 80% confidence interval estimate for the correlation between home team and away team scores. Based upon this interval, can you conclude with 80% confidence that there’s an association between home team and away team scores? Briefly explain.