\(~\)
The basic idea behind a confidence interval is to combine a point estimate (a descriptive statistic found using sample data) and a margin of error to produce an interval estimate whose margin of error is calibrated to achieve a long-run success rate. This generally takes the form: \[\text{Point Estimate} \pm c*SE\]
c
is taken from a probability distribution to
reflect the middle P% of that distribution, with P% corresponding with
the confidence level of the intervalThe table below presents standard error formulas for common descriptive statistics, as well as conditions for when using a standard error formula and probability model based upon the Central Limit Theorem is generally regarded as reasonable:
Descriptive Statistic | Standard Error | Conditions | Probability Model |
---|---|---|---|
\(\hat{p}\) | \(\sqrt{\frac{p(1 - p)}{n}}\) | \(np \geq 10\) and \(n(1-p) \geq 10\) | Normal distribution |
\(\bar{x}\) | \(\frac{\sigma}{\sqrt{n}}\) | normal population or \(n \geq 30\) | \(t\)-distribution |
\(\hat{p}_1 - \hat{p}_2\) | \(\sqrt{\frac{p_1(1 - p_1)}{n_1} + \frac{p_2(1 - p_2)}{n_2}}\) | \(n_ip_i \geq 10\) and \(n_i(1-p_i) \geq 10\) for \(i \in \{1,2\}\) | Normal distribution |
\(\bar{x}_1 - \bar{x}_2\) | \(\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\) | Normal populations or \(n_1 \geq 30\) and \(n_2 \geq 30\) | \(t\)-distribution |
You will rarely (if ever) be expected to use these formulas to calculate confidence interval estimates “by hand”, and you will be given this table on the front page of Exam 2.
This lab will also cover confidence intervals for Pearson’s correlation coefficient and odds ratios, though these are not mentioned in the above table because their associated methodologies don’t fit the pattern of how confidence intervals are found with these other descriptive statistics.
\(~\)
In this lab you’ll work with a random sample of \(n=200\) games played in the National Football League (NFL) between 2018 and 2023. These data contain the following variables of interest:
season
- the year in which the game was playedgame_type
- whether the game was a regular season game
(type = Reg
) or a playoff game (type =
Playoff
)game_outcome
- whether the game’s home team won
(game_outcome = 1
) or lost
(game_outcome = 0
)home_score
- the points scored by the home teamaway_score
- the points scored by the away teamscore_diff
- the score of the home team minus the score
of the away teamYou’ll also need to use the dplyr
and
ggplot2
libraries:
## Libraries
library(dplyr)
library(ggplot2)
## Data that you'll use
nfl = read.csv("https://remiller1450.github.io/data/nfl_sample.csv")
## Data used in examples
tsa <- read.csv("https://remiller1450.github.io/data/tsa.csv")
tsa_laptops = tsa %>% filter(Item == "Computer - Laptop", Claim_Amount < 1e6)
\(~\)
We’ve already seen how to use the binom.test()
and
prop.test()
functions to find confidence interval estimates
for a single proportion in our previous lab. The
prop.test()
function can also be used to find confidence
interval estimates for a difference in proportions. This is
useful because it allows us to compare the risks/success rates across
two different groups while taking into account sampling variability as a
possible explanation for the observed difference.
To use prop.test()
for a difference in proportions we
must supply the numerator and denominator of each proportion. While
there are quicker approaches, the example shown below is easy to
understand and apply to any scenario. In this example we compare the
proportions of denied laptop claims at security checkpoints versus
checked baggage.
## First find the numerator and denominator of each proportion
numerator_checked_baggage = sum(tsa_laptops$Status == "Denied" & tsa_laptops$Claim_Site == "Checked Baggage")
numerator_checkpoint = sum(tsa_laptops$Status == "Denied" & tsa_laptops$Claim_Site == "Checkpoint")
denominator_checked_baggage = sum(tsa_laptops$Claim_Site == "Checked Baggage")
denominator_checkpoint = sum(tsa_laptops$Claim_Site == "Checkpoint")
## Then give them to prop.test(), being careful of the order
prop.test(x = c(numerator_checked_baggage, numerator_checkpoint),
n = c(denominator_checked_baggage, denominator_checkpoint),
conf.level = 0.95)
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(numerator_checked_baggage, numerator_checkpoint) out of c(denominator_checked_baggage, denominator_checkpoint)
## X-squared = 96.204, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.2081118 0.3133901
## sample estimates:
## prop 1 prop 2
## 0.677747 0.416996
Here I’ve printed the entire set of output from
prop.test()
so that we can see that the function is
considering “prop 1” to be the proportion of checked baggage claims that
are denied, or
numerator_checked_baggage/denominator_checked_baggage
,
which is 734/1083 or 0.677.
Question #1:
game_outcome = 1
) relative to the proportion of
playoff games won by the home team.\(~\)
As we discussed earlier in the semester, odds ratios are another common method for describing an association between two categorical variables, and they should be preferred over differences in proportions for rare outcomes.
Unlike other descriptive statistics, the odds ratio takes on values
from \([0,\infty)\), so our generic
formula of \(\text{Point Estimate} \pm
\text{Margin of Error}\) isn’t applicable. We will rely upon the
fisher.test()
function to find confidence intervals for
odds ratios. Something you should notice is that the point estimate (the
sample odds ratio) is not in center of these interval estimates. An
example is shown below:
## fisher.test() only accepts binary categorical variable or 2x2 tables, so we need a few extra steps to prepare our data
# First, filter the data to only include cases in the "checkpoint" and "checked baggage" groups
tsa_laptops_subset = filter(tsa_laptops, Claim_Site %in% c("Checkpoint", "Checked Baggage"))
# Second, use ifelse() to create a binary variable for claims that are "denied"
tsa_laptops_subset$Denied = ifelse(tsa_laptops_subset$Status == "Denied", "Denied", "Settled or Accepted")
## Now we can create a 2x2 table
status_site_table = table(tsa_laptops_subset$Claim_Site, tsa_laptops_subset$Denied)
## Let's make a note of the point estimate
(status_site_table[1,1]/status_site_table[1,2])/
(status_site_table[2,1]/status_site_table[2,2])
## [1] 2.940426
## Then use fisher.test() on the table (we'll print the full output to confirm the OR)
fisher.test(x=status_site_table, conf.level = 0.95)
##
## Fisher's Exact Test for Count Data
##
## data: status_site_table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 2.350539 3.678329
## sample estimates:
## odds ratio
## 2.938064
When using this approach you should be mindful of the odds ratio you are calculating. That is, this example found the odds ratio comparing the odds of a claim being “denied” in the “checked baggage” group relative to the odds of a claim being denied in the “checkpoint” group. This is because “denied” is the first column of the frequency table, and “checked baggage” is the first row.
If we want to change this we can use the factor() function as described in Lab 7.
Question #2:
game_outcome = 0.5
the game was tied. You should use
ifelse()
to create a binary variable that groups ties and
losses together as “not wins”.factor()
to reorder categories prior to
creating your frequency table (see Lab 7 for details/examples).\(~\)
For large sample sizes, or small samples where it is plausible that the data came from a Normally distributed population, confidence interval estimates for a single mean and difference in means should be found using the t-distribution as the underlying probability model.
The t.test()
function provides these confidence interval
estimates as part of its default output:
## Confidence interval example for a single mean
t.test(x = tsa_laptops$Claim_Amount, conf.level = 0.99)
##
## One Sample t-test
##
## data: tsa_laptops$Claim_Amount
## t = 49.071, df = 1594, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
## 1510.687 1678.281
## sample estimates:
## mean of x
## 1594.484
## Confidence interval example for a difference in means
t.test(Claim_Amount ~ Denied, data = tsa_laptops_subset, conf.level = 0.99)
##
## Welch Two Sample t-test
##
## data: Claim_Amount by Denied
## t = 5.6768, df = 1520.1, p-value = 1.641e-08
## alternative hypothesis: true difference in means between group Denied and group Settled or Accepted is not equal to 0
## 99 percent confidence interval:
## 197.2246 525.6245
## sample estimates:
## mean in group Denied mean in group Settled or Accepted
## 1741.419 1379.995
A few things to note:
t.test()
with the entire
vector containing the numeric variable of interest in our sample
data.lm()
to fit regression models. The quantitative variable
(outcome) comes first on the left side of the ~
, and the
categorical variable defining the two groups comes second on the right
side of the ~
Question #3:
ifelse()
function to
create a new binary variable that groups together games in weeks 1-9 as
“early season” and games in weeks 10-21 as “late season”.\(~\)
Confidence interval estimates for Pearson’s correlation coefficient
are found using the cor.test()
function. The example shown
below finds a 95% confidence interval estimate for the correlation
between the claim amount and close amount (the amount actually paid out
by the TSA) for claims involving laptops:
cor.test(x = tsa_laptops$Claim_Amount, y = tsa_laptops$Close_Amount, conf.level = 0.99)
##
## Pearson's product-moment correlation
##
## data: tsa_laptops$Claim_Amount and tsa_laptops$Close_Amount
## t = 4.7439, df = 1593, p-value = 2.284e-06
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.05396995 0.18111678
## sample estimates:
## cor
## 0.1180271
Notice how the correlation between these two variables is weak (\(r = 0.118\)), but the confidence interval estimate suggests that we can conclude with high statistical confidence that these variables are associated. All else being equal, stronger associations are less likely to be explained by sampling variability, but that does not imply that weak correlations observed in sample data suggest independence.
Question #4: