\(~\)
Later in this lab I will ask you to explore the validity of a confidence interval procedure when the sample size varies. You will use the code provided below as a template for this:
## Example that we'll use as a population
library(dplyr)
tsa <- read.csv("https://remiller1450.github.io/data/tsa.csv")
tsa_laptops = tsa %>% filter(Item == "Computer - Laptop", Claim_Amount < 1e6)
We’ve already seen the functions needed to create confidence interval estimates for many common scenarios. As review, these functions are summarized in the table below:
| Scenario | Function | Notes |
|---|---|---|
| Single Proportion | prop.test() or binom.test() |
prop.test() is valid when \(n
\cdot \hat{p} \geq 10\) and $n (1-) $ |
| Difference in Proportions | prop.test() |
Valid when single proportion conditions are met for each group |
| Single Mean | t.test() |
Valid when \(n \geq 30\) or Normally distributed data |
| Difference in Means | t.test() |
Valid when \(n_1 \text{ and } n_2 \geq 30\) or Normally distributed data |
| Odds Ratio | fisher.test() |
Asymmetric confidence interval |
| Correlation | cor.test() |
Allows either Pearson’s or Spearman’s correlation |
| Regression Coefficients | confint() |
The fitted regression model must be given as the function’s input |
You should note that each function expects a different type of input. Below are a few quick examples so you don’t need to go back to our previous labs.
## For prop.test() "x" is the numerator(s) and "n" is the denominator(s) of the involved proportions
numerator_count = sum(tsa_laptops$Status == "Denied")
denominator_count = length(tsa_laptops$Status)
prop.test(x = numerator_count, n = denominator_count, correct = FALSE)
## For t.test() "x" is either the variable (one-sample) or a formula of the form `outcome ~ explanatory` and "data" is the data set.
t.test(x = tsa_laptops$Claim_Amount)
## For fisher.test() "x" is a 2 x 2 contingency table (made using table()) or a 2 x 2 data.frame() object (when entering our own summarized data)
tsa_laptops$Chicago_binary = ifelse(tsa_laptops$Airport_Code == "ORD", "Chicago", "Elsewhere")
tsa_laptops$Denied_binary = ifelse(tsa_laptops$Status == "Denied", "Denied", "Not Denied")
my_table = table(tsa_laptops$Chicago_binary, tsa_laptops$Denied_binary )
fisher.test(my_table)
## For cor.test() "x" and "y" are each of the two involved variables
cor.test(x = tsa_laptops$Claim_Amount, y = tsa_laptops$Close_Amount)
## For confint() "object" is the fitted regression model
fitted_model = lm(Close_Amount ~ Claim_Amount, data = tsa_laptops)
confint(fitted_model)
With all of these functions, the confidence level can be
adjusted using the conf.level argument (except for
confint(), which uses the level argument).
## Example of a 99% confidence interval
cor.test(x = tsa_laptops$Claim_Amount, y = tsa_laptops$Close_Amount, conf.level = 0.99)
##
## Pearson's product-moment correlation
##
## data: tsa_laptops$Claim_Amount and tsa_laptops$Close_Amount
## t = 4.7439, df = 1593, p-value = 2.284e-06
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.05396995 0.18111678
## sample estimates:
## cor
## 0.1180271
\(~\)
Michael Jordan is widely regarded as the best professional basketball player of all time. At the end of his playing career after the 2002-03 NBA season he averaged 30.1 points per game in his regular season appearances. Throughout this lab you will work with three different samples of varying sizes (\(n=5\), \(n=25\), and \(n=75\)) of regular season games played by Michael Jordan.
mj5 = read.csv("https://remiller1450.github.io/data/mj5.csv")
mj25 = read.csv("https://remiller1450.github.io/data/mj25.csv")
mj75 = read.csv("https://remiller1450.github.io/data/mj75.csv")
\(~\)
Question #1: For this question you should use the
mj75 sample. The variable pts records the
outcome of interest in this question.
R function
to verify the interval you calculated in Part B. It is okay if your
interval is off by a tiny amount due to rounding.mj5
sample rather than the mj75 sample. If this sample had the
same sample average and sample standard deviation as mj75,
and the confidence level used is unchanged, would you expect this sample
to produce a wider or narrower interval than the one
you found in Parts B and C? Explain your reasoning.\(~\)
Question #2: For this question you should continue
using the mj75 sample. This question will focus on Michael
Jordan’s three-point shooting percentage, which you can find in the
sample using the sum() function and the three
(made three-point shots) and threeatt (attempted
three-point shots) variables.
R function
to verify the interval you calculated in Part B. It is okay if your
interval is off by a tiny amount due to rounding. Remember to change the
confidence level in the function.\(~\)
Question #3: For this question you will use the
mj5, mj25, and mj75 samples.
pts in
each of these three samples. Given that Michael Jordan’s actual career
average is 30.1 points per game, which sample produced the point
estimate that is closest to the truth?t.test() to calculate a
separate 95% confidence interval estimate using each of these samples.
Which of the intervals contained the true value? Is this
surprising?\(~\)
Question #4: For this question you should continue
using the mj75 sample. This question will involve a variety
of different research questions and your task is to find the requested
confidence interval estimate for the relevant population parameter using
an appropriate R function.
tov) and the number of points he scores. Find a
90% confidence interval estimate for an appropriate descriptive
statistic. Be sure you explore the data to determine the appropriate
statistic.\(~\)
## Outcome variable for Part E/F
mj75$outcome = ifelse(substr(mj75$result, start=1, stop=1) == "W", "win", "loss")
\(~\)
The approaches used in the previous section all rely upon a probability model used to represent the sampling variability in the data. As we saw with hypothesis testing, if this probability model is inappropriate then the procedures relying upon it will be invalid.
By definition, confidence intervals are based upon the long-run success rate of the procedure used to make them, so we can assess their validity by looking at their long-run success rate in various situations. The following code provides a basic template for this investigation:
## Create a population with known characteristics
pop_values = rnorm(n=10000, mean = 30, sd = 5) # Hypothetical population of 10,000 cases
## Set up characteristics of the investigation (sample size, number of repeats to get a long-run average)
n_reps = 500
sample_size = 10
lower_endpoint = upper_endpoint = numeric(length = n_reps)
## Repeatedly sample from the population and apply the CI procedure being studied
for(i in 1:n_reps){
current_sample = sample(pop_values, size = sample_size)
point_est = mean(current_sample)
lower_endpoint[i] = point_est - 1.96*(sd(current_sample)/sqrt(sample_size))
upper_endpoint[i] = point_est + 1.96*(sd(current_sample)/sqrt(sample_size))
}
## Check if the true mean is inside each sample's CI
contains_truth = ifelse(lower_endpoint > mean(pop_values) | upper_endpoint < mean(pop_values), "Fails", "Succeeds")
mean(contains_truth == "Succeeds") # Proportion of CI's that contain the truth
## [1] 0.928
Because the sample() function randomly samples from
pop_values, the output of this code block will differ each
time you run it, but in general we can see that this procedure does not
produce valid 95% confidence intervals.
Question #5: For this question you should rely upon the code provided in this section.
\(~\)
Recall that the \(T\)-distribution is an appropriate probability model when the data come from a Normally distributed population, or when the sample size is large, for which the Central Limit theorem ensures an approximately Normal distribution for most sample statistics, including means and differences in means. We can explore this by creating a right-skewed population and sampling from it:
## Create a right-skewed population
skewed_pop_values = rexp(10000, rate = 5)
ggplot() + geom_histogram(aes(x = skewed_pop_values), bins=25)
We can now use an evaluation loop similar to the one provided in the previous section to understand the circumstances for which confidence intervals for a single mean created using the \(T\)-distribution are valid.
Question #6: For this question you will modify the
evaluation loop from the previous section and use the
skewed_pop_values data as the population.
\(~\)
Recall that the \(Z\)-distribution is used as a probability model for a single proportion and a difference in proportions. A common rule of thumb for determining the appropriateness of this model is \(n \cdot \hat{p} \geq 10\) and $n (1-) $ (for both groups in two-sample scenarios).
We can simulate binary outcomes for two groups using the code provided below:
pop_group_1 = sample(x = c("Yes", "No"), size = 20000, replace = TRUE, prob = c(0.5,0.5))
pop_group_2 = sample(x = c("Yes", "No"), size = 10000, replace = TRUE, prob = c(0.7,0.3))
binary_pop = data.frame(Group = c(rep("Group1", 20000), rep("Group2", 10000)),
Outcome = c(pop_group_1, pop_group_2))
In this simulated population the first group is expected to have 50% “no” outcomes, while the second group is expected to have 30% “no” outcomes. There are also twice as many members of the first group than there are in the second group.
The loop below will randomly select a sample of size
sample_size from the population created above, then it will
use prop.test() to construct a 95% confidence interval
estimate for the difference in the proportion of “no” between each
group:
## Set up characteristics of the investigation (sample size, number of repeats to get a long-run average)
n_reps = 500
sample_size = 100
lower_endpoint = upper_endpoint = numeric(length = n_reps)
## Repeatedly sample from the population and apply the CI procedure being studied
for(i in 1:n_reps){
sample_idx = sample(1:nrow(binary_pop), size = sample_size)
current_sample = binary_pop[sample_idx, ]
current_sample_table = table(current_sample$Group, current_sample$Outcome)
prop_test_results = prop.test(x = current_sample_table, correct = FALSE)
lower_endpoint[i] = prop_test_results$conf.int[1]
upper_endpoint[i] = prop_test_results$conf.int[2]
}
## Check if the true mean is inside each sample's CI
contains_truth = ifelse(lower_endpoint > 0.2 | upper_endpoint < 0.2, "Fails", "Succeeds")
mean(contains_truth == "Succeeds") # Proportion of CI's that contain the truth
## [1] 0.934
Question #7: For this question you will modify the loop provided above to vary the sample size.
prop.test() function produce
valid 95% confidence intervals for samples of this size? Briefly
explain.\(~\)
Question #8 (optional, extra credit): Modify the
provided examples given throughout this lab to show that
binom.test() produces valid confidence intervals for
one-sample categorical data with a small sample size (say \(n=10\)). You will have to generate your own
population and modify the repeated sampling loop as part of this
question (which is why it is extra credit).