Directions (read before starting)
\(~\)
Our previous lab introduced the basic steps involved in confidence interval estimation:
Additionally, the Central Limit theorem provides a Normal approximation of the sampling distribution for several commonly used descriptive statistics. This approximation gives us with mathematical formulas for the standard errors of these statistics:
Statistic | Standard Error | Conditions |
---|---|---|
\(\hat{p}\) | \(\sqrt{\frac{p(1 - p)}{n}}\) | \(np \geq 10\) and \(n(1-p) \geq 10\) |
\(\bar{x}\) | \(\frac{\sigma}{\sqrt{n}}\) | normal population or \(n \geq 30\) |
\(\hat{p}_1 - \hat{p}_2\) | \(\sqrt{\frac{p_1(1 - p_1)}{n_1} + \frac{p_2(1 - p_2)}{n_2}}\) | \(n_ip_i \geq 10\) and \(n_i(1-p_i) \geq 10\) for \(i \in \{1,2\}\) |
\(\bar{x}_1 - \bar{x}_2\) | \(\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\) | normal populations or \(n_1 \geq 30\) and \(n_2 \geq 30\) |
Certain descriptive statistics, such as a single mean or a difference in means, have standard errors that require us to estimate additional parameters (such as the population’s standard deviation) using our sample. In these situations we need to select “c” using an appropriate \(t\)-distribution in order to properly account for the additional uncertainty introduced by estimating these extra parameters.
R
to find the
degrees of freedom (and calculate the confidence interval)
R
in those situations too\(~\)
In Part 1 of the lab you will be given examples of R
functions used to find confidence interval estimates and you will be
asked to recreate the calculations that produced those intervals. In
Part 2, you will be given scenarios and asked to find and interpret
confidence interval estimates using R
.
On Exam 2, you should expect one question requiring you to create a
confidence interval “by hand” for either a mean or a proportion. You
should expect several questions related to confidence intervals that
involve interpreting R
output.
\(~\)
In this part of the lab we’ll use the Iowa City home sales data introduced in a few of our previous labs. We will filter these data to include only the two most common home types, “1 Story Frame” and “2 Story Frame”:
library(dplyr)
ic_homes = read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv") %>%
filter(style %in% c("1 Story Frame", "2 Story Frame"))
In R
we can find a confidence interval for a single
proportion or a difference in proportions using the
prop.test()
function:
## Single proportion example
n_1_story = sum(ic_homes$style == "1 Story Frame")
n_total = nrow(ic_homes)
results = prop.test(x = n_1_story, n = n_total, conf.level = 0.99)
results$conf.int ## 99% CI estimate of the proportion of 1-story homes in Iowa City
## [1] 0.5977297 0.7053835
## attr(,"conf.level")
## [1] 0.99
## Difference in proportions example
style_ac_table = table(ic_homes$style, ic_homes$ac)
results = prop.test(x = style_ac_table, conf.level = 0.90)
results$estimate ## Each sample proportion
## prop 1 prop 2
## 0.12680115 0.09782609
results$conf.int ## 90% CI for the difference in prevalence of AC
## [1] -0.02167003 0.07962016
## attr(,"conf.level")
## [1] 0.9
You should take note of a few things:
x
argument with
the numerator of our proportion and we provide the n
argument with the denominator.x
argument with a two-way frequency table and prop.table()
will compare the row proportions of the outcome in the table’s
first column. In this example these are the proportions of
"1 Story Frame"
and "2 Story Frame"
homes with
ac = "No"
.
prop.test()
## Difference in proportions example #2 - we'll compare the prop of "Yes" for 2-story vs. 1-story
n_ac_2_story = sum(ic_homes$style == "2 Story Frame" & ic_homes$ac == "Yes")
n_2_story = sum(ic_homes$style == "2 Story Frame")
n_ac_1_story = sum(ic_homes$style == "1 Story Frame" & ic_homes$ac == "Yes")
n_1_story = sum(ic_homes$style == "1 Story Frame")
## Results
results = prop.test(x = c(n_ac_2_story, n_ac_1_story), n = c(n_2_story, n_1_story), conf.level = 0.90)
results$estimate ## Each sample proportion
## prop 1 prop 2
## 0.9021739 0.8731988
results$conf.int ## 90% CI for the difference in prevalence of AC
## [1] -0.02167003 0.07962016
## attr(,"conf.level")
## [1] 0.9
Question 1:
prop.test()
to find a 95%
confidence interval estimate for the difference in proportions of
2-story frame and 1-story frame homes in Iowa City that have a “Full”
basement.\(~\)
When estimating a single proportion it is possible to calculate the exact probability of every possible sample proportion that could appear in a sample of size \(n\) using the binomial probability distribution. Probability calculations using the binomial distribution are outside the scope of this course, but you should be aware of exact binomial confidence interval estimates, which are especially useful the conditions for a normal approximation are not satisfied.
Exact binomial confidence intervals can be found using the
binom.test()
function in a manner that directly mirrors our
previous usage of prop.test()
:
## Single proportion example
n_1_story = sum(ic_homes$style == "1 Story Frame")
n_total = nrow(ic_homes)
results = binom.test(x = n_1_story, n = n_total, conf.level = 0.99)
results$conf.int # 99% exact binomial CI
## [1] 0.5981371 0.7060268
## attr(,"conf.level")
## [1] 0.99
Question 2:
\(~\)
In data analyses involving contingency tables odds ratios are a
popular way to describe associations. The fisher.test()
## Example - the odds ratio comparing the odds of ac = "No" for 1-story vs. 2-story homes
result = fisher.test(x = ic_homes$style, y = ic_homes$ac, conf.level = 0.90)
result$estimate # Odds ratio of ac = "No" for 1-story homes relative to 2-story homes
## odds ratio
## 1.33849
result$conf.int # Confidence interval estimate of the odds ratio
## [1] 0.7973224 2.3010853
## attr(,"conf.level")
## [1] 0.9
By default the “event” for which the odds are calculated will be the
first alphabetical value in the y
variable, or
ac = "No"
in this example. And the group appearing the
numerator of the odds ratio will be the first alphabetical value of the
x
variable.
We can change these defaults using the factor()
function:
## Example using relevel() to compare the odds of "Yes"
result = fisher.test(x = ic_homes$style, y = factor(ic_homes$ac, levels = c("Yes","No")), conf.level = 0.90)
result$estimate # Odds ratio of ac = "Yes" for 1-story homes relative to 2-story homes
## odds ratio
## 0.7471104
result$conf.int
## [1] 0.4345775 1.2541977
## attr(,"conf.level")
## [1] 0.9
Question 3:
\(~\)
The t.test()
function can be used to calculate
confidence intervals for a single mean or a difference in means:
## Example of a CI for a single mean
results = t.test(x=ic_homes$sale.amount, conf.level = 0.99)
results$conf.int
## [1] 172661.4 193558.8
## attr(,"conf.level")
## [1] 0.99
## Example of a CI for a difference in means
results = t.test(sale.amount ~ style, data = ic_homes, conf.level = 0.99)
results$estimate
## mean in group 1 Story Frame mean in group 2 Story Frame
## 166179.7 215038.5
results$conf.int
## [1] -72704.14 -25013.28
## attr(,"conf.level")
## [1] 0.99
Question 4: Use the summarize()
function to find the mean and standard deviation for the variable
sale.amount
. Then, using a value of “c” from the the proper
distribution, calculate a 99% confidence estimate of the average sale
amount of homes sold in Iowa City. Calculate your interval “by hand”
(with the exception of finding “c” using StatKey). Verify that your
interval matches the one found in this section’s examples.
\(~\)
The cor.test()
function is used to find a confidence
interval estimate for a correlation between two quantitative variables.
Currently, the function only returns confidence intervals for Pearson’s
correlation coefficient.
## Example of confidence interval for Pearson's correlation coefficient
results = cor.test(x = ic_homes$area.living, y = ic_homes$sale.amount, conf.level = 0.9)
results$estimate ## Point estimate (sample )
## cor
## 0.8224501
results$conf.int
## [1] 0.7978832 0.8442896
## attr(,"conf.level")
## [1] 0.9
Question 5: Find and interpret a 95% confidence interval estimate for the correlation between the year a home was built and its sale amount. How are these variables related in the sample? Can you be confident that these variables are related in the population represented by these data?
\(~\)
The confint()
function is designed to calculate
confidence interval estimates using the coefficients of a fitted
regression model. Differing slightly from the other functions introduced
in this lab, it uses the level
argument to specify the
confidence level:
## Example confidence interval for the coefficients in a linear regression model
my_model = lm(sale.amount ~ style + area.living, data = ic_homes)
confint(my_model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -31293.9509 -7554.4756
## style2 Story Frame -51617.3241 -30866.5517
## area.living 147.2762 165.1543
In this example we can be 95% confident that each additional square foot in living area is associated with an expected increase in sale price that is between \(\$147\) and \(\$165\) when the style of the home is held constant (ie: for two homes of the same style).
Question 6:
summary()
function to
obtain the point estimate and standard error for the coefficient of the
re-coded variable style2 Story Frame
. Then, use an
appropriate value of “c” from a \(t\)-distribution with \(df=n-p-1\) where \(n\) is the sample size, and \(p\) is the number of coefficients
(including the intercept) involved in the model. Confirm the endpoints
of the 95% confidence interval for the coefficient of
style2 Story Frame
given above (ie: -51617.32 and
-30866.55) by calculating the interval yourself.\(~\)
The American Community Survey (ACS) is a component of the US Census that is administered to a random sample US addresses on a rolling basis. When the mailed version is combined with in-person visits and telephone calls the survey has a 95% response rate. The data linked below are a random sample of employed individuals drawn from a recent ACS:
acs = read.csv("https://remiller1450.github.io/data/EmployedACS.csv")
The ACS sample includes the following variables:
Question 7: Use a 95% confidence interval estimate of an appropriate descriptive statistic to decide whether you confidently conclude from the ACS data that married individuals are more likely to have health insurance.
Question 8: Use a 90% confidence interval estimate of an appropriate descriptive statistic to determine you can confidently conclude from the ACS data that there’s a relationship between the average number of hours an individual works and their annual income.
Question 9: Use a 99% confidence interval estimate to determine whether you can confidently conclude from the ACS data that males earn higher mean incomes than females.
Question 10: Use a 99% confidence interval estimate to determine whether you can confidently conclude from the ACS data that males earn higher mean incomes than females after controlling for hours worked per week. Hint: Use a regression model that adjusts for the average number of hours worked each week.