Lab #11 - Confidence Intervals

Directions (read before starting)

Please work together with your assigned partner. Make sure you both fully understand something before moving on.
Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
Ask for help, clarification, or even just a check-in if anything seems unclear.

$~$

Introduction

Our previous lab introduced the basic steps involved in confidence interval estimation:

Find the point estimate of the unknown population parameter using the sample data.
Find the standard error of the point estimate
Attach a margin of error to the point estimate using the generic formula: $\text{Point Estimate} \pm c*SE$ where “c” is chosen to calibrate the interval so that it achieves a certain confidence level

Additionally, the Central Limit theorem provides a Normal approximation of the sampling distribution for several commonly used descriptive statistics. This approximation gives us with mathematical formulas for the standard errors of these statistics:

Statistic	Standard Error	Conditions
$\hat{p}$	$\sqrt{\frac{p(1 - p)}{n}}$	$np \geq 10$ and $n(1-p) \geq 10$
$\bar{x}$	$\frac{\sigma}{\sqrt{n}}$	normal population or $n \geq 30$
$\hat{p}_1 - \hat{p}_2$	$\sqrt{\frac{p_1(1 - p_1)}{n_1} + \frac{p_2(1 - p_2)}{n_2}}$	$n_ip_i \geq 10$ and $n_i(1-p_i) \geq 10$ for $i \in \{1,2\}$
$\bar{x}_1 - \bar{x}_2$	$\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$	normal populations or $n_1 \geq 30$ and $n_2 \geq 30$

Certain descriptive statistics, such as a single mean or a difference in means, have standard errors that require us to estimate additional parameters (such as the population’s standard deviation) using our sample. In these situations we need to select “c” using an appropriate $t$-distribution in order to properly account for the additional uncertainty introduced by estimating these extra parameters.

For a single mean, we use a $t$-distribution with $df = n-1$
For a difference in means we’ll rely on R to find the degrees of freedom (and calculate the confidence interval)
- The coefficients in linear regression models also require the $t$-distribution, and we’ll use R in those situations too

$~$

Lab

In Part 1 of the lab you will be given examples of R functions used to find confidence interval estimates and you will be asked to recreate the calculations that produced those intervals. In Part 2, you will be given scenarios and asked to find and interpret confidence interval estimates using R.

On Exam 2, you should expect one question requiring you to create a confidence interval “by hand” for either a mean or a proportion. You should expect several questions related to confidence intervals that involve interpreting R output.

$~$

Part 1

In this part of the lab we’ll use the Iowa City home sales data introduced in a few of our previous labs. We will filter these data to include only the two most common home types, “1 Story Frame” and “2 Story Frame”:

library(dplyr)
ic_homes = read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv") %>% 
  filter(style %in% c("1 Story Frame", "2 Story Frame"))

Proportions

In R we can find a confidence interval for a single proportion or a difference in proportions using the prop.test() function:

## Single proportion example
n_1_story = sum(ic_homes$style == "1 Story Frame")
n_total = nrow(ic_homes)
results = prop.test(x = n_1_story, n = n_total, conf.level = 0.99)
results$conf.int ## 99% CI estimate of the proportion of 1-story homes in Iowa City

## [1] 0.5977297 0.7053835
## attr(,"conf.level")
## [1] 0.99

## Difference in proportions example
style_ac_table = table(ic_homes$style, ic_homes$ac)
results = prop.test(x = style_ac_table, conf.level = 0.90)
results$estimate  ## Each sample proportion

##     prop 1     prop 2 
## 0.12680115 0.09782609

results$conf.int  ## 90% CI for the difference in prevalence of AC

## [1] -0.02167003  0.07962016
## attr(,"conf.level")
## [1] 0.9

You should take note of a few things:

For a single proportion, we provide the x argument with the numerator of our proportion and we provide the n argument with the denominator.
For a difference in proportions we can provide the x argument with a two-way frequency table and prop.table() will compare the row proportions of the outcome in the table’s first column. In this example these are the proportions of "1 Story Frame" and "2 Story Frame" homes with ac = "No".
- The example below illustrates how we can gain more control over the proportions that are compared by calculating each numerator and denominator separately and passing these to prop.test()

## Difference in proportions example #2 - we'll compare the prop of "Yes" for 2-story vs. 1-story
n_ac_2_story = sum(ic_homes$style == "2 Story Frame" & ic_homes$ac == "Yes")
n_2_story = sum(ic_homes$style == "2 Story Frame")
n_ac_1_story = sum(ic_homes$style == "1 Story Frame" & ic_homes$ac == "Yes")
n_1_story = sum(ic_homes$style == "1 Story Frame")

## Results
results = prop.test(x = c(n_ac_2_story, n_ac_1_story), n = c(n_2_story, n_1_story), conf.level = 0.90)
results$estimate  ## Each sample proportion

##    prop 1    prop 2 
## 0.9021739 0.8731988

results$conf.int  ## 90% CI for the difference in prevalence of AC

## [1] -0.02167003  0.07962016
## attr(,"conf.level")
## [1] 0.9

Question 1:

Part A: Find a point estimate for the difference in proportions of 2-story frame and 1-story frame homes in Iowa City that have a “Full” basement.
Part B: Use prop.test() to find a 95% confidence interval estimate for the difference in proportions of 2-story frame and 1-story frame homes in Iowa City that have a “Full” basement.
Part C: Use the standard error formula for a difference in proportions and an appropriate value of “c” to confirm the confidence interval you found in Part B using a “by hand” calculation. Show your work. Note: You might not get the exact same end points depending upon how you round at each step, but your “by hand” interval should be identical to the one from Part B if both are rounded to 2 decimals places.
Part D: Does the confidence interval you found in Part B and C support the claim that 1-story and 2-story frame homes have different likelihoods of having a “Full” basement? Could the difference suggested by the point estimate you found in Part A be explained by sampling variability?

$~$

Exact Binomial Confidence Intervals

When estimating a single proportion it is possible to calculate the exact probability of every possible sample proportion that could appear in a sample of size $n$ using the binomial probability distribution. Probability calculations using the binomial distribution are outside the scope of this course, but you should be aware of exact binomial confidence interval estimates, which are especially useful the conditions for a normal approximation are not satisfied.

Exact binomial confidence intervals can be found using the binom.test() function in a manner that directly mirrors our previous usage of prop.test():

## Single proportion example
n_1_story = sum(ic_homes$style == "1 Story Frame")
n_total = nrow(ic_homes)
results = binom.test(x = n_1_story, n = n_total, conf.level = 0.99)
results$conf.int # 99% exact binomial CI

## [1] 0.5981371 0.7060268
## attr(,"conf.level")
## [1] 0.99

Question 2:

Part A: Use these data to find a point estimate for the proportion of homes in Iowa City that were built before 1940 (ie: 1939 or earlier).
Part B: Find a 95% exact binomial confidence interval estimate for the proportion of homes built before 1940.
Part C: Accord to the 2019 American Community Survey, 12.3% of all occupied homes in the entire United States were built before 1940. Does your confidence interval in Part B provide support to the claim that Iowa City has more older homes than the nation as a whole? Briefly explain.
Part D: Is it possible that sampling bias might explain why Iowa City has more older homes in this data set than the United States as a whole? Briefly explain.

$~$

Odds Ratios

In data analyses involving contingency tables odds ratios are a popular way to describe associations. The fisher.test()

## Example - the odds ratio comparing the odds of ac = "No" for 1-story vs. 2-story homes
result = fisher.test(x = ic_homes$style, y = ic_homes$ac, conf.level = 0.90)
result$estimate   # Odds ratio of ac = "No" for 1-story homes relative to 2-story homes

## odds ratio 
##    1.33849

result$conf.int   # Confidence interval estimate of the odds ratio

## [1] 0.7973224 2.3010853
## attr(,"conf.level")
## [1] 0.9

By default the “event” for which the odds are calculated will be the first alphabetical value in the y variable, or ac = "No" in this example. And the group appearing the numerator of the odds ratio will be the first alphabetical value of the x variable.

We can change these defaults using the factor() function:

## Example using relevel() to compare the odds of "Yes"
result = fisher.test(x = ic_homes$style, y = factor(ic_homes$ac, levels = c("Yes","No")), conf.level = 0.90)
result$estimate   # Odds ratio of ac = "Yes" for 1-story homes relative to 2-story homes

## odds ratio 
##  0.7471104

result$conf.int

## [1] 0.4345775 1.2541977
## attr(,"conf.level")
## [1] 0.9

Question 3:

Part A: In Question 1 you compared the likelihoods of 1-story and 2-story frame homes having a “Full” basement using a difference in proportions. Now find a point estimate for the odds ratio comparing the odds of a 2-story frame home in Iowa City having a full basement to the odds of a 1-story frame home having a full basement.
Part B: Find a 95% confidence interval estimate of the odds ratio described in Part A. Does this interval support the conclusion that 1-story and 2-story homes have different likelihoods of having a full basement?

$~$

Means

The t.test() function can be used to calculate confidence intervals for a single mean or a difference in means:

## Example of a CI for a single mean
results = t.test(x=ic_homes$sale.amount, conf.level = 0.99)
results$conf.int

## [1] 172661.4 193558.8
## attr(,"conf.level")
## [1] 0.99

## Example of a CI for a difference in means
results = t.test(sale.amount ~ style, data = ic_homes, conf.level = 0.99)
results$estimate

## mean in group 1 Story Frame mean in group 2 Story Frame 
##                    166179.7                    215038.5

results$conf.int

## [1] -72704.14 -25013.28
## attr(,"conf.level")
## [1] 0.99

Question 4: Use the summarize() function to find the mean and standard deviation for the variable sale.amount. Then, using a value of “c” from the the proper distribution, calculate a 99% confidence estimate of the average sale amount of homes sold in Iowa City. Calculate your interval “by hand” (with the exception of finding “c” using StatKey). Verify that your interval matches the one found in this section’s examples.

$~$

Correlation

The cor.test() function is used to find a confidence interval estimate for a correlation between two quantitative variables. Currently, the function only returns confidence intervals for Pearson’s correlation coefficient.

## Example of confidence interval for Pearson's correlation coefficient
results = cor.test(x = ic_homes$area.living, y = ic_homes$sale.amount, conf.level = 0.9)
results$estimate  ## Point estimate (sample )

##       cor 
## 0.8224501

results$conf.int

## [1] 0.7978832 0.8442896
## attr(,"conf.level")
## [1] 0.9

Question 5: Find and interpret a 95% confidence interval estimate for the correlation between the year a home was built and its sale amount. How are these variables related in the sample? Can you be confident that these variables are related in the population represented by these data?

$~$

Regression

The confint() function is designed to calculate confidence interval estimates using the coefficients of a fitted regression model. Differing slightly from the other functions introduced in this lab, it uses the level argument to specify the confidence level:

## Example confidence interval for the coefficients in a linear regression model
my_model = lm(sale.amount ~ style + area.living, data = ic_homes)
confint(my_model, level = 0.95)

##                          2.5 %      97.5 %
## (Intercept)        -31293.9509  -7554.4756
## style2 Story Frame -51617.3241 -30866.5517
## area.living           147.2762    165.1543

In this example we can be 95% confident that each additional square foot in living area is associated with an expected increase in sale price that is between $\$147$ and $\$165$ when the style of the home is held constant (ie: for two homes of the same style).

Question 6:

Part A: Use the summary() function to obtain the point estimate and standard error for the coefficient of the re-coded variable style2 Story Frame. Then, use an appropriate value of “c” from a $t$-distribution with $df=n-p-1$ where $n$ is the sample size, and $p$ is the number of coefficients (including the intercept) involved in the model. Confirm the endpoints of the 95% confidence interval for the coefficient of style2 Story Frame given above (ie: -51617.32 and -30866.55) by calculating the interval yourself.
Part B: Based upon the interval from Part A, can you be confident that 2-story frame homes are less expensive than 1-story frame homes in Iowa City assuming both homes have the same living area? Briefly explain.

$~$

Part 2

The American Community Survey (ACS) is a component of the US Census that is administered to a random sample US addresses on a rolling basis. When the mailed version is combined with in-person visits and telephone calls the survey has a 95% response rate. The data linked below are a random sample of employed individuals drawn from a recent ACS:

acs = read.csv("https://remiller1450.github.io/data/EmployedACS.csv")

The ACS sample includes the following variables:

Sex - “1” for males and “0” for females
Age - age in years
Married - “1” for married individuals and “0” for unmarried individuals
Income - annual income (in thousands of dollars)
HoursWk - average hours worked per week
Race - self-described race
USCitizen - citizenship status, “1” for US citizens and “0” for non-citizens
HealthInsurance - “1” if the individual has health insurance, “0” otherwise
Language - “1” if the individual’s first/native language is English, “0” otherwise

Question 7: Use a 95% confidence interval estimate of an appropriate descriptive statistic to decide whether you confidently conclude from the ACS data that married individuals are more likely to have health insurance.

Question 8: Use a 90% confidence interval estimate of an appropriate descriptive statistic to determine you can confidently conclude from the ACS data that there’s a relationship between the average number of hours an individual works and their annual income.

Question 9: Use a 99% confidence interval estimate to determine whether you can confidently conclude from the ACS data that males earn higher mean incomes than females.

Question 10: Use a 99% confidence interval estimate to determine whether you can confidently conclude from the ACS data that males earn higher mean incomes than females after controlling for hours worked per week. Hint: Use a regression model that adjusts for the average number of hours worked each week.

Statistic	Standard Error	Conditions
\(\hat{p}\)	\(\sqrt{\frac{p(1 - p)}{n}}\)	\(np \geq 10\) and \(n(1-p) \geq 10\)
\(\bar{x}\)	\(\frac{\sigma}{\sqrt{n}}\)	normal population or \(n \geq 30\)
\(\hat{p}_1 - \hat{p}_2\)	\(\sqrt{\frac{p_1(1 - p_1)}{n_1} + \frac{p_2(1 - p_2)}{n_2}}\)	\(n_ip_i \geq 10\) and \(n_i(1-p_i) \geq 10\) for \(i \in \{1,2\}\)
\(\bar{x}_1 - \bar{x}_2\)	\(\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\)	normal populations or \(n_1 \geq 30\) and \(n_2 \geq 30\)