This lab is intended to provide insight into the theoretical aspects of confidence intervals.

Directions (Please read before starting)

  1. Please work together with your assigned groups. Even though you’ll turn in a write-up that is later scored, labs are intended to formative and a substantial portion of the credit you’ll receive is based upon effort and completion.
  2. Please record your responses and code in an R Markdown document following the conventions we’ve used in previous labs.

\(~\)

Ames Housing Dataset

The Ames Housing dataset documents all residential homes sold in Ames, Iowa between 2006 and 2010. The code below reads these data into R:

ah <- read.csv("https://remiller1450.github.io/data/AmesHousing.csv")

Similar to Lab #4, this lab will treat the full Ames Housing dataset as population, and our goal will be to explore the theoretical properties of confidence intervals constructed using samples from this population.

A description of the variables contained in these data can be found here.

\(~\)

Practice with a single random sample

The code below will draw a single random sample of size \(n = 30\) from the Ames Housing population:

N = nrow(ah) ## Number of cases in the population
set.seed(7) ## Random number generation seed
sampled = sample(1:N, size = 30)  ## Draw ID numbers to be in the sample
ah_sample = ah[sampled,]  ## Subset original data to only include the sampled IDs

Question #1: Suppose we’re interested in using the data in ah_sample to estimate the average sale price of all homes in the Ames Housing population. Using methods we’ve previously covered (ie: qnorm or qt) find the point estimate and the margin of error for a 95% confidence interval estimate.

Question #2: Suppose we’re interested in using the data in ah_sample to estimate the proportion of two-story homes in the Ames Housing population (which is recorded by the variable “House.Style”). Using methods we’ve previously covered (ie: qnorm or qt) find the point estimate and the margin of error for a 95% confidence interval estimate.

\(~\)

Using built-in R functions

As you might expect, R contains a number of built-in functions that can be used to create confidence interval estimates for a variety of population parameters.

  • The t.test() function can be used to create intervals for a single mean, or a difference in means, using the \(t\)-distribution.
  • The binom.test() function be used to create exact confidence intervals using the binomial distribution. It should be used when estimating a single proportion.
## Example - using t.test() to find a confidence interval
t.test(ah_sample$Year.Built, conf.level = 0.99)$conf.int
## [1] 1952.572 1986.495
## attr(,"conf.level")
## [1] 0.99
## Example - using binom.test() to find a confidence interval
table(ah_sample$Central.Air)  ## notice 25 of 30 homes had central AC
## 
##  N  Y 
##  5 25
binom.test(x = 25, n = 30, conf.level = 0.99)$conf.int
## [1] 0.5959730 0.9621783
## attr(,"conf.level")
## [1] 0.99

Question #3: Use a built-in R function to find a 95% CI estimate of the average sale price of the Ames Housing population. How does this interval compare with the one you previously calculated “by hand” (in Questions 1 or 2)?

Question #4: Use a built-in R function to find a 95% CI estimate of the proportion of two-story homes in the Ames Housing population. How does this interval compare with the one you previously calculated “by hand” (in Questions 1 or 2)?

\(~\)

Exploring statistical validity

In Question #4 you should have found the exact binomial interval was somewhat wider than the one you found “by hand”. This begs the question of whether both of these intervals are statistically valid. That is, will both of these procedures “succeed” at least 95% of the time if applied to many different random samples?

The code below generates 500 random samples of size \(n = 30\) and stores the 95% CI endpoints for the proportion of home with central air conditioning for the two methods we’ve now used (exact binomial and “by hand”).

nrep = 500 ## Number of times to repeat the loop
n = 30 ## Size of each sample
set.seed(7) ## Set seed for consistency

## Create empty objects to store results
lower_exact = numeric(nrep) 
lower_byhand = numeric(nrep) 
upper_exact = numeric(nrep) 
upper_byhand = numeric(nrep) 

for(i in 1:nrep){
  sampled = sample(1:N, size = n)  ## Draw IDs of a sample
  ah_sample = ah[sampled,]  ## Create the sample
  
  ## By-hand CLT approach
  phat = sum(ah_sample$Central.Air == "Y")/n
    lower_byhand[i] = phat - 1.96*sqrt(phat*(1-phat)/n)
    upper_byhand[i] = phat + 1.96*sqrt(phat*(1-phat)/n)
  
  ## Exact binomial approach  
  eci = binom.test(x = sum(ah_sample$Central.Air == "Y"), n = n, conf.level = 0.95)$conf.int
    lower_exact[i] = eci[1]  ## First element is the lower endpoint
    upper_exact[i] = eci[2]  ## Second element is the upper endpoint
}

Question #5: Using the full population dataset (ie: ah), find the population’s proportion of homes that have central air conditioning.

Question #6: Count the number of times the “by hand” approach failed to produce an interval that contained the true population proportion (of homes with central air conditioning). Hint: you can do this by adding the number of lower endpoints that are larger than the population’s proportion and the number of upper endpoints that are smaller than the population’s proportion.

Question #7: Count the number of times the exact binomial approach failed to produce an interval that contained the true population proportion (of homes with central air conditioning). Do both this approach and the “by hand” approach seem to be producing statistically valid confidence intervals?

\(~\)

Factors influencing interval width

This final section is intended to help you understand and explain how various factors will impact the width of a confidence interval estimate. For Questions #8-#12, you may justify your answers using either of the following approaches:

  • Simulation - drawing different random samples and studying the results
  • Theory - looking at formulas or theoretical results and isolating the role of a particular factor

Regardless of the approach you choose, you should write 1-2 sentences supporting your answer (along with other support in the form of either numerical results or formulas)

Question #8: Assuming everything else remains the same, what role does increasing the confidence level have on the width of a confidence interval estimate?

Question #9: Assuming everything else remains the same, what role does increasing the sample size have on the width of a confidence interval estimate?

Question #10: Assuming everything else remains the same, what role does using the t-distribution (rather than the Normal distribution) have on the width of a confidence interval estimate of a single mean?

Question #11: Assuming everything else remains the same, what role does the sample proportion being closer to 0.5 have on the width on the width of a confidence interval estimate of a single proportion?

Question #12: Assuming everything else remains the same, what role does the numeric values for cases within the population having less variability have on the width on the width of a confidence interval estimate of a single mean?