Lab #9 - Sampling Variability

Directions (read before starting)

Please work together with your assigned partner. Make sure you both fully understand something before moving on.
Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
Ask for help, clarification, or even just a check-in if anything seems unclear.

$~$

Lab

You should work independently with your assigned partner(s), making sure that you all understand concepts introduced in the lab and that you all agree on your recorded responses to the lab’s questions.

$~$

Sampling Variability

Our lecture introduced sampling bias and sampling variability as two possible explanations for why descriptive statistics calculated within a sample might be inconsistent with the true values of those statistics within the target population.

We saw that our attempts to sample words from the Gettysburg Address to accurately estimate the speech’s average word length tended to be upwardly biased due to our predisposition against very common short words. We’ll continue working with the words in the Gettysburg Address in this lab:

gettysburg = read.csv("https://remiller1450.github.io/data/gettysburg.csv")

Our goal will be to understand the concept of sampling variability and how it relates to sample size and variability within the population.

Question 1: In 1-sentence, briefly describe the observations of the target population for this application. That is, what is a single observation in the population? and how many observations are there in the population?

Question 2: The nchar() function (demonstrated below) returns the number of characters in a given input. The function is vectorized, so if we give it a vector containing 3 words it will return the number of characters in each word in a vector with a length of 3.

## Demonstration of nchar()
example_words = c("dog", "frog", "hippopotamus")
nchar(example_words)

## [1]  3  4 12

Part A: Use the mutate() function covered in Lab 4 to add a new variable named word_length to a new data frame named gettysburg_new.
Part B: Using the updated gettysburg_new data frame, and the summarize() function (also covered in Lab) to calculate the mean word length and standard deviation of word lengths in the population.
Part C: Report the mean and standard deviation from Part B using the notation that statisticians use for these parameters. Hint: In R Markdown $\$\text{\\beta}\$$ will appear as a the Greek letter $\beta$, the Greek letters you should be using here are “sigma” and “mu”.
Part D: Suppose we take a sample of words from the Gettysburg address and the mean word length in our sample is $\overline{x}=4.35$. Did we observe a large or a small sampling error? Briefly explain.

$~$

Part 1 - Sample Size

In this section we’ll use the sample() function to select simple random samples from the words in our target population. The code below demonstrates how to draw a random sample of size $n=2$ from the example vector used earlier.

example_words = c("dog", "frog", "hippopotamus")
sample(example_words, size = 2)

## [1] "hippopotamus" "frog"

Recognize that results produced by the sample() function are inherently random, so you can re-run the same command several times and get several different results.

Question 3: Use the sample() function to obtain a random sample of $n=5$ words from the Gettysburg Address. Find the mean word length in your sample, then give your opinion on whether you think that you observed a large or a small sampling error by comparing your estimate (sample mean) with the truth about the population (population mean).

$~$

Many different random samples?

In the real world, we typically can only collect a single sample from our population.

However, in order to better understand what might be observed in that single sample, statisticians have devoted considerable time towards exploring what happens when repeatedly drawing random samples from a population. We will explore this ourselves using a for loop:

## We'll do 100 repeats
n_repeats = 100

## Object to store each sample's mean
samp_mean = numeric(length = n_repeats)

## Repeat random sampling 100 times, storing each each mean word length
for(i in 1:n_repeats){
  current_sample = sample(gettysburg$word, size = 5)
  samp_mean[i] = mean(nchar(current_sample))
}

If you aren’t familiar a for loop, the basic idea is to repeat a block of code a certain number of times using an index variable (in this case i) to keep track of the current repetition. This index variable is incremented after each repetition it reaches its final value (in this case 100, the value of n_repeats).

The for loop given above repeats the following steps:

Using sample() to obtain a random sample of words
Taking the mean number of characters in the words contained in that sample and storing it in the object samp_mean

The result is that samp_mean now stores mean word lengths calculated in 100 different random samples of size $n=5$. We can see the distribution of these sample means using a histogram:

ggplot() + geom_histogram(aes(x = samp_mean), bins = 12)

Question 4: This question will explore sampling variability for random samples of 3 different sizes: 5, 10, and 50.

Part A: The histogram provided above shows the distribution of 100 different sample means for random samples of size $n=5$. Find the standard deviation of these sample means, and provide a 1-sentence interpretation of this standard deviation.
Part B: Modify the for loop code given earlier in this section so that simple random samples of size $n=10$ are drawn from the population. Display a histogram similar to the one given above showing the distributions of mean word lengths across these samples.
Part C: Find the standard deviation of the 100 sample means in Part B. How does the sampling variability among samples of size $n=10$ compare to the sampling variability among samples of size $n=5$ (Part A)?
Part D: Repeat Part B using random samples of size $n=50$. How does this histogram compare with the previous two histograms? Hint: pay attention to the scale of the x-axis.
Part E: Find the standard deviation of the sample means in Part D. How does the sampling variability among samples of size $n=50$ compare to the variability among samples of size $n=10$ and $n=5$?

$~$

Part 2 - Variability within the Population

In Part 1 we saw that sampling variability, or the degree to which an estimate tends to vary from one sample to another, will decrease in response to sample size. That is, for a larger sample size (higher $n$), estimates from different samples will tend to be close to each other (and close to the true value for the population if there’s no sampling bias).

However, sampling variability is not exclusively a product of sample size, it also depends upon how much variability exists between the cases in our population.

If the cases in our population tend to be similar to each other, we’ll encounter less sampling variability.
- That is, if the numeric values of different cases in the population aren’t very spread out, then the means of different samples won’t be as spread out.
If the cases in our population tend to be very different, we’ll encounter more sampling variability.
- That is, if the distribution of cases in the population is very spread out or contains lots of outliers, then the means of different samples will vary a lot depending upon which cases were sampled

To explore the role of variability within the population on sampling variability we’ll artificially create two new populations with higher variability than the original Gettysburg Address:

## Original population w/out the 'word_number' column:
gettysburg = read.csv("https://remiller1450.github.io/data/gettysburg.csv") %>% select(-word_number)

## New population #1
## Filter to keep only the words with 7+ or 1/2 letters
new_pop1 = gettysburg %>% 
  filter(nchar(word) > 6 | nchar(word) < 3)

## New population #2
## Add a few words to New Pop #1 that are extremely long
new_pop2 = new_pop1 %>% add_row(word = c("Incomprehensibilities", "Antidisestablishmentarianism", "Supercalifragilisticexpialidocious"))

## Notice how these new populations have more variability than the original
sd(nchar(new_pop1$word))

## [1] 3.294599

sd(nchar(new_pop2$word))

## [1] 5.261177

sd(nchar(gettysburg$word))

## [1] 2.123273

Question 5:

Part A: Adapt the for loop from Part 1 of the lab to find the mean word lengths in 100 different random samples of size $n=5$ from new_pop1. Then find the standard deviation of these 100 different sample means. Do mean the word lengths of samples of size $n=5$ have more or less variability than the word lengths of different cases in new_pop1?
Part B: Using the results from Part A, do the sample means from Part A of this question exhibit more or less sampling variability than the sample means in Part A of Question 4 (samples of size $n=5$ from the original Gettysburg Address)? Explain your answer.
Part C: Adapt the for loop in Part 1 of the lab to find the means of 100 different random samples of size $n=5$ from new_pop2. How does the sampling variability you observe compare with Part B of this question and Part A of Question 4? Briefly explain there is more sampling variability here.

$~$

Next Steps - Sampling Distributions and Confidence Intervals

The histograms we’ve looked at throughout this lab are known as sampling distributions. These distributions display the amount of variability present in a sampling procedure.

If we know how much sampling variability is present, we can attach a margin of error to our point estimate to more accurately describe what we believe is likely to be true about our target population. This goal, interval estimation, will be our focus next week.