Directions (read before starting)
\(~\)
You should work independently with your assigned partner(s), making sure that you all understand concepts introduced in the lab and that you all agree on your recorded responses to the lab’s questions.
\(~\)
Our lecture introduced sampling bias and sampling variability as two possible explanations for why descriptive statistics calculated within a sample might be inconsistent with the true values of those statistics within the target population.
We saw that our attempts to sample words from the Gettysburg Address to accurately estimate the speech’s average word length tended to be upwardly biased due to our predisposition against very common short words. We’ll continue working with the words in the Gettysburg Address in this lab:
gettysburg = read.csv("https://remiller1450.github.io/data/gettysburg.csv")
Our goal will be to understand the concept of sampling variability and how it relates to sample size and variability within the population.
Question 1: In 1-sentence, briefly describe the observations of the target population for this application. That is, what is a single observation in the population? and how many observations are there in the population?
Question 2: The nchar()
function
(demonstrated below) returns the number of characters in a given input.
The function is vectorized, so if we give it a vector
containing 3 words it will return the number of characters in each word
in a vector with a length of 3.
## Demonstration of nchar()
example_words = c("dog", "frog", "hippopotamus")
nchar(example_words)
## [1] 3 4 12
mutate()
function covered in Lab
4 to add a new variable named word_length
to a new data
frame named gettysburg_new
.gettysburg_new
data frame, and the summarize()
function (also covered in Lab) to calculate the mean word
length and standard deviation of word lengths in the
population.\(~\)
In this section we’ll use the sample()
function to
select simple random samples from the words in our target
population. The code below demonstrates how to draw a random sample of
size \(n=2\) from the example vector
used earlier.
example_words = c("dog", "frog", "hippopotamus")
sample(example_words, size = 2)
## [1] "hippopotamus" "frog"
Recognize that results produced by the sample()
function
are inherently random, so you can re-run the same command several times
and get several different results.
Question 3: Use the sample()
function
to obtain a random sample of \(n=5\)
words from the Gettysburg Address. Find the mean word length in your
sample, then give your opinion on whether you think that you observed a
large or a small sampling error by comparing your estimate (sample mean)
with the truth about the population (population mean).
\(~\)
In the real world, we typically can only collect a single sample from our population.
However, in order to better understand what might be observed in that single sample, statisticians have devoted considerable time towards exploring what happens when repeatedly drawing random samples from a population. We will explore this ourselves using a for loop:
## We'll do 100 repeats
n_repeats = 100
## Object to store each sample's mean
samp_mean = numeric(length = n_repeats)
## Repeat random sampling 100 times, storing each each mean word length
for(i in 1:n_repeats){
current_sample = sample(gettysburg$word, size = 5)
samp_mean[i] = mean(nchar(current_sample))
}
If you aren’t familiar a for loop, the basic idea is to
repeat a block of code a certain number of times using an index variable
(in this case i
) to keep track of the current repetition.
This index variable is incremented after each repetition it reaches its
final value (in this case 100, the value of n_repeats
).
The for loop given above repeats the following steps:
sample()
to obtain a random sample of wordssamp_mean
The result is that samp_mean
now stores mean word
lengths calculated in 100 different random samples of size \(n=5\). We can see the distribution of these
sample means using a histogram:
ggplot() + geom_histogram(aes(x = samp_mean), bins = 12)
Question 4: This question will explore sampling variability for random samples of 3 different sizes: 5, 10, and 50.
\(~\)
In Part 1 we saw that sampling variability, or the degree to which an estimate tends to vary from one sample to another, will decrease in response to sample size. That is, for a larger sample size (higher \(n\)), estimates from different samples will tend to be close to each other (and close to the true value for the population if there’s no sampling bias).
However, sampling variability is not exclusively a product of sample size, it also depends upon how much variability exists between the cases in our population.
To explore the role of variability within the population on sampling variability we’ll artificially create two new populations with higher variability than the original Gettysburg Address:
## Original population w/out the 'word_number' column:
gettysburg = read.csv("https://remiller1450.github.io/data/gettysburg.csv") %>% select(-word_number)
## New population #1
## Filter to keep only the words with 7+ or 1/2 letters
new_pop1 = gettysburg %>%
filter(nchar(word) > 6 | nchar(word) < 3)
## New population #2
## Add a few words to New Pop #1 that are extremely long
new_pop2 = new_pop1 %>% add_row(word = c("Incomprehensibilities", "Antidisestablishmentarianism", "Supercalifragilisticexpialidocious"))
## Notice how these new populations have more variability than the original
sd(nchar(new_pop1$word))
## [1] 3.294599
sd(nchar(new_pop2$word))
## [1] 5.261177
sd(nchar(gettysburg$word))
## [1] 2.123273
Question 5:
new_pop1
. Then find the standard deviation of these 100
different sample means. Do mean the word lengths of samples of size
\(n=5\) have more or less variability
than the word lengths of different cases in new_pop1
?new_pop2
. How does
the sampling variability you observe compare with Part B of this
question and Part A of Question 4? Briefly explain there is more
sampling variability here.\(~\)
The histograms we’ve looked at throughout this lab are known as sampling distributions. These distributions display the amount of variability present in a sampling procedure.
If we know how much sampling variability is present, we can attach a margin of error to our point estimate to more accurately describe what we believe is likely to be true about our target population. This goal, interval estimation, will be our focus next week.