\(~\)
There are three major topics covered in this lab:
We’ll work with the text of the Gettysburg Address, a famous speech given by former US President Abraham Lincoln during the American Civil War. The full text will be considered our population of interest, and we’ll explore sampling words from the text.
Below we load the speech into R
and use the
nchar()
function to add a new variable
word_length
:
gettysburg = read.csv("https://remiller1450.github.io/data/gettysburg.csv")
gettysburg$word_length = nchar(gettysburg$word)
Later in this lab we’ll use “for-loops”, a computational tool for
repeating a block of code many times. Below is an example that
repeatedly samples a word from the vector my_words
and
stores the number of characters contained in that word:
## Words to sample from
my_words = c("cat", "dog", "frog", "hippopotamus")
## Number of times to execute the loop
n_reps = 100
## Object to store results from each pass thru the loop
word_lengths = numeric(length = n_reps)
## Repeat random sampling 100 times, storing each word length
for(i in 1:n_reps){
current_sample = sample(my_words, size = 1) # size = 1 samples n=1 cases from 'my_words'
word_lengths[i] = nchar(current_sample)
}
We can use the table()
function to tally the frequencies
of words lengths across the sampled words:
## Table of the word lengths
table(word_lengths)
## word_lengths
## 3 4 12
## 51 26 23
\(~\)
In this lab you’ll need the dplyr
and
ggplot2
packages that we’ve been using in previous
labs.
library(dplyr)
library(ggplot2)
\(~\)
To prepare for the remainder of this lab, you should begin by
answering the following question pertaining to the example given in this
lab’s preamble. Copied below is the vector my_words
that is
used in that example.
## Vector of words used in the preamble
my_words = c("cat", "dog", "frog", "hippopotamus")
Question #1:
n_reps
to 1000,
would you expect the proportion of 3-character words to be closer to or
further from 50%? Briefly explain your answer.my_words
is
0.5.my_words
. Is \(X\) a
continuous or discrete random variable? What is the probability
distribution of \(X\)?\(~\)
Sampling variability is a reason why a sample might suggest trends
that deviate from those present in the population. For example, we know
that the population mean word length of my_words
(from the preamble and Question #1) is 5.5 characters:
## Mean word length of the population
mean(nchar(my_words))
## [1] 5.5
If we take a random sample of \(n=2\) words, we might get a sample average of 3.5 words (if the words “cat” and “frog” were randomly selected). This is an underestimate of the population’s mean, but that isn’t because our sampling strategy was biased/flawed.
Now let’s move to the Gettysburg Address data set. The code below uses a for-loop to repeatedly take random samples of size \(n=5\) from the words in the address:
## We'll do 100 repeats
n_reps = 100
## Object to store each sample's mean
sample_means = numeric(length = n_reps)
## Repeat random sampling 100 times, storing each mean word length
for(i in 1:n_reps){
current_sample = sample(gettysburg$word, size = 5)
sample_means[i] = mean(nchar(current_sample))
}
The object sample_means
contains the sample mean for
each of the 100 random samples drawn during the for-loop. We can graph
the distribution of these means to better understand sampling
variability:
ggplot() + geom_histogram(aes(x = sample_means), bins = 10)
Question #2: This question focuses on the influence of sample size on sampling variability.
sample_means
. Interpret this value using the
definition of standard deviation (see slide 11 in
our univariate summaries notes for the definition).size
argument in the
sample()
function). Display the average word lengths in
these samples in a histogram similar to the one shown above for samples
of size \(n=5\).\(~\)
In the previous section, we saw that sample size had a significant influence on sampling variability; however, the degree to which a sample could be expected to deviate from the population is not solely determined by the size of the sample. Variability among the cases in the population also contributes to sampling variability.
To study this, we’ll create a few artificial populations that are offshoots of the real Gettysburg Address:
# New Population #1
## Filter to keep only the long and short words
new_pop1 = gettysburg %>%
filter(word_length > 6 | word_length < 3)
# New Population #2
## Add a few extremely long words to New Population #1
new_pop2 = new_pop1 %>%
add_row(word = c("Incomprehensibilities", "Antidisestablishmentarianism", "Supercalifragilisticexpialidocious"))
new_pop2$word_length = nchar(new_pop2$word)
Question #3:
word_length
is larger in new_pop1
and
new_pop2
than it is in the original Gettysburg
Address.new_pop1
, maintaining the sample size of \(n=5\). Then, using the results of this
loop, calculate the standard deviation of the 100 sample averages
stemming from new_pop1
. How does this standard deviation
compare to the one you calculated in Part A of Question #2?new_pop2
.
How does the standard deviation of these 100 sample averages compare to
the one you calculated in Part A of Question #2? How about the one you
calculated in Part B of this question?\(~\)
By default, the sample()
function performs random
sampling, which means that each case within the population has an
equal likelihood of being selected during the sampling process. Sampling
bias occurs when certain cases in the population are more likely to be
selected than others. There are a few ways this can happen:
To further understand sampling bias, we can perform biased sampling
using the prob
argument in the sample()
function:
## Sampling probabilities proportional to word length
proportional_to_length = gettysburg$word_length/sum(gettysburg$word_length)
## Use in the sample() function
biased_sample = sample(gettysburg$word, size = 50, prob = proportional_to_length)
mean(nchar(biased_sample))
## [1] 5.1
Notice how this single sample of size \(n=50\) has an average word length that is much higher than the population’s mean word length of 4.3 characters. However, how do we know that this isn’t just sampling variability? The answer lies in the fact that sampling bias systematically produces samples whose estimates are either too high or too low. We could study the behavior of this sampling approach across many repeated samples to determine if our error was due to sampling bias or sampling variability.
Question #4: Modify the for-loop given prior to
Question #2 so that each word is sampled with a probability proportional
to its length. After running your modified loop, use the results stored
in sample_means
to create a histogram that displays the
distribution of average word lengths stored in
sample_means
. Does this distribution appear to be centered
at the population’s average word length (4.3 characters)?