Lab #8 - Probability and Sampling

\(~\)

Onboarding

There are three major topics covered in this lab:

Random variables which stem from the selection of a sample of data from a population
Sampling variability in response to characteristics of the sampling process
Sampling bias in response to characteristics of the sampling process

We’ll work with the text of the Gettysburg Address, a famous speech given by former US President Abraham Lincoln during the American Civil War. The full text will be considered our population of interest, and we’ll explore sampling words from the text.

Below we load the speech into R and use the nchar() function to add a new variable word_length:

gettysburg = read.csv("https://remiller1450.github.io/data/gettysburg.csv") 
gettysburg$word_length = nchar(gettysburg$word)

Later in this lab we’ll use “for-loops”, a computational tool for repeating a block of code many times. Below is an example that repeatedly samples a word from the vector my_words and stores the number of characters contained in that word:

## Words to sample from
my_words = c("cat", "dog", "frog", "hippopotamus")

## Number of times to execute the loop
n_reps = 100

## Object to store results from each pass thru the loop
word_lengths = numeric(length = n_reps)

## Repeat random sampling 100 times, storing each word length
for(i in 1:n_reps){
  current_sample = sample(my_words, size = 1) # size = 1 samples n=1 cases from 'my_words'
  word_lengths[i] = nchar(current_sample)
}

We can use the table() function to tally the frequencies of words lengths across the sampled words:

## Table of the word lengths
table(word_lengths)

## word_lengths
##  3  4 12 
## 51 26 23

\(~\)

Lab

In this lab you’ll need the dplyr and ggplot2 packages that we’ve been using in previous labs.

library(dplyr)
library(ggplot2)

\(~\)

Random Variables and Probability

To prepare for the remainder of this lab, you should begin by answering the following question pertaining to the example given in this lab’s preamble. Copied below is the vector my_words that is used in that example.

## Vector of words used in the preamble
my_words = c("cat", "dog", "frog", "hippopotamus")

Question #1:

Part A: In the for-loop example from this lab’s preamble, roughly 50% of the 100 randomly selected words were 3-characters long; however, this percentage was not exactly 50%. If we were to increase the value of n_reps to 1000, would you expect the proportion of 3-character words to be closer to or further from 50%? Briefly explain your answer.
Part B: Make an argument using the Law of Large Numbers (revisit this week’s probability notes for the definition) that the probability of a 3-character word being sampled from the vector my_words is 0.5.
Part C: Let the random variable \(X\) denote the number of characters observed when randomly sampling a single word from the vector my_words. Is \(X\) a continuous or discrete random variable? What is the probability distribution of \(X\)?

\(~\)

Sample Size and Sampling Variability

Sampling variability is a reason why a sample might suggest trends that deviate from those present in the population. For example, we know that the population mean word length of my_words (from the preamble and Question #1) is 5.5 characters:

## Mean word length of the population
mean(nchar(my_words))

## [1] 5.5

If we take a random sample of \(n=2\) words, we might get a sample average of 3.5 words (if the words “cat” and “frog” were randomly selected). This is an underestimate of the population’s mean, but that isn’t because our sampling strategy was biased/flawed.

Now let’s move to the Gettysburg Address data set. The code below uses a for-loop to repeatedly take random samples of size \(n=5\) from the words in the address:

## We'll do 100 repeats
n_reps = 100

## Object to store each sample's mean
sample_means = numeric(length = n_reps)

## Repeat random sampling 100 times, storing each mean word length
for(i in 1:n_reps){
  current_sample = sample(gettysburg$word, size = 5)
  sample_means[i] = mean(nchar(current_sample))
}

The object sample_means contains the sample mean for each of the 100 random samples drawn during the for-loop. We can graph the distribution of these means to better understand sampling variability:

ggplot() + geom_histogram(aes(x = sample_means), bins = 10)

Question #2: This question focuses on the influence of sample size on sampling variability.

Part A: Run the for-loop code given in this section, then find the standard deviation of the values contained in the object sample_means. Interpret this value using the definition of standard deviation (see slide 11 in our univariate summaries notes for the definition).
Part B: Modify the for-loop given in this section to now take samples of size \(n=15\) (this is controlled via the size argument in the sample() function). Display the average word lengths in these samples in a histogram similar to the one shown above for samples of size \(n=5\).
Part C: Calculate the standard deviation of 100 sample averages you found in Part B. Are these samples more spread out or less spread out than the samples of size \(n=5\) you looked at in Part A?
Part D: Modify the for-loop given in this section one more time to now take samples of size \(n=50\). Display these samples using a histogram and briefly describe how this histogram compares to the one from Part B, being sure to pay attention to the scale/units of the x-axis.
Part E: Find the standard deviation of the 100 sample averages you found in Part D. How does the variability across random samples of size \(n=50\) compare to the variability of samples of size \(n=5\) and \(n=15\)?

\(~\)

Variability within the Population

In the previous section, we saw that sample size had a significant influence on sampling variability; however, the degree to which a sample could be expected to deviate from the population is not solely determined by the size of the sample. Variability among the cases in the population also contributes to sampling variability.

If the cases in the population tend to be similar (ie: the standard deviation of the population is small), we can expect less sampling variability.
If the cases in the population tend to be dissimilar (ie: the standard deviation of the population is large), we can expect more sampling variability.

To study this, we’ll create a few artificial populations that are offshoots of the real Gettysburg Address:

# New Population #1
## Filter to keep only the long and short words 
new_pop1 = gettysburg %>% 
  filter(word_length > 6 | word_length < 3)

# New Population #2
## Add a few extremely long words to New Population #1
new_pop2 = new_pop1 %>% 
  add_row(word = c("Incomprehensibilities", "Antidisestablishmentarianism", "Supercalifragilisticexpialidocious"))
new_pop2$word_length = nchar(new_pop2$word)

Question #3:

Part A: Confirm that the standard deviation of the variable word_length is larger in new_pop1 and new_pop2 than it is in the original Gettysburg Address.
Part B: Adapt the for-loop given prior to Question #2 to use new_pop1, maintaining the sample size of \(n=5\). Then, using the results of this loop, calculate the standard deviation of the 100 sample averages stemming from new_pop1. How does this standard deviation compare to the one you calculated in Part A of Question #2?
Part C: Repeat Part B using new_pop2. How does the standard deviation of these 100 sample averages compare to the one you calculated in Part A of Question #2? How about the one you calculated in Part B of this question?
Part D: Suppose we create another contrived new population that filters the original Gettysburg Address to only include words that contain between 3 and 6 characters. If you were to take 100 random samples of size \(n=5\) and find the average word length in each sample, would you expect these samples to exhibit more or less sampling variability than random samples of size \(n=5\) drawn from the original Gettysburg Address? Briefly explain.

\(~\)

Sampling Bias

By default, the sample() function performs random sampling, which means that each case within the population has an equal likelihood of being selected during the sampling process. Sampling bias occurs when certain cases in the population are more likely to be selected than others. There are a few ways this can happen:

The likelihood of a case being sampled can be influenced by attributes of that case.
- For example, suppose we pin a copy of the Gettysburg Address up on the wall and throw darts at it while blindfolded, keeping each word that is struck by a dart to be part of our sample. You might think that these words were randomly selected, but statistically speaking they are not, as the longer/larger words are more likely to be struck by a dart.
- Another example of this is non-response bias. For example, if you try to recruit 1000 office workers to complete an anonymous survey about workload, those with higher workloads might disproportionately decline to respond to the survey because they don’t have time. Thus, the sample will disproportionately reflect the office workers who did have the time/motivation to complete the survey.
A certain segment of the population is unreachable by the sampling procedure being used.
- For example, suppose you are sampling science professors by randomly selecting office numbers in Noyce. Any science professors who have moved their office due to being on research leave (which can be as high as 1/6th of tenured faculty in a given year) cannot be selected by this procedure.

To further understand sampling bias, we can perform biased sampling using the prob argument in the sample() function:

## Sampling probabilities proportional to word length
proportional_to_length = gettysburg$word_length/sum(gettysburg$word_length)

## Use in the sample() function
biased_sample = sample(gettysburg$word, size = 50, prob = proportional_to_length)
mean(nchar(biased_sample))

## [1] 5.1

Notice how this single sample of size \(n=50\) has an average word length that is much higher than the population’s mean word length of 4.3 characters. However, how do we know that this isn’t just sampling variability? The answer lies in the fact that sampling bias systematically produces samples whose estimates are either too high or too low. We could study the behavior of this sampling approach across many repeated samples to determine if our error was due to sampling bias or sampling variability.

Question #4: Modify the for-loop given prior to Question #2 so that each word is sampled with a probability proportional to its length. After running your modified loop, use the results stored in sample_means to create a histogram that displays the distribution of average word lengths stored in sample_means. Does this distribution appear to be centered at the population’s average word length (4.3 characters)?