This lab is intended to provide insight into one of the most important statistical results, the Central Limit Theorem.

Directions (Please read before starting)

  1. Please work together with your assigned groups. Even though you’ll turn in a write-up that is later scored, labs are intended to formative and a substantial portion of the credit you’ll receive is based upon effort and completion.
  2. Please record your responses and code in an R Markdown document following the conventions we’ve used in previous labs.

\(~\)

Claims Against the TSA

The Transport Security Administration (TSA) is an agency within the US Department of Homeland Security that has authority of the safety and security of travel in the United States. It was created in response to 9/11, and is most heavily involved in the daily operating procedures of airports.

The TSA screens all persons and personal possessions traveling aboard US airplanes. As a result, thousands of legal claims are filed against the TSA each year regarding stolen or damaged property, improper screening practices, and bodily injury. The code below loads all claims made against the TSA in the years 2004 and 2005.

tsa <- read.csv("https://remiller1450.github.io/data/tsa.csv")  ## Full dataset
#tsa <- read.csv("https://remiller1450.github.io/data/tsa_small.csv")  ## small version (only 5000 cases)

The variables in this dataset include:

  • Claim_Number: A unique identifier for the claim
  • Year: The year in which the event occurred
  • Month: The month within the specified year
  • Day: The day within the specified year/month
  • Airport_Code: An identifier for the airport at which the claim occurred
  • Airline_Name: The airline that the claimant was traveling on
  • Claim_Type: The category of the claim
  • Claim_Site: Where the event occurred
  • Item: The item being claimed
  • Claim_Amount: The monetary amount asked for by the claimant
  • Status: Whether the claim was approved, settled, or denied
  • Close_Amount: The monetary amount awarded by the TSA to the claimant

Note: This is a large dataset that takes a while to load. If your PC does not have enough memory to handle it, you may use the “tsa_small.csv” file instead (please make a comment indicating that you’re doing this).

\(~\)

Sampling variability

For the remainder of this lab, we’ll treat the full tsa dataset as a population (something we wouldn’t have access to in most real applications), and our goal will be to understand what we might observe in any given sample drawn from a population.

The code below draws a random sample of \(n = 50\) cases from the TSA population:

N = nrow(tsa) ## Number of cases in the population
set.seed(123) ## Random number generation seed
sampled = sample(1:N, size = 50)  ## Draw ID numbers to be in the sample
tsa_sample = tsa[sampled,]  ## Subset original data to only include the sampled IDs

Question #1 Find the mean claim amount in the sample of \(n = 50\) cases created by the code above and report it using proper statistical notation (see the sampling and study design slides for details). Then find the mean claim amount for the entire population (ie: the tsa data.frame). Briefly explain why these two values are different.

\(~\)

To better understand sampling variability, we’ll repeat the process of drawing random samples of \(n = 50\) and then recording their respective sample means 100 different times. The code below uses a for loop to do this:

nrep = 100 ## Number of times to repeat the loop
sample_means = numeric(nrep)  ## Empty vector to store results (sample means)

for(i in 1:nrep){
  sampled = sample(1:N, size = 50)  ## Draw IDs of a sample
  tsa_sample = tsa[sampled,]  ## Create the sample
  sample_means[i] = mean(tsa_sample$Claim_Amount)  ## Record the mean of that sample
}

Question #2: Create a histogram displaying the various sample means that were seen across the 100 different random samples drawn by the code provided above. Then describe the shape and center of this distribution. Finally, how would you quantify the amount of sampling variability involved in drawing a sample of size \(n = 50\) from this population?

Question #3: Using the code provided above Question #2 as a template, modify the size argument to instead draw different samples of size \(n = 200\). Following this change, create a histogram of the new set of sample means. How does this histogram compare to the one from Question #2 in terms of its shape, center, and spread?

\(~\)

The sample average as a random variable

Repeatedly drawing different random samples from a population is a useful exercise for understanding sampling variability; however, it is not something we’d ever be able (or want) to do in practice. In the real world, the process of collecting data is usually too expensive and time consuming to repeat.

Fortunately, the knowledge we can gained from Questions #1-3 combined with what we’ve been learning about probability theory and random variables will allow us to describe the amount of sampling variability present in our data using only a single sample.

To begin, let \(X_i\) be a random variable describing the claim amount made by a randomly sampled case from the tsa population.

By definition, \(E(X_i) = \mu\), the mean claim amount of the tsa population. Similarly, \(Var(X_i) = \sigma^2\), the variance of claim amounts within the tsa population.

Next, notice how we can then express the sample mean as a linear combination of random variables: \[\bar{X} = \tfrac{1}{n}*(X_1 + X_2 + \ldots + X_N)\] Question #4: What is the expected value of \(\bar{X}\)? (Hint: check out the “Linear combinations of random variables” slide in the “Random Variables and Probability Models” slides).

Question #5: What is the standard deviation of \(\bar{X}\)? (Hint: check out the “Linear combinations of random variables” slide in the “Random Variables and Probability Models” slides).

Question #6: From Questions #4 and #5 we’ve determined the expected value (center) and the standard deviation (variability) of the sample mean, but we’ve yet to establish a probability distribution. Based upon the histograms you created in Questions #2 and #3, does a Normal model (with the mean and standard deviation you calculated in Questions #4 and #5) seem appropriate?

\(~\)

The Central Limit Theorem

Arguably the most famous theoretical result in all of statistics, the Central Limit Theorem states:

Suppose \(X_1, X_2, \ldots, X_n\) are independent random variables with a common expectation, \(E(X) = \mu\), and a common variance, \(Var(X) = \sigma^2\). Letting \(\bar{X}\) denote the average of these random variables: \(\sqrt{n}\big(\tfrac{\bar{X} - \mu}{\sigma}) \rightarrow N(0,1)\)

The Central Limit Theorem implies that we can use a Normal probability model to evaluate the sampling variability intrinsic to a sample average, even if we only have data from a single sample!

With a slight abuse of notation, we can express the main result of the CLT as: \(\bar{X} \sim N(\mu, \sigma/\sqrt{n})\)

In words, CLT suggests that the sample mean follows a Normal distribution centered at the population mean with a standard deviation equal to the population’s standard deviation divided by the square-root of the sample size, \(n\).

Question #7: The Central Limit Theorem suggests the amount of sampling variability in the sample mean is equal to the standard deviation (of the variable of interest) divided by the square root of \(n\). To verify this, compare the standard deviation of the different sample means (which should be stored in sample_means, a vector you created for Questions #2 and #3) with the standard deviation of the “claim_amount” variable in the original sample divided by \(\sqrt{n}\).

\(~\)

Judging normality using Q-Q plots

It’s relatively easy to check that the expected value and standard deviation suggested by the Central Limit Theorem appear to be accurate. It’s more difficult to determine if the distribution of sample means actually follows a Normal curve.

One tool statisticians use to judge whether observed data appear to follow a Normal distribution is a quantile-quantile, or Q-Q plot.

A Q-Q plot will graph the quantiles of the observed data against what they theoretically should be under a Normal probability model. Note that a “quantile” is the general term used to define a cut-off point that divides a distribution into intervals with a certain probabilities. A percentile is a commonly used type of quantile (as is a quartile, such as Q1 or Q3 in a boxplot).

It’s easiest to understand how a Q-Q plot works by looking at a few examples:

## Use n = 200
n = 200
set.seed(123)

## Simulate a skewed right variable
x = rexp(n, rate = 2)

## Simulate a left skewed variable
y = -rexp(n, rate = 1)

## Simulate a normally distributed variable
z = rnorm(n, mean = 0, sd = 10)

## graphics parameters for a 3x2 arrangement of plots
par(mfrow = c(3,2))

## Base R qq-plot of a variable "X"
hist(x)
qqnorm(x)
qqline(x)

## Base R qq-plot of a variable "Y"
hist(y)
qqnorm(y)
qqline(y)

## Base R qq-plot of a variable "Z"
hist(z)
qqnorm(z)
qqline(z)

The \(x\)-axis of this graph displays theoretical quantiles, while the \(y\)-axis displays the quantiles observed in the sample.

  • The quantiles of right-skewed data, such as x, are very bunched up at first (at the smaller values), then become very spread out later (at the larger values). This is seen in the Q-Q plot by the flat section early on that curves upward towards the right of the graph.
  • The quantiles of left-skewed data, such as y, are very spread out at first (at the smaller values), then become bunched up later (at the larger values). This is seen in the Q-Q plot by the steep slope early on and the flattening out towards the right of the graph.
  • The values of z were drawn from a Normal distribution, so this Q-Q plot reflects what you’d expect to see for data that are truly Normally distributed (a close, but not perfect reflection of the 45-degree line)

Question #8: Create both a histogram and a Q-Q plot of the variable “Claim_Amount” in the population (recall that this is the tsa data.frame from the very start of the lab). Comment upon the shape of this distribution and whether it does/doesn’t resemble the Normal curve.

Question #9: Create both a histogram and a Q-Q plot of the variable “Claim_Amount” in your original sample (this should be the tsa_sample data.frame used in Question #1). Comment upon the shape of this distribution and whether it does/doesn’t resemble the Normal curve.

Question #10: Create both a histogram and a Q-Q plot of the sample mean of the variable “Claim_Amount” using the various different random samples drawn prior to Question #2 (this should be the sample_means vector). Comment upon the shape of this distribution and whether it does/doesn’t resemble the Normal curve.

Remark: Hopefully you can appreciate just how profound the Central Limit Theorem is. It’s extremely difficult to anticipate the distribution of any random variable without observing many realizations of it; however, with the Central Limit Theorem we are able to do exactly that - we can determine the precise distribution of the sample mean despite only observing a single sample mean.