Lab #9 - Normal Distributions and Central Limit Theorem

$~$

Onboarding

This lab will formalize our lab’s investigation of sampling variability by introducing Central Limit Theorem and seeing how it provides a probability model for sample averages.

Normal Distributions

Earlier this week, it was alluded to that many random variables related to sampling, such as the sample average, appear to have a bell-shaped probability distribution. This probability distribution is known as the Normal Distribution and it is defined by the mathematical function: \[ f(X) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(X - \mu)^2}{2\sigma^2}} \]

This function involves two parameters, $\mu$ and $\sigma$, which govern the distribution’s appearance.

$\mu$ (the Greek letter “mu”) is the mean parameter; it determines the distribution’s center/peak
$\sigma$ (the Greek letter “sigma”) is the standard deviation parameter; it determines how spread out or concentrated the distribution is

We typically reference Normal Distributions using the shorthand $N(\mu, \sigma)$; for example, $N(0,1)$ denotes a Normal distribution with a mean of zero and a standard deviation of 1.

Below are three different Normal Distributions displayed on the same x-axis:

The first section of today’s lab will cover basic probability calculations using the Normal Distribution.

$~$

Central Limit Theorem (CLT)

Consider a set of $n$ independent observations of a random variable, $X$, such that we’ve observed: $\{x_1, x_2, \ldots, x_n\}$. For example, these values might be the observed values of a given variable for cases that were randomly selected from a population.

Intuitively:

The random variable that we observed has an expected value, and some of our observations will be above this expected value, while others will be below it
When we sum $x_1 + x_2 + \ldots + x_n$ the values that are larger than expected tend to be balanced out by those that are smaller than expected
- Thus, the distribution of the sample average $\overline{x} = \tfrac{x_1 + x_2 + \ldots + x_n}{n}$ tends to cluster around the expected value of $X$ with tails on either side

Mathematically:

\[\lim_{n \rightarrow \infty} \sqrt{n}\bigg(\frac{\overline{x} - \mu}{\sigma} \bigg) \rightarrow N(0,1)\] This suggests that as the sample size increases, the probability distribution of the sample mean converges to a Normal distribution.

With slightly improper notation, we can write out an approximate Normal model for the sample mean:

\[\overline{X} \sim N\big(\mu, \tfrac{\sigma}{\sqrt{n}}\big)\] In words, CLT suggests that the sample mean follows a Normal distribution centered at the population mean with a standard deviation equal to the population’s standard deviation divided by the square-root of the sample size.

An amazing thing about this result is that it doesn’t assume anything about the distribution of the population. This means that the population we are sampling from can be skewed and/or full of outliers and the sample average will still follow a Normal probability distribution so long as we take a large enough sample.

We will not prove CLT in this course, as the most accessible proofs require knowledge of moment generating functions.

$~$

Lab

In this lab you’ll work with a data set containing all claims made against US Transport Security Administration (TSA), a government agency that oversees travel security in the United States (primarily in airports) in the years 2004 and 2005, shortly after the agency was first created.

Claims are filed against the TSA for damaged or stolen property, improper screening practices and bodily injury. Each claim requests a certain amount of damages that the plaintiff seeks to recover from the TSA (the variable Claim_Amount). These claims are reviewed and can either be settled, rejected, or approved. The final amount paid to the plaintiff is recorded as Close_Amount.

We will work with two different subsets of claims. The first subset contains claims for eyeglasses of less than $2000 (this excludes two outliers in this category). The second subset contains claims for laptops, excluding one extreme outlier that claimed a $1,000,000 loss.

You’ll also need the dplyr and ggplot2 packages that we’ve been using in previous labs.

## Libraries
library(dplyr)
library(ggplot2)

## Full data set
tsa <- read.csv("https://remiller1450.github.io/data/tsa.csv") # Note that the data set is somewhat large

## First Subset
tsa_eyeglasses = tsa %>% filter(Item == "Eyeglasses - (including contact lenses)", Claim_Amount < 2000)

## Second Subset
tsa_laptops = tsa %>% filter(Item == "Computer - Laptop", Claim_Amount < 1e6)

$~$

Working with Normal Probability Models

There are two functions we’ll frequently use when working with Normal Distributions:

pnorm() takes the argument q and computes the tail area of a specified Normal Distribution for values smaller than q (argument lower.tail = TRUE) or larger than q (argument lower.tail = FALSE)
qnorm() takes the argument p and returns the associated quantile of the Normal Distribution.

Below is an example use of pnorm():

pnorm(q = 1, mean = 0, sd = 1, lower.tail = FALSE)

## [1] 0.1586553

This example corresponds to the shaded area:

Below is an example use of qnorm():

qnorm(p = 0.25, mean = 5, sd = 2, lower.tail = TRUE)

## [1] 3.65102

This tells us that 25% of values of the $N(5,2)$ distribution are less than 3.651. In other words, 3.651 corresponds with the 25% percentile of this Normal distribution.

Question #1: For this question you should use the data introduced at the start of the lab.

Part A: Create a distribution of the variable Claim_Amount for both the tsa_eyeglasses and tsa_laptops subsets. While neither distribution appears perfectly Normal, which distribution do you think would be better approximated by a Normal curve?
Part B: Find the mean and standard deviation of the variable Claim_Amount in the tsa_eyeglasses. Using these values as the mean and standard deviation parameters of a Normal probability model, estimate the probability of a randomly selected claim made against the TSA for eyeglasses having a claim amount greater than $300. Hint: You should use the pnorm() function to answer this question.
Part C: Using the same Normal probability model from Part B, estimate the probability of a randomly selected claim made against the TSA for eyeglasses having a claim amount less than $50.
Part D: Confirm that the actual proportion claims in tsa_eyeglasses whose Claim_Amount exceeds $300 is 0.342. Hint: You might find the examples at the start of Lab 6 helpful in demonstrating how to use sum() and n() to calculate a proportion like this one.
Part E: Find the claim amount corresponding to the 65.8th percentile in the Normal model from Part B. Briefly comment on how this is related to Part D.
Part F: Using the results from Parts B, D, and E, comment upon the accuracy of a Normal probability model to approximate the distribution of Claim_Amount for claims made for eyeglasses. That is, do you believe this model provides a reasonable approximation of the underlying distribution?

$~$

Judging Normality using Q-Q plots

It can be difficult to judge whether a distribution displayed in a histogram actually follows the Normal Distribution. One tool that statisticians use to judge normality is a quantile-quantile plot (Q-Q plot for short).

A Q-Q plot graphs the quantiles (percentiles) of the observed data with where they should theoretically be under a Normal probability model. It’s easiest to understand how a Q-Q plot works using some examples:

The x-axis of Q-Q plot shows theoretical quantiles, while the y-axis shows the observed quantiles.

In Example 1, we see that the data are right-skewed, which implies they exhibit more variability than would be expected by a Normal distribution for large positive values. This explains why the Q-Q plot bends upward, as the higher percentiles of this distribution are larger values than would be expected under a Normal model.
We see the reverse in Example 2 where the data are left-skewed. Here the lower percentiles of the data are smaller than would be expected under a Normal model.
Example 3 shows data that actually came from a Normal distribution, so you see roughly a 1-to-1 correspondence between theoretical and observed quantiles with any deviations being due to chance.

We can create nice looking Q-Q plots using stat_qq() and stat_qq_line(). The former function creates the plot, while the latter adds a 45-degree line for comparison purposes:

## Example Q-Q plot
ggplot(data = tsa_laptops, aes(sample = Month)) + stat_qq() + stat_qq_line()

In this example we can judge that the variable Month does not appear to follow a Normal Distribution. It has thicker tails than would be expected by a Normal model (which makes sense because the distribution of months in the data is almost uniform).

Question #2: Create a Q-Q plot to assess the Normality of the variable Claim_Amount in the tsa_laptops subset. Does this plot suggest a Normal probability model is appropriate for approximating the distribution of this variable? Briefly explain.

$~$

Central Limit Theorem

Recall that the Central Limit theorem provides theoretical justification for the use of a Normal probability model to represent the probability distribution of the sample mean. You’ll explore that result in the following question:

Question #3: This question looks at approximating the distribution of the sample mean across many random samples of claims from tsa_laptops. You should begin by noting that tsa_laptops is very right-skewed. Throughout this question you are encouraged to modify the for-loop and histogram code provided below:

## Repeat the loop 200 times (ie: 200 random samples)
n_reps = 200

## Object to store each sample's mean
sample_means = numeric(length = n_reps)

## The loop itself (take many random samples of size n=25)
for(i in 1:n_reps){
  current_sample = sample(tsa_laptops$Claim_Amount, size = 25)
  sample_means[i] = mean(current_sample)
}

## Graph these 200 sample means using a histogram
ggplot() + geom_histogram(aes(x = sample_means), bins = 15)

Part A: Create a histogram showing the distribution of sample means for the variable Claim_Amount across 200 random samples of $n=10$ claims drawn from tsa_laptops. Is this distribution more skewed or less skewed than the distribution of the data that samples were being drawn from?
Part B: Use a Q-Q plot to assess the appropriateness of a Normal probability model for the distribution of sample means for samples of size $n=10$ that you found in Part A. Hint: You’ll need to use the argument aes(sample = sample_means) in both stat_qq() and stat_qq_line().
Part C: Modify your code from Part A to draw 200 random samples of size $n=30$ from tsa_laptops, then construct a Q-Q plot to assess the appropriateness of a Normal probability model for the distribution of sample means for samples of size $n=30$.
Part D: Modify your code from Part A once more to draw samples of size $n=100$, then construct another Q-Q plot to assess the Normality of the distribution of sample means for samples of size $n=100$.
Part E: Based upon the definition of Central Limit theorem given at the start of the lab, explain why the Q-Q plots change in the manner that they do in Parts B, C, and D as the size of the samples drawn from the population is increased.
Part F: Calculate the mean and standard deviation of the 200 sample means you found in Part D. Then, using these as the parameters of a Normal probability model, calculate the probability of a random sample of claims involving laptops having a mean claim amount larger than $1800.
Part G: Calculate the proportion of claims in the full tsa_laptops data set that exceed $1800. Why is this different than the probability you found in Part F? Hint: Think about the random variable involved in each of these calculations and how they are different.