\(~\)
This lab will formalize our lab’s investigation of sampling variability by introducing Central Limit Theorem and seeing how it provides a probability model for sample averages.
Earlier this week, it was alluded to that many random variables related to sampling, such as the sample average, appear to have a bell-shaped probability distribution. This probability distribution is known as the Normal Distribution and it is defined by the mathematical function: \[ f(X) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(X - \mu)^2}{2\sigma^2}} \]
This function involves two parameters, \(\mu\) and \(\sigma\), which govern the distribution’s appearance.
We typically reference Normal Distributions using the shorthand \(N(\mu, \sigma)\); for example, \(N(0,1)\) denotes a Normal distribution with a mean of zero and a standard deviation of 1.
Below are three different Normal Distributions displayed on the same x-axis:
The first section of today’s lab will cover basic probability calculations using the Normal Distribution.
\(~\)
Consider a set of \(n\) independent observations of a random variable, \(X\), such that we’ve observed: \(\{x_1, x_2, \ldots, x_n\}\). For example, these values might be the observed values of a given variable for cases that were randomly selected from a population.
Intuitively:
Mathematically:
\[\lim_{n \rightarrow \infty} \sqrt{n}\bigg(\frac{\overline{x} - \mu}{\sigma} \bigg) \rightarrow N(0,1)\] This suggests that as the sample size increases, the probability distribution of the sample mean converges to a Normal distribution.
With slightly improper notation, we can write out an approximate Normal model for the sample mean:
\[\overline{X} \sim N\big(\mu, \tfrac{\sigma}{\sqrt{n}}\big)\] In words, CLT suggests that the sample mean follows a Normal distribution centered at the population mean with a standard deviation equal to the population’s standard deviation divided by the square-root of the sample size.
An amazing thing about this result is that it doesn’t assume anything about the distribution of the population. This means that the population we are sampling from can be skewed and/or full of outliers and the sample average will still follow a Normal probability distribution so long as we take a large enough sample.
We will not prove CLT in this course, as the most accessible proofs require knowledge of moment generating functions.
\(~\)
In this lab you’ll work with a data set containing all claims made against US Transport Security Administration (TSA), a government agency that oversees travel security in the United States (primarily in airports) in the years 2004 and 2005, shortly after the agency was first created.
Claims are filed against the TSA for damaged or stolen property,
improper screening practices and bodily injury. Each claim requests a
certain amount of damages that the plaintiff seeks to recover from the
TSA (the variable Claim_Amount
). These claims are reviewed
and can either be settled, rejected, or approved. The final amount paid
to the plaintiff is recorded as Close_Amount
.
We will work with two different subsets of claims. The first subset contains claims for eyeglasses of less than $2000 (this excludes two outliers in this category). The second subset contains claims for laptops, excluding one extreme outlier that claimed a $1,000,000 loss.
You’ll also need the dplyr
and ggplot2
packages that we’ve been using in previous labs.
## Libraries
library(dplyr)
library(ggplot2)
## Full data set
tsa <- read.csv("https://remiller1450.github.io/data/tsa.csv") # Note that the data set is somewhat large
## First Subset
tsa_eyeglasses = tsa %>% filter(Item == "Eyeglasses - (including contact lenses)", Claim_Amount < 2000)
## Second Subset
tsa_laptops = tsa %>% filter(Item == "Computer - Laptop", Claim_Amount < 1e6)
\(~\)
There are two functions we’ll frequently use when working with Normal Distributions:
pnorm()
takes the argument q
and computes
the tail area of a specified Normal Distribution for values smaller than
q
(argument lower.tail = TRUE
) or larger than
q
(argument lower.tail = FALSE
)qnorm()
takes the argument p
and returns
the associated quantile of the Normal Distribution.Below is an example use of pnorm()
:
pnorm(q = 1, mean = 0, sd = 1, lower.tail = FALSE)
## [1] 0.1586553
This example corresponds to the shaded area:
Below is an example use of qnorm()
:
qnorm(p = 0.25, mean = 5, sd = 2, lower.tail = TRUE)
## [1] 3.65102
This tells us that 25% of values of the \(N(5,2)\) distribution are less than 3.651. In other words, 3.651 corresponds with the 25% percentile of this Normal distribution.
Question #1: For this question you should use the data introduced at the start of the lab.
Claim_Amount
for both the tsa_eyeglasses
and
tsa_laptops
subsets. While neither distribution appears
perfectly Normal, which distribution do you think would be better
approximated by a Normal curve?Claim_Amount
in the tsa_eyeglasses
.
Using these values as the mean and standard deviation parameters of a
Normal probability model, estimate the probability of a randomly
selected claim made against the TSA for eyeglasses having a claim amount
greater than $300. Hint: You should use the
pnorm()
function to answer this question.tsa_eyeglasses
whose Claim_Amount
exceeds
$300 is 0.342. Hint: You might find the examples at the
start of Lab 6 helpful in demonstrating how to use
sum()
and n()
to calculate a proportion like
this one.Claim_Amount
for claims made for
eyeglasses. That is, do you believe this model provides a reasonable
approximation of the underlying distribution?\(~\)
\(~\)
It can be difficult to judge whether a distribution displayed in a histogram actually follows the Normal Distribution. One tool that statisticians use to judge normality is a quantile-quantile plot (Q-Q plot for short).
A Q-Q plot graphs the quantiles (percentiles) of the observed data with where they should theoretically be under a Normal probability model. It’s easiest to understand how a Q-Q plot works using some examples:
The x-axis of Q-Q plot shows theoretical quantiles, while the y-axis shows the observed quantiles.
We can create nice looking Q-Q plots using stat_qq()
and
stat_qq_line()
. The former function creates the plot, while
the latter adds a 45-degree line for comparison purposes:
## Example Q-Q plot
ggplot(data = tsa_laptops, aes(sample = Month)) + stat_qq() + stat_qq_line()
In this example we can judge that the variable Month
does not appear to follow a Normal Distribution. It has thicker tails
than would be expected by a Normal model (which makes sense because the
distribution of months in the data is almost uniform).
Question #2: Create a Q-Q plot to assess the
Normality of the variable Claim_Amount
in the
tsa_laptops
subset. Does this plot suggest a Normal
probability model is appropriate for approximating the distribution of
this variable? Briefly explain.
\(~\)
Recall that the Central Limit theorem provides theoretical justification for the use of a Normal probability model to represent the probability distribution of the sample mean. You’ll explore that result in the following question:
Question #3: This question looks at approximating
the distribution of the sample mean across many random samples
of claims from tsa_laptops
. You should begin by noting that
tsa_laptops
is very right-skewed. Throughout this question
you are encouraged to modify the for-loop and histogram code provided
below:
## Repeat the loop 200 times (ie: 200 random samples)
n_reps = 200
## Object to store each sample's mean
sample_means = numeric(length = n_reps)
## The loop itself (take many random samples of size n=25)
for(i in 1:n_reps){
current_sample = sample(tsa_laptops$Claim_Amount, size = 25)
sample_means[i] = mean(current_sample)
}
## Graph these 200 sample means using a histogram
ggplot() + geom_histogram(aes(x = sample_means), bins = 15)
Claim_Amount
across 200
random samples of \(n=10\) claims drawn
from tsa_laptops
. Is this distribution more skewed or less
skewed than the distribution of the data that samples were being drawn
from?aes(sample = sample_means)
in both stat_qq()
and stat_qq_line()
.tsa_laptops
, then construct a Q-Q plot to assess the
appropriateness of a Normal probability model for the distribution
of sample means for samples of size \(n=30\).tsa_laptops
data set that exceed $1800. Why is this
different than the probability you found in Part F? Hint: Think
about the random variable involved in each of these calculations and how
they are different.