R
This lab introduces R
and R Studio
as well
as a few procedures we’ll use in future class sessions.
Directions (read before starting)
\(~\)
Two-sample data are generally recorded in a single data sheet with a binary categorical variable defining the groups and a second variable defining the outcome, which can be either categorical or quantitative.
Before starting this lab we will cover a few data
manipulation techniques that are helpful in getting more complex
data to fit within this framework. To do so, we will use a few functions
from the dplyr
library, which you might need to
install.
# install.packages('dplyr')
library(dplyr)
We’ve previously worked with the “TSA claims” data set, a random sample of 5000 claims made against the TSA during its first five years of existence. The data set uses 16 categories to group the items involved in claims:
tsa_data = read.csv("https://remiller1450.github.io/data/tsa_small.csv")
table(tsa_data$Item)
##
## Cameras - Digital
## 440
## Cell Phones
## 55
## Clothing - Shoes; belts; accessories; etc.
## 578
## Computer - Laptop
## 242
## Cosmetics - Perfume; toilet articles; medicines; soaps; etc.
## 100
## Currency
## 93
## DVD/CD Players
## 50
## Eyeglasses - (including contact lenses)
## 114
## Jewelry - Fine
## 466
## Locks
## 367
## Luggage (all types including footlockers)
## 758
## Medicines
## 56
## Musical Instruments - Other - Over $250
## 26
## Other
## 1581
## PDA - Personal Data Assistants
## 27
## Stereo Items & Accessories
## 47
We might be interested in a comparison of digital cameras and laptops, and one way to approach this would be to filter our data to remove any cases not belonging to either of these categories:
## Store filtered results
tsa_laptop_camera_subset = filter(tsa_data, Item %in% c("Cameras - Digital", "Computer - Laptop"))
## Display table of filtered results to confirm
table(tsa_laptop_camera_subset$Item)
##
## Cameras - Digital Computer - Laptop
## 440 242
You should note:
filter()
function in base R
that gets masked when you load the dplyr
package. So if you
are getting an unexpected error you should make sure you’ve run
library(dplyr)
before trying to filter.Item
in the
list c("Cameras - Digital", "Computer - Laptop")
, and the
rows that have such an item are kept.\(~\)
After filtering we might want to calculate descriptive statistics separately for each group. We can do this by stringing together a series of functions using piping:
tsa_laptop_camera_subset %>% ## Step 1 - provide the data
group_by(Item) %>% ## Step 2 - variable defining groups
summarize(mean(Claim_Amount), ## Step 3 - summarize
median(Claim_Amount))
## # A tibble: 2 × 3
## Item `mean(Claim_Amount)` `median(Claim_Amount)`
## <chr> <dbl> <dbl>
## 1 Cameras - Digital 596. 430.
## 2 Computer - Laptop 1742. 1500
Piping takes the object produced by a function and passes it directly as the “data” to the next step in the chain. We could have done our filtering step as part of this pipeline if we’d wanted:
tsa_data %>% ## Step 1 - provide the data
filter(Item %in% c("Cameras - Digital", "Computer - Laptop")) %>% ## Step 2 - filter to keep only cameras and laptops
group_by(Item) %>% ## Step 3 - define the groups
summarize(mean(Claim_Amount), ## Step 4 - summarize
median(Claim_Amount))
## # A tibble: 2 × 3
## Item `mean(Claim_Amount)` `median(Claim_Amount)`
## <chr> <dbl> <dbl>
## 1 Cameras - Digital 596. 430.
## 2 Computer - Laptop 1742. 1500
\(~\)
Suppose we’d like to assess whether the claim rejection rate is
higher for laptops or cameras. It’s worth knowing how to transform a
nominal variable like Status
(which has values of
“Settled”, “Denied” and “Accepted”) into a binary variable using the
ifelse()
function:
## Add new variable to the subset from earlier
tsa_laptop_camera_subset$Rejected = ifelse(tsa_laptop_camera_subset$Status == "Denied", "Rejected", "Not Rejected")
## Look at two-way frequency table
table(tsa_laptop_camera_subset$Item, tsa_laptop_camera_subset$Rejected)
##
## Not Rejected Rejected
## Cameras - Digital 232 208
## Computer - Laptop 95 147
The first argument given to ifelse()
is a logical check,
the second is the value that is returned when the check evaluates to
TRUE
, and the third is the value returned when the check
evaluates to FALSE
.
\(~\)
At this point you should begin working independently with your assigned partner(s). Remember that you should read the lab’s content, not just the questions, and you should all agree with an answer before moving on. You are responsible for any ideas introduced during labs on class exams.
\(~\)
Recall that any complete analysis should report data visualizations and descriptive statistics in addition to statistical inferences (hypothesis tests). Two-sample categorical data often involves comparing a proportion across two groups (ie: difference in proportions), so we can use visualizations and tables that focus on the two variables involved in this comparison.
Below are three different varieties of bar chart that you might consider:
## Example Data
library(ggplot2)
example_data <- read.csv("https://remiller1450.github.io/data/Tips.csv")
## Stacked
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "stack") + labs(title = "Stacked Bar Chart")
## Clustered
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "dodge") + labs(title = "Clustered Bar Chart")
## Conditional
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "fill") + labs(title = "Conditional Bar Chart")
When deciding between these types of bar chart you should think about the types of comparisons that each lend themselves towards. Stacked bar charts most clearly convey the overall sample size in each group (Day and Night here), clustered bar charts facilitate comparisons within groups and convey sample size information. Conditional bar charts provide the clearest means of comparing the outcome of interest in two groups, but don’t provide any sample size information.
For two-sample categorical data, the prop.test()
function can be used to provide both descriptive statistics and
hypothesis testing results. We need to provide the numerator and
denominator for both of the sample proportions involved in the
two-sample \(Z\)-test. Below
we can see a two-way table for the example data:
## Two-way table
table(example_data$Time, example_data$Sex)
##
## F M
## Day 35 33
## Night 52 124
We can store this table and reference specific values using indices:
example_table = table(example_data$Time, example_data$Sex)
example_table[1,1] ## Daytime tables w/ a female customer
## [1] 35
Values from the table can be used to obtain the numerators and denominators we need to perform the two-sample \(Z\)-test:
num1 = example_table[1,1]
denom1 = example_table[1,1] + example_table[1,2]
num2 = example_table[2,1]
denom2 = example_table[2,1] + example_table[2,2]
Note that this gives conditional proportions, the proportion
of female tables during daytime (num1/denom1
) and the
proportion of female tables at night (num2/denom2
). We must
keep track of this because prop.test()
will only give us
“p1” and “p2” using the order we provide it.
The code below shows how these quantities are used in
prop.test()
to perform the two-sample \(Z\)-test using the null hypothesis \(H_0: p_1 = p_2\):
prop.test(x = c(num1, num2), n = c(denom1, denom2), correct = FALSE)
##
## 2-sample test for equality of proportions without continuity correction
##
## data: c(num1, num2) out of c(denom1, denom2)
## X-squared = 10.277, df = 1, p-value = 0.001347
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.0826709 0.3558318
## sample estimates:
## prop 1 prop 2
## 0.5147059 0.2954545
In this example, a good analysis might include one of the previously displayed bar charts along with the conclusion statement:
“There is strong evidence (\(p=0.001\)) that daytime tables have a higher proportion of female customers (51.47%) relative to nighttime tables (29.55%) for this individual.”
Notice how the conclusion includes context, level of evidence, and direction, and it also provides relevant descriptive statistics (conditional percentages within each group).
Question #1: For this question you should begin with the full TSA claims data set. You will investigate whether theft is more likely to occur at security checkpoints or with checked baggage.
x
aesthetic in your plot. Which of these graphs do you prefer? There is no
correct preference, but you should provide a reason explaining your
choice.Theft
and Claim_Site
for the data
subset you’ve been working with on this question.prop.test()
and perform a two-sample \(Z\)-test evaluating whether the proportion
of theft claims at checkpoints is the same as the proportion of theft
claims involving checked baggage. Provide a one-sentence conclusion that
includes the \(p\)-value and the
conditional proportions involved in the test. See the example in this
section for help on what a good conclusion might look like.\(~\)
For two-sample quantitative data you may consider side-by-side box plots or histograms as appropriate data visualizations:
## Boxplots
ggplot(example_data, aes(x = Tip, y = Time)) + geom_boxplot()
## Histograms
ggplot(example_data, aes(x = Tip)) + geom_histogram(bins = 20) + facet_wrap(~Time)
Boxplots are generally preferred in this scenario, with the exception being when you would like to check the distributional shape of each sample and cannot make a clear judgement from the boxplots.
Descriptive statistics for two-sample quantitative data should be calculated using grouped summarization (described in the lab’s introduction). Another example of this is shown below:
example_data %>%
group_by(Time) %>%
summarize(mean(Tip),
sd(Tip),
n())
## # A tibble: 2 × 4
## Time `mean(Tip)` `sd(Tip)` `n()`
## <chr> <dbl> <dbl> <int>
## 1 Day 2.73 1.21 68
## 2 Night 3.10 1.44 176
Note that the command n()
will simply count the number
of cases in each group.
Finally, we can perform the two-sample \(T\)-test using the
t.test()
function:
t.test(Tip ~ Time, data = example_data)
##
## Welch Two Sample t-test
##
## data: Tip by Time
## t = -2.0593, df = 144.07, p-value = 0.04126
## alternative hypothesis: true difference in means between group Day and group Night is not equal to 0
## 95 percent confidence interval:
## -0.73411080 -0.01505364
## sample estimates:
## mean in group Day mean in group Night
## 2.728088 3.102670
The syntax here is a bit different from our prior use of
t.test()
, but will be very useful to get familiar with as
we move towards building regression models. The first piece,
Tip ~ Time
, is a formula that says that the
variable “Tip” is explained by the variable “Time”. We also provide the
data so that t.test()
knows where to look for these
variables.
In this example, a good analysis might include the side-by-side boxplots along with the conclusion statement:
“There is moderate evidence (\(p=0.041\)) that nighttime tables have a higher average tip ($3.10) than daytime tables ($2.73) for tables served by this individual.”
Question #2: For this question you will use the data subset you created in Question #1 that only contains the “Checkpoint” and “Checked Baggage” claim sites.
Close_Amount
at each claim site. How would you describe the
shape of these distributions?Close_Amount
at each claim site.Close_Amount
is equal at both claim sites. Provide a
one-sentence conclusion that includes both the \(p\)-value and the average close amounts
observed in each group (sample). See the example in this section for
help on what a good conclusion might look like.\(~\)
The two-sample \(Z\) and \(T\) tests both assume specific probability models for the outcomes of interest, but depending upon the characteristics of the sample and the population being studied these assumed models might be inappropriate. These conditions are summarized below:
If the conditions for the two-sample \(Z\)-test are not met, a reasonable
alternative is Fisher’s exact test, which tests whether
the rows and columns of a two-way frequency table are independent given
the table’s marginal totals are fixed. To perform Fisher’s exact test we
supply the relevant two-way frequency table to the
fisher.test()
function:
## Example from earlier - difference in proportion of female customers by time of day
example_table = table(example_data$Time, example_data$Sex)
fisher.test(example_table)
##
## Fisher's Exact Test for Count Data
##
## data: example_table
## p-value = 0.001737
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.364025 4.678206
## sample estimates:
## odds ratio
## 2.518817
We will discuss odds ratios and what they mean later in the course,
but for now you should recognize that the \(p\)-value is extremely close to the one we
saw with prop.test()
(0.0017 here vs. 0.0013 earlier in the
lab). Because the \(p\)-value is so
small, the conclusion is that there is strong evidence of an association
between the time of day and whether the customer was female.
Question #3:
ifelse()
to create variable named
“approved” that takes on the value “approved” if a claim was approved
and the value “not approved” if the claim was either settled or denied.
Use the table()
function to display a two-way frequency
table involving these variables. If your data preparation steps were
correct you should see a 2 by 2 table with 4 cells (counts).fisher.test()
and provide a one-sentence conclusion that
includes the \(p\)-value.prop.test()
to address the
same hypotheses involved in Parts C and D. How does this \(p\)-value compare to the one from Part
D?\(~\)
The Wilcoxon Rank Sum Test (also known as the Mann-Whitney U-test) is a rank-based test that does not make any distributional assumptions and can be used in situations where you might be considering the two-sample $T-test.
The null hypothesis is that both groups follow the same distribution, and the alternative is that both groups follow different distributions. The test is conducted using the following steps:
The test is performed using the wilcox.test()
function:
## Perform the test on our example data (tips by time of day)
wilcox.test(Tip ~ Time, mu = 0, data = example_data)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Tip by Time
## W = 4905, p-value = 0.02883
## alternative hypothesis: true location shift is not equal to 0
Interestingly, the \(p\)-value is
slightly smaller than what we calculated using t.test()
(which generally isn’t the case), but the conclusion is the same - the
data provide strong evidence (\(p =
0.029\)) that the amount tipped differs by time of day, with tips
tending to be larger at night (evidenced by previous box plots).
Question #4: In Question #3 you created a filtered version of the TSA claims data that contains only claims which involved cell phones or personal data assistants (PDA). You will continue using that subset in this question.
t.test()
to evaluate the
hypotheses considered in Parts A and B. How does the \(p\)-value compare to the one found using
the Wilcoxon Rank-Sum test?\(~\)
For the following questions you should start with the full TSA data set. Unlike previous questions where you were guided through the steps of the analysis, these questions will require to determine the proper steps and hypothesis tests on your own. Your answer to each question should include:
R
code used to prepare your data and perform
hypothesis testing (ie: filter()
, ifelse()
,
t.test()
, etc.)Question #5: Following the guidelines given above, perform a statistical analysis to determine if there is a difference in the proportion of property damage and personal injury claims that are denied.
\(~\)
Question #6: Following the guidelines given above, perform a statistical analysis to determine if there is a difference in the average claim amount at the Los Angeles international airport (airport code: LAX) or Chicago O’Hare international airport (airport code: ORD)