Lab #5 - Two-sample Data Analysis in R

This lab introduces R and R Studio as well as a few procedures we’ll use in future class sessions.

Directions (read before starting)

Please work together with your assigned partner(s). Make sure you all fully understand each topic or example before moving on.
Record your answers to lab questions separately from the lab’s examples. Everyone should have their own copy of your group’s responses, and each individual should turn-in this copy, even if it’s identical to that of your lab partners.
Ask for help, clarification, or even just a check-in if anything seems unclear.

$~$

Onboarding

Two-sample data are generally recorded in a single data sheet with a binary categorical variable defining the groups and a second variable defining the outcome, which can be either categorical or quantitative.

Before starting this lab we will cover a few data manipulation techniques that are helpful in getting more complex data to fit within this framework. To do so, we will use a few functions from the dplyr library, which you might need to install.

# install.packages('dplyr')
library(dplyr)

Filtering

We’ve previously worked with the “TSA claims” data set, a random sample of 5000 claims made against the TSA during its first five years of existence. The data set uses 16 categories to group the items involved in claims:

tsa_data = read.csv("https://remiller1450.github.io/data/tsa_small.csv")
table(tsa_data$Item)

## 
##                                            Cameras - Digital 
##                                                          440 
##                                                  Cell Phones 
##                                                           55 
##                   Clothing - Shoes; belts; accessories; etc. 
##                                                          578 
##                                            Computer - Laptop 
##                                                          242 
## Cosmetics - Perfume; toilet articles; medicines; soaps; etc. 
##                                                          100 
##                                                     Currency 
##                                                           93 
##                                               DVD/CD Players 
##                                                           50 
##                      Eyeglasses - (including contact lenses) 
##                                                          114 
##                                               Jewelry - Fine 
##                                                          466 
##                                                        Locks 
##                                                          367 
##                    Luggage (all types including footlockers) 
##                                                          758 
##                                                    Medicines 
##                                                           56 
##                      Musical Instruments - Other - Over $250 
##                                                           26 
##                                                        Other 
##                                                         1581 
##                               PDA - Personal Data Assistants 
##                                                           27 
##                                   Stereo Items & Accessories 
##                                                           47

We might be interested in a comparison of digital cameras and laptops, and one way to approach this would be to filter our data to remove any cases not belonging to either of these categories:

## Store filtered results
tsa_laptop_camera_subset = filter(tsa_data, Item %in% c("Cameras - Digital", "Computer - Laptop"))

## Display table of filtered results to confirm
table(tsa_laptop_camera_subset$Item)

## 
## Cameras - Digital Computer - Laptop 
##               440               242

You should note:

There is a filter() function in base R that gets masked when you load the dplyr package. So if you are getting an unexpected error you should make sure you’ve run library(dplyr) before trying to filter.
The second argument in filter (after providing the data) is a logical condition used to determine which rows are kept. Here the condition checks if the row has a value of Item in the list c("Cameras - Digital", "Computer - Laptop"), and the rows that have such an item are kept.

$~$

Grouped Summarization

After filtering we might want to calculate descriptive statistics separately for each group. We can do this by stringing together a series of functions using piping:

tsa_laptop_camera_subset %>%     ## Step 1 - provide the data
  group_by(Item) %>%             ## Step 2 - variable defining groups
  summarize(mean(Claim_Amount),  ## Step 3 - summarize
            median(Claim_Amount))

## # A tibble: 2 × 3
##   Item              `mean(Claim_Amount)` `median(Claim_Amount)`
##   <chr>                            <dbl>                  <dbl>
## 1 Cameras - Digital                 596.                   430.
## 2 Computer - Laptop                1742.                  1500

Piping takes the object produced by a function and passes it directly as the “data” to the next step in the chain. We could have done our filtering step as part of this pipeline if we’d wanted:

tsa_data %>%                                                               ## Step 1 - provide the data
  filter(Item %in% c("Cameras - Digital", "Computer - Laptop")) %>%        ## Step 2 - filter to keep only cameras and laptops
  group_by(Item) %>%                                                       ## Step 3 - define the groups
  summarize(mean(Claim_Amount),                                            ## Step 4 - summarize
            median(Claim_Amount))

## # A tibble: 2 × 3
##   Item              `mean(Claim_Amount)` `median(Claim_Amount)`
##   <chr>                            <dbl>                  <dbl>
## 1 Cameras - Digital                 596.                   430.
## 2 Computer - Laptop                1742.                  1500

$~$

Category Collapsing

Suppose we’d like to assess whether the claim rejection rate is higher for laptops or cameras. It’s worth knowing how to transform a nominal variable like Status (which has values of “Settled”, “Denied” and “Accepted”) into a binary variable using the ifelse() function:

## Add new variable to the subset from earlier
tsa_laptop_camera_subset$Rejected = ifelse(tsa_laptop_camera_subset$Status == "Denied", "Rejected", "Not Rejected")

## Look at two-way frequency table
table(tsa_laptop_camera_subset$Item, tsa_laptop_camera_subset$Rejected)

##                    
##                     Not Rejected Rejected
##   Cameras - Digital          232      208
##   Computer - Laptop           95      147

The first argument given to ifelse() is a logical check, the second is the value that is returned when the check evaluates to TRUE, and the third is the value returned when the check evaluates to FALSE.

$~$

Lab

At this point you should begin working independently with your assigned partner(s). Remember that you should read the lab’s content, not just the questions, and you should all agree with an answer before moving on. You are responsible for any ideas introduced during labs on class exams.

$~$

Two-sample Categorical Data

Recall that any complete analysis should report data visualizations and descriptive statistics in addition to statistical inferences (hypothesis tests). Two-sample categorical data often involves comparing a proportion across two groups (ie: difference in proportions), so we can use visualizations and tables that focus on the two variables involved in this comparison.

Below are three different varieties of bar chart that you might consider:

## Example Data
library(ggplot2)
example_data <- read.csv("https://remiller1450.github.io/data/Tips.csv")

## Stacked
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "stack") + labs(title = "Stacked Bar Chart")

## Clustered
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "dodge") + labs(title = "Clustered Bar Chart")

## Conditional
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "fill") + labs(title = "Conditional Bar Chart")

When deciding between these types of bar chart you should think about the types of comparisons that each lend themselves towards. Stacked bar charts most clearly convey the overall sample size in each group (Day and Night here), clustered bar charts facilitate comparisons within groups and convey sample size information. Conditional bar charts provide the clearest means of comparing the outcome of interest in two groups, but don’t provide any sample size information.

For two-sample categorical data, the prop.test() function can be used to provide both descriptive statistics and hypothesis testing results. We need to provide the numerator and denominator for both of the sample proportions involved in the two-sample $Z$-test. Below we can see a two-way table for the example data:

## Two-way table
table(example_data$Time, example_data$Sex)

##        
##           F   M
##   Day    35  33
##   Night  52 124

We can store this table and reference specific values using indices:

example_table = table(example_data$Time, example_data$Sex)
example_table[1,1] ## Daytime tables w/ a female customer

## [1] 35

Values from the table can be used to obtain the numerators and denominators we need to perform the two-sample $Z$-test:

num1 = example_table[1,1]
denom1 = example_table[1,1] + example_table[1,2]
num2 = example_table[2,1]
denom2 = example_table[2,1] + example_table[2,2]

Note that this gives conditional proportions, the proportion of female tables during daytime (num1/denom1) and the proportion of female tables at night (num2/denom2). We must keep track of this because prop.test() will only give us “p1” and “p2” using the order we provide it.

The code below shows how these quantities are used in prop.test() to perform the two-sample $Z$-test using the null hypothesis $H_0: p_1 = p_2$:

prop.test(x = c(num1, num2), n = c(denom1, denom2), correct = FALSE)

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(num1, num2) out of c(denom1, denom2)
## X-squared = 10.277, df = 1, p-value = 0.001347
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.0826709 0.3558318
## sample estimates:
##    prop 1    prop 2 
## 0.5147059 0.2954545

In this example, a good analysis might include one of the previously displayed bar charts along with the conclusion statement:

“There is strong evidence ($p=0.001$) that daytime tables have a higher proportion of female customers (51.47%) relative to nighttime tables (29.55%) for this individual.”

Notice how the conclusion includes context, level of evidence, and direction, and it also provides relevant descriptive statistics (conditional percentages within each group).

Question #1: For this question you should begin with the full TSA claims data set. You will investigate whether theft is more likely to occur at security checkpoints or with checked baggage.

Part A: Follow the steps in the lab’s introduction to create a filtered subset of the data that contains only “Checked Baggage” and “Checkpoint” claim sites.
Part B: Follow the steps in the lab’s introduction to create a binary outcome variable that indicates whether a claim was theft. Name the variable “Theft” and ensure that it takes on the value “theft” if the claim type was “Passenger Theft” and the value “not theft” for all other claim types.
Part C: Create both clustered and conditional bar charts displaying the relationship between claim site and the variable “Theft” you created in Part B. Make sure that claim site, the explanatory variable, is the mapped to the x aesthetic in your plot. Which of these graphs do you prefer? There is no correct preference, but you should provide a reason explaining your choice.
Part D: Create a two-way frequency table of the variables Theft and Claim_Site for the data subset you’ve been working with on this question.
Part E: Use the frequencies from the table you created in Part D to create the required inputs to prop.test() and perform a two-sample $Z$-test evaluating whether the proportion of theft claims at checkpoints is the same as the proportion of theft claims involving checked baggage. Provide a one-sentence conclusion that includes the $p$-value and the conditional proportions involved in the test. See the example in this section for help on what a good conclusion might look like.

$~$

Two-sample Quantitative Data

For two-sample quantitative data you may consider side-by-side box plots or histograms as appropriate data visualizations:

## Boxplots
ggplot(example_data, aes(x = Tip, y = Time)) + geom_boxplot()

## Histograms
ggplot(example_data, aes(x = Tip)) +  geom_histogram(bins = 20) + facet_wrap(~Time)

Boxplots are generally preferred in this scenario, with the exception being when you would like to check the distributional shape of each sample and cannot make a clear judgement from the boxplots.

Descriptive statistics for two-sample quantitative data should be calculated using grouped summarization (described in the lab’s introduction). Another example of this is shown below:

example_data %>%  
  group_by(Time) %>%             
  summarize(mean(Tip),  
            sd(Tip),
            n())

## # A tibble: 2 × 4
##   Time  `mean(Tip)` `sd(Tip)` `n()`
##   <chr>       <dbl>     <dbl> <int>
## 1 Day          2.73      1.21    68
## 2 Night        3.10      1.44   176

Note that the command n() will simply count the number of cases in each group.

Finally, we can perform the two-sample $T$-test using the t.test() function:

t.test(Tip ~ Time, data = example_data)

## 
##  Welch Two Sample t-test
## 
## data:  Tip by Time
## t = -2.0593, df = 144.07, p-value = 0.04126
## alternative hypothesis: true difference in means between group Day and group Night is not equal to 0
## 95 percent confidence interval:
##  -0.73411080 -0.01505364
## sample estimates:
##   mean in group Day mean in group Night 
##            2.728088            3.102670

The syntax here is a bit different from our prior use of t.test(), but will be very useful to get familiar with as we move towards building regression models. The first piece, Tip ~ Time, is a formula that says that the variable “Tip” is explained by the variable “Time”. We also provide the data so that t.test() knows where to look for these variables.

In this example, a good analysis might include the side-by-side boxplots along with the conclusion statement:

“There is moderate evidence ($p=0.041$) that nighttime tables have a higher average tip ($3.10) than daytime tables ($2.73) for tables served by this individual.”

Question #2: For this question you will use the data subset you created in Question #1 that only contains the “Checkpoint” and “Checked Baggage” claim sites.

Part A: Create a set of side-by-side box plots and/or histograms showing the distribution of the variable Close_Amount at each claim site. How would you describe the shape of these distributions?
Part B: Use grouped summarization to find the mean Close_Amount at each claim site.
Part C: Perform a two-sample $T$-test investigating whether the average Close_Amount is equal at both claim sites. Provide a one-sentence conclusion that includes both the $p$-value and the average close amounts observed in each group (sample). See the example in this section for help on what a good conclusion might look like.

$~$

Violated Assumptions

The two-sample $Z$ and $T$ tests both assume specific probability models for the outcomes of interest, but depending upon the characteristics of the sample and the population being studied these assumed models might be inappropriate. These conditions are summarized below:

Two-sample $Z$-test - at least 10 of each outcome are expected in each group (sample), or $n_1 p_0 \geq 10$, $n_1(1-p_0) \geq 10$, $n_2 p_0 \geq 10$, and $n_2(1-p_0) \geq 10$
Two-sample $T$-test - either both group sizes (samples) are large (ie: $n_1 \geq 30$ and $n_2 \geq 30$), or it is reasonable to assume the data came from Normally distributed populations (judged via histograms or box plots)

Fisher’s Exact Test

If the conditions for the two-sample $Z$-test are not met, a reasonable alternative is Fisher’s exact test, which tests whether the rows and columns of a two-way frequency table are independent given the table’s marginal totals are fixed. To perform Fisher’s exact test we supply the relevant two-way frequency table to the fisher.test() function:

## Example from earlier - difference in proportion of female customers by time of day
example_table = table(example_data$Time, example_data$Sex)
fisher.test(example_table)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  example_table
## p-value = 0.001737
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.364025 4.678206
## sample estimates:
## odds ratio 
##   2.518817

We will discuss odds ratios and what they mean later in the course, but for now you should recognize that the $p$-value is extremely close to the one we saw with prop.test() (0.0017 here vs. 0.0013 earlier in the lab). Because the $p$-value is so small, the conclusion is that there is strong evidence of an association between the time of day and whether the customer was female.

Question #3:

Part A: Filter the full TSA claims data set to include only claims that involve cell phones or personal data assistants (PDA). Then use ifelse() to create variable named “approved” that takes on the value “approved” if a claim was approved and the value “not approved” if the claim was either settled or denied. Use the table() function to display a two-way frequency table involving these variables. If your data preparation steps were correct you should see a 2 by 2 table with 4 cells (counts).
Part B: Using the frequency table from Part A, calculate the pooled proportion reflecting independence between the item (cell phone or PDA) and whether the claim was approved in full.
Part C: Using the pooled proportion from Part B, explain why Fisher’s exact test might be more appropriate than a two-sample $Z$-test for comparing the approval rates of these two item categories.
Part D: Perform Fisher’s exact test using fisher.test() and provide a one-sentence conclusion that includes the $p$-value.
Part E: Use prop.test() to address the same hypotheses involved in Parts C and D. How does this $p$-value compare to the one from Part D?

$~$

Wilcoxon Rank-Sum Test

The Wilcoxon Rank Sum Test (also known as the Mann-Whitney U-test) is a rank-based test that does not make any distributional assumptions and can be used in situations where you might be considering the two-sample $T-test.

The null hypothesis is that both groups follow the same distribution, and the alternative is that both groups follow different distributions. The test is conducted using the following steps:

Each data-point, regardless of group, is ranked from smallest to largest
These ranks are summed within each group
These two sums, along with the sample sizes in each group, are used to compute a test statistic that is compared against a null distribution to produce a $p$-value

The test is performed using the wilcox.test() function:

## Perform the test on our example data (tips by time of day)
wilcox.test(Tip ~ Time, mu = 0, data = example_data)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Tip by Time
## W = 4905, p-value = 0.02883
## alternative hypothesis: true location shift is not equal to 0

Interestingly, the $p$-value is slightly smaller than what we calculated using t.test() (which generally isn’t the case), but the conclusion is the same - the data provide strong evidence ($p = 0.029$) that the amount tipped differs by time of day, with tips tending to be larger at night (evidenced by previous box plots).

Question #4: In Question #3 you created a filtered version of the TSA claims data that contains only claims which involved cell phones or personal data assistants (PDA). You will continue using that subset in this question.

Part A: Create side-by-side box plots of the variable “Claim_Amount” for the two categories of item included in this subset. Based upon these box plots and the number of claims made involving each type of item, assess whether the assumptions of the two-sample $T$-test seem reasonable for these data.
Part B: Perform the Wilcoxon Rank-Sum test to assess whether there is statistical evidence that claim amounts differ by item for the two items involved in the data subset used in this question. Provide a one-sentence conclusion that includes the $p$-value of the test.
Part C: Use t.test() to evaluate the hypotheses considered in Parts A and B. How does the $p$-value compare to the one found using the Wilcoxon Rank-Sum test?

$~$

Practice (required)

For the following questions you should start with the full TSA data set. Unlike previous questions where you were guided through the steps of the analysis, these questions will require to determine the proper steps and hypothesis tests on your own. Your answer to each question should include:

Any R code used to prepare your data and perform hypothesis testing (ie: filter(), ifelse(), t.test(), etc.)
At least one data visualization that is directly related to the stated research question.
One sentence describing the conclusion of your hypothesis test that includes both the $p$-value and the observed descriptive statistics (sample means or sample proportions for each group)
One sentence explaining your choice of hypothesis test. Here you might reference small sample sizes and a skewed distribution, or large counts in each cell of the two-way frequency table.

Question #5: Following the guidelines given above, perform a statistical analysis to determine if there is a difference in the proportion of property damage and personal injury claims that are denied.

$~$

Question #6: Following the guidelines given above, perform a statistical analysis to determine if there is a difference in the average claim amount at the Los Angeles international airport (airport code: LAX) or Chicago O’Hare international airport (airport code: ORD)