This lab introduces R and R Studio as well as a few procedures we’ll use in future class sessions.

Directions (read before starting)

  1. Please work together with your assigned partner(s). Make sure you all fully understand each topic or example before moving on.
  2. Record your answers to lab questions separately from the lab’s examples. If you and your partners work together for the entire lab your group only needs to submit one person’s copy.
  3. Ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Onboarding

Review of filter() and ifelse()

Previous labs have already introduced the filter() and ifelse() functions, but we’ll quickly review them here in order to see how they can be used to create a contingency table from a more complicated data set.

## Load data
tsa_sample = read.csv("https://remiller1450.github.io/data/tsa_small.csv")

## Use filter() to make a subset w/ only the groups of interest
library(dplyr) # Remember that filter() is in the "dplyr" library
my_subset = tsa_sample %>% filter(Item %in% c("Cell Phones", "Medicines"))

## Use ifelse() to create a new outcome variable w/ only the categories of interest
my_subset$new_var = ifelse(my_subset$Status == "Denied", "Denied", "Not Denied")

## The contingency table
table(my_subset$Item, my_subset$new_var)
##              
##               Denied Not Denied
##   Cell Phones     29         26
##   Medicines       16         40

\(~\)

Reordering Categories

Functions like fisher.test() that we’ll use to calculate odds ratios and perform hypothesis tests involving that odds ratio expect us to provide a contingency table as input. The fisher.test() function will then calculate the odds ratio using the first column to define the event of interest (ie: odds of “Denied” using our current subset) and first row as the group used in the numerator of the odds ratio:

## Table from previous section
my_table = table(my_subset$Item, my_subset$new_var)
fisher.test(my_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  my_table
## p-value = 0.01214
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.187826 6.611653
## sample estimates:
## odds ratio 
##   2.761943

Thus, we can see the sample odds ratio here is 2.76, and we can interpret this by saying either of the following:

  • “the odds of a cell phone claim being denied are 2.76 times the odds of a medicine claim being denied”
  • “the odds of a cell phone claim being denied are 176% higher than the odds of a medicine claim being denied”

Sometimes we want to reorder the categories of one (or both) of the variables involved in our contingency table. We can do this using the factor() function and the levels argument:

## Reorder "new_var" and store as "reordered_new_var"
my_subset$reordered_new_var = factor(my_subset$new_var, levels = c("Not Denied", "Denied"))

## Reorder "Item"
my_subset$reordered_item = factor(my_subset$Item, levels = c("Medicines", "Cell Phones"))

## Print the table using the reordered variables
table(my_subset$reordered_item, my_subset$reordered_new_var)
##              
##               Not Denied Denied
##   Medicines           40     16
##   Cell Phones         26     29

An important feature of the odds ratio is that it is symmetric, meaning if we reorder the categories of both variables involved in the contingency table we’ll get the exact same measure of the strength of association:

## Odds ratio on the reordered table
reordered_table = table(my_subset$reordered_item, my_subset$reordered_new_var)
fisher.test(reordered_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  reordered_table
## p-value = 0.01214
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.187826 6.611653
## sample estimates:
## odds ratio 
##   2.761943

We should note that this odds ratio should technically be interpreted as “the odds of a medicine claim being settled or accepted (not denied) are 2.76 times the odds of a cell phone claim being settled or accepted”.

Note that if only one variable were reordered we would not expect to see a numerically identical odds ratio.

\(~\)

Manually Creating Tables in R

You won’t always receive your data in a CSV where each case is a row. Sometimes you might need to analyze data that has already been aggregated into frequencies.

For example, suppose you are given an experiment involving two groups, each containing 35 subjects, where 25 subjects in group 1 experienced the event of interest compared to only 15 experiencing this event in group 2. To perform a statistical analysis of this experiment we can put the necessary information into R using the data.frame() function:

## Create the data frame, naming the columns based upon the event outcome
ny_table = data.frame(event = c(25, 15), not = c(10, 20))

## The rows of the data are the groups, we can label them as such
row.names(ny_table) = c("group1", "group2")

## Print to check how it looks:
print(ny_table)
##        event not
## group1    25  10
## group2    15  20

Entering this data into R allows us to more easily calculate odds ratios and perform hypothesis tests:

## Fisher's exact test on the example data
fisher.test(my_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  my_table
## p-value = 0.01214
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.187826 6.611653
## sample estimates:
## odds ratio 
##   2.761943

\(~\)

Lab

Throughout the lab you’ll look at three different studies where the data can be analyzed using a contingency table.

\(~\)

Application #1 - Treatments for Anorexia Nervosa

Anorexia nervosa is an eating disorder characterized by abnormally low body weight and a compulsive fear of weight gain. To explore the efficacy of different treatment protocols, researchers randomly assigned consenting participants to three different groups, cognitive behavioral therapy (CBT), family therapy (FT), and a control targeting a targeting a distribution of 40% to CBT, 20% to FT, and 40% to control. The family therapy approach is the most invasive and cost prohibitive, but was also hypothesized to be the most beneficial, hence the lower participant allocation it received.

anorexia_data = read.csv("https://remiller1450.github.io/data/anorexia_nervosa.csv")
  • Part A: Were these data collected as part of an experiment or an observation study? If the study is an experiment, were groups randomly assigned? If the study is observational, does it use a prospective, cross-sectional, or retrospective design? Briefly explain your answer.
  • Part B: Create a data visualization that displays the results of this experiment. You should show the explanatory variable on the x-axis and create a graph that facilitates comparisons of the outcome variable across groups despite the different group sizes.
  • Part C: Create a contingency table comparing the study outcome, weight_increase, for the CBT and control groups (excluding the FT group for the moment). Format the table such that the positive outcome, “increased”, is the first column, and active treatment, “CBT”, is the first row.
  • Part D: Calculate the odds that a participant who receives CBT gains weight during the study and the odds that a participant in the control group gains weight. Then, find the odds ratio comparing the odds of weight gain in the CBT group to the odds of weight gain in the control group.
  • Part E: Use Fisher’s exact test to evaluate whether this study provides evidence that receiving CBT influences the odds of weight gain in anorexia patients. Report a one-sentence conclusion that includes the strength of evidence, the observed odds ratio, and appropriate context.
  • Part F: Using the original data from the entire study, create a contingency table comparing the study outcome, weight_increase, for the FT and CBT groups (excluding the control group). Format the table such that the positive outcome, “increased”, is the first column, and the treatment that was hypothesized as more effective, “FT”, is the first row.
  • Part G: Use Fisher’s exact test to evaluate whether this study provides evidence that the odds of weight are different for patients receiving FT than they are for patients receiving CBT. Report a one-sentence conclusion that includes the strength of evidence, the observed odds ratio, and appropriate context.
  • Part H: Would risk differences or risk ratios have been appropriate measures of association for this study? If not, explain which aspects of the study design or sample characteristics make them problematic.

\(~\)

Application #2 - CDC Investigation of Flouride Poisoning

In 1992 an unknown illness appeared in a small Alaskan community. Based upon the reported symptoms, the CDC suspected fluoride poisoning from one of the town’s water sources. The CDC surveyed 38 individuals from the community who reported feeling ill and 33 of them had drank from the water source. They also surveyed 54 residents who did not report feeling ill and only 8 of them said they had drank from the water source.

  • Part A: Were these data collected as part of an experiment or an observation study? If the study is an experiment, were groups randomly assigned? If the study is observational, does it use a prospective, cross-sectional, or retrospective design? Briefly explain your answer.
  • Part B: Use R to create and display a contingency table where the study outcome, suffering from the illness, is the first column, and the exposure of interest, drinking from the suspected water source, is the first row.
  • Part C: Use fisher.test() to calculate the odds ratio and perform a hypothesis test investigating whether these data provide statistical evidence of an association between use of the water source and self-reported illness. Report a one-sentence conclusion that includes the strength of evidence, the observed odds ratio, and appropriate context.
  • Part D: Instead of relying upon odds ratios, would it have been reasonable to use risk differences and/or risk ratios to describe the results of this study? If not, briefly describe which study design features or sample characteristics would make those measures problematic.

\(~\)

Application #3 - Baby Boomers in Congress

The composition of the US Congress changes every two years, when new elections are held. Consequently, we might choose to view the composition of any particular congressional class as a sample of broader political trends existing within that era.

This application will look at the members of the 118th US Congress, which served until Jan 3, 2025. The intent is to compare the relative frequency of baby boomers (ages 61-79 in this data set) across the two major political parties.

congress_data = read.csv("https://remiller1450.github.io/data/congress_2024.csv")
  • Part A: Were these data collected as part of an experiment or an observation study? If the study is an experiment, were groups randomly assigned? If the study is observational, does it use a prospective, cross-sectional, or retrospective design? Briefly explain your answer.
  • Part B: There were members of this congress, Bernie Sanders (Senator, VT), Angus King (Senator, ME), and Kyrsten Sinema (Senator, AZ) that identify as independents but generally vote with the Democratic party. Use ifelse() to create a new variable containing the groups “Republican aligned” (all Republicans) and “Democrat aligned” (all Democrats and independents who generally vote with the Democratic party (Sanders, King, Sinema)).
  • Part C: Use ifelse() to create a new variable indicating whether a member of Congress is a baby boomer. You may use the logical condition congress_data$Age >= 61 & congress_data$Age <= 79 for this task.
  • Part D: Use the new variables you created in Parts B and C to create and display a contingency table. The first column of the table should be the “baby boomer” outcome, and the rows may be in whichever order you’d like.
  • Part E: Use fisher.test() to calculate the odds ratio and perform a hypothesis test investigating whether these data suggest that either of the major political parties has a greater preponderance of baby boomers among its elected members of congress. Report a one-sentence conclusion that includes the strength of evidence, the observed odds ratio, and appropriate context.