This lab introduces R
and R Studio
as well
as a few procedures we’ll use in future class sessions.
Directions (read before starting)
\(~\)
filter()
and ifelse()
Previous labs have already introduced the filter()
and
ifelse()
functions, but we’ll quickly review them here in
order to see how they can be used to create a contingency table from a
more complicated data set.
## Load data
tsa_sample = read.csv("https://remiller1450.github.io/data/tsa_small.csv")
## Use filter() to make a subset w/ only the groups of interest
library(dplyr) # Remember that filter() is in the "dplyr" library
my_subset = tsa_sample %>% filter(Item %in% c("Cell Phones", "Medicines"))
## Use ifelse() to create a new outcome variable w/ only the categories of interest
my_subset$new_var = ifelse(my_subset$Status == "Denied", "Denied", "Not Denied")
## The contingency table
table(my_subset$Item, my_subset$new_var)
##
## Denied Not Denied
## Cell Phones 29 26
## Medicines 16 40
\(~\)
Functions like fisher.test()
that we’ll use to calculate
odds ratios and perform hypothesis tests involving that odds ratio
expect us to provide a contingency table as input. The
fisher.test()
function will then calculate the odds ratio
using the first column to define the event of interest (ie: odds of
“Denied” using our current subset) and first row as the group used in
the numerator of the odds ratio:
## Table from previous section
my_table = table(my_subset$Item, my_subset$new_var)
fisher.test(my_table)
##
## Fisher's Exact Test for Count Data
##
## data: my_table
## p-value = 0.01214
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.187826 6.611653
## sample estimates:
## odds ratio
## 2.761943
Thus, we can see the sample odds ratio here is 2.76, and we can interpret this by saying either of the following:
Sometimes we want to reorder the categories of one (or both) of the
variables involved in our contingency table. We can do this using the
factor()
function and the levels
argument:
## Reorder "new_var" and store as "reordered_new_var"
my_subset$reordered_new_var = factor(my_subset$new_var, levels = c("Not Denied", "Denied"))
## Reorder "Item"
my_subset$reordered_item = factor(my_subset$Item, levels = c("Medicines", "Cell Phones"))
## Print the table using the reordered variables
table(my_subset$reordered_item, my_subset$reordered_new_var)
##
## Not Denied Denied
## Medicines 40 16
## Cell Phones 26 29
An important feature of the odds ratio is that it is symmetric, meaning if we reorder the categories of both variables involved in the contingency table we’ll get the exact same measure of the strength of association:
## Odds ratio on the reordered table
reordered_table = table(my_subset$reordered_item, my_subset$reordered_new_var)
fisher.test(reordered_table)
##
## Fisher's Exact Test for Count Data
##
## data: reordered_table
## p-value = 0.01214
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.187826 6.611653
## sample estimates:
## odds ratio
## 2.761943
We should note that this odds ratio should technically be interpreted as “the odds of a medicine claim being settled or accepted (not denied) are 2.76 times the odds of a cell phone claim being settled or accepted”.
Note that if only one variable were reordered we would not expect to see a numerically identical odds ratio.
\(~\)
R
You won’t always receive your data in a CSV where each case is a row. Sometimes you might need to analyze data that has already been aggregated into frequencies.
For example, suppose you are given an experiment involving two
groups, each containing 35 subjects, where 25 subjects in group 1
experienced the event of interest compared to only 15 experiencing this
event in group 2. To perform a statistical analysis of this experiment
we can put the necessary information into R
using the
data.frame()
function:
## Create the data frame, naming the columns based upon the event outcome
ny_table = data.frame(event = c(25, 15), not = c(10, 20))
## The rows of the data are the groups, we can label them as such
row.names(ny_table) = c("group1", "group2")
## Print to check how it looks:
print(ny_table)
## event not
## group1 25 10
## group2 15 20
Entering this data into R
allows us to more easily
calculate odds ratios and perform hypothesis tests:
## Fisher's exact test on the example data
fisher.test(my_table)
##
## Fisher's Exact Test for Count Data
##
## data: my_table
## p-value = 0.01214
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.187826 6.611653
## sample estimates:
## odds ratio
## 2.761943
\(~\)
Throughout the lab you’ll look at three different studies where the data can be analyzed using a contingency table.
\(~\)
Anorexia nervosa is an eating disorder characterized by abnormally low body weight and a compulsive fear of weight gain. To explore the efficacy of different treatment protocols, researchers randomly assigned consenting participants to three different groups, cognitive behavioral therapy (CBT), family therapy (FT), and a control targeting a targeting a distribution of 40% to CBT, 20% to FT, and 40% to control. The family therapy approach is the most invasive and cost prohibitive, but was also hypothesized to be the most beneficial, hence the lower participant allocation it received.
anorexia_data = read.csv("https://remiller1450.github.io/data/anorexia_nervosa.csv")
weight_increase
, for the CBT and control
groups (excluding the FT group for the moment). Format the table such
that the positive outcome, “increased”, is the first column, and active
treatment, “CBT”, is the first row.weight_increase
, for the FT and CBT groups (excluding the
control group). Format the table such that the positive outcome,
“increased”, is the first column, and the treatment that was
hypothesized as more effective, “FT”, is the first row.\(~\)
In 1992 an unknown illness appeared in a small Alaskan community. Based upon the reported symptoms, the CDC suspected fluoride poisoning from one of the town’s water sources. The CDC surveyed 38 individuals from the community who reported feeling ill and 33 of them had drank from the water source. They also surveyed 54 residents who did not report feeling ill and only 8 of them said they had drank from the water source.
R
to create and display a
contingency table where the study outcome, suffering from the illness,
is the first column, and the exposure of interest, drinking from the
suspected water source, is the first row.fisher.test()
to calculate
the odds ratio and perform a hypothesis test investigating whether these
data provide statistical evidence of an association between use of the
water source and self-reported illness. Report a one-sentence conclusion
that includes the strength of evidence, the observed odds ratio, and
appropriate context.\(~\)
The composition of the US Congress changes every two years, when new elections are held. Consequently, we might choose to view the composition of any particular congressional class as a sample of broader political trends existing within that era.
This application will look at the members of the 118th US Congress, which served until Jan 3, 2025. The intent is to compare the relative frequency of baby boomers (ages 61-79 in this data set) across the two major political parties.
congress_data = read.csv("https://remiller1450.github.io/data/congress_2024.csv")
ifelse()
to create a new variable
containing the groups “Republican aligned” (all Republicans) and
“Democrat aligned” (all Democrats and independents who generally vote
with the Democratic party (Sanders, King, Sinema)).ifelse()
to create a new
variable indicating whether a member of Congress is a baby boomer. You
may use the logical condition
congress_data$Age >= 61 & congress_data$Age <= 79
for this task.fisher.test()
to calculate
the odds ratio and perform a hypothesis test investigating whether these
data suggest that either of the major political parties has a greater
preponderance of baby boomers among its elected members of congress.
Report a one-sentence conclusion that includes the strength of evidence,
the observed odds ratio, and appropriate context.