\(~\)
Last week we learned about ways to describe associations between two variables:
These measures allow us to quantify the effect of one variable upon another. Generally speaking, statisticians consider two different types of associations:
To help us find conditional effects, we’ll use two different
pipeline functions in the dplyr
library, a package
we first introduced in Lab
5.
group_by()
function defines a set of groups and all
subsequent operations in a data pipeline are performed separately for
each group.summarize()
function performs various calculations,
including the calculation of descriptive statistics like the means,
proportions, or the correlation coefficientA simple example using group_by()
and
summarize()
is shown below:
library(dplyr)
congress = read.csv("https://remiller1450.github.io/data/congress_2024.csv")
## Example of group_by and summarize
congress %>% group_by(Party) %>% summarize(Average_Age = mean(Age))
## # A tibble: 3 × 2
## Party Average_Age
## <chr> <dbl>
## 1 D 59.0
## 2 I 68.3
## 3 R 57.2
Most descriptive statistics are straightforward to calculate using this framework, but proportions are a bit tricky as they require us to calculate the numerator and denominator ourselves:
## Conditional proportion example
congress %>% group_by(Party) %>% summarize(Prop_Over_65 = sum(Age >= 65)/n())
## # A tibble: 3 × 2
## Party Prop_Over_65
## <chr> <dbl>
## 1 D 0.378
## 2 I 0.667
## 3 R 0.282
In this example we find the proportion of congress members in each party who were aged 65 or older when the 118th Congress began.
sum()
to
tally the number of cases in each group that satisfy the logical
condition Age >= 65
.n()
.We can verify this calculation using the table()
and
prop.table()
functions (introduced in Lab 4):
## Create the frequency table
party_age_table = table(congress$Party, congress$Age >= 65)
## Calculate row proportions
prop.table(party_age_table, margin = 1)
##
## FALSE TRUE
## D 0.6221374 0.3778626
## I 0.3333333 0.6666667
## R 0.7179487 0.2820513
The major benefit of group_by()
and
summarize()
is that we can condition upon multiple
factors. Consider the example below:
## Another conditional proportion example
congress %>% group_by(Party, Chamber) %>% summarize(Prop_Over_65 = sum(Age >= 65)/n())
## # A tibble: 5 × 3
## # Groups: Party [3]
## Party Chamber Prop_Over_65
## <chr> <chr> <dbl>
## 1 D House 0.350
## 2 D Senate 0.5
## 3 I Senate 0.667
## 4 R House 0.232
## 5 R Senate 0.510
This information can be used to find the conditional relative risk of a congress person being aged 65+ for Democrats vs. Republicans among house members: \[RR = 0.350/0.232 = 1.51\]
We conclude that among house members, Democratic representatives are 51% more likely to be aged 65+ than Republicans.
Compare this with the marginal relative risk, which averages across both chambers of congress: \[RR = 0.378/0.282 = 1.34\]
This suggests that Democrats in congress, averaging across both chambers, are 34% more likely to be aged 65+ than Republicans. The marginal relative risk is smaller because a similar proportion of Democrats and Republicans are aged 65+ in the Senate, but not in the House of Representatives.
\(~\)
In this lab you’ll work with the “Death Penalty Sentencing” data set. These data are from an influential study published in 1981, that analyzed the sentencing outcomes of individuals who were convicted of murder committed during the course of another felony in the state of Florida between 1972 and 1977. The researchers were studying racial bias in death penalty sentencing, and they recorded the race of the offender, the race of the murder victim, and whether or not the offender was sentenced to receive the death penalty.
death_penalty <- read.csv("https://remiller1450.github.io/data/DeathPenaltySentencing.csv")
You’ll also need the ggplot2
and dplyr
libraries, so make sure you’ve loaded them:
library(ggplot2)
library(dplyr)
\(~\)
The difference between marginal effects and conditional effects was covered at the start of the lab. Please review this information before answering the question below if you need a refresher.
Question #1: For this question you should use the
death_penalty
data set described in the previous
section.
group_by()
and
summarize()
to find conditional proportions
describing how likely white offenders and black offenders were to
receive the death penalty. Based upon these two proportions, briefly
comment on whether or not you believe there is an association between
the race of an offender and whether they were sentenced to receive the
death penalty. Hint: The logical condition to check whether the
death penalty was awarded involves a double equals sign, or
DeathPenalty == "death"
.group_by()
and
summarize()
to find the odds of a white offender
and a black offender receiving the death penalty in these data.
Hint: In the “onboarding” section we did something similar for
conditional proportions. However, the denominator of odds should be the
count of how often the event of interest did not occur, so it will be
another logical condition inside of the sum()
function.\(~\)
Notice that these data contain a third variable, the race of the victim. The technique of splitting up data using a third variable that is neither the explanatory nor response variable in order to obtain conditional effects is known as stratification. We’ll soon discuss other ways of estimating conditional effects, but stratification is the most straightforward approach.
Question #2:
VictimRace
to your
group_by()
command. This should produce 4 different
proportions.\(~\)
In the “Death Penalty Sentencing” data set, VictimRace
is a confounding variable, or a third variable that is
associated with both the explanatory and response variables in
the primary analysis. The presence of a confounding variable will
obscure the relationship between the explanatory and response variable,
making the marginal association (or lack thereof)
misleading.
Stratification neutralizes the impact of a confounding variable, as
the definition of confounding is no longer met within strata where all
cases have the same value of the confounding variable. For example, in
the stratum of data where the victim’s race is white, all of the cases
have the same victim’s race, so the variable VictimRace
is
not associated with the explanatory and responses variables within
this stratum.
Question #3: This question seeks to better
understand how and why VictimRace
confounds the relationship between OffenderRace
and
DeathPenalty
in the “Death Penalty Sentencing” data
set:
DeathPenalty
for each value of VictimRace
(a
conditional bar chart). Provide a brief description of the relationship
you see in this barOffenderRace
for each value of VictimRace
(a
conditional bar chart). Provide a brief description of the relationship
you see in this bar chart.VictimRace
satisfy the definition of
confounding? Briefly explain.\(~\)
Stratification is an excellent analytic approach in scenarios like the “Death Penalty Sentencing” study where there are relatively few confounding variables/strata. In more complex situations it is common to report adjusted effects, which modify the marginal effect to adjust for confounding variable(s).
In the “Death Penalty Sentencing” study, the marginal association
between OffenderRace
and DeathPenalty
was
misleading due to imbalances in the distribution of
VictimRace
across the categories of
OffenderRace
(as well as the association between
VictimRace
and DeathPenalty
). Thus, we could
obtain an adjusted effect by weighting the data to force an equal
distribution of VictimRace
, which would make it no
longer satisfy the defintion of confounding.
Consider the risk of being sentenced to the death penalty for a white offender, which we previously found to be 0.232. This risk is actually a weighted average of the risk for cases involving white and the risk for those involving black victims: \[p_{d|wo} = 0.232= \tfrac{190}{198}*0.242 + \tfrac{8}{198}*0 \]
Instead of using the weights that naturally occurred in the data, we could use another set of weights:
We will focus on the first approach, though you should be aware that the second exists and it could be achieved using similar steps to those shown below. First, let’s look at the marginal distribution of victim’s race:
## Overall distribution of victim's race
table(death_penalty$VictimRace)
##
## black white
## 110 268
This tells us the overall proportions of each victims race, which are the weights we’d need to force balance in our risk calculation. Thus, the adjusted risk of receiving a death penalty sentence for white offenders is calculated: \[p_{d|wo}^{adj} = \tfrac{268}{378}*0.242 + \tfrac{110}{378}*0 = 0.172\]
Notice that the adjusted risk is substantially lower than what we’d previously calculated.
Question #4
VictimRace
to calculate the adjusted risk of receiving a
death penalty sentence for black offenders. Show your work.\(~\)
Both stratification and weighting are difficult to implement in scenarios involving many confounding variables. Our next topic, multivaraible regression, offers a way to adjust for multiple confounding variables simultaneously; however, a more reliable approach is to collect our data in a way such that the groups we seek to compare are balanced in every possible way, thereby precluding the possibility of confounding variables distorting the association we observe between our explanatory and response variables.
An example of this the “Infant Heart” data set, where researchers at Harvard Medical School randomly assigned infants in need of heart surgery to one of two surgical approaches, “low-flow bypass” or “circulatory arrest”:
## Read data
infant_heart <- read.csv("https://remiller1450.github.io/data/InfantHeart.csv")
## Compare the distribution of birth weight by treatment group
ggplot(data = infant_heart, aes(x = Treatment, y = Weight)) + geom_boxplot()
Since the treatment group (type of surgery) is independent of birth weight, we can conclude that birth weight is not confounding the relationship between treatment and the study’ primary outcome variables (MDI and PDI scores). This is true even if birth weight is associated with MDI/PDI scores.
Question #5: For this question you should use the “Infant Heart” data set.
Treatment
and Sex
.Sex
in the study,
how much change would you expect as a result of the adjustment: “very
little change”, “moderate change”, or “substantial change”? Choose one
of these options and briefly explain your choice.