\(~\)

Onboarding

Last week we learned about ways to describe associations between two variables:

  • Contingency tables, conditional proportions, relative risks and odds ratios are used for two categorical variables.
  • Correlation and simple linear regression are used for quantitative variables.

These measures allow us to quantify the effect of one variable upon another. Generally speaking, statisticians consider two different types of associations:

  1. Marginal effects - These are the average effect of one variable upon another across an entire population. So far we’ve only looked at marginal associations. The odds ratio we calculated in Lister’s experiment is a marginal effect because we did not consider any specific group other than the general population of patients at the hospital where Lister conducted the experiment.
  2. Conditional effects - These are the effect of one variable upon another within a specific group or condition. For example, the odds ratio relating type of surgery and survival among male patients is a conditional effect - it involves a categorical variable other than the explanatory variable and response variables.

To help us find conditional effects, we’ll use two different pipeline functions in the dplyr library, a package we first introduced in Lab 5.

  • The group_by() function defines a set of groups and all subsequent operations in a data pipeline are performed separately for each group.
  • The summarize() function performs various calculations, including the calculation of descriptive statistics like the means, proportions, or the correlation coefficient

A simple example using group_by() and summarize() is shown below:

library(dplyr)
congress = read.csv("https://remiller1450.github.io/data/congress_2024.csv")

## Example of group_by and summarize
congress %>% group_by(Party) %>% summarize(Average_Age = mean(Age))
## # A tibble: 3 × 2
##   Party Average_Age
##   <chr>       <dbl>
## 1 D            59.0
## 2 I            68.3
## 3 R            57.2

Most descriptive statistics are straightforward to calculate using this framework, but proportions are a bit tricky as they require us to calculate the numerator and denominator ourselves:

## Conditional proportion example 
congress %>% group_by(Party) %>% summarize(Prop_Over_65 = sum(Age >= 65)/n())
## # A tibble: 3 × 2
##   Party Prop_Over_65
##   <chr>        <dbl>
## 1 D            0.378
## 2 I            0.667
## 3 R            0.282

In this example we find the proportion of congress members in each party who were aged 65 or older when the 118th Congress began.

  • Each proportion’s numerator is found using sum() to tally the number of cases in each group that satisfy the logical condition Age >= 65.
  • Each denominator is total number of congress members in each group, which is given by the function n().

We can verify this calculation using the table() and prop.table() functions (introduced in Lab 4):

## Create the frequency table
party_age_table = table(congress$Party, congress$Age >= 65)

## Calculate row proportions
prop.table(party_age_table, margin = 1)
##    
##         FALSE      TRUE
##   D 0.6221374 0.3778626
##   I 0.3333333 0.6666667
##   R 0.7179487 0.2820513

The major benefit of group_by() and summarize() is that we can condition upon multiple factors. Consider the example below:

## Another conditional proportion example 
congress %>% group_by(Party, Chamber) %>% summarize(Prop_Over_65 = sum(Age >= 65)/n())
## # A tibble: 5 × 3
## # Groups:   Party [3]
##   Party Chamber Prop_Over_65
##   <chr> <chr>          <dbl>
## 1 D     House          0.350
## 2 D     Senate         0.5  
## 3 I     Senate         0.667
## 4 R     House          0.232
## 5 R     Senate         0.510

This information can be used to find the conditional relative risk of a congress person being aged 65+ for Democrats vs. Republicans among house members: \[RR = 0.350/0.232 = 1.51\]

We conclude that among house members, Democratic representatives are 51% more likely to be aged 65+ than Republicans.

Compare this with the marginal relative risk, which averages across both chambers of congress: \[RR = 0.378/0.282 = 1.34\]

This suggests that Democrats in congress, averaging across both chambers, are 34% more likely to be aged 65+ than Republicans. The marginal relative risk is smaller because a similar proportion of Democrats and Republicans are aged 65+ in the Senate, but not in the House of Representatives.

\(~\)

Lab

In this lab you’ll work with the “Death Penalty Sentencing” data set. These data are from an influential study published in 1981, that analyzed the sentencing outcomes of individuals who were convicted of murder committed during the course of another felony in the state of Florida between 1972 and 1977. The researchers were studying racial bias in death penalty sentencing, and they recorded the race of the offender, the race of the murder victim, and whether or not the offender was sentenced to receive the death penalty.

death_penalty <- read.csv("https://remiller1450.github.io/data/DeathPenaltySentencing.csv")

You’ll also need the ggplot2 and dplyr libraries, so make sure you’ve loaded them:

library(ggplot2)
library(dplyr)

\(~\)

Marginal Relationships

The difference between marginal effects and conditional effects was covered at the start of the lab. Please review this information before answering the question below if you need a refresher.

Question #1: For this question you should use the death_penalty data set described in the previous section.

  • Part A: Use group_by() and summarize() to find conditional proportions describing how likely white offenders and black offenders were to receive the death penalty. Based upon these two proportions, briefly comment on whether or not you believe there is an association between the race of an offender and whether they were sentenced to receive the death penalty. Hint: The logical condition to check whether the death penalty was awarded involves a double equals sign, or DeathPenalty == "death".
  • Part B: Use group_by() and summarize() to find the odds of a white offender and a black offender receiving the death penalty in these data. Hint: In the “onboarding” section we did something similar for conditional proportions. However, the denominator of odds should be the count of how often the event of interest did not occur, so it will be another logical condition inside of the sum() function.
  • Part C: Using the results of Part B, divide larger odds by the smaller odds and provide a 1-sentence interpretation of the resulting odds ratio.

\(~\)

Statification

Notice that these data contain a third variable, the race of the victim. The technique of splitting up data using a third variable that is neither the explanatory nor response variable in order to obtain conditional effects is known as stratification. We’ll soon discuss other ways of estimating conditional effects, but stratification is the most straightforward approach.

Question #2:

  • Part A: Starting with your code from Question #1 Part A, add the variable VictimRace to your group_by() command. This should produce 4 different proportions.
  • Part B: Among cases with a white victim, do you believe there is an association between the race of the offender and whether or not they were sentenced to receive the death penalty? Justify your answer.
  • Part C: Among cases with a black victim, do you believe there is an association between the race of the offender and whether or not they were sentenced to receive the death penalty? Justify your answer.
  • Part D: Which do you believe to be more meaningful in this study, the marginal effect of offender’s race on death penalty sentencing (what you explored in Question #1) or the conditional effects of offender’s race on death penalty sentencing within each category of victim’s race (what you explored in Parts A-C of this question)? Briefly explain your answer.

\(~\)

Confounding Variables

In the “Death Penalty Sentencing” data set, VictimRace is a confounding variable, or a third variable that is associated with both the explanatory and response variables in the primary analysis. The presence of a confounding variable will obscure the relationship between the explanatory and response variable, making the marginal association (or lack thereof) misleading.

Stratification neutralizes the impact of a confounding variable, as the definition of confounding is no longer met within strata where all cases have the same value of the confounding variable. For example, in the stratum of data where the victim’s race is white, all of the cases have the same victim’s race, so the variable VictimRace is not associated with the explanatory and responses variables within this stratum.

Question #3: This question seeks to better understand how and why VictimRace confounds the relationship between OffenderRace and DeathPenalty in the “Death Penalty Sentencing” data set:

  • Part A: Using the entire data set, create a bar chart that shows the proportion of cases in each category of DeathPenalty for each value of VictimRace (a conditional bar chart). Provide a brief description of the relationship you see in this bar
  • Part B: Using the entire data set, create a bar chart that shows the proportion of cases in each category of OffenderRace for each value of VictimRace (a conditional bar chart). Provide a brief description of the relationship you see in this bar chart.
  • Part C: The definition of confounding requires a third variable be associated with both the explanatory and response variables. Based upon what you saw in Parts A and B, does VictimRace satisfy the definition of confounding? Briefly explain.
  • Part D: In your own words, briefly explain how it is possible that the marginal effect of offender’s race suggests white offenders are slightly more likely to receive the death penalty, but when a stratified analysis is performed black offender’s are more likely to receive the death penalty in both strata.

\(~\)

Adjusted Effects

Stratification is an excellent analytic approach in scenarios like the “Death Penalty Sentencing” study where there are relatively few confounding variables/strata. In more complex situations it is common to report adjusted effects, which modify the marginal effect to adjust for confounding variable(s).

In the “Death Penalty Sentencing” study, the marginal association between OffenderRace and DeathPenalty was misleading due to imbalances in the distribution of VictimRace across the categories of OffenderRace (as well as the association between VictimRace and DeathPenalty). Thus, we could obtain an adjusted effect by weighting the data to force an equal distribution of VictimRace, which would make it no longer satisfy the defintion of confounding.

Consider the risk of being sentenced to the death penalty for a white offender, which we previously found to be 0.232. This risk is actually a weighted average of the risk for cases involving white and the risk for those involving black victims: \[p_{d|wo} = 0.232= \tfrac{190}{198}*0.242 + \tfrac{8}{198}*0 \]

  • The first weight, \(\tfrac{190}{198}\), reflects the 190 of 198 cases with a white offender where the victim was white
    • The associated risk of death penalty in this stratum was 0.242 (you should have found this in Question #2)
  • The second weight, \(\tfrac{8}{198}\), reflects the 190 of 198 cases with a white offender where the victim was black
    • The associated risk of death penalty in this stratum was 0 (there are no cases where a white offender received the death penalty when the victim was black)

Instead of using the weights that naturally occurred in the data, we could use another set of weights:

  1. We could use the overall proportions of white and black victims across the entire data set.
  2. We could choose either white offenders or black offenders as a reference group, and adjust the weights for the other group to match those of the reference group.

We will focus on the first approach, though you should be aware that the second exists and it could be achieved using similar steps to those shown below. First, let’s look at the marginal distribution of victim’s race:

## Overall distribution of victim's race
table(death_penalty$VictimRace)
## 
## black white 
##   110   268

This tells us the overall proportions of each victims race, which are the weights we’d need to force balance in our risk calculation. Thus, the adjusted risk of receiving a death penalty sentence for white offenders is calculated: \[p_{d|wo}^{adj} = \tfrac{268}{378}*0.242 + \tfrac{110}{378}*0 = 0.172\]

Notice that the adjusted risk is substantially lower than what we’d previously calculated.

Question #4

  • Part A: Verify that the marginal risk of a black offender receiving the death penalty, or 0.211, can be expressed as a weighted average of the conditional risks you found in Question #2. You should use the calculations shown in this section as an example, and you should show your work.
  • Part B: Use the overall distribution of VictimRace to calculate the adjusted risk of receiving a death penalty sentence for black offenders. Show your work.
  • Part C: Is the adjusted risk you calculated in Part B higher or lower than the marginal risk?
  • Part D: When compared to the adjusted for white offenders, 0.172, does your adjusted risk from Part B suggest racial bias? Briefly explain.

\(~\)

Random Assignment

Both stratification and weighting are difficult to implement in scenarios involving many confounding variables. Our next topic, multivaraible regression, offers a way to adjust for multiple confounding variables simultaneously; however, a more reliable approach is to collect our data in a way such that the groups we seek to compare are balanced in every possible way, thereby precluding the possibility of confounding variables distorting the association we observe between our explanatory and response variables.

An example of this the “Infant Heart” data set, where researchers at Harvard Medical School randomly assigned infants in need of heart surgery to one of two surgical approaches, “low-flow bypass” or “circulatory arrest”:

## Read data
infant_heart <- read.csv("https://remiller1450.github.io/data/InfantHeart.csv")

## Compare the distribution of birth weight by treatment group
ggplot(data = infant_heart, aes(x = Treatment, y = Weight)) + geom_boxplot()

Since the treatment group (type of surgery) is independent of birth weight, we can conclude that birth weight is not confounding the relationship between treatment and the study’ primary outcome variables (MDI and PDI scores). This is true even if birth weight is associated with MDI/PDI scores.

Question #5: For this question you should use the “Infant Heart” data set.

  • Part A: To gain a better sense of what to expect when two variables are independent due to random assignment, create a data visualization depicting the relationship between Treatment and Sex.
  • Part B: Suppose we calculate the mean PDI score of each treatment group. If we were to adjust these means by weighting according to the overall distribution of Sex in the study, how much change would you expect as a result of the adjustment: “very little change”, “moderate change”, or “substantial change”? Choose one of these options and briefly explain your choice.
  • Part C: Consider a new study where the researchers randomly assigned the explanatory variable. Should these researchers be concerned about unknown confounding variables, or variables that meet the definition of confounding and were not measured when the data were collected? Briefly explain.