Introduction

For this activity we will use the online program StatKey to calculate descriptive statistics and construct basic data visualizations.

The goal of the activity is to understand how multivariate relationships should influence our understanding of data and shape the decisions we make using that data.

For some background, a multivariate relationship is one where 3 or more variables share relationships with each other. Consider the following example:

\(~\)

Activity (part 1, exploration)

A widely cited study, published in 1981, analyzed data on murders which took place during a felony that were committed in the state of Florida between 1972 and 1977. The study recorded numerous attributes pertaining to each of these murders, with the outcome of interest being whether the offender was sentenced to the death penalty. The researchers wanted to decide whether racial bias was present in administration of the death penalty.

A CSV file containing these data can be downloaded by clicking the following link:

This file will appear in your “Downloads” folder, and you can upload it into StatKey using the “Upload File” button on a appropriate menu. Each row in this file describes a court case identified by the research team, and the columns express the following variables describing each case:

Question #1: Use the “Two Categorical Variables” menu under “Descriptive Statistics” on StatKey to create a two-way frequency table displaying the variables OffenderRace and DeathPenalty. Explore the “Proportions: Row, Column, Overall” buttons and use this information to determine whether there seems to be any relationship between the race of the offender and whether or not they received the death penalty. Justify your assessment by citing 2 numbers from one of the tables you looked at.

Question #2: Staying on the “Two Categorical Variables” menu, change the variables displayed to VictimRace and DeathPenalty. Do these two variables appear related? Briefly explain.

Question #3: Again staying on the “Two Categorical Variables” menu, change the variables displayed to VictimRace and OffenderRace. Do these two variables appear related? Briefly explain.

Question #4: Considering what you saw in Questions 2 and 3, do you believe the numbers you cited in Question #1 might be misleading? Briefly explain.

Below are subsets of the full death penalty sentencing dataset that contain only the cases involving victims of a single race:

Question #5: For the cases involving a White victim (first subset given above), were White and Black offenders sentenced to the death penalty at approximately the same rate? Which group was more likely to receive the death penalty? Briefly explain.

Question #6: For the cases involving a Black victim (second subset given above), were White and Black offenders sentenced to the death penalty at approximately the same rate? Which group was more likely to receive the death penalty? Briefly explain.

Question #7: Given everything you’ve seen so far, explain why White and Black offenders seemed to sentenced to the death penalty at similar rates in Question #1, but in Questions #5 and #6 it was Black offenders that received the death penalty at a much higher rate.

\(~\)

Activity (part 2, debriefing)

In this study, “VictimRace” is a confounding variable, or a third variable that is associated with both the explanatory and response variables in your analysis. The presence of a confounding variable will obscure the real relationship between the explanatory and response variables.

This obfuscation can go in both directions:

The following data

Below are some naive statistical results that consider only the variables “sex” and “admit”.

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(data$sex, data$admit)
## p-value = 1.84e-15
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.225016 1.401925
## sample estimates:
## odds ratio 
##   1.310402

This suggests that males are 31% more likely to be admitted. The chances of seeing this large of a discrepancy by chance alone in these data less than 1 in a million.

However, notice that males and females tend to apply to different departments:

And some departments are more selective than others:

It turns out that males were disproportionately applying departments A and B, which were less selective, while females where disproportionately applying to department F, a highly selective program.

If you’re interested, the difference in the odds of admission is statistically insignificant (p = 0.241) after adjusting for the effect of “department” using a logistic regression model:

## 
## Call:
## glm(formula = (admit == "Y") ~ sex + dept, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4454  -0.7664  -0.3698  -0.3610   2.3510  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.61090    0.06678   9.148   <2e-16 ***
## sexM        -0.04984    0.04253  -1.172    0.241    
## deptB       -0.02331    0.08721  -0.267    0.789    
## deptC       -1.21283    0.07240 -16.751   <2e-16 ***
## deptD       -1.25094    0.07157 -17.479   <2e-16 ***
## deptE       -1.68582    0.07712 -21.860   <2e-16 ***
## deptF       -3.25940    0.06849 -47.587   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21769  on 20753  degrees of freedom
## Residual deviance: 17435  on 20747  degrees of freedom
## AIC: 17449
## 
## Number of Fisher Scoring iterations: 5

Note that this effect is not distinguishable from something that could be observed by chance, and the direction actually suggests males might be slightly less likely to be admitted (about 4.9% lower odds of admission, given they apply to the same department).

\(~\)

Next Steps

The possibility that the naive relationship observed between two variables in a population reverses when the data are split into subpopulations using a third variable is known as Simpson’s Paradox.

For Oral Presentation #2, you may find your own historical/real-world example of Simpson’s Paradox and use it as your presentation topic.