Directions:

  • My main expectation is that you thoughtfully work through labs collaboratively with your group, discussing the embedded questions and recording your responses in a shared document.
    • At times you might be asked to add screenshots to your write-up. If you are on a Windows PC, an easy way to do this is the “snipping tool”, which you can find using the search bar along the bottom of your screen. If you are on a Mac, you can find instructions on how to take a screenshot at this link.
  • Everyone should upload their own copy of the lab write-up to Canvas
  • Only a couple of questions on each lab will be graded accuracy, so your focus should be on learning the material rather than “getting the right answers” as quickly as possible

\(~\)

Introduction

This lab is intended to introduce the concept of a multivariate relationship. In doing so, it will serve as motivation for an upcoming discussion of study design.

\(~\)

Dataset #1 - Death Penalty Sentencing in Florida

In the early 1980s, the Harvard Law Review published a study that analyzed data on murders that took place during a felony committed in the state of Florida between 1972 and 1977. The researchers were interested in whether the death penalty was being applied equally to both white and black offenders.

  • This Link contains the Florida Death Penalty Sentencing dataset.

\(~\)

Initial Analysis

Question #1: In any analysis we should start by thinking about how far we can generalize our data. To begin this process, describe a narrowly defined population for which these data are very likely to be representative. Then, describe a more broadly defined population for which these data could be argued as being reasonably representative.

Question #2: The researchers who assembled this dataset were interested in whether the race of the offender played a role in death penalty sentencing. For this question, name the explanatory variable and the response variable as they are recorded in these data and state whether each variable is categorical or quantitative.

Question #3: Create a contingency table displaying the relationship between the variables you identified in Question #2. Then, use conditional proportions to describe the relationship you see. In doing so, be sure to comment upon which group was more likely to receive a death penalty sentence, and whether or not you believe this to be evidence of racially biased sentencing.

\(~\)

Complicating Factors

Notice how these data contain a third variable, the race of the victim. For reasons that will soon be apparent, it common for statisticians to analyze subgroups of cases, or segments of the sample data that are created by conditioning on a variable that is not directly stated in the study’s research question.

The technique of splitting up a sample according to a certain variable is called stratification. Any analysis done within an individual stratum (a subgroup/segment within the larger dataset) is described as “controlling for” the variable used to split the data. The outcomes observed within each stratum can also be described as “conditional” upon the criteria that define the stratum.

To facilitate a stratified analysis of the death penalty sentencing data, the links below provide separate subsets of the original data according to the variable “VictimRace”:

  1. Click here to download the subset containing only cases involving a white victim.
  2. Click here to download the subset containing only cases involving a black victim.

Question #4: Using the subset containing only cases involving a white victim (ie: download the subset next to 1) above), determine whether these is an association between “OffenderRace” and “DeathPenalty”. To support your assessment, report the conditional proportions of death penalty sentences given to both white and black offenders.

Question #5: Using the subset containing only cases involving a black victim (ie: download the subset next to 2) above), determine whether these is an association between “OffenderRace” and “DeathPenalty”. To support your assessment, report the conditional proportions of death penalty sentences given to both white and black offenders.

Question #6: Using your answers to Questions #4 and #5, briefly discuss whether these data appear to provide evidence of racially biased sentencing after controlling for the race of the victim.

\(~\)

Confounding Variables

A confounding variable is a third variable that is associated with both the explanatory and response variables. In the data you analyzed, race of the victim was a confounding variable. When a confounding variables is not controlled for it can obscure the real relationship between the explanatory and response variables.

Sometimes, the impact of the confounding variable is so substantial that it actually reserves the direction of association (which happened in this study). Additionally, notice how for the data as a whole, a greater proportion of white offenders received the death penalty. However, a greater proportion of black offenders received the death penalty when the victim was white as well as when the victim was black. This seemingly contradictory finding is an example of Simpson’s Paradox.

The following questions will help us understand the paradox:

Question #7: Considering no other variables, are offenders more likely to victimize someone of their own race or another race? Use two conditional proportions to justify your answer.

Question #8: Considering no other variables, were the Florida courts more likely to assign the death penalty when the case involved a white victim or a black victim? Use two proportions to justify your answer.

Question #9: Considering your answers to Questions #7 and #8, explain why there seemed to be little difference in death penalty sentencing rates overall, but a large discrepancy after stratifying by the race of the victim?

\(~\)

Dataset #2 - The College Scorecard

In Lab #1, you analyzed data published by the College Scorecard. Provided below are a link to the data and a description of the variables it contains:

  • College Scorecard Dataset Link

  • Name - Name of the institution

  • City - City where the institution is located

  • State - State where the institution is located

  • Enrollment - Number of full-time enrolled students

  • Private - Binary indicator distinguishing public and private institutions

  • Region - Geographic region

  • Adm_Rate - Admissions rate, the proportion of applications who are admitted

  • ACT_median - Median composite ACT score of enrolled students

  • ACT_Q1 - 25th percentile composite ACT score of enrolled students

  • ACT_Q3 - 75th percentile composite ACT score of enrolled students

  • Cost - Average yearly cost of attendance

  • Net_Tuition - Average tuition cost after discounts (scholarships, grants, etc.)

  • Avg_Fac_Salary - Average faculty salary

  • PercentFemale - Proportion of enrolled students who are female

  • PercentWhite - Proportion of enrolled students who identify as White

  • PercentBlack - Proportion of enrolled students who identify as Black

  • PercentHispanic - Proportion of enrolled students who identify as Hispanic

  • PercentAsian - Proportion of enrolled students who identify as Asian

  • FourYearComp_Males - Proportion of male students who go on to earn a degree within four years of their initial enrollment

  • FourYearComp_Females - Proportion of male students who go on to earn a degree within four years of their initial enrollment

  • Debt_median - Median student debt upon leaving the institution

  • Salary10yr_median - Median salary 10 years after graduating the institution

\(~\)

Initial Analysis

Question #10: Use StatKey to summarize the bivariate relationship between the variables “Cost” and “Avg_Fac_Salary”. That is, describe how faculty salaries appear to be related to the costs of a college.

Question #11: Use the definition of confounding, justified by appropriate StatKey graphs or comparative statistics, to determine whether variable “Private” should be considered a confounding variable in the relationship you described in Question #10. (Hint: remember that the definition of a confounding variable is a third variable which is related to both the explanatory and response variables in a bivariate relationship).

\(~\)

Stratified Analysis

After identifying a confounding variable, we can use a program like Microsoft Excel to help us conduct a stratified analysis using the following steps:

  1. Open the data using Excel and click on the “Data” header, then click on “Filter”. You should now see dropdown boxes on each of your variables.
  2. Use the dropdown box from your confounding to obtain the first subset of cases (Private colleges). Copy and paste these into a new Excel workbook and save it as a CSV file (ie: “Colleges_private.csv”)
  3. Repeat step two to create the second subset of cases (Public colleges). Save these as a separate CSV file (ie: “Colleges_public.csv”).
  4. Use StatKey to analyze each of the files you created separately.

Question #12: Following the steps above, perform a stratified analysis of the relationship between “Cost” and “Avg_Fac_Salary” while controlling for the variable “Private”. That is, find the correlation coefficient between “Cost” and “Avg_Fac_Salary” among only private schools, and then find the same correlation coefficient among only public schools. Is the relationship between cost and faculty salary stronger or weaker after stratifying the data?