MATH-256 - Lab #7 - Two-Sample Hypothesis Testing

This lab is intended to provide practice and insight in applying two-sample hypothesis testing methods to real data.

Directions (Please read before starting)

Please work together with your assigned groups. Even though you’ll turn in a write-up that is later scored, labs are intended to formative and a substantial portion of the credit you’ll receive is based upon effort and completion.
Please record your responses and code in an R Markdown document following the conventions we’ve used in previous labs.

\(~\)

Overview

Similar to Lab #6, which covered methods for one-sample data, this lab is intended to provide you with practice applying statistical methods used on two-sample data.

As review, we’ve covered the following hypothesis tests for two-sample categorical data:

Fisher’s exact test - fisher.test
- \(H_0: p_1 - p_2 = 0\) for one-sample categorical data of any sample size
\(Z\)-test - prop.test
- \(H_0: p_1 - p_2 = 0\) for one-sample categorical data with at least 10 observed “successes” and 10 observed “failures” in each group

The \(Z\)-test is more computationally efficient than Fisher’s exact test, and it should be used for large datasets. Otherwise, Fisher’s exact test should be preferred in most circumstances.

We’ve also covered the following tests for one-sample quantitative data:

\(t\)-test - t.test
- \(H_0: \mu_1 -\mu_2 = 0\) for Normally distributed numeric data (of any sample size) or for large samples (\(n_1 \geq 30\) and \(n_2 \geq 30\)) of numeric data (of any distributional shape)
Wilcoxon rank sum test - wilcox.test
- \(H_0: m_1 -m_2\) (where \(m_1\) and \(m_2\) denote population-level medians) for non-Normally distributed numeric data (typically small samples)

Generally speaking, the \(t\)-test is more powerful and should always be used if conditions allow for it.

\(~\)

Case Study #1

Some infants are born with congenital heart defects that require surgery shortly after birth. In this study, researchers at Harvard Medical School randomly assigned 143 infants in need of heart surgery to either the current standard of care known as “circulatory arrest”, which had the downside of cutting of the flow of blood to the brain during the surgery, or a new alternative surgical approach known as “low-flow bypass”, which maintains circulation to the brain but uses an external pump that might lead to other types of brain injuries.

ih = read.csv("https://remiller1450.github.io/data/InfantHeart.csv")

The researchers followed up on these infants a few years later to assess their mental and physical development via the outcomes:

Psychomotor Development Index (PDI) - a composite score measuring physiological development, with higher scores indicating greater development
Mental Development Index (MDI) - a composite score measuring mental development, with higher scores indicating greater development

Additionally, the research team recorded data on the following variables for each infant:

Treatment - the type of surgery the infant received
Weight - the infant’s weight (in grams)
Length - the infant’s length (in cm)
Age - the infant’s age (in hours)
Sex - the infant’s sex (male or female)

Preliminary Steps

Question #1: Considering the design of this study, can a causal relationship between the type of surgery and an infant’s development (measured via PDI or MDI) be established? Briefly explain.

Question #2: Using graphical methods, verify that “weight” and “length” were balanced across both types of surgery. Then, briefly explain why you’d expect these variables to be balanced given the design of the study.

\(~\)

Statistical Analyses

Question #3: Create side-by-side boxplots depicting the outcome variable “PDI” for each type of surgery. Then, considering the sample sizes for each surgery and the distributions seen in these boxplots, determine whether a two-sample \(t\)-test can be used to evaluate a difference in mean PDI scores across the two types of surgery?

Question #4: Regardless of your answer to Question #3, perform a two-sample \(t\)-test comparing the mean PDI scores across each type of surgery. You should report the observed difference in means, the \(p\)-value of the test, and a conclusion in the context of the application.

Question #5: Create side-by-side boxplots depicting the outcome variable “MDI” for each type of surgery. Then, considering the sample sizes for each surgery and the distributions seen in these boxplots, determine whether a two-sample \(t\)-test can be used to evaluate a difference in mean MDI scores across the two types of surgery?

Question #6: Regardless of your answer to Question #5, perform a two-sample \(t\)-test comparing the mean MDI scores across each type of surgery. You should report the observed difference in means, the \(p\)-value of the test, and a conclusion in the context of the application.

Question #7: The code and output below uses Wilcoxon rank-sum tests to statistically evaluate differences in median PDI and MDI scores across the two types of surgery. How do the \(p\)-values and the conclusions drawn from these tests compare with those you found in Questions #4 and #6? Briefly explain why the two approaches produce results that are similar and/or different.

## Testing median PDI
wilcox.test(x = ih$PDI[ih$Treatment == "Circulatory arrest"], y = ih$PDI[ih$Treatment == "Low-flow bypass"])

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  ih$PDI[ih$Treatment == "Circulatory arrest"] and ih$PDI[ih$Treatment == "Low-flow bypass"]
## W = 1975.5, p-value = 0.01891
## alternative hypothesis: true location shift is not equal to 0

## Testing median MDI
wilcox.test(x = ih$MDI[ih$Treatment == "Circulatory arrest"], y = ih$MDI[ih$Treatment == "Low-flow bypass"])

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  ih$MDI[ih$Treatment == "Circulatory arrest"] and ih$MDI[ih$Treatment == "Low-flow bypass"]
## W = 2140.5, p-value = 0.09424
## alternative hypothesis: true location shift is not equal to 0

\(~\)

Case Study #2

Early in the semester we looked at data from a widely cited study which considered trial verdicts for the offenders in all murders that took place during a felonies committed in the state of Florida between 1972 and 1977.

dp = read.csv("https://remiller1450.github.io/data/DeathPenaltySentencing.csv")

This dataset contains the following variables:

OffenderRace - whether the person tried for the crime was white or black
VictimRace - whether the murder victim was white or black
DeathPenalty - whether or not the person tried received the death penalty

The researchers who assembled the dataset were interested in whether or not juries exhibited racially biased sentencing in death penalty verdicts.

Preliminary Steps

Question #8: Considering the design of this study, can a causal relationship between the race of the offender and the death penalty verdict be established? Briefly explain.

Question #9: Using graphical methods, verify that “VictimRace” is imbalanced across the two categories of offender’s race. Then, briefly explain why this result is not unexpected given the design of the study.

\(~\)

Statistical Analyses

Question #10: Using an appropriate statistical test, evaluate the null hypothesis \(H_0: p_1 - p_2 = 0\), where \(p_1\) is the proportion of black offenders that receive the death penalty, and \(p_2\) is the proportion of white offenders that receive the death penalty. Report the observed difference in proportions, a \(p\)-value, and a brief conclusion.

Question #11: Suppose an individual looks at the results of the hypothesis test you performed in Question #10 and claims that the results of the test provide proof that death penalty sentences were not racially biased in Florida during the 1970s. Briefly explain two common hypothesis testing mistakes that this individual has committed.

\(~\)

Stratification

A proper statistical analysis of these data will control for the race of the victim. One way of doing this is a stratified analysis, which performs separate statistical tests on the sub-groups that are created when the data is split according to a confounding variable. In the context of this study, a stratified analysis would split the data into cases involving white victims and cases involving black victims, then evaluate the death penalty rates for white and black offenders within each of those strata. The stratified table below summarizes these data:

Cases involving a White Victim
	Death	Not
Black Offenders	37	41
White Offenders	46	144

Cases involving a Black Victim
	Death	Not
Black Offenders	1	101
White Offenders	0	8

Question #12: Perform separate hypothesis tests for each of the two strata defined above (cases involving white victims and cases involving black victims). Report the \(p\)-value and a brief conclusion for each hypothesis test.

Question #13: Could a two-sample \(Z\)-test have been used evaluate a possible difference in death penalty sentencing rates for white and black offenders in one, both, or neither of the two strata defined above? Briefly explain.