Statistical Testing

Introduction

Before Fall break, we learned about basic probability models, computational methods like bootstrapping and monte carlo simulation, and confidence intervals as tools for understanding uncertainty. Today we’ll cover what is arguably the most well-known statistical tool: hypothesis testing

The basic idea behind hypothesis testing is as follows:

Set up a “straw man” hypothesis that would be useful to falsify. For example, you might set up the hypothesis “Drug A” offers no benefits over “Drug B”. This hypothesis is called the null hypothesis.
Use the null hypothesis to inform a probability model. Then use this model to calculate the probability of seeing the observed data, or similar data that more strongly contradicts the null hypothesis. This probability is called the p-value.
Use this probability as evidence against the null hypothesis. That is, if the probability of seeing the study’s data under the null hypothesis is very small, you could argue that the null hypothesis does not provide a good model for the data and is likely to be incorrect. However, if the probability is high, you cannot be confident that the data did not arise from the null model.

\(~\)

Application

In an article published in the journal Nature, Hamlin, Wynn, Bloom (2007) explored the capacity of infants to judge pro-social behavior.

In one part of their study, infants were repeatedly shown puppet shows where a “climber” character struggled to reach the top of a hill. There were two variations of the show:

Helper scenario - the “helper” character assists the “climber” character

Hinderer scenario - the “hinderer” character antagonizes the “climber” character

Infants watched each variation of the show several times in an alternating order, giving them the opportunity to learn the behavior of each character (which could be identified across repetitions by its color and shape).

Next, they were given the opportunity to choose a character to play with:

The researchers recorded the choices of 16 infants who participated in the study. They sought to evaluate whether the infants were inclined to select the “helper” character over the “hinderer” character. They found that 14 of the 16 infants selected the “helper” character.

\(~\)

Question #1: The first step in hypothesis testing is setting up a null hypothesis that the researchers hope to falsify using their data. In words, what would this hypothesis be for the study/data described in this section?

Question #2: Briefly describe one way that you could generate/simulate data that are consistent with the null hypothesis you provided in Question #1.

Question #3: On this StatKey simulation app, click on “Edit Data” and enter the observed count and the sample size of this study. Next, verify the null hypothesis is \(p=0.5\) and generate 10 simulated outcomes. What does each dot shown in the app’s main panel represent? Briefly explain.

Question #4: On the same page used in Question #3, click “Reset Plot” then generate 2,000 simulated outcomes. You should use the same data and null hypothesis as Question #3. Now, considering the steps and philosophy of hypothesis testing, briefly explain why this distribution of simulated outcomes is useful.

Question #5: Calculate the probability of an outcome at least as extreme as the one observed in the actual study appearing under the model used in Questions #3 and #4. Note that this probability is known as the p-value.

Question #6: Based upon the result you found in Question #5, decide whether you believe the results of this study are sufficient to falsify the null hypothesis.

Question #7: Assume you decided in Question #6 that the observed data provided sufficient evidence to falsify the null hypothesis. Does this mean that infants prefer pro-social behavior? Or, are there other explanations that you still must rule out?

Question #8: In the actual study, the researchers randomly assigned the color and shape of the “helper” and “hinderer” characters for each infant. That is, for some infants the helper was a red triangle, while for others it was a blue square, or yellow triangle, etc. Why do you think the researchers did this? Briefly explain.