Directions
In this lab we will explore hypothesis testing using a couple of case studies. It will focus on using randomization to estimate the null distribution. This is procedure is sometimes called randomization testing.
Have you ever been waiting for a parking spot and felt like people take forever to exit their spot?
Pyschologists Ruback and Juieng investigated this question in the research paper: Territorial defense in parking lots: Retaliation against waiting drivers, which describes a series of studies investigating how various factors relate to how quickly someone leaves a public parking space.
In the first of these studies, Ruback and Juieng observed 200 drivers departing from a public parking lot. For each departing driver they recorded the time (in seconds) between when each driver first entered their car and when they exited their parking space. Additionally, they recorded whether another car was waiting for the space while the driver got into their car and exited their space.
The Parking Dataset contains the results of this experiment
Question #1
Construct a graph that compares the distribution of exit time when another vehicle is waiting and when another vehicle is not waiting. Do these distributions appear skewed? Is there an association between leaving time and the presence of a waiting vehicle?
In this study, researchers wanted to determine whether drivers exited faster when another car was waiting for their spot.
To answer this, they might use statistical testing to evaluate whether the mean leaving time is the same for each group (where groups are defined by whether another car is waiting), or if the mean is lower when another car is waiting.
Question #2
Using proper statistical notation, what is the null hypothesis of this test? What is the alternative hypothesis? Is this a one-sided or two-sided test?
Question #3
Using proper statistical notation, what is your best estimate of the parameter specified in your null hypothesis in Question #2? Provide both the notation for your best estimate, and its actual numeric value.
Randomization testing simulates the data collection process in a world where the null hypothesis is true. The different estimates that arise from these simulations are used to construct the randomization distribution, which is an estimate of the null distribution. This allows us to understand the estimates that we’d expect to see had the null hypothesis been true.
If the observed estimate in the original study is deemed sufficiently rare (relative the possible estimates shown in the null distribution), we declare the observed difference in our sample to be statistically significant.
StatKey is a tool that allows us to create randomization distributions for a few common situations. In the parking example, we are interested doing a “randomization test for a difference in means”.
As we saw in the bootstrapping lab, you’ll need to make use of the “Edit Data” option. When you click on “Edit Data” you can see how your data needs to be formatted; once you recognize the correct format, you can simply copy-paste the correct columns from Minitab.
After you have the data loaded into StatKey, you can view it in the “Original Sample” panel. You should always check this panel to make sure the data were loaded in correctly (for example, make sure \(\bar{x}_1 - \bar{x}_2\) is the same as what you see using Minitab).
To simulate the data collection process under the null hypothesis, click on “Generate 1 Sample”.
Question #4:
When you click on “Generate 1 Sample”, what is plotted in the panel titled “Randomization Dotplot of \(\bar{x}_1 - \bar{x}_2\)”? Be very specific in your answer.
Depending on the type of parameter you are estimating StatKey will simulate randomization samples differently. The way this happens is summarized below:
Question #5:
When you clicked on “Generate 1 Sample” in Lab Question #4, what was plotted in the panel titled “Randomization Sample”? Be specific.
Question #6:
Before you clicked on “Generate 1 Sample”, could you have known the total number of data-points in the randomization sample with waiting times longer than 70 seconds? Could have known which groups these data-points would belong to? Briefly explain.
To get an accurate assessment of how rare our observed sample is (if the null hypothesis were true), we need to compare it with a larger number of randomization samples. My general advice is to generate randomization samples until the standard error of the randomization distribution stays roughly constant as you generate additional samples. For most applications, this takes a few thousand randomization samples.
Question #6
Reset your randomization plot and generate 1,000 randomization samples. Include your plot in your lab write-up and answer the following questions: How many does are there and what do they represent? Why are most of these dots close to zero?
Once you’ve constructed the randomization distribution, the next step in the hypothesis test is to determine how rare/unexpected the observed estimate would be had the null hypothesis been true. This can be done by checking the appropriate “Left Tail”, or “Right Tail” box. In this step, you need to be aware of the direction of your test because there are two different one-sided tests.
Once you’ve selected the correct tail, you can click the box on x-axis of the randomization dotplot to input the estimate from your original sample. StatKey will tell you the proportion of randomization samples that are at least as extreme as the value you entered. This proportion is the p-value, which provides you the information you need to make a ruling on whether or not you believe the null hypothesis.
The plot below corresponds with the test of \(H_0: \mu_1 - \mu_2 = 0\) vs \(H_1: \mu_1 - \mu_2 > 0\). Notice here that \(\bar{x}_1\) is the sample mean of the “Smile” group:
The plot below corresponds with the test of \(H_0: \mu_1 - \mu_2 = 0\) vs \(H_1: \mu_1 - \mu_2 < 0\). Notice the impact of specifying the wrong direction:
Question #7:
Using your randomization plot constructed in Lab Question #6, formally conduct a test of the hypothesis that drivers leave faster when another car was waiting. Report your null and alternative hypotheses (recall that you already stated them earlier), your p-value, a one sentence conclusion.
One-sided hypothesis tests can be risky, if you specify the direction incorrectly you can completely miss out on an interesting discovery. This might sound like a minor inconvenience; couldn’t we just switch hypotheses after seeing the data?
The answer is “no”. In its purest sense, statistical testing is only meaningful when a hypothesis is specified a priori (ahead of time). After you’ve seen the data important properties of statistical testing will not hold if you form your hypotheses around what you saw.
Post hoc testing (and hypothesis formation) is suspected to be a contributing factor in the reproducibility crisis facing many areas of scientific research. Practically speaking, one-sided tests are almost never used. Not only do they look suspicious (like you looked at your data ahead of time), they also risk missing important findings.
Two-sided tests are a little trickier to perform in StatKey due to the ambiguity regarding how to determine values “at least as unexpected” in two opposite directions. One possible definition, which you can find by clicking “Two-Tail”, is to double the p-value from the correctly specified one-sided test. An example of this is shown below, the two-sided p-value in this example is 0.058:
An alternate, but equally valid, approach is to interpret “at least as unexpected” in terms of distance from the null value. This often requires separate specification of the right and left tail cutoffs, notice that the two-sided p-value here is 0.057:
Question #8:
Using your randomization plot constructed in Lab Question #6, conduct a two-sided hypothesis test. Include a screenshot of your randomization distribution, report your p-value, and make a one sentence conclusion addressing the originally research question.
To end this section of the lab, we should briefly comment on why StatKey generates its randomization samples in the ways that it does. Generally speaking, randomization samples are created such that the following are satisfied:
Question #9:
In randomization testing for a single mean, StatKey re-samples the shifted data points with replacement. Why is replacement necessary? Explain in 1-2 sentences.
Question #10:
For randomization testing of a difference in means (or a difference in proportions), StatKey reallocates (shuffles) the group labels. Briefly explain why this approach satisfies both 1) and 2) stated prior to Question 9.
You can find the data “Concussions” at this link. These data originate from a study published in the Journal of Athletics Training in 2003 by Covassin, Swanik, and Sachs titled “Sex Differences and the Incidence of Concussions Among Collegiate Athletes”. In the study, the authors used data from NCAA Injury Surveillance System (ISS), a voluntary injury reporting system used by athletic trainers at colleges across the United States. The NCAA ISS is considered to be a representative sample of all US colleges. The data we’ll analyze contains the following variables:
Question #11:
Use a Minitab formula to create a new column displaying the proportion of concussions for each sex, sport and year combination. By inspecting, graphing, or summarizing this column, which sport appears to lead to the highest proportion of concussions?
Question #12:
Within the “Stat -> Descriptive Statistics” Menu, click on the button titled “Statistics” and select the checkbox for “Sum” (you can also uncheck the other boxes if you want to de-clutter the output). Then on the main menu, enter the variables “Concussion” and “No Concussion” in the “Variables” panel and include “Sex” as a by variable. This will provide you the total number of concussions and no-concussions for each sex. With this information, use a randomization test to answer the question: “do a higher proportion of female athletes (in these sports) sustain concussions?” Use only 1000 randomization samples to avoid crashing StatKey. Clearly state your null and alternative hypotheses using proper statistical notation.
Question #13:
Return to the “Stat -> Descriptive Statistics” Menu and this time include both “Sex” and “Sport” as by variables. With this information, report the difference in proportions (female minus male) separately for each sport. Do these results seem consistent with your findings in Question 12? Do you think that sport might be confounding variable? (Hint: You don’t need to do any statistical tests to answer this question)
Question #14:
Using an approach similar those described in previous questions, perform a randomization test to determine whether proportion of concussions in 1999 differs from the proportion of concussions in 1997. Use only 1000 randomization samples to avoid crashing StatKey. Clearly state your null and alternative hypotheses using proper statistical notation.
Question #15:
Critics might point out that the proportions in these data are very small and we shouldn’t be worried about the male/female and year-to-year differences in concussions that we analyzed. However, in situations involving rare events it is common for researchers to look at ratios of proportions (a measure called relative risk) rather than differences in proportions. What is the female/male relative risk of concussion based upon these data? Also, had a randomization test been done on the relative risk, what would the null hypothesis be?
Question #16:
For this question I’d like you and your group to use the data from either case study presented in this lab to form and test a hypothesis of your choosing. You should include: