MATH-146 Lab #4 - Hypothesis Testing

Goals:

The purpose of this lab is to provide practice applying hypothesis tests (via randomization) to a wide variety of scenarios.

Directions:

You are expected to progress through the analyses described in this document as a group, recording your answers in a shared document. It’s completely up to your group how you’d like to organize this - some groups like using a shared Google Doc, while other might designate one person to be the group’s recorder.
You are expected to work together, any attempts to “divide and conquer” the lab questions may result in point deductions on your group’s lab score.
Labs are graded primarily for completion, and we will get together as group for the last 10-15 minutes of class to discuss some of the lab questions. This means you should focus on learning the material (while also helping the teammates in your group) rather than seeing labs as an assessment (like homework or exams).
Please upload your responses to the Lab’s questions on Canvas. The expectation is that everyone uploads their own copy (they can be identical within your group).
Use the snipping tool on Windows or take a Mac screenshot to add a screenshots to your lab write-up as requested.

$~$

Study #1 - Oatbran and LDL cholesterol

In an investigation of whether oatbran cereal might be effective in reducing LDL cholesterol, researchers randomly assigned 14 adult males with high cholesterol into two groups:

The first group followed a diet involving daily consumption of corn flakes cereal for two weeks, then had a one week washout period, and then engaged in two more weeks of dieting involving daily consumption of oatbran cereal.
The second group followed a similar protocol, but consumed oatbran cereal during the first two week diet and oatbran in the second two week diet.

We’ll analyze each subject’s difference in LDL cholesterol when they were on the oatbran diet relative to when they were on the cornflakes diet. This outcome is recorded as the variable “difference” in the dataset linked below. You should recognize that a positive value of “difference” indicates a reduction in LDL cholesterol on the oatbran diet.

Click Here to download the data from this study.

$~$

Orientation

Question #1: Briefly describe one population that the researchers can reasonably generalize the results of this study to. Additionally, briefly describe another population that the researchers should avoid generalizing the results of this study to.

Question #2: Is this a randomized experiment or an observational study? With that in mind, how concerned are you about the study’s outcome being influenced by bias or confounding variables? You should respond in 2-3 sentences.

$~$

Statistical analysis

Question #3: These researchers wanted to determine whether the oatbran diet was capable of reducing LDL cholesterol levels as measured by the variable “difference”. With that in mind, state the null hypothesis the researchers should evaluate. Be sure to define (in words) any population parameters you include in the null hypothesis (ie: define the meaning of $\mu$, $p$, etc.)

Question #4: Use StatKey to generate a null distribution for the variable “difference” under the null hypothesis you stated in Question #3. Then, use this distribution to provide an estimate of the $p$-value for this study (either one-sided or two-sided is acceptable).

Question #5: At the $\alpha = 0.05$ significance threshold, what do you conclude from the data observed in this study? Write an appropriate 1-2 sentence conclusion. (Hint: any acceptable conclusion needs to say something about oatbran)

$~$

Study #2 - the American Community Survey

The American Community Survey (ACS) is a component of the US Census that is administered to a random sample US addresses on a rolling basis. When the mailed version is combined with in-person visits and telephone calls the survey has a 95% response rate. The data linked below are a random sample of employed individuals drawn from a recent ACS:

ACS data link

The ACS data linked above includes the following variables:

Sex - “1” for males and “0” for females
Age - age in years
Married - “1” for married individuals and “0” for unmarried individuals
Income - annual income (thousands of dollars)
HoursWk - average hours worked per week
Race - self-described race
USCitizen - citizenship status, “1” for US citizens and “0” for non-citizens
HealthInsurance - “1” if the individual has health insurance, “0” otherwise
Language - “1” if the individual’s first/native language is English, “0” otherwise

Important: Questions #6 - #10 each ask you to perform a different hypothesis test. Your answers to each include the following components: a clear statement of the null and alternative hypotheses, the corresponding sample estimate, a 1-sided or 2-sided $p$-value found using StatKey, a 1 sentence conclusion describing the results and implications of the test.

Question #6: Perform a hypothesis test to determine whether these data provide compelling statistical evidence that married individuals are more likely to have health insurance than unmarried individuals.

Question #7: Perform a hypothesis test to determine whether these data provide compelling statistical evidence that males and females differ in the average number of hours they work each week.

Question #8: According to Wikipedia, 78% of American adults speak English as their primary language. Use a hypothesis test to determine whether these data provide sufficient evidence to refute Wikipedia’s claim.

Question #9: Perform a hypothesis test to determine whether these data provide compelling statistical evidence that higher salaries are associated with more hours worked.

Question #10: According website “worlddata.info” (which seems like it might be a questionable source), the average personal income in the United States is $64,000. Use a hypothesis test to determine whether these data provide sufficient evidence to refute this claim.

Question #11: Notice that Questions #6-10 each required you to perform a different hypothesis test using the same set of data. If you wanted to use the Bonferroni Adjustment to ensure no more than a 5% Type 1 error rate for this family of tests, what threshold should be used for statistical significance?

Question #12: If you use the adjusted significant threshold from Question #11 would your chances of making a Type 2 error (on each hypothesis test) increase, decrease, or remain the same? Briefly explain.