Directions:
- Submit your assignment via P-web.
- Submit only a compiled R Markdown document (pdf, word, or html
output are all okay, but you may need to “zip” an html file)
- If you want to compile to a pdf you can install the
tinytext
package by running
install.packages('tinytex')
followed by
tinytex::install_tinytex()
- Only submit your .Rmd file if you are unable to compile it due to
errors (in the future you will be penalized for this)
Question #1
In a previous assignment you worked with the American Community
Survey (ACS) data, which are a component of the US Census administered
to a random sample of US addresses on a rolling basis. When the mailed
version is combined with in-person visits and telephone calls the survey
has a 95% response rate. The data linked below are a random sample of
employed individuals drawn from a recent ACS (2020 Census):
acs = read.csv("https://remiller1450.github.io/data/EmployedACS.csv")
The ACS data linked above includes the following variables:
- Sex - “1” for males and “0” for females
- Age - age in years
- Married - “1” for married individuals and “0” for unmarried
individuals
- Income - annual income (thousands of dollars)
- HoursWk - average hours worked per week
- Race - self-described race
- USCitizen - citizenship status, “1” for US citizens and “0” for
non-citizens
- HealthInsurance - “1” if the individual has health insurance, “0”
otherwise
- Language - “1” if the individual’s first/native language is English,
“0” otherwise
In this question you will perform several hypothesis tests. For each
hypothesis test (Parts A-C) you should include the following steps:
- State the null and alternative hypotheses using either words or
statistical notation.
- Use either StatKey or an appropriate
R
function to find
the \(p\)-value.
- Provide a one-sentence conclusion summarizing the results of your
hypothesis test. Be sure to follow the guidelines from this week’s
lecture slides.
- Part A: According
to Wikipedia, 78% of US adults speak English as their native
language. Use a hypothesis test to determine whether the ACS sample
provides evidence that refutes this claim.
- Part B: Perform a hypothesis test to evaluate
whether these data provide compelling statistical evidence that married
individuals are more likely to have health insurance than unmarried
individuals.
- Part C: According to “worddata.info” (which seems
like a questionable source to me) the average personal income in the
United States is $64,000. Use a hypothesis test to determine whether the
ACS sample provides sufficient evidence to refute this claim.
\(~\)
Question #2
Rosiglitazone is the active ingredient in the controversial type 2
diabetes medicine Avandia and has been linked to an increased risk of
serious cardiovascular problems such as stroke, heart failure, and
death. A common alternative treatment is Pioglitazone, the active
ingredient in a diabetes medicine called Actos. In a nationwide
retrospective observational study of 227,571 Medicare beneficiaries aged
65 years or older, it was found that 2,593 of the 67,593 patients using
Rosiglitazone and 5,386 of the 159,978 using Pioglitazone had serious
cardiovascular problems. These data are summarized in the contingency
table below.
Treatment
|
No CV problems
|
CV Problems
|
Total
|
Pioglitazone
|
154592
|
5386
|
159978
|
Rosiglitazone
|
65000
|
2593
|
67593
|
Total
|
219592
|
7979
|
227571
|
- Part A: Below are several statements about this
study that are either true or false. Identify which
statements are false and briefly explain why each is false.
Note that some statements might reach the correct conclusion using
incorrect reasoning. In these circumstances the entire statement should
be considered false.
- I: Since more than 50% of patients with cardiovascular problems were
on Pioglitazone we can conclude that the risk of cardiovascular problems
is higher for Pioglitazone than it is for Rosiglitazone.
- II: The data suggest that diabetic patients who are taking
Rosiglitazone appear to be likely to have cardiovascular problems (3.8%
risk) than patients on Pioglitazone (3.4% risk).
- III: A statistical test is needed to evaluate whether the observed
difference in the risk of cardiovascular problems is due to the
medication or could be explained by chance.
- IV: If a statistical test performed on these data yielded a \(p\)-value less than 0.01 we can conclude
that Rosiglitazone causes an increase in the rate of
cardiovascular problems relative to Pioglitazone.
- Part B: If the type of treatment and having
cardiovascular problems were independent, how many patients in the
Rosiglitazone group would we expect to have had cardiovascular
problems?
- Part C: Using both words and statistical
symbols, write the null and alternative hypotheses for a statistical
test that investigates whether there is an association between type of
treatment and the risk of cardiovascular problems.
- Part D: Use
prop.test()
to perform a
statistical test of the hypotheses you provided in Part C. Report the
\(p\)-value of the test.
- Part E: Provide a one-sentence conclusion
summarizing the results of the test you performed in Part D. Be careful
to follow the guidelines covered in this week’s lecture slides.
\(~\)
Question #3
In modern biomedical studies it is relatively common to record
measurements for thousands of genetic features at once. Suppose a cancer
researcher collects data on 2000 genes, and 20 of these genes are truly
related to the cancer that the researcher is studying. For each of the
2000 genes, the researcher performs a hypothesis test and compares the
\(p\)-value to a decision threshold of
\(\alpha = 0.05\).
- Part A: Out of all of the hypothesis tests
performed by the researcher, how many Type 1 errors do you expect?
Hint: Remember that Type 1 errors can only occur when the null
hypothesis is correct.
- Part B: Suppose the researcher’s statistical tests
have 80% power, meaning they will correctly reject a false null
hypothesis 80% of the time. Considering this information, how many Type
2 errors do you expect out of all of the hypothesis tests performed by
this researcher. Hint: Remember that Type 2 errors can only
occur when the null hypothesis is incorrect.
- Part C: Suppose the researcher identifies 109
statistically significant genes using \(\alpha
= 0.05\). What do you estimate the false discovery rate
to be for this study? Hint: You should use the number of Type 1
errors from Part A in your calculation.
- Part D: Consider a procedure that controls the
false discovery rate at 5% and one that controls the family-wise Type 1
error rate at 5%. Which procedure do you expect to identify more of the
20 genes that are truly related to this cancer as being “significant”?
Briefly explain your answer.
- Part E: Suppose a 95% confidence interval is
calculated for the association between each gene and cancer status. How
many of these intervals would you expect to contain the true association
for their corresponding gene?