Directions:
- Submit your work via the “Assignments” tab on Canvas
- For this assignment you should record your answers/code using
R Markdown
- Please upload HTML, Word, or PDF output created using R Markdown and
make sure it contains your code, output, and written answers. You should
not include extraneous output, such as printing an entire data
frame.
- At this point in the course you are responsible for knowing how to
properly knit an R Markdown document, so uploading any other file format
will result in a point deduction on the assignment
- Homework is an individual assignment. It’s okay to check
your work or collaborate with your classmates, student mentors, and
others, but it is not okay to pass off their work as your own.
- Please clearly acknowledge any help you get from individuals other
than yourself, or resources other than the materials on our course
website (such as external websites and AI)
Question #1
For this question you should use the “diet” data set provided below.
These data come from a randomized experiment seeking to compare the
efficacy of three different weight loss diets. A subject’s assigned diet
is encoded as either 1, 2, or 3, and is recorded in the variable
Diet. The provided code coerces this variable to a factor
(categorical variable).
diet_data = read.csv("https://remiller1450.github.io/data/diet.csv")
diet_data$Diet = factor(diet_data$Diet)
- Part A: These data contain two variables that were
recorded at the end of the experiment,
postWeight (final
body weight at the end of the experiment) and weightChange
(change in body weight from the start to the end of the experiment).
Which of these variables is the better outcome for the researchers to
focus on? Briefly explain your reasoning.
- Part B: Create an appropriate data visualization
showing the relationship between the explanatory variable
Diet and the outcome variable weightChange.
Based upon this visualization, does there appear to be an association
between diet and weight change?
- Part C: Use
group_by() and
summarize() from the dplyr library to find the
mean and standard deviation of the variable weightChange in
each assigned group.
- Part D: Consider using one-way ANOVA to analyze the
relationship between diet and weight change. Participant #66 was a 41
year-old female weighing 76 kg at the start of the study who was
assigned to Diet #3 and experienced a weight change of -5.0 kg. What is
this participant’s residual under the null model? What is this
participant’s residual under the alternative model? Show your
calculations of each.
- Part E: Use one-way ANOVA to evaluate the
relationship between diet and weight change. Provide a one-sentence
conclusion that includes appropriate context and cites the \(p\)-value. You do not need to check the
assumptions of your ANOVA yet.
- Part F: Using the ANOVA table resulting from the
test you performed in Part E, report the sums of squared residuals for
both the null and alternative models. Would you describe it as
likely or unlikely for these sums of squared residuals to be this
different if diet and weight change were independent? Briefly explain
your reasoning.
- Part G: Evaluate the two primary assumptions of the
one-way ANOVA you performed in Part E using either graphs, descriptive
statistics, or both. Briefly explain whether you believe these
assumptions are reasonable or not.
- Part H: Perform post-hoc pairwise testing to
determine which pairs of diets produced statistically significant
differences in weight change.
- Part I: Now use one-way ANOVA to evaluate the
relationship between the variables
Diet and
postWeight. Provide a one-sentence conclusion that includes
appropriate context and cites the \(p\)-value. You do not need to check the
assumptions of this ANOVA test.
- Part J: Without actually performing any tests,
indicate whether you think it would be valuable to perform post-hoc
testing to expand upon the results of the one-way ANOVA you performed in
Part I. Briefly explain why you believe post-hoc testing would or would
not be worthwhile in this situation.
\(~\)
Question #2
In an experiment conducted in 1982, researchers had 12 subjects
perform a visual motor task where they had to steer a pencil along a
moving track. Each subject was tested on two different tracks: a
straight track and an oscillating one. The researchers were not
interested in how well the participants kept their pencil on the track,
but rather their blink rate (measured in blinks per minute) under each
condition. The columns Straight and
Oscillating record each subject’s blink rate in each of
these conditions.
blink = read.csv("https://remiller1450.github.io/data/blink.csv")
- Part A: This study uses a paired design (see our two-sample
hypothesis testing slides for a refresher). Create a new variable
that records the difference in blink rate for each subject
(
Straight minus Oscillating) and display the
distribution of these differences using a histogram with 13 bins.
- Part B: Consider the assumptions of the paired
(one-sample) \(T\)-test and the
characteristics of these data. Why would it be inappropriate to use a
\(T\)-test to evaluate whether blink
rate is associated with the type of track?
- Part C: Apply a log-2 transformation to the
variable you created in Part A to create a new outcome
log_difference. Create a histogram using 13 bins to confirm
that the data seem to be reasonably Normally distributed.
- Part D: Consider two hypothetical subjects, one
whose value of
log_difference is 4 and another whose value
of log_difference is 1. In terms of the original units
(before the log transformation) how many times larger was the blink
difference for the first subject compared to the second?
- Part E: Perform a one-sample \(T\)-test on the
log_difference
variable you created in Part C and provide a one-sentence conclusion
that includes context, the \(p\)-value,
and a statement about how blink rates compare across experimental
conditions.
\(~\)
Question #3
In an educational study aimed at informing people of the problems
associated with multiple comparisons, Canadian researchers performed a
study looking at various reasons for hospitalization and their potential
associations with astrological signs. They performed a total of 24
different hypothesis tests.
- Part A: If the researchers compare the \(p\)-value of each test they performed to
\(\alpha = 0.05\), how many
statistically significant findings would you expect if in reality there
are no associations between astrological signs and reasons for
hospitalization?
- Part B: If the Bonferroni correction is applied to
keep the family-wise type I error rate at 5%, what is the new
significance threshold that \(p\)-values should be compared to?
- Part C: The study obtained two “significant”
findings (without applying any corrections): individuals born under Leo
had a higher probability of gastrointestinal hemorrhage (p = 0.0447),
while Sagittarians had a higher probability of humerus fracture (p =
0.0123) compared to all other signs combined. Using the adjusted
threshold you found in Part B, are either of these findings
statistically significant in light of the multiple comparisons the
investigators performed?
- Part D: Suppose that instead of using the
Bonferroni correction, as was done in Parts B and C, you used a false
discovery rate control procedure to limit the false discovery rate to
5%. Is using this procedure more likely or less likely
to produce type I errors than the Bonferroni procedure?
- Part E: Suppose that some astrological signs
actually do predispose individuals to certain reasons for
hospitalization. Which procedure, the Bonferroni correction or false
discovery rate control, is more prone to type II errors?