Sta-209 (Spring 2025) Homework #6

Directions:

Submit your work via the “Assignments” tab on Canvas
For this assignment you should record your answers/code using R Markdown
- Please upload HTML, Word, or PDF output created using R Markdown and make sure it contains your code, output, and written answers. You should not include extraneous output, such as printing an entire data frame.
- At this point in the course you are responsible for knowing how to properly knit an R Markdown document, so uploading any other file format will result in a point deduction on the assignment
Homework is an individual assignment. It’s okay to check your work or collaborate with your classmates, student mentors, and others, but it is not okay to pass off their work as your own.
- Please clearly acknowledge any help you get from individuals other than yourself, or resources other than the materials on our course website (such as external websites and AI)

Question #1

For this question you should use the “diet” data set provided below. These data come from a randomized experiment seeking to compare the efficacy of three different weight loss diets. A subject’s assigned diet is encoded as either 1, 2, or 3, and is recorded in the variable Diet. The provided code coerces this variable to a factor (categorical variable).

diet_data = read.csv("https://remiller1450.github.io/data/diet.csv")
diet_data$Diet = factor(diet_data$Diet)

Part A: These data contain two variables that were recorded at the end of the experiment, postWeight (final body weight at the end of the experiment) and weightChange (change in body weight from the start to the end of the experiment). Which of these variables is the better outcome for the researchers to focus on? Briefly explain your reasoning.
Part B: Create an appropriate data visualization showing the relationship between the explanatory variable Diet and the outcome variable weightChange. Based upon this visualization, does there appear to be an association between diet and weight change?
Part C: Use group_by() and summarize() from the dplyr library to find the mean and standard deviation of the variable weightChange in each assigned group.
Part D: Consider using one-way ANOVA to analyze the relationship between diet and weight change. Participant #66 was a 41 year-old female weighing 76 kg at the start of the study who was assigned to Diet #3 and experienced a weight change of -5.0 kg. What is this participant’s residual under the null model? What is this participant’s residual under the alternative model? Show your calculations of each.
Part E: Use one-way ANOVA to evaluate the relationship between diet and weight change. Provide a one-sentence conclusion that includes appropriate context and cites the \(p\)-value. You do not need to check the assumptions of your ANOVA yet.
Part F: Using the ANOVA table resulting from the test you performed in Part E, report the sums of squared residuals for both the null and alternative models. Would you describe it as likely or unlikely for these sums of squared residuals to be this different if diet and weight change were independent? Briefly explain your reasoning.
Part G: Evaluate the two primary assumptions of the one-way ANOVA you performed in Part E using either graphs, descriptive statistics, or both. Briefly explain whether you believe these assumptions are reasonable or not.
Part H: Perform post-hoc pairwise testing to determine which pairs of diets produced statistically significant differences in weight change.
Part I: Now use one-way ANOVA to evaluate the relationship between the variables Diet and postWeight. Provide a one-sentence conclusion that includes appropriate context and cites the \(p\)-value. You do not need to check the assumptions of this ANOVA test.
Part J: Without actually performing any tests, indicate whether you think it would be valuable to perform post-hoc testing to expand upon the results of the one-way ANOVA you performed in Part I. Briefly explain why you believe post-hoc testing would or would not be worthwhile in this situation.

\(~\)

Question #2

In an experiment conducted in 1982, researchers had 12 subjects perform a visual motor task where they had to steer a pencil along a moving track. Each subject was tested on two different tracks: a straight track and an oscillating one. The researchers were not interested in how well the participants kept their pencil on the track, but rather their blink rate (measured in blinks per minute) under each condition. The columns Straight and Oscillating record each subject’s blink rate in each of these conditions.

blink = read.csv("https://remiller1450.github.io/data/blink.csv")

Part A: This study uses a paired design (see our two-sample hypothesis testing slides for a refresher). Create a new variable that records the difference in blink rate for each subject (Straight minus Oscillating) and display the distribution of these differences using a histogram with 13 bins.
Part B: Consider the assumptions of the paired (one-sample) \(T\)-test and the characteristics of these data. Why would it be inappropriate to use a \(T\)-test to evaluate whether blink rate is associated with the type of track?
Part C: Apply a log-2 transformation to the variable you created in Part A to create a new outcome log_difference. Create a histogram using 13 bins to confirm that the data seem to be reasonably Normally distributed.
Part D: Consider two hypothetical subjects, one whose value of log_difference is 4 and another whose value of log_difference is 1. In terms of the original units (before the log transformation) how many times larger was the blink difference for the first subject compared to the second?
Part E: Perform a one-sample \(T\)-test on the log_difference variable you created in Part C and provide a one-sentence conclusion that includes context, the \(p\)-value, and a statement about how blink rates compare across experimental conditions.

\(~\)

Question #3

In an educational study aimed at informing people of the problems associated with multiple comparisons, Canadian researchers performed a study looking at various reasons for hospitalization and their potential associations with astrological signs. They performed a total of 24 different hypothesis tests.

Part A: If the researchers compare the \(p\)-value of each test they performed to \(\alpha = 0.05\), how many statistically significant findings would you expect if in reality there are no associations between astrological signs and reasons for hospitalization?
Part B: If the Bonferroni correction is applied to keep the family-wise type I error rate at 5%, what is the new significance threshold that \(p\)-values should be compared to?
Part C: The study obtained two “significant” findings (without applying any corrections): individuals born under Leo had a higher probability of gastrointestinal hemorrhage (p = 0.0447), while Sagittarians had a higher probability of humerus fracture (p = 0.0123) compared to all other signs combined. Using the adjusted threshold you found in Part B, are either of these findings statistically significant in light of the multiple comparisons the investigators performed?
Part D: Suppose that instead of using the Bonferroni correction, as was done in Parts B and C, you used a false discovery rate control procedure to limit the false discovery rate to 5%. Is using this procedure more likely or less likely to produce type I errors than the Bonferroni procedure?
Part E: Suppose that some astrological signs actually do predispose individuals to certain reasons for hospitalization. Which procedure, the Bonferroni correction or false discovery rate control, is more prone to type II errors?