Sta-209 (Spring 2025) Homework #4

Directions:

Submit your assignment via P-web.
Submit only a compiled R Markdown document (pdf, word, or html output are all okay, but you may need to “zip” an html file)
- If you want to compile to a pdf you can install the tinytext package by running install.packages('tinytex') followed by tinytex::install_tinytex()
Only submit your .Rmd file if you are unable to compile it due to errors (in the future you will be penalized for this)

Question #1

The Indoor Obstacle Course (IOCT) is an obstacle course that all Cadets at West Point Academy are tested on and must successfully complete prior to graduation. The data set below records IOCT completion times for 384 graduates of the academy:

ioct = read.csv("https://data.scorenetwork.org/data/ioct_west_point.csv")

Part A: Find Pearson’s correlation coefficient measuring the strength of association between the variables height and IOCT_Time. Based upon this correlation, do taller Cadets tend to finish the course faster or slower than shorter Cadets? Briefly explain.
Part B: Create a scatter plot showing the relationship between the explanatory variable height and the response variable IOCT_Time that includes both linear and moving average smoothers. Briefly explain why Pearson’s correlation coefficient is appropriate for measuring the strength of association for these data.
Part C: Fit the simple linear regression model IOCT_Time ~ height and interpret the slope coefficient that describes the effect of changes in height on the expected completion time in the fitted model.
Part D: Create an appropriate data visualization to determine whether the variables sex and height are associated in these data. Using your visualization, briefly explain whether or not you believe these variables are associated.
Part E: Create an appropriate data visualization to determine whether the variables sex and IOCT_Time are associated in these data. Using your visualization, briefly explain whether or not you believe these variables are associated.
Part F: Based upon your answers to Parts D and E, state whether or not you believe sex confounds the relationship between height and IOCT_Time.
Part G: Use the group_by() and summarize() functions in the dplyr package to perform a stratified analysis that reports the correlation between height and IOCT_Time separately for each category of sex. Hint: If you are specifying the data as the first step in the pipeline you should reference the variables height and IOCT_Time without using the $ operator inside of the summarize() function.
Part H: Briefly interpret the results of the stratified analysis you performed in Part G. That is, does height appear associated with IOCT_Time after you account for the variable sex?
Part I: Fit the multivariable linear regression model IOCT_Time ~ height + sex and interpret the slope coefficient that describes the effect of changes in height on the expected completion time in the fitted model. Briefly explain the difference between this effect and the one from Part C.

$~$

Question #2

For this question you’ll work with the “2023 Boston Marathon” data set found below:

marathon = read.csv("https://data.scorenetwork.org/data/boston_marathon_2023.csv")

This data set records information on each finisher of the 2023 Boston Marathon, with the variable finish_net_sec (finishing time in seconds) being the outcome of interest for this question.

Part A: Create a frequency table of the variable age_group and briefly describe the distribution of finisher ages.
Part B: Fit the regression model finish_net_sec ~ age_group. Based upon your fitted model, which age group has the fastest expected finish time?
Part C: Briefly interpret the estimated coefficient for “age_group50-54” in the model you fit in Part B.
Part D: Create a data visualization showing the relationship between half_time_sec (the runner’s time at the race’s halfway point) and finish_net_sec. Using this visualization, briefly describe the relationship between these two variables.
Part E Fit the regression model finish_net_sec ~ half_time_sec. Briefly interpret the slope and intercept of this model. If the intercept is not meaningful, you should indicate this.
Part F: Now fit the multivariable regression model finish_net_sec ~ half_time_sec + age_group. Explain why the coefficient of “age_group50-54” in this model is so different from the coefficient in the model you fit in Part B.
Part G: Compare the Adjusted $R^2$ of the models from Parts E and F, or finish_net_sec ~ half_time_sec + age_group and finish_net_sec ~ half_time_sec. Does it seem like including age group improves the model? Or does including this variable produce an overfit model?