Directions:
A commonly referenced example of Simpson’s Paradox involves the batting averages of professional baseball players. In 1995, Derek Jeter got a hit at 12 of 48 at bats (0.250 batting average) while David Justice got a hit at 104 of 411 at bats (0.253 batting average). In 1996, Derek Jeter got a hit at 183 of 582 at bats (0.314 batting average), while David Justice got a hit at 45 of 140 at bats (0.321 batting average). Thus, David Justice had a higher batting average than Derek Jeter in both of these two seasons.
\(~\)
The Indoor Obstacle Course Test (IOCT) is an obstacle course that all Cadets at West Point Academy are tested on and must successfully complete prior to graduation. The data set below records IOCT completion times for 384 graduates of the academy:
ioct = read.csv("https://data.scorenetwork.org/data/ioct_west_point.csv")
height and IOCT_Time. Based upon this
correlation, do taller Cadets tend to finish the course faster or slower
than shorter Cadets? Briefly explain.height and
the response variable IOCT_Time that includes both linear
and moving average smoothers. Briefly explain why Pearson’s correlation
coefficient is appropriate for measuring the strength of association for
these data.IOCT_Time ~ height and interpret the slope coefficient that
describes the effect of changes in height on the expected
completion time in the fitted model.sex and height
are associated in these data. Using your visualization, briefly explain
whether or not you believe these variables are associated.sex and
IOCT_Time are associated in these data. Using your
visualization, briefly explain whether or not you believe these
variables are associated.sex confounds the
relationship between height and
IOCT_Time.group_by() and
summarize() functions in the dplyr package to
perform a stratified analysis that reports the correlation
between height and IOCT_Time separately for
each category of sex. Hint: If you are specifying
the data as the first step in the pipeline you should reference the
variables height and IOCT_Time without using
the $ operator inside of the summarize()
function.height appear associated with IOCT_Time after
you account for the variable sex?IOCT_Time ~ height + sex and interpret the slope
coefficient that describes the effect of changes in height
on the expected completion time in the fitted model. Briefly explain the
difference between this effect and the one from Part C.\(~\)
For this question you’ll work with the “2023 Boston Marathon” data set found below:
marathon = read.csv("https://data.scorenetwork.org/data/boston_marathon_2023.csv")
This data set records information on each finisher of the 2023 Boston
Marathon, with the variable finish_net_sec (finishing time
in seconds) being the outcome of interest for this question.
age_group and briefly describe the distribution of finisher
ages.finish_net_sec ~ age_group. Based upon your fitted model,
which age group has the fastest expected finish time?half_time_sec (the runner’s time at
the race’s halfway point) and finish_net_sec. Using this
visualization, briefly describe the relationship between these two
variables.finish_net_sec ~ half_time_sec. Briefly interpret the slope
and intercept of this model. If the intercept is not meaningful, you
should indicate this.finish_net_sec ~ half_time_sec + age_group. Explain why the
coefficient of “age_group50-54” in this model is so different from the
coefficient in the model you fit in Part B.