Directions:
- Submit your work via the “Assignments” tab on Canvas
- For this assignment you should record your answers/code using
R Markdown
- Please upload HTML, Word, or PDF output created using R Markdown and
make sure it contains your code, output, and written answers. You should
not include extraneous output, such as printing an entire data
frame.
- At this point in the course you are responsible for knowing how to
properly knit an R Markdown document, so uploading any other file format
will result in a point deduction on the assignment
- Homework is an individual assignment. It’s okay to check
your work or collaborate with your classmates, student mentors, and
others, but it is not okay to pass off their work as your own.
- Please clearly acknowledge any help you get from individuals other
than yourself, or resources other than the materials on our course
website (such as external websites and AI)
Question #1
A research group conducted a survey of \(n=1250\) licensed drivers, asking them
about the color of their car and whether they’ve received a speeding
ticket in the past year. The researchers hypothesized that red cars are
more noticeable and therefore might be pulled over for speeding at
higher rates than other colors. The researchers found that of the 270
drivers in their sample who drove a red car, 15 reported receiving a
speeding ticket. Of the remaining 980 drivers only 45 reported receiving
a speeding ticket.
- Part A: Were these data collected as part of an
experiment or an observation study? If the study is an experiment, were
groups randomly assigned? If the study is observational, does it use a
prospective, cross-sectional, or
retrospective design? Briefly explain your answer.
- Part B: Use
R
to create and display a
contingency table where the first column is the outcome of receiving a
speeding ticket and the first row is the group who drive a red car.
- Part C: Calculate the risk difference
comparing the risk of receiving a speeding ticket for drivers of red
cars versus other colored cars. Show your work.
- Part D: Calculate the relative risk of
receiving a speeding ticket for drivers of a red car relative to drivers
of another colored car. Show your work.
- Part E: Do you think the risk difference
or relative risk provides a clearer comparison of how the risk
of receiving a speeding ticket compares across groups? Briefly explain
your reasoning.
- Part F: Use
fisher.test()
to calculate
the odds ratio and perform a hypothesis test investigating whether these
data provide statistical evidence that drivers of red cars are more
likely to receive speeding tickets. Provide a one-sentence conclusion
that includes the strength of evidence, the observed odds ratio, and
appropriate context.
\(~\)
Question #2
A study
published in 1983 looked at the seasonal variation in the
frequencies of cyclones in Antarctica. The data provided below record
each cyclone that occurred during a two-year period.
The data contain two categorical variables describing these
cyclones:
latLong
- the latitude band of where the cyclone
occurred using the coding scheme: 1 = 40-49S, 2 = 50-59S, and 3 =
60-79S
season
- the season when the cyclone occurred using the
coding scheme: 1=Fall, 2=Winter, 3=Spring, and 4=Summer
cyclones = read.csv("https://remiller1450.github.io/data/cyclone.csv")
- Part A: Use
as.character()
to coerce
the variable latLong
to character type, then create a
stacked bar chart displaying the season on the x-axis and latitude band
as the fill of each bar.
- Part B: Based upon the data visualization you
created in Part A, do cyclones seem equally likely to occur in every
season? Does there appear to be a relationship between the season in
which a cyclone occurs and where it occurs (in terms of latitude)?
Provide a brief justification for your answer to each of these
questions.
- Part C: Consider the hypothesis that cyclones are
equally likely in all seasons. What is the name of an appropriate
hypothesis test for this scenario?
- Part D: Use
R
to perform the
hypothesis test you identified in Part C. Provide a one-sentence summary
of the results of this test that includes the appropriate context.
- Part E: Consider the hypothesis that the location
of a cyclone is related to the season in which it occurs. What is the
name of an appropriate hypothesis test for this scenario?
- Part F: Use
R
to perform the
hypothesis test you identified in Part E. Provide a one-sentence summary
of the results of this test that includes the appropriate context.
- Part G: Use the standardized residuals of the test
you performed in Part E to describe how the frequencies of cyclones in
each season in the 40-49S latitude band differ from what would be
expected under the null hypothesis. That is, which seasons have fewer
cyclones than expected, and which have more cyclones than expected?
\(~\)
Question #3
For this question you’ll use the ICU admissions data that we’ve seen
in a few of our previous labs and assignments. The data are a random
sample of \(n=200\) patients at a
hospital affiliated with Carnegie Mellon University (CMU).
These data contain several categorical variables that are encoded
numerically. The questions below will involve the variables:
Race
- The patient’s race: 1=white, 2=black, or
3=other
Consciousness
- Level of awareness at arrival:
1=conscious, 2=deep stupor, or 3=coma
Status
- Patient status: 0=lived or 1=died
icu_data = read.csv("https://remiller1450.github.io/data/ICUAdmissions.csv")
- Part A: Suppose the researchers in this study
recorded the
Consciousness
of each selected patient when
they were first admitted, then they waited to determine the
Status
of these patients until they had either left the ICU
or died. Would this be most accurately described as a
prospective, cross-sectional, or
retrospective design? Briefly explain your answer.
- Part B: Perform a Chi-squared test to evaluate
whether
Consciousness
and Status
are
associated. For the moment you may ignore any warning messages that
appear. Provide a one-sentence conclusion that includes the \(p\)-value of the test as well as
appropriate context about the research question.
- Part C: Now consider using Fisher’s exact test to
evaluate whether
Consciousness
and Status
are
associated. Based upon the assumptions of the Chi-squared test you
performed in Part B, is this test more appropriate for these data?
Briefly explain.
- Part D: Perform Fisher’s exact test and compare the
\(p\)-value with the one you found in
Part B. Do these tests lend themselves to the same or different
conclusions?
- Part E: According to the US Census, the Pittsburgh
metropolitan area (where CMU is located) is 85% white, 8% black, and 7%
other races. Use this information to statistically evaluate whether the
racial demographics of the ICU patients at this hospital differ from
those of the Pittsburgh metropolitan area. Report the \(p\)-value of your test and a 1-sentence
summary. If you find a significant result, use standardized residuals to
report the group with the largest deviation from what is expected.