Directions:
- Submit your assignment via P-web.
- Submit only a compiled R Markdown document (pdf, word, or html
output are all okay, but you may need to “zip” an html file)
- If you want to compile to a pdf you can install the
tinytext
package by running
install.packages('tinytex')
followed by
tinytex::install_tinytex()
- Only submit your .Rmd file if you are unable to compile it due to
errors (in the future you will be penalized for this)
Question #1
The “Antarctica Cyclones” data (loaded below) were collected as part
of a study
published 1983 looking at seasonal variation in the frequencies of
cyclones in Antarctica. Each row in this data set represents an observed
cyclone, and the data contain two categorical variables:
latLong
- the latitude band of where the cyclone
occurred using the coding: 1 = 40-49S, 2 = 50-59S, and 3 = 60-79S
Season
- the season when the cyclone occurred using the
coding: 1=Fall, 2=Winter, 3=Spring, and 4=Summer
cylcones = read.csv("https://remiller1450.github.io/data/cyclone.csv")
- Part A: Create a data visualization displaying the
frequencies of cyclones in each season. Based upon your visualization,
which season appears to have the most recorded cyclones?
- Part B: Consider the hypothesis that cyclones are
equally likely in all seasons. Perform a hypothesis test to evaluate
whether these data are consistent with this hypothesis. Clearly state
your test’s null hypothesis, the resulting \(p\)-value, and provide a 1-sentence summary
of the results. Include any
R
code you used to perform the
test.
- Part C: Display the standardized residuals for the
hypothesis test you performed in Part B. How do you interpret the
residual for the category
Season = 3
?
- Part D: Notice that the categories of
latLong
do not span the same area. In particular, the third
category covers twice as many degrees of latitude as the first two
categories. Considering this difference, what might be a suitable set of
proportions for the null hypothesis that cyclones are evenly distributed
across categories of latLong
?
- Part E: Perform a Chi-squared Goodness of Fit Test
using the null hypothesis you provided in Part D. Report the \(p\)-value and a 1-sentence summary of the
test.
- Part F: Use standardized residuals to identify the
latitude band whose observed count most substantially deviates from its
expected count.
\(~\)
Question #2
The data provided below are a random sample of \(n=200\) patient visits to an ICU hospital
affiliated with Carnegie Mellon University (CMU)
These data contain several categorical variables that are encoded
numerically. The questions below will involve the variables:
Race
- The patient’s race: 1=white, 2=black, or
3=other
Consciousness
- Level of awareness at arrival:
1=conscious, 2=deep stupor, or 3=coma
Status
- Patient status: 0=lived or 1=died
icu = read.csv("https://remiller1450.github.io/data/ICUAdmissions.csv")
- Part A: According to the US
Census, the Pittsburgh metropolitan area (where CMU is located) is
85% white, 8% black, and 7% other races. Use this information to
statistically evaluate whether the racial demographics of the ICU
patients at this hospital differ from those of the Pittsburgh
metropolitan area. Report the \(p\)-value of your test and a 1-sentence
summary. If you find a significant result, use standardized residuals to
report the group with the largest deviation from what is expected.
- Part B: Perform a Chi-squared Test to evaluate
whether the variable
Race
is associated with the variable
Status
. Report the \(p\)-value of your test and a 1-sentence
summary. If you find a significant result, use standardized residuals to
report the combination of categories with the largest deviation from
what is expected. You may ignore any warning messages given by
chisq.test()
.
- Part C: Without doing so, suppose you calculate
Cramer’s V to quantify the strength of association present in the data
used in Part B. Would you expect this measure to be closer to 0 or
closer to 1? Briefly explain your reasoning.
- Part D: Repeat the analysis in Part B using
Fisher’s Exact Test. How does the \(p\)-value of this test compare to the one
you found in Part B? Which of these \(p\)-values is more trustworthy? Briefly
explain.
- Part E: Consider a Chi-squared test of Independence
involving the variables
Race
and
Consciousness
. The null hypothesis for this test implies a
certain distribution of Consciousness
for each category of
Race
. calculate the proportions that define this
distribution.
- Part F: The \(p\)-value when using Fisher’s Exact Test to
assess the relationship between the variables
Race
and
Consciousness
is 0.227. Does this provide strong evidence
that each category of race follows the distribution of consciousness you
provided in Part E? Briefly explain your reasoning.