Sta-209 (Spring 2025) Homework #8

Directions:

Submit your assignment via P-web.
Submit only a compiled R Markdown document (pdf, word, or html output are all okay, but you may need to “zip” an html file)
- If you want to compile to a pdf you can install the tinytext package by running install.packages('tinytex') followed by tinytex::install_tinytex()
Only submit your .Rmd file if you are unable to compile it due to errors (in the future you will be penalized for this)

Question #1

The “Antarctica Cyclones” data (loaded below) were collected as part of a study published 1983 looking at seasonal variation in the frequencies of cyclones in Antarctica. Each row in this data set represents an observed cyclone, and the data contain two categorical variables:

latLong - the latitude band of where the cyclone occurred using the coding: 1 = 40-49S, 2 = 50-59S, and 3 = 60-79S
Season - the season when the cyclone occurred using the coding: 1=Fall, 2=Winter, 3=Spring, and 4=Summer

cylcones = read.csv("https://remiller1450.github.io/data/cyclone.csv")

Part A: Create a data visualization displaying the frequencies of cyclones in each season. Based upon your visualization, which season appears to have the most recorded cyclones?
Part B: Consider the hypothesis that cyclones are equally likely in all seasons. Perform a hypothesis test to evaluate whether these data are consistent with this hypothesis. Clearly state your test’s null hypothesis, the resulting \(p\)-value, and provide a 1-sentence summary of the results. Include any R code you used to perform the test.
Part C: Display the standardized residuals for the hypothesis test you performed in Part B. How do you interpret the residual for the category Season = 3?
Part D: Notice that the categories of latLong do not span the same area. In particular, the third category covers twice as many degrees of latitude as the first two categories. Considering this difference, what might be a suitable set of proportions for the null hypothesis that cyclones are evenly distributed across categories of latLong?
Part E: Perform a Chi-squared Goodness of Fit Test using the null hypothesis you provided in Part D. Report the \(p\)-value and a 1-sentence summary of the test.
Part F: Use standardized residuals to identify the latitude band whose observed count most substantially deviates from its expected count.

\(~\)

Question #2

The data provided below are a random sample of \(n=200\) patient visits to an ICU hospital affiliated with Carnegie Mellon University (CMU)

These data contain several categorical variables that are encoded numerically. The questions below will involve the variables:

Race - The patient’s race: 1=white, 2=black, or 3=other
Consciousness - Level of awareness at arrival: 1=conscious, 2=deep stupor, or 3=coma
Status - Patient status: 0=lived or 1=died

icu = read.csv("https://remiller1450.github.io/data/ICUAdmissions.csv")

Part A: According to the US Census, the Pittsburgh metropolitan area (where CMU is located) is 85% white, 8% black, and 7% other races. Use this information to statistically evaluate whether the racial demographics of the ICU patients at this hospital differ from those of the Pittsburgh metropolitan area. Report the \(p\)-value of your test and a 1-sentence summary. If you find a significant result, use standardized residuals to report the group with the largest deviation from what is expected.
Part B: Perform a Chi-squared Test to evaluate whether the variable Race is associated with the variable Status. Report the \(p\)-value of your test and a 1-sentence summary. If you find a significant result, use standardized residuals to report the combination of categories with the largest deviation from what is expected. You may ignore any warning messages given by chisq.test().
Part C: Without doing so, suppose you calculate Cramer’s V to quantify the strength of association present in the data used in Part B. Would you expect this measure to be closer to 0 or closer to 1? Briefly explain your reasoning.
Part D: Repeat the analysis in Part B using Fisher’s Exact Test. How does the \(p\)-value of this test compare to the one you found in Part B? Which of these \(p\)-values is more trustworthy? Briefly explain.
Part E: Consider a Chi-squared test of Independence involving the variables Race and Consciousness. The null hypothesis for this test implies a certain distribution of Consciousness for each category of Race. calculate the proportions that define this distribution.
Part F: The \(p\)-value when using Fisher’s Exact Test to assess the relationship between the variables Race and Consciousness is 0.227. Does this provide strong evidence that each category of race follows the distribution of consciousness you provided in Part E? Briefly explain your reasoning.