Sta-209 (Spring 2025) Homework #5

Directions:

Submit your work via the “Assignments” tab on Canvas
For this assignment you should record your answers/code using R Markdown
- Please upload HTML, Word, or PDF output created using R Markdown and make sure it contains your code, output, and written answers. You should not include extraneous output, such as printing an entire data frame.
- At this point in the course you are responsible for knowing how to properly knit an R Markdown document, so uploading any other file format will result in a point deduction on the assignment
Homework is an individual assignment. It’s okay to check your work or collaborate with your classmates, student mentors, and others, but it is not okay to pass off their work as your own.
- Please clearly acknowledge any help you get from individuals other than yourself, or resources other than the materials on our course website (such as external websites and AI)

Question #1

A research group conducted a survey of \(n=1250\) licensed drivers, asking them about the color of their car and whether they’ve received a speeding ticket in the past year. The researchers hypothesized that red cars are more noticeable and therefore might be pulled over for speeding at higher rates than other colors. The researchers found that of the 270 drivers in their sample who drove a red car, 15 reported receiving a speeding ticket. Of the remaining 980 drivers only 45 reported receiving a speeding ticket.

Part A: Were these data collected as part of an experiment or an observation study? If the study is an experiment, were groups randomly assigned? If the study is observational, does it use a prospective, cross-sectional, or retrospective design? Briefly explain your answer.
Part B: Use R to create and display a contingency table where the first column is the outcome of receiving a speeding ticket and the first row is the group who drive a red car.
Part C: Calculate the risk difference comparing the risk of receiving a speeding ticket for drivers of red cars versus other colored cars. Show your work.
Part D: Calculate the relative risk of receiving a speeding ticket for drivers of a red car relative to drivers of another colored car. Show your work.
Part E: Do you think the risk difference or relative risk provides a clearer comparison of how the risk of receiving a speeding ticket compares across groups? Briefly explain your reasoning.
Part F: Use fisher.test() to calculate the odds ratio and perform a hypothesis test investigating whether these data provide statistical evidence that drivers of red cars are more likely to receive speeding tickets. Provide a one-sentence conclusion that includes the strength of evidence, the observed odds ratio, and appropriate context.

\(~\)

Question #2

A study published in 1983 looked at the seasonal variation in the frequencies of cyclones in Antarctica. The data provided below record each cyclone that occurred during a two-year period.

The data contain two categorical variables describing these cyclones:

latLong - the latitude band of where the cyclone occurred using the coding scheme: 1 = 40-49S, 2 = 50-59S, and 3 = 60-79S
season - the season when the cyclone occurred using the coding scheme: 1=Fall, 2=Winter, 3=Spring, and 4=Summer

cyclones = read.csv("https://remiller1450.github.io/data/cyclone.csv")

Part A: Use as.character() to coerce the variable latLong to character type, then create a stacked bar chart displaying the season on the x-axis and latitude band as the fill of each bar.
Part B: Based upon the data visualization you created in Part A, do cyclones seem equally likely to occur in every season? Does there appear to be a relationship between the season in which a cyclone occurs and where it occurs (in terms of latitude)? Provide a brief justification for your answer to each of these questions.
Part C: Consider the hypothesis that cyclones are equally likely in all seasons. What is the name of an appropriate hypothesis test for this scenario?
Part D: Use R to perform the hypothesis test you identified in Part C. Provide a one-sentence summary of the results of this test that includes the appropriate context.
Part E: Consider the hypothesis that the location of a cyclone is related to the season in which it occurs. What is the name of an appropriate hypothesis test for this scenario?
Part F: Use R to perform the hypothesis test you identified in Part E. Provide a one-sentence summary of the results of this test that includes the appropriate context.
Part G: Use the standardized residuals of the test you performed in Part E to describe how the frequencies of cyclones in each season in the 40-49S latitude band differ from what would be expected under the null hypothesis. That is, which seasons have fewer cyclones than expected, and which have more cyclones than expected?

\(~\)

Question #3

For this question you’ll use the ICU admissions data that we’ve seen in a few of our previous labs and assignments. The data are a random sample of \(n=200\) patients at a hospital affiliated with Carnegie Mellon University (CMU).

These data contain several categorical variables that are encoded numerically. The questions below will involve the variables:

Race - The patient’s race: 1=white, 2=black, or 3=other
Consciousness - Level of awareness at arrival: 1=conscious, 2=deep stupor, or 3=coma
Status - Patient status: 0=lived or 1=died

icu_data = read.csv("https://remiller1450.github.io/data/ICUAdmissions.csv")

Part A: Suppose the researchers in this study recorded the Consciousness of each selected patient when they were first admitted, then they waited to determine the Status of these patients until they had either left the ICU or died. Would this be most accurately described as a prospective, cross-sectional, or retrospective design? Briefly explain your answer.
Part B: Perform a Chi-squared test to evaluate whether Consciousness and Status are associated. For the moment you may ignore any warning messages that appear. Provide a one-sentence conclusion that includes the \(p\)-value of the test as well as appropriate context about the research question.
Part C: Now consider using Fisher’s exact test to evaluate whether Consciousness and Status are associated. Based upon the assumptions of the Chi-squared test you performed in Part B, is this test more appropriate for these data? Briefly explain.
Part D: Perform Fisher’s exact test and compare the \(p\)-value with the one you found in Part B. Do these tests lend themselves to the same or different conclusions?
Part E: According to the US Census, the Pittsburgh metropolitan area (where CMU is located) is 85% white, 8% black, and 7% other races. Use this information to statistically evaluate whether the racial demographics of the ICU patients at this hospital differ from those of the Pittsburgh metropolitan area. Report the \(p\)-value of your test and a 1-sentence summary. If you find a significant result, use standardized residuals to report the group with the largest deviation from what is expected.