These questions are intended to help you practice for Exam #4. The real exam will feature 2-3 questions that follow a similar format. All course content up until this point may appear on the exam, but the primary focus will be on two-sample hypothesis testing, Chi-squared tests, and ANOVA.

On the actual exam you should be prepared to record your answers in a properly formatted R Markdown document, submitting the compiled HTML output.

\(~\)

Question #1

Between 1972 and 1974, public health researchers canvassed Whickham, a small town in northeast England. The researchers asked residents about their smoking habits and collected demographic information. Twenty years later, a follow-up was performed to determine which of the original subjects were still alive and which had died. these data are a subset of individuals from the original study, consisting of only women who identified as either regular smokers or having never smoked.

sm = read.csv("https://remiller1450.github.io/data/Whickham.csv")

Part A: Using the table() function, find the proportions of smokers and non-smokers that were still alive at follow-up.

Part B: Using the table you found in Part A, calculate the odds ratio describing the odds of a non-smoker being alive at follow-up relative to the odds of a smoker being alive.

Part C: Using an appropriate hypothesis test, determine whether there is a statistically significant difference in the proportion of smokers who were alive at follow-up relative to the proportion of non-smokers who were alive at follow-up. At minimum your answer should clearly state the null hypothesis, it should provide a \(p\)-value, and it should use the \(p\)-value to make a reasonable conclusion.

Part D: Using an appropriate hypothesis test, determine whether the variable “Age” is associated with the variable “Survival”. At minimum your answer should clearly state the null hypothesis, it should provide a \(p\)-value, and it should use the \(p\)-value to make a reasonable conclusion.

Part E: Using an appropriate hypothesis test, determine whether the variable “Age” is associated with the variable “Smoking”. At minimum your answer should clearly state the null hypothesis, it should provide a \(p\)-value, and it should use the \(p\)-value to make a reasonable conclusion.

Part F: Considering the results of the hypothesis tests you performed in Parts C and D, provide a brief explanation for why smokers appeared to have a significantly higher survival rate in this study.

\(~\)

Question #2

The data below documents the daily commutes of a worker living in the greater Toronto area from their home to their workplace GlaxoSmithKline (GSK). It was collected using a GPS app. The variables you will need in this question are:

  • DayOfWeek - whether the commute was on a Monday, Tuesday, Wednesday, Thursday, or Friday
  • MovingTime - the total amount of time their vehicle was moving
  • GoingTo - whether the trip was to GSK or home
ct = read.csv("https://remiller1450.github.io/data/CommuteTracker.csv")

Part A: Perform a statistical test to see whether “DayOfWeek” is associated with a higher “MovingTime”. At minimum your answer should clearly state the null hypothesis, it should provide a \(p\)-value, and it should use the \(p\)-value to make a reasonable conclusion.

Part B: Use Tukey’s Honest Significant Differences method to follow up on the statistical test you performed in Part A. Based upon this analysis, which days seem to be significantly different in terms of their average moving times?

Part C: Suppose the person who collected these data wants to know if trips to “Home” tend to be significantly slower than trips to work, or “GSK”. Would you recommend they use a two-sample \(t\)-test? Or is there a structure within these data that would lead you to recommend another type of test? Briefly explain (1-3 sentences), you do not need to perform either test.