Sta-209 (Spring 2025) Homework #3

Directions:

Submit your assignment via P-web.
Submit only a compiled R Markdown document (pdf, word, or html output are all okay, but you may need to “zip” an html file)
- If you want to compile to a pdf you can install the tinytext package by running install.packages('tinytex') followed by tinytex::install_tinytex()
Only submit your .Rmd file if you are unable to compile it due to errors (in the future you will be penalized for this)

Question #1

For this question you’ll use the “118th Congress” data set located at the URL:

https://remiller1450.github.io/data/congress_2024.csv

As a reminder, this data set documents the age and political party of all members of the 118th US Congress.

Before beginning, you should run the code provided below, which reads these data and uses the ifelse() function to create a new binary categorical variable “Baby_Boomer” that indicates whether a member of congress is part of the baby boomer demographic cohort.

## Read data
congress = read.csv("https://remiller1450.github.io/data/congress_2024.csv")

## Create the "Baby_Boomer" variable
congress$Baby_Boomer = ifelse(congress$Age > 60 & congress$Age < 79, "Boomer", "Not Boomer")

Part A: Create a two-way frequency table showing frequencies of baby boomer status (columns) among each chamber of congress (rows).
Part B: Using indices and your table from Part A, calculate the proportion of US House members that are part of the baby boomer demographic cohort.
Part C: Use a difference in proportions to compare the relative frequencies of baby boomers in the US House and US Senate. In which chamber are members more likely to be baby boomers?
Part D: In Part C, was it reasonable to use a difference in proportions, or should a relative measure of association (ie: relative risk or odds ratio) be preferred? Briefly explain.
Part E: Create a two-way frequency table showing frequencies of baby boomer status (columns) for each political party in congress (rows).
Part F: Using indices and your table from Part E, calculate the odds of a Republican member of congress (Party = 'R') belonging to the baby boomer demographic cohort.
Part G: Now calculate the odds of a Democrat member of congress (Party = 'D') belonging to the baby boomer cohort.
Part H: Using your results from Parts F and G, calculate an odds ratio describing the relative likelihood of a member of congress being a baby boomer in each major political party. You should put the political party with larger odds in the numerator of the odds ratio. Write a 1-sentence statement communicating the association measured by this odds ratio (see the third bullet on slide 12 of our Contingency Table notes for an example).

\(~\)

Question #2

For this question you’ll use the “Hollywood Movies” data set located at the URL:

https://remiller1450.github.io/data/HollywoodMovies.csv

This data set documents the box office performance and critic ratings of major films released between 2007 and 2013.

Part A: Create a data visualization showing the relationship between the explanatory variable Budget, the amount of money used to produce the film (millions of USD), and the response variable TheatersOpenWeek, the number of theaters worldwide that aired the film during on opening weekend.
Part B: Describe the relationship you see in your data visualization from Part A. Be sure to address every important aspect of this relationship.
Part C: Find and report Pearson’s correlation coefficient for the variables Budget and TheatersOpenWeek. Do you believe this is an appropriate measure of association to describe the relationship between these variables? Briefly explain. Hint: be sure to use the argument use = "complete.obs" since there are missing values in these data.
Part D: Find and report Spearman’s correlation coefficient for the variables Budget and TheatersOpenWeek. Do you believe this is an appropriate measure of association to describe the relationship between these variables? Briefly explain.
Part E: Use the lm() function to fit a linear regression model using Budget as the explanatory variable and TheatersOpenWeek as the response variable. Print the model’s estimated coefficients (intercept and slope) and provide a brief interpretation of each.
Part F: Consider the \(R^2\) of the model you fit in Part E and the \(R^2\) of a second model that allows for a quadratic relationship between Budget and TheatersOpenWeek. Without fitting the second model, briefly explain why it will have a higher \(R^2\) value.
Part G: Use the select() function to create a new data frame containing only the variables RottenTomatoes (the film’s rating by professional critics), AudienceScore (the film’s ratings by ordinary viewers), and WorldGross (the total amount of money generated by the film). Then, create a correlation matrix using Pearson’s correlation to determine which of these two ratings has the strongest linear association with WorldGross.