Sta-209 (Spring 2025) Homework #4

Directions:

Submit your work via the “Assignments” tab on Canvas
For this assignment you should record your answers/code using R Markdown
- Please upload HTML, Word, or PDF output created using R Markdown and make sure it contains your code, output, and written answers. You should not include extraneous output, such as printing an entire data frame.
- At this point in the course you are responsible for knowing how to properly knit an R Markdown document, so uploading any other file format will result in a point deduction on the assignment
Homework is an individual assignment. It’s okay to check your work or collaborate with your classmates, student mentors, and others, but it is not okay to pass off their work as your own.
- Please clearly acknowledge any help you get from individuals other than yourself, or resources other than the materials on our course website (such as external websites and AI)

Question #1

In this question you will work with the “TSA claims” data set that you’ve encountered in several of our in-class labs. In the first portion of the analysis you will create a subset from the full sample that will continue using throughout the remainder of this question.

tsa_data = read.csv("https://remiller1450.github.io/data/tsa_small.csv")

Part A: Use the filter() function from the dplyr package (discussed in Lab 5) to create a subset of TSA claims that were settled (which is recorded in the variable Status). Then create a scatterplot to graph the relationship between the variables Claim_Amount and Close_Amount. Include both a linear regression line and a loess smoother on your graph to help you judge the relationship.
Part B: Briefly describe the relationship you see in the scatterplot you created in Part A. Be sure to address the form, direction, and strength of the relationship.
Part C: Would Pearson’s or Spearman’s correlation be a more appropriate method for quantifying the strength of association in these data? Briefly explain.
Part D: Consider the hypothesis that Claim_Amount and Close_Amount are unrelated for settled claims against the TSA. Perform a statistical test using the measure of association you identified in Part C to evaluate this claim. In addition to the R code needed to perform the test, provide a one-sentence conclusion that includes context, strength of evidence (the \(p\)-value), and the type of relationship (sample correlation and a description).
Part E: Consider a regression analysis involving these variables. Based upon the context of these data, which variable should be the explanatory variable and which variable should be the response (outcome) variable? Briefly explain your answer.
Part F: Fit a simple linear regression model to the data and report the model’s estimated slope and intercept. Provide a brief interpretation of both the slope and intercept, indicating whether you believe the intercept to be meaningful in the context of these data.
Part G: Consider a hypothesis test aimed at evaluating whether Claim_Amount and Close_Amount are related using the regression model described in Part F. What is the null hypothesis of this test? Provide your answer using the appropriate statistical symbols.
Part H: Complete the hypothesis test described in Part G using your fitted regression model. Report a one-sentence conclusion that includes context, strength of evidence (the \(p\)-value), and the type of relationship (slope estimate and a description).

\(~\)

Question #2

Researchers at Harvard Medical School conducted an experiment where infants born with congenital heart defects were randomly assigned to receive one of two different open-heart surgeries, low-flow bypass or circulatory arrest. Two years after the surgery the researchers followed up on each infant and assessed their development, measured by their MDI (mental development index) and PDI (psychomotor development index) scores. The data from this experiment are loaded into R by the code provided below:

ih_data = read.csv("https://remiller1450.github.io/data/InfantHeart.csv")

Part A: Suppose the researchers want to compare the mental development index scores of children who received each type of surgery. Does this scenario describe one-sample or two-sample data? Is the outcome categorical or quantitative? You may state your answer without any explanation.
Part B: Create an appropriate data visualization showing the relationship between the variables involved in the analysis described in Part A.
Part C: Perform an appropriate hypothesis test evaluating the hypothesis that MDI scores are not impacted by the type of heart surgery a child received. Include the R code used to perform the test along with a one-sentence conclusion that includes context, strength of evidence, and relevant descriptive statistics.
Part D In addition to MDI, the study also looked at each child’s psychomotor development index. Create a scatterplot displaying the relationship between these two variables, and briefly describe the relationship you see in the scatterplot. Your plot does not need to consider any other variables in the data set.
Part E: Perform an appropriate hypothesis test to evaluate whether these data provide evidence that MDI and PDI are related in infants who receive open-heart surgery. Provide the R code needed for the test along with a one-sentence conclusion that include context, strength of evidence (the \(p\)-value), and the type of relationship.
Part F: Create a correlation matrix of the quantitative variables contained in the data set using Pearson’s correlation. Use this matrix to identify the pair of variables that are most strongly related as well as the pair of variables that are least strongly related.
Part G: Filter the data to only include infants who received the “Low-flow bypass” surgery, then consider a linear regression model that predicts the infant’s PDI score based upon their body weight (in grams) at birth. Use this model to evaluate whether there is evidence that body weight and PDI score are related among these infants. Provide the R needed to perform the test as well as a one-sentence conclusion that includes context, strength of evidence (the \(p\)-value), and the type of relationship (the slope estimate and a description).