Directions:
- Submit your work via the “Assignments” tab on Canvas
- For this assignment you should record your answers/code using
R Markdown
- Please upload HTML, Word, or PDF output created using R Markdown and
make sure it contains your code, output, and written answers. You should
not include extraneous output, such as printing an entire data
frame.
- At this point in the course you are responsible for knowing how to
properly knit an R Markdown document, so uploading any other file format
will result in a point deduction on the assignment
- Homework is an individual assignment. It’s okay to check
your work or collaborate with your classmates, student mentors, and
others, but it is not okay to pass off their work as your own.
- Please clearly acknowledge any help you get from individuals other
than yourself, or resources other than the materials on our course
website (such as external websites and AI)
Question #1
In this question you will work with the “TSA claims” data set that
you’ve encountered in several of our in-class labs. In the first portion
of the analysis you will create a subset from the full sample that will
continue using throughout the remainder of this question.
tsa_data = read.csv("https://remiller1450.github.io/data/tsa_small.csv")
- Part A: Use the
filter()
function from
the dplyr
package (discussed in Lab 5) to create a subset
of TSA claims that were settled (which is recorded in the variable
Status
). Then create a scatterplot to graph the
relationship between the variables Claim_Amount
and
Close_Amount
. Include both a linear regression line and a
loess smoother on your graph to help you judge the relationship.
- Part B: Briefly describe the relationship you see
in the scatterplot you created in Part A. Be sure to address the
form, direction, and strength of the
relationship.
- Part C: Would Pearson’s or Spearman’s correlation
be a more appropriate method for quantifying the strength of association
in these data? Briefly explain.
- Part D: Consider the hypothesis that
Claim_Amount
and Close_Amount
are unrelated
for settled claims against the TSA. Perform a statistical test using the
measure of association you identified in Part C to evaluate this claim.
In addition to the R
code needed to perform the test,
provide a one-sentence conclusion that includes context,
strength of evidence (the \(p\)-value), and the type of
relationship (sample correlation and a description).
- Part E: Consider a regression analysis involving
these variables. Based upon the context of these data, which variable
should be the explanatory variable and which variable should be
the response (outcome) variable? Briefly explain your
answer.
- Part F: Fit a simple linear regression model to the
data and report the model’s estimated slope and intercept. Provide a
brief interpretation of both the slope and intercept,
indicating whether you believe the intercept to be meaningful in the
context of these data.
- Part G: Consider a hypothesis test aimed at
evaluating whether
Claim_Amount
and
Close_Amount
are related using the regression model
described in Part F. What is the null hypothesis of this test? Provide
your answer using the appropriate statistical symbols.
- Part H: Complete the hypothesis test described in
Part G using your fitted regression model. Report a one-sentence
conclusion that includes context, strength of evidence
(the \(p\)-value), and the type of
relationship (slope estimate and a description).
\(~\)
Question #2
Researchers at Harvard Medical School conducted an experiment where
infants born with congenital heart defects were randomly assigned to
receive one of two different open-heart surgeries, low-flow bypass or
circulatory arrest. Two years after the surgery the researchers followed
up on each infant and assessed their development, measured by their MDI
(mental development index) and PDI (psychomotor development index)
scores. The data from this experiment are loaded into R
by
the code provided below:
ih_data = read.csv("https://remiller1450.github.io/data/InfantHeart.csv")
- Part A: Suppose the researchers want to compare the
mental development index scores of children who received each type of
surgery. Does this scenario describe one-sample or two-sample data? Is
the outcome categorical or quantitative? You may state your answer
without any explanation.
- Part B: Create an appropriate data visualization
showing the relationship between the variables involved in the analysis
described in Part A.
- Part C: Perform an appropriate hypothesis test
evaluating the hypothesis that MDI scores are not impacted by the type
of heart surgery a child received. Include the
R
code used
to perform the test along with a one-sentence conclusion that includes
context, strength of evidence, and relevant
descriptive statistics.
- Part D In addition to MDI, the study also looked at
each child’s psychomotor development index. Create a scatterplot
displaying the relationship between these two variables, and briefly
describe the relationship you see in the scatterplot. Your plot does not
need to consider any other variables in the data set.
- Part E: Perform an appropriate hypothesis test to
evaluate whether these data provide evidence that MDI and PDI are
related in infants who receive open-heart surgery. Provide the
R
code needed for the test along with a one-sentence
conclusion that include context, strength of evidence
(the \(p\)-value), and the type of
relationship.
- Part F: Create a correlation matrix of the
quantitative variables contained in the data set using Pearson’s
correlation. Use this matrix to identify the pair of variables that are
most strongly related as well as the pair of variables that are
least strongly related.
- Part G: Filter the data to only include infants who
received the “Low-flow bypass” surgery, then consider a linear
regression model that predicts the infant’s PDI score based upon their
body weight (in grams) at birth. Use this model to evaluate whether
there is evidence that body weight and PDI score are related among these
infants. Provide the
R
needed to perform the test as well
as a one-sentence conclusion that includes context,
strength of evidence (the \(p\)-value), and the type of
relationship (the slope estimate and a description).