Goals:
The purpose of this lab is to provide you with hands-on practice using correlation and regression to describe relationships between quantitative variables. The lab will also include questions related to interval estimation and confidence intervals.
Directions:
\(~\)
This dataset comes from a clinical trial exploring the survival of breast cancer patients. The data were filtered to only include patients who died from a recurrence of their cancer (ie: those still alive at the end of the study were excluded). More complex statistical methods would be necessary to analyze data that also includes survivors.
The variables documented in these data are described below:
Question #1: The questions in this lab will focus on “Time” as an outcome variable. For this question, use StatKey to provide a univariate summary of survival times. Be sure to comment on the shape, central tendency, and spread of the variable. You do not need to add any graphs/output to your lab write-up.
\(~\)
In this section you will explore the relationships between various explanatory variables and survival time.
Question #2: Describe the relationship between between tumor size (mm) at the time of diagnosis and survival time. In doing so, provide a point estimate for the correlation between these two variables.
Question #3: Using the 2-SE method, find a bootstrap 95% confidence interval estimate of the population-level correlation between tumor size and survival time. Show the arithmetic behind how you calculated the interval’s endpoints. Then, briefly explain the benefits of reporting this interval alongside the point estimate you found in Question #2.
Question #4 Using StatKey, describe the relationship between age and survival time. Be sure to address the form, strength, and direction of the relationship you observe.
Question #5: Use StatKey to find the intercept and slope of the estimated regression model that predicts survival time from age. Write out the model using proper statistical notation, then provide a brief interpretation of the intercept and slope of the model.
Question #6: Using the percentile method, find a 95% confidence interval estimate of the population-level slope relating age and survival time. Based upon this interval, can you be confident that age is truly related to survival time? Or is does the interval suggest it might be plausible for age and survival to be unrelated within the broader population? (Hint: use the drop down menu in StatKey to switch from correlation to slope before bootstrapping)
\(~\)
Question #7: Use StatKey to find the difference in mean survival times of pre and post menopausal women. Based upon this difference, do post-menopausal women appear to have longer survival times?
Question #8: Using the 2-SE method, find a bootstrap 95% confidence interval estimate for the difference in mean survival times of pre and post menopausal women. Based upon this interval, can you be confident that a difference exists between these two groups at the population-level?
Question #9: Now, use the percentile method to find a bootstrap 99% confidence interval for the difference in mean survival times of pre and post menopausal women and notice how the interval is wider than the one you found in Question #8. What is the primary reason for the interval being wider? Briefly explain.
\(~\)
In Lab #1 you analyzed data published by the College Scorecard. For this portion of Lab #2 I’d like you to revisit this dataset and analyze the relationship between two quantitative variables. Provided below are a link to the data and a description of the variables it contains:
Name - Name of the institution
City - City where the institution is located
State - State where the institution is located
Enrollment - Number of full-time enrolled students
Private - Binary indicator distinguishing public and private institutions
Region - Geographic region
Adm_Rate - Admissions rate, the proportion of applications who are admitted
ACT_median - Median composite ACT score of enrolled students
ACT_Q1 - 25th percentile composite ACT score of enrolled students
ACT_Q3 - 75th percentile composite ACT score of enrolled students
Cost - Average yearly cost of attendance
Net_Tuition - Average tuition cost after discounts (scholarships, grants, etc.)
Avg_Fac_Salary - Average faculty salary
PercentFemale - Proportion of enrolled students who are female
PercentWhite - Proportion of enrolled students who identify as White
PercentBlack - Proportion of enrolled students who identify as Black
PercentHispanic - Proportion of enrolled students who identify as Hispanic
PercentAsian - Proportion of enrolled students who identify as Asian
FourYearComp_Males - Proportion of male students who go on to earn a degree within four years of their initial enrollment
FourYearComp_Females - Proportion of male students who go on to earn a degree within four years of their initial enrollment
Debt_median - Median student debt upon leaving the institution
Salary10yr_median - Median salary 10 years after graduating the institution
Question #10: Choose any two quantitative variables that you think might be related. Then, use StatKey to estimate a regression model that summarizes the relationship between these variables. Provide the slope and intercept of your model, along with brief interpretations, in your lab write-up.
Question #11: Using the model you found in Question #11, find the residual of Xavier University and provide a brief explanation of what information this residual provides about Xavier in relation to the other schools in the dataset. (Hint: Open the data spreadsheet and filter or scroll to find Xavier, then determine its value for your explanatory variable and use it to generate a predicted outcome. Next, compare this prediction with the observed outcome to find the residual.)