Goals:

The purpose of this lab is to provide you with hands-on practice using correlation and regression to describe relationships between quantitative variables. The lab will also include questions related to interval estimation and confidence intervals.

Directions:

  • You are expected to progress through the analyses described in this document as a group, recording your answers in a shared document. It’s completely up to your group how you’d like to organize this - some groups like using a shared Google Doc, while other might designate one person to be the group’s recorder.
  • You are expected to work together, any attempts to “divide and conquer” the lab questions may result in point deductions on your group’s lab score.
  • Labs are graded primarily for completion, and we will get together as group for the last 10-15 minutes of class to discuss some of the lab questions. This means you should focus on learning the material (while also helping the teammates in your group) rather than seeing labs as an assessment (like homework or exams).
  • Please upload your responses to the Lab’s questions on Canvas. The expectation is that everyone uploads their own copy (they can be identical within your group).
  • Use the snipping tool on Windows or take a Mac screenshot to add a screenshots to your lab write-up as requested.

\(~\)

Dataset #1 - Breast Cancer Survival Times

This dataset comes from a clinical trial exploring the survival of breast cancer patients. The data were filtered to only include patients who died from a recurrence of their cancer (ie: those still alive at the end of the study were excluded). More complex statistical methods would be necessary to analyze data that also includes survivors.

The variables documented in these data are described below:

  • Time: The outcome variable (days the patient survived after their diagnosis)
  • Age: age in years
  • Cycles: Cycles of chemotherapy (either 3 or 6)
  • Menopause: Menopausal status (Pre/Post)
  • Size: Tumor size (mm) at the time of diagnosis
  • Grade: Tumor grade (I/II/III)
  • Nodes: Number of positive lymph nodes (more severe cases of cancer often spread to the lymph nodes)
  • PR: Progesterone receptor status (fmol/mg) (certain types of tumors are driven by progesterone)
  • ER: Estrogen receptor status (fmol/mg) (certain types of tumors are driven by estrogen)

Univariate Analysis

Question #1: The questions in this lab will focus on “Time” as an outcome variable. For this question, use StatKey to provide a univariate summary of survival times. Be sure to comment on the shape, central tendency, and spread of the variable. You do not need to add any graphs/output to your lab write-up.

\(~\)

Quantitative Predictors

In this section you will explore the relationships between various explanatory variables and survival time.

Question #2: Describe the relationship between between tumor size (mm) at the time of diagnosis and survival time. In doing so, provide a point estimate for the correlation between these two variables.

Question #3: Using the 2-SE method, find a bootstrap 95% confidence interval estimate of the population-level correlation between tumor size and survival time. Show the arithmetic behind how you calculated the interval’s endpoints. Then, briefly explain the benefits of reporting this interval alongside the point estimate you found in Question #2.

Question #4 Using StatKey, describe the relationship between age and survival time. Be sure to address the form, strength, and direction of the relationship you observe.

Question #5: Use StatKey to find the intercept and slope of the estimated regression model that predicts survival time from age. Write out the model using proper statistical notation, then provide a brief interpretation of the intercept and slope of the model.

Question #6: Using the percentile method, find a 95% confidence interval estimate of the population-level slope relating age and survival time. Based upon this interval, can you be confident that age is truly related to survival time? Or is does the interval suggest it might be plausible for age and survival to be unrelated within the broader population? (Hint: use the drop down menu in StatKey to switch from correlation to slope before bootstrapping)

\(~\)

Categorical Predictors

Question #7: Use StatKey to find the difference in mean survival times of pre and post menopausal women. Based upon this difference, do post-menopausal women appear to have longer survival times?

Question #8: Using the 2-SE method, find a bootstrap 95% confidence interval estimate for the difference in mean survival times of pre and post menopausal women. Based upon this interval, can you be confident that a difference exists between these two groups at the population-level?

Question #9: Now, use the percentile method to find a bootstrap 99% confidence interval for the difference in mean survival times of pre and post menopausal women and notice how the interval is wider than the one you found in Question #8. What is the primary reason for the interval being wider? Briefly explain.

\(~\)

Dataset #2 - College Scorecard

In Lab #1 you analyzed data published by the College Scorecard. For this portion of Lab #2 I’d like you to revisit this dataset and analyze the relationship between two quantitative variables. Provided below are a link to the data and a description of the variables it contains:

  • College Scorecard Dataset Link

  • Name - Name of the institution

  • City - City where the institution is located

  • State - State where the institution is located

  • Enrollment - Number of full-time enrolled students

  • Private - Binary indicator distinguishing public and private institutions

  • Region - Geographic region

  • Adm_Rate - Admissions rate, the proportion of applications who are admitted

  • ACT_median - Median composite ACT score of enrolled students

  • ACT_Q1 - 25th percentile composite ACT score of enrolled students

  • ACT_Q3 - 75th percentile composite ACT score of enrolled students

  • Cost - Average yearly cost of attendance

  • Net_Tuition - Average tuition cost after discounts (scholarships, grants, etc.)

  • Avg_Fac_Salary - Average faculty salary

  • PercentFemale - Proportion of enrolled students who are female

  • PercentWhite - Proportion of enrolled students who identify as White

  • PercentBlack - Proportion of enrolled students who identify as Black

  • PercentHispanic - Proportion of enrolled students who identify as Hispanic

  • PercentAsian - Proportion of enrolled students who identify as Asian

  • FourYearComp_Males - Proportion of male students who go on to earn a degree within four years of their initial enrollment

  • FourYearComp_Females - Proportion of male students who go on to earn a degree within four years of their initial enrollment

  • Debt_median - Median student debt upon leaving the institution

  • Salary10yr_median - Median salary 10 years after graduating the institution

Analysis

Question #10: Choose any two quantitative variables that you think might be related. Then, use StatKey to estimate a regression model that summarizes the relationship between these variables. Provide the slope and intercept of your model, along with brief interpretations, in your lab write-up.

Question #11: Using the model you found in Question #11, find the residual of Xavier University and provide a brief explanation of what information this residual provides about Xavier in relation to the other schools in the dataset. (Hint: Open the data spreadsheet and filter or scroll to find Xavier, then determine its value for your explanatory variable and use it to generate a predicted outcome. Next, compare this prediction with the observed outcome to find the residual.)