Directions

  • You may choose to work on this lab individually or with your final project group
    • If you work as a group, all members are responsible the content your lab write-up. I strongly encourage you to use a voice chat software (Skype, google hangouts, etc.) while working together
    • If you choose to work as an individual you are not required to answer questions tagged (Group Only)
  • Read through the entire lab (not just the questions). The lab will introduce course content that you will be responsible for on exams/homework.
  • Answer all questions in a separate document, attaching Minitab output if needed.

Introduction

Multiple regression is a widely used statistical method with many uses, we will focus on two of them:

  • Controlling for confounding in situations with multiple confounding variables
  • Developing predictive models to be applied to new data
    • You will have an opportunity to earn extra credit in this portion of the lab

Professor Salaries and Confounding Variables

In Lab #7 we were introduced to data from major United States university that was collected as part of the university’s ongoing evaluation of salary differences between its male and female faculty. In our previous attempt to analyze salary differences using the \(t\)-test, we were unable to adequately account for the many confounding variables.

To refresh your memory, the Professor Salary Data includes the following variables:

  • Rank - a categorical variable with levels “AsstProf”, “AssocProf”, and “Prof”. New professors are usually hired at the rank of Assistant Professor and after several years of productivity they are promoted to Associate Professor or released. The promotion to Full Professor occurs after several additional years of productivity at the Associate Professor level.
  • Discipline - A binary variable with levels “A” (theoretical departments) and “B” (applied departments)
  • Yrs.since.phd - The number of years since the professor received their PhD
  • Yrs.service - The number of years the professor has been working for the university
  • Sex - Whether the professor is male or female
  • Salary - The 9 month salary of the professor (in US dollars)

Question #1

Repeat the two-sample \(t\)-test done in Lab #7 that compared the average salaries of male and female professors. Do males or females have higher salaries? Could the observed difference be due to random chance? Can you claim the difference is due to discrimination by the university?

Question #2

In Lab #7 we identified several confounding variables in the relationship between salary and sex. Briefly explain what makes the variable “Rank” a confounding variable? (Hint: you should directly reference the definition of confounding in your answer)

Question #3

Without using multiple regression, describe how we could evaluate the relationship between salary and sex while controlling for the confounding variable of rank.

Question #4

Fit a regression model that uses “Sex” to predict “Salary”. Interpret the slope and the intercept of this model. How could you use this model to test whether males and females have statistically significant differences in salaries? (Hint: you should directly refer to the idea of a reference category in your answer)

Question #5

Fit a new regression model that uses both “Sex” and “Rank” to predict “Salary”. The coefficients of this model are a little trickier to interpret, as the reference category is defined by two different categorical variables. That said, the model still imposes the same difference between male and female salaries within a rank. We describe this difference in salaries as being “adjusted for rank”. What is the average difference in male and female salaries after adjusting for rank? Is the adjusted difference larger or smaller than the unadjusted difference? Is the adjusted difference statistically significant?

Question #6

Rank is not the only confounding variable in these data, in Lab #7 we also identified “years.since.phd” and “discipline” as confounding variables. When building regression models, we typically add variables one-by-one and take time to understand what happens after each new addition. For this question, you will add “yrs.since.phd” to the model from Question #5; fit a regression model that uses “Sex”, “Rank”, and “yrs.since.phd” to predict “Salary”. Interpret each of the regression coefficients in this model.

Question #7

Use Adjusted \(R^2\) to compare the fit of the model from Lab Question #6 with the fit of the model from Lab Question #5. Based upon this comparison, do you believe the predictor “yrs.since.phd” significantly improves the model?

Question #8 (Group Only)

When building regression models, some predictors may have non-linear relationships with the outcome. These relationships can be identified by looking a graph of the predictor versus the residuals and looking for a pattern. For example, if the residuals show a curved or quadratic pattern when graphed versus a predictor, we should consider using a quadratic effect for that predictor.

For Question #8, construct a graph of “yrs.since.phd” versus the residuals of the model from Lab Question #6 (the model containing “Sex”, “Rank”, and “yrs.since.phd” as predictors). You can ask Minitab for this plot using the “graphs” button in the regression dialog, and then entering “yrs.since.phd” in the bottom box. To help assess trends in the plot, right click on it and add a loess smoother (you can use the default options). Does the smoothed line show a curved pattern in the residuals?

Question #9 (Group Only)

For this question, add a quadratic effect for “yrs.since.phd” to the model from Lab Question #6 (your model should now contain “Sex”, “Rank”, “yrs.since.phd”, and “yrs.since.phd*yrs.since.phd” as explanatory variables). Using ANOVA, evaluate whether using a quadratic effect for yrs.since.phd significantly improves the model’s fit over using just a linear effect. Report your results and a brief conclusion in your lab write-up.

Question #10

The only variable we haven’t yet considered is “yrs.service”; however, we might wonder whether “yrs.since.phd” and “yrs.service” should both be used in our regression model. Find the correlation coefficient between these two variables. Do you think it is necessary to use both variables to predict salary?

Question #11

Fit a model using “Sex”, “Rank”, “yrs.since.phd”, and “yrs.service” as predictors. In the coefficient table there is a column titled “VIF”.

VIF stands for Variance Inflation Factor, it is a measure of how much an explanatory variable overlaps (correlates) with the other variables in the model. A large VIF indicates that a variable is essentially measuring the same effect as something else in the model. Including variables with high VIFs is generally not recommended, a rule of thumb of 5-10 is typically used for grounds to exclude a variable. Variable’s with VIF’s over 10 tend to be viewed as redundant (note: these rules doesn’t apply to variables involved in quadratic or cubic terms, which obviously should be related to the linear effect already present in the model).

For Lab Question #11, use the description of the variables “yrs.service” and “yrs.since.phd” to explain the why VIF of “yrs.service” is high in this model.

Question #12

At this point we’ve thoroughly explored the model building process for these data, and have explicitly considered several different models. Using the knowledge you’ve gained in Lab Questions 1-10, decide upon a regression model and use it to evaluate the adjusted difference in male and female salaries. Provide a brief justification of your model and make a conclusion regarding the male and female salary differences at this university.

Predicting Cancer Survival

The Breast Cancer Data come from a study of breast cancer patients. I’ve subset the original data to only include patients who ended up experiencing death or a recurrence of their cancer. This is done for convenience sake, and this sort of subsetting isn’t optimal as there are better ways to analyze these data using more complex statistical methods. Nevertheless, we will analyze the subset using regression modeling.

The variables in the data are listed below:

  • Time: The outcome variable (days the patient survived without recurrence)
  • Age: age in years I Cycles: Cycles of chemotherapy (3/6)
  • Menopause: Menopausal status (Pre/Post)
  • Size: Tumor size (mm) I Grade: Tumor grade (I/II/III)
  • Nodes: Number of positive lymph nodes (more severe cases of cancer often spread to the lymph nodes)
  • PR: Progesterone receptor status (fmol/mg) (certain types of tumors are driven by progesterone)
  • ER: Estrogen receptor status (fmol/mg) (certain types of tumors are driven by estrogen)

The goal of this portion of the lab will be to predict survival time on a separate set of 100 patients that I haven’t given you. The idea of evaluating a model on data that weren’t used to fit the model is known as external validation. To perform well on the data that I’ve withheld you’ll need to find a model that is not overfit (high variance) or underfit (high bias).

Question #13

Create a scatter plot matrix displaying the relationship between all of the quantitative variables in these data (you should recognize that some categorical predictors are coded using numbers). Which variables appear to be most strongly related with survival? Do any variables appear to have outliers?

Question #14 (Group Only)

When a predictor variable has an outlier (a data-point far from the average) that data-point can be highly influential on that variable’s estimated regression coefficient. To illustrate this concept, fit a regression model using the variable PR to predict survival time and record the slope of this model. Now remove the largest value of PR (subject #127 has a PR of over 900), refit the same regression model. How does the slope of the model change when this outlier is included/removed?

Question #15 (Group Only)

The single subject discussed in Lab Question #14 changes the regression line’s slope by almost 20%. This is undesirable, but we shouldn’t arbitrarily exclude real data. Instead we should try transforming the variable to reduce the impact of outliers.

For Question #15, create a new variable by applying a log-transformation to PR, then compare the \(R^2\) value of the simple linear regression model using the log-transformed variable with the \(R^2\) value of the simple linear regression model using the original, un-transformed variable (Hints: You might need to add a small positive value to each subject in your log-transformation formula; Also, you should be sure that subject #127 has been added back to your data for these comparisons)

Question #16 (Group Only)

Now use backward elimination (with an \(\alpha\) to remove of 0.1) to select a model. Recall that you can do this by clicking on the “Stepwise” button in the “Fit Regression Model” menu. Include in your write-up a screenshot of the coefficient table for the final model.

Question #17 (THIS QUESTION IS OPTIONAL, BUT YOU CAN EARN EXTRA CREDIT FOR YOUR MODEL)

For the final question I’d like you to use your knowledge to construct a model that you think will be most predictive of cancer survival time. Include a screenshot of the coefficient table of your final model, as well as a 1-2 sentence description of how you arrived at that model. You should consider the bias-variance tradeoff when deciding upon the complexity of your model.

Hints: you may transform any of the explanatory or response variables, you may also choose to include quadratic or cubic effects, and you may include or exclude any of the available variables.

The groups whose models make the most accurate predictions on the 100 subjects that I’ve withheld will receive a small amount of extra credit (1st place = 5 pts, 2nd = 2 pts, 3rd = 1pt). Any group whose model which beats the top model from the three semesters I’ve previously used this application will receive an additional 2pts of extra credit (making it possible to earn up to 7pts on this question)

Submission Directions

  • Email your completed write-up to Professor Miller with a subject heading that includes the text “Sta-209-Lab10”. Please include this exact character string, including the dashes. You will lose 1 point off the top of your score if you don’t do so.
  • If you’d like to provide feedback on your group, fill out the optional review form at this link: https://forms.gle/wNWRFMbbra8oK4LJ8