Directions

  • You may choose to work on this lab individually or with your final project group
    • If you work as a group, all members are responsible the content your lab write-up. I strongly encourage you to use a voice chat software (Skype, google hangouts, etc.) while working together
    • If you choice to work as an individual you are not required to answer questions tagged (Group Only)
  • Read through the entire lab (not just the questions). The lab will introduce course content that you will be responsible for on exams/homework.
  • Answer all questions in a separate document, attaching Minitab output if needed.

Introduction

This lab will focus on statistical inference for quantitative variables. In it you will analyze data from a major United States university that was used as part of that university’s ongoing evaluation of salary differences between its male and female faculty. The university released the data set under the condition of anonymity.

The University Salaries data contain the following variables:

  • Rank - a categorical variable with levels “AsstProf”, “AssocProf”, and “Prof”. Generally speaking, new professors are usually hired at the rank of Assistant Professor and after several years of productivity they are promoted to Associate Professor or released. The promotion to Full Professor occurs after several additional years of productivity.
  • Discipline - A binary variable with levels “A” (theoretical departments) and “B” (applied departments)
  • Yrs.since.phd - The number of years since the professor received their PhD
  • Yrs.service - The number of years the professor has been working for the university
  • Sex - Whether the professor is male or female
  • Salary - The 9 month salary of the professor (in dollars)

Note that these data only include salaries for a single year at a single university; we will consider them to be a representative sample of similar universities, and the primary goal of our analysis will be to evaluate differences in male and female salaries.

Question #1:

What are the explanatory and response variables in our primary analysis of these data? What are three different statistical tests that you might consider for evaluating an association between these variables?

Exploring the Data

These data are observational, so before jumping to straight into hypothesis testing we should thoroughly explore them. In our exploration we should carefully look for:

  1. Confounding variables
  2. Outliers or unusual data-points
  3. Variables with highly-skewed distributions that might need to be transformed

Question #2:

Use Minitab to create a dot plot of male and female salaries and include this plot in your lab write-up. Do males or females tend to have higher salaries? How would you describe the shape of these distributions? Do you see any clear outliers?

Question #3:

The \(t\)-test requires normally distributed populations, or a relatively large sample size. Based upon the characteristics of these data, are you comfortable using a two-sample \(t\)-test to evaluate differences in male and female salaries?

Question #4:

In addition to salary and sex, there are four other variables in this dataset. For each of these four variables use an appropriate graph(s) to determine whether that variable is associated with both Sex and Salary. Summarize your results in a table like the one shown below. The first row of this table provides an example of the information you’re expected to record:

Variable Related to Sex? Graph Used (Sex) Related to Salary? Graph Used (Salary)
Rank Yes Stacked Barchart Yes Boxplots
Discipline
yrs.since.phd
yrs.service

Investigating Salary Differences

In this section we will explore three different analysis approaches. In practice you should decide upon a single approach prior to actually conducting your analysis. That said, we’ll apply all three approaches to learn about their similarities and differences.

Question #5: (Group Only)

Suppose your colleague tests the relationship between Sex and Salary in three different ways: a two-sample \(t\)-test, a non-parametric test, and a two-sample \(t\)-test on the log-transformed variable “log(Salary)”. Then, after conducting these tests, they only report results for the test with the lowest \(p\)-value. Do you have any problems with colleague’s approach? (Think about ethics and Type I errors)

Question #6:

Ignoring any assumptions of the test, use Minitab to perform a two-sample \(t\)-test comparing the male and female salaries. State your hypotheses and include Minitab output documenting the test. Based upon your results, do you think the observed difference in male and female salaries could be due to random chance?

Question #7:

Suppose the \(p\)-value of the test described in Question #6 is “statistically significant”. Does this result imply that the salary differences are due to discrimination? Briefly explain.

Question #8:

Conduct an appropriate non-parametric test evaluating the difference between male and female salaries. Add the Minitab output from this test to your lab write-up and answer the following: How do the results of this test compare with the two-sample \(t\)-test you performed in Question #6? Do you believe a non-parametric test is necessary for these data?

Question #9: (Group Only)

Use a Minitab formula to create a new variable “Log_Salary” that is the logarithm of the variable “Salary”. Then perform a two-sample \(t\)-test comparing the Log_Salary of male and female professors. Add the Minitab output from this test to your lab write-up and answer the following: How does this test compare to the tests you performed in Questions #6 and #8? How do you interpret the 95% confidence interval associated with this test? (Hint: remember this confidence interval is on the log-scale, so you should transform its endpoints to make sense of it)

Multivariate Relationships

The salary of a faculty member depends upon several factors, some of which might also be associated with “Sex”. When analyzing the relationship between “Salary” and “Sex” we need to consider these confounding variables.

Question #10:

Using your answer to Question #4, are there any variables that could be confounding the relationship between “Sex” and “Salary”? Use the definition of confounding to justify your answer.

Question #11:

Recall that stratification is an analysis strategy that can be used to neutralize a confounding variable. Explain how you might use stratification to analyze the relationship between “Sex” and “Salary” while properly accounting for the confounding variable “Rank”. (For this question you do not actually need to perform the analysis you describe).

Question #12: (Group Only)

Without transforming any of the data, could stratification be used to address possible confounding due to the variable “yrs.service”? Briefly explain.

Question #13: (Group Only)

Considering possible transformations the variable “yrs.service”, explain how stratification could be used to address possible confounding due to “yrs.service”. (Hint: think about the type of variables we’ve been able to stratify by).

The Final Analysis (for now)

These data are challenging to properly analyze using the hypothesis tests we’ve learned about so far. We’ll later see that multiple regression is better suited for this analysis. Nevertheless, we’ll now attempt the best analysis that we’re capable of right now.

Question #14:

The analysis you perform for this question should consist of the following steps:

  1. Determine which variable, “Rank” or “Discipline”, is the most influential confounding variable. In your lab write-up, state your choice and defend your decision using summary statistics. (Hint: consider which variable has a stronger relationship with the two variables of interest).
  2. Stratify the data by the variable you identified in (A)
  3. Determine the appropriate test(s) to perform in each stratum. In your lab write-up you should state your choices and the reasons why you believe your test(s) should be preferred.
  4. Formally conduct the tests you described in (C), adding the relevant Minitab output to your lab write-up, and writing a thorough 1-2 conclusion summarizing the overall results of the analysis you performed in this question. Your lab write-up doesn’t need to explicitly state A-B-C-D, instead you can provide a single answer to this question that incorporates these steps.

Submission Directions

  • Email your completed write-up to Professor Miller with a subject heading that includes the text “Sta-209-Lab7”. Please include this exact character string, including the dashes. You will lose 1 point off the top of your score if you don’t do so.
  • If you’d like to provide feedback on your group, fill out the optional review form at this link: https://forms.gle/wNWRFMbbra8oK4LJ8