Directions
This lab will focus on statistical inference for quantitative variables. In it you will analyze data from a major United States university that was used as part of that university’s ongoing evaluation of salary differences between its male and female faculty. The university released the data set under the condition of anonymity.
The University Salaries data contain the following variables:
Note that these data only include salaries for a single year at a single university; we will consider them to be a representative sample of similar universities, and the primary goal of our analysis will be to evaluate differences in male and female salaries.
Question #1:
What are the explanatory and response variables in our primary analysis of these data? What are three different statistical tests that you might consider for evaluating an association between these variables?
These data are observational, so before jumping to straight into hypothesis testing we should thoroughly explore them. In our exploration we should carefully look for:
Question #2:
Use Minitab to create a dot plot of male and female salaries and include this plot in your lab write-up. Do males or females tend to have higher salaries? How would you describe the shape of these distributions? Do you see any clear outliers?
Question #3:
The \(t\)-test requires normally distributed populations, or a relatively large sample size. Based upon the characteristics of these data, are you comfortable using a two-sample \(t\)-test to evaluate differences in male and female salaries?
Question #4:
In addition to salary and sex, there are four other variables in this dataset. For each of these four variables use an appropriate graph(s) to determine whether that variable is associated with both Sex and Salary. Summarize your results in a table like the one shown below. The first row of this table provides an example of the information you’re expected to record:
Variable | Related to Sex? | Graph Used (Sex) | Related to Salary? | Graph Used (Salary) |
---|---|---|---|---|
Rank | Yes | Stacked Barchart | Yes | Boxplots |
Discipline | ||||
yrs.since.phd | ||||
yrs.service |
In this section we will explore three different analysis approaches. In practice you should decide upon a single approach prior to actually conducting your analysis. That said, we’ll apply all three approaches to learn about their similarities and differences.
Question #5: (Group Only)
Suppose your colleague tests the relationship between Sex and Salary in three different ways: a two-sample \(t\)-test, a non-parametric test, and a two-sample \(t\)-test on the log-transformed variable “log(Salary)”. Then, after conducting these tests, they only report results for the test with the lowest \(p\)-value. Do you have any problems with colleague’s approach? (Think about ethics and Type I errors)
Question #6:
Ignoring any assumptions of the test, use Minitab to perform a two-sample \(t\)-test comparing the male and female salaries. State your hypotheses and include Minitab output documenting the test. Based upon your results, do you think the observed difference in male and female salaries could be due to random chance?
Question #7:
Suppose the \(p\)-value of the test described in Question #6 is “statistically significant”. Does this result imply that the salary differences are due to discrimination? Briefly explain.
Question #8:
Conduct an appropriate non-parametric test evaluating the difference between male and female salaries. Add the Minitab output from this test to your lab write-up and answer the following: How do the results of this test compare with the two-sample \(t\)-test you performed in Question #6? Do you believe a non-parametric test is necessary for these data?
Question #9: (Group Only)
Use a Minitab formula to create a new variable “Log_Salary” that is the logarithm of the variable “Salary”. Then perform a two-sample \(t\)-test comparing the Log_Salary of male and female professors. Add the Minitab output from this test to your lab write-up and answer the following: How does this test compare to the tests you performed in Questions #6 and #8? How do you interpret the 95% confidence interval associated with this test? (Hint: remember this confidence interval is on the log-scale, so you should transform its endpoints to make sense of it)
The salary of a faculty member depends upon several factors, some of which might also be associated with “Sex”. When analyzing the relationship between “Salary” and “Sex” we need to consider these confounding variables.
Question #10:
Using your answer to Question #4, are there any variables that could be confounding the relationship between “Sex” and “Salary”? Use the definition of confounding to justify your answer.
Question #11:
Recall that stratification is an analysis strategy that can be used to neutralize a confounding variable. Explain how you might use stratification to analyze the relationship between “Sex” and “Salary” while properly accounting for the confounding variable “Rank”. (For this question you do not actually need to perform the analysis you describe).
Question #12: (Group Only)
Without transforming any of the data, could stratification be used to address possible confounding due to the variable “yrs.service”? Briefly explain.
Question #13: (Group Only)
Considering possible transformations the variable “yrs.service”, explain how stratification could be used to address possible confounding due to “yrs.service”. (Hint: think about the type of variables we’ve been able to stratify by).
These data are challenging to properly analyze using the hypothesis tests we’ve learned about so far. We’ll later see that multiple regression is better suited for this analysis. Nevertheless, we’ll now attempt the best analysis that we’re capable of right now.
Question #14:
The analysis you perform for this question should consist of the following steps: