Lab #3 - Bivariate Data Visualizations and Finding Association

$~$

Onboarding

Our previous lab covered univariate graphs, which are primarily used to assess the distribution of a single variable. However, many of the most interesting patterns that can be found in data involve two or more variables.

Association

Two variables are associated if one variable provides information about the other. In more technical terms, association implies that the distribution of a variable changes after conditioning on the other variable.

The opposite of association is independence, which implies one variable provides no information about the other.

Example #1:

Consider the variables “age” and “class year” in a sample of Sta-209 students. After conditioning on class = 'first year' you’ll see a distribution of ages that is younger than the marginal distribution of ages

Example #2:

Here we’ll begin by using a scatter plot to show the relationship between the average monthly salary of faculty at a college and the median earnings of its alumni four years after they graduate.

These two variables are associated, if you look at the distribution of alumni earnings for colleges where the average faculty salary is less than $7,000 per month, the mean salary is substantially lower than the mean salary of the entire distribution including all colleges.

Example #3:

In the example below we can reasonably conclude the variables HSI and Religious are independent.

Knowing that a college is religiously affiliated provides effectively no information about whether or not that college might be an HSI. Roughly the same percentage of religious and non-religious colleges are classified as HSIs.

$~$

Explanatory and Response Variables

In many scenarios where an association exists we will designate an explanatory variable (suspected cause) and a response variable (suspected effect). These designations are based upon:

Subject-area expertise - there is a plausible mechanism for one variable driving changes in the other.
Time order - the explanatory variable occurs earlier in time than the response variable.

For example, a the religious status of a college can usually be traced back to its founding (though some institutions do opt to become non-religious later on), but its HSI status is regularly re-assessed according to the ethnic composition of its student body. Thus, we may consider Religious an explanatory variable and HSI a response variable in Example #3 due to the time ordering of these variables.

It’s important to note that designating an explanatory and response variable does not imply a causal relationship. We will talk more about causation later in the semester.

$~$

Lab

As a reminder, you should work on the lab with your assigned partner(s) using the principles of paired programming. Everyone should keep and submit their own copy of your group’s work.

In this lab you’ll use the “Colleges 2024” data set, which was introduced in our previous lab.

colleges <- read.csv("https://remiller1450.github.io/data/Colleges_2024_Complete.csv")

You’ll also need the same packages used in our previous lab, ggplot2 and forcats. Remember that you do not need to install these packages if you’ve done so previously, but you do need to load them using the library() function:

# install.packages("ggplot2")
# install.packages("forcats")

library(ggplot2)
library(forcats)

$~$

Two Quantitative Variables

Associations between two quantitative variables are visualized using scatter plots, which graph the values of each variable for each case using x and y coordinates. Whenever possible, the typical convention is to use the x axis for the explanatory variable and the y axis for the response variable.

When describing an association, we should address:

Form - what type of trend or pattern exists (ie: linear, quadratic, non-linear, etc.)
Strength - how closely the data adhere to a trend or pattern (ie: strong, moderate, weak, etc.)
Direction - how the values of one variable relate to the values of the other variable (ie: positive or negative)

For some non-linear associations there may not be a single direction.

Scatter plots are created using geom_point() via variable names given to the x and y aesthetics:

## Scatter plot example
ggplot(data = colleges, aes(x = Med_ACT, y = Avg_Fac_Salary)) + geom_point()

In this example we see a moderate-to-strong, non-linear, positive relationship between the median ACT score of students at a college and the average monthly salary of its faculty.

Since identifying weaker relationships can be tricky, especially with over plotting where data points overlap, we can add a smoother to our graph using geom_smooth():

## Adding a smoother layer
ggplot(data = colleges, aes(x = Med_ACT, y = Avg_Fac_Salary)) + geom_point() + geom_smooth(se = FALSE)

The argument se = FALSE is used to turn off the shaded error bands that appear around the smoother by default. The smoother helps us confirm that the graph does indeed show a non-linear (perhaps quadratic or exponential) relationship.

However, we should be cautious to avoid over-interpreting our smoother in regions of the graph without much data. Consider the example below:

## Don't read to much into the smoother on the far left side!
ggplot(data = colleges, aes(x = Med_ACT, y = Ret_Rate)) + geom_point() + geom_smooth(se = FALSE)

This graph shows a strong, positive, linear relationship between a college’s median ACT score and its retention rate. We should not over-interpret the smoother bending to accommodate the two outliers with very low median ACT scores as evidence of a quadratic relationship.

Note: These are real data from the US Government’s college scorecard, but its dubious that any colleges have median ACT scores of 5 and 7.

Question #1:

Part A: Create a scatter plot displaying the variables Med_Grad_Debt and Avg_Fac_Salary, and add a smoother to the graph.
Part B: Describe the form, strength, and direction of the relationship between these two variables using your graph from Part A.

Question #2:

Part A: Create a scatter plot displaying the variables Adm_Rate and Med_Grad_Earnings_4Y, and add a smoother to the graph.
Part B: Describe the form, strength, and direction of the relationship between these two variables using your graph from Part A.

$~$

One Categorical and One Quantitative Variables

Associations between a categorical and a quantitative variable are visualized using side-by-side graphs, most often box plots.

Box plots display various percentiles of a quantitative variable, which are defined as the values that a certain proportion of data falls below. For example, 75% of observed data is below the 75th percentile (also known as Q3, the “third quartile”):

Box plots are created using geom_boxplot():

## Side-by-side boxplots
ggplot(data = colleges, aes(x = Med_Grad_Debt, y = Type)) + geom_boxplot()

Box plots reduce the distribution of a quantitative variable down to a small set of values, but this helps facilitate comparisons across the groups created by the categorical variable. In the example above, we see that the median value of median debt upon graduation is lower among public colleges in comparison to private non-profit and private for-profit colleges. Thus, the variables Type and Med_Grad_Debt are associated.

Question #3:

Part A: Create a set of side-by-side box plots showing the relationship between the variables Ret_Rate (retention rate) and Region.
Part B: Compare the distributions of Ret_Rate in the Southwest and the Southeast regions. When considering just these two regions, do you see evidence of an association between Ret_Rate and Region? State “yes” or “no” and briefly explain your reasoning.
Part C: Now compare the distributions of Ret_Rate in the Southwest and New England regions. When considering just these two regions, do you see evidence of an association between Ret_Rate and Region? State “yes” or “no” and briefly explain your reasoning.
Part D: A categorical and quantitative variable are associated so long as the distribution of the quantitative variable differs across at least one pairing of categories. Based upon your answers to Parts C and D, are the variables Ret_Rate and Region associated in the “colleges” data set? Briefly explain.

$~$

Two Categorical Variables

Recall we had used bar charts to display the distribution of a single categorical variable. It is possible to modify these charts to show the relationship between two categorical variables by subdividing the bars of a univariate bar chart using a second variable, thus producing a stacked bar chart.

## Stacked bar chart
ggplot(data = colleges, aes(y = Region, fill = Type)) + geom_bar()

In this example we use the fill aesthetic to instruct ggplot to use the variable Type to subdivide each bar. Typically we’ll use either the x or y axis to show the categories of the explanatory variable, and we’ll use the fill aesthetic to incorporate the response variable.

In analyses without a clear explanatory or response variable we should map whichever variable has fewer categories to the fill aesthetic and use the x or y axis for the variable with more categories.

A disadvantage of stacked bar charts is that differing frequencies in the first variable, region in this example, make it difficult to assess whether the relative frequency of each type differs by region. In many circumstances we prefer conditional bar charts, which show the conditional distribution of the response variable for each category of the explanatory variable.

## Conditional bar chart
ggplot(data = colleges, aes(y = Region, fill = Type)) + geom_bar(position = 'fill')

The conditional bar chart makes it much easier for us to compare the Mountain region, which has the fewest number of cases, with the other regions.

Question #4:

Part A: Create a stacked bar chart showing the frequencies of Historically Black College and University (HBCU) status on the y-axis and filling these bars according to the variable Religious.
Part B: Create a conditional bar chart using the same variables from Part A.
Part C: Briefly comment upon why the conditional bar chart is more useful when assessing whether these two variables are associated.

$~$

Practice (required)

In previous sections it was obvious from examples and section title which type of graph you’d be creating. In this section you’ll practice translating research questions into effective data visualizations by identifying the explanatory and response variables, checking their types, and then creating an appropriate graph.

Question #5: Create an appropriate data visualization to determine whether the highest degree granted by an institution is associated with the average monthly salary of its faculty. Briefly describe whether or not your graph suggests an association and explain your conclusion.

Question #6: Create an appropriate data visualization to determine whether the proportion of a college students who are female is predictive of its retention rate. Briefly describe whether or not your graph suggests an association and explain your conclusion.

Question #7: Create an appropriate data visualization to assess whether colleges that only grant bachelors degrees are more prevalent in some regions than in others. Briefly describe whether or not your graph suggests and association and explain your conclusion.