\(~\)
Our previous lab covered univariate graphs, which are primarily used to assess the distribution of a single variable. However, many of the most interesting patterns that can be found in data involve two or more variables.
Two variables are associated if one variable provides information about the other. In more technical terms, association implies that the distribution of a variable changes after conditioning on the other variable.
The opposite of association is independence, which implies one variable provides no information about the other.
Example #1:
Consider the variables “age” and “class year” in a sample of Sta-209
students. After conditioning on class = 'first year'
you’ll
see a distribution of ages that is younger than the
marginal distribution of ages
Example #2:
Here we’ll begin by using a scatter plot to show the relationship between the average monthly salary of faculty at a college and the median earnings of its alumni four years after they graduate.
These two variables are associated, if you look at the distribution of alumni earnings for colleges where the average faculty salary is less than $7,000 per month, the mean salary is substantially lower than the mean salary of the entire distribution including all colleges.
Example #3:
In the example below we can reasonably conclude the variables
HSI
and Religious
are independent.
Knowing that a college is religiously affiliated provides effectively no information about whether or not that college might be an HSI. Roughly the same percentage of religious and non-religious colleges are classified as HSIs.
\(~\)
In many scenarios where an association exists we will designate an explanatory variable (suspected cause) and a response variable (suspected effect). These designations are based upon:
For example, a the religious status of a college can usually be
traced back to its founding (though some institutions do opt to become
non-religious later on), but its HSI status is regularly re-assessed
according to the ethnic composition of its student body. Thus, we may
consider Religious
an explanatory variable and
HSI
a response variable in Example #3 due to the time
ordering of these variables.
It’s important to note that designating an explanatory and response variable does not imply a causal relationship. We will talk more about causation later in the semester.
\(~\)
As a reminder, you should work on the lab with your assigned partner(s) using the principles of paired programming. Everyone should keep and submit their own copy of your group’s work.
In this lab you’ll use the “Colleges 2024” data set, which was introduced in our previous lab.
colleges <- read.csv("https://remiller1450.github.io/data/Colleges_2024_Complete.csv")
You’ll also need the same packages used in our previous lab,
ggplot2
and forcats
. Remember that you do
not need to install these packages if you’ve done so previously,
but you do need to load them using the library()
function:
# install.packages("ggplot2")
# install.packages("forcats")
library(ggplot2)
library(forcats)
\(~\)
Associations between two quantitative variables are visualized using
scatter plots, which graph the values of each variable
for each case using x
and y
coordinates.
Whenever possible, the typical convention is to use the x
axis for the explanatory variable and the y
axis
for the response variable.
When describing an association, we should address:
For some non-linear associations there may not be a single direction.
Scatter plots are created using geom_point()
via
variable names given to the x
and y
aesthetics:
## Scatter plot example
ggplot(data = colleges, aes(x = Med_ACT, y = Avg_Fac_Salary)) + geom_point()
In this example we see a moderate-to-strong, non-linear, positive relationship between the median ACT score of students at a college and the average monthly salary of its faculty.
Since identifying weaker relationships can be tricky, especially with
over plotting where data points overlap, we can add a
smoother to our graph using
geom_smooth()
:
## Adding a smoother layer
ggplot(data = colleges, aes(x = Med_ACT, y = Avg_Fac_Salary)) + geom_point() + geom_smooth(se = FALSE)
The argument se = FALSE
is used to turn off the shaded
error bands that appear around the smoother by default. The smoother
helps us confirm that the graph does indeed show a non-linear (perhaps
quadratic or exponential) relationship.
However, we should be cautious to avoid over-interpreting our smoother in regions of the graph without much data. Consider the example below:
## Don't read to much into the smoother on the far left side!
ggplot(data = colleges, aes(x = Med_ACT, y = Ret_Rate)) + geom_point() + geom_smooth(se = FALSE)
This graph shows a strong, positive, linear relationship between a college’s median ACT score and its retention rate. We should not over-interpret the smoother bending to accommodate the two outliers with very low median ACT scores as evidence of a quadratic relationship.
Note: These are real data from the US Government’s college scorecard, but its dubious that any colleges have median ACT scores of 5 and 7.
Question #1:
Med_Grad_Debt
and Avg_Fac_Salary
,
and add a smoother to the graph.Question #2:
Adm_Rate
and Med_Grad_Earnings_4Y
,
and add a smoother to the graph.\(~\)
Associations between a categorical and a quantitative variable are visualized using side-by-side graphs, most often box plots.
Box plots display various percentiles of a quantitative variable, which are defined as the values that a certain proportion of data falls below. For example, 75% of observed data is below the 75th percentile (also known as Q3, the “third quartile”):
Box plots are created using geom_boxplot()
:
## Side-by-side boxplots
ggplot(data = colleges, aes(x = Med_Grad_Debt, y = Type)) + geom_boxplot()
Box plots reduce the distribution of a quantitative variable down to
a small set of values, but this helps facilitate comparisons across the
groups created by the categorical variable. In the example above, we see
that the median value of median debt upon graduation is lower among
public colleges in comparison to private non-profit and private
for-profit colleges. Thus, the variables Type
and
Med_Grad_Debt
are associated.
Question #3:
Ret_Rate
(retention rate) and Region
.Ret_Rate
in the Southwest and the Southeast regions. When
considering just these two regions, do you see evidence of an
association between Ret_Rate
and Region
? State
“yes” or “no” and briefly explain your reasoning.Ret_Rate
in the Southwest and New England regions. When
considering just these two regions, do you see evidence of an
association between Ret_Rate
and Region
? State
“yes” or “no” and briefly explain your reasoning.Ret_Rate
and
Region
associated in the “colleges” data set? Briefly
explain.\(~\)
Recall we had used bar charts to display the distribution of a single categorical variable. It is possible to modify these charts to show the relationship between two categorical variables by subdividing the bars of a univariate bar chart using a second variable, thus producing a stacked bar chart.
## Stacked bar chart
ggplot(data = colleges, aes(y = Region, fill = Type)) + geom_bar()
In this example we use the fill
aesthetic to instruct
ggplot to use the variable Type
to subdivide each bar.
Typically we’ll use either the x
or y
axis to
show the categories of the explanatory variable, and we’ll use
the fill
aesthetic to incorporate the response
variable.
In analyses without a clear explanatory or response variable we
should map whichever variable has fewer categories to the
fill
aesthetic and use the x
or y
axis for the variable with more categories.
A disadvantage of stacked bar charts is that differing frequencies in
the first variable, region
in this example, make it
difficult to assess whether the relative frequency of each type differs
by region. In many circumstances we prefer conditional bar
charts, which show the conditional distribution of the response
variable for each category of the explanatory variable.
## Conditional bar chart
ggplot(data = colleges, aes(y = Region, fill = Type)) + geom_bar(position = 'fill')
The conditional bar chart makes it much easier for us to compare the Mountain region, which has the fewest number of cases, with the other regions.
Question #4:
Religious
.\(~\)
In previous sections it was obvious from examples and section title which type of graph you’d be creating. In this section you’ll practice translating research questions into effective data visualizations by identifying the explanatory and response variables, checking their types, and then creating an appropriate graph.
Question #5: Create an appropriate data visualization to determine whether the highest degree granted by an institution is associated with the average monthly salary of its faculty. Briefly describe whether or not your graph suggests an association and explain your conclusion.
Question #6: Create an appropriate data visualization to determine whether the proportion of a college students who are female is predictive of its retention rate. Briefly describe whether or not your graph suggests an association and explain your conclusion.
Question #7: Create an appropriate data visualization to assess whether colleges that only grant bachelors degrees are more prevalent in some regions than in others. Briefly describe whether or not your graph suggests and association and explain your conclusion.