ggplot2
(part 2)This lab continues our study of data visualization using the
ggplot2
package. The lab will focus on the concepts and
strategy behind creating effective visualizations.
Directions (Please read before starting)
\(~\)
We will continue using the ggplot2
package:
# install.packages("ggplot2")
library(ggplot2)
The examples will again use data from The College Scorecard:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
colleges
data set
records attributes and outcomes for all primarily undergraduate
institutions in the United States with at least 400 full-time
students.\(~\)
The fundamental principles of creating effective data visualizations are quite simple. In short, an effective visualization should:
It’s helpful to understand these principles with a few examples of effective and ineffective visuals:
Consider survey data on the popularity of different internet browsers over time:
Which graph is more effective? Why?
\(~\)
Consider the heights of men and women in the NHANES sample:
Which graph is more effective? Why?
\(~\)
At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.
\(~\)
The fundamental idea of data visualization is to convey a particular message by exploiting human understanding of visual cues. The most common strategies represent differences in the observed data by visual differences in:
However, as we saw in Example #1 from the preamble, not all of these are equally effective. For example, bar charts (lengths) are more effective than pie charts (angles and areas).
In fact, we could make the pie chart from Example #1 even less effective by removing any possible comparison of angles (using a graph called a “donut chart”):
Research by Cleveland and McGill has shown that assessments based upon position and length are most accurate. Judgement of angles or color are somewhat less accurate, but still acceptable. While judgement of area and brightness are substantially less accurate; and judgement based upon volume is the least accurate.
Question #1: Consider the following data visualizations with a goal to compare the population of Suffolk County, MA (where Boston is located) versus all other counties in the north eastern United States.
Note: you do not need to write any R
code for this
question
\(~\)
Histograms and density plots display the distribution of a quantitative variable.
Faceting provides a strategy to show distributional differences
across subsets of data. Effective use of faceting will align the scales
of your axes (using scales = fixed
in the faceting
function) throughout the graph to facilitate accurate comparisons:
ggplot(heights, aes(x = height)) + geom_histogram(aes(y = after_stat(density)), bins = 20) +
facet_wrap(~sex, nrow = 1, scales = "free") + labs(title = "Graph #1 (Difficult)")
ggplot(heights, aes(x = height)) + geom_histogram(aes(y = after_stat(density)), bins = 20) +
facet_wrap(~sex, nrow = 2, scales = "fixed") + labs(title = "Graph #2 (Effective)")
Notice how vertical alignment combined with a common x-axis facilitates a clear comparison, while the Graph #1 makes it very difficult to judge if there are differences in the male and female distributions.
Alternatively, depending upon the number of comparisons, you might want to consider a single plot:
ggplot(heights, aes(x = height, color = sex, fill = sex)) + geom_histogram(aes(y = after_stat(density)), alpha = 0.5, bins = 20, position = "identity") + labs(title = "Graph #3 (Also effective?)")
Question #2: In the examples shown above, the
additional aesthetic mapping y = after_stat(density)
instructs geom_histogram()
to use the density
scale for the heights of the histogram bars (displayed using the
y-axis). What happens when you remove this argument from the code used
produce Graph 3? What happens when you remove it from the code
used to produce Graph 2? Based upon what you’ve observed,
briefly describe when and how this additional mapping should be
used.
Question #3: For the colleges data, create a
collection of density plots that effectively display
the distribution of the variable “Enrollment” across the following
regions: “South East”, “Plains”, “Great Lakes”, and “New England”. Do
not display enrollments in any other regions. Hint: you may use
the %in%
operator to check if a region is contained in a
set of target character strings.
\(~\)
Histograms and density plots are great at showing the shape of a variable’s distribution, but they don’t scale well when many distributions are to be compared.
In contrast, dot plots, box plots, and violin plots are suitable alternatives when comparing more than 2-3 groupings of data.
However, simply switching the geom is not enough to produce an effective graph. Consider following three dot plots that each display the relationship between the mean of “Avg_Fac_Salary” and the variables “Region” and “Private”.
## Data subset used for these graphs
datasub = subset(colleges, Region %in% c("South East", "South West", "Far West", "Mid East", "Great Lakes", "Plains", "New England"))
## Dot plot #1
ggplot(datasub, aes(x = Avg_Fac_Salary, y = Private, color = Region)) + labs(title = "Dot Plot #1") + stat_summary(fun.data = mean_se)
## Dot plot #2
ggplot(datasub, aes(x = Avg_Fac_Salary, y = Private)) + labs(title = "Dot Plot #2") + stat_summary(fun.data = mean_se, color = "red", alpha = 0.3) + facet_wrap(~Region)
## Dot plot #3
ggplot(datasub, aes(x = Avg_Fac_Salary, y = reorder(Region, X = Avg_Fac_Salary, FUN = mean, na.rm = TRUE), color = Private)) + labs(title = "Dot Plot #3", y = "Region") + stat_summary(fun.data = mean_se)
For simplicity these graphs only show the mean \(\pm\) 1 standard error for each region. But
you could attempt to show the entire distribution using
geom_violin()
in place of or in addition to
stat_summary()
, or you could show a more detailed
statistical summary using geom_boxplot()
:
ggplot(datasub, aes(x = Avg_Fac_Salary, y = reorder(Region, X = Avg_Fac_Salary, FUN = mean, na.rm = TRUE), color = Private)) + geom_violin() + labs(title = "Using geom_violin", y = "Region")
ggplot(datasub, aes(x = Avg_Fac_Salary, y = reorder(Region, X = Avg_Fac_Salary, FUN = mean, na.rm = TRUE), color = Private)) + geom_boxplot() + labs(title = "Using geom_boxplot", y = "Region")
Note: The examples above also illustrate the
reorder()
function, which can rearrange the categories of a
variable according to a function of another variable in the data. In
these examples, the categories of “Region” are rearranged by the mean
value of “Avg_Fac_Salary” within each region.
Question #4: In the United States, federal law
defines a Hispanic-serving
institution (HSI) as a college or university where 25% or more of
the total undergraduate student body is Hispanic or Latino. The code
below uses this definition and the ifelse()
function to add
a new binary categorical variable, “HSI”, to the colleges data. For this
question, your goal is to create a graph that effectively compares the
variable “Net_Tuition” for HSI and non-HSI colleges in the states of
“CA”, “FL”, “NY” and “TX”. That is, your graph should allow for easy
comparisons of the variables “Net_Tuition” (or a summary of it), “HSI”,
and “State” (displaying only the four aforementioned states).
Hint: You should remove any colleges with missing values of
“HSI” by including a logical condition involving !is.na()
(which will return TRUE
if a college is not
missing that variable).
colleges$HSI = ifelse(colleges$PercentHispanic >= 0.25, "HSI", "No")
\(~\)
Scatter plots are used to display relationships between two numeric variables using position. However, aesthetics like color, point character, or brightness allow for additional variables to be included in the graph. Below are a several tips for creating more effective scatter plots:
The most natural way to add a third variable into a scatter plot is
the color aesthetic. By default, adding color
into
aes
will create a legend on the side of the plot describing
how the chosen variable is mapped to the colors seen on the graph.
From a visual processing perspective this is less efficient than placing color annotations near the relevant regions of the plot:
data("iris")
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point() + labs(title = "Using a legend")
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point() + labs(title = "Using annotations") +
guides(color="none") + ## This removes the "guides" for the color aesthetic
annotate(geom = "text", x = c(2.5, 4, 6), y = c(0.25,0.75,1.25), label = c("Setosa", "Versicolor", "Virginica"), color = c("red", "darkgreen", "blue"))
The example above demonstrates how annotations can allow a viewer to understand the essence of a scatter plot more quickly.
\(~\)
The goal of any graph is to show your data. If outliers or skew are leading to a lot of blank space in your graph, you might consider a scale transformation:
ggplot(colleges, aes(x = Enrollment, y = Net_Tuition)) + geom_point() + labs(title = "A few outliers dominate your attention")
Sometimes scale transformations can lead to new insights:
ggplot(colleges, aes(x = Enrollment, y = Net_Tuition)) + geom_point() + labs(title = "Two distinct clusters?") + scale_x_continuous(trans = "log2")
ggplot(colleges, aes(x = Enrollment, y = Net_Tuition, color = Private)) + geom_point() + labs(title = "Simpson's Paradox!") + scale_x_continuous(trans = "log2") +
guides(color="none") +
annotate(geom = "text", x = c(2000, 52000), y = c(52000,25000), label = c("Private", "Public"), color = c("red","cyan3"))
\(~\)
In some situations, outliers distract from the main purposes of a graph, but in other situations the outliers are the most interesting aspect of the data. Shown below is a counter example to Tip #2:
In this application, people generally have the greatest interest in knowing the military expenditures of the small number of countries that are currently viewed as having strong geopolitical aspirations. The smaller countries are less interesting, and the main purpose of displaying them is to highlight just how extreme the spending of the world’s top military powers is relative to the average nation.
\(~\)
A third numeric variable can be included in a scatter plot using color, size or brightness, with color being the most effective choice. When mapping numeric values to different colors there are two possible options:
The examples above use pre-built color palettes via the function
scale_color_distiller()
. You should read the help
documentation of this function to see a list of possible color palette
choices.
\(~\)
Question #5: Using the “colleges” data, create a scatter plot displaying the relationship between the variables “Cost”, “ACT_median”, and “Private”. Use annotations rather than a legend to display the color aesthetic.
Question #6: Create a scatter plot that displays three numeric variables from the “colleges” data. Then, briefly write a 2-3 sentences justifying the choices (ie: diverging vs. sequential scales, etc.) you made in constructing the plot.
\(~\)
For these practice questions you should use the data contained at the link below:
mass = read.csv("https://remiller1450.github.io/data/MassShootings.csv")
These data come from a database of mass shootings maintained by the
Mother Jones news organization. They consider two types of incidents,
Type = "Mass"
defined by at least 4 fatalities in a single
location and Type = "Spree"
defined by at least 4
fatalities across a set of related locations/incidents.
Part A: Following the example(s) at this
reference page create a bar chart showing the number of mass
shootings by “Place” and use the argument
aes(label = after_stat(count))
to add a label above each
bar to more clearly display the count.
Part B: Construct a scatterplot relating the variables “Year” and “Victims” and notice the outlier in 2017. Use an annotation to highlight this outlier as interesting.
Part C: Use a dot plot to effectively
display the mean and standard error for the number of fatalities by
“Place” and “Type”. Try to use strategies from the lab’s “Dotplots,
Boxplots, and Violin Plots” section. Pay special attention to the use of
the reorder()
function in these examples. You may ignore
any warnings related to missing values being removed (as some category
combinations do not have enough data to plot).