ggplot2
This lab focuses on data visualization and how to create high quality
graphics using the ggplot2
package. In this lab we will use
cleaned data, with subsequent labs covering how to manipulate data prior
to graphing.
Directions (Please read before starting)
\(~\)
This lab will use the ggplot2
package:
# install.packages("ggplot2")
library(ggplot2)
It will also use data from The College Scorecard:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
colleges
data set
contains attributes and outcomes for all primarily undergraduate
institutions in the United States with at least 400 full-time students
for the year 2019.\(~\)
ggplot2
creates graphicsggplot2
package builds graphics in a structured manner
using layers. Layers are sequentially added to a graph, with
each serving a particular purpose, such as:
Consider the following examples:
## Example #1.1 - nothing (just a base layer)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition))
## Example #1.2 - add a layer displaying raw data
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition)) +
geom_point()
## Example #1.3 - add another layer (smoother)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition)) +
geom_point() +
geom_smooth()
## Example #1.4 - add another layer (reference text)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition)) +
geom_point() +
geom_smooth() +
annotate("text", x =20000, y=40000, label = "Data from 2019")
Note: A high-quality graphic doesn’t require any particular number of layers. In fact, more layers can sometimes detract from the clarity of a data visualization.
\(~\)
The mapping
and data
arguments provided in
the base layer are carried forward to all subsequent layers (which is
often desirable). However, we can avoid this behavior.
## Example #2.1
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
geom_point()
## Example #2.2 (local override of color aesthetic)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
geom_point(color = "red")
Specifying aesthetics locally within layers:
## Example #3.1
ggplot(data = colleges) +
geom_point(mapping = aes(x = Cost, y = Net_Tuition))
## Example #3.2
ggplot(data = colleges) +
geom_point(mapping = aes(x = Cost, y = Net_Tuition)) +
geom_point(mapping = aes(x = Cost, y = Enrollment), color = "red")
Local specification is most useful when you want to add layers that involve the same parameters (“x” and “y” in the example above) but for different things. Common examples are drawing a line through several different group means of an ordinal categorical variable, or displaying both polygons and points on a map.
\(~\)
The ggplot
framework is unique because all graphics are
grammatically defined using the following terminology:
Question #1: Identify and briefly describe each term mentioned above in the graphic created by the code below.
ggplot(data = colleges, mapping = aes(x = Adm_Rate, color = Private)) +
geom_density() +
scale_x_continuous(trans = "reverse") +
facet_wrap(~Region)
Question #2: Create a histogram of the variable “Enrollment” displayed on the log2-scale and faceted by the variable “Private”. Use the ggplot2 cheatsheet to help you identify the necessary functions and arguments.
\(~\)
At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.
ggplot
graphics are a very expansive topic and not
everything can be self-contained in this lab. Throughout the lab, I
encourage you to reference the following resources if you need to figure
out the proper function or syntax for a particular task:
\(~\)
Themes are pre-built style templates used to better tailor a graphic to the mode of publication.
The example below applies a black and white theme to Example 2.1 from the preamble.
## Example #2.1 w/ black and white theme
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
geom_point() +
theme_bw()
Other pre-built themes:
theme_bw()
theme_linedraw()
, theme_light()
and
theme_dark()
theme_minimal()
theme_classic()
theme_void()
You can judge the differences in these themes below:
Any theme can be further customized using theme()
. Most
commonly this function is used to remove a graph’s legend:
## Example #2.1 w/ black and white theme
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
geom_point() +
theme_bw() +
theme(legend.position = "none")
Question #3: The code below creates a line graph that depicts (via a smoothed moving average) the approval of former presidents Jimmy Carter, Ronald Reagan, and Barrack Obama. Modify second portion of this code to try out a few different non-default themes. Then, briefly discuss (1-2 sentences) which themes you feel are most effective and least effective for this type of graph. Include the graph with your preferred theme in your lab write-up.
## Data processing
approval <- read.csv("https://bit.ly/398YR6M")
approval$Week = as.numeric(difftime(as.Date(approval$End.Date, "%m/%d/%y"), as.Date(approval$Inaug.Date, "%m/%d/%y"), units = "weeks"))
approval2 = subset(approval, President %in% c("Reagan", "Carter", "Obama"))
## Creating the graph
ggplot(data = approval2, mapping = aes(x = Week, y = Approving, color = President)) +
geom_smooth(method = "loess", span = 0.6, se = FALSE)
\(~\)
Labels and annotations are important aspects of well-constructed data visualizations. They are used to provide context, or draw the viewer’s attention towards particular aspects of the graphic.
Labels corresponding to aesthetics (such as x
,
y
, color
, etc.) are controlled using the
labs()
function:
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
geom_point() +
labs(x = "Sticker Cost", y = "Price Paid", color = "Admissions Rate")
Annotations are added the graphic as a layer using the
annotate()
function:
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
geom_point() +
annotate(geom = "rect", xmin = 65000, xmax = 77000, ymin = 20000, ymax = 52000, color = "red", alpha = 0.2)
The example above annotates a scatter plot by drawing a red rectangle
with 20% transparency (controlled by alpha
) around the
cluster of colleges with high costs, high net tuition, and low
admissions rates.
Question #4: Use the subset()
function
to create a data frame containing only colleges located in the state of
Iowa. Using these data, create a box plot of admissions rates, rename
the x-axis label to “Admissions Rate”, and add a text annotation above
the outlier saying “Grinnell”.
\(~\)
Scales map values in the data space to the aesthetic space (ie: where
and how should data with Adm_Rate=0.2
,
Cost=40000
, and Net_Tuition=10000
appear on
the graphic?).
Scales can be modified by adding layers using functions whose names follow the general format: “scale_aesthetic_function()”. Shown below are a few examples:
## Default Scales
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
geom_point()
## Put cost (the "x" aesthetic) on the log2 scale
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
geom_point() +
scale_x_continuous(trans = "log2")
## Use a gradient from purple to yellow to display Adm_Rate
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
geom_point() +
scale_color_gradient(low = "purple", high = "yellow")
## Use the popular "viridis" color scale, reversing the default direction via "-1"
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
geom_point() +
scale_color_continuous(type = "viridis", direction = -1)
There are two things you should note at this point:
scale_color_continuous()
is used because “Adm_Rate” is a
continuous numeric variable. However, if colors were being determined by
a character string you’d want to use scale_color_brewer()
instead (or one of the related color scale functions designed for
discrete variables).fill
aesthetic can
be used to add color to a bar chart, and you must use a function like
scale_fill_brewer()
to change the fill color.Question #5: Create a 2-dimensional filled density
plot (using geom_density_2d_filled()
) with the aesthetics
“x = Adm_Rate” and “y = Net_Tuition” using the “viridis” color scale and
a “reverse” x-axis that goes from 1.00 to 0.00. Then, write a sentence
or two describing the combinations of admissions rate and net tuition
that occur most frequently among US colleges.
\(~\)
Sometimes we’d like to display a statistical transformation (mean,
error bars, etc.) alongside the data itself. While this could be
accomplished by creating a separate data frame, it’s generally better to
use a stat_
function:
ggplot(data = colleges[1:30,], mapping = aes(x = Net_Tuition, y = Private)) +
geom_point() +
stat_summary(fun = mean, geom = "point", color = "red", size = 3) +
stat_summary(fun.data = mean_se, geom = "errorbar", color = "red")
The example above adds error bars at 1 standard error above and below
the mean net tuition cost of for the private and public colleges in the
first 30 rows of the colleges
data frame.
Notice how the first instance of stat_summary()
uses the
fun
argument (a simple option designed to return a single
number/vector), while the second uses the fun.data
argument
(a complex option designed to return a data frame, which contains the
lower and upper endpoints of the interval in this example).
Question #6: Using the data subset containing only
colleges located in Iowa (you created this in Question #4), create a
graph similar to the example above but using the arguments
geom = "linerange"
, alpha = 0.3
,
size = 4
, and color = "red"
to depict 1
standard error above/below the means of private and public colleges.
\(~\)
Relying on large numbers of aesthetics to include additional variables on a graph can quickly become overwhelming. Facets allow you to display multiple side-by-side graphs according to one or more categorical variables.
facet_wrap()
is designed to display the data broken by
a single categorical variablefacet_grid()
is designed to display the data broken by
all combinations of two categorical variablesTwo examples of faceting are shown below:
## create a subset for example purposes
reduced_colleges = subset(colleges, Region %in% c("Great Lakes", "Far West", "Plains"))
## facet_wrap
ggplot(data = reduced_colleges, mapping = aes(x = Cost)) +
geom_density() +
facet_wrap(~Region, nrow = 1)
## facet_grid
ggplot(data = reduced_colleges, mapping = aes(x = Cost)) +
geom_density() +
facet_grid(Private~Region)
The facet_wrap()
function will wrap a series of plots
according to a fixed number of rows or columns, while
facet_grid()
will construct a 2-dimensional grid whose
panels correspond to the unique combinations of values in the variables
used in the grid formula.
\(~\)
The code below will load a data set containing 970 Hollywood films
released between 2007 and 2011, then reduce these data to only include
variables that could be known prior to a film’s opening weekend. The
data are then simplified further to only include the four largest
studios (Warner Bros, Fox, Paramount, and Universal) in the three most
common genres (action, comedy, drama). You will use the resulting data
(ie: movies_subset
) for Question #7.
movies = read.csv("https://remiller1450.github.io/data/HollywoodMovies.csv")
movies_subset = subset(movies, LeadStudio %in% c("Warner Bros", "Fox", "Paramount", "Universal") &
Genre %in% c("Action", "Comedy", "Drama"),
select = c("Movie", "LeadStudio", "Story", "Genre","Budget",
"TheatersOpenWeek","Year","OpeningWeekend"))
Question #7: Create a graphic that shows the relationship between the outcome variable “OpeningWeekend” and the explanatory variables: “Budget”, “Genre”, and “Studio” that satisfies the following requirements: