ggplot2
This lab focuses on data visualization and how to create high quality
graphics using the ggplot2
package. In this lab we will use
cleaned data, with subsequent labs covering how to manipulate data prior
to graphing.
Directions (Please read before starting)
\(~\)
This lab will use the ggplot2
package:
# install.packages("ggplot2")
library(ggplot2)
It will also use data from The College Scorecard:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
colleges
data set
records attributes and outcomes for all primarily undergraduate
institutions in the United States with at least 400 full-time students
for the year 2019.\(~\)
ggplot2
creates graphicsggplot2
package builds graphics in a structured manner
using layers. Layers are sequentially added to a graph, with
each serving a particular purpose, such as:
Consider the following examples:
## Example #1.1 - nothing (just a base layer)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition))
## Example #1.2 - add a layer displaying raw data
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition)) +
geom_point()
## Example #1.3 - add another layer (smoother)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition)) +
geom_point() +
geom_smooth()
## Example #1.4 - add another layer (reference text)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition)) +
geom_point() +
geom_smooth() +
annotate("text", x =20000, y=40000, label = "Data from 2019")
Note: A high-quality graphic doesn’t require any particular number of layers. In fact, more layers can sometimes detract from the clarity of a data visualization.
\(~\)
The mapping
and data
arguments provided in
the base layer are carried forward to all subsequent layers (which is
generally desirable). However, we can avoid this behavior.
## Example #2.1
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
geom_point()
## Example #2.2 (local override of color aesthetic)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
geom_point(color = "red")
Specifying aesthetics locally within layers:
## Example #3.1
ggplot(data = colleges) +
geom_point(mapping = aes(x = Cost, y = Net_Tuition))
## Example #3.2
ggplot(data = colleges) +
geom_point(mapping = aes(x = Cost, y = Net_Tuition)) +
geom_point(mapping = aes(x = Cost, y = Enrollment), color = "red")
Local specification is most useful when you want to add layers that use the same parameters (ie: “x” and “y” in the example above) but reference different things. A few other examples are drawing a line through several group means ordered by ordinal categorical variable, or adding both polygons and points to a map.
\(~\)
The ggplot
framework is unique because all graphics are
grammatically defined using the following terminology:
Question #1: Identify and briefly describe each term mentioned above in the graphic created by the code below.
ggplot(data = colleges, mapping = aes(x = Adm_Rate, color = Private)) +
geom_density() +
scale_x_continuous(trans = "reverse") +
facet_wrap(~Region)
Question #2: Create a histogram of the variable “Enrollment” displayed on the log2-scale and faceted by the variable “Private”. Use the ggplot2 cheatsheet to help you identify the necessary functions and arguments.
\(~\)
At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.
The sections that follow introduce various customization options you
can use to improve the appearance and clarity of ggplot2
graphics.
\(~\)
Themes are pre-built style templates used to better tailor a graphic to the mode of publication.
The example below applies a black and white theme to Example 2.1 from the preamble.
## Example #2.1 w/ black and white theme
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
geom_point() +
theme_bw()
Other pre-built themes:
theme_bw()
theme_linedraw()
, theme_light()
and
theme_dark()
theme_minimal()
theme_classic()
theme_void()
You can judge the differences in these themes below:
Any theme can be further customized using theme()
. Most
commonly this function is used to remove a graph’s legend:
## Example #2.1 w/ black and white theme
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
geom_point() +
theme_bw() +
theme(legend.position = "none")
Question #3: The code below creates a line graph that depicts (via a smoothed moving average) the approval of former presidents Jimmy Carter, Ronald Reagan, and Barrack Obama. Modify second portion of this code to try out a few different non-default themes. Then, briefly discuss (1-2 sentences) which themes you feel are most effective and least effective for this type of graph. Include the graph with your preferred theme in your lab write-up.
## Data processing
approval <- read.csv("https://bit.ly/398YR6M")
approval$Week = as.numeric(difftime(as.Date(approval$End.Date, "%m/%d/%y"), as.Date(approval$Inaug.Date, "%m/%d/%y"), units = "weeks"))
approval2 = subset(approval, President %in% c("Reagan", "Carter", "Obama"))
## Creating the graph
ggplot(data = approval2, mapping = aes(x = Week, y = Approving, color = President)) +
geom_smooth(method = "loess", span = 0.6, se = FALSE)
\(~\)
Labels and annotations are important aspects of well-constructed data visualizations. They are used to provide context, or draw the viewer’s attention towards particular aspects of the graphic.
Labels corresponding to aesthetics (such as x
,
y
, color
, etc.) are controlled using the
labs()
function:
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
geom_point() +
labs(x = "Sticker Cost", y = "Price Paid", color = "Admissions Rate")
Annotations are added the graphic as a layer using the
annotate()
function:
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
geom_point() +
annotate(geom = "rect", xmin = 65000, xmax = 77000, ymin = 20000, ymax = 52000, color = "red", alpha = 0.2)
The example above annotates a scatter plot by drawing a red rectangle
with a transparency (controlled by alpha
) of 20% around the
cluster of colleges with high costs, high net tuition, and low
admissions rates.
Question #4: Use the subset()
function
to create a data frame containing only colleges located in the state of
Iowa. Using these data, create a box plot of admissions rates, rename
the x-axis label to “Admissions Rate”, and add a text annotation above
the outlier saying “Grinnell”.
\(~\)
Scales are the guidelines for how aesthetics are displayed. They can be modified by adding layers using functions whose names follow the general format: “scale_aesthetic_function()”. Shown below are a few examples:
## Default Scales
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
geom_point()
## Put cost (the "x" aesthetic) on the log2 scale
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
geom_point() +
scale_x_continuous(trans = "log2")
## Use a gradient from purple to yellow to display Adm_Rate
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
geom_point() +
scale_color_gradient(low = "purple", high = "yellow")
## Use the popular "viridis" color scale, reversing the default direction via "-1"
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
geom_point() +
scale_color_continuous(type = "viridis", direction = -1)
Question #5: Create a 2-dimensional filled density
plot (using geom_density_2d_filled()
) with the aesthetics
“x = Adm_Rate” and “y = Net_Tuition” using the “viridis” color scale and
a “reverse” x-axis that goes from 1.00 to 0.00. Then, write a sentence
or two describing the combinations of admissions rate and net tuition
that are most prevelent among US colleges.
\(~\)
Often we’d like to display a statistical transformation (such as
means, error bars, etc.) alongside the data itself. While this can be
done by creating a new data frame, it’s generally easier to use a
stat_
function:
ggplot(data = colleges[1:30,], mapping = aes(x = Net_Tuition, y = Private)) +
geom_point() +
stat_summary(fun = mean, geom = "point", color = "red", size = 3) +
stat_summary(fun.data = mean_se, geom = "errorbar", color = "red")
The example above adds error bars depicting 1 standard error above/below the mean net tuition costs of the private and public colleges within the first 30 rows of the colleges data frame.
Notice how the first instance of stat_summary()
uses the
fun
argument (a simple option designed to return a single
number/vector), while the second uses the fun.data
argument
(a complex option designed to return a data frame, which contains the
lower and upper endpoints of the interval in this example).
Question #6: Using the data subset containing only
colleges located in Iowa (you created this in Question #4), create a
graph similar to the example above but using the arguments
geom = "linerange"
, alpha = 0.3
,
size = 4
, and color = "red"
to depict 1
standard error above/below the means of private and public colleges.
\(~\)
Trying to use aesthetics to display multiple variables on a graph can quickly become overwhelming. Facets allow you to display multiple side-by-side graphs according to one or more categorical variables.
facet_wrap()
is designed to display the data broken by
a single categorical variablefacet_grid()
is designed to display the data broken by
all combinations of two categorical variablesTwo examples are shown below:
## create a subset for example purposes
reduced_colleges = subset(colleges, Region %in% c("Great Lakes", "Far West", "Plains"))
## facet_wrap
ggplot(data = reduced_colleges, mapping = aes(x = Cost)) +
geom_density() +
facet_wrap(~Region, nrow = 1)
## facet_grid
ggplot(data = reduced_colleges, mapping = aes(x = Cost)) +
geom_density() +
facet_grid(Private~Region)
\(~\)
The code below will load a data set containing 970 Hollywood films
released between 2007 and 2011, then reduce these data to only include
variables that could be known prior to a film’s opening weekend. The
data are then simplified further to only include the four largest
studios (Warner Bros, Fox, Paramount, and Universal) in the three most
common genres (action, comedy, drama). You will use the resulting data
(ie: movies_subset
) for Question #7.
movies = read.csv("https://remiller1450.github.io/data/HollywoodMovies.csv")
movies_subset = subset(movies, LeadStudio %in% c("Warner Bros", "Fox", "Paramount", "Universal") &
Genre %in% c("Action", "Comedy", "Drama"),
select = c("Movie", "LeadStudio", "Story", "Genre","Budget",
"TheatersOpenWeek","Year","OpeningWeekend"))
Question #7: Your goal in this question is to create a graphic that effectively differentiates movies with high revenue on opening weekend from those with low revenue on opening weekend (the variable “OpeningWeekend” records this revenue in millions of US dollars). To practice the topics introduced in this lab, your graphic should include at least 3 of the following 5 components:
Finally, using the graph you created, write 2-3 sentences explaining the trends you found (ie: what attributes seem to predict a film having low/high opening weekend revenue).