Data Visualization with ggplot2

This lab focuses on data visualization and how to create high quality graphics using the ggplot2 package. In this lab we will use cleaned data, with subsequent labs covering how to manipulate data prior to graphing.

Directions (Please read before starting)

Please work together with your assigned partner. Make sure you both fully understand something before moving on.
Please record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Packages and Datasets

This lab will use the ggplot2 package:

# install.packages("ggplot2")
library(ggplot2)

It will also use data from The College Scorecard:

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

Description: The colleges data set records attributes and outcomes for all primarily undergraduate institutions in the United States with at least 400 full-time students for the year 2019.

\(~\)

How `ggplot2` creates graphics

ggplot2 package builds graphics in a structured manner using layers. Layers are sequentially added to a graph, with each serving a particular purpose, such as:

Displaying raw data
Displaying a statistical summary
Adding metadata (ie: annotations, context, references, etc.)

Consider the following examples:

## Example #1.1 - nothing (just a base layer)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition))

## Example #1.2 - add a layer displaying raw data
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition)) +
  geom_point()

## Example #1.3 - add another layer (smoother)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition)) +
  geom_point() +
  geom_smooth()

## Example #1.4 - add another layer (reference text)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition)) +
  geom_point() +
  geom_smooth() + 
  annotate("text", x =20000, y=40000, label = "Data from 2019")

Note: A high-quality graphic doesn’t require any particular number of layers. In fact, more layers can sometimes detract from the clarity of a data visualization.

\(~\)

More on the base layer

The mapping and data arguments provided in the base layer are carried forward to all subsequent layers (which is generally desirable). However, we can avoid this behavior.

## Example #2.1
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
  geom_point()

## Example #2.2 (local override of color aesthetic)
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
  geom_point(color = "red")

Specifying aesthetics locally within layers:

## Example #3.1
ggplot(data = colleges) +
  geom_point(mapping = aes(x = Cost, y = Net_Tuition)) 

## Example #3.2
ggplot(data = colleges) +
  geom_point(mapping = aes(x = Cost, y = Net_Tuition)) +
  geom_point(mapping = aes(x = Cost, y = Enrollment), color = "red")

Local specification is most useful when you want to add layers that use the same parameters (ie: “x” and “y” in the example above) but reference different things. A few other examples are drawing a line through several group means ordered by ordinal categorical variable, or adding both polygons and points to a map.

\(~\)

Terminology

The ggplot framework is unique because all graphics are grammatically defined using the following terminology:

Aesthetics (or “aes”) - mappings of variables to visual cues representing their values (ie: position on the x-axis)
Geometric elements (or “geom”) - what you actually see in the plot (ie: points, lines, etc.)
Scales - guidelines for how aesthetic mappings should be displayed (ie: logarithmic x-axis, red to blue color palate, etc.)
Guides (or “legends”) - references to help a human reader interpret the aesthetics
Facets - rules specifying how to break up and separately display subsets of data

Question #1: Identify and briefly describe each term mentioned above in the graphic created by the code below.

ggplot(data = colleges, mapping = aes(x = Adm_Rate, color = Private)) + 
  geom_density() +
  scale_x_continuous(trans = "reverse") +
  facet_wrap(~Region)

Question #2: Create a histogram of the variable “Enrollment” displayed on the log2-scale and faceted by the variable “Private”. Use the ggplot2 cheatsheet to help you identify the necessary functions and arguments.

\(~\)

Lab

At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.

The sections that follow introduce various customization options you can use to improve the appearance and clarity of ggplot2 graphics.

\(~\)

Themes

Themes are pre-built style templates used to better tailor a graphic to the mode of publication.

The example below applies a black and white theme to Example 2.1 from the preamble.

## Example #2.1 w/ black and white theme
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
  geom_point() +
  theme_bw()

Other pre-built themes:

theme_bw()
theme_linedraw(), theme_light() and theme_dark()
theme_minimal()
theme_classic()
theme_void()

You can judge the differences in these themes below:

Any theme can be further customized using theme(). Most commonly this function is used to remove a graph’s legend:

## Example #2.1 w/ black and white theme
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Region)) +
  geom_point() +
  theme_bw() +
  theme(legend.position = "none")

Question #3: The code below creates a line graph that depicts (via a smoothed moving average) the approval of former presidents Jimmy Carter, Ronald Reagan, and Barrack Obama. Modify second portion of this code to try out a few different non-default themes. Then, briefly discuss (1-2 sentences) which themes you feel are most effective and least effective for this type of graph. Include the graph with your preferred theme in your lab write-up.

## Data processing
approval <- read.csv("https://bit.ly/398YR6M")
approval$Week = as.numeric(difftime(as.Date(approval$End.Date, "%m/%d/%y"), as.Date(approval$Inaug.Date, "%m/%d/%y"), units = "weeks"))
approval2 = subset(approval, President %in% c("Reagan", "Carter", "Obama"))

## Creating the graph
ggplot(data = approval2, mapping = aes(x = Week, y = Approving, color = President)) +
 geom_smooth(method = "loess", span = 0.6, se = FALSE)

\(~\)

Labels and Annotations

Labels and annotations are important aspects of well-constructed data visualizations. They are used to provide context, or draw the viewer’s attention towards particular aspects of the graphic.

Labels corresponding to aesthetics (such as x, y, color, etc.) are controlled using the labs() function:

ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
  geom_point() +
  labs(x = "Sticker Cost", y = "Price Paid", color = "Admissions Rate")

Annotations are added the graphic as a layer using the annotate() function:

ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
  geom_point() + 
  annotate(geom = "rect", xmin = 65000, xmax = 77000, ymin = 20000, ymax = 52000, color = "red", alpha = 0.2)

The example above annotates a scatter plot by drawing a red rectangle with a transparency (controlled by alpha) of 20% around the cluster of colleges with high costs, high net tuition, and low admissions rates.

Question #4: Use the subset() function to create a data frame containing only colleges located in the state of Iowa. Using these data, create a box plot of admissions rates, rename the x-axis label to “Admissions Rate”, and add a text annotation above the outlier saying “Grinnell”.

\(~\)

Scales

Scales are the guidelines for how aesthetics are displayed. They can be modified by adding layers using functions whose names follow the general format: “scale_aesthetic_function()”. Shown below are a few examples:

## Default Scales
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
  geom_point()

## Put cost (the "x" aesthetic) on the log2 scale
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
  geom_point() +
  scale_x_continuous(trans = "log2")

## Use a gradient from purple to yellow to display Adm_Rate
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
  geom_point() +
  scale_color_gradient(low = "purple", high = "yellow")

## Use the popular "viridis" color scale, reversing the default direction via "-1"
ggplot(data = colleges, mapping = aes(x = Cost, y = Net_Tuition, color = Adm_Rate)) +
  geom_point() +
  scale_color_continuous(type = "viridis", direction = -1)

Question #5: Create a 2-dimensional filled density plot (using geom_density_2d_filled()) with the aesthetics “x = Adm_Rate” and “y = Net_Tuition” using the “viridis” color scale and a “reverse” x-axis that goes from 1.00 to 0.00. Then, write a sentence or two describing the combinations of admissions rate and net tuition that are most prevelent among US colleges.

\(~\)

Stats

Often we’d like to display a statistical transformation (such as means, error bars, etc.) alongside the data itself. While this can be done by creating a new data frame, it’s generally easier to use a stat_ function:

ggplot(data = colleges[1:30,], mapping = aes(x = Net_Tuition, y = Private)) +
  geom_point() + 
  stat_summary(fun = mean, geom = "point", color = "red", size = 3) +
  stat_summary(fun.data = mean_se, geom = "errorbar", color = "red")

The example above adds error bars depicting 1 standard error above/below the mean net tuition costs of the private and public colleges within the first 30 rows of the colleges data frame.

Notice how the first instance of stat_summary() uses the fun argument (a simple option designed to return a single number/vector), while the second uses the fun.data argument (a complex option designed to return a data frame, which contains the lower and upper endpoints of the interval in this example).

Question #6: Using the data subset containing only colleges located in Iowa (you created this in Question #4), create a graph similar to the example above but using the arguments geom = "linerange", alpha = 0.3, size = 4, and color = "red" to depict 1 standard error above/below the means of private and public colleges.

\(~\)

Facets

Trying to use aesthetics to display multiple variables on a graph can quickly become overwhelming. Facets allow you to display multiple side-by-side graphs according to one or more categorical variables.

facet_wrap() is designed to display the data broken by a single categorical variable
facet_grid() is designed to display the data broken by all combinations of two categorical variables

Two examples are shown below:

## create a subset for example purposes
reduced_colleges = subset(colleges, Region %in% c("Great Lakes", "Far West", "Plains"))

## facet_wrap
ggplot(data = reduced_colleges, mapping = aes(x = Cost)) +
  geom_density() +
  facet_wrap(~Region, nrow = 1)

## facet_grid
ggplot(data = reduced_colleges, mapping = aes(x = Cost)) +
  geom_density() +
  facet_grid(Private~Region)

\(~\)

Practice

The code below will load a data set containing 970 Hollywood films released between 2007 and 2011, then reduce these data to only include variables that could be known prior to a film’s opening weekend. The data are then simplified further to only include the four largest studios (Warner Bros, Fox, Paramount, and Universal) in the three most common genres (action, comedy, drama). You will use the resulting data (ie: movies_subset) for Question #7.

movies = read.csv("https://remiller1450.github.io/data/HollywoodMovies.csv")
movies_subset = subset(movies, LeadStudio %in% c("Warner Bros", "Fox", "Paramount", "Universal") & 
                               Genre %in% c("Action", "Comedy", "Drama"),
                       select = c("Movie", "LeadStudio", "Story", "Genre","Budget",
                                  "TheatersOpenWeek","Year","OpeningWeekend"))

Question #7: Your goal in this question is to create a graphic that effectively differentiates movies with high revenue on opening weekend from those with low revenue on opening weekend (the variable “OpeningWeekend” records this revenue in millions of US dollars). To practice the topics introduced in this lab, your graphic should include at least 3 of the following 5 components:

Scale modifications
Stats
Labs or annotations
Theme changes
Facets

Finally, using the graph you created, write 2-3 sentences explaining the trends you found (ie: what attributes seem to predict a film having low/high opening weekend revenue).

Data Visualization with ggplot2

Preamble

Packages and Datasets

How ggplot2 creates graphics

More on the base layer

Terminology

Lab

Themes

Labels and Annotations

Scales

Stats

Facets

Practice

Data Visualization with `ggplot2`

How `ggplot2` creates graphics