This lab focuses on creating interactive graphics using plotly, an open-source graphing tool that can interface with R and ggplot.

Directions (Please read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
  2. Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
  3. Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Packages and Datasets

This lab will primarily use the plotly package, but will also require the ggplot2 package.

# load the following packages
# install.packages("plotly")
library(plotly)
library(ggplot2)

The lab’s examples will use the college scorecard data that we’ve previously been working with:

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

\(~\)

Converting existing graphs

Before learning anything about plotly, you should be aware that it is possible to convert a ggplot object into a plotly graphic:

## Store a simple ggplot scatter plot
my_ggplot <- ggplot(data=colleges, aes(x=Cost, y=Salary10yr_median, color = Private)) + geom_point()
ggplotly(my_ggplot) ## Convert

The plotly version of this graph includes the following features:

  1. A dashboard with options to zoom in, zoom out, re-scale, etc.
  2. A tool-tip that displays information when you hover over a data point.

\(~\)

Creating the graph using plot_ly()

The code below demonstrates how to use plotly to create a scatter plot that is colored by a categorical variable:

plot_ly(data = colleges, type = "scatter", mode = "markers",
        x = ~Cost, y = ~Salary10yr_median, color = ~Private)
  • type = "scatter" tells plotly to draw a scatter plot
  • mode = "markers" plots the data as hover-able dots (rather than text labels or other symbols)

You should notice plotly uses a ~ character to indicate when it should look for variables in the data provided by the data argument. If it were omitted, plotly would look for a vector called “Cost” in your global R environment.

Additionally, you should notice that the default themes and colors in plotly differ from those of ggplot.

\(~\)

Comparison of plotly and ggplot

The decision to use plotly or ggplot depends upon the end goal of your visualization. Here are some factors to consider:

ggplot plotly
Easier to construct complex graphics Interactive
Easier customization (colors, etc.) Allows for 3-D graphics
More legible syntax and grammar Allows for animations
Annotations and exporting Can convert ggplot graphics

Because plotly graphics are interactive, they tend to work nicely with R Shiny.

\(~\)

Lab

Layering in plotly

Similar to ggplot, it is possible to build up a plotly graphic by sequentially adding layers using the %>% operator (similar to the + used with ggplot):

plot_ly(data = colleges) %>% 
    add_trace(type = "scatter", x = ~Cost, y = ~Salary10yr_median, color = ~Private) %>%
    add_text(x = ~Cost, y = ~Salary10yr_median, text = ~State)

The example above creates a scatter plot using add_trace(), then it adds a layer of text labels on top of those markers.

The pipe operator allows plotly to be compatible with data wrangling functions from the dplyr and tidyr packages:

colleges %>% 
   filter(State %in% c("IA", "MN", "IL", "WI")) %>%
plot_ly() %>%
   add_trace(type = "box", x = ~Cost, y = ~State)

Typically, the first layer of a plotly graphic is created using add_trace() and the type argument. Additional layers are created using other add_ functions (such as add_text()). This prevents the less important layers from interfering with the hover capacity of the tool-tip.

You can use this reference page for a list of different types of graphics that can be created using add_trace (search using the navigation drop down menus on the left side of the page).

Question #1: Using add_trace(), create a violin plot that separately displays the distributions of the variable “Enrollment” for private and public colleges in the “colleges” data set. Hint: Use the reference page linked above to determine the proper arguments needed to create this type of graph.

\(~\)

Custom Labels

Perhaps the most appealing feature of plotly is the ability to see a label when you hover over a data point or area of interest.

Information can be added to these labels using the text argument in either plot_ly() or add_trace(). For example, we can add the names of each college to our previous scatter plot:

plot_ly(data = colleges) %>%
  add_trace(type = "scatter", mode = "markers",
        x = ~Cost, y = ~Salary10yr_median, color = ~Private, text = ~Name)

Notice how we still see the other aesthetics used to create the graph (ie: x, y, and color) in the label by default. Often these defaults look messy, so you might only want to display the text labels you’ve created using the argument hoverinfo = "text":

plot_ly(data = colleges, hoverinfo = "text") %>%
  add_trace(type = "scatter", mode = "markers",
        x = ~Cost, y = ~Salary10yr_median, color = ~Private, text = ~Name)

In plotly, label text uses hypertext markup language (HTML), so HTML commands can be used to organize and modify the appearance of labels:

plot_ly(data = colleges) %>% 
    add_trace(type = "scatter", mode = "markers",
        x = ~Cost, y = ~Salary10yr_median, color = ~Private,
        text = ~paste0(Name, "<br>", City, ", ", State ))

In this example, paste0() is used to combine fixed character strings with variable values, and the string “<br>” is the HTML command used to begin a new line.

Some other useful HTML commands include:

  • <b> my text <b> - Bolds the text in between the tags
  • <i> my text <i> - Italicizes the text in between the tags
  • x<sub>i<sub> - Adds a subscript, in this case we get \(x_i\)

Question #2: Using the “colleges” data, create a scatter plot of the variables “FourYearComp_Males” and “FourYearComp_Females” that includes a custom label which shows each college’s name in bold text, and also shows on a new line its “PercentFemale” after the character string “percentage female:”.

\(~\)

3-D Graphics

The plotly package is able to create graphics in 3-dimensions. The code below creates a basic 3-D scatter plot:

plot_ly(data = colleges, type = "scatter3d", mode = "markers",
        x = ~Enrollment, y = ~Cost, z = ~ACT_median)

Since 3-D plotly graphs can be rotated, they tend to be more effective visualizations than 3-D scatter plots generated using other packages.

Another 3-D graph that plotly can create is a surface, which can be useful in displaying modeling results. For example, consider a linear regression model that predicts the median 10 year salary of a college’s graduates based upon that college’s cost and admission rate:

## Fit the model
model <- lm(Salary10yr_median ~ Cost + Adm_Rate, data = colleges)

We’ll learn more about modeling later the course. For now, all you should know is that lm() is used to fit linear regression models, and this particular model uses the formula Salary10yr_median ~ Cost + Adm_Rate to indicate that Salary10yr_median is the model’s outcome variable while Cost and Adm_Rate are the model’s predictor variables.

Once the model has been fit, creating a surface to visualize the model involves preliminary two steps:

  1. Creating a grid containing the combinations of predictors we’d like to appear in the graph (ie: setting up the surface’s “x” and “y” scales)
  2. Creating a matrix of model predictions corresponding to each value within that grid (ie: setting up the “z” scale, or the surface’s height)
## Step 1
xs <- seq(min(colleges$Cost, na.rm = TRUE), max(colleges$Cost, na.rm = TRUE), length.out = 100)      ## Seq of 100 costs
ys <- seq(min(colleges$Adm_Rate, na.rm = TRUE), max(colleges$Adm_Rate, na.rm = TRUE), length.out = 100)      ## Seq of 100 adm rates
grid <- expand.grid(Cost = xs, Adm_Rate = ys)                    # Grid of every combo

## Step 2
z <- predict(model, newdata = grid)                         # Generate predictions for the entire grid
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE)        # Reformat the predictions into a matrix

## Graph
plot_ly() %>%
  add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate, 
            z = ~colleges$Salary10yr_median, color = I("black")) %>% 
  add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Blues")

To summarize, xs and ys are sequences of values we’d like the model to predict over. We set these up to go from the minimum to the maximum of those variables so that the surface we see spans the entire range of the observed data. In this example, we asked for 100 equally spaced points between the minimum and maximum values. Because linear regression surfaces are flat planes, we could have asked for far fewer and still seen the same outcome. However, for other models we can see greater resolution by using more points in our grid. This is because add_surface() works by connecting the heights (predicted values) for each combination of values in xs and ys, so more combinations will lead to less interpolation.

Admittedly, this code might seem somewhat complicated, but it’s easily adapted to other models and variables simply by modifying the xs, ys, and model objects.

For example, shown below is the regression surface of a generalized additive model, or GAM, a type of models that allows for non-linear relationships between the predictors and the outcome using spline functions:

## Fit model
library(mgcv)
model <- gam(Salary10yr_median ~ s(Cost) + s(Adm_Rate), data = colleges)

Once the new model has been fit, only the matrix of predicted values needs to be updated (since the “x” and “y” variables from the previous example remain unchanged).

z <- predict(model, newdata = grid)                   # Predictions for every combination in the grid
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE)  # Store predictions as a matrix

## Graph
plot_ly() %>%
  add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate, 
            z = ~colleges$Salary10yr_median, color = I("black")) %>% 
  add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Reds")

Later in the semester we will discuss methods for determining which of these two models should be preferred.

Question #3: Using this section’s code as a template, display the linear regression surface for the model Debt_median ~ Net_Tuition + ACT_median on a 3-D scatter plot that uses “Net_Tuition” as the x-variable and “ACT_median” as the y-variable. You should use the lm() function to fit this model prior to Step 2.

\(~\)

Customizing Appearance

Axis labels in plotly can be modified using the layout() function. Most other scales can be renamed in the function used to create them (ie: colorbar in add_surface()):

## Plot of the GAM model - gam(Salary10yr_median ~ s(Cost) + s(Adm_Rate), data = colleges)

## Graph
plot_ly() %>%
  add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate, 
            z = ~colleges$Salary10yr_median, color = I("black")) %>% 
  add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Reds", colorbar = list(title = "Salary")) %>%
  layout(scene = list(xaxis = list(title = "Cost"),
                      yaxis = list(title = "Admission Rate"),
                      zaxis = list(title = "Median 10 year salary")))

Documentation for the full set of options in layout() can be found here.

\(~\)

Animation

Most plotly graphics can be made into animations using the frame argument, which indicates the series of data snapshots that the animation will progress through.

For example, the code below creates an animated bar chart showing the populations of US states for each year going from 2010 to 2018:

## Load the data
states <- read.csv("https://remiller1450.github.io/data/state_pops.csv")

## Tidy the data
library(tidyr)
library(stringr)
states_long <- pivot_longer(states, cols = 2:ncol(states), names_to = "Year", values_to = "Population")
states_long$Year <- str_replace(string = states_long$Year, pattern = "X", replace = "")
states_long$State <- str_replace(string = states_long$State, pattern = ".", replace = "")

## Animation
plot_ly(data = states_long, type = "bar",
        x = ~reorder(State, X = Population, FUN = min), y = ~Population, frame = ~Year, showlegend = FALSE)

Notice that these data needed to be converted to “long format” for the column “Year” to be used as the frame argument. Additionally, reorder() was used to arrange the states by their initial population (assumed to be their minimum).

Animations can be customized using the animation_opts() function.

## Fast and bouncy animation
plot_ly(data = states_long, type = "bar",
        x = ~reorder(State, X = Population, FUN = min), y = ~Population, frame = ~Year, showlegend = FALSE) %>%
         animation_opts(frame = 100, easing = "elastic", redraw = FALSE)

Within animation_opts(), the frame argument controls the speed at which frames progress. The default is 500 milliseconds, so this second animation is 5 times faster than the initial example.

The easing argument implements a transition between frames (in this case an elastic bounce). Different easing options are listed here between lines 68 and 103.

Finally, redraw = FALSE is used to avoid redrawing the entire plot at each frame. In this example, redrawing doesn’t make much of a difference, but for larger data sets or more complex visualizations it can greatly reduce lag.

Question #4: The code below reads a data set compiled by Mother Jones that aims to document all mass shootings in the United States. For this question, create an animated plot that displays the yearly number of fatalities and injuries in these shootings over time. For reference, a sample animation is included below (yours should be similar, but it doesn’t need to be identical). Hint: Before creating the animation you should use group_by(), summarize(), and pivot_longer() to prepare the data.

shootings <- read.csv('https://remiller1450.github.io/data/MassShootings.csv')