plotlyThis lab focuses on creating interactive graphics using
plotly, an open-source graphing tool that can interface
with R and ggplot.
Directions (Please read before starting)
\(~\)
This lab will primarily use the plotly package, but will
also require the ggplot2 package.
# load the following packages
# install.packages("plotly")
library(plotly)
library(dplyr)
library(ggplot2)
The lab’s examples will use the college scorecard data that we’ve previously been working with:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
\(~\)
Before learning anything about plotly, you should be
aware that it is possible to convert a ggplot object into a
plotly graphic:
## Store a simple ggplot scatter plot
my_ggplot <- ggplot(data=colleges, aes(x=Cost, y=Salary10yr_median, color = Private)) + geom_point()
ggplotly(my_ggplot) ## Convert
The plotly version of this graph includes the following
features:
\(~\)
plot_ly()The code below demonstrates how to use plotly to create
a scatter plot that is colored by a categorical variable:
plot_ly(data = colleges, type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private)
type = "scatter" tells plotly to draw a
scatter plotmode = "markers" plots the data as hover-able dots
(rather than text labels or other symbols)You should notice plotly uses a ~ character
to indicate when it should look for variables in the data provided
by the data argument. If it were omitted,
plotly would look for a vector called “Cost” in your
global R environment.
Additionally, you should notice that the default themes and colors in
plotly differ from those of ggplot.
\(~\)
plotly and ggplotThe decision to use plotly or ggplot
depends upon the end goal of your visualization. Here are some factors
to consider:
ggplot |
plotly |
|---|---|
| Easier to construct complex graphics | Interactive |
| Easier customization (colors, etc.) | Allows for 3-D graphics |
| More legible syntax and grammar | Allows for animations |
| Annotations and exporting | Can convert ggplot graphics |
Because plotly graphics are interactive, they tend to
work nicely with R
Shiny.
\(~\)
plotlySimilar to ggplot, it is possible to build up a
plotly graphic by sequentially adding layers using the
%>% operator (similar to the + used with
ggplot):
plot_ly(data = colleges) %>%
add_trace(type = "scatter", x = ~Cost, y = ~Salary10yr_median, color = ~Private) %>%
add_text(x = ~Cost, y = ~Salary10yr_median, text = ~State, showlegend = FALSE)
The example above creates a scatter plot using
add_trace(), then it adds a layer of text labels on top of
those markers. plotly will automatically generate a legend
for trace/layer it displays, but you can disable this using the
showlegend argument.
The pipe operator allows plotly to be compatible with
data wrangling functions from the dplyr and
tidyr packages:
colleges %>%
filter(State %in% c("IA", "MN", "IL", "WI")) %>%
plot_ly() %>%
add_trace(type = "box", x = ~Cost, y = ~State)
Typically, the first layer of a plotly graphic is
created using add_trace() and the type
argument. Additional layers are created using other add_
functions (such as add_text()). This prevents the less
important layers from interfering with the hover capacity of the
tool-tip.
You can use this reference
page for a list of different types of graphics that can be created
using add_trace (search using the navigation drop down
menus on the left side of the page).
Question #1: Using add_trace(), create
a violin plot that separately displays the distributions of the
variable “Enrollment” for private and public colleges in the “colleges”
data set. Hint: Use the reference page linked above to
determine the proper arguments needed to create this type of graph.
\(~\)
Perhaps the most appealing feature of plotly is the
ability to see a label when you hover over a data point or area of
interest.
Information can be added to these labels using the text
argument in either plot_ly() or add_trace().
For example, we can add the names of each college to our previous
scatter plot:
plot_ly(data = colleges) %>%
add_trace(type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private, text = ~Name)
Notice how we still see the other aesthetics used to create the graph
(ie: x, y, and color) in the label by default. Often these defaults look
messy, so you might only want to display the text labels you’ve created
using the argument hoverinfo = "text":
plot_ly(data = colleges, hoverinfo = "text") %>%
add_trace(type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private, text = ~Name)
In plotly, label text uses hypertext markup language
(HTML), so HTML commands can be used to organize and modify the
appearance of labels:
plot_ly(data = colleges, hoverinfo = "text") %>%
add_trace(type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private,
text = ~paste0(Name, "<br>", City, ", ", State ))
In this example, paste0() is used to combine fixed
character strings with variable values, and the string
“<br>” is the HTML command used to begin a new
line.
Some other useful HTML commands include:
Question #2: Using the “colleges” data, create a
scatter plot of the variables “FourYearComp_Males” and
“FourYearComp_Females” that includes a custom label which shows each
college’s name in bold text, and also shows on a new line its
“PercentFemale” after the character string “percentage female:”. For an
improved style, you can use the round() function to display
“PercentFemale” rounded to 2 decimal places.
\(~\)
\(~\)
The plotly package is able to create graphics in
3-dimensions. The code below creates a basic 3-D scatter plot:
plot_ly(data = colleges, type = "scatter3d", mode = "markers",
x = ~Enrollment, y = ~Cost, z = ~ACT_median)
Since 3-D plotly graphs are easily rotated to a
desirable orientation, they tend to be more effective and easier to use
when compared with 3-D scatter plots generated using other packages.
Another 3-D graph that plotly can create is a
surface, which can be useful in displaying modeling results.
For example, consider a linear regression model that predicts
the median 10 year salary of a college’s graduates based upon that
college’s cost and admission rate:
## Fit the model
model <- lm(Salary10yr_median ~ Cost + Adm_Rate, data = colleges)
We’ll learn more about modeling later the course. For now, all you
should know is that lm() is used to fit linear regression
models, and this particular model uses the formula
Salary10yr_median ~ Cost + Adm_Rate to indicate that
Salary10yr_median is the model’s outcome variable while
Cost and Adm_Rate are the model’s predictor
variables.
Once the model has been fit, creating a surface to visualize the model involves preliminary two steps:
## Step 1
xs <- seq(min(colleges$Cost, na.rm = TRUE), max(colleges$Cost, na.rm = TRUE), length.out = 100) ## Seq of 100 costs
ys <- seq(min(colleges$Adm_Rate, na.rm = TRUE), max(colleges$Adm_Rate, na.rm = TRUE), length.out = 100) ## Seq of 100 adm rates
grid <- expand.grid(Cost = xs, Adm_Rate = ys) # Grid of every combo
## Step 2
z <- predict(model, newdata = grid) # Generate predictions for the entire grid
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE) # Reformat the predictions into a matrix
## Graph
plot_ly() %>%
add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate,
z = ~colleges$Salary10yr_median, color = I("black")) %>%
add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Blues")
To summarize, xs and ys are sequences of
values we’d like the model to predict over. We set these up to go from
the minimum to the maximum of those variables so that the surface we see
spans the entire range of the observed data. In this example, we asked
for 100 equally spaced points between the minimum and maximum values.
Because linear regression surfaces are flat planes, we could have asked
for far fewer and still seen the same outcome. However, for other models
we can see greater resolution by using more points in our grid. This is
because add_surface() works by connecting the heights
(predicted values) for each combination of values in xs and
ys, so more combinations will lead to less
interpolation.
Admittedly, this code might seem somewhat complicated, but it’s
easily adapted to other models and variables simply by modifying the
xs, ys, and model objects.
For example, shown below is the regression surface of a generalized additive model, or GAM, a type of models that allows for non-linear relationships between the predictors and the outcome using spline functions:
## Fit model
library(mgcv)
model <- gam(Salary10yr_median ~ s(Cost) + s(Adm_Rate), data = colleges)
Once the new model has been fit, only the matrix of predicted values needs to be updated (since the “x” and “y” variables from the previous example remain unchanged).
z <- predict(model, newdata = grid) # Predictions for every combination in the grid
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE) # Store predictions as a matrix
## Graph
plot_ly() %>%
add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate,
z = ~colleges$Salary10yr_median, color = I("black")) %>%
add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Reds")
Later in the semester we will discuss methods for determining which of these two models should be preferred.
Question #3: Using this section’s code as a
template, display the linear regression surface for the model
Debt_median ~ Net_Tuition + ACT_median on a 3-D scatter
plot that uses “Net_Tuition” as the x-variable and “ACT_median” as the
y-variable. You should use the lm() function to fit this
model prior to Step 2.
\(~\)
Axis labels in plotly can be modified using the
layout() function, while most other scales can be renamed
in the function used to create them (ie: colorbar in
add_surface()) or in an argument describing them (ie:
marker = list(size = 2)) for changing the point size). The
example below demonstrates a few of these modifications:
## Plot of the GAM model - gam(Salary10yr_median ~ s(Cost) + s(Adm_Rate), data = colleges)
## Graph
plot_ly() %>%
add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate,
z = ~colleges$Salary10yr_median, color = I("black"), marker = list(size = 2)) %>%
add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Reds", colorbar = list(title = "Salary")) %>%
layout(scene = list(xaxis = list(title = "Cost"),
yaxis = list(title = "Admission Rate"),
zaxis = list(title = "Median 10 year salary")))
Documentation for the full set of options in layout()
can be found
here.
\(~\)
Most plotly graphics can be made into animations using
the frame argument, which indicates the series of data
snapshots that the animation will progress through.
For example, the code below creates an animated bar chart showing the populations of US states for each year going from 2010 to 2018:
## Load the data
states <- read.csv("https://remiller1450.github.io/data/state_pops.csv")
## Tidy the data
library(tidyr)
library(stringr)
states_long <- pivot_longer(states, cols = 2:ncol(states), names_to = "Year", values_to = "Population")
states_long$Year <- str_replace(string = states_long$Year, pattern = "X", replace = "")
states_long$State <- str_replace(string = states_long$State, pattern = ".", replace = "")
## Animation
plot_ly(data = states_long, type = "bar",
x = ~reorder(State, X = Population, FUN = min), y = ~Population, frame = ~Year, showlegend = FALSE) %>%
layout(xaxis = list(title = " "))
Notice that these data needed to be converted to “long format” for
the column “Year” to be used as the frame argument.
Additionally, reorder() was used to arrange the states by
their initial population (assumed to be their minimum).
Animations can be customized using the animation_opts()
function.
## Fast and bouncy animation
plot_ly(data = states_long, type = "bar",
x = ~reorder(State, X = Population, FUN = min), y = ~Population, frame = ~Year, showlegend = FALSE) %>%
animation_opts(frame = 100, easing = "elastic", redraw = FALSE) %>% layout(xaxis = list(title = " "))
Within animation_opts(), the frame argument
controls the speed at which frames progress. The default is 500
milliseconds, so this second animation is 5 times faster than the
initial example.
The easing argument implements a transition between
frames (in this case an elastic bounce). Different easing options are listed
here between lines 68 and 103.
Finally, redraw = FALSE is used to avoid redrawing the
entire plot at each frame. In this example, redrawing doesn’t make much
of a difference, but for larger data sets or more complex visualizations
it can greatly reduce lag.
Question #4: The code below reads a data set
compiled by Mother
Jones that aims to document all mass shootings in the United States.
For this question, create an animated plot that displays the yearly
number of fatalities and injuries in these shootings over time. For
reference, a sample animation is included below (yours should be
similar, but it doesn’t need to be identical). Hint: Before
creating the animation you should use group_by(),
summarize(), and pivot_longer() to prepare the
data.
shootings <- read.csv('https://remiller1450.github.io/data/MassShootings.csv')