plotly
This lab focuses on creating interactive graphics using
plotly
, an open-source graphing tool that can interface
with R
and ggplot
.
Directions (Please read before starting)
\(~\)
This lab will primarily use the plotly
package, but will
also require the ggplot2
package.
# load the following packages
# install.packages("plotly")
library(plotly)
library(dplyr)
library(ggplot2)
The lab’s examples will use the college scorecard data that we’ve previously been working with:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
\(~\)
Before learning anything about plotly
, you should be
aware that it is possible to convert a ggplot
object into a
plotly
graphic:
## Store a simple ggplot scatter plot
my_ggplot <- ggplot(data=colleges, aes(x=Cost, y=Salary10yr_median, color = Private)) + geom_point()
ggplotly(my_ggplot) ## Convert
The plotly
version of this graph includes the following
features:
\(~\)
plot_ly()
The code below demonstrates how to use plotly
to create
a scatter plot that is colored by a categorical variable:
plot_ly(data = colleges, type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private)
type = "scatter"
tells plotly
to draw a
scatter plotmode = "markers"
plots the data as hover-able dots
(rather than text labels or other symbols)You should notice plotly
uses a ~
character
to indicate when it should look for variables in the data provided
by the data
argument. If it were omitted,
plotly
would look for a vector called “Cost” in your
global R
environment.
Additionally, you should notice that the default themes and colors in
plotly
differ from those of ggplot
.
\(~\)
plotly
and ggplot
The decision to use plotly
or ggplot
depends upon the end goal of your visualization. Here are some factors
to consider:
ggplot |
plotly |
---|---|
Easier to construct complex graphics | Interactive |
Easier customization (colors, etc.) | Allows for 3-D graphics |
More legible syntax and grammar | Allows for animations |
Annotations and exporting | Can convert ggplot graphics |
Because plotly
graphics are interactive, they tend to
work nicely with R
Shiny.
\(~\)
plotly
Similar to ggplot
, it is possible to build up a
plotly
graphic by sequentially adding layers using the
%>%
operator (similar to the +
used with
ggplot
):
plot_ly(data = colleges) %>%
add_trace(type = "scatter", x = ~Cost, y = ~Salary10yr_median, color = ~Private) %>%
add_text(x = ~Cost, y = ~Salary10yr_median, text = ~State, showlegend = FALSE)
The example above creates a scatter plot using
add_trace()
, then it adds a layer of text labels on top of
those markers. plotly
will automatically generate a legend
for trace/layer it displays, but you can disable this using the
showlegend
argument.
The pipe operator allows plotly
to be compatible with
data wrangling functions from the dplyr
and
tidyr
packages:
colleges %>%
filter(State %in% c("IA", "MN", "IL", "WI")) %>%
plot_ly() %>%
add_trace(type = "box", x = ~Cost, y = ~State)
Typically, the first layer of a plotly
graphic is
created using add_trace()
and the type
argument. Additional layers are created using other add_
functions (such as add_text()
). This prevents the less
important layers from interfering with the hover capacity of the
tool-tip.
You can use this reference
page for a list of different types of graphics that can be created
using add_trace
(search using the navigation drop down
menus on the left side of the page).
Question #1: Using add_trace()
, create
a violin plot that separately displays the distributions of the
variable “Enrollment” for private and public colleges in the “colleges”
data set. Hint: Use the reference page linked above to
determine the proper arguments needed to create this type of graph.
\(~\)
Perhaps the most appealing feature of plotly
is the
ability to see a label when you hover over a data point or area of
interest.
Information can be added to these labels using the text
argument in either plot_ly()
or add_trace()
.
For example, we can add the names of each college to our previous
scatter plot:
plot_ly(data = colleges) %>%
add_trace(type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private, text = ~Name)
Notice how we still see the other aesthetics used to create the graph
(ie: x, y, and color) in the label by default. Often these defaults look
messy, so you might only want to display the text labels you’ve created
using the argument hoverinfo = "text"
:
plot_ly(data = colleges, hoverinfo = "text") %>%
add_trace(type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private, text = ~Name)
In plotly
, label text uses hypertext markup language
(HTML), so HTML commands can be used to organize and modify the
appearance of labels:
plot_ly(data = colleges, hoverinfo = "text") %>%
add_trace(type = "scatter", mode = "markers",
x = ~Cost, y = ~Salary10yr_median, color = ~Private,
text = ~paste0(Name, "<br>", City, ", ", State ))
In this example, paste0()
is used to combine fixed
character strings with variable values, and the string
“<br>” is the HTML command used to begin a new
line.
Some other useful HTML commands include:
Question #2: Using the “colleges” data, create a
scatter plot of the variables “FourYearComp_Males” and
“FourYearComp_Females” that includes a custom label which shows each
college’s name in bold text, and also shows on a new line its
“PercentFemale” after the character string “percentage female:”. For an
improved style, you can use the round()
function to display
“PercentFemale” rounded to 2 decimal places.
\(~\)
\(~\)
The plotly
package is able to create graphics in
3-dimensions. The code below creates a basic 3-D scatter plot:
plot_ly(data = colleges, type = "scatter3d", mode = "markers",
x = ~Enrollment, y = ~Cost, z = ~ACT_median)
Since 3-D plotly
graphs are easily rotated to a
desirable orientation, they tend to be more effective and easier to use
when compared with 3-D scatter plots generated using other packages.
Another 3-D graph that plotly
can create is a
surface, which can be useful in displaying modeling results.
For example, consider a linear regression model that predicts
the median 10 year salary of a college’s graduates based upon that
college’s cost and admission rate:
## Fit the model
model <- lm(Salary10yr_median ~ Cost + Adm_Rate, data = colleges)
We’ll learn more about modeling later the course. For now, all you
should know is that lm()
is used to fit linear regression
models, and this particular model uses the formula
Salary10yr_median ~ Cost + Adm_Rate
to indicate that
Salary10yr_median
is the model’s outcome variable while
Cost
and Adm_Rate
are the model’s predictor
variables.
Once the model has been fit, creating a surface to visualize the model involves preliminary two steps:
## Step 1
xs <- seq(min(colleges$Cost, na.rm = TRUE), max(colleges$Cost, na.rm = TRUE), length.out = 100) ## Seq of 100 costs
ys <- seq(min(colleges$Adm_Rate, na.rm = TRUE), max(colleges$Adm_Rate, na.rm = TRUE), length.out = 100) ## Seq of 100 adm rates
grid <- expand.grid(Cost = xs, Adm_Rate = ys) # Grid of every combo
## Step 2
z <- predict(model, newdata = grid) # Generate predictions for the entire grid
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE) # Reformat the predictions into a matrix
## Graph
plot_ly() %>%
add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate,
z = ~colleges$Salary10yr_median, color = I("black")) %>%
add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Blues")
To summarize, xs
and ys
are sequences of
values we’d like the model to predict over. We set these up to go from
the minimum to the maximum of those variables so that the surface we see
spans the entire range of the observed data. In this example, we asked
for 100 equally spaced points between the minimum and maximum values.
Because linear regression surfaces are flat planes, we could have asked
for far fewer and still seen the same outcome. However, for other models
we can see greater resolution by using more points in our grid. This is
because add_surface()
works by connecting the heights
(predicted values) for each combination of values in xs
and
ys
, so more combinations will lead to less
interpolation.
Admittedly, this code might seem somewhat complicated, but it’s
easily adapted to other models and variables simply by modifying the
xs
, ys
, and model
objects.
For example, shown below is the regression surface of a generalized additive model, or GAM, a type of models that allows for non-linear relationships between the predictors and the outcome using spline functions:
## Fit model
library(mgcv)
model <- gam(Salary10yr_median ~ s(Cost) + s(Adm_Rate), data = colleges)
Once the new model has been fit, only the matrix of predicted values needs to be updated (since the “x” and “y” variables from the previous example remain unchanged).
z <- predict(model, newdata = grid) # Predictions for every combination in the grid
m <- matrix(z, nrow = 100, ncol = 100, byrow = TRUE) # Store predictions as a matrix
## Graph
plot_ly() %>%
add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate,
z = ~colleges$Salary10yr_median, color = I("black")) %>%
add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Reds")
Later in the semester we will discuss methods for determining which of these two models should be preferred.
Question #3: Using this section’s code as a
template, display the linear regression surface for the model
Debt_median ~ Net_Tuition + ACT_median
on a 3-D scatter
plot that uses “Net_Tuition” as the x-variable and “ACT_median” as the
y-variable. You should use the lm()
function to fit this
model prior to Step 2.
\(~\)
Axis labels in plotly
can be modified using the
layout()
function, while most other scales can be renamed
in the function used to create them (ie: colorbar
in
add_surface()
) or in an argument describing them (ie:
marker = list(size = 2))
for changing the point size). The
example below demonstrates a few of these modifications:
## Plot of the GAM model - gam(Salary10yr_median ~ s(Cost) + s(Adm_Rate), data = colleges)
## Graph
plot_ly() %>%
add_trace(type = "scatter3d", x = ~colleges$Cost, y = ~colleges$Adm_Rate,
z = ~colleges$Salary10yr_median, color = I("black"), marker = list(size = 2)) %>%
add_surface(x = ~xs, y = ~ys, z = ~m, colorscale = "Reds", colorbar = list(title = "Salary")) %>%
layout(scene = list(xaxis = list(title = "Cost"),
yaxis = list(title = "Admission Rate"),
zaxis = list(title = "Median 10 year salary")))
Documentation for the full set of options in layout()
can be found
here.
\(~\)
Most plotly
graphics can be made into animations using
the frame
argument, which indicates the series of data
snapshots that the animation will progress through.
For example, the code below creates an animated bar chart showing the populations of US states for each year going from 2010 to 2018:
## Load the data
states <- read.csv("https://remiller1450.github.io/data/state_pops.csv")
## Tidy the data
library(tidyr)
library(stringr)
states_long <- pivot_longer(states, cols = 2:ncol(states), names_to = "Year", values_to = "Population")
states_long$Year <- str_replace(string = states_long$Year, pattern = "X", replace = "")
states_long$State <- str_replace(string = states_long$State, pattern = ".", replace = "")
## Animation
plot_ly(data = states_long, type = "bar",
x = ~reorder(State, X = Population, FUN = min), y = ~Population, frame = ~Year, showlegend = FALSE) %>%
layout(xaxis = list(title = " "))
Notice that these data needed to be converted to “long format” for
the column “Year” to be used as the frame
argument.
Additionally, reorder()
was used to arrange the states by
their initial population (assumed to be their minimum).
Animations can be customized using the animation_opts()
function.
## Fast and bouncy animation
plot_ly(data = states_long, type = "bar",
x = ~reorder(State, X = Population, FUN = min), y = ~Population, frame = ~Year, showlegend = FALSE) %>%
animation_opts(frame = 100, easing = "elastic", redraw = FALSE) %>% layout(xaxis = list(title = " "))
Within animation_opts()
, the frame
argument
controls the speed at which frames progress. The default is 500
milliseconds, so this second animation is 5 times faster than the
initial example.
The easing
argument implements a transition between
frames (in this case an elastic bounce). Different easing options are listed
here between lines 68 and 103.
Finally, redraw = FALSE
is used to avoid redrawing the
entire plot at each frame. In this example, redrawing doesn’t make much
of a difference, but for larger data sets or more complex visualizations
it can greatly reduce lag.
Question #4: The code below reads a data set
compiled by Mother
Jones that aims to document all mass shootings in the United States.
For this question, create an animated plot that displays the yearly
number of fatalities and injuries in these shootings over time. For
reference, a sample animation is included below (yours should be
similar, but it doesn’t need to be identical). Hint: Before
creating the animation you should use group_by()
,
summarize()
, and pivot_longer()
to prepare the
data.
shootings <- read.csv('https://remiller1450.github.io/data/MassShootings.csv')