1. Introduction

Static visualizations, like those produced by ggplot, provide a wealth of information, but don’t allow the viewer to actively interact with the data. This isn’t a problem if the visualization is to be printed in a report or on a poster, but it can be a limitation if the visual is to be used during a presentation, hosted on a webpage, or incorporated into an R Shiny application.

The graphics package plotly is an open-source interactive graphing tool that interfaces with R and ggplot. Graphics created using the plotly package are compatible with existing tools like R Markdown and R Shiny, allowing for easy publishing and sharing.

library(plotly)   # for interactive visuals
library(ggplot2)  # for static visuals
library(tidyr)    # for data tidying
library(stringr)  # to process character strings
library(forcats)  # to process categorical data

2. ggplotly

Before learning any of the syntax used by plotly, we’ll first use the package to add interactivity to an existing graphic created by ggplot. The ggplotly function accepts any ggplot object and renders it into an interactive graphic.

In the example below, we create a simple scatterplot describing the Ames Housing data, then make it interactive with the ggplotly function. You should notice two key features have been added to the plot:

  1. A dashboard that allows users to zoom in/out and rescale the plot.
  2. Popup information at appears when you hover over a data-point.
AmesHousing <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/data-viz/data/AmesHousing.csv")

my_ggplot <- ggplot(data=AmesHousing) + geom_point(mapping= aes(x=GrLivArea, y=SalePrice))
ggplotly(my_ggplot)

Let’s now add some complexity and color each data-point by building type:

my_ggplot <- ggplot(data=AmesHousing) + geom_point(mapping= aes(x=GrLivArea, y=SalePrice, color = BldgType))

ggplotly(my_ggplot)

You should notice:

  1. You can click on a building type in the legend to remove that category of data-points from the plot.
  2. You now see building type, in addition to x-y coordinates, when you hover over a data-point.

ggplotly is a simple and effective way to augment an existing ggplot graphic, but in order to access the more advanced features of plotly we’ll need to construct graphics from scratch using the plot_ly function.

3. plotly Basics

The code below illustrates one way to create our previous scatterplot using plot_ly:

plot_ly(data = AmesHousing, type = "scatter",  mode = "markers", x = ~GrLivArea, y = ~SalePrice, color = ~BldgType)

Here, the argument type = "scatter" tells plotly to create a scatterplot, and the argument mode = "markers" plots the data-points as hoverable dots (rather than text labels or other characters).

You might also notice the \(\sim\) character used in front of the variables we reference in the x, y, and color arguments. This character tells plot_ly to look for that variable inside of the data.frame supplied in the data argument. Had we omitted it, plot_ly would look for vectors with those names in our global R environment.

Similar to ggplot, we can build-up plot_ly graphics by adding layers, though this time layers are added using the %>% operator. In the example below, we add a scatterplot layer using a trace with type = "scatter", we then add another layer of lines to the plot (for illustration, this second layer of lines doesn’t have any practical use):

plot_ly(data = AmesHousing) %>% 
  add_trace(type = "scatter",  x = ~GrLivArea, y = ~SalePrice, color = ~BldgType) %>%
  add_lines(x = ~GrLivArea, y = ~SalePrice, color = ~BldgType) 

To see the full range of plotly trace options (valid inputs to the type argument) visit the reference page. The different options appear in the navigation panel on the left side of the page, you can click on the type of graph you’re interested in to jump to that section of the documentation.

Question #1

Create an R Markdown document and delete the existing sections/code chunks, then add a section (defined by ## Question 1). In a code chunk within this section, load the AmesHousing data and use plotly to create a violin plot displaying the distribution of sale prices for each different house style in the Ames housing data. (Hint: you should use the reference page to learn how to add a violin plot trace).

4. Custom Hover Labels

The labels that appear when you hover over a data-point are one of the most useful features of plotly. They can be controlled via the text argument, which can be given within the plot_ly function, or within add_trace. In the example below, we label each home sale using its property id (PID):

plot_ly(data = AmesHousing, type = "scatter",
        x = ~GrLivArea, y = ~SalePrice, color = ~BldgType,
        text = ~PID)

plotly labels are constructed using hypertext markup langauage, or HTML, meaning we can customize their appearance using HTML commands:

plot_ly(data = AmesHousing, type = "scatter",
        x = ~GrLivArea, y = ~SalePrice, color = ~BldgType,
        text = ~paste0("This home was built in: ", YearBuilt, "<br> It was last sold in: ", YrSold))

The paste0 function is used to combine character strings and variable values into a single string. The text “<br>” is the HTML code to begin a new line. A few other useful HTML commands are:

Along with “<br>”, the commands indicated above are the most common HTML commands that you might use in label. If your desired labels require something else, I encourage you to look at this HTML cheatsheet.

Something you might have noticed is that our custom text was added to default hover info. If we wanted to only include our custom label, and not the default information, we could use the argument hoverinfo = "text":

plot_ly(data = AmesHousing, type = "scatter",
        x = ~GrLivArea, y = ~SalePrice, color = ~BldgType,
        hoverinfo = 'text',
        text = ~paste("This home was built in:", YearBuilt, "<br>", "It was last sold in:", YrSold))

Question #2:

The code below uses plotly to create a piechart of different house styles. Notice what happens when you click on a category in the chart’s legend!

For Question #2, modify this code to include custom hovertext displaying the house style and that style’s median sale price on separate lines. (Hint: you should use the group_by and summarize functions in the dplyr package to get the information needed for your labels)

prop = table(AmesHousing$HouseStyle)/nrow(AmesHousing)
style = names(prop)

plot_ly() %>%
  add_trace(type = "pie", labels = ~style, values = ~prop, textinfo = "percent")

5. 3D Graphics

plotly allows us to graph in three-dimensions, something that ggplot cannot accommodate. The code below creates a simple 3D scatterplot:

plot_ly(data = AmesHousing, type = "scatter3d", mode = "markers",
        x = ~GrLivArea, y = ~SalePrice, z = ~OverallQual)

plotly can also display surfaces. The code below displays the estimated two-dimensional density of overall quality and year built:

library(MASS)
kd <- kde2d(AmesHousing$YearBuilt, AmesHousing$OverallQual, n = 50)

plot_ly() %>% add_surface(x = ~kd$x, y = ~kd$y, z = ~kd$z, showscale = FALSE) %>% 
  layout(scene = list(xaxis = list(title = "Year Built"), 
                      yaxis = list(title = "Overall Quality"), 
                      zaxis = list(title = "Density")))

In this example, we see that changing the labels of axes in plotly is quite cumbersome, requiring the use of nested lists within the layout function. In order to reduce the amount code that is shown, the remaining examples in this tutorial will not format labels of axes or scales. However, you should always neatly format your axes and labels when using plotly on a project.

Perhaps more useful than 3D representations of two-dimensional densities are fitted regression planes:

model <- lm(SalePrice ~ YearBuilt + OverallQual, data = AmesHousing)

xs <- seq(1900, 2020, by = 10)
ys <- seq(1,10, by = 1)
grid <- expand.grid(xs,ys)
names(grid) <- c("YearBuilt", "OverallQual")
z <- predict(model, newdata = grid)
m <- matrix(z, nrow = length(unique(grid$YearBuilt)), ncol = length(unique(grid$OverallQual)))

plot_ly() %>% add_surface(x = ~xs, y = ~ys, z = ~m, colors = c("#d1d1d1", "#000000")) %>% 
  add_markers(x = ~AmesHousing$YearBuilt, y = ~AmesHousing$OverallQual, z = ~AmesHousing$SalePrice, colors = I("blue"))

The code used to create this plot seems pretty complicated, but it isn’t too difficult to understand with some explanation.

To begin, plotly creates a surface using a grid of x-y coordinates and a matrix of heights (z coordinates). In the z matrix, the element in position \(\{i,j\}\) corresponds with the height of the surface at the location defined by \(x_i\) and \(y_j\), the respective \(i^{th}\) and \(j^{th}\) elements of the \(x\) and \(y\) vectors.

For us to use this information to construct the regression plane, we first created sequences of x and y values that we wanted to plot over. We then used the expand.grid function to create matrix containing all possible pairing of these values, as we needed heights for each x-y coordinate. We used the predict function to obtain model predictions for each of these possible x-y pairings, storing the predictions in a matrix with appropriate dimensions (such that position \(\{i,j\}\) contains the height that plotly is expecting).

The code provided above is pretty general and can be easily adapted to different applications simply by changing the variable names and the sequences of values that are plotted over.

Looking more closely at this regression plane, you’ll notice that the effects of year and quality on sale price don’t appear to be linear. Fortunately, we can use the same general procedure to plot the prediction surface of nearly any model.

The plot below displays what the regression surface looks like using a generalized additive model (GAM), a type of model that adds flexibility to allow non-linear relationships between predictors and the outcome using splines:

library(splines)
library(mgcv)
model <- gam(SalePrice ~ s(YearBuilt) + s(OverallQual), data = AmesHousing)

xs <- seq(1900, 2020, by = 10)
ys <- seq(1,10, by = 1)
grid <- expand.grid(xs,ys)
names(grid) <- c("YearBuilt", "OverallQual")
z <- predict(model, newdata = grid)
m <- matrix(z, nrow = length(unique(grid$YearBuilt)), ncol = length(unique(grid$OverallQual)))

plot_ly() %>% add_surface(x = ~xs, y = ~ys, z = ~m, colors = c("#d1d1d1", "#000000")) %>% 
  add_markers(x = ~AmesHousing$YearBuilt, y = ~AmesHousing$OverallQual, z = ~AmesHousing$SalePrice, color = I("red"))

Within the gam function, you’ll notice that predictors are specified within the s function, which specifies that spline should be used to smooth the relationship between that predictor and the outcome (see the presentation on smoothing if this doesn’t sound familiar).

Question 3:

Modify the code given below to display the linear regression model that predicts SalePrice using linear regression model with LotArea and GrLivArea as predictors. Note that you’ll need to change how the xs and ys sequences are defined, as well as many of the variable names that are referenced.

model <- lm(SalePrice ~ YearBuilt + OverallQual, data = AmesHousing)
xs <- seq(1900, 2020, by = 10)
ys <- seq(1,10, by = 1)
grid <- expand.grid(xs,ys)
names(grid) <- c("YearBuilt", "OverallQual")
z <- predict(model, newdata = grid)
m <- matrix(z, nrow = length(unique(grid$YearBuilt)), ncol = length(unique(grid$OverallQual)))

plot_ly() %>% add_surface(x = ~xs, y = ~ys, z = ~m, colors = c("#d1d1d1", "#000000")) %>% 
  add_markers(x = ~AmesHousing$YearBuilt, y = ~AmesHousing$OverallQual, z = ~AmesHousing$SalePrice, colors = I("green"))

5. Animations

plotly graphics can be animated using the frame argument where each frame indicates a snapshot in time at which the graphic is to be created. The example below shows a barchart of US state populations that is animated to show changes from 2010 to 2018.

## Load the data
states <- read.csv("https://remiller1450.github.io/data/state_pops.csv")

## Tidy the data
states_long <- gather(states, key = "Year", value = "Population", 2:ncol(states))
states_long$Year <- str_replace(string = states_long$Year, pattern = "X", replace = "")
states_long$State <- str_replace(string = states_long$State, pattern = ".", replace = "")

## Plotly animation
plot_ly(data = states_long, type = "bar",
        x = ~fct_reorder(State, Population), y = ~Population, frame = ~Year, showlegend = FALSE) 

Notice:

  1. Before plotting we first needed to tidy these data, putting them into “long” format.
  2. The frame argument is used to control the animation. Each “frame” of this animation corresponds with a particular year.
  3. We used the fct_reorder function in the forcats package to reorder the states according to their population.

Generally speaking, any type of visual can be animated by supplying using the frames argument, and there isn’t a requirement that each data-point exists within every frame. The example below demonstrates a creative use of the frame argument where trends in the size of new homes are displayed over time.

plot_ly(data = AmesHousing, type = "scatter", mode = "markers", 
        x = ~GrLivArea, y = ~TotRmsAbvGrd, frame = ~round(YearBuilt,-1),
        showlegend = FALSE) %>%
   animation_opts(frame = 1000, easing = "elastic", redraw = FALSE)

This example also illustrates a couple of additional animation features, which are modified using the animation_opts function:

  1. The speed at which the animation progresses is controlled by the frame argument (confusing, I know). The default value is 500 milliseconds, but in this example we increase it to 1000 milliseconds, resulting in slower transitions between frames.
  2. The way by which frames transition can be changed via the easing argument, here easing = "elastic" causes the points to bounce when a new frame occurs. Different easing options can be found here, with the option names being listed at approximately line 80.
  3. The redraw = FALSE argument can improve the performance of laggy animations by not entirely re-drawing the plot at each transition. However, in this example, it doesn’t make much difference.

Question #4:

The code below loads a dataset compiled by Mother Jones documenting mass shootings in the United States. For this question you should create an animated plot of your choosing that highlights something you deem to be an important or interesting trend in these data.

shootings <- read.csv('https://remiller1450.github.io/data/MassShootings.csv')

6. Resources