

This lab is optional and will be rewarded with a small amount of extra credit. You may work on it with up to one partner and turn in a single copy with both of your names on it.

We’ll continue using the two data sets introduced in our previous lab, as well as the ggplot2 package:

library(ggplot2) # You should already have this installed

## College majors 
college_majors = read.csv("https://remiller1450.github.io/data/majors.csv")

## Police involved deaths
police = read.csv("https://remiller1450.github.io/data/Police.csv")

Themes and labels

The graphics you made in our previous lab are informative, but not polished enough to be used in any sort of formal capacity. This lab will cover a few ways to make your graphics more professional and polished.

The ggplot2 package comes with several pre-built themes that can be used to change the appearance of a graph, these include:

  • theme_bw()
  • theme_linedraw()
  • theme_light() and theme_dark()
  • theme_minimal()
  • theme_classic()
  • theme_void()

These functions can be added using layering to change the theme, the example below displays a scatter plot using theme_bw():

ggplot(data = college_majors, mapping = aes(x = Bach_Med_Income, y = Grad_Med_Income, color = Category)) + 
  geom_point() +

Next, you might notice that the x-axis and y-axis are labeled using the variable names we specified in aes(), including the unprofessional looking underscores. We can provide a different label for any aesthetic present in our graph using the labs() function. This is demonstrated below:

ggplot(data = college_majors, mapping = aes(x = Bach_Med_Income, y = Grad_Med_Income, color = Category)) + 
  geom_point() +
  theme_bw() + 
  labs(x = "Median Income (Bachelor's degree holders)",
       y = "Median Income (Graduate degree holders)", 
       color = "Major (category)")

Question #1: Using the police data, create a clustered bar chart displaying the variables manner_of_death and threat_level. Change the theme of the plot to theme_minimal() and use labs() to change the labels of these variables to no longer show the underscores.


Scales and transformations

Sometimes relationships between variables might be difficult to see due to the presence of skew and/or outliers. As and example, consider the following visualization:

ggplot(data = college_majors, mapping = aes(x = Per_PhD, y = Grad_Med_Income)) +

The one major with \(> 20\%\) of its workforce holding a PhD makes it difficult to see if there’s any relationship present among the more typical majors where \(< 10\%\) of the workforce holds a PhD.

Without changing the underlying data at all, we can change the x-axis scale used to display the data via scale_x_continuous() and the trans (short for transformation) argument:

ggplot(data = college_majors, mapping = aes(x = Per_PhD, y = Grad_Med_Income)) +
  geom_point() +
  scale_x_continuous(trans = "log2")

Each 1-unit increment on the log2 scale reflects a doubling on the original scale. So, we might conclude there is a weak, non-linear relationship between these two variables on the original scale, or a weak, linear relationship on the log2 scale.

Note that there are a wide variety of layer functions that use the general format scale_AESTHETIC_TYPE(). A few examples are listed below:

  • scale_y_continuous() - can re-scale a continuous variable used in the y aesthetic
  • scale_color_continuous() - can change the color scale of a continuous variable to a different pre-built color pallet
  • scale_color_gradient() - can create a custom gradient scale for a continuous variable
  • scale_fill_brewer() - can change the colors used to fill the bars of a bar chart

Question #2: Starting with the graph you created in Question 1, add another layer that changes the fill colors to the palette “Set3”. If you’re curious you can find a nice diagram showing all of the pre-loaded color palettes about 1/3 of the way down the page at this link .


Category reordering

When using a categorical variable in a graph the default behavior of ggplot is to arrange the categories in alphabetical order. However, many graphics tend to be more information when groups are arranged in ascending/descending order.

The reorder() function can be used inside of aes() to reorder the categories of a variable. The code below provides an example:

ggplot(data = college_majors, aes(x = reorder(Major, X = Bach_Med_Income, decreasing = TRUE), y = Bach_Med_Income)) +
  geom_col() +
  scale_x_discrete(guide = guide_axis(angle = 90)) +
  labs(x = "Major", y = "Median Income (Bachelor's degree holders)")

A couple of things to notice with this example:

  1. The argument X = Bach_Med_Income inside of reorder() told the function to reorder the categories of Major according to the values of Bach_Med_Income.
  2. The argument decreasing = TRUE arranged the categories in descending order of median bachelor’s degree income.

The reorder() function also allows to reorder a categorical variable according to a function of another variable. This is useful for order boxplots or dot plots, as demonstrated below:

ggplot(data = college_majors, aes(x = reorder(Category, X = Bach_Med_Income, FUN = median, decreasing = TRUE), y = Bach_Med_Income)) +
  geom_boxplot() +
  labs(x = "Major Category", y = "Median Income (Bachelor's degree holders)")

Here we arrange Category in descending order according to the median value of Bach_Med_Income. The argument FUN = median tells reorder to use the median() function to guide the reordering.

These examples both use a variable in the data to guide the reordering, but this is a problem for us when we’d like to use geom_bar(), which calculates the height of the bars internally. Unfortunately, there’s no easy way to handle this using reorder(), but we can use the fct_infreq() function in the forcats package:

# install.packages("forcats")

ggplot(data = police, aes(x = fct_infreq(race))) + geom_bar()

In summary:

  • We can use reorder() to change the ordering of a categorical variable according to either:
    • Another variable in our data, using the argument X = var_name
    • A function of another variable in our data, using the arguments X = var_name and FUN = function_name
  • We can use fct_infreq() to change the ordering of categories in bar charts that rely upon ggplot() to calculate the heights of the bars internally.

Question #3: Using the college_majors data, create a column chart that shows the percentage of females in the workforce for each major in ascending (increasing) order.

Question #4: Using the police data, create bar chart that shows the frequency of police involved deaths by state with the states arranged in descending (decreasing) order.

Question #5: Using the police data, create a set of side-by-side box plots that show the distribution of ages among those involved in police involved deaths by race/ethnicity, where the racial/ethnic categories appear along the x-axis ordered by their corresponding mean age.