\(~\)

Lab

This lab is optional and will be rewarded with a small amount of extra credit. You may work on it with up to one partner and turn in a single copy with both of your names on it.

We’ll continue using the two data sets introduced in our previous lab, as well as the ggplot2 package:

library(ggplot2) # You should already have this installed

## College majors 
college_majors = read.csv("https://remiller1450.github.io/data/majors.csv")

## Police involved deaths
police = read.csv("https://remiller1450.github.io/data/Police.csv")

Themes and labels

The graphics you made in our previous lab are informative, but not polished enough to be used in any sort of formal capacity. This lab will cover a few ways to make your graphics more professional and polished.

The ggplot2 package comes with several pre-built themes that can be used to change the appearance of a graph, these include:

  • theme_bw()
  • theme_linedraw()
  • theme_light() and theme_dark()
  • theme_minimal()
  • theme_classic()
  • theme_void()

These functions can be added using layering to change the theme, the example below displays a scatter plot using theme_bw():

ggplot(data = college_majors, mapping = aes(x = Bach_Med_Income, y = Grad_Med_Income, color = Category)) + 
  geom_point() +
  theme_bw()

Next, you might notice that the x-axis and y-axis are labeled using the variable names we specified in aes(), including the unprofessional looking underscores. We can provide a different label for any aesthetic present in our graph using the labs() function. This is demonstrated below:

ggplot(data = college_majors, mapping = aes(x = Bach_Med_Income, y = Grad_Med_Income, color = Category)) + 
  geom_point() +
  theme_bw() + 
  labs(x = "Median Income (Bachelor's degree holders)",
       y = "Median Income (Graduate degree holders)", 
       color = "Major (category)")

Question #1: Using the police data, create a clustered bar chart displaying the variables manner_of_death and threat_level. Change the theme of the plot to theme_minimal() and use labs() to change the labels of these variables to no longer show the underscores.

\(~\)

Scales and transformations

Sometimes relationships between variables might be difficult to see due to the presence of skew and/or outliers. As and example, consider the following visualization:

ggplot(data = college_majors, mapping = aes(x = Per_PhD, y = Grad_Med_Income)) +
  geom_point()

The one major with \(> 20\%\) of its workforce holding a PhD makes it difficult to see if there’s any relationship present among the more typical majors where \(< 10\%\) of the workforce holds a PhD.

Without changing the underlying data at all, we can change the x-axis scale used to display the data via scale_x_continuous() and the trans (short for transformation) argument:

ggplot(data = college_majors, mapping = aes(x = Per_PhD, y = Grad_Med_Income)) +
  geom_point() +
  scale_x_continuous(trans = "log2")

Each 1-unit increment on the log2 scale reflects a doubling on the original scale. So, we might conclude there is a weak, non-linear relationship between these two variables on the original scale, or a weak, linear relationship on the log2 scale.

Note that there are a wide variety of layer functions that use the general format scale_AESTHETIC_TYPE(). A few examples are listed below:

  • scale_y_continuous() - can re-scale a continuous variable used in the y aesthetic
  • scale_color_continuous() - can change the color scale of a continuous variable to a different pre-built color pallet
  • scale_color_gradient() - can create a custom gradient scale for a continuous variable
  • scale_fill_brewer() - can change the colors used to fill the bars of a bar chart

Question #2: Starting with the graph you created in Question 1, add another layer that changes the fill colors to the palette “Set3”. If you’re curious you can find a nice diagram showing all of the pre-loaded color palettes about 1/3 of the way down the page at this link .

\(~\)

Category reordering

When using a categorical variable in a graph the default behavior of ggplot is to arrange the categories in alphabetical order. However, many graphics tend to be more information when groups are arranged in ascending/descending order.

The reorder() function can be used inside of aes() to reorder the categories of a variable. The code below provides an example:

ggplot(data = college_majors, aes(x = reorder(Major, X = Bach_Med_Income, decreasing = TRUE), y = Bach_Med_Income)) +
  geom_col() +
  scale_x_discrete(guide = guide_axis(angle = 90)) +
  labs(x = "Major", y = "Median Income (Bachelor's degree holders)")

A couple of things to notice with this example:

  1. The argument X = Bach_Med_Income inside of reorder() told the function to reorder the categories of Major according to the values of Bach_Med_Income.
  2. The argument decreasing = TRUE arranged the categories in descending order of median bachelor’s degree income.

The reorder() function also allows to reorder a categorical variable according to a function of another variable. This is useful for order boxplots or dot plots, as demonstrated below:

ggplot(data = college_majors, aes(x = reorder(Category, X = Bach_Med_Income, FUN = median, decreasing = TRUE), y = Bach_Med_Income)) +
  geom_boxplot() +
  labs(x = "Major Category", y = "Median Income (Bachelor's degree holders)")

Here we arrange Category in descending order according to the median value of Bach_Med_Income. The argument FUN = median tells reorder to use the median() function to guide the reordering.

These examples both use a variable in the data to guide the reordering, but this is a problem for us when we’d like to use geom_bar(), which calculates the height of the bars internally. Unfortunately, there’s no easy way to handle this using reorder(), but we can use the fct_infreq() function in the forcats package:

# install.packages("forcats")
library(forcats)

ggplot(data = police, aes(x = fct_infreq(race))) + geom_bar()

In summary:

  • We can use reorder() to change the ordering of a categorical variable according to either:
    • Another variable in our data, using the argument X = var_name
    • A function of another variable in our data, using the arguments X = var_name and FUN = function_name
  • We can use fct_infreq() to change the ordering of categories in bar charts that rely upon ggplot() to calculate the heights of the bars internally.

Question #3: Using the college_majors data, create a column chart that shows the percentage of females in the workforce for each major in ascending (increasing) order.

Question #4: Using the police data, create bar chart that shows the frequency of police involved deaths by state with the states arranged in descending (decreasing) order.

Question #5: Using the police data, create a set of side-by-side box plots that show the distribution of ages among those involved in police involved deaths by race/ethnicity, where the racial/ethnic categories appear along the x-axis ordered by their corresponding mean age.