ggplot\(~\)
This lab is optional and will be rewarded with a small amount of extra credit. You may work on it with up to one partner and turn in a single copy with both of your names on it.
We’ll continue using the two data sets introduced in our previous
lab, as well as the ggplot2 package:
library(ggplot2) # You should already have this installed
## College majors
college_majors = read.csv("https://remiller1450.github.io/data/majors.csv")
## Police involved deaths
police = read.csv("https://remiller1450.github.io/data/Police.csv")
The graphics you made in our previous lab are informative, but not polished enough to be used in any sort of formal capacity. This lab will cover a few ways to make your graphics more professional and polished.
The ggplot2 package comes with several pre-built themes
that can be used to change the appearance of a graph, these include:
theme_bw()theme_linedraw()theme_light() and theme_dark()theme_minimal()theme_classic()theme_void()These functions can be added using layering to change the theme, the
example below displays a scatter plot using theme_bw():
ggplot(data = college_majors, mapping = aes(x = Bach_Med_Income, y = Grad_Med_Income, color = Category)) +
geom_point() +
theme_bw()
Next, you might notice that the x-axis and y-axis are labeled using
the variable names we specified in aes(), including the
unprofessional looking underscores. We can provide a different label for
any aesthetic present in our graph using the labs()
function. This is demonstrated below:
ggplot(data = college_majors, mapping = aes(x = Bach_Med_Income, y = Grad_Med_Income, color = Category)) +
geom_point() +
theme_bw() +
labs(x = "Median Income (Bachelor's degree holders)",
y = "Median Income (Graduate degree holders)",
color = "Major (category)")
Question #1: Using the police data,
create a clustered bar chart displaying the variables
manner_of_death and threat_level. Change the
theme of the plot to theme_minimal() and use
labs() to change the labels of these variables to no longer
show the underscores.
\(~\)
Sometimes relationships between variables might be difficult to see due to the presence of skew and/or outliers. As and example, consider the following visualization:
ggplot(data = college_majors, mapping = aes(x = Per_PhD, y = Grad_Med_Income)) +
geom_point()
The one major with \(> 20\%\) of its workforce holding a PhD makes it difficult to see if there’s any relationship present among the more typical majors where \(< 10\%\) of the workforce holds a PhD.
Without changing the underlying data at all, we can change the x-axis
scale used to display the data via scale_x_continuous() and
the trans (short for transformation) argument:
ggplot(data = college_majors, mapping = aes(x = Per_PhD, y = Grad_Med_Income)) +
geom_point() +
scale_x_continuous(trans = "log2")
Each 1-unit increment on the log2 scale reflects a doubling on the original scale. So, we might conclude there is a weak, non-linear relationship between these two variables on the original scale, or a weak, linear relationship on the log2 scale.
Note that there are a wide variety of layer functions that use the
general format scale_AESTHETIC_TYPE(). A few examples are
listed below:
scale_y_continuous() - can re-scale a continuous
variable used in the y aestheticscale_color_continuous() - can change the color scale
of a continuous variable to a different pre-built color palletscale_color_gradient() - can create a custom gradient
scale for a continuous variablescale_fill_brewer() - can change the colors used to
fill the bars of a bar chartQuestion #2: Starting with the graph you created in Question 1, add another layer that changes the fill colors to the palette “Set3”. If you’re curious you can find a nice diagram showing all of the pre-loaded color palettes about 1/3 of the way down the page at this link .
\(~\)
When using a categorical variable in a graph the default behavior of
ggplot is to arrange the categories in alphabetical order.
However, many graphics tend to be more information when groups are
arranged in ascending/descending order.
The reorder() function can be used inside of
aes() to reorder the categories of a variable. The code
below provides an example:
ggplot(data = college_majors, aes(x = reorder(Major, X = Bach_Med_Income, decreasing = TRUE), y = Bach_Med_Income)) +
geom_col() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
labs(x = "Major", y = "Median Income (Bachelor's degree holders)")
A couple of things to notice with this example:
X = Bach_Med_Income inside of
reorder() told the function to reorder the categories of
Major according to the values of
Bach_Med_Income.decreasing = TRUE arranged the categories
in descending order of median bachelor’s degree income.The reorder() function also allows to reorder a
categorical variable according to a function of another variable. This
is useful for order boxplots or dot plots, as demonstrated below:
ggplot(data = college_majors, aes(x = reorder(Category, X = Bach_Med_Income, FUN = median, decreasing = TRUE), y = Bach_Med_Income)) +
geom_boxplot() +
labs(x = "Major Category", y = "Median Income (Bachelor's degree holders)")
Here we arrange Category in descending order according
to the median value of Bach_Med_Income. The argument
FUN = median tells reorder to use the median()
function to guide the reordering.
These examples both use a variable in the data to guide the
reordering, but this is a problem for us when we’d like to use
geom_bar(), which calculates the height of the bars
internally. Unfortunately, there’s no easy way to handle this using
reorder(), but we can use the fct_infreq()
function in the forcats package:
# install.packages("forcats")
library(forcats)
ggplot(data = police, aes(x = fct_infreq(race))) + geom_bar()
In summary:
reorder() to change the ordering of a
categorical variable according to either:
X = var_nameX = var_name and
FUN = function_namefct_infreq() to change the ordering of
categories in bar charts that rely upon ggplot() to
calculate the heights of the bars internally.Question #3: Using the college_majors
data, create a column chart that shows the percentage of females in the
workforce for each major in ascending (increasing) order.
Question #4: Using the police data,
create bar chart that shows the frequency of police involved deaths by
state with the states arranged in descending (decreasing) order.
Question #5: Using the police data,
create a set of side-by-side box plots that show the distribution of
ages among those involved in police involved deaths by race/ethnicity,
where the racial/ethnic categories appear along the x-axis ordered by
their corresponding mean age.