ggplot
\(~\)
This lab is optional and will be rewarded with a small amount of extra credit. You may work on it with up to one partner and turn in a single copy with both of your names on it.
We’ll continue using the two data sets introduced in our previous
lab, as well as the ggplot2
package:
library(ggplot2) # You should already have this installed
## College majors
college_majors = read.csv("https://remiller1450.github.io/data/majors.csv")
## Police involved deaths
police = read.csv("https://remiller1450.github.io/data/Police.csv")
The graphics you made in our previous lab are informative, but not polished enough to be used in any sort of formal capacity. This lab will cover a few ways to make your graphics more professional and polished.
The ggplot2
package comes with several pre-built themes
that can be used to change the appearance of a graph, these include:
theme_bw()
theme_linedraw()
theme_light()
and theme_dark()
theme_minimal()
theme_classic()
theme_void()
These functions can be added using layering to change the theme, the
example below displays a scatter plot using theme_bw()
:
ggplot(data = college_majors, mapping = aes(x = Bach_Med_Income, y = Grad_Med_Income, color = Category)) +
geom_point() +
theme_bw()
Next, you might notice that the x-axis and y-axis are labeled using
the variable names we specified in aes()
, including the
unprofessional looking underscores. We can provide a different label for
any aesthetic present in our graph using the labs()
function. This is demonstrated below:
ggplot(data = college_majors, mapping = aes(x = Bach_Med_Income, y = Grad_Med_Income, color = Category)) +
geom_point() +
theme_bw() +
labs(x = "Median Income (Bachelor's degree holders)",
y = "Median Income (Graduate degree holders)",
color = "Major (category)")
Question #1: Using the police
data,
create a clustered bar chart displaying the variables
manner_of_death
and threat_level
. Change the
theme of the plot to theme_minimal()
and use
labs()
to change the labels of these variables to no longer
show the underscores.
\(~\)
Sometimes relationships between variables might be difficult to see due to the presence of skew and/or outliers. As and example, consider the following visualization:
ggplot(data = college_majors, mapping = aes(x = Per_PhD, y = Grad_Med_Income)) +
geom_point()
The one major with \(> 20\%\) of its workforce holding a PhD makes it difficult to see if there’s any relationship present among the more typical majors where \(< 10\%\) of the workforce holds a PhD.
Without changing the underlying data at all, we can change the x-axis
scale used to display the data via scale_x_continuous()
and
the trans
(short for transformation) argument:
ggplot(data = college_majors, mapping = aes(x = Per_PhD, y = Grad_Med_Income)) +
geom_point() +
scale_x_continuous(trans = "log2")
Each 1-unit increment on the log2 scale reflects a doubling on the original scale. So, we might conclude there is a weak, non-linear relationship between these two variables on the original scale, or a weak, linear relationship on the log2 scale.
Note that there are a wide variety of layer functions that use the
general format scale_AESTHETIC_TYPE()
. A few examples are
listed below:
scale_y_continuous()
- can re-scale a continuous
variable used in the y
aestheticscale_color_continuous()
- can change the color scale
of a continuous variable to a different pre-built color palletscale_color_gradient()
- can create a custom gradient
scale for a continuous variablescale_fill_brewer()
- can change the colors used to
fill the bars of a bar chartQuestion #2: Starting with the graph you created in Question 1, add another layer that changes the fill colors to the palette “Set3”. If you’re curious you can find a nice diagram showing all of the pre-loaded color palettes about 1/3 of the way down the page at this link .
\(~\)
When using a categorical variable in a graph the default behavior of
ggplot
is to arrange the categories in alphabetical order.
However, many graphics tend to be more information when groups are
arranged in ascending/descending order.
The reorder()
function can be used inside of
aes()
to reorder the categories of a variable. The code
below provides an example:
ggplot(data = college_majors, aes(x = reorder(Major, X = Bach_Med_Income, decreasing = TRUE), y = Bach_Med_Income)) +
geom_col() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
labs(x = "Major", y = "Median Income (Bachelor's degree holders)")
A couple of things to notice with this example:
X = Bach_Med_Income
inside of
reorder()
told the function to reorder the categories of
Major
according to the values of
Bach_Med_Income
.decreasing = TRUE
arranged the categories
in descending order of median bachelor’s degree income.The reorder()
function also allows to reorder a
categorical variable according to a function of another variable. This
is useful for order boxplots or dot plots, as demonstrated below:
ggplot(data = college_majors, aes(x = reorder(Category, X = Bach_Med_Income, FUN = median, decreasing = TRUE), y = Bach_Med_Income)) +
geom_boxplot() +
labs(x = "Major Category", y = "Median Income (Bachelor's degree holders)")
Here we arrange Category
in descending order according
to the median value of Bach_Med_Income
. The argument
FUN = median
tells reorder to use the median()
function to guide the reordering.
These examples both use a variable in the data to guide the
reordering, but this is a problem for us when we’d like to use
geom_bar()
, which calculates the height of the bars
internally. Unfortunately, there’s no easy way to handle this using
reorder()
, but we can use the fct_infreq()
function in the forcats
package:
# install.packages("forcats")
library(forcats)
ggplot(data = police, aes(x = fct_infreq(race))) + geom_bar()
In summary:
reorder()
to change the ordering of a
categorical variable according to either:
X = var_name
X = var_name
and
FUN = function_name
fct_infreq()
to change the ordering of
categories in bar charts that rely upon ggplot()
to
calculate the heights of the bars internally.Question #3: Using the college_majors
data, create a column chart that shows the percentage of females in the
workforce for each major in ascending (increasing) order.
Question #4: Using the police
data,
create bar chart that shows the frequency of police involved deaths by
state with the states arranged in descending (decreasing) order.
Question #5: Using the police
data,
create a set of side-by-side box plots that show the distribution of
ages among those involved in police involved deaths by race/ethnicity,
where the racial/ethnic categories appear along the x-axis ordered by
their corresponding mean age.