ggplot
(part 1)This lab introduces R
and R Studio
as well
as the format of future class sessions.
Directions (read before starting)
\(~\)
Some functions we’ll use are contained in the default installation of
R
, but others are stored in external libraries known as
“packages”.
The first package we’ll use is ggplot2
, which is used to
create professional quality data visualizations. To use a package it
must be installed onto the computer you’re working on. You can do this
using install.packages()
:
install.packages("ggplot2")
Installing a package is similar to downloading an app - you only ever need to download it once, but you’ll need to load it each time you want to use. As a result, you’ll frequently see code that looks like this:
# install.packages("ggplot2")
library(ggplot2)
In this example, the installation code is not run due to the comment,
and library()
loads the ggplot2
package that
was previously downloaded (possibly on a entirely different date).
The ggplot2
package uses a layer-based framework to
build data visualizations. We can understand this framework using the
following sequence of examples:
## Read the data from our last lab
college_majors <- read.csv("https://remiller1450.github.io/data/majors.csv")
## Example #1.1 - nothing (just a base layer)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income))
## Example #1.2 - add a layer displaying raw data
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point()
## Example #1.2 - add another layer (regression line)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point() +
geom_smooth(method = "lm")
## Example #1.4 - add another layer (reference text)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point() +
geom_smooth(method = "lm") +
annotate("text", x=25, y=80000, label = "Strong, positive relationship")
Notice how each layer is added to the existing graph using
+
. The data frame and variable names from the first layer
are passed forward into later layers. Thus, the variables
Per_Male
and Bach_Med_Income
contained in the
data frame college_majors
are used by the layer created by
geom_point()
to create a scatter plot.
The guiding philosophy of ggplot
is define data
visualizations grammatically using:
aes()
x = Per_Male
instructs ggplot
to relate
each value of Per_Male
to a position on the x-axis+ geom_point()
instructs ggplot
to use
points to display the aesthetic mappings you defined inside of
aes()
Sometimes we’ll want a visual cue to be mapped to a variable, but other times we won’t. What difference do you see in the following examples?
## Example 2.1
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income, color = Category)) + geom_point() + labs(title = "Example 2.1")
## Example 2.2
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income, color = "blue")) + geom_point() + labs(title = "Example 2.2")
## Example 2.2
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point(color = "blue") + labs(title = "Example 2.2")
At this point you should begin working independently with your assigned partner(s) using a paired programming dynamic. You will be responsible for submitting one copy of responses to the questions embedded in the lab. However, I will remind you that all content covered in labs may appear on an exam.
At the end of this lab is a section titled “Examples”. You can navigate to this section using the navigation bar on the left side of the screen. The section contains an example of each type of graph I’ll expect you to be able to create throughout this course.
You do not need to memorize or study the commands or arguments needed to create these graphs. Instead, you will be responsible for knowing the situations where a specific type of graph is appropriate, as well as how to interpret what that graph tells you about your data.
Throughout the lab you’ll need to use two different data sets:
## College majors
college_majors <- read.csv("https://remiller1450.github.io/data/majors.csv")
## Police involved deaths
police <- read.csv("https://remiller1450.github.io/data/Police.csv")
Because we’ll put an emphasis on interpretation, it’s important that you know a little bit about where these data came from:
college_majors
, contains salary
data for various college majors based upon the results of the 2022
American Community Survey. The data were originally obtained from
this page.police
, was aggregated by the
Washington Post and made publically
available here. This data set documents all fatal shootings by a
police officer that took place between 2015 and mid-2020.Question #1: Use the environment panel to explore the contents of each data set. Which data set appears to contain mostly categorical variables? Which data set appears to contain mostly numeric variables?
\(~\)
In this section you will be given several questions that you will need to answer by creating and interpreting an appropriate univariate (single variable) data visualization. All answers should include the code used to create your visualization and a written response (as a comment). Remember to use the “Examples” section at the end of the lab to find template code for the type of graph you decide is most appropriate.
Question #2: The race
variable in the
police
data set records the racial or ethnic group of the
individual. Create an appropriate graph displaying the distribution of
this variable. Use this graph to describe the distribution of this
variable and what that tells you about who has been killed by the police
in recent years. If you’d like additional information on this variable
you may read the data
documention here.
Question #3: The age
variable in
police
records the age at the time of death of each
individual in the data set. Create an appropriate graph displaying the
distribution of this variable. Use this graph to describe the
distribution of this variable and what that tells you about who has been
killed by the police in recent years.
Question #4: The year
variable in
police
indicates the year during which the individual was
killed. Create an appropriate graph displaying the distribution of this
variable, then assess whether the number of police involved deaths
appears to be increasing, decreasing, or relatively stable in recent
years.
\(~\)
This section is similar to part 1, but you should create a bivariate (two variable) graph as your support for each question.
Question #5: In the college_majors
data
set, the Per_Male
variable describes the percentage of the
workforce with a given major that identifies as male, and the
Bach_Med_Income
variable reports the median income of the
workforce with that major. Create a graph that displays the relationship
between these two variables and briefly describe what you see in 1-2
sentences.
Question #6: In the college_majors
data
set, the category
variable describes the general area of a
given major. Create a graph to assess whether any of these areas
disproportionately involve majors with higher median incomes than
others.
Question #7: In the college_majors
data
set, the Per_Masters
variable describes the percentage of
the workforce in a given major whose highest degree is a master’s
degree. Create a graph to assess whether there is a relationship between
the percentage of the workforce in a field who holds a master’s degree
and the unemployment rate in that field.
Question #8: In the police
data set one
might hypothesize that the presence of a body camera might deter an
officer from using deadly force on an unarmed suspect. Create a graph
that explores this hypothesis and briefly describe what the graph tells
you.
In the near future we’ll discuss the different ways that a third
variable can impact the association between an explanatory and response
variable. The questions below are intended to give you an opportunity to
further practice creating graphics with ggplot
and gain
some exposure to the issues that arise when trying to understanding a
multivariate relationship.
Question #9: Using the college_majors
data set, choose 2 numeric variables and create a graph displaying the
multivariate relationship between these variables and the third variable
category
. Does the relationship between the numeric
variables you choose appear to differ by category? Use your graph to
justify your answer.
Question #10: Using the police
data
set, choose 2 categorical variables and create a graph displaying the
multivariate relationship between these variables and the third variable
body_camera
. Does the relationship between the categorical
variables you choose appear to differ when a body camera is/isn’t
present? Use your graph to justify your answer.
\(~\)
All of these examples will use the “tips” data set loaded below:
## Read in the "Tips" data
example_data <- read.csv("https://remiller1450.github.io/data/Tips.csv")
In this data set the cases are individual tables served by a waiter in suburban New York. The variables describe the characteristics of each table.
ggplot(example_data, aes(x = TotBill, y = Tip)) + geom_point()
ggplot(example_data, aes(x = TotBill, y = Tip, color = Day)) + geom_point()
ggplot(example_data, aes(x = TotBill, y = Tip, color = Size)) + geom_point()
ggplot(example_data, aes(x = Tip)) + geom_boxplot()
ggplot(example_data, aes(x = Tip, y = Smoker)) + geom_boxplot()
ggplot(example_data, aes(x = Tip, y = Smoker, color = Sex)) + geom_boxplot()
Notes:
fill = Sex
instead
of color = Sex
and see how the graph changes.ggplot(example_data, aes(x = Tip)) + geom_histogram(bins = 15)
Notes:
bins = 15
determines how many equal length
bins should be used to divvy up the x-axis.ggplot(example_data, aes(x = Tip, fill = Smoker)) + geom_histogram(bins = 15, position = "identity", alpha = 0.4)
Notes:
alpha
is used to set the level of
transparency (1 = fully opaque, 0 = fully transparent)position = "identity"
prevents the binned
frequencies in each group from stacking atop each other.ggplot(example_data, aes(x = Time)) + geom_bar()
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar()
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "dodge")
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "fill")
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "fill") + facet_wrap(~Day)
Notes:
facet_wrap()
replicates the same
graphic for each category of the specified variable (in this case
Day
)