ggplot
,
and univariate data\(~\)
In our first lab you wrote your code in an R Script; however, R Studio supports several other file types, including R Markdown, a framework that allows for R code, that code’s output, and markdown text to seamlessly coexist in the same document.
If you recently installed R Studio it should come with R Markdown already available. You can check this by navigating:
File -> New File -> R Markdown
If you do not see “R Markdown” displayed in this menu you’ll need to
install the rmarkdown
package:
# install.packages("rmarkdown")
library(rmarkdown)
In R
, packages refer to collections of functions (and
other code) that exist in an external repository. To use these functions
you must:
install.packages()
function. You’ll only need to do this step once. Think of
install.packages()
like downloading an app onto your PC or
phone. Once the app has been downloaded it’s there to use until you
delete it.library()
function. Think of library()
like
logging into an app. You’ll need to do it every time you restart your
PC/phone (akin to creating a new file or R
session).This lab will use two other packages, ggplot2
and
forcats
, which we’ll talk about soon. For now, we’ll use
this as opportunity to install and load them:
# install.packages("ggplot2")
# install.packages("forcats")
library(ggplot2)
library(forcats)
I’ll generally add a #
in front of an install.package
command for two reasons:
install.package()
function cannot be called when an
R Markdown document is being compiled (this will make sense later)\(~\)
At the top of an R Markdown document is the header:
After the end of the header you’ll see a code chunk:
After the setup code chunk you’ll see a section header:
Following the section header you’ll see ordinary text:
\(~\)
The purpose of R Markdown is to seamlessly blend R code, output, and
written text. This is accomplished by “knitting” your file into a
completed report. You can knit a file using the “Knit” button (blue yarn
ball icon), or (on windows) by pressing ctrl-shift-k
.
A few things to know about knitting:
install.packages()
and
View()
cannot be used in the environment where the document
is knit. You should comment-out or remove these commands before knitting
to prevent errors.\(~\)
As a reminder, you should work on the lab with your assigned partner(s) using the principles of paired programming. Everyone should keep and submit their own copy of your group’s work.
This lab will require you to work with two different data sets:
college_majors
- which contains salary data for various
college majors based on the results of the 2022 American Community
Survey. These data were originally obtained from
this US Census pagecolleges
- which contains institution-level data for
the year 2024 of all US colleges and universities with at least 300
enrolled students that primarily grant bachelors degrees. These data
were obtained via the
College Scorecard, a database run by the US Department of
Education.## First data set
college_majors <- read.csv("https://remiller1450.github.io/data/majors.csv")
## Second data set
colleges <- read.csv("https://remiller1450.github.io/data/Colleges_2024_Complete.csv")
The ggplot2
package uses a layer-based framework to
progressively build data visualizations. We can understand this
framework using the following sequence of examples:
## Example #1.1 - nothing (a totally blank plot)
ggplot()
## Example #1.2 - we've now given ggplot plot our data and mapped variables to the x and y axes
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income))
## Example #1.3 - we've now added a geometric element that uses our data and mappings (points)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point()
## Example #1.4 - we've now added another geometric element using our data and mappings (regression line)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point() +
geom_smooth(method = "lm", se = FALSE)
In these examples, you should notice how the information we provide
in the ggplot()
function via the data
and
mapping
arguments is passed forward to the geometric
elements that are later added to the graph.
The guiding philosophy of ggplot
is to define data
visualizations grammatically. In this regard you should be familiar with
the following terms:
aes()
x = Per_Male
instructs ggplot
to relate
each value of Per_Male
to a position on the x-axis+ geom_point()
instructs ggplot
to use
points to display the aesthetic mappings you defined inside of
aes()
Question #1: Below is the R
code and
output for 3 different visualization attempts that use the “color”
aesthetic/cue differently. You should reference this code and output for
Parts A and B of this question.
## Example 1
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income, color = Category)) + geom_point() + labs(title = "Example 1")
## Example 2
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income, color = "blue")) + geom_point() + labs(title = "Example 2")
## Example 3
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point(color = "blue") + labs(title = "Example 3")
geom_point()
to relate each value of a variable in
college_majors
to the “color” visual cue?aes()
"purple"
is a valid color in R
Per_Asian
in these data).\(~\)
The previous section covered the broad fundamentals of
ggplot
. This section will focus on specific univariate
graphics and descriptive statistics for categorical data.
Recall that a bar chart displays the distribution frequencies (or
relative frequencies) for a categorical variable. The example below
creates a bar chart that shows the distribution of the variable
Category
in the college_majors
data set:
ggplot(college_majors, aes(x = Category)) + geom_bar()
Notice that we mapped the variable’s name, Category
, to
the x
aesthetic, then we used the geometric element
geom_bar()
to visually display the values that were mapped.
This process is slightly unique for geom_bar()
, which will
tabulate frequencies internally and display those results rather than
displaying the raw data itself (like what we saw for
geom_point()
in earlier examples).
We could perform this underlying tabulation ourselves using the
table()
function:
## Example use of the table() function
table(college_majors$Category)
##
## Business Education Humanities Science
## 6 3 11 14
Notice how we used the $
operator to ensure the
character vector corresponding to the variable Category
was
the input used to construct the frequency table.
We could make this a table of proportions (relative frequencies) by
using the frequency table as input in the prop.table()
function:
## Example use of the prop.table() function
my_table = table(college_majors$Category) # We'll first store our table as the object 'my_table'
prop.table(my_table) # We'll then use 'my_table' as an input
##
## Business Education Humanities Science
## 0.17647059 0.08823529 0.32352941 0.41176471
Finally, it is often desirable to reorder the categories of a bar
chart in ascending or descending order. We can do this by applying the
fct_infreq()
function in the forcats
library
to the variable Category
before it defined as the
x
aesthetic:
## Example of category reordering
ggplot(college_majors, aes(x = fct_infreq(Category))) + geom_bar()
Notice that this modification results in an ugly looking x-axis
label. We can fix this by providing our own label using the
labs()
layer:
ggplot(college_majors, aes(x = fct_infreq(Category))) + geom_bar() + labs(x = "Major Category")
Question #2: For this question you should use the
colleges
data set and base your code on the examples given
in this section.
prop.table()
to calculate
relative frequencies from the table you created in Part A.Region
arranged in descending order of frequency. Change
the label of the x-axis to be the string
"Geographic Region"
.R
comment) write 1-sentence that briefly describes whether colleges seem
uniformly distributed across regions or not.\(~\)
Recall that the distribution of a quantitative variable can be
visualized using a histogram. The example below demonstrates
this for the Per_Masters
variable in the
college_majors
data set:
ggplot(data = college_majors, aes(x = Per_Masters)) + geom_histogram(bins=12)
Notice how we mapped this variable to the x
aesthetic
and used geom_historam()
to create a histogram from this
mapping. Additionally, you should notice we used the argument
bins=12
inside of geom_histogram()
, which led
to the creation of 12 equally spaced bins across the range of our
variable.
Histograms are great for assessing the shape of a quantitative variable’s distribution, but we should rely upon descriptive statistics when reporting center and spread. The following examples correspond to each descriptive statistic discussed in today’s lecture:
# Measures of Center
## mean
mean(college_majors$Per_Masters)
## median
median(college_majors$Per_Masters)
# Measures of Spread
## standard deviation
sd(college_majors$Per_Masters)
## range (max - min)
min_val = min(college_majors$Per_Masters)
max_val = max(college_majors$Per_Masters)
max_val - min_val # This is the range
## IQR
IQR(college_majors$Per_Masters)
Question #3: For this question, you should use the
colleges
data set and base your code on the examples given
in this section.
Enrollment
, which records the number
of full-time undergraduate students enrolled at each institution in
2024.Enrollment
.Enrollment
. Include a one-sentence
interpretation for each measure of spread.