Lab #2 - R Markdown, ggplot, and univariate data

$~$

Onboarding

In our first lab you wrote your code in an R Script; however, R Studio supports several other file types, including R Markdown, a framework that allows for R code, that code’s output, and markdown text to seamlessly coexist in the same document.

If you recently installed R Studio it should come with R Markdown already available. You can check this by navigating:

File -> New File -> R Markdown

If you do not see “R Markdown” displayed in this menu you’ll need to install the rmarkdown package:

# install.packages("rmarkdown")
library(rmarkdown)

In R, packages refer to collections of functions (and other code) that exist in an external repository. To use these functions you must:

Download the package using the install.packages() function. You’ll only need to do this step once. Think of install.packages() like downloading an app onto your PC or phone. Once the app has been downloaded it’s there to use until you delete it.
Load the package contents into your current session using the library() function. Think of library() like logging into an app. You’ll need to do it every time you restart your PC/phone (akin to creating a new file or R session).

This lab will use two other packages, ggplot2 and forcats, which we’ll talk about soon. For now, we’ll use this as opportunity to install and load them:

# install.packages("ggplot2")
# install.packages("forcats")

library(ggplot2)
library(forcats)

I’ll generally add a # in front of an install.package command for two reasons:

As a reminder that you only need to install a package once and you shouldn’t be constantly re-running that command.
The install.package() function cannot be called when an R Markdown document is being compiled (this will make sense later)

$~$

Components of an .Rmd file

At the top of an R Markdown document is the header:

The header is initiated by $\text{---}$ and closed by $\text{---}$
Here you can provide title text, authors, and other information that will appear at the top of the document created by your R Markdown file

After the end of the header you’ll see a code chunk:

Code chunks are initiated by $\text{```\{r\}}$ and closed by $\text{```}$
The first code chunk in most documents is used to set up options for the remainder of the document. In fact, the text “setup” that you see in $\text{```\{r setup\}}$ is giving this chunk the name “setup”. You should keep this chunk as it appears and use other code chunks to add your own code.
You can run the code present in any code chunk using the green arrow in its upper right corner. The grey triangle and green rectangle icon will run all code chunks in your document up to and including the current one in sequential order.

After the setup code chunk you’ll see a section header:

Sections are created using varying numbers of the $\#$ character, which the number determining the size of the header (fewer $\#$ in larger header)

Following the section header you’ll see ordinary text:

Ordinary text will use markdown conventions, so the text $\$\text{H_0: \\mu = 0}\$$ will appear as $H_0: \mu = 0$ in your document.

$~$

Knitting

The purpose of R Markdown is to seamlessly blend R code, output, and written text. This is accomplished by “knitting” your file into a completed report. You can knit a file using the “Knit” button (blue yarn ball icon), or (on windows) by pressing ctrl-shift-k.

A few things to know about knitting:

When you knit an .Rmd file it begins with an empty environment, so the file might not knit if you’ve been testing your code out of order, or if your code depends upon things that you’ve since deleted while working.
Commands like install.packages() and View() cannot be used in the environment where the document is knit. You should comment-out or remove these commands before knitting to prevent errors.

$~$

Lab

As a reminder, you should work on the lab with your assigned partner(s) using the principles of paired programming. Everyone should keep and submit their own copy of your group’s work.

This lab will require you to work with two different data sets:

college_majors - which contains salary data for various college majors based on the results of the 2022 American Community Survey. These data were originally obtained from this US Census page
colleges - which contains institution-level data for the year 2024 of all US colleges and universities with at least 300 enrolled students that primarily grant bachelors degrees. These data were obtained via the College Scorecard, a database run by the US Department of Education.

## First data set
college_majors <- read.csv("https://remiller1450.github.io/data/majors.csv")


## Second data set
colleges <- read.csv("https://remiller1450.github.io/data/Colleges_2024_Complete.csv")

The Grammar of Graphics (ggplot)

The ggplot2 package uses a layer-based framework to progressively build data visualizations. We can understand this framework using the following sequence of examples:

## Example #1.1 - nothing (a totally blank plot)
ggplot()

## Example #1.2 - we've now given ggplot plot our data and mapped variables to the x and y axes
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income))

## Example #1.3 - we've now added a geometric element that uses our data and mappings (points)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point()

## Example #1.4 - we've now added another geometric element using our data and mappings (regression line)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point() +
  geom_smooth(method = "lm", se = FALSE)

In these examples, you should notice how the information we provide in the ggplot() function via the data and mapping arguments is passed forward to the geometric elements that are later added to the graph.

The guiding philosophy of ggplot is to define data visualizations grammatically. In this regard you should be familiar with the following terms:

Aesthetic mappings - relationships between visual cues and variable names given inside of aes()
- x = Per_Male instructs ggplot to relate each value of Per_Male to a position on the x-axis
Geometric elements - elements you actually see on the graphic (ie: points, lines, etc.)
- + geom_point() instructs ggplot to use points to display the aesthetic mappings you defined inside of aes()

Question #1: Below is the R code and output for 3 different visualization attempts that use the “color” aesthetic/cue differently. You should reference this code and output for Parts A and B of this question.

## Example 1
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income, color = Category)) + geom_point() + labs(title = "Example 1")

## Example 2
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income, color = "blue")) + geom_point() + labs(title = "Example 2")

## Example 3
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point(color = "blue") + labs(title = "Example 3")

Part A: Which example(s) instruct geom_point() to relate each value of a variable in college_majors to the “color” visual cue?
Part B: Why do you think the points in Example 2 are red and not blue? Hint: look at the color legend and think about what was given inside of aes()
Part C: Copy and modify one of these examples to color all of the data-points purple. Note: the string "purple" is a valid color in R
Part D: Copy and modify one of these examples to color all of the data-points according to the percentage of the workforce in each major that self identifies as Asian (the variable Per_Asian in these data).

$~$

Categorical Data

The previous section covered the broad fundamentals of ggplot. This section will focus on specific univariate graphics and descriptive statistics for categorical data.

Recall that a bar chart displays the distribution frequencies (or relative frequencies) for a categorical variable. The example below creates a bar chart that shows the distribution of the variable Category in the college_majors data set:

ggplot(college_majors, aes(x = Category)) + geom_bar()

Notice that we mapped the variable’s name, Category, to the x aesthetic, then we used the geometric element geom_bar() to visually display the values that were mapped. This process is slightly unique for geom_bar(), which will tabulate frequencies internally and display those results rather than displaying the raw data itself (like what we saw for geom_point() in earlier examples).

We could perform this underlying tabulation ourselves using the table() function:

## Example use of the table() function
table(college_majors$Category)

## 
##   Business  Education Humanities    Science 
##          6          3         11         14

Notice how we used the $ operator to ensure the character vector corresponding to the variable Category was the input used to construct the frequency table.

We could make this a table of proportions (relative frequencies) by using the frequency table as input in the prop.table() function:

## Example use of the prop.table() function
my_table = table(college_majors$Category) # We'll first store our table as the object 'my_table'
prop.table(my_table)                      # We'll then use 'my_table' as an input

## 
##   Business  Education Humanities    Science 
## 0.17647059 0.08823529 0.32352941 0.41176471

Finally, it is often desirable to reorder the categories of a bar chart in ascending or descending order. We can do this by applying the fct_infreq() function in the forcats library to the variable Category before it defined as the x aesthetic:

## Example of category reordering
ggplot(college_majors, aes(x = fct_infreq(Category))) + geom_bar()

Notice that this modification results in an ugly looking x-axis label. We can fix this by providing our own label using the labs() layer:

ggplot(college_majors, aes(x = fct_infreq(Category))) + geom_bar() + labs(x = "Major Category")

Question #2: For this question you should use the colleges data set and base your code on the examples given in this section.

Part A: Create a table displaying the frequencies of colleges across geographic regions.
Part B: Use prop.table() to calculate relative frequencies from the table you created in Part A.
Part C: Create a bar chart showing the distribution of Region arranged in descending order of frequency. Change the label of the x-axis to be the string "Geographic Region".
Part D: Using plain text (not an R comment) write 1-sentence that briefly describes whether colleges seem uniformly distributed across regions or not.

$~$

Quantitative Data

Recall that the distribution of a quantitative variable can be visualized using a histogram. The example below demonstrates this for the Per_Masters variable in the college_majors data set:

ggplot(data = college_majors, aes(x = Per_Masters)) + geom_histogram(bins=12)

Notice how we mapped this variable to the x aesthetic and used geom_historam() to create a histogram from this mapping. Additionally, you should notice we used the argument bins=12 inside of geom_histogram(), which led to the creation of 12 equally spaced bins across the range of our variable.

Histograms are great for assessing the shape of a quantitative variable’s distribution, but we should rely upon descriptive statistics when reporting center and spread. The following examples correspond to each descriptive statistic discussed in today’s lecture:

# Measures of Center
## mean
mean(college_majors$Per_Masters)

## median
median(college_majors$Per_Masters)

# Measures of Spread
## standard deviation
sd(college_majors$Per_Masters)

## range (max - min)
min_val = min(college_majors$Per_Masters)
max_val = max(college_majors$Per_Masters)
max_val - min_val # This is the range

## IQR
IQR(college_majors$Per_Masters)

Question #3: For this question, you should use the colleges data set and base your code on the examples given in this section.

Part A: Create a histogram using 20 bins that displays the variable Enrollment, which records the number of full-time undergraduate students enrolled at each institution in 2024.
Part B: Is the shape of the distribution you see in Part A as either skewed-left, skewed-right, or approximately symmetric?
Part C: Calculate and report the mean and median of the variable Enrollment.
Part D: Calculate and report the standard deviation and IQR of the variable Enrollment. Include a one-sentence interpretation for each measure of spread.
Part E: Based upon distributional shape you saw in Part A, which measure of center (mean or median) provides a more reasonable description of the typical college’s enrollment?