This lab introduces R and R Studio as well as the format of future class sessions.

Directions (read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand something before moving on.
  2. Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
  3. Ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Onboarding

Some functions we’ll use are contained in the default installation of R, but others are stored in external libraries known as “packages”.

The first package we’ll use is ggplot2, which is used to create professional quality data visualizations. To use a package it must be installed onto the computer you’re working on. You can do this using install.packages():

install.packages("ggplot2")

Installing a package is similar to downloading an app - you only ever need to download it once, but you’ll need to load it each time you want to use. As a result, you’ll frequently see code that looks like this:

# install.packages("ggplot2")
library(ggplot2)

In this example, the installation code is not run due to the comment, and library() loads the ggplot2 package that was previously downloaded (possibly on a entirely different date).

The “Grammar of Graphics”

The ggplot2 package uses a layer-based framework to build data visualizations. We can understand this framework using the following sequence of examples:

## Read the data from our last lab
college_majors <- read.csv("https://remiller1450.github.io/data/majors.csv")

## Example #1.1 - nothing (just a base layer)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income))

## Example #1.2 - add a layer displaying raw data
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point()

## Example #1.2 - add another layer (regression line)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point() +
  geom_smooth(method = "lm")

## Example #1.4 - add another layer (reference text)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point() +
  geom_smooth(method = "lm") + 
  annotate("text", x=25, y=80000, label = "Strong, positive relationship")

Notice how each layer is added to the existing graph using +. The data frame and variable names from the first layer are passed forward into later layers. Thus, the variables Per_Male and Bach_Med_Income contained in the data frame college_majors are used by the layer created by geom_point() to create a scatter plot.

Terminology

The guiding philosophy of ggplot is define data visualizations grammatically using:

  1. Aesthetic mappings - relations between visual cues and variable names given inside of aes()
    • x = Per_Male instructs ggplot to relate each value of Per_Male to a position on the x-axis
  2. Geometric elements - elements you actually see on the graphic (ie: points, lines, etc.)
    • + geom_point() instructs ggplot to use points to display the aesthetic mappings you defined inside of aes()

Sometimes we’ll want a visual cue to be mapped to a variable, but other times we won’t. What difference do you see in the following examples?

## Example 2.1
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income, color = Category)) + geom_point() + labs(title = "Example 2.1")

## Example 2.2
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income, color = "blue")) + geom_point() + labs(title = "Example 2.2")

## Example 2.2
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point(color = "blue") + labs(title = "Example 2.2")

Lab

At this point you should begin working independently with your assigned partner(s) using a paired programming dynamic. You will be responsible for submitting one copy of responses to the questions embedded in the lab. However, I will remind you that all content covered in labs may appear on an exam.

Introduction

At the end of this lab is a section titled “Examples”. You can navigate to this section using the navigation bar on the left side of the screen. The section contains an example of each type of graph I’ll expect you to be able to create throughout this course.

You do not need to memorize or study the commands or arguments needed to create these graphs. Instead, you will be responsible for knowing the situations where a specific type of graph is appropriate, as well as how to interpret what that graph tells you about your data.

Throughout the lab you’ll need to use two different data sets:

## College majors 
college_majors <- read.csv("https://remiller1450.github.io/data/majors.csv")

## Police involved deaths
police <- read.csv("https://remiller1450.github.io/data/Police.csv")

Because we’ll put an emphasis on interpretation, it’s important that you know a little bit about where these data came from:

  • The first data set, college_majors, contains salary data for various college majors based upon the results of the 2022 American Community Survey. The data were originally obtained from this page.
  • The second data set, police, was aggregated by the Washington Post and made publically available here. This data set documents all fatal shootings by a police officer that took place between 2015 and mid-2020.

Question #1: Use the environment panel to explore the contents of each data set. Which data set appears to contain mostly categorical variables? Which data set appears to contain mostly numeric variables?

\(~\)

Part 1 - Univariate graphs

In this section you will be given several questions that you will need to answer by creating and interpreting an appropriate univariate (single variable) data visualization. All answers should include the code used to create your visualization and a written response (as a comment). Remember to use the “Examples” section at the end of the lab to find template code for the type of graph you decide is most appropriate.

Question #2: The race variable in the police data set records the racial or ethnic group of the individual. Create an appropriate graph displaying the distribution of this variable. Use this graph to describe the distribution of this variable and what that tells you about who has been killed by the police in recent years. If you’d like additional information on this variable you may read the data documention here.

Question #3: The age variable in police records the age at the time of death of each individual in the data set. Create an appropriate graph displaying the distribution of this variable. Use this graph to describe the distribution of this variable and what that tells you about who has been killed by the police in recent years.

Question #4: The year variable in police indicates the year during which the individual was killed. Create an appropriate graph displaying the distribution of this variable, then assess whether the number of police involved deaths appears to be increasing, decreasing, or relatively stable in recent years.

\(~\)

Part 2 - Bivariate graphs

This section is similar to part 1, but you should create a bivariate (two variable) graph as your support for each question.

Question #5: In the college_majors data set, the Per_Male variable describes the percentage of the workforce with a given major that identifies as male, and the Bach_Med_Income variable reports the median income of the workforce with that major. Create a graph that displays the relationship between these two variables and briefly describe what you see in 1-2 sentences.

Question #6: In the college_majors data set, the category variable describes the general area of a given major. Create a graph to assess whether any of these areas disproportionately involve majors with higher median incomes than others.

Question #7: In the college_majors data set, the Per_Masters variable describes the percentage of the workforce in a given major whose highest degree is a master’s degree. Create a graph to assess whether there is a relationship between the percentage of the workforce in a field who holds a master’s degree and the unemployment rate in that field.

Question #8: In the police data set one might hypothesize that the presence of a body camera might deter an officer from using deadly force on an unarmed suspect. Create a graph that explores this hypothesis and briefly describe what the graph tells you.

Part 3 - Multivariate graphs

In the near future we’ll discuss the different ways that a third variable can impact the association between an explanatory and response variable. The questions below are intended to give you an opportunity to further practice creating graphics with ggplot and gain some exposure to the issues that arise when trying to understanding a multivariate relationship.

Question #9: Using the college_majors data set, choose 2 numeric variables and create a graph displaying the multivariate relationship between these variables and the third variable category. Does the relationship between the numeric variables you choose appear to differ by category? Use your graph to justify your answer.

Question #10: Using the police data set, choose 2 categorical variables and create a graph displaying the multivariate relationship between these variables and the third variable body_camera. Does the relationship between the categorical variables you choose appear to differ when a body camera is/isn’t present? Use your graph to justify your answer.

\(~\)

Examples

All of these examples will use the “tips” data set loaded below:

## Read in the "Tips" data
example_data <- read.csv("https://remiller1450.github.io/data/Tips.csv")

In this data set the cases are individual tables served by a waiter in suburban New York. The variables describe the characteristics of each table.

Scatter plot (simple)

ggplot(example_data, aes(x = TotBill, y = Tip)) + geom_point()

Scatter plot (color by group)

ggplot(example_data, aes(x = TotBill, y = Tip, color = Day)) + geom_point()

Scatter plot (color by numeric)

ggplot(example_data, aes(x = TotBill, y = Tip, color = Size)) + geom_point()

Boxplot (simple)

ggplot(example_data, aes(x = Tip)) + geom_boxplot()

Boxplot (by group)

ggplot(example_data, aes(x = Tip, y = Smoker)) + geom_boxplot()

Boxplot (by multiple groups)

ggplot(example_data, aes(x = Tip, y = Smoker, color = Sex)) + geom_boxplot()

Notes:

  • Try this with the aesthetic mapping fill = Sex instead of color = Sex and see how the graph changes.

Histogram

ggplot(example_data, aes(x = Tip)) + geom_histogram(bins = 15)

Notes:

  • The argument bins = 15 determines how many equal length bins should be used to divvy up the x-axis.

Histogram (by group)

ggplot(example_data, aes(x = Tip, fill = Smoker)) + geom_histogram(bins = 15, position = "identity", alpha = 0.4) 

Notes:

  • The argument alpha is used to set the level of transparency (1 = fully opaque, 0 = fully transparent)
  • The argument position = "identity" prevents the binned frequencies in each group from stacking atop each other.

Bar chart (simple)

ggplot(example_data, aes(x = Time)) + geom_bar()

Bar chart (stacked)

ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar()

Bar chart (clustered)

ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "dodge")

Bar chart (conditional)

ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "fill")

Bar chart (conditional by group)

ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "fill") + facet_wrap(~Day)

Notes:

  • The layer added by facet_wrap() replicates the same graphic for each category of the specified variable (in this case Day)