Merging and Joining with dplyr

This lab focuses on combining multiple data sources using packages in the tidyverse suite.

Directions (Please read before starting)

Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Packages and Datasets

This lab will primarily use the dplyr package.

# load the following packages
library(dplyr)
library(ggplot2)

The lab’s examples will use two databases:

orders <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/orders.csv", as.is = TRUE)
customers <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/customers.csv", as.is = TRUE)

Description: A contrived example of a customer database that is small enough to visually inspect.

library(nycflights13)
data("flights")
data("planes")
data("weather")
data("airports")

Description: Data originating from the US Bureau of Transportation pertaining to all flights that departed from New York City in 2013.

\(~\)

Databases

So far we’ve only dealt with data that could be stored in a single spreadsheet; however, many real world applications require more flexibility.

Data are often stored in databases, or structured collections of records. Essential to a database is its relational structure. For example, here is the relational structure of the nycflights13 database:

From this diagram you should notice a few things:

Information about a plane could be added to flights via the “tailnum” variable
- This variable exists in both data files
Information about the airport that a plane is departing from or arriving at can be added to flights via the “faa” variable
- The corresponding key in flights is either “origin” or “dest”

Keys

Matching data across the different components of a database is done using keys:

A primary key is a variable (or set of variables) that uniquely define an observation in it’s own table
- “tailnum” is a primary key in planes as no tail number appears more than once in this data frame

planes %>% count(tailnum) %>% filter(n > 1)  ## Primary key (unique to every plane)

## # A tibble: 0 × 2
## # … with 2 variables: tailnum <chr>, n <int>

A foreign key is a variable (or group of variables) in another table that can be linked to a primary key.
- “tailnum” is a foreign key in the flights data file, as it can be linked to “tailnum” in the planes data file
- Notice that “tailnum” is not a primary key in the flights file (as many values appear more than once)

flights %>% count(tailnum) %>% filter(n > 1)  ## Foreign key (unique to planes, but not flights)

## # A tibble: 3,873 × 2
##    tailnum     n
##    <chr>   <int>
##  1 D942DN      4
##  2 N0EGMQ    371
##  3 N10156    153
##  4 N102UW     48
##  5 N103US     46
##  6 N104UW     47
##  7 N10575    289
##  8 N105UW     45
##  9 N107US     41
## 10 N108UW     60
## # … with 3,863 more rows

A primary key and a foreign key combine to form a relation. Relations are used to link information in one data file to another.

Question #1: Use the follow data tables in the nycflights13 database:

data(flights)
data(airlines)

Part A: Is the variable “flight” a primary key in the flights? Why or why not?
Part B: What keys form the relation needed to add the full airline name (contained in airlines) to the records in flights?

\(~\)

Lab

At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.

\(~\)

Mutating Joins

The goal of a mutating join is to combine variables from two different data frames, “X” and “Y”.

There are three important types of mutating joins:

Left Outer Join: Keep all observations in “X” and add information from any matching records in “Y”, filling with NA for records in “X” that do not have a match in “Y”.
Full Outer Join: Keep all observations in both “X” and “Y”, filling with NA for records in “X” without a match in “Y” and for records in “Y” without a match in “X”
Inner Join: Keep only observations with records in both “X” and “Y”, dropping any records that aren’t present in both data frames.

In most circumstances you can expect to use a left outer join, as this approach will preserve all of the records in “X” and attach additional variables from “Y”. The different mutating joins are summarized in the graphic below:

Blue sections illustrate observations (rows) that will be present in the combined data frame, while the white sections represent observations that will be omitted. We will not discuss Right Joins, as they are equivalent to a Left Outer Join with the roles of “X” and “Y” reversed.

\(~\)

Left Joins

Consider the data frames “orders” and “customers”:

print(orders)

##   order id   date
## 1     1  4 Jan-01
## 2     2  8 Feb-01
## 3     3 42 Apr-15
## 4     4 50 Apr-17

print(customers)

##   id    name
## 1  4   Tukey
## 2  8 Wickham
## 3 15   Mason
## 4 16  Jordan
## 5 23   Patil
## 6 42     Cox

Notice that the variable “id” is a primary key (unique identifier) in each data frame.

The left_join() function is used to perform a left outer join, adding the customer information to each order using the column “id” as the join’s key:

left_join(x = orders, y = customers, by = "id")

##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3     3 42 Apr-15     Cox
## 4     4 50 Apr-17    <NA>

Because no entries in the customers data frame had an “id” of 50, a missing value, NA, appears as the name for that order in the combined data frame.

In most situations, left joins are not symmetric, so your choice of “x” and “y” matters:

left_join(x = customers, y = orders, by = "id")

##   id    name order   date
## 1  4   Tukey     1 Jan-01
## 2  8 Wickham     2 Feb-01
## 3 15   Mason    NA   <NA>
## 4 16  Jordan    NA   <NA>
## 5 23   Patil    NA   <NA>
## 6 42     Cox     3 Apr-15

As seen above, customers with “id” numbers of 15, 16, and 23 did not have any matching records in the orders data frame, resulting in missing values for “order” and “date” in the combined data frame.

Question #2: Consider the example join demonstrated below. Given our initial description of a left outer join, you’d expect the left join using table2 as the “x” argument (in the above code) to produce a data frame with 3 rows (one for each observation in table2). Instead, in this scenario, it produces a data frame with 6 rows. Why do you think this happens?

table1 <- data.frame(key = c("a", "a", "b", "b", "b", "c"), X = 1:6)
table2 <- data.frame(key = c("a", "b", "c"), Y = c("X", "Y", "Z"))

left_join(x = table2, y = table1, by = "key")

##   key Y X
## 1   a X 1
## 2   a X 2
## 3   b Y 3
## 4   b Y 4
## 5   b Y 5
## 6   c Z 6

\(~\)

Full Outer Joins

The full_join() function will perform a full outer join of two data frames:

full_join(x = orders, y = customers, by = "id")

##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3     3 42 Apr-15     Cox
## 4     4 50 Apr-17    <NA>
## 5    NA 15   <NA>   Mason
## 6    NA 16   <NA>  Jordan
## 7    NA 23   <NA>   Patil

Notice how the order without a customer (id 50) as well as the three customers without orders (ids 15, 16, and 23) are all included in the resulting data frame alongside the three records that had matches (ids 4, 8, and 42).

Outer joins are symmetric, so the choice of “x” and “y” will only impact the ordering of columns in the resulting data frame:

full_join(x = customers, y = orders, by = "id")

##   id    name order   date
## 1  4   Tukey     1 Jan-01
## 2  8 Wickham     2 Feb-01
## 3 15   Mason    NA   <NA>
## 4 16  Jordan    NA   <NA>
## 5 23   Patil    NA   <NA>
## 6 42     Cox     3 Apr-15
## 7 50    <NA>     4 Apr-17

\(~\)

Inner Joins

The inner_join() function is used to perform an inner join on two data frames:

inner_join(x = orders, y = customers, by = "id")

##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3     3 42 Apr-15     Cox

Inner joins also are symmetric, so your choice of “x” and “y” will only impact the ordering of columns in the resulting data frame:

inner_join(x = customers, y = orders, by = "id")

##   id    name order   date
## 1  4   Tukey     1 Jan-01
## 2  8 Wickham     2 Feb-01
## 3 42     Cox     3 Apr-15

Question #3: Consider two data frames, “A” and “B”, each containing the key variable “ID”. In “A”, there is a single record for each of the following IDs: \(\{1, 2, 3, 4\}\). In “B”, there is a single record for each of the following IDs: \(\{3, 4, 5, 6\}\). Given this information, state how many rows will be present in the result produced by the following joins:

A: left_join(x = A, y = B, by = "ID")
B: left_join(x = B, y = A, by = "ID")
C: full_join(x = A, y = B, by = "ID")
D: inner_join(x = A, y = B, by = "ID")

\(~\)

More on the `by` argument

In an ideal world, you have a single key with the same name that can link the information in your “x” and “y” data frames. This is not always the case; however, there are several ways to use the by argument to accommodate those scenarios.

The default value, by = NULL, will perform a “natural join”, which will automatically identify all variables with shared names in “x” and “y” and use that collection to link matching records. This can be useful when several variables are needed to uniquely identify an observation.
Alternatively, you may use named vector syntax, for example: by = c("a" = "b"). In this example, the variable named “a” in the “x” data frame will be matched with the variable named “b” in the “y” data frame to link records and perform the join.

Question #4: In the nycflights13 database, the locations of each airport (“lat” and “lon”) are stored in the airports data file. For this question, use an appropriate join to add the “lat” and “lon” of the destination airport to each flight recorded in the flights data file. Hint: the variable “faa” is a primary key in airports.

\(~\)

Filtering Joins

Filtering joins will match observations in the same way as mutating joins, but they will not add any variables from “y” to the “x” data frame.

semi_join() will keep the observations in “x” that have a match in “y”
anti_join() will drop the observations in “x” that have a match in “y”

Here a few examples:

semi_join(x = orders, y = customers, by = "id")

##   order id   date
## 1     1  4 Jan-01
## 2     2  8 Feb-01
## 3     3 42 Apr-15

Notice how semi_join() filters out the fourth order because there are no customers with an id of 50 in the customers data frame.

anti_join(x = orders, y = customers, by = "id")

##   order id   date
## 1     4 50 Apr-17

Conversely, anti_join() will filter out everything except for the fourth order because it was the only observation in the “x” data file without a match in the “y” data file.

Filtering joins are most useful in finding implicit missing values, or values that aren’t recorded as NA and instead exist only as an absence. For example, we could find missing entries in airports by looking for flights whose destination airport doesn’t have a match:

anti_join(x = flights, y = airports, by = c("dest" = "faa")) %>% 
  distinct(dest)  ## Return only the unique values of "dest"

## # A tibble: 4 × 1
##   dest 
##   <chr>
## 1 BQN  
## 2 SJU  
## 3 STT  
## 4 PSE

We can see that there are four implicitly missing airports in our database.

Question #5: The code below creates a data frame, top_dest, containing the ten most common flight destinations and how many flights arrived at them.

top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)

Part A: Use this data frame, along with an appropriate join, to filter the flights data file to include only flights whose destination is one of the top ten airports. Then, use dim() to print the dimensions of the resulting data frame.
Part B: Using your result from Part A and an appropriate join, find the number of planes whose destination was a top ten airport that do not have a record in the planes data frame.

\(~\)

Practice (required)

Question #6: Suppose you want to draw an approximate flight path from the origin to destination airport of all flights departing on Jan 1st, 2013, and you’d like to color these paths by “origin”.

Part A: Attach the longitude and latitude of the origin airport to flights. Then, rename these variables to origin_lon and origin_lat (Hint: you can do this using a dplyr command like: select(origin_lon = lon)). Be sure to keep the variables “origin”, “dest”, “month”, and “day”, which will be used later.
Part B: Now attach the longitude and latitude of the destination airport and rename these variables to dest_lon and dest_lat. Note that you can do this by continuing your pipeline from Part A.
Part C (optional): Filter your results from Part B to only include January flights, then use the code given below to construct a map. Note that this code uses borders() to add a map layer to the graphic, which requires the maps package be installed.

## Code for Part C
library(maps)
ggplot() + 
  borders("state", fill = "white") +  ## Add a map with US states filled in white
  geom_curve(data = filter(f6b, month == 1, day == 1), 
             aes(x=origin_lon, y=origin_lat, xend=dest_lon, yend=dest_lat, color = origin), 
             curvature = 0.2,
             alpha = 0.5)

Merging and Joining with dplyr

Preamble

Packages and Datasets

Databases

Keys

Lab

Mutating Joins

Left Joins

Full Outer Joins

Inner Joins

More on the by argument

Filtering Joins

Practice (required)

Merging and Joining with `dplyr`

More on the `by` argument