dplyrThis lab focuses on combining multiple data sources using packages in the tidyverse suite.
Directions (Please read before starting)
\(~\)
This lab will primarily use the dplyr package.
# load the following packages
library(dplyr)
library(ggplot2)
The lab’s examples will use two databases:
orders <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/orders.csv", as.is = TRUE)
customers <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/customers.csv", as.is = TRUE)
library(nycflights13)
data("flights")
data("planes")
data("weather")
data("airports")
\(~\)
So far we’ve only dealt with data that could be stored in a single spreadsheet; however, many real world applications require more flexibility.
Data are often stored in databases, or structured
collections of records. Essential to a database is its relational
structure. For example, here is the relational structure of the
nycflights13 database:
From this diagram you should notice a few things:
flights via
the “tailnum” variable
flights via the “faa” variable
flights is either
“origin” or “dest”Matching data across the different components of a database is done using keys:
planes as no tail number
appears more than once in this data frameplanes %>% count(tailnum) %>% filter(n > 1)  ## Primary key (unique to every plane)    
## # A tibble: 0 × 2
## # … with 2 variables: tailnum <chr>, n <int>
flights data file, as
it can be linked to “tailnum” in the planes data fileflights file (as many values appear more than once)flights %>% count(tailnum) %>% filter(n > 1)  ## Foreign key (unique to planes, but not flights)
## # A tibble: 3,873 × 2
##    tailnum     n
##    <chr>   <int>
##  1 D942DN      4
##  2 N0EGMQ    371
##  3 N10156    153
##  4 N102UW     48
##  5 N103US     46
##  6 N104UW     47
##  7 N10575    289
##  8 N105UW     45
##  9 N107US     41
## 10 N108UW     60
## # … with 3,863 more rows
A primary key and a foreign key combine to form a relation. Relations are used to link information in one data file to another.
Question #1: Use the follow data tables in the
nycflights13 database:
data(flights)
data(airlines)
flights? Why or why not?airlines) to the
records in flights?\(~\)
At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.
\(~\)
The goal of a mutating join is to combine variables from two different data frames, “X” and “Y”.
There are three important types of mutating joins:
NA for records in “X” that do not have a match in “Y”.NA for records in
“X” without a match in “Y” and for records in “Y” without a
match in “X”In most circumstances you can expect to use a left outer join, as this approach will preserve all of the records in “X” and attach additional variables from “Y”. The different mutating joins are summarized in the graphic below:
Blue sections illustrate observations (rows) that will be present in the combined data frame, while the white sections represent observations that will be omitted. We will not discuss Right Joins, as they are equivalent to a Left Outer Join with the roles of “X” and “Y” reversed.
\(~\)
Consider the data frames “orders” and “customers”:
print(orders)
##   order id   date
## 1     1  4 Jan-01
## 2     2  8 Feb-01
## 3     3 42 Apr-15
## 4     4 50 Apr-17
print(customers)
##   id    name
## 1  4   Tukey
## 2  8 Wickham
## 3 15   Mason
## 4 16  Jordan
## 5 23   Patil
## 6 42     Cox
Notice that the variable “id” is a primary key (unique identifier) in each data frame.
The left_join() function is used to perform a left outer
join, adding the customer information to each order using the column
“id” as the join’s key:
left_join(x = orders, y = customers, by = "id")
##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3     3 42 Apr-15     Cox
## 4     4 50 Apr-17    <NA>
Because no entries in the customers data frame had an
“id” of 50, a missing value, NA, appears as the name for
that order in the combined data frame.
In most situations, left joins are not symmetric, so your choice of “x” and “y” matters:
left_join(x = customers, y = orders, by = "id")
##   id    name order   date
## 1  4   Tukey     1 Jan-01
## 2  8 Wickham     2 Feb-01
## 3 15   Mason    NA   <NA>
## 4 16  Jordan    NA   <NA>
## 5 23   Patil    NA   <NA>
## 6 42     Cox     3 Apr-15
As seen above, customers with “id” numbers of 15, 16, and 23 did not
have any matching records in the orders data frame,
resulting in missing values for “order” and “date” in the combined data
frame.
Question #2: Consider the example join demonstrated
below. Given our initial description of a left outer join, you’d expect
the left join using table2 as the “x” argument (in the
above code) to produce a data frame with 3 rows (one for each
observation in table2). Instead, in this scenario, it
produces a data frame with 6 rows. Why do you think this happens?
table1 <- data.frame(key = c("a", "a", "b", "b", "b", "c"), X = 1:6)
table2 <- data.frame(key = c("a", "b", "c"), Y = c("X", "Y", "Z"))
left_join(x = table2, y = table1, by = "key")
##   key Y X
## 1   a X 1
## 2   a X 2
## 3   b Y 3
## 4   b Y 4
## 5   b Y 5
## 6   c Z 6
\(~\)
The full_join() function will perform a full outer join
of two data frames:
full_join(x = orders, y = customers, by = "id")
##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3     3 42 Apr-15     Cox
## 4     4 50 Apr-17    <NA>
## 5    NA 15   <NA>   Mason
## 6    NA 16   <NA>  Jordan
## 7    NA 23   <NA>   Patil
Notice how the order without a customer (id 50) as well as the three customers without orders (ids 15, 16, and 23) are all included in the resulting data frame alongside the three records that had matches (ids 4, 8, and 42).
Outer joins are symmetric, so the choice of “x” and “y” will only impact the ordering of columns in the resulting data frame:
full_join(x = customers, y = orders, by = "id")
##   id    name order   date
## 1  4   Tukey     1 Jan-01
## 2  8 Wickham     2 Feb-01
## 3 15   Mason    NA   <NA>
## 4 16  Jordan    NA   <NA>
## 5 23   Patil    NA   <NA>
## 6 42     Cox     3 Apr-15
## 7 50    <NA>     4 Apr-17
\(~\)
The inner_join() function is used to perform an inner
join on two data frames:
inner_join(x = orders, y = customers, by = "id")
##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3     3 42 Apr-15     Cox
Inner joins also are symmetric, so your choice of “x” and “y” will only impact the ordering of columns in the resulting data frame:
inner_join(x = customers, y = orders, by = "id")
##   id    name order   date
## 1  4   Tukey     1 Jan-01
## 2  8 Wickham     2 Feb-01
## 3 42     Cox     3 Apr-15
Question #3: Consider two data frames, “A” and “B”, each containing the key variable “ID”. In “A”, there is a single record for each of the following IDs: \(\{1, 2, 3, 4\}\). In “B”, there is a single record for each of the following IDs: \(\{3, 4, 5, 6\}\). Given this information, state how many rows will be present in the result produced by the following joins:
left_join(x = A, y = B, by = "ID")left_join(x = B, y = A, by = "ID")full_join(x = A, y = B, by = "ID")inner_join(x = A, y = B, by = "ID")\(~\)
by argumentIn an ideal world, you have a single key with the same name that can
link the information in your “x” and “y” data frames. This is not always
the case; however, there are several ways to use the by
argument to accommodate those scenarios.
by = NULL, will perform a “natural
join”, which will automatically identify all variables with
shared names in “x” and “y” and use that collection to link matching
records. This can be useful when several variables are needed to
uniquely identify an observation.by = c("a" = "b"). In this example, the variable
named “a” in the “x” data frame will be matched with the variable named
“b” in the “y” data frame to link records and perform the join.Question #4: In the nycflights13
database, the locations of each airport (“lat” and “lon”) are stored in
the airports data file. For this question, use an
appropriate join to add the “lat” and “lon” of the destination airport
to each flight recorded in the flights data file.
Hint: the variable “faa” is a primary key in
airports.
\(~\)
Filtering joins will match observations in the same way as mutating joins, but they will not add any variables from “y” to the “x” data frame.
semi_join() will keep the observations
in “x” that have a match in “y”anti_join() will drop the observations
in “x” that have a match in “y”Here a few examples:
semi_join(x = orders, y = customers, by = "id")
##   order id   date
## 1     1  4 Jan-01
## 2     2  8 Feb-01
## 3     3 42 Apr-15
Notice how semi_join() filters out the fourth order
because there are no customers with an id of 50 in the customers data
frame.
anti_join(x = orders, y = customers, by = "id")
##   order id   date
## 1     4 50 Apr-17
Conversely, anti_join() will filter out everything
except for the fourth order because it was the only observation in the
“x” data file without a match in the “y” data file.
Filtering joins are most useful in finding implicit missing
values, or values that aren’t recorded as NA and
instead exist only as an absence. For example, we could find missing
entries in airports by looking for flights whose
destination airport doesn’t have a match:
anti_join(x = flights, y = airports, by = c("dest" = "faa")) %>% 
  distinct(dest)  ## Return only the unique values of "dest"
## # A tibble: 4 × 1
##   dest 
##   <chr>
## 1 BQN  
## 2 SJU  
## 3 STT  
## 4 PSE
We can see that there are four implicitly missing airports in our database.
Question #5: The code below creates a data frame,
top_dest, containing the ten most common flight
destinations and how many flights arrived at them.
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
flights data file to
include only flights whose destination is one of the top ten airports.
Then, use dim() to print the dimensions of the resulting
data frame.planes data
frame.\(~\)
Question #6: Suppose you want to draw an approximate flight path from the origin to destination airport of all flights departing on Jan 1st, 2013, and you’d like to color these paths by “origin”.
flights. Then, rename these variables to
origin_lon and origin_lat (Hint: you can do
this using a dplyr command like:
select(origin_lon = lon)). Be sure to keep the variables
“origin”, “dest”, “month”, and “day”, which will be used later.dest_lon and dest_lat. Note that you can do
this by continuing your pipeline from Part A.borders() to add
a map layer to the graphic, which requires the maps package
be installed.## Code for Part C
library(maps)
ggplot() + 
  borders("state", fill = "white") +  ## Add a map with US states filled in white
  geom_curve(data = filter(f6b, month == 1, day == 1), 
             aes(x=origin_lon, y=origin_lat, xend=dest_lon, yend=dest_lat, color = origin), 
             curvature = 0.2,
             alpha = 0.5)