dplyr
This lab focuses on combining multiple data sources using packages in the tidyverse suite.
Directions (Please read before starting)
\(~\)
This lab will primarily use the dplyr
package.
# load the following packages
library(dplyr)
library(ggplot2)
The lab’s examples will use two databases:
orders <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/orders.csv", as.is = TRUE)
customers <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/customers.csv", as.is = TRUE)
library(nycflights13)
data("flights")
data("planes")
data("weather")
data("airports")
\(~\)
So far we’ve only dealt with data that could be stored in a single spreadsheet; however, many real world applications require more flexibility.
Data are often stored in databases, or structured
collections of records. Essential to a database is its relational
structure. For example, here is the relational structure of the
nycflights13
database:
From this diagram you should notice a few things:
flights
via
the “tailnum” variable
flights
via the “faa” variable
flights
is either
“origin” or “dest”Matching data across the different components of a database is done using keys:
planes
as no tail number
appears more than once in this data frameplanes %>% count(tailnum) %>% filter(n > 1) ## Primary key (unique to every plane)
## # A tibble: 0 × 2
## # … with 2 variables: tailnum <chr>, n <int>
flights
data file, as
it can be linked to “tailnum” in the planes
data fileflights
file (as many values appear more than once)flights %>% count(tailnum) %>% filter(n > 1) ## Foreign key (unique to planes, but not flights)
## # A tibble: 3,873 × 2
## tailnum n
## <chr> <int>
## 1 D942DN 4
## 2 N0EGMQ 371
## 3 N10156 153
## 4 N102UW 48
## 5 N103US 46
## 6 N104UW 47
## 7 N10575 289
## 8 N105UW 45
## 9 N107US 41
## 10 N108UW 60
## # … with 3,863 more rows
A primary key and a foreign key combine to form a relation. Relations are used to link information in one data file to another.
Question #1: Use the follow data tables in the
nycflights13
database:
data(flights)
data(airlines)
flights
? Why or why not?airlines
) to the
records in flights
?\(~\)
At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.
\(~\)
The goal of a mutating join is to combine variables from two different data frames, “X” and “Y”.
There are three important types of mutating joins:
NA
for records in “X” that do not have a match in “Y”.NA
for records in
“X” without a match in “Y” and for records in “Y” without a
match in “X”In most circumstances you can expect to use a left outer join, as this approach will preserve all of the records in “X” and attach additional variables from “Y”. The different mutating joins are summarized in the graphic below:
Blue sections illustrate observations (rows) that will be present in the combined data frame, while the white sections represent observations that will be omitted. We will not discuss Right Joins, as they are equivalent to a Left Outer Join with the roles of “X” and “Y” reversed.
\(~\)
Consider the data frames “orders” and “customers”:
print(orders)
## order id date
## 1 1 4 Jan-01
## 2 2 8 Feb-01
## 3 3 42 Apr-15
## 4 4 50 Apr-17
print(customers)
## id name
## 1 4 Tukey
## 2 8 Wickham
## 3 15 Mason
## 4 16 Jordan
## 5 23 Patil
## 6 42 Cox
Notice that the variable “id” is a primary key (unique identifier) in each data frame.
The left_join()
function is used to perform a left outer
join, adding the customer information to each order using the column
“id” as the join’s key:
left_join(x = orders, y = customers, by = "id")
## order id date name
## 1 1 4 Jan-01 Tukey
## 2 2 8 Feb-01 Wickham
## 3 3 42 Apr-15 Cox
## 4 4 50 Apr-17 <NA>
Because no entries in the customers
data frame had an
“id” of 50, a missing value, NA
, appears as the name for
that order in the combined data frame.
In most situations, left joins are not symmetric, so your choice of “x” and “y” matters:
left_join(x = customers, y = orders, by = "id")
## id name order date
## 1 4 Tukey 1 Jan-01
## 2 8 Wickham 2 Feb-01
## 3 15 Mason NA <NA>
## 4 16 Jordan NA <NA>
## 5 23 Patil NA <NA>
## 6 42 Cox 3 Apr-15
As seen above, customers with “id” numbers of 15, 16, and 23 did not
have any matching records in the orders
data frame,
resulting in missing values for “order” and “date” in the combined data
frame.
Question #2: Consider the example join demonstrated
below. Given our initial description of a left outer join, you’d expect
the left join using table2
as the “x” argument (in the
above code) to produce a data frame with 3 rows (one for each
observation in table2
). Instead, in this scenario, it
produces a data frame with 6 rows. Why do you think this happens?
table1 <- data.frame(key = c("a", "a", "b", "b", "b", "c"), X = 1:6)
table2 <- data.frame(key = c("a", "b", "c"), Y = c("X", "Y", "Z"))
left_join(x = table2, y = table1, by = "key")
## key Y X
## 1 a X 1
## 2 a X 2
## 3 b Y 3
## 4 b Y 4
## 5 b Y 5
## 6 c Z 6
\(~\)
The full_join()
function will perform a full outer join
of two data frames:
full_join(x = orders, y = customers, by = "id")
## order id date name
## 1 1 4 Jan-01 Tukey
## 2 2 8 Feb-01 Wickham
## 3 3 42 Apr-15 Cox
## 4 4 50 Apr-17 <NA>
## 5 NA 15 <NA> Mason
## 6 NA 16 <NA> Jordan
## 7 NA 23 <NA> Patil
Notice how the order without a customer (id 50) as well as the three customers without orders (ids 15, 16, and 23) are all included in the resulting data frame alongside the three records that had matches (ids 4, 8, and 42).
Outer joins are symmetric, so the choice of “x” and “y” will only impact the ordering of columns in the resulting data frame:
full_join(x = customers, y = orders, by = "id")
## id name order date
## 1 4 Tukey 1 Jan-01
## 2 8 Wickham 2 Feb-01
## 3 15 Mason NA <NA>
## 4 16 Jordan NA <NA>
## 5 23 Patil NA <NA>
## 6 42 Cox 3 Apr-15
## 7 50 <NA> 4 Apr-17
\(~\)
The inner_join()
function is used to perform an inner
join on two data frames:
inner_join(x = orders, y = customers, by = "id")
## order id date name
## 1 1 4 Jan-01 Tukey
## 2 2 8 Feb-01 Wickham
## 3 3 42 Apr-15 Cox
Inner joins also are symmetric, so your choice of “x” and “y” will only impact the ordering of columns in the resulting data frame:
inner_join(x = customers, y = orders, by = "id")
## id name order date
## 1 4 Tukey 1 Jan-01
## 2 8 Wickham 2 Feb-01
## 3 42 Cox 3 Apr-15
Question #3: Consider two data frames, “A” and “B”, each containing the key variable “ID”. In “A”, there is a single record for each of the following IDs: \(\{1, 2, 3, 4\}\). In “B”, there is a single record for each of the following IDs: \(\{3, 4, 5, 6\}\). Given this information, state how many rows will be present in the result produced by the following joins:
left_join(x = A, y = B, by = "ID")
left_join(x = B, y = A, by = "ID")
full_join(x = A, y = B, by = "ID")
inner_join(x = A, y = B, by = "ID")
\(~\)
by
argumentIn an ideal world, you have a single key with the same name that can
link the information in your “x” and “y” data frames. This is not always
the case; however, there are several ways to use the by
argument to accommodate those scenarios.
by = NULL
, will perform a “natural
join”, which will automatically identify all variables with
shared names in “x” and “y” and use that collection to link matching
records. This can be useful when several variables are needed to
uniquely identify an observation.by = c("a" = "b")
. In this example, the variable
named “a” in the “x” data frame will be matched with the variable named
“b” in the “y” data frame to link records and perform the join.Question #4: In the nycflights13
database, the locations of each airport (“lat” and “lon”) are stored in
the airports
data file. For this question, use an
appropriate join to add the “lat” and “lon” of the destination airport
to each flight recorded in the flights
data file.
Hint: the variable “faa” is a primary key in
airports
.
\(~\)
Filtering joins will match observations in the same way as mutating joins, but they will not add any variables from “y” to the “x” data frame.
semi_join()
will keep the observations
in “x” that have a match in “y”anti_join()
will drop the observations
in “x” that have a match in “y”Here a few examples:
semi_join(x = orders, y = customers, by = "id")
## order id date
## 1 1 4 Jan-01
## 2 2 8 Feb-01
## 3 3 42 Apr-15
Notice how semi_join()
filters out the fourth order
because there are no customers with an id of 50 in the customers data
frame.
anti_join(x = orders, y = customers, by = "id")
## order id date
## 1 4 50 Apr-17
Conversely, anti_join()
will filter out everything
except for the fourth order because it was the only observation in the
“x” data file without a match in the “y” data file.
Filtering joins are most useful in finding implicit missing
values, or values that aren’t recorded as NA
and
instead exist only as an absence. For example, we could find missing
entries in airports
by looking for flights whose
destination airport doesn’t have a match:
anti_join(x = flights, y = airports, by = c("dest" = "faa")) %>%
distinct(dest) ## Return only the unique values of "dest"
## # A tibble: 4 × 1
## dest
## <chr>
## 1 BQN
## 2 SJU
## 3 STT
## 4 PSE
We can see that there are four implicitly missing airports in our database.
Question #5: The code below creates a data frame,
top_dest
, containing the ten most common flight
destinations and how many flights arrived at them.
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
flights
data file to
include only flights whose destination is one of the top ten airports.
Then, use dim()
to print the dimensions of the resulting
data frame.planes
data
frame.\(~\)
Question #6: Suppose you want to draw an approximate flight path from the origin to destination airport of all flights departing on Jan 1st, 2013, and you’d like to color these paths by “origin”.
flights
. Then, rename these variables to
origin_lon
and origin_lat
(Hint: you can do
this using a dplyr
command like:
select(origin_lon = lon)
). Be sure to keep the variables
“origin”, “dest”, “month”, and “day”, which will be used later.dest_lon
and dest_lat
. Note that you can do
this by continuing your pipeline from Part A.borders()
to add
a map layer to the graphic, which requires the maps
package
be installed.## Code for Part C
library(maps)
ggplot() +
borders("state", fill = "white") + ## Add a map with US states filled in white
geom_curve(data = filter(f6b, month == 1, day == 1),
aes(x=origin_lon, y=origin_lat, xend=dest_lon, yend=dest_lat, color = origin),
curvature = 0.2,
alpha = 0.5)