There are many situations where data sets are broken into multiple tables, and multiple reasons that this might make sense—sometimes it is easier to collect data in different pieces, other times it is to reduce the file size. Regardless of the reason for splitting data sets into multiple tables, they should always formatted in such a way that there is at least one common column between the tables so that they can be merged as needed. In this example we will explore how to use the join suite of functions found in the dplyr package to merge these data tables.

Note: In R, most tabular sets of data are called data frames. Data frames can store objects of different classes (e.g. some columns can be text and other columns can be integers). A synonymous term is data table, which is used by some textbooks and other languages, such as the structued query language (SQL).

To begin, let’s ensure that dplyr is loaded.

library(dplyr)

1. Example data tables

To illustrate the different join functions we will use a small example of a customer database. We will focus on two tables: orders which contains the order number, customer ID, and date of the order; and customers which contains the customer ID and customer name. These are intentionally small data tables so that is easier to see how the join statements are working.

orders <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/orders.csv", as.is = TRUE)
orders
##   order id   date
## 1     1  4 Jan-01
## 2     2  8 Feb-01
## 3     3 42 Apr-15
## 4     4 50 Apr-17
customers <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/customers.csv", as.is = TRUE)
customers
##   id    name
## 1  4   Tukey
## 2  8 Wickham
## 3 15   Mason
## 4 16  Jordan
## 5 23   Patil
## 6 42     Cox

2. Joining data tables

The the dplyr package provides six different join functions, each merges two data tables together in a different way. The best way to understand these functions is to see how each works in our small example.

All of the functions have the same three arguments:

argument explanation
x the first (left) table to join
y the second (right) table to join
by a character vector of variables to join by, notice that the column name must always be in quotes.

inner_join

inner_join creates a new table which is restricted to cases where the values of “by variable” exist in both data sets. All columns from both data sets are returned for these cases.

inner_join(x = orders, y = customers, by = "id")
##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3     3 42 Apr-15     Cox

Remark

In this inner_join, we lose the row for order 4 from the orders table because customer 50 does not appear in the customers data table. We also lose lose orders 15, 16 and 23 from the customers table. Note the columns order, date and name are also returned.

left_join

left_join returns all cases from the x data table, regardless of whether there are matching values of the by variable(s) in y. All columns from both data tables are returned for these cases.

left_join(x = orders, y = customers, by = "id")
##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3     3 42 Apr-15     Cox
## 4     4 50 Apr-17    <NA>

Remark

This new data frame now includes the name column. Since customer 50 does not appear in the customers data table, an NA (missing value) is used for their name.

right_join

right_join returns all cases from the y data table, regardless of whether there are matching values of the by variable(s) in x. All columns from both data tables are returned for these cases.

right_join(x = orders, y = customers, by = "id")
##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3    NA 15   <NA>   Mason
## 4    NA 16   <NA>  Jordan
## 5    NA 23   <NA>   Patil
## 6     3 42 Apr-15     Cox

Remark

We have added the order and date columns to the customers data table. Customers 15, 16, and 23 did not make purchases during this time frame, so missing values (NAs) are used for their order and date values.

full_join

full_join returns all rows and columns from both x and y.

full_join(x = orders, y = customers, by = "id")
##   order id   date    name
## 1     1  4 Jan-01   Tukey
## 2     2  8 Feb-01 Wickham
## 3     3 42 Apr-15     Cox
## 4     4 50 Apr-17    <NA>
## 5    NA 15   <NA>   Mason
## 6    NA 16   <NA>  Jordan
## 7    NA 23   <NA>   Patil

Remark

We have fully merged the orders and customers data tables; thus, we get all of the columns and all of the rows from both data tables. NAs fill in the necessary values for customers not making purchases and for orders without a customer record.

semi_join

semi_join returns all rows from the x data table where there are matching values of the by variable(s) in y, and only the columns from x.

semi_join(x = orders, y = customers, by = "id")
##   order id   date
## 1     1  4 Jan-01
## 2     2  8 Feb-01
## 3     3 42 Apr-15

Remark

We lose the row for order 4 because customer 50 does not appear in the customers data table.

inner_join vs. semi_join

Above, the inner_join and semi_join returned the same number of rows, but this will not always be the case. For example, suppose that customer 42 also placed an order on May-01 so that we have multiple orders from the same customer.

extra_order <- data.frame(order = 5, id = 42, date = "May-01")
extra_order
##   order id   date
## 1     5 42 May-01
orders2 <- rbind(orders, extra_order)
orders2
##   order id   date
## 1     1  4 Jan-01
## 2     2  8 Feb-01
## 3     3 42 Apr-15
## 4     4 50 Apr-17
## 5     5 42 May-01
inner_join(x = customers, y = orders2, by = "id")
##   id    name order   date
## 1  4   Tukey     1 Jan-01
## 2  8 Wickham     2 Feb-01
## 3 42     Cox     3 Apr-15
## 4 42     Cox     5 May-01
semi_join(x = customers, y = orders2, by = "id")
##   id    name
## 1  4   Tukey
## 2  8 Wickham
## 3 42     Cox

Remark

The result of the inner_join includes two rows for customer 42 because inner_join returns all of the columns from both data tables for ids that match. The result of the semi_join only returns one row for each customer because it only returns the rows from customers that have matching ids in orders2.

anti_join

anti_join returns all rows from the x data table where there are not matching values of the by variable(s) in y, and only the columns from x.

anti_join(x = orders, y = customers, by = "id")
##   order id   date
## 1     4 50 Apr-17

Remark

Order 4 is the only order from a customer without a record in the customers data table, so it is the only row of orders returned.

anti_join(x = customers, y = orders, by = "id")
##   id   name
## 1 15  Mason
## 2 16 Jordan
## 3 23  Patil

Remark

Customers 15, 16, and 23 did not place orders during this time frame, so their entries from the customers data table are returned.

3. Common complications

All of our examples have only used a single column to match the entries between the data tables, and has also assumed that the columns will have identical names. This will not always be the case. Below we detail how to refine what variables you merge by.

Another issue that crops up occasionally is duplicate entries in the variable(s) that you wish to merge by. We saw one example of this above when there were two orders from the same customer. In that case the id value was unique in the customer table, but not in the orders table. The result of this join is quite logical, as seen above. If, however, both tables contain duplicate entries in the variable(s) that you wish to merge by, all possible combinations of these entries are returned. A simple example for a full_join is shown below:

# Creating example data frames
table1 <- data.frame(key = c("a", "a", "b", "b", "c"), var = 1:5)
table1
##   key var
## 1   a   1
## 2   a   2
## 3   b   3
## 4   b   4
## 5   c   5
table2 <- data.frame(key = c("a", "a", "b", "b", "c"), var = LETTERS[1:5])
table2
##   key var
## 1   a   A
## 2   a   B
## 3   b   C
## 4   b   D
## 5   c   E
# A full join
full_join(x = table1, y = table2, by = "key")
##   key var.x var.y
## 1   a     1     A
## 2   a     1     B
## 3   a     2     A
## 4   a     2     B
## 5   b     3     C
## 6   b     3     D
## 7   b     4     C
## 8   b     4     D
## 9   c     5     E

In this situation, the results for left_join, right_join, and full_join will be identical.

4. Practice

Record the answers to the questions below in the R Markdown document you created during the plotly lab. Use a single section/code chunk for each question, separating each part using comments (ie: # Part A)

Question #1:

The files books.csv, authors.csv, and book-authors.csv give details about the planned summer reading of a statistics student. You can find the files at:

books.csv provides details for each book (isbn, title, year, and genre), authors.csv provides details about each author (authorid, first name, last name, and nationality), and book-authors.csv provides the author identification (authorid) for each isbn (books with multiple authors will have multiple rows).

  1. Read the three files into R, naming them books, authors, and book_authors.
  2. Use the appropriate join statement to add the ISBNs to the authors data table. Why does the resulting data frame have 31 rows instead of 11?
  3. Eliminate the duplicate rows of your data frame from Part B (which we’ll assume you named df) run the following code (change the object names to align with your code as necessary):

    df2 <- unique(df)

  4. Use the appropriate join statement to add the updated author information from Part B/C to the books data table.
  5. Are there any authors in the authors data table that do not correspond to books in the books data table? Use an appropriate join statement to determine this.

Question #2:

The nycflights13 package contains a database of all flights departing New York City in 2013.

  1. Write code that adds weather information to each record in the flights table. You should match each flight with the nearest weather recording at that airport, meaning you should use the provided code to get the hour nearest to when the flight departs.
  2. Write code to print the dimensions of the flights data.frame before and after the merge. Should they be the same? Briefly explain in a comment.
library(nycflights13)
data("flights")
flights$nearest_hour <- round(flights$dep_time, -2)/100

5. Additional Resources