Data sets are stored in tabular format and there are many possible ways to organize tabular data. Some organizational schemes are designed to be easily read on the page (or screen), while others are designed to be easily used in analysis. In this tutorial, we focus on how a data set should be formatted for analysis in R.

Make sure that the following packages are installed and loaded:

#install.packages("tidyr")
#install.packages("ggplot2")
#install.packages("readr")
library(tidyr)     # contains tools to tidy data
library(ggplot2)   # for plotting
library(readr)     # a package for parsing data

then load in these example data sets:

UBSprices <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/UBSprices.csv", as.is = TRUE)
polls <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/rcp-polls.csv", na.strings = "--", as.is = TRUE)
airlines <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/airline-safety.csv", as.is = TRUE)

1. Definition of a tidy data set

In R, it is easiest to work with data that follow five basic rules:

  1. Every variable is stored in its own column.
  2. Every observation is stored in its own row—that is, every row corresponds to a single case.
  3. Each value of a variable is stored in a cell of the table.
  4. Values should not contain units. Rather, units should be specified in the supporting documentation for the data set, often called a codebook.
  5. There should be no extraneous information (footnotes, table titles, etc.).

A data set satisfying these rules is said to be tidy, a term popularized by Hadley Wickham.

Remark: Most of the time data that violate rules 4 and 5 are obviously not tidy, and there are easy ways to exclude footnotes and titles in spreadsheets by simply omitting the offending rows. This tutorial focuses on the “sneakier” form of untidiness that violates at least one of the first three rules.

This tutorial will describe the following tidyr commands, which can be thought of as verbs for tidying data:

Command Meaning
gather collapses multiple columns into two columns
spread creates multiple columns from two columns
separate splits compound variables into individual columns

Question #1: Open your R Script “Lab1_Name1_Name2” containing answers to the previous lab’s questions and add a comment beneath your earlier answers indicating all subsequent work pertains to the tidy data lab.

2. Tidying longitudinal data (gather)

UBS is an international bank that reports prices of various staples in major cities every three years. The data set in UBSprices data set contains prices of a 1 kg bag of rice in the years 2003 and 2009 in major world cities. The data set was originally extracted from the alr4 R package.

head(UBSprices)
##        city rice2003 rice2009
## 1 Amsterdam        9       11
## 2    Athens       19       27
## 3  Auckland        9       13
## 4   Bangkok       25       27
## 5 Barcelona       10        8
## 6    Berlin       16       17

This data set is not tidy because each row contains two cases: the city in 2003 and the city in 2009. Additionally, the column names 2003 and 2009 contain the year, which should be the value of a variable. In order to tidy these data, we need to

  1. Reorganize the data so that each row corresponds to a city in a specific year.
  2. Create a single variable for the price of rice.
  3. Add a variable for year.

To do this, we will use the gather function in the tidyr package. gather collapses multiple columns into two columns: a key column and a value column. The key will be the new variable containing the old column names and the value will contain the information recorded in the cells of the collapsed columns.

In our example, we want to collapse rice2003 and rice2009 into the key-value pair year and price. To do this, we use the following command:

tidy_ubs <- gather(data = UBSprices, key = year, value = price, rice2003, rice2009)
head(tidy_ubs)
##        city     year price
## 1 Amsterdam rice2003     9
## 2    Athens rice2003    19
## 3  Auckland rice2003     9
## 4   Bangkok rice2003    25
## 5 Barcelona rice2003    10
## 6    Berlin rice2003    16

Remarks

Questions #2:

For question 2, answer the following in your lab write-up using comments and R code:

  1. How are the number of rows adjusted by using the gather command? (Hint: Use dim(UBSprices) to determine how many rows are in the UBSprices data set and dim(tidy_ubs) to determine how many are in the tidy_ubs data set).

  2. How many rows would there be after using the gather command if the UBSprices data set had five columns of years: rice2003, rice2006, rice2009, rice2012, and rice2015?

To finish tidying these data, we need to modify the year column by removing the word “rice” from each cell.

To do this, we can use the parse_number function in the readr package. This function drops any non-numeric characters in a character string.

tidy_ubs$year <- parse_number(tidy_ubs$year)
head(tidy_ubs)
##        city year price
## 1 Amsterdam 2003     9
## 2    Athens 2003    19
## 3  Auckland 2003     9
## 4   Bangkok 2003    25
## 5 Barcelona 2003    10
## 6    Berlin 2003    16

We now have a data set that we can call tidy.

Remark

This data set started in a relatively tidy form, so it may be difficult to see the benefit of tidying it. Tidy data are typically required for summarizing and plotting data in R. For example, consider making a side-by-side boxplot using ggplot (we will learn more about ggplot in a future lab).

qplot(data=tidy_ubs, x= factor(year), y=price, geom="boxplot")

This was straightforward since tidy_ubs was already tidy, but would have required extra manipulation in the original format.

3. Tidying pollster data (separate + gather)

The polls data set contains the results of various presidential polls conducted during July 2016, and was scraped from RealClear Politics.

polls
##                     Poll        Date  Sample MoE Clinton..D. Trump..R.
## 1               Monmouth 7/14 - 7/16  688 LV 3.7          45        43
## 2                CNN/ORC 7/13 - 7/16  872 RV 3.5          42        37
## 3     ABC News/Wash Post 7/11 - 7/14  816 RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl  7/9 - 7/13 1000 RV 3.1          41        35
## 5       Economist/YouGov  7/9 - 7/11  932 RV 4.5          40        37
## 6   Associated Press-GfK  7/7 - 7/11  837 RV  NA          40        36
## 7       McClatchy/Marist   7/5 - 7/9 1053 RV 3.0          40        35
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2
## 7          10         5

Here, the data set is not tidy because:

To break a single character column into multiple new columns we use the separate function in the tidyr package.

To begin, let’s break the Date column into Begin and End columns:

tidy_polls <- separate(data = polls, col = Date, into = c("Begin", "End"), sep = " - ")
tidy_polls
##                     Poll Begin  End  Sample MoE Clinton..D. Trump..R.
## 1               Monmouth  7/14 7/16  688 LV 3.7          45        43
## 2                CNN/ORC  7/13 7/16  872 RV 3.5          42        37
## 3     ABC News/Wash Post  7/11 7/14  816 RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000 RV 3.1          41        35
## 5       Economist/YouGov   7/9 7/11  932 RV 4.5          40        37
## 6   Associated Press-GfK   7/7 7/11  837 RV  NA          40        36
## 7       McClatchy/Marist   7/5  7/9 1053 RV 3.0          40        35
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2
## 7          10         5

Remarks

We also need to separate the Sample column into size and population columns.

tidy_polls <- separate(data = tidy_polls, col = Sample, into = c("size", "population"), sep = " ")
tidy_polls
##                     Poll Begin  End size population MoE Clinton..D.
## 1               Monmouth  7/14 7/16  688         LV 3.7          45
## 2                CNN/ORC  7/13 7/16  872         RV 3.5          42
## 3     ABC News/Wash Post  7/11 7/14  816         RV 4.0          42
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000         RV 3.1          41
## 5       Economist/YouGov   7/9 7/11  932         RV 4.5          40
## 6   Associated Press-GfK   7/7 7/11  837         RV  NA          40
## 7       McClatchy/Marist   7/5  7/9 1053         RV 3.0          40
##   Trump..R. Johnson..L. Stein..G.
## 1        43           5         1
## 2        37          13         5
## 3        38           8         5
## 4        35          11         6
## 5        37           5         2
## 6        36           6         2
## 7        35          10         5

Next, we need to gather the last four columns into a candidate variable.

tidy_polls <- gather(data = tidy_polls, key = candidate, value = percentage, 7:10)
head(tidy_polls)
##                     Poll Begin  End size population MoE   candidate
## 1               Monmouth  7/14 7/16  688         LV 3.7 Clinton..D.
## 2                CNN/ORC  7/13 7/16  872         RV 3.5 Clinton..D.
## 3     ABC News/Wash Post  7/11 7/14  816         RV 4.0 Clinton..D.
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000         RV 3.1 Clinton..D.
## 5       Economist/YouGov   7/9 7/11  932         RV 4.5 Clinton..D.
## 6   Associated Press-GfK   7/7 7/11  837         RV  NA Clinton..D.
##   percentage
## 1         45
## 2         42
## 3         42
## 4         41
## 5         40
## 6         40

Notice that instead of writing out the column names (Clinton..D., Trump..R., etc.) we can simply specify the column numbers—here 7:10 specifies that we are gathering columns 7 through 10.

Finally, we need to separate the candidate names from the political party.

tidy_polls <- separate(tidy_polls, candidate, into= c("candidate", "party"))
## Warning: Expected 2 pieces. Additional pieces discarded in 28 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
head(tidy_polls)
##                     Poll Begin  End size population MoE candidate party
## 1               Monmouth  7/14 7/16  688         LV 3.7   Clinton     D
## 2                CNN/ORC  7/13 7/16  872         RV 3.5   Clinton     D
## 3     ABC News/Wash Post  7/11 7/14  816         RV 4.0   Clinton     D
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000         RV 3.1   Clinton     D
## 5       Economist/YouGov   7/9 7/11  932         RV 4.5   Clinton     D
## 6   Associated Press-GfK   7/7 7/11  837         RV  NA   Clinton     D
##   percentage
## 1         45
## 2         42
## 3         42
## 4         41
## 5         40
## 6         40

Remark

In the last command we let R guess which separator to use. This worked, but resulted in a warning message—we’re lucky that it worked! There are many situations where the separator is too complex for R to guess correctly and it cannot be specified using a simple character in quotes. In such cases we need to use regular expressions (see the tutorial IntroStrings) to aid our data tidying, but that’s a topic for another tutorial. The important thing to note here is that you should always check that separate worked as you expected, don’t blindly trust it!

4. Tidying crash data (gather + separate + spread)

The airlines data set contains the raw data behind the article Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past? that appeared on fivethirtyeight.com.

head(airlines)
##                 airline avail_seat_km_per_week incidents.1985_1999
## 1            Aer Lingus              320906734                   2
## 2             Aeroflot*             1197672318                  76
## 3 Aerolineas Argentinas              385803648                   6
## 4           Aeromexico*              596871813                   3
## 5            Air Canada             1865253802                   2
## 6            Air France             3004002661                  14
##   fatal_accidents.1985_1999 fatalities.1985_1999 incidents.2000_2014
## 1                         0                    0                   0
## 2                        14                  128                   6
## 3                         0                    0                   1
## 4                         1                   64                   5
## 5                         0                    0                   2
## 6                         4                   79                   6
##   fatal_accidents.2000_2014 fatalities.2000_2014
## 1                         0                    0
## 2                         1                   88
## 3                         0                    0
## 4                         0                    0
## 5                         0                    0
## 6                         2                  337

In this example, a case is best described as an airline in a specific time frame, so these data are not tidy because each case is not its own row. Additionally, the last six column names contain the time frame, which is a value. In order to tidy this data set we must

First, we gather the last six columns into a common accidents column. This will allow us to easily create the years column.

tidy_airlines <- gather(airlines, key = accidents, value = count, 3:8)
head(tidy_airlines)
##                 airline avail_seat_km_per_week           accidents count
## 1            Aer Lingus              320906734 incidents.1985_1999     2
## 2             Aeroflot*             1197672318 incidents.1985_1999    76
## 3 Aerolineas Argentinas              385803648 incidents.1985_1999     6
## 4           Aeromexico*              596871813 incidents.1985_1999     3
## 5            Air Canada             1865253802 incidents.1985_1999     2
## 6            Air France             3004002661 incidents.1985_1999    14

Next, we separate the values of the new accidents column into var (short for variable) and years. The default guessing scheme fails here, so we must specify sep = "[.]" to denote that the period is the separator. (If you want to learn more about why we need brackets around the period you need to delve into regular expressions.)

tidy_airlines <- separate(tidy_airlines, accidents, into = c("var", "years"), sep = "[.]")
head(tidy_airlines)
##                 airline avail_seat_km_per_week       var     years count
## 1            Aer Lingus              320906734 incidents 1985_1999     2
## 2             Aeroflot*             1197672318 incidents 1985_1999    76
## 3 Aerolineas Argentinas              385803648 incidents 1985_1999     6
## 4           Aeromexico*              596871813 incidents 1985_1999     3
## 5            Air Canada             1865253802 incidents 1985_1999     2
## 6            Air France             3004002661 incidents 1985_1999    14

Finally, we need to ensure that each row corresponds to a case. (Don’t worry, this will also make each column a variable!) Currently, there are six rows for each airline: one for each var in each time frame. To solve this problem, we need to spread out the var column so that each variable has its own column.

tidy_airlines <- spread(data = tidy_airlines, key = var, value = count)
head(tidy_airlines)
##                 airline avail_seat_km_per_week     years fatal_accidents
## 1            Aer Lingus              320906734 1985_1999               0
## 2            Aer Lingus              320906734 2000_2014               0
## 3             Aeroflot*             1197672318 1985_1999              14
## 4             Aeroflot*             1197672318 2000_2014               1
## 5 Aerolineas Argentinas              385803648 1985_1999               0
## 6 Aerolineas Argentinas              385803648 2000_2014               0
##   fatalities incidents
## 1          0         2
## 2          0         0
## 3        128        76
## 4         88         6
## 5          0         6
## 6          0         1

Remark

Notice that the first argument given to spread is the data frame, followed by the key-value pair. The key is the name of the column whose values will be used as column headings and the value is the name of the column whose values will populate the cells of the new columns. In this example, we use var as the key and populate the cells with the count.

Practice

Question 3:

The file under5mortality.csv (available at the url “https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/under5mortality.csv”) contains the child mortality rate per 1,000 children born for each country from 1800 to 2015. (Source: https://www.gapminder.org/data/)

  1. Briefly describe why it is not considered to be tidy data and what changes need to be made to tidy it.
  2. Use gather to create a tidy data set with columns country, year and mortality. Use parse_number to ensure that the year column is numeric. (Hint: you can change the column names of a data.frame object using the colnames function. For example, the code colnames(mydata)[1] <- "newName" will change the name of the first column to “newName”)

Question 4:

The file HospitalAdmits.csv (available at the url “https://remiller1450.github.io/data/HospitalAdmits.csv”) describes the general reasons for people being hospitalized in the financial years ranging from July 1993 to June 1998. The variable “Separations” describes how many patient discharges occured in that year, while the variable “PatientDays” describes how many days in total patients spent in the hospital for that reason.

  1. Briefly explain why it is not considered to be tidy data and what changes need to be made to tidy it.
  2. Use gather and separate to create a tidy data set with columns IcdCode, IcdText, Year, Field, and Count. The IcdCode is the numeric component of IcdChapter (Hint: pay attention to variable types, you might need to coerce factor variables into character variables using the as.character function, and then convert the result to numeric using the as.numeric function)

6. Additional Resources