Introduction to Data Scraping with R

Scraping Data From the Web

Data on the web are often presented in tables. For instance, we can see a list of countries by population in 1900 on Wikipedia

Web pages are written in HTML (Hyper Text Markup Language) which uses tags to describe different aspects of document content. For example, a heading in a document is indicated by <h1>My Title</h1> whereas a paragraph would be indicated by <p>A paragraph of content...</p>.

In this tutorial, we will learn how to read data from a table on a web page into R. We will need the package rvest to get the data from the web page, and the stringr package to clean up the data.

library(rvest)
library(stringr)
library(readr)

1. Reading data into R with `rvest`

To get the population data on Wikipedia into R, we use the read_html command from the xml2 package (which is attached when rvest is called) to parse the page to obtain an HTML document.

We then use the html_nodes command that extracts all occurrences of the desired tag. We will be interested in scraping data presented in tables, so in the source code, we look for the table tag: <table> ... </table>.

Note: some of the rvest commands may be slow depending on your Internet connection and the complexity of the web page.

popParse <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_population_in_1900")
str(popParse)

## List of 2
##  $ node:<externalptr> 
##  $ doc :<externalptr> 
##  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

The read_html command creates an R object, basically a list, that stores information about the web page.

To extract a table from this web page (this may take some time):

popNodes <- html_nodes(popParse, "table")
popNodes

## {xml_nodeset (4)}
## [1] <table class="box-More_footnotes plainlinks metadata ambox ambox-sty ...
## [2] <table class="wikitable floatright"><tbody>\n<tr><th colspan="3" sty ...
## [3] <table class="wikitable" style="text-align:right"><tbody>\n<tr>\n<th ...
## [4] <table class="nowraplinks hlist collapsible autocollapse navbox-inne ...

There are four tables in this document. By inspecting the output of popNodes, we make a guess that we want the third table. (In other cases, trial and error may be required.) We select the third table by using double brackets, which are used to index list objects:

pop <- html_table(popNodes, header = TRUE, fill = TRUE)[[3]]
str(pop)

## 'data.frame':    236 obs. of  4 variables:
##  $ Rank                                 : chr  "" "1" "" "2" ...
##  $ Country/Territory                    : chr  "World" "Qing China[1]" "British Empire[a]" "Indian Empire[b][1] (UK)" ...
##  $ Population c.<U+2009>1900 estimate[1]: chr  "1,700,000,000" "400,000,000" "384,000,000" "294,000,000" ...
##  $ Percentage of
## World Population      : chr  "-" "23.5%" "22.5%" "17.3%" ...

3. Cleaning the data frame

We now have a workable data frame that we can analyze. Notice that even though the first and third columns are numbers, they are classified as “character.” For Rank, that is because the first observation is the world population and it is not assigned a rank, but rather, the character “-”. The Population column is also a character because the numbers have commas in them, plus some observations include characters such as [1] to indicate some footnotes. We need to convert these columns to be numeric.

In the case of the Rank column, we will remove the world population.

We will also simplify the name of the third column to “Population.”

pop2 <- pop[-1, ]         #remove row 1
head(pop2)

row.names(pop2) <- NULL    #reset row numbers to start at 1

pop2$Rank <- as.numeric(pop2$Rank) #coerce Rank to numeric

## Warning: NAs introduced by coercion

names(pop2)[3] <- "Population" #rename 3rd column

pop2$Population <- parse_number(pop2$Population) ## Removes commas and footnotes from Population

Another problem with this table are footnotes in the Country/Territory. While it isn’t critical that we remove them since they aren’t likely to interfere with an analysis, they look ugly, which is not good if they end up being used as labels for a figure or table.

We can remove them using regular expressions:

pop2$`Country/Territory` <- str_replace_all(pop2$`Country/Territory`, "\\[[^]]+\\]", "")

head(pop2)

4. A movie box office example

The web site Box Office Mojo gives statistics on box office earnings of movies. In addition to daily earnings, the web site also maintains lists of yearly and all time record holders.

We will look at the movies in the top 100 in all time movie worldwide grosses in box office receipts. In particular, we will scrape the data from Box Office Mojo: All Time Box Office. The dollar amounts are in millions of dollars and the years marked with “^” indicate that the movie had multiple releases.

movieParse<- read_html("http://www.boxofficemojo.com/alltime/world/?pagenum=1")

movieTables <- html_nodes(movieParse, "table")
head(movieTables)

## {xml_nodeset (3)}
## [1] <table border="0" cellpadding="0" cellspacing="0" width="100%"><tr>< ...
## [2] <table border="0" cellspacing="0" cellpadding="0" width="100%"><tr>\ ...
## [3] <table border="0" cellpadding="5" cellspacing="1" width="98%">\n<tr  ...

movies <- html_table(movieTables, header = TRUE, fill = TRUE)[[3]]
str(movies)

## 'data.frame':    100 obs. of  9 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title       : chr  "Avatar" "Titanic" "Star Wars: The Force Awakens" "Avengers: Infinity War" ...
##  $ Studio      : chr  "Fox" "Par." "BV" "BV" ...
##  $ Worldwide   : chr  "$2,788.0" "$2,187.5" "$2,068.2" "$2,048.4" ...
##  $ Domestic / %: chr  "$760.5" "$659.4" "$936.7" "$678.8" ...
##  $ Domestic / %: chr  "27.3%" "30.1%" "45.3%" "33.1%" ...
##  $ Overseas / %: chr  "$2,027.5" "$1,528.1" "$1,131.6" "$1,369.5" ...
##  $ Overseas / %: chr  "72.7%" "69.9%" "54.7%" "66.9%" ...
##  $ Year^       : chr  "2009^" "1997^" "2015" "2018" ...

We will first clean up the column names and then convert the character variables into numeric variables.

names(movies)[5:9] <- c("DomesticDollars", "DomesticPercentage", "OverseasDollars", "OverseasPercentage", "Year")

For the grosses, we note that the numbers are either prefaced by a dollar sign or end with a percentage, both of which will need to be removed. We will also have to remove the commas. These tasks can be done using either parse_number or regular expressions. An example of the latter is shown below:

out <- str_replace_all(movies$Worldwide, "\\$|,", "" )
head(out)

## [1] "2788.0" "2187.5" "2068.2" "2048.4" "1671.7" "1518.8"

movies$Worldwide <- as.numeric(out)

Cleaning up the remaining variables of the movies table is left as an exercise.

5. Non-table Nodes

So far the two examples we’ve looked at extract tables from a webpage, but lots of other information from a webpage can be extracted using rvest. The example below extracts data on the images contained on the Box Office Mojo webpage:

moviesImg <- html_nodes(movieParse, "img")
head(moviesImg)

## {xml_nodeset (6)}
## [1] <img src="//bs.serving-sys.com/Serving/adServer.bs?c=8&amp;cn=displa ...
## [2] <img src="/img/misc/bom_logo1.png" width="245" height="56" alt="Box  ...
## [3] <img src="/images/IMDbPro.png" alt="Get industry info at IMDbPro" he ...
## [4] <img src="/images/glyphicons-social-32-twitter@2x.png" alt="Follow u ...
## [5] <img src="/images/glyphicons-social-31-facebook@2x.png" alt="Follow  ...
## [6] <img src="/images/mojo_imdb_sm.png">

We see that there are 10 images on the page. We can also see that the page has Twitter and Facebook accounts, items we might consider extracting using string processing. In general, there are a huge variety of nodes that we could extract, the next section provides a more detailed example of extracting data from a non-table html node.

6. A NY Times Article Example

Sometimes we might want to extract specific nodes that are defined by CSS (Cascading Style Sheets, a langauge which defines the style of an html page). For this goal, it can be very useful to view to html source code of a page for guidance.

To view the HTML code behind a webpage, right click anywhere on the page and select “View Page Source” in Chrome or Firefox, “View Source” in Internet Explorer, or “Show Page Source” in Safari. (If that option doesn’t appear in Safari, just open Safari Preferences, select the Advanced tab, and check “Show Develop menu in menu bar”.)

The example we will look at is a New York Times article. The goal for this example is transform the body of text into a clean data.frame containing:

The lie itself
The date of that lie
A url sourcing the lie

By inspecting the html source code, we see that each lie has the following format:

<span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>

This tells us that extracting all <span> tags belonging to the class “short-desc” will provide us what we need.

NytTrump <- read_html("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")
lies <- html_nodes(NytTrump, ".short-desc")
lies

## {xml_nodeset (180)}
##  [1] <span class="short-desc"><strong>Jan. 21 </strong>I wasn't a fan o ...
##  [2] <span class="short-desc"><strong>Jan. 21 </strong>A reporter for T ...
##  [3] <span class="short-desc"><strong>Jan. 23 </strong>Between 3 millio ...
##  [4] <span class="short-desc"><strong>Jan. 25 </strong>Now, the audienc ...
##  [5] <span class="short-desc"><strong>Jan. 25 </strong>Take a look at t ...
##  [6] <span class="short-desc"><strong>Jan. 25 </strong>You had millions ...
##  [7] <span class="short-desc"><strong>Jan. 25 </strong>So, look, when P ...
##  [8] <span class="short-desc"><strong>Jan. 26 </strong>We've taken in t ...
##  [9] <span class="short-desc"><strong>Jan. 26 </strong>I cut off hundre ...
## [10] <span class="short-desc"><strong>Jan. 28 </strong>The coverage abo ...
## [11] <span class="short-desc"><strong>Jan. 29 </strong>The Cuban-Americ ...
## [12] <span class="short-desc"><strong>Jan. 30 </strong>Only 109 people  ...
## [13] <span class="short-desc"><strong>Feb. 3 </strong>Professional anar ...
## [14] <span class="short-desc"><strong>Feb. 4 </strong>After being force ...
## [15] <span class="short-desc"><strong>Feb. 5 </strong>We had 109 people ...
## [16] <span class="short-desc"><strong>Feb. 6 </strong>I have already sa ...
## [17] <span class="short-desc"><strong>Feb. 6 </strong>It's gotten to a  ...
## [18] <span class="short-desc"><strong>Feb. 6 </strong>The failing @nyti ...
## [19] <span class="short-desc"><strong>Feb. 6 </strong>And the previous  ...
## [20] <span class="short-desc"><strong>Feb. 7 </strong>And yet the murde ...
## ...

The “.” in from of “short-desc” is CSS selector syntax, it will select all elements with class=“short-desc”. See CSS selector link for more selector examples.

After extracting each lie, the first task is to find the date of each. For this we notice that each date is wrapped within a <strong> tag. Thus, extracting nodes with the <strong> tag will provide the dates we need:

dates <- html_nodes(lies, "strong")
dates <- html_text(dates, trim = TRUE)  ## html_text will parse text from the html jargon
head(dates)

## [1] "Jan. 21" "Jan. 21" "Jan. 23" "Jan. 25" "Jan. 25" "Jan. 25"

Extracting each lie requires a different strategy. We could do it using regular expressions, but instead we use the xml_contents function:

lies_var <- xml_contents(lies)

We see that each lie has three components, the second containing the lie itself, and these components are arranged in a vector. We can exploit this ordering and extract the lies using their positions:

lies_var2 <- lies_var[seq(2, length(lies_var), by = 3)]

We can use a similar strategy to get the urls:

lies_var3 <- lies_var[seq(3, length(lies_var), by = 3)]

From here we see that all the urls are tagged with “a”, we can select these nodes using the tag, and then select the href object to get the url itself. Note that we could have noticed this right away and eliminated the prior step.

lies_var3 <- html_node(lies_var3, "a")
url_var <- html_attr(lies_var3, "href")

At this point we’ve satisfied the goal of this example, next step might be a textual analysis that further extracts key words or categorizes the urls.

lies_clean <- data.frame(date = dates, lie = as.character(lies_var2), url = url_var)
head(lies_clean)

7. A Billboard Top 100 example

The website billboard.com keeps track of top songs, albums and artists from the music industry.

One page lists the greatest hot women artists. In the source code, here is one artist’s listing: <div class="chart-list-item " data-rank="24" data-artist="" data-title="Dionne Warwick" data-has-content="false">.

(Note that the current number 1 entry, Madonna, is not listed in this format)

webParse <- read_html("https://www.billboard.com/charts/greatest-hot-100-women-artists")
str(webParse)

## List of 2
##  $ node:<externalptr> 
##  $ doc :<externalptr> 
##  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

webNodes <- html_nodes(webParse,".chart-list-item  " ) 
webNodes

## {xml_nodeset (49)}
##  [1] <div class="chart-list-item  " data-rank="2" data-artist="" data-ti ...
##  [2] <div class="chart-list-item  " data-rank="3" data-artist="" data-ti ...
##  [3] <div class="chart-list-item  " data-rank="4" data-artist="" data-ti ...
##  [4] <div class="chart-list-item  " data-rank="5" data-artist="" data-ti ...
##  [5] <div class="chart-list-item  " data-rank="6" data-artist="" data-ti ...
##  [6] <div class="chart-list-item  " data-rank="7" data-artist="" data-ti ...
##  [7] <div class="chart-list-item  " data-rank="8" data-artist="" data-ti ...
##  [8] <div class="chart-list-item  " data-rank="9" data-artist="" data-ti ...
##  [9] <div class="chart-list-item  " data-rank="10" data-artist="" data-t ...
## [10] <div class="chart-list-item  " data-rank="11" data-artist="" data-t ...
## [11] <div class="chart-list-item  " data-rank="12" data-artist="" data-t ...
## [12] <div class="chart-list-item  " data-rank="13" data-artist="" data-t ...
## [13] <div class="chart-list-item  " data-rank="14" data-artist="" data-t ...
## [14] <div class="chart-list-item  " data-rank="15" data-artist="" data-t ...
## [15] <div class="chart-list-item  " data-rank="16" data-artist="" data-t ...
## [16] <div class="chart-list-item  " data-rank="17" data-artist="" data-t ...
## [17] <div class="chart-list-item  " data-rank="18" data-artist="" data-t ...
## [18] <div class="chart-list-item  " data-rank="19" data-artist="" data-t ...
## [19] <div class="chart-list-item  " data-rank="20" data-artist="" data-t ...
## [20] <div class="chart-list-item  " data-rank="21" data-artist="" data-t ...
## ...

We now need to extract the name of the artist, given in the data-title attribute:

webNodes2 <- html_attr(webNodes, "data-title")
webNodes2

##  [1] "Mariah Carey"               "Janet Jackson"             
##  [3] "Whitney Houston"            "Rihanna"                   
##  [5] "The Supremes"               "Olivia Newton-John"        
##  [7] "Aretha Franklin"            "Katy Perry"                
##  [9] "Taylor Swift"               "Diana Ross"                
## [11] "Donna Summer"               "Carpenters"                
## [13] "Connie Francis"             "Beyonce"                   
## [15] "Brenda Lee"                 "Barbra Streisand"          
## [17] "Cher"                       "P!nk"                      
## [19] "Gloria Estefan"             "Gladys Knight And The Pips"
## [21] "Lady Gaga"                  "TLC"                       
## [23] "Dionne Warwick"             "Heart"                     
## [25] "Paula Abdul"                "Alicia Keys"               
## [27] "Kelly Clarkson"             "Linda Ronstadt"            
## [29] "Destiny's Child"            "Celine Dion"               
## [31] "The Pointer Sisters"        "Captain & Tennille"        
## [33] "Britney Spears"             "Toni Braxton"              
## [35] "Jennifer Lopez"             "Christina Aguilera"        
## [37] "Adele"                      "Monica"                    
## [39] "Helen Reddy"                "Cyndi Lauper"              
## [41] "Roberta Flack"              "Taylor Dayne"              
## [43] "Nicki Minaj"                "Mary J. Blige"             
## [45] "Tina Turner"                "Debbie Gibson"             
## [47] "Sheena Easton"              "Kesha"                     
## [49] "Carly Simon"

As we noted earlier, Madonna’s entry was not listed in the same format as the others, so we will have to add her manually:

top50women <- data.frame(top = c("Madonna", webNodes2))
head(top50women, 5)

On Your Own

Option #1

The web site BikeRaceInfo has a table with data on past winners of the Tour de France. Create a cleaned-up data.frame of this data.
The web site NY Times Best Sellers: Hardcover Fiction contains a list of best-selling fiction books. Scrape the names and authors of these top books and store the results in a clean data.frame.

Option #2

The link https://www.auhouseprices.com/rent/list/NSW/2131/Ashfield/1/?sort=date&type=apartment&bmin=2&bmax=2 shows two-bedroom rental listings in Ashfield Australia. Scrap this webpage to create a clean data.frame containing three variables: each rental’s address, its price, and the month/year when the price was offered. Hint: this is best accomplished by writing a function and using a for loop.

N <- 3
for(i in 1:N){
  print(i)
}

## [1] 1
## [1] 2
## [1] 3