Intro to Text Manipulation in R

Many data sets contain character strings. For example, your data might include tweets from Twitter (which are basically just strings of characters), and perhaps you want to search for occurrences of a certain word or twitter handle. Alternatively, your data might have a location variable that includes city and state abbreviations, and you might want to extract those observations with location containing “NY.”

In this tutorial, you will learn how to manipulate text data using the package stringr and how to match patterns using regular expressions. Some of the commands include:

Command	Description
`str_sub`	Extract substring from a given start to end position
`str_detect`	Detect presence/absence of first occurrence of substring
`str_locate`	Give position (start, end) of first occurrence of substring
`str_locate_all`	Give positions of all occurrences of a substring
`str_replace`	Replace one substring with another

1. Extracting and locating substrings

We begin by introducing some basic functions in the stringr package.

The str_sub function extracts substrings from a string (a string being a sequence of alpha-numeric characters) given the starting and ending position. This example extracts the characters in the second through fourth position for each string in fruits:

library(stringr)
fruits <- c("apple", "pineapple", "Pear", "orange", "peach", "banana")
str_sub(fruits, 2, 4)

## [1] "ppl" "ine" "ear" "ran" "eac" "ana"

The str_detect function checks to see if any instance of a pattern occurs in a string.

str_detect(fruits, "p")  #any occurrence of 'p'?

## [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

Note that pattern matching is case-sensitive.

To locate the position of a pattern within a string, use str_locate:

str_locate(fruits, "an")

##      start end
## [1,]    NA  NA
## [2,]    NA  NA
## [3,]    NA  NA
## [4,]     3   4
## [5,]    NA  NA
## [6,]     2   3

Only the fourth and sixth fruits contain “an”. In the case of “banana,” note that only the first occurrence of “an” is returned.

To find all instances of “an” within each string:

str_locate_all(fruits,"an")

## [[1]]
##      start end
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## 
## [[4]]
##      start end
## [1,]     3   4
## 
## [[5]]
##      start end
## 
## [[6]]
##      start end
## [1,]     2   3
## [2,]     4   5

Remark

The command str_locate_all returns a list, or an object where each element is another object (possibly of different types). In this example our output is a list with six elements, where each element is a matrix with columns start and end.

out <- str_locate_all(fruits, "an")
data.class(out)

## [1] "list"

data.class(out[[1]])

## [1] "matrix"

out[[6]] # 6th element of the list, corresponding to banana

##      start end
## [1,]     2   3
## [2,]     4   5

unlist(out) # coerces the list into a vector

## [1] 3 4 2 4 3 5

length(unlist(out))/2    #total number of times "an" occurs in vector fruits

## [1] 3

2. Regular expressions

Now suppose we want to detect or locate words that begin with “p” or end in “e,” or match a more complex criteria. A regular expression is a sequence of characters that define a pattern.

Let’s detect strings that begin with either “p” or “P”. The metacharacter “^” is used to indicate the beginning of the string, and “[Pp]” is used to indicate “P” or “p”.

str_detect(fruits, "^[Pp]")

## [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE

str_detect(fruits, "[Pp]") ## Notice the impact of the metacharacter "^"

## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Similarly, the metacharacter “$” is used to signify the end of a string.

str_detect(fruits, "e$" )   #end in 'e'

## [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE

The following are other metacharacters that have special meanings:

* \ + $ { } [ ] ^ ? .

For instance, “.” matches any single character:

gr.y matches gray, grey, gr9y, grEy, etc.

and * indicates 0 or more instances of the preceding character:

xy*z matches xz, xyz, xyyz, xyyyz, xyyyyz, etc.

To detect the letter “a” followed by 0 or more occurrences of “p”, type:

str_detect(fruits, "ap*")

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

Compare this to

str_detect(fruits, "ap+")

## [1]  TRUE  TRUE FALSE FALSE FALSE FALSE

The “+” in front of the “p” indicates that we want 1 or more occurrences of “p.”

Here is a more complex pattern:

str_detect(fruits, "^a(.*)e$")

## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

The anchors ^ and $ are used to indicate we want strings that begin with the letter a and end with e. The (.*) indicates that we want to match 0 or more occurrences of any character. Parentheses can be used to group parts of the pattern for readability, you’ll get the same results without them.

3. Examples

Suppose we want to extract 10 digit United States phone numbers from a data set of text strings.

a1 <- "Home: 507-645-5489"
a2 <- "Cell: 219.917.9871"
a3 <- "My work phone is 507-202-2332"
a4 <- "I don't have a phone"
info <- c(a1, a2, a3, a4)
info

## [1] "Home: 507-645-5489"            "Cell: 219.917.9871"           
## [3] "My work phone is 507-202-2332" "I don't have a phone"

To derive a regular expression for valid 10 digit phone numbers we must recognize a few patterns:

A United States area code must start with a 2 or higher so we use brackets again to indicate a range: [2-9]
The next two digits in the area code can be between 0 and 9, so we write [0-9]{2}.
The area code is separated from the other digits using either a “.” or “-”, so we use [-.] to indicate either a dash or a period.

The complete regular expression is given below:

phone <- "([2-9][0-9]{2})[-.]([0-9]{3})[-.]([0-9]{4})"
out <- str_detect(info, phone)
out

## [1]  TRUE  TRUE  TRUE FALSE

Again, str_detect just indicates the presence or absence of the pattern in question.

str_extract(info, phone)

## [1] "507-645-5489" "219.917.9871" "507-202-2332" NA

We can also use stringr to do make manipulations such as anonymizing the phone numbers:

str_replace(info, phone, "XXX-XXX-XXXX")

## [1] "Home: XXX-XXX-XXXX"            "Cell: XXX-XXX-XXXX"           
## [3] "My work phone is XXX-XXX-XXXX" "I don't have a phone"

Remarks

As we noted above, certain characters are reserved. If we want to actually reference them in a regular expression, either put them within a bracket, or use a double forward slash.

str_locate(info, "[.]")  #find first instance of period

##      start end
## [1,]    NA  NA
## [2,]    10  10
## [3,]    NA  NA
## [4,]    NA  NA

str_locate(info, "\\.")  #same

##      start end
## [1,]    NA  NA
## [2,]    10  10
## [3,]    NA  NA
## [4,]    NA  NA

str_locate(info, ".")    #first instance of any character

##      start end
## [1,]     1   1
## [2,]     1   1
## [3,]     1   1
## [4,]     1   1

Metacharacters have different meanings within brackets.

str_detect(fruits, "^[Pp]")  #starts with 'P' or 'p'

## [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE

str_detect(fruits, "[^Pp]")  #any character except 'P' or 'p'

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

str_detect(fruits, "^[^Pp]") #start with any character except 'P' or 'p'

## [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE

See the stringr cheatsheet for a summary of regular expressions.

4. Matching brackets or html tags

In many cases, you may want to match brackets such as [8] or html tags such as <table>.

out <- c("abc[8]", "abc[9][20]", "abc[9]def[10][7]", "abc[]")
out

## [1] "abc[8]"           "abc[9][20]"       "abc[9]def[10][7]"
## [4] "abc[]"

In order to better understand what regular expressions are matching here, we will replace pieces of the above strings with the character “X”.

To replace the left bracket, we write \\[. Next we want to match 0 or more occurrences of any character except the right bracket so we need [^]]*. Finally, to match the right bracket \\].

str_replace_all(out, "\\[([^]]*)\\]", "X")

## [1] "abcX"      "abcXX"     "abcXdefXX" "abcX"

Compare this to

str_replace_all(out, "\\[(.*)\\]", "X")

## [1] "abcX" "abcX" "abcX" "abcX"

In this case, we match the first left bracket (indicated by the \\[), followed by 0 or more instances of any character (the (.*) portion), which could be a right bracket until the final right bracket \\].

Practice

Question #1:

veggies <- c("carrot", "bean", "peas", "cabbage", "scallion", "asparagus")

Using the vector veggies defined above using stringr functions to do the following:

Find those strings that contain the pattern “ea”.
Find those strings that end in “s”.
Find those strings that contain at least two “a”’s.
Find those strings that begin with any letter except “c”.
Find the starting and ending position of the pattern “ca” in each string.

Question #2:

The regular expression "^[Ss](.*)(t+)(.+)(t+)" matches “scuttlebutt”, “Stetson”, and “Scattter”, but not “Scatter.” Why?

Extra Credit:

One real world application of string matching is detecting whether or not an email addres is valid. Examples of some valid email addresses are shown below:

“simple@example.com”
“johnsmith@email.gov”
“marie.curie@college.edu”,
“very_common@example.com”,
“a.little.lengthy.but.ok@dept.example.com”,

For this question, write code that takes the vector given below and identifies the first 5 email addresses as valid, and indentifies the last 3 as invalid.

You may assume that a valid email address has a top-level domain (ie: .com, .gov, .fr, …) that is either 2 or 3 letters. You may also assume that letters, numbers, ., -, and _ are the only valid characters in the core components of the address (see the 6th email below for an invalid email)

emails = c(
"simple@example.com",
"johnsmith@email.gov",
"marie.curie@college.edu",
"very_common@example.com",
"a.little.lengthy.but.ok@dept.example.com",
"bad.email.because+symbol@example.com",
"not_good@email.address",
"this.email.is.fake@gmail.xcom")

## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE