stringr
This lab focuses on manipulating characters strings using the
stringr
package within the tidyverse suite.
# Please install and load the following packages
# install.packages("stringr")
library(stringr)
library(ggplot2)
Directions (Please read before starting)
\(~\)
A “string” is a single element of a character variable (vector), or a stand alone collection of characters enclosed by quotes.
x <- "Single Element String"
y <- c("apple", "banana", "pear") ## strings within a character vector
There are many similarities between strings and other types of data. For example, the individual characters in a string have their own positions, and strings have their own length:
str_length(x)
str_length(y)
length(y)
## [1] 21
## [1] 5 6 4
## [1] 3
Positional indices are important to many string processing functions,
such as str_sub()
:
str_sub(x, start = 1, end = 4) ## str_sub() will subset a string
str_sub(y, start = 1, end = 4)
## [1] "Sing"
## [1] "appl" "bana" "pear"
Positions can also be referenced from the end of a string using negative numbers:
str_sub(x, start = -4, end = -1)
str_sub(y, start = -2, end = -1)
## [1] "ring"
## [1] "le" "na" "ar"
\(~\)
The presence of a character string within a vector will dominate all other types of data:
x1 <- c(1, 2, 3)
typeof(x1)
x2 <- c(1, 2, 3, "A")
typeof(x2)
## [1] "double"
## [1] "character"
y1 <- c(TRUE, FALSE)
typeof(y1)
y2 <- c(TRUE, FALSE, "A")
typeof(y2)
## [1] "logical"
## [1] "character"
This can be a problem if your raw data used character strings to represent missing data or footnotes:
my_data <- read.csv("https://remiller1450.github.io/data/char_dom.csv")
head(my_data)
## ID messy_x messy_y
## 1 1 100 50
## 2 2 90 40
## 3 3 85 55
## 4 4 90 45
## 5 5 110 55*
## 6 6 Missing 60
str(my_data)
## 'data.frame': 10 obs. of 3 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10
## $ messy_x: chr "100" "90" "85" "90" ...
## $ messy_y: chr "50" "40" "55" "45" ...
Fortunately, “messy_x” can be fixed using as.numeric()
,
and “messy_y” can be fixed using the parse_number()
function in the readr
package:
## Coerce x to numeric
as.numeric(my_data$messy_x)
## [1] 100 90 85 90 110 NA 115 NA 105 100
## Extract only the numeric values in y
library(readr)
parse_number(my_data$messy_y)
## [1] 50 40 55 45 55 60 40 35 40 50
\(~\)
At this point you should begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.
\(~\)
The stringr
package contains dozens of string processing
tools. We will focus our attention on the following functions:
Function | Description |
---|---|
str_sub() |
Extract substring from a given start to end position |
str_detect() |
Detect presence/absence of first occurrence of substring |
str_locate() |
Give position (start, end) of first occurrence of substring |
str_locate_all() |
Give positions of all occurrences of a substring |
str_replace() |
Replace the first instance of a substring with another |
str_replace_all() |
Replace all instances of a substring with another |
For illustration purposes, we will use the vector created below:
fruits <- c("Apple", "Pineapple", "Pear", "Orange", "Peach", "Banana")
String Detect:
str_detect()
returns TRUE
or
FALSE
for each element in character vector depending upon
whether it contains the target pattern:
str_detect(fruits, "ap") ## returns TRUE if "ap" is found
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
Because strings are case-sensitive, "ap"
is not
found in "Apple"
, but is found in
"Pineapple"
.
String Locate:
str_locate()
returns the start and end positions of the
first instance of the target pattern:
str_locate(fruits, "an")
## start end
## [1,] NA NA
## [2,] NA NA
## [3,] NA NA
## [4,] 3 4
## [5,] NA NA
## [6,] 2 3
Notice the sixth fruit, "banana"
, has two instances of
"an"
, but only the first is considered.
str_locate_all()
can be used to find every instance of a
target pattern, but it can be more cumbersome to work with because its
output is a list object (rather than the matrix returned by
str_locate()
):
str_locate_all(fruits, "an")
## [[1]]
## start end
##
## [[2]]
## start end
##
## [[3]]
## start end
##
## [[4]]
## start end
## [1,] 3 4
##
## [[5]]
## start end
##
## [[6]]
## start end
## [1,] 2 3
## [2,] 4 5
A common goal might be to count the instances of a pattern, in which
case the unlist()
function can be used to coerce the list
into a vector.
out <- str_locate_all(fruits, "an")
v <- unlist(out) ## coerce the list into a vector
length(v)/2 ## total number of times "an" occurs in "fruits"
## [1] 3
String Replace:
str_replace()
will replace the target pattern with
another expression:
str_replace(fruits, "an", "XX")
## [1] "Apple" "Pineapple" "Pear" "OrXXge" "Peach" "BXXana"
Similar to str_locate()
it also works only on the first
instance of the pattern, but str_replace_all()
can be used
if every instance should be replaced:
str_replace_all(fruits, "an", "XX")
## [1] "Apple" "Pineapple" "Pear" "OrXXge" "Peach" "BXXXXa"
Question #1: Using appropriate stringr
functions, identify which elements in the vector “fruits” contain a
lower case “p”. Then, use the which()
function and indices
to print these fruits.
Question #2: Using appropriate stringr
functions, count the total number of times a lower case “p”
occurs in the vector “fruits”.
\(~\)
With strings it can beneficial to pre-process your data to simplify
later operations. For example, you might convert all of your strings to
lower case via the str_to_lower()
function:
str_to_lower(fruits)
## [1] "apple" "pineapple" "pear" "orange" "peach" "banana"
Similar functions exist for converting to upper, title, or sentence case:
str_to_upper(fruits)
## [1] "APPLE" "PINEAPPLE" "PEAR" "ORANGE" "PEACH" "BANANA"
str_to_title(fruits)
## [1] "Apple" "Pineapple" "Pear" "Orange" "Peach" "Banana"
str_to_sentence("aPPles AND Bananas are THE Most popular FRUits")
## [1] "Apples and bananas are the most popular fruits"
Another common problem in string processing is the presence of white space, such as excessive spacing between characters, or at the beginning and end of strings.
The string_squish()
function will eliminate all leading
and trailing white space and reduce any repeated spaces inside the body
of the string to a single instance:
str_squish(" String A ")
## [1] "String A"
If you wish to remove leading/trailing spaces without modifying the
body of the string you can do so using the string_trim()
function:
str_trim(" String A ")
## [1] "String A"
Question #3: Beginning with the character string
given below, use the functions str_squish()
,
str_to_title()
, and str_replace_all()
to
produce the string "United_States"
.
q3 <- c(" UNITED STATES ")
\(~\)
Regular expressions, or “regex”, are special sequences of
characters used to identify patterns in strings. At its simplest, a
regular expression might look exactly like the pattern you’re trying to
find. For example "ea"
will match the "ea"
in
"pear"
.
From here, we can increase the flexibility of the expression using
.
, sometimes called a wildcard, which can match
any character.
To understand the wildcard, consider searching for ".a."
in y = c("apple", "banana", "pear")
:
The patterns matching ".a."
are highlighted above.
Notice how each match has the character “a” surrounded by exactly one
character on both sides, but the exact characters surrounding the “a”
can be different.
\(~\)
By default, regex will match with any portion of a string, but it’s sometimes useful to anchor an expression to only a match characters at either the start or end of a string.
^
anchors matching to the start of a string$
anchors matching to at the end of a stringfruits[str_detect(fruits, "e$")] ## Fruits ending in "e"
## [1] "Apple" "Pineapple" "Orange"
fruits[str_detect(fruits, "^P")] ## Fruits starting with "P"
## [1] "Pineapple" "Pear" "Peach"
Question #4: Use the wildcard character,
.
, and anchoring to find all strings that start with “P”,
followed by any character, followed by “a”.
\(~\)
The meta-characters, *
and +
can be used to
match patterns with a flexible number of repeated characters.
*
is used to match 0 or more instances of the preceding
character+
is used to match 1 or more instances of the preceding
characterfruits[str_detect(fruits, "ap*")] ## "a" followed by 0 or more instances of "p"
fruits[str_detect(fruits, "ap+")] ## "a" followed by 1 or more occurrences of "p"
## [1] "Pineapple" "Pear" "Orange" "Peach" "Banana"
## [1] "Pineapple"
In these examples, the usefulness of *
isn’t obvious,
but consider the following examples:
strings <- c("good", "goood", "goooood", "gooooood!")
str_detect(strings, "goo*d") ## Handles
## [1] TRUE TRUE TRUE TRUE
fruits[str_detect(fruits, "^A.*e$")] ## Fruits starting with "A", followed by 0 or more of any char, ending in "e"
## [1] "Apple"
Question #5: Suppose you want to detect variations
of “good” with excessive o’s, but you do not want to detect “god”. Why
won’t the expression "goo*d:
work for this aim? How can you
modify it to address the problem?
\(~\)
In addition to .
, *
, +
,
^
, and $
, other meta-characters include:
[]
- indicates the literal interpretation of a
character, or a limited set of exchangeable characters.
[Pp]
will match either “P” or “p”{}
- indicates a fixed number of repetitions of the
preceding character(s).
[0-9]{2}
will match any two-digit number (ie: 01, 15,
78, etc.)\\
- an escape character used to match something that
is a meta-character.
\\.
will match the character .
(a period
appearing in a string)\\\\
is needed to match the character "\"
(because \
is itself a special character)In addition to these meta-characters, there are some special pattern shortcuts that are worth knowing:
\\d
will match any digit\\s
will match any white space (ie: a space, a tab, or
a newline)[^abd]
will match anything character other than “a”,
“b” or “d”. When used inside a square parentheses ^
operates differently (recall it’s used in anchoring).()
can be used for organization and will not influence
pattern matching.To illustrate these meta-characters, consider the task of extracting 10-digit phone numbers from text string data:
phone_strings <- c("Home: 507-645-5489",
"Cell: 219.917.9871",
"My work phone is 507-202-2332",
"I don't have a phone")
[2-9]\\d{2}
.
or -
. So, we now have
[2-9]\\d{2}[-.]
.
or -
, then 4 more digits, making the full
expression [2-9]\\d{2}[-.]\\d{3}[-.]\\d{4}
phone_pattern = "[2-9]\\d{2}[-.]\\d{3}[-.]\\d{4}"
str_detect(phone_strings, phone_pattern) ## Identify strings with matches
## [1] TRUE TRUE TRUE FALSE
str_extract(phone_strings, phone_pattern) ## Extract the matches
## [1] "507-645-5489" "219.917.9871" "507-202-2332" NA
It’s also possible to use stringr
functions to help make
sensitive information anonymous:
str_replace(phone_strings, phone_pattern, "XXX-XXX-XXXX")
## [1] "Home: XXX-XXX-XXXX" "Cell: XXX-XXX-XXXX"
## [3] "My work phone is XXX-XXX-XXXX" "I don't have a phone"
\(~\)
The data below comes from a real driving simulator experiment. The “disposition” file records the experimental participants, their assigned conditions, and the driving simulator output files that record a time-series of driver/vehicle inputs in a particular simulated drive. Output from the simulator are stored as DAQ file.
disp = read.csv("https://remiller1450.github.io/data/disposition.csv")
head(disp)
## Analyze Reduced Ignore Discard DaqPath
## 1 X NA Control\\C_001_POST
## 2 X NA Control\\C_002__PRE
## 3 X NA Control\\C_003__PRE
## 4 X NA Occasional\\O_001__PRE
## 5 X NA Control\\C_004_POST
## 6 X NA Frequent\\F_001__PRE
## DaqName Date
## 1 1_RuralRedLight_20180905113244.daq 9/5/2018 23:01
## 2 3_RuralRedLight_20180912092144.daq 9/12/2018 23:11
## 3 2_RuralRedLight_20180914094223.daq 9/14/2018 23:12
## 4 1_RuralRedLight_20180917093224.daq 9/17/2018 23:10
## 5 3_RuralRedLight_20180928135832.daq 9/28/2018 23:04
## 6 4_RuralRedLight_20181003103829.daq 10/3/2018 23:11
Question #7: Exclude any drives that do not have an
“X” in the reduced column. Then, using stringr
functions as
appropriate, process the information recorded in “disposition.csv” into
a data frame with the following columns:
Hints:
\\\\
is needed to match a
single \
_[^\\d]*$
useful. Think
about what it will will match.data.frame
to assemble them into a data frame.Printed below are the first 10 rows of the target data.frame in the requested format.
## Group SubjectID Treatment Scenario DriveNumber
## 1 Control C_001 POST RuralRedLight 1
## 2 Control C_002 PRE RuralRedLight 3
## 3 Control C_003 PRE RuralRedLight 2
## 4 Occasional O_001 PRE RuralRedLight 1
## 5 Control C_004 POST RuralRedLight 3
## 6 Frequent F_001 PRE RuralRedLight 4
## 7 Frequent F_002 PRE RuralRedLight 4
## 8 Control C_005 POST RuralRedLight 3
## 9 Frequent F_003 PRE RuralRedLight 4
## 10 Occasional O_002 PRE RuralRedLight 1