Quick Guide to Regex in R

2016-04-02 374 words 2 minutes

Contents

The purpose of this guide is to bridge the gap between understanding what a regular expression is and how to use them in R. If you’re brand new to regular expressions, I highly recommend checking out RegexOne.

Hadley Wickham’s stringr package makes using regular expressions in R a breeze. I use it to avoid the complexity of base R’s regex functions grep, grepl, regexpr, gregexpr, sub and gsub where even the function names are cryptic.

Setup

library(stringr)

sentence <- "We bought our  Golden Retriever, Snuggles, for $30 on 1/1/2015 at 1017 Main St. where they   have many dogs."

Does the string contain a pattern?

# Does the sentence contain the word “the”?

# disregard adjacent characters
str_detect(sentence, "the")
## [1] TRUE

# consider word boundaries on both sides of the word "the"
str_detect(sentence, "\\bthe\\b")
## [1] FALSE

Extracting patterns

# What’s the first number that appears in the sentence?

# find the first digit
str_extract(sentence, "\\d")
## [1] "3"

# find the first sequence of digits
str_extract(sentence, "\\d+")
## [1] "30"

# find the first match for [^\\b]\\d+ followed by a word break where 
# [^\\b]\\d+ matches everything except a word boundary followed by 1 or more digits
str_extract(sentence, "[^\\b]\\d+(?=\\b)") 
## [1] "$30"
 
# find all sequences of numbers
str_extract_all(sentence, "\\b\\d+\\b")
## [[1]]
## [1] "30"   "1"    "1"    "2015" "1017"

Counting matching patterns

# How many times does the word “dog” appear in the sentence?

# count occurences of the word "dog"
str_count(sentence, "dog")
## [1] 1

# count occurences of the word "dog" and require word boundaries 
# on both sides of the word
str_count(sentence, "\\bdog\\b")
## [1] 0

Replacing matching patterns

# Replace the 2nd digit with a 9
str_replace(sentence, "(?<=\\d)[^\\d]*(\\d)", "9")
## [1] "We bought our  Golden Retriever, Snuggles, for $39 on 1/1/2015 at 1017 Main St. where they   have many dogs."
 
# Replace every 0 or 1 with a 6
str_replace_all(sentence, "(0|1)", "6")
## [1] "We bought our  Golden Retriever, Snuggles, for $36 on 6/6/2665 at 6667 Main St. where they   have many dogs."
 
# Replace all instances of multiple spaces with a single space
str_replace_all(sentence, "\\s{2,}", " ")
## [1] "We bought our Golden Retriever, Snuggles, for $30 on 1/1/2015 at 1017 Main St. where they have many dogs."