library(tidyverse)
Practice: Regular Expressions (part 1)
Stat 133
- Getting familiar with regex functions from
"stringr"
- Use regex operations to clean/process “messy” data
- Focus on detection and extraction of string patterns
1 Regular Expressions with "stringr"
2 Practice Regex patterns
Consider the following character with the names of some animals:
= c(
animals 'dog', 'cat', 'bird', 'dolphin', 'lion', 'zebra', 'tiger',
'wolf', 'whale', 'eagle', 'pig', 'osprey', 'kangaroo', 'koala')
Let’s match the pattern dog
with str_match()
str_match(animals, pattern = 'dog')
[,1]
[1,] "dog"
[2,] NA
[3,] NA
[4,] NA
[5,] NA
[6,] NA
[7,] NA
[8,] NA
[9,] NA
[10,] NA
[11,] NA
[12,] NA
[13,] NA
[14,] NA
As you can tell, the output of str_match()
is an array with as many rows as elements in the input vector. If there is a match, the matched string pattern is displayed. If there is no match, then you get a missing value NA
You can use str_detect()
to check if there is a match:
str_detect(animals, 'dog')
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE
To extract the matched pattern, you can use str_extract()
:
str_extract(animals, pattern = 'dog')
[1] "dog" NA NA NA NA NA NA NA NA NA NA NA
[13] NA NA
Sometimes you may want to extract the elements of the input vector associated to a match. For instance, say we are interested in those animals that contain the letter "a"
. We can use the output of str_detect()
to do logical subsetting
str_detect(animals, 'a')] animals[
[1] "cat" "zebra" "whale" "eagle" "kangaroo" "koala"
2.1 Your turn
Use logical subsetting with str_detect()
, to find the names of animals with:
- zero or more
o
Show answer
str_detect(animals, 'o*')] animals[
- zero or one
o
Show answer
str_detect(animals, 'o?')] animals[
- at least 1
o
:"dog" "dolphin" "lion" "wolf" "osprey" "kangaroo" "koala"
Show answer
str_detect(animals, 'o+')] animals[
- exactly 2
o
’s together:"kangaroo"
Show answer
str_detect(animals, 'o{2}')] animals[
- one
o
, but not twoo
’s together:"dog" "dolphin" "lion" "wolf" "osprey" "koala"
Show answer
str_detect(animals, 'o[^o]')] animals[
- two vowels together:
"lion" "eagle" "kangaroo" "koala"
Show answer
str_detect(animals, '[aeiou]{2}')] animals[
- two or more consonants together:
"bird" "dolphin" "zebra" "wolf" "whale" "eagle" "osprey" "kangaroo"
Show answer
str_detect(animals, '[^aeiou]{2,}')] animals[
- three consonants together:
"dolphin" "osprey"
Show answer
str_detect(animals, '[^aeiou]{3}')] animals[
- three letters only:
"dog" "cat" "pig"
Show answer
str_detect(animals, '^[a-z]{3}$')] animals[
- four letters only:
"bird" "lion" "wolf"
Show answer
str_detect(animals, '^[a-z]{4}$')] animals[
3 File Names
Here’s another character vector with some file names and their extensions:
<- c(
files 'sales1.csv', 'orders.csv', 'sales2.csv',
'sales3.csv', 'europe.csv', 'usa.csv', 'mex.csv',
'CA.csv', 'FL.csv', 'NY.csv', 'TX.csv',
'sales-europe.csv', 'sales-usa.csv', 'sales-mex.csv')
3.1 Your turn
- Find the file names containing numbers
Show answer
str_detect(files, '[0123456789]')]
files[
str_detect(files, '[0-9]')]
files[
str_detect(files, '[[:digit:]]')] files[
- Find the file names containing no numbers
Show answer
!str_detect(files, '[0-9]')] files[
- Find the file names containing lower case letters (including file extension)
Show answer
str_detect(files, '[[:lower:]]')] files[
- Find the file names containing lower case letters (just the name, not the file extension)
Show answer
!str_detect(files, '[[:upper:]]')] files[
- Find the file names containing a dash
Show answer
str_detect(files, '-')] files[
- Find the file names containing no dash
Show answer
!str_detect(files, '-')] files[
- Create a vector of files by replacing the ‘csv’ extension into ‘txt’ extension
Show answer
str_replace(files, pattern = "csv", replacement = "txt")
- Extract just the file name (without the extension)
Show answer
str_replace(files, pattern = "\\.csv", replacement = "")
str_remove(files, pattern = "\\.csv")