library(tidyverse)Practice: Regular Expressions (part 1)
Stat 133
- Getting familiar with regex functions from
"stringr" - Use regex operations to clean/process “messy” data
- Focus on detection and extraction of string patterns
1 Regular Expressions with "stringr"
2 Practice Regex patterns
Consider the following character with the names of some animals:
animals = c(
'dog', 'cat', 'bird', 'dolphin', 'lion', 'zebra', 'tiger',
'wolf', 'whale', 'eagle', 'pig', 'osprey', 'kangaroo', 'koala')Let’s match the pattern dog with str_match()
str_match(animals, pattern = 'dog') [,1]
[1,] "dog"
[2,] NA
[3,] NA
[4,] NA
[5,] NA
[6,] NA
[7,] NA
[8,] NA
[9,] NA
[10,] NA
[11,] NA
[12,] NA
[13,] NA
[14,] NA
As you can tell, the output of str_match() is an array with as many rows as elements in the input vector. If there is a match, the matched string pattern is displayed. If there is no match, then you get a missing value NA
You can use str_detect() to check if there is a match:
str_detect(animals, 'dog') [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE
To extract the matched pattern, you can use str_extract():
str_extract(animals, pattern = 'dog') [1] "dog" NA NA NA NA NA NA NA NA NA NA NA
[13] NA NA
Sometimes you may want to extract the elements of the input vector associated to a match. For instance, say we are interested in those animals that contain the letter "a". We can use the output of str_detect() to do logical subsetting
animals[str_detect(animals, 'a')][1] "cat" "zebra" "whale" "eagle" "kangaroo" "koala"
2.1 Your turn
Use logical subsetting with str_detect(), to find the names of animals with:
- zero or more
o
Show answer
animals[str_detect(animals, 'o*')]- zero or one
o
Show answer
animals[str_detect(animals, 'o?')]- at least 1
o:"dog" "dolphin" "lion" "wolf" "osprey" "kangaroo" "koala"
Show answer
animals[str_detect(animals, 'o+')]- exactly 2
o’s together:"kangaroo"
Show answer
animals[str_detect(animals, 'o{2}')]- one
o, but not twoo’s together:"dog" "dolphin" "lion" "wolf" "osprey" "koala"
Show answer
animals[str_detect(animals, 'o[^o]')]- two vowels together:
"lion" "eagle" "kangaroo" "koala"
Show answer
animals[str_detect(animals, '[aeiou]{2}')]- two or more consonants together:
"bird" "dolphin" "zebra" "wolf" "whale" "eagle" "osprey" "kangaroo"
Show answer
animals[str_detect(animals, '[^aeiou]{2,}')]- three consonants together:
"dolphin" "osprey"
Show answer
animals[str_detect(animals, '[^aeiou]{3}')]- three letters only:
"dog" "cat" "pig"
Show answer
animals[str_detect(animals, '^[a-z]{3}$')]- four letters only:
"bird" "lion" "wolf"
Show answer
animals[str_detect(animals, '^[a-z]{4}$')]3 File Names
Here’s another character vector with some file names and their extensions:
files <- c(
'sales1.csv', 'orders.csv', 'sales2.csv',
'sales3.csv', 'europe.csv', 'usa.csv', 'mex.csv',
'CA.csv', 'FL.csv', 'NY.csv', 'TX.csv',
'sales-europe.csv', 'sales-usa.csv', 'sales-mex.csv')3.1 Your turn
- Find the file names containing numbers
Show answer
files[str_detect(files, '[0123456789]')]
files[str_detect(files, '[0-9]')]
files[str_detect(files, '[[:digit:]]')]- Find the file names containing no numbers
Show answer
files[!str_detect(files, '[0-9]')]- Find the file names containing lower case letters (including file extension)
Show answer
files[str_detect(files, '[[:lower:]]')]- Find the file names containing lower case letters (just the name, not the file extension)
Show answer
files[!str_detect(files, '[[:upper:]]')]- Find the file names containing a dash
Show answer
files[str_detect(files, '-')]- Find the file names containing no dash
Show answer
files[!str_detect(files, '-')]- Create a vector of files by replacing the ‘csv’ extension into ‘txt’ extension
Show answer
str_replace(files, pattern = "csv", replacement = "txt")- Extract just the file name (without the extension)
Show answer
str_replace(files, pattern = "\\.csv", replacement = "")
str_remove(files, pattern = "\\.csv")