Practice: Regular Expressions (part 1)

Stat 133

Author

Gaston Sanchez

Learning Objectives

Getting familiar with regex functions from "stringr"
Use regex operations to clean/process “messy” data
Focus on detection and extraction of string patterns

1 Regular Expressions with `"stringr"`

library(tidyverse)

2 Practice Regex patterns

Consider the following character with the names of some animals:

animals = c(
  'dog', 'cat', 'bird', 'dolphin', 'lion', 'zebra', 'tiger', 
  'wolf', 'whale', 'eagle', 'pig', 'osprey', 'kangaroo', 'koala')

Let’s match the pattern dog with str_match()

str_match(animals, pattern = 'dog')

      [,1] 
 [1,] "dog"
 [2,] NA   
 [3,] NA   
 [4,] NA   
 [5,] NA   
 [6,] NA   
 [7,] NA   
 [8,] NA   
 [9,] NA   
[10,] NA   
[11,] NA   
[12,] NA   
[13,] NA   
[14,] NA

As you can tell, the output of str_match() is an array with as many rows as elements in the input vector. If there is a match, the matched string pattern is displayed. If there is no match, then you get a missing value NA

You can use str_detect() to check if there is a match:

str_detect(animals, 'dog')

 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE

To extract the matched pattern, you can use str_extract():

str_extract(animals, pattern = 'dog')

 [1] "dog" NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA   
[13] NA    NA

Sometimes you may want to extract the elements of the input vector associated to a match. For instance, say we are interested in those animals that contain the letter "a". We can use the output of str_detect() to do logical subsetting

animals[str_detect(animals, 'a')]

[1] "cat"      "zebra"    "whale"    "eagle"    "kangaroo" "koala"

2.1 Your turn

Use logical subsetting with str_detect(), to find the names of animals with:

zero or more o

Show answer

animals[str_detect(animals, 'o*')]

zero or one o

Show answer

animals[str_detect(animals, 'o?')]

at least 1 o: "dog" "dolphin" "lion" "wolf" "osprey" "kangaroo" "koala"

Show answer

animals[str_detect(animals, 'o+')]

exactly 2 o’s together: "kangaroo"

Show answer

animals[str_detect(animals, 'o{2}')]

one o, but not two o’s together: "dog" "dolphin" "lion" "wolf" "osprey" "koala"

Show answer

animals[str_detect(animals, 'o[^o]')]

two vowels together: "lion" "eagle" "kangaroo" "koala"

Show answer

animals[str_detect(animals, '[aeiou]{2}')]

two or more consonants together: "bird" "dolphin" "zebra" "wolf" "whale" "eagle" "osprey" "kangaroo"

Show answer

animals[str_detect(animals, '[^aeiou]{2,}')]

three consonants together: "dolphin" "osprey"

Show answer

animals[str_detect(animals, '[^aeiou]{3}')]

three letters only: "dog" "cat" "pig"

Show answer

animals[str_detect(animals, '^[a-z]{3}$')]

four letters only: "bird" "lion" "wolf"

Show answer

animals[str_detect(animals, '^[a-z]{4}$')]

3 File Names

Here’s another character vector with some file names and their extensions:

files <- c(
  'sales1.csv', 'orders.csv', 'sales2.csv',
  'sales3.csv', 'europe.csv', 'usa.csv', 'mex.csv',
  'CA.csv', 'FL.csv', 'NY.csv', 'TX.csv',
  'sales-europe.csv', 'sales-usa.csv', 'sales-mex.csv')

3.1 Your turn

Find the file names containing numbers

Show answer

files[str_detect(files, '[0123456789]')]

files[str_detect(files, '[0-9]')]

files[str_detect(files, '[[:digit:]]')]

Find the file names containing no numbers

Show answer

files[!str_detect(files, '[0-9]')]

Find the file names containing lower case letters (including file extension)

Show answer

files[str_detect(files, '[[:lower:]]')]

Find the file names containing lower case letters (just the name, not the file extension)

Show answer

files[!str_detect(files, '[[:upper:]]')]

Find the file names containing a dash

Show answer

files[str_detect(files, '-')]

Find the file names containing no dash

Show answer

files[!str_detect(files, '-')]

Create a vector of files by replacing the ‘csv’ extension into ‘txt’ extension

Show answer

str_replace(files, pattern = "csv", replacement = "txt")

Extract just the file name (without the extension)

Show answer

str_replace(files, pattern = "\\.csv", replacement = "")

str_remove(files, pattern = "\\.csv")

1 Regular Expressions with "stringr"

2 Practice Regex patterns

2.1 Your turn

3 File Names

3.1 Your turn

1 Regular Expressions with `"stringr"`