Practice: Regular Expressions (part 1)

Stat 133

Author

Gaston Sanchez

Learning Objectives
  • Getting familiar with regex functions from "stringr"
  • Use regex operations to clean/process “messy” data
  • Focus on detection and extraction of string patterns

1 Regular Expressions with "stringr"

library(tidyverse)

2 Practice Regex patterns

Consider the following character with the names of some animals:

animals = c(
  'dog', 'cat', 'bird', 'dolphin', 'lion', 'zebra', 'tiger', 
  'wolf', 'whale', 'eagle', 'pig', 'osprey', 'kangaroo', 'koala')

Let’s match the pattern dog with str_match()

str_match(animals, pattern = 'dog')
      [,1] 
 [1,] "dog"
 [2,] NA   
 [3,] NA   
 [4,] NA   
 [5,] NA   
 [6,] NA   
 [7,] NA   
 [8,] NA   
 [9,] NA   
[10,] NA   
[11,] NA   
[12,] NA   
[13,] NA   
[14,] NA   

As you can tell, the output of str_match() is an array with as many rows as elements in the input vector. If there is a match, the matched string pattern is displayed. If there is no match, then you get a missing value NA

You can use str_detect() to check if there is a match:

str_detect(animals, 'dog')
 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE

To extract the matched pattern, you can use str_extract():

str_extract(animals, pattern = 'dog')
 [1] "dog" NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA   
[13] NA    NA   

Sometimes you may want to extract the elements of the input vector associated to a match. For instance, say we are interested in those animals that contain the letter "a". We can use the output of str_detect() to do logical subsetting

animals[str_detect(animals, 'a')]
[1] "cat"      "zebra"    "whale"    "eagle"    "kangaroo" "koala"   

2.1 Your turn

Use logical subsetting with str_detect(), to find the names of animals with:

  1. zero or more o
Show answer
animals[str_detect(animals, 'o*')]
  1. zero or one o
Show answer
animals[str_detect(animals, 'o?')]
  1. at least 1 o: "dog" "dolphin" "lion" "wolf" "osprey" "kangaroo" "koala"
Show answer
animals[str_detect(animals, 'o+')]
  1. exactly 2 o’s together: "kangaroo"
Show answer
animals[str_detect(animals, 'o{2}')]
  1. one o, but not two o’s together: "dog" "dolphin" "lion" "wolf" "osprey" "koala"
Show answer
animals[str_detect(animals, 'o[^o]')]
  1. two vowels together: "lion" "eagle" "kangaroo" "koala"
Show answer
animals[str_detect(animals, '[aeiou]{2}')]
  1. two or more consonants together: "bird" "dolphin" "zebra" "wolf" "whale" "eagle" "osprey" "kangaroo"
Show answer
animals[str_detect(animals, '[^aeiou]{2,}')]
  1. three consonants together: "dolphin" "osprey"
Show answer
animals[str_detect(animals, '[^aeiou]{3}')]
  1. three letters only: "dog" "cat" "pig"
Show answer
animals[str_detect(animals, '^[a-z]{3}$')]
  1. four letters only: "bird" "lion" "wolf"
Show answer
animals[str_detect(animals, '^[a-z]{4}$')]

3 File Names

Here’s another character vector with some file names and their extensions:

files <- c(
  'sales1.csv', 'orders.csv', 'sales2.csv',
  'sales3.csv', 'europe.csv', 'usa.csv', 'mex.csv',
  'CA.csv', 'FL.csv', 'NY.csv', 'TX.csv',
  'sales-europe.csv', 'sales-usa.csv', 'sales-mex.csv')

3.1 Your turn

  1. Find the file names containing numbers
Show answer
files[str_detect(files, '[0123456789]')]

files[str_detect(files, '[0-9]')]

files[str_detect(files, '[[:digit:]]')]
  1. Find the file names containing no numbers
Show answer
files[!str_detect(files, '[0-9]')]
  1. Find the file names containing lower case letters (including file extension)
Show answer
files[str_detect(files, '[[:lower:]]')]
  1. Find the file names containing lower case letters (just the name, not the file extension)
Show answer
files[!str_detect(files, '[[:upper:]]')]
  1. Find the file names containing a dash
Show answer
files[str_detect(files, '-')]
  1. Find the file names containing no dash
Show answer
files[!str_detect(files, '-')]
  1. Create a vector of files by replacing the ‘csv’ extension into ‘txt’ extension
Show answer
str_replace(files, pattern = "csv", replacement = "txt")
  1. Extract just the file name (without the extension)
Show answer
str_replace(files, pattern = "\\.csv", replacement = "")

str_remove(files, pattern = "\\.csv")