Strings II

Parsing text with regular expressions

fs <- list.files(recursive = TRUE)
fs

[1] "data/133-survey-small.csv" "images/headache.webp"     
[3] "images/headwebp.png"       "images/penguins.webp"     
[5] "images/regex-tester.png"   "ps.qmd"                   
[7] "slides.qmd"                "slides.rmarkdown"

Question Write the pattern that matches each of the following:

Files with a webp extension.
Images with a name that starts with the letter p.
Files with a three digit number.

02:00

fs

[1] "data/133-survey-small.csv" "images/headache.webp"     
[3] "images/headwebp.png"       "images/penguins.webp"     
[5] "images/regex-tester.png"   "ps.qmd"                   
[7] "slides.qmd"                "slides.rmarkdown"

Files with a webp extension.

library(tidyverse) # has library(stringr)
str_view(fs, "\\.webp", match = NA)

[1] │ data/133-survey-small.csv
[2] │ images/headache<.webp>
[3] │ images/headwebp.png
[4] │ images/penguins<.webp>
[5] │ images/regex-tester.png
[6] │ ps.qmd
[7] │ slides.qmd
[8] │ slides.rmarkdown

fs

[1] "data/133-survey-small.csv" "images/headache.webp"     
[3] "images/headwebp.png"       "images/penguins.webp"     
[5] "images/regex-tester.png"   "ps.qmd"                   
[7] "slides.qmd"                "slides.rmarkdown"

Images with a name that starts with the letter p.

str_view(fs, "images/p", match = NA)

[1] │ data/133-survey-small.csv
[2] │ images/headache.webp
[3] │ images/headwebp.png
[4] │ <images/p>enguins.webp
[5] │ images/regex-tester.png
[6] │ ps.qmd
[7] │ slides.qmd
[8] │ slides.rmarkdown

fs

[1] "data/133-survey-small.csv" "images/headache.webp"     
[3] "images/headwebp.png"       "images/penguins.webp"     
[5] "images/regex-tester.png"   "ps.qmd"                   
[7] "slides.qmd"                "slides.rmarkdown"

Files with a three digit number.

str_view(fs, "[0123456789]{3}")

[1] │ data/<133>-survey-small.csv

Agenda

Regular expressions
- Literal Characters
- Metacharacters
  - Wildcards
  - Quantifiers
  - Anchors
- Character Sets
- Character Classes
- Grouping and Alternation

Case Studies
- Extracting a filename
- Splitting into majors
Kahoot

Anchors

You can anchor the regular expression using ^ to match the start or $ to match the end.

df <- tibble(fruit = c("apple", "banana"))
df |>
  mutate(start_a = str_view(fruit, "^a", match = NA))

# A tibble: 2 × 2
  fruit  start_a   
  <chr>  <strngr_v>
1 apple  <a>pple   
2 banana banana

Anchors

You can anchor the regular expression using ^ to match the start or $ to match the end.

df <- tibble(fruit = c("apple", "banana"))
df |>
  mutate(start_a = str_view(fruit, "^a", match = NA),
         end_a   = str_view(fruit, "a$", match = NA))

# A tibble: 2 × 3
  fruit  start_a    end_a     
  <chr>  <strngr_v> <strngr_v>
1 apple  <a>pple    apple     
2 banana banana     banan<a>

fs <- c(fs, "my.webp.pic.png")
fs

[1] "data/133-survey-small.csv" "images/headache.webp"     
[3] "images/headwebp.png"       "images/penguins.webp"     
[5] "images/regex-tester.png"   "ps.qmd"                   
[7] "slides.qmd"                "slides.rmarkdown"         
[9] "my.webp.pic.png"

Files with a webp extension.

# will catch a file with .webp in the name
str_view(fs, ".webp", match = NA)

[1] │ data/133-survey-small.csv
[2] │ images/headache<.webp>
[3] │ images/hea<dwebp>.png
[4] │ images/penguins<.webp>
[5] │ images/regex-tester.png
[6] │ ps.qmd
[7] │ slides.qmd
[8] │ slides.rmarkdown
[9] │ my<.webp>.pic.png

# much better
str_view(fs, "\\.webp$", match = NA)

[1] │ data/133-survey-small.csv
[2] │ images/headache<.webp>
[3] │ images/headwebp.png
[4] │ images/penguins<.webp>
[5] │ images/regex-tester.png
[6] │ ps.qmd
[7] │ slides.qmd
[8] │ slides.rmarkdown
[9] │ my.webp.pic.png

Tip

Regex tester website are very helpful for understanding how your regex is working https://regexr.com/

plates <- c("7ABC123", "5ZZA113", "9HMB22B")
colors <- c("col", "color", "colour", "colr")
phones <- c("510-384-2988", "483-3309", "558-90-3942", "382-APG-3911")

Question Write the pattern that will match each of the following¹:

Any license plate that ends in A, B, or C.
Valid full spellings of the word “color”.
Strings that contain only valid phone number with area code (no country code).

01:30

plates <- c("7ABC123", "5ZZA113", "9HMB22B")

Any license plate that ends in A, B, or C.

str_view(plates, "[ABC]$", match = NA)

[1] │ 7ABC123
[2] │ 5ZZA113
[3] │ 9HMB22<B>

colors <- c("col", "color", "colour", "colr")

Valid full spellings of the word “color”.

str_view(colors, "colou?r", match = NA)

[1] │ col
[2] │ <color>
[3] │ <colour>
[4] │ colr

phones <- c("510-384-2988", "483-3309", "1-558-90-3942", "382-APG-3911")

Strings that contain only valid phone number with area code (no country code).

str_view(phones, "^\\d{3}-\\d{3}-\\d{4}$", match = NA)

[1] │ <510-384-2988>
[2] │ 483-3309
[3] │ 1-558-90-3942
[4] │ 382-APG-3911

Character Classes

Construct your own set: [ABC]
Special characters in a set:
- ^ at the start negates the set.
- - expresses a range.

x <- "abcd ABCD 12345 -!@#%."

str_view(x, "[a-z]+")

[1] │ <abcd> ABCD 12345 -!@#%.

str_view(x, "[^a-z0-9]+")

[1] │ abcd< ABCD >12345< -!@#%.>

Character Classes

Construct your own set: [ABC]
Special characters in a set:
- ^ at the start negates the set.
- - expresses a range.
Shortcut classes

Shortcut classes

\d matches any digit;
\D matches anything that isn’t a digit.
\s matches any whitespace (e.g., space, tab, newline);
\S matches anything that isn’t whitespace.
\w matches any “word” character, i.e. letters and numbers;
\W matches any “non-word” character.

Note: These all have to escaped when forming a string to express the pattern.

Shortcut classes

x <- "abcd ABCD 12345 -!@#%."
str_view(x, "\\d+")

[1] │ abcd ABCD <12345> -!@#%.

str_view(x, "\\D+")

[1] │ <abcd ABCD >12345< -!@#%.>

str_view(x, "\\s+")

[1] │ abcd< >ABCD< >12345< >-!@#%.

str_view(x, "\\S+")

[1] │ <abcd> <ABCD> <12345> <-!@#%.>

str_view(x, "\\w+")

[1] │ <abcd> <ABCD> <12345> -!@#%.

str_view(x, "\\W+")

[1] │ abcd< >ABCD< >12345< -!@#%.>

Grouping and Alternation

Grouping

You can change the scope of different operations by grouping characters with ().

x <- c("hah", "hahaha", "haaa")

str_view(x, "ha+")

[1] │ <ha>h
[2] │ <ha><ha><ha>
[3] │ <haaa>

str_view(x, "(ha)+")

[1] │ <ha>h
[2] │ <hahaha>
[3] │ <ha>aa

Grouping

You can change the scope of different operations by grouping characters with ().

x <- c("hah", "hahaha", "haaa")

str_view(x, "ha+")

[1] │ <ha>h
[2] │ <ha><ha><ha>
[3] │ <haaa>

str_view(x, "(ha)+")

[1] │ <ha>h
[2] │ <hahaha>
[3] │ <ha>aa

str_count(x, "ha+")

[1] 1 3 1

str_count(x, "(ha)+")

[1] 1 1 1

Alternation

You can separate different possible patterns with |; reads like “or”.

str_view(df$fruit, "an|ap")

[1] │ <ap>ple
[2] │ b<an><an>a

survey <- read_csv("data/133-survey-small.csv")
str_view(survey$major, "Stat|stat") # same as "[sS]tat"

 [3] │ <Stat>istics
 [4] │ <Stat>istics
 [8] │ <Stat>istics
 [9] │ <Stat>istics + Economics
[10] │ <Stat>istics
[11] │ <Stat>istics, Economics, Political Economy
[12] │ <stat>istics
[16] │ <Stat>istics
[18] │ <Stat>istics + Integrative Biology
[21] │ <Stat>istics
[24] │ <Stat>istics
[25] │ <stat>istics
[26] │ <Stat>istics
[31] │ <Stat>istics
[32] │ Data Science & <Stat>s
[34] │ <Stat>istics
[37] │ <Stat>istics
[38] │ <Stat>istics
[42] │ <stat>istic
[43] │ Data Science, <Stat>istics
... and 66 more

Grouping with Alternation

x <- c("cat", "catcat", "catdog", "dogcat", "dog")
p1 <- "^cat|dog$"

Question: Which strings will this match?

str_view(x, p1)

[1] │ <cat>
[2] │ <cat>cat
[3] │ <cat><dog>
[5] │ <dog>

Grouping with Alternation

x <- c("cat", "catcat", "catdog", "dogcat", "dog")
p1 <- "^cat|dog$"
p2 <- "^(cat|dog)$"

Question: Which strings will this match?

str_view(x, p2)

[1] │ <cat>
[5] │ <dog>

Kahoot

Case Studies

Extracting file names

How do I extract just the file name from each path?

fs

[1] "data/133-survey-small.csv" "images/headache.webp"     
[3] "images/headwebp.png"       "images/penguins.webp"     
[5] "images/regex-tester.png"   "ps.qmd"                   
[7] "slides.qmd"                "slides.rmarkdown"         
[9] "my.webp.pic.png"

str_view(fs, "[^/]+$")

[1] │ data/<133-survey-small.csv>
[2] │ images/<headache.webp>
[3] │ images/<headwebp.png>
[4] │ images/<penguins.webp>
[5] │ images/<regex-tester.png>
[6] │ <ps.qmd>
[7] │ <slides.qmd>
[8] │ <slides.rmarkdown>
[9] │ <my.webp.pic.png>

Extracting file names

How do I extract just the file name from each path?

fs

[1] "data/133-survey-small.csv" "images/headache.webp"     
[3] "images/headwebp.png"       "images/penguins.webp"     
[5] "images/regex-tester.png"   "ps.qmd"                   
[7] "slides.qmd"                "slides.rmarkdown"         
[9] "my.webp.pic.png"

str_extract(fs, "[^/]+$")

[1] "133-survey-small.csv" "headache.webp"        "headwebp.png"        
[4] "penguins.webp"        "regex-tester.png"     "ps.qmd"              
[7] "slides.qmd"           "slides.rmarkdown"     "my.webp.pic.png"

How do you learn and keep track of the str_ functions in stringr?

Splitting off majors

How do I systemtically keep track of each major a student has?

select(survey, major)

# A tibble: 202 × 1
   major                           
   <chr>                           
 1 Applied mathematics             
 2 Psychology & Data Science       
 3 Statistics                      
 4 Statistics                      
 5 math major + ds minor           
 6 Economics and minor Data Science
 7 Political Science               
 8 Statistics                      
 9 Statistics + Economics          
10 Statistics                      
# ℹ 192 more rows

select(survey, major)

# A tibble: 202 × 1
   major                           
   <chr>                           
 1 Applied mathematics             
 2 Psychology & Data Science       
 3 Statistics                      
 4 Statistics                      
 5 math major + ds minor           
 6 Economics and minor Data Science
 7 Political Science               
 8 Statistics                      
 9 Statistics + Economics          
10 Statistics                      
# ℹ 192 more rows