Strings I

Parsing text with regular expressions

Agenda

  • Fundamentals of Character Strings
  • Stringr
  • Regular expressions
    • Literal Characters
    • Metacharacters
      • Wildcards
      • Quantifiers
      • Anchors
    • Character Sets
    • Character Classes

Question

What is the distribution of majors in this course?

library(tidyverse)
survey <- read_csv("data/133-survey-small.csv")
survey
# A tibble: 202 × 2
   major                            excited                                     
   <chr>                            <chr>                                       
 1 Applied mathematics              Learning how to use R                       
 2 Psychology & Data Science        To learn more about how to manipulate and w…
 3 Statistics                       Learning the R programming language properl…
 4 Statistics                       to gain a deeper understanding of R and und…
 5 math major + ds minor            Learn real skills                           
 6 Economics and minor Data Science Being able to advance my knowledge in R     
 7 Political Science                excited to learn how to code in RStudio     
 8 Statistics                       I am most excited about being able to learn…
 9 Statistics + Economics           Learning R and how to process data!!        
10 Statistics                       computation in stats                        
# ℹ 192 more rows

Fundamentals of Strings

A character vector is a set of strings, each of which has a number of characters (nchar()).

fruit <- c("apple", "banana")
length(fruit)
[1] 2
nchar(fruit)
[1] 5 6

What is the lengthiest major?

survey |>
  mutate(nchar = nchar(major)) |>
  arrange(desc(nchar))
# A tibble: 202 × 3
   major                                                         excited   nchar
   <chr>                                                         <chr>     <int>
 1 Neuroscience, Molecular and Cell Biology, Minor: Data Science I'm exci…    61
 2 Cognitive science and legal studies, data science minor       To learn…    55
 3 Statistics & Environmental Economics and Policy               I am exc…    47
 4 Cognitive Science/Media Studies + Data Minor                  <NA>         44
 5 Applied Mathematics & Integrative Biology                     I want t…    41
 6 Cognitive Science with Data Science minor                     Gaining …    41
 7 Statistics, Economics, Political Economy                      Labs         40
 8 Statistics and Business Administration                        Learning…    38
 9 Integrative Biology and Public Health                         As a PH …    37
10 Legal Studies (Minor: Public Health)                          I am exc…    36
# ℹ 192 more rows

Creating Strings

  • Wrap the characters either in " or '.
  • Combine strings with paste()
paste("ba", "na", "na")
[1] "ba na na"

Creating Strings

  • Wrap the characters either in " or '.
  • Combine strings with paste()
paste("ba", "na", "na", sep = ",")
[1] "ba,na,na"

Creating Strings

  • Wrap the characters either in " or '.
  • Combine strings with paste()
paste("ba", "na", "na", sep = "")
[1] "banana"

Changing case

  • tolower() makes all letter characters lowercase
  • toupper() makes all letter characters uppercase
fruit_yelled <- toupper(fruit)
fruit_yelled
[1] "APPLE"  "BANANA"
tolower(fruit_yelled)
[1] "apple"  "banana"

Quotations

"She said, "Naaaah...", and walked away."
Error in parse(text = input): <text>:1:13: unexpected symbol
1: "She said, "Naaaah...
                ^

Include a quotation in a string by using '.

'She said, "Naaaah...", and walked away.'
[1] "She said, \"Naaaah...\", and walked away."

Escape Characters

You can change the default meaning of a given character by escaping it: prepending it with \.

  • " normally starts or ends a string. \" is the double-quote character.
  • t normally is the lowercase Latin t. \t is a tab space.
  • n normally is the lowercase Latin n. \n is a new line.
  • U0001F600 normal is just that. \U0001F600 is 😀.

Stringr

stringr

A set of functions for string manipulation with a standard format (called an “API”).

  • str_length(): analog of nchar()
  • str_c(): analog of paste()
  • str_view()
  • str_sub()
  • str_detect()
  • str_split()
  • str_match()
  • str_replace()
  • str_extract()

str_view()

Show more readable version of strings. Also see how a pattern matches.

x <- 'She said, "Naaaah...", and walked away.'
x
[1] "She said, \"Naaaah...\", and walked away."
str_view(x)
[1] │ She said, "Naaaah...", and walked away.

str_view()

Show more readable version of strings. Also see how a pattern matches.

x <- 'She said, "Naaaah...", and walked away.'
y <- "by only me is your doing, my darling)\n\t\t\t\t\ti fear\U0001F47B"
y
[1] "by only me is your doing, my darling)\n\t\t\t\t\ti fear👻"
str_view(y)
[1] │ by only me is your doing, my darling)
    │ {\t\t\t\t\t}i fear👻

str_sub()

Subset strings by position.

fruit
[1] "apple"  "banana"
str_sub(fruit, start = 1, end = 3)
[1] "app" "ban"

survey |>
  select(major) |>
  mutate(first_letter = substr(major, 1, 1))
# A tibble: 202 × 2
   major                            first_letter
   <chr>                            <chr>       
 1 Applied mathematics              A           
 2 Psychology & Data Science        P           
 3 Statistics                       S           
 4 Statistics                       S           
 5 math major + ds minor            m           
 6 Economics and minor Data Science E           
 7 Political Science                P           
 8 Statistics                       S           
 9 Statistics + Economics           S           
10 Statistics                       S           
# ℹ 192 more rows

str_detect()

Detects the presence or absence of a pattern in a string

str_detect(fruit, pattern = "apple")
[1]  TRUE FALSE
str_detect(fruit, pattern = "an")
[1] FALSE  TRUE
str_view(fruit, pattern = "an")
[2] │ b<an><an>a

survey |>
  select(major) |>
  filter(str_detect(major, "Statistics"))
# A tibble: 64 × 1
   major                                   
   <chr>                                   
 1 Statistics                              
 2 Statistics                              
 3 Statistics                              
 4 Statistics + Economics                  
 5 Statistics                              
 6 Statistics, Economics, Political Economy
 7 Statistics                              
 8 Statistics + Integrative Biology        
 9 Statistics                              
10 Statistics                              
# ℹ 54 more rows

survey |>
  select(major) |>
  filter(str_detect(major, "Statistics")) |>
  arrange(desc(row_number()))
# A tibble: 64 × 1
   major                             
   <chr>                             
 1 Statistics                        
 2 Statistics                        
 3 Statistics/Pure Mathematics       
 4 Statistics                        
 5 Economics, Statistics             
 6 Statistics                        
 7 Applied Mathematics and Statistics
 8 Statistics                        
 9 Statistics                        
10 Statistics                        
# ℹ 54 more rows

survey |>
  select(major) |>
  filter(str_detect(major, "Stat"))
# A tibble: 75 × 1
   major                                   
   <chr>                                   
 1 Statistics                              
 2 Statistics                              
 3 Statistics                              
 4 Statistics + Economics                  
 5 Statistics                              
 6 Statistics, Economics, Political Economy
 7 Statistics                              
 8 Statistics + Integrative Biology        
 9 Statistics                              
10 Statistics                              
# ℹ 65 more rows

survey |>
  select(major) |>
  filter(str_detect(major, "stat"))
# A tibble: 11 × 1
   major                          
   <chr>                          
 1 statistics                     
 2 statistics                     
 3 statistic                      
 4 statistic                      
 5 Engineering Math and statistics
 6 statistics and economics       
 7 statistics                     
 8 stats,applied math            
 9 statistics                     
10 stats                          
11 statistic                      

How do we (succinctly) specify that we want to detect “Statistics”, “Stat”, “statistics, and”stat”?

Either upper- or lower-case “s”, followed be “tat” anywhere in the string.

Regular expressions

Regular expressions


A sequence of characters expressing a string or pattern to be searched for within a longer piece of text.

  1. Literal characters
  2. Metacharacters
  3. Character sets
  4. Character classes

Literal characters

Match the letter “a”.

pattern <- "a"
str_view(survey$major, pattern)
 [1] │ Applied m<a>them<a>tics
 [2] │ Psychology & D<a>t<a> Science
 [3] │ St<a>tistics
 [4] │ St<a>tistics
 [5] │ m<a>th m<a>jor + ds minor
 [6] │ Economics <a>nd minor D<a>t<a> Science
 [7] │ Politic<a>l Science
 [8] │ St<a>tistics
 [9] │ St<a>tistics + Economics
[10] │ St<a>tistics
[11] │ St<a>tistics, Economics, Politic<a>l Economy
[12] │ st<a>tistics
[15] │ Applied M<a>th
[16] │ St<a>tistics
[17] │ Public He<a>lth
[18] │ St<a>tistics + Integr<a>tive Biology
[19] │ Applied M<a>th
[21] │ St<a>tistics
[22] │ Public he<a>lth
[24] │ St<a>tistics
... and 139 more

Literal characters

Match the letter “ta”.

pattern <- "ta"
str_view(survey$major, pattern)
 [2] │ Psychology & Da<ta> Science
 [3] │ S<ta>tistics
 [4] │ S<ta>tistics
 [6] │ Economics and minor Da<ta> Science
 [8] │ S<ta>tistics
 [9] │ S<ta>tistics + Economics
[10] │ S<ta>tistics
[11] │ S<ta>tistics, Economics, Political Economy
[12] │ s<ta>tistics
[16] │ S<ta>tistics
[18] │ S<ta>tistics + Integrative Biology
[21] │ S<ta>tistics
[24] │ S<ta>tistics
[25] │ s<ta>tistics
[26] │ S<ta>tistics
[27] │ Da<ta> Science and Applied Math
[31] │ S<ta>tistics
[32] │ Da<ta> Science & S<ta>ts
[34] │ S<ta>tistics
[35] │ Da<ta> Science
... and 93 more

Literal characters

Match the letters “sta”.

pattern <- "sta"
str_view(survey$major, pattern)
 [12] │ <sta>tistics
 [25] │ <sta>tistics
 [42] │ <sta>tistic
 [59] │ <sta>tistic
 [80] │ Engineering Math and <sta>tistics
[116] │ <sta>tistics and economics
[117] │ <sta>tistics
[122] │ <sta>ts,applied math
[140] │ <sta>tistics
[142] │ <sta>ts
[164] │ <sta>tistic

Metacharacters

Non-letter characters with special meaning.

  • .
  • ?
  • +
  • *
  • [
  • ]
  • etc.

Wildcard .

. will match any single character

str_view(fruit, pattern = ".")
[1] │ <a><p><p><l><e>
[2] │ <b><a><n><a><n><a>
str_view(fruit, pattern = "a.")
[1] │ <ap>ple
[2] │ b<an><an>a

“a” followed by any character.

Any character followed by the characters “tat”.

str_view(survey$major, pattern = ".tat")
 [3] │ <Stat>istics
 [4] │ <Stat>istics
 [8] │ <Stat>istics
 [9] │ <Stat>istics + Economics
[10] │ <Stat>istics
[11] │ <Stat>istics, Economics, Political Economy
[12] │ <stat>istics
[16] │ <Stat>istics
[18] │ <Stat>istics + Integrative Biology
[21] │ <Stat>istics
[24] │ <Stat>istics
[25] │ <stat>istics
[26] │ <Stat>istics
[31] │ <Stat>istics
[32] │ Data Science & <Stat>s
[34] │ <Stat>istics
[37] │ <Stat>istics
[38] │ <Stat>istics
[42] │ <stat>istic
[43] │ Data Science, <Stat>istics
... and 66 more

But what if the string has an actual period?

survey$excited[3]
[1] "Learning the R programming language properly."

How do I match "properly." but not "properly!" (or any other character)?

str_view(c("properly.", "properly!"),
           pattern = "properly\\.")
[1] │ <properly.>

The first \ escapes it for the string, the second \ escapes it for the regex.

Quantifiers

  • ?: Matches 0 or 1 of the preceding character
  • +: Matches 1 or more of the preceding character
  • *: Matches any number of the preceding character
  • {3}: Matches exactly three of the preceding character

# ab? matches an "a", optionally followed by a "b".
str_view(c("a", "ab", "abb"), "ab?")
[1] │ <a>
[2] │ <ab>
[3] │ <ab>b
# ab+ matches an "a", followed by at least one "b".
str_view(c("a", "ab", "abb"), "ab+")
[2] │ <ab>
[3] │ <abb>
# ab* matches an "a", followed by any number of "b"s.
str_view(c("a", "ab", "abb"), "ab*")
[1] │ <a>
[2] │ <ab>
[3] │ <abb>

# ab{2} matches an "a", followed by exactly 2 "b"s.
str_view(c("a", "ab", "abb"), "ab{2}")
[3] │ <abb>

Character Sets

You can demarcate a set of characters with [].

str_view(fruit, pattern = "[ab]")
[1] │ <a>pple
[2] │ <b><a>n<a>n<a>
str_view(fruit, pattern = "[aeiou]")
[1] │ <a>ppl<e>
[2] │ b<a>n<a>n<a>

Your Turn

str_view('She said, "Naaaah...", and walked away.', p1)

Question: Write down 5 different regular expressions to match <Naaaah>.

01:30

Your Turn

p <- "Naaaah"
p <- "Na+h"
p <- "Na{4}h"
p <- "N[aeiou]+h"
str_view('She said, "Naaaah...", and walked away.', p)
[1] │ She said, "<Naaaah>...", and walked away.

Question: Write down 5 different regular expressions to match <Naaaah>.

How do we (succinctly) specify that we want to detect “Statistics”, “Stat”, “statistics, and”stat”?

Either upper- or lower-case “s”, followed be “tat” anywhere in the string.

"[sS]tat"

What is the distribution of majors in this course?

What is proportion of students major in Stats?

survey |>
  select(major) |>
  mutate(view_stat = str_view(major, pattern = "[sS]tat", match = NA))
# A tibble: 202 × 2
   major                            view_stat                       
   <chr>                            <strngr_v>                      
 1 Applied mathematics              Applied mathematics             
 2 Psychology & Data Science        Psychology & Data Science       
 3 Statistics                       <Stat>istics                    
 4 Statistics                       <Stat>istics                    
 5 math major + ds minor            math major + ds minor           
 6 Economics and minor Data Science Economics and minor Data Science
 7 Political Science                Political Science               
 8 Statistics                       <Stat>istics                    
 9 Statistics + Economics           <Stat>istics + Economics        
10 Statistics                       <Stat>istics                    
# ℹ 192 more rows

What is the distribution of majors in this course?

What is proportion of students major in Stats?

survey |>
  select(major) |>
  mutate(view_stat = str_view(major, pattern = "[sS]tat", match = NA),
         is_stats  = str_detect(major, pattern = "[sS]tat"))
# A tibble: 202 × 3
   major                            view_stat                        is_stats
   <chr>                            <strngr_v>                       <lgl>   
 1 Applied mathematics              Applied mathematics              FALSE   
 2 Psychology & Data Science        Psychology & Data Science        FALSE   
 3 Statistics                       <Stat>istics                     TRUE    
 4 Statistics                       <Stat>istics                     TRUE    
 5 math major + ds minor            math major + ds minor            FALSE   
 6 Economics and minor Data Science Economics and minor Data Science FALSE   
 7 Political Science                Political Science                FALSE   
 8 Statistics                       <Stat>istics                     TRUE    
 9 Statistics + Economics           <Stat>istics + Economics         TRUE    
10 Statistics                       <Stat>istics                     TRUE    
# ℹ 192 more rows

What is the distribution of majors in this course?

What is proportion of students major in Stats?

survey |>
  select(major) |>
  mutate(view_stat = str_view(major, pattern = "[sS]tat", match = NA),
         is_stats  = str_detect(major, pattern = "[sS]tat")) |>
  summarize(p_stat = mean(is_stats))
# A tibble: 1 × 1
  p_stat
   <dbl>
1  0.426