Strings I

Parsing text with regular expressions

Agenda

Fundamentals of Character Strings
Stringr
Regular expressions
- Literal Characters
- Metacharacters
  - Wildcards
  - Quantifiers
  - Anchors
- Character Sets
- Character Classes

Question

What is the distribution of majors in this course?

library(tidyverse)
survey <- read_csv("data/133-survey-small.csv")
survey

# A tibble: 202 × 2
   major                            excited                                     
   <chr>                            <chr>                                       
 1 Applied mathematics              Learning how to use R                       
 2 Psychology & Data Science        To learn more about how to manipulate and w…
 3 Statistics                       Learning the R programming language properl…
 4 Statistics                       to gain a deeper understanding of R and und…
 5 math major + ds minor            Learn real skills                           
 6 Economics and minor Data Science Being able to advance my knowledge in R     
 7 Political Science                excited to learn how to code in RStudio     
 8 Statistics                       I am most excited about being able to learn…
 9 Statistics + Economics           Learning R and how to process data!!        
10 Statistics                       computation in stats                        
# ℹ 192 more rows

Fundamentals of Strings

A character vector is a set of strings, each of which has a number of characters (nchar()).

fruit <- c("apple", "banana")
length(fruit)

[1] 2

nchar(fruit)

[1] 5 6

What is the lengthiest major?

survey |>
  mutate(nchar = nchar(major)) |>
  arrange(desc(nchar))

# A tibble: 202 × 3
   major                                                         excited   nchar
   <chr>                                                         <chr>     <int>
 1 Neuroscience, Molecular and Cell Biology, Minor: Data Science I'm exci…    61
 2 Cognitive science and legal studies, data science minor       To learn…    55
 3 Statistics & Environmental Economics and Policy               I am exc…    47
 4 Cognitive Science/Media Studies + Data Minor                  <NA>         44
 5 Applied Mathematics & Integrative Biology                     I want t…    41
 6 Cognitive Science with Data Science minor                     Gaining …    41
 7 Statistics, Economics, Political Economy                      Labs         40
 8 Statistics and Business Administration                        Learning…    38
 9 Integrative Biology and Public Health                         As a PH …    37
10 Legal Studies (Minor: Public Health)                          I am exc…    36
# ℹ 192 more rows

Creating Strings

Wrap the characters either in " or '.
Combine strings with paste()

paste("ba", "na", "na")

[1] "ba na na"

Creating Strings

Wrap the characters either in " or '.
Combine strings with paste()

paste("ba", "na", "na", sep = ",")

[1] "ba,na,na"

Creating Strings

Wrap the characters either in " or '.
Combine strings with paste()

paste("ba", "na", "na", sep = "")

[1] "banana"

Changing case

tolower() makes all letter characters lowercase
toupper() makes all letter characters uppercase

fruit_yelled <- toupper(fruit)
fruit_yelled

[1] "APPLE"  "BANANA"

tolower(fruit_yelled)

[1] "apple"  "banana"

Quotations

"She said, "Naaaah...", and walked away."

Error in parse(text = input): <text>:1:13: unexpected symbol
1: "She said, "Naaaah...
                ^

Include a quotation in a string by using '.

'She said, "Naaaah...", and walked away.'

[1] "She said, \"Naaaah...\", and walked away."

Escape Characters

You can change the default meaning of a given character by escaping it: prepending it with \.

" normally starts or ends a string. \" is the double-quote character.
t normally is the lowercase Latin t. \t is a tab space.
n normally is the lowercase Latin n. \n is a new line.
U0001F600 normal is just that. \U0001F600 is 😀.

Stringr

`stringr`

A set of functions for string manipulation with a standard format (called an “API”).

str_length(): analog of nchar()
str_c(): analog of paste()
str_view()
str_sub()
str_detect()

str_split()
str_match()
str_replace()
str_extract()

`str_view()`

Show more readable version of strings. Also see how a pattern matches.

x <- 'She said, "Naaaah...", and walked away.'
x

[1] "She said, \"Naaaah...\", and walked away."

str_view(x)

[1] │ She said, "Naaaah...", and walked away.

`str_view()`

Show more readable version of strings. Also see how a pattern matches.

x <- 'She said, "Naaaah...", and walked away.'
y <- "by only me is your doing, my darling)\n\t\t\t\t\ti fear\U0001F47B"
y

[1] "by only me is your doing, my darling)\n\t\t\t\t\ti fear👻"

str_view(y)

[1] │ by only me is your doing, my darling)
    │ {\t\t\t\t\t}i fear👻

`str_sub()`

Subset strings by position.

fruit

[1] "apple"  "banana"

str_sub(fruit, start = 1, end = 3)

[1] "app" "ban"

survey |>
  select(major) |>
  mutate(first_letter = substr(major, 1, 1))

# A tibble: 202 × 2
   major                            first_letter
   <chr>                            <chr>       
 1 Applied mathematics              A           
 2 Psychology & Data Science        P           
 3 Statistics                       S           
 4 Statistics                       S           
 5 math major + ds minor            m           
 6 Economics and minor Data Science E           
 7 Political Science                P           
 8 Statistics                       S           
 9 Statistics + Economics           S           
10 Statistics                       S           
# ℹ 192 more rows

`str_detect()`

Detects the presence or absence of a pattern in a string

str_detect(fruit, pattern = "apple")

[1]  TRUE FALSE

str_detect(fruit, pattern = "an")

[1] FALSE  TRUE

str_view(fruit, pattern = "an")

[2] │ b<an><an>a

survey |>
  select(major) |>
  filter(str_detect(major, "Statistics"))

# A tibble: 64 × 1
   major                                   
   <chr>                                   
 1 Statistics                              
 2 Statistics                              
 3 Statistics                              
 4 Statistics + Economics                  
 5 Statistics                              
 6 Statistics, Economics, Political Economy
 7 Statistics                              
 8 Statistics + Integrative Biology        
 9 Statistics                              
10 Statistics                              
# ℹ 54 more rows

survey |>
  select(major) |>
  filter(str_detect(major, "Statistics")) |>
  arrange(desc(row_number()))

# A tibble: 64 × 1
   major                             
   <chr>                             
 1 Statistics                        
 2 Statistics                        
 3 Statistics/Pure Mathematics       
 4 Statistics                        
 5 Economics, Statistics             
 6 Statistics                        
 7 Applied Mathematics and Statistics
 8 Statistics                        
 9 Statistics                        
10 Statistics                        
# ℹ 54 more rows

survey |>
  select(major) |>
  filter(str_detect(major, "Stat"))

# A tibble: 75 × 1
   major                                   
   <chr>                                   
 1 Statistics                              
 2 Statistics                              
 3 Statistics                              
 4 Statistics + Economics                  
 5 Statistics                              
 6 Statistics, Economics, Political Economy
 7 Statistics                              
 8 Statistics + Integrative Biology        
 9 Statistics                              
10 Statistics                              
# ℹ 65 more rows

survey |>
  select(major) |>
  filter(str_detect(major, "stat"))

# A tibble: 11 × 1
   major                          
   <chr>                          
 1 statistics                     
 2 statistics                     
 3 statistic                      
 4 statistic                      
 5 Engineering Math and statistics
 6 statistics and economics       
 7 statistics                     
 8 stats，applied math            
 9 statistics                     
10 stats                          
11 statistic

How do we (succinctly) specify that we want to detect “Statistics”, “Stat”, “statistics, and”stat”?

Either upper- or lower-case “s”, followed be “tat” anywhere in the string.

Regular expressions

A sequence of characters expressing a string or pattern to be searched for within a longer piece of text.

Literal characters
Metacharacters
Character sets
Character classes

Literal characters

Match the letter “a”.

pattern <- "a"

str_view(survey$major, pattern)

 [1] │ Applied m<a>them<a>tics
 [2] │ Psychology & D<a>t<a> Science
 [3] │ St<a>tistics
 [4] │ St<a>tistics
 [5] │ m<a>th m<a>jor + ds minor
 [6] │ Economics <a>nd minor D<a>t<a> Science
 [7] │ Politic<a>l Science
 [8] │ St<a>tistics
 [9] │ St<a>tistics + Economics
[10] │ St<a>tistics
[11] │ St<a>tistics, Economics, Politic<a>l Economy
[12] │ st<a>tistics
[15] │ Applied M<a>th
[16] │ St<a>tistics
[17] │ Public He<a>lth
[18] │ St<a>tistics + Integr<a>tive Biology
[19] │ Applied M<a>th
[21] │ St<a>tistics
[22] │ Public he<a>lth
[24] │ St<a>tistics
... and 139 more

Literal characters

Match the letter “ta”.

pattern <- "ta"

str_view(survey$major, pattern)

 [2] │ Psychology & Da<ta> Science
 [3] │ S<ta>tistics
 [4] │ S<ta>tistics
 [6] │ Economics and minor Da<ta> Science
 [8] │ S<ta>tistics
 [9] │ S<ta>tistics + Economics
[10] │ S<ta>tistics
[11] │ S<ta>tistics, Economics, Political Economy
[12] │ s<ta>tistics
[16] │ S<ta>tistics
[18] │ S<ta>tistics + Integrative Biology
[21] │ S<ta>tistics
[24] │ S<ta>tistics
[25] │ s<ta>tistics
[26] │ S<ta>tistics
[27] │ Da<ta> Science and Applied Math
[31] │ S<ta>tistics
[32] │ Da<ta> Science & S<ta>ts
[34] │ S<ta>tistics
[35] │ Da<ta> Science
... and 93 more

Literal characters

Match the letters “sta”.

pattern <- "sta"

str_view(survey$major, pattern)

 [12] │ <sta>tistics
 [25] │ <sta>tistics
 [42] │ <sta>tistic
 [59] │ <sta>tistic
 [80] │ Engineering Math and <sta>tistics
[116] │ <sta>tistics and economics
[117] │ <sta>tistics
[122] │ <sta>ts，applied math
[140] │ <sta>tistics
[142] │ <sta>ts
[164] │ <sta>tistic

Metacharacters

Non-letter characters with special meaning.

.
?
+
*
[
]
etc.

Wildcard `.`

. will match any single character

str_view(fruit, pattern = ".")

[1] │ <a><p><p><l><e>
[2] │ <b><a><n><a><n><a>

str_view(fruit, pattern = "a.")

[1] │ <ap>ple
[2] │ b<an><an>a

“a” followed by any character.

Any character followed by the characters “tat”.

str_view(survey$major, pattern = ".tat")

 [3] │ <Stat>istics
 [4] │ <Stat>istics
 [8] │ <Stat>istics
 [9] │ <Stat>istics + Economics
[10] │ <Stat>istics
[11] │ <Stat>istics, Economics, Political Economy
[12] │ <stat>istics
[16] │ <Stat>istics
[18] │ <Stat>istics + Integrative Biology
[21] │ <Stat>istics
[24] │ <Stat>istics
[25] │ <stat>istics
[26] │ <Stat>istics
[31] │ <Stat>istics
[32] │ Data Science & <Stat>s
[34] │ <Stat>istics
[37] │ <Stat>istics
[38] │ <Stat>istics
[42] │ <stat>istic
[43] │ Data Science, <Stat>istics
... and 66 more

But what if the string has an actual period?

survey$excited[3]

[1] "Learning the R programming language properly."

How do I match "properly." but not "properly!" (or any other character)?

str_view(c("properly.", "properly!"),
           pattern = "properly\\.")

[1] │ <properly.>

The first \ escapes it for the string, the second \ escapes it for the regex.

Quantifiers

?: Matches 0 or 1 of the preceding character
+: Matches 1 or more of the preceding character
*: Matches any number of the preceding character
{3}: Matches exactly three of the preceding character

# ab? matches an "a", optionally followed by a "b".
str_view(c("a", "ab", "abb"), "ab?")

[1] │ <a>
[2] │ <ab>
[3] │ <ab>b

# ab+ matches an "a", followed by at least one "b".
str_view(c("a", "ab", "abb"), "ab+")

[2] │ <ab>
[3] │ <abb>

# ab* matches an "a", followed by any number of "b"s.
str_view(c("a", "ab", "abb"), "ab*")

[1] │ <a>
[2] │ <ab>
[3] │ <abb>

# ab{2} matches an "a", followed by exactly 2 "b"s.
str_view(c("a", "ab", "abb"), "ab{2}")

[3] │ <abb>

Character Sets

You can demarcate a set of characters with [].

str_view(fruit, pattern = "[ab]")

[1] │ <a>pple
[2] │ <b><a>n<a>n<a>

str_view(fruit, pattern = "[aeiou]")

[1] │ <a>ppl<e>
[2] │ b<a>n<a>n<a>

Your Turn

str_view('She said, "Naaaah...", and walked away.', p1)

Question: Write down 5 different regular expressions to match <Naaaah>.

01:30

Your Turn

p <- "Naaaah"
p <- "Na+h"
p <- "Na{4}h"
p <- "N[aeiou]+h"

str_view('She said, "Naaaah...", and walked away.', p)

[1] │ She said, "<Naaaah>...", and walked away.

Question: Write down 5 different regular expressions to match <Naaaah>.

How do we (succinctly) specify that we want to detect “Statistics”, “Stat”, “statistics, and”stat”?

Either upper- or lower-case “s”, followed be “tat” anywhere in the string.

"[sS]tat"

What is the distribution of majors in this course?

What is proportion of students major in Stats?

survey |>
  select(major) |>
  mutate(view_stat = str_view(major, pattern = "[sS]tat", match = NA))

# A tibble: 202 × 2
   major                            view_stat                       
   <chr>                            <strngr_v>                      
 1 Applied mathematics              Applied mathematics             
 2 Psychology & Data Science        Psychology & Data Science       
 3 Statistics                       <Stat>istics                    
 4 Statistics                       <Stat>istics                    
 5 math major + ds minor            math major + ds minor           
 6 Economics and minor Data Science Economics and minor Data Science
 7 Political Science                Political Science               
 8 Statistics                       <Stat>istics                    
 9 Statistics + Economics           <Stat>istics + Economics        
10 Statistics                       <Stat>istics                    
# ℹ 192 more rows

What is the distribution of majors in this course?

What is proportion of students major in Stats?

survey |>
  select(major) |>
  mutate(view_stat = str_view(major, pattern = "[sS]tat", match = NA),
         is_stats  = str_detect(major, pattern = "[sS]tat"))

# A tibble: 202 × 3
   major                            view_stat                        is_stats
   <chr>                            <strngr_v>                       <lgl>   
 1 Applied mathematics              Applied mathematics              FALSE   
 2 Psychology & Data Science        Psychology & Data Science        FALSE   
 3 Statistics                       <Stat>istics                     TRUE    
 4 Statistics                       <Stat>istics                     TRUE    
 5 math major + ds minor            math major + ds minor            FALSE   
 6 Economics and minor Data Science Economics and minor Data Science FALSE   
 7 Political Science                Political Science                FALSE   
 8 Statistics                       <Stat>istics                     TRUE    
 9 Statistics + Economics           <Stat>istics + Economics         TRUE    
10 Statistics                       <Stat>istics                     TRUE    
# ℹ 192 more rows

What is the distribution of majors in this course?

What is proportion of students major in Stats?

survey |>
  select(major) |>
  mutate(view_stat = str_view(major, pattern = "[sS]tat", match = NA),
         is_stats  = str_detect(major, pattern = "[sS]tat")) |>
  summarize(p_stat = mean(is_stats))

# A tibble: 1 × 1
  p_stat
   <dbl>
1  0.426

Strings I

Agenda

Question

Fundamentals of Strings

Creating Strings

Creating Strings

Creating Strings

Changing case

Quotations

Escape Characters

Stringr

stringr

str_view()

str_view()

str_sub()

str_detect()

Regular expressions

Regular expressions

Literal characters

Literal characters

Literal characters

Metacharacters

Wildcard .

Quantifiers

Character Sets

Your Turn

Your Turn

`stringr`

`str_view()`

`str_view()`

`str_sub()`

`str_detect()`

Wildcard `.`