Practice: Regular Expressions (part 2)

Stat 133

Author

Gaston Sanchez

Learning Objectives
  • Getting familiar with regex functions from "stringr"
  • Use regex operations to clean/process “messy” data
  • Focus on detection and extraction of string patterns

1 Regular Expressions with "stringr"

library(tidyverse) # data wrangling and graphics
library(rvest)     # for web scraping
library(plotly)    # web-interactive plots

The goal of this module is for you to get a first contact with the most common regex functions in "stringr" (see below). While R provides a set of built-in functions for regex pattern matching, I prefer to use "stringr" functions which are more consistent and user friendly.

To be more precise, the purpose of this document is to give you a little bit of regex practice with functions such as:

  • str_match()
  • str_detect()
  • str_extract()
  • str_locate()
  • str_sub()
  • str_split()
  • str_remove()
  • str_replace()

1.1 Stringr cheatsheet

We recommend having at hand the stringr cheatsheet (available in bCourses)

https://bcourses.berkeley.edu/courses/1526481/files/folder/cheatsheets?preview=86322354


2 Women’s Discus Data

You will be working with the data set for the Women’s discus throw world record progression, available in wikipedia:

https://en.wikipedia.org/wiki/Women%27s_discus_throw_world_record_progression

2.1 Getting the Data

Perhaps the simplest and most straightforward way to get the data in R is by passing the URL to a function that lets you import HTML files such as read_html():

# assemble url
wiki = "https://en.wikipedia.org/wiki/"
women_discus = "Women%27s_discus_throw_world_record_progression"
url = paste0(wiki, women_discus)

# import HTML file
doc = read_html(url)
class(doc)
[1] "xml_document" "xml_node"    

As you can tell, doc is an object of class "xml_document" that contains, among other things, the HTML table with the data of discus world records.

2.2 Another option to get the data

Another option to get the data involves a 2-step process:

  1. download the html file of the wikipedia page to your working directory,

  2. then import this file in R.

This is my preferred method for working with data from the web, following good practices to avoid making recurrent and unnecessary requests to the server where the data is hosted.

# step 1) download a copy of the HTML file, 
# (run this code in R's console!!!)
wiki = "https://en.wikipedia.org/wiki/"
women_discus = "Women%27s_discus_throw_world_record_progression"
url = paste0(wiki, women_discus)
download.file(url, "women-discus-throw.html")

Assuming that you’ve downloaded the html content in the file women-discus-throw.html, and that this file is in your working directory, you can then import it with read_html():

# step 2) extract HTML table into a data.frame
doc = read_html("women-discus-throw.html")

To extract the HTML tables from doc you use the html_table() function:

tbls = html_table(doc)
length(tbls)
[1] 4

2.3 Raw (messy) Table

If you inspect all the extracted tables in tbls, you’ll see that the table we are interested in is the second one:

dat = tbls[[2]]
dat
# A tibble: 57 × 4
   Mark                                                   Athlete Location Date 
   <chr>                                                  <chr>   <chr>    <chr>
 1 24.90 m (.mw-parser-output .frac{white-space:nowrap}.… Lilli … Berlin   1 Oc…
 2 26.62 m (87 ft 4 in)                                   Lilli … Berlin   8 Ju…
 3 27.39 m (89 ft 10+1⁄4 in)                              Yvonne… Paris    23 S…
 4 27.70 m (90 ft 10+1⁄2 in)                              Lucie … Paris    14 J…
 5 28.325 m (92 ft 11 in)                                 Lisett… Brussels 21 J…
 6 30.225 m (99 ft 1+3⁄4 in)                              Lucien… Paris    14 S…
 7 31.15 m (102 ft 2+1⁄4 in)                              Maria … Prague   11 O…
 8 34.15 m (112 ft 1⁄4 in)                                Halina… Warsaw   23 M…
 9 38.34 m (125 ft 9+1⁄4 in)                              Milly … Braunsc… 22 A…
10 39.18 m (128 ft 6+1⁄2 in)                              Halina… Warsaw   4 Se…
# ℹ 47 more rows

3 Extracting Mark (in meters)

The column Mark contains a character string with the record expressed both in meters and feet-inches. We want to extract only the value associated to meters.

My suggestion is to always start small. In this case, you can get a subset of values on which to test your regex patterns:

tmp = head(dat$Mark)
tmp
[1] "24.90 m (.mw-parser-output .frac{white-space:nowrap}.mw-parser-output .frac .num,.mw-parser-output .frac .den{font-size:80%;line-height:0;vertical-align:super}.mw-parser-output .frac .den{vertical-align:sub}.mw-parser-output .sr-only{border:0;clip:rect(0,0,0,0);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;width:1px}81 ft 8+1⁄4 in)"
[2] "26.62 m (87 ft 4 in)"                                                                                                                                                                                                                                                                                                                                              
[3] "27.39 m (89 ft 10+1⁄4 in)"                                                                                                                                                                                                                                                                                                                                         
[4] "27.70 m (90 ft 10+1⁄2 in)"                                                                                                                                                                                                                                                                                                                                         
[5] "28.325 m (92 ft 11 in)"                                                                                                                                                                                                                                                                                                                                            
[6] "30.225 m (99 ft 1+3⁄4 in)"                                                                                                                                                                                                                                                                                                                                         

With the values in tmp, you can match the numeric values of meters with a pattern that matches a string beginning with two digits, followed by a dot ".", followed by two more digits

str_extract(tmp, "^[0-9][0-9]\\.[0-9][0-9]")
[1] "24.90" "26.62" "27.39" "27.70" "28.32" "30.22"

3.1 Your Turn: Meters

  1. Use str_extract() with a pattern that matches a string beginning with two digits, followed by a dot ".", followed by two more digits. Come up with three different regex patterns that meet these criteria.
Show answer
# pattern 1
str_extract(tmp, "^[0-9][0-9]\\.[0-9][0-9]")

# pattern 2
str_extract(tmp, "^[0-9]{2}\\.[0-9]{2}")

# pattern 3
str_extract(tmp, "^[[:digit:]]{2}\\.[[:digit:]]{2}")
  1. Once you have a simple pattern, use it on the entire column Mark and get a numeric vector mark:
Show answer
# numeric vector mark
mark = as.numeric(str_extract(dat$Mark, "^[0-9]{2}\\.[0-9]{2}"))

4 Extracting Athlete Name

The second task involves extracting the first and last names of the athletes. If you inspect the column Athlete, you will see that all its values are formed with the first name, the last name, and the country inside parenthesis:

ath = head(dat$Athlete)
ath
[1] "Lilli Henoch (GER)"     "Lilli Henoch (GER)"     "Yvonne Tembouret (FRA)"
[4] "Lucie Petit (FRA)"      "Lisette Petré (BEL)"    "Lucienne Velu (FRA)"   

4.1 Your Turn: play with ath

  1. Work with the sample vector ath and try to str_extract() the first name. You can look for a pattern consisting of a word at the beginning of the string. This involves using the beginning of the string anchor "^", and the word pattern "\\w+" (i.e. one or more alphanumeric characters):
Show answer
str_extract(ath, "^\\w+")
  1. Now use the patterns whitespace "\\s" and word "\\w+" to attempt extracting the athlete’s last name "\\s\\w+"; hint: to remove the extra matched space you can use str_trim()
Show answer
ath_last = str_extract(ath, "\\s\\w+")
str_trim(ath_last)
  1. Once you are done working with ath, use your code to extract the first and last names of all athletes; use vectors first_name and last_name for this purpose:
Show answer
# first and last name of all athletes
first_name = str_extract(dat$Athlete, "^\\w+")

last_str = str_extract(dat$Athlete, "\\s\\w+")
last_name = str_trim(last_str)

4.2 Your Turn: Athlete’s Initials

Use first_name and last_name to select the first letter in each vector in order to form a new vector initials containing the initials of each athlete’s name: e.g. "J.T.", "T.L.", "G.H.", ...

Show answer
# initials vector
first_initial = str_extract(first_name, "\\w")
last_initial = str_extract(last_name, "\\w")
initials = paste0(first_initial, ".", last_initial, ".")

5 Country

The column Athlete also contains the athlete’s country—abbreviated—within parenthesis:

ath = head(dat$Athlete)
ath
[1] "Lilli Henoch (GER)"     "Lilli Henoch (GER)"     "Yvonne Tembouret (FRA)"
[4] "Lucie Petit (FRA)"      "Lisette Petré (BEL)"    "Lucienne Velu (FRA)"   

5.1 Your Turn: Athlete’s country

  1. Use str_extract() with a regex pattern that matches the country abbreviation formed by three consecutive upper case letters. Come up with three different regex patterns that lets you get these characters:
Show answer
# pattern 1
str_extract(ath, "[A-Z][A-Z][A-Z]")

# pattern 2
str_extract(ath, "[A-Z]{3}")

# pattern 3
str_extract(ath, "[[:upper:]]{3}")
  1. Once you have the right pattern, use your code to extract the country abbreviations on the entire column to produce a vector country
Show answer
country = str_extract(dat$Athlete, "[A-Z][A-Z][A-Z]")

6 Date

The date values are in the column Date:

dts = head(dat$Date)
dts
[1] "1 October 1922"       "8 July 1923"          "23 September 1923[1]"
[4] "14 July 1924[1]"      "21 July 1924[2][3]"   "14 September 1924[1]"

6.1 Your Turn: play with dts

With the dts vector, extract in separate vectors the values of day, month name, and year: you can try using patterns such as:

  • digit: "[0-9]", "\\d",

  • non-digits: "\\D"

  • word-character: "\\w":

Show answer
# days
day = as.numeric(str_extract(dat$Date, "^[0-9]+"))

# months
month = str_trim(str_extract(dat$Date, "\\D+"))

# years
year = as.numeric(str_extract(dat$Date, "\\d{4}"))

7 Clean Data Frame

We are assuming that you have created vectors for all the cleaned components:

  • mark (in meters)
  • first_name (first name of athlete)
  • last_name (last name of athlete)
  • initials (name initials)
  • country(athlete’s country)
  • day (number of day)
  • month (name of month)
  • year (number of year)

7.1 Your Turn: assemble table discus

Create a data frame discus with all the above vectors used as columns of this table. The head() of your table should look like this:

   mark first_name last_name initials country day     month year
1 24.90      Lilli    Henoch     L.H.     GER   1   October 1922
2 26.62      Lilli    Henoch     L.H.     GER   8      July 1923
3 27.39     Yvonne Tembouret     Y.T.     FRA  23 September 1923
4 27.70      Lucie     Petit     L.P.     FRA  14      July 1924
5 28.32    Lisette     Petré     L.P.     BEL  21      July 1924
6 30.22   Lucienne      Velu     L.V.     FRA  14 September 1924
Show answer
discus = data.frame(
  mark = mark,
  first_name = first_name,
  last_name = last_name,
  initials = initials,
  country = country,
  day = day,
  month = month,
  year = year
)

8 Timeline graph

Graph a timeline to visualize the progression of world records: use year for the x-axis, and mark for the y-axis. To be more specific, create a "ggplot" object and then pass it to ggplotly(); you could get something like this:

Show answer
gg = ggplot(data = discus, aes(x = year, y = mark)) +
  geom_line(color = "#FF2E66") +
  geom_point(color = "#FF2E66", aes(text = last_name)) + 
  labs(title = "Women's discus throw world record progression") +
  theme_minimal()

ggplotly(gg)