library(tidyverse) # data wrangling and graphics
library(rvest) # for web scraping
library(plotly) # web-interactive plots
Practice: Regular Expressions (part 2)
Stat 133
- Getting familiar with regex functions from
"stringr"
- Use regex operations to clean/process “messy” data
- Focus on detection and extraction of string patterns
1 Regular Expressions with "stringr"
The goal of this module is for you to get a first contact with the most common regex functions in "stringr"
(see below). While R provides a set of built-in functions for regex pattern matching, I prefer to use "stringr"
functions which are more consistent and user friendly.
To be more precise, the purpose of this document is to give you a little bit of regex practice with functions such as:
str_match()
str_detect()
str_extract()
str_locate()
str_sub()
str_split()
str_remove()
str_replace()
1.1 Stringr cheatsheet
We recommend having at hand the stringr cheatsheet (available in bCourses)
https://bcourses.berkeley.edu/courses/1526481/files/folder/cheatsheets?preview=86322354
2 Women’s Discus Data
You will be working with the data set for the Women’s discus throw world record progression, available in wikipedia:
https://en.wikipedia.org/wiki/Women%27s_discus_throw_world_record_progression
2.1 Getting the Data
Perhaps the simplest and most straightforward way to get the data in R is by passing the URL to a function that lets you import HTML files such as read_html()
:
# assemble url
= "https://en.wikipedia.org/wiki/"
wiki = "Women%27s_discus_throw_world_record_progression"
women_discus = paste0(wiki, women_discus)
url
# import HTML file
= read_html(url)
doc class(doc)
[1] "xml_document" "xml_node"
As you can tell, doc
is an object of class "xml_document"
that contains, among other things, the HTML table with the data of discus world records.
2.2 Another option to get the data
Another option to get the data involves a 2-step process:
download the html file of the wikipedia page to your working directory,
then import this file in R.
This is my preferred method for working with data from the web, following good practices to avoid making recurrent and unnecessary requests to the server where the data is hosted.
# step 1) download a copy of the HTML file,
# (run this code in R's console!!!)
= "https://en.wikipedia.org/wiki/"
wiki = "Women%27s_discus_throw_world_record_progression"
women_discus = paste0(wiki, women_discus)
url download.file(url, "women-discus-throw.html")
Assuming that you’ve downloaded the html content in the file women-discus-throw.html
, and that this file is in your working directory, you can then import it with read_html()
:
# step 2) extract HTML table into a data.frame
= read_html("women-discus-throw.html") doc
To extract the HTML tables from doc
you use the html_table()
function:
= html_table(doc)
tbls length(tbls)
[1] 4
2.3 Raw (messy) Table
If you inspect all the extracted tables in tbls
, you’ll see that the table we are interested in is the second one:
= tbls[[2]]
dat dat
# A tibble: 57 × 4
Mark Athlete Location Date
<chr> <chr> <chr> <chr>
1 24.90 m (.mw-parser-output .frac{white-space:nowrap}.… Lilli … Berlin 1 Oc…
2 26.62 m (87 ft 4 in) Lilli … Berlin 8 Ju…
3 27.39 m (89 ft 10+1⁄4 in) Yvonne… Paris 23 S…
4 27.70 m (90 ft 10+1⁄2 in) Lucie … Paris 14 J…
5 28.325 m (92 ft 11 in) Lisett… Brussels 21 J…
6 30.225 m (99 ft 1+3⁄4 in) Lucien… Paris 14 S…
7 31.15 m (102 ft 2+1⁄4 in) Maria … Prague 11 O…
8 34.15 m (112 ft 1⁄4 in) Halina… Warsaw 23 M…
9 38.34 m (125 ft 9+1⁄4 in) Milly … Braunsc… 22 A…
10 39.18 m (128 ft 6+1⁄2 in) Halina… Warsaw 4 Se…
# ℹ 47 more rows
3 Extracting Mark (in meters)
The column Mark
contains a character string with the record expressed both in meters and feet-inches. We want to extract only the value associated to meters.
My suggestion is to always start small. In this case, you can get a subset of values on which to test your regex patterns:
= head(dat$Mark)
tmp tmp
[1] "24.90 m (.mw-parser-output .frac{white-space:nowrap}.mw-parser-output .frac .num,.mw-parser-output .frac .den{font-size:80%;line-height:0;vertical-align:super}.mw-parser-output .frac .den{vertical-align:sub}.mw-parser-output .sr-only{border:0;clip:rect(0,0,0,0);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;width:1px}81 ft 8+1⁄4 in)"
[2] "26.62 m (87 ft 4 in)"
[3] "27.39 m (89 ft 10+1⁄4 in)"
[4] "27.70 m (90 ft 10+1⁄2 in)"
[5] "28.325 m (92 ft 11 in)"
[6] "30.225 m (99 ft 1+3⁄4 in)"
With the values in tmp
, you can match the numeric values of meters with a pattern that matches a string beginning with two digits, followed by a dot "."
, followed by two more digits
str_extract(tmp, "^[0-9][0-9]\\.[0-9][0-9]")
[1] "24.90" "26.62" "27.39" "27.70" "28.32" "30.22"
3.1 Your Turn: Meters
- Use
str_extract()
with a pattern that matches a string beginning with two digits, followed by a dot"."
, followed by two more digits. Come up with three different regex patterns that meet these criteria.
Show answer
# pattern 1
str_extract(tmp, "^[0-9][0-9]\\.[0-9][0-9]")
# pattern 2
str_extract(tmp, "^[0-9]{2}\\.[0-9]{2}")
# pattern 3
str_extract(tmp, "^[[:digit:]]{2}\\.[[:digit:]]{2}")
- Once you have a simple pattern, use it on the entire column
Mark
and get a numeric vectormark
:
Show answer
# numeric vector mark
= as.numeric(str_extract(dat$Mark, "^[0-9]{2}\\.[0-9]{2}")) mark
4 Extracting Athlete Name
The second task involves extracting the first and last names of the athletes. If you inspect the column Athlete
, you will see that all its values are formed with the first name, the last name, and the country inside parenthesis:
= head(dat$Athlete)
ath ath
[1] "Lilli Henoch (GER)" "Lilli Henoch (GER)" "Yvonne Tembouret (FRA)"
[4] "Lucie Petit (FRA)" "Lisette Petré (BEL)" "Lucienne Velu (FRA)"
4.1 Your Turn: play with ath
- Work with the sample vector
ath
and try tostr_extract()
the first name. You can look for a pattern consisting of a word at the beginning of the string. This involves using the beginning of the string anchor"^"
, and the word pattern"\\w+"
(i.e. one or more alphanumeric characters):
Show answer
str_extract(ath, "^\\w+")
- Now use the patterns whitespace
"\\s"
and word"\\w+"
to attempt extracting the athlete’s last name"\\s\\w+"
; hint: to remove the extra matched space you can usestr_trim()
Show answer
= str_extract(ath, "\\s\\w+")
ath_last str_trim(ath_last)
- Once you are done working with
ath
, use your code to extract the first and last names of all athletes; use vectorsfirst_name
andlast_name
for this purpose:
Show answer
# first and last name of all athletes
= str_extract(dat$Athlete, "^\\w+")
first_name
= str_extract(dat$Athlete, "\\s\\w+")
last_str = str_trim(last_str) last_name
4.2 Your Turn: Athlete’s Initials
Use first_name
and last_name
to select the first letter in each vector in order to form a new vector initials
containing the initials of each athlete’s name: e.g. "J.T.", "T.L.", "G.H.", ...
Show answer
# initials vector
= str_extract(first_name, "\\w")
first_initial = str_extract(last_name, "\\w")
last_initial = paste0(first_initial, ".", last_initial, ".") initials
5 Country
The column Athlete also contains the athlete’s country—abbreviated—within parenthesis:
= head(dat$Athlete)
ath ath
[1] "Lilli Henoch (GER)" "Lilli Henoch (GER)" "Yvonne Tembouret (FRA)"
[4] "Lucie Petit (FRA)" "Lisette Petré (BEL)" "Lucienne Velu (FRA)"
5.1 Your Turn: Athlete’s country
- Use
str_extract()
with a regex pattern that matches the country abbreviation formed by three consecutive upper case letters. Come up with three different regex patterns that lets you get these characters:
Show answer
# pattern 1
str_extract(ath, "[A-Z][A-Z][A-Z]")
# pattern 2
str_extract(ath, "[A-Z]{3}")
# pattern 3
str_extract(ath, "[[:upper:]]{3}")
- Once you have the right pattern, use your code to extract the country abbreviations on the entire column to produce a vector
country
Show answer
= str_extract(dat$Athlete, "[A-Z][A-Z][A-Z]") country
6 Date
The date values are in the column Date
:
= head(dat$Date)
dts dts
[1] "1 October 1922" "8 July 1923" "23 September 1923[1]"
[4] "14 July 1924[1]" "21 July 1924[2][3]" "14 September 1924[1]"
6.1 Your Turn: play with dts
With the dts
vector, extract in separate vectors the values of day, month name, and year: you can try using patterns such as:
digit:
"[0-9]"
,"\\d"
,non-digits:
"\\D"
word-character:
"\\w"
:
Show answer
# days
= as.numeric(str_extract(dat$Date, "^[0-9]+"))
day
# months
= str_trim(str_extract(dat$Date, "\\D+"))
month
# years
= as.numeric(str_extract(dat$Date, "\\d{4}")) year
7 Clean Data Frame
We are assuming that you have created vectors for all the cleaned components:
mark
(in meters)first_name
(first name of athlete)last_name
(last name of athlete)initials
(name initials)country
(athlete’s country)day
(number of day)month
(name of month)year
(number of year)
7.1 Your Turn: assemble table discus
Create a data frame discus
with all the above vectors used as columns of this table. The head()
of your table should look like this:
mark first_name last_name initials country day month year
1 24.90 Lilli Henoch L.H. GER 1 October 1922
2 26.62 Lilli Henoch L.H. GER 8 July 1923
3 27.39 Yvonne Tembouret Y.T. FRA 23 September 1923
4 27.70 Lucie Petit L.P. FRA 14 July 1924
5 28.32 Lisette Petré L.P. BEL 21 July 1924
6 30.22 Lucienne Velu L.V. FRA 14 September 1924
Show answer
= data.frame(
discus mark = mark,
first_name = first_name,
last_name = last_name,
initials = initials,
country = country,
day = day,
month = month,
year = year
)
8 Timeline graph
Graph a timeline to visualize the progression of world records: use year
for the x-axis, and mark
for the y-axis. To be more specific, create a "ggplot"
object and then pass it to ggplotly()
; you could get something like this:
Show answer
= ggplot(data = discus, aes(x = year, y = mark)) +
gg geom_line(color = "#FF2E66") +
geom_point(color = "#FF2E66", aes(text = last_name)) +
labs(title = "Women's discus throw world record progression") +
theme_minimal()
ggplotly(gg)