library(tidyverse)
Practice: Manipulating tables with dplyr
Stat 133
- Learn the basic verbs of dplyr
- Subset rows by position with
slice()
- Subset rows that match a condition with
filter()
- Select columns by name with
select()
- Modify variables with
mutate()
- Reorder rows with
arrange()
- Reduce variables to values with
summarise()
- Group values by one or more variables with
group_by()
1 Manipulation of Data Frames with "dplyr"
In this module, you will start learning how to manipulate tables, in a syntactic way, with functions from the tidyverse package "dplyr"
.
1.1 Data starwars
We are going to use a data starwars
that, as you can imagine, has to do with Star Wars characters.
starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
2 Basic "dplyr"
verbs
The starting point involves reviewing a set of elementary functions known as the basic dplyr verbs:
filter()
,slice()
, andselect()
: subsetting and selecting rows and columnsmutate()
: add new variablesarrange()
: reorder rowssummarise()
: reduce variables to valuesgroup_by()
: grouped (aggregated) operations
3 Slice
Let’s start with the function slice()
. This function allows you to subset rows using their positions, for example:
# first row
slice(starwars, 1)
# A tibble: 1 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# third row
slice(starwars, 3)
# A tibble: 1 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 R2-D2 96 32 <NA> white, blue red 33 none masculine
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# first three rows
= slice(starwars, 1:3)
three_rows three_rows
# A tibble: 3 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# specify a negative position
slice(three_rows, -1)
# A tibble: 2 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 C-3PO 167 75 <NA> gold yellow 112 none masculine
2 R2-D2 96 32 <NA> white, blue red 33 none masculine
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
We should note that slice()
comes with a set of sibling functions such as:
slice_head()
to subset the first rowsslice_tail()
to subset the last rowsslice_min()
to subset rows with lowest values of a variableslice_max()
to subset rows with highest values of a variableslice_sample()
to randomly subset rows
3.1 Your turn: using slice()
and friends
- use
slice()
to subset the 10th row ofstarwars
.
Show answer
slice(starwars, 10)
- use
slice_head()
to subset the first 4 rows ofstarwars
Show answer
slice_head(starwars, n = 4)
- use
slice_tail()
to subset the last 3 rows ofstarwars
Show answer
slice_tail(starwars, n = 3)
- use
slice_min()
to subset the row with the smallestheight
value
Show answer
slice_min(starwars, height)
- use
slice()
to subset the data by selecting rows 10, 20, 30, …, 50. Optional hint:seq()
is your friend.
Show answer
slice(starwars, seq(from = 10, to = 50, by = 10))
4 Filter
filter()
allows you to select rows by defining a condition (which could be simple or compound):
# subset rows given a simple condition
# (height greater than 200 cm)
<- filter(starwars, height > 200)
gt_200 gt_200
# A tibble: 10 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Darth V… 202 136 none white yellow 41.9 male mascu…
2 Chewbac… 228 112 brown unknown blue 200 male mascu…
3 Roos Ta… 224 82 none grey orange NA male mascu…
4 Rugor N… 206 NA none green orange NA male mascu…
5 Yarael … 264 NA none white yellow NA male mascu…
6 Lama Su 229 88 none grey black NA male mascu…
7 Taun We 213 NA none grey black NA fema… femin…
8 Grievous 216 159 none brown, wh… green, y… NA male mascu…
9 Tarfful 234 136 brown brown blue NA male mascu…
10 Tion Me… 206 80 none grey black NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# subset rows given a compound condition
filter(starwars, height > 200 & mass < 100)
# A tibble: 3 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Roos Tar… 224 82 none grey orange NA male mascu…
2 Lama Su 229 88 none grey black NA male mascu…
3 Tion Med… 206 80 none grey black NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
4.1 Your turn: using filter()
- use
filter()
to subset those individuals with height less than 100 cm tall.
Show answer
filter(starwars, height < 100)
- use
filter()
to subset rows of female individuals (sex
).
Show answer
filter(starwars, sex == "female")
- use
filter()
to subset rows of female individuals (sex
) no more than 160cm tall.
Show answer
filter(starwars, sex == "female" & height < 160)
5 Select
select()
allows you to select one or more columns by name:
# columns by name
= select(starwars, name, height, mass)
name_height name_height
# A tibble: 87 × 3
name height mass
<chr> <int> <dbl>
1 Luke Skywalker 172 77
2 C-3PO 167 75
3 R2-D2 96 32
4 Darth Vader 202 136
5 Leia Organa 150 49
6 Owen Lars 178 120
7 Beru Whitesun Lars 165 75
8 R5-D4 97 32
9 Biggs Darklighter 183 84
10 Obi-Wan Kenobi 182 77
# ℹ 77 more rows
select()
also allows you to exclude one or more columns by negating their names:
# exclude columns by name
select(name_height, -mass)
# A tibble: 87 × 2
name height
<chr> <int>
1 Luke Skywalker 172
2 C-3PO 167
3 R2-D2 96
4 Darth Vader 202
5 Leia Organa 150
6 Owen Lars 178
7 Beru Whitesun Lars 165
8 R5-D4 97
9 Biggs Darklighter 183
10 Obi-Wan Kenobi 182
# ℹ 77 more rows
# exclude columns by name
select(name_height, -height, -mass)
# A tibble: 87 × 1
name
<chr>
1 Luke Skywalker
2 C-3PO
3 R2-D2
4 Darth Vader
5 Leia Organa
6 Owen Lars
7 Beru Whitesun Lars
8 R5-D4
9 Biggs Darklighter
10 Obi-Wan Kenobi
# ℹ 77 more rows
You can also specify a range of consecutive columns via the :
operator
# consecutive columns
select(starwars, name:mass)
# A tibble: 87 × 3
name height mass
<chr> <int> <dbl>
1 Luke Skywalker 172 77
2 C-3PO 167 75
3 R2-D2 96 32
4 Darth Vader 202 136
5 Leia Organa 150 49
6 Owen Lars 178 120
7 Beru Whitesun Lars 165 75
8 R5-D4 97 32
9 Biggs Darklighter 183 84
10 Obi-Wan Kenobi 182 77
# ℹ 77 more rows
Interestingly, you can also select()
columns by their position
# columns by position
select(starwars, 1:3)
# A tibble: 87 × 3
name height mass
<chr> <int> <dbl>
1 Luke Skywalker 172 77
2 C-3PO 167 75
3 R2-D2 96 32
4 Darth Vader 202 136
5 Leia Organa 150 49
6 Owen Lars 178 120
7 Beru Whitesun Lars 165 75
8 R5-D4 97 32
9 Biggs Darklighter 183 84
10 Obi-Wan Kenobi 182 77
# ℹ 77 more rows
5.1 Your turn: using select()
- use
filter()
to subset rows of individuals from homeworldNaboo
, and thenselect()
their names.
Show answer
select(filter(starwars, homeworld == 'Naboo'), name)
- find how to select the name, and homeworld, of human female individuals.
Show answer
select(filter(starwars, sex == 'female' & species == "Human"), name, homeworld)
- find how to select the name, and gender, of
"Droid"
species.
Show answer
select(filter(starwars, species == 'Droid' ), name, gender)
6 Adding new variables: mutate()
Another basic verb is mutate()
which allows you to add new variables.
Let’s create a small data frame with some female individuals and three columns: name
, height
, and mass
:
# creating a small data frame step by step
= filter(starwars, sex == 'female')
fem = select(fem, name, height, mass)
fem = slice(fem, c(1, 2, 5, 6, 8))
fem fem
# A tibble: 5 × 3
name height mass
<chr> <int> <dbl>
1 Leia Organa 150 49
2 Beru Whitesun Lars 165 75
3 Shmi Skywalker 163 NA
4 Ayla Secura 178 55
5 Luminara Unduli 170 56.2
Now, let’s use mutate()
to (temporarily) add a column with the ratio height / mass
:
mutate(fem, height / mass)
# A tibble: 5 × 4
name height mass `height/mass`
<chr> <int> <dbl> <dbl>
1 Leia Organa 150 49 3.06
2 Beru Whitesun Lars 165 75 2.2
3 Shmi Skywalker 163 NA NA
4 Ayla Secura 178 55 3.24
5 Luminara Unduli 170 56.2 3.02
You can also give a new name, like: ht_wt = height / mass
:
mutate(fem, ht_wt = height / mass)
# A tibble: 5 × 4
name height mass ht_wt
<chr> <int> <dbl> <dbl>
1 Leia Organa 150 49 3.06
2 Beru Whitesun Lars 165 75 2.2
3 Shmi Skywalker 163 NA NA
4 Ayla Secura 178 55 3.24
5 Luminara Unduli 170 56.2 3.02
In order to permanently change the data, you need to assign the changes to an object:
<- mutate(fem, ht_m = height * 0.0254, wt_kg = mass * 0.4536)
fem2 fem2
# A tibble: 5 × 5
name height mass ht_m wt_kg
<chr> <int> <dbl> <dbl> <dbl>
1 Leia Organa 150 49 3.81 22.2
2 Beru Whitesun Lars 165 75 4.19 34.0
3 Shmi Skywalker 163 NA 4.14 NA
4 Ayla Secura 178 55 4.52 24.9
5 Luminara Unduli 170 56.2 4.32 25.5
7 Reordering rows: arrange()
The next basic verb of "dplyr"
is arrange()
which allows you to reorder rows. For example, here’s how to arrange the rows of fem
by height
# order rows by height (increasingly)
arrange(fem, height)
# A tibble: 5 × 3
name height mass
<chr> <int> <dbl>
1 Leia Organa 150 49
2 Shmi Skywalker 163 NA
3 Beru Whitesun Lars 165 75
4 Luminara Unduli 170 56.2
5 Ayla Secura 178 55
By default arrange()
sorts rows in increasing order. To arrange rows in descending order you need to use the auxiliary function desc()
.
# order rows by height (decreasingly)
arrange(fem, desc(height))
# A tibble: 5 × 3
name height mass
<chr> <int> <dbl>
1 Ayla Secura 178 55
2 Luminara Unduli 170 56.2
3 Beru Whitesun Lars 165 75
4 Shmi Skywalker 163 NA
5 Leia Organa 150 49
# order rows by height, and then mass
arrange(fem, height, mass)
# A tibble: 5 × 3
name height mass
<chr> <int> <dbl>
1 Leia Organa 150 49
2 Shmi Skywalker 163 NA
3 Beru Whitesun Lars 165 75
4 Luminara Unduli 170 56.2
5 Ayla Secura 178 55
7.1 Your Turn
- use the original data frame
starwars
tofilter()
andarrange()
those individuals with height less than 150 cm tall, in increasing order by height.
Show answer
arrange(filter(starwars, height < 150), height)
- display the name, homeworld, and species, of the top-5 tallest individuals.
Show answer
slice(select(arrange(starwars, desc(height)), name, homeworld, species), 1:5)
- display the name, homeworld, and species, for the top-5 heaviest individuals.
Show answer
slice(select(arrange(starwars, desc(mass)), name, homeworld, species), 1:5)
8 Summarizing values with summarise()
The next verb is summarise()
, which is also available as summarize()
. Conceptually, this involves applying a function on one or more columns, in order to summarize values. This is probably easier to understand with one example.
Say you are interested in calculating the average height of all individuals. To do this “a la dplyr” you use summarise()
as follows:
# average height (removing missing values)
summarise(starwars, avg_height = mean(height, na.rm = TRUE))
# A tibble: 1 × 1
avg_height
<dbl>
1 175.
What if you want to calculate some summary statistics for height
: min, median, mean, and max?
# some stats for height (dplyr)
summarise(
starwars, min = min(height, na.rm = TRUE),
median = median(height, na.rm = TRUE),
avg = mean(height, na.rm = TRUE),
max = max(height, na.rm = TRUE)
)
# A tibble: 1 × 4
min median avg max
<int> <int> <dbl> <int>
1 66 180 175. 264
9 Grouped operations
To better appreciate the power of summarise()
, we need to introduce another major "dplyr"
basic verb: group_by()
. This is the function that allows you to perform data aggregations, or grouped operations.
Let’s see the combination of summarise()
and group_by()
to calculate the average height by homeworld:
# average height, grouped by homeworld
summarise(
group_by(starwars, homeworld),
avg_salary = mean(height, na.rm = TRUE)
)
Here’s a more fancy example: average mass and height, by homeworld, displayed in descending order by average height:
arrange(
summarise(
group_by(starwars, homeworld),
avg_height = mean(height, na.rm = TRUE),
avg_mass = mean(mass, na.rm = TRUE)),
desc(avg_height)
)
9.1 Your Turn
- use
summarise()
to get the largest height value.
Show answer
summarise(starwars, max_height = max(height, na.rm = TRUE))
- use
summarise()
andgroup_by()
to display the median of mass, by homeworld
Show answer
summarise(group_by(starwars, homeworld), med_mass = median(mass, na.rm = TRUE))
- use
arrange()
,summarise()
andgroup_by()
display the average mass by gender, in ascending order
Show answer
arrange(
summarise(group_by(starwars, gender),
avg = mean(mass, na.rm = TRUE)),
avg)
10 More Exercises
- use
slice()
to subset the data by selecting the last 5 rows.
Show answer
slice(starwars, 83:87)
- Find out how to use the functions
slice_head()
to select the first 5 rows
Show answer
slice_head(starwars, n = 5)
- Find out how to use the functions
slice_tail()
to select the last 3 rows
Show answer
slice_tail(starwars, n = 3)
- use
filter()
to subset rows of individuals with brown hair color.
Show answer
filter(starwars, hair_color == 'brown')
- use
"dplyr"
functions to display the names of individuals with green skin color.
Show answer
select(filter(starwars, skin_color == "green"), name)
- find how to select the name, height, and mass, of male individuals, with brown or black hair color.
Show answer
select(filter(starwars, sex == 'male' & hair_color %in% c("brown", "black")), name, height, mass)
- using the data frame
fem
, add a new variableproduct
with the product ofheight
andmass
.
Show answer
<- mutate(fem, product = height * mass)
fem fem
- create a new data frame
fem3
, by adding columnslog_height
andlog_mass
with the log transformations ofheight
andmass
.
Show answer
<- mutate(fem, log_height = log(height), log_mass = log(mass))
fem3 fem3
- use
summarise()
to get the standard deviation ofmass
; the argumentna.rm = TRUE
is your friend for removing missing values.
Show answer
summarise(starwars, sd_mass = sd(mass, na.rm = TRUE))
- obtain the mean and standard deviation of
height
, for female characters; the argumentna.rm = TRUE
is your friend for removing missing values.
Show answer
summarise(
filter(starwars, sex == "female"),
mean_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))