Practice: Manipulating tables with dplyr

Stat 133

Author

Gaston Sanchez

Learning Objectives
  • Learn the basic verbs of dplyr
  • Subset rows by position with slice()
  • Subset rows that match a condition with filter()
  • Select columns by name with select()
  • Modify variables with mutate()
  • Reorder rows with arrange()
  • Reduce variables to values with summarise()
  • Group values by one or more variables with group_by()

1 Manipulation of Data Frames with "dplyr"

In this module, you will start learning how to manipulate tables, in a syntactic way, with functions from the tidyverse package "dplyr" .

library(tidyverse)

1.1 Data starwars

We are going to use a data starwars that, as you can imagine, has to do with Star Wars characters.

starwars
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

2 Basic "dplyr" verbs

The starting point involves reviewing a set of elementary functions known as the basic dplyr verbs:

  • filter(), slice(), and select(): subsetting and selecting rows and columns

  • mutate(): add new variables

  • arrange(): reorder rows

  • summarise(): reduce variables to values

  • group_by(): grouped (aggregated) operations


3 Slice

Let’s start with the function slice(). This function allows you to subset rows using their positions, for example:

# first row
slice(starwars, 1)
# A tibble: 1 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue              19 male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
# third row
slice(starwars, 3)
# A tibble: 1 × 14
  name  height  mass hair_color skin_color  eye_color birth_year sex   gender   
  <chr>  <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>    
1 R2-D2     96    32 <NA>       white, blue red               33 none  masculine
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
# first three rows
three_rows = slice(starwars, 1:3)
three_rows
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue              19 male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red               33 none  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
# specify a negative position
slice(three_rows, -1)
# A tibble: 2 × 14
  name  height  mass hair_color skin_color  eye_color birth_year sex   gender   
  <chr>  <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>    
1 C-3PO    167    75 <NA>       gold        yellow           112 none  masculine
2 R2-D2     96    32 <NA>       white, blue red               33 none  masculine
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

We should note that slice() comes with a set of sibling functions such as:

  • slice_head() to subset the first rows
  • slice_tail() to subset the last rows
  • slice_min() to subset rows with lowest values of a variable
  • slice_max() to subset rows with highest values of a variable
  • slice_sample() to randomly subset rows

3.1 Your turn: using slice() and friends

  1. use slice() to subset the 10th row of starwars.
Show answer
slice(starwars, 10)
  1. use slice_head() to subset the first 4 rows of starwars
Show answer
slice_head(starwars, n = 4)
  1. use slice_tail() to subset the last 3 rows of starwars
Show answer
slice_tail(starwars, n = 3)
  1. use slice_min() to subset the row with the smallest height value
Show answer
slice_min(starwars, height)
  1. use slice() to subset the data by selecting rows 10, 20, 30, …, 50. Optional hint: seq() is your friend.
Show answer
slice(starwars, seq(from = 10, to = 50, by = 10))

4 Filter

filter() allows you to select rows by defining a condition (which could be simple or compound):

# subset rows given a simple condition
# (height greater than 200 cm)
gt_200 <- filter(starwars, height > 200)
gt_200
# A tibble: 10 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 2 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 3 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
 4 Rugor N…    206    NA none       green      orange          NA   male  mascu…
 5 Yarael …    264    NA none       white      yellow          NA   male  mascu…
 6 Lama Su     229    88 none       grey       black           NA   male  mascu…
 7 Taun We     213    NA none       grey       black           NA   fema… femin…
 8 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
 9 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
10 Tion Me…    206    80 none       grey       black           NA   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
# subset rows given a compound condition
filter(starwars, height > 200 & mass < 100)
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Roos Tar…    224    82 none       grey       orange            NA male  mascu…
2 Lama Su      229    88 none       grey       black             NA male  mascu…
3 Tion Med…    206    80 none       grey       black             NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

4.1 Your turn: using filter()

  1. use filter() to subset those individuals with height less than 100 cm tall.
Show answer
filter(starwars, height < 100)
  1. use filter() to subset rows of female individuals (sex).
Show answer
filter(starwars, sex == "female")
  1. use filter() to subset rows of female individuals (sex) no more than 160cm tall.
Show answer
filter(starwars, sex == "female" & height < 160)

5 Select

select() allows you to select one or more columns by name:

# columns by name
name_height = select(starwars, name, height, mass)
name_height
# A tibble: 87 × 3
   name               height  mass
   <chr>               <int> <dbl>
 1 Luke Skywalker        172    77
 2 C-3PO                 167    75
 3 R2-D2                  96    32
 4 Darth Vader           202   136
 5 Leia Organa           150    49
 6 Owen Lars             178   120
 7 Beru Whitesun Lars    165    75
 8 R5-D4                  97    32
 9 Biggs Darklighter     183    84
10 Obi-Wan Kenobi        182    77
# ℹ 77 more rows

select() also allows you to exclude one or more columns by negating their names:

# exclude columns by name
select(name_height, -mass)
# A tibble: 87 × 2
   name               height
   <chr>               <int>
 1 Luke Skywalker        172
 2 C-3PO                 167
 3 R2-D2                  96
 4 Darth Vader           202
 5 Leia Organa           150
 6 Owen Lars             178
 7 Beru Whitesun Lars    165
 8 R5-D4                  97
 9 Biggs Darklighter     183
10 Obi-Wan Kenobi        182
# ℹ 77 more rows
# exclude columns by name
select(name_height, -height, -mass)
# A tibble: 87 × 1
   name              
   <chr>             
 1 Luke Skywalker    
 2 C-3PO             
 3 R2-D2             
 4 Darth Vader       
 5 Leia Organa       
 6 Owen Lars         
 7 Beru Whitesun Lars
 8 R5-D4             
 9 Biggs Darklighter 
10 Obi-Wan Kenobi    
# ℹ 77 more rows

You can also specify a range of consecutive columns via the : operator

# consecutive columns
select(starwars, name:mass)
# A tibble: 87 × 3
   name               height  mass
   <chr>               <int> <dbl>
 1 Luke Skywalker        172    77
 2 C-3PO                 167    75
 3 R2-D2                  96    32
 4 Darth Vader           202   136
 5 Leia Organa           150    49
 6 Owen Lars             178   120
 7 Beru Whitesun Lars    165    75
 8 R5-D4                  97    32
 9 Biggs Darklighter     183    84
10 Obi-Wan Kenobi        182    77
# ℹ 77 more rows

Interestingly, you can also select() columns by their position

# columns by position
select(starwars, 1:3)
# A tibble: 87 × 3
   name               height  mass
   <chr>               <int> <dbl>
 1 Luke Skywalker        172    77
 2 C-3PO                 167    75
 3 R2-D2                  96    32
 4 Darth Vader           202   136
 5 Leia Organa           150    49
 6 Owen Lars             178   120
 7 Beru Whitesun Lars    165    75
 8 R5-D4                  97    32
 9 Biggs Darklighter     183    84
10 Obi-Wan Kenobi        182    77
# ℹ 77 more rows

5.1 Your turn: using select()

  1. use filter() to subset rows of individuals from homeworld Naboo, and then select() their names.
Show answer
select(filter(starwars, homeworld == 'Naboo'), name)
  1. find how to select the name, and homeworld, of human female individuals.
Show answer
select(filter(starwars, sex == 'female' & species == "Human"), name, homeworld)
  1. find how to select the name, and gender, of "Droid" species.
Show answer
select(filter(starwars, species == 'Droid' ), name, gender)

6 Adding new variables: mutate()

Another basic verb is mutate() which allows you to add new variables.

Let’s create a small data frame with some female individuals and three columns: name, height, and mass:

# creating a small data frame step by step
fem = filter(starwars, sex == 'female')
fem = select(fem, name, height, mass)
fem = slice(fem, c(1, 2, 5, 6, 8))
fem
# A tibble: 5 × 3
  name               height  mass
  <chr>               <int> <dbl>
1 Leia Organa           150  49  
2 Beru Whitesun Lars    165  75  
3 Shmi Skywalker        163  NA  
4 Ayla Secura           178  55  
5 Luminara Unduli       170  56.2

Now, let’s use mutate() to (temporarily) add a column with the ratio height / mass:

mutate(fem, height / mass)
# A tibble: 5 × 4
  name               height  mass `height/mass`
  <chr>               <int> <dbl>         <dbl>
1 Leia Organa           150  49            3.06
2 Beru Whitesun Lars    165  75            2.2 
3 Shmi Skywalker        163  NA           NA   
4 Ayla Secura           178  55            3.24
5 Luminara Unduli       170  56.2          3.02

You can also give a new name, like: ht_wt = height / mass:

mutate(fem, ht_wt = height / mass)
# A tibble: 5 × 4
  name               height  mass ht_wt
  <chr>               <int> <dbl> <dbl>
1 Leia Organa           150  49    3.06
2 Beru Whitesun Lars    165  75    2.2 
3 Shmi Skywalker        163  NA   NA   
4 Ayla Secura           178  55    3.24
5 Luminara Unduli       170  56.2  3.02

In order to permanently change the data, you need to assign the changes to an object:

fem2 <- mutate(fem, ht_m = height * 0.0254, wt_kg = mass * 0.4536)
fem2
# A tibble: 5 × 5
  name               height  mass  ht_m wt_kg
  <chr>               <int> <dbl> <dbl> <dbl>
1 Leia Organa           150  49    3.81  22.2
2 Beru Whitesun Lars    165  75    4.19  34.0
3 Shmi Skywalker        163  NA    4.14  NA  
4 Ayla Secura           178  55    4.52  24.9
5 Luminara Unduli       170  56.2  4.32  25.5

7 Reordering rows: arrange()

The next basic verb of "dplyr" is arrange() which allows you to reorder rows. For example, here’s how to arrange the rows of fem by height

# order rows by height (increasingly)
arrange(fem, height)
# A tibble: 5 × 3
  name               height  mass
  <chr>               <int> <dbl>
1 Leia Organa           150  49  
2 Shmi Skywalker        163  NA  
3 Beru Whitesun Lars    165  75  
4 Luminara Unduli       170  56.2
5 Ayla Secura           178  55  

By default arrange() sorts rows in increasing order. To arrange rows in descending order you need to use the auxiliary function desc().

# order rows by height (decreasingly)
arrange(fem, desc(height))
# A tibble: 5 × 3
  name               height  mass
  <chr>               <int> <dbl>
1 Ayla Secura           178  55  
2 Luminara Unduli       170  56.2
3 Beru Whitesun Lars    165  75  
4 Shmi Skywalker        163  NA  
5 Leia Organa           150  49  
# order rows by height, and then mass
arrange(fem, height, mass)
# A tibble: 5 × 3
  name               height  mass
  <chr>               <int> <dbl>
1 Leia Organa           150  49  
2 Shmi Skywalker        163  NA  
3 Beru Whitesun Lars    165  75  
4 Luminara Unduli       170  56.2
5 Ayla Secura           178  55  

7.1 Your Turn

  1. use the original data frame starwars to filter() and arrange() those individuals with height less than 150 cm tall, in increasing order by height.
Show answer
arrange(filter(starwars, height < 150), height)
  1. display the name, homeworld, and species, of the top-5 tallest individuals.
Show answer
slice(select(arrange(starwars, desc(height)), name, homeworld, species), 1:5)
  1. display the name, homeworld, and species, for the top-5 heaviest individuals.
Show answer
slice(select(arrange(starwars, desc(mass)), name, homeworld, species), 1:5)

8 Summarizing values with summarise()

The next verb is summarise(), which is also available as summarize(). Conceptually, this involves applying a function on one or more columns, in order to summarize values. This is probably easier to understand with one example.

Say you are interested in calculating the average height of all individuals. To do this “a la dplyr” you use summarise() as follows:

# average height (removing missing values)
summarise(starwars, avg_height = mean(height, na.rm = TRUE))
# A tibble: 1 × 1
  avg_height
       <dbl>
1       175.

What if you want to calculate some summary statistics for height: min, median, mean, and max?

# some stats for height (dplyr)
summarise(
  starwars, 
  min = min(height, na.rm = TRUE),
  median = median(height, na.rm = TRUE),
  avg = mean(height, na.rm = TRUE),
  max = max(height, na.rm = TRUE)
)
# A tibble: 1 × 4
    min median   avg   max
  <int>  <int> <dbl> <int>
1    66    180  175.   264

9 Grouped operations

To better appreciate the power of summarise(), we need to introduce another major "dplyr" basic verb: group_by(). This is the function that allows you to perform data aggregations, or grouped operations.

Let’s see the combination of summarise() and group_by() to calculate the average height by homeworld:

# average height, grouped by homeworld
summarise(
  group_by(starwars, homeworld),
  avg_salary = mean(height, na.rm = TRUE)
)

Here’s a more fancy example: average mass and height, by homeworld, displayed in descending order by average height:

arrange(
  summarise(
    group_by(starwars, homeworld),
    avg_height = mean(height, na.rm = TRUE),
    avg_mass = mean(mass, na.rm = TRUE)),
  desc(avg_height)
)

9.1 Your Turn

  1. use summarise() to get the largest height value.
Show answer
summarise(starwars, max_height = max(height, na.rm = TRUE))
  1. use summarise() and group_by() to display the median of mass, by homeworld
Show answer
summarise(group_by(starwars, homeworld), med_mass = median(mass, na.rm = TRUE))
  1. use arrange(), summarise() and group_by() display the average mass by gender, in ascending order
Show answer
arrange(
  summarise(group_by(starwars, gender), 
            avg = mean(mass, na.rm = TRUE)),
  avg)

10 More Exercises

  1. use slice() to subset the data by selecting the last 5 rows.
Show answer
slice(starwars, 83:87)
  1. Find out how to use the functions slice_head() to select the first 5 rows
Show answer
slice_head(starwars, n = 5)
  1. Find out how to use the functions slice_tail() to select the last 3 rows
Show answer
slice_tail(starwars, n = 3)
  1. use filter() to subset rows of individuals with brown hair color.
Show answer
filter(starwars, hair_color == 'brown')
  1. use "dplyr" functions to display the names of individuals with green skin color.
Show answer
select(filter(starwars, skin_color == "green"), name)
  1. find how to select the name, height, and mass, of male individuals, with brown or black hair color.
Show answer
select(filter(starwars, sex == 'male' & hair_color %in% c("brown", "black")), name, height, mass)
  1. using the data frame fem, add a new variable product with the product of height and mass.
Show answer
fem <- mutate(fem, product = height * mass)
fem
  1. create a new data frame fem3, by adding columns log_height and log_mass with the log transformations of height and mass.
Show answer
fem3 <- mutate(fem, log_height = log(height), log_mass = log(mass))
fem3
  1. use summarise() to get the standard deviation of mass; the argument na.rm = TRUE is your friend for removing missing values.
Show answer
summarise(starwars, sd_mass = sd(mass, na.rm = TRUE))
  1. obtain the mean and standard deviation of height, for female characters; the argument na.rm = TRUE is your friend for removing missing values.
Show answer
summarise(
  filter(starwars, sex == "female"),
  mean_height = mean(height, na.rm = TRUE),
  sd_height = sd(height, na.rm = TRUE))