Data Wrangling II

Agenda

Two big ideas today:

  1. Data Pipelines
  2. Grouped Operations

Data pipelines

library(dplyr)
star_wars <- data.frame(
    name = c("Anakin", "Padme", "Luke", "JarJar"),
    homeworld = c("Tatooine", "Naboo", "Tatooine", "Naboo"),
    height = c(1.8, 1.6, 1.7, 1.9),
    weight = c(84, 45, 77, 90)
)

Using base R data frame subsetting, create from star_wars…

  1. The average height of characters from Tatooine

Let’s look at three ways to solve this.

Nesting

summarize(filter(star_wars, homeworld == "Tatooine"), mean(height))
  mean(height)
1         1.75
  • Must be read from the inside out πŸ‘Ž
  • Hard to keep track of arguments πŸ‘Ž
  • All in one line of code πŸ‘
  • Only refer to one data frame πŸ‘

Step-by-step

star_wars2 <- filter(star_wars, homeworld == "Tatooine")
summarize(star_wars2, mean(height))
  mean(height)
1         1.75
  • Have to repeaet data frame names πŸ‘Ž
  • Creates unnecessary objects πŸ‘Ž
  • Stores intermediate objects πŸ‘
  • Can be read from top to bottom πŸ‘

Data Pipelines in R

2016: magrittr introduces %>%

|>

2021:

R now provides a simple native forward pipe syntax |>. The simple form of the forward pipe inserts the left-hand side as the first argument in the right-hand side call.

Using the Pipe Operator

star_wars |>

Using the Pipe Operator

star_wars |>
    filter(homeworld == "Tatooine") |>

Using the Pipe Operator

star_wars |>
    filter(homeworld == "Tatooine") |>
    summarize(mean(height))
  mean(height)
1         1.75
  • πŸ€·β€β™‚οΈ
  • Can be read like an english paragraph πŸ‘
  • Only type the data once πŸ‘
  • No leftover objects πŸ‘

Understanding your pipeline

It’s good practice to understand the output of each line by breaking the pipe.

star_wars |>
    select(homeworld) |>
    filter(mean(height))
Error in `filter()`:
β„Ή In argument: `mean(height)`.
Caused by error:
! object 'height' not found
star_wars |>
    select(homeworld)
  homeworld
1  Tatooine
2     Naboo
3  Tatooine
4     Naboo

star_wars |> # A #<<
    filter(homeworld == "Naboo",
           name %in% c("Padme", "JarJar")) |> # B #<<
    select(height, weight) |> # C #<<
    summarize(mean(height), sd(height),
              mean(weight), sd(weight)) # D #<<

Question: What are the dimensions of the data frame at each stage of the pipe: A, B, C, and D?

02:00

|> works everywhere

R now provides a simple native forward pipe syntax |>. The simple form of the forward pipe inserts the left-hand side as the first argument in the right-hand side call.

factor(c("cat", "cat", "dog")) |> summary()
cat dog 
  2   1 

Grouped Operations

library(dplyr)
star_wars <- data.frame(
    name = c("Anakin", "Padme", "Luke", "JarJar"),
    homeworld = c("Tatooine", "Naboo", "Tatooine", "Naboo"),
    height = c(1.8, 1.6, 1.7, 1.9),
    weight = c(84, 45, 77, 90)
)

Using base R data frame subsetting, create from star_wars…

  1. The average height of characters from Tatooine across each planet

You could set up two pipelines with different filters. But there’s a better way.

group_by()

Flag the rows of a dataframe as belong to a group defined by a factor. For use in downstream operations.

star_wars |>
    group_by(homeworld)
# A tibble: 4 Γ— 4
# Groups:   homeworld [2]
  name   homeworld height weight
  <chr>  <chr>      <dbl>  <dbl>
1 Anakin Tatooine     1.8     84
2 Padme  Naboo        1.6     45
3 Luke   Tatooine     1.7     77
4 JarJar Naboo        1.9     90

group_by()

Flag the rows of a dataframe as belong to a group defined by a factor. For use in downstream operations.

star_wars |>
    group_by(homeworld) |>
    summarize(mean(height))
# A tibble: 2 Γ— 2
  homeworld `mean(height)`
  <chr>              <dbl>
1 Naboo               1.75
2 Tatooine            1.75

Draw diagram

group_by() with summarize()

star_wars |>
    summarize(mean(height))
  mean(height)
1         1.75

group_by() with summarize()

star_wars |>
    group_by(homeworld) |>
    summarize(mean(height))
# A tibble: 2 Γ— 2
  homeworld `mean(height)`
  <chr>              <dbl>
1 Naboo               1.75
2 Tatooine            1.75

Makes a summary row for each group.

group_by() with filter()

star_wars |>
    filter(height == max(height))
    name homeworld height weight
1 JarJar     Naboo    1.9     90

group_by() with filter()

star_wars |>
    group_by(homeworld) |>
    filter(height == max(height))
# A tibble: 2 Γ— 4
# Groups:   homeworld [2]
  name   homeworld height weight
  <chr>  <chr>      <dbl>  <dbl>
1 Anakin Tatooine     1.8     84
2 JarJar Naboo        1.9     90

Changes the scope of functions inside filter() to operate within groups.

group_by() with arrange()

star_wars |>
    arrange(desc(height))
    name homeworld height weight
1 JarJar     Naboo    1.9     90
2 Anakin  Tatooine    1.8     84
3   Luke  Tatooine    1.7     77
4  Padme     Naboo    1.6     45

group_by() with arrange()

star_wars |>
    group_by(homeworld) |>
    arrange(desc(height))
# A tibble: 4 Γ— 4
# Groups:   homeworld [2]
  name   homeworld height weight
  <chr>  <chr>      <dbl>  <dbl>
1 JarJar Naboo        1.9     90
2 Anakin Tatooine     1.8     84
3 Luke   Tatooine     1.7     77
4 Padme  Naboo        1.6     45

Arrange ignores group_by() and is always global

Why???

star_wars |>
    arrange(homeworld, height)
    name homeworld height weight
1  Padme     Naboo    1.6     45
2 JarJar     Naboo    1.9     90
3   Luke  Tatooine    1.7     77
4 Anakin  Tatooine    1.8     84

group_by() with mutate()

star_wars |>
    mutate(height_z = (height - mean(height)) / sd(height))
    name homeworld height weight   height_z
1 Anakin  Tatooine    1.8     84  0.3872983
2  Padme     Naboo    1.6     45 -1.1618950
3   Luke  Tatooine    1.7     77 -0.3872983
4 JarJar     Naboo    1.9     90  1.1618950

group_by() with mutate()

star_wars |>
    group_by(homeworld) |>
    mutate(height_z = (height - mean(height)) / sd(height))
# A tibble: 4 Γ— 5
# Groups:   homeworld [2]
  name   homeworld height weight height_z
  <chr>  <chr>      <dbl>  <dbl>    <dbl>
1 Anakin Tatooine     1.8     84    0.707
2 Padme  Naboo        1.6     45   -0.707
3 Luke   Tatooine     1.7     77   -0.707
4 JarJar Naboo        1.9     90    0.707

Changes the scope of functions inside mutate() to operate within groups.

Statefulness

What will this produce?

star_wars |>
    group_by(homeworld) |>
    mutate(height_z = (height - mean(height)) / sd(height)) |>
    summarize(mean(height_z))
# A tibble: 2 Γ— 2
  homeworld `mean(height_z)`
  <chr>                <dbl>
1 Naboo                    0
2 Tatooine                 0

Once grouped, a data frame stays grouped until reduced to one-row-per group or it is ungrouped.

Statefulness

What will this produce?

star_wars |>
    group_by(homeworld) |>
    mutate(height_z = (height - mean(height)) / sd(height)) |>
    ungroup() |>
    summarize(mean(height_z))
# A tibble: 1 Γ— 1
  `mean(height_z)`
             <dbl>
1                0

Shortcut: .by

Most dplyr functions have a .by argument to apply the function to the groups of another variable.

  • Not stateful
  • Good for single line operations
star_wars |>
    summarize(mean(height), .by = homeworld)
  homeworld mean(height)
1  Tatooine         1.75
2     Naboo         1.75