Data Wrangling II

Agenda

Two big ideas today:

Data Pipelines
Grouped Operations

Data pipelines

library(dplyr)
star_wars <- data.frame(
    name = c("Anakin", "Padme", "Luke", "JarJar"),
    homeworld = c("Tatooine", "Naboo", "Tatooine", "Naboo"),
    height = c(1.8, 1.6, 1.7, 1.9),
    weight = c(84, 45, 77, 90)
)

Using base R data frame subsetting, create from star_wars…

The average height of characters from Tatooine

Let’s look at three ways to solve this.

Nesting

summarize(filter(star_wars, homeworld == "Tatooine"), mean(height))

  mean(height)
1         1.75

Must be read from the inside out 👎
Hard to keep track of arguments 👎

All in one line of code 👍
Only refer to one data frame 👍

Step-by-step

star_wars2 <- filter(star_wars, homeworld == "Tatooine")
summarize(star_wars2, mean(height))

  mean(height)
1         1.75

Have to repeaet data frame names 👎
Creates unnecessary objects 👎

Stores intermediate objects 👍
Can be read from top to bottom 👍

Data Pipelines in R

2016: magrittr introduces %>%

|>

2021:

R now provides a simple native forward pipe syntax |>. The simple form of the forward pipe inserts the left-hand side as the first argument in the right-hand side call.

Using the Pipe Operator

star_wars |>

Using the Pipe Operator

star_wars |>
    filter(homeworld == "Tatooine") |>

Using the Pipe Operator

star_wars |>
    filter(homeworld == "Tatooine") |>
    summarize(mean(height))

  mean(height)
1         1.75

🤷‍♂️

Can be read like an english paragraph 👍
Only type the data once 👍
No leftover objects 👍

Understanding your pipeline

It’s good practice to understand the output of each line by breaking the pipe.

star_wars |>
    select(homeworld) |>
    filter(mean(height))

Error in `filter()`:
ℹ In argument: `mean(height)`.
Caused by error:
! object 'height' not found

star_wars |>
    select(homeworld)

  homeworld
1  Tatooine
2     Naboo
3  Tatooine
4     Naboo

star_wars |> # A #<<
    filter(homeworld == "Naboo",
           name %in% c("Padme", "JarJar")) |> # B #<<
    select(height, weight) |> # C #<<
    summarize(mean(height), sd(height),
              mean(weight), sd(weight)) # D #<<

Question: What are the dimensions of the data frame at each stage of the pipe: A, B, C, and D?

02:00

`|>` works everywhere

R now provides a simple native forward pipe syntax |>. The simple form of the forward pipe inserts the left-hand side as the first argument in the right-hand side call.

factor(c("cat", "cat", "dog")) |> summary()

cat dog 
  2   1

Grouped Operations

library(dplyr)
star_wars <- data.frame(
    name = c("Anakin", "Padme", "Luke", "JarJar"),
    homeworld = c("Tatooine", "Naboo", "Tatooine", "Naboo"),
    height = c(1.8, 1.6, 1.7, 1.9),
    weight = c(84, 45, 77, 90)
)

Using base R data frame subsetting, create from star_wars…

The average height of characters from Tatooine across each planet

You could set up two pipelines with different filters. But there’s a better way.

`group_by()`

Flag the rows of a dataframe as belong to a group defined by a factor. For use in downstream operations.

star_wars |>
    group_by(homeworld)

# A tibble: 4 × 4
# Groups:   homeworld [2]
  name   homeworld height weight
  <chr>  <chr>      <dbl>  <dbl>
1 Anakin Tatooine     1.8     84
2 Padme  Naboo        1.6     45
3 Luke   Tatooine     1.7     77
4 JarJar Naboo        1.9     90

`group_by()`

Flag the rows of a dataframe as belong to a group defined by a factor. For use in downstream operations.

star_wars |>
    group_by(homeworld) |>
    summarize(mean(height))

# A tibble: 2 × 2
  homeworld `mean(height)`
  <chr>              <dbl>
1 Naboo               1.75
2 Tatooine            1.75

Draw diagram

`group_by()` with `summarize()`

star_wars |>
    summarize(mean(height))

  mean(height)
1         1.75

`group_by()` with `summarize()`

star_wars |>
    group_by(homeworld) |>
    summarize(mean(height))

# A tibble: 2 × 2
  homeworld `mean(height)`
  <chr>              <dbl>
1 Naboo               1.75
2 Tatooine            1.75

Makes a summary row for each group.

`group_by()` with `filter()`

star_wars |>
    filter(height == max(height))

    name homeworld height weight
1 JarJar     Naboo    1.9     90

`group_by()` with `filter()`

star_wars |>
    group_by(homeworld) |>
    filter(height == max(height))

# A tibble: 2 × 4
# Groups:   homeworld [2]
  name   homeworld height weight
  <chr>  <chr>      <dbl>  <dbl>
1 Anakin Tatooine     1.8     84
2 JarJar Naboo        1.9     90

Changes the scope of functions inside filter() to operate within groups.

`group_by()` with `arrange()`

star_wars |>
    arrange(desc(height))

    name homeworld height weight
1 JarJar     Naboo    1.9     90
2 Anakin  Tatooine    1.8     84
3   Luke  Tatooine    1.7     77
4  Padme     Naboo    1.6     45

`group_by()` with `arrange()`

star_wars |>
    group_by(homeworld) |>
    arrange(desc(height))

# A tibble: 4 × 4
# Groups:   homeworld [2]
  name   homeworld height weight
  <chr>  <chr>      <dbl>  <dbl>
1 JarJar Naboo        1.9     90
2 Anakin Tatooine     1.8     84
3 Luke   Tatooine     1.7     77
4 Padme  Naboo        1.6     45

Arrange ignores group_by() and is always global

Why???

star_wars |>
    arrange(homeworld, height)

    name homeworld height weight
1  Padme     Naboo    1.6     45
2 JarJar     Naboo    1.9     90
3   Luke  Tatooine    1.7     77
4 Anakin  Tatooine    1.8     84

`group_by()` with `mutate()`

star_wars |>
    mutate(height_z = (height - mean(height)) / sd(height))

    name homeworld height weight   height_z
1 Anakin  Tatooine    1.8     84  0.3872983
2  Padme     Naboo    1.6     45 -1.1618950
3   Luke  Tatooine    1.7     77 -0.3872983
4 JarJar     Naboo    1.9     90  1.1618950

`group_by()` with `mutate()`

star_wars |>
    group_by(homeworld) |>
    mutate(height_z = (height - mean(height)) / sd(height))

# A tibble: 4 × 5
# Groups:   homeworld [2]
  name   homeworld height weight height_z
  <chr>  <chr>      <dbl>  <dbl>    <dbl>
1 Anakin Tatooine     1.8     84    0.707
2 Padme  Naboo        1.6     45   -0.707
3 Luke   Tatooine     1.7     77   -0.707
4 JarJar Naboo        1.9     90    0.707

Changes the scope of functions inside mutate() to operate within groups.

Statefulness

What will this produce?

star_wars |>
    group_by(homeworld) |>
    mutate(height_z = (height - mean(height)) / sd(height)) |>
    summarize(mean(height_z))

# A tibble: 2 × 2
  homeworld `mean(height_z)`
  <chr>                <dbl>
1 Naboo                    0
2 Tatooine                 0

Once grouped, a data frame stays grouped until reduced to one-row-per group or it is ungrouped.

Statefulness

What will this produce?

star_wars |>
    group_by(homeworld) |>
    mutate(height_z = (height - mean(height)) / sd(height)) |>
    ungroup() |>
    summarize(mean(height_z))

# A tibble: 1 × 1
  `mean(height_z)`
             <dbl>
1                0

Shortcut: `.by`

Most dplyr functions have a .by argument to apply the function to the groups of another variable.

Not stateful
Good for single line operations

star_wars |>
    summarize(mean(height), .by = homeworld)

  homeworld mean(height)
1  Tatooine         1.75
2     Naboo         1.75

Data Wrangling II

Agenda

Data pipelines

Nesting

Step-by-step

Data Pipelines in R

Using the Pipe Operator

Using the Pipe Operator

Using the Pipe Operator

Understanding your pipeline

|> works everywhere

Grouped Operations

group_by()

group_by()

group_by() with summarize()

group_by() with summarize()

group_by() with filter()

group_by() with filter()

group_by() with arrange()

group_by() with arrange()

group_by() with mutate()

group_by() with mutate()

Statefulness

Statefulness

Shortcut: .by

`|>` works everywhere

`group_by()`

`group_by()`

`group_by()` with `summarize()`

`group_by()` with `summarize()`

`group_by()` with `filter()`

`group_by()` with `filter()`

`group_by()` with `arrange()`

`group_by()` with `arrange()`

`group_by()` with `mutate()`

`group_by()` with `mutate()`

Shortcut: `.by`