Data Wrangling I

Manipulating data frames with dplyr

Room for Improvement in Base R

star_wars <- data.frame(
    name = c("Anakin", "Padme", "Luke", "JarJar"),
    homeworld = c("Tatooine", "Naboo", "Tatooine", "Naboo"),
    height = c(1.8, 1.6, 1.7, 1.9),
    weight = c(84, 45, 77, 90)
)

Using base R data frame subsetting, create from star_wars

  1. The vector of heights.
  2. A data frame that contains only the characters shorter than 1.75.
  3. A data frame that contains the four original cols plus the one with the height in inches.
  4. The average height of characters from Tatooine
02:30

What’s going on here?

star_wars[star_wars$height < 1.75, ]
   name homeworld height weight
2 Padme     Naboo    1.6     45
3  Luke  Tatooine    1.7     77
star_wars$height
[1] 1.8 1.6 1.7 1.9
star_wars$height < 1.75
[1] FALSE  TRUE  TRUE FALSE

Comparison operators

You can form a logical vector through a logical comparison:

Examples

c("apple", "orange", "apple", "kiwi") == "apple"
[1]  TRUE FALSE  TRUE FALSE
"apple" == c("apple", "orange", "apple", "kiwi")
[1]  TRUE FALSE  TRUE FALSE
"apple" %in% c("apple", "orange", "apple", "kiwi")
[1] TRUE
c("apple", "orange", "apple", "kiwi") %in% "apple"
[1]  TRUE FALSE  TRUE FALSE

Examples

c("apple", "kiwi") %in% c("apple", "orange", "apple", "kiwi")
[1] TRUE TRUE
c("apple", "orange", "apple", "kiwi") %in% c("apple", "kiwi")
[1]  TRUE FALSE  TRUE  TRUE

Let’s make this better

star_wars[star_wars$height < 1.75, ]



star_wars$height_in <- star_wars$height * 39.3701



mean(star_wars$height[star_wars$homeworld == "Tatooine"])

Four Principles:

  1. Reuse existing data structures.
    • The Data Frame
  2. Compose simple functions with the pipe.
    • |>
  3. Embrace functional programming.
    • Clear separate of data structure and functions.
  4. Design for humans.
    • It should be readable!

dplyr

dplyr verbs

  • slice()
  • select()
  • filter()
  • mutate()
  • arrange()
  • summarize()
  • group_by()
  1. All of these have a data frame as the first argument.
  2. All of these produce a data frame as output.
  3. Within the function, you can simply refer to column names unquoted.

Load the package.

library(dplyr)

Our example data set.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90

slice()

Isolates particular rows of a data frame by row number.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
slice(star_wars, 2)
   name homeworld height weight
1 Padme     Naboo    1.6     45

slice()

Isolates particular rows of a data frame by row number.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
slice(star_wars, 2:4)
    name homeworld height weight
1  Padme     Naboo    1.6     45
2   Luke  Tatooine    1.7     77
3 JarJar     Naboo    1.9     90

slice()

Isolates particular rows of a data frame by row number.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
slice(star_wars, -3)
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3 JarJar     Naboo    1.9     90

select()

Selects variables by name or number.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
select(star_wars, homeworld)
  homeworld
1  Tatooine
2     Naboo
3  Tatooine
4     Naboo

select()

Selects variables by name or number.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
select(star_wars, 2)
  homeworld
1  Tatooine
2     Naboo
3  Tatooine
4     Naboo

select()

Selects variables by name or number.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
select(star_wars, homeworld, name)
  homeworld   name
1  Tatooine Anakin
2     Naboo  Padme
3  Tatooine   Luke
4     Naboo JarJar

filter()

Returns rows that meet certain criteria.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
filter(star_wars, homeworld == "Naboo")
    name homeworld height weight
1  Padme     Naboo    1.6     45
2 JarJar     Naboo    1.9     90

filter()

Returns rows that meet certain criteria.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
filter(star_wars, weight < 50)
   name homeworld height weight
1 Padme     Naboo    1.6     45

filter()

Returns rows that meet certain criteria.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
filter(star_wars, name %in% c("Padme", "Luke"))
   name homeworld height weight
1 Padme     Naboo    1.6     45
2  Luke  Tatooine    1.7     77

filter()

Returns rows that meet certain criteria.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
filter(star_wars,
       name %in% c("Padme", "Luke"),
       weight < 50)
   name homeworld height weight
1 Padme     Naboo    1.6     45

You can add multiple conditions separated by ,

mutate()

Adds a new variable that can be a function of previous variables.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
mutate(star_wars, weight_g = weight * 1000)
    name homeworld height weight weight_g
1 Anakin  Tatooine    1.8     84    84000
2  Padme     Naboo    1.6     45    45000
3   Luke  Tatooine    1.7     77    77000
4 JarJar     Naboo    1.9     90    90000

arrange()

Sort the rows of a data frame by the values of variables.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
arrange(star_wars, height)
    name homeworld height weight
1  Padme     Naboo    1.6     45
2   Luke  Tatooine    1.7     77
3 Anakin  Tatooine    1.8     84
4 JarJar     Naboo    1.9     90

arrange()

Sort the rows of a data frame by the values of variables.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
arrange(star_wars, desc(height))
    name homeworld height weight
1 JarJar     Naboo    1.9     90
2 Anakin  Tatooine    1.8     84
3   Luke  Tatooine    1.7     77
4  Padme     Naboo    1.6     45

arrange()

Sort the rows of a data frame by the values of variables.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
arrange(star_wars, homeworld)
    name homeworld height weight
1  Padme     Naboo    1.6     45
2 JarJar     Naboo    1.9     90
3 Anakin  Tatooine    1.8     84
4   Luke  Tatooine    1.7     77

arrange()

Sort the rows of a data frame by the values of variables.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
arrange(star_wars, homeworld, height)
    name homeworld height weight
1  Padme     Naboo    1.6     45
2 JarJar     Naboo    1.9     90
3   Luke  Tatooine    1.7     77
4 Anakin  Tatooine    1.8     84

summarize()

Summarize a variable with a statistic.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
summarize(star_wars,
          mean(height))
  mean(height)
1         1.75

summarize()

Summarize a variable with a statistic.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
summarize(star_wars,
          avg_height = mean(height))
  avg_height
1       1.75

summarize()

Summarize a variable with a statistic.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
summarize(star_wars,
          avg_height = mean(height),
          sd_height  = sd(height))
  avg_height sd_height
1       1.75 0.1290994

Can calculate multiple statistics separated by ,

star_wars <- data.frame(
    name = c("Anakin", "Padme", "Luke", "JarJar"),
    homeworld = c("Tatooine", "Naboo", "Tatooine", "Naboo"),
    height = c(1.8, 1.6, 1.7, 1.9),
    weight = c(84, 45, 77, 90)
)

Using dplyr, create from star_wars

  1. The vector of heights.
  2. A data frame that contains only the characters shorter than 1.75.
  3. A data frame that contains the four original cols plus the one with the height in inches.
  4. The average height of characters from Tatooine
03:00

Data Wrangling Made Even Easier

select()

Selects variables by name or number.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
select(star_wars, name:height)
    name homeworld height
1 Anakin  Tatooine    1.8
2  Padme     Naboo    1.6
3   Luke  Tatooine    1.7
4 JarJar     Naboo    1.9

select()

Selects variables by name or number.

star_wars
    name homeworld height weight
1 Anakin  Tatooine    1.8     84
2  Padme     Naboo    1.6     45
3   Luke  Tatooine    1.7     77
4 JarJar     Naboo    1.9     90
select(star_wars, contains("world"))
  homeworld
1  Tatooine
2     Naboo
3  Tatooine
4     Naboo

Practice Data Frame

library(tibble)

day_at_cal <- tibble(
  weekday    = factor(c("Mon","Mon","Tue","Tue","Wed","Wed"),
                      levels = c("Mon","Tue","Wed","Thu","Fri"), ordered = TRUE),
  study_spot = factor(c("Moffitt","Doe","Cory","MLK","CITRIS","Soda")),
  ai_hours   = c(1.2, 0.4, 2.1, 1.0, 0.8, 1.6),
  steps_k    = c(8.5, 6.2, 9.1, 7.0, 5.8, 10.3))