Factors and Data Frames

Data structures for data science

While you’re waiting

Imagine you collect University ID number and major from 5 friends:

uid <- c("3850278", "5869204", "4728112", "3829651", "3859278")
maj <- c("statistics", "economics", "statistics", "data", "economics")

Conceptually, in what ways is the data in these two vectors similar? In what ways is it different?
Imagine instead you asked 100 friends. For each vector, can you think of a more efficient way for the computer to store that data?

Agenda

Factors
Data Frames

Factors

While you’re waiting

Imagine you collect University ID number and major from 5 friends:

uid <- c("3850278", "5869204", "4728112", "3829651", "3859278")
maj <- c("STAT", "ECON", "STAT", "DATA", "ECON")

Conceptually, in what ways is the data in these two vectors similar? In what ways is it different?
Imagine instead you asked 100 friends. For each vector, can you think of a more efficient way for the computer to store that data?

Factors

A data structure for storing categorical data. Stores the values in an integer vector and adds a levels attribute to map integers to character strings. Created with factor().

maj <- factor(x = c("STAT", "ECON", "STAT", "DATA", "ECON"))
maj

[1] STAT ECON STAT DATA ECON
Levels: DATA ECON STAT

as.integer(maj)

[1] 3 2 3 1 2

c(maj, 2L)

[1] 3 2 3 1 2 2

Attributes

Objects built on top of simpler data structures can have attributes. A factor’s attributes include its class and levels.

class(maj)

[1] "factor"

levels(maj)

[1] "DATA" "ECON" "STAT"

Setting Levels

Quickly change all of the values with a particular level.

levels(maj) <- c("DATA", "ECON", "STATISTICS")
maj

[1] STATISTICS ECON       STATISTICS DATA       ECON      
Levels: DATA ECON STATISTICS

Add a level even if it’s not observed.

levels(maj) <- c("DATA", "ECON", "STATISTICS", "BIO")
maj

[1] STATISTICS ECON       STATISTICS DATA       ECON      
Levels: DATA ECON STATISTICS BIO

Taking advantage of class

Certain functions behaved differently depending on the class of the object that is passed to it.

class(1:4)

[1] "integer"

summary(1:4)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.75    2.50    2.50    3.25    4.00

class(maj)

[1] "factor"

summary(maj)

      DATA       ECON STATISTICS        BIO 
         1          2          2          0

Ordering levels

You can create a factor with ordered levels by adding ordered = TRUE.

maj_or <- factor(x = c("STAT", "ECON", "STAT", "DATA", "ECON"),
                 levels = c("STAT", "ECON", "DATA", "BIO"),
                 ordered = TRUE)

maj_or

[1] STAT ECON STAT DATA ECON
Levels: STAT < ECON < DATA < BIO

sort(maj_or)

[1] STAT STAT ECON ECON DATA
Levels: STAT < ECON < DATA < BIO

barplot(summary(maj_or))

barplot(summary(maj))

Factors and Matrices

Both are augmented versions of atomic vectors that are given classes to allow for special behavior.

mat <- matrix(1:4, ncol = 2)
class(mat)

[1] "matrix" "array"

summary(mat)

       V1             V2      
 Min.   :1.00   Min.   :3.00  
 1st Qu.:1.25   1st Qu.:3.25  
 Median :1.50   Median :3.50  
 Mean   :1.50   Mean   :3.50  
 3rd Qu.:1.75   3rd Qu.:3.75  
 Max.   :2.00   Max.   :4.00

This is a form of object oriented programming.

The Limitations of Our Structures

The Limitation of Atomic Vectors

songs <- factor(c("Golden by Huntrix", "Stay with me", "4me4u", "Golden by Huntrix"))
distances <- c(2000, 800, 100, 10)

Question: What was the song and distance of the second index card?

songs[2]

[1] Stay with me
Levels: 4me4u Golden by Huntrix Stay with me

distances[2]

[1] 800

😬 This is brittle!

The Limitation of Matrices

mat <- matrix(c(songs, distances), ncol = 2)

mat[2,]

[1]   3 800

Question: Summarize the distribution of songs and distances.

summary(mat)

       V1             V2        
 Min.   :1.00   Min.   :  10.0  
 1st Qu.:1.75   1st Qu.:  77.5  
 Median :2.00   Median : 450.0  
 Mean   :2.00   Mean   : 727.5  
 3rd Qu.:2.25   3rd Qu.:1100.0  
 Max.   :3.00   Max.   :2000.0

🤬 Matrices must contain only one type.

The Limitation of Lists

lst <- list("songs" = songs, "distances" = distances)

summary(lst[["songs"]])

            4me4u Golden by Huntrix      Stay with me 
                1                 2                 1

summary(lst[["distances"]])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   10.0    77.5   450.0   727.5  1100.0  2000.0

lst[["songs"]][2]

[1] Stay with me
Levels: 4me4u Golden by Huntrix Stay with me

lst[["distances"]][2]

[1] 800

🤯 Too complicated to subset.

Data Structures in R

Data Frames

Data Frame

A data frame is a named list of vectors of the same length. Holds attributes for (column) names, row.names, and its class, “data.frame”. Created with data.frame().

star_wars <- data.frame(
    name = c("Anakin", "Padme", "Luke", "Leia"),
    gender = c("male", "female", "male", "female"),
    height = c(1.88, 1.65, 1.72, 1.50),
    weight = c(84, 45, 77, 49)
)
star_wars

    name gender height weight
1 Anakin   male   1.88     84
2  Padme female   1.65     45
3   Luke   male   1.72     77
4   Leia female   1.50     49

Data Frame Attributes

names(star_wars)

[1] "name"   "gender" "height" "weight"

row.names(star_wars)

[1] "1" "2" "3" "4"

class(star_wars)

[1] "data.frame"

Other useful functions for Data Frames

summary()
str()
head()
dim()
ncol()
nrow()

star_wars <- data.frame(
    name = c("Anakin", "Padme", "Luke", "Leia"),
    gender = c("male", "female", "male", "female"),
    height = c(1.88, 1.65, 1.72, 1.50),
    weight = c(84, 45, 77, 49)
)

Index Cards as a Data Frame

index_cards <- data.frame(songs, distances)
index_cards

              songs distances
1 Golden by Huntrix      2000
2      Stay with me       800
3             4me4u       100
4 Golden by Huntrix        10

summary(index_cards)

               songs     distances     
 4me4u            :1   Min.   :  10.0  
 Golden by Huntrix:2   1st Qu.:  77.5  
 Stay with me     :1   Median : 450.0  
                       Mean   : 727.5  
                       3rd Qu.:1100.0  
                       Max.   :2000.0

Index Cards as a Data Frame, cont.

index_cards[2, ]

         songs distances
2 Stay with me       800

👍👍

Question

    name gender height weight
1 Anakin   male   1.88     84
2  Padme female   1.65     45
3   Luke   male   1.72     77
4   Leia female   1.50     49

Using what you know about matrix and list subsetting, write code to subset from star_wars…

The data frame containing only height and weight.
The character vector gender.
The data frame containing only the last row.
The value 1.88

02:30

Subsetting Data Frames

Data frames can be subset either as a two dimensional array (matrix [row, col] subsetting) or as a one dimensional list ([], [[]] or $ subsetting)).

The Evoluation of the Data Frame

1990: Data Frame (R)

2009: Pandas DF (Py)

2016: Tibble (R)

2020: Polars (R/Py)

Tibbles

An updated version of a data frame with convenient behaviors.

library(tibble)
star_wars_tbl <- as_tibble(star_wars)
star_wars_tbl

# A tibble: 4 × 4
  name   gender height weight
  <chr>  <chr>   <dbl>  <dbl>
1 Anakin male     1.88     84
2 Padme  female   1.65     45
3 Luke   male     1.72     77
4 Leia   female   1.5      49

The main difference is that tibbles are lazy and surly: they do less and complain more - Hadley Wickham