Data structures for data science
Imagine you collect University ID number and major from 5 friends:
Imagine you collect University ID number and major from 5 friends:
A data structure for storing categorical data. Stores the values in an integer vector and adds a levels attribute to map integers to character strings. Created with factor().
Objects built on top of simpler data structures can have attributes. A factor’s attributes include its class and levels.
Quickly change all of the values with a particular level.
[1] STATISTICS ECON STATISTICS DATA ECON
Levels: DATA ECON STATISTICS
Add a level even if it’s not observed.
Certain functions behaved differently depending on the class of the object that is passed to it.
You can create a factor with ordered levels by adding ordered = TRUE.
Both are augmented versions of atomic vectors that are given classes to allow for special behavior.
V1 V2
Min. :1.00 Min. :3.00
1st Qu.:1.25 1st Qu.:3.25
Median :1.50 Median :3.50
Mean :1.50 Mean :3.50
3rd Qu.:1.75 3rd Qu.:3.75
Max. :2.00 Max. :4.00
This is a form of object oriented programming.
Question: What was the song and distance of the second index card?
😬 This is brittle!
Question: Summarize the distribution of songs and distances.
V1 V2
Min. :1.00 Min. : 10.0
1st Qu.:1.75 1st Qu.: 77.5
Median :2.00 Median : 450.0
Mean :2.00 Mean : 727.5
3rd Qu.:2.25 3rd Qu.:1100.0
Max. :3.00 Max. :2000.0
🤬 Matrices must contain only one type.
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.0 77.5 450.0 727.5 1100.0 2000.0
🤯 Too complicated to subset.
A data frame is a named list of vectors of the same length. Holds attributes for (column) names, row.names, and its class, “data.frame”. Created with data.frame().
name gender height weight
1 Anakin male 1.88 84
2 Padme female 1.65 45
3 Luke male 1.72 77
4 Leia female 1.50 49
summary()str()head()dim()ncol()nrow() songs distances
1 Golden by Huntrix 2000
2 Stay with me 800
3 4me4u 100
4 Golden by Huntrix 10
👍👍
name gender height weight
1 Anakin male 1.88 84
2 Padme female 1.65 45
3 Luke male 1.72 77
4 Leia female 1.50 49
Using what you know about matrix and list subsetting, write code to subset from star_wars…
height and weight.gender.1.8802:30
Data frames can be subset either as a two dimensional array (matrix [row, col] subsetting) or as a one dimensional list ([], [[]] or $ subsetting)).
1990: Data Frame (R)

2009: Pandas DF (Py)

2016: Tibble (R)

2020: Polars (R/Py)

An updated version of a data frame with convenient behaviors.
# A tibble: 4 × 4
name gender height weight
<chr> <chr> <dbl> <dbl>
1 Anakin male 1.88 84
2 Padme female 1.65 45
3 Luke male 1.72 77
4 Leia female 1.5 49
The main difference is that tibbles are lazy and surly: they do less and complain more - Hadley Wickham
