library(tidyverse)
Practice: Importing tables with readr
Stat 133
- Get to know various options to import data tables in R
- Practice using reading tables functions from package
"utils"
- Practice using reading tables functions from package
"readr"
1 Importing Data Tables with "readr"
In this module, you will start learning how to import tables in R using functions from the tidyverse package "readr"
.
1.1 Data Files
This module has a companion folder that contains the text files listed below (see bCourses, Files/
tab, folder data/
, subfolder importing-tables/
):
starwarstoy.csv
: this is a typical CSV filestarwarstoy2.csv
: this is a typical CSV filestarwarsfwf.txt
: this is a fixed width format (fwf) filestarwarstoy2.txt
: this is a text file with both metadata and data
The following exercises will allow you to practice importing data into a tabular object (e.g. data.frame
, tibble
) using base "utils"
functions read.table()
and friends, as well as functions from package "readr"
.
2 Example: CSV file starwars.csv
Consider the CSV file starwarstoy.csv
. This is a comma separated value file, with the following contents:
name,gender,height,weight
Luke Skywalker,male,1.72,77
Leia Skywalker,female,1.5,49
Han Solo,male,1.8,80
Obi-Wan Kenobi,male,1.82,77
As you can tell, the content of the file consists of five lines of text.
The first line corresponds to the names of the variables or fields:
name
gender
height
weight
Lines 2 to 5 correspond to the values of four Star Wars characters (Luke, Leia, Han, and Obi-Wan)
2.1 How do you import this data in R?
R has a set of built-in functions for importing a great variety of data files. For sake of convenience, we are going to describe importing tools from the "tidyverse"
package "readr"
.
2.1.1 Quick & Dirty Import
Assuming the file starwarstoy.csv
is in your working directory, one quick-and-dirty way to import it in R is with the function read_csv()
, as shown below:
# quick-and-dirty import
= read_csv(file = "starwarstoy.csv") dat
Rows: 4 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): name, gender
dbl (2): height, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice the message provided by read_csv()
. It detected that the input file uses ","
as the field delimiter. Likewise, it has made an educated guess for the data-types of the columns: name
and gender
have been encoded as character
, whereas height
and weight
have been encoded as double
.
Once the data has been imported, you can use dat
like any other object. It is important to notice that the output of "readr"
functions is a tibble
dat
# A tibble: 4 × 4
name gender height weight
<chr> <chr> <dbl> <dbl>
1 Luke Skywalker male 1.72 77
2 Leia Skywalker female 1.5 49
3 Han Solo male 1.8 80
4 Obi-Wan Kenobi male 1.82 77
2.1.2 A More Formal Import
A more formal way to import the file with read_csv()
is to specify not just the name of the input file, but also the data-type of each column. The latter specification can be given with the argument col_types
and the auxiliary function cols()
.
The arguments passed to cols()
involve pairs of column names and their types. In turn, data types are specified with parsing functions such as col_logical()
, col_character()
, col_integer()
, col_double()
, etc.
# more formal import
= read_csv(
dat file = "starwarstoy.csv",
col_types = cols(
name = col_character(),
gender = col_character(),
height = col_double(),
weight = col_double()
))
2.1.3 Yet Another Import
An alternative way to specify the data-types of the columns is to use one-letter abbreviations instead of the parsing functions, for example (but not limited to):
"l"
: logical"i"
: integer"d"
: double"c"
: character
Here’s how to use abbreviations for the data-types:
# abbreviations for column data-types
= read_csv(
dat file = "starwarstoy.csv",
col_types = cols(
name = "c",
gender = "c",
height = "d",
weight = "d"
))
But wait, there’s more! You can write a more compact command without the use of the auxiliary cols()
function. All you have to do is pass a string with the abbreviations to the argument col_types
. For example, the command below passes the string "ccdd"
to col_types
:
# more compact command
= read_csv(
dat file = "starwarstoy.csv",
col_types = "ccdd")
The way the string "ccdd"
works is fairly simple: the first c
is for the first column, the next c
is for the second column, and so on and so forth.
2.2 Your Turn: importing starwarstoy2.csv
The file starwarstoy2.csv
has the following content:
name,gender,height,weight,jedi,species,weapon
"Luke Skywalker",male,1.72,77,jedi,human,lightsaber
"Leia Skywalker",female,1.5,49,no_jedi,human,blaster
"Obi-Wan Kenobi",male,1.82,77,jedi,human,lightsaber
"Han Solo",male,1.8,80,no_jedi,human,blaster
"R2-D2",male,0.96,32,no_jedi,droid,unarmed
"C-3PO",male,1.67,75,no_jedi,droid,unarmed
"Yoda",male,0.66,17,jedi,yoda,lightsaber
"Chewbacca",male,2.28,112,no_jedi,wookiee,bowcaster
Use read_csv()
to import this CSV file. Make sure to specify reasonable data types for each of the columns. Try to do this using at least 2 different options for the col_types
argument.
Show answer
# somewhat abbreviated
= read_csv(
dat2 file = "starwarstoy.csv",
col_types = cols(
name = "c",
gender = "c",
height = "d",
weight = "i",
jedi = "c",
species = "c",
weapon = "c"
))
# super abbreviated
= read_csv(
dat2 file = "starwarstoy.csv",
col_types = "ccdiccc")
2.3 Your Turn: Fixed Width Format
The file starwarsfwf.txt
has the following content:
name gender height weight
"Luke Skywalker" male 1.72 77
"Leia Skywalker" female 1.5 49
"Han Solo" male 1.8 80
"Obi-Wan Kenobi" male 1.82 77
See the documentation of the "readr"
function read_fwf()
, and use this function to import the data in starwarsfwf.txt
.
One suggestion to complete this operation is to specify arguments:
name of the
file
column positions and names—argument
col_positions
—via the functionfwf_widths()
skip the first line—because column names are being already specified via
fwf_widths()
Show answer
= read_fwf(
fwf file = "starwarsfwf.txt",
col_positions = fwf_widths(
widths = c(16, 7, 7, 7),
col_names = c('name', 'gender', 'height', 'weight')),
skip = 1)
2.4 Your Turn: File with metadata
Take a look at the file starwarstoy2.txt
.
# Description: Toy data set of some Star Wars characters
# Format: Tab separated values
# height units in meters
# weight units in kilograms
# column names
name
gender
height
weight
jedi
# rows
"Luke Skywalker" male 1.72 77 TRUE
"Leia Skywalker" female 1.5 49 FALSE
"Obi-Wan Kenobi" male 1.82 77 TRUE
"Han Solo" male 1.8 80 FALSE
As you can tell, this file has three sections:
The first four lines of text have some general information with the description of the file, its format, and units of measurement for
height
andweight
The next section, from line 7 to 11, indicates the names of the columns
The last section, from line 14 to 17, have the actual data table
Your goal is to write commands, without hardcoding any values, in order to import the data table (with their corresponding column names). We recommend the following strategy:
use
read_lines()
to import the first 11 lines in a character vector,from the character vector of step (1), extract those elements that correspond to the column names; this will give you a character vector with just the column names,
use
read_tsv()
to import the data values starting at line 14 (i.e. by skipping the first 13 lines), specifying the column names previously extracted in the character vector of step (2); as well as providing sensible values for thecol_types
argument.
Show answer
# step 1
= read_lines("starwarstoy2.txt", n_max = 11)
first_lines
# step 2
= tail(first_lines, 5)
column_names
# step 3
= read_tsv(
sw2 file = "starwarstoy2.txt",
skip = 13,
col_names = column_names,
col_types = "ccddl")