13.1 readr | Practitioner’s Guide to Data Science

13.1 `readr`

You must be familiar with read.csv(), read.table() and write.csv() in base R. Here we will introduce a more efficient package for reading and writing data: readr package. The corresponding functions are read_csv(), read_table() and write_csv(). The commands look quite similar, but readr is different in the following respects:

It is 10x faster. The trick is that readr uses C++ to process the data quickly.
It doesn’t change the column names. The names can start with a number and “.” will not be substituted to “_”. For example:

read_csv("2015,2016,2017
1,2,3
4,5,6")

## Rows: 2 Columns: 3
## ── Column specification ───────────────────────────────
## Delimiter: ","
## dbl (3): 2015, 2016, 2017
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 3
##   `2015` `2016` `2017`
##    <dbl>  <dbl>  <dbl>
## 1      1      2      3
## 2      4      5      6

readr functions do not convert strings to factors by default, are able to parse dates and times and can automatically determine the data types in each column.
The killing character, in my opinion, is that readr provides progress bar. What makes you feel worse than waiting is not knowing how long you have to wait.

The major functions of readr is to turn flat files into data frames:

read_csv(): reads comma delimited files
read_csv2(): reads semicolon separated files (common in countries where , is used as the decimal place)
read_tsv(): reads tab delimited files
read_delim(): reads in files with any delimiter
read_fwf(): reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions()
read_table(): reads a common variation of fixed width files where columns are separated by white space
read_log(): reads Apache style log files

The good thing is that those functions have similar syntax. Once you learn one, the others become easy. Here we will focus on read_csv().

The most important information for read_csv() is the path to your data:

sim.dat <- read_csv("http://bit.ly/2P5gTw4")
head(sim.dat)

# A tibble: 6 x 19
    age gender income house store_exp online_exp store_trans online_trans    Q1
  <int> <chr>   <dbl> <chr>     <dbl>      <dbl>       <int>        <int> <int>
1    57 Female 1.21e5 Yes        529.       304.           2            2     4
2    63 Female 1.22e5 Yes        478.       110.           4            2     4
3    59 Male   1.14e5 Yes        491.       279.           7            2     5
4    60 Male   1.14e5 Yes        348.       142.          10            2     5
5    51 Male   1.24e5 Yes        380.       112.           4            4     4
6    59 Male   1.08e5 Yes        338.       196.           4            5     4
# ... with 10 more variables: Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>, Q6 <int>,
#   Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>

The function reads the file to R as a tibble. You can consider tibble as next iteration of the data frame. They are different with data frame for the following aspects:

It never changes an input’s type (i.e., no more stringsAsFactors = FALSE!)
It never adjusts the names of variables
It has a refined print method that shows only the first 10 rows and all the columns that fit on the screen. You can also control the default print behavior by setting options.

Refer to http://r4ds.had.co.nz/tibbles.html for more information about ‘tibble’.

When you run read_csv() it prints out a column specification that gives the name and type of each column. To better understanding how readr works, it is helpful to type in some baby data set and check the results:

dat <- read_csv("2015,2016,2017
100,200,300
canola,soybean,corn")

## Rows: 2 Columns: 3
## ── Column specification ───────────────────────────────
## Delimiter: ","
## chr (3): 2015, 2016, 2017
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

print(dat)

## # A tibble: 2 × 3
##   `2015` `2016`  `2017`
##   <chr>  <chr>   <chr> 
## 1 100    200     300   
## 2 canola soybean corn

You can also add comments on the top and tell R to skip those lines:

dat <- read_csv("# I will never let you know that
          # my favorite food is carrot
          Date,Food,Mood
          Monday,carrot,happy
          Tuesday,carrot,happy
          Wednesday,carrot,happy
          Thursday,carrot,happy
          Friday,carrot,happy
          Saturday,carrot,extremely happy
          Sunday,carrot,extremely happy", 
          skip = 2)

## Rows: 7 Columns: 3
## ── Column specification ───────────────────────────────
## Delimiter: ","
## chr (3): Date, Food, Mood
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

print(dat)

## # A tibble: 7 × 3
##   Date      Food   Mood           
##   <chr>     <chr>  <chr>          
## 1 Monday    carrot happy          
## 2 Tuesday   carrot happy          
## 3 Wednesday carrot happy          
## 4 Thursday  carrot happy          
## 5 Friday    carrot happy          
## 6 Saturday  carrot extremely happy
## 7 Sunday    carrot extremely happy

If you don’t have column names, set col_names = FALSE then R will assign names “X1”,“X2”… to the columns:

dat <- read_csv("Saturday,carrot,extremely happy
          Sunday,carrot,extremely happy", col_names = FALSE)

## Rows: 2 Columns: 3
## ── Column specification ───────────────────────────────
## Delimiter: ","
## chr (3): X1, X2, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

print(dat)

## # A tibble: 2 × 3
##   X1       X2     X3             
##   <chr>    <chr>  <chr>          
## 1 Saturday carrot extremely happy
## 2 Sunday   carrot extremely happy

You can also pass col_names a character vector which will be used as the column names. Try to replace col_names=FALSE with col_names=c("Date","Food","Mood") and see what happen.

As mentioned before, you can use read_csv2() to read semicolon separated files:

dat <- read_csv2("Saturday; carrot; extremely happy \n 
                 Sunday; carrot; extremely happy", col_names = FALSE)
print(dat)

## # A tibble: 2 × 3
##   X1       X2     X3             
##   <chr>    <chr>  <chr>          
## 1 Saturday carrot extremely happy
## 2 Sunday   carrot extremely happy

Here “\n” is a convenient shortcut for adding a new line.

You can use read_tsv() to read tab delimited files:

dat <- read_tsv("every\tman\tis\ta\tpoet\twhen\the\tis\tin\tlove\n", 
    col_names = FALSE)

## Rows: 1 Columns: 10
## ── Column specification ───────────────────────────────
## Delimiter: "\t"
## chr (10): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

print(dat)

## # A tibble: 1 × 10
##   X1    X2    X3    X4    X5    X6    X7    X8    X9   
##   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 every man   is    a     poet  when  he    is    in   
## # … with 1 more variable: X10 <chr>

Or more generally, you can use read_delim() and assign separating character:

dat <- read_delim("THE|UNBEARABLE|RANDOMNESS|OF|LIFE\n", 
    delim = "|", col_names = FALSE)

## Rows: 1 Columns: 5
## ── Column specification ───────────────────────────────
## Delimiter: "|"
## chr (5): X1, X2, X3, X4, X5
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

print(dat)

## # A tibble: 1 × 5
##   X1    X2         X3         X4    X5   
##   <chr> <chr>      <chr>      <chr> <chr>
## 1 THE   UNBEARABLE RANDOMNESS OF    LIFE

Another situation you will often run into is the missing value. In marketing survey, people like to use “99” to represent missing. You can tell R to set all observation with value “99” as missing when you read the data:

dat <- read_csv("Q1,Q2,Q3
               5, 4,99", 
               na = "99")

## Rows: 1 Columns: 3
## ── Column specification ───────────────────────────────
## Delimiter: ","
## dbl (2): Q1, Q2
## lgl (1): Q3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

print(dat)

## # A tibble: 1 × 3
##      Q1    Q2 Q3   
##   <dbl> <dbl> <lgl>
## 1     5     4 NA

For writing data back to disk, you can use write_csv() and write_tsv(). The following two characters of the two functions increase the chances of the output file being read back in correctly:

Encode strings in UTF-8
Save dates and date-times in ISO8601 format so they are easily parsed elsewhere

For example:

write_csv(sim.dat, "sim_dat.csv")

For other data types, you can use the following packages:

Haven: SPSS, Stata and SAS data
Readxl and xlsx: excel data(.xls and .xlsx)
DBI: given data base, such as RMySQL, RSQLite and RPostgreSQL, read data directly from the database using SQL

Some other useful materials:

For getting data from the internet, you can refer to the book “XML and Web Technologies for Data Sciences with R”.
R data import/export manual
rio package:https://github.com/leeper/rio