Back to Homepage

1 Data Wrangling

This chapter focuses on some of the most frequently used data manipulations and shows how to implement them in R. It is important to explore the data set with descriptive statistics (mean, standard deviation, etc.) and data visualization prior to analysis. Transform data so that the data structure is in line with the requirements of the model. You also need to summarize the results after analysis.

Here we assume the readers are already familiar with some of the traditional R data operations, such as subsetting data frame, deleting variables, read and write functions (read.csv (), write.csv (), etc.) in base R. We will also skip some basic descriptive functions in R. For example, for discrete variables, we often use the frequency table to look at the frequency (table ()) of the variable at various levels as needed, or a crosstab of two variables. You can also draw a bar chart for discrete variables (bar()). For continuous variables, we need to look at the mean (mean ()), standard deviation (sd()), quantile (quantile()) of a variable from time to time. There are also functions like summary(), str() and describe() (a functions in the ‘psych’ package) that give a summary of a data frame.

The focus here is to introduce some of the more efficient data wrangling methods in R.

1.1 Read and write data

1.1.1 readr

You must be familar with read.csv(), read.table() and write.csv() in base R. Here we will introduce a more efficient package from RStudio in 2015 for reading and writing data: readr package. The corresponding functions are read_csv(), read_table() and write_csv(). The commands look quite similar, but readr is different in the following respects:

  1. It is 10x faster. The trick is that readr uses C++ to process the data quickly.

  2. It doesn’t change the column names. The names can start with number and “.” will not be substitued to “_”. For example:

    library(readr)
    read_csv("2015,2016,2017
    1,2,3
    4,5,6")
    ##   2015 2016 2017
    ## 1    1    2    3
    ## 2    4    5    6
  3. readr functions do not convert strings to factors by default, are able to parse dates and times and can automatically determine the data types in each column.

  4. The killing character in my opinion is that readr provides progress bar. What makes you feel worse than waiting is not knowing how long you have to wait. Without “progress bar” might be the No.1 reason that people break up with the one they have been dating.

The major functions of readr is to turn flat files into data frames:

  • read_csv(): reads comma delimited files
  • read_csv2(): reads semicolon separated files (common in countries where , is used as the decimal place)
  • read_tsv(): reads tab delimited files
  • read_delim(): reads in files with any delimiter
  • read_fwf(): reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions()
  • read_table(): reads a common variation of fixed width files where columns are separated by white space
  • read_log(): reads Apache style log files

The good thing is that those functions have similar syntax. Once you learn one, the others become easy. Here we will focus on read_csv().

The most important information for read_csv() is the path to your data:

library(readr)
sim.dat <- read_csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv ")
head(sim.dat)
##   age gender   income house store_exp online_exp store_trans online_trans
## 1  57 Female 120963.4   Yes  529.1344   303.5125           2            2
## 2  63 Female 122008.1   Yes  478.0058   109.5297           4            2
## 3  59   Male 114202.3   Yes  490.8107   279.2496           7            2
## 4  60   Male 113616.3   Yes  347.8090   141.6698          10            2
## 5  51   Male 124252.6   Yes  379.6259   112.2372           4            4
## 6  59   Male 107661.5   Yes  338.3154   195.6870           4            5
##   Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1  4  2  1  2  1  4  1  4  2   4   Price
## 2  4  1  1  2  1  4  1  4  1   4   Price
## 3  5  2  1  2  1  4  1  4  1   4   Price
## 4  5  2  1  3  1  4  1  4  2   4   Price
## 5  4  1  1  3  1  4  1  4  2   4   Price
## 6  4  2  1  2  1  4  1  4  1   4   Price

The function reads the file to R as a tibble. You can consider tibble as next iteration of data frame. They are different with data frame for the following aspects:

  • It never changes an input’s type (i.e., no more stringsAsFactors = FALSE!)
  • It never adjusts the names of variables
  • It has a refined print method that shows only the first 10 rows, and all the columns that fit on screen. You can also control the default print behavior by setting options.

Refer to http://r4ds.had.co.nz/tibbles.html for more information about ‘tibble’.

When you run read_csv() it prints out a column specification that gives the name and type of each column. In order to better understanding how readr works, it is helpful to type in some baby data set and check the results:

dat=read_csv("2015,2016,2017
100,200,300
canola,soybean,corn")
print(dat)
##     2015    2016 2017
## 1    100     200  300
## 2 canola soybean corn

You can also add comments on the top and tell R to skip those lines:

dat=read_csv("# I will never let you know that
          # my favorite food is carrot
          Date,Food,Mood
          Monday,carrot,happy
          Tuesday,carrot,happy
          Wednesday,carrot,happy
          Thursday,carrot,happy
          Friday,carrot,happy
          Saturday,carrot,extremely happy
          Sunday,carrot,extremely happy", skip = 2)
print(dat)
##        Date   Food            Mood
## 1    Monday carrot           happy
## 2   Tuesday carrot           happy
## 3 Wednesday carrot           happy
## 4  Thursday carrot           happy
## 5    Friday carrot           happy
## 6  Saturday carrot extremely happy
## 7    Sunday carrot extremely happy

If you don’t have column names, set col_names = FALSE then R will assign names “X1”,“X2”… to the columns:

dat=read_csv("Saturday,carrot,extremely happy
          Sunday,carrot,extremely happy", col_names=FALSE)
print(dat)
##         X1     X2              X3
## 1 Saturday carrot extremely happy
## 2   Sunday carrot extremely happy

You can also pass col_names a character vector which will be used as the column names. Try to replace col_names=FALSE with col_names=c("Date","Food","Mood") and see what happen.

As mentioned before, you can use read_csv2() to read semicolon separated files:

dat=read_csv2("Saturday; carrot; extremely happy \n Sunday; carrot; extremely happy", col_names=FALSE)
print(dat)
##         X1     X2              X3
## 1 Saturday carrot extremely happy
## 2   Sunday carrot extremely happy

Here “\n” is a convenient shortcut for adding a new line.

You can use read_tsv() to read tab delimited files:

dat=read_tsv("every\tman\tis\ta\tpoet\twhen\the\tis\tin\tlove\n", col_names = FALSE)
print(dat)
##      X1  X2 X3 X4   X5   X6 X7 X8 X9  X10
## 1 every man is  a poet when he is in love

Or more generally, you can use read_delim() and assign separating character:

dat=read_delim("THE|UNBEARABLE|RANDOMNESS|OF|LIFE\n", delim = "|", col_names = FALSE)
print(dat)
##    X1         X2         X3 X4   X5
## 1 THE UNBEARABLE RANDOMNESS OF LIFE

Another situation you will often run into is missing value. In marketing survey, people like to use “99” to represent missing. You can tell R to set all observation with value “99” as missing when you read the data:

dat=read_csv("Q1,Q2,Q3
               5, 4,99",na="99")
print(dat)
##   Q1 Q2   Q3
## 1  5  4 <NA>

For writing data back to disk, you can use write_csv() and write_tsv(). The following two characters of the two functions increase the chances of the output file being read back in correctly:

  • Encode strings in UTF-8
  • Save dates and date-times in ISO8601 format so they are easily parsed elsewhere

For example:

write_csv(sim.dat, "sim_dat.csv")

For other data types, you can use the following packages:

  • Haven: SPSS, Stata and SAS data
  • Readxl and xlsx: excel data(.xls and .xlsx)
  • DBI: given data base, such as RMySQL, RSQLite and RPostgreSQL, read data directly from the database using SQL

Some other useful materials:

1.1.2 data.table— enhanced data.frame

What is data.table? It is an R package that provides an enhanced version of data.frame. The most used object in R is data frame. Before we move on, let’s briefly review some basic characters and manipulations of data.frame:

  • It is a set of rows and columns.
  • Each row is of the same length and data type
  • Every column is of the same length but can be of differing data types
  • It has characteristics of both a matrix and a list
  • It uses [] to subset data

I will use the clothes customer data to illustrate. There are two dimensions in []. The first one indicates row and second one indicates column. It uses comma to separate them.

# read data
sim.dat<-read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
# subset the first two rows
sim.dat[1:2,]
##   age gender   income house store_exp online_exp store_trans online_trans
## 1  57 Female 120963.4   Yes  529.1344   303.5125           2            2
## 2  63 Female 122008.1   Yes  478.0058   109.5297           4            2
##   Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1  4  2  1  2  1  4  1  4  2   4   Price
## 2  4  1  1  2  1  4  1  4  1   4   Price
# subset the first two rows and column 3 and 5
sim.dat[1:2,c(3,5)]
##     income store_exp
## 1 120963.4  529.1344
## 2 122008.1  478.0058
# get all rows with age>70
sim.dat[sim.dat$age>70,]
##     age gender   income house store_exp online_exp store_trans
## 288 300   Male 208017.5   Yes  5076.801   6053.485          12
##     online_trans Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10     segment
## 288           11  1  4  5  4  4  4  4  1  4   2 Conspicuous
# get rows with age> 60 and gender is Male
# select column 3 and 4
sim.dat[sim.dat$age>68 & sim.dat$gender == "Male", 3:4]
##       income house
## 204 119552.0    No
## 288 208017.5   Yes

Remember that there are usually different ways to conduct the same manipulation. For example, the following code presents three ways to calculate average number of online transactions for male and female:

tapply(sim.dat$online_trans, sim.dat$gender, mean )
##   Female     Male 
## 15.38448 11.26233
aggregate(online_trans ~ gender, data = sim.dat, mean)
##   gender online_trans
## 1 Female     15.38448
## 2   Male     11.26233
library(dplyr)
sim.dat%>% 
  group_by(gender)%>%
  summarise(Avg_online_trans=mean(online_trans))
## # A tibble: 2 x 2
##   gender Avg_online_trans
##   <fctr>            <dbl>
## 1 Female         15.38448
## 2   Male         11.26233

There is no gold standard to choose a specific function to manipulate data. The goal is to solve real problem not the tool itself. So just use whatever tool that is convenient for you.

The way to use [] is straightforward. But the manipulations are limited. If you need more complicated data reshaping or aggregation, there are other packages to use such as dplyr, reshape2, tidyr etc. But the usage of those packages are not as straightforward as []. You often need to change functions. Keeping related operations together, such as subset, group, update, join etc, will allow for:

  • concise, consistent and readable syntax irrespective of the set of operations you would like to perform to achieve your end goal
  • performing data manipulation fluidly without the cognitive burden of having to change among different functions
  • by knowing precisely the data required for each operation, you can automatically optimize operations effectively

data.table is the package for that. If you are not familiar with other data manipulating packages and are interested in reducing programming time tremendously, then this package is for you.

Other than extending the function of [], data.table has the following advantages:

Offers fast import, subset, grouping, update, and joins for large data files It is easy to turn data frame to data table Can behave just like a data frame

You need to install and load the package:

# If you haven't install it, use the code to instal
# install.packages("data.table")
# load packagw
library(data.table)

Use data.table() to covert the existing data frame sim.dat to data table:

dt <- data.table(sim.dat)
class(dt)
## [1] "data.table" "data.frame"

Calculate mean for counts of online transactions:

dt[, mean(online_trans)]
## [1] 13.546

You can’t do the same thing using data frame:

sim.dat[,mean(online_trans)]
Error in mean(online_trans) : object 'online_trans' not found

If you want to calculate mean by group as before, set “by =” argument:

dt[ , mean(online_trans), by = gender]
##    gender       V1
## 1: Female 15.38448
## 2:   Male 11.26233

You can group by more than one variables. For example, group by “gender” and “house”:

dt[ , mean(online_trans), by = .(gender, house)]
##    gender house        V1
## 1: Female   Yes 11.312030
## 2:   Male   Yes  8.771523
## 3: Female    No 19.145833
## 4:   Male    No 16.486111

Assign column names for aggregated variables:

dt[ , .(avg = mean(online_trans)), by = .(gender, house)]
##    gender house       avg
## 1: Female   Yes 11.312030
## 2:   Male   Yes  8.771523
## 3: Female    No 19.145833
## 4:   Male    No 16.486111

data.table can accomplish all operations that aggregate() and tapply()can do for data frame.

  • General setting of data.table

Different from data frame, there are three arguments for data table:

It is analogous to SQL. You don’t have to know SQL to learn data table. But experience with SQL will help you understand data table. In SQL, you select column j (use command SELECT) for row i (using command WHERE). GROUP BY in SQL will assign the variable to group the observations.

Let’s review our previous code:

dt[ , mean(online_trans), by = gender]
##    gender       V1
## 1: Female 15.38448
## 2:   Male 11.26233

The code above is equal to the following SQL:

SELECT  gender, avg(online_trans) FROM sim.dat GROUP BY gender

R code:

dt[ , .(avg = mean(online_trans)), by = .(gender, house)]
##    gender house       avg
## 1: Female   Yes 11.312030
## 2:   Male   Yes  8.771523
## 3: Female    No 19.145833
## 4:   Male    No 16.486111

is equal to SQL:

SELECT gender, house, avg(online_trans) AS avg FROM sim.dat GROUP BY gender, house

R code:

dt[ age < 40, .(avg = mean(online_trans)), by = .(gender, house)]
##    gender house      avg
## 1:   Male   Yes 14.45977
## 2: Female   Yes 18.14062
## 3:   Male    No 18.24299
## 4: Female    No 20.10196

is equal to SQL:

SELECT gender, house, avg(online_trans) AS avg FROM sim.dat WHERE age < 40 GROUP BY gender, house

You can see the analogy between data.table and SQL. Now let’s focus on operations in data table.

  • select row
# select rows with age<20 and income > 80000
dt[age < 20 & income > 80000]
##    age gender   income house store_exp online_exp store_trans online_trans
## 1:  19 Female 83534.70    No  227.6686   1490.719           1           22
## 2:  18 Female 89415.97   Yes  209.5487   1926.470           3           28
## 3:  19 Female 92812.81    No  186.7475   1041.539           2           18
##    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1:  2  1  1  2  4  1  4  2  4   1   Style
## 2:  2  1  1  1  4  1  4  2  4   1   Style
## 3:  3  1  1  2  4  1  4  3  4   1   Style
# select the first two rows
dt[1:2]
##    age gender   income house store_exp online_exp store_trans online_trans
## 1:  57 Female 120963.4   Yes  529.1344   303.5125           2            2
## 2:  63 Female 122008.1   Yes  478.0058   109.5297           4            2
##    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1:  4  2  1  2  1  4  1  4  2   4   Price
## 2:  4  1  1  2  1  4  1  4  1   4   Price
  • select column

Selecting columns in data.table don’t need $:

# select column “age” but return it as a vector
# the argument for row is empty so the result will return all observations
ans <- dt[, age]
head(ans)
## [1] 57 63 59 60 51 59

To return data.table object, put column names in list():

# Select age and online_exp columns and return as a data.table instead
ans <- dt[, list(age, online_exp)]
head(ans)
##    age online_exp
## 1:  57   303.5125
## 2:  63   109.5297
## 3:  59   279.2496
## 4:  60   141.6698
## 5:  51   112.2372
## 6:  59   195.6870

Or you can also put column names in .():

ans <- dt[, .(age, online_exp)]
# head(ans)

To select all columns from “age” to “income”:

ans <- dt[, age:income, with = FALSE]
head(ans,2)
##    age gender   income
## 1:  57 Female 120963.4
## 2:  63 Female 122008.1

Delete columns using - or !:

# delete columns from  age to online_exp
ans <- dt[, -(age:online_exp), with = FALSE]
ans <- dt[, !(age:online_exp), with = FALSE]
  • tabulation

In data table. .N means to count。

# row count
dt[, .N] 
## [1] 1000

If you assign the group variable, then it will count by groups:

# counts by gender
dt[, .N, by= gender]  
##    gender   N
## 1: Female 554
## 2:   Male 446
# for those younger than 30, count by gender
 dt[age < 30, .(count=.N), by= gender] 
##    gender count
## 1: Female   292
## 2:   Male    86

Order table:

# get records with the highest 5 online expense:
head(dt[order(-online_exp)],5) 
##    age gender   income house store_exp online_exp store_trans online_trans
## 1:  40 Female 217599.7    No  7023.684   9479.442          10            6
## 2:  41 Female       NA   Yes  3786.740   8638.239          14           10
## 3:  36   Male 228550.1   Yes  3279.621   8220.555           8           12
## 4:  31 Female 159508.1   Yes  5177.081   8005.932          11           13
## 5:  43 Female 190407.4   Yes  4694.922   7875.562           6           11
##    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10     segment
## 1:  1  4  5  4  3  4  4  1  4   2 Conspicuous
## 2:  1  4  4  4  4  4  4  1  4   2 Conspicuous
## 3:  1  4  5  4  4  4  4  1  4   1 Conspicuous
## 4:  1  4  4  4  4  4  4  1  4   2 Conspicuous
## 5:  1  4  5  4  4  4  4  1  4   2 Conspicuous

Since data table keep some characters of data frame, they share some operations:

dt[order(-online_exp)][1:5]
##    age gender   income house store_exp online_exp store_trans online_trans
## 1:  40 Female 217599.7    No  7023.684   9479.442          10            6
## 2:  41 Female       NA   Yes  3786.740   8638.239          14           10
## 3:  36   Male 228550.1   Yes  3279.621   8220.555           8           12
## 4:  31 Female 159508.1   Yes  5177.081   8005.932          11           13
## 5:  43 Female 190407.4   Yes  4694.922   7875.562           6           11
##    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10     segment
## 1:  1  4  5  4  3  4  4  1  4   2 Conspicuous
## 2:  1  4  4  4  4  4  4  1  4   2 Conspicuous
## 3:  1  4  5  4  4  4  4  1  4   1 Conspicuous
## 4:  1  4  4  4  4  4  4  1  4   2 Conspicuous
## 5:  1  4  5  4  4  4  4  1  4   2 Conspicuous

You can also order the table by more than one variable. The following code will order the table by gender, then order within gender by online_exp:

dt[order(gender, -online_exp)][1:5]
##    age gender   income house store_exp online_exp store_trans online_trans
## 1:  40 Female 217599.7    No  7023.684   9479.442          10            6
## 2:  41 Female       NA   Yes  3786.740   8638.239          14           10
## 3:  31 Female 159508.1   Yes  5177.081   8005.932          11           13
## 4:  43 Female 190407.4   Yes  4694.922   7875.562           6           11
## 5:  50 Female 263858.0   Yes  5813.802   7448.729          11           11
##    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10     segment
## 1:  1  4  5  4  3  4  4  1  4   2 Conspicuous
## 2:  1  4  4  4  4  4  4  1  4   2 Conspicuous
## 3:  1  4  4  4  4  4  4  1  4   2 Conspicuous
## 4:  1  4  5  4  4  4  4  1  4   2 Conspicuous
## 5:  1  4  5  4  4  4  4  1  4   1 Conspicuous
  • Use fread() to import dat

Other than read.csv in base R, we have introduced ‘read_csv’ in ‘readr’. read_csv is much faster and will provide progress bar which makes user feel much better (at least make me feel better). fread() in data.table further increase the efficiency of reading data. The following are three examples of reading the same data file topic.csv. The file includes text data scraped from an agriculture forum with 209670 rows and 6 columns:

system.time(topic<-read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/topic.csv"))
  user  system elapsed 
  4.313   0.027   4.340
system.time(topic<-readr::read_csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/topic.csv"))
   user  system elapsed 
  0.267   0.008   0.274 
system.time(topic<-data.table::fread("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/topic.csv"))
   user  system elapsed 
  0.217   0.005   0.221 

It is clear that read_csv() is much faster than read.csv(). fread() is a little faster than read_csv(). As the size increasing, the difference will become for significant. Note that fread() will read file as data.table by default.

1.2 Summarize data

1.2.1 apply(), lapply() and sapply() in base R

There are some powerful functions to summarize data in base R, such as apply(), lapply() and sapply(). They do the same basic things and are all from “apply” family: apply functions over parts of data. They differ in two important respects:

  1. the type of object they apply to
  2. the type of result they will return

When do we use apply()? When we want to apply a function to margins of an array or matrix. That means our data need to be structured. The operations can be very flexible. It returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.

For example you can compute row and column sums for a matrix:

## simulate a matrix
x <- cbind(x1 =1:8, x2 = c(4:1, 2:5))
dimnames(x)[[1]] <- letters[1:8]
apply(x, 2, mean)
##  x1  x2 
## 4.5 3.0
col.sums <- apply(x, 2, sum)
row.sums <- apply(x, 1, sum)

You can also apply other functions:

ma <- matrix(c(1:4, 1, 6:8), nrow = 2)
ma
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    1    7
## [2,]    2    4    6    8
apply(ma, 1, table)  #--> a list of length 2
## [[1]]
## 
## 1 3 7 
## 2 1 1 
## 
## [[2]]
## 
## 2 4 6 8 
## 1 1 1 1
apply(ma, 1, stats::quantile) # 5 x n matrix with rownames
##      [,1] [,2]
## 0%      1  2.0
## 25%     1  3.5
## 50%     2  5.0
## 75%     4  6.5
## 100%    7  8.0

Results can have different lengths for each call. This is a trickier example. What will you get?

## Example with different lengths for each call
z <- array(1:24, dim = 2:4)
zseq <- apply(z, 1:2, function(x) seq_len(max(x)))
zseq         ## a 2 x 3 matrix
typeof(zseq) ## list
dim(zseq) ## 2 3
zseq[1,]
apply(z, 3, function(x) seq_len(max(x)))
  • lapply() applies a function over a list, data.frame or vector and returns a list of the same length.
  • sapply() is a user-friendly version and wrapper of lapply(). By default it returns a vector, matrix or if simplify = "array", an array if appropriate. apply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f). If simplify=TRUE, then it will return a data.frame instead of list.

Let’s use some data with context to help you better understand the functions.

  • Get the mean and standard deviation of all numerical variables in the data set.
# Read data
sim.dat<-read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
# Get numerical variables
sdat<-sim.dat[,!lapply(sim.dat,class)=="factor"]
## Try the following code with apply() function
## apply(sim.dat,2,class)
## What is the problem?

The data frame sdat only includes numeric columns. Now we can go head and use apply() to get mean and standard deviation for each column:

apply(sdat, MARGIN=2,function(x) mean(na.omit(x)))
##          age       income    store_exp   online_exp  store_trans 
##       38.840   113543.065     1356.851     2120.181        5.350 
## online_trans           Q1           Q2           Q3           Q4 
##       13.546        3.101        1.823        1.992        2.763 
##           Q5           Q6           Q7           Q8           Q9 
##        2.945        2.448        3.434        2.396        3.085 
##          Q10 
##        2.320

Here we defined a function using function(x) mean(na.omit(x)). It is a very simple function. It tells R to ignore the missing value when calculating the mean. MARGIN=2 tells R to apply function to each column. It is not hard to guess what MARGIN=1 mean. The result show that the average online expense is much higher than store expense. You can also compare the average scores across different questions. The command to calculate standard deviation is very similar. The only difference is to change mean() to sd():

apply(sdat, MARGIN=2,function(x) sd(na.omit(x)))
##          age       income    store_exp   online_exp  store_trans 
##    16.416818 49842.287197  2774.399785  1731.224308     3.695559 
## online_trans           Q1           Q2           Q3           Q4 
##     7.956959     1.450139     1.168348     1.402106     1.155061 
##           Q5           Q6           Q7           Q8           Q9 
##     1.284377     1.438529     1.455941     1.154347     1.118493 
##          Q10 
##     1.136174

Even the average online expense is higher than store expense, the standard deviation for store expense is much higher than online expense which indicates there are very likely some big/small purchase in store. We can check it quickly:

summary(sdat$store_exp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -500.0   205.0   329.0  1357.0   597.3 50000.0
summary(sdat$online_exp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   68.82  420.30 1942.00 2120.00 2441.00 9479.00

There are some odd values in store expense. The minimum value is -500 which is a wrong imputation which indicates that you should preprocess data before analyzing it. Checking those simple statistics will help you better understand your data. It then give you some idea how to preprocess and analyze them. How about using lapply() and sapply()?

Run the following code and compare the results:

lapply(sdat, function(x) sd(na.omit(x)))
sapply(sdat, function(x) sd(na.omit(x)))
sapply(sdat, function(x) sd(na.omit(x)), simplify = FALSE)

1.2.2 ddply() in plyr package

dplyr is a set of clean and consistent tools that implement the split-apply-combine pattern in R. This is an extremely common pattern in data analysis: you solve a complex problem by breaking it down into small pieces, doing something to each piece and then combining the results back together again. [From package description]

You may find the description sounds familiar. The package is sort of a wrapper of apply family. We will only introduce the main function ddply(). Because the package has next iteration which is dplyr package. We will introduce dplyr in more details. The reason we still want to spend some time on the older version is because they have similar idea and knowing the lineage will deeper your understanding of the whole family.

We will use the same data frame sim.dat to illustrate. Run the following command:

library(plyr)
ddply(sim.dat,"segment",summarize, Age=round(mean(na.omit(age)),0),
      FemalePct=round(mean(gender=="Female"),2),
      HouseYes=round(mean(house=="Yes"),2),
      store_exp=round(mean(na.omit(store_exp),trim=0.1),0),
      online_exp=round(mean(online_exp),0),
      store_trans=round(mean(store_trans),1),
      online_trans=round(mean(online_trans),1))
##       segment Age FemalePct HouseYes store_exp online_exp store_trans
## 1 Conspicuous  42      0.32     0.86      4990       4898        10.9
## 2       Price  60      0.45     0.94       501        205         6.1
## 3     Quality  35      0.47     0.34       301       2013         2.9
## 4       Style  24      0.81     0.27       200       1962         3.0
##   online_trans
## 1         11.1
## 2          3.0
## 3         16.0
## 4         21.1

Now, let’s peel the onion in order.

The first argument sim.dat is easy. It is the data you want to work on.

The second argument "segment" is the column you want to group by. It is a very standard marketing segmentation problem. The final segment is the result you want to get by designing survey, collecting and analyzing data. Here we assume those segments are known and we want to understand how each group of customer look like. It is a common task in segmentation: figuring out a profile. Here we only summarize data by one categorical variable but you can group by multiply variables using ddply(sim.dat, c("segment","house"), .). So the second argument tell the function we want to divide data by customer segment.

The third argument summarize tells R the kind of manipulation you want to do which is to summarize data. There are other choices for this argument such as transform (transform data within each group) and subset(subset data within each group).

Then the rest commands tell R the exact action. For example, Age=round(mean(na.omit(age)),0) tell R the following things:

  1. Calculate the mean of column age ignoring missing value
  2. Round the result to the specified number of decimal places
  3. Store the result to a new variable named Age

The rest of the command above are similar. In the end we calculate the following for each segment:

  1. Age: average age for each segment
  2. FemalePct: percentage for each segment
  3. HouseYes: percentage of people who own a house
  4. stroe_exp: average expense in store
  5. online_exp: average expense online
  6. store_trans: average times of transactions in store
  7. online_trans: average times of online transactions

There is a lot of information you can draw from those simple averages.

  • Conspicuous: average age is about 40. Target for middle-age wealthy people. 1/3 of them are female and 2/3 are male. They may be good target for candy dad. They buy regardless the price. Almost all of them own house (0.86). It makes us wonder what is wrong with the rest 14%? They may live in Manhattan

  • Price: They are older people, average age 60. Nearly all of them own a house(0.94). They are less likely to purchase online (store_trans=6 while online_trans=3). This is the only group that is less likely to purchase online.

  • Quality: The average age is 35. They are not way different with Conspicuous in terms of age. But they spend much less. The percentages of male and female are similar. They prefer online shopping. More than half of them don’t own a house (0.66).

  • Style: They are young people with average age 24. Majority of them are female (0.81). Most of them don’t own a house (0.73). They are very likely to be digital natives and definitely prefer online shopping.

You may notice that Style group purchase more frequently online (online_trans=21) but the expense (online_exp=1962) is not higher. This makes us wondering what is the average expense each time so you have a better idea about the price range the group fall in.

The analytical process is aggregated instead of independent steps. What you learn before will help you decide what to do next. Sometimes you also need to go backward to fix something in the previous steps. For example, you may need to check those negative expense value.

We continue to use ddply() to calculate the two statistics:

ddply(sim.dat,"segment",summarize,avg_online=round(sum(online_exp)/sum(online_trans),2),
      avg_store=round(sum(store_exp)/sum(store_trans),2))
##       segment avg_online avg_store
## 1 Conspicuous     442.27    479.25
## 2       Price      69.28     81.30
## 3     Quality     126.05    105.12
## 4       Style      92.83    121.07

Price group has the lowest averaged one time purchasing price. The Conspicuous group will pay the highest price. When we build profile in real life, we will need to look at the survey results too. Those simple data manipulations can provide you lots of information already. As mentioned before, other than “summarize” there are other functions such as “transform” and “subset”.

For simplicity, I draw 11 random samples and 3 variables (age, store_exp and segment) from the original data according to the different segments. We will explain stratified sampling later. Here we just do it without explanation.

library(caret)
set.seed(2016)
trainIndex<-createDataPartition(sim.dat$segment,p=0.01,list=F,times=1)
examp<-sim.dat[trainIndex,c("age","store_exp","segment")]

Now data frame examp only has 11 rows and 3 columns. Let’s look at the function of transform:

ddply(examp,"segment",transform,store_pct=round(store_exp/sum(store_exp),2))
##    age store_exp     segment store_pct
## 1   42 6319.0718 Conspicuous      0.55
## 2   42 5106.4816 Conspicuous      0.45
## 3   55  595.2520       Price      0.42
## 4   64  399.3550       Price      0.28
## 5   64  426.6653       Price      0.30
## 6   39  362.4795     Quality      0.58
## 7   35  260.5065     Quality      0.42
## 8   23  205.6099       Style      0.25
## 9   24  212.3040       Style      0.26
## 10  24  202.1017       Style      0.25
## 11  28  200.1906       Style      0.24

What “transform” does is to transform data within the specified group (segment) and append the result as a new column.

Next let’s look at the function of “subset”:

ddply(examp,"segment",subset,store_exp>median(store_exp))
##   age store_exp     segment
## 1  42 6319.0718 Conspicuous
## 2  55  595.2520       Price
## 3  39  362.4795     Quality
## 4  23  205.6099       Style
## 5  24  212.3040       Style

You get all rows with store_exp greater than its group median.

1.2.3 dplyr package

dplyr provides a flexible grammar of data manipulation focusing on tools for working with data frames (hence the d in the name). It is faster and more friendly:

  • It identifies the most important data manipulations and make they easy to use from R
  • It performs faster for in-memory data by writing key pieces in C++ using Rcpp
  • The interface is the same for data frame, data table or database

I will illustrate the following functions in order:

  1. Display
  2. Subset
  3. Summarize
  4. Create new variable
  5. Merge

Display

  • tbl_df(): Convert the data to tibble which offers better checking and printing capabilities than traditional data frames. It will adjust output width according to fit the current window.
library(dplyr)
tbl_df(sim.dat)
  • glimpse(): This is like a transposed version of tbl_df()
glimpse(sim.dat)

Subset

Get rows with income more than 300000:

library(magrittr)
filter(sim.dat, income >300000) %>%
  tbl_df()
## # A tibble: 4 x 19
##     age gender   income  house store_exp online_exp store_trans
##   <int> <fctr>    <dbl> <fctr>     <dbl>      <dbl>       <int>
## 1    40   Male 301398.0    Yes  4840.461   3618.212          10
## 2    33   Male 319704.3    Yes  5998.305   4395.923           9
## 3    41   Male 317476.2    Yes  3029.844   4179.671          11
## 4    37 Female 315697.2    Yes  6548.970   4284.065          13
## # ... with 12 more variables: online_trans <int>, Q1 <int>, Q2 <int>,
## #   Q3 <int>, Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>,
## #   Q10 <int>, segment <fctr>

Here we meet a new operator %>%. It is called “Pipe operator” which pipes a value forward into an expression or function call. What you get in the left operation will be the first argument or the only argument in the right operation.

x %>% f(y) = f(x, y)
y %>% f(x, ., z) = f(x, y, z )

It is an operator from magrittr which can be really beneficial. Look at the following code. Can you tell me what it does?

ave_exp <- filter( 
  summarise(
    group_by( 
      filter(
        sim.dat, 
        !is.na(income)
      ), 
      segment
    ), 
    ave_online_exp = mean(online_exp), 
    n = n()
  ), 
  n > 200
) 

Now look at the identical code using “%>%”:

avg_exp <- sim.dat %>% 
 filter(!is.na(income)) %>% 
 group_by(segment) %>% 
 summarise( 
   ave_online_exp = mean(online_exp), 
   n = n() ) %>% 
  filter(n > 200)

Isn’t it much more straight forward now? Let’s read it:

  1. Delete observations from sim.dat with missing income values
  2. Group the data from step 1 by variable segment
  3. Calculate mean of online expense for each segment and save the result as a new variable named ave_online_exp
  4. Calculate the size of each segment and saved it as a new variable named n
  5. Get segments with size larger than 200

You can use distinct() to delete duplicated rows.

dplyr distinct(sim.dat)

sample_frac() will randomly select some rows with specified percentage. sample_n() can randomly select rows with specified number.

dplyr::sample_frac(sim.dat, 0.5, replace = TRUE) 
dplyr::sample_n(sim.dat, 10, replace = TRUE) 

slice() will select rows by position:

dplyr::slice(sim.dat, 10:15) 

It is equivalent to sim.dat[10:15,].

top_n() will select the order top n entries:

dplyr::top_n(sim.dat,2,income)

If you want to select columns instead of rows, you can use select(). The following are some sample codes:

# select by column name
dplyr::select(sim.dat,income,age,store_exp)

# select columns whose name contains a character string
dplyr::select(sim.dat, contains("_"))

# select columns whose name ends with a character string
# similar there is "starts_with"
dplyr::select(sim.dat, ends_with("e"))

# select columns Q1,Q2,Q3,Q4 and Q5
select(sim.dat, num_range("Q", 1:5)) 

# select columns whose names are in a group of names
dplyr::select(sim.dat, one_of(c("age", "income")))

# select columns between age and online_exp
dplyr::select(sim.dat, age:online_exp)

# select all columns except for age
dplyr::select(sim.dat, -age)

Summarize

The operations here are similar what we did before with apply() and ddply().

dplyr::summarise(sim.dat, avg_online = mean(online_trans)) 
##   avg_online
## 1     13.546
# apply function anyNA() to each column
# you can also assign a function vector such as: c("anyNA","is.factor")
dplyr::summarise_each(sim.dat, funs_(c("anyNA")))
##     age gender income house store_exp online_exp store_trans online_trans
## 1 FALSE  FALSE   TRUE FALSE     FALSE      FALSE       FALSE        FALSE
##      Q1    Q2    Q3    Q4    Q5    Q6    Q7    Q8    Q9   Q10 segment
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE

You can use group_by() to indicate the variables you want to group by as before:

sim.dat %>% group_by(segment) %>% summarise_each(funs_(c("anyNA")))
## # A tibble: 4 x 19
##       segment   age gender income house store_exp online_exp store_trans
##        <fctr> <lgl>  <lgl>  <lgl> <lgl>     <lgl>      <lgl>       <lgl>
## 1 Conspicuous FALSE  FALSE   TRUE FALSE     FALSE      FALSE       FALSE
## 2       Price FALSE  FALSE   TRUE FALSE     FALSE      FALSE       FALSE
## 3     Quality FALSE  FALSE   TRUE FALSE     FALSE      FALSE       FALSE
## 4       Style FALSE  FALSE   TRUE FALSE     FALSE      FALSE       FALSE
## # ... with 11 more variables: online_trans <lgl>, Q1 <lgl>, Q2 <lgl>,
## #   Q3 <lgl>, Q4 <lgl>, Q5 <lgl>, Q6 <lgl>, Q7 <lgl>, Q8 <lgl>, Q9 <lgl>,
## #   Q10 <lgl>

Create new variable

mutate() will compute and append one or more new columns:

dplyr::mutate(sim.dat, total_exp = store_exp + online_exp)

It will apply window function to the columns and return a column with the same length. It is a different type of function as before.

# min_rank=rank(ties.method = "min")
# mutate_each() means apply function to each column
dplyr::mutate_each(sim.dat, funs(min_rank)) 

The other similar function is transmute(). The differece is that transmute() will delete the original columns and only keep the new ones.

dplyr::transmute(sim.dat, total_exp = store_exp + online_exp) 

Merge

We create two baby data sets to show how the functions work.

(x<-data.frame(cbind(ID=c("A","B","C"),x1=c(1,2,3))))
##   ID x1
## 1  A  1
## 2  B  2
## 3  C  3
(y<-data.frame(cbind(ID=c("B","C","D"),y1=c(T,T,F))))
##   ID    y1
## 1  B  TRUE
## 2  C  TRUE
## 3  D FALSE
# join to the left
# keep all rows in x
left_join(x,y,by="ID")
##   ID x1   y1
## 1  A  1 <NA>
## 2  B  2 TRUE
## 3  C  3 TRUE
# get rows matched in both data sets
inner_join(x,y,by="ID")
##   ID x1   y1
## 1  B  2 TRUE
## 2  C  3 TRUE
# get rows in either data set
full_join(x,y,by="ID")
##   ID   x1    y1
## 1  A    1  <NA>
## 2  B    2  TRUE
## 3  C    3  TRUE
## 4  D <NA> FALSE
# filter out rows in x that can be matched in y 
# it doesn't bring in any values from y 
semi_join(x,y,by="ID")
##   ID x1
## 1  B  2
## 2  C  3
# the opposite of  semi_join()
# it gets rows in x that cannot be matched in y
# it doesn't bring in any values from y
anti_join(x,y,by="ID")
##   ID x1
## 1  A  1

There are other functions(intersect(), union() and setdiff()). Also the data frame version of rbind and cbind which are bind_rows() and bind_col(). We are not going to go through them all. You can try them yourself. If you understand the functions we introduced so far. It should be easy for you to figure out the rest.

1.3 Tidy and Reshape Data

“Tidy data” represent the information from a dataset as data frames where each row is an observation and each column contains the values of a variable (i.e. an attribute of what we are observing). Depending on the situation, the requirements on what to present as rows and columns may change. In order to make data easy to work with for the problem at hand, in practice, we often need to convert data between the “wide” and the “long” format. The process feels like playing with a dough.

There are two commonly used packages for this kind of manipulations: tidyr and reshape2. We will show how to tidy and reshape data using the two packages. By comparing the functions to show how they overlap and where they differ.

1.3.1 reshape2 package

It is a reboot of previous package reshape. Why? Here is what I got from Stack Overflow:

reshape2 let Hadley make a rebooted reshape that was way, way faster, while avoiding busting up people’s dependencies and habits.”

Take a baby subset of our exemplary clothes consumers data to illustrate:

(sdat<-sim.dat[1:5,1:6])
##   age gender   income house store_exp online_exp
## 1  57 Female 120963.4   Yes  529.1344   303.5125
## 2  63 Female 122008.1   Yes  478.0058   109.5297
## 3  59   Male 114202.3   Yes  490.8107   279.2496
## 4  60   Male 113616.3   Yes  347.8090   141.6698
## 5  51   Male 124252.6   Yes  379.6259   112.2372

For the above data sdat, what if we want to have a variable indicating the purchasing channel (i.e. online or in-store) and another column with the corresponding expense amount? Assume we want to keep the rest of the columns the same. It is a task to change data from “wide” to “long”. There are two general ways to shape data:

  • Use melt() to convert an object into a molten data frame, i.e. from wide to long
  • Use dcast() to cast a molten data frame into the shape you want, i.e. from long to wide
library(reshape2)
(mdat <- melt(sdat, measure.vars=c("store_exp","online_exp"),
              variable.name = "Channel",
              value.name = "Expense"))
##    age gender   income house    Channel  Expense
## 1   57 Female 120963.4   Yes  store_exp 529.1344
## 2   63 Female 122008.1   Yes  store_exp 478.0058
## 3   59   Male 114202.3   Yes  store_exp 490.8107
## 4   60   Male 113616.3   Yes  store_exp 347.8090
## 5   51   Male 124252.6   Yes  store_exp 379.6259
## 6   57 Female 120963.4   Yes online_exp 303.5125
## 7   63 Female 122008.1   Yes online_exp 109.5297
## 8   59   Male 114202.3   Yes online_exp 279.2496
## 9   60   Male 113616.3   Yes online_exp 141.6698
## 10  51   Male 124252.6   Yes online_exp 112.2372

You melted the data frame sdat by two variables: store_exp and online_exp (measure.vars=c("store_exp","online_exp")). The new variable name is Channel set by command variable.name = "Channel". The value name is Expense set by command value.name = "Expense".

You can run a regression to study the effect of purchasing channel:

# Here we use all observations from sim.dat
mdat<-melt(sim.dat[,1:6], measure.vars=c("store_exp","online_exp"),
            variable.name = "Channel",
              value.name = "Expense")
fit<-lm(Expense~gender+house+income+Channel+age,data=mdat)
summary(fit)
## 
## Call:
## lm(formula = Expense ~ gender + house + income + Channel + age, 
##     data = mdat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4208   -821   -275    533  44353 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -9.132e+02  1.560e+02  -5.855 5.76e-09 ***
## genderMale         3.572e+02  1.028e+02   3.475 0.000524 ***
## houseYes          -5.687e+01  1.138e+02  -0.500 0.617275    
## income             2.834e-02  1.079e-03  26.268  < 2e-16 ***
## Channelonline_exp  8.296e+02  9.772e+01   8.489  < 2e-16 ***
## age               -2.793e+01  3.356e+00  -8.321  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1974 on 1626 degrees of freedom
##   (368 observations deleted due to missingness)
## Multiple R-squared:  0.348,  Adjusted R-squared:  0.346 
## F-statistic: 173.5 on 5 and 1626 DF,  p-value: < 2.2e-16

You can melt() list, matrix, table too. The syntax is similar and we won’t go through every situation. Sometimes we want to convert the data from “long” to “wide”. For example, you want to compare the online and in store expense between male and female based on the house ownership.

dcast(mdat, house + gender ~ Channel, sum)
##   house gender store_exp online_exp
## 1    No Female  171102.2   583492.4
## 2    No   Male  133130.8   332499.9
## 3   Yes Female  355320.2   500856.9
## 4   Yes   Male  697297.3   703332.0

In the above code, what is the left side of ~ are variables that you want to group by. The right side is the variable you want to spread as columns. It will use the column indicating value from melt() before. Here is “Expense” .

1.3.2 tidyr package

The other package that will do similar manipulations is tidyr. Let’s get a subset to illustrate the usage.

library(dplyr)
# practice functions we learnt before
sdat<-sim.dat[1:5,]%>%
  dplyr::select(age,gender,store_exp,store_trans)
sdat %>% tbl_df()
## # A tibble: 5 x 4
##     age gender store_exp store_trans
## * <int> <fctr>     <dbl>       <int>
## 1    57 Female  529.1344           2
## 2    63 Female  478.0058           4
## 3    59   Male  490.8107           7
## 4    60   Male  347.8090          10
## 5    51   Male  379.6259           4

gather() function in tidyr is analogous to melt() in reshape2. The following code will do the same thing as we did before using melt():

library(tidyr)
msdat<-tidyr::gather(sdat,"variable","value",store_exp,store_trans)
msdat %>% tbl_df()
## # A tibble: 10 x 4
##      age gender    variable    value
##    <int> <fctr>       <chr>    <dbl>
##  1    57 Female   store_exp 529.1344
##  2    63 Female   store_exp 478.0058
##  3    59   Male   store_exp 490.8107
##  4    60   Male   store_exp 347.8090
##  5    51   Male   store_exp 379.6259
##  6    57 Female store_trans   2.0000
##  7    63 Female store_trans   4.0000
##  8    59   Male store_trans   7.0000
##  9    60   Male store_trans  10.0000
## 10    51   Male store_trans   4.0000

Or if we use the pipe operation, we can write the above code as:

sdat%>%gather("variable","value",store_exp,store_trans)

It is identical with the following code using melt():

melt(sdat, measure.vars=c("store_exp","store_trans"),
            variable.name = "variable",
              value.name = "value")

The opposite operation to gather() is spread(). The previous one stacks columns and the latter one spread the columns.

msdat %>% spread(variable,value)
##   age gender store_exp store_trans
## 1  51   Male  379.6259           4
## 2  57 Female  529.1344           2
## 3  59   Male  490.8107           7
## 4  60   Male  347.8090          10
## 5  63 Female  478.0058           4

Another pair of functions that do opposite manipulations are separate() and unite().

sepdat<- msdat %>% 
  separate(variable,c("Source","Type"))
sepdat %>% tbl_df()
## # A tibble: 10 x 5
##      age gender Source  Type    value
##  * <int> <fctr>  <chr> <chr>    <dbl>
##  1    57 Female  store   exp 529.1344
##  2    63 Female  store   exp 478.0058
##  3    59   Male  store   exp 490.8107
##  4    60   Male  store   exp 347.8090
##  5    51   Male  store   exp 379.6259
##  6    57 Female  store trans   2.0000
##  7    63 Female  store trans   4.0000
##  8    59   Male  store trans   7.0000
##  9    60   Male  store trans  10.0000
## 10    51   Male  store trans   4.0000

You can see that the function separates the original column “variable” to two new columns “Source” and “Type”. You can use sep= to set the string or regular express to separate the column. By default, it is “_”.

The unite() function will do the opposite: combining two columns. It is like the generalization of paste() to data frame.

sepdat %>% 
  unite("variable",Source,Type,sep="_")
##    age gender    variable    value
## 1   57 Female   store_exp 529.1344
## 2   63 Female   store_exp 478.0058
## 3   59   Male   store_exp 490.8107
## 4   60   Male   store_exp 347.8090
## 5   51   Male   store_exp 379.6259
## 6   57 Female store_trans   2.0000
## 7   63 Female store_trans   4.0000
## 8   59   Male store_trans   7.0000
## 9   60   Male store_trans  10.0000
## 10  51   Male store_trans   4.0000

The reshaping manipulations may be the trickiest part. You have to practice a lot to get familiar with those functions. Unfortunately there is no short cut.