This chapter focuses on some of the most frequently used data manipulations and shows how to implement them in R. It is important to explore the data set with descriptive statistics (mean, standard deviation, etc.) and data visualization prior to analysis. Transform data so that the data structure is in line with the requirements of the model. You also need to summarize the results after analysis.
Here we assume the readers are already familiar with some of the traditional R data operations, such as subsetting data frame, deleting variables, read and write functions (read.csv ()
, write.csv ()
, etc.) in base R. We will also skip some basic descriptive functions in R. For example, for discrete variables, we often use the frequency table to look at the frequency (table ()
) of the variable at various levels as needed, or a crosstab of two variables. You can also draw a bar chart for discrete variables (bar()
). For continuous variables, we need to look at the mean (mean ()
), standard deviation (sd()
), quantile (quantile()
) of a variable from time to time. There are also functions like summary()
, str()
and describe()
(a functions in the ‘psych’ package) that give a summary of a data frame.
The focus here is to introduce some of the more efficient data wrangling methods in R.
readr
You must be familar with read.csv()
, read.table()
and write.csv()
in base R. Here we will introduce a more efficient package from RStudio in 2015 for reading and writing data: readr
package. The corresponding functions are read_csv()
, read_table()
and write_csv()
. The commands look quite similar, but readr
is different in the following respects:
It is 10x faster. The trick is that readr
uses C++ to process the data quickly.
It doesn’t change the column names. The names can start with number and “.
” will not be substitued to “_
”. For example:
library(readr)
read_csv("2015,2016,2017
1,2,3
4,5,6")
## 2015 2016 2017
## 1 1 2 3
## 2 4 5 6
readr
functions do not convert strings to factors by default, are able to parse dates and times and can automatically determine the data types in each column.
The killing character in my opinion is that readr
provides progress bar. What makes you feel worse than waiting is not knowing how long you have to wait. Without “progress bar” might be the No.1 reason that people break up with the one they have been dating.
The major functions of readr is to turn flat files into data frames:
read_csv()
: reads comma delimited filesread_csv2()
: reads semicolon separated files (common in countries where ,
is used as the decimal place)read_tsv()
: reads tab delimited filesread_delim()
: reads in files with any delimiterread_fwf()
: reads fixed width files. You can specify fields either by their widths with fwf_widths()
or their position with fwf_positions()
read_table()
: reads a common variation of fixed width files where columns are separated by white spaceread_log()
: reads Apache style log filesThe good thing is that those functions have similar syntax. Once you learn one, the others become easy. Here we will focus on read_csv()
.
The most important information for read_csv()
is the path to your data:
library(readr)
sim.dat <- read_csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv ")
head(sim.dat)
## age gender income house store_exp online_exp store_trans online_trans
## 1 57 Female 120963.4 Yes 529.1344 303.5125 2 2
## 2 63 Female 122008.1 Yes 478.0058 109.5297 4 2
## 3 59 Male 114202.3 Yes 490.8107 279.2496 7 2
## 4 60 Male 113616.3 Yes 347.8090 141.6698 10 2
## 5 51 Male 124252.6 Yes 379.6259 112.2372 4 4
## 6 59 Male 107661.5 Yes 338.3154 195.6870 4 5
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1 4 2 1 2 1 4 1 4 2 4 Price
## 2 4 1 1 2 1 4 1 4 1 4 Price
## 3 5 2 1 2 1 4 1 4 1 4 Price
## 4 5 2 1 3 1 4 1 4 2 4 Price
## 5 4 1 1 3 1 4 1 4 2 4 Price
## 6 4 2 1 2 1 4 1 4 1 4 Price
The function reads the file to R as a tibble
. You can consider tibble
as next iteration of data frame. They are different with data frame for the following aspects:
stringsAsFactors = FALSE
!)Refer to http://r4ds.had.co.nz/tibbles.html for more information about ‘tibble’.
When you run read_csv()
it prints out a column specification that gives the name and type of each column. In order to better understanding how readr
works, it is helpful to type in some baby data set and check the results:
dat=read_csv("2015,2016,2017
100,200,300
canola,soybean,corn")
print(dat)
## 2015 2016 2017
## 1 100 200 300
## 2 canola soybean corn
You can also add comments on the top and tell R to skip those lines:
dat=read_csv("# I will never let you know that
# my favorite food is carrot
Date,Food,Mood
Monday,carrot,happy
Tuesday,carrot,happy
Wednesday,carrot,happy
Thursday,carrot,happy
Friday,carrot,happy
Saturday,carrot,extremely happy
Sunday,carrot,extremely happy", skip = 2)
print(dat)
## Date Food Mood
## 1 Monday carrot happy
## 2 Tuesday carrot happy
## 3 Wednesday carrot happy
## 4 Thursday carrot happy
## 5 Friday carrot happy
## 6 Saturday carrot extremely happy
## 7 Sunday carrot extremely happy
If you don’t have column names, set col_names = FALSE
then R will assign names “X1
”,“X2
”… to the columns:
dat=read_csv("Saturday,carrot,extremely happy
Sunday,carrot,extremely happy", col_names=FALSE)
print(dat)
## X1 X2 X3
## 1 Saturday carrot extremely happy
## 2 Sunday carrot extremely happy
You can also pass col_names
a character vector which will be used as the column names. Try to replace col_names=FALSE
with col_names=c("Date","Food","Mood")
and see what happen.
As mentioned before, you can use read_csv2()
to read semicolon separated files:
dat=read_csv2("Saturday; carrot; extremely happy \n Sunday; carrot; extremely happy", col_names=FALSE)
print(dat)
## X1 X2 X3
## 1 Saturday carrot extremely happy
## 2 Sunday carrot extremely happy
Here “\n
” is a convenient shortcut for adding a new line.
You can use read_tsv()
to read tab delimited files:
dat=read_tsv("every\tman\tis\ta\tpoet\twhen\the\tis\tin\tlove\n", col_names = FALSE)
print(dat)
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## 1 every man is a poet when he is in love
Or more generally, you can use read_delim()
and assign separating character:
dat=read_delim("THE|UNBEARABLE|RANDOMNESS|OF|LIFE\n", delim = "|", col_names = FALSE)
print(dat)
## X1 X2 X3 X4 X5
## 1 THE UNBEARABLE RANDOMNESS OF LIFE
Another situation you will often run into is missing value. In marketing survey, people like to use “99” to represent missing. You can tell R to set all observation with value “99” as missing when you read the data:
dat=read_csv("Q1,Q2,Q3
5, 4,99",na="99")
print(dat)
## Q1 Q2 Q3
## 1 5 4 <NA>
For writing data back to disk, you can use write_csv()
and write_tsv()
. The following two characters of the two functions increase the chances of the output file being read back in correctly:
For example:
write_csv(sim.dat, "sim_dat.csv")
For other data types, you can use the following packages:
Haven
: SPSS, Stata and SAS dataReadxl
and xlsx
: excel data(.xls and .xlsx)DBI
: given data base, such as RMySQL, RSQLite and RPostgreSQL, read data directly from the database using SQLSome other useful materials:
rio
package:https://github.com/leeper/riodata.table
— enhanced data.frame
What is data.table
? It is an R package that provides an enhanced version of data.frame
. The most used object in R is data frame
. Before we move on, let’s briefly review some basic characters and manipulations of data.frame:
[]
to subset dataI will use the clothes customer data to illustrate. There are two dimensions in []
. The first one indicates row and second one indicates column. It uses comma to separate them.
# read data
sim.dat<-read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
# subset the first two rows
sim.dat[1:2,]
## age gender income house store_exp online_exp store_trans online_trans
## 1 57 Female 120963.4 Yes 529.1344 303.5125 2 2
## 2 63 Female 122008.1 Yes 478.0058 109.5297 4 2
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1 4 2 1 2 1 4 1 4 2 4 Price
## 2 4 1 1 2 1 4 1 4 1 4 Price
# subset the first two rows and column 3 and 5
sim.dat[1:2,c(3,5)]
## income store_exp
## 1 120963.4 529.1344
## 2 122008.1 478.0058
# get all rows with age>70
sim.dat[sim.dat$age>70,]
## age gender income house store_exp online_exp store_trans
## 288 300 Male 208017.5 Yes 5076.801 6053.485 12
## online_trans Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 288 11 1 4 5 4 4 4 4 1 4 2 Conspicuous
# get rows with age> 60 and gender is Male
# select column 3 and 4
sim.dat[sim.dat$age>68 & sim.dat$gender == "Male", 3:4]
## income house
## 204 119552.0 No
## 288 208017.5 Yes
Remember that there are usually different ways to conduct the same manipulation. For example, the following code presents three ways to calculate average number of online transactions for male and female:
tapply(sim.dat$online_trans, sim.dat$gender, mean )
## Female Male
## 15.38448 11.26233
aggregate(online_trans ~ gender, data = sim.dat, mean)
## gender online_trans
## 1 Female 15.38448
## 2 Male 11.26233
library(dplyr)
sim.dat%>%
group_by(gender)%>%
summarise(Avg_online_trans=mean(online_trans))
## # A tibble: 2 x 2
## gender Avg_online_trans
## <fctr> <dbl>
## 1 Female 15.38448
## 2 Male 11.26233
There is no gold standard to choose a specific function to manipulate data. The goal is to solve real problem not the tool itself. So just use whatever tool that is convenient for you.
The way to use []
is straightforward. But the manipulations are limited. If you need more complicated data reshaping or aggregation, there are other packages to use such as dplyr
, reshape2
, tidyr
etc. But the usage of those packages are not as straightforward as []
. You often need to change functions. Keeping related operations together, such as subset, group, update, join etc, will allow for:
data.table
is the package for that. If you are not familiar with other data manipulating packages and are interested in reducing programming time tremendously, then this package is for you.
Other than extending the function of []
, data.table
has the following advantages:
Offers fast import, subset, grouping, update, and joins for large data files It is easy to turn data frame to data table Can behave just like a data frame
You need to install and load the package:
# If you haven't install it, use the code to instal
# install.packages("data.table")
# load packagw
library(data.table)
Use data.table()
to covert the existing data frame sim.dat
to data table:
dt <- data.table(sim.dat)
class(dt)
## [1] "data.table" "data.frame"
Calculate mean for counts of online transactions:
dt[, mean(online_trans)]
## [1] 13.546
You can’t do the same thing using data frame:
sim.dat[,mean(online_trans)]
Error in mean(online_trans) : object 'online_trans' not found
If you want to calculate mean by group as before, set “by =
” argument:
dt[ , mean(online_trans), by = gender]
## gender V1
## 1: Female 15.38448
## 2: Male 11.26233
You can group by more than one variables. For example, group by “gender
” and “house
”:
dt[ , mean(online_trans), by = .(gender, house)]
## gender house V1
## 1: Female Yes 11.312030
## 2: Male Yes 8.771523
## 3: Female No 19.145833
## 4: Male No 16.486111
Assign column names for aggregated variables:
dt[ , .(avg = mean(online_trans)), by = .(gender, house)]
## gender house avg
## 1: Female Yes 11.312030
## 2: Male Yes 8.771523
## 3: Female No 19.145833
## 4: Male No 16.486111
data.table
can accomplish all operations that aggregate()
and tapply()
can do for data frame.
data.table
Different from data frame, there are three arguments for data table:
It is analogous to SQL. You don’t have to know SQL to learn data table. But experience with SQL will help you understand data table. In SQL, you select column j
(use command SELECT
) for row i
(using command WHERE
). GROUP BY
in SQL will assign the variable to group the observations.
Let’s review our previous code:
dt[ , mean(online_trans), by = gender]
## gender V1
## 1: Female 15.38448
## 2: Male 11.26233
The code above is equal to the following SQL:
SELECT gender, avg(online_trans) FROM sim.dat GROUP BY gender
R code:
dt[ , .(avg = mean(online_trans)), by = .(gender, house)]
## gender house avg
## 1: Female Yes 11.312030
## 2: Male Yes 8.771523
## 3: Female No 19.145833
## 4: Male No 16.486111
is equal to SQL:
SELECT gender, house, avg(online_trans) AS avg FROM sim.dat GROUP BY gender, house
R code:
dt[ age < 40, .(avg = mean(online_trans)), by = .(gender, house)]
## gender house avg
## 1: Male Yes 14.45977
## 2: Female Yes 18.14062
## 3: Male No 18.24299
## 4: Female No 20.10196
is equal to SQL:
SELECT gender, house, avg(online_trans) AS avg FROM sim.dat WHERE age < 40 GROUP BY gender, house
You can see the analogy between data.table
and SQL
. Now let’s focus on operations in data table.
# select rows with age<20 and income > 80000
dt[age < 20 & income > 80000]
## age gender income house store_exp online_exp store_trans online_trans
## 1: 19 Female 83534.70 No 227.6686 1490.719 1 22
## 2: 18 Female 89415.97 Yes 209.5487 1926.470 3 28
## 3: 19 Female 92812.81 No 186.7475 1041.539 2 18
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1: 2 1 1 2 4 1 4 2 4 1 Style
## 2: 2 1 1 1 4 1 4 2 4 1 Style
## 3: 3 1 1 2 4 1 4 3 4 1 Style
# select the first two rows
dt[1:2]
## age gender income house store_exp online_exp store_trans online_trans
## 1: 57 Female 120963.4 Yes 529.1344 303.5125 2 2
## 2: 63 Female 122008.1 Yes 478.0058 109.5297 4 2
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1: 4 2 1 2 1 4 1 4 2 4 Price
## 2: 4 1 1 2 1 4 1 4 1 4 Price
Selecting columns in data.table
don’t need $
:
# select column “age” but return it as a vector
# the argument for row is empty so the result will return all observations
ans <- dt[, age]
head(ans)
## [1] 57 63 59 60 51 59
To return data.table
object, put column names in list()
:
# Select age and online_exp columns and return as a data.table instead
ans <- dt[, list(age, online_exp)]
head(ans)
## age online_exp
## 1: 57 303.5125
## 2: 63 109.5297
## 3: 59 279.2496
## 4: 60 141.6698
## 5: 51 112.2372
## 6: 59 195.6870
Or you can also put column names in .()
:
ans <- dt[, .(age, online_exp)]
# head(ans)
To select all columns from “age
” to “income
”:
ans <- dt[, age:income, with = FALSE]
head(ans,2)
## age gender income
## 1: 57 Female 120963.4
## 2: 63 Female 122008.1
Delete columns using -
or !
:
# delete columns from age to online_exp
ans <- dt[, -(age:online_exp), with = FALSE]
ans <- dt[, !(age:online_exp), with = FALSE]
In data table. .N
means to count。
# row count
dt[, .N]
## [1] 1000
If you assign the group variable, then it will count by groups:
# counts by gender
dt[, .N, by= gender]
## gender N
## 1: Female 554
## 2: Male 446
# for those younger than 30, count by gender
dt[age < 30, .(count=.N), by= gender]
## gender count
## 1: Female 292
## 2: Male 86
Order table:
# get records with the highest 5 online expense:
head(dt[order(-online_exp)],5)
## age gender income house store_exp online_exp store_trans online_trans
## 1: 40 Female 217599.7 No 7023.684 9479.442 10 6
## 2: 41 Female NA Yes 3786.740 8638.239 14 10
## 3: 36 Male 228550.1 Yes 3279.621 8220.555 8 12
## 4: 31 Female 159508.1 Yes 5177.081 8005.932 11 13
## 5: 43 Female 190407.4 Yes 4694.922 7875.562 6 11
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1: 1 4 5 4 3 4 4 1 4 2 Conspicuous
## 2: 1 4 4 4 4 4 4 1 4 2 Conspicuous
## 3: 1 4 5 4 4 4 4 1 4 1 Conspicuous
## 4: 1 4 4 4 4 4 4 1 4 2 Conspicuous
## 5: 1 4 5 4 4 4 4 1 4 2 Conspicuous
Since data table keep some characters of data frame, they share some operations:
dt[order(-online_exp)][1:5]
## age gender income house store_exp online_exp store_trans online_trans
## 1: 40 Female 217599.7 No 7023.684 9479.442 10 6
## 2: 41 Female NA Yes 3786.740 8638.239 14 10
## 3: 36 Male 228550.1 Yes 3279.621 8220.555 8 12
## 4: 31 Female 159508.1 Yes 5177.081 8005.932 11 13
## 5: 43 Female 190407.4 Yes 4694.922 7875.562 6 11
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1: 1 4 5 4 3 4 4 1 4 2 Conspicuous
## 2: 1 4 4 4 4 4 4 1 4 2 Conspicuous
## 3: 1 4 5 4 4 4 4 1 4 1 Conspicuous
## 4: 1 4 4 4 4 4 4 1 4 2 Conspicuous
## 5: 1 4 5 4 4 4 4 1 4 2 Conspicuous
You can also order the table by more than one variable. The following code will order the table by gender
, then order within gender
by online_exp
:
dt[order(gender, -online_exp)][1:5]
## age gender income house store_exp online_exp store_trans online_trans
## 1: 40 Female 217599.7 No 7023.684 9479.442 10 6
## 2: 41 Female NA Yes 3786.740 8638.239 14 10
## 3: 31 Female 159508.1 Yes 5177.081 8005.932 11 13
## 4: 43 Female 190407.4 Yes 4694.922 7875.562 6 11
## 5: 50 Female 263858.0 Yes 5813.802 7448.729 11 11
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1: 1 4 5 4 3 4 4 1 4 2 Conspicuous
## 2: 1 4 4 4 4 4 4 1 4 2 Conspicuous
## 3: 1 4 4 4 4 4 4 1 4 2 Conspicuous
## 4: 1 4 5 4 4 4 4 1 4 2 Conspicuous
## 5: 1 4 5 4 4 4 4 1 4 1 Conspicuous
fread()
to import datOther than read.csv
in base R, we have introduced ‘read_csv’ in ‘readr’. read_csv
is much faster and will provide progress bar which makes user feel much better (at least make me feel better). fread()
in data.table
further increase the efficiency of reading data. The following are three examples of reading the same data file topic.csv
. The file includes text data scraped from an agriculture forum with 209670 rows and 6 columns:
system.time(topic<-read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/topic.csv"))
user system elapsed
4.313 0.027 4.340
system.time(topic<-readr::read_csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/topic.csv"))
user system elapsed
0.267 0.008 0.274
system.time(topic<-data.table::fread("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/topic.csv"))
user system elapsed
0.217 0.005 0.221
It is clear that read_csv()
is much faster than read.csv()
. fread()
is a little faster than read_csv()
. As the size increasing, the difference will become for significant. Note that fread()
will read file as data.table
by default.
apply()
, lapply()
and sapply()
in base RThere are some powerful functions to summarize data in base R, such as apply()
, lapply()
and sapply()
. They do the same basic things and are all from “apply” family: apply functions over parts of data. They differ in two important respects:
When do we use apply()
? When we want to apply a function to margins of an array or matrix. That means our data need to be structured. The operations can be very flexible. It returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
For example you can compute row and column sums for a matrix:
## simulate a matrix
x <- cbind(x1 =1:8, x2 = c(4:1, 2:5))
dimnames(x)[[1]] <- letters[1:8]
apply(x, 2, mean)
## x1 x2
## 4.5 3.0
col.sums <- apply(x, 2, sum)
row.sums <- apply(x, 1, sum)
You can also apply other functions:
ma <- matrix(c(1:4, 1, 6:8), nrow = 2)
ma
## [,1] [,2] [,3] [,4]
## [1,] 1 3 1 7
## [2,] 2 4 6 8
apply(ma, 1, table) #--> a list of length 2
## [[1]]
##
## 1 3 7
## 2 1 1
##
## [[2]]
##
## 2 4 6 8
## 1 1 1 1
apply(ma, 1, stats::quantile) # 5 x n matrix with rownames
## [,1] [,2]
## 0% 1 2.0
## 25% 1 3.5
## 50% 2 5.0
## 75% 4 6.5
## 100% 7 8.0
Results can have different lengths for each call. This is a trickier example. What will you get?
## Example with different lengths for each call
z <- array(1:24, dim = 2:4)
zseq <- apply(z, 1:2, function(x) seq_len(max(x)))
zseq ## a 2 x 3 matrix
typeof(zseq) ## list
dim(zseq) ## 2 3
zseq[1,]
apply(z, 3, function(x) seq_len(max(x)))
lapply()
applies a function over a list, data.frame or vector and returns a list of the same length.sapply()
is a user-friendly version and wrapper of lapply()
. By default it returns a vector, matrix or if simplify = "array"
, an array if appropriate. apply(x, f, simplify = FALSE, USE.NAMES = FALSE)
is the same as lapply(x, f)
. If simplify=TRUE
, then it will return a data.frame
instead of list
.Let’s use some data with context to help you better understand the functions.
# Read data
sim.dat<-read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
# Get numerical variables
sdat<-sim.dat[,!lapply(sim.dat,class)=="factor"]
## Try the following code with apply() function
## apply(sim.dat,2,class)
## What is the problem?
The data frame sdat
only includes numeric columns. Now we can go head and use apply()
to get mean and standard deviation for each column:
apply(sdat, MARGIN=2,function(x) mean(na.omit(x)))
## age income store_exp online_exp store_trans
## 38.840 113543.065 1356.851 2120.181 5.350
## online_trans Q1 Q2 Q3 Q4
## 13.546 3.101 1.823 1.992 2.763
## Q5 Q6 Q7 Q8 Q9
## 2.945 2.448 3.434 2.396 3.085
## Q10
## 2.320
Here we defined a function using function(x) mean(na.omit(x))
. It is a very simple function. It tells R to ignore the missing value when calculating the mean. MARGIN=2
tells R to apply function to each column. It is not hard to guess what MARGIN=1
mean. The result show that the average online expense is much higher than store expense. You can also compare the average scores across different questions. The command to calculate standard deviation is very similar. The only difference is to change mean()
to sd()
:
apply(sdat, MARGIN=2,function(x) sd(na.omit(x)))
## age income store_exp online_exp store_trans
## 16.416818 49842.287197 2774.399785 1731.224308 3.695559
## online_trans Q1 Q2 Q3 Q4
## 7.956959 1.450139 1.168348 1.402106 1.155061
## Q5 Q6 Q7 Q8 Q9
## 1.284377 1.438529 1.455941 1.154347 1.118493
## Q10
## 1.136174
Even the average online expense is higher than store expense, the standard deviation for store expense is much higher than online expense which indicates there are very likely some big/small purchase in store. We can check it quickly:
summary(sdat$store_exp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -500.0 205.0 329.0 1357.0 597.3 50000.0
summary(sdat$online_exp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.82 420.30 1942.00 2120.00 2441.00 9479.00
There are some odd values in store expense. The minimum value is -500 which is a wrong imputation which indicates that you should preprocess data before analyzing it. Checking those simple statistics will help you better understand your data. It then give you some idea how to preprocess and analyze them. How about using lapply()
and sapply()
?
Run the following code and compare the results:
lapply(sdat, function(x) sd(na.omit(x)))
sapply(sdat, function(x) sd(na.omit(x)))
sapply(sdat, function(x) sd(na.omit(x)), simplify = FALSE)
ddply()
in plyr
packagedplyr
is a set of clean and consistent tools that implement the split-apply-combine pattern in R. This is an extremely common pattern in data analysis: you solve a complex problem by breaking it down into small pieces, doing something to each piece and then combining the results back together again. [From package description]
You may find the description sounds familiar. The package is sort of a wrapper of apply family. We will only introduce the main function ddply()
. Because the package has next iteration which is dplyr
package. We will introduce dplyr
in more details. The reason we still want to spend some time on the older version is because they have similar idea and knowing the lineage will deeper your understanding of the whole family.
We will use the same data frame sim.dat
to illustrate. Run the following command:
library(plyr)
ddply(sim.dat,"segment",summarize, Age=round(mean(na.omit(age)),0),
FemalePct=round(mean(gender=="Female"),2),
HouseYes=round(mean(house=="Yes"),2),
store_exp=round(mean(na.omit(store_exp),trim=0.1),0),
online_exp=round(mean(online_exp),0),
store_trans=round(mean(store_trans),1),
online_trans=round(mean(online_trans),1))
## segment Age FemalePct HouseYes store_exp online_exp store_trans
## 1 Conspicuous 42 0.32 0.86 4990 4898 10.9
## 2 Price 60 0.45 0.94 501 205 6.1
## 3 Quality 35 0.47 0.34 301 2013 2.9
## 4 Style 24 0.81 0.27 200 1962 3.0
## online_trans
## 1 11.1
## 2 3.0
## 3 16.0
## 4 21.1
Now, let’s peel the onion in order.
The first argument sim.dat
is easy. It is the data you want to work on.
The second argument "segment"
is the column you want to group by. It is a very standard marketing segmentation problem. The final segment is the result you want to get by designing survey, collecting and analyzing data. Here we assume those segments are known and we want to understand how each group of customer look like. It is a common task in segmentation: figuring out a profile. Here we only summarize data by one categorical variable but you can group by multiply variables using ddply(sim.dat, c("segment","house"), .)
. So the second argument tell the function we want to divide data by customer segment.
The third argument summarize
tells R the kind of manipulation you want to do which is to summarize data. There are other choices for this argument such as transform
(transform data within each group) and subset
(subset data within each group).
Then the rest commands tell R the exact action. For example, Age=round(mean(na.omit(age)),0)
tell R the following things:
age
ignoring missing valueAge
The rest of the command above are similar. In the end we calculate the following for each segment:
Age
: average age for each segmentFemalePct
: percentage for each segmentHouseYes
: percentage of people who own a housestroe_exp
: average expense in storeonline_exp
: average expense onlinestore_trans
: average times of transactions in storeonline_trans
: average times of online transactionsThere is a lot of information you can draw from those simple averages.
Conspicuous: average age is about 40. Target for middle-age wealthy people. 1/3 of them are female and 2/3 are male. They may be good target for candy dad. They buy regardless the price. Almost all of them own house (0.86). It makes us wonder what is wrong with the rest 14%? They may live in Manhattan
Price: They are older people, average age 60. Nearly all of them own a house(0.94). They are less likely to purchase online (store_trans=6 while online_trans=3). This is the only group that is less likely to purchase online.
Quality: The average age is 35. They are not way different with Conspicuous in terms of age. But they spend much less. The percentages of male and female are similar. They prefer online shopping. More than half of them don’t own a house (0.66).
Style: They are young people with average age 24. Majority of them are female (0.81). Most of them don’t own a house (0.73). They are very likely to be digital natives and definitely prefer online shopping.
You may notice that Style group purchase more frequently online (online_trans=21) but the expense (online_exp=1962) is not higher. This makes us wondering what is the average expense each time so you have a better idea about the price range the group fall in.
The analytical process is aggregated instead of independent steps. What you learn before will help you decide what to do next. Sometimes you also need to go backward to fix something in the previous steps. For example, you may need to check those negative expense value.
We continue to use ddply()
to calculate the two statistics:
ddply(sim.dat,"segment",summarize,avg_online=round(sum(online_exp)/sum(online_trans),2),
avg_store=round(sum(store_exp)/sum(store_trans),2))
## segment avg_online avg_store
## 1 Conspicuous 442.27 479.25
## 2 Price 69.28 81.30
## 3 Quality 126.05 105.12
## 4 Style 92.83 121.07
Price group has the lowest averaged one time purchasing price. The Conspicuous group will pay the highest price. When we build profile in real life, we will need to look at the survey results too. Those simple data manipulations can provide you lots of information already. As mentioned before, other than “summarize
” there are other functions such as “transform
” and “subset
”.
For simplicity, I draw 11 random samples and 3 variables (age
, store_exp
and segment
) from the original data according to the different segments. We will explain stratified sampling later. Here we just do it without explanation.
library(caret)
set.seed(2016)
trainIndex<-createDataPartition(sim.dat$segment,p=0.01,list=F,times=1)
examp<-sim.dat[trainIndex,c("age","store_exp","segment")]
Now data frame examp
only has 11 rows and 3 columns. Let’s look at the function of transform
:
ddply(examp,"segment",transform,store_pct=round(store_exp/sum(store_exp),2))
## age store_exp segment store_pct
## 1 42 6319.0718 Conspicuous 0.55
## 2 42 5106.4816 Conspicuous 0.45
## 3 55 595.2520 Price 0.42
## 4 64 399.3550 Price 0.28
## 5 64 426.6653 Price 0.30
## 6 39 362.4795 Quality 0.58
## 7 35 260.5065 Quality 0.42
## 8 23 205.6099 Style 0.25
## 9 24 212.3040 Style 0.26
## 10 24 202.1017 Style 0.25
## 11 28 200.1906 Style 0.24
What “transform
” does is to transform data within the specified group (segment) and append the result as a new column.
Next let’s look at the function of “subset
”:
ddply(examp,"segment",subset,store_exp>median(store_exp))
## age store_exp segment
## 1 42 6319.0718 Conspicuous
## 2 55 595.2520 Price
## 3 39 362.4795 Quality
## 4 23 205.6099 Style
## 5 24 212.3040 Style
You get all rows with store_exp
greater than its group median.
dplyr
packagedplyr
provides a flexible grammar of data manipulation focusing on tools for working with data frames (hence the d
in the name). It is faster and more friendly:
Rcpp
I will illustrate the following functions in order:
Display
tbl_df()
: Convert the data to tibble
which offers better checking and printing capabilities than traditional data frames. It will adjust output width according to fit the current window.library(dplyr)
tbl_df(sim.dat)
glimpse()
: This is like a transposed version of tbl_df()
glimpse(sim.dat)
Subset
Get rows with income
more than 300000:
library(magrittr)
filter(sim.dat, income >300000) %>%
tbl_df()
## # A tibble: 4 x 19
## age gender income house store_exp online_exp store_trans
## <int> <fctr> <dbl> <fctr> <dbl> <dbl> <int>
## 1 40 Male 301398.0 Yes 4840.461 3618.212 10
## 2 33 Male 319704.3 Yes 5998.305 4395.923 9
## 3 41 Male 317476.2 Yes 3029.844 4179.671 11
## 4 37 Female 315697.2 Yes 6548.970 4284.065 13
## # ... with 12 more variables: online_trans <int>, Q1 <int>, Q2 <int>,
## # Q3 <int>, Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>,
## # Q10 <int>, segment <fctr>
Here we meet a new operator %>%
. It is called “Pipe operator” which pipes a value forward into an expression or function call. What you get in the left operation will be the first argument or the only argument in the right operation.
x %>% f(y) = f(x, y)
y %>% f(x, ., z) = f(x, y, z )
It is an operator from magrittr
which can be really beneficial. Look at the following code. Can you tell me what it does?
ave_exp <- filter(
summarise(
group_by(
filter(
sim.dat,
!is.na(income)
),
segment
),
ave_online_exp = mean(online_exp),
n = n()
),
n > 200
)
Now look at the identical code using “%>%
”:
avg_exp <- sim.dat %>%
filter(!is.na(income)) %>%
group_by(segment) %>%
summarise(
ave_online_exp = mean(online_exp),
n = n() ) %>%
filter(n > 200)
Isn’t it much more straight forward now? Let’s read it:
sim.dat
with missing income valuessegment
ave_online_exp
n
You can use distinct()
to delete duplicated rows.
dplyr distinct(sim.dat)
sample_frac()
will randomly select some rows with specified percentage. sample_n()
can randomly select rows with specified number.
dplyr::sample_frac(sim.dat, 0.5, replace = TRUE)
dplyr::sample_n(sim.dat, 10, replace = TRUE)
slice()
will select rows by position:
dplyr::slice(sim.dat, 10:15)
It is equivalent to sim.dat[10:15,]
.
top_n()
will select the order top n entries:
dplyr::top_n(sim.dat,2,income)
If you want to select columns instead of rows, you can use select()
. The following are some sample codes:
# select by column name
dplyr::select(sim.dat,income,age,store_exp)
# select columns whose name contains a character string
dplyr::select(sim.dat, contains("_"))
# select columns whose name ends with a character string
# similar there is "starts_with"
dplyr::select(sim.dat, ends_with("e"))
# select columns Q1,Q2,Q3,Q4 and Q5
select(sim.dat, num_range("Q", 1:5))
# select columns whose names are in a group of names
dplyr::select(sim.dat, one_of(c("age", "income")))
# select columns between age and online_exp
dplyr::select(sim.dat, age:online_exp)
# select all columns except for age
dplyr::select(sim.dat, -age)
Summarize
The operations here are similar what we did before with apply()
and ddply()
.
dplyr::summarise(sim.dat, avg_online = mean(online_trans))
## avg_online
## 1 13.546
# apply function anyNA() to each column
# you can also assign a function vector such as: c("anyNA","is.factor")
dplyr::summarise_each(sim.dat, funs_(c("anyNA")))
## age gender income house store_exp online_exp store_trans online_trans
## 1 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
You can use group_by()
to indicate the variables you want to group by as before:
sim.dat %>% group_by(segment) %>% summarise_each(funs_(c("anyNA")))
## # A tibble: 4 x 19
## segment age gender income house store_exp online_exp store_trans
## <fctr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 Conspicuous FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 2 Price FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 3 Quality FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## 4 Style FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## # ... with 11 more variables: online_trans <lgl>, Q1 <lgl>, Q2 <lgl>,
## # Q3 <lgl>, Q4 <lgl>, Q5 <lgl>, Q6 <lgl>, Q7 <lgl>, Q8 <lgl>, Q9 <lgl>,
## # Q10 <lgl>
Create new variable
mutate()
will compute and append one or more new columns:
dplyr::mutate(sim.dat, total_exp = store_exp + online_exp)
It will apply window function to the columns and return a column with the same length. It is a different type of function as before.
# min_rank=rank(ties.method = "min")
# mutate_each() means apply function to each column
dplyr::mutate_each(sim.dat, funs(min_rank))
The other similar function is transmute()
. The differece is that transmute()
will delete the original columns and only keep the new ones.
dplyr::transmute(sim.dat, total_exp = store_exp + online_exp)
Merge
We create two baby data sets to show how the functions work.
(x<-data.frame(cbind(ID=c("A","B","C"),x1=c(1,2,3))))
## ID x1
## 1 A 1
## 2 B 2
## 3 C 3
(y<-data.frame(cbind(ID=c("B","C","D"),y1=c(T,T,F))))
## ID y1
## 1 B TRUE
## 2 C TRUE
## 3 D FALSE
# join to the left
# keep all rows in x
left_join(x,y,by="ID")
## ID x1 y1
## 1 A 1 <NA>
## 2 B 2 TRUE
## 3 C 3 TRUE
# get rows matched in both data sets
inner_join(x,y,by="ID")
## ID x1 y1
## 1 B 2 TRUE
## 2 C 3 TRUE
# get rows in either data set
full_join(x,y,by="ID")
## ID x1 y1
## 1 A 1 <NA>
## 2 B 2 TRUE
## 3 C 3 TRUE
## 4 D <NA> FALSE
# filter out rows in x that can be matched in y
# it doesn't bring in any values from y
semi_join(x,y,by="ID")
## ID x1
## 1 B 2
## 2 C 3
# the opposite of semi_join()
# it gets rows in x that cannot be matched in y
# it doesn't bring in any values from y
anti_join(x,y,by="ID")
## ID x1
## 1 A 1
There are other functions(intersect()
, union()
and setdiff()
). Also the data frame version of rbind
and cbind
which are bind_rows()
and bind_col()
. We are not going to go through them all. You can try them yourself. If you understand the functions we introduced so far. It should be easy for you to figure out the rest.
“Tidy data” represent the information from a dataset as data frames where each row is an observation and each column contains the values of a variable (i.e. an attribute of what we are observing). Depending on the situation, the requirements on what to present as rows and columns may change. In order to make data easy to work with for the problem at hand, in practice, we often need to convert data between the “wide” and the “long” format. The process feels like playing with a dough.
There are two commonly used packages for this kind of manipulations: tidyr
and reshape2
. We will show how to tidy and reshape data using the two packages. By comparing the functions to show how they overlap and where they differ.
reshape2
packageIt is a reboot of previous package reshape
. Why? Here is what I got from Stack Overflow:
“reshape2
let Hadley make a rebooted reshape that was way, way faster, while avoiding busting up people’s dependencies and habits.”
Take a baby subset of our exemplary clothes consumers data to illustrate:
(sdat<-sim.dat[1:5,1:6])
## age gender income house store_exp online_exp
## 1 57 Female 120963.4 Yes 529.1344 303.5125
## 2 63 Female 122008.1 Yes 478.0058 109.5297
## 3 59 Male 114202.3 Yes 490.8107 279.2496
## 4 60 Male 113616.3 Yes 347.8090 141.6698
## 5 51 Male 124252.6 Yes 379.6259 112.2372
For the above data sdat
, what if we want to have a variable indicating the purchasing channel (i.e. online or in-store) and another column with the corresponding expense amount? Assume we want to keep the rest of the columns the same. It is a task to change data from “wide” to “long”. There are two general ways to shape data:
melt()
to convert an object into a molten data frame, i.e. from wide to longdcast()
to cast a molten data frame into the shape you want, i.e. from long to widelibrary(reshape2)
(mdat <- melt(sdat, measure.vars=c("store_exp","online_exp"),
variable.name = "Channel",
value.name = "Expense"))
## age gender income house Channel Expense
## 1 57 Female 120963.4 Yes store_exp 529.1344
## 2 63 Female 122008.1 Yes store_exp 478.0058
## 3 59 Male 114202.3 Yes store_exp 490.8107
## 4 60 Male 113616.3 Yes store_exp 347.8090
## 5 51 Male 124252.6 Yes store_exp 379.6259
## 6 57 Female 120963.4 Yes online_exp 303.5125
## 7 63 Female 122008.1 Yes online_exp 109.5297
## 8 59 Male 114202.3 Yes online_exp 279.2496
## 9 60 Male 113616.3 Yes online_exp 141.6698
## 10 51 Male 124252.6 Yes online_exp 112.2372
You melted the data frame sdat
by two variables: store_exp
and online_exp
(measure.vars=c("store_exp","online_exp")
). The new variable name is Channel
set by command variable.name = "Channel"
. The value name is Expense
set by command value.name = "Expense"
.
You can run a regression to study the effect of purchasing channel:
# Here we use all observations from sim.dat
mdat<-melt(sim.dat[,1:6], measure.vars=c("store_exp","online_exp"),
variable.name = "Channel",
value.name = "Expense")
fit<-lm(Expense~gender+house+income+Channel+age,data=mdat)
summary(fit)
##
## Call:
## lm(formula = Expense ~ gender + house + income + Channel + age,
## data = mdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4208 -821 -275 533 44353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.132e+02 1.560e+02 -5.855 5.76e-09 ***
## genderMale 3.572e+02 1.028e+02 3.475 0.000524 ***
## houseYes -5.687e+01 1.138e+02 -0.500 0.617275
## income 2.834e-02 1.079e-03 26.268 < 2e-16 ***
## Channelonline_exp 8.296e+02 9.772e+01 8.489 < 2e-16 ***
## age -2.793e+01 3.356e+00 -8.321 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1974 on 1626 degrees of freedom
## (368 observations deleted due to missingness)
## Multiple R-squared: 0.348, Adjusted R-squared: 0.346
## F-statistic: 173.5 on 5 and 1626 DF, p-value: < 2.2e-16
You can melt()
list, matrix, table too. The syntax is similar and we won’t go through every situation. Sometimes we want to convert the data from “long” to “wide”. For example, you want to compare the online and in store expense between male and female based on the house ownership.
dcast(mdat, house + gender ~ Channel, sum)
## house gender store_exp online_exp
## 1 No Female 171102.2 583492.4
## 2 No Male 133130.8 332499.9
## 3 Yes Female 355320.2 500856.9
## 4 Yes Male 697297.3 703332.0
In the above code, what is the left side of ~
are variables that you want to group by. The right side is the variable you want to spread as columns. It will use the column indicating value from melt()
before. Here is “Expense
” .
tidyr
packageThe other package that will do similar manipulations is tidyr
. Let’s get a subset to illustrate the usage.
library(dplyr)
# practice functions we learnt before
sdat<-sim.dat[1:5,]%>%
dplyr::select(age,gender,store_exp,store_trans)
sdat %>% tbl_df()
## # A tibble: 5 x 4
## age gender store_exp store_trans
## * <int> <fctr> <dbl> <int>
## 1 57 Female 529.1344 2
## 2 63 Female 478.0058 4
## 3 59 Male 490.8107 7
## 4 60 Male 347.8090 10
## 5 51 Male 379.6259 4
gather()
function in tidyr
is analogous to melt()
in reshape2
. The following code will do the same thing as we did before using melt()
:
library(tidyr)
msdat<-tidyr::gather(sdat,"variable","value",store_exp,store_trans)
msdat %>% tbl_df()
## # A tibble: 10 x 4
## age gender variable value
## <int> <fctr> <chr> <dbl>
## 1 57 Female store_exp 529.1344
## 2 63 Female store_exp 478.0058
## 3 59 Male store_exp 490.8107
## 4 60 Male store_exp 347.8090
## 5 51 Male store_exp 379.6259
## 6 57 Female store_trans 2.0000
## 7 63 Female store_trans 4.0000
## 8 59 Male store_trans 7.0000
## 9 60 Male store_trans 10.0000
## 10 51 Male store_trans 4.0000
Or if we use the pipe operation, we can write the above code as:
sdat%>%gather("variable","value",store_exp,store_trans)
It is identical with the following code using melt()
:
melt(sdat, measure.vars=c("store_exp","store_trans"),
variable.name = "variable",
value.name = "value")
The opposite operation to gather()
is spread()
. The previous one stacks columns and the latter one spread the columns.
msdat %>% spread(variable,value)
## age gender store_exp store_trans
## 1 51 Male 379.6259 4
## 2 57 Female 529.1344 2
## 3 59 Male 490.8107 7
## 4 60 Male 347.8090 10
## 5 63 Female 478.0058 4
Another pair of functions that do opposite manipulations are separate()
and unite()
.
sepdat<- msdat %>%
separate(variable,c("Source","Type"))
sepdat %>% tbl_df()
## # A tibble: 10 x 5
## age gender Source Type value
## * <int> <fctr> <chr> <chr> <dbl>
## 1 57 Female store exp 529.1344
## 2 63 Female store exp 478.0058
## 3 59 Male store exp 490.8107
## 4 60 Male store exp 347.8090
## 5 51 Male store exp 379.6259
## 6 57 Female store trans 2.0000
## 7 63 Female store trans 4.0000
## 8 59 Male store trans 7.0000
## 9 60 Male store trans 10.0000
## 10 51 Male store trans 4.0000
You can see that the function separates the original column “variable
” to two new columns “Source
” and “Type
”. You can use sep=
to set the string or regular express to separate the column. By default, it is “_
”.
The unite()
function will do the opposite: combining two columns. It is like the generalization of paste()
to data frame.
sepdat %>%
unite("variable",Source,Type,sep="_")
## age gender variable value
## 1 57 Female store_exp 529.1344
## 2 63 Female store_exp 478.0058
## 3 59 Male store_exp 490.8107
## 4 60 Male store_exp 347.8090
## 5 51 Male store_exp 379.6259
## 6 57 Female store_trans 2.0000
## 7 63 Female store_trans 4.0000
## 8 59 Male store_trans 7.0000
## 9 60 Male store_trans 10.0000
## 10 51 Male store_trans 4.0000
The reshaping manipulations may be the trickiest part. You have to practice a lot to get familiar with those functions. Unfortunately there is no short cut.