6.1 Summarize Data
6.1.1 dplyr
package
dplyr
provides a flexible grammar of data manipulation focusing on tools for working with data frames (hence the d
in the name). It is faster and more friendly:
- It identifies the most important data manipulations and make them easy to use from R.
- It performs faster for in-memory data by writing key pieces in C++ using
Rcpp
. - The interface is the same for data frame, data table or database.
We will illustrate the following functions in order using the clothing company data:
- Display
- Subset
- Summarize
- Create new variable
- Merge
# Read data
<- read.csv("http://bit.ly/2P5gTw4") sim.dat
6.1.1.1 Display
tbl_df()
: Convert the data totibble
which offers better checking and printing capabilities than traditional data frames. It will adjust output width according to fit the current window.
tbl_df(sim.dat)
glimpse()
: This is like a transposed version oftbl_df()
glimpse(sim.dat)
6.1.1.2 Subset
Get rows with income
more than 300000:
filter(sim.dat, income >300000) %>%
tbl_df()
Here we use the operator %>%
. It is called a “pipe operator” which pipes a value forward into an expression or function call. What you get in the left operation will be the first argument or the only argument in the right operation.
%>% f(y) = f(x, y)
x %>% f(x, ., z) = f(x, y, z ) y
It is an operator from magrittr
which can be really beneficial. The following R code is difficulty to read and understand without using the pipe operator.
<- filter(
ave_exp summarise(
group_by(
filter(
sim.dat, !is.na(income)
),
segment
), ave_online_exp = mean(online_exp),
n = n()
), > 200
n )
The same function with pipe operator “%>%
”:
<- sim.dat %>%
ave_exp filter(!is.na(income)) %>%
group_by(segment) %>%
summarise(
ave_online_exp = mean(online_exp),
n = n() ) %>%
filter(n > 200)
It is much easier to read:
- Delete observations from
sim.dat
with missing income values - Group the data from step 1 by variable
segment
- Calculate mean of online expense for each segment and save the result as a new variable named
ave_online_exp
- Calculate the size of each segment and saved it as a new variable named
n
- Get segments with size larger than 200
You can use distinct()
to delete duplicated rows.
::distinct(sim.dat) dplyr
sample_frac()
will randomly select some rows with a specified percentage. sample_n()
can randomly select rows with a specified number.
::sample_frac(sim.dat, 0.5, replace = TRUE)
dplyr::sample_n(sim.dat, 10, replace = TRUE) dplyr
slice()
will select rows by position:
::slice(sim.dat, 10:15) dplyr
It is equivalent to sim.dat[10:15,]
.
top_n()
will select the order top n entries:
::top_n(sim.dat,2,income) dplyr
If you want to select columns instead of rows, you can use select()
. The following are some sample codes:
# select by column name
::select(sim.dat,income,age,store_exp)
dplyr
# select columns whose name contains a character string
::select(sim.dat, contains("_"))
dplyr
# select columns whose name ends with a character string
# similar there is "starts_with"
::select(sim.dat, ends_with("e"))
dplyr
# select columns Q1,Q2,Q3,Q4 and Q5
select(sim.dat, num_range("Q", 1:5))
# select columns whose names are in a group of names
::select(sim.dat, one_of(c("age", "income")))
dplyr
# select columns between age and online_exp
::select(sim.dat, age:online_exp)
dplyr
# select all columns except for age
::select(sim.dat, -age) dplyr
6.1.1.3 Summarize
Let us use a standard marketing problem, customer segmentation, to illustrate how to summarize data. It usually starts with designing survey and collecting data. Then run a cluster analysis on the data to get customer segments. Once we have different segments, the next is to understand how each group of customer look like by summarizing some key metrics. For example, we can do the following data aggregation for different segments of clothes customers.
<- sim.dat %>%
dat_summary ::group_by(segment) %>%
dplyr::summarise(Age = round(mean(na.omit(age)), 0),
dplyrFemalePct = round(mean(gender == "Female"), 2),
HouseYes = round(mean(house == "Yes"), 2),
store_exp = round(mean(na.omit(store_exp),
trim = 0.1), 0),
online_exp = round(mean(online_exp), 0),
store_trans = round(mean(store_trans), 1),
online_trans = round(mean(online_trans), 1))
# transpose the data frame for showing purpose
# due to the limit of output width
<- dat_summary$segment
cnames <- dplyr::select(dat_summary, - segment)
dat_summary <- t(dat_summary) %>% data.frame()
tdat_summary names(tdat_summary) <- cnames
tdat_summary
## Conspicuous Price Quality Style
## Age 42.00 60.00 35.00 24.00
## FemalePct 0.32 0.45 0.47 0.81
## HouseYes 0.86 0.94 0.34 0.27
## store_exp 4990.00 501.00 301.00 200.00
## online_exp 4898.00 205.00 2013.00 1962.00
## store_trans 10.90 6.10 2.90 3.00
## online_trans 11.10 3.00 16.00 21.10
Now, let’s look at the code in more details.
The first line sim.dat
is easy. It is the data you want to work on. The second line group_by(segment)
tells R that in the following steps you want to summarise by variable segment
. Here we only summarize data by one categorical variable, but you can group by multiple variables, such as group_by(segment, house)
. The third argument summarise
tells R the manipulation(s) to do. Then list the exact actions inside summarise()
. For example, Age = round(mean(na.omit(age)),0)
tell R the following things:
- Calculate the mean of column
age
ignoring missing value for each customer segment - Round the result to the specified number of decimal places
- Store the result in a new variable named
Age
The rest of the command above is similar. In the end, we calculate the following for each segment:
Age
: average age for each segmentFemalePct
: percentage for each segmentHouseYes
: percentage of people who own a housestroe_exp
: average expense in storeonline_exp
: average expense onlinestore_trans
: average times of transactions in the storeonline_trans
: average times of online transactions
There is a lot of information you can extract from those simple averages.
Conspicuous: average age is about 40. It is a group of middle-age wealthy people. 1/3 of them are female, and 2/3 are male. They buy regardless the price. Almost all of them own house (0.86).
Price: They are older people with average age 60. Nearly all of them own a house (0.94). They are less likely to purchase online (
store_trans = 6
whileonline_trans = 3
). It is the only group that is less likely to buy online.Quality: The average age is 35. They are not way different with Conspicuous regarding age. But they spend much less. The percentages of male and female are similar. They prefer online shopping. More than half of them don’t own a house (0.66).
Style: They are young people with average age 24. The majority of them are female (0.81). Most of them don’t own a house (0.73). They are very likely to be digital natives and prefer online shopping.
You may notice that Style group purchase more frequently online (online_trans
) but the expense (online_exp
) is not higher. It makes us wonder what is the average expense each time, so you have a better idea about the price range of the group.
The analytical process is aggregated instead of independent steps. The current step will shed new light on what to do next. Sometimes you need to go back to fix something in the previous steps. Let’s check average one-time online and instore purchase amounts:
%>%
sim.dat group_by(segment) %>%
summarise(avg_online = round(sum(online_exp)/sum(online_trans), 2),
avg_store = round(sum(store_exp)/sum(store_trans), 2))
## # A tibble: 4 × 3
## segment avg_online avg_store
## <chr> <dbl> <dbl>
## 1 Conspicuous 442. 479.
## 2 Price 69.3 81.3
## 3 Quality 126. 105.
## 4 Style 92.8 121.
Price group has the lowest averaged one-time purchase. The Conspicuous group will pay the highest price. When we build customer profile in real life, we will also need to look at the survey summarization. You may be surprised how much information simple data manipulations can provide.
Another comman task is to check which column has missing values. It requires the program to look at each column in the data. In this case you can use summarise_all
:
# apply function anyNA() to each column
# you can also assign a function vector
# such as: c("anyNA","is.factor")
::summarise_all(sim.dat, funs_(c("anyNA"))) dplyr
## age gender income house store_exp online_exp
## 1 FALSE FALSE TRUE FALSE FALSE FALSE
## store_trans online_trans Q1 Q2 Q3 Q4
## 1 FALSE FALSE FALSE FALSE FALSE FALSE
## Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The above code returns a vector indicating if there is any value missing in each column.
6.1.1.4 Create new variable
There are often situations where you need to create new variables. For example, adding online and store expense to get total expense. In this case, you will apply a function to the columns and return a column with the same length. mutate()
can do it for you and append one or more new columns:
::mutate(sim.dat, total_exp = store_exp + online_exp) dplyr
The above code sums up two columns and appends the result (total_exp
) to sim.dat
. Another similar function is transmute()
. The difference is that transmute()
will delete the original columns and only keep the new ones.
::transmute(sim.dat, total_exp = store_exp + online_exp) dplyr
6.1.1.5 Merge
Similar to SQL, there are different joins in dplyr
. We create two baby data sets to show how the functions work.
<- data.frame(cbind(ID = c("A", "B", "C"), x1 = c(1, 2, 3)))) (x
## ID x1
## 1 A 1
## 2 B 2
## 3 C 3
<- data.frame(cbind(ID = c("B", "C", "D"), y1 = c(T, T, F)))) (y
## ID y1
## 1 B TRUE
## 2 C TRUE
## 3 D FALSE
# join to the left
# keep all rows in x
left_join(x, y, by = "ID")
## ID x1 y1
## 1 A 1 <NA>
## 2 B 2 TRUE
## 3 C 3 TRUE
# get rows matched in both data sets
inner_join(x, y, by = "ID")
## ID x1 y1
## 1 B 2 TRUE
## 2 C 3 TRUE
# get rows in either data set
full_join(x, y, by = "ID")
## ID x1 y1
## 1 A 1 <NA>
## 2 B 2 TRUE
## 3 C 3 TRUE
## 4 D <NA> FALSE
# filter out rows in x that can be matched in y
# it doesn't bring in any values from y
semi_join(x, y, by = "ID")
# the opposite of semi_join()
# it gets rows in x that cannot be matched in y
# it doesn't bring in any values from y
anti_join(x, y, by = "ID")
There are other functions (intersect()
, union()
and setdiff()
). Also the data frame version of rbind
and cbind
which are bind_rows()
and bind_col()
. We are not going to go through them all. You can try them yourself. If you understand the functions we introduced so far. It should be easy for you to figure out the rest.
6.1.2 apply()
, lapply()
and sapply()
in base R
There are some powerful functions to summarize data in base R, such as apply()
, lapply()
and sapply()
. They do the same basic things and are all from “apply” family: apply functions over parts of data. They differ in two important respects:
- the type of object they apply to
- the type of result they will return
When do we use apply()
? When we want to apply a function to margins of an array or matrix. That means our data need to be structured. The operations can be very flexible. It returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
For example you can compute row and column sums for a matrix:
## simulate a matrix
<- cbind(x1 =1:8, x2 = c(4:1, 2:5))
x dimnames(x)[[1]] <- letters[1:8]
apply(x, 2, mean)
## x1 x2
## 4.5 3.0
<- apply(x, 2, sum)
col.sums <- apply(x, 1, sum) row.sums
You can also apply other functions:
<- matrix(c(1:4, 1, 6:8), nrow = 2)
ma ma
## [,1] [,2] [,3] [,4]
## [1,] 1 3 1 7
## [2,] 2 4 6 8
apply(ma, 1, table) #--> a list of length 2
## [[1]]
##
## 1 3 7
## 2 1 1
##
## [[2]]
##
## 2 4 6 8
## 1 1 1 1
apply(ma, 1, stats::quantile) # 5 x n matrix with rownames
## [,1] [,2]
## 0% 1 2.0
## 25% 1 3.5
## 50% 2 5.0
## 75% 4 6.5
## 100% 7 8.0
Results can have different lengths for each call. This is a trickier example. What will you get?
## Example with different lengths for each call
<- array(1:24, dim = 2:4)
z <- apply(z, 1:2, function(x) seq_len(max(x)))
zseq ## a 2 x 3 matrix
zseq typeof(zseq) ## list
dim(zseq) ## 2 3
1,]
zseq[apply(z, 3, function(x) seq_len(max(x)))
lapply()
applies a function over a list, data.frame or vector and returns a list of the same length.sapply()
is a user-friendly version and wrapper oflapply()
. By default it returns a vector, matrix or ifsimplify = "array"
, an array if appropriate.apply(x, f, simplify = FALSE, USE.NAMES = FALSE)
is the same aslapply(x, f)
. Ifsimplify=TRUE
, then it will return adata.frame
instead oflist
.
Let’s use some data with context to help you better understand the functions.
- Get the mean and standard deviation of all numerical variables in the dataset.
# Get numerical variables
<- sim.dat[, lapply(sim.dat, class) %in% c("integer", "numeric")]
sdat ## Try the following code with apply() function apply(sim.dat,2,class)
## What is the problem?
The data frame sdat
only includes numeric columns. Now we can go head and use apply()
to get mean and standard deviation for each column:
apply(sdat, MARGIN = 2, function(x) mean(na.omit(x)))
## age income store_exp online_exp
## 3.884e+01 1.135e+05 1.357e+03 2.120e+03
## store_trans online_trans Q1 Q2
## 5.350e+00 1.355e+01 3.101e+00 1.823e+00
## Q3 Q4 Q5 Q6
## 1.992e+00 2.763e+00 2.945e+00 2.448e+00
## Q7 Q8 Q9 Q10
## 3.434e+00 2.396e+00 3.085e+00 2.320e+00
Here we defined a function using function(x) mean(na.omit(x))
. It is a very simple function. It tells R to ignore the missing value when calculating the mean. MARGIN = 2
tells R to apply the function to each column. It is not hard to guess what MARGIN = 1
mean. The result show that the average online expense is much higher than store expense. You can also compare the average scores across different questions. The command to calculate standard deviation is very similar. The only difference is to change mean()
to sd()
:
apply(sdat, MARGIN = 2, function(x) sd(na.omit(x)))
## age income store_exp online_exp
## 16.417 49842.287 2774.400 1731.224
## store_trans online_trans Q1 Q2
## 3.696 7.957 1.450 1.168
## Q3 Q4 Q5 Q6
## 1.402 1.155 1.284 1.439
## Q7 Q8 Q9 Q10
## 1.456 1.154 1.118 1.136
Even the average online expense is higher than store expense, the standard deviation for store expense is much higher than online expense which indicates there is very likely some big/small purchase in store. We can check it quickly:
summary(sdat$store_exp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -500 205 329 1357 597 50000
summary(sdat$online_exp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 69 420 1942 2120 2441 9479
There are some odd values in store expense. The minimum value is -500 which indicates that you should preprocess data before analyzing it. Checking those simple statistics will help you better understand your data. It then gives you some idea how to preprocess and analyze them. How about using lapply()
and sapply()
?
Run the following code and compare the results:
lapply(sdat, function(x) sd(na.omit(x)))
sapply(sdat, function(x) sd(na.omit(x)))
sapply(sdat, function(x) sd(na.omit(x)), simplify = FALSE)