6.1 Summarize Data
dplyr provides a flexible grammar of data manipulation focusing on tools for working with data frames (hence the
d in the name). It is faster and more friendly:
- It identifies the most important data manipulations and make them easy to use from R.
- It performs faster for in-memory data by writing key pieces in C++ using
- The interface is the same for data frame, data table or database.
We will illustrate the following functions in order using the clothing company data:
- Create new variable
# Read data <- read.csv("http://bit.ly/2P5gTw4")sim.dat
tbl_df(): Convert the data to
tibblewhich offers better checking and printing capabilities than traditional data frames. It will adjust output width according to fit the current window.
glimpse(): This is like a transposed version of
Get rows with
income more than 300000:
filter(sim.dat, income >300000) %>% tbl_df()
Here we use the operator
%>%. It is called a “pipe operator” which pipes a value forward into an expression or function call. What you get in the left operation will be the first argument or the only argument in the right operation.
%>% f(y) = f(x, y) x %>% f(x, ., z) = f(x, y, z )y
It is an operator from
magrittr which can be really beneficial. The following R code is difficulty to read and understand without using the pipe operator.
<- filter( ave_exp summarise( group_by( filter( sim.dat, !is.na(income) ), segment ), ave_online_exp = mean(online_exp), n = n() ), > 200 n )
The same function with pipe operator “
<- sim.dat %>% ave_exp filter(!is.na(income)) %>% group_by(segment) %>% summarise( ave_online_exp = mean(online_exp), n = n() ) %>% filter(n > 200)
It is much easier to read:
- Delete observations from
sim.datwith missing income values
- Group the data from step 1 by variable
- Calculate mean of online expense for each segment and save the result as a new variable named
- Calculate the size of each segment and saved it as a new variable named
- Get segments with size larger than 200
You can use
distinct() to delete duplicated rows.
sample_frac() will randomly select some rows with a specified percentage.
sample_n() can randomly select rows with a specified number.
::sample_frac(sim.dat, 0.5, replace = TRUE) dplyr::sample_n(sim.dat, 10, replace = TRUE) dplyr
slice() will select rows by position:
::slice(sim.dat, 10:15) dplyr
It is equivalent to
top_n() will select the order top n entries:
If you want to select columns instead of rows, you can use
select(). The following are some sample codes:
# select by column name ::select(sim.dat,income,age,store_exp) dplyr # select columns whose name contains a character string ::select(sim.dat, contains("_")) dplyr # select columns whose name ends with a character string # similar there is "starts_with" ::select(sim.dat, ends_with("e")) dplyr # select columns Q1,Q2,Q3,Q4 and Q5 select(sim.dat, num_range("Q", 1:5)) # select columns whose names are in a group of names ::select(sim.dat, one_of(c("age", "income"))) dplyr # select columns between age and online_exp ::select(sim.dat, age:online_exp) dplyr # select all columns except for age ::select(sim.dat, -age)dplyr
Let us use a standard marketing problem, customer segmentation, to illustrate how to summarize data. It usually starts with designing survey and collecting data. Then run a cluster analysis on the data to get customer segments. Once we have different segments, the next is to understand how each group of customer look like by summarizing some key metrics. For example, we can do the following data aggregation for different segments of clothes customers.
<- sim.dat %>% dat_summary ::group_by(segment) %>% dplyr::summarise(Age = round(mean(na.omit(age)), 0), dplyrFemalePct = round(mean(gender == "Female"), 2), HouseYes = round(mean(house == "Yes"), 2), store_exp = round(mean(na.omit(store_exp), trim = 0.1), 0), online_exp = round(mean(online_exp), 0), store_trans = round(mean(store_trans), 1), online_trans = round(mean(online_trans), 1)) # transpose the data frame for showing purpose # due to the limit of output width <- dat_summary$segment cnames <- dplyr::select(dat_summary, - segment) dat_summary <- t(dat_summary) %>% data.frame() tdat_summary names(tdat_summary) <- cnames tdat_summary
## Conspicuous Price Quality Style ## Age 42.00 60.00 35.00 24.00 ## FemalePct 0.32 0.45 0.47 0.81 ## HouseYes 0.86 0.94 0.34 0.27 ## store_exp 4990.00 501.00 301.00 200.00 ## online_exp 4898.00 205.00 2013.00 1962.00 ## store_trans 10.90 6.10 2.90 3.00 ## online_trans 11.10 3.00 16.00 21.10
Now, let’s look at the code in more details.
The first line
sim.dat is easy. It is the data you want to work on. The second line
group_by(segment) tells R that in the following steps you want to summarise by variable
segment. Here we only summarize data by one categorical variable, but you can group by multiple variables, such as
group_by(segment, house). The third argument
summarise tells R the manipulation(s) to do. Then list the exact actions inside
summarise(). For example,
Age = round(mean(na.omit(age)),0) tell R the following things:
- Calculate the mean of column
ageignoring missing value for each customer segment
- Round the result to the specified number of decimal places
- Store the result in a new variable named
The rest of the command above is similar. In the end, we calculate the following for each segment:
Age: average age for each segment
FemalePct: percentage for each segment
HouseYes: percentage of people who own a house
stroe_exp: average expense in store
online_exp: average expense online
store_trans: average times of transactions in the store
online_trans: average times of online transactions
There is a lot of information you can extract from those simple averages.
Conspicuous: average age is about 40. It is a group of middle-age wealthy people. 1/3 of them are female, and 2/3 are male. They buy regardless the price. Almost all of them own house (0.86).
Price: They are older people with average age 60. Nearly all of them own a house (0.94). They are less likely to purchase online (
store_trans = 6while
online_trans = 3). It is the only group that is less likely to buy online.
Quality: The average age is 35. They are not way different with Conspicuous regarding age. But they spend much less. The percentages of male and female are similar. They prefer online shopping. More than half of them don’t own a house (0.66).
Style: They are young people with average age 24. The majority of them are female (0.81). Most of them don’t own a house (0.73). They are very likely to be digital natives and prefer online shopping.
You may notice that Style group purchase more frequently online (
online_trans) but the expense (
online_exp) is not higher. It makes us wonder what is the average expense each time, so you have a better idea about the price range of the group.
The analytical process is aggregated instead of independent steps. The current step will shed new light on what to do next. Sometimes you need to go back to fix something in the previous steps. Let’s check average one-time online and instore purchase amounts:
%>% sim.dat group_by(segment) %>% summarise(avg_online = round(sum(online_exp)/sum(online_trans), 2), avg_store = round(sum(store_exp)/sum(store_trans), 2))
## # A tibble: 4 × 3 ## segment avg_online avg_store ## <chr> <dbl> <dbl> ## 1 Conspicuous 442. 479. ## 2 Price 69.3 81.3 ## 3 Quality 126. 105. ## 4 Style 92.8 121.
Price group has the lowest averaged one-time purchase. The Conspicuous group will pay the highest price. When we build customer profile in real life, we will also need to look at the survey summarization. You may be surprised how much information simple data manipulations can provide.
Another comman task is to check which column has missing values. It requires the program to look at each column in the data. In this case you can use
# apply function anyNA() to each column # you can also assign a function vector # such as: c("anyNA","is.factor") ::summarise_all(sim.dat, funs_(c("anyNA")))dplyr
## age gender income house store_exp online_exp ## 1 FALSE FALSE TRUE FALSE FALSE FALSE ## store_trans online_trans Q1 Q2 Q3 Q4 ## 1 FALSE FALSE FALSE FALSE FALSE FALSE ## Q5 Q6 Q7 Q8 Q9 Q10 segment ## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The above code returns a vector indicating if there is any value missing in each column.
22.214.171.124 Create new variable
There are often situations where you need to create new variables. For example, adding online and store expense to get total expense. In this case, you will apply a function to the columns and return a column with the same length.
mutate() can do it for you and append one or more new columns:
::mutate(sim.dat, total_exp = store_exp + online_exp)dplyr
The above code sums up two columns and appends the result (
sim.dat. Another similar function is
transmute(). The difference is that
transmute() will delete the original columns and only keep the new ones.
::transmute(sim.dat, total_exp = store_exp + online_exp)dplyr
Similar to SQL, there are different joins in
dplyr. We create two baby data sets to show how the functions work.
<- data.frame(cbind(ID = c("A", "B", "C"), x1 = c(1, 2, 3))))(x
## ID x1 ## 1 A 1 ## 2 B 2 ## 3 C 3
<- data.frame(cbind(ID = c("B", "C", "D"), y1 = c(T, T, F))))(y
## ID y1 ## 1 B TRUE ## 2 C TRUE ## 3 D FALSE
# join to the left # keep all rows in x left_join(x, y, by = "ID")
## ID x1 y1 ## 1 A 1 <NA> ## 2 B 2 TRUE ## 3 C 3 TRUE
# get rows matched in both data sets inner_join(x, y, by = "ID")
## ID x1 y1 ## 1 B 2 TRUE ## 2 C 3 TRUE
# get rows in either data set full_join(x, y, by = "ID")
## ID x1 y1 ## 1 A 1 <NA> ## 2 B 2 TRUE ## 3 C 3 TRUE ## 4 D <NA> FALSE
# filter out rows in x that can be matched in y # it doesn't bring in any values from y semi_join(x, y, by = "ID")
# the opposite of semi_join() # it gets rows in x that cannot be matched in y # it doesn't bring in any values from y anti_join(x, y, by = "ID")
There are other functions (
setdiff()). Also the data frame version of
cbind which are
bind_col(). We are not going to go through them all. You can try them yourself. If you understand the functions we introduced so far. It should be easy for you to figure out the rest.
sapply() in base R
There are some powerful functions to summarize data in base R, such as
sapply(). They do the same basic things and are all from “apply” family: apply functions over parts of data. They differ in two important respects:
- the type of object they apply to
- the type of result they will return
When do we use
apply()? When we want to apply a function to margins of an array or matrix. That means our data need to be structured. The operations can be very flexible. It returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
For example you can compute row and column sums for a matrix:
## simulate a matrix <- cbind(x1 =1:8, x2 = c(4:1, 2:5)) x dimnames(x)[] <- letters[1:8] apply(x, 2, mean)
## x1 x2 ## 4.5 3.0
<- apply(x, 2, sum) col.sums <- apply(x, 1, sum)row.sums
You can also apply other functions:
<- matrix(c(1:4, 1, 6:8), nrow = 2) ma ma
## [,1] [,2] [,3] [,4] ## [1,] 1 3 1 7 ## [2,] 2 4 6 8
apply(ma, 1, table) #--> a list of length 2
## [] ## ## 1 3 7 ## 2 1 1 ## ## [] ## ## 2 4 6 8 ## 1 1 1 1
apply(ma, 1, stats::quantile) # 5 x n matrix with rownames
## [,1] [,2] ## 0% 1 2.0 ## 25% 1 3.5 ## 50% 2 5.0 ## 75% 4 6.5 ## 100% 7 8.0
Results can have different lengths for each call. This is a trickier example. What will you get?
## Example with different lengths for each call <- array(1:24, dim = 2:4) z <- apply(z, 1:2, function(x) seq_len(max(x))) zseq ## a 2 x 3 matrix zseq typeof(zseq) ## list dim(zseq) ## 2 3 1,] zseq[apply(z, 3, function(x) seq_len(max(x)))
lapply()applies a function over a list, data.frame or vector and returns a list of the same length.
sapply()is a user-friendly version and wrapper of
lapply(). By default it returns a vector, matrix or if
simplify = "array", an array if appropriate.
apply(x, f, simplify = FALSE, USE.NAMES = FALSE)is the same as
lapply(x, f). If
simplify=TRUE, then it will return a
Let’s use some data with context to help you better understand the functions.
- Get the mean and standard deviation of all numerical variables in the dataset.
# Get numerical variables <- sim.dat[, lapply(sim.dat, class) %in% c("integer", "numeric")] sdat ## Try the following code with apply() function apply(sim.dat,2,class) ## What is the problem?
The data frame
sdat only includes numeric columns. Now we can go head and use
apply() to get mean and standard deviation for each column:
apply(sdat, MARGIN = 2, function(x) mean(na.omit(x)))
## age income store_exp online_exp ## 3.884e+01 1.135e+05 1.357e+03 2.120e+03 ## store_trans online_trans Q1 Q2 ## 5.350e+00 1.355e+01 3.101e+00 1.823e+00 ## Q3 Q4 Q5 Q6 ## 1.992e+00 2.763e+00 2.945e+00 2.448e+00 ## Q7 Q8 Q9 Q10 ## 3.434e+00 2.396e+00 3.085e+00 2.320e+00
Here we defined a function using
function(x) mean(na.omit(x)). It is a very simple function. It tells R to ignore the missing value when calculating the mean.
MARGIN = 2 tells R to apply the function to each column. It is not hard to guess what
MARGIN = 1 mean. The result show that the average online expense is much higher than store expense. You can also compare the average scores across different questions. The command to calculate standard deviation is very similar. The only difference is to change
apply(sdat, MARGIN = 2, function(x) sd(na.omit(x)))
## age income store_exp online_exp ## 16.417 49842.287 2774.400 1731.224 ## store_trans online_trans Q1 Q2 ## 3.696 7.957 1.450 1.168 ## Q3 Q4 Q5 Q6 ## 1.402 1.155 1.284 1.439 ## Q7 Q8 Q9 Q10 ## 1.456 1.154 1.118 1.136
Even the average online expense is higher than store expense, the standard deviation for store expense is much higher than online expense which indicates there is very likely some big/small purchase in store. We can check it quickly:
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -500 205 329 1357 597 50000
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 69 420 1942 2120 2441 9479
There are some odd values in store expense. The minimum value is -500 which indicates that you should preprocess data before analyzing it. Checking those simple statistics will help you better understand your data. It then gives you some idea how to preprocess and analyze them. How about using
Run the following code and compare the results:
lapply(sdat, function(x) sd(na.omit(x))) sapply(sdat, function(x) sd(na.omit(x))) sapply(sdat, function(x) sd(na.omit(x)), simplify = FALSE)