6.1 Summarize Data

6.1.1 dplyr package

dplyr provides a flexible grammar of data manipulation focusing on tools for working with data frames (hence the d in the name). It is faster and more friendly:

  • It identifies the most important data manipulations and make them easy to use from R.
  • It performs faster for in-memory data by writing key pieces in C++ using Rcpp.
  • The interface is the same for data frame, data table or database.

We will illustrate the following functions in order using the clothing company data:

  1. Display
  2. Subset
  3. Summarize
  4. Create new variable
  5. Merge
# Read data
sim.dat <- read.csv("http://bit.ly/2P5gTw4") Display

  • tbl_df(): Convert the data to tibble which offers better checking and printing capabilities than traditional data frames. It will adjust output width according to fit the current window.
  • glimpse(): This is like a transposed version of tbl_df()
glimpse(sim.dat) Subset

Get rows with income more than 300000:

filter(sim.dat, income >300000) %>%

Here we use the operator %>%. It is called a “pipe operator” which pipes a value forward into an expression or function call. What you get in the left operation will be the first argument or the only argument in the right operation.

x %>% f(y) = f(x, y)
y %>% f(x, ., z) = f(x, y, z )

It is an operator from magrittr which can be really beneficial. The following R code is difficulty to read and understand without using the pipe operator.

ave_exp <- filter( 
    ave_online_exp = mean(online_exp), 
    n = n()
  n > 200

The same function with pipe operator “%>%”:

ave_exp <- sim.dat %>% 
 filter(!is.na(income)) %>% 
 group_by(segment) %>% 
   ave_online_exp = mean(online_exp), 
   n = n() ) %>% 
  filter(n > 200)

It is much easier to read:

  1. Delete observations from sim.dat with missing income values
  2. Group the data from step 1 by variable segment
  3. Calculate mean of online expense for each segment and save the result as a new variable named ave_online_exp
  4. Calculate the size of each segment and saved it as a new variable named n
  5. Get segments with size larger than 200

You can use distinct() to delete duplicated rows.


sample_frac() will randomly select some rows with a specified percentage. sample_n() can randomly select rows with a specified number.

dplyr::sample_frac(sim.dat, 0.5, replace = TRUE) 
dplyr::sample_n(sim.dat, 10, replace = TRUE) 

slice() will select rows by position:

dplyr::slice(sim.dat, 10:15) 

It is equivalent to sim.dat[10:15,].

top_n() will select the order top n entries:


If you want to select columns instead of rows, you can use select(). The following are some sample codes:

# select by column name

# select columns whose name contains a character string
dplyr::select(sim.dat, contains("_"))

# select columns whose name ends with a character string
# similar there is "starts_with"
dplyr::select(sim.dat, ends_with("e"))

# select columns Q1,Q2,Q3,Q4 and Q5
select(sim.dat, num_range("Q", 1:5)) 

# select columns whose names are in a group of names
dplyr::select(sim.dat, one_of(c("age", "income")))

# select columns between age and online_exp
dplyr::select(sim.dat, age:online_exp)

# select all columns except for age
dplyr::select(sim.dat, -age) Summarize

Let us use a standard marketing problem, customer segmentation, to illustrate how to summarize data. It usually starts with designing survey and collecting data. Then run a cluster analysis on the data to get customer segments. Once we have different segments, the next is to understand how each group of customer look like by summarizing some key metrics. For example, we can do the following data aggregation for different segments of clothes customers.

dat_summary <- sim.dat %>% 
  dplyr::group_by(segment) %>% 
  dplyr::summarise(Age = round(mean(na.omit(age)), 0), 
            FemalePct = round(mean(gender == "Female"), 2), 
            HouseYes = round(mean(house == "Yes"), 2), 
            store_exp = round(mean(na.omit(store_exp), 
                                   trim = 0.1), 0),
            online_exp = round(mean(online_exp), 0), 
            store_trans = round(mean(store_trans), 1), 
            online_trans = round(mean(online_trans), 1))

# transpose the data frame for showing purpose
# due to the limit of output width
cnames <- dat_summary$segment
dat_summary <- dplyr::select(dat_summary, - segment)
tdat_summary <- t(dat_summary) %>% data.frame()
names(tdat_summary) <- cnames
##              Conspicuous  Price Quality   Style
## Age                42.00  60.00   35.00   24.00
## FemalePct           0.32   0.45    0.47    0.81
## HouseYes            0.86   0.94    0.34    0.27
## store_exp        4990.00 501.00  301.00  200.00
## online_exp       4898.00 205.00 2013.00 1962.00
## store_trans        10.90   6.10    2.90    3.00
## online_trans       11.10   3.00   16.00   21.10

Now, let’s look at the code in more details.

The first line sim.dat is easy. It is the data you want to work on. The second line group_by(segment) tells R that in the following steps you want to summarise by variable segment. Here we only summarize data by one categorical variable, but you can group by multiple variables, such as group_by(segment, house). The third argument summarise tells R the manipulation(s) to do. Then list the exact actions inside summarise(). For example, Age = round(mean(na.omit(age)),0) tell R the following things:

  1. Calculate the mean of column age ignoring missing value for each customer segment
  2. Round the result to the specified number of decimal places
  3. Store the result in a new variable named Age

The rest of the command above is similar. In the end, we calculate the following for each segment:

  1. Age: average age for each segment
  2. FemalePct: percentage for each segment
  3. HouseYes: percentage of people who own a house
  4. stroe_exp: average expense in store
  5. online_exp: average expense online
  6. store_trans: average times of transactions in the store
  7. online_trans: average times of online transactions

There is a lot of information you can extract from those simple averages.

  • Conspicuous: average age is about 40. It is a group of middle-age wealthy people. 1/3 of them are female, and 2/3 are male. They buy regardless the price. Almost all of them own house (0.86).

  • Price: They are older people with average age 60. Nearly all of them own a house (0.94). They are less likely to purchase online (store_trans = 6 while online_trans = 3). It is the only group that is less likely to buy online.

  • Quality: The average age is 35. They are not way different with Conspicuous regarding age. But they spend much less. The percentages of male and female are similar. They prefer online shopping. More than half of them don’t own a house (0.66).

  • Style: They are young people with average age 24. The majority of them are female (0.81). Most of them don’t own a house (0.73). They are very likely to be digital natives and prefer online shopping.

You may notice that Style group purchase more frequently online (online_trans) but the expense (online_exp) is not higher. It makes us wonder what is the average expense each time, so you have a better idea about the price range of the group.

The analytical process is aggregated instead of independent steps. The current step will shed new light on what to do next. Sometimes you need to go back to fix something in the previous steps. Let’s check average one-time online and instore purchase amounts:

sim.dat %>% 
  group_by(segment) %>% 
  summarise(avg_online = round(sum(online_exp)/sum(online_trans), 2),
            avg_store = round(sum(store_exp)/sum(store_trans), 2))
## # A tibble: 4 × 3
##   segment     avg_online avg_store
##   <chr>            <dbl>     <dbl>
## 1 Conspicuous      442.      479. 
## 2 Price             69.3      81.3
## 3 Quality          126.      105. 
## 4 Style             92.8     121.

Price group has the lowest averaged one-time purchase. The Conspicuous group will pay the highest price. When we build customer profile in real life, we will also need to look at the survey summarization. You may be surprised how much information simple data manipulations can provide.

Another comman task is to check which column has missing values. It requires the program to look at each column in the data. In this case you can use summarise_all:

# apply function anyNA() to each column
# you can also assign a function vector 
# such as: c("anyNA","is.factor")
dplyr::summarise_all(sim.dat, funs_(c("anyNA")))
##     age gender income house store_exp online_exp
##   store_trans online_trans    Q1    Q2    Q3    Q4
##      Q5    Q6    Q7    Q8    Q9   Q10 segment

The above code returns a vector indicating if there is any value missing in each column. Create new variable

There are often situations where you need to create new variables. For example, adding online and store expense to get total expense. In this case, you will apply a function to the columns and return a column with the same length. mutate() can do it for you and append one or more new columns:

dplyr::mutate(sim.dat, total_exp = store_exp + online_exp)

The above code sums up two columns and appends the result (total_exp) to sim.dat. Another similar function is transmute(). The difference is that transmute() will delete the original columns and only keep the new ones.

dplyr::transmute(sim.dat, total_exp = store_exp + online_exp) Merge

Similar to SQL, there are different joins in dplyr. We create two baby data sets to show how the functions work.

(x <- data.frame(cbind(ID = c("A", "B", "C"), x1 = c(1, 2, 3))))
##   ID x1
## 1  A  1
## 2  B  2
## 3  C  3
(y <- data.frame(cbind(ID = c("B", "C", "D"), y1 = c(T, T, F))))
##   ID    y1
## 1  B  TRUE
## 2  C  TRUE
## 3  D FALSE
# join to the left
# keep all rows in x
left_join(x, y, by = "ID")
##   ID x1   y1
## 1  A  1 <NA>
## 2  B  2 TRUE
## 3  C  3 TRUE
# get rows matched in both data sets
inner_join(x, y, by = "ID")
##   ID x1   y1
## 1  B  2 TRUE
## 2  C  3 TRUE
# get rows in either data set
full_join(x, y, by = "ID")
##   ID   x1    y1
## 1  A    1  <NA>
## 2  B    2  TRUE
## 3  C    3  TRUE
## 4  D <NA> FALSE
# filter out rows in x that can be matched in y 
# it doesn't bring in any values from y 
semi_join(x, y, by = "ID")
# the opposite of  semi_join()
# it gets rows in x that cannot be matched in y
# it doesn't bring in any values from y
anti_join(x, y, by = "ID")

There are other functions (intersect(), union() and setdiff()). Also the data frame version of rbind and cbind which are bind_rows() and bind_col(). We are not going to go through them all. You can try them yourself. If you understand the functions we introduced so far. It should be easy for you to figure out the rest.

6.1.2 apply(), lapply() and sapply() in base R

There are some powerful functions to summarize data in base R, such as apply(), lapply() and sapply(). They do the same basic things and are all from “apply” family: apply functions over parts of data. They differ in two important respects:

  1. the type of object they apply to
  2. the type of result they will return

When do we use apply()? When we want to apply a function to margins of an array or matrix. That means our data need to be structured. The operations can be very flexible. It returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.

For example you can compute row and column sums for a matrix:

## simulate a matrix
x <- cbind(x1 =1:8, x2 = c(4:1, 2:5))
dimnames(x)[[1]] <- letters[1:8]
apply(x, 2, mean)
##  x1  x2 
## 4.5 3.0
col.sums <- apply(x, 2, sum)
row.sums <- apply(x, 1, sum)

You can also apply other functions:

ma <- matrix(c(1:4, 1, 6:8), nrow = 2)
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    1    7
## [2,]    2    4    6    8
apply(ma, 1, table)  #--> a list of length 2
## [[1]]
## 1 3 7 
## 2 1 1 
## [[2]]
## 2 4 6 8 
## 1 1 1 1
apply(ma, 1, stats::quantile) # 5 x n matrix with rownames
##      [,1] [,2]
## 0%      1  2.0
## 25%     1  3.5
## 50%     2  5.0
## 75%     4  6.5
## 100%    7  8.0

Results can have different lengths for each call. This is a trickier example. What will you get?

## Example with different lengths for each call
z <- array(1:24, dim = 2:4)
zseq <- apply(z, 1:2, function(x) seq_len(max(x)))
zseq         ## a 2 x 3 matrix
typeof(zseq) ## list
dim(zseq) ## 2 3
apply(z, 3, function(x) seq_len(max(x)))
  • lapply() applies a function over a list, data.frame or vector and returns a list of the same length.
  • sapply() is a user-friendly version and wrapper of lapply(). By default it returns a vector, matrix or if simplify = "array", an array if appropriate. apply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f). If simplify=TRUE, then it will return a data.frame instead of list.

Let’s use some data with context to help you better understand the functions.

  • Get the mean and standard deviation of all numerical variables in the dataset.
# Get numerical variables
sdat <- sim.dat[, lapply(sim.dat, class) %in% c("integer", "numeric")]
## Try the following code with apply() function apply(sim.dat,2,class)
## What is the problem?

The data frame sdat only includes numeric columns. Now we can go head and use apply() to get mean and standard deviation for each column:

apply(sdat, MARGIN = 2, function(x) mean(na.omit(x)))
##          age       income    store_exp   online_exp 
##    3.884e+01    1.135e+05    1.357e+03    2.120e+03 
##  store_trans online_trans           Q1           Q2 
##    5.350e+00    1.355e+01    3.101e+00    1.823e+00 
##           Q3           Q4           Q5           Q6 
##    1.992e+00    2.763e+00    2.945e+00    2.448e+00 
##           Q7           Q8           Q9          Q10 
##    3.434e+00    2.396e+00    3.085e+00    2.320e+00

Here we defined a function using function(x) mean(na.omit(x)). It is a very simple function. It tells R to ignore the missing value when calculating the mean. MARGIN = 2 tells R to apply the function to each column. It is not hard to guess what MARGIN = 1 mean. The result show that the average online expense is much higher than store expense. You can also compare the average scores across different questions. The command to calculate standard deviation is very similar. The only difference is to change mean() to sd():

apply(sdat, MARGIN = 2, function(x) sd(na.omit(x)))
##          age       income    store_exp   online_exp 
##       16.417    49842.287     2774.400     1731.224 
##  store_trans online_trans           Q1           Q2 
##        3.696        7.957        1.450        1.168 
##           Q3           Q4           Q5           Q6 
##        1.402        1.155        1.284        1.439 
##           Q7           Q8           Q9          Q10 
##        1.456        1.154        1.118        1.136

Even the average online expense is higher than store expense, the standard deviation for store expense is much higher than online expense which indicates there is very likely some big/small purchase in store. We can check it quickly:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -500     205     329    1357     597   50000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      69     420    1942    2120    2441    9479

There are some odd values in store expense. The minimum value is -500 which indicates that you should preprocess data before analyzing it. Checking those simple statistics will help you better understand your data. It then gives you some idea how to preprocess and analyze them. How about using lapply() and sapply()?

Run the following code and compare the results:

lapply(sdat, function(x) sd(na.omit(x)))
sapply(sdat, function(x) sd(na.omit(x)))
sapply(sdat, function(x) sd(na.omit(x)), simplify = FALSE)