## 6.2 Summarize data

### 6.2.1 `apply()`

, `lapply()`

and `sapply()`

in base R

There are some powerful functions to summarize data in base R, such as `apply()`

, `lapply()`

and `sapply()`

. They do the same basic things and are all from “apply” family: apply functions over parts of data. They differ in two important respects:

- the type of object they apply to
- the type of result they will return

When do we use `apply()`

? When we want to apply a function to margins of an array or matrix. That means our data need to be structured. The operations can be very flexible. It returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.

For example you can compute row and column sums for a matrix:

```
## simulate a matrix
x <- cbind(x1 =1:8, x2 = c(4:1, 2:5))
dimnames(x)[[1]] <- letters[1:8]
apply(x, 2, mean)
```

```
## x1 x2
## 4.5 3.0
```

You can also apply other functions:

```
## [,1] [,2] [,3] [,4]
## [1,] 1 3 1 7
## [2,] 2 4 6 8
```

```
## [[1]]
##
## 1 3 7
## 2 1 1
##
## [[2]]
##
## 2 4 6 8
## 1 1 1 1
```

```
## [,1] [,2]
## 0% 1 2.0
## 25% 1 3.5
## 50% 2 5.0
## 75% 4 6.5
## 100% 7 8.0
```

Results can have different lengths for each call. This is a trickier example. What will you get?

```
## Example with different lengths for each call
z <- array(1:24, dim = 2:4)
zseq <- apply(z, 1:2, function(x) seq_len(max(x)))
zseq ## a 2 x 3 matrix
typeof(zseq) ## list
dim(zseq) ## 2 3
zseq[1,]
apply(z, 3, function(x) seq_len(max(x)))
```

`lapply()`

applies a function over a list, data.frame or vector and returns a list of the same length.`sapply()`

is a user-friendly version and wrapper of`lapply()`

. By default it returns a vector, matrix or if`simplify = "array"`

, an array if appropriate.`apply(x, f, simplify = FALSE, USE.NAMES = FALSE)`

is the same as`lapply(x, f)`

. If`simplify=TRUE`

, then it will return a`data.frame`

instead of`list`

.

Let’s use some data with context to help you better understand the functions.

- Get the mean and standard deviation of all numerical variables in the dataset.

```
# Read data
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
# Get numerical variables
sdat <- sim.dat[, !lapply(sim.dat, class) == "factor"]
## Try the following code with apply() function apply(sim.dat,2,class)
## What is the problem?
```

The data frame `sdat`

only includes numeric columns. Now we can go head and use `apply()`

to get mean and standard deviation for each column:

```
## age income store_exp online_exp
## 3.884e+01 1.135e+05 1.357e+03 2.120e+03
## store_trans online_trans Q1 Q2
## 5.350e+00 1.355e+01 3.101e+00 1.823e+00
## Q3 Q4 Q5 Q6
## 1.992e+00 2.763e+00 2.945e+00 2.448e+00
## Q7 Q8 Q9 Q10
## 3.434e+00 2.396e+00 3.085e+00 2.320e+00
```

Here we defined a function using `function(x) mean(na.omit(x))`

. It is a very simple function. It tells R to ignore the missing value when calculating the mean. `MARGIN=2`

tells R to apply the function to each column. It is not hard to guess what `MARGIN=1`

mean. The result show that the average online expense is much higher than store expense. You can also compare the average scores across different questions. The command to calculate standard deviation is very similar. The only difference is to change `mean()`

to `sd()`

:

```
## age income store_exp online_exp
## 16.417 49842.287 2774.400 1731.224
## store_trans online_trans Q1 Q2
## 3.696 7.957 1.450 1.168
## Q3 Q4 Q5 Q6
## 1.402 1.155 1.284 1.439
## Q7 Q8 Q9 Q10
## 1.456 1.154 1.118 1.136
```

Even the average online expense is higher than store expense, the standard deviation for store expense is much higher than online expense which indicates there is very likely some big/small purchase in store. We can check it quickly:

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -500 205 329 1357 597 50000
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 69 420 1942 2120 2441 9479
```

There are some odd values in store expense. The minimum value is -500 which indicates that you should preprocess data before analyzing it. Checking those simple statistics will help you better understand your data. It then gives you some idea how to preprocess and analyze them. How about using `lapply()`

and `sapply()`

?

Run the following code and compare the results:

### 6.2.2 `dplyr`

package

`dplyr`

provides a flexible grammar of data manipulation focusing on tools for working with data frames (hence the `d`

in the name). It is faster and more friendly:

- It identifies the most important data manipulations and make they easy to use from R
- It performs faster for in-memory data by writing key pieces in C++ using
`Rcpp`

- The interface is the same for data frame, data table or database.

We will illustrate the following functions in order:

- Display
- Subset
- Summarize
- Create new variable
- Merge

**Display**

`tbl_df()`

: Convert the data to`tibble`

which offers better checking and printing capabilities than traditional data frames. It will adjust output width according to fit the current window.

`glimpse()`

: This is like a transposed version of`tbl_df()`

**Subset**

Get rows with `income`

more than 300000:

Here we meet a new operator `%>%`

. It is called “Pipe operator” which pipes a value forward into an expression or function call. What you get in the left operation will be the first argument or the only argument in the right operation.

It is an operator from `magrittr`

which can be really beneficial. Look at the following code. Can you tell me what it does?

```
ave_exp <- filter(
summarise(
group_by(
filter(
sim.dat,
!is.na(income)
),
segment
),
ave_online_exp = mean(online_exp),
n = n()
),
n > 200
)
```

Now look at the identical code using “`%>%`

”:

```
ave_exp <- sim.dat %>%
filter(!is.na(income)) %>%
group_by(segment) %>%
summarise(
ave_online_exp = mean(online_exp),
n = n() ) %>%
filter(n > 200)
```

Isn’t it much more straightforward now? Let’s read it:

- Delete observations from
`sim.dat`

with missing income values - Group the data from step 1 by variable
`segment`

- Calculate mean of online expense for each segment and save the result as a new variable named
`ave_online_exp`

- Calculate the size of each segment and saved it as a new variable named
`n`

- Get segments with size larger than 200

You can use `distinct()`

to delete duplicated rows.

`sample_frac()`

will randomly select some rows with a specified percentage. `sample_n()`

can randomly select rows with a specified number.

`slice()`

will select rows by position:

It is equivalent to `sim.dat[10:15,]`

.

`top_n()`

will select the order top n entries:

If you want to select columns instead of rows, you can use `select()`

. The following are some sample codes:

```
# select by column name
dplyr::select(sim.dat,income,age,store_exp)
# select columns whose name contains a character string
dplyr::select(sim.dat, contains("_"))
# select columns whose name ends with a character string
# similar there is "starts_with"
dplyr::select(sim.dat, ends_with("e"))
# select columns Q1,Q2,Q3,Q4 and Q5
select(sim.dat, num_range("Q", 1:5))
# select columns whose names are in a group of names
dplyr::select(sim.dat, one_of(c("age", "income")))
# select columns between age and online_exp
dplyr::select(sim.dat, age:online_exp)
# select all columns except for age
dplyr::select(sim.dat, -age)
```

**Summarize**

A standard marketing problem is customer segmentation. It usually starts with designing survey and collecting data. Then run a cluster analysis on the data to get customer segments. Once we have different segments, the next is to understand how each group of customer look like by summarizing some key metrics. For example, we can do the following data aggregation for different segments of clothes customers.

```
sim.dat %>%
group_by(segment) %>%
summarise(Age = round(mean(na.omit(age)), 0),
FemalePct = round(mean(gender == "Female"), 2),
HouseYes = round(mean(house == "Yes"), 2),
store_exp = round(mean(na.omit(store_exp), trim = 0.1), 0),
online_exp = round(mean(online_exp), 0),
store_trans = round(mean(store_trans), 1),
online_trans = round(mean(online_trans), 1))
```

```
## Age FemalePct HouseYes store_exp online_exp
## 1 39 0.55 0.57 840 2120
## store_trans online_trans
## 1 5.3 13.5
```

Now, let’s peel the onion.

The first line `sim.dat`

is easy. It is the data you want to work on. The second line `group_by(segment)`

tells R that in the following steps you want to summarise by variable `segment`

. Here we only summarize data by one categorical variable, but you can group by multiple variables, such as `group_by(segment, house)`

. The third argument `summarise`

tells R the manipulation(s) to do. Then list the exact actions inside `summarise()`

. For example, `Age=round(mean(na.omit(age)),0)`

tell R the following things:

- Calculate the mean of column
`age`

ignoring missing value for each customer segment - Round the result to the specified number of decimal places
- Store the result in a new variable named
`Age`

The rest of the command above is similar. In the end, we calculate the following for each segment:

`Age`

: average age for each segment`FemalePct`

: percentage for each segment`HouseYes`

: percentage of people who own a house`stroe_exp`

: average expense in store`online_exp`

: average expense online`store_trans`

: average times of transactions in the store`online_trans`

: average times of online transactions

There is a lot of information you can extract from those simple averages.

Conspicuous: average age is about 40. It is a group of middle-age wealthy people. 1/3 of them are female, and 2/3 are male. They buy regardless the price. Almost all of them own house (0.86). It makes us wonder what is wrong with the rest 14%?

Price: They are older people with average age 60. Nearly all of them own a house(0.94). They are less likely to purchase online (store_trans=6 while online_trans=3). It is the only group that is less likely to buy online.

Quality: The average age is 35. They are not way different with Conspicuous regarding age. But they spend much less. The percentages of male and female are similar. They prefer online shopping. More than half of them don’t own a house (0.66).

Style: They are young people with average age 24. The majority of them are female (0.81). Most of them don’t own a house (0.73). They are very likely to be digital natives and prefer online shopping.

You may notice that Style group purchase more frequently online (`online_trans`

) but the expense (`online_exp`

) is not higher. It makes us wonder what is the average expense each time, so you have a better idea about the price range of the group.

The analytical process is aggregated instead of independent steps. The current step will shed new light on what to do next. Sometimes you need to go back to fix something in the previous steps. Let’s check average one-time online and instore purchase amounts:

```
sim.dat %>%
group_by(segment) %>%
summarise(avg_online = round(sum(online_exp)/sum(online_trans), 2),
avg_store = round(sum(store_exp)/sum(store_trans), 2))
```

```
## avg_online avg_store
## 1 156.5 253.6
```

Price group has the lowest averaged one-time purchase. The Conspicuous group will pay the highest price. When we build customer profile in real life, we will also need to look at the survey summarization. You may be surprised how much information simple data manipulations can provide.

Another comman task is to check which column has missing values. It requires the program to look at each column in the data. In this case you can use `summarise_all`

:

```
# apply function anyNA() to each column
# you can also assign a function vector such as: c("anyNA","is.factor")
dplyr::summarise_all(sim.dat, funs_(c("anyNA")))
```

```
## age gender income house store_exp online_exp
## 1 FALSE FALSE TRUE FALSE FALSE FALSE
## store_trans online_trans Q1 Q2 Q3 Q4
## 1 FALSE FALSE FALSE FALSE FALSE FALSE
## Q5 Q6 Q7 Q8 Q9 Q10 segment
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
```

The above code returns a vector indicating if there is any value missing in each column.

**Create new variable**

There are often situations where you need to create new variables. For example, adding online and store expense to get total expense. In this case, you will apply **window function** to the columns and return a column with the same length. `mutate()`

can do it for you and append one or more new columns:

The above code sums up two columns and appends the result (`total_exp`

) to `sim.dat`

. Another similar function is `transmute()`

. The difference is that `transmute()`

will delete the original columns and only keep the new ones.

**Merge**

Similar to SQL, there are different joins in `dplyr`

. We create two baby data sets to show how the functions work.

```
## ID x1
## 1 A 1
## 2 B 2
## 3 C 3
```

```
## ID y1
## 1 B TRUE
## 2 C TRUE
## 3 D FALSE
```

```
## Warning: Column `ID` joining factors with different
## levels, coercing to character vector
```

```
## ID x1 y1
## 1 A 1 <NA>
## 2 B 2 TRUE
## 3 C 3 TRUE
```

```
## Warning: Column `ID` joining factors with different
## levels, coercing to character vector
```

```
## ID x1 y1
## 1 B 2 TRUE
## 2 C 3 TRUE
```

```
## Warning: Column `ID` joining factors with different
## levels, coercing to character vector
```

```
## ID x1 y1
## 1 A 1 <NA>
## 2 B 2 TRUE
## 3 C 3 TRUE
## 4 D <NA> FALSE
```

```
# filter out rows in x that can be matched in y
# it doesn't bring in any values from y
semi_join(x, y, by = "ID")
```

```
# the opposite of semi_join()
# it gets rows in x that cannot be matched in y
# it doesn't bring in any values from y
anti_join(x, y, by = "ID")
```

There are other functions(`intersect()`

, `union()`

and `setdiff()`

). Also the data frame version of `rbind`

and `cbind`

which are `bind_rows()`

and `bind_col()`

. We are not going to go through them all. You can try them yourself. If you understand the functions we introduced so far. It should be easy for you to figure out the rest.