6.3 Tidy and Reshape Data

“Tidy data” represent the information from a dataset as data frames where each row is an observation, and each column contains the values of a variable (i.e., an attribute of what we are observing). Depending on the situation, the requirements on what to present as rows and columns may change. To make data easy to work with for the problem at hand, in practice, we often need to convert data between the “wide” and the “long” format. The process feels like kneading the dough.

There are two commonly used packages for this kind of manipulations: tidyr and reshape2. We will show how to tidy and reshape data using the two packages. By comparing the functions to show how they overlap and where they differ.

6.3.1 reshape2 package

It is a reboot of the previous package reshape. Take a baby subset of our exemplary clothes consumers data to illustrate:

For the above data sdat, what if we want to have a variable indicating the purchasing channel (i.e. online or in-store) and another column with the corresponding expense amount? Assume we want to keep the rest of the columns the same. It is a task to change data from “wide” to “long”. There are two general ways to shape data:

  • Use melt() to convert an object into a molten data frame, i.e., from wide to long
  • Use dcast() to cast a molten data frame into the shape you want, i.e., from long to wide
##    age gender income house    Channel Expense
## 1   57 Female 120963   Yes  store_exp   529.1
## 2   63 Female 122008   Yes  store_exp   478.0
## 3   59   Male 114202   Yes  store_exp   490.8
## 4   60   Male 113616   Yes  store_exp   347.8
## 5   51   Male 124253   Yes  store_exp   379.6
## 6   57 Female 120963   Yes online_exp   303.5
## 7   63 Female 122008   Yes online_exp   109.5
## 8   59   Male 114202   Yes online_exp   279.2
## 9   60   Male 113616   Yes online_exp   141.7
## 10  51   Male 124253   Yes online_exp   112.2

You melted the data frame sdat by two variables: store_exp and online_exp (measure.vars=c("store_exp","online_exp")). The new variable name is Channel set by command variable.name = "Channel". The value name is Expense set by command value.name = "Expense".

You can run a regression to study the effect of purchasing channel as follows:

You can melt() list, matrix, table too. The syntax is similar, and we won’t go through every situation. Sometimes we want to convert the data from “long” to “wide”. For example, you want to compare the online and in-store expense between male and female based on the house ownership.

## Using Expense as value column: use value.var to override.
##   house gender store_exp online_exp
## 1   Yes Female      1007      413.0
## 2   Yes   Male      1218      533.2

In the above code, what is the left side of ~ are variables that you want to group by. The right side is the variable you want to spread as columns. It will use the column indicating value from melt() before. Here is “Expense” .

6.3.2 tidyr package

The other package that will do similar manipulations is tidyr. Let’s get a subset to illustrate the usage.

## # A tibble: 5 x 4
##     age gender store_exp store_trans
##   <int> <fct>      <dbl>       <int>
## 1    57 Female      529.           2
## 2    63 Female      478.           4
## 3    59 Male        491.           7
## 4    60 Male        348.          10
## 5    51 Male        380.           4

gather() function in tidyr is analogous to melt() in reshape2. The following code will do the same thing as we did before using melt():

## # A tibble: 10 x 4
##      age gender variable    value
##    <int> <fct>  <chr>       <dbl>
##  1    57 Female store_exp    529.
##  2    63 Female store_exp    478.
##  3    59 Male   store_exp    491.
##  4    60 Male   store_exp    348.
##  5    51 Male   store_exp    380.
##  6    57 Female store_trans    2 
##  7    63 Female store_trans    4 
##  8    59 Male   store_trans    7 
##  9    60 Male   store_trans   10 
## 10    51 Male   store_trans    4

Or if we use the pipe operation, we can write the above code as:

It is identical with the following code using melt():

The opposite operation to gather() is spread(). The previous one stacks columns and the latter one spread the columns.

##   age gender store_exp store_trans
## 1  51   Male     379.6           4
## 2  57 Female     529.1           2
## 3  59   Male     490.8           7
## 4  60   Male     347.8          10
## 5  63 Female     478.0           4

Another pair of functions that do opposite manipulations are separate() and unite().

## # A tibble: 10 x 5
##      age gender Source Type  value
##    <int> <fct>  <chr>  <chr> <dbl>
##  1    57 Female store  exp    529.
##  2    63 Female store  exp    478.
##  3    59 Male   store  exp    491.
##  4    60 Male   store  exp    348.
##  5    51 Male   store  exp    380.
##  6    57 Female store  trans    2 
##  7    63 Female store  trans    4 
##  8    59 Male   store  trans    7 
##  9    60 Male   store  trans   10 
## 10    51 Male   store  trans    4

You can see that the function separates the original column “variable” to two new columns “Source” and “Type”. You can use sep= to set the string or regular expression to separate the column. By default, it is “_”.

The unite() function will do the opposite: combining two columns. It is the generalization of paste() to a data frame.

##    age gender    variable value
## 1   57 Female   store_exp 529.1
## 2   63 Female   store_exp 478.0
## 3   59   Male   store_exp 490.8
## 4   60   Male   store_exp 347.8
## 5   51   Male   store_exp 379.6
## 6   57 Female store_trans   2.0
## 7   63 Female store_trans   4.0
## 8   59   Male store_trans   7.0
## 9   60   Male store_trans  10.0
## 10  51   Male store_trans   4.0

The reshaping manipulations may be the trickiest part. You have to practice a lot to get familiar with those functions. Unfortunately, there is no shortcut.