6.2 Tidy and Reshape Data

“Tidy data” represents the information from a dataset as data frames where each row is an observation, and each column contains the values of a variable (i.e., an attribute of what we are observing). Depending on the situation, the requirements on what to present as rows and columns may change. To make data easy to work with the problem at hand. In practice, we often need to convert data between the “wide” and the “long” format. The process feels like kneading the dough.

In this section, we will show how to tidy and reshape data using tidyr packages. It is built to simplify the process of creating tidy data. We will go through four fundamental functions:

  • gather(): reshape data from wide to long
  • spread(): reshape data from long to wide
  • separate(): split a column into multiple columns
  • unite(): combine multiple columns to one column

Take a baby subset of our exemplary clothes consumers data to illustrate:

sdat<-sim.dat[1:5,1:6]
sdat
##   age gender income house store_exp online_exp
## 1  57 Female 120963   Yes     529.1      303.5
## 2  63 Female 122008   Yes     478.0      109.5
## 3  59   Male 114202   Yes     490.8      279.2
## 4  60   Male 113616   Yes     347.8      141.7
## 5  51   Male 124253   Yes     379.6      112.2

For the above data sdat, what if we want to reshape the data to have a column indicating the purchasing channel (i.e. from store_exp or online_exp) and a second column with the corresponding expense amount? Assume we want to keep the rest of the columns the same. It is a task to change data from “wide” to “long”.

dat_long <- tidyr::gather(sdat, "Channel","Expense", 
                       store_exp, online_exp)
dat_long
##    age gender income house    Channel Expense
## 1   57 Female 120963   Yes  store_exp   529.1
## 2   63 Female 122008   Yes  store_exp   478.0
## 3   59   Male 114202   Yes  store_exp   490.8
## 4   60   Male 113616   Yes  store_exp   347.8
## 5   51   Male 124253   Yes  store_exp   379.6
## 6   57 Female 120963   Yes online_exp   303.5
## 7   63 Female 122008   Yes online_exp   109.5
## 8   59   Male 114202   Yes online_exp   279.2
## 9   60   Male 113616   Yes online_exp   141.7
## 10  51   Male 124253   Yes online_exp   112.2

The above code gathers two variables (store_exp and online_exp), and collapses them into key-value pairs (Channel and Expense), duplicating all other columns as needed.

You can run a regression to study the effect of purchasing channel as follows:

# Here we use all observations from sim.dat
# Don't show result here

msdat <- tidyr::gather(sim.dat[, 1:6], "Channel","Expense", 
                       store_exp, online_exp)
fit <- lm(Expense ~ gender + house + income + Channel + age, 
          data = msdat)
summary(fit)

Sometimes we want to reshape the data from “long” to “wide”. For example, you want to compare the online and in-store expense between male and female based on house ownership.

We need to reshape the wide data frame dat_wide to a long format by spreading the key-value pairs across multiple columns. And then summarize the long data frame dat_long, grouping byhouse and gender.

dat_wide = tidyr::spread(dat_long, Channel, Expense)
# you can check what dat_long is like
dat_wide %>% 
  dplyr::group_by(house, gender) %>% 
  dplyr::summarise(total_online_exp = sum(online_exp),
                   total_store_exp = sum(store_exp))
## # A tibble: 2 × 4
## # Groups:   house [1]
##   house gender total_online_exp total_store_exp
##   <chr> <chr>             <dbl>           <dbl>
## 1 Yes   Female             413.           1007.
## 2 Yes   Male               533.           1218.

The above code also uses the functions in the dplyr package introduced in the previous section. Here we use package::function to make clear the package name. It is not necessary if the package is already loaded.

Another pair of functions that do opposite manipulations are separate() and unite().

sepdat<- dat_long %>% 
  separate(Channel, c("Source", "Type"))
sepdat
##    age gender income house Source Type Expense
## 1   57 Female 120963   Yes  store  exp   529.1
## 2   63 Female 122008   Yes  store  exp   478.0
## 3   59   Male 114202   Yes  store  exp   490.8
## 4   60   Male 113616   Yes  store  exp   347.8
## 5   51   Male 124253   Yes  store  exp   379.6
## 6   57 Female 120963   Yes online  exp   303.5
## 7   63 Female 122008   Yes online  exp   109.5
## 8   59   Male 114202   Yes online  exp   279.2
## 9   60   Male 113616   Yes online  exp   141.7
## 10  51   Male 124253   Yes online  exp   112.2

You can see that the function separates the original column “Channel” to two new columns “Source” and “Type”. You can use sep = to set the string or regular expression to separate the column. By default, it is “_”.

The unite() function will do the opposite: combining two columns. It is the generalization of paste() to a data frame.

sepdat %>% 
  unite("Channel", Source, Type, sep = "_")
##    age gender income house    Channel Expense
## 1   57 Female 120963   Yes  store_exp   529.1
## 2   63 Female 122008   Yes  store_exp   478.0
## 3   59   Male 114202   Yes  store_exp   490.8
## 4   60   Male 113616   Yes  store_exp   347.8
## 5   51   Male 124253   Yes  store_exp   379.6
## 6   57 Female 120963   Yes online_exp   303.5
## 7   63 Female 122008   Yes online_exp   109.5
## 8   59   Male 114202   Yes online_exp   279.2
## 9   60   Male 113616   Yes online_exp   141.7
## 10  51   Male 124253   Yes online_exp   112.2

The reshaping manipulations may be the trickiest part. You have to practice a lot to get familiar with those functions. Unfortunately, there is no shortcut.