6.2 Tidy and Reshape Data
“Tidy data” represents the information from a dataset as data frames where each row is an observation, and each column contains the values of a variable (i.e., an attribute of what we are observing). Depending on the situation, the requirements on what to present as rows and columns may change. To make data easy to work with the problem at hand. In practice, we often need to convert data between the “wide” and the “long” format. The process feels like kneading the dough.
In this section, we will show how to tidy and reshape data using tidyr
packages. It is built to simplify the process of creating tidy data. We will go through four fundamental functions:
gather()
: reshape data from wide to longspread()
: reshape data from long to wideseparate()
: split a column into multiple columnsunite()
: combine multiple columns to one column
Take a baby subset of our exemplary clothes consumers data to illustrate:
<-sim.dat[1:5,1:6]
sdat sdat
## age gender income house store_exp online_exp
## 1 57 Female 120963 Yes 529.1 303.5
## 2 63 Female 122008 Yes 478.0 109.5
## 3 59 Male 114202 Yes 490.8 279.2
## 4 60 Male 113616 Yes 347.8 141.7
## 5 51 Male 124253 Yes 379.6 112.2
For the above data sdat
, what if we want to reshape the data to have a column indicating the purchasing channel (i.e. from store_exp
or online_exp
) and a second column with the corresponding expense amount? Assume we want to keep the rest of the columns the same. It is a task to change data from “wide” to “long”.
<- tidyr::gather(sdat, "Channel","Expense",
dat_long
store_exp, online_exp) dat_long
## age gender income house Channel Expense
## 1 57 Female 120963 Yes store_exp 529.1
## 2 63 Female 122008 Yes store_exp 478.0
## 3 59 Male 114202 Yes store_exp 490.8
## 4 60 Male 113616 Yes store_exp 347.8
## 5 51 Male 124253 Yes store_exp 379.6
## 6 57 Female 120963 Yes online_exp 303.5
## 7 63 Female 122008 Yes online_exp 109.5
## 8 59 Male 114202 Yes online_exp 279.2
## 9 60 Male 113616 Yes online_exp 141.7
## 10 51 Male 124253 Yes online_exp 112.2
The above code gathers two variables (store_exp
and online_exp
), and collapses them into key-value pairs (Channel
and Expense
), duplicating all other columns as needed.
You can run a regression to study the effect of purchasing channel as follows:
# Here we use all observations from sim.dat
# Don't show result here
<- tidyr::gather(sim.dat[, 1:6], "Channel","Expense",
msdat
store_exp, online_exp)<- lm(Expense ~ gender + house + income + Channel + age,
fit data = msdat)
summary(fit)
Sometimes we want to reshape the data from “long” to “wide”. For example, you want to compare the online and in-store expense between male and female based on house ownership.
We need to reshape the wide data frame dat_wide
to a long format by spreading the key-value pairs across multiple columns. And then summarize the long data frame dat_long
, grouping byhouse
and gender
.
= tidyr::spread(dat_long, Channel, Expense)
dat_wide # you can check what dat_long is like
%>%
dat_wide ::group_by(house, gender) %>%
dplyr::summarise(total_online_exp = sum(online_exp),
dplyrtotal_store_exp = sum(store_exp))
## # A tibble: 2 × 4
## # Groups: house [1]
## house gender total_online_exp total_store_exp
## <chr> <chr> <dbl> <dbl>
## 1 Yes Female 413. 1007.
## 2 Yes Male 533. 1218.
The above code also uses the functions in the dplyr
package introduced in the previous section. Here we use package::function
to make clear the package name. It is not necessary if the package is already loaded.
Another pair of functions that do opposite manipulations are separate()
and unite()
.
<- dat_long %>%
sepdatseparate(Channel, c("Source", "Type"))
sepdat
## age gender income house Source Type Expense
## 1 57 Female 120963 Yes store exp 529.1
## 2 63 Female 122008 Yes store exp 478.0
## 3 59 Male 114202 Yes store exp 490.8
## 4 60 Male 113616 Yes store exp 347.8
## 5 51 Male 124253 Yes store exp 379.6
## 6 57 Female 120963 Yes online exp 303.5
## 7 63 Female 122008 Yes online exp 109.5
## 8 59 Male 114202 Yes online exp 279.2
## 9 60 Male 113616 Yes online exp 141.7
## 10 51 Male 124253 Yes online exp 112.2
You can see that the function separates the original column “Channel
” to two new columns “Source
” and “Type
”. You can use sep =
to set the string or regular expression to separate the column. By default, it is “_
”.
The unite()
function will do the opposite: combining two columns. It is the generalization of paste()
to a data frame.
%>%
sepdat unite("Channel", Source, Type, sep = "_")
## age gender income house Channel Expense
## 1 57 Female 120963 Yes store_exp 529.1
## 2 63 Female 122008 Yes store_exp 478.0
## 3 59 Male 114202 Yes store_exp 490.8
## 4 60 Male 113616 Yes store_exp 347.8
## 5 51 Male 124253 Yes store_exp 379.6
## 6 57 Female 120963 Yes online_exp 303.5
## 7 63 Female 122008 Yes online_exp 109.5
## 8 59 Male 114202 Yes online_exp 279.2
## 9 60 Male 113616 Yes online_exp 141.7
## 10 51 Male 124253 Yes online_exp 112.2
The reshaping manipulations may be the trickiest part. You have to practice a lot to get familiar with those functions. Unfortunately, there is no shortcut.