Chapter 6 Data Wrangling

This chapter focuses on some of the most frequently used data manipulations and shows how to implement them in R and Python. It is critical to explore the data with descriptive statistics (mean, standard deviation, etc.) and data visualization before analysis. Transform data so that the data structure is in line with the requirements of the model. You also need to summarize the results after analysis.

When the data is too large to fit in a computer’s memory, we can use some big data analytics engine like Spark on a cloud platform (see Chapter 4). Even the user interface of many data platforms is much more friendly now, it is still easier to manipulate the data as a local data frame. Spark’s R and Python interfaces aim to keep the data manipulation syntax consistent with popular packages for local data frames. As shown in Section 4.4, we can run nearly all of the dplyr functions on a spark data frame once setting up the Spark environment. And the Python interface pyspark uses a similar syntax as pandas. This chapter focuses on data manipulations on standard data frames, which is also the foundation of big data manipulation.

Even when the data can fit in the memory, there may be a situation where it is slow to read and manipulate due to a relatively large size. Some R packages can make the process faster with the cost of familiarity, especially for data wrangling. But it avoids the hurdle of setting up Spark cluster and working in an unfamiliar environment. It is not a topic in this chapter but Appendix 13 briefly introduces some of the alternative R packages to read, write and wrangle a data set that is relatively large but not too big to fit in the memory.

There are many fundamental data processing functions in R. They lack consistent coding and can’t flow together easily. Learning all of them is a daunting task and unnecessary. R Studio developed a collection of packages and bundled them in tidyverse to systemize data wrangling and analysis tasks. You can see the package list in tidyverse on the website. This chapter focuses on some of the tidyverse packages to do data wrangling for the following reasons:

Those packages are widely used among R users in data science.
The code is more efficient.
The code syntax is consistent, which makes it easier to remember and read.

Section 6.1.2 introduces some base R functions outside the tidyverse universe, such as apply(), lapply() and sapply(). They are complementary functions when you are working with a data frame.

Load the R packages first:

# install packages from CRAN
p_needed <- c('dplyr','tidyr')
packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]
if (length(p_to_install) > 0) {
    install.packages(p_to_install)
}

lapply(p_needed, require, character.only = TRUE)