• Load R Packages and Data
  • Deal with Problematic Data
  • Deal with Missing Values
  • Centering and Scaling
  • Resolve Skewness
  • Resolve Outliers
  • Deal with Collinearity
  • Deal with Sparse Variables
  • Encode Dummy Variables

This notebook illustrates how to perform standard data pre-processing, an essential step for any data science projects.

Load R Packages and Data

# install packages
p_needed <- c('imputeMissings','caret','e1071','psych','car','corrplot','RANN')
packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]

if (length(p_to_install) > 0) {
    install.packages(p_to_install)
}

lapply(p_needed, require, character.only = TRUE)
# load the simulated dataset and return summary statistics for each column
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
summary(sim.dat)
##       age            gender              income          house          
##  Min.   : 16.00   Length:1000        Min.   : 41776   Length:1000       
##  1st Qu.: 25.00   Class :character   1st Qu.: 85832   Class :character  
##  Median : 36.00   Mode  :character   Median : 93869   Mode  :character  
##  Mean   : 38.84                      Mean   :113543                     
##  3rd Qu.: 53.00                      3rd Qu.:124572                     
##  Max.   :300.00                      Max.   :319704                     
##                                      NA's   :184                        
##    store_exp         online_exp       store_trans     online_trans  
##  Min.   : -500.0   Min.   :  68.82   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:  205.0   1st Qu.: 420.34   1st Qu.: 3.00   1st Qu.: 6.00  
##  Median :  329.0   Median :1941.86   Median : 4.00   Median :14.00  
##  Mean   : 1356.8   Mean   :2120.18   Mean   : 5.35   Mean   :13.55  
##  3rd Qu.:  597.3   3rd Qu.:2440.78   3rd Qu.: 7.00   3rd Qu.:20.00  
##  Max.   :50000.0   Max.   :9479.44   Max.   :20.00   Max.   :36.00  
##                                                                     
##        Q1              Q2              Q3              Q4       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :3.000   Median :1.000   Median :1.000   Median :3.000  
##  Mean   :3.101   Mean   :1.823   Mean   :1.992   Mean   :2.763  
##  3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##        Q5              Q6              Q7              Q8       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.:1.000   1st Qu.:2.500   1st Qu.:1.000  
##  Median :4.000   Median :2.000   Median :4.000   Median :2.000  
##  Mean   :2.945   Mean   :2.448   Mean   :3.434   Mean   :2.396  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##        Q9             Q10         segment         
##  Min.   :1.000   Min.   :1.00   Length:1000       
##  1st Qu.:2.000   1st Qu.:1.00   Class :character  
##  Median :4.000   Median :2.00   Mode  :character  
##  Mean   :3.085   Mean   :2.32                     
##  3rd Qu.:4.000   3rd Qu.:3.00                     
##  Max.   :5.000   Max.   :5.00                     
## 

Deal with Problematic Data

Set the problematic values as missing and impute them later.

# set problematic values as missings
sim.dat$age[which(sim.dat$age > 100)] <- NA
sim.dat$store_exp[which(sim.dat$store_exp < 0)] <- NA
# see the results
summary(subset(sim.dat, select = c("age", "store_exp")))
##       age          store_exp      
##  Min.   :16.00   Min.   :  155.8  
##  1st Qu.:25.00   1st Qu.:  205.1  
##  Median :36.00   Median :  329.8  
##  Mean   :38.58   Mean   : 1358.7  
##  3rd Qu.:53.00   3rd Qu.:  597.4  
##  Max.   :69.00   Max.   :50000.0  
##  NA's   :1       NA's   :1

Deal with Missing Values

Missing values are common in the raw data set. Based on the mechanism behind missing values, we have a few different ways to impute missing values.

First, let’s use the impute() function from imputeMissing package with method = "median/mode".

# save the result as another object 
# !!! has to add imputeMissings::
demo_imp <- imputeMissings::impute(sim.dat, method = "median/mode")

# check the first five columns. 
# There are no missing values in other columns
summary(demo_imp[, 1:5])
##       age           gender              income          house          
##  Min.   :16.00   Length:1000        Min.   : 41776   Length:1000       
##  1st Qu.:25.00   Class :character   1st Qu.: 87896   Class :character  
##  Median :36.00   Mode  :character   Median : 93869   Mode  :character  
##  Mean   :38.58                      Mean   :109923                     
##  3rd Qu.:53.00                      3rd Qu.:119456                     
##  Max.   :69.00                      Max.   :319704                     
##    store_exp      
##  Min.   :  155.8  
##  1st Qu.:  205.1  
##  Median :  329.8  
##  Mean   : 1357.7  
##  3rd Qu.:  597.3  
##  Max.   :50000.0

We can also use preProcess() function from caret package with method = "medianImpute".

imp <- preProcess(sim.dat, method = "medianImpute")
demo_imp2 <- predict(imp, sim.dat)
summary(demo_imp2[, 1:5])
##       age           gender              income          house          
##  Min.   :16.00   Length:1000        Min.   : 41776   Length:1000       
##  1st Qu.:25.00   Class :character   1st Qu.: 87896   Class :character  
##  Median :36.00   Mode  :character   Median : 93869   Mode  :character  
##  Mean   :38.58                      Mean   :109923                     
##  3rd Qu.:53.00                      3rd Qu.:119456                     
##  Max.   :69.00                      Max.   :319704                     
##    store_exp      
##  Min.   :  155.8  
##  1st Qu.:  205.1  
##  Median :  329.8  
##  Mean   : 1357.7  
##  3rd Qu.:  597.3  
##  Max.   :50000.0

Use preProcess() to conduct KNN:

## Please note, to use knnImpute you have to install.packages('RANN') 
# !!! have to install RANN package
imp <- preProcess(sim.dat, method = "knnImpute", k = 5)

# need to use predict() to get KNN result
demo_imp <- predict(imp, sim.dat)

# only show the first three elements 
# !!! below line changed !!!
lapply(demo_imp, summary)[1:3]
## $age
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.5910972 -0.9568733 -0.1817107  0.0000156  1.0162678  2.1437770 
## 
## $gender
##    Length     Class      Mode 
##      1000 character character 
## 
## $income
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.43989 -0.53732 -0.37606  0.02389  0.21540  4.13627

The preProcess() will automatically ignore non-numeric columns. When all the columns for a row are missing, then KNN method will fail. For example, the following codes will return an error. In this case, we can identify rows with all columns missing and remove them from the dataset.

temp <- rbind(sim.dat, rep(NA, ncol(sim.dat)))
imp <- preProcess(sim.dat, method = "knnImpute", k = 5)
demo_imp <- predict(imp, temp)
Error in FUN(newX[, i], ...) : 
  cannot impute when all predictors are missing in the new data point

There is an error saying “cannot impute when all predictors are missing in the new data point”. It is easy to fix by finding and removing the problematic row(s):

idx <- apply(temp, 1, function(x) sum(is.na(x)))
as.vector(which(idx == ncol(temp)))
## [1] 1001

It shows that row 1001 is problematic. We can go ahead to delete it.

Finally, let us try the “bagImpute” method, which is more time-consuming.

imp <- preProcess(sim.dat, method = "bagImpute")
demo_imp <- predict(imp, sim.dat)
summary(demo_imp[, 1:5])
##       age           gender              income          house          
##  Min.   :16.00   Length:1000        Min.   : 41776   Length:1000       
##  1st Qu.:25.00   Class :character   1st Qu.: 86762   Class :character  
##  Median :36.00   Mode  :character   Median : 94739   Mode  :character  
##  Mean   :38.58                      Mean   :114602                     
##  3rd Qu.:53.00                      3rd Qu.:123726                     
##  Max.   :69.00                      Max.   :319704                     
##    store_exp      
##  Min.   :  155.8  
##  1st Qu.:  205.1  
##  Median :  329.0  
##  Mean   : 1357.6  
##  3rd Qu.:  597.3  
##  Max.   :50000.0

Centering and Scaling

Centering and scaling are the most common data transformation, and they are easy to apply. For example, one way to do centering and scaling is to use the mean and standard deviation of the data, as described below.

income <- sim.dat$income
# calculate the mean of income
mux <- mean(income, na.rm = T)
# calculate the standard deviation of income
sdx <- sd(income, na.rm = T)
# centering
tr1 <- income - mux
# scaling
tr2 <- tr1/sdx

But we can use the preProcess() function in caret directly, as illustrated below for the age and income:

sdat <- subset(sim.dat, select = c("age", "income"))
# set the 'method' option
trans <- preProcess(sdat, method = c("center", "scale"))
# use predict() function to get the final result
transformed <- predict(trans, sdat)

Resolve Skewness

We first show how the left and right skewness looks and then describe how to use a box-cox procedure to identify a transformation to reduce skewness in the data.

# need skewness() function from e1071 package
set.seed(1000)
par(mfrow = c(1, 2), oma = c(2, 2, 2, 2))
# random sample 1000 chi-square distribution with df=2
# right skew
x1 <- rchisq(1000, 2, ncp = 0)
# get left skew variable x2 from x1
x2 <- max(x1) - x1
plot(density(x2), main = paste("left skew, skewnwss =",
round(skewness(x2), 2)), xlab = "X2")
plot(density(x1), main = paste("right skew, skewness =",
round(skewness(x1), 2)), xlab = "X1")

In the cell below, we use the preProcess() function in caret package to find the best Box-Cox transformation for store_trans and online_trans in our simulated dataset.

describe(sim.dat)
##              vars    n      mean       sd   median   trimmed      mad      min
## age             1  999     38.58    14.19    36.00     37.67    16.31    16.00
## gender*         2 1000      1.45     0.50     1.00      1.43     0.00     1.00
## income          3  816 113543.07 49842.29 93868.68 104841.94 28989.47 41775.64
## house*          4 1000      1.57     0.50     2.00      1.59     0.00     1.00
## store_exp       5  999   1358.71  2775.17   329.80    845.14   197.47   155.81
## online_exp      6 1000   2120.18  1731.22  1941.86   1874.51  1015.21    68.82
## store_trans     7 1000      5.35     3.70     4.00      4.89     2.97     1.00
## online_trans    8 1000     13.55     7.96    14.00     13.42    10.38     1.00
## Q1              9 1000      3.10     1.45     3.00      3.13     1.48     1.00
## Q2             10 1000      1.82     1.17     1.00      1.65     0.00     1.00
## Q3             11 1000      1.99     1.40     1.00      1.75     0.00     1.00
## Q4             12 1000      2.76     1.16     3.00      2.83     1.48     1.00
## Q5             13 1000      2.95     1.28     4.00      3.05     0.00     1.00
## Q6             14 1000      2.45     1.44     2.00      2.43     1.48     1.00
## Q7             15 1000      3.43     1.46     4.00      3.54     0.00     1.00
## Q8             16 1000      2.40     1.15     2.00      2.36     1.48     1.00
## Q9             17 1000      3.09     1.12     4.00      3.23     0.00     1.00
## Q10            18 1000      2.32     1.14     2.00      2.27     1.48     1.00
## segment*       19 1000      2.70     1.15     3.00      2.75     1.48     1.00
##                    max     range  skew kurtosis      se
## age              69.00     53.00  0.47    -1.18    0.45
## gender*           2.00      1.00  0.22    -1.95    0.02
## income       319704.34 277928.70  1.69     2.57 1744.83
## house*            2.00      1.00 -0.27    -1.93    0.02
## store_exp     50000.00  49844.19  8.08   115.04   87.80
## online_exp     9479.44   9410.63  1.18     1.31   54.75
## store_trans      20.00     19.00  1.11     0.69    0.12
## online_trans     36.00     35.00  0.03    -0.98    0.25
## Q1                5.00      4.00 -0.12    -1.36    0.05
## Q2                5.00      4.00  1.13    -0.32    0.04
## Q3                5.00      4.00  1.06    -0.40    0.04
## Q4                5.00      4.00 -0.18    -1.46    0.04
## Q5                5.00      4.00 -0.60    -1.40    0.04
## Q6                5.00      4.00  0.11    -1.89    0.05
## Q7                5.00      4.00 -0.90    -0.79    0.05
## Q8                5.00      4.00  0.21    -1.33    0.04
## Q9                5.00      4.00 -0.68    -1.10    0.04
## Q10               5.00      4.00  0.39    -1.23    0.04
## segment*          4.00      3.00 -0.20    -1.41    0.04
# select the two columns and save them as dat_bc
dat_bc <- subset(sim.dat, select = c("store_trans", "online_trans"))
trans <- preProcess(dat_bc, method = c("BoxCox"))

Compare the histogram of the store_trans variable before and after the Box-Cox transformation.

transformed <- predict(trans, dat_bc)
par(mfrow = c(1, 2), oma = c(2, 2, 2, 2))
hist(dat_bc$store_trans, main = "Before Transformation",
xlab = "store_trans")
hist(transformed$store_trans, main = "After Transformation",
xlab = "store_trans")

skewness(transformed$store_trans)
## [1] -0.2154708

We can also use the BoxCoxTrans() function directly to perform the Box-Cox transformation.

trans <- BoxCoxTrans(dat_bc$store_trans)
transformed <- predict(trans, dat_bc$store_trans)
skewness(transformed)
## [1] -0.2154708

Resolve Outliers

There are formal definitions of outliers, but it is essential to visualize the data to gain some intuition.

# select numerical non-survey data
sdat <- subset(sim.dat, select = c("age", "income", "store_exp",
"online_exp", "store_trans", "online_trans"))
# use scatterplotMatrix() function from car package
par(oma = c(2, 2, 1, 2))
car::scatterplotMatrix(sdat, diagonal = TRUE)

In addition to visualization, we can also calculate the modified Z-score using mean and MAD (i.e., median of the absolute dispersion) for each data point. We can then define outliers as modified Z-score greater than 3.5.

# calculate median of the absolute dispersion for income
ymad <- mad(na.omit(sdat$income))
# calculate z-score
zs <- (sdat$income - mean(na.omit(sdat$income)))/ymad
# count the number of outliers
sum(na.omit(zs > 3.5))
## [1] 59

For models sensitive to outliers, we can use spatial sign transformation to minimize outliers’ impact, as described below.

# KNN imputation
sdat <- sim.dat[, c("income", "age")]
imp <- preProcess(sdat, method = c("knnImpute"), k = 5)
sdat <- predict(imp, sdat)
transformed <- spatialSign(sdat)
transformed <- as.data.frame(transformed)
par(mfrow = c(1, 2), oma = c(2, 2, 2, 2))
plot(income ~ age, data = sdat, col = "blue", main = "Before")
plot(income ~ age, data = transformed, col = "blue", main = "After")

Deal with Collinearity

When two variables are highly correlated, collinearity exists. We can visualize the correlation between variables.

# select non-survey numerical variables
sdat <- subset(sim.dat, select = c("age", "income", "store_exp",
"online_exp", "store_trans", "online_trans"))

# use bagging imputation here
imp <- preProcess(sdat, method = "bagImpute")
sdat <- predict(imp, sdat)

# get the correlation matrix
correlation <- cor(sdat)

# plot
par(oma = c(2, 2, 2, 2))
corrplot.mixed(correlation, order = "hclust", tl.pos = "lt",
upper = "ellipse")

Once we have the correlation between variables, we can define high correlation variables using certain cutoff threshold, and remove them to reduce potential collinearity.

highCorr <- findCorrelation(cor(sdat), cutoff = 0.7)

# delete highly correlated columns
sdat <- sdat[-highCorr]

# check the new correlation matrix
cor(sdat)
##                  income  store_exp online_exp online_trans
## income        1.0000000  0.6003472  0.5203073   -0.3567879
## store_exp     0.6003472  1.0000000  0.5349530   -0.1367383
## online_exp    0.5203073  0.5349530  1.0000000    0.2256370
## online_trans -0.3567879 -0.1367383  0.2256370    1.0000000

Deal with Sparse Variables

Other than the highly related predictors, predictors with degenerate distributions can cause problems. Removing those variables can significantly improve some models’ performance and stability. One extreme example is a variable with a single value, which is called a zero-variance variable. Variables with a very low frequency of unique values are near-zero variance predictors.

# make a copy
zero_demo <- sim.dat
# add two sparse variable zero1 only has one unique value zero2 is a
# vector with the first element 1 and the rest are 0s
zero_demo$zero1 <- rep(1, nrow(zero_demo))
zero_demo$zero2 <- c(1, rep(0, nrow(zero_demo) - 1))

We can use nearZeroVar() function in caret package to identify these sparse variables.

nearZeroVar(zero_demo,freqCut = 95/5, uniqueCut = 10)
## [1] 20 21

Encode Dummy Variables

For categorical variables, dummy encoding is a needed step before fitting many models. It converted one column of a categorical variable to multiple columns containing 0 and 1. For one categorical variable, we can use the class.ind() function in nnet package. But it is more convenient to use the dummyVars() function in caret package, which can be applied to a data frame.

dumVar <- nnet::class.ind(sim.dat$gender)
head(dumVar)
##      Female Male
## [1,]      1    0
## [2,]      1    0
## [3,]      0    1
## [4,]      0    1
## [5,]      0    1
## [6,]      0    1
# use "origional variable name + level" as new name
dumMod <- dummyVars(~gender + house + income,
data = sim.dat,
levelsOnly = F)
head(predict(dumMod, sim.dat))
##   genderFemale genderMale houseNo houseYes   income
## 1            1          0       0        1 120963.4
## 2            1          0       0        1 122008.1
## 3            0          1       0        1 114202.3
## 4            0          1       0        1 113616.3
## 5            0          1       0        1 124252.6
## 6            0          1       0        1 107661.5

The function can also create interaction term:

dumMod <- dummyVars(~gender + house + income + income:gender,
data = sim.dat,
levelsOnly = F)
head(predict(dumMod, sim.dat))
##   genderFemale genderMale houseNo houseYes   income genderFemale:income
## 1            1          0       0        1 120963.4            120963.4
## 2            1          0       0        1 122008.1            122008.1
## 3            0          1       0        1 114202.3                 0.0
## 4            0          1       0        1 113616.3                 0.0
## 5            0          1       0        1 124252.6                 0.0
## 6            0          1       0        1 107661.5                 0.0
##   genderMale:income
## 1               0.0
## 2               0.0
## 3          114202.3
## 4          113616.3
## 5          124252.6
## 6          107661.5