This notebook illustrates how to perform standard data pre-processing, an essential step for any data science projects.
# install packages
p_needed <- c('imputeMissings','caret','e1071','psych','car','corrplot','RANN')
packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]
if (length(p_to_install) > 0) {
install.packages(p_to_install)
}
lapply(p_needed, require, character.only = TRUE)
# load the simulated dataset and return summary statistics for each column
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
summary(sim.dat)
## age gender income house
## Min. : 16.00 Length:1000 Min. : 41776 Length:1000
## 1st Qu.: 25.00 Class :character 1st Qu.: 85832 Class :character
## Median : 36.00 Mode :character Median : 93869 Mode :character
## Mean : 38.84 Mean :113543
## 3rd Qu.: 53.00 3rd Qu.:124572
## Max. :300.00 Max. :319704
## NA's :184
## store_exp online_exp store_trans online_trans
## Min. : -500.0 Min. : 68.82 Min. : 1.00 Min. : 1.00
## 1st Qu.: 205.0 1st Qu.: 420.34 1st Qu.: 3.00 1st Qu.: 6.00
## Median : 329.0 Median :1941.86 Median : 4.00 Median :14.00
## Mean : 1356.8 Mean :2120.18 Mean : 5.35 Mean :13.55
## 3rd Qu.: 597.3 3rd Qu.:2440.78 3rd Qu.: 7.00 3rd Qu.:20.00
## Max. :50000.0 Max. :9479.44 Max. :20.00 Max. :36.00
##
## Q1 Q2 Q3 Q4
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Median :3.000 Median :1.000 Median :1.000 Median :3.000
## Mean :3.101 Mean :1.823 Mean :1.992 Mean :2.763
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## Q5 Q6 Q7 Q8
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.750 1st Qu.:1.000 1st Qu.:2.500 1st Qu.:1.000
## Median :4.000 Median :2.000 Median :4.000 Median :2.000
## Mean :2.945 Mean :2.448 Mean :3.434 Mean :2.396
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## Q9 Q10 segment
## Min. :1.000 Min. :1.00 Length:1000
## 1st Qu.:2.000 1st Qu.:1.00 Class :character
## Median :4.000 Median :2.00 Mode :character
## Mean :3.085 Mean :2.32
## 3rd Qu.:4.000 3rd Qu.:3.00
## Max. :5.000 Max. :5.00
##
Set the problematic values as missing and impute them later.
# set problematic values as missings
sim.dat$age[which(sim.dat$age > 100)] <- NA
sim.dat$store_exp[which(sim.dat$store_exp < 0)] <- NA
# see the results
summary(subset(sim.dat, select = c("age", "store_exp")))
## age store_exp
## Min. :16.00 Min. : 155.8
## 1st Qu.:25.00 1st Qu.: 205.1
## Median :36.00 Median : 329.8
## Mean :38.58 Mean : 1358.7
## 3rd Qu.:53.00 3rd Qu.: 597.4
## Max. :69.00 Max. :50000.0
## NA's :1 NA's :1
Missing values are common in the raw data set. Based on the mechanism behind missing values, we have a few different ways to impute missing values.
First, let’s use the impute()
function from
imputeMissing
package with
method = "median/mode"
.
# save the result as another object
# !!! has to add imputeMissings::
demo_imp <- imputeMissings::impute(sim.dat, method = "median/mode")
# check the first five columns.
# There are no missing values in other columns
summary(demo_imp[, 1:5])
## age gender income house
## Min. :16.00 Length:1000 Min. : 41776 Length:1000
## 1st Qu.:25.00 Class :character 1st Qu.: 87896 Class :character
## Median :36.00 Mode :character Median : 93869 Mode :character
## Mean :38.58 Mean :109923
## 3rd Qu.:53.00 3rd Qu.:119456
## Max. :69.00 Max. :319704
## store_exp
## Min. : 155.8
## 1st Qu.: 205.1
## Median : 329.8
## Mean : 1357.7
## 3rd Qu.: 597.3
## Max. :50000.0
We can also use preProcess()
function from
caret
package with
method = "medianImpute"
.
imp <- preProcess(sim.dat, method = "medianImpute")
demo_imp2 <- predict(imp, sim.dat)
summary(demo_imp2[, 1:5])
## age gender income house
## Min. :16.00 Length:1000 Min. : 41776 Length:1000
## 1st Qu.:25.00 Class :character 1st Qu.: 87896 Class :character
## Median :36.00 Mode :character Median : 93869 Mode :character
## Mean :38.58 Mean :109923
## 3rd Qu.:53.00 3rd Qu.:119456
## Max. :69.00 Max. :319704
## store_exp
## Min. : 155.8
## 1st Qu.: 205.1
## Median : 329.8
## Mean : 1357.7
## 3rd Qu.: 597.3
## Max. :50000.0
Use preProcess() to conduct KNN:
## Please note, to use knnImpute you have to install.packages('RANN')
# !!! have to install RANN package
imp <- preProcess(sim.dat, method = "knnImpute", k = 5)
# need to use predict() to get KNN result
demo_imp <- predict(imp, sim.dat)
# only show the first three elements
# !!! below line changed !!!
lapply(demo_imp, summary)[1:3]
## $age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.5910972 -0.9568733 -0.1817107 0.0000156 1.0162678 2.1437770
##
## $gender
## Length Class Mode
## 1000 character character
##
## $income
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.43989 -0.53732 -0.37606 0.02389 0.21540 4.13627
The preProcess()
will automatically ignore non-numeric
columns. When all the columns for a row are missing, then KNN method
will fail. For example, the following codes will return an error. In
this case, we can identify rows with all columns missing and remove them
from the dataset.
temp <- rbind(sim.dat, rep(NA, ncol(sim.dat)))
imp <- preProcess(sim.dat, method = "knnImpute", k = 5)
demo_imp <- predict(imp, temp)
Error in FUN(newX[, i], ...) :
cannot impute when all predictors are missing in the new data point
There is an error saying
“cannot impute when all predictors are missing in the new data point
”.
It is easy to fix by finding and removing the problematic row(s):
idx <- apply(temp, 1, function(x) sum(is.na(x)))
as.vector(which(idx == ncol(temp)))
## [1] 1001
It shows that row 1001 is problematic. We can go ahead to delete it.
Finally, let us try the “bagImpute
” method, which is
more time-consuming.
imp <- preProcess(sim.dat, method = "bagImpute")
demo_imp <- predict(imp, sim.dat)
summary(demo_imp[, 1:5])
## age gender income house
## Min. :16.00 Length:1000 Min. : 41776 Length:1000
## 1st Qu.:25.00 Class :character 1st Qu.: 86762 Class :character
## Median :36.00 Mode :character Median : 94739 Mode :character
## Mean :38.58 Mean :114602
## 3rd Qu.:53.00 3rd Qu.:123726
## Max. :69.00 Max. :319704
## store_exp
## Min. : 155.8
## 1st Qu.: 205.1
## Median : 329.0
## Mean : 1357.6
## 3rd Qu.: 597.3
## Max. :50000.0
Centering and scaling are the most common data transformation, and they are easy to apply. For example, one way to do centering and scaling is to use the mean and standard deviation of the data, as described below.
income <- sim.dat$income
# calculate the mean of income
mux <- mean(income, na.rm = T)
# calculate the standard deviation of income
sdx <- sd(income, na.rm = T)
# centering
tr1 <- income - mux
# scaling
tr2 <- tr1/sdx
But we can use the preProcess()
function in
caret
directly, as illustrated below for the
age
and income
:
sdat <- subset(sim.dat, select = c("age", "income"))
# set the 'method' option
trans <- preProcess(sdat, method = c("center", "scale"))
# use predict() function to get the final result
transformed <- predict(trans, sdat)
We first show how the left and right skewness looks and then describe how to use a box-cox procedure to identify a transformation to reduce skewness in the data.
# need skewness() function from e1071 package
set.seed(1000)
par(mfrow = c(1, 2), oma = c(2, 2, 2, 2))
# random sample 1000 chi-square distribution with df=2
# right skew
x1 <- rchisq(1000, 2, ncp = 0)
# get left skew variable x2 from x1
x2 <- max(x1) - x1
plot(density(x2), main = paste("left skew, skewnwss =",
round(skewness(x2), 2)), xlab = "X2")
plot(density(x1), main = paste("right skew, skewness =",
round(skewness(x1), 2)), xlab = "X1")
In the cell below, we use the preProcess()
function in
caret
package to find the best Box-Cox transformation for
store_trans
and online_trans
in our simulated
dataset.
describe(sim.dat)
## vars n mean sd median trimmed mad min
## age 1 999 38.58 14.19 36.00 37.67 16.31 16.00
## gender* 2 1000 1.45 0.50 1.00 1.43 0.00 1.00
## income 3 816 113543.07 49842.29 93868.68 104841.94 28989.47 41775.64
## house* 4 1000 1.57 0.50 2.00 1.59 0.00 1.00
## store_exp 5 999 1358.71 2775.17 329.80 845.14 197.47 155.81
## online_exp 6 1000 2120.18 1731.22 1941.86 1874.51 1015.21 68.82
## store_trans 7 1000 5.35 3.70 4.00 4.89 2.97 1.00
## online_trans 8 1000 13.55 7.96 14.00 13.42 10.38 1.00
## Q1 9 1000 3.10 1.45 3.00 3.13 1.48 1.00
## Q2 10 1000 1.82 1.17 1.00 1.65 0.00 1.00
## Q3 11 1000 1.99 1.40 1.00 1.75 0.00 1.00
## Q4 12 1000 2.76 1.16 3.00 2.83 1.48 1.00
## Q5 13 1000 2.95 1.28 4.00 3.05 0.00 1.00
## Q6 14 1000 2.45 1.44 2.00 2.43 1.48 1.00
## Q7 15 1000 3.43 1.46 4.00 3.54 0.00 1.00
## Q8 16 1000 2.40 1.15 2.00 2.36 1.48 1.00
## Q9 17 1000 3.09 1.12 4.00 3.23 0.00 1.00
## Q10 18 1000 2.32 1.14 2.00 2.27 1.48 1.00
## segment* 19 1000 2.70 1.15 3.00 2.75 1.48 1.00
## max range skew kurtosis se
## age 69.00 53.00 0.47 -1.18 0.45
## gender* 2.00 1.00 0.22 -1.95 0.02
## income 319704.34 277928.70 1.69 2.57 1744.83
## house* 2.00 1.00 -0.27 -1.93 0.02
## store_exp 50000.00 49844.19 8.08 115.04 87.80
## online_exp 9479.44 9410.63 1.18 1.31 54.75
## store_trans 20.00 19.00 1.11 0.69 0.12
## online_trans 36.00 35.00 0.03 -0.98 0.25
## Q1 5.00 4.00 -0.12 -1.36 0.05
## Q2 5.00 4.00 1.13 -0.32 0.04
## Q3 5.00 4.00 1.06 -0.40 0.04
## Q4 5.00 4.00 -0.18 -1.46 0.04
## Q5 5.00 4.00 -0.60 -1.40 0.04
## Q6 5.00 4.00 0.11 -1.89 0.05
## Q7 5.00 4.00 -0.90 -0.79 0.05
## Q8 5.00 4.00 0.21 -1.33 0.04
## Q9 5.00 4.00 -0.68 -1.10 0.04
## Q10 5.00 4.00 0.39 -1.23 0.04
## segment* 4.00 3.00 -0.20 -1.41 0.04
# select the two columns and save them as dat_bc
dat_bc <- subset(sim.dat, select = c("store_trans", "online_trans"))
trans <- preProcess(dat_bc, method = c("BoxCox"))
Compare the histogram of the store_trans
variable before
and after the Box-Cox transformation.
transformed <- predict(trans, dat_bc)
par(mfrow = c(1, 2), oma = c(2, 2, 2, 2))
hist(dat_bc$store_trans, main = "Before Transformation",
xlab = "store_trans")
hist(transformed$store_trans, main = "After Transformation",
xlab = "store_trans")
skewness(transformed$store_trans)
## [1] -0.2154708
We can also use the BoxCoxTrans()
function directly to
perform the Box-Cox transformation.
trans <- BoxCoxTrans(dat_bc$store_trans)
transformed <- predict(trans, dat_bc$store_trans)
skewness(transformed)
## [1] -0.2154708
There are formal definitions of outliers, but it is essential to visualize the data to gain some intuition.
# select numerical non-survey data
sdat <- subset(sim.dat, select = c("age", "income", "store_exp",
"online_exp", "store_trans", "online_trans"))
# use scatterplotMatrix() function from car package
par(oma = c(2, 2, 1, 2))
car::scatterplotMatrix(sdat, diagonal = TRUE)
In addition to visualization, we can also calculate the modified Z-score using mean and MAD (i.e., median of the absolute dispersion) for each data point. We can then define outliers as modified Z-score greater than 3.5.
# calculate median of the absolute dispersion for income
ymad <- mad(na.omit(sdat$income))
# calculate z-score
zs <- (sdat$income - mean(na.omit(sdat$income)))/ymad
# count the number of outliers
sum(na.omit(zs > 3.5))
## [1] 59
For models sensitive to outliers, we can use spatial sign transformation to minimize outliers’ impact, as described below.
# KNN imputation
sdat <- sim.dat[, c("income", "age")]
imp <- preProcess(sdat, method = c("knnImpute"), k = 5)
sdat <- predict(imp, sdat)
transformed <- spatialSign(sdat)
transformed <- as.data.frame(transformed)
par(mfrow = c(1, 2), oma = c(2, 2, 2, 2))
plot(income ~ age, data = sdat, col = "blue", main = "Before")
plot(income ~ age, data = transformed, col = "blue", main = "After")
When two variables are highly correlated, collinearity exists. We can visualize the correlation between variables.
# select non-survey numerical variables
sdat <- subset(sim.dat, select = c("age", "income", "store_exp",
"online_exp", "store_trans", "online_trans"))
# use bagging imputation here
imp <- preProcess(sdat, method = "bagImpute")
sdat <- predict(imp, sdat)
# get the correlation matrix
correlation <- cor(sdat)
# plot
par(oma = c(2, 2, 2, 2))
corrplot.mixed(correlation, order = "hclust", tl.pos = "lt",
upper = "ellipse")
Once we have the correlation between variables, we can define high correlation variables using certain cutoff threshold, and remove them to reduce potential collinearity.
highCorr <- findCorrelation(cor(sdat), cutoff = 0.7)
# delete highly correlated columns
sdat <- sdat[-highCorr]
# check the new correlation matrix
cor(sdat)
## income store_exp online_exp online_trans
## income 1.0000000 0.6003472 0.5203073 -0.3567879
## store_exp 0.6003472 1.0000000 0.5349530 -0.1367383
## online_exp 0.5203073 0.5349530 1.0000000 0.2256370
## online_trans -0.3567879 -0.1367383 0.2256370 1.0000000
Other than the highly related predictors, predictors with degenerate distributions can cause problems. Removing those variables can significantly improve some models’ performance and stability. One extreme example is a variable with a single value, which is called a zero-variance variable. Variables with a very low frequency of unique values are near-zero variance predictors.
# make a copy
zero_demo <- sim.dat
# add two sparse variable zero1 only has one unique value zero2 is a
# vector with the first element 1 and the rest are 0s
zero_demo$zero1 <- rep(1, nrow(zero_demo))
zero_demo$zero2 <- c(1, rep(0, nrow(zero_demo) - 1))
We can use nearZeroVar()
function in caret
package to identify these sparse variables.
nearZeroVar(zero_demo,freqCut = 95/5, uniqueCut = 10)
## [1] 20 21
For categorical variables, dummy encoding is a needed step before
fitting many models. It converted one column of a categorical variable
to multiple columns containing 0 and 1. For one categorical variable, we
can use the class.ind()
function in nnet
package. But it is more convenient to use the dummyVars()
function in caret
package, which can be applied to a data
frame.
dumVar <- nnet::class.ind(sim.dat$gender)
head(dumVar)
## Female Male
## [1,] 1 0
## [2,] 1 0
## [3,] 0 1
## [4,] 0 1
## [5,] 0 1
## [6,] 0 1
# use "origional variable name + level" as new name
dumMod <- dummyVars(~gender + house + income,
data = sim.dat,
levelsOnly = F)
head(predict(dumMod, sim.dat))
## genderFemale genderMale houseNo houseYes income
## 1 1 0 0 1 120963.4
## 2 1 0 0 1 122008.1
## 3 0 1 0 1 114202.3
## 4 0 1 0 1 113616.3
## 5 0 1 0 1 124252.6
## 6 0 1 0 1 107661.5
The function can also create interaction term:
dumMod <- dummyVars(~gender + house + income + income:gender,
data = sim.dat,
levelsOnly = F)
head(predict(dumMod, sim.dat))
## genderFemale genderMale houseNo houseYes income genderFemale:income
## 1 1 0 0 1 120963.4 120963.4
## 2 1 0 0 1 122008.1 122008.1
## 3 0 1 0 1 114202.3 0.0
## 4 0 1 0 1 113616.3 0.0
## 5 0 1 0 1 124252.6 0.0
## 6 0 1 0 1 107661.5 0.0
## genderMale:income
## 1 0.0
## 2 0.0
## 3 114202.3
## 4 113616.3
## 5 124252.6
## 6 107661.5