Home[主页] Blog [博客] Resume [简历]

0.1 Start

There are a number of reasons a predictive model falls [“Applied Predictive Modeling”], such as:

• Unjustified extrapolation
• Over-fitting

In this blog post, I am going to summarize some common data pre-processing approaches.

0.2Centering and Scaling

It is the most straightforward data transformation. It centers and scales a variable to mean 0 and standard deviation 1. It ensures that the criterion for finding linear combinations of the predictors is based on how much variation they explain and therefore improves the numerical stability. Models involving finding linear combinations of the predictors to explain response/predictors variation need data centering and scaling, such as PCA and PLS. You can easily writing code yourself to conduct this transformation. Or the function preProcess() in package caret can apply this transformation to a set of predictors.

#install packages needed
library(caret)
library(e1071)
library(gridExtra)
library(lattice)
library(imputeMissings)
library(RANN)
library(corrplot)
library(nnet)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
trans<-preProcess(cars,method=c("center","scale"))
transformed<-predict(trans,cars)
par(mfrow=c(1,2))
hist(cars$dist,main="Original",xlab="dist") hist(transformed$dist,main="Centered and Scaled",xlab="dist")

Sometimes you only need to scale the variable. For example, if the model adds penalty to the parameter estimates (such as $$L_2$$ penalty is ridge regression and $$L_1$$ penalty in LASSO), the variables need to have similar scale to ensure a fair variable selection. I am heavy user of this kind of penalty-based model in my work and I used the following quantile transformation:

$x_{ij}^{*}=\frac{x_{ij}-quantile(x_{.j},0.01)}{quantile(x_{.j}-0.99)-quantile(x_{-j},0.01)}$

The reason to use 99% and 1% quantile instead of maximum and minimum values is to resist the impact of outliers.

It is easy to write a function to do it:

qscale<-function(dat){
for (i in 1:ncol(dat)){
up<-quantile(dat[,i],0.99)
low<-quantile(dat[,i],0.01)
diff<-up-low
dat[,i]<-(dat[,i]-low)/diff
}
return(dat)
}

In order to illustrate, let’s simulate a data set with three variables: income, age and education.

set.seed(2015)
income<-sample(seq(50000,150000,by=500),95)
age<-income/2000-10
noise<-round(runif(95)*10,0)
age<-age+noise
income<-c(income,10000,15000,300000,250000,230000)
age<-c(age,30,20,25,35,95)
demo<-data.frame(income,age)
demo$education<-as.factor(sample(c("High School","Bachelor","Master","Doctor"),100,replace = T,prob =c(0.7,0.15,0.12,0.03) )) summary(demo[,c("income","age")]) ## income age ## Min. : 10000 Min. :20.0 ## 1st Qu.: 76375 1st Qu.:30.2 ## Median : 98750 Median :44.2 ## Mean :103480 Mean :44.9 ## 3rd Qu.:126375 3rd Qu.:56.9 ## Max. :300000 Max. :95.0 It is clear that income and age are not on the same scale. Now apply the function qscale() on the simulated data demo. transformed<-qscale(demo[,c("income","age")]) summary(transformed) ## income age ## Min. :-0.021 Min. :-0.019 ## 1st Qu.: 0.261 1st Qu.: 0.178 ## Median : 0.356 Median : 0.448 ## Mean : 0.376 Mean : 0.460 ## 3rd Qu.: 0.473 3rd Qu.: 0.690 ## Max. : 1.210 Max. : 1.424 0.3Resolve Skewness Skewness is defined to be the third standardized central moment. The formula for the sample skewness statistics is: $skewness=\frac{\sum(x_{i}+\bar{x})^{3}}{(n-1)v^{3/2}}$ $v=\frac{\sum(x_{i}=\bar{x})^{2}}{(n-1)}$ Skewness=0 means that the destribution is symmetric, i.e. the probability of falling on either side of the distribution’s mean is equal. You can easily tell if a distribution is skewed by simple visualization. There are different ways may help to remove skewness such as log, square root or inverse. However it is often difficult to determine from plots which transformation is most appropriate for correcting skewness. The Box-Cox procedure automatically identified a transformation from the family of power transformations that are indexed by a parameter $$\lambda$$. $x^{*}=\begin{cases} \begin{array}{c} \frac{x^{\lambda}-1}{\lambda}\\ log(x) \end{array} & \begin{array}{c} if\ \lambda\neq0\\ if\ \lambda=0 \end{array}\end{cases}$ It is easy to see that this family includes log transformation ($$\lambda=0$$), square transformation ($$\lambda=2$$), square root ($$\lambda=0.5$$), inverse ($$\lambda=-1$$) and others in-between. We can still use function preProcess() in package caret to apply this transformation by chaning the method argument. (trans<-preProcess(cars,method=c("BoxCox"))) ## ## Call: ## preProcess.default(x = cars, method = c("BoxCox")) ## ## Created from 50 samples and 2 variables ## Pre-processing: Box-Cox transformation ## ## Lambda estimates for Box-Cox transformation: ## 1, 0.5 The output shows the sample size (50), number of variables (2) and the $$\lambda$$ estimates for each variable. After calling the preProcess() function, the predict() method applies the results to a data frame. transformed<-predict(trans,cars) par(mfrow=c(1,2)) hist(cars$dist,main="Original",xlab="dist")
hist(transformed$dist,main="After BoxCox Transformation",xlab="dist") An alternative is to use function BoxCoxTrans() in package caret. (trans<-BoxCoxTrans(cars$dist))
## Box-Cox Transformation
##
## 50 data points used to estimate Lambda
##
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##       2      26      36      43      56     120
##
## Largest/Smallest: 60
## Sample Skewness: 0.759
##
## Estimated Lambda: 0.5
transformed<-predict(trans,cars$dist) skewness(transformed) ## [1] -0.01903 The estimated $$\lambda$$ is the same 0.5. Original skewness is 0.759 and after transformation, the skewness is -0.01902765 which is close to 0. You can use function skewness() in package e1071 to get the skewness statistics. 0.4Resolve Outliers Even under certain assumptions we can statistically define outliers, it can be hard to define in some situations. You can refer to “Detection of Outliers” for more information. Some models are resistant to outliers (such as tree-based model and support vector machine). If a model is sensitive to outliers (such as linear regression and logistic regression), we can use spatial sign transformation to minimize the problem. It projects the original sample points to the surface of a sphere by: $x_{ij}^{*}=\frac{x_{ij}}{\sqrt{\sum_{j=1}^{p}x_{ij}^{2}}}$ As noted in the book “Applied Predictive Modeling”, Since the denominator is intended to measure the squared distance to the center of the predictor’s distribution, it is important to center and scale the predictor data prior to using this transformation. Note that, unlike centering or scaling, this manipulation of the predictors transforms them as a group. We can use spatialSign() function in caret to conduct spatial sign on demo: trans<-preProcess(demo[,c("income","age")],method=c("center","scale")) transformed<-predict(trans,demo[,c("income","age")]) transformed2 <- spatialSign(transformed) transformed2 <- as.data.frame(transformed2) p1<-xyplot(income ~ age, data = transformed, main="Original") p2<-xyplot(income ~ age, data = transformed2, main="After Spatial Sign") grid.arrange(p1,p2, ncol=2) 0.5Missing Values We need a book to fully explicate this topic. Before we decide how to handle missing value, it is important to understand why the values are missing. Do the missing values have information related outcomes? Or are they missing at random? It is not the goal here to illustrate which methods to use in different missing situation. You can refer to Section 3.4 of “Applied Predictive Modeling” for more discussion on that. The objective of this post is to introduce some imputation methods and corresponding application examples using R. Survey statistics has studied the imputation extensively which focuses on making valid inferences. Missing value imputation in predictive modeling is a different problem. Saar-Tsechansky and Provost compared several different methods for applying classification to instance with missing values. “Handling Missing Values when Applying Classification Models The following code randomly assigns some missing values to the previous data demo and names the new data set demo_missing. set.seed(100) id1<-sample(1:nrow(demo),15) id2<-sample(1:nrow(demo),10) id3<-sample(1:nrow(demo),10) demo_missing<-demo demo_missing$age[id1]<-NA
demo_missing$income[id2]<-NA demo_missing$education[id3]<-NA
summary(demo_missing)
##      income            age             education
##  Min.   : 15000   Min.   :20.0   Bachelor   :13
##  1st Qu.: 77125   1st Qu.:30.2   Doctor     : 2
##  Median : 98750   Median :44.2   High School:70
##  Mean   :102811   Mean   :44.4   Master     : 5
##  3rd Qu.:125250   3rd Qu.:56.2   NA's       :10
##  Max.   :300000   Max.   :95.0
##  NA's   :10       NA's   :15

0.5.1 Impute missing values with median/mode

You can use function impute() under package imputeMissings to impute missing values with mdedian/mode. This method is simple, fast but treats each predictor independently, and may not be accurate.

demo_imp<-impute(demo_missing,method="median/mode")
summary(demo_imp)
##      income            age             education
##  Min.   : 15000   Min.   :20.0   Bachelor   :13
##  1st Qu.: 79250   1st Qu.:32.2   Doctor     : 2
##  Median : 98750   Median :44.2   High School:80
##  Mean   :102405   Mean   :44.4   Master     : 5
##  3rd Qu.:122875   3rd Qu.:54.5
##  Max.   :300000   Max.   :95.0

Note that the median/mode method imputes mode to character vectors and median to numeric and integer vectors.So you can see the 10 missing values for variable “education” are imputed with “High School” since it is the mode.

You can also use function ‘preProcess()’ to attain this.But it only works for numeric variable.

imp<-preProcess(demo_missing[,c("income","age")],method="medianImpute")
demo_imp<-predict(imp,demo_missing[,c("income","age")])
summary(demo_imp)
##      income            age
##  Min.   : 15000   Min.   :20.0
##  1st Qu.: 79250   1st Qu.:32.2
##  Median : 98500   Median :44.2
##  Mean   :102380   Mean   :44.4
##  3rd Qu.:122875   3rd Qu.:54.5
##  Max.   :300000   Max.   :95.0

0.5.2 Impute missing values based on K-nearest neighbors

k-nearest neighbor will find the k closest samples (Euclidian distance) in the training set and impute the mean of those “neighbors”.

imp<-preProcess(demo_missing[,c("income","age")],method="knnImpute",k=2)
demo_imp<-predict(imp,demo_missing[,c("income","age")])
## Error: cannot impute when all predictors are missing in the new data point

Now we get an error saying “cannot impute when all predictors are missing in the new data point”. It is because there is at least one sample with both “income” and “age” missing. We can delete the corresponding row and do it again.

idx<-which(is.na(demo_missing$income)&is.na(demo_missing$age))

imp<-preProcess(demo_missing[-idx,c("income","age")],method="knnImpute",k=2)
demo_imp<-predict(imp,demo_missing[-idx,c("income","age")])
summary(demo_imp)
##      income            age
##  Min.   :-2.260   Min.   :-1.538
##  1st Qu.:-0.687   1st Qu.:-0.893
##  Median :-0.105   Median :-0.011
##  Mean   :-0.006   Mean   : 0.016
##  3rd Qu.: 0.594   3rd Qu.: 0.768
##  Max.   : 5.074   Max.   : 3.183

The error doesn’t show up this time. This method considers all predictors together but it requires them to be in the same scale since the “euclidian distance” is used to find the neighbours.

0.6Collinearity

It is probably a technical term that many un-technical people also know. There is an excellent function in corrplot package with the same name corrplot() that can visualize correlation structure of a set of predictors. The function has option to reorder the variables in a way that reveals clusters of highly correlated ones. We add some columns to demo that are correlated.

adddemo<-demo[,-3]
adddemo$added1<-sqrt(demo$age)+10
adddemo$added2<-log(demo$income)+demo$age adddemo$added2<-log(demo$age) adddemo$added4<-demo$income/1000+5*demo$age
adddemo$added5<-sin(demo$age)

The following command will produce visualization for the correlation matrix of adddemo.

corrplot(cor(adddemo),order="hclust")

The size and color of the points are associated with the strength of corresponding correlation. Section 3.5 of “Applied Predictive Modeling” presents a heuristic algorithm to remove minium number of predicitors to ensure all pairwise corelations are below a certain threshold:

1. Calculate the correlation matrix of the predictors.
2. Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B).
3. Determine the average correlation between A and the other variables. Do the same for predictor B.
4. If A has a larger average correlation, remove it; otherwise, remove predictor B.
5. Repeat Step 2-4 until no absolute correlations are above the threshold.

The findCorrelation() function in package caret will apaply the above algorithm.

(highCorr<-findCorrelation(cor(adddemo),cutoff=.75))
## [1] 5 2 3
# remove columns with high correlations
# correlation matrix for filtered data
corrplot(cor(filter_demo),order="hclust")

0.7Sparse Variables

Other than the highly related predictors, predictors with degenerate distributions need to be removed as well. Removing those variables can significantly improve some models’ performance and/or stability (such as linear regression and logistic regression but tree based model is impervious to this type of predictors). One extreme example is a variable with single value which is called zero-variance variable.

Similarly those variables with very low frequency of unique values are near-zero variance predictors. How to detect those variables? There are two rules:

• The fraction of unique values over the sample size
• The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value
• The caret package funciton nearZeroVar() can filter near-zero variance predictors.

#add two variables with low variance
zero_demo<-demo
zero_demo$zero1<-rep(0,nrow(demo)) zero_demo$zero2<-c(1,rep(0,nrow(demo)-1))
# zero1 only has one unique value
# zero2 is a vector with the first element 1 and the rest are 0s
summary(zero_demo)
##      income            age             education      zero1
##  Min.   : 10000   Min.   :20.0   Bachelor   :15   Min.   :0
##  1st Qu.: 76375   1st Qu.:30.2   Doctor     : 2   1st Qu.:0
##  Median : 98750   Median :44.2   High School:77   Median :0
##  Mean   :103480   Mean   :44.9   Master     : 6   Mean   :0
##  3rd Qu.:126375   3rd Qu.:56.9                    3rd Qu.:0
##  Max.   :300000   Max.   :95.0                    Max.   :0
##      zero2
##  Min.   :0.00
##  1st Qu.:0.00
##  Median :0.00
##  Mean   :0.01
##  3rd Qu.:0.00
##  Max.   :1.00
# the function will return a vector of integers indicating which columns to remove
nearZeroVar(zero_demo,freqCut = 95/5, uniqueCut = 10)
## [1] 4 5

Note the two arguments in the function freqCut = and uniqueCut =. They are corresponding to the previous two rules.

• freqCut: the cutoff for the ratio of the most common value to the second most common value

• uniqueCut:the cutoff for the percentage of distinct values out of the number of total samples

0.8Re-encode Dummy Variables

Sometimes we need to recode categories to smaller bits of information named “dummy variables”. Take the variable “education” in demo for example. It has four categories: “High School”,“Bachelor”,“Master” and “Doctor”. If we recode it to be dummy variables, each category get its own dummy variable that is 0/1 indicator for that category.

For a single categorical variable, we can use function class.ind() in package nnet:

dumVar<-class.ind(demo\$education)
head(dumVar)
##      Bachelor Doctor High School Master
## [1,]        0      0           1      0
## [2,]        0      0           1      0
## [3,]        0      0           1      0
## [4,]        0      0           1      0
## [5,]        0      0           1      0
## [6,]        0      0           1      0

If we want to determine encodeings for more than one variables, we can use dummyVars() in caret.

dumMod<-dummyVars(~income+education,
data=demo,
# Remove the variable name from the column name
levelsOnly=T)
predict(dumMod,head(demo))
##   income Bachelor Doctor High School Master
## 1  56000        0      0           1      0
## 2 133500        0      0           1      0
## 3  79500        0      0           1      0
## 4  53000        0      0           1      0
## 5  63500        0      0           1      0
## 6  84500        0      0           1      0

To add some more complexity, we could assume joint effect of income and education. In this case, this will add 4 more columns to the resulted data frame:

dumMod<-dummyVars(~income+education+income:education,
data=demo,
levelsOnly=T)
predict(dumMod,head(demo))
##   income Bachelor Doctor High School Master income:Bachelor income:Doctor
## 1  56000        0      0           1      0               0             0
## 2 133500        0      0           1      0               0             0
## 3  79500        0      0           1      0               0             0
## 4  53000        0      0           1      0               0             0
## 5  63500        0      0           1      0               0             0
## 6  84500        0      0           1      0               0             0
##   income:High School income:Master
## 1              56000             0
## 2             133500             0
## 3              79500             0
## 4              53000             0
## 5              63500             0
## 6              84500             0

</html>