5.2 Missing Values
You can write a whole book about missing value. This section will only show some of the most commonly used methods without getting too deep into the topic. Chapter 7 of the book by De Waal, Pannekoek and Scholtus (Waal, Pannekoek, and Scholtus 2011) makes a concise overview of some of the existing imputation methods. The choice of specific method depends on the actual situation. There is no best way.
One question to ask before imputation: Is there any auxiliary information? Being aware of any auxiliary information is critical. For example, if the system set customer who did not purchase as missing, then the real purchasing amount should be 0. Is missing a random occurrence? If so, it may be reasonable to impute with mean or median. If not, is there a potential mechanism for the missing data? For example, older people are more reluctant to disclose their ages in the questionnaire, so that the absence of age is not completely random. In this case, the missing values need to be estimated using the relationship between age and other independent variables. For example, use variables such as whether they have children, income, and other survey questions to build a model to predict age.
Also, the purpose of modeling is important for selecting imputation methods. If the goal is to interpret the parameter estimate or statistical inference, then it is important to study the missing mechanism carefully and to estimate the missing values using non-missing information as much as possible. If the goal is to predict, people usually will not study the absence mechanism rigorously (but sometimes the mechanism is obvious). If the absence mechanism is not clear, treat it as missing at random and use mean, median, or k-nearest neighbor to impute. Since statistical inference is sensitive to missing values, researchers from survey statistics have conducted in-depth studies of various imputation schemes which focus on valid statistical inference. The problem of missing values in the prediction model is different from that in the traditional survey. Therefore, there are not many papers on missing value imputation in the prediction model. Those who want to study further can refer to Saar-Tsechansky and Provost’s comparison of different imputation methods (S. T. M and F 2007) and De Waal, Pannekoek and Scholtus’ book (Waal, Pannekoek, and Scholtus 2011).
5.2.1 Impute missing values with median/mode
In the case of missing at random, a common method is to impute with the mean (continuous variable) or median (categorical variables). You can use impute()
function in imputeMissings
package.
# save the result as another object
<- impute(sim.dat, method = "median/mode")
demo_imp # check the first 5 columns
# there is no missing values in other columns
summary(demo_imp[, 1:5])
age gender income house store_exp
Min. :16.00 Female:554 Min. : 41776 No :432 Min. : 155.8
1st Qu.:25.00 Male :446 1st Qu.: 87896 Yes:568 1st Qu.: 205.1
Median :36.00 Median : 93869 Median : 329.8
Mean :38.58 Mean :109923 Mean : 1357.7
3rd Qu.:53.00 3rd Qu.:119456 3rd Qu.: 597.3
Max. :69.00 Max. :319704 Max. :50000.0
After imputation, demo_imp
has no missing value. This method is straightforward and widely used. The disadvantage is that it does not take into account the relationship between the variables. When there is a significant proportion of missing, it will distort the data. In this case, it is better to consider the relationship between variables and study the missing mechanism. In the example here, the missing variables are numeric. If the missing variable is a categorical/factor variable, the impute()
function will impute with the mode.
You can also use preProcess()
in package caret
, but it is only for numeric variables, and can not impute categorical variables. Since missing values here are numeric, we can use the preProcess()
function. The result is the same as the impute()
function. PreProcess()
is a powerful function that can link to a variety of data preprocessing methods. We will use the function later for other data preprocessing.
<- preProcess(sim.dat, method = "medianImpute")
imp <- predict(imp, sim.dat)
demo_imp2 summary(demo_imp2[, 1:5])
age gender income house store_exp
Min. :16.00 Female:554 Min. : 41776 No :432 Min. : 155.8
1st Qu.:25.00 Male :446 1st Qu.: 87896 Yes:568 1st Qu.: 205.1
Median :36.00 Median : 93869 Median : 329.8
Mean :38.58 Mean :109923 Mean : 1357.7
3rd Qu.:53.00 3rd Qu.:119456 3rd Qu.: 597.3
Max. :69.00 Max. :319704 Max. :50000.0
5.2.2 K-nearest neighbors
K-nearest neighbor (KNN) will find the k closest samples (Euclidian distance) in the training set and impute the mean of those “neighbors.”
Use preProcess()
to conduct KNN:
<- preProcess(sim.dat, method = "knnImpute", k = 5)
imp # need to use predict() to get KNN result
<- predict(imp, sim.dat)
demo_imp # only show the first three elements
lapply(sim.dat, class)[1:3]
age gender income
Min. :-1.5910972 Female:554 Min. :-1.43989
1st Qu.:-0.9568733 Male :446 1st Qu.:-0.53732
Median :-0.1817107 Median :-0.37606
Mean : 0.0000156 Mean : 0.02389
3rd Qu.: 1.0162678 3rd Qu.: 0.21540
Max. : 2.1437770 Max. : 4.13627
The preProcess()
in the first line will automatically ignore non-numeric columns.
Comparing the KNN result with the previous median imputation, the two are very different. This is because when you tell the preProcess()
function to use KNN (the option method =" knnImpute"
), it will automatically standardize the data.
Another way is to use Bagging tree (in the next section). Note that KNN can not impute samples with the entire row missing. The reason is straightforward. Since the algorithm uses the average of its neighbors if none of them has a value, what does it apply to calculate the mean?
Let’s append a new row with all values missing to the original data frame to get a new object called temp
. Then apply KNN to temp
and see what happens:
<- rbind(sim.dat, rep(NA, ncol(sim.dat)))
temp <- preProcess(sim.dat, method = "knnImpute", k = 5)
imp <- predict(imp, temp) demo_imp
Error in FUN(newX[, i], ...) :
cannot impute when all predictors are missing in the new data point
There is an error saying “cannot impute when all predictors are missing in the new data point
”. It is easy to fix by finding and removing the problematic row(s):
<- apply(temp, 1, function(x) sum(is.na(x)))
idx as.vector(which(idx == ncol(temp)))
It shows that row 1001 is problematic. You can go ahead to delete it.
5.2.3 Bagging Tree
Bagging (Bootstrap aggregating) was originally proposed by Leo Breiman. It is one of the earliest ensemble methods (B. L 1966a). When used in missing value imputation, it will use the remaining variables as predictors to train a bagging tree and then use the tree to predict the missing values. Although theoretically, the method is powerful, the computation is much more intense than KNN. In practice, there is a trade-off between computation time and the effect. If a median or mean meet the modeling needs, even bagging tree may improve the accuracy a little, but the upgrade is so marginal that it does not deserve the extra time. The bagging tree itself is a model for regression and classification. Here we use preProcess()
to impute sim.dat
:
<- preProcess(sim.dat, method = "bagImpute")
imp <- predict(imp, sim.dat)
demo_imp summary(demo_imp[, 1:5])
age gender income house store_exp
Min. :16.00 Female:554 Min. : 41776 No :432 Min. : 155.8
1st Qu.:25.00 Male :446 1st Qu.: 86762 Yes:568 1st Qu.: 205.1
Median :36.00 Median : 94739 Median : 329.0
Mean :38.58 Mean :114665 Mean : 1357.7
3rd Qu.:53.00 3rd Qu.:123726 3rd Qu.: 597.3
Max. :69.00 Max. :319704 Max. :50000.0