5.2 Missing Values

You can write a whole book about missing value. This section will only show some of the most commonly used methods without getting too deep into the topic. Chapter 7 of the book by De Waal, Pannekoek and Scholtus (Waal, Pannekoek, and Scholtus 2011) makes a concise overview of some of the existing imputation methods. The choice of specific method depends on the actual situation. There is no best way.

One question to ask before imputation: Is there any auxiliary information? Being aware of any auxiliary information is critical. For example, if the system set customer who did not purchase as missing, then the real purchasing amount should be 0. Is missing a random occurrence? If so, it may be reasonable to impute with mean or median. If not, is there a potential mechanism for the missing data? For example, older people are more reluctant to disclose their ages in the questionnaire, so that the absence of age is not completely random. In this case, the missing values need to be estimated using the relationship between age and other independent variables. For example, use variables such as whether they have children, income, and other survey questions to build a model to predict age.

Also, the purpose of modeling is important for selecting imputation methods. If the goal is to interpret the parameter estimate or statistical inference, then it is important to study the missing mechanism carefully and to estimate the missing values using non-missing information as much as possible. If the goal is to predict, people usually will not study the absence mechanism rigorously (but sometimes the mechanism is obvious). If the absence mechanism is not clear, treat it as missing at random and use mean, median, or k-nearest neighbor to impute. Since statistical inference is sensitive to missing values, researchers from survey statistics have conducted in-depth studies of various imputation schemes which focus on valid statistical inference. The problem of missing values in the prediction model is different from that in the traditional survey. Therefore, there are not many papers on missing value imputation in the prediction model. Those who want to study further can refer to Saar-Tsechansky and Provost’s comparison of different imputation methods (S. T. M and F 2007) and De Waal, Pannekoek and Scholtus’ book (Waal, Pannekoek, and Scholtus 2011).

5.2.1 Impute missing values with median/mode

In the case of missing at random, a common method is to impute with the mean (continuous variable) or median (categorical variables). You can use impute() function in imputeMissings package.

# save the result as another object
demo_imp <- impute(sim.dat, method = "median/mode")
# check the first 5 columns
# there is no missing values in other columns
summary(demo_imp[, 1:5])
      age           gender        income       house       store_exp      
 Min.   :16.00   Female:554   Min.   : 41776   No :432   Min.   :  155.8  
 1st Qu.:25.00   Male  :446   1st Qu.: 87896   Yes:568   1st Qu.:  205.1  
 Median :36.00                Median : 93869             Median :  329.8  
 Mean   :38.58                Mean   :109923             Mean   : 1357.7  
 3rd Qu.:53.00                3rd Qu.:119456             3rd Qu.:  597.3  
 Max.   :69.00                Max.   :319704             Max.   :50000.0

After imputation, demo_imp has no missing value. This method is straightforward and widely used. The disadvantage is that it does not take into account the relationship between the variables. When there is a significant proportion of missing, it will distort the data. In this case, it is better to consider the relationship between variables and study the missing mechanism. In the example here, the missing variables are numeric. If the missing variable is a categorical/factor variable, the impute() function will impute with the mode.

You can also use preProcess() in package caret, but it is only for numeric variables, and can not impute categorical variables. Since missing values here are numeric, we can use the preProcess() function. The result is the same as the impute() function. PreProcess() is a powerful function that can link to a variety of data preprocessing methods. We will use the function later for other data preprocessing.

imp <- preProcess(sim.dat, method = "medianImpute")
demo_imp2 <- predict(imp, sim.dat)
summary(demo_imp2[, 1:5])
      age           gender        income       house       store_exp      
 Min.   :16.00   Female:554   Min.   : 41776   No :432   Min.   :  155.8  
 1st Qu.:25.00   Male  :446   1st Qu.: 87896   Yes:568   1st Qu.:  205.1  
 Median :36.00                Median : 93869             Median :  329.8  
 Mean   :38.58                Mean   :109923             Mean   : 1357.7  
 3rd Qu.:53.00                3rd Qu.:119456             3rd Qu.:  597.3  
 Max.   :69.00                Max.   :319704             Max.   :50000.0  

5.2.2 K-nearest neighbors

K-nearest neighbor (KNN) will find the k closest samples (Euclidian distance) in the training set and impute the mean of those “neighbors.”

Use preProcess() to conduct KNN:

imp <- preProcess(sim.dat, method = "knnImpute", k = 5)
# need to use predict() to get KNN result
demo_imp <- predict(imp, sim.dat)
# only show the first three elements
lapply(sim.dat, class)[1:3]
      age                gender        income        
 Min.   :-1.5910972   Female:554   Min.   :-1.43989  
 1st Qu.:-0.9568733   Male  :446   1st Qu.:-0.53732  
 Median :-0.1817107                Median :-0.37606  
 Mean   : 0.0000156                Mean   : 0.02389  
 3rd Qu.: 1.0162678                3rd Qu.: 0.21540  
 Max.   : 2.1437770                Max.   : 4.13627 

The preProcess() in the first line will automatically ignore non-numeric columns.

Comparing the KNN result with the previous median imputation, the two are very different. This is because when you tell the preProcess() function to use KNN (the option method =" knnImpute"), it will automatically standardize the data. Another way is to use Bagging tree (in the next section). Note that KNN can not impute samples with the entire row missing. The reason is straightforward. Since the algorithm uses the average of its neighbors if none of them has a value, what does it apply to calculate the mean?

Let’s append a new row with all values missing to the original data frame to get a new object called temp. Then apply KNN to temp and see what happens:

temp <- rbind(sim.dat, rep(NA, ncol(sim.dat)))
imp <- preProcess(sim.dat, method = "knnImpute", k = 5)
demo_imp <- predict(imp, temp)
Error in FUN(newX[, i], ...) : 
  cannot impute when all predictors are missing in the new data point

There is an error saying “cannot impute when all predictors are missing in the new data point”. It is easy to fix by finding and removing the problematic row(s):

idx <- apply(temp, 1, function(x) sum(is.na(x)))
as.vector(which(idx == ncol(temp)))

It shows that row 1001 is problematic. You can go ahead to delete it.

5.2.3 Bagging Tree

Bagging (Bootstrap aggregating) was originally proposed by Leo Breiman. It is one of the earliest ensemble methods (B. L 1966a). When used in missing value imputation, it will use the remaining variables as predictors to train a bagging tree and then use the tree to predict the missing values. Although theoretically, the method is powerful, the computation is much more intense than KNN. In practice, there is a trade-off between computation time and the effect. If a median or mean meet the modeling needs, even bagging tree may improve the accuracy a little, but the upgrade is so marginal that it does not deserve the extra time. The bagging tree itself is a model for regression and classification. Here we use preProcess() to impute sim.dat:

imp <- preProcess(sim.dat, method = "bagImpute")
demo_imp <- predict(imp, sim.dat)
summary(demo_imp[, 1:5])
      age           gender        income       house       store_exp      
 Min.   :16.00   Female:554   Min.   : 41776   No :432   Min.   :  155.8  
 1st Qu.:25.00   Male  :446   1st Qu.: 86762   Yes:568   1st Qu.:  205.1  
 Median :36.00                Median : 94739             Median :  329.0  
 Mean   :38.58                Mean   :114665             Mean   : 1357.7  
 3rd Qu.:53.00                3rd Qu.:123726             3rd Qu.:  597.3  
 Max.   :69.00                Max.   :319704             Max.   :50000.0  

References

L, Breiman. 1966a. “Bagging Predictors.” Machine Learning 24 (2): 123–40.
M, Saar Tsechansky, and Provost F. 2007. “Handling Missing Values When Applying Classification Models.” Journal of Machine Learning Research b (8): 1625–57.
Waal, Ton de, Jeroen Pannekoek, and Sander Scholtus. 2011. Handbook of Statistical Data Editing and Imputation. John Wiley; Sons.