5.1 Data Cleaning

After you load the data, the first thing is to check how many variables are there, the type of variables, the distributions, and data errors. Let’s read and check the data:

sim.dat <- read.csv("http://bit.ly/2P5gTw4")
summary(sim.dat)
      age           gender        income       house       store_exp    
 Min.   : 16.0   Female:554   Min.   : 41776   No :432   Min.   : -500  
 1st Qu.: 25.0   Male  :446   1st Qu.: 85832   Yes:568   1st Qu.:  205  
 Median : 36.0                Median : 93869             Median :  329  
 Mean   : 38.8                Mean   :113543             Mean   : 1357  
 3rd Qu.: 53.0                3rd Qu.:124572             3rd Qu.:  597  
 Max.   :300.0                Max.   :319704             Max.   :50000  
                              NA's   :184                               
   online_exp    store_trans     online_trans        Q1            Q2      
 Min.   :  69   Min.   : 1.00   Min.   : 1.0   Min.   :1.0   Min.   :1.00  
 1st Qu.: 420   1st Qu.: 3.00   1st Qu.: 6.0   1st Qu.:2.0   1st Qu.:1.00  
 Median :1942   Median : 4.00   Median :14.0   Median :3.0   Median :1.00  
 Mean   :2120   Mean   : 5.35   Mean   :13.6   Mean   :3.1   Mean   :1.82  
 3rd Qu.:2441   3rd Qu.: 7.00   3rd Qu.:20.0   3rd Qu.:4.0   3rd Qu.:2.00  
 Max.   :9479   Max.   :20.00   Max.   :36.0   Max.   :5.0   Max.   :5.00  
                                                                           
       Q3             Q4             Q5             Q6             Q7      
 Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00  
 1st Qu.:1.00   1st Qu.:2.00   1st Qu.:1.75   1st Qu.:1.00   1st Qu.:2.50  
 Median :1.00   Median :3.00   Median :4.00   Median :2.00   Median :4.00  
 Mean   :1.99   Mean   :2.76   Mean   :2.94   Mean   :2.45   Mean   :3.43  
 3rd Qu.:3.00   3rd Qu.:4.00   3rd Qu.:4.00   3rd Qu.:4.00   3rd Qu.:4.00  
 Max.   :5.00   Max.   :5.00   Max.   :5.00   Max.   :5.00   Max.   :5.00  
                                                                           
       Q8            Q9            Q10              segment   
 Min.   :1.0   Min.   :1.00   Min.   :1.00   Conspicuous:200  
 1st Qu.:1.0   1st Qu.:2.00   1st Qu.:1.00   Price      :250  
 Median :2.0   Median :4.00   Median :2.00   Quality    :200  
 Mean   :2.4   Mean   :3.08   Mean   :2.32   Style      :350  
 3rd Qu.:3.0   3rd Qu.:4.00   3rd Qu.:3.00                    
 Max.   :5.0   Max.   :5.00   Max.   :5.00                    

Are there any problems? Questionnaire response Q1-Q10 seem reasonable, the minimum is 1 and maximum is 5. Recall that the questionnaire score is 1-5. The number of store transactions (store_trans) and online transactions (online_trans) make sense too. Things to pay attention are:

  • There are some missing values.
  • There are outliers for store expenses (store_exp). The maximum value is 50000. Who would spend $50000 a year buying clothes? Is it an imputation error?
  • There is a negative value ( -500) in store_exp which is not logical.
  • Someone is 300 years old.

How to deal with that? Depending on the situation, if the sample size is large enough and the missing happens randomly, it does not hurt to delete those problematic samples. Or we can set these values as missing and impute them instead of deleting the rows.

# set problematic values as missings
sim.dat$age[which(sim.dat$age > 100)] <- NA
sim.dat$store_exp[which(sim.dat$store_exp < 0)] <- NA
# see the results
summary(subset(sim.dat, select = c("age", "store_exp")))
      age          store_exp      
 Min.   :16.00   Min.   :  155.8  
 1st Qu.:25.00   1st Qu.:  205.1  
 Median :36.00   Median :  329.8  
 Mean   :38.58   Mean   : 1358.7  
 3rd Qu.:53.00   3rd Qu.:  597.4  
 Max.   :69.00   Max.   :50000.0  
 NA's   :1       NA's   :1        

Now let’s deal with the missing values in the data.