5.1 Data Cleaning
After you load the data, the first thing is to check how many variables are there, the type of variables, the distributions, and data errors. Let’s read and check the data:
<- read.csv("http://bit.ly/2P5gTw4")
sim.dat summary(sim.dat)
age gender income house store_exp
Min. : 16.0 Female:554 Min. : 41776 No :432 Min. : -500
1st Qu.: 25.0 Male :446 1st Qu.: 85832 Yes:568 1st Qu.: 205
Median : 36.0 Median : 93869 Median : 329
Mean : 38.8 Mean :113543 Mean : 1357
3rd Qu.: 53.0 3rd Qu.:124572 3rd Qu.: 597
Max. :300.0 Max. :319704 Max. :50000
NA's :184
online_exp store_trans online_trans Q1 Q2
Min. : 69 Min. : 1.00 Min. : 1.0 Min. :1.0 Min. :1.00
1st Qu.: 420 1st Qu.: 3.00 1st Qu.: 6.0 1st Qu.:2.0 1st Qu.:1.00
Median :1942 Median : 4.00 Median :14.0 Median :3.0 Median :1.00
Mean :2120 Mean : 5.35 Mean :13.6 Mean :3.1 Mean :1.82
3rd Qu.:2441 3rd Qu.: 7.00 3rd Qu.:20.0 3rd Qu.:4.0 3rd Qu.:2.00
Max. :9479 Max. :20.00 Max. :36.0 Max. :5.0 Max. :5.00
Q3 Q4 Q5 Q6 Q7
Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00
1st Qu.:1.00 1st Qu.:2.00 1st Qu.:1.75 1st Qu.:1.00 1st Qu.:2.50
Median :1.00 Median :3.00 Median :4.00 Median :2.00 Median :4.00
Mean :1.99 Mean :2.76 Mean :2.94 Mean :2.45 Mean :3.43
3rd Qu.:3.00 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:4.00
Max. :5.00 Max. :5.00 Max. :5.00 Max. :5.00 Max. :5.00
Q8 Q9 Q10 segment
Min. :1.0 Min. :1.00 Min. :1.00 Conspicuous:200
1st Qu.:1.0 1st Qu.:2.00 1st Qu.:1.00 Price :250
Median :2.0 Median :4.00 Median :2.00 Quality :200
Mean :2.4 Mean :3.08 Mean :2.32 Style :350
3rd Qu.:3.0 3rd Qu.:4.00 3rd Qu.:3.00
Max. :5.0 Max. :5.00 Max. :5.00
Are there any problems? Questionnaire response Q1-Q10 seem reasonable, the minimum is 1 and maximum is 5. Recall that the questionnaire score is 1-5. The number of store transactions (store_trans
) and online transactions (online_trans
) make sense too. Things to pay attention are:
- There are some missing values.
- There are outliers for store expenses (
store_exp
). The maximum value is 50000. Who would spend $50000 a year buying clothes? Is it an imputation error? - There is a negative value ( -500) in
store_exp
which is not logical. - Someone is 300 years old.
How to deal with that? Depending on the situation, if the sample size is large enough and the missing happens randomly, it does not hurt to delete those problematic samples. Or we can set these values as missing and impute them instead of deleting the rows.
# set problematic values as missings
$age[which(sim.dat$age > 100)] <- NA
sim.dat$store_exp[which(sim.dat$store_exp < 0)] <- NA
sim.dat# see the results
summary(subset(sim.dat, select = c("age", "store_exp")))
age store_exp
Min. :16.00 Min. : 155.8
1st Qu.:25.00 1st Qu.: 205.1
Median :36.00 Median : 329.8
Mean :38.58 Mean : 1358.7
3rd Qu.:53.00 3rd Qu.: 597.4
Max. :69.00 Max. :50000.0
NA's :1 NA's :1
Now let’s deal with the missing values in the data.