5.7 Sparse Variables

Other than the highly related predictors, predictors with degenerate distributions can cause the problem too. Removing those variables can significantly improve some models’ performance and stability (such as linear regression and logistic regression but the tree based model is impervious to this type of predictors). One extreme example is a variable with a single value which is called zero-variance variable. Variables with very low frequency of unique values are near-zero variance predictors. In general, detecting those variables follows two rules:

  • The fraction of unique values over the sample size
  • The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value.

nearZeroVar() function in the caret package can filter near-zero variance predictors according to the above rules. In order to show the useage of the function, let’s arbitaryly add some problematic variables to the origional data sim.dat:

# make a copy
zero_demo <- sim.dat
# add two sparse variable zero1 only has one unique value zero2 is a
# vector with the first element 1 and the rest are 0s
zero_demo$zero1 <- rep(1, nrow(zero_demo))
zero_demo$zero2 <- c(1, rep(0, nrow(zero_demo) - 1))

The function will return a vector of integers indicating which columns to remove:

nearZeroVar(zero_demo,freqCut = 95/5, uniqueCut = 10)
## [1] 20 21

As expected, it returns the two columns we generated. You can go ahead to remove them. Note the two arguments in the function freqCut = and uniqueCut = are corresponding to the previous two rules.

  • freqCut: the cutoff for the ratio of the most common value to the second most common value
  • uniqueCut: the cutoff for the percentage of distinct values out of the number of total samples