It is probably the technical term known by the most un-technical people. When two predictors are very strongly correlated, including both in a model may lead to confusion or problem with a singular matrix. There is an excellent function in
corrplot package with the same name
corrplot() that can visualize correlation structure of a set of predictors. The function has the option to reorder the variables in a way that reveals clusters of highly correlated ones.
# select non-survey numerical variables sdat <- subset(sim.dat, select = c("age", "income", "store_exp", "online_exp", "store_trans", "online_trans")) # use bagging imputation here imp <- preProcess(sdat, method = "bagImpute") sdat <- predict(imp, sdat) # get the correlation matrix correlation <- cor(sdat) # plot par(oma = c(2, 2, 2, 2)) corrplot.mixed(correlation, order = "hclust", tl.pos = "lt", upper = "ellipse")
The closer the correlation is to 0, the lighter the color is and the closer the shape is to a circle. The elliptical means the correlation is not equal to 0 (because we set the
upper = "ellipse"), the greater the correlation, the narrower the ellipse. Blue represents a positive correlation; red represents a negative correlation. The direction of the ellipse also changes with the correlation. The correlation coefficient is shown in the lower triangle of the matrix.
The variables relationship from previous scatter matrix are clear here: the negative correlation between age and online shopping, the positive correlation between income and amount of purchasing. Some correlation is very strong ( such as the correlation between
age is -0.7414) which means the two variables contain duplicate information.
Section 3.5 of “Applied Predictive Modeling” (Kuhn and Johnston 2013) presents a heuristic algorithm to remove a minimum number of predictors to ensure all pairwise correlations are below a certain threshold:
- Calculate the correlation matrix of the predictors.
- Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B).
- Determine the average correlation between A and the other variables. Do the same for predictor B.
- If A has a larger average correlation, remove it; otherwise, remove predictor B.
- Repeat Step 2-4 until no absolute correlations are above the threshold.
findCorrelation() function in package
caret will apply the above algorithm.
##  2 6
It returns the index of columns need to be deleted. It tells us that we need to remove the first column to make sure the correlations are all below 0.7.
The absolute value of the elements in the correlation matrix after removal are all below 0.7. How strong does a correlation have to get, before you should start worrying about multicollinearity? There is no easy answer to that question. You can treat the threshold as a tuning parameter and pick one that gives you best prediction accuracy.
Kuhn, Max, and Kjell Johnston. 2013. Applied Predictive Modeling. Springer.