11.5 Bagging Tree

As mentioned before, a single tree is unstable. If you randomly separate the sample to be two parts and fit tree model on each, you can get two very different trees. A stable model should give a similar result on different random samples. Some traditional statistical models have high stability, such as linear regression. Ensemble methods appeared in the 1990s which can effectively stabilize the model. Bootstrapping is a type of process where you repeated draw samples of the same size from a single original sample with replacement (B and R 1986). Bootstrap aggregation (Bagged) is an ensemble technique proposed by Leo Breiman (Breiman 1996a). It uses bootstrapping in conjunction with any model to construct an ensemble. The process is very straightforward:

Assume that there are $n$ independent random variables $Z_{1}, \dots, Z_{n}$ with variance $σ^{2}$ . Then the variance of the mean $\bar{Z}$ is $\frac{σ^{2}}{n}$ . It is easy to see why bagged models have less variance. Since bootstrapping is to sample with replacement, it means some samples are selected multiple times and some not at all. Those left out samples are called out-of-bag. You can use the out-of-bag sample to access the model performance. For regression, the prediction is a simple average. For classification, the prediction is the category with the most “votes.” Here, the number of trees, $B$ is a parameter you need to decide, i.e. tuning parameter. Bagging is a general approach that can be applied to different learners. Here we only discuss in the context of decision trees.

The advantages of bagging tree are:

Bagging stabilizes the model predictions by averaging the results. If we have 10 bootstrap samples and fit a single tree on each of those, we may get 10 trees with very different structures and leading to different predictions for a new sample. But if we use the average of the 10 predictions as the final prediction, then the result is much more stable. It means if we have another 10 samples and do it all-over again, we will get very similar averaged prediction.
Bagging provides more accurate predictions. If the goal is to predict rather than interpret, then the ensemble approach definitely has an advantage, especially for unstable models. However, for stable models (such as regression, MARS), bagging may bring marginal improvement for the model performance.
Bagging can use out-of-bag samples to evaluate model performance. For each model in the ensemble, we can calculate the value of the model performance metric (you can decide what metric to use). You can use the average of all the out-of-bag performance values to gauge the predictive performance of the entire ensemble. This correlates well with either cross-validation estimates or test set estimates. On average, each tree uses about 2/3 of the samples, and the rest 1/3 is used as out-of-bag. When the number of bootstrap samples is large enough, the out-of-bag performance estimate approximates that from leave one out cross-validation.

You need to choose the number of bootstrap samples. The author of “Applied Predictive Modeling” (Kuhn and Johnston 2013) points out that often people see an exponential decrease in predictive improvement as the number of iterations increases. Most of the predictive power is from a small portion of the trees. Based on their experience, model performance can have small improvements up to 50 bagging iterations. If it is still not satisfying, they suggest trying other more powerfully predictive ensemble methods such as random forests and boosting which will be described in the following sections.

The disadvantages of bagging tree are:

As the number of bootstrap samples increases, the computation and memory requirements increase as well. You can mitigate this disadvantage by parallel computing. Since each bootstrap sample and modeling is independent of any other sample and model, you can easily parallelize the bagging process by building those models separately and bring back the results in the end to generate the prediction.
The bagged model is difficult to explain which is common for all ensemble approaches. However, you can still get variable importance by combining measures of importance across the ensemble. For example, we can calculate the RSS decrease for each variable across all trees and use the average as the measurement of the importance.
Since the bagging tree uses all of the original predictors as everey split of every tree, those trees are related with each other. The tree correlation prevents bagging from optimally reducing the variance of the predicted values. See (Hastie T 2008) for a mathematical illustration of the tree correlation phenomenon.

Let’s look at how to use R to build bagging tree using survey question to predict customer gender based on the customer dataset. Get the predictors and response variable first:

dat <- read.csv("http://bit.ly/2P5gTw4")
# use the 10 survey questions as predictors
trainx <- dat[, grep("Q", names(dat))]
# add segment as a predictor 
# don't need to encode it to dummy variables
trainx$segment <- as.factor(dat$segment)
# use gender as the response variable
trainy <- as.factor(dat$gender)

Then fit the model using train function in caret package. Here we just set the number of trees to be 1000. You can tune that parameter.

set.seed(100)
bagTune <- caret::train(trainx, trainy, 
                           method = "treebag",
                           nbagg = 1000,
                           metric = "ROC",
                           trControl = trainControl(method = "cv",
                           summaryFunction = twoClassSummary,
                           classProbs = TRUE,
                           savePredictions = TRUE))

The model results are:

bagTune

## Bagged CART 
## 
## 1000 samples
##   11 predictor
##    2 classes: 'Female', 'Male' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 901, 899, 900, 900, 901, 900, ... 
## Resampling results:
## 
##   ROC     Sens    Spec  
##   0.7093  0.6533  0.6774

Since we only have a handful of variables in this example, the maximum AUC doesn’t improve by using bagging tree. But it makes a difference when we have more predictors.

References

B, Efron, and Tibshirani R. 1986. “Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy.” Statistical Science, 54–75.

Hastie T, Friedman J, Tibshirani R. 2008. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.

Kuhn, Max, and Kjell Johnston. 2013. Applied Predictive Modeling. Springer.