## 11.7 Gradient Boosted Machine

Boosting models were developed in the 1980s and were originally for classification problems. Due to the excellent model performance, they were widely used for a variety of applications, such as gene expression , chemical substructure classification , music classification , etc. The first effective implementation of boosting is Adaptive Boosting (AdaBoost) algorithm came up by Yoav Freund and Robert Schapire in 1996 (YFR 1999). After that, some researchers started to connect the boosting algorithm with some statistical concepts, such as loss function, additive model, logistic regression. Friedman pointed out that boosting can be considered as a forward stagewise additive model that minimizes exponential loss. The new view of boosting in a statistical framework enabled the method to be extended to regression problems.

The idea is to combine a group of weak learners (a classifier that is marginally better than random guess) to produce a strong learner. Like bagging, boosting is a general approach that can be applied to different learners. Here we focus on the decision tree. Recall that both bagging and random forest create multiple copies of the original training data using the bootstrap, fitting a separate decision tree to each copy and combining all the results to create a single prediction. Boosting also creates different trees but the trees are grown sequentially and each tree is a weak learner. Any modeling technique with tuning parameters can produce a range of learners, from weak to strong. You can easily make a weak learner by restricting the depth of the tree. There are different types of boosting. Here we introduce two main types: adaptive boosting and stochastic gradient boosting.

### 11.7.1 Adaptive Boosting

Yoav Freund and Robert Schapire came up the AdaBoost.M1 algorithm. Consider a binary classification problem where the response variable has two categories $$Y \in \{-1, 1\}$$. Given predictor matrix, $$X$$, construct a classifier $$G(X)$$ that predicts $$1$$ or $$-1$$. The corresponding error rate in the training set is:

$\bar{err}=\frac{1}{N}\Sigma_{i=1}^NI(y_i\neq G(x_i))$

The algorithm produces a series of classifiers $$G_m(x),\ m=1,2,...,M$$ from different iterations. In each iteration, it finds the best classifier based on the current weights. The misclassified samples in the $$m^{th}$$ iteration will have higher weights in the $$(m+1)^{th}$$ iteration and the correctly classified samples will have lower weights. As it moves on, the algorithm will put more effort into the “difficult” samples until it can correctly classify them. So it requires the algorithm to change focus at each iteration. At each iteration, the algorithm will calculate a stage weight based on the error rate. The final prediction is a weighted average of all those weak classifiers using stage weights from all the iterations:

$G(x)=sign ( \Sigma_{m=1}^M \alpha_{m}G_m(x))$ where $$\alpha_1,\alpha_2,...,\alpha_M$$ are the weights from different iterations.

Since the classifier $$G_m(x)$$ returns discrete value, the AdaBoost.M1 algorithm is known as “Discrete AdaBoost” . You can revise the above algorithm if it returns continuousf value, for example, a probability . As mentioned before, boosting is a general approach that can be applied to different learners. Since you can easily create weak learners by limiting the depth of the tree, the boosting tree is a common method. Since the classification tree is a low bias/high variance technique, ensemble decreases model variance and lead to low bias/low variance model. See Breinman for more explanation about why the boosting tree performs well in general. However, boosting can not significantly improve the low variance model. So applying boosting to K-Nearest Neighbor (KNN) doesn’t lead to as good improvement as applying boosting to statistical learning methods like naive Bayes .

### 11.7.2 Stochastic Gradient Boosting

As mentioned before, Friedman provided a statistical framework for the AdaBoost algorithm and pointed out that boosting can be considered as a forward stagewise additive model that minimizes exponential loss. The framework led to some generalized algorithms such as Real AdaBoost, Gentle AdaBoost, and LogitBoost. Those algorithms later were unified under a framework called gradient boosting machine. The last section of the chapter illustrates how boosting can be considered as an additive model.

Consider a 2-class classification problem. You have the response $$y \in \{-1, 1\}$$ and the sample proportion of class 1 from the training set is $$p$$. $$f(x)$$ is the model prediction in the range of $$[-\infty, +\infty]$$ and the predicted event probability is $$\hat{p}=\frac{1}{1+exp[-f(x)]}$$. The gradient boosting for this problem is as follows:

When using the tree as the base learner, basic gradient boosting has two tuning parameters: tree depth and the number of iterations. You can further customize the algorithm by selecting a different loss function and gradient . The final line of the loop includes a regularization strategy. Instead of adding $$f_i^{(j)}$$ to the previous iteration’s $$f_i$$, only a fraction of the value is added. This fraction is called learning rate which is $$\lambda$$ in the algorithm. It can take values between 0 and 1 which is another tuning parameter of the model.

The way to calculate variable importance in boosting is similar to a bagging model. You get variable importance by combining measures of importance across the ensemble. For example, we can calculate the Gini index improvement for each variable across all trees and use the average as the measurement of the importance.

Boosting is a very popular method for classification. It is one of the methods that can be directly applied to the data without requiring a great deal of time-consuming data preprocessing. Applying boosting on tree models significantly improves predictive accuracy. Some advantages of trees that are sacrificed by boosting are speed and interpretability.

Let’s look at the R implementation.

gbmGrid <- expand.grid(interaction.depth = c(1, 3, 5, 7, 9),
n.trees = 1:5,
shrinkage = c(.01, .1),
n.minobsinnode = c(1:10))

set.seed(100)
gbmTune <- caret::train(x = trainx,
y = trainy,
method = "gbm",
tuneGrid = gbmGrid,
metric = "ROC",
verbose = FALSE,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = TRUE))
# only show part of the output
gbmTune
Stochastic Gradient Boosting

1000 samples
11 predictor
2 classes: 'Female', 'Male'

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 899, 900, 900, 899, 899, 901, ...
Resampling results across tuning parameters:

shrinkage interaction.depth n.minobsinnode n.trees ROC    Sens Spec
0.01      1                 1              1       0.6821 1.00 0.00
0.01      1                 1              2       0.6882 1.00 0.00
.
.
.
0.01      5                 8              4       0.7096 1.00 0.00
0.01      5                 8              5       0.7100 1.00 0.00
0.01      5                 9              1       0.7006 1.00 0.00
0.01      5                 9              2       0.7055 1.00 0.00
[ reached getOption("max.print") -- omitted 358 rows ]

ROC was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 4,
interaction.depth = 3, shrinkage = 0.01 and n.minobsinnode = 6.

The results show that the tuning parameter settings that lead to the best ROC are n.trees = 4 (number of trees), interaction.depth = 3 (depth of tree), shrinkage = 0.01 (learning rate) and n.minobsinnode = 6 (minimum number of observations in each node).

Now, let’s compare the results from the three tree models.

treebagRoc <- pROC::roc(response = bagTune$pred$obs,
predictor = bagTune$pred$Female,
levels = rev(levels(bagTune$pred$obs)))

rfRoc <- pROC::roc(response = rfTune$pred$obs,
predictor = rfTune$pred$Female,
levels = rev(levels(rfTune$pred$obs)))

gbmRoc <- pROC::roc(response = gbmTune$pred$obs,
predictor = gbmTune$pred$Female,
levels = rev(levels(gbmTune$pred$obs)))

plot.roc(rpartRoc,
type = "s",
print.thres = c(.5), print.thres.pch = 16,
print.thres.pattern = "", print.thres.cex = 1.2,
col = "black", legacy.axes = TRUE,
print.thres.col = "black")

plot.roc(treebagRoc,
type = "s",
add = TRUE,
print.thres = c(.5), print.thres.pch = 3,
legacy.axes = TRUE, print.thres.pattern = "",
print.thres.cex = 1.2,
col = "red", print.thres.col = "red")

plot.roc(rfRoc,
type = "s",
add = TRUE,
print.thres = c(.5), print.thres.pch = 1,
legacy.axes = TRUE, print.thres.pattern = "",
print.thres.cex = 1.2,
col = "green", print.thres.col = "green")

plot.roc(gbmRoc,
type = "s",
add = TRUE,
print.thres = c(.5), print.thres.pch = 10,
legacy.axes = TRUE, print.thres.pattern = "",
print.thres.cex = 1.2,
col = "blue", print.thres.col = "blue")

legend(0.2, 0.5, cex = 0.8,
c("Single Tree", "Bagged Tree",
"Random Forest", "Boosted Tree"),
lwd = c(1, 1, 1, 1),
col = c("black", "red", "green", "blue"),
pch = c(16, 3, 1, 10))

Since the data here doesn’t have many variables, we don’t see a significant difference among the models. But you can still see those ensemble methods are better than a single tree. In most of the real applications, ensemble methods perform much better. Random forest and boosting trees can be a baseline model. Before exploring different models, you can quickly run a random forest to see the performance and then try to improve that performance. If the performance you got from the random forest is not too much better than guessing, you should consider collecting more data or reviewing the problem to frame it a different way instead of trying other models. Because it usually means the current data is not enough to solve the problem.

### References

al, BenDor A et. 2000. “Tissue Classification with Gene Expression Profiles.” Journal of Computational Biology 7 (3): 559–83.
Bauer, Eric, and Ron Kohavi. 1999. “An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants.” Machine Learning 36: 105–42.
Bergstra, James, Norman Casagrande, Dumitru Erhan, Douglas Eck, and Balazs Kegl. 2006. “Aggregate Features and AdaBoost for Music Classification.” Machine Learning 65: 473–84.
Breiman, Leo. 1998. “Arcing Classifiers.” The Annals of Statistics 26: 123–40.
Dudoit S, Fridlyand J, and Speed T. 2002. “Comparison of Discrimination Meth- Ods for the Classification of Tumors Using Gene Expression Data.” Journal of the American Statistical Association 97 (457): 77–87.
Freund, Y., and R. Schapire. 1997. “A Decision-Theoretic Generalization of Online Learning and an Application to Boosting.” Journal of Computer and System Sciences 55: 119–39.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2000. “Additive Logistic Regression: A Statistical View of Boosting.” Annals of Statistics 38: 337–74.
Hastie T, Friedman J, Tibshirani R. 2008. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.
L, Valiant. 1984. “A Theory of the Learnable.” Communications of the ACM 27: 1134–42.
M, Kearns, and Valiant L. 1989. “Cryptographic Limitations on Learning Boolean Formulae and Finite Automata.”
Varmuza K, He P, and Fang K. 2003. “Boosting Applied to Classification of Mass Spectral Data.” Journal of Data Science 1 (391–404).
YFR, Schapire. 1999. “Adaptive Game Playing Using Multiplicative Weights.” Games and Economic Behavior 29: 79–103.