## 11.6 Gradient Boosted Machine

Boosting models were developed in the 1980s (L 1984; M and L 1989) and were originally for classification problems. Due to the excellent model performance, they were widely used for a variety of applications, such as gene expression (Dudoit S and T 2002; al 2000), chemical substructure classification (Varmuza K and K 2003), music classification (al 2006), etc. The first effective implementation of boosting is Adaptive Boosting (AdaBoost) algorithm came up by Yoav Freund and Robert Schapire in 1996 (YFR 1999). After that, some researchers (Friedman J and R 2000) started to connect the boosting algorithm with some statistical concepts, such as loss function, additive model, logistic regression. Friedman pointed out that boosting can be considered as a forward stagewise additive model that minimizes exponential loss. The new view of boosting in a statistical framework enabled the method to be extended to regression problems.

The idea is to combine a group of weak learners (a classifier that is marginally better than random guess) to produce a strong learner. Like bagging, boosting is a general approach that can be applied to different learners. Here we focus on the decision tree. Recall that both bagging and random forest create multiple copies of the original training data using the bootstrap, fitting a separate decision tree to each copy and combining all the results to create a single prediction. Boosting also creates different trees but the trees are grown sequentially and each tree is a weak learner. Any modeling technique with tuning parameters can produce a range of learners, from weak to strong. You can easily make a weak learner by restricting the depth of the tree. There are different types of boosting. Here we introduce two main types: adaptive boosting and stochastic gradient boosting.

### 11.6.1 Adaptive Boosting

Yoav Freund and Robert Schapire (Freund and Schapire 1997) came up the AdaBoost.M1 algorithm. Consider a binary classification problem where the response variable has two categories \(Y \in \{-1, 1\}\). Given predictor matrix, \(X\), construct a classifier \(G(X)\) that predicts \(1\) or \(-1\). The corresponding error rate in the training set is:

\[\bar{err}=\frac{1}{N}\Sigma_{i=1}^NI(y_i\neq G(x_i))\]

The algorithm produces a series of classifiers \(G_m(x),\ m=1,2,...,M\) from different iterations. In each iteration, it finds the best classifier based on the current weights. The misclassified samples in the \(m^{th}\) iteration will have higher weights in the \(m+1^{th}\) iteration and the correctly classified samples will have lower weights. As it moves on, the algorithm will put more effort into the “difficult” samples until it can correctly classify them. So it requires the algorithm to change focus at each iteration. At each iteration, the algorithm will calculate a stage weight based on the error rate. The final prediction is a weighted average of all those weak classifiers using stage weights from all the iterations:

\[G(x)=sign ( \Sigma_{m=1}^M \alpha_{m}G_m(x))\] where \(\alpha_1,\alpha_2,...,\alpha_M\) are the weights from different iterations.

**AdaBoost.M1**

- Response variables have two values: +1 and -1
- Initialize the observation to have the same weights: \(w_i=\frac{1}{N},i=1,...,N\)
for m = 1 to M:

Fit a classifier \(G_m(x)\) to the training data using weights \(w_i\)

Compute the error rate for the \(m^th\) classifier: \(err_m=\frac{\Sigma_{i=1}^Nw_i I(y_i\neq G_m(x_i))}{\Sigma_{i=1}^Nw_i}\)

Compute the stage weight for \(m^{th}\) iteration: \(\alpha_m=log\frac{1-err_m}{err_m}\)

Update \(w_i = w_i\cdot exp[\alpha_m\cdot I(y_i \neq G_m(x_i))],\ i=1,2,\dots,N\)

Calculate the prediction：\(G(x)=sign[\Sigma_{m=1}^M\alpha_mG_m(x)]\)，where \(sign(\cdot)\) means if \(\cdot\) is positive, then the sample is classified as +1, -1 otherwise.

Since the classifier \(G_m(x)\) returns discrete value, the AdaBoost.M1 algorithm is known as “Discrete AdaBoost” (Friedman J and R 2000). You can revise the above algorithm if it returns continuous value, for example, a probability(Friedman J and R 2000). As mentioned before, boosting is a general approach that can be applied to different learners. Since you can easily create weak learners by limiting the depth of the tree, the boosting tree is a common method. Since the classification tree is a low bias/high variance technique, ensemble decreases model variance and lead to low bias/low variance model. See Breinman(Breiman 1998) for more explanation about why the boosting tree performs well in general. However, boosting can not significantly improve the low variance model. So applying boosting to Latent Dirichlet Allocation (LDA) or K-Nearest Neighbor (KNN) doesn’t lead to as good improvement as applying boosting to statistical learning methods like naive Bayes (E and R 1999).

### 11.6.2 Stochastic Gradient Boosting

As mentioned before, Friedman (Friedman J and R 2000) provided a statistical framework for the AdaBoost algorithm and pointed out that boosting can be considered as a forward stagewise additive model that minimizes exponential loss. The framework led to some generalized algorithms such as Real AdaBoost, Gentle AdaBoost, and LogitBoost. Those algorithms later were unified under a framework called gradient boosting machine. The last section of the chapter illustrates how boosting can be considered as an additive model.

Consider a 2-class classification problem. You have the response \(y \in \{0, 1\}\) and the sample proportion of class 1 from the training set is \(p\). \(f(x)\) is the model prediction in the range of \([-\infty, +\infty]\) and the predicted event probability is \(\hat{p}=\frac{1}{1+exp[-f(x)]}\). The gradient boosting for this problem is as follows:

- Initialize all predictions to the sample log-odds: \(f_{i} = log \frac{\hat{p}}{1- \hat{p}}\)
**for**j=1 … M**do**- Compute predicted event probability: \(\hat{p}_i=\frac{1}{1+exp[-f_{i}(x)]}\).
- Compute the residual (i.e. gradient): \(z_i=y_i-\hat{p}_i\)
- Randomly sample the training data
- Train a tree model on the random subset using the residuals as the outcome
- Compute the terminal node estimates of the Pearson residuals: \(r_i=\frac{1/n\Sigma_i^n(y_i-\hat{p}_i)}{1/n\Sigma_i^n\hat{p}_i(1-\hat{p}_i)}\)
- Update f：\(f_i=f_i+\lambda f_i^{(j)}\)

- end

When using the tree as the base learner, basic gradient boosting has two tuning parameters: tree depth and the number of iterations. You can further customize the algorithm by selecting a different loss function and gradient(Hastie T 2008). The final line of the loop includes a regularization strategy. Instead of adding \(f_i^{(j)}\) to the previous iteration’s \(f_i\), only a fraction of the value is added. This fraction is called learning rate which is \(\lambda\) in the algorithm. It can take values between 0 and 1 which is another tuning parameter of the model.

The way to calculate variable importance in boosting is similar to a bagging model. You get variable importance by combining measures of importance across the ensemble. For example, we can calculate the Gini index improvement for each variable across all trees and use the average as the measurement of the importance.

Boosting is a very popular method for classification. It is one of the methods that can be directly applied to the data without requiring a great deal of time-consuming data preprocessing. Applying boosting on tree models significantly improves predictive accuracy. Some advantages of trees that are sacrificed by boosting are speed and interpretability.

Let’s look at the R implementation.

```
gbmGrid <- expand.grid(interaction.depth = c(1, 3, 5, 7, 9),
n.trees = 1:5,
shrinkage = c(.01, .1),
n.minobsinnode = c(1:10))
set.seed(100)
gbmTune <- train(x = trainx,
y = trainy,
method = "gbm",
tuneGrid = gbmGrid,
metric = "ROC",
verbose = FALSE,
trControl = trainControl(method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = TRUE))
```

```
Stochastic Gradient Boosting
1000 samples
11 predictor
2 classes: 'Female', 'Male'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 899, 900, 900, 899, 899, 901, ...
Resampling results across tuning parameters:
shrinkage interaction.depth n.minobsinnode n.trees ROC Sens Spec
0.01 1 1 1 0.6821 1.0000 0.000000
0.01 1 1 2 0.6882 1.0000 0.000000
0.01 1 1 3 0.6936 1.0000 0.000000
.
.
.
0.01 5 8 2 0.7127 1.0000 0.000000
0.01 5 8 3 0.7148 1.0000 0.000000
0.01 5 8 4 0.7096 1.0000 0.000000
0.01 5 8 5 0.7100 1.0000 0.000000
0.01 5 9 1 0.7006 1.0000 0.000000
0.01 5 9 2 0.7055 1.0000 0.000000
[ reached getOption("max.print") -- omitted 358 rows ]
ROC was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 4, interaction.depth = 3, shrinkage
= 0.01 and n.minobsinnode = 6.
```

The results show that the tuning parameter settings that lead to the best ROC are `n.trees = 4`

(number of trees), `interaction.depth = 3`

(depth of tree), `shrinkage = 0.01`

(learning rate) and `n.minobsinnode = 6`

(minimum number of observations in each node).

Now, let’s compare the results from the three tree models.

```
treebagRoc <- roc(response = bagTune$pred$obs,
predictor = bagTune$pred$Female,
levels = rev(levels(bagTune$pred$obs)))
```

`## Setting direction: controls < cases`

```
rfRoc <- roc(response = rfTune$pred$obs,
predictor = rfTune$pred$Female,
levels = rev(levels(rfTune$pred$obs)))
```

`## Setting direction: controls < cases`

```
gbmRoc <- roc(response = gbmTune$pred$obs,
predictor = gbmTune$pred$Female,
levels = rev(levels(gbmTune$pred$obs)))
```

`## Setting direction: controls < cases`

```
plot.roc(rpartRoc,
type = "s",
print.thres = c(.5),
print.thres.pch = 16,
print.thres.pattern = "",
print.thres.cex = 1.2,
col = "black", legacy.axes = TRUE,
print.thres.col = "black")
plot.roc(treebagRoc,
type = "s",
add = TRUE, print.thres = c(.5),
print.thres.pch = 3, legacy.axes = TRUE, print.thres.pattern = "",
print.thres.cex = 1.2,
col = "red", print.thres.col = "red")
plot.roc(rfRoc,
type = "s",
add = TRUE, print.thres = c(.5),
print.thres.pch = 1, legacy.axes = TRUE, print.thres.pattern = "",
print.thres.cex = 1.2,
col = "green", print.thres.col = "green")
plot.roc(gbmRoc,
type = "s",
add = TRUE, print.thres = c(.5),
print.thres.pch = 10, legacy.axes = TRUE, print.thres.pattern = "",
print.thres.cex = 1.2,
col = "blue", print.thres.col = "blue")
legend(0.2, 0.5, cex = 0.8,
c("Single Tree", "Bagged Tree", "Random Forest", "Boosted Tree"),
lwd = c(1, 1, 1, 1),
col = c("black", "red", "green", "blue"),
pch = c(16, 3, 1, 10))
```

Since the data here doesn’t have many variables, we don’t see a significant difference among the models. But you can still see those ensemble methods are better than a single tree. In most of the real applications, ensemble methods perform much better. Random forest and boosting trees can be a baseline model. Before exploring different models, you can quickly run a random forest to see the performance and then try to improve that performance. If the performance you got from the random forest is not too much better than guessing, you should consider collecting more data or reviewing the problem to frame it a different way instead of trying other models. Because it usually means the current data is not enough to solve the problem.

### 11.6.3 Boosting as Additive Model

This section illustrates how boosting is a forward stagewise additive model that minimizes exponential loss (Friedman J and R 2000). Many seemingly different models can be represented as a **basis expansion model**. Recall the classifier obtained by the AdaBoost.M1 algorithm:

\[G(x)=sign ( \Sigma_{m=1}^M \alpha_{m}G_m(x))\]

The above expression fits in the framework of basis expansion which is as following:

\[\begin{equation} f(x)=\Sigma_{m=1}^M \beta_m b(x,\gamma_m) \tag{11.2} \end{equation}\]

where \(\beta_m,\ m=1,\dots,M\) is expansion coefficient and \(b(x,\gamma)\in \mathbb{R}\) is basis function. Many of the learning methods fit into this additive framework. Hastie et al discuss basis expansion in detail in Chapter 5 of the book (Hastie T 2008) and cover the additive expansion for different learning techniques (Hastie T 2008), such as single-hidden-layer neural networks (Chapter 11) and MARS (Section 9.4). These models are fit by minimizing a designated loss function averaged over the training data:

\[\begin{equation} \underset{\{\beta_m,\gamma_m\}_i^M}{min}\Sigma_{i=1}^N L\left(y_i,\Sigma_{m=1}^M\beta_{m}b(x_i;\gamma_m)\right) \tag{11.3} \end{equation}\]

Different models have different basis function \(b(x_i;\gamma_m)\). You can also customize loss function \(L(\cdot)\), such as squared-error loss, or likelihood-based loss. Optimizing the loss function across the whole training set is usually computation costly no matter the choice of basis and loss. The good news is that the problem can be simplified as fitting a single basis function.

\[\underset{\beta,\gamma}{min}=\Sigma_{i=1}^N L(y_i,\beta b(x_i;\gamma))\]

The forward stagewise algorithm approximates the optimal solution of equation (11.3). It adds new basis functions to the expansion without adjusting the previous ones. The forward stagewise additive algorithm is as following (section 10.2, (Hastie T 2008)):

- Initialize \(f_0(x)=0\)
**for**\(m=1,\dots,M\)**do**：- compute \[(\beta_m,\gamma_m)=\underset{\beta,\gamma}{argmin}\Sigma_{i=1}^NL(y_i,f_{m-1}(x_i)+\beta b(x_i;\gamma))\]
- set \(f_m(x)=f_{m-1}(x)+\beta_m b(x;\gamma_m)\)

At iteration m, it will search for the optimal \(b(x;\gamma_m)\) and \(\beta_m\) based on the previous basis function \(f_{m-1}(x)\). And then add the new basis \(b(x;\gamma_m)\beta_m\) to the previous basis function to get a new basis function \(f_m(x)\) without changing any parameters from previous steps. Assume we use squared-error loss:

\[L(y,f(x))=(y-f(x))^2\]

Then we have:

\[L(y_i,f_{m-1}(x_i)+\beta b(x_i;\gamma))=(y_i-f_{m-1}(x_i)-\beta b(x_i;\gamma))^2\]

where \(y_i-f_{m-1}(x_i)\) is the residual for sample i based on the previous model. That is to say, it is fitting the new basis using the residual of the previous step.

However, the squared-error loss is generally not a good choice. For a regression problem, it is susceptible to outliers. It doesn’t fit the classification problem since the response is categorical. Hence we often consider other loss functions.

Now let is go back to the AdaBoost.M1 algorithm. It is actually a special case of the above forward stagewise model when the loss function is:

\[L(y,f(x))=exp(-yf(x))\]

In AdaBoost.M1, the basis function is the classifier from each iteration \(G_m(x)\in \{-1,1\}\). If we use the exponential loss, the optimization problem is

\[\begin{equation} \begin{array}{ccc} (\beta_m,G_m) & = & \underset{\beta,G}{argmin}\Sigma_{i=1}^N exp[-y_i(f_{m-1}(x_i)+\beta G(x_i))]\\ & = & \underset{\beta, G}{argmin}\Sigma_{i=1}^N exp[-y_i \beta G(x_i)]\cdot exp[-y_if_{m-1}(x_i)]\\ & = & \underset{\beta, G}{argmin}\Sigma_{i=1}^N w_i^m exp[-y_i\beta G(x_i)] \end{array} \tag{11.4} \end{equation}\]

where \(w_i^m= exp[-y_if_{m-1}(x_i)]\). It does not depend on \(\beta\) and \(G(x)\). So we can consider it as the weight for each sample. Since the weight is related to \(f_{m-1}(x_i)\), it changes each iteration. We can further decompose equation (11.4):

\[\begin{equation} \begin{array}{ccc} (\beta_{m},G_{m}) & = & \underset{\beta,G}{argmin}\Sigma_{i=1}^{N}w_{i}^{m}exp[-y_{i}\beta G(x_{i})]\\ & = & \underset{\beta,G}{argmin}\Sigma_{i=1}^{N}\left\{ w_{i}^{m}e^{-\beta}I(y_{i}=G(x))+w_{i}^{m}e^{\beta}I(y_{i}\neq G(x))\right\} \\ & = & \underset{\beta,G}{argmin}\Sigma_{i=1}^{N}\left\{ w_{i}^{m}e^{-\beta}[1-I(y_{i}\neq G(x))]+w_{i}^{m}e^{\beta}I(y_{i}\neq G(x))\right\} \\ & = & \underset{\beta,G}{argmin}\left\{ (e^{\beta}-e^{-\beta})\cdot\Sigma_{i=1}^{N}w_{i}^{m}I(y_{i}\neq G(x_{i}))+e^{-\beta}\cdot\Sigma_{i=1}^{N}w_{i}^{m}\right\} \end{array} \tag{11.5} \end{equation}\]

when \(\beta >0\), the solution for equation (11.5) is:

\[G_{m} = \underset{G}{argmin}\Sigma_{i=1}^{N}w_{i}^{m}I(y_{i}\neq G(x))\]

It is the classifier that minimizes the weighted error. Plug the above \(G_m\) into equation (11.5). Take the derivative with respect to \(\beta\) and set it to be 0 to solve the optimal \(\beta_m\):

\[\beta_m =\frac{1}{2}ln\frac{1-err_m}{err_m}\]

where

\[err_m = \frac{\Sigma_{i=1}^N w_i^{m}I(y_i \neq G_m(x_i))}{\Sigma_{i=1}^N w_i^{m}}\]

According to the forward stagewise algorithm, the result is updated as:

\[f_m(x)=f_{m-1}(x)+\beta_m G_m(x)\]

We can go ahead and get the weight for the next iteration:

\[\begin{equation} \begin{array}{ccc} w_i^{m+1} & = & exp[-y_if_m (x_i)]\\ & = & exp[-y_if_{m-1}(x)-y_i \beta_m G_m(x)]\\ & = & w_{i}^{m}\cdot exp[-\beta_m y_i G_m(x_i)] \end{array} \tag{11.6} \end{equation}\]

Since \(-y_i G_m(x_i)=2\cdot I(y_i \neq G_m(x_i))-1\), equation (11.5) can be written as:

\[w_i^{m+1}=w_i^m \cdot exp[\alpha_mI(y_i\neq G_m(x_i))] \cdot exp[-\beta_m]\]

where \(\alpha_m=2\beta_m=log\frac{1-err_m}{err_m}\) is the same with the \(\alpha_m\) in AdaBoost.M1 algorithm we showed before. So AdaBoost.M1 is a special case of a forward stagewise additive model using exponential loss. For comparing and selecting different loss functions, you can refer to section 10.5 and 10.6 in (Hastie T 2008).

### References

al, BenDor A et. 2000. “Tissue Classification with Gene Expression Profiles.” *Journal of Computational Biology* 7 (3): 559–83.

al, Bergstra J et. 2006. “Aggregate Features and Adaboost for Music Classification.” *Machine Learning* 65: 473–84.

Breiman, Leo. 1998. “Arcing Classifiers.” *The Annals of Statistics* 26: 123–40.

Dudoit S, Fridlyand J, and Speed T. 2002. “Comparison of Discrimination Meth- Ods for the Classification of Tumors Using Gene Expression Data.” *Journal of the American Statistical Association* 97 (457): 77–87.

E, Bauer, and Kohavi R. 1999. “An Empirical Comparison of Voting Classifica- Tion Algorithms: Bagging, Boosting, and Variants.” *Machine Learning* 36: 105–42.

Freund, Y., and R. Schapire. 1997. “A Decision-Theoretic Generalization of Online Learning and an Application to Boosting.” *Journal of Computer and System Sciences* 55: 119–39.

Friedman J, Hastie T, and Tibshirani R. 2000. “Additive Logistic Regression: A Statistical View of Boosting.” *Annals of Statistics* 38: 337–74.

Hastie T, Friedman J, Tibshirani R. 2008. *The Elements of Statistical Learning: Data Mining, Inference and Prediction*. 2nd ed. Springer.

L, Valiant. 1984. “A Theory of the Learnable.” *Communications of the ACM* 27: 1134–42.

M, Kearns, and Valiant L. 1989. “Cryptographic Limitations on Learning Boolean Formulae and Finite Automata.”

Varmuza K, He P, and Fang K. 2003. “Boosting Applied to Classification of Mass Spectral Data.” *Journal of Data Science* 1 (391–404).

YFR, Schapire. 1999. “Adaptive Game Playing Using Multiplicative Weights.” *Games and Economic Behavior* 29: 79–103.