Chapter 11 Tree-Based Methods

The tree-based models can be used for regression and classification. The goal is to stratify or segment the predictor space into a number of sub-regions. For a given observation, use the mean or the mode of the training observations in the sub-region as the prediction. Tree-based methods are conceptually simple yet powerful. This type of model is often referred to as Classification And Regression Trees (CART). They are popular tools for many reasons:

  1. Do not require user to specify the form of the relationship between predictors and response
  2. Do not require (or if they do, very limited) data preprocessing and can handle different types of predictors (sparse, skewed, continuous, categorical, etc.)
  3. Robust to co-linearity
  4. Can handle missing data
  5. Many pre-built packages make implementation as easy as a button push

CART can refer to the tree model in general, but most of the time, it represents the algorithm initially proposed by Breiman (al 1984). After Breiman, there are many new algorithms, such as ID3, C4.5, and C5.0. C5.0 is an improved version of C4.5, but since C5.0 is not open source, the C4.5 algorithm is more popular. C4.5 was a major competitor of CART. But now, all those seem outdated. The most popular tree models are Random Forest (RF) and Gradient Boosting Machine (GBM). Despite being out of favor in application, it is important to understand the mechanism of the basic tree algorithm. Because the later models are based on the same foundation.

The original CART algorithm targets binary classification, and the later algorithms can handle multi-category classification. A single tree is easy to explain but has poor accuracy. More complicated tree models, such as RF and GBM, can provide much better prediction at the cost of explainability. As the model becoming more complicated, it is more like a black-box which makes it very difficult to explain the relationship among predictors. There is always a trade-off between explainability and predictability.

The reason why it is called “tree” is of course because the structure has similarities. But the direction of the decision tree is opposite to the real tree, the root is on the top, and the leaf is on the bottom. From the root node, a decision tree divides to different branches and generates more nodes. The new nodes are child nodes, and the previous node is the parent node. At each child node, the algorithm will decide whether to continue dividing. If it stops, the node is called a leaf node. If it continues, then the node becomes the new parent node and splits to produce the next layer of child nodes. At each non-leaf node, the algorithm needs to decide split into branches. The leaf node contains the final “decision,” final class the sample belongs to or the sample’s value has. Here are the important definitions in the tree model:

  • Classification tree: the outcome is discrete
  • Regression tree: the outcome is continuous (e.g. the price of a house, or a patient’s length of stay in a hospital)
  • Non-leaf node (or split node): the algorithm needs to decide a split at each non-leaf node (eg: \(age \leq 25\), \(25 < age \leq 35\), \(age > 35\))
  • Root node:the beginning node where the tree starts
  • Leaf node (or Terminal node): the node stops splitting. It has the final decision of the model
  • Degree of the node: the number of subtrees of a node
  • Degree of the tree: the maximum degree of a node in the tree
  • Pruning: remove parts of the tree that do not provide power to classify instances
  • Branch (or Subtree): the whole part under a non-leaf node
  • Child: the node directly after and connected to another node
  • Parent: the converse notion of a child

Single tree is easy to explain but has high variance and low accuracy, and hence is very limited. Minor changes in the training data can lead to large changes in the fitted tree. A series of rectangular decision regions defined by a single tree is often too naive to represent the relationship between the dependent variable and the predictors. To overcome these shortcomings, researchers have proposed ensemble methods which combine many trees. Ensemble tree models typically have much better predictive performance than a single tree. We will introduce those models in later sections.

References

al, Leo Breiman et. 1984. Classification and Regression Trees. ISBN 978-0412048418. Chapman; Hall/CRC.