11.1 Splitting Criteria

The splitting criteria used by the regression tree and the classification tree are different. Like the regression tree, the goal of the classification tree is to divide the data into smaller, more homogeneous groups. Homogeneity means that most of the samples at each node are from one class. The original CART algorithm uses Gini impurity as the splitting criterion; The later ID3, C4.5, and C5.0 use entropy. We will look at three most common splitting criteria.

Gini impurity

Gini impurity(al 1984) is a measure of non-homogeneity. It is widely used in classification tree. For a two-class problem, the Gini impurity for a given node is defined as:

\[p_{1}(1-p_{1})+p_{2}(1-p_{2})\]

where \(p_{1}\) and \(p_{2}\) are probabilities for the two classes respectively. It is easy to see that when the sample set is pure, one of the probability is 0 and the Gini score is the smallest. Conversely, when \(p_{1}=p_{2}=0.5\), the Gini score is the largest, in which case the purity of the node is the smallest. Let’s look at an example. Suppose we want to determine which students are computer science (CS) majors. Here is the simple hypothetical classification tree result obtained with the gender variable.

Let’s calculate the Gini impurity for splitting node “Gender”:

  1. Gini impurity for “Female” = \(\frac{1}{6}\times\frac{5}{6}+\frac{5}{6}\times\frac{1}{6}=\frac{5}{18}\)
  2. Gini impurity for “Male” = \(0\times1+1\times 0=0\)

The Gini impurity for the node “Gender” is the following weighted average of the above two scores:

\[\frac{3}{5}\times\frac{5}{18}+\frac{2}{5}\times 0=\frac{1}{6}\]

The Gini impurity for the 50 samples in the parent node is \(\frac{1}{2}\). It is easy to calculate the Gini impurity drop from \(\frac{1}{2}\) to \(\frac{1}{6}\) after splitting. The split using “gender” causes a Gini impurity decrease of \(\frac{1}{3}\). The algorithm will use different variables to split the data and choose the one that causes the most substantial Gini impurity decrease.

Information gain

Looking at the samples in the following three nodes, which one is the easiest to describe? It is obviously C. Because all the samples in C are of the same type, so the description requires the least amount of information. On the contrary, B needs more information, and A needs the most information. In other words, C has the highest purity, B is the second, and A has the lowest purity. We need less information to describe nodes with higher purity.

A measure of the degree of disorder is entropy which is defined as:

\[Entropy=-plog_{2}p-(1-p)log_{2}(1-p)\]

where p is the percentage of one type of samples. If all the samples in one node are of one type (such as C), the entropy is 0. If the proportion of each type in a node is 50%-50%, the entropy is 1. We can use entropy as splitting criteria. The goal is to decrease entropy as the tree grows.

Similarly, the entropy of a splitting node is the weighted average of the entropy of each child. In the above tree for the students, the entropy of the root node with all 50 students is \(-\frac{25}{50}log_{2}\frac{25}{50}-\frac{25}{50}log_{2}\frac{25}{50}=1\). Here an entropy of 1 indicates that the purity of the node is the lowest, that is, each type takes up half of the samples.

The entropy of the split using variable “gender” can be calculated in three steps:

  1. Entropy for “Female” = \(-\frac{5}{30}log_{2}\frac{5}{30}-\frac{25}{30}log_{2}\frac{25}{30}=0.65\)
  2. Entropy for “Male” = \(0\times1+1\times 0=0\)
  3. Entropy for the node “Gender” is the weighted average of the above two entropy numbers: \(\frac{3}{5}\times 0.65+\frac{2}{5}\times 0=0.39\)

So entropy decreases from 1 to 0.39 after the split.

Sum of Square Error (SSE)

The previous two metrics are for classification tree. The SSE is the most widely used splitting metric for regression. Suppose you want to divide the data set \(S\) into two groups of \(S_{1}\) and \(S_{2}\), where the selection of \(S_{1}\) and \(S_{2}\) needs to minimize the sum of squared errors:

\[\begin{equation} SSE=\Sigma_{i\in S_{1}}(y_{i}-\bar{y}_{1})^{2}+\Sigma_{i\in S_{2}}(y_{i}-\bar{y}_{2})^{2} \tag{11.1} \end{equation}\]

In equation (11.1), \(\bar{y}_{1}\) and \(\bar{y}_{1}\) are the average of the sample in \(S_{1}\) and \(S_{2}\). The way regression tree grows is to automatically decide on the splitting variables and split points that can maximize SSE reduction. Since this process is essentially a recursive segmentation, this approach is also called recursive partitioning.

Take a look at this simple regression tree for the height of 10 students:

You can calculate the SSE using the following code:

  1. SSE for “Female” is 136
  2. SSE for “Male” is 32
  3. SSE for splitting node “Gender” is the sum of the above two numbers which is 168

SSE for the 10 students in root node is 522.9. After the split, SSE decreases from 522.9 to 168.

If there is another possible way of splitting, divide it by major, as follows:

In this situation:

  1. SSE for “Math” is 184
  2. SSE for “English” is 302.8
  3. SSE for splitting node “Major” is the sum of the above two numbers which is 486.8

Splitting data using variable “gender” reduced SSE from 522.9 to 168; using variable “major” reduced SSE from 522.9 to 486.8. Based on SSE reduction, you should use gender to split the data.

The three splitting criteria mentioned above are the basis for building a tree model.

References

al, Leo Breiman et. 1984. Classification and Regression Trees. ISBN 978-0412048418. Chapman; Hall/CRC.