11.3 Tree Pruning
Pruning is the process that reduces the size of decision trees. It reduces the risk of overfitting by limiting the size of the tree or removing sections of the tree that provide little power.
Limit the size
You can limit the tree size by setting some parameters.
Minimum sample size at each node: Defining the minimum sample size at the node helps to prevent the leaf nodes having only one sample. The sample size can be a tuning parameter. If it is too large, the model tends to under-fit. If it is too small, the model tends to over-fit. In the case of severe class imbalance, the minimum sample size may need to be smaller because the number of samples in a particular class is small.
Maximum depth of the tree: If the tree grows too deep, the model tends to over-fit. It can be a tuning parameter.
Maximum number of terminal nodes: Limit on the terminal nodes works the same as the limit on the depth of the tree. They are proportional.
The number of variables considered for each split: the algorithm randomly selects variables used in finding the optimal split point at each level. In general, the square root of the number of all variables works best, which is also the default setting for many functions. However, people often treat it as a tuning parameter.
Remove branches
Another way is to first let the tree grow as much as possible and then go back to remove insignificant branches. The process reduces the depth of the tree. The idea is to overfit the training set and then correct using cross-validation. There are different implementations.
- cost/complexity penalty
The idea is that the pruning minimizes the penalized error \(SSE_{\lambda}\) with a certain value of tuning parameter \(\lambda\).
\[SSE_{\lambda} = SSE+\lambda \times (complexity)\]
Here complexity is a function of the number of leaves. For every given \(\lambda\), we want to find the tree that minimizes this penalized error. Breiman presents the algorithm to solve the optimization (L. Breiman et al. 1984).
To find the optimal pruning tree, you need to iterate through a series of values of \(\lambda\) and calculate the corresponding SSE. For the same \(\lambda\), SSE changes over different samples. Breiman et al. suggested using cross-validation (L. Breiman et al. 1984) to study the variation of SSE under each \(\lambda\) value. They also proposed a standard deviation criterion to give the simplest tree: within one standard deviation, find the simplest tree that minimizes the absolute error. Another method is to choose the tree size that minimizes the numerical error (Hastie T 2008).
- Error-based pruning
This method was first proposed by Quinlan (Quinlan 1999). The idea behind is intuitive. All split nodes of the tree are included in the initial candidate pool. Pruning a split node means removing the entire subtree under the node and setting the node as a terminal node. The data is divided into 3 subsets for:
training a complete tree
pruning
testing the final model
You train a complete tree using the subset (1) and apply the tree on the subset (2) to calculate the accuracy. Then prune the tree based on a node and apply that on the subset (2) to calculate another accuracy. If the accuracy after pruning is higher or equal to that from the complete tree, then we set the node as a terminal node. Otherwise, keep the subtree under the node. The advantage of this method is that it is easy to compute. However, when the size of the subset (2) is much smaller than that of the subset (1), there is a risk of over-pruning. Some researchers found that this method results in more accurate trees than pruning process based on tree size (F. Espoito and Semeraro 1997).
- Error-complexity pruning
This method is to search for a trade-off between error and complexity. Assume we have a splitting node \(t\), and the corresponding subtree \(T\). The error cost of the node is defined as:
\[R(t)=r(t)\times p(t) = \frac{misclassified\ sample\ size\ of\ the\ node}{total\ sample\ size}\]
where \(r(t)\) is the error rate associate with the node as if it is a terminal node:
\[r(t)=\frac{misclassified\ sample\ size\ of\ the\ node}{sample\ size\ of\ the\ node}\]
\(p(t)\) is the ratio of the sample of the node to the total sample:
\[p(t)=\frac{ sample\ size\ of\ the\ node}{total\ sample\ size}\]
The multiplication \(r(t)\times p(t)\) cancels out the sample size of the node. If we keep node \(t\), the error cost of the subtree \(T\) is:
\[R(T)=\Sigma_{i = no.\ of\ leaves\ in\ subtree\ T} R(i)\]
The error-complexity measure of the node is:
\[a(t)=\frac{R(t)-R(T)_{t}}{no.\ of\ leaves - 1}\]
Based on the metrics above, the pruning process is (Patel and Upadhyay 2012):
- Calculate \(a\) for every node \(t\).
- Prune the node with the lowest value.
- Repeat 1 and 2. It produces a pruned tree each time and they form a forest.
- Select the tree with the best overall accuracy.
- Minimum error pruning
Niblett and Brotko introduced this pruning method in 1991 (Cestnik and Bratko 1991). The process is a bottom-up approach which seeks a single tree that minimizes the expected error rate on new samples. If we prune a splitting point \(t\), all the samples under \(t\) will be classified as from one category, say category \(c\). If we prune the subtree, the expected error rate is:
\[E(t)=\frac{n_{t}-n_{t,c}+k-1}{n_{t}+k}\]
where:
\[k=number\ of\ categories\] \[n_{t}=sample\ size\ under\ node\ t\] \[n_{t,c}= number\ of\ sample\ under\ t\ that\ belong\ to\ category\ c\]
Based on the above definition, the pruning process is (F. Espoito and Semeraro 1997):
- Calculate the expected error rate for each non-leave node if that subtree is pruned
- Calculate the expected error rate for each non-leave node if that subtree is not pruned
- If pruning the node leads to higher expected rate, then keep the subtree; otherwise, prune it.