Machine Learning and Data Mining in Business
Lecture 8: Decision Trees and Random Forests
Discipline of Business Analytics
Copyright By PowCoder代写 加微信 powcoder
Lecture 8: Decision Trees and Random Forests
Learning objectives
• Classification and regression trees. • Bagging.
• Random forests.
Lecture 8: Decision Trees and Random Forests
1. Classification and regression trees
2. Fitting decision trees
3. Advantages and disadvantages of trees 4. Bagging
5. Random forests
Classification and regression trees
Decision trees
Decision trees originate from the field of decision analysis:
Image from http://towardsdatascience.com/decision-tree-hugging-b8851f853486
Regression tree
A regression tree uses a decision tree as a rule to predict a continuous response.
GarageCars <= 1.5 samples = 1840 value = 140.3
OverallQual <= 6.5 samples = 2442 value = 156.2
samples = 602 value = 205.0
OverallQual <= 8.5 samples = 488 value = 303.7
GarageCars <= 2.5 samples = 350 value = 270.9
samples = 138 value = 386.7
OverallQual <= 7.5 samples = 2930 value = 180.8
samples = 883 OverallQual <= 5.5
value = 120.9 samples = 957 value = 158.2
samples = 435 samples = 522 value = 139.4 value = 173.8
samples = 184 value = 244.6
samples = 166 value = 300.1
Example: regression tree
Suppose that the house has a quality rating of 8 and 3 garage spots. The prediction is 300.1.
GarageCars <= 1.5 samples = 1840 value = 140.3
OverallQual <= 6.5 samples = 2442 value = 156.2
samples = 602 value = 205.0
OverallQual <= 8.5 samples = 488 value = 303.7
GarageCars <= 2.5 samples = 350 value = 270.9
samples = 138 value = 386.7
OverallQual <= 7.5 samples = 2930 value = 180.8
samples = 883 OverallQual <= 5.5
value = 120.9 samples = 957 value = 158.2
samples = 435 samples = 522 value = 139.4 value = 173.8
samples = 184 value = 244.6
samples = 166 value = 300.1
Classification tree
A classification tree is very similar to a regression tree, except that it outputs a class label or class probabilities.
Employees <= 1243.5 samples = 250 value = [135, 115] class = retention
Retention expenses <= 68.67 samples = 204
value = [90, 114]
class = churn
Retention expenses <= 68.935 samples = 46
value = [45, 1]
class = retention
samples = 165
value = [90, 75] class = retention
samples = 39 value = [0, 39] class = churn
samples = 41
value = [41, 0] class = retention
samples = 5
value = [4, 1] class = retention
Example: classification tree
Suppose that the number of employees is 2000 and the retention expense is 50. The tree predicts retention.
Employees <= 1243.5 samples = 250 value = [135, 115] class = retention
Retention expenses <= 68.67 samples = 204
value = [90, 114]
class = churn
Retention expenses <= 68.935 samples = 46
value = [45, 1]
class = retention
samples = 165
value = [90, 75] class = retention
samples = 39 value = [0, 39] class = churn
samples = 41
value = [41, 0] class = retention
samples = 5
value = [4, 1] class = retention
Decision trees
Each leaf of classification or regression tree corresponds to rectangular partition of the set of possible values of the predictors.
Years < 4.5
Rectangular partitions and trees
X1 ≤ t1 X2 ≤ t2
Regression tree
Formally, the regression tree model is
f(x)= wmI(x∈Rm),
1. R1, R2, . . . , RM are distinct and non-overlapping rectangular regions of the input spcace.
2. The parameter wm is the prediction for every instance in region m.
3. M is the number of leaves.
The number of leaves M determines the capacity of the tree.
• The higher the number of leaves, the higher the capacity, and
therefore the lower the bias.
• On the hand, a large tree will clearly overfit the data, since having small number of training examples in a region Rj leads to high variance for estimating the best prediction for that region.
Fitting decision trees
Fitting decision trees
The learning algorithm needs to select:
• The tree architecture.
• The split variable and split point for each decision node. • The predicted value in each region.
GarageCars <= 1.5 samples = 1840 value = 140.3
OverallQual <= 6.5 samples = 2442 value = 156.2
samples = 602 value = 205.0
OverallQual <= 8.5 samples = 488 value = 303.7
GarageCars <= 2.5 samples = 350 value = 270.9
samples = 138 value = 386.7
OverallQual <= 7.5 samples = 2930 value = 180.8
samples = 883 OverallQual <= 5.5
value = 120.9 samples = 957 value = 158.2
samples = 435 samples = 522 value = 139.4 value = 173.8
samples = 184 value = 244.6
samples = 166 value = 300.1
Fitting decision trees
Given the tree structure, it’s simple to choose the predicted value:
• For a regression tree, the predicted value is the mean response for the samples in the region.
• For a classification tree, the predicted probabilities are the class proportions in the region.
• If we are interested in the most probable label, the prediction is simply the majority class in the region.
Fitting decision trees
Let Dm be the set of instances in node m. For a regression tree, wm= 1 yi,
where |Dm| is the number of samples in the node.
For a classication tree,
πmc= 1 I(yi=c). |Dm| i∈Dm
Example: regression tree
GarageCars <= 1.5 samples = 1840 value = 140.3
OverallQual <= 6.5 samples = 2442 value = 156.2
samples = 602 value = 205.0
OverallQual <= 8.5 samples = 488 value = 303.7
GarageCars <= 2.5 samples = 350 value = 270.9
samples = 138 value = 386.7
OverallQual <= 7.5 samples = 2930 value = 180.8
samples = 883 OverallQual <= 5.5
value = 120.9 samples = 957 value = 158.2
samples = 435 samples = 522 value = 139.4 value = 173.8
samples = 184 value = 244.6
samples = 166 value = 300.1
Example: classification tree
Acquisition expense <= 490.755 samples = 250
value = [97, 153]
class = acquisition
Acquisition expense <= 370.185 samples = 110
value = [90, 20]
class = no acquisition
Employees <= 397.0 samples = 140 value = [7, 133] class = acquisition
samples = 55
value = [7, 48] class = acquisition
samples = 59
value = [59, 0] class = no acquisition
samples = 51 value = [31, 20] class = no acquisition
samples = 85
value = [0, 85] class = acquisition
Fitting decision trees
• A decision tree asks a series of yes or no questions. How can we find the right questions to ask?
• That is, what should be the tree architecture, split variables, and split points?
• Unfortunately, it’s not computationally feasible to enumerate and search over all possible decision trees.
Growing a decision tree
1. Consider a tree that only has the root node.
2. Loop through each feature column.
3. For each column, loop through each possible value of the feature.
4. For every combination of feature and value, split the data into two leaves based on whether the feature is greater than or less than that value.
5. Pick the combination of feature and split point that give the best predictions using this simple model.
Continues below.
Illustration:
OverallQual <= 7.5 samples = 2930 value = 180.8
samples = 2442 value = 156.2
samples = 488 value = 303.7
Growing a decision tree
6. We now have divided the training observations into two non-overlapping groups. Treat each group as a separate dataset and go back to step one.
7. Continue this process recursively until reaching a stopping criterion, such as a minimum number of observations in each leaf.
Illustration:
OverallQual <= 7.5 samples = 2930 value = 180.8
OverallQual <= 6.5 samples = 2442 value = 156.2
samples = 488 value = 303.7
samples = 1840 samples = 602 value = 140.3 value = 205.0
Illustration:
OverallQual <= 7.5 samples = 2930 value = 180.8
samples = 1840 value = 140.3
OverallQual <= 6.5 samples = 2442 value = 156.2
samples = 602 value = 205.0
OverallQual <= 8.5 samples = 488 value = 303.7
samples = 350 value = 270.9
samples = 138 value = 386.7
Illustration:
GarageCars <= 1.5 samples = 1840 value = 140.3
OverallQual <= 6.5 samples = 2442 value = 156.2
samples = 602 value = 205.0
OverallQual <= 8.5 samples = 488 value = 303.7
samples = 350 value = 270.9
samples = 138 value = 386.7
OverallQual <= 7.5 samples = 2930 value = 180.8
samples = 883 samples = 957 value = 120.9 value = 158.2
Fitting decision trees
• We call the above procedure recursive binary splitting.
• Recursive binary splitting greedy algorithm. At each step, we choose the best immediate split without considering the impact on the rest of the tree.
Fitting decision trees
Let Dm be the set of instances that reach node m, DmL(j,t)={(xi,yi)∈Dm :xij ≤t}
DmR(j,t) = {(xi,yi) ∈ Dm : xij > t}.
We select the split variable j and split point t as follows:
jm,tm =argmin |DmL(j,t)|cDmL(j,t)+|DmR(j,t)|cDmL(j,t), j,t |Dm| |Dm|
where c() is a cost function.
Fitting decision trees
For a regression tree, the cost is the MSE
c(Dm)= 1 (yi−wm)2 |Dm| i∈Dm
Node impurity measures
• For a classification tree, we want each decision node to discriminate as much as possible between the classes. An ideal situation is a pure node where all observations belong to the same class.
• We refer to the criterion for growing a classification tree as a node impurity measure.
Example: Customer Acquisition
Acquisition expense <= 490.755 samples = 250
value = [97, 153]
class = acquisition
Acquisition expense <= 370.185 samples = 110
value = [90, 20]
class = no acquisition
Employees <= 397.0 samples = 140 value = [7, 133] class = acquisition
samples = 55
value = [7, 48] class = acquisition
samples = 59
value = [59, 0] class = no acquisition
samples = 51 value = [31, 20] class = no acquisition
samples = 85
value = [0, 85] class = acquisition
Node impurity measures
• A small value of an impurity measure indicates that the node contains cases predominantly from a single class.
• One option is to directly compute the average loss based on the loss matrix, including the error rate as a special case.
• However, we typically prefer to use the Gini index or entropy to grow the tree, as these two similar measures are more sensitive to the node purity.
Node impurity measures
The Gini index is
Gm =πmc(1−πmc)=πmcπmj,
c=1 c=1 j̸=c
where πmc is the class c proportion in leaf m. The entropy is
Hm =−πmc log(πmc).
Node impurity measures
Binary case (p indicates the class proportion, the entropy is scaled for visualisation):
Figure from ESL.
Regularisation
In cost-complexity pruning, we consider the following criterion to reduce the size of the tree:
Cα(T) = J(T) + α|T|,
where J(T) is the training error and α ≥ 0 is a hyperparameter
that controls the trade-off between tree size and the training error.
Cost-complexity pruning
Cα(T) = J(T) + α|T|
• We can interpret α as the minimum reduction in the training
error required to allow a split.
• When α = 0, we simply obtain the maximal tree T0. For any
given α > 0, there is a subtree Tα ⊆ T0 that minimises Cα(T).
• We select α by cross-validation.
Illustration:
GarageCars <= 1.5 samples = 1840 value = 140.3
OverallQual <= 6.5 samples = 2442 value = 156.2
samples = 602 value = 205.0
OverallQual <= 8.5 samples = 488 value = 303.7
samples = 350 value = 270.9
samples = 138 value = 386.7
OverallQual <= 7.5 samples = 2930 value = 180.8
samples = 883 samples = 957 value = 120.9 value = 158.2
Advantages and disadvantages of trees
Advantages of trees
Trees are simple.
Trees lead to interpretable prediction rules that are easy to
explain to non-experts.
Insensitive to monotone transformations of numerical predictors.
Robust to outliers in the input space.
Natural handling of nominal and ordinal predictors.
Natural handling of mixed data types.
Natural handling of missing values.
Advantages of trees
Trees can approximate complex nonlinearities, including complex interaction effects.
Trees automatically perform feature selection.
Trees scale well computationally to large datasets.
Disadvantages of trees
Lower predictive accuracy than other methods (for a single tree).
Trees are inherently unstable due to their hierarchical nature. Small changes in the data can lead to a very different sequence of splits.
Trees lack smoothness, which can degrade performance (especially in regression settings).
Trees have difficulty capturing additive structure.
Illustration: lack of smoothness
• The idea of bagging (bootstrap aggregation) is to average many noisy but approximately unbiased models to reduce variance.
• Trees are ideal candidates for bagging since they have high flexibility if grown deeply enough. Deep trees have relatively low bias but high variance.
The Bootstrap
Let D = {(yi, xi)}ni=1 be the training data.
• We construct a bootstrap sample D∗ = {(yi∗, x∗i )}ni=1 by randomly sampling n observations with replacement from the training data.
• In bootstrapping, we generate a large number of independent bootstrap samples D1∗, . . . , DB∗ .
• By applying an estimator θ(D) to each bootstrap sample, we obtain the bootstrap distribution of the estimator.
Bootstrapping: graphical illustration
Original Data (Z)
Figure from ISL
1. Generate B bootstrap samples Db∗ (b = 1, . . . , B) from the training data D.
2. Fit a decision tree T∗ to each bootstrap sample. b
3. Use the ensemble of trees T∗, . . . , T∗ to make predictions. 1B
In regression, the prediction from bagging is the average prediction across all trees:
fbag(x) = Tb∗(x). b=1
Bagging: graphical illustration
Image credit: Sirakorn, CC BY-SA 4.0, via Wikimedia Commons
• Increasing B does not overfit.
• The higher the B, the lower the variance. However, the reduction in variance from additional trees becomes smaller and smaller as we increase B, tending towards zero.
• We grow deep trees and do not apply pruning.
Random forests
Random forests
Random forests improve on bagging by adding a tweak that reduces the correlation between the trees: we select only a random subset of k ≤ p predictors as candidate split variables for each decision node.
Random forests
As a simple example, consider the average of two trees
T1(x) + T2(x) f(x)= ,
where T1 and T2 are trees. From the properties of variances,
Var f(x) =
Therefore, the variance of the estimator increases with the correlation between the trees.
Var T1(x) + Var T2(x) + 2Cov T1(x), T2(x)
Random forest
Algorithm Random forest
1: forb=1toBdo
2: Sample n observations with replacement from the training data D to obtain the bootstrap sample Db∗.
3: Grow a random forest tree Tb to Db∗ by repeating the following steps for each leaf until the algorithm reaches a minimum node size nmin:
4: (i) Select k variables at random from the p variables.
5: (ii) Pick the best variable and split point among the k candidates.
6: (iii) Split the node into two child nodes.
7: end for
8: Output the ensemble of trees {Tb}B1 .
Random forests
• The main hyperparameters in a random forest are the number of candidate split variables (k) and the minimum node size for each tree.
• As usual, we select the hyperparameters by cross-validation. Using a small k will typically be helpful when there is a large number of correlated predictors.
Illustration
m=p m=p/2 m= p
Figure from ISL
0 100 200 300 400 500 Number of Trees
Test Classification Error
0.2 0.3 0.4 0.5
Why do random forests work so well?
• When there is a small subset of strong predictors, these variables will tend to dominate the top splits of the trees over the bootstrap samples. That makes the trees quite similar to each other.
• Since the trees are similar, the trees will be highly correlated. Averaging over highly correlated variables does not lead to a substantial reduction in variance.
• Random forests overcome this problem by giving a chance to other predictors, leading to decorrelation, at the expense of some bias.
Feature importance
• A common measure of feature importance for random forests is to add up the total reduction in the training error due to splits over that predictor, averaged across all B trees.
• This is different from the feature importance based on SHAP values.
Feature importance
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com