STAT318 — Data Mining
Dr
University of Canterbury, Christchurch,
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
, University of Canterbury 2020 STAT318 — Data Mining ,1 / 42
Tree-Based Methods
In this section we consider tree-based methods for regression and classification.
These methods partition the feature (predictor) space into a number of regions, and then fit a simple model (for example, a constant) in each region.
The set of splitting rules used the partition the feature space can be summarized in a tree, so these approaches are known as decision tree methods.
We will focus on the Classification and Regression Trees (CART) approach.
, University of Canterbury 2020 STAT318 — Data Mining ,2 / 42
Decision trees are great because they are:
• simple;
• easy to interpret (at least for small tree);
• scale well to large training data sets (divide-and-conquer operation); and
• they are basis for state-of-the-art supervised learning methods (random forests and boosting).
Decision trees also have some drawbacks, but we’ll get to these later.
Decision Tree
A decision tree is a recursive partition of the feature space that can be displayed in a flowchart-like structure consisting of nodes and branches.
1 Root Node: A node that has no incoming branches.
2 Internal Node: A node that has exactly one incoming branch and at least two
outgoing branches.
3 Terminal (Leaf) Node: A node with exactly one incoming branch and no outgoing branches.
Internal nodes have an associated query (splitting rule) and by convention, we proceed to the left descendent node (left daughter) if we answer ‘yes’ to the query.
, University of Canterbury 2020 STAT318 — Data Mining ,3 / 42
Here is an over-simplified decision tree for a medical triage system to illustrate the structure of a tree.
6Yes No7
Send to hospital Stay at home
2 Yes .
Send to hospital
Does the patient have . difficulty breathing?
1 No 3
.Is the patient bleeding? .
.
To classify a patient we answer the queries and follow the branches. For example, a patient with no difficulty breathing and no bleeding is told to stay at home:
node 1 → node 3 → node 7 = stay at home. A patient with difficulty breathing is told to go to hospital: node 1 → node 2 = send to hospital.
Splitting Rules
Qualitative Feature:
1 Multiway split: let K be the number of values the ith feature, Xi, takes in node m. Using Xi , split the node m into K descendent nodes — one node for each value.
2 Binary split: the split has the form
Xi ∈ S, where S be a subset of values the ith feature takes.
Quantitative Feature:
1 Axis parallel splits: a split using the ith feature, Xi, has the form Xi ≤s, wheres∈R.
2 Oblique splits: the split uses a linear combination of features and has the form p
a0+aiXi ≤0, wherea0,a1,…,ap ∈R. i=1
, University of Canterbury 2020 STAT318 — Data Mining ,4 / 42
Consider a feature Car Type with three values: Sports, Family, luxury. A multiway split and an equivalent series of binary splits are shown below.
.. Sports Family
If the feature has many values, multiway splits tend to fragment the data too quickly, leaving insufficient training data for further splitting of the descendent nodes. Noting that a multiway split can be formed using a series of binary splits (see above), binary splits are preferred. CART only uses binary splits for qualitative features.
Car Type …
Multiway Split Binary Split
… Sports Family Luxury
Car Type in {Sports,Family} ..
Car Type in {Sports}
Luxury
Oblique and axes parallel splits
−2 −1 0 1 2 X1
, University of Canterbury 2020
−2 −1 0 1 2 X1
STAT318 — Data Mining
,5 / 42
In this example, we have a two-class classification problem (green or yellow) with two continuous predictors (X1 and X2). The classes are linearly separable and an oblique split of the form
a0 + a1X1 + a2X2 ≤ 0
separates the classes perfectly. However, in general, finding an optimal oblique split can be computationally expensive as there are many possibilities to con- sider. Oblique trees are also more difficult to interpret because each split is a linear combination of the predictors, rather than a split involving a single predictor.
CART uses axis parallel splits (Xi ≤ s) for continuous predictors. Axis parallel splits can produce complex decision boundary structure (see the right figure) when compared with oblique splits, but they are easier to fit. We will not be considering oblique splits any further in this course.
X2
−2 −1 0 1 2
X2
−2 −1 0 1 2
Regression Trees: Hitters data set
●
●●●● ● ● ●● ● ●●
●
●● ● ●
●
●
●
● ● ●●● ●●●●●● ●
●● ●
●●●
● ●
● ●
●
●●● ●● ● ● ● ● ● ●
●
● ●●●● ● ● ●
● ● ●● ● ●● ● ●●●●●●
●●● ● ● ●●● ●●●●
●
●●●
●● ●●● ●
●
● ● ● ●●●●● ● ●● ●●
● ● ● ● ● ● ● ● ●
●● ●
●
●
●● ● ●●● ●●● ● ●
●
●●
●●●●●●●
● ● ● ●
● ● ● ● ● ●
●●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●
●
● ●● ●
● ● ● ● ●●●●
●
●
●
●●
●
●●●●● ●● ●●●●●● ●
●
5 10 15 20
Years
, University of Canterbury 2020 STAT318 — Data Mining ,6 / 42
To motivate regression trees, we consider a simple example. The aim here is to predict a Baseball player’s salary using:
• Hits: the number of hits in the previous year; and
• Years: the number of years playing in the major leagues.
The response variable Salary has been log transformed so it’s more ‘bell shaped’ (Salary is measured in 1,000 of dollars). Players with low income are coloured blue and those with high income yellow.
Can you explain the player with a high income (yellow), zero hits and only one year playing in the major league?
How do we partition the training data?
0 50 100 150 200
Hits
Decision Tree: Hitters data set
R1
R3
R2
1 4.5
24
Years
238
117.5
1
, University of Canterbury 2020
STAT318 — Data Mining
,7 / 42
A partition of the feature space for the Hitters training data. The three regions can be written as follows:
R1 = {X :Years<4.5}
R2 = {X : Years ≥ 4.5, Hits < 117.5}
R3 = {X : Years ≥ 4.5, Hits ≥ 117.5}
We read {X : Years < 4.5} as ‘the set of training data points X such that Years is less than 4.5’.
Hits
Decision Tree: Hitters data set
5.11
, University of Canterbury 2020
6.00
6.74
Years| < 4.5
Hits <
117.5
STAT318 — Data Mining ,8 / 42
The regression tree corresponding to the partition on the previous slide. The lengths of the branches are proportional to the decrease in RSS after the split is made. That is, longer branches correspond to greater reductions in RSS. (Note: this is something that the tree R function does, not all functions plot trees in this way.)
The mean response of the training observations at each terminal node is used for making predictions. For example, a player with 6 years experience and 80 hits has a predicted log salary of 6, (that’s a salary of $106, 1 million USD).
Observations:
• Years is the most important predictor (it’s at the top of the tree and it reduces the RSS the most).
• Experienced players with a large number of hits get larger salaries.
• Simple, easy to display, and easy to explain to a non-expert. Now we need to learn how to fit one!
Tree Building Process (Regression)
We have N training data observations {xi,yi}Ni=1, where xi is a vector containing p predictors and yi is its response value.
Suppose we have partitioned the feature space into M distinct regions,
R1, R2, . . . , RM and model the response as a constant, cm, in the mth region:
M
f (x) = cmI(x ∈ Rm).
m=1
To minimize the sum of squares (yi − f (xi ))2, we choose
cm = ave(yi : xi ∈ Rm).
, University of Canterbury 2020 STAT318 — Data Mining ,9 / 42
There are many possible partitions that can be formed. CART uses partitions consisting of nested boxes. These partitions are simple to form and the resulting model is easy to interpret.
Tree Building Process (Regression)
It is usually computationally infeasible to find a partition that minimizes the sum of squares.
CART uses a top-down, greedy approach called binary recursive partitioning.
1 Top-Down: Start at the root node with all the data and then recursively split the
feature space, where each split defines two new branches further down the tree.
2 Greedy: Make a series of locally optimal splits, rather than looking ahead and picking a split that will produce a better tree at some future step.
, University of Canterbury 2020 STAT318 — Data Mining ,10 / 42
This figure illustrates how the data are divided following a split.
A
B
After each split, we end up with two independent training sets (A and B). These sets can be treated separately in parallel for further splitting.
A series of locally optimal splits (greedy) does not guarantee a globally optimal tree.
A
B
Tree Building Process (Regression)
Starting with all the training data, consider splitting on the jth feature using a split point s.
This defines two regions:
R1 ={X :Xj ≤s} and R2 ={X :Xj >s}.
We seek the splitting feature j and the split point s that minimizes
min (yi −y ̄R1)2 + (yi −y ̄R2)2, j,s xi∈R1 xi∈R2
where y ̄Rm is the mean in the mth region.
, University of Canterbury 2020 STAT318 — Data Mining
,11 / 42
If y ̄R1 ̸= y ̄R2 , the split reduces the RSS, otherwise the RSS is unchanged. That is, the split cannot increase the RSS.
For N points in Rd , there are d (N − 1) possible splits to consider. For example, in R2 with N = 4 there are six possible splits:
X2 .X2 .
.
.
.
X1 < s
X1 X1 X2 < s
The splitting value s is taken as the midpoint between the two points. (Qualitative features are covered on the next page.)
.
.
.
Tree Building Process (Regression)
The best split can be found quickly by scanning through all the features.
Having found the best split, we partition the training data into two regions and repeat the splitting process on these two regions.
The splitting process is repeated until a stopping condition is satisfied, for example, until no terminal node contains more than five observations.
, University of Canterbury 2020 STAT318 — Data Mining ,12 / 42
Qualitative Features
CART considers all 2K −1 − 1 possible binary splits for qualitative features, where K is the number of different values the feature takes in the node. For the Car Type example on Slide 4, there are three possible splits to consider. Let Xj = Car Type, then the three possible splits are:
Xj ∈ {Sports}, Xj ∈ {Sports,Family} and Xj ∈ {Sports,Luxury}. The corresponding regions are:
R1 = {X : Xj ∈ {Sports}} and
R2 = {X : Xj ∈ {Family,Luxury}};
R1 = {X : Xj ∈ {Sports,Family}} and R2 = {X : Xj ∈ {Luxury}}; R1 = {X : Xj ∈ {Sports,Luxury}} and R2 = {X : Xj ∈ {Family}}.
Recursive Binary Partition
X1
, University of Canterbury 2020 STAT318 — Data Mining ,13 / 42
This partition was not formed using recursive binary partitioning (the boxes are not nested). CART will not form a partition like this.
X2
Recursive Binary Partition
X 1 ≤ t1 X 2 ≤ t2
R1R2R3
t4
R2
R3
R5
R4
R1
X 1 ≤ t3
X 2 ≤ t4
t2
R4 R5
t1 t3 X1
X2
X1
, University of Canterbury 2020
STAT318 — Data Mining
,14 / 42
This is an example of a CART partition, the corresponding tree and a plot of the model (a piecewise continuous surface). Another simple regression example with one predictor is illustrated below.
.
y
y=8
. x < 4 . ...
Corresponding tree
y=2 ..
x= 4
28
.
x
X2
Pruning a Tree
The tree building process may produce good predictions on the training data, but is likely to overfit the training data.
A smaller tree might give lower variance and better interpretation at the cost of increased bias.
Rather than growing a small tree, it is better to grow a large tree and prune it back into a smaller tree.
, University of Canterbury 2020 STAT318 — Data Mining ,15 / 42
Deep bushy trees localize the training data to relatively small regions Rm (in a similar way to KNN using a relatively small value of k). This suggests deep trees have low bias and high variance. Pruning the deep bushy aims to reduce the variance and reduce overfitting.
Another strategy would be to grow the tree until the reduction in RSS becomes sufficiently small. However, a seemingly worthless split early on may be followed by a useful split that reduces the RSS substantially. It is good practice to grow a large tree and then prune it back into a smaller tree.
Cost-Complexity Pruning
Let T0 denote a fully grown tree. A subtree T ⊂ T0 is any tree that can be obtained by collapsing any number of internal nodes.
The cost complexity criterion is
|T|
Cα(T)= (yi−y ̄Rm)2+α|T|,
m=1 xi ∈Rm
where |T| is the number of terminal nodes of tree T.
For each α ≥ 0, there is a unique subtree Tα ⊂ T0 that minimizes Cα(T).
, University of Canterbury 2020 STAT318 — Data Mining ,16 / 42
The idea here is to find a sequence of best trees of varying complexities (by changing α). Then, we choose the correct complexity for the problem at hand.
The penalty parameter α governs the trade-off between tree size (complexity) and its goodness-of-fit to the training data, where large α values select smaller trees.
• To find Tα, we collapse the two terminal nodes (to their parent node) that yields the smallest gain in
|T|
RSS= (yi−y ̄Rm)2.
m=1 xi ∈Rm
We continue this collapsing process to produce a sequence of trees until we reach the root note. This sequence of trees must contain Tα (we will not prove this result).
• We then find the best α (the one that gives the lowest testing error) using cross validation.
Tree Algorithm
1 Use recursive binary partitioning to grow a large tree on the training data.
2 Apply cost complexity pruning to the large tree in order to obtain a sequence of
best subtrees, as a function of α.
3 Use K-fold cross-validation to choose α. For k = 1,2,...,K, do
(a) Repeat steps (1) and (2) on the training data excluding the kth fold.
(b) Evaluate the mean squared testing error of the kth fold, as a function of α.
Average the results and choose α to minimize the average error.
4 Return the subtree from step (2) that corresponds to the chosen value of α.
, University of Canterbury 2020 STAT318 — Data Mining
,17 / 42
Step 2 gives us the best trees as a function of α: ...
. ......... ..
full tree with a = 0
Step 3 uses cross validation to find the best α (tree size). The tree from step 2 of that size is then selected.
a decreasing (complexity increasing)
...
Fully Grown Tree: Hitters data
Putou < 3.5 Years < 3.5
5.487 5.394 4.622 5.183
6.189
60.5
Hits <
Years < 4.5
RBI < 117.5
ts < 82 Years
Walks < 52.5
Runs < 47.5 RBI < 80.5
6.407 6.549 Years < 6.5 6.459 7.007
< 43.5 Walks
6.015 5.571
7.289
, University of Canterbury 2020 STAT318 — Data Mining
,18 / 42
|
The is the fully grown regression tree to the Hitters data from earlier in this section.
Observations:
• There were 12 terminal nodes in the tree.
• Splits further down the tree did not reduce the RSS as much splits higher up (as expected).
• The most useful predictors were Years, RBI (number of runs batted in) and Hits.
• There are 19 predictors in the Hitters data and the tree only used 6 of them to model the response (automatic feature selection, we didn’t need to specify the most useful predictors to fit the model). Trees only use the predictors that matter (in terms of reducing the RSS) and discard the rest.
• Are we over-fitting the training data?
Testing Error: Hitters data
2 4 6 8 10 Tree Size
, University of Canterbury 2020 STAT318 — Data Mining ,19 / 42
Training Cross−Validation Test
The training MSE decreases as tree size (complexity) increases (as expected be- cause larger trees have lower bias than smaller ones). The CV-error is a reasonable approximation to the testing error and shows the characteristic U shape. For this problem, a tree size of three fits the data well. The pruned tree was given on slide 8.
Mean Squared Error
0.0 0.2 0.4 0.6 0.8 1.0
Classification Trees
Classification trees are very similar to regression trees. The feature space is subdivided using recursive binary partitioning, but the sum of squares error is not used to choose the best split and the average is not used for predictions at terminal nodes.
In a node m (region Rm of the partition) with Nm observations, denote the fraction of observations in class k
pˆmk=N1 I(yi=k). m xi∈Rm
We classify an observation in a terminal node m to the majority class in node m k(m) = argmax(pˆmk).
k
, University of Canterbury 2020 STAT318 — Data Mining ,20 / 42
Impurity Measures
Misclassification Error:
Gini Index:
I(m) = 1 − pˆmk(m)
= 1−max{pˆmk}.
k
K
I(m) = pˆmk(1 − pˆmk).
k=1
Cross-Entropy (Deviance):
K
I(m) = − pˆmk log2(pˆmk )
k=1
, University of Canterbury 2020
(with 0 log2(0) = 0).
STAT318 — Data Mining ,21 / 42
A small impurity value indicates that a node contains predominantly observations from one class (we want nodes to pure, or close to it). The choice of impurity measure doesn’t really have an impact on the performance of a classification tree. CART uses the Gini Index so we’ll focus on this one. For a two-class classification problem we have
I(m) = 2p(1 − p), where p is the proportion in the first (or second) class.
Impurity Measures: Two-Class Problem
0.0 0.2 0.4
Impurity measures attain their maximum value when each class is equally represented at a node.
, University of Canterbury 2020 STAT318 — Data Mining ,22 / 42
p
0.6 0.8 1.0
0.0 0.1 0.2 0.3 0.4 0.5
Entropy
Gini index
Misclassification error
m
2m
2m+1
The best split in node m is the one that maximizes the reduction in impurity ∆(I) = I(m) − N2m I(2m) − N2m+1 I(2m + 1),
where I(m) is the impurity and Nm is the number of observations in node m.
Note: it is not always possible to split an impure node. In such cases the node is declared terminal and observations are classified to the majority class.
, University of Canterbury 2020 STAT318 — Data Mining ,23 / 42
Consider a binary classification problem y ∈ {Yes, No} with 100 training obser- vations in node m. The node has been.split, giving two daughter nodes with the
following structure:
Yes = 25, No = 50 Yes = 25, No = 0
That is, 25 of the Yes’ from node m go to the left daughter after the split and the remaining 25 go right. The change in impurity is calculated as follows:
. Yes = 50, No = 50 .
m
2m 2m+1
I(m) = I(2m) = I(2m + 1) =
2(1/2)(1 − 1/2) = 1/2 2(2/3)(1 − 2/3) = 4/9 0
(Parent Node) (Left Daughter) (Right Daughter).
The change in impurity is
∆(I) = 1/2 − 3/4(4/9) − 1/4(0) = 1/6.
Classification Tree: Heart data
Thal:a
MaxHR < 145.5
No No
Yes
No Yes
MaxHR < 156
No
Sex < 0.5
Ca < 0.5 Slope < 1.5
Oldpeak < 1.1 RestECG < 1
|
0.5 Ca <
RestBP < 157 Chol < 244
Yes
No No
Yes
,24 / 42
Yes
MaxHR < 161.5
ChestPain:bc
Age < 52
Thal:b ChestPain:a
No Yes
Chol < 244
No No No Yes
Yes
, University of Canterbury 2020 STAT318 — Data Mining
The fully grown classification tree for the Heart data. The Heart data contains 303 patients with 13 predictors. The outcome Yes indicates the presence of heart disease and No indicates no heart disease. Notice that some internal nodes have been split into terminal nodes with the same class label (both Yes or both No). How can this happen?
Here is an example that illu.strates this: . . Yes = 10, No = 2
Yes = 7, No = 0 Yes = 3, No = 2
(Yes) (Yes)
Hence, it is useful to report the estimated probability of being in a particular class. In the example above, the left node is pure so we are confident about the predicted Yes. However, for the right node, the estimated probability of Yes is 3/5 (not much better than flipping a coin).
Classification Tree: Heart data
5 10 15 Tree Size
, University of Canterbury 2020
Training Cross−Validation Test
MaxHR < 161.5
No No
STAT318 — Data Mining
ChestPain:bc
No Yes
Yes Yes
,25 / 42
Thal:a |
Ca < 0.5
0.5 Ca <
Cost-complexity pruning gives a simple tree with six terminal nodes. Of the 13 predictors, only four were included in the final model.
Error
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Missing Values
Deficient Observations: training or testing observations that have missing feature values (not missing response values).
If deficient observations are present in the training set, the best split is found using the known feature values.
To make predictions for deficient observations, surrogate splits are used.
Let Xj < s be the splitting rule in node m, called the primary split. A surrogate split uses a different feature Xi (i ̸= j) to split node m in a way that best approximates the distribution of training observations after the primary split. If training observation x goes to the left daughter after the primary split, it should also go left using the surrogate split etc.
, University of Canterbury 2020 STAT318 — Data Mining ,26 / 42
Surrogate splits are used to decide whether to go left or right at a node when the splitting feature is missing. For example, consider passing
x = (x1, x2, x3) = (missing, 4, 12) through a tree with the query
x1 < 2.
We cannot answer this query because we don’t have an x1 value. If the surrogate
split is
x3 < 14,
we proceed to the left daughter node. Otherwise, we go to the right daugh-
ter. In either case, a decision is made and we can continue moving through the tree.
Surrogate splits are only used if they perform better than the blind rule, which sends x to daughter node with the most training observations.
Pros and Cons of Decision Trees
Small trees are simple, easy to explain and can be displayed graphically.
Trees can easily handle categorical predictors without having to create dummy variables and can deal with missing values.
Trees are invariant under all monotone transformations of the predictors in the training data.
Trees perform variable selection and complexity reduction automatically. Tree-growing algorithms scale well to large training data sets.
Unfortunately, trees generally have poor predictive accuracy when compared with other methods. However, by aggregating trees, their predictive performance can be greatly improved.
, University of Canterbury 2020 STAT318 — Data Mining ,27 / 42
Bagging
Deep trees suffer from high variance (small changes in the training data can dramatically change the tree produced), but tend to have low bias.
Bootstrap aggregation (bagging) is a general purpose approach for reducing the variance of a statistical learning technique (it is particularly useful for deep trees because of their high variance and low bias).
Given a set of n independent observations X1, X2, . . . , Xn with common variance σ2, the mean of these observations X ̄ has variance σ2/n. Averaging a set of observations reduces variance.
, University of Canterbury 2020 STAT318 — Data Mining ,28 / 42
The sample mean (from a random sample) X ̄ = n−1 ni=1 Xi is a normally dis- tributed random variable (Central Limit Theorem) and its expected value is
E(X ̄) = En−1(X1 +X2 +...+Xn)
= n−1 (E(X1)+E(X2)+...+E(Xn)))
= n−1(μ+μ+...+μ)=μ.
Hence, X ̄ is an unbiased estimate of the population mean μ and it’s variance is
V(X ̄) = Vn−1(X1+X2+...+Xn)
= n−2 (V(X1)+V(X2)+...+V(Xn))) (independence) = n−2σ2+σ2+...+σ2)=σ2/n,
which decreases with increasing sample size (precision increases with sample size).
Idea: Build an ensemble of bushy trees using different training data and average their predictions to reduce variance.
Bagging
Unfortunately, we usually only have one training data set.
However, we can generate B different bootstrapped training data sets.
Bagging: Train our tree on the bth bootstrapped training set to get Tb(x). The prediction is given by
fbag(x) = B
For classification, we record the prediction made by each tree and then take a
majority vote.
, University of Canterbury 2020 STAT318 — Data Mining ,29 / 42
We want our trees to have low bias and high variance so we do not prune trees in the ensemble. Pruning trees increases the pairwise correlation between trees which reduces the effect of averaging. Essentially, we want each tree in the ensemble to have low bias (big and bushy trees) and for each tree to make mistakes on different training observations.
ˆ 1B
Tb(x) (regression)
b=1
Bagging: Example
Original Tree b = 1
x.1 < 0.395 x.1 < 0.555
11 0100
b = 2
x.2 < 0.205
|
01 01100101
100 b = 3
b =4
x.3 < 0.985
b = 5
x.4 < −1.36
|
x.2 < 0.285
|
|
|
|
110110
The Elements of Statistical Learning, Hastie, Tibshirani and Friedman (2009).
, University of Canterbury 2020 STAT318 — Data Mining
,30 / 42
0
1
0 11
10 01
0 1
10
This is a two-class classification problem y ∈ {0,1} with p = 5 features x1, x2, . . . , x5. Clearly, each bootstrapped training set produces different tree structure. The root node splits also use several different features. However, x1 appears to be a dominant feature because it is used to split the root node in 7 of the 11 trees.
To classification an x using this ensemble of 11 trees, we pass x through each tree and record each tree’s prediction. If more than five trees classify x as 0, then x is class 0. Otherwise, x is class 1.
Bagging: Example
b = 6
x.1 < 0.395
b =7 b = 8
x.1 < 0.395 x.3 < 0.985
110
|
|
1
01 00 000101
11
0
The Elements of Statistical Learning, Hastie, Tibshirani and Friedman (2009).
, University of Canterbury 2020 STAT318 — Data Mining
b = 9
x.1 < 0.395
b = 10
x.1 < 0.555
1
b = 11
x.1 < 0.555
1 0
1
011001
10 01
|
|
|
|
,31 / 42
Random Forests
Bagged trees tend to look quite similar to each other, particularly if there are only a few strong predictors.
Hence, predictions from the bagged trees will be highly (positively) correlated.
Unfortunately, averaging highly correlated quantities does not reduce variance as much as it would for uncorrelated quantities.
Random forests provide a potential improvement over bagged trees by attempting to de-correlate trees in the ensemble.
, University of Canterbury 2020 STAT318 — Data Mining ,32 / 42
If B identically distributed (same σ2, but not necessarily independent) random variables Xi have positive pairwise correlation
ρ = cov(Xi , Xj ) = cov(Xi , Xj ) (identically distributed so same σ), σXi σXj σ2
the variance of the sample mean is
1B1B
V Xi = V(Xi)+2cov(Xi,Xj)
B B2 i=1
i