STAT318 — Data Mining
Dr
University of Canterbury, Christchurch,
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
, University of Canterbury 2020
STAT318 — Data Mining ,1 / 42
Tree-Based Methods
In this section we consider tree-based methods for regression and classification.
These methods partition the feature (predictor) space into a number of regions, and then fit a simple model (for example, a constant) in each region.
The set of splitting rules used the partition the feature space can be summarized in a tree, so these approaches are known as decision tree methods.
We will focus on the Classification and Regression Trees (CART) approach.
, University of Canterbury 2020
STAT318 — Data Mining ,2 / 42
Decision trees are great because they are:
• simple;
• easy to interpret (at least for small tree);
• scale well to large training data sets (divide-and-conquer operation); and
• they are basis for state-of-the-art supervised learning methods (random forests and boosting).
Decision trees also have some drawbacks, but we’ll get to these later.
Decision Tree
A decision tree is a recursive partition of the feature space that can be displayed in a flowchart-like structure consisting of nodes and branches.
1 Root Node: A node that has no incoming branches.
2 Internal Node: A node that has exactly one incoming branch and at least two
outgoing branches.
3 Terminal (Leaf) Node: A node with exactly one incoming branch and no outgoing branches.
Internal nodes have an associated query (splitting rule) and by convention, we proceed to the left descendent node (left daughter) if we answer ‘yes’ to the query.
, University of Canterbury 2020
STAT318 — Data Mining ,3 / 42
Here is an over-simplified decision tree for a medical triage system to illustrate the structure of a tree.
6Yes No7
Send to hospital Stay at home
2 Yes .
Send to hospital
Does the patient have . difficulty breathing?
1 No 3
.Is the patient bleeding? .
.
To classify a patient we answer the queries and follow the branches. For example, a patient with no di culty breathing and no bleeding is told to stay at home:
node 1 æ node 3 æ node 7 = stay at home. A patient with di culty breathing is told to go to hospital: node 1 æ node 2 = send to hospital.
Splitting Rules
Qualitative Feature:
1 Multiway split: let K be the number of values the ith feature, Xi, takes in node m. Using Xi , split the node m into K descendent nodes — one node for each value.
2 Binary split: the split has the form
Xi œ S, where S be a subset of values the ith feature takes.
Quantitative Feature:
1 Axis parallel splits: a split using the ith feature, Xi, has the form Xi Æs, wheresœR.
2 Oblique splits: the split uses a linear combination of features and has the form
a0+ÿp aiXiÆ0, wherea0,a1,…,apœR. i=1
, University of Canterbury 2020
STAT318 — Data Mining ,4 / 42
Consider a feature Car Type with three values: Sports, Family, luxury. A multiway split and an equivalent series of binary splits are shown below.
Car Type ..
Multiway Split Binary Split
… Sports Family Luxury
.. Car Type in {Sports} Luxury
Car Type in {Sports,Family}
… Sports Family
If the feature has many values, multiway splits tend to fragment the data too quickly, leaving insu cient training data for further splitting of the descendent nodes. Noting that a multiway split can be formed using a series of binary splits (see above), binary splits are preferred. CART only uses binary splits for qualitative features.
Oblique and axes parallel splits
−2 −1 0 1 2 −2 −1 0 1 2 X1 X1
, University of Canterbury 2020
STAT318 — Data Mining ,5 / 42
In this example, we have a two-class classification problem (green or yellow) with two continuous predictors (X1 and X2). The classes are linearly separable and an oblique split of the form a0 + a1X1 + a2X2 Æ 0
separates the classes perfectly. However, in general, finding an optimal oblique split can be computationally expensive as there are many possibilities to con- sider. Oblique trees are also more di cult to interpret because each split is a linear combination of the predictors, rather than a split involving a single predictor.
CART uses axis parallel splits (Xi Æ s) for continuous predictors. Axis parallel splits can produce complex decision boundary structure (see the right figure) when compared with oblique splits, but they are easier to fit. We will not be considering oblique splits any further in this course.
X2
−2 −1 0 1 2
X2
−2 −1 0 1 2
Regression Trees: Hitters data set
●● ●
● ●●
● ● ● ● ● ●●
●
●●●
●
● ●●●
● ●
●●
●
●● ●●●
●●●●●● ●
● ●
●●●●●● ● ● ● ●
● ●
●●●●● ●●
● ● ● ● ● ● ● ● ● ● ● ●
●●●●● ● ● ●
●●●● ●● ●●● ●●
●●●●●●●●●
● ● ● ● ●
●●
●●●●● ● ●● ●● ●
● ● ● ● ● ● ● ● ●●● ●
● ●
● ●●● ●●
●●● ●●● ●● ●
●●●●●● ● ●●
●
● ●●● ● ● ● ●●●● ●
●●
●●●●●
●● ●●●● ●●●●●● ● ●
●●●●●● ● ●●●
●●● ●
● ● ● ● ● ● ●●●●●●
●●●●●●●●● ● ● ●
●●●●
●
●●
5 10 15 20 Years
, University of Canterbury 2020
STAT318 — Data Mining ,6 / 42
To motivate regression trees, we consider a simple example. The aim here is to predict a Baseball player’s salary using:
• Hits: the number of hits in the previous year; and
• Years: the number of years playing in the major leagues.
The response variable Salary has been log transformed so it’s more ‘bell shaped’ (Salary is measured in 1,000 of dollars). Players with low income are coloured blue and those with high income yellow.
Can you explain the player with a high income (yellow), zero hits and only one year playing in the major league?
How do we partition the training data?
Hits
0 50 100 150 200
Decision Tree: Hitters data set
R1
R3
R2
1 4.5
24
Years
238
117.5
1
, University of Canterbury 2020
STAT318 — Data Mining ,7 / 42
A partition of the feature space for the Hitters training data. The three regions can be written as follows:
R1 = {X :Years<4.5}
R2 = {X : Years Ø 4.5, Hits < 117.5} R3 = {X : Years Ø 4.5, Hits Ø 117.5}
We read {X : Years < 4.5} as ‘the set of training data points X such that Years is less than 4.5’.
Hits
Decision Tree: Hitters data set
5.11
Years| < 4.5
6.00
6.74
Hits <
117.5
, University of Canterbury 2020
STAT318 — Data Mining ,8 / 42
The regression tree corresponding to the partition on the previous slide. The lengths of the branches are proportional to the decrease in RSS after the split is made. That is, longer branches correspond to greater reductions in RSS. (Note: this is something that the tree R function does, not all functions plot trees in this way.)
The mean response of the training observations at each terminal node is used for making predictions. For example, a player with 6 years experience and 80 hits has a predicted log salary of 6, (that’s a salary of $106, 1 million USD).
Observations:
• Years is the most important predictor (it’s at the top of the tree and it reduces the RSS the most).
• Experienced players with a large number of hits get larger salaries.
• Simple, easy to display, and easy to explain to a non-expert. Now we need to learn how to fit one!
Tree Building Process (Regression)
We have N training data observations {xi,yi}Ni=1, where xi is a vector containing p predictors and yi is its response value.
Suppose we have partitioned the feature space into M distinct regions,
R1, R2, . . . , RM and model the response as a constant, cm, in the mth region:
f (x) = ÿM cmI(x œ Rm). m=1
To minimize the sum of squares q(yi ≠ f (xi ))2, we choose cm = ave(yi : xi œ Rm).
, University of Canterbury 2020
STAT318 — Data Mining ,9 / 42
There are many possible partitions that can be formed. CART uses partitions consisting of nested boxes. These partitions are simple to form and the resulting model is easy to interpret.
Tree Building Process (Regression)
It is usually computationally infeasible to find a partition that minimizes the sum of squares.
CART uses a top-down, greedy approach called binary recursive partitioning.
1 Top-Down: Start at the root node with all the data and then recursively split the
feature space, where each split defines two new branches further down the tree.
2 Greedy: Make a series of locally optimal splits, rather than looking ahead and picking a split that will produce a better tree at some future step.
, University of Canterbury 2020
STAT318 — Data Mining ,10 / 42
This figure illustrates how the data are divided following a split.
A
B
After each split, we end up with two independent training sets (A and B). These sets can be treated separately in parallel for further splitting.
A series of locally optimal splits (greedy) does not guarantee a globally optimal tree.
A
B
Tree Building Process (Regression)
Starting with all the training data, consider splitting on the jth feature using a split point s.
This defines two regions:
R1 ={X :Xj Æs} and R2 ={X :Xj >s}.
We seek the splitting feature j and the split point s that minimizes
__o
Z m i n Y] ÿ ( y i ≠ y ̄ R 1 ) 2 + ÿ ( y i ≠ y ̄ R 2 ) 2 ^ ,
j,s [xiœR1 xiœR2 \ where y ̄Rm is the mean in the mth region.
, University of Canterbury 2020
STAT318 — Data Mining ,11 / 42
If y ̄ ”= y ̄ , the split reduces the RSS, otherwise the RSS is unchanged. That is, R1 R2
the split cannot increase the RSS.
threshing
For N points in Rd , there are d (N ≠ 1) possible splits to consider. For example, in R2 with N = 4 there are six possible splits:
X
2 4 17 6 X 22
..
.
.
.
features ar
enter
e
covered on the next page.)
X1 X1 < s
X1 X2 < s
The splittintgovalue s is taken as the midpoint between the two points. (Qualitative
.
.
.
Tree Building Process (Regression)
The best split can be found quickly by scanning through all the features.
Having found the best split, we partition the training data into two regions and repeat the splitting process on these two regions.
oeg
The splitting process is repeated until a stopping condition is satisfied, for
T
example, until no terminal node contains more than five observations.
, University of Canterbury 2020
STAT318 — Data Mining ,12 / 42
Qualitative Features
r.ro
CART considers all 2K ≠1 ≠ 1 possible binary splits for qualitative features, where K is the number of di erent values the feature takes in the node. For the Car Type example on Slide 4, there are three possible splits to consider. Let Xj = Car Type, then the three possible splits are:
Xj œ {Sports}, Xj œ {Sports,Family} and Xj œ {Sports,Luxury}. The corresponding regions are:
R1 = {X : Xj œ {Sports}} and
R2 = {X : Xj œ {Family,Luxury}};
R1 = {X : Xj œ {Sports,Family}}
R1 = {X : Xj œ {Sports,Luxury}} and R2 = {X : Xj œ {Family}}.
and R2 = {X : Xj œ {Luxury}};
Recursive Binary Partition
X1
, University of Canterbury 2020
STAT318 — Data Mining ,13 / 42
This partition was not formed using recursive binary partitioning (the boxes are not nested). CART will not form a partition like this.
X2
Recursive Binary Partition
X1 t1 X2 t2
R1 R2 R3
t4
R2
R3
R5
R4
R1
X1 t3
X2 t4
t2
R4 R5
t1
t3
X1
X2
X1
, University of Canterbury 2020
STAT318 — Data Mining ,14 / 42
This is an example of a CART partition, the corresponding tree and a plot of the model (a piecewise continuous surface). Another simple regression example with one predictor is illustrated below.
.
x= 4
y
y=8
. x < 4 . ...
Corresponding tree
y=2
.
.
x
28
.
X2
Pruning a Tree
The tree building process may produce good predictions on the training data, but is likely to overfit the training data.
A smaller tree might give lower variance and better interpretation at the cost of increased bias.
Rather than growing a small tree, it is better to grow a large tree and prune it back into a smaller tree.
, University of Canterbury 2020
STAT318 — Data Mining ,15 / 42
Deep bushy trees localize the training data to relatively small regions Rm (in a similar way to KNN using a relatively small value of k). This suggests deep trees have low bias and high variance. Pruning the deep bushy aims to reduce the variance and reduce overfitting.
Another strategy would be to grow the tree until the reduction in RSS becomes su ciently small. However, a seemingly worthless split early on may be followed by a useful split that reduces the RSS substantially. It is good practice to grow a large tree and then prune it back into a smaller tree.
Cost-Complexity Pruning
Let T0 denote a fully grown tree. A subtree T μ T0 is any tree that can be obtained by collapsing any number of internal nodes.
The cost complexity criterion is
|T|
C–(T)=ÿ ÿ(yi≠y ̄Rm)2+–|T|,
m=1 xi œRm
where |T| is the number of terminal nodes of tree T.
For each – Ø 0, there is a unique subtree T– μ T0 that minimizes C–(T).
, University of Canterbury 2020
STAT318 — Data Mining ,16 / 42
The idea here is to find a sequence of best trees of varying complexities (by changing –). Then, we choose the correct complexity for the problem at hand.
The penalty parameter – governs the trade-o between tree size (complexity) and its goodness-of-fit to the training data, where large – values select smaller trees.
• To find T–, we collapse the two terminal nodes (to their parent node) that yields the smallest gain in
|T|
RSS=ÿ ÿ(y≠y ̄ )2.
i Rm m=1 xi œRm
We continue this collapsing process to produce a sequence of trees until we reach the root note. This sequence of trees must contain T– (we will not prove this result).
• We then find the best – (the one that gives the lowest testing error) using cross validation.
Tree Algorithm
1 2
3
4
Use recursive binary partitioning to grow a large tree on the training data. Apply cost complexity pruning to the large tree in order to obtain a sequence of
best subtrees, as a function of –.
Use K-fold cross-validation to choose –.
For k = 1,2,...,K, do
(a) Repeat steps (1) and (2) on the training data excluding the kth fold.
(b) Evaluate the mean squared testing error of the kth fold, as a function of –. Average the results and choose – to minimize the average error.
Return the subtree from step (2) that corresponds to the chosen value of –.
, University of Canterbury 2020
STAT318 — Data Mining ,17 / 42
Step 2 gives us the best trees as a function of –: ... ... ...
......
... full tree with = 0
Step 3 uses cross validation to find the best – (tree size). The tree from step 2 of that size is then selected.
decreasing (complexity increasing)
Fully Grown Tree: Hitters data
Years < 4.5
|
RBI < 60.5
Putouts < 82
Years < 3.5
Hits < 117.5
5.487
4.622
Years < 3.5 5.394 6.189
5.183
Walks < 52.5 Runs < 47.5
6.015 5.571
6.407 6.549
RBI < 80.5 Years < 6.5
6.459 7.007
< 43.5 Walks
7.289
, University of Canterbury 2020
STAT318 — Data Mining ,18 / 42
The is the fully grown regression tree to the Hitters data from earlier in this section.
Observations:
• There were 12 terminal nodes in the tree.
• Splits further down the tree did not reduce the RSS as much splits higher up (as expected).
• The most useful predictors were Years, RBI (number of runs batted in) and Hits.
• There are 19 predictors in the Hitters data and the tree only used 6 of them to model the response (automatic feature selection, we didn’t need to specify the most useful predictors to fit the model). Trees only use the predictors that matter (in terms of reducing the RSS) and discard the rest.
• Are we over-fitting the training data?
Testing Error: Hitters data
2 4 6 8 10 Tree Size
, University of Canterbury 2020
STAT318 — Data Mining ,19 / 42
Training Cross−Validation Test
The training MSE decreases as tree size (complexity) increases (as expected be- cause larger trees have lower bias than smaller ones). The CV-error is a reasonable approximation to the testing error and shows the characteristic U shape. For this problem, a tree size of three fits the data well. The pruned tree was given on slide 8.
Mean Squared Error
0.0 0.2 0.4 0.6 0.8 1.0
Classification Trees
Classification trees are very similar to regression trees. The feature space is subdivided using recursive binary partitioning, but the sum of squares error is not used to choose the best split and the average is not used for predictions at terminal nodes.
In a node m (region Rm of the partition) with Nm observations, denote the fraction of observations in class k
pˆmk=N1 ÿI(yi=k). m xiœRm
We classify an observation in a terminal node m to the majority class in node m
k ( m ) = a r g m a x ( pˆ m k ) . k
, University of Canterbury 2020
STAT318 — Data Mining ,20 / 42
Impurity Measures
Misclassification Error:
k Gini Index: ÿK
I(m) = 1 ≠ pˆmk(m)
= 1≠max{pˆmk}.
I ( m ) = k = 1 pˆ m k ( 1 ≠ pˆ m k ) . Cross-Entropy (Deviance):
I(m)=≠ÿK pˆmklog2(pˆmk) (with0log2(0)=0). k=1
, University of Canterbury 2020
STAT318 — Data Mining ,21 / 42
A small impurity value indicates that a node contains predominantly observations from one class (we want nodes to pure, or close to it). The choice of impurity measure doesn’t really have an impact on the performance of a classification tree. CART uses the Gini Index so we’ll focus on this one. For a two-class classification problem we have
I(m) = 2p(1 ≠ p), where p is the proportion in the first (or second) class.
Impurity Measures: Two-Class Problem
0.0 0.2 0.4
Impurity measures attain their maximum value when each class is equally represented at a node.
p
0.6 0.8 1.0
, University of Canterbury 2020
STAT318 — Data Mining ,22 / 42
0.0 0.1 0.2 0.3 0.4 0.5
Entropy
Gini index
Misclassification error
2m
2m+1
m
The best split in node m is the one that maximizes the reduction in impurity (I) = I(m) ≠ N2m I(2m) ≠ N2m+1 I(2m + 1),
where I(m) is the impurity and Nm is the number of observations in node m.
Note: it is not always possible to split an impure node. In such cases the node is declared terminal and observations are classified to the majority class.
, University of Canterbury 2020
STAT318 — Data Mining ,23 / 42
Consider a binary classification problem.y œ {Yes, No} with 100 training obser- vations in node m. The node has been split, giving two daughter nodes with the following structure:
..
Yes = 25, No = 50 Yes = 25, No = 0
Yes = 50, No = 50
m
2m 2m+1
That is, 25 of the Yes’ from node m go to the left daughter after the split and the remaining 25 go right. The change in impurity is calculated as follows:
I(m) = I(2m) = I(2m + 1) =
2(1/2)(1 ≠ 1/2) = 1/2 2(2/3)(1 ≠ 2/3) = 4/9 0
(Parent Node) (Left Daughter) (Right Daughter).
The change in impurity is
(I) = 1/2 ≠ 3/4(4/9) ≠ 1/4(0) = 1/6.
Classification Tree: Heart data
Thal:a
MaxHR < 145.5
No No
Yes
No Yes
Chol < 244 MaxHR < 156
Chol < 244
No No No Yes
RestBP < 157
Yes No No
No
Yes
Yes
No
Sex < 0.5
Yes
Ca < 0.5 Slope < 1.5
Oldpeak < 1.1
RestECG < 1
Yes
|
0.5 Ca <
MaxHR < 161.5
ChestPain:bc
Age < 52 Thal:b ChestPain:a
, University of Canterbury 2020
STAT318 — Data Mining ,24 / 42
The fully grown classification tree for the Heart data. The Heart data contains 303 patients with 13 predictors. The outcome Yes indicates the presence of heart disease and No indicates no heart disease. Notice that some internal nodes have been split into terminal nodes with the same class label (both Yes or both No). How can this happen?
Here is an example that illu.strates this: . . Yes = 10, No = 2
Yes = 7, No = 0 Yes = 3, No = 2
(Yes) (Yes)
Hence, it is useful to report the estimated probability of being in a particular class. In the example above, the left node is pure so we are confident about the predicted Yes. However, for the right node, the estimated probability of Yes is 3/5 (not much better than flipping a coin).
Classification Tree: Heart data
Thal:a
Ca < 0.5
|
0.5 Ca <
5 10 15 Tree Size
Training Cross−Validation Test
MaxHR < 161.5
No No
ChestPain:bc
No Yes
Yes Yes
, University of Canterbury 2020
STAT318 — Data Mining ,25 / 42
Cost-complexity pruning gives a simple tree with six terminal nodes. Of the 13 predictors, only four were included in the final model.
Error
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Missing Values
Deficient Observations: training or testing observations that have missing feature values (not missing response values).
If deficient observations are present in the training set, the best split is found using the known feature values.
To make predictions for deficient observations, surrogate splits are used.
Let Xj < s be the splitting rule in node m, called the primary split. A surrogate split uses a di erent feature Xi (i ”= j) to split node m in a way that best approximates the distribution of training observations after the primary split. If training observation x goes to the left daughter after the primary split, it should also go left using the surrogate split etc.
, University of Canterbury 2020
STAT318 — Data Mining ,26 / 42
Surrogate splits are used to decide whether to go left or right at a node when the splitting feature is missing. For example, consider passing
x = (x1, x2, x3) = (missing, 4, 12)
through a tree with the query
x1 < 2.
We cannot answer this query because we don’t have an x1 value. If the surrogate split is x3 < 14,
we proceed to the left daughter node. Otherwise, we go to the right daugh- ter. In either case, a decision is made and we can continue moving through the tree.
Surrogate splits are only used if they perform better than the blind rule, which sends x to daughter node with the most training observations.
Pros and Cons of Decision Trees
Small trees are simple, easy to explain and can be displayed graphically.
Trees can easily handle categorical predictors without having to create dummy variables and can deal with missing values.
Trees are invariant under all monotone transformations of the predictors in the training data.
Trees perform variable selection and complexity reduction automatically. Tree-growing algorithms scale well to large training data sets.
Unfortunately, trees generally have poor predictive accuracy when compared with other methods. However, by aggregating trees, their predictive performance can be greatly improved.
, University of Canterbury 2020
STAT318 — Data Mining ,27 / 42
A accuratewithighvariance lowbias s i n g u l e r g t e r e i e t s e ml f a n y o b t t e h a t
of
Yes Byaveragingpredictmamwtsofteen
aIdea
usingBootstrapping
Rseampleousramplwe ithreplacement aDc aor g e m m b e r t w o o s s o f r e s a m p e s . a t
i
wesomehow ueu.geno but variance can keep biastree reduce
i
i
o Ei.hn 3 zn.es
measurement sin.ie
meansevemraeal surements
H o d w o w b e u i l md a n t y r e e s
each
GJusbtuildatrefeor traininsgetthawt ere
have training buwtoenly one set
t Bagging bootst
tre
na
je
mrooBerrorestimo.nu
Algon_tnmioobtainBbootstrapresamp.es
witneuougnmnapresampescm.sn
rapag
greg
a
tion
trees we
a
samphleas caseswemurandmydrwooase.am our no
ousramplewinreplacement
samples samples IfwdeoB wgeeBttrent
samplinwgithreplacement critical otherwisewinae Bidenticsaal mples is we
e b a g g i n g t e setr r o ri u t
Bootstrapping decision Estimating of
resanpewiunotantamauofizsageit.es__bootstrap_aggregated theoriginatrl ainingcases
ofoutrrainsample
enoughB
each
caswe in
eachsample cargecwwbias.nignrarian.estree For re growa
bea tesctase tree name
Averagaeggregatepredictionfsromauotfhetrees a R e g r e s s i o n t a k e t h me e a o n t f h e B p r e d i c t i o n s
bclassificationtaktehe vote Bpredi.am majority otfhe
when bootstrap bootstrap
Bagging
Deep trees su er from high variance (small changes in the training data can dramatically change the tree produced), but tend to have low bias.
Bootstrap aggregation (bagging) is a general purpose approach for reducing the variance of a statistical learning technique (it is particularly useful for deep trees because of their high variance and low bias).
Given a set of n independent observations X1, X2, . . . , Xn with common variance ‡2, the mean of these observations X ̄ has variance ‡2/n. Averaging a set of observations reduces variance.
, University of Canterbury 2020
STAT318 — Data Mining ,28 / 42
The sample mean (from a random sample) X ̄ = n≠1 qni=1 Xi is a normally dis- tributed random variable (Central Limit Theorem) and its expected value is
E(X ̄) = E1n≠1(X1+X2+...+Xn)2
= n≠1 (E(X1)+E(X2)+...+E(Xn)))
= n≠1(μ+μ+...+μ)=μ.
Hence, X ̄ is an unbiased estimate of the population mean μ and it’s variance is
V(X ̄) = V1n≠1(X1+X2+...+Xn)2
= n≠2 (V(X1)+V(X2)+...+V(Xn))) (independence) = n≠21‡2+‡2+...+‡2)2=‡2/n,
which decreases with increasing sample size (precision increases with sample size).
Idea: Build an ensemble of bushy trees using di erent training data and average their predictions to reduce variance.
Bagging
Unfortunately, we usually only have one training data set.
However, we can generate B di erent bootstrapped training data sets.
Bagging: Train our tree on the bth bootstrapped training set to get Tb(x). The prediction is given by
fˆ (x) = 1 ÿB T (x) (regression) bag Bb=1 b
For classification, we record the prediction made by each tree and then take a majority vote.
, University of Canterbury 2020
STAT318 — Data Mining ,29 / 42
We want our trees to have low bias and high variance so we do not prune trees in the ensemble. Pruning trees increases the pairwise correlation between trees which reduces the e ect of averaging. Essentially, we want each tree in the ensemble to have low bias (big and bushy trees) and for each tree to make mistakes on di erent training observations.
Bagging: Example
Original Tree b = 1 b = 2 x.1 < 0.395 x.1 < 0.555 x.2 < 0.205
|||
11 0100
10 011001
01
01
0
b = 3 b =4 b = 5
x.2 < 0.285 x.3 < 0.985 x.4 < 1.36
|||
10 0
0 1
10
1 110110
The Elements of Statistical Learning, Hastie, Tibshirani and Friedman (2009).
0
1
11
0
, University of Canterbury 2020
STAT318 — Data Mining ,30 / 42
This is a two-class classification problem y œ {0,1} with p = 5 features x1, x2, . . . , x5. Clearly, each bootstrapped training set produces di erent tree structure. The root node splits also use several di erent features. However, x1 appears to be a dominant feature because it is used to split the root node in 7 of the 11 trees.
To classification an x using this ensemble of 11 trees, we pass x through each tree and record each tree’s prediction. If more than five trees classify x as 0, then x is class 0. Otherwise, x is class 1.
Bagging: Example
b = 6 b =7 b = 8
x.1 < 0.395 x.1 < 0.395 x.3 < 0.985
|||
1
1
01 00 000101
b = 9 b = 10 b = 11
10
11
0
The Elements of Statistical Learning, Hastie, Tibshirani and Friedman (2009).
x.1 < 0.395 x.1 < 0.555 x.1 < 0.555
|||
1
1
011001
0
10 01
1
, University of Canterbury 2020
STAT318 — Data Mining ,31 / 42
Random Forests
Bagged trees tend to look quite similar to each other, particularly if there are only a few strong predictors.
Hence, predictions from the bagged trees will be highly (positively) correlated.
Unfortunately, averaging highly correlated quantities does not reduce variance as much as it would for uncorrelated quantities.
Random forests provide a potential improvement over bagged trees by attempting to de-correlate trees in the ensemble.
, University of Canterbury 2020
STAT318 — Data Mining ,32 / 42
If B identically distributed (same ‡2, but not necessarily independent) random variables Xi have positive pairwise correlation
fl = cov(Xi , Xj ) = cov(Xi , Xj ) (identically distributed so same ‡), ‡Xi ‡Xj ‡2
the variance of the sample mean is
VAB1ÿB XiB = B1AÿB V(Xi)+2ÿcov(Xi,Xj)B
i=1 2 i=1 i