Ensemble Learning
COMP9417 Machine Learning and Data Mining
Term 2, 2020
COMP9417 ML & DM Ensemble Learning Term 2, 2020 1 / 70
Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from figures for the book
“Python Data Science Handbook” by J. VanderPlas O’Reilly Media (2017) http://shop.oreilly.com/product/0636920034919.do
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM Ensemble Learning Term 2, 2020 2 / 70
Aims
Aims
This lecture will develop your understanding of ensemble methods in machine learning, based on analyses and algorithms covered previously. Following it you should be able to:
• describe the framework of the bias-variance decomposition and some of its practical implications
• describe how ensembles might be used to address the bias and variance components of error
• outline the concept of the stability of a learning algorithm
• describe the ensemble methods of bagging, random forests and
boosting
• compare the operation of these methods in terms of the bias and variance components of error
COMP9417 ML & DM Ensemble Learning Term 2, 2020 3 / 70
Introduction
Introduction
In previous lectures, introduced some theoretical ideas about limits on machine learning. But do these have any practical impact ?
The
• •
• •
answer is yes !
The bias-variance decomposition of error can be a tool for thinking about how to reduce error in learning
Take a learning algorithm and ask:
• how can we reduce its bias ?
• how can we reduce its variance ?
Ensemble learning methods can be viewed in this light
A form of multi-level learning: learning a number of base-level models from the data, and learning to combine these models as an ensemble
COMP9417 ML & DM Ensemble Learning Term 2, 2020 4 / 70
Bias-variance
Review: bias-variance decomposition
• Theoretical tool for analyzing how much specific training set affects performance of classifier
• Assume we have an infinite number of classifiers built from different training sets all of the same size:
• The bias of a learning scheme is the expected error due to the mismatch between the learner’s hypothesis space and the space of target concepts
• The variance of a learning scheme is the expected error due to differences in the training sets used
• Total expected error ≈ bias2 + variance
• Next slide: a graphical representation of this idea, where distance
from target stands for error
COMP9417 ML & DM Ensemble Learning Term 2, 2020 5 / 70
Bias-variance
Bias and variance
X
X
X
X
X
X XX X
X
A dartboard metaphor illustrating the concepts of bias and variance. Each dartboard corresponds to a different learning algorithm, and each dart signifies a different training sample. The top row learning algorithms exhibit low bias, on average staying close to the bullseye (the true function value for a particular x), while the ones on the bottom row
have high bias. The left column shows low variance and the right column high variance.
X
X
X
X
X
X XX X
X
COMP9417 ML & DM Ensemble Learning Term 2, 2020 6 / 70
Bias-variance
Bias-variance: a trade-off
Easier to see with regression in the following figure 1 (to see the details you will have to zoom in in your viewer):
• each column represents a different model class g(x) shown in red
• each row represents a different set of n = 6 training points, Di, randomly sampled from target function F (x) with noise, shown in black
• probability functions of mean squared error E are shown
1from: “Elements of Statistical Learning” by Hastie, Tibshirani and Friedman (2001) COMP9417 ML & DM Ensemble Learning Term 2, 2020 7 / 70
Bias-variance
Bias-variance: a trade-off
COMP9417 ML & DM Ensemble Learning Term 2, 2020 8 / 70
Bias-variance
Bias-variance: a trade-off
• a) is very poor: a linear model with fixed parameters independent of training data; high bias, zero variance
• b) is better: a linear model with fixed parameters independent of training data; slightly lower bias, zero variance
• c) is a cubic model with parameters trained by mean-square-error on training data; low bias, moderate variance
• d) is a linear model with parameters adjusted to fit each training set; intermediate bias and variance
• training with data n → ∞ would give
• c) with bias approaching small value due to noise
• but not d)
• variance for all models would approach zero
COMP9417 ML & DM Ensemble Learning Term 2, 2020 9 / 70
Bias-variance
Bias-variance in ensemble classification
• Recall that we derived the bias-variance decomposition for regression – squared-error loss function
• Cannot apply same derivation for classification – zero-one loss
• Bias-variance decomposition used to analyze how much restriction to a single training set affects performance
• Can decompose expected error of any individual ensemble member as follows:
• Bias = expected error of the ensemble classifier on new data
• Variance = component of the expected error due to particular training
set being used to build classifier
• Total expected error ≈ bias + variance
• Note (A): we assume noise inherent in the data is part of the bias component as it cannot normally be measured
• Note (B): multiple versions of this decomposition exist for zero-one loss but the basic idea is always the same
COMP9417 ML & DM Ensemble Learning Term 2, 2020 10 / 70
Bias-variance
Bias-variance with “Big Data”
The following 4 slides are due to Prof. G. Webb, Monash U.
• high bias algorithms often used for efficiency • why ?
• big data can reduce variance • why ?
COMP9417 ML & DM Ensemble Learning Term 2, 2020 11 / 70
Bias-variance
Bias-variance with “Big Data”
Suppose we have a low bias representation (e.g., all conjunctive concepts), but such concepts may not always occur frequently in small datasets:
COMP9417 ML & DM Ensemble Learning Term 2, 2020 12 / 70
Bias-variance
Bias-variance with “Big Data”
So we can increase bias – e.g., by Naive Bayes-type conditional independence assumptions – but this forces averaging of class distributions over all “small concepts”:
COMP9417 ML & DM Ensemble Learning Term 2, 2020 13 / 70
Bias-variance
Bias-variance with “Big Data”
“Big Data” may help to resolve the bias-variance dilemma: • high bias algorithms are often used for efficiency
• usually simpler to compute • big data can reduce variance
• “small” concepts will occur more frequently
• low bias algorithms can find them in each sample • but: how to compute efficiently ?
This is still largely an open problem!
COMP9417 ML & DM Ensemble Learning Term 2, 2020 14 / 70
Bias-variance
Bias-variance in “Real-world AI”
Applications increasingly require machine-learning systems to perform at “human-level” (e.g., personal assistants, self-driving vehicles, etc.)
How can an understanding of the bias-variance decomposition help ?
Suppose you are developing an application and you know what “human-level” error would typically be on this task.
You have sufficient data for training and validation datasets, and you are not restricted in terms of the models that you could learn (e.g., from linear regression or classification up to kernel methods, ensembles, deep networks, etc.)
COMP9417 ML & DM Ensemble Learning Term 2, 2020 15 / 70
Bias-variance
Bias-variance in “Real-world AI”
Training-set error is observed to be high compared to human-level – why ? Bias is too high – solution: move to a more expressive (lower bias) model
Training-set error is observed to be similar to human-level, but validation set error is high compared to human-level – why ?
Variance is too high – solution: get more data (!), try regularization, ensembles, move to a different model architecture
These scenarios are often found in applications of deep learning2.
2“Nuts and Bolts of Applying Deep Learning” by Andrew Ng http://www.youtube.com/watch?v=F1ka6a13S9I
COMP9417 ML & DM Ensemble Learning Term 2, 2020 16 / 70
Stability
Stability
• for a given data distribution D
• train algorithm L on training sets S1, S2 sampled from D
• expect that the model from L should be the same (or very similar) on both S1 and S2
• if so, we say that L is a stable learning algorithm
• otherwise it is unstable
• typical stable algorithm: kNN (for some k)
• typical unstable algorithm: decision-tree learning
Turney, P. “Bias and the Quantification of Stability”
COMP9417 ML & DM Ensemble Learning Term 2, 2020 17 / 70
Stability
Decision boundaries in tree learning
Decision boundaries for monothetic two-class trees in two and three dimensions; arbitrarily fine decision regions for classes R1, R2 can be learned by recursively partioning the instance space.
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 ML & DM Ensemble Learning Term 2, 2020 18 / 70
Stability
Instability of tree learning
An example shows the effect of a small change in the training data on the structure of an unpruned binary tree learned by CART. The training set has 8 instances for each class:
Note: for class ω2 (red) the last instance has two values for feature x2. On the next slide is a tree learned from the data where this instance has value x2 = .36 (marked ∗), and on the following slide we see the tree obtained when this value is changed to
x2 = .32 (marked †).
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 ML & DM Ensemble Learning Term 2, 2020 19 / 70
Stability
Instability of tree learning
The partioned instance space (left) contains the instance marked ∗ and corresponds to the decision tree (right).
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 ML & DM Ensemble Learning Term 2, 2020 20 / 70
Stability
Instability of tree learning
The partioned instance space (left) contains the instance marked † and corresponds to the decision tree (right). Note that both the decision boundaries and the tree topology are considerably changed, for example, testing x2 rather than x1 at the tree root, although the change in data was very small.
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 ML & DM Ensemble Learning Term 2, 2020 21 / 70
Stability
Stability and Bias-Variance
• stable algorithms typically have high bias
• unstable algorithms typically have high variance
• BUT: take care to consider effect of parameters, e.g., in kNN
• 1NN perfectly separates training data, so low bias but high variance
• By increasing the number of neighbours k we increase bias and
decrease variance (what happens when k = n?)
• Every test instance will have the same number of neighbours, and the
class probability vectors will all be the same !
COMP9417 ML & DM Ensemble Learning Term 2, 2020 22 / 70
Stability
Three-, five- and seven-nearest neighbour
Decision regions of k-nearest neighbour classifiers; the shading represents the predicted probability distribution over the five classes.
3-nearest neighbour 5-nearest neighbour 7-nearest neighbour Illustrates the effect of varying k on stability (i.e., bias and variance).
COMP9417 ML & DM Ensemble Learning Term 2, 2020 23 / 70
Ensembles
Ensemble methods
In essence, ensemble methods in machine learning have the following two things in common:
• they construct multiple, diverse predictive models from adapted versions of the training data (most often reweighted or resampled);
• they combine the predictions of these models in some way, often by simple averaging or voting (possibly weighted).
COMP9417 ML & DM Ensemble Learning Term 2, 2020 24 / 70
Ensembles
Ensembles: combining multiple models
• Basic idea of ensembles or “multi-level” learning schemes: build different “experts” and let them vote
• Advantage: often improves predictive performance
• Disadvantage: produces output that is very hard to interpret
• Notable schemes: bagging, random forests, boosting
• can be applied to both classification and numeric prediction problems
COMP9417 ML & DM Ensemble Learning Term 2, 2020 25 / 70
Ensembles
Bagging
Bootstrap error estimation
This is a standard “resampling” technique from statistics. Can be used to estimate a parameter of interest, e.g., error rate of a learning method on a data set.
• sampling from data set with replacement
• e.g. sample from n instances, with replacement, n times to generate
another data set of n instances
• (almost certainly) new data set contains some duplicate instances
• and does not contain others – used as the test set
• chance of not being picked (1 − 1 )n ≈ e−1 = 0.368 n
• 0.632 training set
• error estimate = 0.632 × errtest + 0.368 × errtrain
• repeat and average with different bootstrap samples
• however, in ML, cross-validation is preferred for error estimates
COMP9417 ML & DM Ensemble Learning Term 2, 2020 26 / 70
Ensembles
Bagging
Bootstrap error estimation
Why this is interesting/useful
• Can be used to estimate many parameters of interest
• For example, bias, variance, etc.
• Can then apply significance tests and other statistical machinery • But can be computationally demanding
• Not widely used in machine learning (unlike cross-validation)
• See Bradley Efron’s book for more details
COMP9417 ML & DM Ensemble Learning Term 2, 2020 27 / 70
Ensembles
Bagging
Bagging
“Bootstrap Aggregation”
• Employs simplest way of combining predictions: voting/averaging
• Each model receives equal weight
• Generalized version of bagging:
• Sample several training sets of size n (instead of just having one training set of size n)
• Build a classifier for each training set • Combine the classifiers’ predictions
• This improves performance in almost all cases if learning scheme is unstable (i.e. decision trees)
COMP9417 ML & DM Ensemble Learning Term 2, 2020 28 / 70
Ensembles
Bagging
Bagging
• Bagging reduces variance by voting/averaging, thus reducing the overall expected error, even though datasets are all dependent
• In the case of classification there are pathological situations where the overall error might increase
• Usually, the more classifiers the better, with diminishing returns
• Problem: we only have one dataset!
• Solution: generate new datasets of size n by sampling with replacement from original dataset, giving duplicate instances
• Can be applied to numeric prediction and classification
• Can help a lot if data is noisy
COMP9417 ML & DM Ensemble Learning Term 2, 2020 29 / 70
Ensembles
Bagging
Bagging in a nutshell
Learning (model generation)
Let n be the number of instances in the training data. For each of t iterations:
Sample n instances with replacement from training set. Apply the learning algorithm to the sample.
Store the resulting model.
Classification
For each of the t models:
Predict class of instance using model.
Return “ensemble” class.
What is the ensemble class ? The class that has been predicted most often (for classification, i.e., the majority vote or mode), or the mean of the output class values (for regression).
COMP9417 ML & DM Ensemble Learning Term 2, 2020 30 / 70
Ensembles
Bagging
Bagging more precisely
1 2 3 4 5
Algorithm Bagging(D,T,A) // train ensemble from bootstrap samples Input: dataset D; ensemble size T; learning algorithm A.
Output: set of models; predictions to be combined by voting or averaging.
fort=1toT do
bootstrap sample Dt from D by sampling |D| examples with replacement run A on Dt to produce a model Mt
end
return {Mt|1 ≤ t ≤ T}
COMP9417 ML & DM Ensemble Learning Term 2, 2020 31 / 70
Bagging linear classifiers
66
44
22
00
−2 −2
−4 −4
−6 −6
−6 −4 −2 0 2 4 6
−6 −4 −2 0 2 4 6
Ensembles Bagging
(left) An ensemble of five basic linear classifiers built from bootstrap samples with bagging. The decision rule is majority vote, leading to a piecewise linear decision boundary. (right) If we turn the votes into probabilities, we see the ensemble is effectively grouping instances in different ways, with each segment obtaining a slightly
different probability.
COMP9417 ML & DM Ensemble Learning Term 2, 2020 32 / 70
Ensembles
Bagging
Bagging trees
An experiment with simulated data:
• sample of size n = 30, two classes, five features
• Pr(Y =1|x1 ≤0.5)=0.2andPr(Y =1|x1 >0.5)=0.8)
• test sample of size 2000 from same population
• fit classification trees to training sample, 200 bootstrap samples
• trees are different (tree induction is unstable)
• therefore have high variance
• averaging reduces variance and leaves bias unchanged
• (graph: test error for original and bagged trees, with green – vote; purple – average probabilities)
COMP9417 ML & DM Ensemble Learning Term 2, 2020 33 / 70
Ensembles
Bagging
Bagging trees
COMP9417 ML & DM Ensemble Learning Term 2, 2020 34 / 70
Ensembles
Bagging
Bagging trees
COMP9417 ML & DM Ensemble Learning Term 2, 2020 35 / 70
Ensembles
Bagging
Bagging trees
COMP9417 ML & DM Ensemble Learning Term 2, 2020 36 / 70
Ensembles
Bagging
Bagging trees
COMP9417 ML & DM Ensemble Learning Term 2, 2020 37 / 70
Ensembles
Bagging
Bagging trees
The news is not all good:
• when we bag a model, any simple structure is lost
• this is because a bagged tree is no longer a tree . . .
• … but a forest
• although bagged trees can be mapped back to a single tree . . . • . . . this reduces claim to comprehensibility
• stable models like nearest neighbour not very affected by bagging • unstable models like trees most affected by bagging
• usually, their design for interpretability (bias) leads to instability • more recently, random forests (see Breiman’s web-site)
COMP9417 ML & DM Ensemble Learning Term 2, 2020 38 / 70
Ensembles
Random Forests
Randomization
• Can randomize learning algorithm instead of input to introduce diversity into an ensemble
• Some algorithms already have a random component: e.g., initial weights in a neural net
• Most algorithms can be randomized, e.g., greedy algorithms:
• Pick N options at random from the full set of options, then choose the
best of those N choices
• E.g.: attribute selection in decision trees
• More generally applicable than bagging: e.g., we can use random subsets of features in a nearest-neighbor classifier
• Bagging does not work with stable classifiers such as nearest neighbour classifiers
• Can be combined with bagging
• When learning decision trees, this yields the Random Forest method for
building ensemble classifiers
COMP9417 ML & DM Ensemble Learning Term 2, 2020 39 / 70
Ensembles
Random Forests
Random Forests
1 2 3 4 5 6
Algorithm RandomForest(D,T,d) // train ensemble of randomized trees Input: data set D; ensemble size T; subspace dimension d.
Output: set of models; predictions to be combined by voting or averaging.
fort=1toT do
bootstrap sample Dt from D by sampling |D| examples with replacement select d features at random and reduce dimensionality of Dt accordingly train a tree model Mt on Dt without pruning
end
return {Mt|1 ≤ t ≤ T}
COMP9417 ML & DM Ensemble Learning Term 2, 2020 40 / 70
Ensembles
Random Forests
Random Forests
Leo Breiman’s Random Forests algorithm is essentially like Bagging for trees, except the ensemble of tree models is trained from bootstrap samples and random subspaces.
• each tree in the forest is learned from
• a bootstrap sample, i.e., sample from the training set with replacement • a subspace sample, i.e., randomly sample a subset of features
• advantage: forces more diversity among trees in ensemble
• advantage: less time to train since only consider a subset of features
Note: combining linear classifiers in an ensemble gives a piecewise linear (i.e., non-linear) model, whereas multiple trees can be combined into a single tree.
COMP9417 ML & DM Ensemble Learning Term 2, 2020 41 / 70
Ensembles
Boosting
Boosting
• Also uses voting/averaging but each model is weighted according to their performance
• Iterative procedure: new models are influenced by performance of previously built ones
• New model is encouraged to become “expert” for instances classified incorrectly by earlier models
• Intuitive justification: models should be experts that complement each other
• There are several variants of this algorithm . . .
COMP9417 ML & DM Ensemble Learning Term 2, 2020 42 / 70
Ensembles
Boosting
The strength of weak learnability
• Learner produces a binary [−1, +1] classifier h with error rate ε < 0.5.
• In some sense h is “useful”, i.e., better than random !
• strong learner if ε < 0.5 and ε “close” to zero.
• weak learner if ε < 0.5 and ε “close” to 0.5.
• Question (arising from Valiant’s PAC framework):
is there a procedure to convert a weak learner into a strong learner ?
COMP9417 ML & DM Ensemble Learning Term 2, 2020 43 / 70
Ensembles
Boosting
The strength of weak learnability
Schapire (1990) - first boosting algorithm.
Method:
• weak learner learns initial hypothesis h1 from N examples
• next learns hypothesis h2 from new set of N examples, half of which are misclassified by h1
• then learns hypothesis h3 from N examples for which h1 and h2 disagree
• “boosted” hypothesis h gives voted prediction on instance x: • if h1(x) = h2(x) then return agreed prediction, else
• return h3(x)
Result: if h1 has error rate ε < 0.5 then error of h bounded by 3ε2 − 2ε3, i.e., better than ε (see next slide).
Schapire showed that weak learners can be boosted into strong learners. COMP9417 ML & DM Ensemble Learning Term 2, 2020 44 / 70
Ensembles
Boosting
Boosting a weak learner reduces error
COMP9417 ML & DM Ensemble Learning Term 2, 2020 45 / 70
Ensembles
Boosting
Why does boosting work?
Simple boosting using three classifiers in an ensemble:
• h1 is a weak learner
• dataset used to train h2 is maximally informative wrt h1 • h3 learns on what h1 and h2 disagree about
• for prediction on instance x:
• if h1 and h2 agree, use that label (probably correct)
• otherwise use h3 (probably neither h1 or h2 are correct)
Can apply this reasoning recursively within each component classifier
COMP9417 ML & DM Ensemble Learning Term 2, 2020 46 / 70
Ensembles
Boosting
A general boosting method
• original version: after initial hypothesis, each subsequent hypothesis has to “focus” on errors made by previous hypotheses
• general version: extend from 3 hypotheses to many
• how to focus current hypothesis on errors of previous hypotheses ?
• apply weights to misclassified examples
• called adaptive boosting
COMP9417 ML & DM Ensemble Learning Term 2, 2020 47 / 70
Ensembles
Boosting
Weight updates in boosting
• Suppose a linear classifier achieves performance as in the first contingency table. The error rate is ε = (9 + 16)/100 = 0.25.
• We want to give half the weight to the misclassified examples. The following weight updates achieve this: a factor 1/2ε = 2 for for the misclassified examples and 1/2(1 − ε) = 2/3 for the correctly classified examples.
Actual ⊕ Actual ⊖
Predicted ⊕ 24
9
33
Predicted ⊖
16 40
51 60 67 100
COMP9417 ML & DM
Ensemble Learning
Term 2, 2020
48 / 70
Ensembles
Boosting
Weight updates in boosting
• Taking these updated weights into account leads to the contingency table below, which has a (weighted) error rate of 0.5.
⊕⊖
⊕ 16 32 48 ⊖ 18 34 52
34 66 100
COMP9417 ML & DM
Ensemble Learning
Term 2, 2020
49 / 70
Ensembles
Boosting
Boosting
Algorithm Boosting(D,T,A) // train binary classifier ensemble, reweighting datasets
Input: data set D; ensemble size T; learning algorithm A Output: weighted ensemble of models
1 w1i ←1/|D| for all xi ∈ D
2 fort=1toT do
3 run A on D with weights wti to produce a model Mt
4 calculate weighted error εt
5 αt ←1 ln 1−εt 2 εt
6 w(t+1)i ← wti for misclassified instances xi ∈ D 2εt
7 w(t+1)j ← wtj for correctly classified instances xj ∈ D 2(1−εt )
8 end
9 return M (x) = Tt=1 αt Mt (x)
COMP9417 ML & DM
Ensemble Learning
Term 2, 2020
50 / 70
Boosting
33
2.5 2.5
22
1.5 1.5
11
0.5 0.5
00
−0.5 −0.5
−1 −1
−1.5 −1.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5
Ensembles Boosting
(left) An ensemble of five boosted basic linear classifiers with majority vote. The linear classifiers were learned from blue to red; none of them achieves zero training error, but the ensemble does. (right) Applying bagging results in a much more homogeneous ensemble, indicating that there is little diversity in the bootstrap samples.
COMP9417 ML & DM Ensemble Learning Term 2, 2020 51 / 70
Ensembles
Boosting
Why those αt?
The two weight updates for the misclassified instances and the correctly classified instances can be written as reciprocal terms δt and 1/δt normalised by some term Zt:
1 = δt 1 = 1/δt 2εt Zt 2(1−εt) Zt
From this we can derive
1−εt
Zt =2 εt(1−εt) δt = ε =exp(αt)
t
So the weight update for misclassified instances is exp (αt) /Zt and for
correctly classified instances exp (−αt) /Zt. Using the fact that yiMt(xi) = +1 for instances correctly classified by model Mt and −1 otherwise, we can write the weight update as
w = w exp (−αtyiMt(xi)) (t+1)i ti Zt
which is the expression commonly found in the literature.
COMP9417 ML & DM Ensemble Learning Term 2, 2020 52 / 70
Ensembles
Boosting
More on boosting
• Can be applied without weights using resampling with probability determined by weights
• Disadvantage: not all instances are used
• Advantage: resampling can be repeated if error exceeds 0.5
• Stems from computational learning theory
• Theoretical result: training error decreases exponentially
• Also: works if base classifiers not too complex and their error doesn’t become too large too quickly
COMP9417 ML & DM Ensemble Learning Term 2, 2020 53 / 70
Ensembles
Boosting
A bit more on boosting
• Puzzling fact: generalization error can decrease long after training error has reached zero
• Seems to contradict Occam’s Razor !
• However, problem disappears if margin (confidence) is considered
instead of error
• Margin: difference between estimated probability for true class and most likely other class (between -1, 1)
• Boosting works with weak learners: only condition is that error ε doesn’t exceed 0.5 (slightly better than random guessing)
• LogitBoost: more sophisticated boosting scheme in Weka (based on additive logistic regression)
COMP9417 ML & DM Ensemble Learning Term 2, 2020 54 / 70
Ensembles
Boosting
Boosting reduces error
Adaboost applied to a weak learning system can reduce the training error exponentially as the number of component classifiers is increased.
• focuses on “difficult” patterns
• training error of successive classifier on its own weighted training set
is generally larger than predecessor
• training error of ensemble will decrease
• typically, test error of ensemble will decrease also
COMP9417 ML & DM Ensemble Learning Term 2, 2020 55 / 70
Ensembles
Boosting
Boosting reduces error
COMP9417 ML & DM Ensemble Learning Term 2, 2020 56 / 70
Ensembles
Boosting
Boosting enlarges the model class
A two-dimensional two-category classification task
• three component linear classifiers
• final classification is by voting component classifiers
• gives a non-linear decision boundary
• each component is a weak learner (slightly better than 0.5)
• ensemble classifier has error lower than any single component
• ensemble classifier has error lower than single classifier on complete training set
COMP9417 ML & DM Ensemble Learning Term 2, 2020 57 / 70
Ensembles
Boosting
Boosting enlarges the model class
COMP9417 ML & DM Ensemble Learning Term 2, 2020 58 / 70
Other ensemble methods
Stacking
• So far, ensembles where base learners all use same algorithm
• But what if we want to combine outputs of different algorithms ?
• Also, what if the combining method could be tuned from data ?
• “Stacked generalization” or stacking
• Uses meta learner instead of voting to combine predictions of base learners
• Predictions of base learners (level-0 models) are used as input for meta learner (level-1 model)
• Each base learners considered a feature, with value its output yˆ on instance x
• But predictions on training data can’t be used to generate data for level-1 model!
• So a cross-validation-like scheme is employed
COMP9417 ML & DM Ensemble Learning Term 2, 2020 59 / 70
Other ensemble methods
Stacking
• If base learners can output probabilities it’s better to use those as input to meta learner
• gives more information to meta-learner
• Which algorithm to use to generate meta learner?
• In principle, any learning scheme can be applied, but suggested to use • “relatively global, smooth” models (David Wolpert)
• Since base learners do most of the work • And this reduces risk of overfitting
• Stacking can also be applied to numeric prediction (and density estimation)
COMP9417 ML & DM Ensemble Learning Term 2, 2020 60 / 70
Other ensemble methods
Mixture of Experts
• Framework for learning assuming data generated by a mixture model • base level component classifiers (or rankers, . . . )
• outputs are combined by a tunable system to do the ‘mixing”
• Each component models an “expert” for some part of the problem • All component outputs are pooled for ensemble output
• Can be trained by gradient descent
COMP9417 ML & DM Ensemble Learning Term 2, 2020 61 / 70
Other ensemble methods
Mixture of Experts
COMP9417 ML & DM Ensemble Learning Term 2, 2020 62 / 70
Other ensemble methods
Additive Regression
• Using statistical terminology, boosting is a greedy algorithm for fitting an additive model
• More specifically, it implements forward stagewise additive modeling
• Forward stagewise additive modeling for numeric prediction:
1 Build standard regression model (e.g., regression tree)
2 Gather residuals, learn model predicting residuals (e.g. another
regression tree), and repeat
• To predict, simply sum up individual predictions from all regression models
Gradient (Tree) Boosting is based on this approach, where at each boosting iteration a model is fit to approximate the components of the negative gradient of the overall loss Friedman (2001).
COMP9417 ML & DM Ensemble Learning Term 2, 2020 63 / 70
Other ensemble methods
Option Trees
• Ensembles are not easily interpretable • Can we generate a single model?
• One possibility: ”cloning” the ensemble by using large amounts of artificial data that is labeled by the ensemble
• Another possibility: generating a single structure that represents an ensemble in a compact fashion
• Option tree: decision tree with option nodes
• Idea: follow all possible branches at option node
• Predictions from different branches are merged using voting or by
averaging probability estimates
COMP9417 ML & DM Ensemble Learning Term 2, 2020 64 / 70
Other ensemble methods
Option Trees
• Can be learned by modifying a standard decision tree learner:
• Create option node if there are several equally promising splits (within
a user-specified interval)
• When pruning, error at option node is average error of options
COMP9417 ML & DM Ensemble Learning Term 2, 2020 65 / 70
Other ensemble methods
Alternating Decision Trees
• Can also grow an option tree by incrementally adding nodes to it using a boosting algorithm
• The resulting structure is called an alternating decision tree, with splitter nodes and prediction nodes
• Prediction nodes are leaf nodes if no splitter nodes have been added to them yet
• Standard alternating tree applies to 2-class problems but the algorithm can be extended to multi-class problems
• To obtain a prediction from an alternating tree, filter the instance down all applicable branches and sum the predictions
• Predictions from all relevant predictions nodes need to be used, whether those nodes are leaves or not
• Predict one class or the other depending on whether the sum is positive or negative
COMP9417 ML & DM Ensemble Learning Term 2, 2020 66 / 70
Other ensemble methods
Alternating Decision Trees
• Different approaches, but can be grown using a boosting algorithm:
• Assume that the base learner used for boosting produces a single
conjunctive if-then rule in each boosting iteration, including numeric
prediction
• Choose best extension among all possible extensions applicable to the
tree, according to the loss function used
COMP9417 ML & DM Ensemble Learning Term 2, 2020 67 / 70
Other ensemble methods
Gradient Boosting
• Boosting algorithms learn a form of additive model • training uses a forward stagewise procedure
• At each boosting iteration, a new (weighted) component function is added to the boosted model
• In gradient boosting, this approach is used to solve an optimization problem
• Informally, need to minimize loss over all components (basis functions) over all training examples
• A simpler approximation to this optimization is a forward stepwise procedure
• At each iteration, minimize loss summed over all previously added components, plus the current one
• In gradient boosting, a regression tree is learned at each iteration to minimize the loss for predicting the negative gradient at each leaf
• Implemented in the widely-used XGBoost package for scalable learning COMP9417 ML & DM Ensemble Learning Term 2, 2020 68 / 70
Summary
Ensemble Learning
Important points to remember
Low-bias models tend to have high variance, and vice versa.
Bagging is predominantly a variance-reduction technique, while boosting is
primarily a bias-reduction technique.
This explains why bagging is often used in combination with high-variance models such as tree models (as in Random Forests), whereas boosting is typically used with high-bias models such as linear classifiers or univariate decision trees (also called decision stumps).
COMP9417 ML & DM Ensemble Learning Term 2, 2020 69 / 70
Summary
Ensemble Learning
• Bias-variance decomposition breaks down error, suggests possible fixes to improve learning algorithms
• Stability idea captures aspects of both bias and variance
• Bagging is a simple way to run ensemble methods
• Random Forests are a popular bagging approach for trees
• Boosting has a more theoretically justified basis and may work better in practice to reduce error, but can be susceptible to very noisy data
• Many other variants of ensemble learning
• Remember: No Free Lunch and other theorems → no “magic bullet”
for machine learning!
COMP9417 ML & DM Ensemble Learning Term 2, 2020 70 / 70
References
Friedman, J. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29(5):1189–1232.
COMP9417 ML & DM Ensemble Learning Term 2, 2020 70 / 70