Ensemble Learning
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain
Aims
This lecture will develop your understanding of ensemble methods in machine learning, based on analyses and algorithms covered previously. Following it you should be able to:
• Describe the framework of the bias-variance decomposition and some of its practical implications
• describe how ensembles might be used to address the bias and variance components of error
• outline the concept of the stability of a learning algorithm
• describe different ensemble methods of bagging, randomization, boosting, etc.
COMP9417 T1, 2021 1
Introduction
In previous lectures, introduced some theoretical ideas about limits on machine learning. But do these have any practical impact ?
The answer is yes !
One of these theoretical tools is bias-variance decomposition:
• The bias-variance decomposition of error can be a tool for thinking about how to reduce error in learning
• Take a learning algorithm and ask:
– how can we reduce its bias ?
– how can we reduce its variance ?
• Ensemble learning methods can be viewed in this light
• A form of multi-level learning: learning a number of base-level models from the data, and learning to combine these models as an ensemble
COMP9417 T1, 2021 2
Review: bias-variance decomposition
• Theoretical tool for analyzing how much specific training set affects performance of classifier
• Assume we have an infinite number of classifiers built from different training sets all of the same size:
– The bias of a learning scheme is the expected error due to the mismatch between the learner’s hypothesis space (class of models) and the space of target concepts
– The variance of a learning scheme is the expected error due to differences in the training sets used
§ The variance of a learning scheme is how much the predictions for a given point vary between different model.
– Total expected error ≈ bias! + variance
COMP9417 T1, 2021 3
Bias-Variance
Good Fit
Truth
Underfitting
Source: Scott-Fortmann,Understanding Bias-variance tradeoff COMP9417 T1, 2021 4
Overfitting
Bias-Variance
• The inability of the learning algorithm to capture the true relationship between the output and the features/attributes is called bias.
• The learning algorithm difference in fits between datasets is called variance.
Three commonly used methods to find a good bias-variance tradeoff are:
o Regularization o Bagging
o Boosting
COMP9417 T1, 2021 5
Bias-variance: a trade-off
Easier to see with regression in the following figure1(to see the details
you will have to zoom in in your viewer):
– each column represents a different model class 𝑔(x) shown in red
– each row represents a different set of 𝑛 = 6 training points, 𝐷! , randomly sampled from target function 𝐹(x) with noise, shown in black
– probability functions of mean squared error 𝐸 are shown
1- from: “Elements of Statistical Learning” by Hastie, Tibshirani and Friedman (2001
COMP9417 T1, 2021 6
Bias-variance: a trade-off
COMP9417 T1, 2021 7
Bias-variance: a trade-off
• “𝑎” is very poor: a linear model with fixed parameters independent of training data; high bias, zero variance
• “𝑏” is better: a linear model with fixed parameters independent of training data; slightly lower bias, zero variance
• “𝑐” is a cubic model with parameters trained by mean-square-error on training data; low bias, moderate variance
• “𝑑” is a linear model with parameters adjusted to fit each training set; intermediate bias and variance
• training with data 𝑚 → ∞ would give
– “𝑐” with bias approaching small value due to noise
– but not “𝑑”
– variance for all models would approach zero
COMP9417 T1, 2021 8
Bias-variance in ensemble classification
• Recall that we derived the bias-variance decomposition for regression (squared-error loss function)
• Cannot apply same derivation for classification (zero-one loss can be used)
• Bias-variance decomposition used to analyze how much restriction to a
single training set affects performance
• Can decompose expected error of any individual ensemble member as
follows:
– Bias = expected error of the ensemble classifier on new data
– Variance = component of the expected error due to particular training set being used to build classifier
COMP9417 T1, 2021 9
Bias-variance with “Big Data”
• high bias algorithms often used for efficiency o why ?
• big data can reduce variance o why ?
This slide and the following 3 slides are due to Prof. G. Webb, Monash U. COMP9417 T1, 2021 10
Bias-variance with “Big Data”
Suppose we have a low bias representation (e.g., all conjunctive concepts), but such concepts may not always occur frequently in small datasets:
COMP9417 T1, 2021 11
Bias-variance with “Big Data”
So, we can increase bias – e.g., by Naive Bayes-type conditional independence assumptions – but this forces averaging of class distributions over all “small concepts”:
COMP9417 T1, 2021 12
Bias-variance with “Big Data”
“Big Data” may help to resolve the bias-variance dilemma:
• high bias algorithms are often used for efficiency – usually simpler to compute
• big data can reduce variance
– “small” concepts will occur more frequently – low bias algorithms can be applied
– but: how to compute efficiently ?
This is still largely an open problem!
COMP9417 T1, 2021 13
Bias-variance in “Real-world AI”
Imagine the following situation:
– Applications increasingly require machine-learning systems to perform at “human-level” (e.g., personal assistants, self-driving vehicles, etc.)
– Suppose you are developing an application and you know what “human- level” error would typically be on this task.
– You have sufficient data for training and validation datasets, and you are not restricted in terms of the models that you could learn (e.g., from linear regression or classification up to kernel methods, ensembles, deep networks, etc.)
– How can an understanding of the bias-variance decomposition help ?
COMP9417 T1, 2021 14
Bias-variance in “Real-world AI”
The following scenarios can happen:
1. Training-set error is observed to be high compared to human-level – why ?
– Bias is too high – solution: move to a more expressive (lower bias) model
2. Training-set error is observed to be similar to human-level, but validation set error is high compared to human-level – why ?
– Variance is too high – solution: get more data (!), try regularization, ensembles, move to a different model architecture
These scenarios are often found in applications of deep learning2
2-“Nuts and Bolts of Applying Deep Learning” by Andrew Ng http://www.youtube.com/watch?v=F1ka6a13S9I
COMP9417 T1, 2021 15
Stability
• for a given data distribution 𝒟
• train algorithm 𝐿 on training sets 𝑆”, 𝑆# sampled from 𝒟
• expect that the model from 𝐿 should be the same (or very similar) on both 𝑆” and 𝑆#
• if so, we say that 𝐿 is a stable learning algorithm
• otherwise, it is unstable
• typical stable algorithm: 𝑘NN (for some 𝑘)
• typical unstable algorithm: decision-tree learning
Turney, P. “Bias and the Quantification of Stability”
COMP9417 T1, 2021 16
Decision boundaries in tree learning
Decision boundaries for monothetic two-class trees in two and three dimensions; arbitrarily fine decision regions for classes R”, R” can be learned by recursively partitioning the instance space.
COMP9417 T1, 2021 17
Instability of tree learning
An example shows the effect of a small change in the training data on the structure of an unpruned binary tree learned by CART. The training set has 8 instances for each class:
Note: for class 𝜔! (red) the last instance has two values for feature 𝑥! . On the next slide is a tree learned from the data where this instance has value 𝑥!
= .36 (marked ∗), and on the following slide we see the tree obtained when this value is changed to 𝑥! = .32 (marked †).
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 T1, 2021 18
Instability of tree learning
The partitioned instance space (left) contains the instance marked ∗ and corresponds to the decision tree (right).
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 T1, 2021 19
Instability of tree learning
The partitioned instance space (left) contains the instance marked † and corresponds to the decision tree (right). Note that both the decision boundaries and the tree topology are considerably changed, for example, testing 𝑥! rather than 𝑥” at the tree root, although the change in data was very small.
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 T1, 2021 20
Stability and Bias-Variance
• stable algorithms typically have high bias
• unstable algorithms typically have high variance
• BUT: take care to consider effect of parameters on stability, e.g., in 𝑘NN:
– 1NN perfectly separates training data, so low bias but high variance
– By increasing the number of neighbors 𝑘 we increase bias and decrease variance (what happens when 𝑘 = 𝑛?)
– Every test instance will have the same number of neighbors, and the class probability vectors will all be the same !
COMP9417 T1, 2021 21
Three-, five- and seven-nearest neighbour
Decision regions of kNN classifiers; the shading represents the predicted probability distribution over the five classes.
Illustrates the effect of varying k on stability (i.e., bias and variance). COMP9417 T1, 2021 22
Ensemble Methods
COMP9417 T1, 2021 23
Supervised learning
We have looked into different supervised methods:
– Basic linear classifier
– kNN
– Naïve Bayes
– Logistic regression
– Decision trees
– Perceptron
– Support vector machine –…
For a new problem, which method to pick?
COMP9417 T1, 2021 24
Supervised learning
• We can test all different methods on a validation set and pick the one which gives the best performance
• Each learning method gives a different hypothesis,… but no hypothesis is perfect
• Maybe we can combine different hypotheses to build a better hypothesis (model) !?
COMP9417 T1, 2021 25
Ensemble methods
• Ensemble methods are meta-algorithms that combine different models into one model, and they can:
o Decrease variance
o Decrease bias
o Improve performance
• The idea relates to the “wisdom of crowd” phenomenon:
o Aggregation of information in groups (by James Surowiecki, 1907)
o Aggregation of independent estimate can be really effective for prediction
» Unweighted average (e.g., majority vote)
» Weighted average (e.g., committee of experts)
COMP9417 T1, 2021 26
Simple ensembles
There are different ways of combining predictors:
1. Simple ensembles like majority vote or unweighted average
Training set
Training
Learning algorithm 1
Learning algorithm 2
Learning algorithm k-1
Learning algorithm k
Test sample
…
Model 2 … Model k-1
Majority voting / mean
Output
Model 1
Model k
COMP9417 T1, 2021
27
Simple ensembles
2. weighted averages / weighted votes: Every model gets a weight (e.g. depending on its performance)
Training set
Training
Type equation here.
…
Learning algorithm 1
Learning algorithm 2
Test sample
Model 1 Model 2 … 𝜆! 𝜆”
Output
Model k-1
𝜆#$!
Model k
𝜆#
$
! ! ” # 𝜆 ! 𝑦# !
Learning algorithm k-1
Learning algorithm k
COMP9417 T1, 2021
28
Weighted average / weighted majority
In practice:
– Learning algorithms may not be independent – Some better fit the data so make less error
We can define weights in different ways:
– Decrease weight of correlated learners – Increase weight of good learners
COMP9417 T1, 2021 29
Simple ensembles
3. Treat the output of each model as a feature and train a model on that
Training set
Training
Learning algorithm 1
Learning algorithm 2
Learning algorithm k-1
Learning algorithm k
Test sample
…
Model 2 … Model k-1 Learning algorithm
Model 1
Model k
COMP9417 T1, 2021
Model Output 30
Simple ensembles
3. Treat the output of each model as a feature and train a model on that
o If the task is a binary classification and we choose the fusion model to be a linear model, then this will become a weighted vote
o We can train the fusion model on a validation set or a portion of training data that we have not used in training the initial models because otherwise it will always be biased towards the models that had a better performance on the training data
COMP9417 T1, 2021 31
Simple ensembles
4. Mixture of experts:
o Weight 𝛼! 𝑥 indicates “expertise”
o It divides the feature space into homogeneous regions
o It may use a weighted average or just pick the model with the largest expertise
o It is a kind of local learning
COMP9417 T1, 2021 32
Ensemble methods
5. “Bagging” method: (“Bootstrap Aggregation”)
o Training many classifiers, but each with only a portion of data o Then aggregate through model averaging / majority voting
We used data splitting in cross-validation (k-fold CV) to check overfitting and possibility avoid overfitting.
o But maybe we can combine the models that we train on each split
o This will reduce the overfitting
COMP9417 T1, 2021 33
Bagging
Bagging = Bootstrap +aggregation Bootstrap:
A standard “resampling” technique from statistics for estimating quantities about a population by averaging estimates from multiple small data samples. Importantly, samples are constructed by drawing observations from a large data sample one at a time and returning them to the data sample after they have been chosen.
https://mlcourse.ai/articles/topic5-part1-bagging/
COMP9417 T1, 2021 34
Bagging
Bootstrap:
– Create a random subset of data by sampling with replacement
– Draw 𝑚′ samples from 𝑚 sample with replacement (𝑚′ ≤ 𝑚)
Bagging:
– Repeat 𝑘 times to generate 𝑘 subsets
§ Some of the samples get repeated and some will not be left out
– Train one classifier on each subset
– To test, aggregate the output of 𝑘 classifiers that you trained in the previous step using either majority vote / unweighted average
COMP9417 T1, 2021 35
Bagging
Ck
Pk
Tk
https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783555130/7/ ch07lvl1sec46/bagging-building-an-ensemble-of-classifiers-from-bootstrap-samples
COMP9417 T1, 2021 36
Bias / Variance in Bagging
Error of any model has two components:
– Bias: due to model choice which
o can be reduced by increasing complexity
– Variance: due to small sample size or high complexity of the model
o Can be reduced by increasing the data or reducing the complexity
Bagging is applied on a collection of low-bias high-variance models and by averaging them the bias would not get affected, however the variance will be reduced
COMP9417 T1, 2021 37
Bagging error
Assumption:
– If learners are independent
– If each learner makes an error with probability 𝑝 (same error probability for all learners)
The probability that 𝑘′ out of 𝑘 learner make an error is:
𝑘 𝑘#
If we use majority voting to decide about the output, then the error happens if more than $! of learners make an error, so the error for majority voting is:
𝑝$% (1 − 𝑝)$%$#
M 𝑘 𝑝$%(1−𝑝)$%$#
$%&$ 𝑘# !
If 𝑘 = 10 and 𝑝 = 0.1, then 𝑒𝑟𝑟𝑜𝑟(𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦 𝑣𝑜𝑡𝑖𝑛𝑔) < 0.001 COMP9417 T1, 2021 38
Bagging
Advantage:
– Reduces overfitting (harder for the aggregated model to memorize full dataset, since some of the data will not be seen by the training models)
– This improves performance in almost all cases specially if learning scheme is unstable (i.e. decision trees are unstable because with slightly different sub-sample, the tree structure may change drastically)
– Can be applied to numeric prediction and classification
– Can help a lot if data is noisy
COMP9417 T1, 2021 39
Bagging more precisely
COMP9417 T1, 2021 40
Bagging linear classifiers
(left) An ensemble of five basic linear classifiers built from bootstrap samples with bagging. The decision rule is majority vote, leading to a piecewise linear decision boundary. (right) If we turn the votes into probabilities, we see the ensemble is effectively grouping instances in different ways, with each segment obtaining a slightly different probability.
COMP9417 T1, 2021 41
Bagging trees
Example:
An experiment with simulated data:
• A sample set of size m = 30, two classes {0,1} and 5 features
• Each feature has a standard Gaussian distribution with pairwise correlation 0.95
• The outputs are generated according to:
Pr 𝑌 = 1 𝑥" ≤0.5)=0.2andPr 𝑌 = 1 𝑥" >0.5)=0.8
• A test set of size 2000 from same population
• The Bayesian error for this data is 0.2
COMP9417 T1, 2021 42
Bagging trees
Approach:
• fit classification trees to training sample using 200 bootstrap samples
• No pruning was used
• trees are different (tree induction is unstable)
• therefore, have high variance
Results:
• averaging reduces variance and leaves bias unchanged
COMP9417 T1, 2021 43
Bagging trees
Original Tree
x.1 < 0.395
284 8. Model Inference and Averaging
1 0
|
10 10
1
01
0 011001
b =3
x.2 < 0.285
b = 4
x.3 < 0.985
b = 5
|
10 0
0 1
10
0
b = 1
x.1 < 0.555
1 0
b = 2
x.2 < 0.205
|
|
0
x.4 < −1.36
|
|
0
1
11
1 110110
b =6 b = 7
b = 8
COMP9417 T1, 2021
44
x.1 < 0.395 x.1 < 0.395
x.3 < 0.985
|||
Bagging trees
0 1
0
b =6
x.1 < 0.395
|
|
10
10
0
1
11
1 110110
11
01 00 000101
0
b = 9
x.1 < 0.395
b = 10
x.1 < 0.555
1
b = 11
1
1
011001
0
the top split is annotated.
10 01
b = 7
x.1 < 0.395
b = 8
x.3 < 0.985
|
FIGURE 8.9. Bagging trees on simulated dataset. The top left panel shows the original tree. Eleven trees grown on bootstrap samples are shown. For each tree,
|
COMP9417 T1, 2021 45
1
10
0
|
x.1 < 0.555
|
1
Bagging trees
8.7 Bagging 285
Consensus Probability
Original Tree
Bagged Trees
Bayes
0 50 100 150 200 Number of Bootstrap Samples
FIGURE 8.10. Error curves for the bagging example of Figure 8.9. Shown is COMP9417 T1, 2021 46
the test error of the original tree and bagged trees as a function of the number of
bootstrap samples. The orange points correspond to the consensus vote, while the
Test Error
0.20 0.25 0.30 0.35 0.40 0.45 0.50
Bagging trees
Pros:
• Reduces the overfitting
• Generalizes better
Cons:
• when we bag a model, any simple structure is lost
– this is because a bagged tree is no longer a tree ... but a forest, so this reduces claim to interpretability
– stable models like nearest neighbor not very affected by bagging but unstable models like trees most affected by bagging
• With lots of data, we usually learn the same classifier, so averaging them does not help
COMP9417 T1, 2021 47
Random Forest
What is the solution for ensemble decision trees with lots of data?
– Have extra variability in the learners and introduce more randomness to the procedure
How?
– For every model, use only a subset of features
o This will create diversity in the trees
– Then, similar to before, take average/majority vote over trees
(learners)
An ensemble tree with random subset of features is called Random Forest
COMP9417 T1, 2021 48
Random Forests
COMP9417 T1, 2021 49
Random Forest (PlayTennis dataset)
Day Outlook Temp. Humidity Wind Play
D1 Sunny Hot
D2 Sunny Hot
D3 Overcast Hot
D4 Rain Mild
D5 Rain Cool
D6 Rain Cool
D7 Overcast Cool
D8 Sunny Mild
D9 Sunny Cool
D10 Rain Mild
D11 Sunny Mild
D12 Overcast Mild
D13 Overcast Hot
D14 Rain Mild
High Weak No High Strong No High Weak Yes High Weak Yes Normal Weak Yes Normal Strong No Normal Strong Yes High Weak No Normal Weak Yes Normal Weak Yes Normal Strong Yes High Strong Yes Normal Weak Yes High Strong No
COMP9417 T1, 2021 50
Random Forest (PlayTennis dataset) Example:
Step 1: select 14 samples from the dataset with replacement
Day Outlook Temp. Humidity Wind Play
D3 Overcast D11 Sunny D3 Overcast D8 Sunny D14 Rain
D5 Rain
D6 Rain
D12 Overcast
D8 Sunny
D9 Sunny
D10 Rain
D11 Sunny
D9 Sunny D1 Sunny D14 Rain
Hot High Mild Normal Hot High Mild High Mild High Cool Normal Cool Normal Mild High Mild High Cool Normal Mild Normal Mild Normal Cool Normal Hot High Mild High
Weak Yes Strong Yes Weak Yes Weak No Strong No Weak Yes Strong No Strong Yes Weak No Weak Yes Weak Yes Strong Yes Weak Yes Weak No Strong No
COMP9417 T1, 2021 51
Random Forest (PlayTennis dataset) Example:
Step 2: select a subset of features randomly (for example 2 features out of 4)
Day Temp. Wind Play
D3 Hot D11 Mild D3 Hot D8 Mild D14 Mild
D5 Cool
D6 Cool
D12 Mild
D8 Mild
D9 Cool
D10 Mild
D11 Mild
D9 Cool D1 Hot D14 Mild
Weak Yes Strong Yes Weak Yes Weak No Strong No Weak Yes Strong No Strong Yes Weak No Weak Yes Weak Yes Strong Yes Weak Yes Weak No Strong No
COMP9417 T1, 2021 52
Random Forest (PlayTennis dataset) Example:
Step 3: build a full tree without pruning using the selected features and samples
Wind
Temp. Yes
No
Yes
COMP9417 T1, 2021
53
Random Forest (PlayTennis dataset) Example:
Step 4: Repeat previous steps k = 5 times (the value for 𝑘 can be tuned by using cross validation)
COMP9417 T1, 2021 54
Random Forest (PlayTennis dataset) Example:
For every new sample, predict the output using all trees and then take the average/majority vote
x
xx
xx
COMP9417 T1, 2021 55
Random Forests
Leo Breiman’s Random Forests algorithm is essentially like Bagging for trees, except the ensemble of tree models is trained from bootstrap samples and random subspaces.
• each tree in the forest is learned from
– a bootstrap sample, i.e., sample from the training set with replacement
– a subspace sample, i.e., randomly sample a subset of features (without replacement)
• advantage:
– forces more diversity among trees in ensemble
– less time to train since only consider a subset of features
Note: combining linear classifiers in an ensemble gives a piecewise linear (i.e., non-linear) model
COMP9417 T1, 2021 56
Randomization
• •
•
•
Some algorithms already have a random component: e.g., initial weights in a neural net
Most algorithms can be randomized, e.g., greedy algorithms:
– Pick N options at random from the full set of options, then choose the best of those N choices
– E.g.: attribute selection in decision trees
More generally applicable than bagging: e.g., we can use random subsets of features in a nearest-neighbor classifier
– Bagging does not work with stable classifiers such as nearest neighbor classifiers
Can be combined with bagging
– E.g. when learning decision trees, this yields the Random Forest method for building ensemble classifiers
COMP9417 T1, 2021 57
Ensemble methods
So far we have seen several ensemble methods:
1. Simple ensembles like majority vote or unweighted average
2. weighted averages / weighted votes
3. Treat the output of each model as a feature
4. Mixture of experts
5. Bagging (Bootstrap Aggregation)
6. Adding randomization to introduce diversity in the models (e.g. feature subset selection)
7. What is next?
COMP9417 T1, 2021 58
Boosting
So far we have focused on building parallel learners and then combine their decision (equal weights / unequal weights)
Problem with parallel learners:
what if all learners are mistaken in the same region?
Boosting:
– Uses “weak” learners
– Learners are trained sequentially
– New learners focus on errors of earlier learners
– New learners try to get these “hard” examples right by operating on a weighted training set in favor of misclassified instances
– Combine all learners in the end
Boosting is a method that convert a sequence of weak learners into a very complex model to predict
COMP9417 T1, 2021 59
Boosting
Re-weighted training set:
o Start with same weight for all the instances
o Misclassified instances gain higher weights: so the next classifier is more likely to classify it correctly
o Correctly classified instances lose weight Learning with a weighted training set:
1. Givedifferentweightstothelossfunctionfordifferentinstances
§ So the algorithm will be forced to learn instances with higher weights
2. Createanewcollectionofdatawithmultiplecopiesofsamples with higher sample weight
COMP9417 T1, 2021 60
Boosting algorithm
1. set𝑤' =1/𝑚for𝑖=1,...,𝑚
2. Repeatuntilsufficientnumberofhypothesis
o Train model 𝐿( using the dataset with weight 𝑤 o Increase 𝑤' for misclassified instances of 𝐿(
3. Ensemblehypothesisistheweightedmajority/weightedaverageof𝑘 learners 𝐿", ... , 𝐿$ with weight 𝜆", ... , 𝜆$ which are proportional to the accuracy of 𝐿(
COMP9417 T1, 2021 61
Boosting
Reweighting training data:
𝑤!✓✓✓ ✓
We always aim to minimize some cost function:
✓ 𝑤# ✓ ✗ ✗ ✓
𝑤" ✗
✗✓
✓ ✓ 𝑤& ✗ ✓ ✓ ✓
𝑳𝟏 𝑳𝟐 𝑳𝟑 𝑳𝟒
• Weighted average loss:
•
Unweighted average loss:
1
𝐽𝜃 =𝑁!𝐽!(𝜃,x!)
𝑤$✗✓✓ ✓
𝑤%✓✗ !
𝐽𝜃 =!𝑤!𝐽!(𝜃,x!) !
COMP9417 T1, 2021
62
Boosting
Training set
Training
Learning algorithm 1
Test sample
...
...
Model 1
𝜆!
Model 2
𝜆"
Model k-1
𝜆#$!
Model k
𝜆#
Learning algorithm 2
Learning algorithm k-1
Learning algorithm k
Weighted majority/weighted average
Output
COMP9417 T1, 2021
63
Data weights
Data weights
Data weights
Data weights
Boosting
Boosting works well as long as we use weak learners.
– weak learner is a model that is slightly better than random
Examples of weak learners:
– Perceptron
– Decision stumps (trees with one node)
– Naïve Bayes models
– etc.
COMP9417 T1, 2021 64
AdaBoost (Adaptive Boosting)
• AdaBoost usually uses stump trees (trees with one node and two leaves)
as the base learner
o Not very accurate at classification on their own
• AdaBoost combines stumps to boost the performance so it creates a forest of stumps instead of forest of trees
• In AdaBoost, stumps are created sequentially
• The error of each stump affects the training data weight in the next stump
• Depending on the performance, each stump gets different weight (𝜆') in the final classification decision
COMP9417 T1, 2021 65
AdaBoost (Adaptive Boosting)
• 𝑤 ! = 8" ∀ 𝑖
• Forj=1tokdo
– Learn 𝐿9 with data weight 𝑤
– 𝜖 =∑8 𝑤(9)1[𝐿(x)≠𝑦]/∑8 𝑤(9)
9 !:" ! 9 ! ! !:" !
– 𝜆9 = " log("=>#) # >#
– 𝑤!(9?”) = 𝑤!(9)exp(𝜆9) for misclassified instances
– 𝑤!(9?”) = 𝑤!(9)exp(−𝜆9) for correct instances
• End
• Make prediction using the final model: 𝑦 x = 𝑠𝑖𝑔𝑛(∑@
𝜆 𝐿 (x)) 9:” ! 9
COMP9417 T1, 2021 66
AdaBoost
Example:
Step 1: give similar weight to your sample (weights have to be normalized)
Blood pressure
Weight
Age
Heart disease
Sample weight
Yes
68
56
Yes
1 $
7
No
75
44
No
1 $
7
Yes
80
35
Yes
1 $
7
Yes
76
49
No
1 $
7
No
78
50
Yes
1 $
7
No
83
38
No
1 $
7
Yes
85
60
Yes
1 $
7
COMP9417 T1, 2021 67
AdaBoost
Step 2: Find the best stump by testing each feature/attribute and compute the stump weight (𝜆!)
Blood pressure
Weight
Age
Heart disease
Sample weight
Yes
68
56
Yes
1 $
7
No
75
44
No
1 $
7
Yes
80
35
Yes
1 $
7
Yes
76
49
No
1 $
7
No
78
50
Yes
1 $
7
No
83
38
No
1 $
7
Yes
85
60
Yes
1 $
7
Blood Pressure
No [2 ,1]
Yes [3,1]
𝑇𝑜𝑡𝑎𝑙 𝑒𝑟𝑟𝑜𝑟 = 27
𝜆# = 1 log(1 − 𝑇𝑜𝑡𝑎𝑙 𝑒𝑟𝑟𝑜𝑟) = 0.45 2 𝑇𝑜𝑡𝑎𝑙 𝑒𝑟𝑟𝑜𝑟
COMP9417 T1, 2021
68
AdaBoost
Step 3: Modify sample weights (𝑤&)
Blood pressure
Weight
Age
Heart disease
Sample weight
Yes
68
56
Yes
1 $
7
No
75
44
No
1 $
7
Yes
80
35
Yes
1 $
7
Yes
76
49
No
1 $
7
No
78
50
Yes
1 $
7
No
83
38
No
1 $
7
Yes
85
60
Yes
1 $
7
𝑁𝑒𝑤 𝑤! = 𝑤!×𝑒&% for misclassified samples 𝑤’ = 17 ×𝑒(.’* = 0.22
𝑤* = 17 ×𝑒(.’* = 0.22
𝑁𝑒𝑤 𝑤! = 𝑤!×𝑒’&% for correct samples 𝑤! = 17 ×𝑒$(.’* = 0.09
𝑤! =𝑤”=𝑤+ =𝑤, =𝑤- =0.09
COMP9417 T1, 2021 69
AdaBoost
Step 4: Normalise sample weights
Blood pressure
Weight
Age
Heart
disease
Sample weight
New weight
Norm. weight
Yes
68
56
Yes
1 $
7
0.09
0.1
No
75
44
No
1 $
7
0.09
0.1
Yes
80
35
Yes
1 $
7
0.09
0.1
Yes
76
49
No
1 $
7
0.22
0.25
No
78
50
Yes
1 $
7
0.22
0.25
No
83
38
No
1 $
7
0.09
0.1
Yes
85
60
Yes
1 $
7
0.09
0.1
Normalized 𝑤! = (& ∑ (&
COMP9417 T1, 2021 70
AdaBoost
Step 5: create a new stump using the weights
Blood pressure
Weight
Age
Heart
disease
Norm. weight
Yes
68
56
Yes
0.1
No
75
44
No
0.1
Yes
80
35
Yes
0.1
Yes
76
49
No
0.25
No
78
50
Yes
0.25
No
83
38
No
0.1
Yes
85
60
Yes
0.1
There are two ways to include the weights:
•
or •
use weighted information gain to find the best next stump
generate new sample collection from the training set based on the sample weights and find the best stump on that
COMP9417 T1, 2021
71
AdaBoost
• Go back to step 2 and repeat 𝑘 times
You will end up with 𝑘 stumps that each has a weight for final prediction and you will pass your test
sample to all of them and take the weighted majority vote/averagre
COMP9417 T1, 2021 72
Boosting
(left) An ensemble of five boosted basic linear classifiers with majority vote. The linear classifiers were learned from blue to red; none of them achieves zero training error, but the ensemble does. (right) Applying bagging results in a much more homogeneous ensemble, indicating that there is little diversity in the bootstrap samples.
COMP9417 T1, 2021 73
Boosting
Advantages:
– No need to use complex models
– Can boost the performance of any weak learner
– Very simple to implement
– Decreasing the bias
– Decreasing the variance
– Good generalization
Disadvantage:
– Lack of interpretability
– Slow during training and potentially testing
COMP9417 T1, 2021 74
Gradient Boosting
Gradient boosting is applying similar idea for regression (however gradient boosting can also be used for classification as well).
Simple linear regression or simple regression trees can be used as weak learners
Repeat the following procedure:
– Learn a regression predictor
– Compute the error residual
– learn to predict residual
COMP9417 T1, 2021 75
Gradient Boosting
• It learn a sequence of predictors
– Learns a new predictor for the residuals at each iteration
• Sum of the predictors creates more complex model
• The model gets more accurate at each step adding more predictors
https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d
COMP9417 T1, 2021 76
Gradient Boosting
• In this example regression trees are used as the base learner
COMP9417 T1, 2021 77
Gradient Boosting
• Start with training a weak learner and make a set of predictions 𝑦`’
• The error of our prediction is = 𝐽(𝑦’,𝑦`’)
o If we are working with MSE then 𝐽 = ∑'(𝑦’ − 𝑦`’)!
• We can improve 𝑦`’ by gradually reducing the error
– 𝑦`’ = 𝑦`’+𝛼 𝜕𝐽(𝑦, 𝑦`)/𝜕𝑦`’ (we want the loss function to move towards the actual 𝑦’ or to be minimised)
oFortheMSE,∇𝐽 𝑦’,𝑦`’ =𝑦’ −𝑦`’ and
o we can estimate this with 𝑓 𝑖 ≈ ∇𝐽 𝑦’, 𝑦`’
• Each new learner s estimating the gradient of loss 𝑓$(𝑖) Gradient descent: taking sequence of steps to reduce 𝐽(𝑦’,𝑦`’)
Gradient boosting: sum of predictors weighted by some step size 𝛼
COMP9417 T1, 2021 78
Ensemble Learning Important points to remember
Low-bias models tend to have high variance, and vice versa.
Bagging is predominantly a variance-reduction technique, while boosting is primarily a bias-reduction technique which reduces the variance by taking the weighted average.
This explains why bagging is often used in combination with high- variance models such as tree models (as in Random Forests), whereas boosting is typically used with high-bias models such as linear classifiers or univariate decision trees (also called decision stumps).
COMP9417 T1, 2021 79
Ensemble Learning
• Bias-variance decomposition breaks down error, suggests possible fixes to improve learning algorithms
• Stability idea captures aspects of both bias and variance
• Bagging is a simple way to run ensemble methods
• Random Forests are a popular bagging approach for trees
• Boosting has a more theoretically justified basis and may work better in practice to reduce error, but can be susceptible to very noisy data
• Many other variants of ensemble learning
COMP9417 T1, 2021 80
Acknowledgements
• Material derived from slides for the book “Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
• Material derived from slides for the book “Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012) http://www.cs.ubc.ca/~murphyk/MLbook
• Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
• Material derived from slides for the book “Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
• Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw- Hill (1997) http://www-2.cs.cmu.edu/~tom/mlbook.html
• Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani Goa Campus, India (2016)
COMP9417 T1, 2021 81