CS计算机代考程序代写 Bayesian scheme data mining algorithm deep learning decision tree Ensemble Learning

Ensemble Learning
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain

Aims
This lecture will develop your understanding of ensemble methods in machine learning, based on analyses and algorithms covered previously. Following it you should be able to:
• Describe the framework of the bias-variance decomposition and some of its practical implications
• describe how ensembles might be used to address the bias and variance components of error
• outline the concept of the stability of a learning algorithm
• describe different ensemble methods of bagging, randomization, boosting, etc.
COMP9417 T1, 2021 1

Introduction
In previous lectures, introduced some theoretical ideas about limits on machine learning. But do these have any practical impact ?
The answer is yes !
One of these theoretical tools is bias-variance decomposition:
• The bias-variance decomposition of error can be a tool for thinking about how to reduce error in learning
• Take a learning algorithm and ask:
– how can we reduce its bias ?
– how can we reduce its variance ?
• Ensemble learning methods can be viewed in this light
• A form of multi-level learning: learning a number of base-level models from the data, and learning to combine these models as an ensemble
COMP9417 T1, 2021 2

Review: bias-variance decomposition
• Theoretical tool for analyzing how much specific training set affects performance of classifier
• Assume we have an infinite number of classifiers built from different training sets all of the same size:
– The bias of a learning scheme is the expected error due to the mismatch between the learner’s hypothesis space (class of models) and the space of target concepts
– The variance of a learning scheme is the expected error due to differences in the training sets used
§ The variance of a learning scheme is how much the predictions for a given point vary between different model.
– Total expected error ≈ bias! + variance
COMP9417 T1, 2021 3

Bias-Variance
Good Fit
Truth
Underfitting
Source: Scott-Fortmann,Understanding Bias-variance tradeoff COMP9417 T1, 2021 4
Overfitting

Bias-Variance
• The inability of the learning algorithm to capture the true relationship between the output and the features/attributes is called bias.
• The learning algorithm difference in fits between datasets is called variance.
Three commonly used methods to find a good bias-variance tradeoff are:
o Regularization o Bagging
o Boosting
COMP9417 T1, 2021 5

Bias-variance: a trade-off
Easier to see with regression in the following figure1(to see the details
you will have to zoom in in your viewer):
– each column represents a different model class 𝑔(x) shown in red
– each row represents a different set of 𝑛 = 6 training points, 𝐷! , randomly sampled from target function 𝐹(x) with noise, shown in black
– probability functions of mean squared error 𝐸 are shown
1- from: “Elements of Statistical Learning” by Hastie, Tibshirani and Friedman (2001
COMP9417 T1, 2021 6

Bias-variance: a trade-off
COMP9417 T1, 2021 7

Bias-variance: a trade-off
• “𝑎” is very poor: a linear model with fixed parameters independent of training data; high bias, zero variance
• “𝑏” is better: a linear model with fixed parameters independent of training data; slightly lower bias, zero variance
• “𝑐” is a cubic model with parameters trained by mean-square-error on training data; low bias, moderate variance
• “𝑑” is a linear model with parameters adjusted to fit each training set; intermediate bias and variance
• training with data 𝑚 → ∞ would give
– “𝑐” with bias approaching small value due to noise
– but not “𝑑”
– variance for all models would approach zero
COMP9417 T1, 2021 8

Bias-variance in ensemble classification
• Recall that we derived the bias-variance decomposition for regression (squared-error loss function)
• Cannot apply same derivation for classification (zero-one loss can be used)
• Bias-variance decomposition used to analyze how much restriction to a
single training set affects performance
• Can decompose expected error of any individual ensemble member as
follows:
– Bias = expected error of the ensemble classifier on new data
– Variance = component of the expected error due to particular training set being used to build classifier
COMP9417 T1, 2021 9

Bias-variance with “Big Data”
• high bias algorithms often used for efficiency o why ?
• big data can reduce variance o why ?
This slide and the following 3 slides are due to Prof. G. Webb, Monash U. COMP9417 T1, 2021 10

Bias-variance with “Big Data”
Suppose we have a low bias representation (e.g., all conjunctive concepts), but such concepts may not always occur frequently in small datasets:
COMP9417 T1, 2021 11

Bias-variance with “Big Data”
So, we can increase bias – e.g., by Naive Bayes-type conditional independence assumptions – but this forces averaging of class distributions over all “small concepts”:
COMP9417 T1, 2021 12

Bias-variance with “Big Data”
“Big Data” may help to resolve the bias-variance dilemma:
• high bias algorithms are often used for efficiency – usually simpler to compute
• big data can reduce variance
– “small” concepts will occur more frequently – low bias algorithms can be applied
– but: how to compute efficiently ?
This is still largely an open problem!
COMP9417 T1, 2021 13

Bias-variance in “Real-world AI”
Imagine the following situation:
– Applications increasingly require machine-learning systems to perform at “human-level” (e.g., personal assistants, self-driving vehicles, etc.)
– Suppose you are developing an application and you know what “human- level” error would typically be on this task.
– You have sufficient data for training and validation datasets, and you are not restricted in terms of the models that you could learn (e.g., from linear regression or classification up to kernel methods, ensembles, deep networks, etc.)
– How can an understanding of the bias-variance decomposition help ?
COMP9417 T1, 2021 14

Bias-variance in “Real-world AI”
The following scenarios can happen:
1. Training-set error is observed to be high compared to human-level – why ?
– Bias is too high – solution: move to a more expressive (lower bias) model
2. Training-set error is observed to be similar to human-level, but validation set error is high compared to human-level – why ?
– Variance is too high – solution: get more data (!), try regularization, ensembles, move to a different model architecture
These scenarios are often found in applications of deep learning2
2-“Nuts and Bolts of Applying Deep Learning” by Andrew Ng http://www.youtube.com/watch?v=F1ka6a13S9I
COMP9417 T1, 2021 15

Stability
• for a given data distribution 𝒟
• train algorithm 𝐿 on training sets 𝑆”, 𝑆# sampled from 𝒟
• expect that the model from 𝐿 should be the same (or very similar) on both 𝑆” and 𝑆#
• if so, we say that 𝐿 is a stable learning algorithm
• otherwise, it is unstable
• typical stable algorithm: 𝑘NN (for some 𝑘)
• typical unstable algorithm: decision-tree learning
Turney, P. “Bias and the Quantification of Stability”
COMP9417 T1, 2021 16

Decision boundaries in tree learning
Decision boundaries for monothetic two-class trees in two and three dimensions; arbitrarily fine decision regions for classes R”, R” can be learned by recursively partitioning the instance space.
COMP9417 T1, 2021 17

Instability of tree learning
An example shows the effect of a small change in the training data on the structure of an unpruned binary tree learned by CART. The training set has 8 instances for each class:
Note: for class 𝜔! (red) the last instance has two values for feature 𝑥! . On the next slide is a tree learned from the data where this instance has value 𝑥!
= .36 (marked ∗), and on the following slide we see the tree obtained when this value is changed to 𝑥! = .32 (marked †).
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 T1, 2021 18

Instability of tree learning
The partitioned instance space (left) contains the instance marked ∗ and corresponds to the decision tree (right).
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 T1, 2021 19

Instability of tree learning
The partitioned instance space (left) contains the instance marked † and corresponds to the decision tree (right). Note that both the decision boundaries and the tree topology are considerably changed, for example, testing 𝑥! rather than 𝑥” at the tree root, although the change in data was very small.
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 T1, 2021 20

Stability and Bias-Variance
• stable algorithms typically have high bias
• unstable algorithms typically have high variance
• BUT: take care to consider effect of parameters on stability, e.g., in 𝑘NN:
– 1NN perfectly separates training data, so low bias but high variance
– By increasing the number of neighbors 𝑘 we increase bias and decrease variance (what happens when 𝑘 = 𝑛?)
– Every test instance will have the same number of neighbors, and the class probability vectors will all be the same !
COMP9417 T1, 2021 21

Three-, five- and seven-nearest neighbour
Decision regions of kNN classifiers; the shading represents the predicted probability distribution over the five classes.
Illustrates the effect of varying k on stability (i.e., bias and variance). COMP9417 T1, 2021 22

Ensemble Methods
COMP9417 T1, 2021 23

Supervised learning
We have looked into different supervised methods:
– Basic linear classifier
– kNN
– Naïve Bayes
– Logistic regression
– Decision trees
– Perceptron
– Support vector machine –…
For a new problem, which method to pick?
COMP9417 T1, 2021 24

Supervised learning
• We can test all different methods on a validation set and pick the one which gives the best performance
• Each learning method gives a different hypothesis,… but no hypothesis is perfect
• Maybe we can combine different hypotheses to build a better hypothesis (model) !?
COMP9417 T1, 2021 25

Ensemble methods
• Ensemble methods are meta-algorithms that combine different models into one model, and they can:
o Decrease variance
o Decrease bias
o Improve performance
• The idea relates to the “wisdom of crowd” phenomenon:
o Aggregation of information in groups (by James Surowiecki, 1907)
o Aggregation of independent estimate can be really effective for prediction
» Unweighted average (e.g., majority vote)
» Weighted average (e.g., committee of experts)
COMP9417 T1, 2021 26

Simple ensembles
There are different ways of combining predictors:
1. Simple ensembles like majority vote or unweighted average
Training set
Training
Learning algorithm 1
Learning algorithm 2
Learning algorithm k-1
Learning algorithm k
Test sample

Model 2 … Model k-1
Majority voting / mean
Output
Model 1
Model k
COMP9417 T1, 2021
27

Simple ensembles
2. weighted averages / weighted votes: Every model gets a weight (e.g. depending on its performance)
Training set
Training
Type equation here.

Learning algorithm 1
Learning algorithm 2
Test sample
Model 1 Model 2 … 𝜆! 𝜆”
Output
Model k-1
𝜆#$!
Model k
𝜆#
$
! ! ” # 𝜆 ! 𝑦# !
Learning algorithm k-1
Learning algorithm k
COMP9417 T1, 2021
28

Weighted average / weighted majority
In practice:
– Learning algorithms may not be independent – Some better fit the data so make less error
We can define weights in different ways:
– Decrease weight of correlated learners – Increase weight of good learners
COMP9417 T1, 2021 29

Simple ensembles
3. Treat the output of each model as a feature and train a model on that
Training set
Training
Learning algorithm 1
Learning algorithm 2
Learning algorithm k-1
Learning algorithm k
Test sample

Model 2 … Model k-1 Learning algorithm
Model 1
Model k
COMP9417 T1, 2021
Model Output 30

Simple ensembles
3. Treat the output of each model as a feature and train a model on that
o If the task is a binary classification and we choose the fusion model to be a linear model, then this will become a weighted vote
o We can train the fusion model on a validation set or a portion of training data that we have not used in training the initial models because otherwise it will always be biased towards the models that had a better performance on the training data
COMP9417 T1, 2021 31

Simple ensembles
4. Mixture of experts:
o Weight 𝛼! 𝑥 indicates “expertise”
o It divides the feature space into homogeneous regions
o It may use a weighted average or just pick the model with the largest expertise
o It is a kind of local learning
COMP9417 T1, 2021 32

Ensemble methods
5. “Bagging” method: (“Bootstrap Aggregation”)
o Training many classifiers, but each with only a portion of data o Then aggregate through model averaging / majority voting
We used data splitting in cross-validation (k-fold CV) to check overfitting and possibility avoid overfitting.
o But maybe we can combine the models that we train on each split
o This will reduce the overfitting
COMP9417 T1, 2021 33

Bagging
Bagging = Bootstrap +aggregation Bootstrap:
A standard “resampling” technique from statistics for estimating quantities about a population by averaging estimates from multiple small data samples. Importantly, samples are constructed by drawing observations from a large data sample one at a time and returning them to the data sample after they have been chosen.
https://mlcourse.ai/articles/topic5-part1-bagging/
COMP9417 T1, 2021 34

Bagging
Bootstrap:
– Create a random subset of data by sampling with replacement
– Draw 𝑚′ samples from 𝑚 sample with replacement (𝑚′ ≤ 𝑚)
Bagging:
– Repeat 𝑘 times to generate 𝑘 subsets
§ Some of the samples get repeated and some will not be left out
– Train one classifier on each subset
– To test, aggregate the output of 𝑘 classifiers that you trained in the previous step using either majority vote / unweighted average
COMP9417 T1, 2021 35

Bagging
Ck
Pk
Tk
https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783555130/7/ ch07lvl1sec46/bagging-building-an-ensemble-of-classifiers-from-bootstrap-samples
COMP9417 T1, 2021 36

Bias / Variance in Bagging
Error of any model has two components:
– Bias: due to model choice which
o can be reduced by increasing complexity
– Variance: due to small sample size or high complexity of the model
o Can be reduced by increasing the data or reducing the complexity
Bagging is applied on a collection of low-bias high-variance models and by averaging them the bias would not get affected, however the variance will be reduced
COMP9417 T1, 2021 37

Bagging error
Assumption:
– If learners are independent
– If each learner makes an error with probability 𝑝 (same error probability for all learners)
The probability that 𝑘′ out of 𝑘 learner make an error is:
𝑘 𝑘#
If we use majority voting to decide about the output, then the error happens if more than $! of learners make an error, so the error for majority voting is:
𝑝$% (1 − 𝑝)$%$#
M 𝑘 𝑝$%(1−𝑝)$%$#
$%&$ 𝑘# !
If 𝑘 = 10 and 𝑝 = 0.1, then 𝑒𝑟𝑟𝑜𝑟(𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦 𝑣𝑜𝑡𝑖𝑛𝑔) < 0.001 COMP9417 T1, 2021 38 Bagging Advantage: – Reduces overfitting (harder for the aggregated model to memorize full dataset, since some of the data will not be seen by the training models) – This improves performance in almost all cases specially if learning scheme is unstable (i.e. decision trees are unstable because with slightly different sub-sample, the tree structure may change drastically) – Can be applied to numeric prediction and classification – Can help a lot if data is noisy COMP9417 T1, 2021 39 Bagging more precisely COMP9417 T1, 2021 40 Bagging linear classifiers (left) An ensemble of five basic linear classifiers built from bootstrap samples with bagging. The decision rule is majority vote, leading to a piecewise linear decision boundary. (right) If we turn the votes into probabilities, we see the ensemble is effectively grouping instances in different ways, with each segment obtaining a slightly different probability. COMP9417 T1, 2021 41 Bagging trees Example: An experiment with simulated data: • A sample set of size m = 30, two classes {0,1} and 5 features • Each feature has a standard Gaussian distribution with pairwise correlation 0.95 • The outputs are generated according to: Pr 𝑌 = 1 𝑥" ≤0.5)=0.2andPr 𝑌 = 1 𝑥" >0.5)=0.8
• A test set of size 2000 from same population
• The Bayesian error for this data is 0.2
COMP9417 T1, 2021 42

Bagging trees
Approach:
• fit classification trees to training sample using 200 bootstrap samples
• No pruning was used
• trees are different (tree induction is unstable)
• therefore, have high variance
Results:
• averaging reduces variance and leaves bias unchanged
COMP9417 T1, 2021 43

Bagging trees
Original Tree
x.1 < 0.395 284 8. Model Inference and Averaging 1 0 | 10 10 1 01 0 011001 b =3 x.2 < 0.285 b = 4 x.3 < 0.985 b = 5 | 10 0 0 1 10 0 b = 1 x.1 < 0.555 1 0 b = 2 x.2 < 0.205 | | 0 x.4 < −1.36 | | 0 1 11 1 110110 b =6 b = 7 b = 8 COMP9417 T1, 2021 44 x.1 < 0.395 x.1 < 0.395 x.3 < 0.985 ||| Bagging trees 0 1 0 b =6 x.1 < 0.395 | | 10 10 0 1 11 1 110110 11 01 00 000101 0 b = 9 x.1 < 0.395 b = 10 x.1 < 0.555 1 b = 11 1 1 011001 0 the top split is annotated. 10 01 b = 7 x.1 < 0.395 b = 8 x.3 < 0.985 | FIGURE 8.9. Bagging trees on simulated dataset. The top left panel shows the original tree. Eleven trees grown on bootstrap samples are shown. For each tree, | COMP9417 T1, 2021 45 1 10 0 | x.1 < 0.555 | 1 Bagging trees 8.7 Bagging 285 Consensus Probability Original Tree Bagged Trees Bayes 0 50 100 150 200 Number of Bootstrap Samples FIGURE 8.10. Error curves for the bagging example of Figure 8.9. Shown is COMP9417 T1, 2021 46 the test error of the original tree and bagged trees as a function of the number of bootstrap samples. The orange points correspond to the consensus vote, while the Test Error 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Bagging trees Pros: • Reduces the overfitting • Generalizes better Cons: • when we bag a model, any simple structure is lost – this is because a bagged tree is no longer a tree ... but a forest, so this reduces claim to interpretability – stable models like nearest neighbor not very affected by bagging but unstable models like trees most affected by bagging • With lots of data, we usually learn the same classifier, so averaging them does not help COMP9417 T1, 2021 47 Random Forest What is the solution for ensemble decision trees with lots of data? – Have extra variability in the learners and introduce more randomness to the procedure How? – For every model, use only a subset of features o This will create diversity in the trees – Then, similar to before, take average/majority vote over trees (learners) An ensemble tree with random subset of features is called Random Forest COMP9417 T1, 2021 48 Random Forests COMP9417 T1, 2021 49 Random Forest (PlayTennis dataset) Day Outlook Temp. Humidity Wind Play D1 Sunny Hot D2 Sunny Hot D3 Overcast Hot D4 Rain Mild D5 Rain Cool D6 Rain Cool D7 Overcast Cool D8 Sunny Mild D9 Sunny Cool D10 Rain Mild D11 Sunny Mild D12 Overcast Mild D13 Overcast Hot D14 Rain Mild High Weak No High Strong No High Weak Yes High Weak Yes Normal Weak Yes Normal Strong No Normal Strong Yes High Weak No Normal Weak Yes Normal Weak Yes Normal Strong Yes High Strong Yes Normal Weak Yes High Strong No COMP9417 T1, 2021 50 Random Forest (PlayTennis dataset) Example: Step 1: select 14 samples from the dataset with replacement Day Outlook Temp. Humidity Wind Play D3 Overcast D11 Sunny D3 Overcast D8 Sunny D14 Rain D5 Rain D6 Rain D12 Overcast D8 Sunny D9 Sunny D10 Rain D11 Sunny D9 Sunny D1 Sunny D14 Rain Hot High Mild Normal Hot High Mild High Mild High Cool Normal Cool Normal Mild High Mild High Cool Normal Mild Normal Mild Normal Cool Normal Hot High Mild High Weak Yes Strong Yes Weak Yes Weak No Strong No Weak Yes Strong No Strong Yes Weak No Weak Yes Weak Yes Strong Yes Weak Yes Weak No Strong No COMP9417 T1, 2021 51 Random Forest (PlayTennis dataset) Example: Step 2: select a subset of features randomly (for example 2 features out of 4) Day Temp. Wind Play D3 Hot D11 Mild D3 Hot D8 Mild D14 Mild D5 Cool D6 Cool D12 Mild D8 Mild D9 Cool D10 Mild D11 Mild D9 Cool D1 Hot D14 Mild Weak Yes Strong Yes Weak Yes Weak No Strong No Weak Yes Strong No Strong Yes Weak No Weak Yes Weak Yes Strong Yes Weak Yes Weak No Strong No COMP9417 T1, 2021 52 Random Forest (PlayTennis dataset) Example: Step 3: build a full tree without pruning using the selected features and samples Wind Temp. Yes No Yes COMP9417 T1, 2021 53 Random Forest (PlayTennis dataset) Example: Step 4: Repeat previous steps k = 5 times (the value for 𝑘 can be tuned by using cross validation) COMP9417 T1, 2021 54 Random Forest (PlayTennis dataset) Example: For every new sample, predict the output using all trees and then take the average/majority vote x xx xx COMP9417 T1, 2021 55 Random Forests Leo Breiman’s Random Forests algorithm is essentially like Bagging for trees, except the ensemble of tree models is trained from bootstrap samples and random subspaces. • each tree in the forest is learned from – a bootstrap sample, i.e., sample from the training set with replacement – a subspace sample, i.e., randomly sample a subset of features (without replacement) • advantage: – forces more diversity among trees in ensemble – less time to train since only consider a subset of features Note: combining linear classifiers in an ensemble gives a piecewise linear (i.e., non-linear) model COMP9417 T1, 2021 56 Randomization • • • • Some algorithms already have a random component: e.g., initial weights in a neural net Most algorithms can be randomized, e.g., greedy algorithms: – Pick N options at random from the full set of options, then choose the best of those N choices – E.g.: attribute selection in decision trees More generally applicable than bagging: e.g., we can use random subsets of features in a nearest-neighbor classifier – Bagging does not work with stable classifiers such as nearest neighbor classifiers Can be combined with bagging – E.g. when learning decision trees, this yields the Random Forest method for building ensemble classifiers COMP9417 T1, 2021 57 Ensemble methods So far we have seen several ensemble methods: 1. Simple ensembles like majority vote or unweighted average 2. weighted averages / weighted votes 3. Treat the output of each model as a feature 4. Mixture of experts 5. Bagging (Bootstrap Aggregation) 6. Adding randomization to introduce diversity in the models (e.g. feature subset selection) 7. What is next? COMP9417 T1, 2021 58 Boosting So far we have focused on building parallel learners and then combine their decision (equal weights / unequal weights) Problem with parallel learners: what if all learners are mistaken in the same region? Boosting: – Uses “weak” learners – Learners are trained sequentially – New learners focus on errors of earlier learners – New learners try to get these “hard” examples right by operating on a weighted training set in favor of misclassified instances – Combine all learners in the end Boosting is a method that convert a sequence of weak learners into a very complex model to predict COMP9417 T1, 2021 59 Boosting Re-weighted training set: o Start with same weight for all the instances o Misclassified instances gain higher weights: so the next classifier is more likely to classify it correctly o Correctly classified instances lose weight Learning with a weighted training set: 1. Givedifferentweightstothelossfunctionfordifferentinstances § So the algorithm will be forced to learn instances with higher weights 2. Createanewcollectionofdatawithmultiplecopiesofsamples with higher sample weight COMP9417 T1, 2021 60 Boosting algorithm 1. set𝑤' =1/𝑚for𝑖=1,...,𝑚 2. Repeatuntilsufficientnumberofhypothesis o Train model 𝐿( using the dataset with weight 𝑤 o Increase 𝑤' for misclassified instances of 𝐿( 3. Ensemblehypothesisistheweightedmajority/weightedaverageof𝑘 learners 𝐿", ... , 𝐿$ with weight 𝜆", ... , 𝜆$ which are proportional to the accuracy of 𝐿( COMP9417 T1, 2021 61 Boosting Reweighting training data: 𝑤!✓✓✓ ✓ We always aim to minimize some cost function: ✓ 𝑤# ✓ ✗ ✗ ✓ 𝑤" ✗ ✗✓ ✓ ✓ 𝑤& ✗ ✓ ✓ ✓ 𝑳𝟏 𝑳𝟐 𝑳𝟑 𝑳𝟒 • Weighted average loss: • Unweighted average loss: 1 𝐽𝜃 =𝑁!𝐽!(𝜃,x!) 𝑤$✗✓✓ ✓ 𝑤%✓✗ ! 𝐽𝜃 =!𝑤!𝐽!(𝜃,x!) ! COMP9417 T1, 2021 62 Boosting Training set Training Learning algorithm 1 Test sample ... ... Model 1 𝜆! Model 2 𝜆" Model k-1 𝜆#$! Model k 𝜆# Learning algorithm 2 Learning algorithm k-1 Learning algorithm k Weighted majority/weighted average Output COMP9417 T1, 2021 63 Data weights Data weights Data weights Data weights Boosting Boosting works well as long as we use weak learners. – weak learner is a model that is slightly better than random Examples of weak learners: – Perceptron – Decision stumps (trees with one node) – Naïve Bayes models – etc. COMP9417 T1, 2021 64 AdaBoost (Adaptive Boosting) • AdaBoost usually uses stump trees (trees with one node and two leaves) as the base learner o Not very accurate at classification on their own • AdaBoost combines stumps to boost the performance so it creates a forest of stumps instead of forest of trees • In AdaBoost, stumps are created sequentially • The error of each stump affects the training data weight in the next stump • Depending on the performance, each stump gets different weight (𝜆') in the final classification decision COMP9417 T1, 2021 65 AdaBoost (Adaptive Boosting) • 𝑤 ! = 8" ∀ 𝑖 • Forj=1tokdo – Learn 𝐿9 with data weight 𝑤 – 𝜖 =∑8 𝑤(9)1[𝐿(x)≠𝑦]/∑8 𝑤(9) 9 !:" ! 9 ! ! !:" ! – 𝜆9 = " log("=>#) # >#
– 𝑤!(9?”) = 𝑤!(9)exp(𝜆9) for misclassified instances
– 𝑤!(9?”) = 𝑤!(9)exp(−𝜆9) for correct instances
• End
• Make prediction using the final model: 𝑦 x = 𝑠𝑖𝑔𝑛(∑@
𝜆 𝐿 (x)) 9:” ! 9
COMP9417 T1, 2021 66

AdaBoost
Example:
Step 1: give similar weight to your sample (weights have to be normalized)
Blood pressure
Weight
Age
Heart disease
Sample weight
Yes
68
56
Yes
1 $
7
No
75
44
No
1 $
7
Yes
80
35
Yes
1 $
7
Yes
76
49
No
1 $
7
No
78
50
Yes
1 $
7
No
83
38
No
1 $
7
Yes
85
60
Yes
1 $
7
COMP9417 T1, 2021 67

AdaBoost
Step 2: Find the best stump by testing each feature/attribute and compute the stump weight (𝜆!)
Blood pressure
Weight
Age
Heart disease
Sample weight
Yes
68
56
Yes
1 $
7
No
75
44
No
1 $
7
Yes
80
35
Yes
1 $
7
Yes
76
49
No
1 $
7
No
78
50
Yes
1 $
7
No
83
38
No
1 $
7
Yes
85
60
Yes
1 $
7
Blood Pressure
No [2 ,1]
Yes [3,1]
𝑇𝑜𝑡𝑎𝑙 𝑒𝑟𝑟𝑜𝑟 = 27
𝜆# = 1 log(1 − 𝑇𝑜𝑡𝑎𝑙 𝑒𝑟𝑟𝑜𝑟) = 0.45 2 𝑇𝑜𝑡𝑎𝑙 𝑒𝑟𝑟𝑜𝑟
COMP9417 T1, 2021
68

AdaBoost
Step 3: Modify sample weights (𝑤&)
Blood pressure
Weight
Age
Heart disease
Sample weight
Yes
68
56
Yes
1 $
7
No
75
44
No
1 $
7
Yes
80
35
Yes
1 $
7
Yes
76
49
No
1 $
7
No
78
50
Yes
1 $
7
No
83
38
No
1 $
7
Yes
85
60
Yes
1 $
7
𝑁𝑒𝑤 𝑤! = 𝑤!×𝑒&% for misclassified samples 𝑤’ = 17 ×𝑒(.’* = 0.22
𝑤* = 17 ×𝑒(.’* = 0.22
𝑁𝑒𝑤 𝑤! = 𝑤!×𝑒’&% for correct samples 𝑤! = 17 ×𝑒$(.’* = 0.09
𝑤! =𝑤”=𝑤+ =𝑤, =𝑤- =0.09
COMP9417 T1, 2021 69

AdaBoost
Step 4: Normalise sample weights
Blood pressure
Weight
Age
Heart
disease
Sample weight
New weight
Norm. weight
Yes
68
56
Yes
1 $
7
0.09
0.1
No
75
44
No
1 $
7
0.09
0.1
Yes
80
35
Yes
1 $
7
0.09
0.1
Yes
76
49
No
1 $
7
0.22
0.25
No
78
50
Yes
1 $
7
0.22
0.25
No
83
38
No
1 $
7
0.09
0.1
Yes
85
60
Yes
1 $
7
0.09
0.1
Normalized 𝑤! = (& ∑ (&
COMP9417 T1, 2021 70

AdaBoost
Step 5: create a new stump using the weights
Blood pressure
Weight
Age
Heart
disease
Norm. weight
Yes
68
56
Yes
0.1
No
75
44
No
0.1
Yes
80
35
Yes
0.1
Yes
76
49
No
0.25
No
78
50
Yes
0.25
No
83
38
No
0.1
Yes
85
60
Yes
0.1
There are two ways to include the weights:

or •
use weighted information gain to find the best next stump
generate new sample collection from the training set based on the sample weights and find the best stump on that
COMP9417 T1, 2021
71

AdaBoost
• Go back to step 2 and repeat 𝑘 times
You will end up with 𝑘 stumps that each has a weight for final prediction and you will pass your test
sample to all of them and take the weighted majority vote/averagre
COMP9417 T1, 2021 72

Boosting
(left) An ensemble of five boosted basic linear classifiers with majority vote. The linear classifiers were learned from blue to red; none of them achieves zero training error, but the ensemble does. (right) Applying bagging results in a much more homogeneous ensemble, indicating that there is little diversity in the bootstrap samples.
COMP9417 T1, 2021 73

Boosting
Advantages:
– No need to use complex models
– Can boost the performance of any weak learner
– Very simple to implement
– Decreasing the bias
– Decreasing the variance
– Good generalization
Disadvantage:
– Lack of interpretability
– Slow during training and potentially testing
COMP9417 T1, 2021 74

Gradient Boosting
Gradient boosting is applying similar idea for regression (however gradient boosting can also be used for classification as well).
Simple linear regression or simple regression trees can be used as weak learners
Repeat the following procedure:
– Learn a regression predictor
– Compute the error residual
– learn to predict residual
COMP9417 T1, 2021 75

Gradient Boosting
• It learn a sequence of predictors
– Learns a new predictor for the residuals at each iteration
• Sum of the predictors creates more complex model
• The model gets more accurate at each step adding more predictors
https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d
COMP9417 T1, 2021 76

Gradient Boosting
• In this example regression trees are used as the base learner
COMP9417 T1, 2021 77

Gradient Boosting
• Start with training a weak learner and make a set of predictions 𝑦`’
• The error of our prediction is = 𝐽(𝑦’,𝑦`’)
o If we are working with MSE then 𝐽 = ∑'(𝑦’ − 𝑦`’)!
• We can improve 𝑦`’ by gradually reducing the error
– 𝑦`’ = 𝑦`’+𝛼 𝜕𝐽(𝑦, 𝑦`)/𝜕𝑦`’ (we want the loss function to move towards the actual 𝑦’ or to be minimised)
oFortheMSE,∇𝐽 𝑦’,𝑦`’ =𝑦’ −𝑦`’ and
o we can estimate this with 𝑓 𝑖 ≈ ∇𝐽 𝑦’, 𝑦`’
• Each new learner s estimating the gradient of loss 𝑓$(𝑖) Gradient descent: taking sequence of steps to reduce 𝐽(𝑦’,𝑦`’)
Gradient boosting: sum of predictors weighted by some step size 𝛼
COMP9417 T1, 2021 78

Ensemble Learning Important points to remember
Low-bias models tend to have high variance, and vice versa.
Bagging is predominantly a variance-reduction technique, while boosting is primarily a bias-reduction technique which reduces the variance by taking the weighted average.
This explains why bagging is often used in combination with high- variance models such as tree models (as in Random Forests), whereas boosting is typically used with high-bias models such as linear classifiers or univariate decision trees (also called decision stumps).
COMP9417 T1, 2021 79

Ensemble Learning
• Bias-variance decomposition breaks down error, suggests possible fixes to improve learning algorithms
• Stability idea captures aspects of both bias and variance
• Bagging is a simple way to run ensemble methods
• Random Forests are a popular bagging approach for trees
• Boosting has a more theoretically justified basis and may work better in practice to reduce error, but can be susceptible to very noisy data
• Many other variants of ensemble learning
COMP9417 T1, 2021 80

Acknowledgements
• Material derived from slides for the book “Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
• Material derived from slides for the book “Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012) http://www.cs.ubc.ca/~murphyk/MLbook
• Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
• Material derived from slides for the book “Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
• Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw- Hill (1997) http://www-2.cs.cmu.edu/~tom/mlbook.html
• Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani Goa Campus, India (2016)
COMP9417 T1, 2021 81