Ensemble Learning
COMP9417 Machine Learning & Data Mining
Term 1, 2022
Adapted from slides by Dr Michael
Copyright By PowCoder代写 加微信 powcoder
This lecture will develop your understanding of ensemble methods in machine learning, based on analyses and algorithms covered previously. Following it you should be able to:
• Describe the framework of the bias-variance decomposition and some of its practical implications
• describe how ensembles might be used to address the bias and variance components of error
• outline the concept of the stability of a learning algorithm
• describe different ensemble methods of bagging, randomization, boosting, etc.
COMP9417 T1, 2022 1
Introduction
In previous lectures, introduced some theoretical ideas about limits on machine learning. But do these have any practical impact ?
The answer is yes !
One of these theoretical tools is bias-variance decomposition:
• The bias-variance decomposition of error can be a tool for thinking about how to reduce error in learning
• Take a learning algorithm and ask:
– how can we reduce its bias ?
– how can we reduce its variance ?
• Ensemble learning methods can be viewed in this light
• A form of multi-level learning: learning a number of base-level models from the data, and learning to combine these models as an ensemble
COMP9417 T1, 2022 2
Review: bias-variance decomposition
• Theoretical tool for analyzing how much specific training set affects performance of classifier
• Assume we have an infinite number of classifiers built from different training sets all of the same size:
– The bias of a learning scheme is the expected error due to the mismatch between the learner’s hypothesis space (class of models) and the space of target concepts
– The variance of a learning scheme is the expected error due to differences in the training sets used
§ The variance of a learning scheme is how much the predictions for a given point vary between different model.
– Total expected error ≈ bias1 + variance
COMP9417 T1, 2022 3
Bias-Variance
Underfitting
Source: Scott-Fortmann,Understanding Bias-variance tradeoff COMP9417 T1, 2022 4
Overfitting
Bias-Variance
• The inability of the learning algorithm to capture the true relationship between the output and the features/attributes is called bias.
• The learning algorithm difference in fits between datasets is called variance.
Three commonly used methods to find a good bias-variance tradeoff are:
o Regularization o Bagging
o Boosting
COMP9417 T1, 2022 5
Bias-variance: a trade-off
Easier to see with regression in the following figure1(to see the details you will have to zoom in in your viewer):
– each column represents a different model class 𝑔(x) shown in red
– each row represents a different set of 𝑛 = 6 training points, 𝐷< , randomly sampled from target function 𝐹(x) with noise, shown in black
– probability functions of mean squared error 𝐸 are shown
1- from: “Elements of Statistical Learning” by Hastie, Tibshirani and Friedman (2001
COMP9417 T1, 2022 6
Bias-variance: a trade-off
COMP9417 T1, 2022 7
Bias-variance: a trade-off
• “𝑎” is very poor: a linear model with fixed parameters independent of training data; high bias, zero variance
• “𝑏” is better: a linear model with fixed parameters independent of training data; slightly lower bias, zero variance
• “𝑐” is a cubic model with parameters trained by mean-square-error on training data; low bias, moderate variance
• “𝑑” is a linear model with parameters adjusted to fit each training set; intermediate bias and variance
• training with data 𝑚 → ∞ would give
– “𝑐” with bias approaching small value due to noise
– but not “𝑑”
– variance for all models would approach zero
COMP9417 T1, 2022 8
Bias-variance in ensemble classification
• Recall that we derived the bias-variance decomposition for regression (squared-error loss function)
• Cannot apply same derivation for classification (zero-one loss can be used)
• Bias-variance decomposition used to analyze how much restriction to a
single training set affects performance
• Can decompose expected error of any individual ensemble member as
– Bias = expected error of the ensemble classifier on new data
– Variance = component of the expected error due to particular training set being used to build classifier
COMP9417 T1, 2022 9
Bias-variance with “Big Data”
• high bias algorithms often used for efficiency o why ?
• big data can reduce variance o why ?
This slide and the following 3 slides are due to Prof. G. Webb, Monash U. COMP9417 T1, 2022 10
Bias-variance with “Big Data”
Suppose we have a low bias representation (e.g., all conjunctive concepts), but such concepts may not always occur frequently in small datasets:
COMP9417 T1, 2022 11
Bias-variance with “Big Data”
So, we can increase bias – e.g., by Naive Bayes-type conditional independence assumptions – but this forces averaging of class distributions over all “small concepts”:
COMP9417 T1, 2022 12
Bias-variance with “Big Data”
“Big Data” may help to resolve the bias-variance dilemma:
• high bias algorithms are often used for efficiency – usually simpler to compute
• big data can reduce variance
– “small” concepts will occur more frequently – low bias algorithms can be applied
– but: how to compute efficiently ?
This is still largely an open problem!
COMP9417 T1, 2022 13
Bias-variance in “Real-world AI”
Imagine the following situation:
– Applications increasingly require machine-learning systems to perform at “human-level” (e.g., personal assistants, self-driving vehicles, etc.)
– Suppose you are developing an application and you know what “human- level” error would typically be on this task.
– You have sufficient data for training and validation datasets, and you are not restricted in terms of the models that you could learn (e.g., from linear regression or classification up to kernel methods, ensembles, deep networks, etc.)
– How can an understanding of the bias-variance decomposition help ?
COMP9417 T1, 2022 14
Bias-variance in “Real-world AI”
The following scenarios can happen:
1. Training-set error is observed to be high compared to human-level – why ?
– Bias is too high – solution: move to a more expressive (lower bias) model
2. Training-set error is observed to be similar to human-level, but validation set error is high compared to human-level – why ?
– Variance is too high – solution: get more data (!), try regularization, ensembles, move to a different model architecture
These scenarios are often found in applications of deep learning2
2- “Nuts and Bolts of Applying Deep Learning” by http://www.youtube.com/watch?v=F1ka6a13S9I
COMP9417 T1, 2022 15
• for a given data distribution 𝒟
• train algorithm 𝐿 on training sets 𝑆L, 𝑆1 sampled from 𝒟
• expect that the model from 𝐿 should be the same (or very similar) on both 𝑆L and 𝑆1
• if so, we say that 𝐿 is a stable learning algorithm
• otherwise, it is unstable
• typical stable algorithm: 𝑘NN (for some 𝑘)
• typical unstable algorithm: decision-tree learning
Turney, P. “Bias and the Quantification of Stability”
COMP9417 T1, 2022 16
Decision boundaries in tree learning
Decision boundaries for monothetic two-class trees in two and three dimensions; arbitrarily fine decision regions for classes RL, RL can be learned by recursively partitioning the instance space.
COMP9417 T1, 2022 17
Instability of tree learning
An example shows the effect of a small change in the training data on the structure of an unpruned binary tree learned by CART. The training set has 8 instances for each class:
Note: for class 𝜔1 (red) the last instance has two values for feature 𝑥1 . On the next slide is a tree learned from the data where this instance has value 𝑥1 =
.36 (marked ∗), and on the following slide we see the tree obtained when this value is changed to 𝑥1 = .32 (marked †).
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 T1, 2022 18
Instability of tree learning
The partitioned instance space (left) contains the instance marked ∗ and corresponds to the decision tree (right).
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 T1, 2022 19
Instability of tree learning
The partitioned instance space (left) contains the instance marked † and corresponds to the decision tree (right). Note that both the decision boundaries and the tree topology are considerably changed, for example, testing 𝑥1 rather than 𝑥L at the tree root, although the change in data was very small.
From: “Pattern Classification”. R. Duda, P. Hart, and D. Stork (2001) Wiley.
COMP9417 T1, 2022 20
Stability and Bias-Variance
• stable algorithms typically have high bias
• unstable algorithms typically have high variance
• BUT: take care to consider effect of parameters on stability, e.g., in 𝑘NN:
– 1NN perfectly separates training data, so low bias but high variance
– By increasing the number of neighbors 𝑘 we increase bias and decrease variance (what happens when 𝑘 = 𝑛?)
– Every test instance will have the same number of neighbors, and the class probability vectors will all be the same !
COMP9417 T1, 2022 21
Three-, five- and seven-nearest neighbour
Decision regions of kNN classifiers; the shading represents the predicted probability distribution over the five classes.
Illustrates the effect of varying k on stability (i.e., bias and variance). COMP9417 T1, 2022 22
Ensemble Methods
COMP9417 T1, 2022 23
Supervised learning
We have looked into different supervised methods:
– Basic linear classifier
– Naïve Bayes
– Logistic regression
– Decision trees
– Perceptron
– Support vector machine –...
For a new problem, which method to pick?
COMP9417 T1, 2022 24
Supervised learning
• We can test all different methods on a validation set and pick the one which gives the best performance
• Each learning method gives a different hypothesis,... but no hypothesis is perfect
• Maybe we can combine different hypotheses to build a better hypothesis (model) !?
COMP9417 T1, 2022 25
Ensemble methods
• Ensemble methods are meta-algorithms that combine different models into one model, and they can:
o Decrease variance
o Decrease bias
o Improve performance
• The idea relates to the “wisdom of crowd” phenomenon:
o Aggregation of information in groups (by , 1907)
o Aggregation of independent estimate can be really effective for prediction
» Unweighted average (e.g., majority vote)
» Weighted average (e.g., committee of experts)
COMP9417 T1, 2022 26
Simple ensembles
There are different ways of combining predictors:
1. Simple ensembles like majority vote or unweighted average
Training set
Test sample
Model 2 ... Model k-1
Majority voting / mean
Learning algorithm 1
Learning algorithm 2
Learning algorithm k-1
Learning algorithm k
COMP9417 T1, 2022
Simple ensembles
2. weighted averages / weighted votes: Every model gets a weight (e.g. depending on its performance)
Training set
Test sample
... Model k-1 𝜆X\L
Learning algorithm 1
Learning algorithm 2
Learning algorithm k-1
Learning algorithm k
V < W L 𝜆 < 𝑦Z <
COMP9417 T1, 2022
Weighted average / weighted majority
In practice:
– Learning algorithms may not be independent – Some better fit the data so make less error
We can define weights in different ways:
– Decrease weight of correlated learners – Increase weight of good learners
COMP9417 T1, 2022 29
Simple ensembles
3. Treat the output of each model as a feature and train a model on that
Training set
Test sample
Model 2 ... Model k-1 Learning algorithm
Learning algorithm 1
Learning algorithm 2
Learning algorithm k-1
Learning algorithm k
COMP9417 T1, 2022
Model Output 30
Simple ensembles
3. Treat the output of each model as a feature and train a model on that
o If the task is a binary classification and we choose the fusion model to be a linear model, then this will become a weighted vote
o We can train the fusion model on a validation set or a portion of training data that we have not used in training the initial models because otherwise it will always be biased towards the models that had a better performance on the training data
COMP9417 T1, 2022 31
Simple ensembles
4. Mixture of experts:
o Weight 𝛼< 𝑥 indicates “expertise”
o It divides the feature space into homogeneous regions
o It may use a weighted average or just pick the model with the largest expertise
o It is a kind of local learning
COMP9417 T1, 2022 32
Ensemble methods
5. “Bagging” method: (“Bootstrap Aggregation”)
o Training many classifiers, but each with only a portion of data o Then aggregate through model averaging / majority voting
We used data splitting in cross-validation (k-fold CV) to check overfitting and possibility avoid overfitting.
o But maybe we can combine the models that we train on each split
o This will reduce the overfitting
COMP9417 T1, 2022 33
Bagging = Bootstrap +aggregation Bootstrap:
A standard “resampling” technique from statistics for estimating quantities about a population by averaging estimates from multiple small data samples. Importantly, samples are constructed by drawing observations from a large data sample one at a time and returning them to the data sample after they have been chosen.
https://mlcourse.ai/articles/topic5-part1-bagging/
COMP9417 T1, 2022 34
Bootstrap:
– Create a random subset of data by sampling with replacement
– Draw 𝑚′ samples from 𝑚 sample with replacement (𝑚′ ≤ 𝑚)
– Repeat 𝑘 times to generate 𝑘 subsets
§ Some of the samples get repeated and some will not be left out
– Train one classifier on each subset
– To test, aggregate the output of 𝑘 classifiers that you trained in the previous step using either majority vote / unweighted average
COMP9417 T1, 2022 35
https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783555130/7/ ch07lvl1sec46/bagging-building-an-ensemble-of-classifiers-from-bootstrap-samples
COMP9417 T1, 2022 36
Bias / Variance in Bagging
Error of any model has two components:
– Bias: due to model choice which
o can be reduced by increasing complexity
– Variance: due to small sample size or high complexity of the model
o Can be reduced by increasing the data or reducing the complexity
Bagging is applied on a collection of low-bias high-variance models and by averaging them the bias would not get affected, however the variance will be reduced
COMP9417 T1, 2022 37
Bagging error
Assumption:
– If learners are independent
– If each learner makes an error with probability 𝑝 (same error probability for all learners)
The probability that 𝑘′ out of 𝑘 learner make an error is:
If we use majority voting to decide about the output, then the error happens if more than X1 of learners make an error, so the error for majority voting is:
𝑝Xb (1 − 𝑝)X\Xa
V 𝑘 𝑝Xb(1−𝑝)X\ X 𝑘a 1
If 𝑘 = 10 and 𝑝 = 0.1, then 𝑒𝑟𝑟𝑜𝑟(𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦 𝑣𝑜𝑡𝑖𝑛𝑔) < 0.001 COMP9417 T1, 2022 38
Advantage:
– Reduces overfitting (harder for the aggregated model to memorize full dataset, since some of the data will not be seen by the training models)
– This improves performance in almost all cases specially if learning scheme is unstable (i.e., decision trees are unstable because with slightly different sub-sample, the tree structure may change drastically)
– Can be applied to numeric prediction and classification
– Can help a lot if data is noisy
COMP9417 T1, 2022 39
Bagging more precisely
COMP9417 T1, 2022 40
Bagging linear classifiers
(left) An ensemble of five basic linear classifiers built from bootstrap samples with bagging. The decision rule is majority vote, leading to a piecewise linear decision boundary. (right) If we turn the votes into probabilities, we see the ensemble is effectively grouping instances in different ways, with each segment obtaining a slightly different probability.
COMP9417 T1, 2022 41
Bagging trees
An experiment with simulated data:
• A sample set of size m = 30, two classes {0,1} and 5 features
• Each feature has a standard Gaussian distribution with pairwise correlation 0.95
• The outputs are generated according to:
Pr 𝑌 = 1 𝑥L ≤0.5)=0.2andPr 𝑌 = 1 𝑥L >0.5)=0.8
• A test set of size 2000 from same population
• The Bayesian error for this data is 0.2
COMP9417 T1, 2022 42
Bagging trees
• fit classification trees to training sample using 200 bootstrap samples
• No pruning was used
• trees are different (tree induction is unstable)
• therefore, have high variance
• averaging reduces variance and leaves bias unchanged
COMP9417 T1, 2022 43
Bagging trees
Original Tree
x.1 < 0.395
284 8. Model Inference and Averaging
x.2 < 0.285
x.3 < 0.985
x.1 < 0.555
x.2 < 0.205
x.4 < −1.36
b =6 b = 7
COMP9417 T1, 2022
x.1 < 0.395 x.1 < 0.395
x.3 < 0.985
Bagging trees
01 00 000101
x.1 < 0.395
x.1 < 0.395
x.1 < 0.555
the top split is annotated.
x.1 < 0.395
x.3 < 0.985
FIGURE 8.9. Bagging trees on simulated dataset. The top left panel shows the original tree. Eleven trees grown on bootstrap samples are shown. For each tree,
x.1 < 0.555
COMP9417 T1, 2022 45
Bagging trees
8.7 Bagging 285
Original Tree
Consensus Probability
Bagged Trees
0 50 100 150 200 Number of Bootstrap Samples
FIGURE 8.10. Error curves for the bagging example of Figure 8.9. Shown is COMP9417 T1, 2022 46
the test error of the original tree and bagged trees as a function of the number of
bootstrap samples. The orange points correspond to the consensus vote, while the
Test Error
0.20 0.25 0.30 0.35 0.40 0.45 0.50
Bagging trees
• Reduces the overfitting
• Generalizes better
• when we bag a model, any simple structure is lost
– this is because a bagged tree is no longer a tree ... but a forest, so this reduces claim to interpretability
– stable models like nearest neighbor not very affected by bagging but unstable models like trees most affected by bagging
• With lots of data, we usually learn the same classifier, so averaging them does not help
COMP9417 T1, 2022 47
Random Forest
What is the solution for ensemble decision trees with lots of data?
– Have extra variability in the learners and introduce more randomness to the procedure
– For every model, use only a subset of features
o This will create diversity in the trees
– Then, similar to before, take average/majority vote over trees
(learners)
An ensemble tree with random subset of features is called Random Forest
COMP9417 T1, 2022 48
Random Forests
COMP9417 T1, 2022 49
Random Forest (PlayTennis dataset)
Day Outlook Temp. Humidity Wind Play
D4 Rain Mild
D5 Rain Cool
D6 Rain Cool
D10 Rain Mild
D14 Rain Mild
High Weak No High Strong No High Weak Yes High Weak Yes Normal Weak Yes Normal Strong No Normal Strong Yes High Weak No Normal Weak Yes Normal Weak Yes Normal Strong Yes High Strong Yes Normal Weak Yes High Strong No
COMP9417 T1, 2022 50
Random Forest (PlayTennis dataset) Example:
Step 1: select 14 samples from the dataset with replacement
Day Outlook Temp. Humidity Wind Play
D3 11 3 8 14 Rain
D12 8 9 10 Rain
D11 9 1 14 Rain
Hot High Mild Normal Hot High Mild High Mild High Cool Normal Cool Normal Mild High Mild High Cool Normal Mild Normal Mild Normal Cool Normal Hot High Mild High
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com