CS计算机代考程序代写 scheme python data science Bayesian data mining decision tree algorithm Ensemble Learning

Ensemble Learning
COMP9417 Machine Learning and Data Mining
Term 2, 2021
COMP9417 ML & DM Ensemble Learning Term 2, 2021 1 / 51

Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from figures for the book
“Python Data Science Handbook” by J. VanderPlas O’Reilly Media (2017) http://shop.oreilly.com/product/0636920034919.do
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM Ensemble Learning Term 2, 2021 2 / 51

Aims
Aims
This lecture will develop your understanding of ensemble methods in machine learning, based on analyses and algorithms covered previously. Following it you should be able to:
• describe the framework of the bias-variance decomposition and some of its practical implications
• describe how ensembles might be used to address the bias and variance components of error
• outline the concept of the stability of a learning algorithm
• describe the ensemble methods of bagging, random forests and
boosting
• compare the operation of these methods in terms of the bias and variance components of error
COMP9417 ML & DM Ensemble Learning Term 2, 2021 3 / 51

Introduction
Introduction
The bias-variance decomposition of error can be a tool for thinking about how to reduce error in learning

• •
Take a learning algorithm and ask: • how can we reduce its bias ?
• how can we reduce its variance ?
Ensemble learning methods can be viewed in this light
A form of multi-level learning: learning a number of base-level models from the data, and learning to combine these models as an ensemble
COMP9417 ML & DM Ensemble Learning Term 2, 2021 4 / 51

Bias-variance
The bias-variance decomposition
• A theoretical tool for analyzing how much specific training set affects learning performance
• Assume we have an infinite number of models trained by the same learning algorithm on different sample datasets, all of the same size:
• The bias of a learning algorithm is the expected error due to the mismatch between the learner’s model space and the space of target concepts
• The variance of a learning algorithm is the expected error due to differences in the training datasets used
• Total expected error ≈ bias2 + variance
• Next slide: a graphical representation of this idea, where distance
from “bullseye”, i.e., centre of target stands for error
COMP9417 ML & DM Ensemble Learning Term 2, 2021 5 / 51

Bias-variance
Bias and variance
X
X
X
X
X
X XX X
X
Each target corresponds to a different learning algorithm, each arrow (green ’x’) a different training sample. Top row learning algorithms show low bias, on average close to the centre (i.e., the true function value), while on bottom row they have high bias. Left column algorithms show low variance and, in the right column, high variance.
X
X
X
X
X
X XX X
X
COMP9417 ML & DM Ensemble Learning Term 2, 2021 6 / 51

Bias-variance
Bias-variance: a trade-off
Bias-variance components of error typically show a trade-off:
Bias ↑ and Variance ↓ , or
Bias ↓ and Variance ↑
This trade-off shown with regression in the following figure 1 (to see the details you may have to zoom in in your viewer):
• each column represents a different regression model class g(x) shown in red
• each row represents a different set of n = 6 training points, Di, randomly sampled from target function F (x) with noise, shown in black
• probability functions of mean squared error E are shown
1See: Duda et al. (2001).
COMP9417 ML & DM Ensemble Learning Term 2, 2021 7 / 51

Bias-variance
Bias-variance: a trade-off
COMP9417 ML & DM Ensemble Learning Term 2, 2021 8 / 51

Bias-variance
Bias-variance: a trade-off
Previous slide:
• a) is very poor: a linear model with fixed parameters independent of training data; high bias, zero variance
• b) is better: a linear model with fixed parameters independent of training data; slightly lower bias, zero variance
• c) is a cubic model with parameters trained by mean-square-error on training data; low bias, moderate variance
• d) is a linear model with parameters adjusted to fit each training set; intermediate bias and variance
• training with data n → ∞ would give
• c) with bias approaching small value due to noise
• but not d)
• variance for all models would approach zero
COMP9417 ML & DM Ensemble Learning Term 2, 2021 9 / 51

Stability
Stability
• for a given data distribution D
• train algorithm L on training sets S1, S2 sampled from D
• expect that the model from L should be the same (or very similar) on both S1 and S2
• if so, we say that L is a stable learning algorithm
• otherwise it is unstable
• typical stable algorithm: kNN (for some k)
• typical unstable algorithm: decision-tree learning
Turney, P. “Bias and the Quantification of Stability”
COMP9417 ML & DM Ensemble Learning Term 2, 2021 10 / 51

Stability
Decision boundaries in tree learning
Decision boundaries for monothetic two-class trees in two and three dimensions; arbitrarily fine decision regions for classes R1, R2 can be learned by recursively partioning the instance space.
From: Duda et al. (2001).
COMP9417 ML & DM Ensemble Learning Term 2, 2021 11 / 51

Stability
Instability of tree learning
An example shows the effect of a small change in the training data on the structure of an unpruned binary tree learned by CART. The training set has 8 instances for each class:
Note: for class ω2 (red) the last instance has two values for feature x2. On the next slide is a tree learned from the data where this instance has value x2 = .36 (marked ∗), and on the following slide we see the tree obtained when this value is changed to
x2 = .32 (marked †).
From: Duda et al. (2001).
COMP9417 ML & DM Ensemble Learning Term 2, 2021 12 / 51

Stability
Instability of tree learning
The partioned instance space (left) contains the instance marked ∗ and corresponds to the decision tree (right).
From: Duda et al. (2001).
COMP9417 ML & DM Ensemble Learning Term 2, 2021 13 / 51

Stability
Instability of tree learning
The partioned instance space (left) contains the instance marked † and corresponds to the decision tree (right). Note that both the decision boundaries and the tree topology are considerably changed, for example, testing x2 rather than x1 at the tree root, although the change in data was very small.
From: Duda et al. (2001).
COMP9417 ML & DM Ensemble Learning Term 2, 2021 14 / 51

Stability
Stability and Bias-Variance
• stable algorithms typically have high bias
• unstable algorithms typically have high variance • BUT: take care to consider effect of parameters
• e.g., decision trees typically have parameters to limit depth of tree
• simple trees will be more stable
• also, e.g., in kNN
• 1NN perfectly separates training data, so low bias but high variance
• by increasing the number of neighbours k we increase bias and
decrease variance (what happens when k = n?)
• every test instance will have the same number of neighbours, and the
class probability vectors will all be the same !
COMP9417 ML & DM Ensemble Learning Term 2, 2021 15 / 51

Stability
Three-, five- and seven-nearest neighbour
Decision regions of k-nearest neighbour classifiers on a multiclass classification problem; shading represents the predicted probability distribution over the five classes.
3-nearest neighbour 5-nearest neighbour 7-nearest neighbour As k increases, decision boundaries become less sharp (bias ↑ , variance ↓ ).
COMP9417 ML & DM Ensemble Learning Term 2, 2021 16 / 51

Ensembles
Ensemble methods
In essence, ensemble methods in machine learning have the following two things in common:
• they construct multiple, diverse predictive models from adapted versions of the training data (most often reweighted or resampled);
• they combine the predictions of these models in some way, often by simple averaging or voting (possibly weighted).
COMP9417 ML & DM Ensemble Learning Term 2, 2021 17 / 51

Ensembles
Ensembles: combining multiple models
• Basic idea of ensembles or “multi-level” learning schemes: build different “experts” and combine their output predictions
• Advantage: often improves predictive performance
• Disadvantage: produces output that is very hard to interpret
• Notable schemes: bagging, random forests, boosting
• can be applied to both classification and numeric prediction problems
COMP9417 ML & DM Ensemble Learning Term 2, 2021 18 / 51

Ensembles
Bias-variance in ensemble classification
• Recall that we originally derived the bias-variance decomposition for regression – squared-error loss function
• Cannot apply same derivation for classification – zero-one loss
• Bias-variance decomposition used to analyze how much restriction to a single training set affects performance
• Can decompose expected error of any individual ensemble member as follows:
• Bias = expected error of the ensemble classifier on new data
• Variance = component of the expected error due to particular training
set being used to build classifier
• Total expected error ≈ bias2 + variance
• Note (A): we assume noise inherent in the data is part of the bias component as it cannot normally be measured
• Note (B): multiple versions of this decomposition exist for zero-one loss but the basic idea is always the same
COMP9417 ML & DM Ensemble Learning Term 2, 2021 19 / 51

Ensembles
Bagging
Bagging
“Bootstrap Aggregation” – introduced by Leo Breiman2
• Employs simplest way of combining predictions: voting/averaging
• Each model receives equal weight
• Generalized version of bagging:
• Sample several training sets of size n (instead of just having one training set of size n)
• Build a classifier for each training set • Combine the classifiers’ predictions
• This improves performance in almost all cases if learning scheme is unstable (i.e., high-variance, e.g., decision trees)
2See: Breiman (1996).
COMP9417 ML & DM Ensemble Learning Term 2, 2021 20 / 51

Ensembles
Bagging
Bagging
• Bagging reduces variance by voting/averaging, thus reducing the overall expected error
• In the case of classification there are pathological situations where the overall error might increase
• Usually, the more classifiers the better, with diminishing returns
• Problem: we only have one dataset of n instances !
• Solution: generate multiple samples from the dataset
• use sampling with replacement from original dataset • each sample can have duplicate examples
• each sample can have missing examples
• Can be applied to numeric prediction and classification
COMP9417 ML & DM Ensemble Learning Term 2, 2021 21 / 51

Ensembles
Bagging
Bagging more precisely
1 2 3 4 5
Algorithm Bagging(D,T,A) // train ensemble from bootstrap samples Input: dataset D; ensemble size T; learning algorithm A.
Output: set of models; predictions to be combined by voting or averaging.
fort=1toT do
bootstrap sample Dt from D by sampling |D| examples with replacement run A on Dt to produce a model Mt
end
return {Mt|1 ≤ t ≤ T}
What is the ensemble’s prediction ? The average of the predictions of all the models in the ensemble:
• for classification, the majority vote, or mode • for regression, the mean
COMP9417 ML & DM Ensemble Learning Term 2, 2021 22 / 51

Ensembles
Bagging
Bagging linear classifiers
6
4
2
0
−2
−4
−6
An ensemble of five basic linear classifiers built from bootstrap samples with bagging. The decision rule is majority vote, leading to a piecewise linear decision boundary.
−6 −4 −2 0 2 4 6
COMP9417 ML & DM Ensemble Learning Term 2, 2021 23 / 51

Ensembles
Bagging
Bagging trees
An experiment with simulated data:
• sample of size n = 30, two classes, five features
• Pr(Y =1|x1 ≤0.5)=0.2andPr(Y =1|x1 >0.5)=0.8)
• test sample of size 2000 from same population
• fit classification trees to training sample, 200 bootstrap samples
• trees are different (tree induction is unstable)
• therefore have high variance
• averaging reduces variance and leaves bias unchanged
• (graph: test error for original and bagged trees, with green – vote; purple – average probabilities)
COMP9417 ML & DM Ensemble Learning Term 2, 2021 24 / 51

Ensembles
Bagging
Bagging trees
COMP9417 ML & DM Ensemble Learning Term 2, 2021 25 / 51

Ensembles
Bagging
Bagging trees
COMP9417 ML & DM Ensemble Learning Term 2, 2021 26 / 51

Ensembles
Bagging
Bagging trees
COMP9417 ML & DM Ensemble Learning Term 2, 2021 27 / 51

Ensembles
Bagging
Bagging trees
COMP9417 ML & DM Ensemble Learning Term 2, 2021 28 / 51

Ensembles
Bagging
Bagging trees
The news is not all good:
• when we bag a model, any simple structure is lost
• this is because a bagged tree is no longer a tree . . .
• … but a forest
• although bagged trees can be mapped back to a single tree . . . • . . . this reduces claim to comprehensibility
• stable models like nearest neighbour not very affected by bagging • unstable models like trees most affected by bagging
• usually, their design for interpretability (bias) leads to instability
COMP9417 ML & DM Ensemble Learning Term 2, 2021 29 / 51

Ensembles
Random Forests
Randomization
• Can randomize learning algorithm instead of input to introduce diversity into an ensemble
• Some learning algorithms already have a random component
• Most algorithms can be randomized, e.g., greedy algorithms:
• Pick r options at random from the full set of options, then choose the best of those r choices
• E.g., feature selection in decision trees
• More generally applicable than bagging: e.g., we can use random subsets of features in a nearest-neighbor classifier
• Bagging does not work well with stable classifiers
• Randomization can be combined with bagging
• When learning decision trees, this yields the Random Forest3 method for building ensemble classifiers
3See: Breiman (2001).
COMP9417 ML & DM Ensemble Learning Term 2, 2021 30 / 51

Ensembles
Random Forests
Random Forests
1 2 3 4 5 6
Algorithm RandomForest(D,T,d) // train ensemble of randomized trees Input: data set D; ensemble size T; subspace dimension d.
Output: set of models; predictions to be combined by voting or averaging.
fort=1toT do
bootstrap sample Dt from D by sampling |D| examples with replacement select d features at random and reduce dimensionality of Dt accordingly train a tree model Mt on Dt without pruning
end
return {Mt|1 ≤ t ≤ T}
COMP9417 ML & DM Ensemble Learning Term 2, 2021 31 / 51

Ensembles
Random Forests
Random Forests
Leo Breiman’s Random Forests algorithm is essentially like Bagging for trees, except that the ensemble of tree models is trained from bootstrap samples and random subspaces.
• each tree in the forest is learned from
• a bootstrap sample, i.e., sample from the training set with replacement • a subspace sample, i.e., randomly sample a subset of features
• advantage: forces more diversity among trees in ensemble
• advantage: less time to train since only consider a subset of features
Note: combining linear classifiers in an ensemble gives a piecewise linear (i.e., non-linear) model, whereas multiple trees can be combined into a single tree.
COMP9417 ML & DM Ensemble Learning Term 2, 2021 32 / 51

Ensembles
Boosting
Boosting
• Boosting4 also uses voting/averaging of an ensemble but each model is weighted according to their performance
• An iterative learning procedure: new models are influenced by performance of previously built ones
• New model is encouraged to become “expert” for instances classified incorrectly by earlier models
• Intuition: models should be experts that complement each other
• There are many variants of this method . . .
4See: Freund and Schapire (1997).
COMP9417 ML & DM Ensemble Learning Term 2, 2021 33 / 51

Ensembles
Boosting
The strength of weak learnability
Setting comes from computational learning theory
• Learner produces a binary [−1, +1] classifier h with error rate ε < 0.5. • In some sense h is “useful”, i.e., better than random ! • a strong learner if ε < 0.5 and ε “close” to zero. • a weak learner if ε < 0.5 and ε “close” to 0.5. • Question: is there a procedure to convert a weak learner into a strong learner ? • Answer: yes, weak learners can be boosted into strong learners ! COMP9417 ML & DM Ensemble Learning Term 2, 2021 34 / 51 Ensembles Boosting Weight updates in boosting • Suppose a linear classifier achieves performance as in the contingency table below. Error rate is ε = (9 + 16)/100 = 0.25. Predicted ⊕ Actual ⊕ 24 Actual ⊖ 9 33 Predicted ⊖ 16 40 51 60 67 100 • We want to give half the weight to the misclassified examples. The following weight updates achieve this: • a factor 1/2ε = 2 for the misclassified examples and • a factor 1/2(1 − ε) = 2/3 for the correctly classified examples. COMP9417 ML & DM Ensemble Learning Term 2, 2021 35 / 51 Ensembles Boosting Weight updates in boosting • Taking these updated weights into account leads to the contingency table below, which has a (weighted) error rate of 0.5. ⊕⊖ ⊕ 16 32 48 ⊖ 18 34 52 34 66 100 COMP9417 ML & DM Ensemble Learning Term 2, 2021 36 / 51 Ensembles Boosting Boosting Algorithm Boosting(D,T,A) // train binary classifier ensemble, reweighting datasets Input: data set D; ensemble size T; learning algorithm A Output: weighted ensemble of models 1 w1i ←1/|D| for all xi ∈ D 2 fort=1toT do 3 run A on D with weights wti to produce a model Mt 4 calculate weighted error εt 5 αt ←1 ln 1−εt 2 εt 6 w(t+1)i ← wti for misclassified instances xi ∈ D 2εt 7 w(t+1)j ← wtj for correctly classified instances xj ∈ D 2(1−εt ) 8 end 9 return M (x) = 􏰅Tt=1 αt Mt (x) COMP9417 ML & DM Ensemble Learning Term 2, 2021 37 / 51 Ensembles Boosting Why those αt? The two weight updates for the misclassified instances and the correctly classified instances can be written as reciprocal terms δt and 1/δt normalised by some term Zt: 1 = δt 1 = 1/δt 2εt Zt 2(1−εt) Zt From this we can derive 􏰐 􏰑1−εt Zt =2 εt(1−εt) δt = ε =exp(αt) t So the weight update for misclassified instances is exp (αt) /Zt and for correctly classified instances exp (−αt) /Zt. Using the fact that yiMt(xi) = +1 for instances correctly classified by model Mt and −1 otherwise, we can write the weight update as w = w exp (−αtyiMt(xi)) (t+1)i ti Zt which is the expression commonly found in the literature. COMP9417 ML & DM Ensemble Learning Term 2, 2021 38 / 51 Ensembles Boosting Boosting reduces error Adaboost applied to a weak learning system can reduce the training error exponentially as the number of component classifiers is increased. • focuses on “difficult” patterns • training error of successive classifier on its own weighted training set is generally larger than predecessor • training error of ensemble will decrease • typically, test error of ensemble will decrease also COMP9417 ML & DM Ensemble Learning Term 2, 2021 39 / 51 Ensembles Boosting Boosting reduces error COMP9417 ML & DM Ensemble Learning Term 2, 2021 40 / 51 Ensembles Boosting Boosting enlarges the model class A two-dimensional two-category classification task • three component linear classifiers • final classification is by weighted voting of component classifiers • gives a non-linear decision boundary • each component is a weak learner (slightly better than 0.5) • ensemble classifier has error lower than any single component COMP9417 ML & DM Ensemble Learning Term 2, 2021 41 / 51 Ensembles Boosting Boosting enlarges the model class COMP9417 ML & DM Ensemble Learning Term 2, 2021 42 / 51 Ensembles Boosting Boosting vs. Bagging 33 2.5 2.5 22 1.5 1.5 11 0.5 0.5 00 −0.5 −0.5 −1 −1 −1.5 −1.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 (left) An ensemble of five boosted basic linear classifiers with majority vote. The linear classifiers were learned from blue to red; none of them achieves zero training error, but the ensemble does. (right) Applying bagging results in a much more homogeneous ensemble, indicating that there is little diversity in the bootstrap samples. COMP9417 ML & DM Ensemble Learning Term 2, 2021 43 / 51 Other ensemble methods Stacking • So far, ensembles where base learners all use same algorithm • But what if we want to combine outputs of different algorithms ? • Also, what if the combining method could be tuned from data ? • “Stacked generalization” or stacking • Uses meta learner instead of voting to combine predictions of base learners • Predictions of base learners (level-0 models) are used as input for meta learner (level-1 model) • Each base learner considered a feature, with value its output yˆ on instance x • But predictions on training data can’t be used to generate data for level-1 model! • So a cross-validation-like scheme is employed COMP9417 ML & DM Ensemble Learning Term 2, 2021 44 / 51 Other ensemble methods Stacking • If base learners can output probabilities it’s better to use those as input to meta-learner • gives more information to meta-learner • Which algorithm to use to generate meta-learner? • In principle, any learning scheme can be applied, but suggested to use • “relatively global, smooth” models (David Wolpert) • A linear model is typically used • Since base learners do most of the work • And this reduces risk of overfitting • Stacking can also be applied to numeric prediction (and density estimation) COMP9417 ML & DM Ensemble Learning Term 2, 2021 45 / 51 Other ensemble methods Mixture of Experts • Framework for learning assuming data generated by a mixture model • base level component classifiers (or rankers, . . . ) • outputs are combined by a tunable system to do the “mixing” • Each component models an “expert” for some part of the problem • All component outputs are pooled for ensemble output • Can be trained by gradient descent COMP9417 ML & DM Ensemble Learning Term 2, 2021 46 / 51 Other ensemble methods Mixture of Experts COMP9417 ML & DM Ensemble Learning Term 2, 2021 47 / 51 Other ensemble methods Additive Regression • Using statistical terminology, boosting is a greedy algorithm for fitting an additive model • More specifically, it implements forward stagewise additive modeling • Forward stagewise additive modeling for numeric prediction: 1 Build standard regression model (e.g., regression tree) 2 Gather residuals, learn model predicting residuals (e.g. another regression tree), and repeat • To predict, simply sum up individual predictions from all regression models Gradient (Tree) Boosting is based on this approach, where at each boosting iteration a model is fit to approximate the components of the negative gradient of the overall loss Friedman (2001). COMP9417 ML & DM Ensemble Learning Term 2, 2021 48 / 51 Other ensemble methods Gradient Boosting • Boosting algorithms learn a form of additive model • training uses a forward stagewise procedure • At each boosting iteration, a new (weighted) component function is added to the boosted model • In gradient boosting, this approach is used to solve an optimization problem • Informally, need to minimize loss over all components (basis functions) over all training examples • A simpler approximation to this optimization is a forward stepwise procedure • At each iteration, minimize loss summed over all previously added components, plus the current one • In gradient boosting, a regression tree is learned at each iteration to minimize the loss for predicting the negative gradient at each leaf • Implemented in the widely-used XGBoost5 package for scalable learning 5See: Chen and Guestrin (2016). COMP9417 ML & DM Ensemble Learning Term 2, 2021 49 / 51 Summary Ensemble Learning Important points to remember Low-bias models tend to have high variance, and vice versa. Bagging is predominantly a variance-reduction technique, while boosting is primarily a bias-reduction technique. This explains why bagging is often used in combination with high-variance models such as tree models (as in Random Forests), whereas boosting is typically used with high-bias models such as linear classifiers or univariate decision trees (also called decision stumps). COMP9417 ML & DM Ensemble Learning Term 2, 2021 50 / 51 Summary Ensemble Learning • Bias-variance decomposition breaks down error, suggests possible fixes to improve learning algorithms • Stability idea captures aspects of both bias and variance • Bagging is a simple way to run ensemble methods • Random Forests are a popular bagging & randomization approach for trees • Boosting has a more theoretically justified basis and often works better in practice to reduce error, but can be susceptible to very noisy data • Many other variants of ensemble learning • Gradient Boosting (XGBoost, LightGBM, CatBoost) often the “off-the-shelf” method of choice COMP9417 ML & DM Ensemble Learning Term 2, 2021 51 / 51 References Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2):123–140. Breiman, L. (2001). Random Forests. Machine Learning, 45:5–32. Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. Duda, R., Hart., P., and Stork, D. (2001). Pattern Classfication. Wiley. Freund, Y. and Schapire, R. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1):119–139. Friedman, J. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29(5):1189–1232. COMP9417 ML & DM Ensemble Learning Term 2, 2021 51 / 51