1 Introduction
2 Bagging and Boosting
AdaBoost – Adaptive Boost
Stacked generalization
Copyright By PowCoder代写 加微信 powcoder
Structure of ensemble classifiers
Random Forests and Decision Trees
3 Conclusion
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 2/57
Introduction
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 3/57
Introduction
Idea of Ensemble Methods
Generate a collection of “weak” classifiers from a given dataset.
Each “weak” classifier is a simple classifier, e.g., a half-space classifier, (can be slightly better choosing randomly/blindly), given by a different dataset (generated by resampling).
Combine classifiers to be a final one, e.g., by averaging or voting. (Multi-classifier system)
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 4/57
Introduction
Motivation
Reduces variance: Higher precision of the match; results are less dependent on peculiarities of a single training set
Reduces bias: more accurate or better quality of the match Improves the estimate if the learning algorithm is unstable
A learner is said to be unstable if a small change to training set causes large change in the output classifier. For example, decision trees and neural networks are unstable learners; k- nearest neighbor and naive Bayesian classifiers are stable learners
Goal: Improve the accuracy by using multi-classifiers.
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 5/57
Introduction
Ensemble Methods – Methods that combine multiple diverse classifiers to obtain better classification accuracy.
How to combine classifiers?
How to diversify classifiers? Answers lead to different approaches.
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 6/57
Bagging and Boosting
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 7/57
Bagging and Boosting
Ensemble learning: 2 methods
Bagging (a name derived from “bootstrap aggregating”)
Uses multiple versions of a training set.
Each created by drawing n′ < n samples from D with replacement. Each bootstrap data set is used to train a different weak classifier.
Final classification decision is based on the vote of each weak classifier.
First create a classifier with accuracy on the training set greater than average. Then add new component classifiers to form an ensemble whose joint decision rule has arbitrarily high accuracy on the training set.
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 8/57
Bagging - procedure
1 Start with dataset D.
2 Generate M dataset D1, D2, ..., DM.
Each distribution is created by drawing n′ < n samples from D with replacement.
Some samples can appear more than once while others do not appear at all.
3 Learn weak classifier for each dataset.
weak classifiers fi(x) for dataset Di, i = 1, 2, ..., M.
4 Combine all weak classifiers using a majority voting or averaging scheme. Averaging: ffinal (x) = sgn ∑M 1 fi (x)
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 9/57
Goal: to improve the accuracy of any given learning algorithm.
The classification performance is “boosted” by combining a number of weak
classifiers (having accuracy only slightly better than chance is sufficient).
Add weak classifier one by one to form the overall classifier such that higher classification accuracy can be achieved.
Each weak classifier is trained with “informative” dataset complementary to others.
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 10/57
Example: A 2 dimensional 2-category classification task with 3 weak classifiers
Dataset D with n patterns Training procedure:
Step 1: Randomly select a set of n1 ≤ n patterns (without replacement) from D to create dataset D1. Train a weak classifier C1 using D1 (C1 should have at least 50% classification accuracy).
Step 2: Create an “informative” dataset D2 (n2 ≤ n) from D of which roughly half of the patterns should be correctly classified by C1 and the rest is wrongly classified. Train a weak classifier C2 using D2.
Step 3: Create an “informative” dataset D3 (n3 ≤ n) from D of which the patterns are not well classified by C1 and C2 (C1 and C2 disagree). Train a weak classifier C3 using D3.
The final decision of classification is based on the votes of the weak classifiers.
e.g., by the first two weak classifiers if they agree, and by the third weak classifier if the first two disagree.
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 11/57
28 CHAPTER 9à ALGORITHM-INDEPENDENT MACHINE LEARNING
n = 27 x2 C1 x2 C2
Figure 1: 3 weak classifiers in a boosting algorithm.
Figure 9à6: A two-dimensional two-category classiÞcation task is shown at the topà The middle row shows three component linear classiÞers Ck trained by LMS al- gorithm Chapà , where their training patterns were chosen through the basic booDstr iHn.Kg. Lpamro(cKeCdL)ureà The Þnal classiÞcEantsieomnbleisMegthiovdesn by the voting of the 7tChCrSeMePNcNom20-20-21
final classification by voting
component classifiers
Adaboost - Adaptive Boost
AdaBoost is a short form of “Adaptive Boosting”.
It allows adding weak classifiers to the final classifier until a desirable classification accuracy is achieved.
Each training pattern is assigned a weight. A lower weight is assigned to patterns being identified correctly; A higher weight is assigned to patterns not being identified correctly. So AdaBoost classifier takes more care about the “difficult” patterns.
It constructs a strong classifier by combining weak classifiers.
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 13/57
Adaboost - Adaptive Boost
ithm recapAitlugloartitohnm recapitulation
Algorithm recapitulation
(a) 1 weak classifier. (b) 2 weak classifiers. (c) 3 weak classifiers. (d) 4 weak classifiers.
(e) 5 weak classifiers. (f) 7 weak classifiers. (g) 40 weak classifiers. Figure 2: An example of AdaBoost Classifier.
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 14/57
ec - Adaptive Boost
Advantages:
Simple for implementation
Good classification accuracy and generalisation Can be applied to different classifiers
Disadvantages:
Solution is suboptimal
Sensitive to noisy data and outliers
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 15/57
Adaboost - Adaptive Boost
Dataset and class labels
Select the weak classi- fier with minimum error
Update weighted error weights and calculate the weight of the selected weak classifier
Error rate > 0.5 No
Maximum number of weak classifiers reached?
Figure 3: Flowchart of Adaboost Algorithm.
Output strong classifier as weighted sum of selected weak classifiers
DrH.K.Lam (KCL)
EnsembleMethods
7CCSMPNN2020-21
Initialise weighted error weights and maximum number of weak classifiers
Adaboost – Adaptive Boost
Algorithm: Adaboost
Given D = {(x1,y1),(x2,y2),…,(xn,yn)}, yi = {−1,1},i = 1,…,n begin initialise kmax, W1(i) = 1/n,i = 1,…,n
do k ← k + 1
n ˆhk = arg min Ej where Ej = ∑
Wk (i ) (Weighted error rate)†
hj ∈H i=1,yi ̸=hj (xi )
εk = overall weighted error rate of classifier ˆhk (i.e., the minimum Ej ) ifεk >0.5
kmax = k − 1
end αk = 1 ln 1−εk
Wk (i)e−αk yi ˆhk (xi )
, i = 1,…,n (Update Wk (i))
Wk+1(i) = Zk until k = kmax
† Find the classifier hk ∈ H that minimises the error εk with respect to the distribution Wk (i). ‡ Zk is a normalization factor chosen so that Wk+1(i) is a normalised distribution.
Table 1: Algorithm for Adaboost.
kmax • Final classifier: H(x) = sign ∑ α ˆh (x)
DrH.K.Lam (KCL)
EnsembleMethods
7CCSMPNN2020-21
Adaboost – Adaptive Boost
• Final classifier: H(x) = sign ∑ α ˆh (x)
ˆh 1 ( x ) ˆh2 (x)
sign(·) Figure 4: Structure of Adaboost Classifier.
ˆhkmax (x)
DrH.K.Lam (KCL)
EnsembleMethods
7CCSMPNN2020-21
Adaboost – Adaptive Boost
2 classes, {+,−} DrH.K.Lam (KCL)
Classifier: h1(x)
3 errors (3 ‘+’ are wrongly recognised)
EnsembleMethods
7CCSMPNN2020-21
Adaboost – Adaptive Boost
+1 −6 −8 +2
Classifier: h2(x)
3 errors (3 ‘−’) DrH.K.Lam (KCL)
−10 h2 (x)
Classifier: h3(x)
3 errors (2 ‘+’ and 1 ‘−’)
EnsembleMethods
7CCSMPNN2020-21
Adaboost – Adaptive Boost
Example Round 1: k = 1
kmax = 3 (initialisation), n = 10, W1(i) = 1/n = 0.1, i = 1,…,10 Classifier1: E = 0.1 + 0.1 + 0.1 =0.3
+3 +4 +5 Classifier2: E = 0.1 + 0.1 + 0.1 =0.3
Classifier3: E = 0.1 + 0.1 + 0.1 =0.3
Choose classifier 1: ˆh1(x) = h1(x), ε1 = E1 = 0.3 α1 = 1 ln1−ε1 = 0.4236
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 21/57
Adaboost – Adaptive Boost
Round 1: k = 1, α1 = 0.4236
Wk(i)e−αkyiˆhk(xi)
Wk+1(i) = Wk (i)e−αk yi ˆhk (xi )/Zk
1 2 3 4 5 6 7 8 9 10
+1 +2 +3 +4 +5 −6 −7 −8 −9 −10
0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
0.1e−0.4236 = 0.0655 0.1e−0.4236 = 0.0655 0.1e0.4236 = 0.1527 0.1e0.4236 = 0.1527 0.1e0.4236 = 0.1527 0.1e−0.4236 = 0.0655 0.1e−0.4236 = 0.0655 0.1e−0.4236 = 0.0655 0.1e−0.4236 = 0.0655 0.1e−0.4236 = 0.0655
0.0655/0.9166 = 0.0715 0.0655/0.9166 = 0.0715 0.1527/0.9166 = 0.1666 0.1527/0.9166 = 0.1666 0.1527/0.9166 = 0.1666 0.0655/0.9166 = 0.0715 0.0655/0.9166 = 0.0715 0.0655/0.9166 = 0.0715 0.0655/0.9166 = 0.0715 0.0655/0.9166 = 0.0715
Z1 =∑W1(i)e−α1yiˆh1(xi) =0.0655×7+0.1527×3=0.9166
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 22/57
Adaboost – Adaptive Boost
In the beginning of round 1: k =1
At the end of round 1: k =1 +3 (0.1666)
−6 (0.0715) +1 (0.0715)
+2 (0.0715)
−7 (0.0715)
−10 (0.0715)
+4 (0.1666)
−8 (0.0715)
+5 (0.1666)
−9 (0.0715)
DrH.K.Lam (KCL)
EnsembleMethods
7CCSMPNN2020-21
Adaboost – Adaptive Boost
Example Round 2: k = 2
Classifier 1: E Classifier 2: E Classifier 3: E
= 0.1666 + 0.1666 + 0.1666 = 0.4998
= 0.0715 + 0.0715 + 0.0715 = 0.2145
= 0.0715 + 0.0715 + 0.0715 = 0.2145
Choose classifier 2 (i.e., ˆh2(x) = h2(x)) and ε2 = E2 = 0.2145
α2 = 1 ln1−ε2 = 0.6490 2 ε2
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 24/57
Adaboost – Adaptive Boost
Round 2: k = 2, α2 = 0.6490
Wk(i)e−αkyiˆhk(xi)
Wk+1(i) = Wk (i)e−αk yi ˆhk (xi )/Zk
1 2 3 4 5 6 7 8 9 10
+1 +2 +3 +4 +5 −6 −7 −8 −9 −10
0.0715 0.0715 0.1666 0.1666 0.1666 0.0715 0.0715 0.0715 0.0715 0.0715
0.0715e−0.6490 = 0.0374 0.0715e−0.6490 = 0.0374 0.1666e−0.6490 = 0.0871 0.1666e−0.6490 = 0.0871 0.1666e−0.6490 = 0.0871 0.0715e0.6490 = 0.1368 0.0715e0.6490 = 0.1368 0.0715e0.6490 = 0.1368 0.0715e−0.6490 = 0.0374 0.0715e−0.6490 = 0.0374
0.0374/0.8213 = 0.0455 0.0374/0.8213 = 0.0455 0.0871/0.8213 = 0.1061 0.0871/0.8213 = 0.1061 0.0871/0.8213 = 0.1061 0.1368/0.8213 = 0.1666 0.1368/0.8213 = 0.1666 0.1368/0.8213 = 0.1666 0.0374/0.8213 = 0.0455 0.0374/0.8213 = 0.0455
Z2 =∑W2(i)e−α2yiˆh2(xi) =0.0374×4+0.0871×3+0.1368×3=0.8213
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 25/57
Adaboost – Adaptive Boost
In the beginning of round 2: k =2
At the end of round 2: k =2 +3 (0.1061)
−6 (0.0715) +1 (0.0715)
−8 (0.0715)
−6 (0.1666) +1 (0.0455)
+2 (0.0715)
−7 (0.0715)
−10 (0.0715)
+2 (0.0455)
−7 (0.1666)
−10 (0.0455)
+3 (0.1666)
+4 (0.1666)
+5 (0.1666)
−9 (0.0715)
+4 (0.1061)
−8 (0.1666)
+5 (0.1061)
−9 (0.0455)
DrH.K.Lam (KCL)
EnsembleMethods
7CCSMPNN2020-21
Adaboost – Adaptive Boost
Example Round 3: k = 3
Classifier 1: E Classifier 2: E Classifier 3: E
= 0.1061 + 0.1061 + 0.1061 = 0.3183
= 0.1666 + 0.1666 + 0.1666 = 0.4998
= 0.0455 + 0.0455 + 0.0455 = 0.1365
Choose classifier 3 (i.e, ˆh3 = h3(x)) and ε3 = E3 = 0.1365
α3 = 1 ln1−ε3 = 0.9223 2 ε2
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 27/57
Adaboost – Adaptive Boost
Final classifier:
kmax H(x)=sign ∑αh(x)
= sign(0.4236h1 (x) + 0.6490h2 (x) + 0.9223h3 (x))
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 28/57
Adaboost – Adaptive Boost
Final classifier:
H(x)= sign(0.4236
DrH.K.Lam (KCL)
EnsembleMethods
7CCSMPNN2020-21 29/57
Adaboost – Adaptive Boost
0.4236h1 (x) + 0.6490h2 (x) + 0.9223h3 (x)
R1 : 0.4236×1+0.6490×1+0.9223×1 = 1.9949
R2 : 0.4236×−1+0.6490×1+0.9223×1 = 1.1477
R3 : 0.4236×−1+0.6490×−1+0.9223×1 = −0.1503
R4 : 0.4236×1+0.6490×1+0.9223×−1 = 0.1503
R5 : 0.4236×−1+0.6490×1+0.9223×−1 = −0.6969
R6 : 0.4236×−1+0.6490×−1+0.9223× −1 = −1.9949
DrH.K.Lam (KCL)
EnsembleMethods
7CCSMPNN2020-21 30/57
Adaboost – Adaptive Boost
0.4236h1 (x) + 0.6490h2 (x) + 0.9223h3 (x)
R1 : 0.4236×1+0.6490×1+0.9223×1 = 1.9949
R2 : 0.4236×−1+0.6490×1+0.9223×1 = 1.1477
R3 : 0.4236×−1+0.6490×−1+0.9223×1 = −0.1503
R4 : 0.4236×1+0.6490×1+0.9223×−1 = 0.1503
R5 : 0.4236×−1+0.6490×1+0.9223×−1 = −0.6969
R6 : 0.4236×−1+0.6490×−1+0.9223× −1 = −1.9949
DrH.K.Lam (KCL)
EnsembleMethods
7CCSMPNN2020-21 31/57
Stacked generalization
In stacked generalisation, the overall classification is done in two levels. The input will be processed by a group of classifiers in the first level. The output from the first layer will go though the classifier in the second level to make the final decision.
Training procedure (Training dataset D with n samples, K classifiers (first level) are used in the first level):
Step 1: Prepare K sub-training dataset: D1, D2, ···, DK of different size and samples.
Step 2: Train the kth classifier using training dataset Dk, k = 1, 2, …, K.
Step3: Createanewtrainingdatasetforthesecond-levelclassifier.Theoutputsof the first-level classifiers (z1, z2, ···, zK ) and the target output (y) of the classification problem (supervised) are taken to form the new dataset Dnew .
Step4: Trainthesecond-levelclassifierusingthedatasetDnew.
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 32/57
Stacked generalization
DrH.K.Lam (KCL)
EnsembleMethods
7CCSMPNN2020-21
Input feature
Figure 5: Stacked generalization.
Structure of ensemble classifiers
Parallel Structure
All classifiers will make decision independently. The result will be combined by the combiner.
Combination strategics includes averaging, voting, weighted voting, adaptive weights, etc.
The introduced ensemble approaches such as bagging, boosting and AdaBoost approaches use this structure.
DrH.K.Lam (KCL)
EnsembleMethods 7CCSMPNN2020-21 34/57
Structure of ensemble classifiers
Serial Structure
A particular structure of hierarchical (tree) structure.
Information and decision result are propagated to the later level for making decision.
Rough decision can be make in the upper levels for efficiency.
Accuracy can be fine tuned in the lower levels.
However, error will propagate.
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 35/57
Structure of ensemble classifiers
Hierarchical
A general structure of parallel/serial structure.
Combination of methods of parallel and serial structure can be used.
DrH.K.Lam (KCL)
EnsembleMethods 7CCSMPNN2020-21 36/57
Random Forests
Random Forests is also called “Randomised Decision Forest” or “Randomised Forests”.
A random forest is an ensemble of decision trees, which are a standard machine learning technique.
The final decision is made by combining the decisions from all decision trees.
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 37/57
Decision Trees
A decision tree has a tree structure consisting of nodes linked by branches
Starts at root node Ends at leaf nodes
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 38/57
Decision Trees
Training data consist of feature vectors and class labels
Training dataset:
D = {(x1,y1),(x2,y2),…,(xN,yN)}
Class label: y ∈ {ω ,ω ,…,ω } i 12 k
x = x2i ,i = 1,2,…,N
i . xdi
N: the number of training samples
d: the dimensionality of the feature space
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 39/57
Decision Trees
5.5>=x2>5.5
7.3>=x1>7.3
Classification: Non-leaf nodes make decisions based on the value of an element of x (input feature vector).
Typically decisions are binary (resulting in a binary decision tree)
For example, if x4 > 11, go right, else go left Leaf nodes perform classification
May represent one class label or P(ω|x)
DrH.K.Lam (KCL) EnsembleMethods 7CCSMPNN2020-21 40/57
Decision Trees
11>=length>11
35/12 25/63
5.5>=lightness>5.5
Salmon/Sea Bass
Determine tree st4ructure
Determine decision rules. Typically, choose
feature, xj, and threshold, t, that best splits
5.5>=x >5.5
the data into distinct class2es, e.g. most
salmon on left (xj ≤ t), most sea bass on right(xj >t).
7.3>=x >7.3 Severalmetricsfo1rbestsplit,suchas
Gini impurity Information gain Variance reduction
Figure 6: Classification of fish (number of samples in training data: 60 salmon, 75 sea bass)
DrH.K.Lam (KCL)
EnsembleMethods
7CCSMPNN2020-21 41/57
Decision Trees
11>=length>11
35/12 25/63
5.5>=lightness>5.5
Salmon/Sea Bass
Determine tree st4ructure
Determine decision rules.
Calculate P(ω|x) for each leaf node:
For each trainin5g.5e>x=exm>p5la.5r, pass x down 2
the tree and count how many exemplars
from each category arrive at each node.
P(Salmon|x) = 22 = 0.8462 22+4
7.3>=x1>7.3
P(Sea Bass|x) = 22+4 = 0.1538
Figure 7: Classification of fish (number of samples in training data: 60 salmon, 75 sea bass)
DrH.K.Lam (KCL) EnsembleMethods
7CCSMPNN2020-21 42/57
Decision Trees
11>=length>11
35/12 25/63
5.5>=lightness>5.5
Salmon/Sea Bass
Classifying:
For new sample, pass x down the tree and determine which leaf node it arrives at Determine decision rules.
Give sample class label associated with leaf
Or given P(ω|x) associated with leaf node, give sample class label of most probable class
P(Salmon|x) = 22 = 0.8462 22+4
P(Sea Bass|x) = 4 = 0.1538 22+4
Figure 8: Classify a fish (sample has length 15, lightness 2)
DrH.K.Lam (KCL) EnsembleMethods
7CCSMPNN2020-21 43/57
Decision Trees
It is not restricted to dichotomies and can be used to classify dataset with more than two categories.
DrH.K.Lam (KCL)
EnsembleMethods
7CCSMPNN2020-21 44/57
x1 ≥0.8 x2 ≥0.2
Decision Trees
8à3à CART
Another example of two-class decision tree. Go left if the condition is true otherwise go right.
x1 < 0.27 x2 <0.32
x1 < 0.81 ω1 ω2
0 x1
.2 .4 .6 .8 1
DrH.K.Lam (KCL)
EnsembleMethods
7CCSMPNN2020-21 45/57
Decision Trees
Decisions at nodes could be more complex - implemented by a more complex classifier
e.g. Binary
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com