CS计算机代考程序代写 decision tree algorithm Lecture 12. Ensemble methods.

Lecture 12. Ensemble methods.
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne

COMP90051 Statistical Machine Learning
This lecture
• Ensemblemethods:Hedgingyourbets! ∗ Bagging and random forests
∗ Boosting
∗ Stacking
art: OpenClipartVectors at pixabay.com (CC0)
2

COMP90051 Statistical Machine Learning
Why “one true” model?
• Thusfar,wehavediscussedindividualmodelsand considered each of them in isolation/competition
• Weknowhowtoevaluateeachmodel’sperformance (via accuracy, F-measure, etc.) which allows us to choose the best model for a dataset overall
• Butoveralldoesn’timplyper-exampleperformance
∗ Overall best model: likely makes errors on some instances! ∗ Overall-worst model: could be superior on some instances!
• Ensemblesletususemultiplemodelstogether!
3

Statistical Machine Learning (S2 2018)
Panel of experts
• Consider a panel of 3 experts that make a classification decision independently. Each expert makes a mistake with probability 0.3. The consensus decision is by majority vote.
Deck 12
art: OpenClipartVectors at pixabay.com (CC0)
4

COMP90051 Statistical Machine Learning
Panel of experts (cont.)
• Setup: Independent mistakes, each probability p
• Distribution of #mistakes of n experts is Binom(n,p)
P r 𝑘𝑘 = 𝑛𝑛𝑘𝑘 𝑝𝑝 𝑘𝑘 ( 1 − 𝑝𝑝 ) 𝑛𝑛 − 𝑘𝑘
• Majority vote errs when at least 𝑛𝑛/2 experts err (for n odd)
Pr 𝑝𝑝𝑝𝑝𝑛𝑛𝑝𝑝𝑝𝑝 𝑝𝑝𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 = �𝑛𝑛 𝑛𝑛𝑘𝑘 𝑘𝑘= 𝑛𝑛/2
𝑝𝑝𝑘𝑘(1 − 𝑝𝑝)𝑛𝑛−𝑘𝑘
5

COMP90051 Statistical Machine Learning
Combining models
• Model combination (aka. ensemble learning) constructs a set of base models (aka base learners) from given training set and aggregates the outputs into a single meta-model (ensemble)
0
How to generate multiple learners from a single training dataset?
∗ Classificationvia(weighted)majorityvote
∗ Regressionvia(weighed)averaging
∗ Moregenerally:meta-model=f(basemodels)
• Recall bias-variance trade-off:
𝔼𝔼 𝑝𝑝 𝒴𝒴 , 𝑓𝑓̂ 𝒙𝒙 = 𝔼𝔼 𝒴𝒴 − 𝔼𝔼 𝑓𝑓̂ 2 + 𝑉𝑉 𝑝𝑝 𝑒𝑒 𝑓𝑓̂ + 𝑉𝑉 𝑝𝑝 𝑒𝑒 𝒴𝒴
Test error = (bias)2 + variance + irreducible error • Averaging 𝑘𝑘 independent and identically distributed
predictions reduces variance: 𝑉𝑉𝑝𝑝𝑒𝑒 𝑓𝑓̂ = 1 𝑉𝑉𝑝𝑝𝑒𝑒 𝑓𝑓̂ 𝑎𝑎𝑎𝑎𝑎𝑎 𝑘𝑘
6

COMP90051 Statistical Machine Learning
Bagging (bootstrap aggregating; Breiman’94) • Method: construct “near-independent” datasets via
∗ Generate 𝑘𝑘 datasets, each size 𝑛𝑛 sampled from n training data with replacement – bootstrap samples
sampling with replacement
∗ Build base classifier on each constructed dataset ∗ Aggregate predictions via voting/averaging
• Original training dataset: {0,1,2,3,4,5,6,7,8,9}
• Bootstrap samples:
{7, 2, 6, 7, 5, 4, 8, 8, 1 0} – out-of-sample 3, 9 {1,3,8,0,3,5,8,0,1,9} – out-of-sample 2,4,6,7 {2,9,4,2,7,9,3,0,1,0} – out-of-sample 3,5,6,8
7

COMP90051 Statistical Machine Learning
Refresher on decision trees
no 𝑥𝑥1 ≥ 𝜃𝜃1 yes 𝑥𝑥2 no 𝑥𝑥2≥𝜃𝜃2 yes 𝐴𝐴 𝜃𝜃2
𝐴𝐴 𝐴𝐴𝐵𝐵 𝜃𝜃1𝑥𝑥1
𝐴𝐴
• Training criterion: Purity of each final partition
• Optimisation: Heuristic greedy iterative approach
• Model complexity is defined by the depth of the tree
• Deep trees: Very fine tuned to specific data  high variance, low bias
• Shallow trees: Crude approximation  low variance, high bias
8
𝐵𝐵

COMP90051 Statistical Machine Learning
Bagging example: Random forest
• Algorithm(parameters:#trees𝑘𝑘,#features𝑝𝑝≤𝑚𝑚)
• Justbaggedtrees!
2. For𝑐𝑐=1…𝑘𝑘
a) Create new bootstrap sample of training data
b) Select random subset of 𝑝𝑝 of the 𝑚𝑚 features
c) Train decision tree on bootstrap sample using the 𝑝𝑝 features
d) Add tree to forest
1. Initialise forest as empty
3. Making predictions via majority vote or averaging • Worksextremelywellinmanypracticalsettings
9

COMP90051 Statistical Machine Learning
Putting out-of-sample data to use
• At each round, a particular training example has a probability of 1 − 1 of not being selected
𝑛𝑛𝑛𝑛 ∗ Thus probability of being left out is 1 − 𝑛𝑛1
∗ For large 𝑛𝑛, this probability approaches 𝑝𝑝−1 ≈ 0.368
∗ On average only 63.2% of data included per bootstrap sample
• Can use this for independent error estimate of ensemble ∗ Safe like cross-validation, but on overlapping sub-samples!
∗ Evaluate each base classifier on its out-of-sample 36.8%
∗ Average these evaluationsEvaluation of ensemble!
10

COMP90051 Statistical Machine Learning
Bagging: Reflections
• Simplemethodbasedonsamplingandvoting
• Possibilitytoparallelisecomputationofindividual
base classifiers
• Highlyeffectiveovernoisydatasets
• Performanceisoftensignificantlybetterthan (simple) base classifiers, never substantially worse
• Improvesunstableclassifiersbyreducingvariance
11

COMP90051 Statistical Machine Learning
Boosting
• Intuition:focusattentionofbaseclassifierson examples “hard to classify”
• Method:iterativelychangethedistributionon examples to reflect performance of the classifier on the previous iteration
∗ Start with each training instance having a 1/𝑛𝑛 probability of being included in the sample
∗ Over 𝑘𝑘 iterations, train a classifier and update the weight of each instance according to classifier’s ability to classify it
∗ Combine the base classifiers via weighted voting
12

COMP90051 Statistical Machine Learning
Boosting: Sampling example
• Originaltrainingdataset: {0,1,2,3,4,5,6,7,8,9}
• Boostingsamples:
Iteration 1: 7,𝟐𝟐,6,7,5,4,8,8,1 0
Suppose that example 2 was misclassified Iteration 2: {1,3,8,𝟐𝟐,3,5,𝟐𝟐,0,1,9}
Suppose that example 2 was misclassified still Iteration 3: {𝟐𝟐,9,𝟐𝟐,𝟐𝟐,7,9,3,𝟐𝟐,1,0}
13

COMP90051 Statistical Machine Learning
Boosting Example: AdaBoost
1. Initialise example distribution 𝑃𝑃1 𝑖𝑖 = 1/𝑛𝑛, 𝑖𝑖 = 1, … , 𝑛𝑛
2. For𝑐𝑐=1…𝑘𝑘
a) Train base classifier 𝐴𝐴𝑐𝑐 on sample with replacement from 𝑃𝑃𝑐𝑐
b) Set confidence 𝛼𝛼𝑐𝑐 = 1 ln 1−𝜀𝜀𝑐𝑐 for classifier’s error rate 𝜀𝜀𝑐𝑐
𝑐𝑐+1 𝑐𝑐
𝑐𝑐𝑐𝑐𝑖𝑖 exp 𝛼𝛼𝑐𝑐 , 𝑒𝑒𝑜𝑜𝑜𝑝𝑝𝑒𝑒𝑜𝑜𝑖𝑖𝑜𝑜𝑝𝑝
2 𝜀𝜀𝑐𝑐
c) Update example distribution to be normalised of: 𝑃𝑃 𝑖𝑖 ∝𝑃𝑃 𝑖𝑖 ×�exp−𝛼𝛼 , 𝑖𝑖𝑓𝑓𝐴𝐴 𝑖𝑖 =𝑦𝑦
3.
argmax∑𝑘𝑘 𝛼𝛼𝛿𝛿𝐴𝐴𝒙𝒙=𝑦𝑦
Classify as majority vote weighted by confidences
𝑦𝑦
𝑐𝑐=1𝑡𝑡 𝑐𝑐
14

COMP90051 Statistical Machine Learning
AdaBoost (cont.)
2𝛼𝛼 𝜀𝜀
• Technicality: Reinitialise example distribution whenever 𝜀𝜀𝑐𝑐 > 0.5
𝜀𝜀
• Base learners: often decision stumps or trees, anything “weak” ∗ A decision stump is a decision tree with one splitting node
confidence weights
15

COMP90051 Statistical Machine Learning
Boosting: Reflections
• Methodbasedoniterativesamplingandweighted voting
• Morecomputationallyexpensivethanbagging
• Themethodhasguaranteedperformanceinthe
form of error bounds over the training data
• Inpracticalapplications,boostingcanoverfit
16

COMP90051 Statistical Machine Learning
Bagging vs Boosting
Bagging
Boosting
Parallel sampling
Iterative sampling
Minimise variance
Target “hard” instances
Simple voting
Weighted voting
Classification or regression
Classification or regression
Not prone to overfitting
Prone to overfitting (unless base learners are simple)
17

COMP90051 Statistical Machine Learning
Stacking
• Intuition: “smooth” errors over a range of algorithms with different biases
• Method: train a meta-model over the outputs of the base learners
∗ Train base- and meta-learners using cross-validation ∗ Simple meta-classifier: logistic regression
• Generalisationofbaggingandboosting
18

COMP90051 Statistical Machine Learning
Stacking: Reflections
• ComparethistoANNsandbasisexpansion
• Mathematicallysimplebutcomputationally
expensive method
• Abletocombineheterogeneousclassifierswith varying performance
• Withcare,stackingresultsinasgoodorbetter results than the best of the base classifiers
19

COMP90051 Statistical Machine Learning
This lecture
• Ensemblemethods:Hedgingyourbets! ∗ Bagging and random forests
∗ Boosting
∗ Stacking
• Next lecture: multi-armed bandits
art: OpenClipartVectors at pixabay.com (CC0)
20