CS代写 Lecture 19: Ensemble Learning

Lecture 19: Ensemble Learning
Semester 1, 2022 , CIS
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without
written permission from the author.

Acknowledgement: , &
Today we’ll be using:
https://webwhiteboard.com/board/pACt6sNEHK6DM9Z7HMfVOXAmiNQmenYo/

• individual classification algorithms in isolation
• choose the ”optimal” classifier by comparing the performance of individual classifiers over a given dataset/task
• When evaluating, we only get one shot at classifying a given test instance and are stuck with the bias inherent in a given algorithm
• Wrapping up classification:
• aside on (non)-linear classification
• aside on (non)-parametric machine learning
• Ensembles: combining multiple (weak/unstable) models into one strong model!

Aside: (Non)-linear and (non)-parametric classification

Aside I: linear vs. non-linear classification
Linear classifiers
• Naive Bayes
• Logistic Regression • Perceptron
… because their decision boundary is a linear function of the input x. (There’s still a non-linear activation function, so y is not a linear function of x).
Non-linear classifiers
• Multi-layer perceptron (with non-linear activations) • K-Nearest Neighbors
• Decision trees
… because their decision boundary is *not* a linear function of the input x. They can learn more complex decision boundaries.

Aside I: linear vs. non-linear classification
Decision boundary of a Decision boundary of a well- well-trained decision tree trained Naive Bayes classifier
xxxx xxxx x xxxx
xxxxxx xxxxxx x
xx xxx xxx xxx x xxxxxxxxxxxx x
x xx x x x xxxxxx
AppropriatUe-nfidtteinr-gfitting Over-fitting Appropriate-fitting O
https://webwhiteboard.com/board/pACt6sNEHK6DM9Z7HMfVOXAmiNQmenYo/

Aside II: parametric vs. non-parametric models
Warning: these terms are ambiguous and several definitions exist. We’ll adopt the following.
Parametric Models
• Naive Bayes, Logistic Regression, Multi-layer perceptron, …
… because they have a constant number of parameters, irrespective of the amount of training data. We can write down the model y = f (x ; θ) which holds true no matter what x. We fit parameters to a given model.
Non-parametric models
• K-Nearest Neighbors, Decision trees, …
… because the parameters grow with the training data and are possibly
infinite. We learn our model directly from the data.
• Discuss: what’s ‘non-parametric’ about KNN?
• Discuss: what’s ‘non-parametric’ about Decision Trees?

Now, on to ensembles

Ensembles I
• Ensemble learning (aka. Classifier combination): constructs a set of base classifiers from a given set of training data and aggregates the outputs into a single meta-classifier
• Intuition 1: the combination of lots of weak classifiers can be at least as good as one strong classifier
• Intuition 2: the combination of a selection of strong classifiers is (usually) at least as good as the best of the base classifiers

Ensembles II
• When does ensemble learning work?
• the base classifiers should not make the same mistakes • the base classifiers are reasonably accurate

Ensembles III
Let’s assume that:
• We have a set of 25 binary base classifiers
• Each has an error rate of ε = 0.35
• The base classifiers are independent (that’s usually false) • We perform classifier combination by voting
The error rate of the combined classifier is:
εi(1−ε)25−i ≈0.06

Ensembles VI
• When does ensemble learning work?

Classification with Ensemble Learning
• The simplest means of classification over multiple base classifiers is simple voting:
• Classification: run multiple base classifiers over the test data and select the class predicted by the most base classifiers (e.g. k-NN)
• Regression: average over the numeric predictions of our base classifiers

Approaches to Ensemble Learning
• Instance manipulation: generate multiple training datasets through sampling, and train a base classifier over each dataset
• Feature manipulation: generate multiple training datasets through different feature subsets, and train a base classifier over each dataset
• Algorithm manipulation: semi-randomly “tweak” internal parameters within a given algorithm to generate multiple base classifiers over a given dataset
• Class label manipulation: generate multiple training datasets by manipulating the class labels in a reversible manner

Stacking I
• Intuition: “smooth” errors over a range of algorithms with different biases
• Simple Voting: generate multiple training datasets through different feature subsets, and train a base classifier over each dataset
• presupposes the classifiers have equal performance
• Meta Classification: train a classifier over the outputs of the base classifiers
• train using nested cross validation to reduce bias

Stacking II
• Level 0: Given training dataset (X , y ): • Train Neural Network
• Train Naive Bayes • Train Decision Tree • …
• Discard (or keep) X , add new attributes for each instance: • predictions of the classifiers above
• other data as available (NB scores etc.)
• Level 1: Train meta-classifier. Usually Logistic Regression or Neural Network

Stacking III
Nested Cross-validation

Stacking VI
• Mathematically simple but computationally expensive method
• Able to combine heterogeneous classifiers with varying performance
• Generally, stacking results in as good or better results than the best of the base classifiers
• Widely seen in applied research; less interest within theoretical circles (esp. statistical learning)

• Intuition: the more data, the better the performance (lower the variance), so how can we get ever more data out of a fixed training dataset?
• Method: construct “novel” datasets through a combination of random sampling and replacement
• Randomly sample the original dataset N times, with replacement (bootstrap)
• Thus, we get a new dataset of the same size, where any individual instance is absent with probability (1 − N1 )N
• construct k random datasets for k base classifiers, and arrive at prediction via voting

Bagging II
• Original dataset:
• Bootstrap Samples

Bagging III
• The same (unstable) classification algorithm is used throughout
• As bagging is aimed towards minimising variance through sampling, the algorithm should be unstable ( =high-variance) … e.g.?

Bagging VI
• Simple method based on sampling and voting
• Possibility to parallelise computation of individual base classifiers
• Highly effective over noisy datasets (outliers may vanish)
• Performance is generally significantly better than the base classifiers and only occasionally substantially worse

Bagging – Random Forest

Random Forest I
A Random Tree is a Decision Tree where:
• At each node, only some of the possible attributes are considered
• For example, a fixed proportion of all of the attributes, except the ones used earlier in the tree
• Attempts to control for unhelpful attributes in the feature set
• Much faster to build than a “deterministic” Decision Tree, but increases model variance
This is an instance of Feature Manipulation.

Random Forest II
A Random Forest is:
• An ensemble of Random Trees (many trees = forest)
• Each tree is built using a different Bagged training dataset
• As with Bagging the combined classification is via voting
• The idea behind them is to minimise overall model variance, without introducing (combined) model bias
This is an instance of Instance Manipulation.

Random Forest III
Hyperparameters:
• number of trees B (can be tuned, e.g. based on “out-of-bag” error rate) • feature sub-sample size (e.g. (log |F | + 1))
Interpretation:
• logic behind predictions on individual instances can be tediously followed
through the various trees
• logic behind overall model: ???

Random Forest VI
Practical Properties of Random Forests: • Generally a very strong performer
• Embarrassingly parallelisable
• Surprisingly efficient
• Robust to overfitting
• Interpretability sacrificed

Boosting I
• Intuition: tune base classifiers to focus on the “hard to classify” instances
• Approach: iteratively change the distribution and weights of training instances to reflect the performance of the classifier on the previous iteration
• start with each training instance having a probability of N1 being included in the sample
• over T iterations, train a classifier and update the weight of each instance according to whether it is correctly classified
• combine the base classifiers via weighted voting

Boosting II
• Original dataset:
• Boosting samples:

AdaBoost I
AdaBoost (Adaptive Boosting) is a sequential ensembling algorithm. Basic idea:
• Base classifier: C0
• Training instances (xj,yj)|j = 1,2,…,N
• Initial instance weights w(0) = 1 |j = 1,2,…,N jN
• In iteration i:
• Construct classifier Ci and compute error rate εi
εi = 􏰂w(i)δ(Ci(xj) ̸= yj) j
• Use εi to compute the classifier weight αi (importance of Ci ) • Use αi to update instance weights
• Add weighted Ci to ensemble
Output: Weighted set of base classifiers: {(α1, C1), (α2, C2), . . . , (αT , CT )}

AdaBoost II: Computing α
“Importance” of Ci (i.e. the weight associated with the classifiers’ votes):
α = 1 log 1 − εi i 2 e εi

AdaBoost III: Updating w
Weights for instance j (i > 0):
 w(i+1)=wj ×e−αi
if Ci(xj)=yj if Ci(xj)̸=yj
(recall that the αi s are always positive)

AdaBoost VI
• Continue iterating for i = 1, 2, . . . , T , but reinitialise the instance weights whenever εi > 0.5
• As long as each base classifier is better than random, convergence to a stronger model is guaranteed
• Classification
C∗(x) = argmax 􏰂 αj δ(Cj (x) = y)
• Typical base classifiers: decision stumps (OneR) or decision trees. (Possibly the world’s favorite classifier!)

Boosting: Summary
• Mathematically complicated but computationally cheap method based on iterative sampling and weighted voting
• More computationally expensive than bagging
• The method has guaranteed performance in the form of error bounds over the training data
• Interesting effect with convergence of the error rate over the training vs. test data
• In practical applications, boosting has the tendency to overfit

Bagging vs. Boosting
• Parallel sampling
• Simple voting
• Single classification algorithm • Minimise variance
• Not prone to overfitting
• Iterative sampling
• Weighted voting
• Single classification algorithm • Minimise (instance) bias
• Prone to overfitting

• What is classifier combination?
• What is bagging and what is the basic thinking behind it? • What is boosting and what is the basic thinking behind it? • What is stacking and what is the basic thinking behind it? • How do bagging and boosting compare?

References
• . Random forests. Machine Learning, 45(1):5–32, 2001.
• Pang- , , and . Introduction to Data Mining. , 2006.
• Ian H. Witten and . Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. , San Francisco, USA, second edition, 2005.

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts