Beacon Conference of Undergraduate Research
Ensemble Learning
Lingqiao Liu
University of Adelaide
Some slides borrowed from Rama Ramakrishnan and Rob Schapire etc.
Outlines
University of Adelaide 2
• Ensemble methods overview
• Random forest
• Bagging
• Boosting
Outlines
University of Adelaide 3
• Ensemble methods overview
• Random forest
• Bagging
• Boosting
What is Ensemble learning?
• According to the dictionary, the word ‘ensemble’ means
University of Adelaide 4
‘a group of items viewed as a whole rather
than individually’
• In machine learning, ensemble learning is a learning
technique to learn and combine multiple predictive
models
– The models can be anything
– More diverse the better.
– Combine the predictions:
• Majority voting/Weighted Majority voting
• Average/weighted average
• …
Ensemble learning is powerful
• Ensemble approach is often the essential strategy for
wining machine learning competition!
University of Adelaide 5
Ensemble learning
• The key is to ensure the diversity of predictive method
• Let’s consider an extreme case
– Binary classification problem
– We ensemble 1000 classifiers, the accuracy of each classifier is
51%, that is, slightly better than random guessing
– Those classifiers are statistically independent
– We use majority voting to make final prediction
• If most classifier predict class 1, then the prediction is class 1, vice
versa
University of Adelaide 6
What is the classification accuracy of the
final system?
Ensemble learning example
• If the prediction is correct, it means that more than 500
classifiers should give correct prediction
• This probability can be calculated by binomial
cumulative distribution
– What is a cumulative Binomial probability? – Cross Validated
(stackexchange.com)
P = binocdf(499,1000,1-0.51) = 0.73
• After calculation, the final prediction accuracy can go
from 51% to over 70%!
– If the individual classifier got slightly higher accuracy, say 55%,
the ensembled classifier can reach 100% accuracy
University of Adelaide 7
https://stats.stackexchange.com/questions/108002/what-is-a-cumulative-binomial-probability
Ensemble learning
• We can see that by using a rather weak classifier, the
ensemble classifier can significantly outperform the
original classifier
• The key is the independence assumption
– In practice, ensuring independence is challenging
– we usually use different methods to ensure the individual
predictors are diverse/independent as much as possible.
• How to ensure the diversity of classifiers
– The predictor/classifier, e.g., different types of classifiers, or
classifiers with different parameters
– The dataset to train the model, e.g., classifier trained on different
subset of data
University of Adelaide 8
Outlines
University of Adelaide 9
• Ensemble methods overview
• Random forest
• Bagging
• Boosting
Random forest
• Basic idea: ensemble a set of simple classifier, called
decision tree, as the final classifier
• Simple but effective approach
– Especially good for features with diverse meanings
– Efficient in training and testing
University of Adelaide 10
Decision Tree: An introduction
• A classic predictive model
– A decision tree is drawn upside
down with its root at the top.
– Each non-leaf node represents a
test, usually it involves only one
feature
– The leaf node represents the
decision/prediction outcome
– Example
University of Adelaide 11
For this let’s consider a very basic example that uses titanic data set
for predicting whether a passenger will survive or not. Below model
uses 3 features/attributes/columns from the data set, namely sex, age
and sibsp (number of spouses or children along).
Decision Tree: an example
University of Adelaide 12
• In each node, decision tree chooses one dimension
and one threshold to perform a test. Data are
divided according to the outcome of the test
• The test ends when samples in one branch are all
belong the same class.
Decision Tree: an example
University of Adelaide 13
It is equivalent to recursively partition the
feature spaces. ( by vertical or horizontal
lines in the 2-D case)
Decision Tree: how to build
• Intuitively, we try to find a way to divide data such that
the class labels are “purer” in the divided subsets
• For each node, scan all the dimensions and try all
possible thresholds, find the best one
– Best is measured by weighted average Gini impurity of the subset
in the left child and right child , the lower the better
University of Adelaide 14
Weight is calculated by the number of samples in each child
We can tell G = 0, if all samples belong to the same class
Example
University of Adelaide 15
Decision Tree
• By recursively apply the above rule, we could split nodes
until the impurity can not be further reduced, i.e., all the
samples in the children nodes belong to the same
categories
• Decision Tree is fast to train and evaluate, because it only
uses one feature each time
• Decision Tree can easily overfit the training data
– Not stable since a mistake made in the parent node will affect the
decision of its children.
– Its classification accuracy is poor
University of Adelaide 16
Random forest
• Idea: build T trees instead of one, (that’s why it is called
forest), and introduce randomness for each tree
• Injecting randomness
– Instead of trying all features every time we make a new decision
node, we only try a random subset of the features.
• Random forest works surprisingly well in many
applications
• Why use decision tree?
– Easy to train
– Fast to evaluate
– Easy to inject randomness
University of Adelaide 17
Outlines
University of Adelaide 18
• Ensemble methods overview
• Random forest
• Bagging
• Boosting
Bagging
• A general method to train t diverse models
• Idea:
• given a dataset of n points:
– Sample, with replacement, n training examples from the dataset.
– Train a model on the n samples.
– Repeat t times.
University of Adelaide 19
With replacement -> chance to repeat sample same samples. So
each “n-sample” training set will be different
Example of Bagging
• Bagging a set of decision trees
– Each tree is built by using decision tree algorithm, no
randomness is introduced in the tree construction stage
• Decision boundary of individual trees
University of Adelaide 20
Example of Bagging
• Use T trees
University of Adelaide 21
Outlines
University of Adelaide 22
• Ensemble methods overview
• Random forest
• Bagging
• Boosting
Boosting
• Bagging can be understood as a way to assign each
sample different weights
• Recall that in general a machine learning algorithm try to
minimize a loss function with the form
• In bagging, the weights are randomly assigned and can
only take limited number of values
University of Adelaide 23
If a sample is not selected, it is equivalent to set its weight to 0
Boosting
• Boosting follows a similar procedure to build a bundle of
predictors
– Build a model with a given predictor with initial weights
– Adjust weights
– Build a model with the updated weights
– Adjust weights
– Repeat T times
• The weighted is updated to make the classifier focuses on
samples that are most often misclassified by previous
rules
• The results are combined by weighted majority voting
University of Adelaide 24
Boosting
• Given a weak classifier with prediction accuracy slightly
better than random, say, 55% in binary classification
case, a boosting algorithm can provably construct single
classifier with very high accuracy, say, 99%, given
sufficient training data.
– The weak classifier is also called weak learner
• Has many variants
– Adaboost
– Anyboost
– Graident boost
– …
University of Adelaide 25
A formal definition of Adaboost (binary
case)
University of Adelaide 26
Weight for each sample
Weighted error rate
A formal definition of Adaboost (binary
case)
University of Adelaide 27
𝐷𝑡+1
A formal definition of Adaboost (binary
case)
University of Adelaide 28
Decrease
Increase
Weight is updated based on previous round error
Putting together: an example procedure
• Start training with initial weight
• Try to find a weak classifier minimizing
• Calculate the weighted classification error and
calculate
• Update weight to by using and
• Try to find a weak classifier minimizing
• Calculate the weighted classification error and
calculate
• Repeat…
University of Adelaide 29
Toy example
University of Adelaide 30
Round 1
University of Adelaide 31
Round 2
University of Adelaide 32
Round 3
University of Adelaide 33
Final classifier
University of Adelaide 34
Voted combination of classifiers
University of Adelaide 35
Smaller error, larger weight
Example of weak learner
University of Adelaide 36
In practice, it is OK to assume 𝑤1= 1 or -1
Discussion
• How to train a decision stump for a set of given weights?
University of Adelaide 37
Discussion
• How to train a decision stump for a set of given weights?
– Try different model parameters and pick the one given the
minimal weighted error
– Efficient method could be used to fast search the optimal
parameter
– Example: consider choosing the k-th dimension and 𝑤1= 1, how
to choose 𝑤0
Feature values at the k-th dimension [0.5, 0.2, 0.3,-0.4, 0.1]
Weights for each sample [ 0.2,0.1, 0.3, 0.2, 0.2]
University of Adelaide 38
Discussion
• How to train a decision stump for a set of given weights?
– Try different model parameters and pick the one given the
minimal weighted error
– Efficient method could be used to fast search the optimal
parameter
– Example: consider choosing the k-th dimension and 𝑤1= 1, how
to choose 𝑤0
Feature values at the k-th dimension [0.5, 0.2, 0.3,-0.4, 0.1]
Weights for each sample [ 0.2,0.1, 0.3, 0.2, 0.2]
University of Adelaide 39
Discussion
– Example: consider choosing the k-th dimension and 𝑤1= 1, how
to choose 𝑤0
Feature values at the k-th dimension [0.5, 0.2, 0.3,-0.4, 0.1]
Weights for each sample [ 0.2,0.1, 0.3, 0.2, 0.2]
– Rank the feature values and try threshold
– Calculate weighted classification error for each threshold and
find the best threshold
[0.5, 0.3, 0.2, 0.1, -0.4]
[0.2, 0.3, 0.1, 0.2, 0.2]
University of Adelaide 40
Questions
• Although intuitively reasonable, why should we choose
the update equation in Adaboost?
• What is the objective function of Adaboost?
University of Adelaide 41
Loss of Adaboost
University of Adelaide 42
Adding a weak learner
University of Adelaide 43
Note that Adaboost sequentially add weak learner, Adaboost can be seen as a
greedy way of optimizing the loss function: fixed previously found classifiers and
weights and only optimize the current round of classifier and weight
Adding a weak learner
University of Adelaide 44
Note that Adaboost sequentially add weak learner, Adaboost can be seen as a
greedy way of optimizing the loss function: fixed previously found classifiers and
weights and only optimize the current round of classifier and weight
Variables to be optimized at the m round
Adding a weak learner
University of Adelaide 45
Adding a weak learner
University of Adelaide 46
Solving 𝛼
University of Adelaide 47
Solving 𝛼
University of Adelaide 48
Solving 𝜃𝑚
University of Adelaide 49
Equivalent to minimize the weighted error
Update W
University of Adelaide 50
Note that since we update W greedily, it is OK to divide W by a constant or
normalize W to make it sum to 1
Put the above derivation together
• We show that Adaboost can be seen as a greedy solution
to the following optimization problem (the objective
function of Adaboost)
• It sequentially optimizes each and
University of Adelaide 51
Application of Adaboost
University of Adelaide 52
P. Viola; M. Jones. Rapid object detection using a boosted
cascade of simple features. CVPR 2001
Application of Adaboost: Face detection
• Face detection: scan the image
with a bounding box and
evaluate the decision at each
candidate location
– Very time consuming, need fast
evaluation
• Haar features
– The sum of pixel values in the
black area – The sum of pixel
values in the white area
– Parameters:
• Type of Haar features
• Location of Haar template inside
the bounding box
University of Adelaide 53
Combine results by Adaboost
Used in OpenCV
Summary
• Ensemble Learning
– Main idea.
– Why it works
• Random forest
– Decision tree
– Injecting randomness to decision tree to make a random forest
• Bagging
• Boosting
– Adaboost: procedure
– Weak classifier
– The objective function of Adaboost
University of Adelaide 54