CS计算机代考程序代写 decision tree algorithm Beacon Conference of Undergraduate Research

Beacon Conference of Undergraduate Research

Ensemble Learning

Lingqiao Liu
University of Adelaide

Some slides borrowed from Rama Ramakrishnan and Rob Schapire etc.

Outlines

University of Adelaide 2

• Ensemble methods overview

• Random forest

• Bagging

• Boosting

Outlines

University of Adelaide 3

• Ensemble methods overview

• Random forest

• Bagging

• Boosting

What is Ensemble learning?

• According to the dictionary, the word ‘ensemble’ means

University of Adelaide 4

‘a group of items viewed as a whole rather
than individually’

• In machine learning, ensemble learning is a learning
technique to learn and combine multiple predictive
models
– The models can be anything

– More diverse the better.

– Combine the predictions:
• Majority voting/Weighted Majority voting

• Average/weighted average

• …

Ensemble learning is powerful

• Ensemble approach is often the essential strategy for
wining machine learning competition!

University of Adelaide 5

Ensemble learning

• The key is to ensure the diversity of predictive method

• Let’s consider an extreme case

– Binary classification problem

– We ensemble 1000 classifiers, the accuracy of each classifier is
51%, that is, slightly better than random guessing

– Those classifiers are statistically independent

– We use majority voting to make final prediction

• If most classifier predict class 1, then the prediction is class 1, vice
versa

University of Adelaide 6

What is the classification accuracy of the
final system?

Ensemble learning example

• If the prediction is correct, it means that more than 500
classifiers should give correct prediction

• This probability can be calculated by binomial
cumulative distribution

– What is a cumulative Binomial probability? – Cross Validated
(stackexchange.com)

P = binocdf(499,1000,1-0.51) = 0.73

• After calculation, the final prediction accuracy can go
from 51% to over 70%!

– If the individual classifier got slightly higher accuracy, say 55%,

the ensembled classifier can reach 100% accuracy

University of Adelaide 7

https://stats.stackexchange.com/questions/108002/what-is-a-cumulative-binomial-probability

Ensemble learning

• We can see that by using a rather weak classifier, the
ensemble classifier can significantly outperform the
original classifier

• The key is the independence assumption

– In practice, ensuring independence is challenging

– we usually use different methods to ensure the individual
predictors are diverse/independent as much as possible.

• How to ensure the diversity of classifiers

– The predictor/classifier, e.g., different types of classifiers, or
classifiers with different parameters

– The dataset to train the model, e.g., classifier trained on different
subset of data

University of Adelaide 8

Outlines

University of Adelaide 9

• Ensemble methods overview

• Random forest

• Bagging

• Boosting

Random forest

• Basic idea: ensemble a set of simple classifier, called
decision tree, as the final classifier

• Simple but effective approach

– Especially good for features with diverse meanings

– Efficient in training and testing

University of Adelaide 10

Decision Tree: An introduction

• A classic predictive model
– A decision tree is drawn upside

down with its root at the top.

– Each non-leaf node represents a
test, usually it involves only one
feature

– The leaf node represents the
decision/prediction outcome

– Example

University of Adelaide 11

For this let’s consider a very basic example that uses titanic data set
for predicting whether a passenger will survive or not. Below model
uses 3 features/attributes/columns from the data set, namely sex, age
and sibsp (number of spouses or children along).

Decision Tree: an example

University of Adelaide 12

• In each node, decision tree chooses one dimension
and one threshold to perform a test. Data are
divided according to the outcome of the test

• The test ends when samples in one branch are all
belong the same class.

Decision Tree: an example

University of Adelaide 13

It is equivalent to recursively partition the
feature spaces. ( by vertical or horizontal
lines in the 2-D case)

Decision Tree: how to build

• Intuitively, we try to find a way to divide data such that
the class labels are “purer” in the divided subsets

• For each node, scan all the dimensions and try all
possible thresholds, find the best one

– Best is measured by weighted average Gini impurity of the subset
in the left child and right child , the lower the better

University of Adelaide 14

Weight is calculated by the number of samples in each child

We can tell G = 0, if all samples belong to the same class

Example

University of Adelaide 15

Decision Tree

• By recursively apply the above rule, we could split nodes
until the impurity can not be further reduced, i.e., all the
samples in the children nodes belong to the same
categories

• Decision Tree is fast to train and evaluate, because it only
uses one feature each time

• Decision Tree can easily overfit the training data

– Not stable since a mistake made in the parent node will affect the
decision of its children.

– Its classification accuracy is poor

University of Adelaide 16

Random forest

• Idea: build T trees instead of one, (that’s why it is called
forest), and introduce randomness for each tree

• Injecting randomness

– Instead of trying all features every time we make a new decision
node, we only try a random subset of the features.

• Random forest works surprisingly well in many
applications

• Why use decision tree?

– Easy to train

– Fast to evaluate

– Easy to inject randomness

University of Adelaide 17

Outlines

University of Adelaide 18

• Ensemble methods overview

• Random forest

• Bagging

• Boosting

Bagging

• A general method to train t diverse models

• Idea:

• given a dataset of n points:
– Sample, with replacement, n training examples from the dataset.

– Train a model on the n samples.

– Repeat t times.

University of Adelaide 19

With replacement -> chance to repeat sample same samples. So
each “n-sample” training set will be different

Example of Bagging

• Bagging a set of decision trees

– Each tree is built by using decision tree algorithm, no
randomness is introduced in the tree construction stage

• Decision boundary of individual trees

University of Adelaide 20

Example of Bagging

• Use T trees

University of Adelaide 21

Outlines

University of Adelaide 22

• Ensemble methods overview

• Random forest

• Bagging

• Boosting

Boosting

• Bagging can be understood as a way to assign each
sample different weights

• Recall that in general a machine learning algorithm try to
minimize a loss function with the form

• In bagging, the weights are randomly assigned and can
only take limited number of values

University of Adelaide 23

If a sample is not selected, it is equivalent to set its weight to 0

Boosting

• Boosting follows a similar procedure to build a bundle of
predictors

– Build a model with a given predictor with initial weights

– Adjust weights

– Build a model with the updated weights

– Adjust weights

– Repeat T times

• The weighted is updated to make the classifier focuses on
samples that are most often misclassified by previous
rules

• The results are combined by weighted majority voting

University of Adelaide 24

Boosting

• Given a weak classifier with prediction accuracy slightly
better than random, say, 55% in binary classification
case, a boosting algorithm can provably construct single
classifier with very high accuracy, say, 99%, given
sufficient training data.

– The weak classifier is also called weak learner

• Has many variants

– Adaboost

– Anyboost

– Graident boost

– …

University of Adelaide 25

A formal definition of Adaboost (binary
case)

University of Adelaide 26

Weight for each sample

Weighted error rate

A formal definition of Adaboost (binary
case)

University of Adelaide 27

𝐷𝑡+1

A formal definition of Adaboost (binary
case)

University of Adelaide 28

Decrease

Increase

Weight is updated based on previous round error

Putting together: an example procedure

• Start training with initial weight

• Try to find a weak classifier minimizing

• Calculate the weighted classification error and
calculate

• Update weight to by using and

• Try to find a weak classifier minimizing

• Calculate the weighted classification error and
calculate

• Repeat…

University of Adelaide 29

Toy example

University of Adelaide 30

Round 1

University of Adelaide 31

Round 2

University of Adelaide 32

Round 3

University of Adelaide 33

Final classifier

University of Adelaide 34

Voted combination of classifiers

University of Adelaide 35

Smaller error, larger weight

Example of weak learner

University of Adelaide 36

In practice, it is OK to assume 𝑤1= 1 or -1

Discussion

• How to train a decision stump for a set of given weights?

University of Adelaide 37

Discussion

• How to train a decision stump for a set of given weights?

– Try different model parameters and pick the one given the
minimal weighted error

– Efficient method could be used to fast search the optimal
parameter

– Example: consider choosing the k-th dimension and 𝑤1= 1, how
to choose 𝑤0

Feature values at the k-th dimension [0.5, 0.2, 0.3,-0.4, 0.1]

Weights for each sample [ 0.2,0.1, 0.3, 0.2, 0.2]

University of Adelaide 38

Discussion

• How to train a decision stump for a set of given weights?

– Try different model parameters and pick the one given the
minimal weighted error

– Efficient method could be used to fast search the optimal
parameter

– Example: consider choosing the k-th dimension and 𝑤1= 1, how
to choose 𝑤0

Feature values at the k-th dimension [0.5, 0.2, 0.3,-0.4, 0.1]

Weights for each sample [ 0.2,0.1, 0.3, 0.2, 0.2]

University of Adelaide 39

Discussion

– Example: consider choosing the k-th dimension and 𝑤1= 1, how
to choose 𝑤0

Feature values at the k-th dimension [0.5, 0.2, 0.3,-0.4, 0.1]

Weights for each sample [ 0.2,0.1, 0.3, 0.2, 0.2]

– Rank the feature values and try threshold

– Calculate weighted classification error for each threshold and
find the best threshold

[0.5, 0.3, 0.2, 0.1, -0.4]

[0.2, 0.3, 0.1, 0.2, 0.2]

University of Adelaide 40

Questions

• Although intuitively reasonable, why should we choose
the update equation in Adaboost?

• What is the objective function of Adaboost?

University of Adelaide 41

Loss of Adaboost

University of Adelaide 42

Adding a weak learner

University of Adelaide 43

Note that Adaboost sequentially add weak learner, Adaboost can be seen as a
greedy way of optimizing the loss function: fixed previously found classifiers and
weights and only optimize the current round of classifier and weight

Adding a weak learner

University of Adelaide 44

Note that Adaboost sequentially add weak learner, Adaboost can be seen as a
greedy way of optimizing the loss function: fixed previously found classifiers and
weights and only optimize the current round of classifier and weight

Variables to be optimized at the m round

Adding a weak learner

University of Adelaide 45

Adding a weak learner

University of Adelaide 46

Solving 𝛼

University of Adelaide 47

Solving 𝛼

University of Adelaide 48

Solving 𝜃𝑚

University of Adelaide 49

Equivalent to minimize the weighted error

Update W

University of Adelaide 50

Note that since we update W greedily, it is OK to divide W by a constant or
normalize W to make it sum to 1

Put the above derivation together

• We show that Adaboost can be seen as a greedy solution
to the following optimization problem (the objective
function of Adaboost)

• It sequentially optimizes each and

University of Adelaide 51

Application of Adaboost

University of Adelaide 52

P. Viola; M. Jones. Rapid object detection using a boosted
cascade of simple features. CVPR 2001

Application of Adaboost: Face detection

• Face detection: scan the image
with a bounding box and
evaluate the decision at each
candidate location
– Very time consuming, need fast

evaluation

• Haar features
– The sum of pixel values in the

black area – The sum of pixel
values in the white area

– Parameters:
• Type of Haar features

• Location of Haar template inside
the bounding box

University of Adelaide 53

Combine results by Adaboost
Used in OpenCV

Summary

• Ensemble Learning

– Main idea.

– Why it works

• Random forest

– Decision tree

– Injecting randomness to decision tree to make a random forest

• Bagging

• Boosting

– Adaboost: procedure

– Weak classifier

– The objective function of Adaboost

University of Adelaide 54