CS373 Data Mining and�
Machine Learning�
Lecture 8
Jean Honorio
Purdue University
(originally prepared by Tommi Jaakkola, MIT CSAIL)
Today’s topics
• Ensembles and Boosting
- ensembles, relation to feature selection
- myopic forward-fitting and boosting
- understanding boosting
Ensembles
• An ensemble classifier combines a set of m “weak” base
learners into a “strong” ensemble�
�
�
�
�
�
non-negative
“votes”
base learner
ensemble
discriminant
function
Ensembles
• An ensemble classifier combines a set of m “weak” base
learners into a “strong” ensemble
• The base learners are typically simple “decision
stumps”, i.e., linear classifiers based on one coordinate�
�
�
�
�
�
�
non-negative
“votes”
base learner
ensemble
discriminant
function
Ensembles
• An ensemble classifier combines a set of m “weak” base
learners into a “strong” ensemble
• The base learners are typically simple “decision
stumps”, i.e., linear classifiers based on one coordinate�
�
�
�
�
�
�
non-negative
“votes”
base learner
ensemble
discriminant
function
Ensembles
• An ensemble classifier combines a set of m “weak” base
learners into a “strong” ensemble
• The base learners are typically simple “decision
stumps”, i.e., linear classifiers based on one coordinate�
�
�
�
�
�
�
non-negative
“votes”
base learner
ensemble
discriminant
function
Ensemble learning
• We can view the ensemble learning problem as a
coordinate selection problem
• The problem of finding the best “coordinates” to
include corresponds to finding the parameters of the
base learners (out of an uncountable set)
parameters
or “votes”
coordinates
indexed by
Estimation criterion
• In principle, we can estimate the ensemble�
�
�
by minimizing the training loss�
�
�
�
with respect to the parameters in the ensemble
logistic loss exponential loss
Estimation criterion
• In principle, we can estimate the ensemble�
�
�
by minimizing the training loss�
�
�
�
with respect to the parameters in the ensemble�
• This is a hard problem to solve jointly but we can add
base learners sequentially (cf. forward fitting)
Estimation criterion
• In principle, we can estimate the ensemble�
�
�
by minimizing the training loss�
�
�
�
with respect to the parameters in the ensemble�
• This is a hard problem to solve jointly but we can add
base learners sequentially (cf. forward fitting)
fixed
Myopic forward-fitting
• Out of all we wish to select one that minimizes the
derivative of the loss at (has the most negative
derivative at zero)
fixed
fixed weights on
training examples
Weights on training examples
error (agreement with
the opposite label)positive weight
these derivatives are negative
Weights on training examples
error (agreement with
the opposite label)positive weight
these derivatives are negative
Weights on training examples
error (agreement with
the opposite label)positive weight
these derivatives are negative
• We can always normalize the weights without affecting
the choice of
Weights on training examples
error (agreement with
the opposite label)positive weight
these derivatives are negative
General boosting algorithm
• We use a myopic forward-fitting method to estimate�
�
�
Step 0: �
General boosting algorithm
• We use a myopic forward-fitting method to estimate�
�
�
Step 0: �
�
Step 1: �
General boosting algorithm
• We use a myopic forward-fitting method to estimate�
�
�
Step 0: �
�
Step 1: �
�
�
�
- weights correspond to derivatives of the loss function
- the weight is large if the example is not classified correctly by
the ensemble we have so far
- finding the parameters that minimize the weighted error is
easily solved, e.g., for decision stumps�
• We use a myopic forward-fitting method to estimate�
�
�
Step 0: �
�
Step 1: �
�
�
�
Step 2: �
�
�
�
- this is a 1-dimensional convex problem that can be solved
easily. �
General boosting algorithm
fixed fixed
General boosting algorithm
• We use a myopic forward-fitting method to estimate�
�
�
Step 0: �
�
Step 1: �
�
�
�
Step 2: �
�
�
�
Step 3:
fixed fixed
Boosting example
Boosting example
Boosting example
Boosting example
Boosting example
Trying the same base learner again
• is set to minimize the loss given the base learner
• At the optimum value,
• Thus the current base learner is useless at the next
iteration (but may be chosen again later on)
updated weights (up to
normalization and overall sign)
Ensemble training error
• The boosting algorithm decreases the training loss�
�
�
�
monotonically while the base learners remain effective
against the weighted error (derivative is not zero)
• For any non-increasing loss function, �
�
�
�
�
Thus we have a monotonically decreasing upper bound
on the 0-1 training error (classification error)
Ensemble training error
training loss
0-1 training error
iteration