程序代写代做代考 data mining algorithm CS373 Data Mining and�

CS373 Data Mining and�
Machine Learning�

Lecture 8
Jean Honorio

Purdue University

(originally prepared by Tommi Jaakkola, MIT CSAIL)

Today’s topics
• Ensembles and Boosting

- ensembles, relation to feature selection
- myopic forward-fitting and boosting
- understanding boosting

Ensembles
• An ensemble classifier combines a set of m “weak” base

learners into a “strong” ensemble�




non-negative
“votes”

base learner
ensemble

discriminant
function

Ensembles
• An ensemble classifier combines a set of m “weak” base

learners into a “strong” ensemble

• The base learners are typically simple “decision
stumps”, i.e., linear classifiers based on one coordinate�





non-negative
“votes”

base learner
ensemble

discriminant
function

Ensembles
• An ensemble classifier combines a set of m “weak” base

learners into a “strong” ensemble

• The base learners are typically simple “decision
stumps”, i.e., linear classifiers based on one coordinate�





non-negative
“votes”

base learner
ensemble

discriminant
function

Ensembles
• An ensemble classifier combines a set of m “weak” base

learners into a “strong” ensemble

• The base learners are typically simple “decision
stumps”, i.e., linear classifiers based on one coordinate�





non-negative
“votes”

base learner
ensemble

discriminant
function

Ensemble learning
• We can view the ensemble learning problem as a

coordinate selection problem

• The problem of finding the best “coordinates” to
include corresponds to finding the parameters of the
base learners (out of an uncountable set)

parameters
or “votes”

coordinates
indexed by

Estimation criterion
• In principle, we can estimate the ensemble�


by minimizing the training loss�



with respect to the parameters in the ensemble

logistic loss exponential loss

Estimation criterion
• In principle, we can estimate the ensemble�


by minimizing the training loss�



with respect to the parameters in the ensemble�

•  This is a hard problem to solve jointly but we can add
base learners sequentially (cf. forward fitting)

Estimation criterion
• In principle, we can estimate the ensemble�


by minimizing the training loss�



with respect to the parameters in the ensemble�

•  This is a hard problem to solve jointly but we can add
base learners sequentially (cf. forward fitting)

fixed

Myopic forward-fitting

• Out of all we wish to select one that minimizes the
derivative of the loss at (has the most negative
derivative at zero)

fixed

fixed weights on
training examples

Weights on training examples

error (agreement with
the opposite label)positive weight

these derivatives are negative

Weights on training examples

error (agreement with
the opposite label)positive weight

these derivatives are negative

Weights on training examples

error (agreement with
the opposite label)positive weight

these derivatives are negative

• We can always normalize the weights without affecting
the choice of

Weights on training examples

error (agreement with
the opposite label)positive weight

these derivatives are negative

General boosting algorithm
• We use a myopic forward-fitting method to estimate�


Step 0: �

General boosting algorithm
• We use a myopic forward-fitting method to estimate�


Step 0: �

Step 1: �

General boosting algorithm
• We use a myopic forward-fitting method to estimate�


Step 0: �

Step 1: �


- weights correspond to derivatives of the loss function
-  the weight is large if the example is not classified correctly by

the ensemble we have so far

- finding the parameters that minimize the weighted error is
easily solved, e.g., for decision stumps�

• We use a myopic forward-fitting method to estimate�


Step 0: �

Step 1: �



Step 2: �


-  this is a 1-dimensional convex problem that can be solved
easily. �

General boosting algorithm

fixed fixed

General boosting algorithm
• We use a myopic forward-fitting method to estimate�


Step 0: �

Step 1: �



Step 2: �



Step 3:

fixed fixed

Boosting example

Boosting example

Boosting example

Boosting example

Boosting example

Trying the same base learner again
•  is set to minimize the loss given the base learner

• At the optimum value,

• Thus the current base learner is useless at the next
iteration (but may be chosen again later on)

updated weights (up to
normalization and overall sign)

Ensemble training error
• The boosting algorithm decreases the training loss�



monotonically while the base learners remain effective
against the weighted error (derivative is not zero)

• For any non-increasing loss function, �




Thus we have a monotonically decreasing upper bound
on the 0-1 training error (classification error)

Ensemble training error

training loss

0-1 training error

iteration