CS计算机代考程序代写 decision tree algorithm Machine learning lecture slides

Machine learning lecture slides

Machine learning lecture slides

COMS 4771 Fall 2020

0 / 24

Classification IV: Ensemble methods

Overview

I Bagging and Random Forests
I Boosting
I Margins and over-fitting

1 / 24

Motivation

I Recall model averaging: given T real-valued predictors
f̂ (1), . . . , f̂ (T ), form ensemble predictor f̂avg

f̂avg(x) :=
1
T

T∑
t=1

f̂ (t)(x).

I Mean squared error:

mse(f̂avg) =
1
T

T∑
t=1

mse(f̂ (t))−
1
T

T∑
t=1

E
[
(f̂avg(X)− f̂ (t)(X))2

]
.

I For classification, analogue is majority-vote classifier f̂maj:

f̂maj(x) :=


+1 if

∑T
t=1 f̂

(t)(x) > 0
−1 otherwise

(f̂avg is the scoring function used for f̂maj)
2 / 24

How to get classifiers to combine?

I Starting anew; how should we train classifiers to combine in
majority-vote?

I Recall: model averaging works well when
I all f̂ (t) have similar MSEs, and
I all f̂ (t) predict very differently from each other

I To first point, use same learning algorithm for all f̂ (t)
I To second point, learning algorithm should have “high

variance”

3 / 24

How to get classifiers to combine?

I Starting anew; how should we train classifiers to combine in
majority-vote?

I Recall: model averaging works well when
I all f̂ (t) have similar MSEs, and
I all f̂ (t) predict very differently from each other

I To first point, use same learning algorithm for all f̂ (t)
I To second point, learning algorithm should have “high

variance”

3 / 24

How to get classifiers to combine?

I Starting anew; how should we train classifiers to combine in
majority-vote?

I Recall: model averaging works well when
I all f̂ (t) have similar MSEs, and
I all f̂ (t) predict very differently from each other

I To first point, use same learning algorithm for all f̂ (t)
I To second point, learning algorithm should have “high

variance”

3 / 24

Using the same learning algorithm multiple times I

I Running same learning algorithm T times on the same data set
yields T identical classifiers – not helpful!

I Instead, want to run same learning algorithm on T separate
data sets.

S1 · · ·S2 ST

· · ·

P

Figure 1: What we want is T data sets drawn from P

4 / 24

Using the same learning algorithm multiple times II

I Invoke plug-in principle
I In IID model, regard empirical disitribution on training examples

Pn as estimate of the example distribution P .
I Draw T independent data sets from Pn; and run learning

algorithm on each data set.
I This is called bootstrap resampling.

S∗1 · · ·S∗2 S

T

Pn

· · ·

P

Figure 2: What we can get is T data sets from Pn
5 / 24

Bagging

I Bagging: bootstrap aggregating (Breiman, 1994)
I Given training data (x1, y1), . . . , (xn, yn) ∈ X × {−1,+1}
I For t = 1, . . . , T :

I Randomly draw n examples with replacement from training
data: S∗t := ((x

(t)
i , y

(t)
i ))

n
i=1 (bootstrap sample)

I Run learning algorithm on S∗t to get classifier f̂ (t)

I Return majority-vote classifier over f̂ (1), . . . , f̂ (T )

6 / 24

Aside: Sampling with replacement

I Pick n individuals from a population of size n with
replacement.

I What is the chance that a given individual is not picked?

I Implications for bagging:
I Each bootstrap sample contains about 63% of the training

examples
I Remaining 37% can be used to estimate error rate of classifier

trained on bootstrap sample

7 / 24

Random forests

I Random Forests (Breiman, 2001): Bagging with randomized
variant of decision tree learning algorithm
I Each time we need to choose a split, pick random subset of


d

features and only choose split from among those features.
I Main idea: trees may use very different features, so less likely

to make mistakes in the same way.

8 / 24

Classifiers with independent errors

I Say we have T binary classifiers f̂ (1), . . . , f̂ (T )
I Assume on a given x, each provides an incorrect prediction

with probability 0.4:

Pr(f̂ (t)(X) 6= Y | X = x) = 0.4.

Moreover, assume error events are independent.
I Use majority-vote classifier f̂maj.
I What is chance that more than half of the classifiers give the

incorrect prediction?

9 / 24

Coping with non-independent errors

I Classifier errors are unlikely to be independent; do something
else to benefit from majority-vote

I Change how we obtain the individual classifiers:
I Adaptively choose classifiers
I Re-weight training data

I Start with uniform distribution over training examples
I Loop:

I Use learning algorithm to get new classifier for ensemble
I Re-weight training examples to emphasize examples on which

new classifier is incorrect

10 / 24

Coping with non-independent errors

I Classifier errors are unlikely to be independent; do something
else to benefit from majority-vote

I Change how we obtain the individual classifiers:
I Adaptively choose classifiers
I Re-weight training data

I Start with uniform distribution over training examples
I Loop:

I Use learning algorithm to get new classifier for ensemble
I Re-weight training examples to emphasize examples on which

new classifier is incorrect

10 / 24

Coping with non-independent errors

I Classifier errors are unlikely to be independent; do something
else to benefit from majority-vote

I Change how we obtain the individual classifiers:
I Adaptively choose classifiers
I Re-weight training data

I Start with uniform distribution over training examples
I Loop:

I Use learning algorithm to get new classifier for ensemble
I Re-weight training examples to emphasize examples on which

new classifier is incorrect

10 / 24

Adaptive Boosting

I AdaBoost (Freund and Schapire, 1997)
I Training data (x1, y1), . . . , (xn, yn) ∈ X × {−1,+1}
I Initialize D1(i) = 1/n for all i = 1, . . . , n
I For t = 1, . . . , T :

I Run learning algorithm on Dt-weighted training examples, get
classifier f (t)

I Update weights:

zt :=
n∑

i=1
Dt(i) · yif (t)(xi) ∈ [−1,+1]

αt :=
1
2

ln
1 + zt
1− zt

∈ R

Dt+1(i) :=
Dt(i) exp(−αt · yif (t)(xi))

Zt
for i = 1, . . . , n.

Here Zt is normalizer that makes Dt+1 a probability
distribution.

I Final classifier: f̂(x) = sign(
∑T

t=1 αt · f
(t)(x))

11 / 24

Plot of αt as function of zt

αt =
1
2

ln
1 + zt
1− zt

∈ R

z
t

-1 -0.5 0 0.5 1

α
t

-3

-2

-1

0

1

2

3

Figure 3: αt as function of zt
12 / 24

Example: AdaBoost with decision stumps

I (From Figures 1.1 and 2.2 of Schapire & Freund text.)
I Use “decision stump” learning algorithm with AdaBoost

I Each f (t) has the form

f (t)(x) =
{

+1 if xi > θ
−1 if xi ≤ θ

or f (t)(x) =
{
−1 if xi > θ
+1 if xi ≤ θ

I Straightforward to handle importance weights Dt(i) in decision
tree learning algorithm

Figure 4: Training data for example execution

13 / 24

Example execution of AdaBoost I

14 / 24

Example execution of AdaBoost II

f (1)

z1 = 0.40, α1 = 0.42

15 / 24

Example execution of AdaBoost III

f (1)

z1 = 0.40, α1 = 0.42

16 / 24

Example execution of AdaBoost IV

f (1)

z1 = 0.40, α1 = 0.42
f (2)

z2 = 0.58, α2 = 0.65

17 / 24

Example execution of AdaBoost V

+

+

f (1)

z1 = 0.40, α1 = 0.42
f (2)

z2 = 0.58, α2 = 0.65

18 / 24

Example execution of AdaBoost VI

+

+

+

+

f (1)

z1 = 0.40, α1 = 0.42
f (2)

z2 = 0.58, α2 = 0.65
f (3)

z3 = 0.72, α3 = 0.92

19 / 24

Example execution of AdaBoost VII

+

+

f (1)

z1 = 0.40, α1 = 0.42
f (2)

z2 = 0.58, α2 = 0.65
f (3)

z3 = 0.72, α3 = 0.92

20 / 24

Example execution of AdaBoost VIII

+

+

f (1)

z1 = 0.40, α1 = 0.42
f (2)

z2 = 0.58, α2 = 0.65
f (3)

z3 = 0.72, α3 = 0.92

Final classifier:
f̂(x) = sign

(
0.42f (1)(x) + 0.65f (2)(x) + 0.92f3(x)

)

21 / 24

Training error rate of final classifier

I Let γt := zt/2: advantage over random guessing achieved by
f (t)

I Theorem: Training error rate of final classifier is

err(f̂ , ((xi, yi))ni=1) ≤ exp


−2 T∑

t=1
γ2t


 = exp (−2γ̄2T)

where

γ̄2 :=
1
T

T∑
t=1

γ2t .

I AdaBoost is “adaptive”:
I Some γt can be small (even negative)—only care about average

γ̄2

I What about true error rate in IID model?
I A very complex model as T becomes large!

22 / 24

Training error rate of final classifier

I Let γt := zt/2: advantage over random guessing achieved by
f (t)

I Theorem: Training error rate of final classifier is

err(f̂ , ((xi, yi))ni=1) ≤ exp


−2 T∑

t=1
γ2t


 = exp (−2γ̄2T)

where

γ̄2 :=
1
T

T∑
t=1

γ2t .

I AdaBoost is “adaptive”:
I Some γt can be small (even negative)—only care about average

γ̄2

I What about true error rate in IID model?
I A very complex model as T becomes large!

22 / 24

Training error rate of final classifier

I Let γt := zt/2: advantage over random guessing achieved by
f (t)

I Theorem: Training error rate of final classifier is

err(f̂ , ((xi, yi))ni=1) ≤ exp


−2 T∑

t=1
γ2t


 = exp (−2γ̄2T)

where

γ̄2 :=
1
T

T∑
t=1

γ2t .

I AdaBoost is “adaptive”:
I Some γt can be small (even negative)—only care about average

γ̄2

I What about true error rate in IID model?
I A very complex model as T becomes large!

22 / 24

Surprising behavior of boosting

I AdaBoost + C4.5 decision tree learning on “letters” data set
16 1 Introduction and Overview

10 100 1000
0

5

10

15

20

Figure 1.7
The training and test percent error rates obtained using boosting on an OCR dataset with C4.5 as the base learner.
The top and bottom curves are test and training error, respectively. The top horizontal line shows the test error rate
using just C4.5. The bottom line shows the final test error rate of AdaBoost after 1000 rounds. (Reprinted with
permission of the Institute of Mathematical Statistics.)

This pronounced lack of overfitting seems to flatly contradict our earlier intuition that sim-
pler is better. Surely, a combination of five trees is much, much simpler than a combination
of 1000 trees (about 200 times simpler, in terms of raw size), and both perform equally
well on the training set (perfectly, in fact). So how can it be that the far larger and more
complex combined classifier performs so much better on the test set? This would appear to
be a paradox.

One superficially plausible explanation is that the αt ’s are converging rapidly to zero,
so that the number of base classifiers being combined is effectively bounded. However, as
noted above, the ϵt ’s remain around 5–6% in this case, well below

1
2 , which means that the

weights αt on the individual base classifiers are also bounded well above zero, so that the
combined classifier is constantly growing and evolving with each round of boosting.

Such resistance to overfitting is typical of boosting, although, as we have seen in sec-
tion 1.2.3, boosting certainly can overfit. This resistance is one of the properties that make
it such an attractive learning algorithm. But how can we understand this behavior?

In chapter 5, we present a theoretical explanation of how, why, and whenAdaBoost works
and, in particular, of why it often does not overfit. Briefly, the main idea is the following.
The description above of AdaBoost’s performance on the training set took into account
only the training error, which is already zero after just five rounds. However, training error
tells only part of the story, in that it reports just the number of examples that are correctly
or incorrectly classified. Instead, to understand AdaBoost, we also need to consider how
confident the predictions being made by the algorithm are. We will see that such confidence
can be measured by a quantity called the margin. According to this explanation, although
the training error—that is, whether or not the predictions are correct—is not changing

AdaBoost training error

AdaBoost test error

C4.5 test error

Figure 5: Figure 1.7 from Schapire & Freund text

I Training error rate is zero after five iterations.
I Test error rate continues to decrease, even up to 1000

iterations.
23 / 24

Margins theory

I Look at (normalized) scoring function of final classifier

ĥ(x) :=
∑T

t=1 αt · f
(t)(x)∑T

t=1 |αt|
∈ [−1,+1].

I Say y · ĥ(x) is margin achieved by ĥ on example (x, y)

I Theorem (Schapire, Freund, Bartlett, and Lee, 1998):
I Larger margins on training examples ⇒ better resistance to

over-fitting in IID model
I AdaBoost tends to increase margins on training examples

I (Similar to but not same as SVM margins)
I On “letters” data set

T = 5 T = 100 T = 1000
training error rate 0.0% 0.0% 0.0%
test error rate 8.4% 3.3% 3.1%
% margins ≤0.5 7.7% 0.0% 0.0%

min margin achieved 0.14 0.52 0.55

24 / 24

Margins theory

I Look at (normalized) scoring function of final classifier

ĥ(x) :=
∑T

t=1 αt · f
(t)(x)∑T

t=1 |αt|
∈ [−1,+1].

I Say y · ĥ(x) is margin achieved by ĥ on example (x, y)
I Theorem (Schapire, Freund, Bartlett, and Lee, 1998):

I Larger margins on training examples ⇒ better resistance to
over-fitting in IID model

I AdaBoost tends to increase margins on training examples
I (Similar to but not same as SVM margins)

I On “letters” data set
T = 5 T = 100 T = 1000

training error rate 0.0% 0.0% 0.0%
test error rate 8.4% 3.3% 3.1%
% margins ≤0.5 7.7% 0.0% 0.0%

min margin achieved 0.14 0.52 0.55

24 / 24

Margins theory

I Look at (normalized) scoring function of final classifier

ĥ(x) :=
∑T

t=1 αt · f
(t)(x)∑T

t=1 |αt|
∈ [−1,+1].

I Say y · ĥ(x) is margin achieved by ĥ on example (x, y)
I Theorem (Schapire, Freund, Bartlett, and Lee, 1998):

I Larger margins on training examples ⇒ better resistance to
over-fitting in IID model

I AdaBoost tends to increase margins on training examples
I (Similar to but not same as SVM margins)
I On “letters” data set

T = 5 T = 100 T = 1000
training error rate 0.0% 0.0% 0.0%
test error rate 8.4% 3.3% 3.1%
% margins ≤0.5 7.7% 0.0% 0.0%

min margin achieved 0.14 0.52 0.55
24 / 24

Classification IV: Ensemble methods