Machine learning lecture slides
COMS 4771 Fall 2020
0/24
Classification IV: Ensemble methods
Overview
Bagging and Random Forests Boosting
Margins and over-fitting
1/24
Motivation
Recall model averaging: given T real-valued predictors fˆ(1), . . . , fˆ(T ), form ensemble predictor fˆ
t=1
ˆˆˆˆ mse(favg) = T mse(f(t)) − T E (favg(X) − f(t)(X))2 .
t=1 t=1
For classification, analogue is majority-vote classifier fˆ : maj
1T ˆˆ
f(t)(x). 1T1T
favg(x) := T Mean squared error:
avg
(fˆ avg
is the scoring function used for fˆ ) maj
fˆ maj
ˆ −1 otherwise
(x) := +1 if Tt=1 f(t)(x) > 0
2/24
How to get classifiers to combine?
Starting anew; how should we train classifiers to combine in majority-vote?
3/24
How to get classifiers to combine?
Starting anew; how should we train classifiers to combine in majority-vote?
Recall: model averaging works well when
all fˆ(t) have similar MSEs, and
all fˆ(t) predict very differently from each other
3/24
How to get classifiers to combine?
Starting anew; how should we train classifiers to combine in majority-vote?
Recall: model averaging works well when
all fˆ(t) have similar MSEs, and
all fˆ(t) predict very differently from each other
To first point, use same learning algorithm for all fˆ(t)
To second point, learning algorithm should have “high
variance”
3/24
Using the same learning algorithm multiple times I
Running same learning algorithm T times on the same data set yields T identical classifiers – not helpful!
Instead, want to run same learning algorithm on T separate data sets.
P
···
S1 S2···ST
Figure 1: What we want is T data sets drawn from P
4/24
Using the same learning algorithm multiple times II
Invoke plug-in principle
In IID model, regard empirical disitribution on training examples
Pn as estimate of the example distribution P .
Draw T independent data sets from Pn; and run learning
algorithm on each data set.
This is called bootstrap resampling.
P
Pn ···
S1∗ S2∗···ST∗
Figure 2: What we can get is T data sets from Pn
5/24
Bagging
Bagging: bootstrap aggregating (Breiman, 1994)
Giventrainingdata(x1,y1),…,(xn,yn)∈X×{−1,+1} For t = 1,…,T:
Randomly draw n examples with replacement from training data: S∗ := ((x(t), y(t)))n (bootstrap sample)
t iii=1
Run learning algorithm on St∗ to get classifier fˆ(t)
Return majority-vote classifier over fˆ(1),…,fˆ(T)
6/24
Aside: Sampling with replacement
Pick n individuals from a population of size n with replacement.
What is the chance that a given individual is not picked?
Implications for bagging:
Each bootstrap sample contains about 63% of the training
examples
Remaining 37% can be used to estimate error rate of classifier
trained on bootstrap sample
7/24
Random forests
Random Forests (Breiman, 2001): Bagging with randomized
variant of decision tree learning algorithm
Each time we need to choose a split, pick random subset of
features and only choose split from among those features.
√
d
Main idea: trees may use very different features, so less likely to make mistakes in the same way.
8/24
Classifiers with independent errors
Say we have T binary classifiers fˆ(1), . . . , fˆ(T )
Assume on a given x, each provides an incorrect prediction
with probability 0.4: ˆ
Pr(f(t)(X)̸=Y |X =x)=0.4. Moreover, assume error events are independent.
Use majority-vote classifier fˆ . maj
What is chance that more than half of the classifiers give the incorrect prediction?
9/24
Coping with non-independent errors
Classifier errors are unlikely to be independent; do something else to benefit from majority-vote
10/24
Coping with non-independent errors
Classifier errors are unlikely to be independent; do something else to benefit from majority-vote
Change how we obtain the individual classifiers:
Adaptively choose classifiers Re-weight training data
10/24
Coping with non-independent errors
Classifier errors are unlikely to be independent; do something else to benefit from majority-vote
Change how we obtain the individual classifiers: Adaptively choose classifiers
Re-weight training data
Start with uniform distribution over training examples
Loop:
Use learning algorithm to get new classifier for ensemble
Re-weight training examples to emphasize examples on which
new classifier is incorrect
10/24
Adaptive Boosting
AdaBoost (Freund and Schapire, 1997)
Trainingdata(x1,y1),…,(xn,yn)∈X×{−1,+1}
Initialize D1(i) = 1/n for all i = 1,…,n
For t = 1,…,T:
ˆ T (t)
Run learning algorithm on Dt-weighted training examples, get classifier f(t)
Update weights:
n
zt :=Dt(i)·yif(t)(xi)∈[−1,+1]
i=1
αt := 1 ln 1 + zt ∈ R
Dt+1(i):=Dt(i)exp(−αt·yif(t)(xi)) fori=1,…,n. Zt
Here Zt is normalizer that makes Dt+1 a probability
2 1−zt
distribution.
Final classifier: f (x) = sign(t=1 αt · f (x))
11/24
Plot of αt as function of zt
αt = 1 ln 1 + zt ∈ R 2 1−zt
3 2 1 0 -1 -2 -3
-1 -0.5 0 0.5 1
zt
Figure 3: αt as function of zt
12/24
αt
Example: AdaBoost with decision stumps
(From Figures 1.1 and 2.2 of Schapire & Freund text.) Use “decision stump” learning algorithm with AdaBoost
Each f(t) has the form +1 ifx >θ
−1 ifx >θ i
f(t)(x) = i or −1 ifxi ≤θ
f(t)(x) =
Straightforward to handle importance weights Dt(i) in decision
tree learning algorithm
+1 ifxi ≤θ
Figure 4: Training data for example execution
13/24
Example execution of AdaBoost I
14/24
Example execution of AdaBoost II
f (1)
z1 = 0.40, α1 = 0.42
15/24
Example execution of AdaBoost III
f (1)
z1 = 0.40, α1 = 0.42
16/24
Example execution of AdaBoost IV
f (1) f (2)
z1 =0.40,α1 =0.42 z2 =0.58,α2 =0.65
17/24
Example execution of AdaBoost V
+
+–
–
f (1) f (2)
z1 =0.40,α1 =0.42 z2 =0.58,α2 =0.65
18/24
Example execution of AdaBoost VI
–
+
+–
–
+
+–
f(1) f(2) f(3)
z1 =0.40,α1 =0.42 z2 =0.58,α2 =0.65 z3 =0.72,α3 =0.92
19/24
Example execution of AdaBoost VII
–
+
+–
f(1) f(2) f(3)
z1 =0.40,α1 =0.42 z2 =0.58,α2 =0.65 z3 =0.72,α3 =0.92
20/24
Example execution of AdaBoost VIII
–
+
+–
f(1) f(2) f(3)
z1 =0.40,α1 =0.42 z2 =0.58,α2 =0.65 z3 =0.72,α3 =0.92
Final classifier:
ˆ (1) (2) f(x) = sign 0.42f (x) + 0.65f (x) + 0.92f3(x)
21/24
Training error rate of final classifier
Let γt := zt/2: advantage over random guessing achieved by f (t)
Theorem: Training error rate of final classifier is T
ˆ n 2 2
err(f,((xi,yi))i=1) ≤ exp −2 γt = exp −2γ ̄ T t=1
where
1 T
γ ̄2 := T
γt2.
t=1
22/24
Training error rate of final classifier
Let γt := zt/2: advantage over random guessing achieved by f (t)
Theorem: Training error rate of final classifier is T
ˆ n 2 2
err(f,((xi,yi))i=1) ≤ exp −2 γt = exp −2γ ̄ T t=1
where
1 T
γ ̄2 := T
γt2.
AdaBoost is “adaptive”:
Some γt can be small (even negative)—only care about average
γ ̄2
t=1
22/24
Training error rate of final classifier
Let γt := zt/2: advantage over random guessing achieved by f (t)
Theorem: Training error rate of final classifier is T
ˆ n 2 2
err(f,((xi,yi))i=1) ≤ exp −2 γt = exp −2γ ̄ T t=1
where
1 T
γt2.
Some γt can be small (even negative)—only care about average
AdaBoost is “adaptive”: γ ̄2
What about true error rate in IID model?
A very complex model as T becomes large!
γ ̄2 := T
t=1
22/24
Surprising behavior of boosting
16 1 Introduction and Overvi
AdaBoost + C4.5 decision tree learning on “letters” data set
20 15 10
5
0
Figure 1.7
C4.5 test error AdaBoost test error
AdaBoost training error
10 100 1000
The training and test percent error rates obtained using boosting on an OCR dataset with C4.5 as the base learn
Figure 5: Figure 1.7 from Schapire & Freund text
The top and bottom curves are test and training error, respectively. The top horizontal line shows the test error r using just C4.5. The bottom line shows the final test error rate of AdaBoost after 1000 rounds. (Reprinted w permission of the Institute of Mathematical Statistics.)
Training error rate is zero after five iterations.
This pronounced lack of overfitting seems to flatly contradict our earlier intuition that si
Test error rate continues to decrease, even up to 1000
iterations.
pler is better. Surely, a combination of five trees is much, much simpler than a combinati
of 1000 trees (about 200 times simpler, in terms of raw size), and both perform equa
well on the training set (perfectly, in fact). So how can it be that the far larger and mo
23/24
e
m o l
Margins theory
Look at (normalized) scoring function of final classifier ˆ Tt=1 αt · f(t)(x)
h(x) := Tt=1 |αt| ∈ [−1, +1].
Say y · hˆ(x) is margin achieved by hˆ on example (x, y)
24/24
Margins theory
Look at (normalized) scoring function of final classifier ˆ Tt=1 αt · f(t)(x)
h(x) := Tt=1 |αt| ∈ [−1, +1].
Say y · hˆ(x) is margin achieved by hˆ on example (x, y)
Theorem (Schapire, Freund, Bartlett, and Lee, 1998):
Larger margins on training examples ⇒ better resistance to
over-fitting in IID model
AdaBoost tends to increase margins on training examples
(Similar to but not same as SVM margins)
24/24
Margins theory
Look at (normalized) scoring function of final classifier ˆ Tt=1 αt · f(t)(x)
h(x) := Tt=1 |αt| ∈ [−1, +1].
Say y · hˆ(x) is margin achieved by hˆ on example (x, y)
Theorem (Schapire, Freund, Bartlett, and Lee, 1998):
Larger margins on training examples ⇒ better resistance to
over-fitting in IID model
AdaBoost tends to increase margins on training examples
(Similar to but not same as SVM margins) On “letters” data set
0.0%
0.0%
8.4%
3.3%
7.7%
0.0%
training error rate test error rate
% margins ≤0.5 min margin achieved
T=5 T=100 T=1000 0.0%
0.14
3.1%
0.0% 0.52 0.55
24/24