程序代写代做代考 decision tree algorithm Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/24

Classification IV: Ensemble methods

Overview
􏰛 Bagging and Random Forests 􏰛 Boosting
􏰛 Margins and over-fitting
1/24

Motivation
􏰛 Recall model averaging: given T real-valued predictors fˆ(1), . . . , fˆ(T ), form ensemble predictor fˆ
t=1
ˆ􏰊ˆ􏰊ˆˆ mse(favg) = T mse(f(t)) − T E (favg(X) − f(t)(X))2 .
t=1 t=1
􏰛 For classification, analogue is majority-vote classifier fˆ : maj
1T ˆ􏰊ˆ
f(t)(x). 1T1T􏰔􏰕
favg(x) := T 􏰛 Mean squared error:
avg
(fˆ avg
is the scoring function used for fˆ ) maj
fˆ maj
ˆ −1 otherwise

(x) := +1 if 􏰉Tt=1 f(t)(x) > 0

2/24

How to get classifiers to combine?
􏰛 Starting anew; how should we train classifiers to combine in majority-vote?
3/24

How to get classifiers to combine?
􏰛 Starting anew; how should we train classifiers to combine in majority-vote?
􏰛 Recall: model averaging works well when
􏰛 all fˆ(t) have similar MSEs, and
􏰛 all fˆ(t) predict very differently from each other
3/24

How to get classifiers to combine?
􏰛 Starting anew; how should we train classifiers to combine in majority-vote?
􏰛 Recall: model averaging works well when
􏰛 all fˆ(t) have similar MSEs, and
􏰛 all fˆ(t) predict very differently from each other
􏰛 To first point, use same learning algorithm for all fˆ(t)
􏰛 To second point, learning algorithm should have “high
variance”
3/24

Using the same learning algorithm multiple times I
􏰛 Running same learning algorithm T times on the same data set yields T identical classifiers – not helpful!
􏰛 Instead, want to run same learning algorithm on T separate data sets.
P
···
S1 S2···ST
Figure 1: What we want is T data sets drawn from P
4/24

Using the same learning algorithm multiple times II
􏰛 Invoke plug-in principle
􏰛 In IID model, regard empirical disitribution on training examples
Pn as estimate of the example distribution P .
􏰛 Draw T independent data sets from Pn; and run learning
algorithm on each data set.
􏰛 This is called bootstrap resampling.
P
Pn ···
S1∗ S2∗···ST∗
Figure 2: What we can get is T data sets from Pn
5/24

Bagging
􏰛 Bagging: bootstrap aggregating (Breiman, 1994)
􏰛 Giventrainingdata(x1,y1),…,(xn,yn)∈X×{−1,+1} 􏰛 For t = 1,…,T:
􏰛 Randomly draw n examples with replacement from training data: S∗ := ((x(t), y(t)))n (bootstrap sample)
t iii=1
􏰛 Run learning algorithm on St∗ to get classifier fˆ(t)
􏰛 Return majority-vote classifier over fˆ(1),…,fˆ(T)
6/24

Aside: Sampling with replacement
􏰛 Pick n individuals from a population of size n with replacement.
􏰛 What is the chance that a given individual is not picked?
􏰛 Implications for bagging:
􏰛 Each bootstrap sample contains about 63% of the training
examples
􏰛 Remaining 37% can be used to estimate error rate of classifier
trained on bootstrap sample
7/24

Random forests
􏰛 Random Forests (Breiman, 2001): Bagging with randomized
variant of decision tree learning algorithm
􏰛 Each time we need to choose a split, pick random subset of
features and only choose split from among those features.

d
􏰛 Main idea: trees may use very different features, so less likely to make mistakes in the same way.
8/24

Classifiers with independent errors
􏰛 Say we have T binary classifiers fˆ(1), . . . , fˆ(T )
􏰛 Assume on a given x, each provides an incorrect prediction
with probability 0.4: ˆ
Pr(f(t)(X)̸=Y |X =x)=0.4. Moreover, assume error events are independent.
􏰛 Use majority-vote classifier fˆ . maj
􏰛 What is chance that more than half of the classifiers give the incorrect prediction?
9/24

Coping with non-independent errors
􏰛 Classifier errors are unlikely to be independent; do something else to benefit from majority-vote
10/24

Coping with non-independent errors
􏰛 Classifier errors are unlikely to be independent; do something else to benefit from majority-vote
􏰛 Change how we obtain the individual classifiers:
􏰛 Adaptively choose classifiers 􏰛 Re-weight training data
10/24

Coping with non-independent errors
􏰛 Classifier errors are unlikely to be independent; do something else to benefit from majority-vote
􏰛 Change how we obtain the individual classifiers: 􏰛 Adaptively choose classifiers
􏰛 Re-weight training data
􏰛 Start with uniform distribution over training examples
􏰛 Loop:
􏰛 Use learning algorithm to get new classifier for ensemble
􏰛 Re-weight training examples to emphasize examples on which
new classifier is incorrect
10/24

Adaptive Boosting
􏰛 AdaBoost (Freund and Schapire, 1997)
􏰛 Trainingdata(x1,y1),…,(xn,yn)∈X×{−1,+1}
􏰛 Initialize D1(i) = 1/n for all i = 1,…,n
􏰛 For t = 1,…,T:
􏰛
ˆ T (t)
􏰛 􏰛
Run learning algorithm on Dt-weighted training examples, get classifier f(t)
Update weights:
n
zt :=􏰊Dt(i)·yif(t)(xi)∈[−1,+1]
i=1
αt := 1 ln 1 + zt ∈ R
Dt+1(i):=Dt(i)exp(−αt·yif(t)(xi)) fori=1,…,n. Zt
Here Zt is normalizer that makes Dt+1 a probability
2 1−zt
distribution.
Final classifier: f (x) = sign(􏰉t=1 αt · f (x))
11/24

Plot of αt as function of zt
αt = 1 ln 1 + zt ∈ R 2 1−zt
3 2 1 0 -1 -2 -3
-1 -0.5 0 0.5 1
zt
Figure 3: αt as function of zt
12/24
αt

Example: AdaBoost with decision stumps
􏰛 (From Figures 1.1 and 2.2 of Schapire & Freund text.) 􏰛 Use “decision stump” learning algorithm with AdaBoost
􏰛 Each f(t) has the form 􏰇+1 ifx >θ
􏰇−1 ifx >θ i
f(t)(x) = i or −1 ifxi ≤θ
f(t)(x) =
􏰛 Straightforward to handle importance weights Dt(i) in decision
tree learning algorithm
+1 ifxi ≤θ
Figure 4: Training data for example execution
13/24

Example execution of AdaBoost I
14/24

Example execution of AdaBoost II
f (1)
z1 = 0.40, α1 = 0.42
15/24

Example execution of AdaBoost III
f (1)
z1 = 0.40, α1 = 0.42
16/24

Example execution of AdaBoost IV
f (1) f (2)
z1 =0.40,α1 =0.42 z2 =0.58,α2 =0.65
17/24

Example execution of AdaBoost V
+
+–

f (1) f (2)
z1 =0.40,α1 =0.42 z2 =0.58,α2 =0.65
18/24

Example execution of AdaBoost VI

+
+–

+
+–
f(1) f(2) f(3)
z1 =0.40,α1 =0.42 z2 =0.58,α2 =0.65 z3 =0.72,α3 =0.92
19/24

Example execution of AdaBoost VII

+
+–
f(1) f(2) f(3)
z1 =0.40,α1 =0.42 z2 =0.58,α2 =0.65 z3 =0.72,α3 =0.92
20/24

Example execution of AdaBoost VIII

+
+–
f(1) f(2) f(3)
z1 =0.40,α1 =0.42 z2 =0.58,α2 =0.65 z3 =0.72,α3 =0.92
Final classifier:
ˆ 􏰁 (1) (2) 􏰂 f(x) = sign 0.42f (x) + 0.65f (x) + 0.92f3(x)
21/24

Training error rate of final classifier
􏰛 Let γt := zt/2: advantage over random guessing achieved by f (t)
􏰛 Theorem: Training error rate of final classifier is T
􏰁􏰂 ˆ n 􏰊2 2
err(f,((xi,yi))i=1) ≤ exp −2 γt = exp −2γ ̄ T t=1
where
1 􏰊T
γ ̄2 := T
γt2.
t=1
22/24

Training error rate of final classifier
􏰛 Let γt := zt/2: advantage over random guessing achieved by f (t)
􏰛 Theorem: Training error rate of final classifier is T
􏰁􏰂 ˆ n 􏰊2 2
err(f,((xi,yi))i=1) ≤ exp −2 γt = exp −2γ ̄ T t=1
where
1 􏰊T
γ ̄2 := T
γt2.
􏰛 AdaBoost is “adaptive”:
􏰛 Some γt can be small (even negative)—only care about average
γ ̄2
t=1
22/24

Training error rate of final classifier
􏰛 Let γt := zt/2: advantage over random guessing achieved by f (t)
􏰛 Theorem: Training error rate of final classifier is T
􏰁􏰂 ˆ n 􏰊2 2
err(f,((xi,yi))i=1) ≤ exp −2 γt = exp −2γ ̄ T t=1
where
1 􏰊T
γt2.
􏰛 Some γt can be small (even negative)—only care about average
􏰛 AdaBoost is “adaptive”: γ ̄2
􏰛 What about true error rate in IID model?
􏰛 A very complex model as T becomes large!
γ ̄2 := T
t=1
22/24

Surprising behavior of boosting
􏰛
16 1 Introduction and Overvi
AdaBoost + C4.5 decision tree learning on “letters” data set
20 15 10
5
0
Figure 1.7
C4.5 test error AdaBoost test error
AdaBoost training error
10 100 1000
The training and test percent error rates obtained using boosting on an OCR dataset with C4.5 as the base learn
Figure 5: Figure 1.7 from Schapire & Freund text
The top and bottom curves are test and training error, respectively. The top horizontal line shows the test error r using just C4.5. The bottom line shows the final test error rate of AdaBoost after 1000 rounds. (Reprinted w permission of the Institute of Mathematical Statistics.)
􏰛 Training error rate is zero after five iterations.
􏰛
This pronounced lack of overfitting seems to flatly contradict our earlier intuition that si
Test error rate continues to decrease, even up to 1000
iterations.
pler is better. Surely, a combination of five trees is much, much simpler than a combinati
of 1000 trees (about 200 times simpler, in terms of raw size), and both perform equa
well on the training set (perfectly, in fact). So how can it be that the far larger and mo
23/24
e
m o l

Margins theory
􏰛 Look at (normalized) scoring function of final classifier ˆ 􏰉Tt=1 αt · f(t)(x)
h(x) := 􏰉Tt=1 |αt| ∈ [−1, +1].
􏰛 Say y · hˆ(x) is margin achieved by hˆ on example (x, y)
24/24

Margins theory
􏰛 Look at (normalized) scoring function of final classifier ˆ 􏰉Tt=1 αt · f(t)(x)
h(x) := 􏰉Tt=1 |αt| ∈ [−1, +1].
􏰛 Say y · hˆ(x) is margin achieved by hˆ on example (x, y)
􏰛 Theorem (Schapire, Freund, Bartlett, and Lee, 1998):
􏰛 Larger margins on training examples ⇒ better resistance to
over-fitting in IID model
􏰛 AdaBoost tends to increase margins on training examples
􏰛 (Similar to but not same as SVM margins)
24/24

Margins theory
􏰛 Look at (normalized) scoring function of final classifier ˆ 􏰉Tt=1 αt · f(t)(x)
h(x) := 􏰉Tt=1 |αt| ∈ [−1, +1].
􏰛 Say y · hˆ(x) is margin achieved by hˆ on example (x, y)
􏰛 Theorem (Schapire, Freund, Bartlett, and Lee, 1998):
􏰛 Larger margins on training examples ⇒ better resistance to
over-fitting in IID model
􏰛 AdaBoost tends to increase margins on training examples
􏰛 (Similar to but not same as SVM margins) 􏰛 On “letters” data set
0.0%
0.0%
8.4%
3.3%
7.7%
0.0%
training error rate test error rate
% margins ≤0.5 min margin achieved
T=5 T=100 T=1000 0.0%
0.14
3.1%
0.0% 0.52 0.55
24/24