The algorithm
AdaBoost
Daniel Hsu (COMS 4771)
The input training data is {(xi , yi )}ni=1 from X × {−1, +1}. • Initialize D1(i) := 1/n for each i = 1,…,n.
• Fort=1,…,T,do:
– Give Dt-weighted examples to Weak Learner; get back ht : X → {−1, +1}. – Compute weight on ht and update weights on examples:
n
st := Dt(i) · yiht(xi)
2 1−st
Dt+1(i):=Dt(i)·exp(−αt·yiht(xi)) foreachi=1,…,n
Zt
n
Zt := Dt(i) · exp(−αt · yiht(xi))
i=1
αt := 1 ln 1 + st
where
i=1
is the normalizer that makes Dt+1 a probability distribution.
• Final hypothesis is hˆ defined by hˆ(x) := sign Tt=1 αt · ht(x) for x ∈ X . Training error rate bound
Let lˆ be the function defined by ˆˆˆˆ
ˆ l(x) :=
T
t=1
αt · ht(x)
so h(x) = sign(l(x)). The training error rate of h can be bounded above by the average exponential loss of l:
1n 1n ˆˆ
n 1{h(xi) ̸= yi} ≤ n exp(−yil(xi)). i=1 i=1
for x ∈ X
This holds because
ˆˆˆ
h(xi) ̸= yi ⇔ −yil(xi) ≥ 0 ⇔ exp(−yil(xi)) ≥ 1.
1
Furthermore, the average exponential loss of lˆ equals the product of the normalizers from all rounds: 1nnT
ˆ exp(−yil(xi)) = D1(i) · exp −
n
i=1
αt · yiht(xi)
D1(i) · exp(−α1 · yih1(xi)) T
i=1 =Z
t=1
n
·exp −α·yh(x) 1Ztiti
i=1 1 t=2 nT
=ZD(i)·exp −α·yh(x) 12titi
i=1
n
t=2
D2(i) · exp(−α2 · yih2(xi))
T ·exp −α·yh(x)
=ZZ
12 Z titi
=··· T
= Zt. t=1
Since each yiht(xi) ∈ {−1, +1}, the normalizer Zt can be written as n
Zt = Dt(i) · exp(−αt · yiht(xi)) i=1
i=1 2 nT
t=3
=ZZZD(i)·exp −α·yh(x)
123 3 i=1
titi t=3
n
= Dt(i) ·
i=1 n
= Dt(i)· 2 1+s + 2 1−s i=1 t t
= (1 + st)(1 − st)
=
So, we conclude the following bound on the training error rate of hˆ:
1n 1n TT1T ˆˆ22
st t=1
1 + yiht(xi)
2 exp(−αt) +
1 − yiht(xi) 2
exp(αt) 1 + yiht(xi)1 − st 1 − yiht(xi)1 + st
1−s2t.
exp(−yil(xi))=
where the last step uses the fact that 1 + x ≤ ex for any real number x.
(The bound is usually written in terms of γt := st/2, i.e., as exp(−2 Tt=1 γt2).) Margins on training examples
Let gˆ be the function defined by
Tt=1 αt · ht(x)
gˆ(x) := Tt=1 |αt| for x ∈ X
so yigˆ(xi) is the margin achieved on example (xi, yi). We may assume without loss of generality that αt ≥ 0 for each t = 1, . . . , T (by replacing ht with −ht as needed.) Fix a value θ ∈ (0, 1), and consider the fraction of training examples on which gˆ achieves a margin at most θ:
1 n
n
n 1{h(xi)̸=yi}≤ n i=1
i=1
t=1
Zt = 1−st ≤exp −2 t=1
i=1
1{yigˆ(xi) ≤ θ}. 2
This quantity can be bounded above using the arguments from before:
1n1nT ˆ
n
i=1
1{yigˆ(xi)≤θ}= n 1 yil(xi)≤θ αt i=1 t=1
T 1n ˆ
≤ exp θ αt · n exp(−yil(xi)) t=1 i=1
T T =exp θαt · 1−s2t
t=1 t=1
= T (1 + st)1+θ(1 − st)1−θ.
t=1
Suppose that for some γ > 0, st ≥ 2γ for all t = 1,…,T. If θ < γ, then using calculus, it can be shown that each term in the product is less than 1:
(1 + st)1+θ(1 − st)1−θ < 1. Hence, the bound decreases to zero exponentially fast with T.
3