AdaBoost
AdaBoost
Daniel Hsu (COMS 4771)
The algorithm
The input training data is {(xi, yi)}ni=1 from X × {−1,+1}.
• Initialize D1(i) := 1/n for each i = 1, . . . , n.
• For t = 1, . . . , T , do:
– Give Dt-weighted examples to Weak Learner; get back ht : X → {−1,+1}.
– Compute weight on ht and update weights on examples:
st :=
n∑
i=1
Dt(i) · yiht(xi)
αt :=
1
2
ln
1 + st
1− st
Dt+1(i) :=
Dt(i) · exp(−αt · yiht(xi))
Zt
for each i = 1, . . . , n
where
Zt :=
n∑
i=1
Dt(i) · exp(−αt · yiht(xi))
is the normalizer that makes Dt+1 a probability distribution.
• Final hypothesis is ĥ defined by ĥ(x) := sign
(∑T
t=1 αt · ht(x)
)
for x ∈ X .
Training error rate bound
Let ˆ̀ be the function defined by
ˆ̀(x) :=
T∑
t=1
αt · ht(x) for x ∈ X
so ĥ(x) = sign(ˆ̀(x)). The training error rate of ĥ can be bounded above by the average exponential loss of ˆ̀:
1
n
n∑
i=1
1{ĥ(xi) 6= yi} ≤
1
n
n∑
i=1
exp(−yi ˆ̀(xi)).
This holds because
ĥ(xi) 6= yi ⇔ −yi ˆ̀(xi) ≥ 0⇔ exp(−yi ˆ̀(xi)) ≥ 1.
1
Furthermore, the average exponential loss of ˆ̀ equals the product of the normalizers from all rounds:
1
n
n∑
i=1
exp(−yi ˆ̀(xi)) =
n∑
i=1
D1(i) · exp
(
−
T∑
t=1
αt · yiht(xi)
)
= Z1
n∑
i=1
D1(i) · exp(−α1 · yih1(xi))
Z1
· exp
(
−
T∑
t=2
αt · yiht(xi)
)
= Z1
n∑
i=1
D2(i) · exp
(
−
T∑
t=2
αt · yiht(xi)
)
= Z1Z2
n∑
i=1
D2(i) · exp(−α2 · yih2(xi))
Z2
· exp
(
−
T∑
t=3
αt · yiht(xi)
)
= Z1Z2Z3
n∑
i=1
D3(i) · exp
(
−
T∑
t=3
αt · yiht(xi)
)
= · · ·
=
T∏
t=1
Zt.
Since each yiht(xi) ∈ {−1,+1}, the normalizer Zt can be written as
Zt =
n∑
i=1
Dt(i) · exp(−αt · yiht(xi))
=
n∑
i=1
Dt(i) ·
(
1 + yiht(xi)
2
exp(−αt) +
1− yiht(xi)
2
exp(αt)
)
=
n∑
i=1
Dt(i) ·
(
1 + yiht(xi)
2
√
1− st
1 + st
+
1− yiht(xi)
2
√
1 + st
1− st
)
=
√
(1 + st)(1− st)
=
√
1− s2t .
So, we conclude the following bound on the training error rate of ĥ:
1
n
n∑
i=1
1{ĥ(xi) 6= yi} ≤
1
n
n∑
i=1
exp(−yi ˆ̀(xi)) =
T∏
t=1
Zt =
T∏
t=1
√
1− s2t ≤ exp
(
−
1
2
T∑
t=1
s2t
)
where the last step uses the fact that 1 + x ≤ ex for any real number x.
(The bound is usually written in terms of γt := st/2, i.e., as exp(−2
∑T
t=1 γ
2
t ).)
Margins on training examples
Let ĝ be the function defined by
ĝ(x) :=
∑T
t=1 αt · ht(x)∑T
t=1 |αt|
for x ∈ X
so yiĝ(xi) is the margin achieved on example (xi, yi). We may assume without loss of generality that αt ≥ 0
for each t = 1, . . . , T (by replacing ht with −ht as needed.) Fix a value θ ∈ (0, 1), and consider the fraction
of training examples on which ĝ achieves a margin at most θ:
1
n
n∑
i=1
1{yiĝ(xi) ≤ θ}.
2
This quantity can be bounded above using the arguments from before:
1
n
n∑
i=1
1{yiĝ(xi) ≤ θ} =
1
n
n∑
i=1
1
{
yi ˆ̀(xi) ≤ θ
T∑
t=1
αt
}
≤ exp
(
θ
T∑
t=1
αt
)
·
1
n
n∑
i=1
exp(−yi ˆ̀(xi))
= exp
(
θ
T∑
t=1
αt
)
·
T∏
t=1
√
1− s2t
=
T∏
t=1
√
(1 + st)1+θ(1− st)1−θ.
Suppose that for some γ > 0, st ≥ 2γ for all t = 1, . . . , T . If θ < γ, then using calculus, it can be shown that each term in the product is less than 1:√ (1 + st)1+θ(1− st)1−θ < 1. Hence, the bound decreases to zero exponentially fast with T . 3 The algorithm Training error rate bound Margins on training examples