CS计算机代考程序代写 algorithm AdaBoost

AdaBoost

Daniel Hsu (COMS 4771)

The algorithm
The input training data is {(xi, yi)}ni=1 from X × {−1,+1}.

• Initialize D1(i) := 1/n for each i = 1, . . . , n.

• For t = 1, . . . , T , do:

– Give Dt-weighted examples to Weak Learner; get back ht : X → {−1,+1}.

– Compute weight on ht and update weights on examples:

st :=
n∑
i=1

Dt(i) · yiht(xi)

αt :=
1
2

ln
1 + st
1− st

Dt+1(i) :=
Dt(i) · exp(−αt · yiht(xi))

Zt
for each i = 1, . . . , n

where
Zt :=

n∑
i=1

Dt(i) · exp(−αt · yiht(xi))

is the normalizer that makes Dt+1 a probability distribution.

• Final hypothesis is ĥ defined by ĥ(x) := sign
(∑T

t=1 αt · ht(x)
)
for x ∈ X .

Training error rate bound
Let ˆ̀ be the function defined by

ˆ̀(x) :=
T∑
t=1

αt · ht(x) for x ∈ X

so ĥ(x) = sign(ˆ̀(x)). The training error rate of ĥ can be bounded above by the average exponential loss of ˆ̀:

1
n

n∑
i=1

1{ĥ(xi) 6= yi} ≤
1
n

n∑
i=1

exp(−yi ˆ̀(xi)).

This holds because
ĥ(xi) 6= yi ⇔ −yi ˆ̀(xi) ≥ 0⇔ exp(−yi ˆ̀(xi)) ≥ 1.

Furthermore, the average exponential loss of ˆ̀ equals the product of the normalizers from all rounds:

1
n

n∑
i=1

exp(−yi ˆ̀(xi)) =
n∑
i=1

D1(i) · exp
(
−

T∑
t=1

αt · yiht(xi)
)

= Z1
n∑
i=1

D1(i) · exp(−α1 · yih1(xi))
Z1

· exp
(
−

T∑
t=2

αt · yiht(xi)
)

= Z1
n∑
i=1

D2(i) · exp
(
−

T∑
t=2

αt · yiht(xi)
)

= Z1Z2
n∑
i=1

D2(i) · exp(−α2 · yih2(xi))
Z2

· exp
(
−

T∑
t=3

αt · yiht(xi)
)

= Z1Z2Z3
n∑
i=1

D3(i) · exp
(
−

T∑
t=3

αt · yiht(xi)
)

= · · ·

=
T∏
t=1

Zt.

Since each yiht(xi) ∈ {−1,+1}, the normalizer Zt can be written as

Zt =
n∑
i=1

Dt(i) · exp(−αt · yiht(xi))

=
n∑
i=1

Dt(i) ·
(

1 + yiht(xi)
2

exp(−αt) +
1− yiht(xi)

2
exp(αt)

)

=
n∑
i=1

Dt(i) ·
(

1 + yiht(xi)
2

√
1− st
1 + st

+
1− yiht(xi)

√
1 + st
1− st

)
=
√

(1 + st)(1− st)

=
√

1− s2t .

So, we conclude the following bound on the training error rate of ĥ:

1
n

n∑
i=1

1{ĥ(xi) 6= yi} ≤
1
n

n∑
i=1

exp(−yi ˆ̀(xi)) =
T∏
t=1

Zt =
T∏
t=1

√
1− s2t ≤ exp

(
−

1
2

T∑
t=1

s2t

)
where the last step uses the fact that 1 + x ≤ ex for any real number x.

(The bound is usually written in terms of γt := st/2, i.e., as exp(−2
∑T
t=1 γ

2
t ).)

Margins on training examples
Let ĝ be the function defined by

ĝ(x) :=
∑T
t=1 αt · ht(x)∑T

t=1 |αt|
for x ∈ X

so yiĝ(xi) is the margin achieved on example (xi, yi). We may assume without loss of generality that αt ≥ 0
for each t = 1, . . . , T (by replacing ht with −ht as needed.) Fix a value θ ∈ (0, 1), and consider the fraction
of training examples on which ĝ achieves a margin at most θ:

1
n

n∑
i=1

1{yiĝ(xi) ≤ θ}.

This quantity can be bounded above using the arguments from before:

1
n

n∑
i=1

1{yiĝ(xi) ≤ θ} =
1
n

n∑
i=1

1
{
yi ˆ̀(xi) ≤ θ

T∑
t=1

αt

}

≤ exp
(
θ

T∑
t=1

αt

)
·

1
n

n∑
i=1

exp(−yi ˆ̀(xi))

= exp
(
θ

T∑
t=1

αt

)
·
T∏
t=1

√
1− s2t

=
T∏
t=1

√
(1 + st)1+θ(1− st)1−θ.

Suppose that for some γ > 0, st ≥ 2γ for all t = 1, . . . , T . If θ < γ, then using calculus, it can be shown that each term in the product is less than 1:√ (1 + st)1+θ(1− st)1−θ < 1. Hence, the bound decreases to zero exponentially fast with T . 3 The algorithm Training error rate bound Margins on training examples

Related Posts