CS代考计算机代写 algorithm The Boosting Approach to Machine Learning

The Boosting Approach to Machine Learning
Maria-Florina Balcan 03/16/2015

• •
Boosting
General method for improving the accuracy of any given learning algorithm.
Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor.
• Works amazingly well in practice — Adaboost and its variations one of the top 10 algorithms.
• Backed up by solid foundations.

Readings:
• The Boosting Approach to Machine Learning: An Overview. Rob Schapire, 2001
• Theory and Applications of Boosting. NIPS tutorial. http://www.cs.princeton.edu/~schapire/talks/nips-tutorial.pdf
Plan for today:
• Motivation.
• A bit of history.
• Adaboost: algo, guarantees, discussion.
• Focus on supervised classification.

•
An Example: Spam Detection
E.g., classify which emails are spam and which are important.
Not spam spam
Key observation/motivation:
• Easy to find rules of thumb that are often correct.
• E.g., “If buy now in the message, then predict spam.”
• E.g., “If say good-bye to debt in the message, then predict spam.” • Harder to find single rule that is very highly accurate.

An Example: Spam Detection
• Boosting: meta-procedure that takes in an algo for finding rules of thumb (weak learner). Produces a highly accurate rule, by calling the weak learner repeatedly on cleverly chosen datasets.
𝒉𝟏 𝒉𝟐 𝒉𝟑
𝒉𝑻
• apply weak learner to a subset of emails, obtain rule of thumb
• apply to 2nd subset of emails, obtain 2nd rule of thumb
• apply to 3rd subset of emails, obtain 3rd rule of thumb
• repeat T times; combine weak rules into a single highly accurate rule.
Emails
…

Boosting: Important Aspects
How to choose examples on each round?
• Typically, concentrate on “hardest” examples (those most often misclassified by previous rules of thumb)
How to combine rules of thumb into single prediction rule?
• take (weighted) majority vote of rules of thumb

Historically….

Weak Learning vs Strong/PAC Learning
• [Kearns & Valiant ’88]: defined weak learning: being able to predict better than random guessing (error ≤ 12 − 𝛾) , consistently.
• •
Posed an open pb: “Does there exist a boosting algo that turns a weak learner into a strong PAC learner (that can produce arbitrarily accurate hypotheses)?”
Informally, given “weak” learning algo that can consistently find classifiers of error ≤ 12 − 𝛾, a boosting algo would provably construct a single classifier with error ≤ 𝜖.

Weak Learning vs Strong/PAC Learning
Strong (PAC) Learning Weak Learning
• ∃algoA
• ∀𝑐∈𝐻
• ∀𝐷
• ∀𝜖>0
• ∀𝛿>0
• A produces h s.t.:
Pr 𝑒𝑟𝑟 h ≥ 𝜖 ≤ 𝛿
• ∃algoA
• ∃𝛾>0
• ∀𝑐∈𝐻
• ∀𝐷
• ∀𝜖>12−𝛾
• ∀𝛿>0
• A produces h s.t.
Pr 𝑒𝑟𝑟 h ≥ 𝜖 ≤ 𝛿
• [Kearns & Valiant ’88]: defined weak learning & posed an open pb of finding a boosting algo.

•
Cool conceptually and technically, not very practical.
Surprisingly….
Weak Learning =Strong (PAC) Learning
Original Construction [Schapire ’89]:
•
poly-time boosting algo, exploits that we can learn a little on every distribution.
•
•
A modest booster obtained via calling the weak learning algorithm on 3 distributions.
Error = 𝛽 < 1 − 𝛾 → error 3𝛽2 − 2𝛽3 2 Then amplifies the modest boost of accuracy by running this somehow recursively. An explosion of subsequent work Adaboost (Adaptive Boosting) “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting” [Freund-Schapire, JCSS’97] Godel Prize winner 2003 Informal Description Adaboost • Boosting: turns a weak algo into a strong (PAC) learner. Input: S={(x1, 𝑦1), ...,(xm, 𝑦m)}; xi ∈ 𝑋, 𝑦𝑖 ∈ 𝑌 = {−1,1} weak learning algo A (e.g., Naïve Bayes, decision stumps) • Fort=1,2,...,T ht • ConstructDt on{x1,...,xm} + + ++ + εt = Pxi ~Dt(ht xi ≠ yi) error of ht over Dt • Output Hfinal 𝑥 = sign 𝑡=1 𝛼𝑡h𝑡 𝑥 Roughly speaking Dt+1 increases weight on xi if ht incorrect on xi ; decreases it on xi if ht correct. + + + - - • Run A on D producing h : 𝑋 → {−1,1} (weak classifier) - tt--- -- Adaboost (Adaptive Boosting) • Weak learning algorithm A. • For t=1,2, ... ,T • Construct 𝐃𝐭 on {𝐱𝟏, ..., 𝒙𝐦} • Run A on Dt producing ht Constructing 𝐷𝑡 • • D1 uniform on {x1, ..., xm} Given Dt and ht set [i.e.,D1 𝑖 =𝑚1] 𝐷𝑡+1 𝑖 = 𝐷𝑡 𝑖 e −𝛼𝑡 𝑍𝑡 𝐷𝑡+1𝑖=𝐷𝑡𝑖 e𝛼𝑡 𝑍𝑡 if 𝑦𝑖 = h𝑡 𝑥𝑖 if𝑦𝑖≠h𝑡𝑥𝑖 𝐷𝑡 𝑖 −𝛼 𝑦 h 𝑥 𝐷𝑡+1 𝑖 = 𝑍𝑡 e 𝑡 𝑖 𝑡 𝑖 Dt+1 puts half of weight on examples xi where ht is incorrect & half on examples where ht is correct 𝛼𝑡=1ln1−𝜖𝑡 >0 2 𝜖𝑡
Final hyp: Hfinal 𝑥 = sign 𝑡 𝛼𝑡h𝑡 𝑥

Adaboost: A toy example
Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)

Adaboost: A toy example

Adaboost (Adaptive Boosting)
• Weak learning algorithm A.
• For t=1,2, … ,T
• Construct 𝐃𝐭 on {𝐱𝟏, …, 𝒙𝐦}
• Run A on Dt producing ht
Constructing 𝐷𝑡
•
•
D1 uniform on {x1, …, xm} Given Dt and ht set
[i.e.,D1 𝑖 =𝑚1]
𝐷𝑡+1 𝑖 = 𝐷𝑡 𝑖 e −𝛼𝑡 𝑍𝑡
𝐷𝑡+1𝑖=𝐷𝑡𝑖 e𝛼𝑡 𝑍𝑡
if 𝑦𝑖 = h𝑡 𝑥𝑖 if𝑦𝑖≠h𝑡𝑥𝑖
𝐷𝑡 𝑖 −𝛼 𝑦 h 𝑥 𝐷𝑡+1 𝑖 = 𝑍𝑡 e 𝑡 𝑖 𝑡 𝑖
Dt+1 puts half of weight on examples xi where ht is incorrect & half on examples where ht is correct
𝛼𝑡=1ln1−𝜖𝑡 >0 2 𝜖𝑡
Final hyp: Hfinal 𝑥 = sign 𝑡 𝛼𝑡h𝑡 𝑥

Nice Features of Adaboost
• Very general: a meta-procedure, it can use any weak learning algorithm!!!(e.g., Naïve Bayes, decision stumps)
• Very fast (single pass through data each round) & simple to code, no parameters to tune.
• Shift in mindset: goal is now just to find classifiers a bit better than random guessing.
• Grounded in rich theory.
• Relevant for big data age: quickly focuses on “core difficulties”, well-suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine-Mansour COLT’12].

Analyzing Training Error
Theorem
𝑡
So, if ∀𝑡,𝛾𝑡 ≥ 𝛾 > 0, then 𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ exp −2𝛾2𝑇
𝜖𝑡 =1/2−𝛾𝑡 (errorofh𝑡 over𝐷𝑡) 𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤exp −2 𝛾𝑡2
The training error drops exponentially in T!!!
Toget𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤𝜖,needonly𝑇=𝑂 1 log 1 rounds
Adaboost is adaptive
• Doesnotneedtoknow𝛾orTapriori • Can exploit 𝛾𝑡 ≫ 𝛾
𝛾2 𝜖

Understanding the Updates & Normalization
Claim: Dt+1 puts half of the weight on xi where ht was incorrect and half of the weight on xi where ht was correct.
Recall𝐷 𝑖 =𝐷𝑡 𝑖 e−𝛼𝑡𝑦𝑖h𝑡 𝑥𝑖
𝑡+1 𝑍𝑡 Probabilities are equal!
Pr𝑦𝑖≠h𝑡𝑥𝑖 = 𝐷𝑡+1
𝐷𝑡𝑖𝑒𝛼𝑡=𝜖 1𝑒𝛼𝑡= 𝜖𝑡 1−𝜖𝑡= 𝜖𝑡1−𝜖𝑡
𝑖:𝑦𝑖≠h𝑡 𝑥𝑖
𝑖:𝑦𝑖=h𝑡 𝑥𝑖
𝑍𝑡 𝑡 𝑍𝑡 𝑍𝑡
𝐷𝑡 𝑖 𝑒−𝛼𝑡 =1−𝜖𝑡 𝑒−𝛼𝑡 = 1−𝜖𝑡
𝑍𝑡 𝑍𝑡 𝑍𝑡
𝜖𝑡
𝑍𝑡
Pr 𝑦𝑖 =h𝑡 𝑥𝑖 𝐷𝑡+1
=
𝜖𝑡 = 1−𝜖𝑡
1−𝜖𝑡 𝜖𝑡 𝑍𝑡
𝑍𝑡= 𝐷𝑡𝑖𝑒−𝛼𝑡𝑦𝑖h𝑡𝑥𝑖 = 𝐷𝑡𝑖𝑒−𝛼𝑡+ 𝐷𝑡𝑖𝑒𝛼𝑡 𝑖:𝑦𝑖=h𝑡 𝑥𝑖 𝑖:𝑦𝑖=h𝑡 𝑥𝑖 𝑖:𝑦𝑖≠h𝑡 𝑥𝑖
= 1−𝜖𝑡 𝑒−𝛼𝑡 +𝜖𝑡𝑒𝛼𝑡 =2 𝜖𝑡 1−𝜖𝑡

Analyzing Training Error: Proof Intuition
Theorem
𝜖𝑡 =1/2−𝛾𝑡 (errorofh𝑡 over𝐷𝑡)
𝛾𝑡2
• On round 𝑡, we increase weight of 𝑥𝑖 for which h𝑡 is wrong. • If 𝐻𝑓𝑖𝑛𝑎𝑙 incorrectly classifies 𝑥𝑖,
– Then 𝑥𝑖 incorrectly classified by (wtd) majority of h𝑡’s. – Which implies final prob. weight of 𝑥𝑖 is large.
Can show probability ≥ 1 1 𝑚 𝑡𝑍𝑡
• Since sum of prob. = 1, can’t have too many of high weight. Can show # incorrectly classified ≤ 𝑚 𝑡 𝑍𝑡 .
And( 𝑡𝑍𝑡)→0.
𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤exp −2
𝑡

Analyzing Training Error: Proof Math
1 exp −𝑦𝑖𝑓 𝑥𝑖
Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =
where 𝑓 𝑥𝑖 = 𝑡 𝛼𝑡h𝑡 𝑥𝑖 . [Unthresholded weighted vote of h𝑖 on 𝑥𝑖 ]
𝑚 𝑡 𝑍𝑡
Step 2: errS 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ 𝑡 𝑍𝑡.
Step3: 𝑡𝑍𝑡= 𝑡2𝜖𝑡1−𝜖𝑡 = 𝑡 1−4𝛾𝑡2≤𝑒−2 𝑡𝛾𝑡2

Analyzing Training Error: Proof Math
Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 = where𝑓 𝑥𝑖 = 𝑡𝛼𝑡h𝑡 𝑥𝑖 .
1 exp −𝑦𝑖𝑓 𝑥𝑖 𝑚 𝑡 𝑍𝑡
Recall𝐷1 𝑖 = 1 and𝐷𝑡+1 𝑖 =𝐷𝑡 𝑖 exp −𝑦𝑖𝛼𝑡h𝑡 𝑥𝑖 𝑚 𝑍𝑡
𝐷𝑇+1 𝑖 = exp −𝑦𝑖𝛼𝑇h𝑇 𝑥𝑖 𝑍𝑇
= exp −𝑦𝑖𝛼𝑇h𝑇 𝑥𝑖 𝑍𝑇
…….
× 𝐷𝑇 𝑖
× exp −𝑦𝑖𝛼𝑇−1h𝑇−1 𝑥𝑖 𝑍𝑇−1
× 𝐷𝑇−1 𝑖
×⋯×exp −𝑦𝑖𝛼1h1 𝑥𝑖
= exp −𝑦𝑖𝛼𝑇h𝑇 𝑥𝑖
𝑍𝑇 𝑍1𝑚
1
= 1 exp −𝑦𝑖(𝛼1h1 𝑥𝑖 +⋯+𝛼𝑇h𝑇 𝑥𝑇 ) = 1 exp −𝑦𝑖𝑓 𝑥𝑖 𝑚𝑍1⋯𝑍𝑇 𝑚𝑡𝑍𝑡

Analyzing Training Error: Proof Math
Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 = where𝑓 𝑥𝑖 = 𝑡𝛼𝑡h𝑡 𝑥𝑖 .
1 exp −𝑦𝑖𝑓 𝑥𝑖 𝑚 𝑡 𝑍𝑡
Step 2: errS 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ 𝑡 𝑍𝑡.
err𝐻 =1 1
𝑦𝑖≠𝐻𝑓𝑖𝑛𝑎𝑙 𝑥𝑖
exploss
1 0
S 𝑓𝑖𝑛𝑎𝑙
𝑚
1 𝑖
1𝑦𝑖𝑓 𝑥𝑖 ≤0 exp −𝑦𝑖𝑓 𝑥𝑖
= 𝐷𝑇+1𝑖 𝑍𝑡=𝑡𝑍𝑡. 𝑖𝑡
= 𝑚 ≤ 𝑚1
𝑖
𝑖
0/1loss

Analyzing Training Error: Proof Math
Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 = where𝑓 𝑥𝑖 = 𝑡𝛼𝑡h𝑡 𝑥𝑖 .
1 exp −𝑦𝑖𝑓 𝑥𝑖 𝑚 𝑡 𝑍𝑡
Step 2: errS 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ 𝑡 𝑍𝑡.
Step3: 𝑡𝑍𝑡= 𝑡2𝜖𝑡1−𝜖𝑡 = 𝑡 1−4𝛾𝑡2≤𝑒−2 𝑡𝛾𝑡2
Note: recall 𝑍𝑡 = 1 − 𝜖𝑡 𝑒−𝛼𝑡 + 𝜖𝑡𝑒𝛼𝑡 = 2 𝜖𝑡 1 − 𝜖𝑡 𝛼𝑡 minimizer of 𝛼 → 1−𝜖𝑡 𝑒−𝛼 +𝜖𝑡𝑒𝛼

Generalization Guarantees
Theorem 𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ exp
−2
𝛾𝑡2
𝑡
where 𝜖𝑡 = 1/2 − 𝛾𝑡
How about generalization guarantees?
Original analysis [Freund&Schapire’97]
• H space of weak hypotheses; d=VCdim(H)
𝐻𝑓𝑖𝑛𝑎𝑙 is a weighted vote, so the hypothesis class is: G={all fns of the form sign( 𝑇𝑡=1 𝛼𝑡h𝑡(𝑥)) }
Theorem [Freund&Schapire’97]
∀ 𝑔 ∈ 𝐺, 𝑒𝑟𝑟 𝑔 ≤ 𝑒𝑟𝑟𝑆 𝑔 + 𝑂 𝑇𝑑 T= # of rounds 𝑚
Key reason: VCd𝑖𝑚 𝐺 = 𝑂 𝑑𝑇 plus typical VC bounds.

Generalization Guarantees
Theorem [Freund&Schapire’97]
∀𝑔∈𝑐𝑜(𝐻),𝑒𝑟𝑟𝑔 ≤𝑒𝑟𝑟𝑆 𝑔 +𝑂
𝑇𝑑 𝑚
where d=VCdim(H)
error
train error
generalization error
complexity
T= # of rounds

Generalization Guarantees
• Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.
• Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!
• These results seem to contradict FS’87 bound and Occam’s razor (in order achieve good test error the classifier should be as simple as possible)!

How can we explain the experiments?
R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in “Boosting the margin: A new explanation for the effectiveness of voting methods” a nice theoretical explanation.
Key Idea:
Training error does not tell the whole story.
We need also to consider the classification confidence!!

Related Posts