Some examples of recent ing are also describ ed
Background
applications of b o ost
Park
Rob ert E Schapire ATT Labs Shannon Lab oratory
Avenue Ro om A Florham Park NJ USA wwwresearchattcomschapire
schapireresearchattcom
Abstract
Bo osting is a general metho d for improving the accuracy of any given learning algorithm This short pap er intro duces the b o osting algorithm AdaBo ost and explains the underlying theory of b o osting including an explanation of why b o osting often do es not suer from overtting
Given x y xm ym where xi X yi Y
Initialize D i m FortT
fg
Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999.
A Brief Intro duction to Bo osting
Train weak learner using distribution Dt
Get weak
hyp othesis ht X f g with error
Pr h x y tiDtti i
t Choosetln
Boosting is a general method which attempts to boost the accuracy of any given learning algorithm Bo osting has its ro ots in a theoretical framework for studying ma chine learning called the PAC learning mo del due to Valiant see Kearns and Vazirani for a go o d in
t
Up date
Dt i
et
if htxi yi
Dti e t
Dt i exp t yi ht xi Zt
if h x y Zt tii
rst to p ose the question of whether a weak algorithm which p erforms just slightly b et random guessing in the PAC mo del can b e b o osted into an arbitrarily accurate strong learning algorithm Schapire came up with the rst prov able p olynomialtime b o osting algorithm in A year later Freund develop ed a much more ecient b o osting algorithm which although optimal in a certain sense nevertheless suered from certain practical draw
where Zt is a normalization factor chosen so that Dt will b e a distribution
tro duction to this mo del Kearns and Valiant
were the learning ter than
Output
the nal hyp othesis
backs The rst exp eriments with these
early b o osting Schapire and
algorithms were carried out by
Drucker
ideas of the algorithm is to maintain
Simard on an AdaBo ost
OCR task
set of weights over the
training set
The
and
ties
cus
in Fig The algorithm takes as input a training set x y xm ym where each xi belongs to some
AdaBo ost algorithm intro duced in by Freund
Schapire
solved many of the practical b o osting algorithms and is Pseudo co de for AdaBo ost
dicul the fo is given
of the of this
earlier pap er
domain or instance space X and each label yi
some lab el set Y For most of this pap er we assume
Y f g later we discuss extensions class case AdaBo ost calls a given weak or algorithm rep eatedly in a series of rounds
to the multi base learning t T
Notice
t PriDt htxi yi Dti iht xi yi
that the error is measured with resp ect
to the
is in
Figure
The b o osting algorithm AdaBo ost
H x sign
X T
thtx
One of the main
a distribution or
The weight of this distribution on training example i on round t is denoted Dt i Initially all weights are set equally but on each round the weights of incorrectly classied examples are increased so that the weak learner is forced to fo cus on the hard examples in the training set
The weak learners job is to nd a weak hypothesis
ht X
The goodness of a weak hypothesis is measured by its error X
f g appropriate
for the
distribution Dt
t
20 15 10
5 0
1.0
0.5
margin
10 100
rounds
1000
-1 -0.5
0.5
1
Figure
Schapire et al Left the training and test error
classier as a function of the numb er of rounds of b o osting The horizontal lines indicate the test error rate of the base classier as well as the test error of the nal combined classier Right The cumulative distribution of margins of the training examples after and iterations indicated by shortdashed longdashed mostly hidden and solid curves resp ectively
distribution Dt on which the weak learner was trained In practice the weak learner may b e an algorithm that can use the weights Dt on the training examples Alter
Error
curves and the margin
distribution
graph curves
for
lower and upp er
natively when this
ing examples can b e sampled according to Dt and these unweighted resampled examples can b e used to train the weak learner
Once the weak hyp othesis ht has b een received Ada Bo ost cho oses a parameter t as in the gure Intu
hiq YpY
is not p ossible a subset of the train
itively t measures the imp ortance that is
ht Note that t if t which we
without loss of generality and that t gets larger as t gets smaller
The distribution Dt is next
shown in the gure The eect of this rule is to increase the weight of examples misclassied by ht and to de crease the weight of correctly classied examples Thus the weight tends to concentrate on hard examples
The the T to ht
nal hypothesis H is a weighted weak hyp otheses where t is the
ma jority vote of weight assigned
Schapire
analysis can b e
which output realvalued or
That is for each instance x the weak hyp othesis ht out puts a prediction ht x R whose sign is the predicted lab el or and whose magnitude jht xj gives a measure of condence in the prediction
Analyzing the training error
and
Singer extended
show how
to handle
condencerated predictions
The most basic
cerns its ability
write the error t of ht as t Since a hypothesis that guesses each instances class at random has an error rate of on binary problems t thus measures how much
up dated using the rule
theoretical prop erty of AdaBo ost con
to reduce the training
error Let us
how to
b ound the
assigned to can assume
t
AdaBo ost and its weak hyp otheses
b o osting C on curves
the
letter dataset as
resp ectively of the combined
b etter than random are ht s predictions Freund and Schapire prove that the training error the fraction of mistakes on the training set of the nal hyp othesis Hisatmost
Thus domsothattforsomethenthetraining error drops exp onentially fast
A similar prop erty is enjoyed by previous b o osting al gorithms However previous algorithms required that such a lower bound be known a priori before boost ing b egins In practice knowledge of such a b ound is very dicult to obtain AdaBo ost on the other hand is adaptive in that it adapts to the error rates of the indi vidual weak hyp otheses This is the basis of its name Ada is short for adaptive
The b ound given in Eq combined with the b ounds on generalization error given b elow prove that AdaBo ost is indeed a b o osting algorithm in the sense that it can eciently convert a weak learning algorithm which can always generate a hyp othesis with a weak edge for any distribution into a strong learning algorithm which can
if each
weak hyp othesis is slightly b etter than ran
generate a hyp othesis with an arbitrarily low error given sucient data
Generalization error
Freund and Schapire showed
generalization error of the nal hypothesis in terms of its training error the size m of the sample the VC dimension d of the weak hyp othesis space and the num
rep orted by
t t t
tt
X
exp t
rate
error
cumulative distribution
30 30 25 25 20 20 15 15 10 10
55
00 0 5 10 15 20 25 30
boosting stumps
0 5
10 15 20 25 30
boosting C4.5
Figure
rep orted by Freund and Schapire Each p oint in each scatterplot shows the test error rate of the two comp eting
Comparison of C versus b o osting stumps
and b o osting
C on a set of b enchmark problems as
algorithms on a single b enchmark The y co ordinate of each p oint gives the test the given b enchmark and the xco ordinate gives the error rate of b o osting stumps plot All error rates have b een averaged over multiple runs
error rate in p ercent of C on left plot or b o osting C right
b er of rounds T of b o osting The VCdimension is a standard measure of the complexity of a space of hy p otheses See for instance Blumer et al Sp eci cally they used techniques from Baum and Haussler
dened to be
X
to show that the bility is at most
generalization
error with high
r T d
m
proba
t X
t t
Pr Hx y O
empirically that b o osting often do es not
when run for thousands of rounds Moreover it was ob served that AdaBo ost would sometimes continue to drive down the generalization error long after the training er
learning al
of ab ove For instance the left side of Fig training and test curves of running b o ost
ror had reached zero clearly contradicting the spirit
the b ound
shows the
ing on top of Quinlans C decisiontree gorithm on the letter dataset
on the margins can b e seen empirically
In resp onse to these empirical
Schapire et al following the work of Bartlett gave an alternative analysis in terms of the margins of the training examples The margin of example x y is
overt even
ndings
in
which is p ositive if and only if the example Moreover the mag
y
t ht x
It
H
nitude of the margin can be interpreted as a measure of condence in the prediction Schapire et al proved that larger margins on the training set translate into a su p erior upp er b ound on the generalization error Sp eci cally the generalization error is at most
m
is a numb er correctly classies
where Pr denotes empirical probability on the train ing sample This bound suggests that boosting will overt if run for to o many rounds ie as T b ecomes
r Pr marginx y O
d
large In fact this sometimes do es happ en
early exp eriments several authors observed
However in
for any
is entirely independent of T the number of rounds of b o osting In addition Schapire et al proved that b o ost ing is particularly aggressive at reducing the margin in a quantiable sense since it concentrates on the examples with the smallest margins whether p ositive or negative
with high probability
Note that this b ound
Bo ostings eect
for instance on the right side of Fig which shows the cumulative distribution of margins of the training ex amples on the letter dataset In this case even after the training error reaches zero b o osting continues to in crease the margins of the training examples eecting a corresp onding drop in the test error
Attempts not always successful to use the insights gleaned from the theory of margins have b een made
C4.5
16 35
AdaBoost Sleeping-experts Rocchio Naive-Bayes PrTFIDF
AdaBoost Sleeping-experts Rocchio Naive-Bayes PrTFIDF
14 12 10
30
25
8 20 6
4
2
05
15
10
3 4 5 Number of Classes
Figure Comparison of error rates for
bilistic TFIDF Ro cchio and sleeping exp erts as
on two text corp ora Reuters newswire articles left numb ers of class lab els as indicated on the xaxis of each
by
gin
ing
ers which minimum margin
several authors In addition the mar theory p oints to a strong connection b etween b o ost and the supp ortvector machines of Vapnik and oth
explicitly attempt to maximize the
The b ehavior of
in a gametheoretic setting as explored by Freund and Schapire see also Grove and Schuurmans and Breiman In particular b o osting can b e viewed as rep eated play of a certain game and AdaBo ost can b e shown to b e a sp ecial case of a more general algo rithm for playing rep eated games and for approximately solving a game This also shows that b o osting is related to linear programming
Multiclass classication
AdaBo ost can also b e understo o d
There are several metho ds of extending
the multiclass case The most straightforward general ization called AdaBo ostM is adequate when the weak learner is strong enough to achieve reasonably high accuracy even on the hard distributions created by Ada Bo ost However this metho d fails if the weak learner cannot achieve at least accuracy when run on these hard distributions
For the latter case several more sophisticated meth o ds have b een develop ed These generally work by re ducing the multiclass problem to a larger binary prob lem Schapire and Singers algorithm AdaBo ostMH
works by
ample x
example x is
other lab els
AdaBo ostM
Singers AdaBo ostMR algorithm instead creates
creating a set of binary problems for each ex and each p ossible lab el y of the form For the correct lab el y or is it one of the Freund and Schapires algorithm which is a sp ecial case of Schapire and
binary problems for each y and each incorrect lab el x is the correct label y or
example x with correct lab el
y of the form For y
example
AdaBo ost
to
6
4 6 8 10 12 14 16 18 20 Number of Classes
AdaBo ost and four
rep orted by Schapire
text categorization metho ds naive Bayes proba and Singer The algorithms were tested and AP newswire headlines right and with varying
gure
These metho ds require additional eort in the de sign of the weak learning algorithm A dier ent technique which incorp orates Dietterich and Bakiris metho d of errorcorrecting output co des
achieves similar provable b ounds to those of
Bo ostMH and AdaBo ostM but can b e used with any weak learner which can handle simple binary la b eled data Schapire and Singer give yet another metho d of combining b o osting with errorcorrecting out put co des
Exp eriments and applications
Practically AdaBo ost has many advantages It is fast simple and easy to program It has no parameters to
tune except for the numb er of round T It requires prior knowledge ab out the weak learner and so can exibly combined with any metho d for nding weak hy p otheses Finally it comes with a set of theoretical guar antees given sucient data and a weak learner that can reliably provide only mo derately accurate weak hyp othe ses This is a shift in mind set for the learningsystem designer instead of trying to design a learning algorithm that is accurate over the entire space we can instead fo cus on nding weaking learning algorithms that only need to b e b etter than random
On the other hand some caveats are certainly in or der The actual p erformance of b o osting on a partic ular problem is clearly dep endent on the data and the weak learner Consistent with theory b o osting can fail to p erform well given insucient data overly complex weak hyp otheses or weak hyp otheses which are to o weak Bo osting seems to b e esp ecially susceptible to noise
AdaBo ost has b een tested empirically by many re searchers including For in stance Freund and Schapire tested AdaBo ost on a set of UCI b enchmark datasets using C as a weak learning algorithm as well as an algorithm which
other
Ada
no b e
% Error
% Error
4:1/0.27,4/0.17 5:0/0.26,5/0.17 7:4/0.25,9/0.18 1:9/0.15,7/0.15 2:0/0.29,2/0.19 9:7/0.25,9/0.17
3:5/0.28,3/0.28 9:7/0.19,9/0.19 4:1/0.23,4/0.23 4:1/0.21,4/0.20 4:9/0.16,4/0.16 9:9/0.17,4/0.17
4:4/0.18,9/0.16 4:4/0.21,1/0.18 7:7/0.24,9/0.21 9:9/0.25,7/0.22 4:4/0.19,9/0.16 9:9/0.20,7/0.17
Figure A sample of the examples that have the largest
Eric Bauer and Ron Kohavi An empirical com parison of voting classication algorithms Bagging b o osting and variants Machine Learning to ap p ear
Eric B Baum and David Haussler What size net gives valid generalization Neural Computation
Anselm Blumer Andrzej Ehrenfeucht David Haus sler and Manfred K Warmuth Learnability and the VapnikChervonenkis dimension Journal of the Association for Computing Machinery Octob er
Bernhard E Boser
Vladimir N Vapnik
timal margin classiers
Annual ACM Workshop on Computational ing Theory pages
Leo Breiman Arcing the edge Technical
Statistics Department University of California at Berkeley
Leo Breiman Prediction games and arcing classi ers Technical Rep ort Statistics Department University of California at Berkeley
Leo Breiman Arcing classiers The Annals of Statistics
weight on an OCR
Schapire These examples were chosen after rounds of b o osting top line rounds middle and rounds
algorithm for op Learn
task as rep orted by Freund and
b ottom Underneath each image is a line
d w w where d is the lab el of the
and are the lab els that get the highest and second highest vote from the combined hyp othesis at that p oint in the run of the algorithm and w w are the corre sp onding normalized scores
nds
tree
in Fig As can b e
ing the weak decision
results as C while
decisiontree algorithm a signicant improvement in p er formance
the b est decision stump or singletest decision Some of the results of these exp eriments are shown seen from this gure even b o ost stumps can usually give as go o d b o osting C generally gives the
Corinna Cortes vector networks Septemb er
and Vladimir Vapnik Supp ort Machine Learning
In another set of exp eriments Schapire and Singer
used b o osting for text categorization tasks
work weak hyp otheses were used which test on the pres ence or absence of a word or phrase Some results of these exp eriments comparing AdaBo ost to four other metho ds are shown in Fig In nearly all of these exp er iments and for all of the p erformance measures tested b o osting p erformed as well or signicantly b etter than the other metho ds tested Bo osting has also b een ap plied to text ltering and ranking problems
A nice prop erty of AdaBo ost is its ability to identify outliers ie examples that are either mislab eled in the training data or which are inherently ambiguous and hard to categorize Because AdaBo ost fo cuses its weight on the hardest examples the examples with the highest weight often turn out to b e outliers An example of this phenomenon can b e seen in Fig taken from an OCR
Thomas G Dietterich An exp erimental comparison of three methods for constructing ensembles of de cision trees Bagging b o osting and randomization Machine Learning to app ear
Thomas G Dietterich and Ghulum Bakiri Solv ing multiclass learning problems via errorcorrecting output codes Journal of Articial Intel ligence Re search January
Harris Drucker and Corinna Cortes Bo osting deci sion trees In Advances in Neural Information Pro
exp eriment conducted
References
Peter L Bartlett
classication with neural networks the size of the weights is more imp ortant than the size of the net work IEEE Transactions on Information Theory March
by Freund and
Schapire
Schapire and algorithm for
The sample complexity of pattern
of the form example
Rep ort
For this
A
training
In Proceedings of the Fifth
Isab elle M Guyon and
cessing Systems
Harris Drucker
Simard Bo osting
International Journal of Pattern Recognition and Articial Intel ligence
pages
Rob ert Schapire and Patrice p erformance in neural networks
Yoav Freund Bo osting a weak rithm by ma jority Information and
Yoav Freund Ra j Iyer Robert E
Yoram Singer An ecient b o osting
combining preferences In Machine Learning Pro ceedings of the Fifteenth International Conference
Yoav Freund and Rob ert E Schapire Exp eriments with a new b o osting algorithm In Machine Learn ing Proceedings of the Thirteenth International Conference pages
learning algo Computation
Yoav Freund and Rob ert E Schapire Game the ory online prediction and b o osting In Proceedings of the Ninth Annual Conference on Computational Learning Theory pages
Yoav Freund and Rob ert E Schapire A decision theoretic generalization of online learning and an application to b o osting Journal of Computer and System Sciences August
Yoav Freund and Rob ert E Schapire Adaptive
Rob ert E Schapire Yoav Freund Peter Bartlett
game playing and Economic
using multiplicative weights Games Behavior to app ear
Rob ert E Schapire and Yoram Singer ter A b o ostingbased system for text tion Machine Learning to app ear
Bo osTex categoriza
Adam J Grove and Dale Schuurmans Bo osting in the limit Maximizing the margin of learned en sembles In Proceedings of the Fifteenth National Conference on Articial Intel ligence
Jerey C Jackson and Mark W Craven Learning sparse p erceptrons In Advances in Neural Informa tion Processing Systems pages
Michael Kearns and Leslie G Valiant Learning Bo olean formulae or nite automata is as hard as factoring Technical Rep ort TR Harvard University Aiken Computation Lab oratory August
Michael Kearns and Leslie G Valiant Crypto graphic limitations on learning Bo olean formulae and nite automata Journal of the Association for Computing Machinery January
Michael J Kearns and Umesh V Vazirani An In troduction to Computational Learning Theory MIT Press
Richard Maclin and David Opitz An empirical eval uation of bagging and b o osting In Proceedings of the Fourteenth National Conference on Articial In tel ligence pages
Llew Mason Peter Bartlett and Jonathan Baxter Direct optimization of margins improves general ization in combined classiers Technical rep ort Deparment of Systems Engineering Australian Na tional University
C J Merz and P M Murphy UCI rep os itory of machine learning databases wwwicsuciedumlearnMLRep ositoryhtml
J R Quinlan Bagging b o osting and C In Proceedings of the Thirteenth National Conference
on Articial Intel ligence pages
J Ross Quinlan C Programs for Machine Learning Morgan Kaufmann
Rob ert E Schapire The strength of weak learnabil
ity Machine Learning
Robert E Schapire Using output codes to boost multiclass learning problems In Machine Learning Proceedings of the Fourteenth International Confer ence pages
Rob ert E Schapire Yoram Singer and Amit Sing hal Bo osting and Ro cchio applied to text ltering In SIGIR Proceedings of the st Annual Inter national Conference on Research and Development in Information Retrieval
Holger Schwenk and Yoshua Bengio Training meth o ds for adaptive b o osting of neural networks In Advances in Neural Information Processing Systems pages
L G Valiant A theory of the learnable Commu nications of the ACM Novemb er
Vladimir N Vapnik The Nature of Statistical
Theory Springer
and Wee Sun Lee Bo osting the
explanation for the eectiveness of
The Annals of Statistics Octob er
Rob ert E Schapire and Yoram Singer Improved b o osting algorithms using condencerated predic tions In Proceedings of the Eleventh Annual Con ference on Computational Learning Theory pages To app ear Machine Learning
Learning
margin A new voting metho ds