程序代写代做代考 information theory information retrieval database algorithm ER Some examples of recent ing are also describ ed􏸶

Some examples of recent ing are also describ ed􏸶
Background
applications of b o ost􏹃
􏸵􏸺􏸻 Park
Rob ert E􏸶 Schapire AT􏸸T Labs􏸹 Shannon Lab oratory
Avenue􏸹 Ro om A􏸼􏸽􏸾􏸹 Florham Park􏸹 NJ 􏸻􏸽􏸾􏸴􏸼􏸹 USA www􏸶research􏸶att􏸶com􏸿􏹁schapire
schapire􏹀research􏸶att􏸶com
Abstract
Bo osting is a general metho d for improving the accuracy of any given learning algorithm􏸶 This short pap er intro duces the b o osting algorithm AdaBo ost􏸹 and explains the underlying theory of b o osting􏸹 including an explanation of why b o osting often do es not su􏹂er from over􏸷tting􏸶
Given􏸳 􏹌x􏸵􏹈 y􏸵 􏹍􏹈 􏸳 􏸳 􏸳 􏹈 􏹌xm 􏹈 ym 􏹍 where xi 􏸼 X􏸹 yi 􏸼 Y
Initialize D􏸵 􏹌i􏹍 􏹎 􏸵􏹎m􏸶 Fort􏹎􏸵􏹈􏸳􏸳􏸳􏹈T􏸳
􏹎 f􏹞􏸵􏹈􏹏􏸵g
Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999.
A Brief Intro duction to Bo osting
􏹚 􏹚
􏹚 􏹚
Train weak learner using distribution Dt 􏸶
Get weak
hyp othesis ht 􏸳 X 􏹟 f􏹞􏸵􏹈 􏹏􏸵g with error
􏹚 􏹎 Pr 􏹆h 􏹌x 􏹍 􏹓􏹎 y 􏹇 􏸳 t􏹜i􏹁Dt􏹤ti i
􏸵 􏸵 􏹞 􏹚t Choose􏹂t􏹎􏸼ln 􏸶
Boosting is a general method which attempts to 􏹄boost􏹅 the accuracy of any given learning algorithm􏸶 Bo osting has its ro ots in a theoretical framework for studying ma􏹃 chine learning called the 􏹄PAC􏹅 learning mo del􏸹 due to Valiant 􏹆􏸴􏸽􏹇􏹈 see Kearns and Vazirani 􏹆􏸼􏹉􏹇 for a go o d in􏹃
􏹚t
Up date􏸳 􏹥
Dt 􏹌i􏹍
e􏹞􏹂t
if ht􏹌xi 􏹍 􏹎 yi
Dt􏹏􏸵􏹌i􏹍􏹎 􏹠􏹂 e t
􏹎
Dt 􏹌i􏹍 exp 􏹌􏹞􏹂t yi ht 􏹌xi 􏹍􏹍 Zt
if h 􏹌x 􏹍 􏹓􏹎 y Zt tii
􏹆􏸼􏸼􏸹 􏸼􏸴􏹇 􏸷rst to p ose the question of whether a 􏹄weak􏹅 algorithm which p erforms just slightly b et􏹃 random guessing in the PAC mo del can b e 􏹄b o osted􏹅 into an arbitrarily accurate 􏹄strong􏹅 learning algorithm􏸶 Schapire 􏹆􏸴􏸻􏹇 came up with the 􏸷rst prov􏹃 able p olynomial􏹃time b o osting algorithm in 􏸵􏸾􏸺􏸾􏸶 A year later􏸹 Freund 􏹆􏸵􏹉􏹇 develop ed a much more e􏹊cient b o osting algorithm which􏸹 although optimal in a certain sense􏸹 nevertheless su􏹂ered from certain practical draw􏹃
where Zt is a normalization factor 􏹌chosen so that Dt􏹏􏸵 will b e a distribution􏹍􏸶
tro duction to this mo del􏸶 Kearns and Valiant
were the learning ter than
Output
the 􏸷nal hyp othesis􏸳
backs􏸶 The 􏸷rst exp eriments with these
early b o osting Schapire and
algorithms were carried out by
Drucker􏸹
ideas of the algorithm is to maintain
Simard 􏹆􏸵􏸴􏹇 on an AdaBo ost
OCR task􏸶
set of weights over the
training set􏸶
The
and
ties
cus
in Fig􏸶 􏸵􏸶 The algorithm takes as input a training set 􏹌x􏸵􏹈 y􏸵􏹍􏹈 􏸳 􏸳 􏸳 􏹈 􏹌xm􏹈 ym􏹍 where each xi belongs to some
AdaBo ost algorithm􏸹 intro duced in 􏸵􏸾􏸾􏹋 by Freund
Schapire 􏹆􏸵􏸺􏹇􏸹
solved many of the practical b o osting algorithms􏸹 and is Pseudo co de for AdaBo ost
di􏹊cul􏹃 the fo􏹃 is given
of the of this
earlier pap er􏸶
domain or instance space X 􏸹 and each label yi
some lab el set Y 􏸶 For most of this pap er􏸹 we assume
Y 􏹎 f􏹞􏸵􏹈 􏹏􏸵g􏹈 later􏸹 we discuss extensions class case􏸶 AdaBo ost calls a given weak or algorithm rep eatedly in a series of rounds
to the multi􏹃 base learning t 􏹎 􏸵􏹈 􏸳 􏸳 􏸳 􏹈 T 􏸶
Notice
􏹚t 􏹎 Pri􏹁Dt 􏹆ht􏹌xi􏹍 􏹓􏹎 yi 􏹇 􏹎 Dt􏹌i􏹍􏸳 i􏸳ht 􏹌xi 􏹍􏹓􏹎yi
that the error is measured with resp ect
to the
is in
Figure 􏸵􏸳
The b o osting algorithm AdaBo ost􏸶
H 􏹌x􏹍 􏹎 sign
􏹦X 􏹟 T
􏹂tht􏹌x􏹍 􏸳
One of the main
a distribution or
The weight of this distribution on training example i on round t is denoted Dt 􏹌i􏹍􏸶 Initially􏸹 all weights are set equally􏸹 but on each round􏸹 the weights of incorrectly classi􏸷ed examples are increased so that the weak learner is forced to fo cus on the hard examples in the training set􏸶
The weak learner􏹐s job is to 􏸷nd a weak hypothesis
ht 􏸳 X
The goodness of a weak hypothesis is measured by its error X
􏹟 f􏹞􏸵􏹈 􏹏􏸵g appropriate
for the
distribution Dt 􏸶
t􏹎􏸵

20 15 10
5 0
1.0
0.5
margin
10 100
􏹑 rounds
1000
-1 -0.5
0.5
1
Figure
Schapire et al􏸶 􏹆􏸴􏸼􏹇􏸶 Left􏸳 the training and test error
classi􏸷er as a function of the numb er of rounds of b o osting􏸶 The horizontal lines indicate the test error rate of the base classi􏸷er as well as the test error of the 􏸷nal combined classi􏸷er􏸶 Right􏸳 The cumulative distribution of margins of the training examples after 􏹋􏸹 􏸵􏸻􏸻 and 􏸵􏸻􏸻􏸻 iterations􏸹 indicated by short􏹃dashed􏸹 long􏹃dashed 􏹌mostly hidden􏹍 and solid curves􏸹 resp ectively􏸶
distribution Dt on which the weak learner was trained􏸶 In practice􏸹 the weak learner may b e an algorithm that can use the weights Dt on the training examples􏸶 Alter􏹃
􏸼􏸳 Error
curves and the margin
distribution
graph curves
for
􏹌lower and upp er
natively􏸹 when this
ing examples can b e sampled according to Dt 􏸹 and these 􏹌unweighted􏹍 resampled examples can b e used to train the weak learner􏸶
Once the weak hyp othesis ht has b een received􏸹 Ada􏹃 Bo ost cho oses a parameter 􏹂t as in the 􏸷gure􏸶 Intu􏹃
hiq YpY􏸼
is not p ossible􏸹 a subset of the train􏹃
itively􏸹 􏹂t measures the imp ortance that is
ht 􏸶 Note that 􏹂t 􏹡 􏸻 if 􏹚t 􏹢 􏸵􏹎􏸼 􏹌which we
without loss of generality􏹍􏸹 and that 􏹂t gets larger as 􏹚t gets smaller􏸶
The distribution Dt is next
shown in the 􏸷gure􏸶 The e􏹂ect of this rule is to increase the weight of examples misclassi􏸷ed by ht 􏸹 and to de􏹃 crease the weight of correctly classi􏸷ed examples􏸶 Thus􏸹 the weight tends to concentrate on 􏹄hard􏹅 examples􏸶
The the T to ht 􏸶
􏸷nal hypothesis H is a weighted weak hyp otheses where 􏹂t is the
ma jority vote of weight assigned
Schapire
analysis can b e
which output real􏹃valued or
That is􏸹 for each instance x􏸹 the weak hyp othesis ht out􏹃 puts a prediction ht 􏹌x􏹍 􏸼 R whose sign is the predicted lab el 􏹌􏹞􏸵 or 􏹏􏸵􏹍 and whose magnitude jht 􏹌x􏹍j gives a measure of 􏹄con􏸷dence􏹅 in the prediction􏸶
Analyzing the training error
and
Singer 􏹆􏸴􏸴􏹇 extended
show how
to handle
con􏸷dence􏹃rated predictions􏸶
The most basic
cerns its ability
write the error 􏹚t of ht as 􏸼 􏹞 􏹘t􏸶 Since a hypothesis that guesses each instance􏹐s class at random has an error rate of 􏸵􏹎􏸼 􏹌on binary problems􏹍􏸹 􏹘t thus measures how much
up dated using the rule
theoretical prop erty of AdaBo ost con􏹃
to reduce the training 􏸵
error􏸶 Let us
how to
b ound the
assigned to can assume
t
AdaBo ost and its weak hyp otheses
b o osting C􏹉􏸶􏹋 on curves􏸹
the
letter dataset as
resp ectively􏹍 of the combined
b etter than random are ht 􏹐s predictions􏸶 Freund and Schapire 􏹆􏸵􏸺􏹇 prove that the training error 􏹌the fraction of mistakes on the training set􏹍 of the 􏸷nal hyp othesis Hisatmost
Thus􏸹 domsothat􏹘t􏹡􏹘forsome􏹘􏹛􏸻􏸹thenthetraining error drops exp onentially fast􏸶
A similar prop erty is enjoyed by previous b o osting al􏹃 gorithms􏸶 However􏸹 previous algorithms required that such a lower bound 􏹘 be known a priori before boost􏹃 ing b egins􏸶 In practice􏸹 knowledge of such a b ound is very di􏹊cult to obtain􏸶 AdaBo ost􏸹 on the other hand􏸹 is adaptive in that it adapts to the error rates of the indi􏹃 vidual weak hyp otheses􏸶 This is the basis of its name 􏹒 􏹄Ada􏹅 is short for 􏹄adaptive􏸶􏹅
The b ound given in Eq􏸶 􏹌􏸵􏹍􏸹 combined with the b ounds on generalization error given b elow prove that AdaBo ost is indeed a b o osting algorithm in the sense that it can e􏹊ciently convert a weak learning algorithm 􏹌which can always generate a hyp othesis with a weak edge for any distribution􏹍 into a strong learning algorithm 􏹌which can
if each
weak hyp othesis is slightly b etter than ran􏹃
generate a hyp othesis with an arbitrarily low error given su􏹊cient data􏹍􏸶
Generalization error
Freund and Schapire 􏹆􏸵􏸺􏹇 showed
generalization error of the 􏸷nal hypothesis in terms of its training error􏸹 the size m of the sample􏸹 the VC􏹃 dimension d of the weak hyp othesis space and the num􏹃
rep orted by
􏸼 􏹚t􏹌􏸵 􏹞 􏹚t􏹍 􏹎 􏸵 􏹞 􏹉􏹘t
tt􏹦􏹟
X􏸼
􏹢 exp 􏹞􏸼 􏹘t 􏸳 􏹌􏸵􏹍
rate􏸹
􏸼
error
cumulative distribution

30 30 25 25 20 20 15 15 10 10
55
00 0 5 10 15 20 25 30
boosting stumps
0 5
10 15 20 25 30
boosting C4.5
Figure 􏸴􏸳
rep orted by Freund and Schapire 􏹆􏸵􏹓􏹇􏸶 Each p oint in each scatterplot shows the test error rate of the two comp eting
Comparison of C􏹉􏸶􏹋 versus b o osting stumps
and b o osting
C􏹉􏸶􏹋 on a set of 􏸼􏸽 b enchmark problems as
algorithms on a single b enchmark􏸶 The y 􏹃co ordinate of each p oint gives the test the given b enchmark􏸹 and the x􏹃co ordinate gives the error rate of b o osting stumps plot􏹍􏸶 All error rates have b een averaged over multiple runs􏸶
error rate 􏹌in p ercent􏹍 of C􏹉􏸶􏹋 on 􏹌left plot􏹍 or b o osting C􏹉􏸶􏹋 􏹌right
b er of rounds T of b o osting􏸶 􏹌The VC􏹃dimension is a standard measure of the 􏹄complexity􏹅 of a space of hy􏹃 p otheses􏸶 See􏸹 for instance􏸹 Blumer et al􏸶 􏹆􏹉􏹇􏸶􏹍 Sp eci􏸷􏹃 cally􏸹 they used techniques from Baum and Haussler 􏹆􏸴􏹇
de􏸷ned to be
X
to show that the bility􏸹 is at most
generalization
error􏸹 with high
􏹦r 􏹟 T d
m
proba􏹃
t X􏸳
􏹂t t
P􏹔r 􏹆H􏹌x􏹍 􏹓􏹎 y􏹇 􏹏 O􏹕
empirically that b o osting often do es not
when run for thousands of rounds􏸶 Moreover􏸹 it was ob􏹃 served that AdaBo ost would sometimes continue to drive down the generalization error long after the training er􏹃
learning al􏹃
of ab ove􏸶 For instance􏸹 the left side of Fig􏸶 􏸼 training and test curves of running b o ost􏹃
ror had reached zero􏸹 clearly contradicting the spirit
the b ound
shows the
ing on top of Quinlan􏹐s C􏹉􏸶􏹋 decision􏹃tree gorithm 􏹆􏸼􏸾􏹇 on the 􏹄letter􏹅 dataset􏸶
on the margins can b e seen empirically􏸹
In resp onse to these empirical
Schapire et al􏸶 􏹆􏸴􏸼􏹇􏸹 following the work of Bartlett 􏹆􏸵􏹇􏸹 gave an alternative analysis in terms of the margins of the training examples􏸶 The margin of example 􏹌x􏹈 y􏹍 is
over􏸷t􏸹 even
􏸷ndings􏸹
􏸴
in
􏹆􏹞􏸵􏹈 􏹏􏸵􏹇 which is p ositive if and only if the example􏸶 Moreover􏸹 the mag􏹃
y
􏹂t ht 􏹌x􏹍
It
H
nitude of the margin can be interpreted as a measure of con􏸷dence in the prediction􏸶 Schapire et al􏸶 proved that larger margins on the training set translate into a su􏹃 p erior upp er b ound on the generalization error􏸶 Sp eci􏸷􏹃 cally􏸹 the generalization error is at most
m􏹜 􏸼
is a numb er correctly classi􏸷es
where P􏹔r􏹆􏹣􏹇 denotes empirical probability on the train􏹃 ing sample􏸶 This bound suggests that boosting will over􏸷t if run for to o many rounds􏸹 i􏸶e􏸶􏸹 as T b ecomes
􏹦r P􏹔r 􏹆margin􏹌x􏹈 y􏹍 􏹢 􏹜􏹇 􏹏 O􏹕
􏹟 d
large􏸶 In fact􏸹 this sometimes do es happ en􏸶
early exp eriments􏸹 several authors 􏹆􏸺􏸹 􏸵􏸼􏸹 􏸼􏸺􏹇 observed
However􏸹 in
for any 􏹜
is entirely independent of T􏸹 the number of rounds of b o osting􏸶 In addition􏸹 Schapire et al􏸶 proved that b o ost􏹃 ing is particularly aggressive at reducing the margin 􏹌in a quanti􏸷able sense􏹍 since it concentrates on the examples with the smallest margins 􏹌whether p ositive or negative􏹍􏸶
􏹛 􏸻 with high probability􏸶
Note that this b ound
Bo osting􏹐s e􏹂ect
for instance􏸹 on the right side of Fig􏸶 􏸼 which shows the cumulative distribution of margins of the training ex􏹃 amples on the 􏹄letter􏹅 dataset􏸶 In this case􏸹 even after the training error reaches zero􏸹 b o osting continues to in􏹃 crease the margins of the training examples e􏹂ecting a corresp onding drop in the test error􏸶
Attempts 􏹌not always successful􏹍 to use the insights gleaned from the theory of margins have b een made
C4.5

16 35
AdaBoost Sleeping-experts Rocchio Naive-Bayes PrTFIDF
AdaBoost Sleeping-experts Rocchio Naive-Bayes PrTFIDF
14 12 10
30
25
8 20 6
4
2
05
15
10
3 4 5 Number of Classes
Figure 􏹉􏸳 Comparison of error rates for
bilistic TF􏹃IDF􏸹 Ro cchio and sleeping exp erts􏹍 as
on two text corp ora 􏹒 Reuters newswire articles 􏹌left􏹍 numb ers of class lab els as indicated on the x􏹃axis of each
by
gin
ing
ers 􏹆􏹋􏸹 􏸾􏸹 􏸴􏸺􏹇 which minimum margin􏸶
several authors 􏹆􏹓􏸹 􏸼􏸻􏸹 􏸼􏹓􏹇􏸶 In addition􏸹 the mar􏹃 theory p oints to a strong connection b etween b o ost􏹃 and the supp ort􏹃vector machines of Vapnik and oth􏹃
explicitly attempt to maximize the
The b ehavior of
in a game􏹃theoretic setting as explored by Freund and Schapire 􏹆􏸵􏸽􏸹 􏸵􏸾􏹇 􏹌see also Grove and Schuurmans 􏹆􏸼􏸻􏹇 and Breiman 􏹆􏸽􏹇􏹍􏸶 In particular􏸹 b o osting can b e viewed as rep eated play of a certain game􏸹 and AdaBo ost can b e shown to b e a sp ecial case of a more general algo􏹃 rithm for playing rep eated games and for approximately solving a game􏸶 This also shows that b o osting is related to linear programming􏸶
Multiclass classi􏸷cation
AdaBo ost can also b e understo o d
There are several metho ds of extending
the multiclass case􏸶 The most straightforward general􏹃 ization 􏹆􏸵􏸺􏹇􏸹 called AdaBo ost􏸶M􏸵􏸹 is adequate when the weak learner is strong enough to achieve reasonably high accuracy􏸹 even on the hard distributions created by Ada􏹃 Bo ost􏸶 However􏸹 this metho d fails if the weak learner cannot achieve at least 􏹋􏸻􏹖 accuracy when run on these hard distributions􏸶
For the latter case􏸹 several more sophisticated meth􏹃 o ds have b een develop ed􏸶 These generally work by re􏹃 ducing the multiclass problem to a larger binary prob􏹃 lem􏸶 Schapire and Singer􏹐s 􏹆􏸴􏸴􏹇 algorithm AdaBo ost􏸶MH
works by
ample x
example x􏸹 is
other lab els􏹗􏹅
AdaBo ost􏸶M􏸼
Singer􏹐s 􏹆􏸴􏸴􏹇 AdaBo ost􏸶MR algorithm􏹍 instead creates
creating a set of binary problems􏸹 for each ex􏹃 and each p ossible lab el y 􏸹 of the form􏸳 􏹄For the correct lab el y or is it one of the Freund and Schapire􏹐s 􏹆􏸵􏸺􏹇 algorithm 􏹌which is a sp ecial case of Schapire and
binary problems􏸹 for each y and each incorrect lab el x􏸹 is the correct label y or
example x with correct lab el
y 􏸻 of the form􏸳 􏹄For y􏸻 􏹗􏹅
example
AdaBo ost
to
6
4 6 8 10 12 14 16 18 20 Number of Classes
AdaBo ost and four
rep orted by Schapire
􏹉
text categorization metho ds 􏹌naive Bayes􏸹 proba􏹃 and Singer 􏹆􏸴􏹉􏹇􏸶 The algorithms were tested and AP newswire headlines 􏹌right􏹍 􏹒 and with varying
􏸷gure􏸶
These metho ds require additional e􏹂ort in the de􏹃 sign of the weak learning algorithm􏸶 A di􏹂er􏹃 ent technique 􏹆􏸴􏸵􏹇􏸹 which incorp orates Dietterich and Bakiri􏹐s 􏹆􏸵􏸵􏹇 metho d of error􏹃correcting output co des􏸹
achieves similar provable b ounds to those of
Bo ost􏸶MH and AdaBo ost􏸶M􏸼􏸹 but can b e used with any weak learner which can handle simple􏸹 binary la􏹃 b eled data􏸶 Schapire and Singer 􏹆􏸴􏸴􏹇 give yet another metho d of combining b o osting with error􏹃correcting out􏹃 put co des􏸶
Exp eriments and applications
Practically􏸹 AdaBo ost has many advantages􏸶 It is fast􏸹 simple and easy to program􏸶 It has no parameters to
tune 􏹌except for the numb er of round T 􏹍􏸶 It requires prior knowledge ab out the weak learner and so can 􏹘exibly combined with any metho d for 􏸷nding weak hy􏹃 p otheses􏸶 Finally􏸹 it comes with a set of theoretical guar􏹃 antees given su􏹊cient data and a weak learner that can reliably provide only mo derately accurate weak hyp othe􏹃 ses􏸶 This is a shift in mind set for the learning􏹃system designer􏸳 instead of trying to design a learning algorithm that is accurate over the entire space􏸹 we can instead fo cus on 􏸷nding weaking learning algorithms that only need to b e b etter than random􏸶
On the other hand􏸹 some caveats are certainly in or􏹃 der􏸶 The actual p erformance of b o osting on a partic􏹃 ular problem is clearly dep endent on the data and the weak learner􏸶 Consistent with theory􏸹 b o osting can fail to p erform well given insu􏹊cient data􏸹 overly complex weak hyp otheses or weak hyp otheses which are to o weak􏸶 Bo osting seems to b e esp ecially susceptible to noise 􏹆􏸵􏸻􏹇􏸶
AdaBo ost has b een tested empirically by many re􏹃 searchers􏸹 including 􏹆􏸼􏸹 􏸵􏸻􏸹 􏸵􏸼􏸹 􏸼􏸵􏸹 􏸼􏹋􏸹 􏸼􏸺􏸹 􏸴􏹓􏹇􏸶 For in􏹃 stance􏸹 Freund and Schapire 􏹆􏸵􏹓􏹇 tested AdaBo ost on a set of UCI b enchmark datasets 􏹆􏸼􏸽􏹇 using C􏹉􏸶􏹋 􏹆􏸼􏸾􏹇 as a weak learning algorithm􏸹 as well as an algorithm which
other
Ada􏹃
no b e
% Error
% Error

4:1/0.27,4/0.17 5:0/0.26,5/0.17 7:4/0.25,9/0.18 1:9/0.15,7/0.15 2:0/0.29,2/0.19 9:7/0.25,9/0.17
3:5/0.28,3/0.28 9:7/0.19,9/0.19 4:1/0.23,4/0.23 4:1/0.21,4/0.20 4:9/0.16,4/0.16 9:9/0.17,4/0.17
4:4/0.18,9/0.16 4:4/0.21,1/0.18 7:7/0.24,9/0.21 9:9/0.25,7/0.22 4:4/0.19,9/0.16 9:9/0.20,7/0.17
Figure 􏹋􏸳 A sample of the examples that have the largest
􏹆􏸼􏹇
􏹆􏸴􏹇
􏹆􏹉􏹇
􏹆􏹋􏹇
􏹆􏹓􏹇
􏹆􏸽􏹇
􏹆􏸺􏹇
􏹆􏸾􏹇
􏹆􏸵􏸻􏹇
􏹆􏸵􏸵􏹇
􏹆􏸵􏸼􏹇
􏹆􏸵􏸴􏹇
􏹆􏸵􏹉􏹇
􏹆􏸵􏹋􏹇
􏹆􏸵􏹓􏹇
Eric Bauer and Ron Kohavi􏸶 An empirical com􏹃 parison of voting classi􏸷cation algorithms􏸳 Bagging􏸹 b o osting􏸹 and variants􏸶 Machine Learning􏸹 to ap􏹃 p ear􏸶
Eric B􏸶 Baum and David Haussler􏸶 What size net gives valid generalization􏹗 Neural Computation􏸹 􏸵􏹌􏸵􏹍􏸳􏸵􏹋􏸵􏹙􏸵􏹓􏸻􏸹 􏸵􏸾􏸺􏸾􏸶
Anselm Blumer􏸹 Andrzej Ehrenfeucht􏸹 David Haus􏹃 sler􏸹 and Manfred K􏸶 Warmuth􏸶 Learnability and the Vapnik􏹃Chervonenkis dimension􏸶 Journal of the Association for Computing Machinery􏸹 􏸴􏹓􏹌􏹉􏹍􏸳􏸾􏸼􏸾􏹙 􏸾􏹓􏹋􏸹 Octob er 􏸵􏸾􏸺􏸾􏸶
Bernhard E􏸶 Boser􏸹
Vladimir N􏸶 Vapnik􏸶
timal margin classi􏸷ers􏸶
Annual ACM Workshop on Computational ing Theory􏸹 pages 􏸵􏹉􏹉􏹙􏸵􏹋􏸼􏸹 􏸵􏸾􏸾􏸼􏸶
Leo Breiman􏸶 Arcing the edge􏸶 Technical
􏹉􏸺􏹓􏸹 Statistics Department􏸹 University of California at Berkeley􏸹 􏸵􏸾􏸾􏸽􏸶
Leo Breiman􏸶 Prediction games and arcing classi􏹃 􏸷ers􏸶 Technical Rep ort 􏹋􏸻􏹉􏸹 Statistics Department􏸹 University of California at Berkeley􏸹 􏸵􏸾􏸾􏸽􏸶
Leo Breiman􏸶 Arcing classi􏸷ers􏸶 The Annals of Statistics􏸹 􏸼􏹓􏹌􏸴􏹍􏸳􏸺􏸻􏸵􏹙􏸺􏹉􏸾􏸹 􏸵􏸾􏸾􏸺􏸶
weight on an OCR
Schapire 􏹆􏸵􏹓􏹇􏸶 These examples were chosen after 􏹉 rounds of b o osting 􏹌top line􏹍􏸹 􏸵􏸼 rounds 􏹌middle􏹍 and 􏸼􏹋 rounds
algorithm for op􏹃 Learn􏹃
task as rep orted by Freund and
􏹌b ottom􏹍􏸶 Underneath each image is a line
d􏸳􏹝􏸵 􏹎w􏸵 􏸹􏹝􏸼 􏹎w􏸼 􏸹 where d is the lab el of the
and 􏹝􏸼 are the lab els that get the highest and second highest vote from the combined hyp othesis at that p oint in the run of the algorithm􏸹 and w􏸵 􏸹 w􏸼 are the corre􏹃 sp onding normalized scores􏸶
􏸷nds
tree􏸶
in Fig􏸶 􏸴􏸶 As can b e
ing the weak decision
results as C􏹉􏸶􏹋􏸹 while
decision􏹃tree algorithm a signi􏸷cant improvement in p er􏹃 formance􏸶
the b est 􏹄decision stump􏹅 or single􏹃test decision Some of the results of these exp eriments are shown seen from this 􏸷gure􏸹 even b o ost􏹃 stumps can usually give as go o d b o osting C􏹉􏸶􏹋 generally gives the
Corinna Cortes vector networks􏸶 Septemb er 􏸵􏸾􏸾􏹋􏸶
and Vladimir Vapnik􏸶 Supp ort􏹃 Machine Learning􏸹 􏸼􏸻􏹌􏸴􏹍􏸳􏸼􏸽􏸴􏹙􏸼􏸾􏸽􏸹
In another set of exp eriments􏸹 Schapire and Singer 􏹆􏸴􏹉􏹇
used b o osting for text categorization tasks􏸶
work􏸹 weak hyp otheses were used which test on the pres􏹃 ence or absence of a word or phrase􏸶 Some results of these exp eriments comparing AdaBo ost to four other metho ds are shown in Fig􏸶 􏹉􏸶 In nearly all of these exp er􏹃 iments and for all of the p erformance measures tested􏸹 b o osting p erformed as well or signi􏸷cantly b etter than the other metho ds tested􏸶 Bo osting has also b een ap􏹃 plied to text 􏸷ltering 􏹆􏸴􏹋􏹇 and 􏹄ranking􏹅 problems 􏹆􏸵􏹋􏹇􏸶
A nice prop erty of AdaBo ost is its ability to identify outliers􏸹 i􏸶e􏸶􏸹 examples that are either mislab eled in the training data􏸹 or which are inherently ambiguous and hard to categorize􏸶 Because AdaBo ost fo cuses its weight on the hardest examples􏸹 the examples with the highest weight often turn out to b e outliers􏸶 An example of this phenomenon can b e seen in Fig􏸶 􏹋 taken from an OCR
Thomas G􏸶 Dietterich􏸶 An exp erimental comparison of three methods for constructing ensembles of de􏹃 cision trees􏸳 Bagging􏸹 b o osting􏸹 and randomization􏸶 Machine Learning􏸹 to app ear􏸶
Thomas G􏸶 Dietterich and Ghulum Bakiri􏸶 Solv􏹃 ing multiclass learning problems via error􏹃correcting output codes􏸶 Journal of Arti􏸷cial Intel ligence Re􏹃 search􏸹 􏸼􏸳􏸼􏹓􏸴􏹙􏸼􏸺􏹓􏸹 January 􏸵􏸾􏸾􏹋􏸶
Harris Drucker and Corinna Cortes􏸶 Bo osting deci􏹃 sion trees􏸶 In Advances in Neural Information Pro􏹃
exp eriment conducted
References
􏹆􏸵􏹇 Peter L􏸶 Bartlett􏸶
classi􏸷cation with neural networks􏸳 the size of the weights is more imp ortant than the size of the net􏹃 work􏸶 IEEE Transactions on Information Theory􏸹 􏹉􏹉􏹌􏸼􏹍􏸳􏹋􏸼􏹋􏹙􏹋􏸴􏹓􏸹 March 􏸵􏸾􏸾􏸺􏸶
by Freund and
Schapire 􏹆􏸵􏹓􏹇􏸶
Schapire􏸹 and algorithm for
The sample complexity of pattern
of the form example􏸹 􏹝􏸵
Rep ort
For this
􏹋
A
training
In Proceedings of the Fifth
Isab elle M􏸶 Guyon􏸹 and
cessing Systems 􏸺􏸹
Harris Drucker􏸹
Simard􏸶 Bo osting
International Journal of Pattern Recognition and Arti􏸷cial Intel ligence􏸹 􏸽􏹌􏹉􏹍􏸳􏸽􏸻􏹋􏹙􏸽􏸵􏸾􏸹 􏸵􏸾􏸾􏸴􏸶
pages 􏹉􏸽􏸾􏹙􏹉􏸺􏹋􏸹 􏸵􏸾􏸾􏹓􏸶
Rob ert Schapire􏸹 and Patrice p erformance in neural networks􏸶
Yoav Freund􏸶 Bo osting a weak rithm by ma jority􏸶 Information and 􏸵􏸼􏸵􏹌􏸼􏹍􏸳􏸼􏹋􏹓􏹙􏸼􏸺􏹋􏸹 􏸵􏸾􏸾􏹋􏸶
Yoav Freund􏸹 Ra j Iyer􏸹 Robert E􏸶
Yoram Singer􏸶 An e􏹊cient b o osting
combining preferences􏸶 In Machine Learning􏸳 Pro􏹃 ceedings of the Fifteenth International Conference􏸹 􏸵􏸾􏸾􏸺􏸶
Yoav Freund and Rob ert E􏸶 Schapire􏸶 Exp eriments with a new b o osting algorithm􏸶 In Machine Learn􏹃 ing􏸳 Proceedings of the Thirteenth International Conference􏸹 pages 􏸵􏹉􏸺􏹙􏸵􏹋􏹓􏸹 􏸵􏸾􏸾􏹓􏸶
learning algo􏹃 Computation􏸹

􏹆􏸵􏸽􏹇 Yoav Freund and Rob ert E􏸶 Schapire􏸶 Game the􏹃 ory􏸹 on􏹃line prediction and b o osting􏸶 In Proceedings of the Ninth Annual Conference on Computational Learning Theory􏸹 pages 􏸴􏸼􏹋􏹙􏸴􏸴􏸼􏸹 􏸵􏸾􏸾􏹓􏸶
􏹆􏸵􏸺􏹇 Yoav Freund and Rob ert E􏸶 Schapire􏸶 A decision􏹃 theoretic generalization of on􏹃line learning and an application to b o osting􏸶 Journal of Computer and System Sciences􏸹 􏹋􏹋􏹌􏸵􏹍􏸳􏸵􏸵􏸾􏹙􏸵􏸴􏸾􏸹 August 􏸵􏸾􏸾􏸽􏸶
􏹆􏸵􏸾􏹇 Yoav Freund and Rob ert E􏸶 Schapire􏸶 Adaptive
􏹆􏸴􏸼􏹇 Rob ert E􏸶 Schapire􏸹 Yoav Freund􏸹 Peter Bartlett􏸹
game playing and Economic
using multiplicative weights􏸶 Games Behavior􏸹 to app ear􏸶
􏹆􏸴􏹉􏹇 Rob ert E􏸶 Schapire and Yoram Singer􏸶 ter􏸳 A b o osting􏹃based system for text tion􏸶 Machine Learning􏸹 to app ear􏸶
Bo osTex􏹃 categoriza􏹃
􏹆􏸼􏸻􏹇 Adam J􏸶 Grove and Dale Schuurmans􏸶 Bo osting in the limit􏸳 Maximizing the margin of learned en􏹃 sembles􏸶 In Proceedings of the Fifteenth National Conference on Arti􏸷cial Intel ligence􏸹 􏸵􏸾􏸾􏸺􏸶
􏹆􏸼􏸵􏹇 Je􏹂rey C􏸶 Jackson and Mark W􏸶 Craven􏸶 Learning sparse p erceptrons􏸶 In Advances in Neural Informa􏹃 tion Processing Systems 􏸺􏸹 pages 􏹓􏹋􏹉􏹙􏹓􏹓􏸻􏸹 􏸵􏸾􏸾􏹓􏸶
􏹆􏸼􏸼􏹇 Michael Kearns and Leslie G􏸶 Valiant􏸶 Learning Bo olean formulae or 􏸷nite automata is as hard as factoring􏸶 Technical Rep ort TR􏹃􏸵􏹉􏹃􏸺􏸺􏸹 Harvard University Aiken Computation Lab oratory􏸹 August 􏸵􏸾􏸺􏸺􏸶
􏹆􏸼􏸴􏹇 Michael Kearns and Leslie G􏸶 Valiant􏸶 Crypto􏹃 graphic limitations on learning Bo olean formulae and 􏸷nite automata􏸶 Journal of the Association for Computing Machinery􏸹 􏹉􏸵􏹌􏸵􏹍􏸳􏹓􏸽􏹙􏸾􏹋􏸹 January 􏸵􏸾􏸾􏹉􏸶
􏹆􏸼􏹉􏹇 Michael J􏸶 Kearns and Umesh V􏸶 Vazirani􏸶 An In􏹃 troduction to Computational Learning Theory􏸶 MIT Press􏸹 􏸵􏸾􏸾􏹉􏸶
􏹆􏸼􏹋􏹇 Richard Maclin and David Opitz􏸶 An empirical eval􏹃 uation of bagging and b o osting􏸶 In Proceedings of the Fourteenth National Conference on Arti􏸷cial In􏹃 tel ligence􏸹 pages 􏹋􏹉􏹓􏹙􏹋􏹋􏸵􏸹 􏸵􏸾􏸾􏸽􏸶
􏹆􏸼􏹓􏹇 Llew Mason􏸹 Peter Bartlett􏸹 and Jonathan Baxter􏸶 Direct optimization of margins improves general􏹃 ization in combined classi􏸷ers􏸶 Technical rep ort􏸹 Deparment of Systems Engineering􏸹 Australian Na􏹃 tional University􏸹 􏸵􏸾􏸾􏸺􏸶
􏹆􏸼􏸽􏹇 C􏸶 J􏸶 Merz and P􏸶 M􏸶 Murphy􏸶 UCI rep os􏹃 itory of machine learning databases􏸹 􏸵􏸾􏸾􏸺􏸶 www􏸶ics􏸶uci􏸶edu􏸿􏹁mlearn􏸿MLRep ository􏸶html􏸶
􏹆􏸼􏸺􏹇 J􏸶 R􏸶 Quinlan􏸶 Bagging􏸹 b o osting􏸹 and C􏹉􏸶􏹋􏸶 In Proceedings of the Thirteenth National Conference
on Arti􏸷cial Intel ligence􏸹 pages 􏸽􏸼􏹋􏹙􏸽􏸴􏸻􏸹 􏸵􏸾􏸾􏹓􏸶
􏹆􏸼􏸾􏹇 J􏸶 Ross Quinlan􏸶 C􏹉􏸶􏹋􏸳 Programs for Machine Learning􏸶 Morgan Kaufmann􏸹 􏸵􏸾􏸾􏸴􏸶
􏹆􏸴􏸻􏹇 Rob ert E􏸶 Schapire􏸶 The strength of weak learnabil􏹃
ity􏸶 Machine Learning􏸹 􏹋􏹌􏸼􏹍􏸳􏸵􏸾􏸽􏹙􏸼􏸼􏸽􏸹 􏸵􏸾􏸾􏸻􏸶
􏹆􏸴􏸵􏹇 Robert E􏸶 Schapire􏸶 Using output codes to boost multiclass learning problems􏸶 In Machine Learning􏸳 Proceedings of the Fourteenth International Confer􏹃 ence􏸹 pages 􏸴􏸵􏸴􏹙􏸴􏸼􏸵􏸹 􏸵􏸾􏸾􏸽􏸶
􏹆􏸴􏹋􏹇 Rob ert E􏸶 Schapire􏸹 Yoram Singer􏸹 and Amit Sing􏹃 hal􏸶 Bo osting and Ro cchio applied to text 􏸷ltering􏸶 In SIGIR 􏹐􏸾􏸺􏸳 Proceedings of the 􏸼􏸵st Annual Inter􏹃 national Conference on Research and Development in Information Retrieval􏸹 􏸵􏸾􏸾􏸺􏸶
􏹆􏸴􏹓􏹇 Holger Schwenk and Yoshua Bengio􏸶 Training meth􏹃 o ds for adaptive b o osting of neural networks􏸶 In Advances in Neural Information Processing Systems 􏸵􏸻􏸹 pages 􏹓􏹉􏸽􏹙􏹓􏹋􏸴􏸹 􏸵􏸾􏸾􏸺􏸶
􏹆􏸴􏸽􏹇 L􏸶 G􏸶 Valiant􏸶 A theory of the learnable􏸶 Commu􏹃 nications of the ACM􏸹 􏸼􏸽􏹌􏸵􏸵􏹍􏸳􏸵􏸵􏸴􏹉􏹙􏸵􏸵􏹉􏸼􏸹 Novemb er 􏸵􏸾􏸺􏹉􏸶
􏹆􏸴􏸺􏹇 Vladimir N􏸶 Vapnik􏸶 The Nature of Statistical
Theory􏸶 Springer􏸹
􏹓
and Wee Sun Lee􏸶 Bo osting the
explanation for the e􏹂ectiveness of
The Annals of Statistics􏸹 􏸼􏹓􏹌􏹋􏹍􏸳􏸵􏹓􏹋􏸵􏹙􏸵􏹓􏸺􏹓􏸹 Octob er 􏸵􏸾􏸾􏸺􏸶
􏹆􏸴􏸴􏹇 Rob ert E􏸶 Schapire and Yoram Singer􏸶 Improved b o osting algorithms using con􏸷dence􏹃rated predic􏹃 tions􏸶 In Proceedings of the Eleventh Annual Con􏹃 ference on Computational Learning Theory􏸹 pages 􏸺􏸻􏹙􏸾􏸵􏸹 􏸵􏸾􏸾􏸺􏸶 To app ear􏸹 Machine Learning􏸶
Learning
􏸵􏸾􏸾􏹋􏸶
margin􏸳 A new voting metho ds􏸶