The Annals of Statistics
2000, Vol. 28, No. 2, 337–407
SPECIAL INVITED PAPER
ADDITIVE LOGISTIC REGRESSION: A STATISTICAL VIEW OF BOOSTING
By Jerome Friedman,1 Trevor Hastie2 3 and Robert Tibshirani2 4
Stanford University
Boosting is one of the most important recent developments in classi- fication methodology. Boosting works by sequentially applying a classifica- tion algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well-known statistical princi- ples, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descrip- tions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large-scale data mining applications.
1. Introduction. The starting point for this paper is an interesting pro- cedure called “boosting,” which is a way of combining the performance of many “weak” classifiers to produce a powerful “committee.” Boosting was proposed in the computational learning theory literature [Schapire (1990), Freund (1995), Freund and Schapire (1997)] and has since received much attention.
While boosting has evolved somewhat over the years, we describe the most commonly used version of the AdaBoost procedure [Freund and Schapire
Received August 1998; revised December 1999.
1Also at Stanford Linear Accelerator Center, Stanford, CA 94305. Supported in part by Dept. of Energy Contract DE-AC03-76 SF 00515 and NSF Grant DMS-97-64431.
2Also at Division of BioStatistics, Dept. of Health, Research and Policy, Stanford University, Stanford, CA 94305.
3Supported in part by NSF Grants DMS-95-04495, DMS-98-03645 and NIH Grant ROI-CA-72028-01.
4Supported in part by Natural Sciences and Engineering Research Council of Canada.
AMS 1991 subject classifications. 62G05, 62G07, 68T10, 68T05.
Key words and phrases. classification, tree, nonparametric estimation, stagewise fitting,
machine learning.
337
338 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
(1996b)], which we call Discrete AdaBoost. This is essentially the same as AdaBoost.M1 for binary data in Freund and Schapire. Here is a concise descrip- tion of AdaBoost in the two-class classification setting. We have training data x1y1xNyN with xi a vector valued feature and yi = −1 or 1. We define Fx = M1 cm fm x where each fm x is a classifier producing val- ues plus or minus 1 and cm are constants; the corresponding prediction is signFx. The AdaBoost procedure trains the classifiers fmx on weighted versions of the training sample, giving higher weight to cases that are cur- rently misclassified. This is done for a sequence of weighted samples, and then the final classifier is defined to be a linear combination of the classifiers from each stage. A detailed description of Discrete AdaBoost is given in the boxed display titled Algorithm 1.
Much has been written about the success of AdaBoost in producing accurate classifiers. Many authors have explored the use of a tree-based classifier for fmx and have demonstrated that it consistently produces significantly lower error rates than a single decision tree. In fact, Breiman (1996) (referring to a NIPS workshop) called AdaBoost with trees the “best off-the-shelf classifier in the world” [see also Breiman (1998b)]. Interestingly, in many examples the test error seems to consistently decrease and then level off as more classifiers are added, rather than ultimately increase. For some reason, it seems that AdaBoost is resistant to overfitting.
Figure 1 shows the performance of Discrete AdaBoost on a synthetic clas- sification task, using an adaptation of CARTTM [Breiman, Friedman, Olshen and Stone (1984)] as the base classifier. This adaptation grows fixed-size trees in a “best-first” manner (see Section 8). Included in the figure is the bagged tree [Breiman (1996)] which averages trees grown on bootstrap resampled versions of the training data. Bagging is purely a variance-reduction tech- nique, and since trees tend to have high variance, bagging often produces good results.
Discrete AdaBoost [Freund and Schapire (1996b)]
1. Start with weights wi = 1/Ni = 1N. 2. Repeat for m = 12M:
(a) Fit the classifier fm x ∈ −1 1 using weights wi on the training data. (b) Compute errm = Ew1y̸=fmx, cm = log1 − errm/errm.
(c) Set wi ← wi expcm1yi̸=fmxi, i = 1 2 N, and renormalize so that i wi = 1.
3. Output the classifier signMm=1 cmfmx.
Algorithm 1. Ew represents expectation over the training data with weights w = w1w2wN, and 1S is the indicator of the set S. At each iteration, AdaBoost increases the weights of the observations misclassified by fmx by a factor that depends on the weighted training error.
ADDITIVE LOGISTIC REGRESSION 339
Fig. 1. Test error for Bagging, Discrete AdaBoost and Real AdaBoost on a simulated two-class nested spheres problem (see Section 6). There are 2000 training data points in ten dimensions, and the Bayes error rate is zero. All trees are grown “best-first” without pruning. The leftmost iteration corresponds to a single tree.
Early versions of AdaBoost used a resampling scheme to implement step 2 of Algorithm 1, by weighted sampling from the training data. This suggested a connection with bagging and that a major component of the success of boosting has to do with variance reduction.
However, boosting performs comparably well when:
1. A weighted tree-growing algorithm is used in step 2 rather than weighted resampling, where each training observation is assigned its weight wi. This removes the randomization component essential in bagging.
2. “Stumps” are used for the weak learners. Stumps are single-split trees with only two terminal nodes. These typically have low variance but high bias. Bagging performs very poorly with stumps [Figure 1 (top right panel)].
340 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
These observations suggest that boosting is capable of both bias and vari- ance reduction, and thus differs fundamentally from bagging.
The base classifier in Discrete AdaBoost produces a classification rule fm x → −1 1, where is the domain of the predictive features x. Freund and Schapire (1996b), Breiman (1998a) and Schapire and Singer (1998) have suggested various modifications to improve the boosting algo- rithms.
A generalization of Discrete AdaBoost appeared in Freund and Schapire (1996b), and was developed further in Schapire and Singer (1998), that uses real-valued “confidence-rated” predictions rather than the −1 1 of Discrete AdaBoost. The weak learner for this generalized boosting produces a map- ping fmx: → R; the sign of fmx gives the classification, and fmx a measure of the “confidence” in the prediction. This real-valued contribution is combined with the previous contributions with a multiplier cm as before, and a slightly different recipe for cm is provided.
We present a generalized version of AdaBoost, which we call Real AdaBoost in Algorithm 2, in which the weak learner returns a class probability estimate pm x = Pˆ w y = 1x ∈ 0 1. The contribution to the final classifier is half the logit-transform of this probability estimate. One form of Schapire and Singer’s generalized AdaBoost coincides with Real AdaBoost, in the special case where the weak learner is a decision tree. Real AdaBoost tends to perform the best in our simulated examples in Figure 1, especially with stumps, although we see with 100 node trees Discrete AdaBoost overtakes Real AdaBoost after 200 iterations.
In this paper we analyze the AdaBoost procedures from a statistical per- spective. The main result of our paper rederives AdaBoost as a method for fitting an additive model m fmx in a forward stagewise manner. This sim- ple fact largely explains why it tends to outperform a single base learner. By fitting an additive model of different and potentially simple functions, it expands the class of functions that can be approximated.
Real AdaBoost
1. Start with weights wi = 1/N i = 12N. 2. Repeat for m = 12M:
(a) Fit the classifier to obtain a class probability estimate pmx = Pˆwy = 1x ∈ 0 1, using weights wi on the training data.
(b) Set fmx ← 12 log pmx/1 − pmx ∈ R.
(c) Set wi ← wi exp−yifmxi, i = 1 2 N, and renormalize so that
i wi = 1.
3. Output the classifier signMm=1 fmx.
Algorithm 2. The Real AdaBoost algorithm uses class probability esti- mates pmx to construct real-valued contributions fmx.
ADDITIVE LOGISTIC REGRESSION 341
Given this fact, Discrete and Real AdaBoost appear unnecessarily compli-
cated. A much simpler way to fit an additive model would be to minimize
squared-error loss Ey − fm x2 in a forward stagewise manner. At the
mth stage we fix f1x · · · fm−1x and minimize squared error to obtain fmx
in linear regression and additive modeling [Hastie and Tibshirani (1990)]. However squared error loss is not a good choice for classification (see Figure 2 in Section 4.2) and hence “fitting of residuals” doesn’t work very well in that case. We show that AdaBoost fits an additive model using a bet- ter loss function for classification. Specifically we show that AdaBoost fits an additive logistic regression model, using a criterion similar to, but not the same as, the binomial log-likelihood. [If pmx are the class probabilities, an additive logistic regression approximates log pmx/1 − pmx by an addi- tive function m fmx.] We then go on to derive a new boosting procedure
“LogitBoost” that directly optimizes the binomial log-likelihood.
The original boosting techniques [Schapire (1990), Freund (1995)] prov- ably improved or “boosted” the performance of a single classifier by produc- ing a “majority vote” of similar classifiers. These algorithms then evolved into more adaptive and practical versions such as AdaBoost, whose success was still explained in terms of boosting individual classifiers by a “weighted majority vote” or “weighted committee.” We believe that this view, along with the appealing name “boosting” inherited by AdaBoost, may have led to some of the mystery about how and why the method works. As mentioned above,
we instead view boosting as a technique for fitting an additive model. Section 2 gives a short history of the boosting idea. In Section 3 we briefly review additive modeling. Section 4 shows how boosting can be viewed as an additive model estimator and proposes some new boosting methods for the two- class case. The multiclass problem is studied in Section 5. Simulated and real data experiments are discussed in Sections 6 and 7. Our tree-growing imple- mentation, using truncated best-first trees, is described in Section 8. Weight trimming to speed up computation is discussed in Section 9, and we briefly describe generalizations of boosting in Section 10. We end with a discussion
in Section 11.
2. A brief history of boosting. Schapire (1990) developed the first sim- ple boosting procedure in the PAC-learning framework [Valiant (1984), Kearns and Vazirani (1994)]. Schapire showed that a weak learner could always improve its performance by training two additional classifiers on filtered ver- sions of the input data stream. A weak learner is an algorithm for producing a two-class classifier with performance guaranteed (with high probability) to be significantly better than a coinflip. After learning an initial classifier h1 on the first N training points:
1. h2 is learned on a new sample of N points, half of which are misclassified by h1.
2. h3 is learned on N points for which h1 and h2 disagree. 3. The boosted classifier is hB = Majority Voteh1 h2 h3.
= Ey−m−1 f xx. This is just “fitting of residuals” and is commonly used 1j
342 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Schapire’s “Strength of Weak Learnability” theorem proves that hB has improved performance over h1.
Freund (1995) proposed a “boost by majority” variation which combined many weak learners simultaneously and improved the performance of the sim- ple boosting algorithm of Schapire. The theory supporting both of these algo- rithms requires the weak learner to produce a classifier with a fixed error rate. This led to the more adaptive and realistic AdaBoost [Freund and Schapire (1996b)] and its offspring, where this assumption was dropped.
Freund and Schapire (1996b) and Schapire and Singer (1998) provide some theory to support their algorithms, in the form of upper bounds on generaliza- tion error. This theory has evolved in the computational learning community, initially based on the concepts of PAC learning. Other theories attempting to explain boosting come from game theory [Freund and Schapire (1996a), Breiman (1997)] and VC theory [Schapire, Freund, Bartlett and Lee (1998)]. The bounds and the theory associated with the AdaBoost algorithms are inter- esting, but tend to be too loose to be of practical importance. In practice, boost- ing achieves results far more impressive than the bounds would imply.
3. Additive models. We show in the next section that AdaBoost fits an additive model Fx = Mm=1 cmfmx. We believe that viewing current boost- ing procedures as stagewise algorithms for fitting additive models goes a long way toward understanding their performance. Additive models have a long history in statistics, and so we first give some examples here.
3.1. Additive regression models. We initially focus on the regression prob- lem, where the response y is quantitative, x and y have some joint distribu- tion, and we are interested in modeling the mean Eyx = Fx. The additive model has the form
p
(1) Fx = fjxj
j=1
There is a separate function fjxj for each of the p input variables xj. More generally, each component fj is a function of a small, prespecified subset of the input variables. The backfitting algorithm [Friedman and Stuetzle (1981), Buja, Hastie and Tibshirani (1989)] is a convenient modular “Gauss–Seidel” algorithm for fitting additive models. A backfitting update is
(2) fjxj←Ey−fkxkxj forj=12p1 k̸=j
Any method or algorithm for estimating a function of xj can be used to obtain an estimate of the conditional expectation in (2). In particular, this can include nonparametric smoothing algorithms, such as local regression or smoothing splines. In the right-hand side, all the latest versions of the func- tions fk are used in forming the partial residuals. The backfitting cycles are repeated until convergence. Under fairly general conditions, backfitting can
ADDITIVE LOGISTIC REGRESSION 343
be shown to converge to the minimizer of Ey − Fx2 [Buja, Hastie and Tibshirani (1989)].
3.2. Extended additive models. More generally, one can consider additive models whose elements fmxM1 are functions of potentially all of the input features x. Usually in this context the fmx are taken to be simple functions characterized by a set of parameters γ and a multiplier βm,
(3) fmx = βmbx γm The additive model then becomes
M
(4) FMx= βmbxγm
m=1
For example, in single hidden layer neural networks bxγ = σγtx where σ· is a sigmoid function and γ parameterizes a linear combination of the input features. In signal processing, wavelets are a popular choice with γ parameterizing the location and scale shifts of a “mother” wavelet bx. In these applications bxγmM1 are generally called “basis functions” since they span a function subspace.
If least-squares is used as a fitting criterion, one can solve for an optimal set of parameters through a generalized backfitting algorithm with updates,
(5) βm γm ← arg min Ey − βkbx γk − βbx γ2 β γ k̸=m
for m = 1 2 M in cycles until convergence. Alternatively, one can use a “greedy” forward stepwise approach,
(6) βm γm ← arg min Ey − Fm−1x − βbx γ2 βγ
tion values at earlier iterations. This is the approach used by Mallat and Zhang (1993) in “matching pursuit,” where the bxγ are selected from an over-complete dictionary of wavelet bases. In the language of boosting, fx = βbxγ would be called a “weak learner” and FMx (4) the “committee.” If decision trees were used as the weak learner, the parameters γ would repre- sent the splitting variables, split points, the constants in each terminal node and number of terminal nodes of each tree.
Note that the backfitting procedure (5) or its greedy cousin (6) only require an algorithm for fitting a single weak learner (3) to data. This base algorithm is simply applied repeatedly to modified versions of the original data
ym ← y − fkx k̸=m
for m = 12M, where β γ m−1 are fixed at their corresponding solu- kk1
344 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
In the forward stepwise procedure (6), the modified output ym at the mth iter- ation depends only on its value ym−1 and the solution fm−1x at the previous iteration,
(7) ym = ym−1 − fm−1x
At each step m, the previous output values ym−1 are modified (7) so that the previous model fm−1x has no explanatory power on the new outputs ym. One can therefore view this as a procedure for boosting a weak learner fx = βbx γ to form a powerful committee FMx (4).
3.3. Classification problems. For the classification problem, we learn from Bayes theorem that all we need is Py = jx, the posterior or conditional class probabilities. One could transfer all the above regression machinery across to the classification domain by simply noting that E1y=jx = Py = jx, where 1y=j is the 0/1 indicator variable representing class j. While this works fairly well in general, several problems have been noted [Hastie, Tibshirani and Buja (1994)] for constrained regression methods. The estimates are typ- ically not confined to 01, and severe masking problems can occur when there are more than two classes. A notable exception is when trees are used as the regression method, and in fact this is the approach used by Breiman, Friedman, Olshen and Stone (1984).
Logistic regression is a popular approach used in statistics for overcom- ing these problems. For a two-class problem, an additive logistic model has the form
M
(8) log Py = −1x = fmx
Py = 1x
m=1
The monotone logit transformation on the left guarantees that for any values of Fx = Mm=1 fm x ∈ R, the probability estimates lie in 0 1; inverting we get
(9) px = Py = 1x = eFx 1 + eFx
Here we have given a general additive form for Fx; special cases exist that are well known in statistics. In particular, linear logistic regression [McCullagh and Nelder (1989), e.g.] and additive logistic regression [Hastie and Tibshirani (1990)] are popular. These models are usually fit by maximizing the binomial log-likelihood and enjoy all the associated asymptotic optimality features of maximum likelihood estimation.
A generalized version of backfitting (2), called “Local Scoring” in Hastie and Tibshirani (1990), can be used to fit the additive logistic model by maximum likelihood. Starting with guesses f1x1 · · · fpxp, Fx = fkxk and px defined in (9), we form the working response:
(10) z=Fx+ 1y=1 −px px1 − px
ADDITIVE LOGISTIC REGRESSION 345
We then apply backfitting to the response z with observation weights px1− px to obtain new fkxk. This process is repeated until convergence. The forward stagewise version (6) of this procedure bears a close similarity to the LogitBoost algorithm described later in the paper.
4. AdaBoost: an additive logistic regression model. In this section we show that the AdaBoost algorithms (Discrete and Real) can be interpreted as stagewise estimation procedures for fitting an additive logistic regression model. They optimize an exponential criterion which to second order is equiva- lent to the binomial log-likelihood criterion. We then propose a more standard likelihood-based boosting procedure.
4.1. An exponential criterion. Consider minimizing the criterion
(11) JF = Ee−yFx
for estimation of Fx. Here E represents expectation; depending on the con- text, this may be a population expectation (with respect to a probability dis- tribution) or else a sample average. Ew indicates a weighted expectation. Lemma 1 shows that the function Fx that minimizes JF is the symmetric logistic transform of Py = 1x.
Lemma 1.
(12)
Hence
(13) (14)
Ee−yFx is minimized at
Fx= 1log Py=1x
Py = 1x = Py = −1x =
eFx e−Fx + eFx
e−Fx e−Fx + eFx
2 Py = −1x
While E entails expectation over the joint distribution of y and x, it is sufficient to minimize the criterion conditional on x:
Proof.
Ee−yFxx = Py = 1xe−Fx + Py = −1xeFx ∂Ee−yFxx = −Py = 1xe−Fx + Py = −1xeFx
∂Fx
The result follows by setting the derivative to zero. P
This exponential criterion appeared in Schapire and Singer (1998), moti- vated as an upper bound on misclassification error. Breiman (1997) also used this criterion in his results on AdaBoost and prediction games. The usual
346 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
logistic transform does not have the factor 12 as in (12); by multiplying the
numerator and denominator in (13) by eFx, we get the usual logistic model (15) px = e2Fx
1 + e2Fx
Hence the two models are equivalent up to a factor 2.
Corollary 1. If E is replaced by averages over regions of x where Fx is constant (as in the terminal node of a decision tree), the same result applies to the sample proportions of y = 1 and y = −1.
Results 1 and 2 show that both Discrete and Real AdaBoost, as well as the Generalized AdaBoost of Freund and Schapire (1996b), can be motivated as iterative algorithms for optimizing the (population based) exponential crite- rion. The results share the same format.
1. Given an imperfect Fx, an update Fx + fx is proposed based on the population version of the criterion.
2. The update, which involves population conditional expectations, is imper- fectly approximated for finite data sets by some restricted class of estima- tors, such as averages in terminal nodes of trees.
Hastie and Tibshirani (1990) use a similar derivation of the local scoring
algorithm used in fitting generalized additive models. Many terms are typi- cally required in practice, since at each stage the approximation to conditional expectation is rather crude. Because of Lemma 1, the resulting algorithms can be interpreted as a stagewise estimation procedure for fitting an additive logistic regression model. The derivations are sufficiently different to warrant separate treatment.
Result 1. The Discrete AdaBoost algorithm (population version) builds an additive logistic regression model via Newton-like updates for minimizing
Ee−yFx .
Proof. Let JF = Ee−yFx. Suppose we have a current estimate Fx and seek an improved estimate Fx + cfx. For fixed c (and x), we expand JFx + cfx to second order about fx = 0,
JF + cf = Ee−yFx+cfx
≈ Ee−yFx1 − ycfx + c2y2fx2/2
= Ee−yFx1 − ycfx + c2/2
since y2=1 and fx2=1. Minimizing pointwise with respect to fx∈
−1 1, we write
(16) fx = arg min Ew1 − ycfx + c2/2x f
ADDITIVE LOGISTIC REGRESSION 347 Here the notation Ew·x refers to a weighted conditional expectation, where
w = wxy = e−yFx, and
Ewgx yx = Ewx yx
def Ewx ygx yx For c > 0, minimizing (16) is equivalent to maximizing
(17)
The solution is
(18) fx =
Note that
(19)
1 −1
Ew yfx
if Ewyx = Pwy = 1x − Pwy = −1x > 0,
otherwise.
−Ewyfx = Ewy − fx2/2 − 1
[again using fx2 = y2 = 1]. Thus minimizing a quadratic approximation to the criterion leads to a weighted least-squares choice of fx ∈ −1 1, and this constitutes the Newton-like step.
Given fx ∈ −1 1, we can directly minimize JF + cf to determine c:
(20)
c = arg min Ewe−cyfx c
= 1 log 1 − err 2 err
where err=Ew1y̸=fx. Note that c can be negative if the weak learner does worse than 50%, in which case it automatically reverses the polarity. Combining these steps we get the update for Fx,
Fx ← Fx + 1 log 1 − err fx 2 err
In the next iteration the new contribution cfx to Fx augments the weights wx y ← wx ye−cfxy
Since −yfx = 2 × 1y̸=fx − 1, we see that the update is equivalent to wx y ← wx yexplog 1 − err 1y̸=fx
Thus the function and weight updates are of an identical form to those used in Discrete AdaBoost.
This population version of AdaBoost translates naturally to a data version using trees. The weighted conditional expectation in (18) is approximated by the terminal-node weighted averages in a tree. In particular, the weighted least-squares criterion is used to grow the tree-based classifier fx, and given fx, the constant c is based on the weighted training error.
err
348 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Note that after each Newton step, the weights change, and hence the tree configuration will change as well. This adds an adaptive twist to the data version of a Newton-like algorithm.
Parts of this derivation for AdaBoost can be found in Breiman (1997) and Schapire and Singer (1998), but without making the connection to additive logistic regression models.
Corollary 2. After each update to the weights, the weighted misclassifi- cation error of the most recent weak learner is 50%.
Proof. This follows by noting that the c that minimizes JF + cf satisfies
(21) ∂JF + cf = −Ee−yFx+cfxyfx = 0 ∂c
The result follows since yfx is 1 for a correct and −1 for an incorrect classification. P
Schapire and Singer (1998) give the interpretation that the weights are updated to make the new weighted problem maximally difficult for the next weak learner.
The Discrete AdaBoost algorithm expects the tree or other “weak learner” to deliver a classifier fx ∈ −1 1. Result 1 requires minor modifications to accommodate fx ∈ R, as in the generalized AdaBoost algorithms [Freund and Schapire (1996b), Schapire and Singer (1998)]; the estimate for cm differs. Fixing f, we see that the minimizer of (20) must satisfy
(22) Ewyfxe−cyfx = 0
If f is not discrete, this equation has no closed-form solution for c, and requires an iterative solution such as Newton–Raphson.
We now derive the Real AdaBoost algorithm, which uses weighted proba- bility estimates to update the additive logistic model, rather than the classifi- cations themselves. Again we derive the population updates and then apply it to data by approximating conditional expectations by terminal-node averages in trees.
Result 2. The Real AdaBoost algorithm fits an additive logistic regression model by stagewise and approximate optimization of JF = Ee−yFx.
Proof. Suppose we have a current estimate Fx and seek an improved esti- mate Fx + fx by minimizing JFx + fx at each x.
JFx + fx = Ee−yFxe−yfxx
= e−fxEe−yFx1y=1x + efxEe−yFx1y=−1x
Dividing through by Ee−yFxx and setting the derivative w.r.t. fx to zero we get
(23) fx = 1 log Ew1y=1x 2 Ew 1y=−1 x
ADDITIVE LOGISTIC REGRESSION 349 (24) =1log Pwy=1x
2 Pwy = −1x
where wx y = exp−yFx. The weights get updated by wx y ← wx ye−yfx
The algorithm as presented would stop after one iteration. In practice we use crude approximations to conditional expectation, such as decision trees or other constrained models, and hence many steps are required.
Corollary 3. At the optimal Fx the weighted conditional mean of y is 0. Proof. If Fx is optimal, we have
25 ∂JFx = −Ee−yFxy = 0 P Fx
We can think of the weights as providing an alternative to residuals for the binary classification problem. At the optimal function F, there is no further information about F in the weighted conditional distribution of y. If there is, we use it to update F.
An iteration M in either the Discrete or Real AdaBoost algorithms, we have composed an additive function of the form
M
(26) Fx = fmx
m=1
where each of the components are found in a greedy forward stagewise fash- ion, fixing the earlier components. Our term “stagewise” refers to a similar
approach in statistics:
1. Variables are included sequentially in a stepwise regression.
2. Thecoefficientsofvariablesalreadyincludedreceivenofurtheradjustment.
4.2. Why Ee−yFx? So far the only justification for this exponential crite- rion is that it has a sensible population minimizer, and the algorithm described above performs well on real data. In addition:
1. Schapire and Singer (1998) motivate e−yFx as a differentiable upper bound to misclassification error 1yF<0 (see Figure 2).
2. The AdaBoost algorithm that it generates is extremely modular, requir- ing at each iteration the retraining of a classifier on a weighted training database.
Let y∗ = y + 1/2, taking values 0, 1, and parametrize the binomial prob-
abilities by
px = eFx The binomial log-likelihood is eFx + e−Fx
(27)
ly∗ px = y∗ logpx + 1 − y∗log1 − px = − log1 + e−2yFx
350 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Fig. 2. A variety of loss functions for estimating a function Fx for classification. The horizontal axis is yF which is negative for errors and positive for correct classifications. All the loss functions are monotone in yF and are centered and scaled to match e−yF at F = 0. The curve labeled “Log- likelihood” is the binomial log-likelihood or cross-entropy y∗ log p + 1 − y∗log1 − p. The curve labeled “Squared Error(p)” is y∗ − p2. The curve labeled “Squared Error(F)” is y − F2 and increases once yF exceeds 1 thereby increasingly penalizing classifications that are “too correct.”
Hence we see that:
3. The population minimizers of −Ely∗px and Ee−yFx coincide. This is easily seen because the expected log-likelihood is maximized at the true probabilities px = Py∗ = 1x, which define the logit Fx. By Lemma 1 we see that this is exactly the minimizer of Ee−yFx. In fact, the exponential criterion and the (negative) log-likelihood are equivalent to second order in a Taylor series around F = 0,
(28) − ly∗ p ≈ exp−yF + log2 − 1
Graphs of exp−yF and log1 + e−2yFx are shown in Figure 2, as a function of yF—positive values of yF imply correct classification. Note that − exp−yF itself is not a proper log-likelihood, as it does not equal the log of any probability mass function on plus or minus 1.
4. There is another way to view the criterion JF. It is easy to show that (29) e−yFx = y∗ − px
px1 − px
with Fx = 12 logpx/1 − px. The right-hand side is known as the χ statistic in the statistical literature. χ2 is a quadratic approximation to the log-likelihood, and so χ can be considered a “gentler” alternative.
ADDITIVE LOGISTIC REGRESSION 351
One feature of both the exponential and log-likelihood criteria is that they are monotone and smooth. Even if the training error is zero, the criteria will drive the estimates towards purer solutions (in terms of probability estimates).
Why not estimate the fm by minimizing the squared error Ey − Fx2? If F x = m−1 f x is the current prediction, this leads to a forward
m−1 1 j
stagewise procedure that does an unweighted fit to the response y − Fm−1x
at step m as in (6). Empirically we have found that this approach works quite well, but is dominated by those that use monotone loss criteria. We believe that the nonmonotonicity of squared error loss (Figure 2) is the reason. Correct classifications, but with yFx > 1, incur increasing loss for increasing values of Fx. This makes squared-error loss an especially poor approximation to misclassification error rate. Classifications that are “too correct” are penalized as much as misclassification errors.
4.3. Direct optimization of the binomial log-likelihood. In this section we explore algorithms for fitting additive logistic regression models by stagewise optimization of the Bernoulli log-likelihood. Here we focus again on the two- class case and will use a 0/1 response y∗ to represent the outcome. We repre- sent the probability of y∗ = 1 by px, where
(30) px = eFx eFx + e−Fx
Algorithm 3 gives the details.
LogitBoost (two classes)
1. Start with weights wi =1/N i=12N, Fx=0 and probability esti- mates pxi = 12.
2. Repeat for m = 12M:
(a) Compute the working response and weights
zi= y∗i−pxi pxi1 − pxi
wi = pxi1 − pxi
(b) Fit the function fmx by a weighted least-squares regression of zi to
xi using weights wi. 1 Fx (c) UpdateFx←Fx+2fmxandpx←e
3. Output the classifier signFx = signMm=1 fmx.
/e
Fx
+e
−Fx
.
Algorithm 3. An adaptive Newton algorithm for fitting an additive logis- tic regression model.
352 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Result 3. The LogitBoost algorithm (two classes, population version) uses Newton steps for fitting an additive symmetric logistic model by maximum likelihood.
Derivation. Consider the update Fx + fx and the expected log- likelihood
(31) ElF + f = E2y∗Fx + fx − log1 + e2Fx+fx Conditioning on x, we compute the first and second derivative at fx = 0,
sx = ∂ElFx + fx ∂fx fx=0
= 2Ey∗ − pxx Hx = ∂2ElFx + fx
∂fx2 fx=0
(33)
where px is defined in terms of Fx. The Newton update is then
(32)
(34)
(35)
Fx ← Fx − Hx−1sx
= Fx + 1 Ey∗ − pxx
2 Epx1 − pxx
= Fx + 1Ew y∗ − px x
2 px1 − px
= −4Epx1 − pxx
where wx = px1 − px. Equivalently, the Newton update fx solves the weighted least-squares approximation [about Fx] to the log-likelihood
(36) min Ewx fx
Fx +
1 y∗−px 2
2 px1 − px
− Fx + fx
The population algorithm described here translates immediately to an implementation on data when E·x is replaced by a regression method, such as regression trees [Breiman, Friedman, Olshen and Stone (1984)]. While the role of the weights are somewhat artificial in the population case, they are not in any implementation; wx is constant when conditioned on x, but the wxi in a terminal node of a tree, for example, depend on the current values Fxi, and will typically not be constant.
Sometimes the wx get very small in regions of (x) perceived [by Fx] to be pure—that is, when px is close to 0 or 1. This can cause numerical problems in the construction of z, and lead to the following crucial implementation protections:
1. If y∗ = 1, then compute z = y∗ − p/p1 − p as 1/p. Since this number can get large if p is small, threshold this ratio at z max. The particu- lar value chosen for z max is not crucial; we have found empirically that
Derivation.
∂JFx + fx ∂fx fx=0
∂2JFx + fx ∂fx2 fx=0
Hence the Newton update is
= −Ee−yFxyx
= Ee−yFxx since y2 = 1
ADDITIVE LOGISTIC REGRESSION 353
z max ∈ 2 4 works well. Likewise, if y∗ = 0, compute z = −1/1 − p with
a lower threshold of −z max.
2. Enforce a lower threshold on the weights: w = maxw 2 × machine-zero.
4.4. Optimizing Ee−yFx by Newton stepping. The population version of the Real AdaBoost procedure (Algorithm 2) optimizes E exp−yFx + fx exactly with respect to f at each iteration. In Algorithm 4 we propose the “Gentle AdaBoost” procedure that instead takes adaptive Newton steps much like the LogitBoost algorithm just described.
Result 4. The Gentle AdaBoost algorithm (population version) uses Newton steps for minimizing Ee−yFx.
Fx ← Fx + Ee−yFxyx Ee−yFx x
where wx y = e−yFx.
= Fx + Ewyx
The main difference between this and the Real AdaBoost algorithm is how it uses its estimates of the weighted class probabilities to update the functions. Here the update is fmx = Pwy = 1x−Pwy = −1x, rather than half the
Gentle AdaBoost
1. Start with weights wi = 1/Ni = 12NFx = 0 2. Repeat for m = 12M:
(a) Fit the regression function fmx by weighted least-squares of yi to xi with weights wi.
(b) Update Fx ← Fx + fmx.
(c) Update wi ← wi exp−yifmxi and renormalize.
3. Output the classifier signFx = signMm=1 fmx.
Algorithm 4. A modified version of the Real AdaBoost algorithm, using Newton stepping rather than exact optimization at each step.
354 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
log-ratio as in (24): fmx = 12 logPwy = 1x/Pwy = −1x. Log-ratios can be numerically unstable, leading to very large updates in pure regions, while the update here lies in the range −1 1. Empirical evidence suggests (see Section 7) that this more conservative algorithm has similar performance to both the Real AdaBoost and LogitBoost algorithms, and often outperforms them both, especially when stability is an issue.
There is a strong similarity between the updates for the Gentle AdaBoost algorithm and those for the LogitBoost algorithm. Let P = Py = 1x, and px = eFx/eFx + e−Fx. Then
Ee−yFxyx = e−FxP − eFx1 − P Ee−yFxx e−FxP + eFx1 − P
(37)
The analogous expression for LogitBoost from (34) is
= P−px 1 − pxP + px1 − P
(38) 1 P−px 2 px1 − px
At px ≈ 12 these are nearly the same, but they differ as the px become extreme. For example, if P ≈ 1 and px ≈ 0, (38) blows up, while (37) is about 1 (and always falls in −1 1).
5. Multiclass procedures. Here we explore extensions of boosting to classification with multiple classes. We start off by proposing a natural gen- eralization of the two-class symmetric logistic transformation, and then con- sider specific algorithms. In this context Schapire and Singer (1998) define J responses yj for a J class problem, each taking values in −11. Simi- larly the indicator response vector with elements y∗j is more standard in the statistics literature. Assume the classes are mutually exclusive.
Definition 1. For a J class problem let pjx = Pyj = 1x. We define the symmetric multiple logistic transformation
(39) Equivalently, (40)
1 J
Fjx = log pjx − J log pkx
k=1 e F j x J
pjx = J eFkx
k=1 k=1
Fkx = 0
The centering condition in (40) is for numerical stability only; it simply pins the Fj down, else we could add an arbitrary constant to each Fj and the probabilities remain the same. The equivalence of these two definitions is easily established, as well as the equivalence with the two-class case.
ADDITIVE LOGISTIC REGRESSION 355
Schapire and Singer (1998) provide several generalizations of AdaBoost for the multiclass case, and also refer to other proposals [Freund and Schapire (1997), Schapire (1997)]; we describe their AdaBoost.MH algorithm (see Algo- rithm 5), since it seemed to dominate the others in their empirical studies. We then connect it to the models presented here. We will refer to the augmented variable in Algorithm 5 as the “class” variable C. We make a few observations:
1. The population version of this algorithm minimizes Jj=1 Ee−yjFjx, which is equivalent to running separate population boosting algorithms on each of the J problems of size N obtained by partitioning the N × J samples in the obvious fashion. This is seen trivially by first conditioning on C = j, and then xC = j, when computing conditional expectations.
2. The same is almost true for a tree-based algorithm. We see this because:
(a) If the first split is on C, either a J-nary split if permitted, or else J − 1 binary splits, then the subtrees are identical to separate trees grown to each of the J groups. This will always be the case for the first tree.
(b) If a tree does not split on C anywhere on the path to a terminal node, then that node returns a function fmxj = gmx that contributes nothing to the classification decision. However, as long as a tree includes a split on C at least once on every path to a terminal node, it will make a contribution to the classifier for all input feature values.
The advantage or disadvantage of building one large tree using class label as an additional input feature is not clear. No motivation is provided. We therefore implement AdaBoost.MH using the more traditional direct approach of building J separate trees to minimize Jj=1 Eexp−yjFjx
We have thus shown
Result 5. The AdaBoost.MH algorithm for a J-class problem fits J uncou-
pled additive logistic models, Gj x = 12 log pj x/1 − pj x, each class against the rest.
AdaBoost.MH [Schapire and Singer (1998)]
1. Expand the original N observations into N × J pairs xi 1 yi1 , xi2yi2xiJyiJ i=1N. Here yij is the −11 response for class j and observation i.
2. Apply Real AdaBoost to the augmented dataset, producing a function F ×1J→RFxj=mfmxj.
3. Output the classifier arg maxj Fx j.
Algorithm 5. The AdaBoost.MH algorithm converts the J class problem
into that of estimating a two class classifier on a training set J times as large, with an additional “feature” defined by the set of class labels.
356 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
In principal this parametrization is fine, since Gjx is monotone in pjx. However, we are estimating the Gjx in an uncoupled fashion, and there is no guarantee that the implied probabilities sum to 1. We give some examples where this makes a difference, and AdaBoost.MH performs more poorly than an alternative coupled likelihood procedure.
Schapire and Singer’s AdaBoost.MH was also intended to cover situations where observations can belong to more than one class. The “MH” represents “Multi-Label Hamming”, Hamming loss being used to measure the errors in the space of 2J possible class labels. In this context fitting a separate classifier for each label is a reasonable strategy. However, Schapire and Singer also propose using AdaBoost.MH when the class labels are mutually exclusive, which is the focus in this paper.
Algorithm 6 is a natural generalization of Algorithm 3 for fitting the J-class logistic regression model (40).
Result 6. The LogitBoost algorithm (J classes, population version) uses quasi-Newton steps for fitting an additive symmetric logistic model by maximum-likelihood.
LogitBoost (J classes)
1. Start with weights wij = 1/N i = 1N j = 1J Fjx = 0 and
pjx = 1/J ∀j.
2. Repeat for m = 12M: (a) Repeat for j = 1J:
(i) Compute working responses and weights in the jth class, z i j = y ∗i j − p j x i
pjxi1 − pjxi wij = pjxi1 − pjxi
(ii) Fit the function fmjx by a weighted least-squares regression of zij to xi with weights wij.
(b) Set fmjx ← J−1 fmjx − 1 Jk=1 fmkx, and Fjx ← Fjx + f mj x . J J
(c) Update pjx via (40).
3. Output the classifier arg maxj Fjx.
Algorithm 6. An adaptive Newton algorithm for fitting an additive mul- tiple logistic regression model.
ADDITIVE LOGISTIC REGRESSION 357
Derivation.
1. We first give the score and Hessian for the population Newton algorithm corresponding to a standard multilogit parametrization
G x = log Py∗j = 1x j P y ∗J = 1 x
with GJx = 0 (and the choice of J for the base class is arbitrary). The expected conditional log-likelihood is:
J−1
ElG + gx = Ey∗jxGjx + gjx
j=1
J−1
−log 1+eGkx+gkx k=1
sjx=Ey∗j −pjxxj=1J−1 Hjkx=−pjxδjk −pkxjk=1J−1
2. Our quasi-Newton update amounts to using a diagonal approximation to the Hessian, producing updates:
gjx← Ey∗j −pjxxj=1J−1 pjx1 − pjx
3. To convert to the symmetric parametrization, we would note that gJ = 0, and set fjx = gjx − 1/J Jk=1 gkx. However, this procedure could be applied using any class as the base, not just the Jth. By averaging over all choices for the base class, we get the update
fjx = J − 1 Ey∗j − pjxx − 1 J Ey∗k − pkxx J pjx1 − pjx J k=1 pkx1 − pkx
For more rigid parametric models and full Newton stepping, this sym-
metrization would be redundant. With quasi-Newton steps and adaptive (tree based) models, the symmetrization removes the dependence on the choice of the base class.
6. Simulation studies. In this section the four flavors of boosting out- lined above are applied to several artificially constructed problems. Compar- isons based on real data are presented in Section 7.
An advantage of comparisons made in a simulation setting is that all as- pects of each example are known, including the Bayes error rate and the complexity of the decision boundary. In addition, the population expected error rates achieved by each of the respective methods can be estimated to arbitrary accuracy by averaging over a large number of different training and test data
358 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
sets drawn from the population. The four boosting methods compared here are:
DAB: Discrete AdaBoost—Algorithm 1. RAB: Real AdaBoost—Algorithm 2.
LB: LogitBoost—Algorithms 3 and 6. GAB: Gentle AdaBoost—Algorithm 4.
DAB, RAB and GAB handle multiple classes using the AdaBoost.MH approach.
In an attempt to differentiate performance, all of the simulated examples involve fairly complex decision boundaries. The ten input features for all exam- ples are randomly drawn from a ten-dimensional standard normal distribution x ∼ N100 I. For the first three examples the decision boundaries separating successive classes are nested concentric ten-dimensional spheres constructed by thresholding the squared-radius from the origin
10 (41) r2 = x2j
j=1
Each class Ck 1 ≤ k ≤ K is defined as the subset of observations
(42) Ck = xitk−1 ≤ r2i < tk
witht =0andt =∞.Thet K−1 foreachexamplewerechosensoasto
0Kk1
put approximately equal numbers of observations in each class. The training
sample size is N = K · 1000 so that approximately 1000 training observations are in each class. An independently drawn test set of 10,000 observations was used to estimate error rates for each training set. Averaged results over ten such independently drawn training–test set combinations were used for the final error rate estimates. The corresponding statistical uncertainties (stan- dard errors) of these final estimates (averages) are approximately a line width on each plot.
Figure 3 (top left) compares the four algorithms in the two-class K = 2 case using a two-terminal node decision tree (“stump”) as the base classifier. Shown is error rate as a function of number of boosting iterations. The upper (black) line represents DAB and the other three nearly coincident lines are the other three methods (dotted red = RAB, short-dashed green = LB, and long-dashed blue = GAB). Note that the somewhat erratic behavior of DAB, especially for less than 200 iterations, is not due to statistical uncertainty. For less than 400 iterations LB has a minuscule edge, after that it is a dead heat with RAB and GAB. DAB shows substantially inferior performance here with roughly twice the error rate at all iterations.
Figure 3 (lower left) shows the corresponding results for three classes K = 3 again with two-terminal node trees. Here the problem is more difficult as represented by increased error rates for all four methods, but their relation- ship is roughly the same: the upper (black) line represents DAB and the other
ADDITIVE LOGISTIC REGRESSION 359
Fig. 3. Test error curves for the simulation experiment with an additive decision boundary, as described in 42. In all panels except the top right, the solid curve (representing Discrete AdaBoost) lies alone above the other three curves.
three nearly coincident lines are the other three methods. The situation is somewhat different for larger number of classes. Figure 3 (lower right) shows results for K = 5 which are typical for K ≥ 4. As before, DAB incurs much higher error rates than all the others, and RAB and GAB have nearly iden- tical performance. However, the performance of LB relative to RAB and GAB has changed. Up to about 40 iterations it has the same error rate. From 40 to about 100 iterations LB’s error rates are slightly higher than the other two. After 100 iterations the error rate for LB continues to improve whereas that
360 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
for RAB and GAB level off, decreasing much more slowly. By 800 iterations the error rate for LB is 0.19 whereas that for RAB and GAB is 0.32. Speculation as to the reason for LB’s performance gain in these situations is presented below.
In the above examples a stump was used as the base classifier. One might expect the use of larger trees would do better for these rather complex prob- lems. Figure 3 (top right) shows results for the two-class problem, here boost- ing trees with eight terminal nodes. These results can be compared to those for stumps in Figure 3 (top left). Initially, error rates for boosting eight-node trees decrease much more rapidly than for stumps, with each successive iteration, for all methods. However, the error rates quickly level off and improvement is very slow after about 100 iterations. The overall performance of DAB is much improved with the bigger trees, coming close to that of the other three methods. As before RAB, GAB and LB exhibit nearly identical performance. Note that at each iteration the eight-node tree model consists of four times the number of additive terms as does the corresponding stump model. This is why the error rates decrease so much more rapidly in the early iterations. In terms of model complexity (and training time), a 100-iteration model using eight-terminal node trees is equivalent to a 400-iteration stump model.
Comparing the top two panels in Figure 3, one sees that for RAB, GAB and LB the error rate using the bigger trees (0.072) is in fact 33% higher than that for stumps (0.054) at 800 iterations, even though the former is four times more complex. This seemingly mysterious behavior is easily understood by examining the nature of the decision boundary separating the classes. The Bayes decision boundary between two classes is the set
(43) x log Py = 1x = 0 Py = −1x
or simply x Bx = 0. To approximate this set it is sufficient to estimate the logit Bx, or any monotone transformation of Bx, as closely as possible. As discussed above, boosting produces an additive logistic model whose com- ponent functions are represented by the base classifier. With stumps as the base classifier, each component function has the form
(44) fmx = cmL 1xj≤tm + cmR1xj>tm
(45) = fmxj
if the mth stump chose to split on coordinate j. Here tm is the split-point, and cmL and cmR are the weighted means of the response in the left and right terminal nodes. Thus the model produced by boosting stumps is additive in the original features,
p
(46) Fx = gjxj
j=1
where gjxj adds together all those stumps involving xj (and is 0 if none exist).
ADDITIVE LOGISTIC REGRESSION 361
Examination of (41) and (42) reveals that an optimal decision boundary for the above examples is also additive in the original features, with fj xj = x2j + constant. Thus, in the context of decision trees, stumps are ideally matched to these problems; larger trees are not needed. However boosting larger trees need not be counterproductive in this case if all of the splits in each individual tree are made on the same predictor variable. This would also produce an additive model in the original features (46). However, due to the forward greedy stagewise strategy used by boosting, this is not likely to happen if the decision boundary function involves more than one predictor; each individual tree will try to do its best to involve all of the important predictors. Owing to the nature of decision trees, this will produce models with interaction effects; most terms in the model will involve products in more than one variable. Such nonadditive models are not as well suited for approximating truly additive decision boundaries such as (41) and (42). This is reflected in increased error rate as observed in Figure 3.
The above discussion also suggests that if the decision boundary separating
pairs of classes were inherently nonadditive in the predictors, then boosting
stumps would be less advantageous than using larger trees. A tree with m
terminal nodes can produce basis functions with a maximum interaction order
of minm − 1 p where p is the number of predictor features. These higher
order basis functions provide the possibility to more accurately estimate those
decision boundaries Bx with high-order interactions. The purpose of the next
example is to verify this intuition. There are two classes K = 2 and 5000
training observations with the x 5000 drawn from a ten-dimensional normal i1
distribution as in the previous examples. Class labels were randomly assigned to each observation with log-odds
Pry = 1x 6 6 (47) log Pry=−1x =10xj 1+−1lxl
j=1 l=1
Approximately equal numbers of observations are assigned to each of the two classes, and the Bayes error rate is 0.046. The decision boundary for this problem is a complicated function of the first six predictor variables involving all of them in second-order interactions of equal strength. As in the above examples, test sets of 10,000 observations was used to estimate error rates for each training set, and final estimates were averages over ten replications.
Figure 4 (top left) shows test-error rate as a function of iteration number for each of the four boosting methods using stumps. As in the previous examples, RAB and GAB track each other very closely. DAB begins very slowly, being dominated by all of the others until around 180 iterations, where it passes below RAB and GAB. LB mostly dominates, having the lowest error rate until about 650 iterations. At that point DAB catches up and by 800 iterations it may have a very slight edge. However, none of these boosting methods perform well with stumps on this problem, the best error rate being 0.35.
Figure 4 (top right) shows the corresponding plot when four terminal node trees are boosted. Here there is a dramatic improvement with all of the four
362 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Fig. 4. Test error curves for the simulation experiment with a nonadditive decision boundary, as described in (47).
methods. For the first time there is some small differentiation between RAB and GAB. At nearly all iterations the performance ranking is LB best, followed by GAB, RAB and DAB in order. At 800 iterations LB achieves an error rate of 0.134. Figure 4 (lower left) shows results when eight terminal node trees are boosted. Here, error rates are generally further reduced with LB improving the least (0.130), but still dominating. The performance ranking among the other three methods changes with increasing iterations; DAB overtakes RAB
ADDITIVE LOGISTIC REGRESSION 363
at around 150 iterations and GAB at about 230 becoming fairly close to LB by 800 iterations with an error rate of 0.138.
Although limited in scope, these simulation studies suggest several trends. They explain why boosting stumps can sometimes be superior to using larger trees, and suggest situations where this is likely to be the case; that is when decision boundaries Bx can be closely approximated by functions that are additive in the original predictor features. When higher order interactions are required, stumps exhibit poor performance. These examples illustrate the close similarity between RAB and GAB. In all cases the difference in perfor- mance between DAB and the others decreases when larger trees and more iterations are used, sometimes overtaking the others. More generally, relative performance of these four methods depends on the problem at hand in terms of the nature of the decision boundaries, the complexity of the base classifier and the number of boosting iterations.
The superior performance of LB in Figure 3 (lower right) appears to be a consequence of the multiclass logistic model (Algorithm 6). All of the other methods use the asymmetric AdaBoost.MH strategy (Algorithm 5) of building separate two-class models for each individual class against the pooled com- plement classes. Even if the decision boundaries separating all class pairs are relatively simple, pooling classes can produce complex decision bound- aries that are difficult to approximate [Friedman (1996)]. By considering all of the classes simultaneously, the symmetric multiclass model is better able to take advantage of simple pairwise boundaries when they exist [Hastie and Tibshirani (1998)]. As noted above, the pairwise boundaries induced by (41) and (42) are simple when viewed in the context of additive modeling, whereas the pooled boundaries are more complex; they cannot be well approximated by functions that are additive in the original predictor variables.
The decision boundaries associated with these examples were deliberately chosen to be geometrically complex in an attempt to elicit performance differ- ences among the methods being tested. Such complicated boundaries are not likely to often occur in practice. Many practical problems involve compara- tively simple boundaries [Holte (1993)]; in such cases performance differences will still be situation dependent, but correspondingly less pronounced.
7. Some experiments with real world data. In this section we show the results of running the four fitting methods: LogitBoost, Discrete AdaBoost, Real AdaBoost and Gentle AdaBoost on a collection of datasets from the UC- Irvine machine learning archive, plus a popular simulated dataset. The base learner is a tree in each case, with either two or eight terminal nodes. For com- parison, a single decision tree was also fit (using the tree function in Splus), with the tree size determined by 5-fold cross-validation.
The datasets are summarized in Table 1. The test error rates are shown in Table 2 for the smaller datasets, and in Table 3 for the larger ones. The vowel, sonar, satimage and letter datasets come with a prespecified test set. The waveform data is simulated, as described in Breiman, Friedman, Olshen and
364
J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Table 1
Datasets used in the experiments
Data set
Vowel
Breast cancer Ionosphere Glass
Sonar Waveform
# Train
528
699
351
214
210
300
# Test
462 5-fold CV 5-fold CV 5-fold CV 5-fold CV 5000
2000 4000
# Inputs
10 9 34 10 60 21
36 16
# Classes
11 2 2 7 2 3
6 26
Satimage 4435 Letter 16000
Stone (1984). For the others, 5-fold cross-validation was used to estimate the test error.
It is difficult to discern trends on the small datasets (Table 2) because all but quite large observed differences in performance could be attributed to sampling fluctuations. On the vowel, breast cancer, ionosphere, sonar and waveform data, purely additive stump models seem to perform comparably to the larger (eight-node) trees. The glass data seems to benefit a little from larger trees. There is no clear differentiation in performance among the boost- ing methods.
On the larger data sets (Table 3) clearer trends are discernible. For the satimage data the eight-node tree models are only slightly, but significantly, more accurate than the purely additive models. For the letter data there is no contest. Boosting stumps is clearly inadequate. There is no clear differ- entiation among the boosting methods for eight-node trees. For the stumps, LogitBoost, Real AdaBoost and Gentle AdaBoost have comparable perfor- break mance, distinctly superior to Discrete AdaBoost. This is consistent with the results of the simulation study (Section 6).
Except perhaps for Discrete AdaBoost, the real data examples fail to demon- strate performance differences between the various boosting methods. This is in contrast to the simulated data sets of Section 6. There LogitBoost gener- ally dominated, although often by a small margin. The inability of the real data examples to discriminate may reflect statistical difficulties in estimat- ing subtle differences with small samples. Alternatively, it may be that their underlying decision boundaries are all relatively simple [Holte (1993)] so that all reasonable methods exhibit similar performance.
8. Additive logistic trees. In most applications of boosting the base clas- sifier is considered to be a primitive, repeatedly called by the boosting proce- dure as iterations proceed. The operations performed by the base classifier are the same as they would be in any other context given the same data and weights. The fact that the final model is going to be a linear combination of a large number of such classifiers is not taken into account. In particular, when using decision trees, the same tree growing and pruning algorithms are
ADDITIVE LOGISTIC REGRESSION
365
Table 2
Test error rates on small real examples
2 Terminal Nodes
8 Terminal Nodes
Method
Vowel
LogitBoost
Real AdaBoost Gentle AdaBoost Discrete AdaBoost
Breast
LogitBoost
Real AdaBoost Gentle AdaBoost Discrete AdaBoost
Ion
LogitBoost
Real AdaBoost Gentle AdaBoost Discrete AdaBoost
Glass
LogitBoost
Real AdaBoost Gentle AdaBoost Discrete AdaBoost
Sonar
LogitBoost
Real AdaBoost Gentle AdaBoost Discrete AdaBoost
Waveform
LogitBoost
Real AdaBoost Gentle AdaBoost Discrete AdaBoost
Iterations
CART error = 0642
CART error = 0045
CART error = 0076
CART error = 0400
CART error = 0596
CART error = 0364
50 100
0.532 0.524 0.565 0.561 0.556 0.571 0.563 0.535
0.028 0.031 0.038 0.038 0.037 0.037 0.042 0.040
0.074 0.077 0.068 0.066 0.085 0.074 0.088 0.080
0.266 0.257 0.276 0.247 0.276 0.261 0.285 0.285
0.231 0.231 0.154 0.163 0.183 0.183 0.154 0.144
0.196 0.195 0.193 0.197 0.190 0.188 0.188 0.185
200
0.511 0.548 0.584 0.563
0.029 0.040 0.041 0.040
0.071 0.068 0.077 0.080
0.266 0.257 0.252 0.271
0.202 0.202 0.173 0.183
0.206 0.195 0.193 0.191
50 100
0.517 0.517 0.496 0.496 0.515 0.496 0.511 0.500
0.034 0.038 0.032 0.034 0.032 0.031 0.032 0.035
0.068 0.063 0.054 0.054 0.066 0.063 0.068 0.063
0.243 0.238 0.234 0.234 0.219 0.233 0.238 0.234
0.163 0.154 0.173 0.173 0.154 0.154 0.163 0.144
0.192 0.191 0.185 0.182 0.185 0.185 0.186 0.183
200
0.517 0.496 0.496 0.500
0.038 0.034 0.031 0.037
0.063 0.054 0.063 0.063
0.238 0.234 0.238 0.243
0.154 0.173 0.154 0.144
0.191 0.182 0.186 0.183
366
J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Table 3
Test error rates on larger data examples
Method
Satimage
LogitBoost
Real AdaBoost Gentle AdaBoost Discrete AdaBoost
LogitBoost
Real AdaBoost Gentle AdaBoost Discrete AdaBoost
Letter
LogitBoost
Real AdaBoost Gentle AdaBoost Discrete AdaBoost
LogitBoost
Real AdaBoost Gentle AdaBoost Discrete AdaBoost
Terminal
Nodes 20
CART error = 0148 2 0.140
2 0.148 2 0.148 2 0.174
8 0.096 8 0.105 8 0.106 8 0.122
CART error = 0124 2 0.250
2 0.244 2 0.246 2 0.310
8 0.075 8 0.068 8 0.068 8 0.080
Iterations
50 100
0.120 0.112 0.126 0.117 0.129 0.119 0.156 0.140
0.095 0.092 0.102 0.092 0.103 0.095 0.107 0.100
0.182 0.159 0.181 0.160 0.187 0.157 0.226 0.196
0.047 0.036 0.041 0.033 0.040 0.030 0.045 0.035
200 Fraction
0.102 0.119 0.119 0.128
0.088 0.091 0.089 0.099
0.145 0.06 0.150 0.12 0.145 0.14 0.185 0.18
0.033 0.03 0.032 0.03
0.028 0.03
0.029 0.03
generally employed. Sometimes alterations are made (such as no pruning) for programming convenience and speed.
When boosting is viewed in the light of additive modeling, however, this greedy approach can be seen to be far from optimal in many situations. As discussed in Section 6 the goal of the final classifier is to produce an accurate approximation to the decision boundary function Bx. In the context of boost- ing, this goal applies to the final additive model, not to the individual terms (base classifiers) at the time they were constructed. For example, it was seen in Section 6 that if Bx was close to being additive in the original predictive features, then boosting stumps was optimal since it produced an approxima- tion with the same structure. Building larger trees increased the error rate of the final model because the resulting approximation involved high-order interactions among the features. The larger trees optimized error rates of the individual base classifiers, given the weights at that step, and even produced lower unweighted error rates in the early stages. However, after a sufficient number of boosts, the stump-based model achieved superior performance.
More generally, one can consider an expansion of the decision boundary function in a functional ANOVA decomposition [Friedman (1991)]
(48) Bx=fjxj+fjkxjxk+ fjklxjxkxl+··· j jk jkl
ADDITIVE LOGISTIC REGRESSION 367
The first sum represents the closest function to Bx that is additive in the original features, the first two represent the closest approximation involv- ing at most two-feature interactions, the first three represent three-feature interactions, and so on. If Bx can be accurately approximated by such an expansion, truncated at low interaction order, then allowing the base classi- fier to produce higher order interactions can reduce the accuracy of the final boosted model. In the context of decision trees, higher order interactions are produced by deeper trees.
In situations where the true underlying decision boundary function admits a low order ANOVA decomposition, one can take advantage of this structure to improve accuracy by restricting the depth of the base decision trees to be not much larger than the actual interaction order of Bx. Since this is not likely to be known in advance for any particular problem, this maximum depth becomes a “meta-parameter” of the procedure to be estimated by some model selection technique, such as cross-validation.
One can restrict the depth of an induced decision tree by using its stan- dard pruning procedure, starting from the largest possible tree, but requiring it to delete enough splits to achieve the desired maximum depth. This can be computationally wasteful when this depth is small. The time required to build the tree is proportional to the depth of the largest possible tree before pruning. Therefore, dramatic computational savings can be achieved by sim- ply stopping the growing process at the maximum depth, or alternatively at a maximum number of terminal nodes. The standard heuristic arguments in favor of growing large trees and then pruning do not apply in the context of boosting. Shortcomings in any individual tree can be compensated by trees grown later in the boosting sequence.
If a truncation strategy based on number of terminal nodes is to be employed, it is necessary to define an order in which splitting takes place. We adopt a “best-first” strategy. An optimal split is computed for each currently terminal node. The node whose split would achieve the greatest reduction in the tree building criterion is then actually split. This increases the number of terminal nodes by one. This continues until a maximum number M of ter- minal notes is induced. Standard computational tricks can be employed so that inducing trees in this order requires no more computation than other orderings commonly used in decision tree induction.
The truncation limit M is applied to all trees in the boosting sequence. It is thus a meta-parameter of the entire boosting procedure. An optimal value can be estimated through standard model selection techniques such as minimizing cross-validated error rate of the final boosted model. We refer to this combi- nation of truncated best-first trees, with boosting, as “additive logistic trees” (ALT). Best-first trees were used in all of the simulated and real examples. One can compare results on the latter (Tables 2 and 3) to corresponding results reported by Dietterich [(1998), Table 1] on common data sets. Error rates achieved by ALT with very small truncation values are seen to compare quite favorably with other committee approaches using much larger trees at each boosting step. Even when error rates are the same, the computational savings
368 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Fig. 5. Coordinate functions for the additive logistic tree obtained by boosting (Logitboost) with stumps, for the two-class nested sphere example from Section 6.
associated with ALT can be quite important in data mining contexts where large data sets cause computation time to become an issue.
Another advantage of low order approximations is model visualization. In particular, for models additive in the input features (46), the contribution of each feature xj can be viewed as a graph of gjxj plotted against xj. Figure 5 shows such plots for the ten features of the two-class nested spheres example of Figure 3. The functions are shown for the first class concentrated near the origin; the corresponding functions for the other class are the negatives of these functions.
The plots in Figure 5 clearly show that the contribution to the log-odds of each individual feature is approximately quadratic, which matches the gener- ating model (41) and (42).
When there are more than two classes, plots similar to Figure 5 can be made for each class and analogously interpreted. Higher order interaction models are more difficult to visualize. If there are at most two-feature interactions, the two-variable contributions can be visualized using contour or perspective mesh plots. Beyond two-feature interactions, visualization techniques are even less effective. Even when noninteraction (stump) models do not achieve the highest accuracy, they can be very useful as descriptive statistics owing to the interpretability of the resulting model.
9. Weight trimming. In this section we propose a simple idea and show that it can dramatically reduce computation for boosted models without sacri- ficing accuracy. Despite its apparent simplicity, this approach does not appear to be in common use [although similar ideas have been proposed before: Schapire (1990), Freund (1995)]. At each boosting iteration there is a distribution of weights over the training sample. As iterations proceed, this distribution tends to become highly skewed towards smaller weight values. A larger fraction of the training sample becomes correctly classified with increasing confidence, thereby receiving smaller weights. Observations with very low relative weight have little impact on training of the base classifier; only those that carry the dominant proportion of the weight mass are influen-
ADDITIVE LOGISTIC REGRESSION 369
tial. The fraction of such high weight observations can become very small in later iterations. This suggests that at any iteration one can simply delete from the training sample the large fraction of observations with very low weight without having much effect on the resulting induced classifier. However, com- putation is reduced since it tends to be proportional to the size of the training sample, regardless of weights.
At each boosting iteration, training observations with weight wi less than a threshold wi < tβ are not used to train the classifier. We take the value of tβ to be the βth quantile of the weight distribution over the training data at the corresponding iteration. That is, only those observations that carry the fraction 1 − β of the total weight mass are used for training. Typically β ∈ 00101 so that the data used for training carries from 90 to 99% of the total weight mass. Note that the weights for all training observations are recomputed at each iteration. Observations deleted at a particular iteration may therefore reenter at later iterations if their weights subsequently increase relative to other observations.
Figure 6 (left panel) shows test-error rate as a function of iteration number for the letter recognition problem described in Section 7, here using Gentle AdaBoost and eight-node trees as the base classifier. Two error rate curves are shown. The black solid one represents using the full training sample at each iteration β = 0, whereas the blue dashed curve represents the corresponding error rate for β = 01. The two curves track each other very closely, especially at the later iterations. Figure 6 (right panel) shows the corresponding frac- tion of observations used to train the base classifier as a function of iteration number. Here the two curves are not similar. With β = 01 the number of observations used for training drops very rapidly reaching roughly 5% of the total at 20 iterations. By 50 iterations it is down to about 3% where it stays throughout the rest of the boosting procedure. Thus, computation is reduced
Fig. 6. The left panel shows the test error for the letter recognition problem as a function of iteration number. The black solid curve uses all the training data, the red dashed curve uses a subset based on weight thresholding. The right panel shows the percent of training data used for both approaches. The upper curve steps down, because training can stop for an entire class if it is fit sufficiently well (see text).
370 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
by over a factor of 30 with no apparent loss in classification accuracy. The reason why sample size in this case decreases for β = 0 after 150 iterations is that if all of the observations in a particular class are classified correctly with very high confidence Fk > 15 + logN training for that class stops, and con- tinues only for the remaining classes. At 400 iterations, 12 classes remained of the original 26 classes.
The last column labeled fraction in Table 3 for the letter-recognition problem shows the average fraction of observations used in training the base classifiers over the 200 iterations, for all boosting methods and tree sizes. For eight-node trees, all methods behave as shown in Figure 6. With stumps, LogitBoost uses considerably less data than the others and is thereby correspondingly faster.
This is a genuine property of LogitBoost that sometimes gives it an advan- tage with weight trimming. Unlike the other methods, the LogitBoost weights wi = pi1 − pi do not in any way involve the class outputs yi; they simply measure nearness to the currently estimated decision boundary FMx = 0. Discarding small weights thus retains only those training observations that are estimated to be close to the boundary. For the other three procedures the weight is monotone in −yiFMxi. This gives highest weight to currently mis- classified training observations, especially those far from the boundary. If after trimming, the fraction of observations remaining is less than the error rate, the subsample passed to the base learner will be highly unbalanced contain- ing very few correctly classified observations. This imbalance seems to inhibit learning. No such imbalance occurs with LogitBoost since near the decision boundary, correctly and misclassified observations appear in roughly equal numbers.
As this example illustrates, very large reductions in computation for boost- ing can be achieved by this simple trick. A variety of other examples (not shown) exhibit similar behavior with all boosting methods. Note that other committee approaches to classification such as bagging [Breiman (1996)] and randomized trees [Dietterich (1998)] while admitting parallel implementa- tions, cannot take advantage of this approach to reduce computation.
10. Further generalizations of boosting. We have shown above that AdaBoost fits an additive model, optimizing a criterion similar to binomial log-likelihood, via an adaptive Newton method. This suggests ways in which the boosting paradigm may be generalized. First, the Newton step can be replaced by a gradient step, slowing down the fitting procedure. This can reduce susceptibility to overfitting and lead to improved performance. Sec- ond, any smooth loss function can be used: for regression, squared error is natural, leading to the “fitting of residuals” boosting algorithm mentioned in the introduction. However, other loss functions might have benefits, for example, tapered squared error based on Huber’s robust influence function [Huber (1964)]. The resulting procedure is a fast, convenient method for resis- tant fitting of additive models. Details of these generalizations may be found in Friedman (1999).
ADDITIVE LOGISTIC REGRESSION 371
11. Concluding remarks. In order to understand a learning procedure statistically it is necessary to identify two important aspects: its structural model and its error model. The former is most important since it determines the function space of the approximator, thereby characterizing the class of functions or hypotheses that can be accurately approximated with it. The error model specifies the distribution of random departures of sampled data from the structural model. It thereby defines the criterion to be optimized in the estimation of the structural model.
We have shown that the structural model for boosting is additive on the logistic scale with the base learner providing the additive components. This understanding alone explains many of the properties of boosting. It is no sur- prise that a large number of such (jointly optimized) components defines a much richer class of learners than one of them alone. It reveals that in the context of boosting all base learners are not equivalent, and there is no uni- versally best choice over all situations. As illustrated in Section 6, the base learners need to be chosen so that the resulting additive expansion matches the particular decision boundary encountered. Even in the limited context of boosting decision trees the interaction order, as characterized by the number of terminal nodes, needs to be chosen with care. Purely additive models induced by decision stumps are sometimes, but not always, the best. However, we con- jecture that boundaries involving very high-order interactions will rarely be encountered in practice. This motivates our additive logistic trees (ALT) pro- cedure described in Section 8.
The error model for two-class boosting is the obvious one for binary vari- ables, namely the Bernoulli disribution. We show that the AdaBoost proce- dures maximize a criterion that is closely related to expected log-Bernoulli likelihood, having the identical solution in the distributional (L2) limit of infinite data. We derived a more direct procedure for maximizing this log- likelihood (LogitBoost) and show that it exhibits properties nearly identical to those of Real AdaBoost.
In the multiclass case, the AdaBoost procedures maximize a separate Bernoulli likelihood for each class versus the others. This is a natural choice and is especially appropriate when observations can belong to more than one class [Schapire and Singer (1998)]. In the more usual setting of a unique class label for each observation, the symmetric multinomial distribution is a more appropriate error model. We develop a multiclass LogitBoost proce- dure that maximizes the corresponding log-likelihood by quasi-Newton step- ping. We show through simulated examples that there exist settings where this approach leads to superior performance, although none of these situa- tions seems to have been encountered in the set of real data examples used for illustration; the performance of both approaches had quite similar perfor- mance over these examples.
The concepts developed in this paper suggest that there is very little, if any, connection between (deterministic) weighted boosting and other (randomized) ensemble methods such as bagging [Breiman (1996)] and randomized trees [Dietterich (1998)]. In the language of least-squares regression, the latter are
372 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
purely “variance” reducing procedures intended to mitigate instability, espe- cially that associated with decision trees. Boosting on the other hand seems fundamentally different. It appears to be mainly a “bias” reducing procedure, intended to increase the flexibility of stable (highly biased) weak learners by incorporating them in a jointly fitted additive expansion.
The distinction becomes less clear [Breiman (1998a)] when boosting is implemented by finite weighted random sampling instead of weighted opti- mization. The advantages or disadvantages of introducing randomization into boosting by drawing finite samples is not clear. If there turns out to be an advantage with randomization in some situations, then the degree of random- ization, as reflected by the sample size, is an open question. It is not obvious that the common choice of using the size of the original training sample is optimal in all (or any) situations.
One fascinating issue not covered in this paper is the fact that boosting, whatever flavor, seems resistant to overfitting. Some possible explanations are:
1. As the LogitBoost iterations proceed, the overall impact of changes intro- duced by fmx reduces. Only observations with appreciable weight deter- mine the new functions—those near the decision boundary. By definition these observations have Fx near zero and can be affected by changes, while those in pure regions have large values of Fx and are less likely to be modified.
2. The stagewise nature of the boosting algorithms does not allow the full col- lection of parameters to be jointly fit, and thus has far lower variance than the full parameterization might suggest. In the computational learning the- ory literature this is explained in terms of VC dimension of the ensemble compared to that of each weak learner.
Fig. 7. Real AdaBoost (stumps) on a noisy concentric-sphere problem, with 400 observations per class and Bayes error 25%. The test error (upper curve) increases after about fifty iterations.
ADDITIVE LOGISTIC REGRESSION 373
3. Classifiers are hurt less by overfitting than other function estimators [e.g., the famous risk bound of the 1-nearest-neighbor classifier, Cover and Hart (1967)].
Figure 7 shows a case where boosting does overfit. The data are gener- ated from two ten-dimensional spherical Gaussians with the same mean, and variances chosen so that the Bayes error is 25% (400 samples per class). We used Real AdaBoost and stumps (the results were similar for all the boosting algorithms). After about 50 iterations the test error (slowly) increases.
Schapire, Freund, Bartlett and Lee (1998) suggest that the properties of AdaBoost, including its resistance to overfitting, can be understood in terms of classification margins. However, Breiman (1997) presents evidence counter to this explanation. Whatever the explanation, the empirical evidence is strong; the introduction of boosting by Schapire, Freund and colleagues has brought an exciting and important set of new ideas to the table.
Acknowledgments. We thank Andreas Buja for alerting us to the recent work on text classification at AT&T laboratories, Bogdan Popescu for illu- minating discussions on PAC learning theory and Leo Breiman and Robert Schapire for useful comments on an earlier version of this paper. We also thank two anonymous referees and an Associate Editor for detailed and use- ful comments on an earlier draft of the paper.
REFERENCES
Breiman, L. (1996). Bagging predictors. Machine Learning 24 123–140.
Breiman, L. (1997). Prediction games and arcing algorithms. Technical Report 504, Dept. Statis-
tics, Univ. California, Berkeley.
Breiman, L. (1998a). Arcing classifiers (with discussion). Ann. Statist. 26 801–849.
Breiman, L. (1998b). Combining predictors. Technical report, Dept. Statistics, Univ. California,
Berkeley.
Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees.
Wadsworth, Belmont, CA.
Buja, A., Hastie, T. and Tibshirani, R. (1989). Linear smoothers and additive models (with
discussion). Ann. Statist. 17 453–555.
Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. Proc. IEEE Trans. Inform.
Theory 13 21–27.
Dietterich, T. (1998). An experimental comparison of three methods for constructing ensembles
of decision trees: bagging, boosting, and randomization. Machine Learning? 1–22. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Inform. and Comput. 121
256–285.
Freund, Y. and Schapire, R. (1996a). Game theory, on-line prediction and boosting. In Proceedings
of the Ninth Annual Conference on Computational Learning Theory 325–332. Freund, Y. and Schapire, R. E. (1996b). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference 148–156. Morgan
Kaufman, San Francisco.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of online learning and
an application to boosting. J. Comput. System Sciences 55.
Friedman, J. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19
1–141.
374 DISCUSSION
Friedman, J. (1996). Another approach to polychotomous classification. Technical report, Stanford Univ.
Friedman, J. (1999). Greedy function approximation: the gradient boosting machine. Technical report, Stanford Univ.
Friedman, J. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc. 76 817–823.
Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London. Hastie, T. and Tibshirani, R. (1998). Classification by pairwise coupling. Ann. Statist. 26 451–471. Hastie, T., Tibshirani, R. and Buja, A. (1994). Flexible discriminant analysis by optimal scoring.
J. Amer. Statist. Assoc. 89 1255–1270.
Holte, R. (1993). Very simple classification rules perform well on most commonly used datasets.
Machine Learning 11 63–90.
Huber, P. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 53 73–101. Kearns, M. and Vazirani, U. (1994). An Introduction to Computational Learning Theory. MIT
Press.
Mallat, S. and Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE
Trans. Signal Processing 41 3397–3415.
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall, London. Schapire, R. (1997). Using output codes to boost multiclass learning problems. In Proceedings
of the Fourteenth International Conference on Machine Learning 313–321. Morgan
Kaufman, San Francisco.
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning 5 197–227. Schapire, R. E. and Singer, Y. (1998). Improved boosting algorithms using confidence-rated pre-
dictions. In Proceedings of the Eleventh Annual Conference on Computational Learning
Theory.
Schapire, R., Freund, Y., Bartlett, P. and Lee, W. (1998). Boosting the margin: a new explana-
tion for the effectiveness of voting methods. Ann. Statist. 26 1651–1686. Valiant, L. G. (1984). A theory of the learnable. Comm. ACM 27 1134–1142.
Department of Statistics
Sequoia Hall
Stanford University
Stanford, California 94305
E-mail: jnf, hastie, tibs@stat.stanford.edu
DISCUSSION
Leo Breiman
University of California, Berkeley
The authors have presented a fascinating view of boosting and I congrat- ulate them for the steps they have taken to clear up the mystery of why the AdaBoost algorithm works so well. Yet I have misgivings about their explana- tion (which I expressed earlier in my comments to them on receiving the first draft of their paper).
A crucial property of AdaBoost is that it almost never overfits the data no matter how many iterations it is run. The authors counterexample is the first convincing one I have seen but stands in contrast to the hundreds of data sets on which AdaBoost has not produced overfitting. Any complete explanation of AdaBoost has to explain this empirical fact. Ordinarily, in logistic regression,
ADDITIVE LOGISTIC REGRESSION 375
if one adds more and more variables, increasing the likelihood at each step, then at some point overfitting sets in and the test-set error blossoms. The fact that this does not happen in AdaBoost or in LogitBoost is extraordinary. And, unless I am missing something, there is no explanation in the paper.
I approach the problem differently. In Breiman (1999a), I assume an ensem- ble of classifiers hx θ θ ∈ . Let the sequence θn be iid selections from according to some probability Q and let the classifiers hx θ1 hx θ2 hxθN each cast a unit vote for the predicted class at x. Then, under weak conditions, as N goes to infinity, the selected class converges a.s. to arg maxj Qhx θ = j. This holds, for example, if the classifiers are trees, in which case I refer to the ensemble as a “random forest.” The theorem is a simple consequence of the law of large numbers.
For example, in bagging, each classifier is grown on a bootstrap sample from the original training set. If the training set has K examples, let θ be a K-vector of nonnegative integers adding to K. Under Q, let θ have the distribution of sampling K times with replacement from K objects. Thus, using the results cited, bagging converges a.s. as the number of iterations becomes large.
AdaBoost is probably a random forest. Consider the following version of AdaBoost: given a probability P on the training set, use a fixed algorithm to construct a classifier hx P using the weighted training set. The next prob- ability P′ on the training set depends only on P, the indices of the examples misclassified and the weighted misclassification error. Then there is a deter- ministic function defined on the space of all probabilities on the training set such that P′ = P. Assume that is ergodic with invariant measure Q and that the voting is unweighted. Then the class selected converges a.s. to arg maxj Qhx P = j.
The implication is that the random forest constructed by letting the sequence Pn be iid from Q is equivalent to AdaBoost. AdaBoost uses weighted voting, but a small alteration in the argument gives the same result. In Breiman (1999a), empirical evidence is given to show that AdaBoost is a random forest. This would settle the question of why AdaBoost does not overfit, but not the question of why AdaBoost gives such low test-set error.
There are many different random forests, and some are more accurate than others. For instance, Dieterrich (1998) constructs a random forest by choosing at random among the ten best splits at any node and shows empirically that this gives generally lower test-set error than bagging. In (1999a) we show that the test-set misclassification error is bounded above by an expression that depends on two parameters: the first is the expected correlation between pairs of classifiers, and the second is the expected accuracy of the individual classifiers.
One very effective way to reduce correlation is by random choice of features at each node. That is, suppose there are ten predictor variables. At each node, choose a small subgroup of them at random to split on. This idea first appeared in a paper by Amit and Geman (1997). Using this approach and 100 iterations gives the following test-set errors as compared to the best corresponding values for LogitBoost.
376 DISCUSSION Test-Set Errors (%)
Dataset LogitBoost
Breast 3.8 Ionosphere 6.3 Glass 23.8 Sonar 15.4 Waveform 19.1 Sat-images 11.2
Random Forest
2.9
7.1 20.6 15.9 17.2 8.6
The comparison is on runs I have already done. Since I have not done a run on the vowel dataset that is similar to that of the paper, it does not appear. Similarly, I have not run the letter dataset referred to. The major point of this comparison is to show that correlation and strength are the essential ingre- dients of an accurate ensemble, and that if boosting succeeds, it is because it gives low correlation between classifiers of reasonable strength. The fact that a similar algorithm maximizes the likelihood in a logistic setting is not the primary reason for its success.
So what is known about the factors leading to AdaBoost’s success? One clear factor is that at each step it decouples itself from the previous step. This would intuitively seem to lead to low correlation between pairs of classifiers in AdaBoost considered as a random forest. However, this has not yet been analytically or empirically established.
A recently discovered empirical property of AdaBoost is mysterious but may lead to better understanding. At each iteration, the examples currently misclassified by AdaBoost are weighted more heavily for the next iteration, to encourage their correct classification. The commonly accepted interpreta- tion, therefore, is that AdaBoost tries to classify correctly the hard-to-classify examples, and ignores the easy-to-classify examples. This interpretation is incorrect, and I have to admit some culpability.
Suppose AdaBoost is run for a while. Then for each case in the training set, look at the weighted proportion of times that it has been misclassified (the weights are just those that AdaBoost assigns to each tree in the ensemble). For all data sets I’ve looked at so far, these proportions are nearly the same over the entire training set. That is, AdaBoost is an equalizer. When it puts more emphasis on points that have been recently misclassified, it is trying to get them correctly classified next time around [see Wheway (1999)].
Let the points in the training set be yn xn. Let prkn be the unweighted proportion of times that the nth case in the training set is misclassified by the first k classifiers. To check on my hypothesis that AdaBoost worked so well because it is an equalizer, using the methods in Breiman (1999b), I devised an algorithm called arc-equal which sequentially minimizes Avnprkn − t2, where t is an adjustable parameter.
Then I did runs on a number of datasets, proceeding as follows: first Ada- Boost was run 100 times. The t was taken equal to the average weighted misclassification rate over the training set produced by AdaBoost and arc-
ADDITIVE LOGISTIC REGRESSION 377
equal run 100 times. When this was repeated 50 times, each time leaving out 10% of the data as a test set, the test-set error rate of arc-equal was always nearly the same as that of AdaBoost.
The experiments performed and the empirical observation that our data- sets, AdaBoost tends to equalize the misclassification error over the training set is suggestive. But it does not yet explain why equalization leads to low test-set error. This mystery should be most interesting to some of our more theoretically motivated statisticians.
REFERENCES
Amit, Y. and Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Comput. 9 1545–1588.
Breiman, L. (1999a). Random forests. Technical report, available at www.stat.berkeley.edu. Breiman, L. (1999b). Prediction games and arcing algorithms. Neural Comput. 11 1493–1517. Dietterich, T. (2000). An experimental comparison of three methods for constructing ensembles
of decision trees: bagging, boosting and randomization. Mach. Learning 40 139–158. Wheway, V. (1999). Variance reduction trends on ‘boosted’ classifiers. Available from virg@cse.
unsw.edu.au
DISCUSSION
PeterBu ̈hlmannandBinYu
ETHZu ̈richandBellLaboratories,MurrayHill, Lucent Technologies and University of California, Berkeley
We congratulate Friedman, Hastie and Tibshirani (FHT) for their signif- icant work connecting boosting and (kinds of) additive models. FHT provide a much-needed bridge between an important and effective machine learning procedure and traditional statistical modeling ideas. With this bridge, ideas can now flow easily in both directions so that a deeper and more thorough understanding of boosting will eventually emerge.
In this discussion, we would like to share our thoughts and reflections on boosting and related statistical ideas, inspired by FHT.
1. What makes boosting work in classification? Let us consider the two-class problem in this section. In an elegant and convincing way, FHT bring Freund and Schapire’s Discrete AdaBoosting to our home domain by rederiv- ing its population version as a familiar Newton-like step in the optimization of a not-so-familiar exponential loss function serving as a surrogate for the zero– one loss function. This surrogate is sensible since it is uniquely minimized by
Department of Statistics University of California
367 Evans Hall, #3860 Berkeley, California 94720-3860 E-mail: leo@stat.berkeley.edu
378 DISCUSSION
(half of) the log odds ratio of the conditional probability of Y = 1 given X = x. Under this light, it is not hard to find instances of statistical procedures shar- ing traces of similarities with boosting. For example, in parametric estimation, we indeed have been “boosting” a consistent estimator to an efficient estima- tor by taking a Newton step in the optimization of the likelihood function and indeed have been using estimating equations as sensible surrogates because the likelihood equations are intractable. Despite these similarities, there are fundamental differences between boosting and these statistical procedures. As recounted by FHT, boosting as a conjecture was proposed as a theoretical learning question in the machine learning community and is now being viewed as a nonparametric stepwise fitting technique in an additive style. Its superb empirical performance in high-dimensional classification problems has very few, if any, rivals. Even though the concept of a “true” model does get used in the evaluation stage of boosting, the starting point of boosting is not a “true model” as is commonly done in statistics. Its starting point is a “weak learner” and the question posed was how to make it better or “boost” it. In hindsight, this is a natural and realistic approach in modern data analysis where the size of datasets could not be imagined in the time of our forebears such as Fisher or Neyman. Because of the complexity and scale of these problems, it is impos- sible to come up with an effective likelihood approach in one step. Often, if not always, a sensible procedure or a “weak learner” is available, either ad hoc or based on a simple approximation. Moreover, an evaluation criterion or a loss function is always available. The boosting question now becomes how to improve the weak learner in terms of this loss function. With a starting point and an objective function, a greedy approach for improvement is very natu- ral. Numerical considerations of the greedy approach explain why boosting in terms of the evaluating zero–one loss function might not be a good idea. Any derivative-based greedy algorithm such as Newton’s method is not appropri- ate for this discontinuous loss function whose minimizer is not unique but a whole class. A surrogate, the exponential loss function, is used in AdaBoost- ing as the implementing objective function although the evaluation function remains the zero–one loss. From a numerical point of view, the exponential function is an excellent function to apply Newton’s method, because of its con- vexity and gradually changing derivatives. Furthermore, this is a relevant loss function to optimize since its minimizer is sensible for the classification problem. A second surrogate, proposed by FHT, is the binomial likelihood that as a function is very close to the exponential loss and has the same minimizer. FHT devise the Gentle Boosting algorithm based on the binomial surrogate and show a similar performance to those based on the exponential loss. A third surrogate, the squared loss function, is also discussed in FHT. The squared function is the best possible for Newton’s method: convex and having constant second derivatives. FHT report good performance but think that it is domi- nated by schemes based on either the exponential or the binomial likelihood loss functions and hint that the reason might be that the squared loss func- tion loses its monotonicity when yFx > 1. It is well known that the unique minimizer of the squared loss is the conditional expectation of Y given X = x
ADDITIVE LOGISTIC REGRESSION 379
which is naturally bounded between −1 and 1 since Y = 1. If this constraint is taken into account in the greedy search for the optimizer of the squared loss, then one should not wander out to the region where yFx > 1 but stay in the region where the squared loss is monotone. It is curious to know whether or not taking this constraint into account (obviously numerically more compli- cated) would bring the performance of squared loss boosting on a par with those based on the exponential or binomial likelihood loss.
Although the three surrogates mentioned so far have sensible minimizers, they are only qualitative approximations to the zero–one loss function. They are actually quite bad quantitative approximations (cf. Figure 2 of FHT). But the population minimizers of the expected zero–one loss function include those of exponential, binomial likelihood and the squared loss. The first question then simply becomes:
1. Which surrogate (or implementing loss function for boosting) to use so that the boosting estimate best approximates its corresponding population minimizer? This (difficult) question might involve numerical as well as statistical–stochastic efficiency issues.
Now let us step back from these successful (implementing) surrogates to look at the original zero–one loss function. From the quantitative approxima- tion point of view, an obvious surrogate is a smoothed version of the zero–one loss function 1 − yFx/σ where is the cdf of 0 1 and σ is a tuning parameter to control the approximation of such a smoothed version to the orig- inal zero–one loss function. This makes the numerical greedy approach pos- sible. If one follows the exact steps (when c is fixed) to this objective function as in the derivation of the population version of Discrete AdaBoost (Result 1 in FHT), one gets a boosting algorithm with a different reweighting scheme having weights.
φyFx/σ = φFx/σ
because y2 = 1 and φ·, the standard normal density, is symmetric.
These weights ignore the values of y or avoid reality check and concentrate more and more on the cases where the previous classifier F is unsure about itself. Since this is the opposite of the boosting reweighting philosophy, one might expect that it would not work well. The smoothed zero–one loss func- tion is not convex (or concave) and its derivatives change rapidly, especially for a small tuning parameter σ or a closer approximation to the zero–one loss function. For such functions, the quadratic approximation is not accurate and Newton’s method easily overshoots to miss the true optimizer. Hence this
recipe should not work from a numerical point of view.
2. The question is, however, for this smoothed zero–one loss, will a more suit- able numerical optimization method such as trust region [cf. Gill, Murray and Wright (1981)] lead to a sensible boosting procedure having a different weighting philosophy, presumably with higher computational cost?
380 DISCUSSION
As argued above, boosting algorithms benefit greatly from a smart choice of a surrogate implementing objective function from both statistical and numer- ical points of view. FHT (Section 4.4) also point out that Gentle Boosting has an edge because of the numerical stability of the derivative calculation of the binomial likelihood function. However, boosting’s most powerful advocate is its stunning success on real data sets and its “mysterious” resistance to over- fitting in most cases in classification. The summary in Table 2 of FHT on real dataset results suggests that the performance of a particular variant of boosting depends on the interaction among the choice of the weak learner, the underlying problem and the effectiveness of the Newton method as a numeri- cal approximation to the surrogate loss function. Figuring out how these fac- tors interact should bring us a deeper understanding of boosting and might shed light on its resistance to overfitting.
We explore this interaction in the rest of this discussion. In the context of boosting L2 regression, we compare boosting with another ensemble scheme, bagging [Breiman (1996)], for which we have gained some understanding recently [Bu ̈hlmann and Yu (2000)]. Different weak learners are looked at, the overfitting issue is touched upon, and a bag-boosting scheme is proposed. In the last section, we return to classification to emphasize the importance of the choice of the weak learner and to make the point that the resistance to overfitting of boosting is probably due to the zero–one evaluation loss function.
2. Boosting and bagging with L2 loss for regression. Boosting as explained and analyzed by FHT is a general method: a greedy forward stage- wise technique to fit a model of additive style by minimizing a loss function. This view opens the door for boosting in other contexts than classification (although L2 loss is also an appropriate surrogate as described in the previ- ous section). For L2 regression, the boosting algorithm works as follows:
(a) Set F0 ≡ 0 (or another more sensible starting value).
(b) For m = 12M, fit the function estimator fm· which is para- meterized as fmx = βbx γ (as in FHT):
n2 βm γm = arg minβγ Yi − Fm−1Xi − βbXi γ
i=1 Set Fm· = Fm−1· + fm·
(c) Output the function estimator
M
FM· = fm·
m=1
This algorithm is indicated by FHT [formula (6)] and was also given by Friedman (1999). It does not involve any “reweighting”. The weak learner b· γ is fitted in the mth step to the current residuals Yi − Fm−1Xi There is no need for a surrogate loss function in this case since the evaluating L2 loss is the best possible for Newton’s method and the quadratic approximation is
ADDITIVE LOGISTIC REGRESSION 381
exact. We now discuss the issue of choosing the weak learner. This discussion is continued in Section 3 for the case of classification.
Linear weak learners. It is well known that boosting’s cousin bagging does not change any linear procedures. It can be shown easily that it is the same with boosting in L2 regression. When the weak learner is linear bx γ = xTγ and fmx = xTγm the following two statements hold:
(a) For L2 boosting in regression as described earlier, fm· ≡ 0 for all m = 23M
FM· = least-squares linear regression predictor for all M ≥ 1 (b) For LogitBoost as presented by FHT, with
pMx = expFMx/expFMx + exp−FMx
pMx converges to the corresponding MLE in logistic linear regression as
M → ∞ provided that standard conditions for convergence of the MLE hold. Stumps as the weak learner. We now investigate the effect of boosting in
a concrete simple linear model,
(1) Yi =2+3Xi +εi i=1n
with εi’s i.i.d. ∼ 0 1 Xi’s i.i.d. ∼ Uniform0 1 and all Xi’s indepen- dent from all εi ’s. The target is the true function Ftrue x = 2 + 3x x ∈ 0 1 We use stumps (with one-dimensional covariate space) as weak learners and run the L2 boosting algorithm for regression from above. For a typical sample, the boosting stumps estimate is displayed in Figure 1; it is still a rather crude and erratic approximation for the target. Figure 1 uses six boosting iterations which is optimal on average with respect to MSE as indicated in Figure 2. Based on 50 independent simulations of model (1), we have estimated bias, variance and hence also mean-squared error as a function of the number of
boosting iterations. The result is displayed in Figure 2.
3. Interestingly, there is a clear indication for overfitting, starting after six iterations already. We see here a simple example in L2 boosting for regres- sion where overfitting occurs easily in contrast to classification; similar phenomena are reported in Friedman (1999).
Fromourownwork[Bu ̈hlmannandYu(2000)]weknowthatstumpseval- uated at x have high variances for x in a whole region of the covariate space. From an asymptotic point of view, this region is “centered around” the true optimal split point for a stump and has “substantial” size On−1/3. That is, stumps do have high variances even in low dimensions as in this sim- ple case (with only three parameters) as long as one is looking at the “right scale” On−1/3; such a high variance presumably propagates when combining stumps in boosting. This observation is the starting point for another boosting machine to be described next.
382 DISCUSSION
The bag-boosting machine with bagged weak learners. Bagging is known to smooth the weak learner such as stumps and thus achieves a variance reduction.Theprecise(asymptotic)smoothingfunctionisgiveninBu ̈hlmann and Yu (2000), which also characterizes its variance reduction effect in simple yet canonical cases. The high variance of a stump is reduced up to a factor of size 2 to 3, while leaving the bias approximately unchanged. This implies that bagging is also effective for a low-dimensional predictor such as stumps. Hence, we combine bagging with boosting and we call it bag-boosting. For the weak learner in boosting, just replace the stump (or a more general tree) by the bagged stump (or bagged tree). This is a very natural idea and has been thought about by a couple of researchers although we have not seen anything in writing. The resulted fit from bag-boosting is shown in Figures 1 and 2. Note that performance for bagging alone is given by bag-boosting with one boost. Instead of a classical bagging step, we actually use “sub-bagging” or bagging on subsamples of size [n/2] (resampling without replacement) which iscomputationallycheaperwhilestillbeingasaccurateasbagging[Bu ̈hlmann and Yu (2000)]. The visual improvement of bag-boosting in Figure 1 can be
Fig. 1. Boosting stumps (B, black) and bag-boosting stumps (BB, red) estimate based on a typical sample (dots) from (1). The target function is indicated in green. Boosting is done with 6 and bag-boosting with 4 iterations.
ADDITIVE LOGISTIC REGRESSION 383
Fig. 2. Boosting stumps and bag-boosting stumps for model (1). Top: squared bias and variance of boosting (B, black and red) and bag-boosting (BB, purple and green). Bottom: mean-squared error for boosting (B, black) and bag-boosting (BB, red).
explained by the fact that a bagged stump is a smooth rather than a step function[Bu ̈hlmannandYu(2000)].ItsvarianceandMSEimprovementsare impressively described by Figure 2, provided that we know roughly where to stop with boosting iterations. Figure 2 illustrates that the more efficient base learner (namely the bagged stump) has a faster increase in variance with the number of boosting iterations: It would be interesting to know whether this is a more general fact.
4. Thus bag-boosting has a potential to improve upon boosting with stumps or larger decision trees. Its drawback is the higher, although still feasible, computational cost.
384 DISCUSSION
3. Back to classification. Many issues remain. For example, the ques- tion about convergence is not touched by FHT. In particular, for a fixed sample size, there are two issues: the (non)convergence of the boosting algorithm and whether it is better to stop iterations before having approximately converged, if the algorithm actually would converge. The latter issue is known in fitting complex parametric models such as neural networks where a possible regular- ization is given by stopping before convergence. There is another convergence issue about the next level when the sample size tends to infinity.
5. A more important question is how much FHT’s population story tells about the data driven algorithms. The population quantities are always approx- imated: the accuracy depends on the choice of weak learners (see also our remarks about linear weak learners, stumps and bagged stumps in Section 2) and the data-generating distributions. Finally, the hope is that the errors do not propagate with the boosting iterations!
Particularly, even with the connection made by FHT, it is hard to know whether one should use a simple learner like stumps or a more complex one such as trees with more terminal nodes.
In the boosting literature, which is almost exclusively on classification, deci- sion trees are the most popular weak learners. We do agree that trees are generally attractive, but it is worth pointing out that other (nonlinear) weak learners may exhibit good performance as well. Even when using trees as weak learners, the choice of the number of terminal nodes (e.g., with best-first induction) in the learners does matter. Figure 1 in FHT can be misleading for a superficial reader. It wrongly suggests that the number of terminal nodes in the learner is irrelevant when there is sufficient computing power for per- forming many boosting steps; and that with stumps, Discrete AdaBoost is outperformed by Real AdaBoost consistently. The careful reader will exploit a more subtle sensitivity with respect to choosing the learner in Section 6 of FHT [e.g., Figure 4 (top left and top right)] and results on real data sets (Table 2).
FHT distinguish between additive and nonadditive decision boundaries. They argue convincingly that in the first case, choosing stumps as weak learn- ers is better than using a larger tree which is expected to overestimate nonex- isting interaction terms. If we knew that the decision boundary could be well approximated by an additive function, we would feel more comfortable by fit- ting an additive model (with backfitting) as in Hastie and Tibshirani (1990), rather than boosting stumps. Or in other words, if the conjecture by FHT [Section 11] “that boundaries involving very high-order interactions will rarely be encountered in practice” (whatever “high” means here) is really relevant, there would not be any good reason to prefer boosting over the established and popular GAM backfitting estimate.
Boosting’s resistance to overfitting has been one of its major virtues in comparison with other nonparametric techniques. Overfitting in boosting algo- rithms in the sense of fitting a large model can happen in two ways. The first is having a large model fitted as the weak learner and the other having to run
Subbagging
Boosting stumps Bag-boosting stumps Boosting large tree Bag-boosting large tree
4.812
4.029 (with optimal 97 boosts) 3.652 (with optimal 47 boosts) 2.929 (with optimal 117 boosts) 2.768 (with optimal 11 boosts)
ADDITIVE LOGISTIC REGRESSION 385
Table 1
Misclassification rates (in percentages) for the breast cancer data original tree (unpruned; the program from R): 5.643
the algorithm for many iterations. The most overfitted model comes from a large or complex weak learner in combination with a long iteration in boost- ing. FHT touch on this overfitting issue in the concluding section by listing three plausible explanations. As emphasized in Section 1 of this Discussion, the use of the (implementing) surrogate exponential loss function is crucial from the numerical point of view and is thus partially responsible for the success of boosting. However, the evaluation in a classification problem has always been the zero–one loss.
6. This divorce of the implementation and the evaluation loss function does not exist in L2 boosting where we saw overfitting easily in Section 2. We concur with the last point of FHT in Section 11 that the zero–one loss is very robust against overfitting and we further argue that an evaluation based on the exponential implementing loss does show strong overfitting. This conjecture is supported by the breast cancer dataset below. Moreover, we believe that “parts of the real world” are nonadditive. Under this belief and knowing boosting’s resistance to overfitting, it can be very valuable to use best-first inducted trees with more than two terminal nodes. This is confirmed also by our analysis of the breast cancer data in Table 1.
Analysis of breast cancer data. We partially redo the analysis of FHT for the breast cancer data and thereby add insights on bag-boosting and overfit- ting. Table 1 and Figure 3 give comparative results to the ones in FHT. We use “sub-bagging” for computational efficiency and it has a performance similar tobagging[Bu ̈hlmannandYu(2000)].Themisclassificationrates(calledtest error in FHT) are estimated by averaging over 100 random partitions of the data set into a 90% training and a 10% testing, while FHT use a 5-fold CV. So our results differ slightly from FHT (also because presumably we are using different tree algorithms), but we believe ours have better accuracies. Boost- ing and its variants are done with Real AdaBoost, and the trees are run with the program from R. Bag-boosting large trees gives the best result, followed by boosting large trees, bag-boosting stumps, boosting stumps and then bag- ging. Bag-boosting only needs few iterations and brings nontrivial improve- ment over boosting alone, at the expense of a more computationally costly weak learner. Note that Figure 3 (bottom panel) shows strong overfitting with respect to Eexp−YFX!
386 DISCUSSION
Fig. 3. Different weak learners in Real AdaBoost for breast cancer data (Section 7 in FHT). Top: misclassification rates. Bottom: expected exponential losses Eexp−YFX on the logarithmic scale.
7. For this particular dataset and with respect to the misclassification rate, there is hardly any overfitting. It suggests that there is no loss in using large trees, if there is never any substantial overfitting. Boosting stumps on the other hand restricts to additive models. With respect to the misclas- sification rate, should we always boost with large trees as weak learners?
Acknowledgments. Bin Yu thanks Mark Hansen for many stimulating discussions and helpful comments on the draft and Margaret Wright for help- ful discussions on optimization. Special thanks are due to Josef Vogt for the results on the breast cancer data.
REFERENCES
Breiman, L. (1996). Bagging predictors. Machine Learning 24 123–140. Bu ̈hlmann,P.andYu,B.(2000).Explainingbagging.Preprint.
Friedman, J. (1999). Greedy function approximation: the gradient boosting machine. Technical
report, Stanford Univ.
Gill, E. P., Murray, W. and Wright, M. H. (1981). Practical Optimization. Academic Press,
New York.
Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London.
Seminarfu ̈rStatistik DepartmentofStatistics ETH-Zentrum, LEO D72 University of California CH-8092 Zurich Berkeley, California 94720-3860 Switzerland
E-mail buhlmann@stat.math.ethz.ch
ADDITIVE LOGISTIC REGRESSION 387
DISCUSSION
Andreas Buja
AT&T Labs
1. The cultural background of boosting: machine learning versus statistics. I have worked under the same roof with the inventors of boosting, Freund and Schapire, for several years, but I might as well have been on another planet. The physical vicinity was of no help in coming to grips with boosting. I listened to their talks, but I heard the words and missed the point. Later I read Leo Breiman’s debates with Freund and Schapire, and it really got my attention when I heard his praise of boosting. His penetrating analysis was an effort to understand the performance of boosting, in particular its mysterious immunity to overfitting. What he did not provide was a translation of boosting into terms that are familiar to statisticians. He did not bridge the culture gap that separates the stodgy main stream of statistics from its dynamic younger sibling in computer science, machine learning. This is one of the achievements of the paper by Friedman, Hastie and Tibshirani (FHT): It gives statisticians a language with which to think about boosting.
If Freund and Schapire feel misinterpreted or if they think of FHT’s statis- tical modeling interpretation as misguided, as I expect they will, it is beside the point. There is no single “true” interpretation of anything; interpretation is a vehicle in the service of human comprehension. The value of an interpre- tation is in enabling others to fruitfully think about an idea, and this is what the FHT paper certainly does for statisticians and the idea of boosting.
An interpretation of new ideas in terms of old ones is not a mere reduction of the ideas to an “old hat.” On the contrary, such a reduction can clarify what is novel. The emphasis of the FHT interpretations may be too long on the reduction aspect and too short on working out the novelty in boosting; I will try to explain what I mean below. The novelty that we see in light of FHT’s interpretations, though, may not agree with what Freund and Schapire think is novel. Thus, while the FHT paper may provide a gate for statisticians to a new area, the same gate may not be open in reverse for the folks in machine learning (although I’d love to be proved wrong).
2. Boosting arbitrary generalized linear and additive models. Ac- cording to FHT, boosting can be seen as a two-class classification method that fits additive models on a logistic scale in a forward stagewise manner with regard to a loss function that is second-order equivalent to the negative log likelihood for the conditional Bernoulli model. This interpretation leads FHT naturally to propose their “LogitBoost” method by substituting the exact neg- ative log likelihood for the original loss function.
The test of every interpretation is in its capacity to spawn new ideas and generalizations. Differing interpretations are worthwhile because they lead to differing generalizations. The FHT interpretation resulted immediately in a
388 DISCUSSION
very competitive new version of boosting, as well as an improvement in boost- ing multiclass classification.
It seems to me, though, that the FHT interpretation calls, even screams, for a much more global extension of boosting: Now that we know how to boost additive logistic regression, we also know how to boost any generalized addi- tive model. If Algorithm 3 of FHT for LogitBoost is the familiar Newton– Raphson algorithm modified for the acquisition of nonlinear terms in a stage- wise logistic regression, then the same can be used for the acquisition of non- linear terms in any stagewise generalized regression.
In particular, we now know how to boost ordinary least-squares regression: by stagewise LS fitting of nonlinear terms to the previous residuals. This may seem trite, but it isn’t if one has ever tried to come up with a version of boosting for regression with continuous response. Freund and Schapire have tried, and their tentative proposal was to reduce regression to some form of classification, in a classical attempt at reducing a new problem to an old one. This seems like a circuitous approach, however, once the FHT interpretation of boosting is at hand. With the extension of the FHT interpretation to generalized additive regression, it seems most natural to interpret stagewise LS regression as the regression version of boosting.
3. The role of greedy stagewise fitting in the protection against overfit. FHT’s interpretation of boosting is lucid, but their presentation of stagewise fitting needs to be put in perspective. I have no quibbles with the facts, but some with the positioning of the facts.
FHT do state that the fitting a` la boosting is “forward stagewise,” but the impression we (and above all the readers in machine learning) may get is that this term describes a common mode of model fitting in statistics. It doesn’t. We have the term, but it is not very well known, and it does not enjoy high stand- ing among most statisticians who know about it. It is a computational device, or more accurately, a capitulation, in those situations where one is unable to fit a large number of model terms jointly. We even teach our students that fitting one term at a time and not adjusting the others is not the way to fit a model; we teach them that changing one term in a model affects all others in a joint model fit, assuming that the joint model fit is the norm. Interestingly, one popular method for joint fitting, the backfitting algorithm, was devised by Friedman and Stuetzle (1981) to more closely achieve joint optimality in the otherwise greedy fitting procedure of projection pursuit regression (PPR). Had they omitted the backfitting part in their algorithm and allowed the acquisition of a greater number of ridge function terms instead, PPR would have been a regression version of boosting. It appears that even the uncon- ventional minds of Friedman and Stuetzle were involuntarily constrained by statisticians’ prejudice in favor of parsimonious models obtained by joint model fits. Doing away with this prejudice is certainly one of the novelties in the boosting endeavor. Ockham’s razor is being replaced by Freund and Schapire’s sledgehammer.
ADDITIVE LOGISTIC REGRESSION 389
Omitting joint fitting is crucial to the success of boosting: Its purported advantage, relative protection against overfit, disappears instantaneously if termwise (stagewise) fitting is replaced with joint fitting of all terms. The 500 or so stumps and twigs that I have seen fitted in a single boosting fit would come crashing down if they were fitted jointly. Therefore, the essence of the protection against overfit is in the suboptimality of the fitting method.
This also explains why boosting could not have been invented by statisti- cians: Our preoccupations are such that the statistical benefits of a suboptimal computational device would not have appeared on our radar. There are excep- tions to the rule, though: Statisticians have, for example, developed a notion of one-step estimates in robustness, where it was shown that one Newton step in the computation of an M-estimate provides essentially the robustness benefits of full iteration. Similarly, in order to understand the performance of boosting, we will need a theory that explains why this particular type of suboptimality often avoids overfit.
A curious observation in this context concerns the cavalier attitude in the boosting community, including FHT, toward the stopping problem. They seem to consider it sufficient to provide plots or tables that show misclassification rates for 100, 200, 300, 400 terms. So we stop around 300 or 400, and this is part of the definition of the method. The assumption, empirically substanti- ated, is that it really doesn’t matter much as long as we stop after the acqui- sition of a sufficiently large number of terms. The situation is similar to that in artificial neural networks (NNs), where the stopping time of online opti- mization acts as a smoothing parameter. In NNs, however, one tries to detect the onset of overfit with cross-validation. In boosting, by contrast, the testing error remains flat for a long time, overfit does not occur before the human runs out of patience and a precise stopping rule seems therefore unnecessary.
A satisfactory theory of protection against overfit would have to explain the relative flatness of the test error in boosting. The flatness is only relative, however, and even boosting has to submit to overfit at some point. This is seen from the fact that usually the class of fitted terms bxγ is rich in the sense that for a given finite sample size N the N-dimensional design vectors bγ = bxn γn=1···N span all of RN as γ runs across its domain.
This is obviously not a proof, but a slightly tighter argument can be given. First, we note that, unlike in joint fitting, any iterative termwise fitting pro- cedure depends not only on the space spanned by the terms, but the actual basis they represent and the order in which they are fitted. (I thank Werner Stuetzle who spelled this out most clearly to me.)
In the simple case where the domain of the γ’s is finite and consists of K elements γ1 γK only, boosting is a greedy form of backfitting the linear model k=1···K βkbxγk. Instead of cycling through the terms in an orderly fashion k = 12K, 12 as in conventional backfitting, the next k is picked greedily so as to lower the loss function maximally. We can conceive of at least three types of backfitting according to the order in which the terms are fitted: (1) cycle over the terms in a fixed order (traditional backfitting), (2) pick the next term randomly, (3) pick the next term greedily. It would
390 DISCUSSION
seem plausible to examine combinations of type 2 and type 3, whereby the picking probability of a term is a function of its loss reduction; as is, type 2 uses a uniform distribution on the terms and type 3 puts probability 1 on the term with maximal reduction.
It is clear from the known convergence of type 1 backfitting to the joint least-squares solution that overfit must occur if the joint solution overfits. If similar convergence properties for type 2 and type 3 backfitting hold, overfit will occur there also. In light of these considerations, it may be just a matter of time till boosting follows the lead of NNs and cross-validation is introduced for the stopping problem in boosting as well.
An issue that a theory of protection against overfit would have to tackle is that there are two ways for a learner or base term bx γ to be “weak”: weak- ness can arise from bias or from variance. In either case the weak learner usually contains signal, but the signal is weak for different reasons. Weak- ness due to bias means underfit, weakness due to variance means overfit. Boosting is designed to reduce bias but not variance. The latter is addressed by Breiman’s (1996) “bagging” proposal and by Bayesian model averaging. As things stand, one either approaches good fits from poor fits with boosting, or from overly flexible fits with model averaging. It would be desirable if both bias and variance could be tackled in one unified approach.
4. The role of the functional form of the weak learner. Additive models with small trees as base terms are an innovation we owe the machine learning literature. Trees are not essential to boosting, but the idea of building additive models from multiple trees is a welcome side benefit.
Functional form hasn’t been a topic of great debate in the boosting liter- ature, probably because of the original justifications in terms of the “weak learner” metaphor. These justifications do not carry us far, however. The ques- tion is not whether but how much a weak learner can be boosted, or in statis- tical terminology: how well an additive model can approximate an underlying response surface. A major determinant for answering this second question is the functional form of the weak learner bx γ. The fact that the boosting folks haven’t hit on this the hard way is probably due to the Holte (1993) effect, cited a couple of times by FHT: “Very simple classification rules perform well on most commonly used datasets.” That is, nature rarely chooses complex deci- sion boundaries that couldn’t be described by thresholding additive functions or at most two-variable or three-variable interactions.
Inspired by ANOVA practices, FHT propose to measure and control com- plexity of a functional form in terms of interaction order, that is, the maximum number of variables admitted in the weak learner at any given time. For trees, the interaction order is identical with the maximum admitted tree depth. But the idea of interaction order can be extended to other functional forms as well. The most obvious are Friedman’s MARS fits, in which interaction order can be controlled as easily as in trees. Another possibility are projection pursuit regression and classification (PPR and PPC, respectively), although a major modification is necessary: the linear combinations that enter the ridge
ADDITIVE LOGISTIC REGRESSION 391
functions would have to be constrained to, say, two or three predictors at a time.
In order to more fully understand boosting, it will be necessary to apply it to functional forms other than trees. This would not be difficult in the case of MARS and PPR/C. In fact, for a boosting version of either one would simply replace the joint fitting with unadjusted greedy fitting. Ironically, stagewise modifications of MARS and PPR/C require that the technically most creative ideas of their inventors be thrown out: Friedman used sophisticated stable Cholesky updates for the fast computation of joint fits in MARS, and Friedman and Stuetzle devised backfitting for the same purpose in PPR/C. Now we are led to expect that giving up these technical marvels will be beneficial! Having experienced first hand the problems of overfit in the original MARS, I would be thrilled if the anticipated benefits of boosting showed up.
5. Conclusion. The FHT interpretation of boosting has implications for machine learning as well as for statistics. For machine learning, the gener- alization of FHT’s ideas lead us straight to a regression version of boosting. For statistics, FHT’s ideas raise the expectation that the benefits of boosting might carry over to other stagewise modifications of nonparametric fitting pro- cedures. The latter is just a guess so far, and in the absence of a convincing theory of boosting’s protection against overfit, we wouldn’t even know why it held if it did. Clearly, there is a lot of work ahead of us. FHT have given us a means for generating new ideas that can guide our future research.
REFERENCES
Breimann, L. (1996). Bagging predictors. Machine Learning 24 123–140.
Friedman, J. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc.
76 817–823.
Holte, R. (1993). Very simple classification rules perform well on most commonly used datasets.
Machine Learning 11 63–90.
AT&T Labs
Shannon Laboratory
180 Park Avenue
Florham Park, New Jersey 07932-0971 E-mail: andreas@research.att.com
DISCUSSION
Yoav Freund and Robert E. Schapire
AT&T Labs
The main and important contribution of this paper is in establishing a connection between boosting, a newcomer to the statistics scene, and additive models.
392 DISCUSSION
One of the main properties of boosting that has made it interesting to statis- ticians and others is its relative (but not complete) immunity to overfitting. As pointed out by the authors, the current paper does not address this issue. Leo Breiman (1998) tried to explain this behaviour in terms of bias and variance. In our paper with Bartlett and Lee (1997), we gave an explanation in terms of the “margins” of the training examples and the VC-dimension of the base class. Breiman, as well as the current paper, point out that our bounds are very rough and are thus not useful in practice. While this is clearly true at this time, it is also true that the analysis given by Breiman and by this paper yield no provable bounds whatsoever. It is completely unclear whether this analysis can be used to predict the performance of classification rules outside of the training sample.
At the root of this argument about boosting is a much more fundamen- tal argument about the type of prior assumptions that one should make when embarking on the task of inducing a classification rule from data. The assump- tion that seems to underlie the use of maximum likelihood in the current paper is that data are generated by a distribution from a prespecified class. In this case, this is the class of distributions in which the relationship between the features and the labels is described by a log-linear function. In comparison, the assumption that we make in our analysis is that the data are generated from some arbitrary distribution in an i.i.d. fashion. Clearly, our assumption is the weaker one and this leads to a theory that is more generally applicable.
From a related but more practical point of view, one main issue when apply- ing boosting or boosting-like techniques in practice is how to choose the base class. The approach taken in this paper is that this choice is made based on our prior beliefs regarding the type of log-linear dependencies that might exist between the features and the label. On the other hand, in the boosting approach, we make an assumption about what kind of rules might have slight but significant correlations with the label. This is the essence of the “weak learning” assumption upon which the theory of boosting is founded.
In the current paper, boosting is analyzed mostly in the context of decision stumps and decision trees. The argument seems to be that while in most real-world cases decision stumps are powerful enough, in some less com- mon cases the type of dependencies that exist in the data require a more powerful base class, such as two- or three-level decision trees. A rather differ- ent approach to the combination of decision trees and boosting was recently proposed by Freund and Mason (1999). They represent decision trees as sums of very simple functions and use boosting to simultaneously learn both the decision rules and the way to average them.
Another important issue discussed in this paper is the performance of boost- ing methods on data which are generated by classes that have a significant overlap, in other words, classification problems in which even the Bayes opti- mal prediction rule has a significant error. It has been observed by several authors, including those of the current paper, that AdaBoost is not an optimal method in this case. The problem seems to be that AdaBoost overemphasizes the atypical examples which eventually results in inferior rules. In the current
ADDITIVE LOGISTIC REGRESSION 393
paper, the authors suggest “GentleBoost” as a better method than AdaBoost for this case. The reason that this might be a better method is that it gives less emphasis to misclassified examples. The increase in the weight of the example is quadratic in the negative margin, rather than exponential.
However, one can argue that this alteration of AdaBoost, while being a step in the right direction, is not large enough. In fact, one can argue that once an example has a very large negative margin it is best to assume that it is an outlier that should be completely removed from the training set. A new boosting algorithm based on this radical approach was recently proposed by Freund (1999).
REFERENCES
Breiman, L. (1998). Arcing classifiers. Ann. Statist. 26 801–849.
Freund, Y. (1999). An adaptive version of the boost by majority algorithm. In Proceedings of the
Twelfth Annual Conference on Computational Learning Theory.
Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In Machine
Learning: Proceedings of the Sixteenth International Conference 124–133.
Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new
explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686.
AT&T Labs–Research
180 Park Avenue
Florham Park, New Jersey 07932-0971 E-mail: yoav@research.att.com
DISCUSSION
Greg Ridgeway
University of Washington
The authors have done a great service to the field of statistics by branch- ing out to the problems of interests to the computational learning theorist. The connection between boosting and maximum likelihood not only helps to explain why boosting is an effective classification method but it also opens the door to the application of boosting to many other statistical prediction problems.
It is difficult not to understate the attention that boosting has received in the machine learning and data mining communities. Soon after Freund and Schapire’s algorithm made its debut, a flurry of empirical work showcased its uncanny ability to improve prediction accuracy [Quinlan (1996), Bauer and Kohavi (1999), Drucker and Cortes (1996)]. Charles Elkan’s application of the boosted na ̈ıve Bayes won first place out of 45 entries in the 1997 data mining competition [Elkan (1997)], demonstrating the strength of boosting against research and industrial prediction tools. Elkan also noted in this work that his model had the form of nonlinear logistic regression but did not completely
394 DISCUSSION
connect boosting with optimization and likelihood. Boosting is even making
its way into popular science as noted in the New York Times [Lee (1999)].
Low variance, bias reducing steps. Likelihood based inference is in the business of obtaining good parameter estimates, which is not necessarily the same as producing a good predictor. In the classification game we win if we produce a classifier with the lowest misclassification rate. A good estimator of the underlying class distribution conditional on x is of secondary importance. Consider the following example.
Example 1. Generate N = 1000 observations as x∼N01 Fx=−21×2 −1
px = exp2Fx yx ∼ Bernpx 1 + exp2Fx
The Bayes decision rule is to classify a new observation as a 1 if x>√2 and 0 otherwise. The misclassification error for this rule is about 20.2%. Figure 1 shows the px that generated the data as a smooth curve. After 100
Fig. 1. The smooth curve is PY=1x and the jagged curve is the estimate that the Real AdaBoost algorithm produces after 100 iterations using a stump classifier on the Example 1 data.
ADDITIVE LOGISTIC REGRESSION 395
boosting iterations using a stump classifier (a piecewise constant with only one knot fit using CART) we see that boosting produces an estimate of px that is overfit in terms of providing an accurate picture of the underlying data generating mechanism. However, the misclassification rate on a test dataset with 10,000 observations is only 21.5%, not far off from the Bayes error rate. More iterations have little effect. This counts as a win for the analyst inter- ested solely in predictive performance but a loss for those seeking to uncover the data generating mechanism. Even though the authors’ Lemma 1 shows that the population minimizer of Ee−yFx is the true conditional class prob- ability, with a finite sample it is easy for AdaBoost to overfit from a function estimate or calibration point of view. The debate between accurate prediction and the discovery of the contents of the black box is certainly not new but it is also has not gone away.
Some of the amazement surrounding boosting was that it could take a sim- ple classifier like a stump and turn it into a near-Bayes optimal classifier. The choice of using a stump at each iteration in the example is critical. After two boosting iterations the classifier will pretty much nail the correct decision boundary since the two knots will likely be in the neighborhood of −√2 and +√2. Due to the constraints on the base classifier, subsequent iterations can do little to modify the decision boundary. Therefore, local jaggedness devel- ops as successive stumps try to fit individual observations but it cannot be substantial enough to have an effect on the misclassification rate. This leads to the question of the importance of selecting the base classifier. In this exam- ple, fitting slightly more complex trees at each iteration has disastrous effects on the performance. Figure 2 tracks the change in misclassification rate with each boosting iteration when the base classifier is a tree with four termi- nal nodes. A four-terminal node tree is likely to nearly capture the decision boundary on the first iteration. Subsequent iterations have enough flexibility to fit individual observations and generalization performance declines. From Figure 2 we can see that the performance degrades to slightly worse than predicting 0, the majority class, for all observations.
The results of Example 1 yield the following insights into boosting.
1. AdaBoost (in its present form) is not necessarily useful in function estima- tion. Calibration of the boosted probability estimates is questionable.
2. The idea that AdaBoost is resistant to overfitting needs some qualification. The choice of base classifier can make or break the success of utilizing boosting.
3. Boosting seems to work well when we use a sequence of simple, low vari- ance classifiers. Each stage makes a small, low variance step, sequentially chipping away at the bias.
Other looks at functional optimization. Once boosting is cast as func-
tional optimization we can approach the problem from other avenues. In the AdaBoost setup we wish to find an Fx that minimizes JF = Ee−yFx. An additive improvement to Fx would find an α and an fx such that
396 DISCUSSION
Fig. 2. Misclassification rate when fitting trees with four terminal nodes to the Example 1 test data on each Real AdaBoost iteration. The horizontal line marks the error rate if we predicted all observations to be zeros.
JF + αf decreases. The authors describe how AdaBoost selects α and fx. The Gentle AdaBoost algorithm that the authors propose performs a point- wise Newton step to suggest fx with α fixed at 1. We can also view this as a gradient ascent in function space with a squared penalty on the magnitude of fx. Consider the following analysis.
The directional derivative, or Gaˆteaux variation, of JF in the direction of some function fx computes the local change in JF if we make the modification Fx ← Fx + αfx, moving Fx in the direction of fx. The directional derivative of JF is
δJF f = d E exp−yFx + αfxα=0 (1) dα
= E − yfxexp−yFx
A gradient ascent strategy would have us select from among some class of functions the fx that minimizes δJF f, or equivalently offers the greatest local decrease in JF. Without further restrictions this method is rather unstable since we can make δJF f arbitrarily small by setting
(2) fx = −∞ if Eyx < 0 ∞ if Eyx > 0
ADDITIVE LOGISTIC REGRESSION 397 On the other hand, if we penalize fx in (1) for being too big then we might
select fx as
fx = arg min E − yfxe−yFx + 12 fx2
(3)
f
= arg min E − yfxe−yFx + 12 fx2 + 12 ye−yFx 2 f
= arg min Efx − ye−yFx2 f
Note that (3) is equivalent to choosing the fx that is similar to a least- squares fit to the gradient that the authors compute in their derivation of the Gentle AdaBoost algorithm. Obviously, in practice we only have a sample with which to estimate the expectation in (3). Therefore, it becomes critical that we use a low variance least-squares regression method to minimize the empirical estimate of (3). I will return to this topic in a moment. After estimating fx we can use a line search to obtain the α that minimizes E exp−yFx+αfx.
Applying this same technique within the LogitBoost scenario, that is, maxi- mizing the directional derivative of the Bernoulli log-likelihood with a squared penalty on the direction fx, the gradient step is
eFx 2 (4) fx=argminE fx− y∗ − −Fx Fx
fe+e
After estimating the optimal step direction the choice of the optimal step size,
α, takes on an interesting form when we have a sample.
N
(5) α = arg min 2y∗i Fxi + αfxi − log1 + exp2Fxi + αfxi
α i=1
Equation (5) has the form of a likelihood for the linear logistic regression model with fxi as a covariate and Fxi as an offset term. The optimal α then is just the maximum likelihood estimate of the coefficient of fxi for which efficient software exists. Similar results hold for the class of exponential family regression models and the Cox model for survival analysis [Ridgeway (1999)].
Both the authors’ results and this section arrive at the situation where we need to propose a direction in which to move our current guess for the regres- sion function. When handed only finite training datasets, one is faced with estimating a direction that seems likely to offer a decrease in the loss func- tion. Example 1 showed that high variance estimators of descent directions (e.g., four-terminal node trees) can cause problems. In the past statisticians have controlled the variance of predictors in other manners. As the authors point out, the LogitBoost algorithm has a familiar feel to statisticians. Tradi- tional statistical practice uses linear regression instead of trees at each stage to estimate the conditional expectation. This yields the iteratively reweighted
398 DISCUSSION
least-squares algorithm for fitting linear logistic regression models. We can also use smoothers to estimate the conditional expectation as in the local scoring algorithm. These types of methods controlled variance at the expense of bias reduction by limiting the functional form, restricting variable interac- tions, or constraining the jaggedness of the curve.
When we introduce regression trees to estimate these gradient steps insta- bility becomes an issue. However, various forms of averaging may help to keep variance in check. If we use bagged trees to estimate the gradient we can often reduce the variance introduced on the iteration. This is the basis of Breiman’s adaptive bagging [Breiman (1999)]. Denison (2000) proposes averaging over all possible stumps at each iteration with each weighted by their posterior dis- tribution. The amount of needed computation is nearly equivalent to CART with the benefit of more stable gradient steps. It seems that the future of nonparametric regression might consist of choosing an ideal loss function and a sequence of low variance estimates of the gradient direction in which each step decreases bias.
Bayesian prediction and function estimation. When I first learned of the problem of boosting I attempted to understand it from a Bayesian perspective. The redistribution of weights on observations, the model mix- ing and the improved performance seemed to have a Bayesian flavor to it. In fact, Freund and Schapire (1997) even offered a na ̈ıve Bayes justifica- tion for the AdaBoost algorithm. With the current results in hand it now seems clear that the Bayesian interpretation of boosting is that it is simply maximum likelihood! Boosting suggests moving the current function in some optimal direction and assembles a path through function space toward a max- imum likelihood estimate. The Bayesian wants to move the current function in all directions at each iteration, eventually considering all paths through the function space with each weighted by some posterior probability. As is often the case, computation can impede a fully Bayesian solution. However, the lessons learned from boosting could be emminently useful for Bayesian prediction and function estimation problems.
In Bayesian prediction problems we want to find the predictive distribution of a y given its covariates, x, and a dataset containing previous observations,
(6) PryxD= PryxFPrFDdF
The predictive distribution integrates over all members, F, in some class of functions. The crudest approximation simply assumes that PrFD has most of its mass in a small neighborhood of the maximum likelihood estimate of F so that PryxD ≈ PryxFˆ. To make a point prediction one would com- pute some functional of PryxFˆ (e.g., mean, class with the largest proba- bility). Recent efforts have proposed methods for sampling from PrFD in order to compute a Monte Carlo estimate of (6). Some of these ideas include MCMC on the space of trees [Denison, Mallick and Smith (1996), Chipman, George and McCulloch (1998)], on the space of additive models with a partic-
ADDITIVE LOGISTIC REGRESSION 399
ular smoothness [Hastie and Tibshirani (1998)] and reversible jump MCMC on the space of piecewise constants [Heikkinen (1998)] or splines [DiMatteo, Genovese and Kass (1999)]. I think boosting may have a lot to offer the latter of these methods.
Heikkinen (1998) assumes that the regression function is a piecewise con- stant where the number of pieces and the constants associated with those pieces are random quantities. The simulation from PrFD proceeds by either randomly combining two neighboring pieces, splitting a random piece into two, or randomly perturbing the constant associated with one of the pieces. Among the curses of MCMC methods is the problem of wasting iterations explor- ing regions of the parameter space that are not particularly of interest or not exploring the entire interesting region. Each boosting iteration computes some kind of gradient in the function space indicating with respect to the current state which neighboring functions are most interesting in some sense. From the insight gained from boosting one could potentially produce more efficient proposal mechanisms for exploring PrFD; that is in the works.
Closing comments. This work melds some exciting work in both compu- tational learning theory and statistics. The authors have elegantly connected the AdaBoost algorithm with a novel interpretation of fitting additive logis- tic regression. Researchers from a diverse set of fields are in search of bet- ter methods for modeling datasets that are massive, noisy, complex and high dimensional. Boosting, as a modular procedure for constructing such models, offers an exciting paradigm for future prediction methods.
REFERENCES
Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning 36 105–139.
Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical Report 547, Dept. Statistics, Univ. California, Berkeley.
Chipman, H., George, E. and McCulloch, R. (1998). Bayesian CART model search (with dis- cussion). J. Amer. Statist. Assoc. 93 935–960.
Denison, D. (2000). Boosting with Bayesian stumps. Technical report, Dept. Mathematics, Impe- rial College.
Denison, D., Mallick, B. and Smith, A. (1996). Bayesian CART. Technical report, Dept. Mathe- matics, Imperial College.
DiMatteo, I., Genovese, C. and Kass, R. (1999). Bayesian curve fitting with free-knot splines. Technical report, Carnegie Mellon Univ.
Drucker, H. and Cortes, C. (1996). Boosting decision trees. In Proceedings of Neural Informa- tion Processing 8 479–485. MIT Press.
Elkan, C. (1997). Boosting and na ̈ıve Bayes learning. Technical Report CS97-557, Univ. California, San Diego.
Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119–139.
Hastie, T. and Tibshirani, R. (1998). Bayesian backfitting. Technical report, Stanford Univ. Heikkinen, J. (1998). Curve and surface estimation using dynamic step functions. In Practi- cal Nonparametric and Semiparametric Bayesian Statistics (D. Dey, P. Mu ̈ller and
D. Sinha, eds.) 255–272. Springer, New York.
Lee, J. (1999). A computer program that plays a hunch. New York Times August 17.
400 REJOINDER
Quinlan, J. (1996). Bagging, boosting, and C4.5. In Proceedings Thirteenth American Associa- tion for Artificial Intelligence National Conference on Artificial Intelligence 725–730. AAAI Press, Menlo Park, CA.
Ridgeway, G. (1999). The state of boosting. In Proceedings of the Thirty-first Symposium on the Interface 172–181.
Ridgeway, G. (1999). The state of boosting. In Computing Science and Statistics 31 (K. Berk, M. Pourahmadi, eds.) Interface Foundation of North America, 172–181. Fairfax, VA.
Department of Statistics University of Washington Seattle, Washington 98195-4322 E-mail: greg@stat.washington.edu
REJOINDER
Jerome Friedman, Trevor Hastie and Robert Tibshirani
Stanford University
We thank the discussants for their generous and thoughtful comments, and the editors for arranging this discussion. Since the paper and discussions are long, we can only reply to a few of the many important points raised by the discussants.
1. Overfitting. All of the discussions raise the issue of overfitting in the context of boosting. Breiman, and to a lesser extent, Freund and Schapire sug- gest that our explanation is inadequate since it suggests that enough boosting steps can lead to overfitting. Breiman cites empirical evidence suggesting that “AdaBoost almost never overfits the data no matter how many iterations it is run.” It would be a nice world if that were the case, but unfortunately it isn’t quite. Besides the example in our paper, Ridgeway provides a simple exam- ple that exhibits dramatic overfitting in Figure 2 of his discussion. Quoting Ratsch, Onoda and Muller (2000), “Recent studies with highly noisy patterns [Quinlan (1996), Grove and Schuurmans (1998), Ratsch (1998)] showed that it is clearly a myth that Boosting methods do not overfit.” Still, it does appear that boosting is more resistant to overfitting that one might expect based on general modeling experience. Buja suggests that the suboptimal greedy “stagewise” nature of the fitting procedure may provide some resistance. This could be the case, and merits further investigation. However, we agree with Bu ̈hlmannandYuthatitislargelythenatureofzero–onelossthatprovides general resistance to overfitting as judged in classification.
We illustrate this with a simple type of “nearest-neighbor” procedure. Con- sider a fixed set of data points xi N1 . A training data set consists of yi xi N1 , where each yi ∈ −1 1 is a random variable with pi = Pryi = 1. Given a training set, our estimate for each pi is pˆi = 1+ayi/2, 0 ≤ a ≤ 1. Here a is a parameter that controls the degree of fit to the training data. Larger values provide a closer fit; the value a = 1 interpolates the data, whereas a = 0
Criterion
exponential
likelihood
squared error absolute error zero–one loss
Ly F exp−yF
log1 + exp−2yF y−F2
y−F 1y̸=signF
Rˆ
1−a 1/2
R
1 − e ̄ 1−a 1/2 1+a
+e ̄ 1+a 1/2 1−a
1 − e ̄ log 2 21+a
+e ̄log 1−a 1−a2 +4ae ̄ 1−1−2e ̄a e ̄
a∗
1 − 2e ̄
1 − 2e ̄ 1−2e ̄
1 > 0
ADDITIVE LOGISTIC REGRESSION
401
Table 1
R= N LyiFi i=1
R= N ELyiFi i=1
1+a log 2
1+a 1−a2
1−a 0
estimates pˆi = 1/2 irrespective of the data. We consider five loss functions Ly F—exponential, likelihood, squared error, absolute error, and zero–one loss—all defined in Table 1.
For each loss function, F is defined to be the population minimizer arg min ELy F = arg min pL1 F + 1 − pL−1 F
FF
For the exponential and likelihood, this is F = 12 logp/1 − p for squared and absolute error F = 2p − 1, and for zero–one loss F = sign2p − 1.
Table 1 shows the average loss on the training data ˆ1N ˆ
and the corresponding average population risk 1N ∗ˆ
The later expectation is with respect to the training yi (which produces Fˆi), and the test observation y∗i, both drawn independently with probability pi. The Fˆi are obtained by plugging pˆi into the expressions for F. For the expo- nential and likelihood criteria,
Fˆi = 1 log pˆi = 1 log 1 + ayi 2 1−pˆi 2 1−ayi
whereas for the squared and absolute error criteria,
Fˆ i = a y i
The risk values are reported in terms of the parameter a and
1N 2N
e ̄ = N E1y∗i̸=signF∗i = N pi1−pi
i=1 i=1
402 REJOINDER
which is just the average risk based on zero–one loss (and is twice the Bayes risk). This error rate does not depend on the value of a since it is indepen- dent of the magnitude of Fˆi, depending only on its sign. Thus, overfitting as measured by zero–one loss does not exist for this procedure. Interpolating the data by setting a = 1 does not degrade zero–one loss. Also shown in Table 1 is the parameter value a∗ that minimizes the population risk for each respective criterion.
As seen from Table 1, both the average training set loss and population risk for the four criteria, other than zero–one loss, do depend on the magnitude of Fˆi, and thus depend strongly on how close the training data are fit, as regu- lated by the value of a. The average training set loss Rˆ is monotone decreasing with increasing values of a. Starting at a = 0, the population risk R for all four criteria initially decreases. For the first three a minimum is reached at the value a∗ = 1 − 2e ̄. Increasing the value of a beyond that point monotoni- cally increases population risk, symptomatic of classic overfitting. The degree of overfitting depends on the criterion. Criteria listed higher in Table 1 report more overfitting. For absolute error there is no overfitting; interpolating the training data here minimizes population risk for this criterion.
Figure 1 of this Rejoinder shows both training error (lowest green curve), population risk for e ̄ = 01 (middle red curve), and e ̄ = 02 (highest blue curve), as a function of 1/1 − a for the loss criteria shown in Table 1. All curves are normalized to have unit risk at a = 0. Note the difference in ver- tical scale between the lower and upper plots. The first three criteria exhibit classic overfitting. The degree of this overfit increases with increasing intrin- sic noise as reflected by the value of e ̄. For exponential and likelihood loss the expected test error approaches infinity as the value of a approaches one (data interpolation), the latter much more slowly. Both squared and absolute loss approach finite values of 4e ̄ and 2e ̄, respectively. Except for the discontinuity ata=0(wheretheriskisthatofacoinflip,R= 12,normalizedtobe1),the population zero–one risk curves drop immediately to e ̄. When the abscissa is plotted on a linear rather than logarithmic scale, the plot for exponential loss (upper left panel) closely resembles that displayed in Figure 3 (lower panel) of the Bu ̈hlmann and Yu discussion.
It is important to note that the extent to which overfitting is observed here depends only on the loss criterion used to measure degree-of-fit. For a given training data set, equal values of the parameter a produce identical models for the probabilities pˆi. The judgment as to whether, or how much, a particu- lar model overfits the training data is determined solely by the criterion used to define “fit,” and not by the training procedure used to obtain the model. Exponential loss reports dramatic overfitting, likelihood substantial overfit- ting, squared error loss moderate overfitting and absolute and zero–one loss no overfitting at all, on exactly the same sequence of models.
Of course, this example is highly idealized so as to permit exact calcula- tions of risk. In other situations absolute and zero–one loss may exhibit some degree of overfitting. The assumption that the x-values are fixed prevents zero–one loss from reporting overfitting for this procedure. In actual practice,
ADDITIVE LOGISTIC REGRESSION 403
Fig. 1. Training and population risk curves for the five loss functions in Table 1 as a function of 1/1 − a. The green curves are training risk. The red curves are population risk for e ̄ = 01 and the blue curves are for e ̄ = 02. All curves are normalized to have unit risk at a = 0.
404 REJOINDER
the x-values themselves are random variables and fitting the data too closely can result in some incorrect estimates of the sign of F, thereby degrading pop- ulation zero–one risk. However, the analysis above suggests that overfitting as judged by absolute and especially zero–one loss should be less severe than when judged by exponential, likelihood or squared error loss. Also, there is no reason to expect that the degree-of-fit to the training data that results in minimum population risk will be the same for the former two measures as with the latter three. This is illustrated in Figure 3 of Bu ̈hlmann and Yu’s discussion.
This analysis has no bearing on the relative merits of any of these criteria as implementing loss functions for boosting, or any other procedure. As pointed out by Bu ̈hlmann and Yu, it is often the case that the best criterion to drive a search strategy is not the criterion used to judge the value of the resulting solution. However, the phenomenon of “overfitting” is largely dictated by the latter.
In the field of statistics most of our experience has been derived from like- lihood and squared error loss. As Breiman points out “adding more and more variables to a logistic regression, increasing the training likelihood at each step, usually results in overfitting at some point.” This is surely the case when overfitting is judged by the likelihood criterion. It can sometimes be the case when judged by zero–one loss. However, the simple example pre- sented here, as well as considerable empirical evidence elsewhere, suggest that zero–one loss, being less sensitive to variations in the estimates of the underlying probabilities (they depend only on the sign of 2pˆi − 1), exhibit considerably less overfitting than other loss criteria that depend more heav- ily on the detailed probability estimates. This will be the case whether those estimates are obtained by AdaBoost, LogitBoost or any other procedure.
2. Regression. Most of the discussions consider the extension of boosting toregression.BujaandBu ̈hlmannandYu,aswellasRidgeway(1999),observe that our view of boosting leads to natural extensions to regression based on squared error loss. This is discussed in Section 10 of the paper. It is explored in more detail in Friedman (1999a), where gradient based boosting is extended to arbitrary differentiable loss criteria. As noted in Friedman (1999a), shrinkage is an important ingredient to the success of boosting in the regression context. At each step m of the iterative procedure, the current approximation Fm−1x is updated by
(7) Fmx = Fm−1x + ν · fmx
Here fmx is the update computed in the usual manner by the boosting algorithm at the mth step, and ν ∈ 01 is a parameter whose value con- trols the amount of shrinkage. Empirical evidence was presented that such shrinkage dramatically improves the accuracy of the target function estimate when measured by likelihood, squared error or absolute error. Less dramatic but significant gains were also observed for zero–one loss, where the quality of the function estimate (as judged by the other criteria) is less important. In
ADDITIVE LOGISTIC REGRESSION 405
all cases, accuracy improved monotonically with decreasing value of ν (more shrinkage), provided that the number of iterations M was correspondingly increased so as to reach minimum test error.
Breiman and Bu ̈hlmann and Yu suggest that incorporating a stochastic component into boosting algorithms can substantially improve performance. Bu ̈hlmannandYupropose“bag-boosting”which,asRidgewaypointsout,isthe essence of Breiman’s (1999) “adaptive bagging” procedure. The base learner is replaced by the corresponding bagged learner at each boosting step. This can be viewed as an alternative to shrinkage for reducing the learning rate. The individual bagging substeps can be regarded as randomized boosting steps without updating the approximation. Friedman (1999b) investigated combin- ing randomization with shrinkage in an integrated manner. The approxima- tion is slowly but continuously updated at each randomized boost using a slow learning rate, rather than alternating between a large number of steps with- out updating (bagging), followed by a single step with a full update (boost). Combining randomization with shrinkage in this manner provides a modest but significant additional improvement over the already large gain produced by shrinkage alone. However, there is a substantial computational advantage by employing what Bu ̈hlmann and Yu call “subbagging.” Friedman and Hall (1999) discuss the equivalence of bagging based on boostrapping, and half- sampling without replacement and show that different subsample sizes can often lead to better performance.
3. Interpretations of boosting. For their invention of boosting, Freund and Schapire deserve a great deal of praise. It has led to both practical learn- ing algorithms and much intellectual study and debate. As Buja so poetically states, “There is no single ‘true’ interpretation of anything; interpretation is a vehicle in the service of human comprehension. The proof of an interpretation is in enabling others to fruitfully think about an idea.” We hope that our work has helped serve that goal. Freund and Schapire and Breiman present two additional interpretations, different from ours and each other. Ridgeway con- siders yet another view along Bayesian lines. Having multiple interpretations can only increase the chance of obtaining new insights.
We don’t see a great deal of contradiction between our view and that of Freund and Schapire. “Provable bounds” can be derived for LogitBoost (Nigel Duffy and David Helmbold, private communication). Also, we don’t see how our “assumptions” are necessarily stronger (or weaker) than those motivating their interpretation. It is hard to imagine a binary random variable that is not Bernoulli distributed. We do not assume “a log-linear relationship between the features and the labels.” The function class describing this relationship is determined by the weak learner employed. AdaBoost, LogitBoost, or any other boosting method using the same learner is making similar assumptions concerning the label–feature relationship. As was pointed out in the paper, and amplified by Bu ̈hlmann and Yu and Buja, “The question is not whether but how much a weak learner can be boosted, or, in statistical terminology, how well an additive model can approximate an underlying response surface.”
406 REJOINDER
The underlying response surface is the unknown label–feature relationship. The weak learner provides the additive components. Different weak learners will perform differently with different response surfaces.
Freund and Mason’s (1999) “alternating decision tree” approach is quite interesting. It shares many common elements with MARS [Friedman (1991)], substituting exponential loss for squared error loss, and using zero-order rather than first-order splines. Both of these modifications may be beneficial in the classification context.
Freund’s (1999) suggestion of removing outliers is a good one, but not “rad- ical.” This approach is at the center of the field of robust statistics. The source of AdaBoost’s lack of robustness is its use of exponential loss as the implement- ing loss function, which increases exponentially with large negative margin. Negative log-likelihood increases only linearly with large negative margin, thereby being much more robust. Still, sometimes performance can be further improved a little by completely removing the influence of observations with very large negative margins. This is accomplished as a beneficial side effect of the weight trimming strategy described in Section 9 of the paper, when used with LogitBoost. The LogitBoost weights are wi = pxi1 − pxi. Obser- vations with the smallest weights at the current iteration are removed from consideration for that iteration. Observations with large negative margin have either pxi ≃ 0 (for yi = 1) or pxi ≃ 1 (for yi = −1), thereby being deleted.
Breiman views boosting as an “ensemble” method for combining classifiers. He seeks to incorporate boosting and bagging into a common framework based on “random forests.” This is an intriguing concept. However, it seems to us that boosting operates differently than random forests. From the description, a random forests appears to be an ensembles of learners that are i.i.d. given the data. Hence the bias of the ensemble is the same as that of each individual learner. By contrast, in boosting the individual members are not independent given the data: the construction of each depends explicitly on those members that have been previously entered. The bias of the ensemble is typically much less than that of each constituent learner. Boosting turns a weak learner to strong ensemble, whereas random forests combine strong learners in an i.i.d. fashion in order to reduce variance.
Viewing boosting as stagewise additive regression does however suggest Breiman’s conclusion that “if boosting succeeds, it is because it gives low cor- relation between classifiers of reasonable strength.” In stagewise regression “each step decouples itself from the previous step” leading to “low correla- tion between pairs of classifiers.” In general, a greedy search strategy will not successively add highly correlated terms, since that would not produce very much local reduction in the criterion being minimized. Also, there will be a tendency to add terms of “reasonable strength,” since that will tend to reduce the criterion. Thus, one could conclude that AdaBoost owes its success to these particular aspects of stagewise additive regression. This is consis- tent with Breiman’s explanation, although motivated somewhat differently. Unfortunately, this conclusion appears to be in conflict with one empirical fact,
ADDITIVE LOGISTIC REGRESSION 407
namely, the success of shrinking (1) in increasing accuracy with all boosting methods, including AdaBoost [Friedman (1999a)].
Shrinking strongly inhibits the natural tendency of greedy search strate- gies to add low correlation terms of reasonable strength. For smaller val- ues of the shrinkage parameter ν, the strength of each added term ν · fmx becomes smaller, and more highly correlated with recently entered terms. Yet, the evidence so far indicates that the smaller the value of ν, the higher the overall accuracy, as long as there are enough iterations.
The source of this mystery in under investigation. It is but one of the many aspect of this general class of procedures that are not yet well understood. Whether these procedures are referred to as “boosting,” or more prosaically as “stagewise additive regression,” there are still many problems left to solve that can challenge us for some time to come.
REFERENCES
Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical Report 547, Dept. Statistics, Univ. California, Berkeley.
Freund, Y. (1999). An adaptive version of boost by majority algorithm. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory.
Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In Machine Learning: Proceedings of the Sixteenth International Conference 124–133.
Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19 1–141.
Friedman, J. H. (1999a). Greedy function approximation: a gradient boosting machine. Ann. Statist. To appear.
Friedman, J. H. (1999b). Stochastic gradient boosting. Technical report, Dept. Statistics, Stanford Univ.
Friedman, J. H. and Hall, P. (1999). On bagging and nonlinear estimation. J. Comput. Graph. Statist. To appear.
Grove, A. and Schuurmans, D. (1998). Boosting in the limit: maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artificial Intelligence.
Quinlan, J. (1996). Boosting first order learning. In Proceedings of the Seventh International Workshop on Algorithmic Learning Theory (S. Arikawa and A. Sharma, eds.) Lecture Notes in Artificial Intelligence 1160 143–155. Springer, Berlin.
Ratsch, G. (1998). Ensemble learning methods for classification. Masters thesis, Dept. Computer Science, Univ. Potsdam.
Ratsch, G., Onoda, T. and Muller, K. R. (2000). Soft margins for AdaBoost. Machine Learning 1–35.
Ridgeway, G. (1999). The state of boosting. In Proceedings of the Thirty-first Symposium on the Interface 172–181.
Department of Statistics Sequoia Hall
370 Serra Mall
Stanford University
Stanford California 94305-4065 E-mail: hastie@stat.stanford.edu