CS计算机代考程序代写 scheme data mining ER decision tree ant algorithm Hive Greedy Function Approximation: A Gradient Boosting Machine

Greedy Function Approximation: A Gradient Boosting Machine
Author(s): Jerome H. Friedman
Source: The Annals of Statistics , Oct., 2001, Vol. 29, No. 5 (Oct., 2001), pp. 1189-1232 Published by: Institute of Mathematical Statistics
Stable URL: https://www.jstor.org/stable/2699986
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at https://about.jstor.org/terms
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The Annals of Statistics
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

The Annals of Statistics
2001, Vol. 29, No. 5, 1189-1232
1999 REITZ LECTURE
GREEDY FUNCTION APPROXIMATION: A GRADIENT BOOSTING MACHINE’
BY JEROME H. FRIEDMAN Stanford University
Function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest- descent minimization. A general gradient descent “boosting” paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such “TreeBoost” models are presented. Gradient boost-
ing of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tib- shirani are discussed.
1. Function estimation. In the function estimation or “predictive learn- ing” problem, one has a system consisting of a random “output” or “response” variable y and a set of random “input” or “explanatory” variables x = x1, ….
xn}. Using a “training” sample {yi, Xi}N of known (y, x)-values, the goal is obtain an estimate or approximation F(x), of the function F*(x) mapping
x to y, that minimizes the expected value of some specified loss functi L(y, F(x)) over the joint distribution of all (y, x)-values,
(1) F* = argmin Ey xL(y, F(x)) = argminEx[Ey(L(y, F(x))) I x]. FF
Frequently employed loss functions L(y, F) include squared-error (y – F)2
and absolute error ly – Fl for y E R’ (regression) and negative binomial log likelihood, log(1 + e-2YF), when y E {-1, 1} (classification).
A common procedure is to restrict F(x) to be a member of a parameterized class of functions F(x; P), where P = {P1, P2, .. .} is a finite set of parameters whose joint values identify individual class members. In this article we focus
Received May 1999; revised April 2001.
1Supported in part by CSIRO Mathematical and Information Science, Australia; Department
of Energy Contract DE-AC03-76SF00515; and NSF Grant DMS-97-64431.
AMS 2000 subject classifications. 62-02, 62-07, 62-08, 62G08, 62H30, 68T10.
Key words and phrases. Function estimation, boosting, decision trees, robust nonparametric
regression.
1189
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1190J.H.FRIEDMAN on “additive” expansions of the form
M
(2) F(x;{fP3m, am}m) = E 3mh(X; am) m=l
The (generic) function h(x; a) in (2) is usually a simple parameterized func- tion of the input variables x, characterized by parameters a = {a1, a2, . . .}. The individual terms differ in the joint values am chosen for these parameters. Such expansions (2) are at the heart of many function approximation meth- ods such as neural networks [Rumelhart, Hinton, and Williams (1986)], radial basis functions [Powell (1987)], MARS [Friedman (1991)], wavelets [Donoho (1993)] and support vector machines [Vapnik (1995)]. Of special interest here is the case where each of the functions h(x; am) is a small regression tree,
such as those produced by CART TM [Breiman, Friedman, Olshen and Stone (1983)]. For a regression tree the parameters am are the splitting variables, split locations and the terminal node means of the individual trees.
1.1. Numerical optimization. In general, choosing a parameterized model F(x; P) changes the function optimization problem to one of parameter opti- mization,
(3)P*=argmin4>(P), P
where
and then
4?(P) = Ey XL(y, F(x; P))
F* (x) = F(x; P*).
For most F(x; P) and L, numerical optimization methods must be applied to solve (3). This often involves expressing the solution for the parameters in the form
M (4)P*EPm,
m=O
where po is an initial guess and {pm}m are successive increments (“steps” or “boosts”), each based on the sequence of preceding steps. The prescription for computing each step Pm is defined by the optimization method.
1.2. Steepest-descent. Steepest-descent is one of the simplest of the frequ-
ently used numerical minimization methods. It defines the increments {pm}lm (4) as follows. First the current gradient gm is computed:
gm = {gjm} = [ dPj =_p1 }
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

where
The step is taken to be
where
m-1
Pm-l = E Pi i=O
Pm = Pmgm,
GREEDY FUNCTION APPROXIMATION 1191
(5) Pm = arg min 4(Pm_1 – pgm). p
The negative gradient -gm is said to define the “steepest-descent” direction and (5) is called the “line search” along that direction.
2. Numerical optimization in function space. Here we take a “non- parametric” approach and apply numerical optimization in function space. That is, we consider F(x) evaluated at each point x to be a “parameter” and seek to minimize
@'(F) = Ey xL(y, F(x)) = Ex[Ey(L(y, F(x))) I x], or equivalently,
4(F(x)) = Ey[L(y, F(x)) I x]
at each individual x, directly with respect to F(x). In function space there are
an infinite number of such parameters, but in data sets (discussed below) only
a finite number {F(xj)}N are involved. Following the numerical optimization paradigm we take the solution to be
M
F*(x) = E fm(x),
m==O
where fo(x) is an initial guess, and {fm(X)}jM are incremental functions (“steps” or “boosts”) defined by the optimization method.
For steepest-descent,
(6) f m (x) = -Pm gm (x) with
(X d0g(F(x))_ _ dEY[L(y, F(x)) x]-
and
gm(x – dFx -dF X grn(x)-mJiF(x)=F.j(X) = E[Y x)) F(x)=F,-l (x)
m-1 Fmil(X) E fi(x).
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1192J.H.FRIEDMAN
Assuming sufficient regularity that one can interchange differentiation and integration, this becomes
(7) g() =E dL(y, F(x)) I x
dL F(x) F(x)=:Fm,(x)
The multiplier Pm in (6) is given by the line search
(8) Pm = argmin Ey XL(y, Fml(x) -pgm(x)). p
3. Finite data. This nonparametric approach breaks down when the joint
distribution of (y, x) is estimated by a finite data sample {Yi, Xi}IN. In case EY[. I x] cannot be estimated accurately by its data value at each xi even if it could, one would like to estimate F*(x) at x values other than the
training sample points. Strength must be borrowed from nearby data points
by imposing smoothness on the solution. One way to do this is to assume a parameterized form such as (2) and do parameter optimization as discussed in Section 1.1 to minimize the corresponding data based estimate of expected loss,
M
1 m s am I = arg min L Yi, E mh(Xi; am)j
W,n, a/} i=l m=l {13a? =arg 7 m’}~ Ifif m=1
In situations where this is infeasible one can try a “greedy stagewise” approach. For m = 1, 2, …, Ml
N
(9) (fm am) = argmin E L(yi, Fm I(xi) + /3h(xi; a)) f3,a i=l
and then
(10) Fm(x) = Fmi,(x) + pmh(x; am).
Note that this stagewise strategy is different from stepwise approaches that readjust previously entered terms when new ones are added.
In signal processing this stagewise strategy is called “matching pursuit” [Mallat and Zhang (1993)] where L(y, F) is squared-error loss and the { h(x; am)}M are called basis functions, usually taken from an overcomplete waveletlike dictionary. In machine learning, (9), (10) is called “boosting” where y E {-1, 1} and L(y, F) is either an exponential loss criterion e-YF [Freund and Schapire (1996), Schapire and Singer (1998)] or negative binomial log- likelihood [Friedman, Hastie and Tibshirani (2000) (here after reffered to as FHT00)]. The function h(x; a) is called a “weak learner” or “base learner” and is usually a classification tree.
Suppose that for a particular loss L(y, F) and/or base learner h(x; a) the
solution to (9) is difficult to obtain. Given any approximator Fmi,(x), th function 83mh(x; am) (9), (10) can be viewed as the best greedy step toward
the data-based estimate of F*(x) (1), under the constraint that the step “direc-
tion” h(x; am) be a member of the parameterized class of functions h(x; a). It
can thus be regarded as a steepest descent step (6) under that constraint. By
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

GREEDY FUNCTION APPROXIMATION 1193
construction, the data-based analogue of the unconstrained negative gradi- ent (7),
-gm (i ) -[ dF(Xi ) – F(x)=F,,,-_ (x)
gives the best steepest-descent step direction -gm = {-gm(xi)}1 in the N- dimensional data space at Fmi(x). However, this gradient is defined only at
the data points {Xi N and cannot be generalized to other x-values. One po bility for generalization is to choose that member of the parameterized class
h(x; am) that produces hm = {h(xi; am)}IN most parallel to -gm E RN. This i the h(x; a) most highly correlated with -gm(x) over the data distribution. It
can be obtained from the solution
N
(11) am = argmin [-gm(xi) -,8h(xi;a)]2. a,f *i=
This constrained negative gradient h(x; am) is used in place of the uncon- strained one -gm(x) (7) in the steepest-descent strategy. Specifically, the line search (8) is performed
N
(12) Pm = argmin E( L(yx, Fm1(Xi) + ph(xi; am)) Pi=
and the approximation updated,
Fm(x) = Fmi,(x) + Pmh(X; am).
Basically, instead of obtaining the solution under a smoothness constraint
(9), the constraint is applied to the unconstrained (rough) solution by fit-
ting h(x; a) to the “pseudoresponses” {K = -gm(xi)}IN1 (7). This permit replacement of the difficult function minimization problem (9) by least-squares
function minimization (11), followed by only a single parameter optimization
based on the original criterion (12). Thus, for any h(x; a) for which a feasible least-squares algorithm exists for solving (11), one can use this approach to
minimize any differentiable loss L(y, F) in conjunction with forward stage- wise additive modeling. This leads to the following (generic) algorithm using steepest-descent.
ALGORITHM 1 (Gradient-Boost).
1. Fo(x) = arg min EN L(yi vp) 2. For m = 1 to M do:
3- 5d = F(x-) ]F(x)=F (x)i = 1, N
4. am = arg mina,: Ei1[Yi-,fh(x0;a)j2 5. Pm = argmin Ei=1 L(yi, Fmi,(xi) + ph(xi; am)) 6. Fm(x) = Fmi (x) + Pmh(X; am)
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1194J.H.FRIEDMAN
7. endFor
end Algorithm
Note that any fitting criterion that estimates conditional expectation (given x) could in principle be used to estimate the (smoothed) negative gradient (7) at line 4 of Algorithm 1. Least-squares (11) is a natural choice owing to the superior computational properties of many least-squares algorithms.
In the special case where y E {-1, 1} and the loss function L(y, F) depends on y and F only through their product L(y, F) = L(yF), the analogy of boost- ing (9), (10) to steepest-descent minimization has been noted in the machine learning literature [lRatsch, Onoda and Muller (1998), Breiman (1999)]. Duffy and Helmbold (1999) elegantly exploit this analogy to motivate their GeoLev and GeoArc procedures. The quantity yF is called the “margin” and the steepest-descent is performed in the space of margin values, rather than the space of function values F. The latter approach permits application to more general loss functions where the notion of margins is not apparent. Drucker (1997) employs a different strategy of casting regression into the framework of
classification in the context of the AdaBoost algorithm [Freund and Schapire (1996)].
4. Applications: additive modeling. In this section the gradient boost- ing strategy is applied to several popular loss criteria: least-squares (LS), least absolute deviation (LAD), Huber (M), and logistic binomial log-likelihood (L). The first serves as a “reality check”, whereas the others lead to new boosting algorithms.
4.1. Least-squares regression. Here L(y, F) = (y – F)2/2. The pseudore-
sponse in line 3 of Algorithm 1 is Fi = – ,F7_(xi). Thus, line 4 simp the current residuals and the line search (line 5) produces the result Pm –
Oam, where OBm is the minimizing /3 of line 4. Therefore, gradient b squared-error loss produces the usual stagewise approach of iteratively fitting
the current residuals.
ALGORITHM 2 (LS-Boost).
Fo(x) = Y
For m = 1 to M do:
Yi = Y- -Fm_x(x) i = 1, N
(Pm’ am) = arg minapNE I [4i – ph(xi; a)]2
Fm(x) = Fm I(X) + Pmh(X; am) endFor
end Algorithm
4.2. Least absolute deviation (LAD) regression. For the loss function L(y, F) = ly- Fl, one has
(13) =p[8dL(yj, F(xi))] = sign(Y – –
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

GREEDY FUNCTION APPROXIMATION 1195
This implies that h(x; a) is fit (by least-squares) to the sign of the current residuals in line 4 of Algorithm 1. The line search (line 5) becomes
N
Pm = argmin Yi – Fm,(xi) – ph(xt;am).
N yi -Fmi(xi) (14) = argmin> 3h(xi;am). h(xa;am) – P
=medianw1yi-h( m-i(xi)1, Wi = jh(x
Here medianw{.} is the weighted median with weights wi. Inserting these results [(13), (14)] into Algorithm 1 yields an algorithm for least absolute deviation boosting, using any base learner h(x; a).
4.3. Regression trees. Here we consider the special case where each base learner is an J-terminal node regression tree [Breiman, Friedman, Olshen and Stone (1983)]. Each regression tree model itself has the additive form
J
( 15) h(x; ttbi, R j}J ) =Ebj 1(x E R j i=l
Here {Rj}J are disjoint regions that collectively cover the values of the predictor variables x. These regions are represented by the ter-
minal nodes of the corresponding tree. The indicator function 1(.) has the
value 1 if its argument is true, and zero otherwise. The “parameters” of this
base learner (15) are the coefficients {bj}j, and the quantities that define the
boundaries of the regions {Rj}f. These are the splitting variables and the values of those variables that represent the splits at the nonterminal nodes of
the tree. Because the regions are disjoint, (15) is equivalent to the prediction rule: if x E R1 then h(x) = bj.
For a regression tree, the update at line 6 of Algorithm 1 becomes
J
(16) Fm(X) = Fm-i(X) + Pm E bjml(x e Rjm). j=1
Here {Rjm}J are the regions defined by the terminal nodes of the tree at
the mth iteration. They are constructed to predict the pseudoresponse (line 3) by least-squares (line 4). The {bjm} are the corresponding least-squares coefficients,
bjm = avexERjm i-
The scaling factor Pm is the solution to the “line search” at line 5.
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1196J.H.FRIEDMAN
The update (16) can be alternatively expressed as
J
(17) Fm(X) = Fmi-(X) + E yjml(x E Rjm) j=1
with Yjm = Pmbjm. One can view (17) as adding J separate basis functions
at each step {1(x E Rjm)}IJ, instead of a single additive one as in (16). Thus, in this case one can further improve the quality of the fit by using the opti- mal coefficients for each of these separate basis functions (17). These optimal coefficients are the solution to
NJ
{Yjm}J {‘YjI iTi1 j=1
Owing to the disjoint nature of the regions produced by regression trees, this reduces to
(18) yjm=argmin E L(yi,Fm_1(xi)+?y). X’iERjm
This is just the optimal constant update in each terminal node region, based
on the loss function L, given the current approximation Fmil(x). For the case of LAD regression (18) becomes
Yjm = medianxi (ER J{Yi – Fm-i(Xi)},
which is simply the median of the current residuals in the jth terminal node at
the mth iteration. At each iteration a regression tree is built to best predict the
sign of the current residuals y, – Fm-,(xi), based on a least-squares criterion. Then the approximation is updated by adding the median of the residuals in
each of the derived terminal nodes.
ALGORITHM 3 (LAD-TreeBoost).
Fo(x) = median{}yi } For m = 1 to M do:
Yi = sign(yi – Fmi,(xi)), i = 1, N
{Rjm}j = J-terminal node tree({yi, xi} I)
Yjm = medianxiERjm {yi – Fmi(Xii)}, j = 1, J
Fm(X) = Fm-i(X) + Ej= lyjml(X E Rjm) endFor
end Algorithm
This algorithm is highly robust. The trees use only order information on
the individual input variables xj, and the pseudoresponses 5i (13) have two values, Yi E {-1, 1}. The terminal node updates are based on medians.
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

GREEDY FUNCTION APPROXIMATION 1197
An alternative approach would be to build a tree to directly minimize the loss criterion,
N
treem(x) = arg min – Fmil(xi) – tree(xi)l J-node tree i=1
Fm(x) = Fm- (xi) + treem(x).
However, Algorithm 3 is much faster since it uses least-squares to induce the trees. Squared-error loss is much more rapidly updated than mean absolute deviation when searching for splits during the tree building process.
4.4. M-Regression. M-regression techniques attempt resistance to long- tailed error distributions and outliers while maintaining high efficiency for normally distributed errors. We consider the Huber loss function [Huber (1964)]
(19) L(y, F) =|2(YF vIY- ,
5ly (I-Fl – /2) lY -Fl > 5.
Here the pseudoresponse is
= dL(yi, F(xi)) 1
d 8F(xi) I F(x)=F._1(x)
_ – Fm.-(Xi), IYi – Fm-(xi)l < & - { 3 sign(yi - Fm-i(Xi)), IYi - Fm-(xi)l > an
and the line search becomes
N
(20) Pm = arg min L(yi, Fm-1(xi) + ph(xi; am)) P i=1
with L given by (19). The solution to (19), (20) can be obtained by standard iterative methods [see Huber (1964)].
The value of the transition point 3 defines those residual values that are considered to be “outliers,” subject to absolute rather than squared-error loss.
An optimal value will depend on the distribution of y – F*(x), where F* is
the true target function (1). A common practice is to choose the value of 3 to
be the ae-quantile of the distribution of I y – F* (x)1, where (1 – ae) controls breakdown point of the procedure. The “breakdown point” is the fraction of observations that can be arbitrarily modified without seriously degrading the
quality of the result. Since F*(x) is unknown one uses the current estimate
Fmil(x) as an approximation at the mth iteration. The distribution of y Fmil(x)l is estimated by the current residuals, leading to
am = quantile{ I yj – Fm -1(xi)}1N
With regression trees as base learners we use the strategy of Section 4.3, that is, a separate update (18) in each terminal node R jm* For the Huber loss
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1198J.H.FRIEDMAN
(19) the solution to (18) can be approximated by a single step of the standard iterative procedure [Huber (1964)] starting at the median
rim = medianxiERj.{rm_l(xi)}, where {rm_i(xi)}N are the current residuals
rmil(xi) = yi-Fmi,(xi).
The approximation is
‘Yjm= rjm? N+ m E sign(rm -(x)-rym).min(8m,abs(rm (xj)-r jm)), im XiERjm
where Njm is the number of observations in the jth terminal node. This gives the following algorithm for boosting regression trees based on Huber loss (19).
ALGORITHM 4 M-TreeBoost.
FO(x) = median{yi}I For m = 1 to M do:
rmil(xi) = yi-Fmi(xi) i = 1, N 5m =quantile,jjrm_(xi)Ijl
= rml(xi), rm-i(xi)I < am i- 1, N Y 8m sign(rm-1(xi)), Irm-i(xi)l > 8m
{Rjml}J = J-terminal node tree({Yi, xi }N) rjm = medianxiERm {rmij(xi)}, j = 1, J
Yjm = rjm + I ExiERjmSign (rmi,(xi) – rjm) min(8m, abs(rm_i(xi) – ) j=1, J
Fm (x) = Fmil(x) ? EfJ=1 ‘Yjm1(x E Rjm ) endFor
end Algorithm
According to the motivations underlying robust regression, this algorithm should have properties similar to that of least-squares boosting (Algorithm 2) for normally distributed errors, and similar to that of least absolute deviation regression (Algorithm 3) with very long-tailed distributions. For error distri- butions with only moderately long tails it can have performance superior to both (see Section 6.2).
4.5. Two-class logistic regression and classification. Here the loss function is negative binomial log-likelihood (FHTOO)
L(y, F) = log(l + exp(-2yF)), y E {-1, 1},
where
(21) F(x) = 2logF Pr(y–1 x)
2 LPr(y = -lIx)_j
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

GREEDY FUNCTION APPROXIMATION 1199 The pseudoresponse is
(22) – [ F(x)) 1 – 2yi/(l + exp(2yjFFm i (xj))). L dcF(x1) IF(x)=Fmi,(x)
The line search becomes
N
Pm = arg min E log(l + exp(-2yi(Fmi_ (xi) + ph(xi; am)))).
With regression trees as base learners we again use the strategy (Section 4.3) of separate updates in each terminal node Rjm:
(23) Yjm argmin E log(l +exp(-2yi(Fmi(Xi) ? y))) ‘Yxi ER jm
There is no closed-form solution to (23). Following FHTOO, we approximate it by a single Newton-Raphson step. This turns out to be
Yjm = E i/E IJiJ(2 IYij) xi ERjm xi ERjm
with 5j given by (22). This gives the following algorithm for likeliho boosting with regression trees.
ALGORITHM 5 (LK-TreeBoost). 1?5-
FO(x) = 1 log Y For m = 1 to M do:
Yi = 2yi/(l + exp(2yiFm_i(xi))), i = 1, N {Rjm}J = J-terminal node tree({51, Xi}N)
Yjm = EXijERjm Yi/ ExiERjm IJY(2 – IYj), i = 1, J
Fm(x) = Fm-i(X) + EJ I Yjml(X E Rjm) endFor
end Algorithm
The final approximation FM(x) is related to log-odds through (21). This can be inverted to yield probability estimates
p+(x) = Pr(y = 1 I x) = 1/(1 + e-2FM(x))
p_(x) = Pr(y = -1 I x) = 1/(1 + e2FM(x)) These in turn can be used for classification,
A(x) = 2 1[c(-1, 1)p+(x) > c(l, -1)p_(x)] – 1, where c(y, y) is the cost associated with predicting – when the truth is y.
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1200J.H.FRIEDMAN
4.5.1. Influence trimming. The empirical loss function for the two-class logistic regression problem at the mth iteration is
N
(24) km(P, a)= log[l1+ exp(-2yiF1_1(x )) exp(-2yiph(xi; a))]. i=l1
If yiFm-1(xi) is very large, then (24) has almost no dependence on ph(xi; a) for small to moderate values near zero. This implies that the ith observation
(yi, xi) has almost no influence on the loss function, and therefore on its solution
(Pm, am) = argmin Om(P, a). p, a
This suggests that all observations (yi, xi) for which yiFmi,(xi) is relatively
very large can be deleted from all computations of the mth iteration without
having a substantial effect on the result. Thus,
(25) wi = exp(-2YiFmi,(xi))
can be viewed as a measure of the “influence” or weight of the ith observation on the estimate pmh(x; am).
More generally, from the nonparametric function space perspective of
Section 2, the parameters are the observation function values {F(x
influence on an estimate to changes in a “parameter” value F(xi) (holding all the other parameters fixed) can be gauged by the second derivative of the loss function with respect to that parameter. Here this second derivative at the
mth iteration is Ii 1(2 – I5iJ) with 5i given by (22). Thus, another m the influence or “weight” of the ith observation on the estimate pmh(x; a the mth iteration is
(26) wi = 1?J(2 – Iij).
Influence trimming deletes all observations with where l(a) is the solution to
l(a) N
(27) E w(i) = a wi. i=1 i=l
Here {W(i)}N are the weig values are a E [0.05, 0.2]. Note that influence trimming based on (25), (27)
is identical to the “weight trimming” strategy employed with Real AdaBoost, whereas (26), (27) is equivalent to that used with LogitBoost, in FHTOO. There
it was seen that 90% to 95% of the observations were often deleted without sacrificing accuracy of the estimates, using either influence measure. This results in a corresponding reduction in computation by factors of 10 to 20.
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

GREEDY FUNCTION APPROXIMATION 1201
4.6. Multiclass logistic regression and classification. Here we develop a gradient-descent boosting algorithm for the K-class problem. The loss func- tion is
K
(28) L({yk, Fk(x)} ) =K – yklogpk(x), k=i
where Yk = l(class = k) E {0, 1}, and Pk(X) = Pr(yk = 1 x). Following FHTOO, we use the symmetric multiple logistic transform
(29) Fk(X) = log Pk(X) – K 0log p(X) or equivalently
K
(30) Pk(X) = exp(Fk(x))/ E exp(FI(x)). k1=
Substituting (30) into (28) and taking first derivatives one has
(31) Sik = j==L({Yi } Yk – Pk,m mp (xi), d8 F(XJ) – {F(x)=Fj, “,_i(X)}K
where Pk, m_i(X) is derived from Fk, mi(X) through (30). Thus, K-trees are induced at each iteration m to predict the corresponding current residuals for each class on the probability scale. Each of these trees has J-terminal nodes, with corresponding regions {Rjkm}J1= The model updates yjkm corresponding to these regions are the solution to
NK KJ
{1Yjkm} = argmin} E E, Yik, Fk,m-l(Xi) jk1(Xi ERjm)) {‘Yjk} i=I k=I =I
where O(Yk, Fk) = -Yk log Pk f This has no closed form solution. Moreover, the regions corresponding to the
different class trees overlap, so that the solution does not reduce to a separate calculation within each region of each tree in analogy with (18). Following FHTOO, we approximate the solution with a single Newton-Raphson step, using a diagonal approximation to the Hessian. This decomposes the problem into a separate calculation for each terminal node of each tree. The result is
(32) Yk K-i xeRjkm Yik
This leads to the following algorithm for K-class logistic gradient boosting.
ALGORITHM 6 (LK-TreeBoost). Fko(x) = 0, k = 1, K
For m = 1 to M do:
pk(X) = exp(Fk(x))/ I[S11 exp(FI(x)), k = 1, K
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1202J.H.FRIEDMAN
For k = 1 to K do:
Yik = Yik -Pk(XA), i = 1, N
{Rjkm}=l = J-terminal node tree({5ik, xI} )
Nk K-1 ExRjkm(l YI) = 1 Tik
YJkm – K ExicRjkm~ IYikl(‘–IYikI) 1, Jl
Fkm(X) = Fk, m-l(X) + ,iA lYjkm 1(X c Rjjkm) endFor
endFor
end Algorithm
The final estimates {FkM(X)}K can be used to obtain corresponding prob- ability estimates {PkM(X)}1 through (30). These in turn can be used for classification
K
k(x) = arg mm E c(k, k’)PkIM(x), 1 200, so that optimal M-values for it are unstable.
Although illustrated here for just one target function and base learner (11- terminal node tree), the qualitative nature of these results is fairly universal. Other target functions and tree sizes (not shown) give rise to the same behav- ior. This suggests that the best value for v depends on the number of iterations
M. The latter should be made as large as is computationally convenient or feasible. The value of v should then be adjusted so that LOF achieves its min- imum close to the value chosen for M. If LOF is still decreasing at the last iteration, the value of v or the number of iterations M should be increased, preferably the latter. Given the sequential nature of the algorithm, it can eas- ily be restarted where it finished previously, so that no computation need be repeated. LOF as a function of iteration number is most conveniently esti- mated using a left-out test sample.
As illustrated here, decreasing the learning rate clearly improves perfor- mance, usually dramatically. The reason for this is less clear. Shrinking the model update (36) at each iteration produces a more complex effect than direct proportional shrinkage of the entire model
(38) FV(X) = – + V (FM(X)-
where FM(X) is the model induced without shrinkage. The update pmh(x; am) at each iteration depends on the specific sequence of updates at the previous iterations. Incremental shrinkage (36) produces very different models than global shrinkage (38). Empirical evidence (not shown) indicates that global shrinkage (38) provides at best marginal improvement over no shrinkage, far from the dramatic effect of incremental shrinkage. The mystery underlying the success of incremental shrinkage is currently under investigation.
6. Simulation studies. The performance of any function estimation method depends on the particular problem to which it is applied. Important characteristics of problems that affect performance include training sample size N, true underlying “target” function F*(x) (1), and the distribution of
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

GREEDY FUNCTION APPROXIMATION 1207
the departures, E, of y I x from F*(x). For any given problem, N is always known and sometimes the distribution of E is also known, for example when y is binary (Bernoulli). When y is a general real-valued variable the distribution of 8 is seldom known. In nearly all cases, the nature of F*(x) is unknown.
In order to gauge the value of any estimation method it is necessary to accu- rately evaluate its performance over many different situations. This is most conveniently accomplished through Monte Carlo simulation where data can be generated according to a wide variety of prescriptions and resulting perfor- mance accurately calculated. In this section several such studies are presented in an attempt to understand the properties of the various GradientLTreeBoost procedures developed in the previous sections. Although such a study is far more thorough than evaluating the methods on just a few selected examples, real or simulated, the results of even a large study can only be regarded as suggestive.
6.1. Random function generator. One of the most important character- istics of any problem affecting performance is the true underlying target function F*(x) (1). Every method has particular targets for which it is most appropriate and others for which it is not. Since the nature of the target func- tion can vary greatly over different problems, and is seldom known, we com- pare the merits of regression tree gradient boosting algorithms on a variety of different randomly generated targets. Each one takes the form
20 (39) F*(x) a,ag1(zl).
The coefficients {al}20 are randomly
a, – Ut-1, 1]. Each gl(zl) is a function o of the n-input variables x. Specifically,
Z, {Xp=1j)}Jj1,
where each P1 is a separate random permutation of the integers {1, 2, .. ., n}.
The size of each subset n, is itself taken to be random, n1 = L1.5 + r], wit r being drawn from an exponential distribution with mean A = 2. Thus, the
expected number of input variables for each gl(zl) is between three and four However, most often there will be fewer than that, and somewhat less often,
more. This reflects a bias against strong very high-order interaction effects. However, for any realized F* (x) there is a good chance that at least a few
of the 20 functions gl(zl) will involve higher-order interactions. In any case, F*(x) will be a function of all, or nearly all, of the input variables.
Each gl(zl) is an nl-dimensional Gaussian function (40) gl(zl) = exp-((zI- )V1(z1 – ))
where each of the mean vectors {_t}120 is randomly g distribution as that of the input variables x. The n, x
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1208J.H.FRIEDMAN is also randomly generated. Specifically,
VI = UID1U[,
where U1 is a random orthonormal matrix (uniform on Haar measure) and
D, = diag {d,, … dn,11} The square roots of the eigenvalues are rando
erated from a uniform distribution Vd jl U[a, b], where the limits a, b on the distribution of the input variables x.
For all of the studies presented here, the number of input variables was taken to be n = 10, and their joint distribution was taken to be standard nor- mal x – N(0, I). The eigenvalue limits were a – 0.1 and b = 2.0. Although the tails of the normal distribution are often shorter than that of data encountered in practice, they are still more realistic than uniformly distributed inputs often used in simulation studies. Also, regression trees are immune to the effects of long-tailed input variable distributions, so shorter tails gives a relative advan- tage to competitors in the comparisons.
In the simulation studies below, 100 target functions F*(x) were randomly generated according to the above prescription (39), (40). Performance is evalu- ated in terms of the distribution of approximation inaccuracy [relative approx- imation error (37) or misclassification risk] over these different targets. This approach allows a wide variety of quite different target functions to be gen- erated in terms of the shapes of their contours in the ten-dimensional input
space. Although lower order interactions are favored, these functions are not especially well suited to additive regression trees. Decision trees produce ten- sor product basis functions, and the components gl(zl) of the targets F*(x) are not tensor product functions. Using the techniques described in Section 8, visualizations of the dependencies of the first randomly generated function on some of its more important arguments are shown in Section 8.3.
Although there are only ten input variables, each target is a function of all of them. In many data mining applications there are many more than ten inputs. However, the relevant dimensionalities are the intrinsic dimensional- ity of the input space, and the number of inputs that actually influence the output response variable y. In problems with many input variables there are usually high degrees of collinearity among many of them, and the number of roughly independent variables (approximate intrinsic dimensionality) is much smaller. Also, target functions often strongly depend only on a small subset of all of the inputs.
6.2. Error distribution. In this section, LS TreeBoost, LADRTreeBoost, and M-TreeBoost are compared in terms of their performance over the 100 target functions for two different error distributions. Best-first regression trees with 11 terminal nodes were used with all algorithms. The breakdown param- eter for the M-TreeBoost was set to its default value a = 0.9. The learning rate parameter (36) was set to v = 0.1 for all TreeBoost procedures in all of the simulation studies.
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

GREEDY FUNCTION APPROXIMATION 1209
One hundred data sets {yi, Xi}N were ge
yi= F*(xi) + i,
where F*(x) represents each of the 100 target functions randomly generated
as described in Section 6.1. For the first study, the errors 8i were generated from a normal distribution with zero mean, and variance adjusted so that
(41) EI1I = ExIF*(x) – medianxF*(x) ,
giving a 1/1 signal-to-noise ratio. For the second study the errors were gen-
erated from a “slash” distribution, 8i = s. (u/v), where u – N(O, 1) a U[0, 1]. The scale factor s is adjusted to give a 1/1 signal-to-noise ratio (41).
The slash distribution has very thick tails and is often used as an extreme to
test robustness. The training sample size was taken to be N = 7500, with 5000
used for training, and 2500 left out as a test sample to estimate the optimal number of components M. For each of the 100 trials an additional validation sample of 5000 observations was generated (without error) to evaluate the approximation inaccuracy (37) for that trial.
The left panels of Figure 2 show boxplots of the distribution of approxima- tion inaccuracy (37) over the 100 targets for the two error distributions for each of the three methods. The shaded area of each boxplot shows the interquar- tile range of the distribution with the enclosed white bar being the median.
L1j
NormalNormal
LS LAD M LS LAD M Slash Slash
-= –
LS LAD M LS LAD M
FIG. 2. Distribution of absolute approximation error (left panels) and error relative to the best
(right panels) for LSJTheeBoost, LAD_TheeBoost and M_TreeBoost for normal and slash error tributions. LSJTreeBoost, performs best with the normal error distribution. LADJTreeBoost and M_TreeBoost both perform well with slash errors. M_TreeBoost is very close to the best for both
error distributions. Note the use of logarithmic scale in the lower right panel.
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1210J.H.FRIEDMAN
The outer hinges represent the points closest to (plus/minus) 1.5 interquar- tile range units from the (upper/lower) quartiles. The isolated bars represent individual points outside this range (outliers).
These plots allow the comparison of the overall distributions, but give no information concerning relative performance for individual target functions. The right two panels of Figure 2 attempt to provide such a summary. They show distributions of error ratios, rather than the errors themselves. For each target function and method, the error for the method on that target is divided by the smallest error obtained on that target, over all of the methods (here three) being compared. Thus, for each of the 100 trials, the best method
receives a value of 1.0 and the others receive a larger value. If a particu- lar method was best (smallest error) for all 100 target functions, its resulting distribution (boxplot) would be a point mass at the value 1.0. Note that the logarithm of this ratio is plotted in the lower right panel.
From the left panels of Figure 2 one sees that the 100 targets represent a fairly wide spectrum of difficulty for all three methods; approximation errors vary by over a factor of two. For normally distributed errors LS-TreeBoost is the superior performer, as might be expected. It had the smallest error in 73 of the trials, with M-TreeBoost best the other 27 times. On average
LS-TreeBoost was 0.2% worse than the best, M_TreeBoost 0.9% worse, and LAD-TreeBoost was 7.4% worse than the best.
With slash-distributed errors, things are reversed. On average the approxi- mation error for LS-TreeBoost was 0.95, thereby explaining only 5% target variation. On individual trials however, it could be much better or much worse. The performance of both LAD-TreeBoost and M-TreeBoost was much better and comparable to each other. LAD-TreeBoost was best 32 times and
M-TreeBoost 68 times. On average LADfTreeBoost was 4.1% worse than the
best, M_TreeBoost 1.0% worse, and LS-TreeBoost was 364.6% worse that t best, over the 100 targets.
The results suggest that of these three, M_TreeBoost is the method of choice. In both the extreme cases of very well-behaved (normal) and very badly behaved (slash) errors, its performance was very close to that of the best. By comparison, LAD-TreeBoost suffered somewhat with normal errors, and LS-TreeBoost was disastrous with slash errors.
6.3. LSJTheeBoost versus MARS. All GradientLTreeBoost algorithms pro- duce piecewise constant approximations. Although the number of such pieces is generally much larger than that produced by a single tree, this aspect of the approximating function FM(X) might be expected to represent a disadvantage with respect to methods that provide continuous approximations, especially when the true underlying target F*(x) (1) is continuous and fairly smooth. All of the randomly generated target functions (39), (40) are continuous and very smooth. In this section we investigate the extent of the piecewise constant
disadvantage by comparing the accuracy of GradientLTreeBoost with that of MARS [Friedman (1991)] over these 100 targets. Like TreeBoost, MARS pro- duces a tensor product based approximation. However, it uses continuous func-
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

GREEDY FUNCTION APPROXIMATION 1211
tions as the product factors, thereby producing a continuous approximation. It also uses a more involved (stepwise) strategy to induce the tensor products. Since MARS is based on least-squares fitting, we compare it to LS-Tree- Boost using normally distributed errors, again with a 1/1 signal-to-noise ratio (41). The experimental setup is the same as that in Section 6.2. It is interesting to note that here the performance of MARS was considerably enhanced by
using the 2500 observation test set for model selection, rather than its default generalized cross-validation (GCV) criterion [Friedman (1991)].
The top left panel of Figure 3 compares the distribution of MARS average absolute approximation errors, over the 100 randomly generated target func- tions (39), (40), to that of LS-TreeBoost from Figure 2. The MARS distribution is seen to be much broader, varying by almost a factor of three. There were many targets for which MARS did considerably better than LS-TreeBoost, and many for which it was substantially worse. This further illustrates the fact that the nature of the target function strongly influences the relative performance of different methods. The top right panel of Figure 3 shows the distribution of errors, relative to the best for each target. The two methods
exhibit similar performance based on average absolute error. There were a number of targets where each one substantially outperformed the other.
The bottom two panels of Figure 3 show corresponding plots based on root mean squared error. This gives proportionally more weight to larger errors in assessing lack of performance. For LS-TreeBoost the two error measures have close to the same values for all of the 100 targets. However with MARS, root mean squared error is typically 30% higher than average absolute error. This indicates that MARS predictions tend to be either very close to, or far from, the target. The errors from LS-TreeBoost are more evenly distributed. It tends to have fewer very large errors or very small errors. The latter may be a consequence of the piecewise constant nature of the approximation which
makes it difficult to get arbitrarily close to very smoothly varying targets with approximations of finite size. As Figure 3 illustrates, relative performance can be quite sensitive to the criterion used to measure it.
These results indicate that the piecewise constant aspect of TreeBoost approximations is not a serious disadvantage. In the rather pristine environ- ment of normal errors and normal input variable distributions, it is competi- tive with MARS. The advantage of the piecewise constant approach is robust- ness; specifically, it provides immunity to the adverse effects of wide tails and outliers in the distribution of the input variables x. Methods that produce continuous approximations, such as MARS, can be extremely sensitive to such problems. Also, as shown in Section 6.2, M-TreeBoost (Algorithm 4) is nearly
as accurate as LS-TreeBoost for normal errors while, in addition, being highly resistant to output y-outliers. Therefore in data mining applications where the cleanliness of the data is not assured and x- and/or y-outliers may be present, the relatively high accuracy, consistent performance and robustness of M-TreeBoost may represent a substantial advantage.
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1212J.H.FRIEDMAN Abs-error Abs-error
C9
E
o0
LS_TreeBoost MARS LS_TreeBoost MARS
(o~~~~~~~~~~~~~~~~~~~~o
LS_TreeBoost MARS LS_TreeBoost MARS Rms-error Rms-error
0 c~~~~~~~~~~~~~~~~~~~~~~( ,. T_
(0
0 . – En
0 ~~~ ~ ~ 0
w
0 0
FIG. 3. Distribution of approximation error (left panels) and e for LSJTheeBoost and MARS. The top panels are based on bottom ones use root mean squared error. For absolute err indicating more frequent better and worse performance than measured by root mean squared error is much worse, indica
make both larger and smaller errors than LSJTheeBoost.
6.4. LKifreeBoost versus K-class LogitBoost and AdaBoost.MH. In this section the performance of LK-TreeBoost is compared to that of K-class Log- itBoost (FHT00) and AdaBoost.MH [Schapire and Singer (1998)] over the 100 randomly generated targets (Section 6.1). Here K = 5 classes are generated
by thresholding each target at its 0.2, 0.4, 0.6 and 0.8 quantiles over the dis- tribution of input x-values. There are N = 7500 training observations for each trial (1500 per class) divided into 5000 for training and 2500 for model selec-
tion (number of iterations, M). An independently generated validation sample
of 5000 observations was used to estimate the error rate for each target. The
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

GREEDY FUNCTION APPROXIMATION 1213
Bayes error rate is zero for all targets, but the induced decision boundaries can become quite complicated, depending on the nature of each individual target function F*(x). Regression trees with 11 terminal nodes were used for each method.
Figure 4 shows the distribution of error rate (left panel), and its ratio to the smallest (right panel), over the 100 target functions, for each of the three methods. The error rate of all three methods is seen to vary substantially over these targets. LK TreeBoost is seen to be the generally superior performer. It had the smallest error for 78 of the trials and on average its error rate was 0.6% higher than the best for each trial. LogitBoost was best on 21 of the targets and there was one tie. Its error rate was 3.5% higher than the best on
average. AdaBoost.MH was never the best performer, and on average it was 15% worse than the best.
Figure 5 shows a corresponding comparison, with the LogitBoost and AdaBoost.MH procedures modified to incorporate incremental shrinkage (36), with the shrinkage parameter set to the same (default) value v = 0.1 used with LK-TreeBoost. Here one sees a somewhat different picture. Both LogitBoost and AdaBoost.MH benefit substantially from shrinkage. The performance of
all three procedures is now nearly the same, with LogitBoost perhaps hav- ing a slight advantage. On average its error rate was 0.5% worse that the best; the corresponding values for LK-TreeBoost and AdaBoost.MH were 2.3% and 3.9%, respectively. These results suggest that the relative performance of these methods is more dependent on their aggressiveness, as parameterized by learning rate, than on their structural differences. LogitBoost has an addi-
Error-rate Rel. error-rate
LK_TreeBoost LogitBoost AdaBoost LK_TreeBoost LogltBoost AdaBoost
FIG. 4. Distribution of error rate on a five-class problem (left panel) and error rate relative to the best (right panel) for LK TreeBoost, LogitBoost, and AdaBoost.MH. LK Tl’reeBoost exhibits superior
performance.
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1214J.H.FRIEDMAN Error-rate Rel. error-rate
o i l~~~~~~~~~~~~~~~~~~~~~~t
LK_TreeBoost LogitBoost(O. 1) AdaBoost(0.1) LK TreeBoost LogitBoost(O.1) AdaBoost(0 1)
FIG. 5. Distribution of error rate on a five-class problem (left panel), and error rate relative to the best (right panel), for LK T7reeBoost, and with proportional shrinkage applied to LogitBoost and RealAdaBoost. Here the performance of all three methods is similar.
tional internal shrinkage associated with stabilizing its pseudoresponse (33) when the denominator is close to zero (FHTOO, page 352). This may account for its slight superiority in this comparison. In fact, when increased shrink- age is applied to LKITreeBoost (v = 0.05) its performance improves, becoming identical to that of LogitBoost shown in Figure 5. It is likely that when the shrinkage parameter is carefully tuned for each of the three methods, there would be little performance differential between them.
7. Tree boosting. The GradientBoost procedure (Algorithm 1) has two primary metaparameters, the number of iterations M and the learning rate parameter v (36). These are discussed in Section 5. In addition to these, there are the metaparameters associated with the procedure used to estimate the base learner h(x; a). The primary focus of this paper has been on the use of best-first induced regression trees with a fixed number of terminal nodes, J. Thus, J is the primary metaparameter of this base learner. The best choice for its value depends most strongly on the nature of the target function, namely
the highest order of the dominant interactions among the variables. Consider an ANOVA expansion of a function
(42) F(X) = f j(Xj) + E f jk(Xj, Xk) + E f jkl(Xj, Xk, x1) + .. j j,k j,k,l
The first sum is called the “main effects” component of F(x). It consists of a
sum of functions that each depend on only one input variable. The particular functions { f j(Xj)}N are those that provide the closest approximation to F(
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

GREEDY FUNCTION APPROXIMATION 1215
under this additive constraint. This is sometimes referred to as an “additive”
model because the contributions of each xj, f j(xj), add to the contribu of the others. This is a different and more restrictive definition of “additive”
than (2). The second sum consists of functions of pairs of input variables.
They are called the two-variable “interaction effects.” They are chosen so that
along with the main effects they provide the closest approximation to F(x) under the limitation of no more than two-variable interactions. The third sum represents three-variable interaction effects, and so on.
The highest interaction order possible is limited by the number of input variables n. However, especially for large n, many target functions F*(x) encountered in practice can be closely approximated by ANOVA decomposi- tions of much lower order. Only the first few terms in (42) are required to capture the dominant variation in F*(x). In fact, considerable success is often achieved with the additive component alone [Hastie and Tibshirani (1990)]. Purely additive approximations are also produced by the “naive” -Bayes method [Warner, Toronto, Veasey and Stephenson (1961)], which is often highly successful in classification. These considerations motivated the bias toward lower-order interactions in the randomly generated target functions
(Section 6.1) used for the simulation studies.
The goal of function estimation is to produce an approximation F(x) that
closely matches the target F*(x). This usually requires that the dominant interaction order of F(x) be similar to that of F*(x). In boosting regression trees, the interaction order can be controlled by limiting the size of the indi- vidual trees induced at each iteration. A tree with J terminal nodes produces a function with interaction order at most min(J – 1, n). The boosting pro- cess is additive, so the interaction order of the entire approximation can be no larger than the largest among its individual components. Therefore, with any of the TreeBoost procedures, the best tree size J is governed by the effec-
tive interaction order of the target F*(x). This is usually unknown so that J becomes a metaparameter of the procedure to be estimated using a model selection criterion such as cross-validation or on a left-out subsample not used in training. However, as discussed above, it is unlikely that large trees would ever be necessary or desirable.
Figure 6 illustrates the effect of tree size on approximation accuracy for the 100 randomly generated functions (Section 6.1) used in the simulation studies. The experimental set-up is the same as that used in Section 6.2. Shown is the distribution of absolute errors (37) (left panel), and errors relative to the lowest for each target (right panel), for J E {2, 3, 6, 11, 21}. The first value J = 2 produces additive main effects components only; J = 3 produces additive and two-variable interaction terms, and so on. A J terminal node tree can produce interaction levels up to a maximum of min(J- 1, n), with typical values being less than that, especially when J – 1 < n. As seen in Figure 6 the smallest trees J E {2, 3} produce lower accuracy on average, but their distributions are considerably wider than the others. This means that they produce more very accurate, and even more very inaccurate, This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms 1216J.H.FRIEDMAN Abs. error Rel. error 2 3 6 11 ~~~21 2 3 1 21 FIG. 6. Distribution of absolute approximation error (left panel) and error relative to the best (right panel) for LS-TheeBoost with different sized trees, as measured by number of terminal nodes J. The distribution using the smallest trees J c { 2, 3} is wider, indicating more frequent better and worse performance than with the larger trees, all of which have similar performance. approximations. The smaller trees, being restricted to low-order interactions, are better able to take advantage of targets that happen to be of low interaction level. However, they do quite badly when trying to approximate the high- order interaction targets. The larger trees J E {6, 11, 21} are more consistent. They sacrifice some accuracy on low-order interaction targets, but do much better on the higher-order functions. There is little performance difference among the larger trees, with perhaps some slight deterioration for J = 21. The J = 2 trees produced the most accurate approximation eight times; the corresponding numbers for J E { 3, 6, 11, 2 1} were 2, 30, 3 1, 29, respectively. On average the J = 2 trees had errors 23.2% larger than the lowest for each target, while the others had corresponding values of 16.4%, 2.4%, 2.2% and 3.7%, respectively. Higher accuracy should be obtained when the best tree size J is individually estimated for each target. In practice this can be accomplished by evaluating the use of different tree sizes with an independent test data set, as illustrated in Section 9. 8. Interpretation. In many applications it is useful to be able to inter- pret the derived approximation F(x). This involves gaining an understanding of those particular input variables that are most influential in contributing to its variation, and the nature of the dependence of F(x) on those influential inputs. To the extent that F(x) at least qualitatively reflects the nature of the target function F*(x) (1), such tools can provide information concerning the underlying relationship between the inputs x and the output variable y. In this section, several tools are presented for interpreting TreeBoost approxima- This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms GREEDY FUNCTION APPROXIMATION 1217 tions. Although they can be used for interpreting single decision trees, they tend to be more effective in the context of boosting (especially small) trees. These interpretative tools are illustrated on real data examples in Section 9. 8.1. Relative importance of input variables. Among the most useful des- criptions of an approximation F(x) are the relative influences Ij, of the individual inputs xj, on the variation of F(x) over the joint input variable distribution. One such measure is (43) =i Ex x varx[xi])12 For piecewise constant approximations produced by decision trees, (43) does not strictly exist and it must be approximated by a surrogate measure that reflects its properties. Breiman, Friedman, Olshen and Stone (1983) proposed J-1 (44) I (T)= 2 d1(vt= t=1 where the summation is over the nonterminal nodes t of the J-terminal node tree T, vt is the splitting variable associated with node t, and i is the cor- responding empirical improvement in squared error (35) as a result of the split. The right-hand side of (44) is associated with squared influence so that its units correspond to those of (43). Breiman, Friedman, Olshen and Stone (1983) used (44) directly as a measure of influence, rather than squared influ- ence. For a collection of decision trees {Tm}", obtained through boosting, (44) can be generalized by its average over all of the trees, IM (45)Ij=M(Tm) , n=1 in the sequence. The motivation for (44), (45) is based purely on heuristic arguments. As a partial justification we show that it produces expected results when applied in the simplest context. Consider a linear target function n (46) F*(x) = aO + j xj, j=1 where the covariance matrix of the inputs is a multiple of the identity Ex [(x - x)(x - k)T] = cl1. In this case the influence measure (43) produces (47) Ij = lajl. This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms 1218J.H.FRIEDMAN Table 2 shows the results of a small simulation study similar to those in Section 6, but with F*(x) taken to be linear (46) with coefficients (48) aj = (-l)jn and a signal-to-noise ratio of 1/1 (41). Shown are the mean and standard deviation of the values of (44), (45) over ten random samples, all with F*(x) given by (46), (48). The influence of the estimated most influential variable xj* is arbitrarily assigned the value Ii* = 100, and the estimated v the others scaled accordingly. The estimated importance ranking of the input variables was correct on every one of the ten trials. As can be seen in Table 2, the estimated relative influence values are consistent with those given by (47) and (48). In Breiman, Friedman, Olshen and Stone 1983, the influence measure (44) is augmented by a strategy involving surrogate splits intended to uncover the masking of influential variables by others highly associated with them. This strategy is most helpful with single decision trees where the opportunity for variables to participate in splitting is limited by the size J of the tree in (44). In the context of boosting, however, the number of splitting opportunities is vastly increased (45), and surrogate unmasking is correspondingly less essential. In K-class logistic regression and classification (Section 4.6) there are K (logistic) regression functions {FkM(X)}fK[l, each described by a sequence of M trees. In this case (45) generalizes to IM (49) IJk = M EIj (Tkm), where Tkm is the tree induced for the kth class at iteration m. The quantity Ijk can be interpreted as the relevance of predictor variable xj in sepa class k from the other classes. The overall relevance of xi can be obtai TABLE 2 Estimated mean and standard deviation of input variable relative influence for a linear target function Variable Mean Standard 10 100.0 0.0 9 90.3 4.3 8 80.0 4.1 7 69.8 3.9 6 62.1 2.3 5 51.7 2.0 4 40.3 4.2 3 31.3 2.9 2 22.2 2.8 1 13.0 3.2 This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms GREEDY FUNCTION APPROXIMATION 1219 averaging over all classes iK Ij=K k=1i However, the individual Ijk themselves can be quite useful. It is often the case that different subsets of variables are highly relevant to different subsets of classes. This more detailed knowledge can lead to insights not obtainable by examining only overall relevance. 8.2. Partial dependence plots. Visualization is one of the most powerful interpretational tools. Graphical renderings of the value of F(x) as a func- tion of its arguments provides a comprehensive summary of its dependence on the joint values of the input variables. Unfortunately, such visualization is limited to low-dimensional arguments. Functions of a single real-valued vari- able x, F(x), can be plotted as a graph of the values of F(x) against each corresponding value of x. Functions of a single categorical variable can be represented by a bar plot, each bar representing one of its values, and the bar height the value of the function. Functions of two real-valued variables can be pictured using contour or perspective mesh plots. Functions of a categorical variable and another variable (real or categorical) are best summarized by a sequence of ("trellis") plots, each one showing the dependence of F(x) on the second variable, conditioned on the respective values of the first variable [Becker and Cleveland (1996)]. Viewing functions of higher-dimensional arguments is more difficult. It is therefore useful to be able to view the partial dependence of the approximation F(x) on selected small subsets of the input variables. Although a collection of such plots can seldom provide a comprehensive depiction of the approximation, it can often produce helpful clues, especially when F(x) is dominated by low- order interactions (Section 7). Let z1 be a chosen "target" subset, of size 1, of the input variables x, zl = {Z1, } - -, ' Zl} c{Xl,... I Xn}' and z\l be the complement subset z\U zl = x. The approximation F(x) in principle depends on variables in both subsets F(x) = F(zl, z\l). If one conditions on specific values for the variables in z\1, then F(x) can be considered as a function only of the variables in the chosen subset zl, (50) FZ\l (zl) = F(z1 I z\ ) This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms 1220J.H.FRIEDMAN In general, the functional fo chosen for z\l. If, however, this dependence is not too strong then the average function (51) F1 (zl) = Ez\1 [F(x)] = f F(z ,Z\l) P\l (Z\l) dz\l can represent a useful summary of the partial dependence of F(x) on the chosen variable subset zl. Here p\1(z\,) is the marginal probability of Z\l, (52) P\(z\)= fp(x) dzl, where p(x) is the joint density of all of the inputs x. This complement marginal density (52) can be estimated from the training data, so that (51) becomes 1N (53) Fl(zl) - Y F(zl, z-,\l). In the special cases where the dependence of F(x) on z1 is additive, (54) F(x) = Fl(zl) + F\l(z\l), or multiplicative, (55) F(x) = F1(Z1) F\l(z\6), the form of FZ\1 (zl) (50) does not depend on variables z\l. Then F1(zl) (51) provides a complete description of the nature of the variation of F(x) on the chosen input variable subset zl. An alternative way of summarizing the dependence of F(x) on a subset z1 is to directly model F(x) as a function of z1 on the training data (56) Fl(zl) = EX[F(X) I Zi] = f F(x) p(z\l I zl) dz\l. However, averaging over the conditional density in (56), rather than the marginal density in (51), causes Fl(zl) to reflect not only the dependence of F(x) on the selected variable subset zl, but in addition, apparent depen- dencies induced solely by the associations between them and the complement variables z\1. For example, if the contribution of z1 happens to be additive (54) or multiplicative (55), Fl(zl) (56) would not evaluate to the correspond- ing term or factor Fl(zl), unless the joint density p(x) happened to be the product (57) p(x) = pl(z1). P\l(Z\l). Partial dependence functions (51) can be used to help interpret models pro- duced by any "black box" prediction method, such as neural networks, support vector machines, nearest neighbors, radial basis functions, etc. When there are a large number of predictor variables, it is very useful to have a measure of This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms GREEDY FUNCTION APPROXIMATION 1221 relevance (Section 8.1) to reduce the potentially large number variables and variable combinations to be considered. Also, a pass over the data (53) is required to evaluate each Fl (zl) for each set of joint values z1 of its argu This can be time-consuming for large data sets, although subsampling could help somewhat. For regression trees based on single-variable splits, however, the partial dependence of F(x) on a specified target variable subset z1 (51) is straight- forward to evaluate given only the tree, without reference to the data itself (53). For a specific set of values for the variables zl, a weighted traversal of the tree is performed. At the root of the tree, a weight value of 1 is assigned. For each nonterminal node visited, if its split variable is in the target subset zl, the appropriate left or right daughter node is visited and the weight is not modified. If the node's split variable is a member of the complement subset z\l, then both daughters are visited and the current weight is multiplied by the fraction of training observations that went left or right, respectively, at that node. Each terminal node visited during the traversal is assigned the current value of the weight. When the tree traversal is complete, the value of is the corresponding weighted average of the F(x) values over those termi- nal nodes visited during the tree traversal. For a collection of M regression trees, obtained through boosting, the results for the individual trees are simply averaged. For purposes of interpretation through graphical displays, input variable subsets of low cardinality (I < 2) are most useful. The most informative of such subsets would likely be comprised of the input variables deemed to be among the most influential (44), (45) in contributing to the variation of F(x). Illustrations are provided in Sections 8.3 and 9. The closer the dependence of F(x) on the subset zZ is to being additive (54) or multiplicative (55), the more completely the partial dependence function Fl(zl) (51) captures the nature of the influence of the variables in z1 on t derived approximation F(x). Therefore, subsets z1 that group together those influential inputs that have complex [nonfactorable (55)] interactions between them will provide the most revealing partial dependence plots. As a diagnostic, both F, (zl) and F, (z\1) can be separately computed for candidate subsets. The value of the multiple correlation over the training data between F(x) and {Fj(zl), F\l(z\i)} and/or Fl(zl). F\l(z\l) can be used to gauge the deg additivity and/or factorability of F(x) with respect to a chosen subset zl. As an additional diagnostic, FZ\l(zl) (50) can be computed for a small number of z\l-values randomly selected from the training data. The resulting functions of z1 can be compared to Fl(zl) to judge the variability of the partial dependence of F(x) on zl, with respect to changing values of z\l. In K-class logistic regression and classification (Section 4.6) there are K (logistic) regression functions {Fk(X)}KkQl= Each is logarithmically rela pk(X) = Pr(y = k I x) through (29). Larger values of Fk(x) imply high This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms 1222J.H.FRIEDMAN probability of observing class k at x. Partial dependence plots of each Fk(x) on variable subsets z1 most relevant to that class (49) provide information on how the input variables influence the respective individual class probabilities. 8.3. Randomly generated function. In this section the interpretational tools described in the preceding two sections are applied to the first (of the 100) randomly generated functions (Section 6.1) used for the Monte Carlo studies of Section 6. Figure 7 shows the estimated relative importance (44), (45) of the 10 input predictor variables. Some are seen to be more influential than others, but no small subset appears to dominate. This is consistent with the mechanism used to generate these functions. Figure 8 displays single variable (I = 1) partial dependence plots (53) on the six most influential variables. The hash marks at the base of each plot represent the deciles of the corresponding predictor variable distribution. The piecewise constant nature of the approximation is evident. Unlike most approximation methods, there is no explicit smoothness constraint imposed upon TreeBoost models. Arbitrarily sharp discontinuities can be accommo- dated. The generally smooth trends exhibited in these plots suggest that a smooth approximation best describes this target. This is again consistent with the way these functions were generated. CD U CD 0~ E ci ~ ~ l a)I Relative Variable Importance 7 2 8 9 3 5 4 1 10 6 Input variable FIG. 7. Relative importance of the input predictor variables for the first randomly generated function used in the Monte Carlo studies. This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms GREEDY FUNCTION APPROXIMATION 1223 O'ELCaCDX 6 -9 -2 -t 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 var 7 var 2 var5 FIGt8o Sigl-vaialeparia deenenc pot fo te sx ostinuetia pedito uaiat6 for6the first randomly generated function used in the simulation studies P0 - -2 -1 0 1 2 -2 -I 0 1 2 -2 -1 0 1 2 varO var3 varl FIG.8.Sing thei aruens iti nieyta n ol vrb bet noe hi for the firs Figure 9 displays two-variable (I = 2) partial dependence plots on some of the more influential variables. Interaction effects of varying degrees are indicated among these variable pairs. This is in accordance with the way in whcicoh tIheease ItarageItnftuni ctsicontsi owerI eh acIturalelyo gsenrereartesds (o39)a,l(o40i)h. m I r Given the general complexity of these generated targets as a function of their arguments, it is unlikely that one would ever be able to uncover their complete detailed functional form through a series of such partial dependence plots. The goal is to obtain an understandable description of some of the impor- tant aspects of the functional relationship. In this example the target function was generated from a known prescription, so that at least qualitatively we can verify that this is the case here. 9. Real data. In this section the TreeBoost regression algorithms are illustrated on two moderate-sized data sets. The results in Section 6.4 suggest that the properties of the classification algorithm LK-TreeBoost are very sim- ilar to those of LogitBoost, which was extensively applied to data in FHTOO. The first (scientific) data set consists of chemical concentration measurements on rock samples, and the second (demographic) is sample survey questionnaire data. Both data sets were partitioned into a learning sample consisting of two- thirds of the data, with the remaining data being used as a test sample for choosing the model size (number of iterations M). The shrinkage parameter (36) was set to v = 0.1. This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms 1224J.H.FRIEDMAN FIG. 9. Two-variable partial dependence plots on a few of the important predictor variables for the first randomly generated function used in the simulation studies. 9.1. Garnet data. This data set consists of a sample of N = 13317 garnets collected from around the world [Griffin, Fisher, Friedman, Ryan and O' Reilly (1997)]. A garnet is a complex Ca-Mg-Fe-Cr silicate that commonly occurs as a minor phase in rocks making up the earth's mantle. The variables associated with each garnet are the concentrations of various chemicals and the tectonic plate setting where the rock was collected: (TiO2, Cr203, FeO, MnO, MgO, CaO, Zn, Ga, Sr, Y, Zr, tec). The first eleven variables representing concentrations are real-valued. The last variable (tec) takes on three categorical values: "ancient stable shields," "Proterozoic shield areas," and "young orogenic belts." There are no missing values in these data, but the distribution of many of the variables tend to be highly skewed toward larger values, with many outliers. The purpose of this exercise is to estimate the concentration of titanium (TiO2) as a function of the joint concentrations of the other chemicals and the tectonic plate index. This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms GREEDY FUNCTION APPROXIMATION 1225 TABLE 3 Average absolute error of LS-TreeBoost, LAD TreeBoost, and MTreeBoost on the garnet data for varying numbers of terminal nodes in the individual trees TerminalnodesLSLADM 2 0.58 0.57 0.57 3 0.48 0.47 0.46 4 0.49 0.45 0.45 6 0.48 0.44 0.43 11 0.47 0.44 0.43 21 0.46 0.43 0.43 Table 3 shows the average absolute error in predicting the output y-variable, relative to the optimal constant prediction, Ey y - F(x) (58) A(y, F(x)) = v based on the test sample, for LS TreeBoost, LADLTreeBoost, and M-TreeBoost for several values of the size (number of terminal nodes) J of the constituent trees. Note that this prediction error measure (58) includes the additive irre- ducible error associated with the (unknown) underlying target function F*(x) (1). This irreducible error adds same amount to all entries in Table 3. Thus, differences in those entries reflect a proportionally greater improvement in approximation error (37) on the target function itself. For all three methods the additive (J = 2) approximation is distinctly infe- rior to that using larger trees, indicating the presence of interaction effects (Section 7) among the input variables. Six terminal node trees are seen to be adequate and using only three terminal node trees is seen to provide accuracy within 10% of the best. The errors of LADRTreeBoost and M-TreeBoost are smaller than those of LS-TreeBoost and similar to each other, with perhaps M-TreeBoost having a slight edge. These results are consistent with those obtained in the simulation studies as shown in Figures 2 and 6. Figure 10 shows the relative importance (44), (45) of the 11 input variables in predicting TiO2 concentration based on the M-TreeBoost approximation using six terminal node trees. Results are very similar for the other models in Table 3 with similar errors. Ga and Zr are seen to be the most influential with MnO being somewhat less important. The top three panels of Figure 11 show the partial dependence (51) of the approximation F(x) on these three most influential variables. The bottom three panels show the partial dependence of F(x) on the three pairings of these variables. A strong interaction effect between Ga and Zr is clearly evident. F(x) has very little dependence on either variable when the other takes on its smallest values. As the value of one of them is increased, the dependence of F(x) on the other is correspondingly amplified. A somewhat smaller interaction effect is seen between MnO and Zr. This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC Ey - median(y) | All use subject to https://about.jstor.org/terms 1226J.H.FRIEDMAN Relative importance co 0- Ga Zr MnO Y Zn MaO CaO FeO Cr203 Tec Sr FIG. 10. Relative influence of the eleven input variables on the target variation for the garnet data. Ga and Zr are much more influential that the others. Ci C4 N | - A ~~~~~N = N 0 QC)~~~~~~~~~~~~~~~1 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 20 30 40 50 Ga Zr CnO C) ~~~~~~0 246102116024 111 0 2 0 5 10 z ~ 0 FIG. 11. Partial dependence plots for the three most influential input variables in the garnet data. Note the different vertical scales for each plot. There is a strong interaction effect between Zr and Ga, and a somewhat weaker one between Zr and MnO. This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms GREEDY FUNCTION APPROXIMATION 1227 TABLE 4 Variables for the demographic data Variable Demographic Number values Type 1 sex 2 cat 2 m artial status 5 cat 3 age 7 real 4 education 6 real 5 occupation 9 cat 6income9real 7 years in Bay Area 5 real 8dualincomes2cat 9 number in household 9 real 10 number in household<18 9 real 11 householder status 3 cat 12 type of home 5 cat 13 ethnic classification 8 cat 14 language in home 3 cat 9.2. Demographic data. This data set consists of N = 9409 questionnaires filled out by shopping mall customers in the San Francisco Bay Area [Impact Resources, Inc, Columbus, Ohio (1987)]. Here we use answers to the first 14 questions, relating to demographics, for illustration. These questions are listed in Table 4. The data are seen to consist of a mixture of real and categorical variables, each with a small numbers of distinct values. There are many miss- ing values. We illustrate TreeBoost on these data by modeling income as a function of the other 13 variables. Table 5 shows the average absolute error in predicting income, relative to the best constant predictor (58), for the three regression TreeBoost algorithms. There is little difference in performance among the three methods. Owing to the highly discrete nature of these data, there are no outliers or long-tailed distributions among the real-valued inputs or the output y. There is also very little reduction in error as the constituent tree size J is increased, indicating TABLE 5 Average absolute error of LS-TreeBoost, LAD_TreeBoost, and M-TreeBoost on the demographic data for varying numbers of terminal nodes in the individual trees TerminalnodesLSLADM 2 0.60 0.63 0.61 3 0.60 0.62 0.59 4 0.59 0.59 0.59 6 0.59 0.58 0.59 11 0.59 0.57 0.58 21 0.59 0.58 0.58 This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms 1228J.H.FRIEDMAN Relative importance C C) co C) CD C) occ hsid mar age edu hme eth Ian dinc num sex < 18 yBA FIG. 12. Relative influence of the 13 input variables on the target variation for the demographic data. No small group of variables dominate. lack of interactions among the input variables; an approximation additive in the individual input variables (J = 2) seems to be adequate. Figure 12 shows the relative importance of the input variables in predicting income, based on the (J = 2) LS-TreeBoost approximation. There is no small subset of them that dominates. Figure 13 shows partial dependence plots on the six most influential variables. Those for the categorical variables are rep- resented as bar plots, and all plots are centered to have zero mean over the data. Since, the approximation consists of main effects only [first sum in (42)], these plots completely describe the corresponding contributions f j(xj) of ea of these inputs. There do not appear to be any surprising results in Figure 13. The depen- dencies for the most part confirm prior suspicions and suggest that the approx- imation is intuitively reasonable. 10. Data mining. As "off the shelf" tools for predictive data mining, the TreeBoost procedures have some attractive properties. They inherit the favor- able characteristics of trees while mitigating many of the unfavorable ones. Among the most favorable is robustness. All TreeBoost procedures are invari- ant under all (strictly) monotone transformations of the individual input vari- ables. For example, using xj, log xj, exi, or xY; as the jth input variable yields the same result. Thus, the need for. considering input variable transformations is eliminated. As a consequence of this invariance, sensitivity to long-tailed This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC All use subject to https://about.jstor.org/terms GREEDY FUNCTION APPROXIMATION 1229 Occupation Household status Marital status ._MilitaryflWidowed Homemaker Rent Dv sep. 0 V^~~Uemployed Single Retired Live with family Laborer _ Live together prSalesngaiOwnMariedN Prof./manag.Married -1.5 -0.5 0.5 1.0 1.5 -0.5 0.0 0.5 1.0 -0.5 0.0 0.5 1.0 1.5 Age Education Type home 1 2- \ 3 - 4 5 6|7 1 2 3 4 5 -1.5 -1.0 -0.5 0.0 0.5 1.Other / I T / 1 8 1 ~~~~~~~~~~~~~~~~Mobile home | / | 1 / | | ~~~~~~~~~~~~~~~~~Apartment N ] / | ' 4 / | ~~~~~~~ ~~Condo _ ' 1/ t N4> , , ,! ~~~~~~~~House
1 2 3 4 5 6 7 2 3 4 5 6 -. -10 -0.5 0.0 05 1.0
FIG. 13. Partial dependence plots for the six most influential input variables in the demographic data. Note the different vertical scales for each plot. The abscissa values for age and education are codes representing consecutive equal intervals. The dependence of income on age is nonmonotonic reaching a maximum at the value 5, representing the interval 45-54 years old.
distributions and outliers is also eliminated. In addition, LADRTreeBoost is completely robust against outliers in the output variable y as well. M-Tree- Boost also enjoys a fair measure of robustness against output outliers.
Another advantage of decision tree induction is internal feature selection. Trees tend to be quite robust against the addition of irrelevant input variables. Also, tree-based models handle missing values in a unified and elegant manner
[Breiman, Friedman, Olshen and Stone (1983)]. There is no need to consider external imputation schemes. TreeBoost clearly inherits these properties as well.
The principal disadvantage of single tree models is inaccuracy. This is a consequence of the coarse nature of their piecewise constant approximations, especially for smaller trees, and instability, especially for larger trees, and the fact that they involve predominately high-order interactions. All of these are mitigated by boosting. TreeBoost procedures produce piecewise constant
approximations, but the granularity is much finer. TreeBoost enhances sta-
bility by using small trees and by the effect of averaging over many of them. The interaction level of TreeBoost approximations is effectively controlled by
limiting the size of the individual constituent trees.
Among the purported biggest advantages of single tree models is inter- pretability, whereas boosted trees are thought to lack this feature. Small trees
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1230J.H.FRIEDMAN
can be easily interpreted, but due to instability such interpretations should be treated with caution. The interpretability of larger trees is questionable [Ripley (1996)]. TreeBoost approximations can be interpreted using partial dependence plots in conjunction with the input variable relative importance measure, as illustrated in Sections 8.3 and 9. While not providing a complete
description, they at least offer some insight into the nature of the input- output relationship. Although these tools can be used with any approxima- tion method, the special characteristics of tree-based models allow their rapid calculation. Partial dependence plots can also be used with single regression trees, but as noted above, more caution is required owing to greater instability.
After sorting of the input variables, the computation of the regression Tree- Boost procedures (LS, LAD, and M-TreeBoost) scales linearly with the num- ber of observations N, the number of input variables n and the number of iterations M. It scales roughly as the logarithm of the size of the constituent trees J. In addition, the classification algorithm LK-TreeBoost scales linearly with the number of classes K; but it scales highly sublinearly with the num- ber of iterations M, if influence trimming (Section 4.5.1) is employed. As a point of reference, applying M-TreeBoost to the garnet data of Section 9.1
(N = 13317, n = 11, J = 6, M = 500) required 20 seconds on a 933Mh Pen- tium III computer.
As seen in Section 5, many boosting iterations (M – 500) can be required to obtain optimal TreeBoost approximations, based on small values of the shrinkage parameter v (36). This is somewhat mitigated by the very small size of the trees induced at each iteration. However, as illustrated in Figure
1, improvement tends to be very rapid initially and then levels off to slower increments. Thus, nearly optimal approximations can be achieved quite early (M -_ 100) with correspondingly much less computation. These near-optimal approximations can be used for initial exploration and to provide an indication of whether the final approximation will be of sufficient accuracy to warrant continuation. If lack of fit improves very little in the first few iterations (say
100), it is unlikely that there will be dramatic improvement later on. If contin- uation is judged to be warranted, the procedure can be restarted where it left off previously, so that no computational investment is lost. Also, one can use larger values of the shrinkage parameter to speed initial improvement for this purpose. As seen in Figure 1, using v -0.25 provided accuracy within 10% of the optimal (v = 0.1) solution after only 20 iterations. In this case however, boosting would have to be restarted from the beginning if a smaller shrinkage
parameter value were to be subsequently employed.
The ability of TreeBoost procedures to give a quick indication of potential
predictability, coupled with their extreme robustness, makes them a useful preprocessing tool that can be applied to imperfect data. If sufficient pre- dictability is indicated, further data cleaning can be invested to render it suitable for more sophisticated, less robust, modeling procedures.
If more data become available after modeling is complete, boosting can be continued on the new data starting from the previous solution. This usually improves accuracy provided an independent test data set is used to monitor
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

GREEDY FUNCTION APPROXIMATION 1231
improvement to prevent overfitting on the new data. Although the accuracy increase is generally less than would be obtained by redoing the entire analysis on the combined data, considerable computation is saved.
Boosting on successive subsets of data can also be used when there is insuf- ficient random access main memory to store the entire data set. Boosting can be applied to “arcbites” of data [Breiman (1997)] sequentially read into main memory, each time starting at the current solution, recycling over previous subsets as time permits. Again, it is crucial to use an independent test set to stop training on each individual subset at that point where the estimated accuracy of the combined approximation starts to diminish.
Acknowledgments. Helpful discussions with Trevor Hastie, Bogdan Popescu and Robert Tibshirani are gratefully acknowledged.
REFERENCES
BECKER, R. A. and CLEVELAND, W. S (1996). The design and control of Trellis display. J; Comput. Statist. Graphics 5 123-155.
BREIMAN, L. (1997). Pasting bites together for prediction in large data sets and on-line. Technical report, Dept. Statistics, Univ. California, Berkeley.
BREIMAN, L. (1999). Prediction games and arcing algorithms. Neural Comp. 11 1493-1517. BREIMAN, L., FRIEDMAN, J. H., OLSHEN, R. and STONE, C. (1983). Classification and Regression
Trees. Wadsworth, Belmont, CA.
COPAS, J. B. (1983). Regression, prediction, and shrinkage (with discussion). J Roy. Statist. Soc.
Ser B 45 311-354.
DONOHO, D. L. (1993). Nonlinear wavelete methods for recovery of signals, densities, and spec-
tra from indirect and noisy data. In Different Perspectives on Wavelets. Proceedings of Symposium in Applied Mathematics (I. Daubechies, ed.) 47 173-205. Amer. Math. Soc., Providence RI.
DRUCKER, H. (1997). Improving regressors using boosting techniques. Proceedings of Fourteenth International Conference on Machine Learning (D. Fisher, Jr., ed.) 107-115. Morgan- Kaufmann, San Francisco.
DUFFY, N. and HELMBOLD, D. (1999). A geometric approach to leveraging weak learners. In Computational Learning Theory. Proceedings of 4th European Conference EuroCOLT99 (P. Fischer and H. U. Simon, eds.) 18-33. Springer, New York.
FREUND, Y. and SCHAPIRE, R. (1996). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference 148-156. Morgan Kaufman, San Francisco.
FRIEDMAN, J. H. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19 1-141.
FRIEDMAN J. H., HASTIE, T. and TIBSHIRANI, R. (2000). Additive logistic regression: a statistical view of boosting (with discussion). Ann. Statist. 28 337-407.
GRIFFIN, W. L., FISHER, N. I., FRIEDMAN J. H., RYAN, C. G. and O’REILLY, S. (1999). Cr-Pyrope garnets in lithospheric mantle. J Petrology. 40 679-704.
HASTIE, T. and TIBSHIRANI, R. (1990). Generalized Additive Models. Chapman and Hall, London. HUBER, P. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 35 73-101. MALLAT, S. and ZHANG, Z. (1993). Matching pursuits with time frequency dictionaries. IEEE
Trans. Signal Processing 41 3397-3415.
POWELL, M. J. D. (1987). Radial basis functions for multivariate interpolation: a review. In Algo-
rithms for Approximation (J. C. Mason and M. G. Cox, eds.) 143-167. Clarendon Press,
Oxford.
RATSCH, G., ONODA, T. and MULLER, K. R. (1998). Soft margins for AdaBoost. NeuroCOLT Tech-
nical Report NC-TR-98-021.
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms

1232J.H.FRIEDMAN
RIPLEY, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge Univ. Press. RUMELHART, D. E., HINTON, G. E. and WILLIAMS, R. J. (1986). Learning representations by back-
propagating errors. Nature 323 533-536.
SCHAPIRE, R. and SINGER, Y. (1998). Improved boosting algorithms using confidence-rated predic-
tions. In Proceedings of the Eleventh Annual Conference on Computational Learning
Theory. ACM, New York.
VAPNIK, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.
WARNER, J. R., TORONTO, A. E., VEASEY, L. R. and STEPHENSON, R. (1961). A mathematical model
for medical diagnosis-application to congenital heart disease. J Amer. Med. Assoc. 177 177-184.
DEPARTMENT OF STATISTICS SEQUOIA HALL
STANFORD UNIVERSITY STANFORD, CALIFORNIA 94305 E-MAIL: jhf@stat.stanford.edu
This content downloaded from 143.89.17.135 on Thu, 29 Apr 2021 05:07:41 UTC
All use subject to https://about.jstor.org/terms