CS计算机代考程序代写 data mining decision tree algorithm 10

10
Boosting and Additive Trees
10.1 Boosting Methods
Boosting is one of the most powerful learning ideas introduced in the last twenty years. It was originally designed for classification problems, but as will be seen in this chapter, it can profitably be extended to regression as well. The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.” From this perspective boosting bears a resemblance to bagging and other committee-based approaches (Section 8.8). However we shall see that the connection is at best superficial and that boosting is fundamentally differ- ent.
We begin by describing the most popular boosting algorithm due to Freund and Schapire (1997) called “AdaBoost.M1.” Consider a two-class problem, with the output variable coded as Y ∈ {−1, 1}. Given a vector of predictor variables X, a classifier G(X) produces a prediction taking one of the two values {−1, 1}. The error rate on the training sample is
1 􏰜N
err = N
and the expected error rate on future predictions is EXY I(Y ̸= G(X)).
A weak classifier is one whose error rate is only slightly better than random guessing. The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers Gm(x),m = 1,2,…,M.
T. Hastie et al., The Elements of Statistical Learning, Second Edition, 337 DOI: 10.1007/b94608_10,
© Springer Science+Business Media, LLC 2009
i=1
I(yi ̸= G(xi)),

338 10. Boosting and Additive Trees
Final Classifier
􏰥􏰛M 􏰦 G(x) = sign m=1 αmGm(x)
GM(x)
G3(x)
G2(x)
G1(x)
FIGURE 10.1. Schematic of AdaBoost. Classifiers are trained on weighted ver-
sions of the dataset, and then combined to produce a final prediction.
The predictions from all of them are then combined through a weighted majority vote to produce the final prediction:
􏰨􏰜M 􏰩
G(x) = sign αmGm(x) . (10.1)
m=1
Here α1, α2, . . . , αM are computed by the boosting algorithm, and weight the contribution of each respective Gm(x). Their effect is to give higher influence to the more accurate classifiers in the sequence. Figure 10.1 shows a schematic of the AdaBoost procedure.
The data modifications at each boosting step consist of applying weights w1,w2,…,wN to each of the training observations (xi,yi), i = 1,2,…,N. Initially all of the weights are set to wi = 1/N, so that the first step simply trains the classifier on the data in the usual manner. For each successive iteration m = 2,3,…,M the observation weights are individually modi- fied and the classification algorithm is reapplied to the weighted observa- tions. At step m, those observations that were misclassified by the classifier Gm−1(x) induced at the previous step have their weights increased, whereas the weights are decreased for those that were classified correctly. Thus as iterations proceed, observations that are difficult to classify correctly re- ceive ever-increasing influence. Each successive classifier is thereby forced
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample

(d) Setwi←wi·exp[αm·I(yi̸=Gm(xi))],i=1,2,…,N. 􏰥􏰛M 􏰦
3. Output G(x) = sign m=1 αmGm(x) .
10.1 Boosting Methods 339
Algorithm 10.1 AdaBoost.M1.
1. Initialize the observation weights wi = 1/N, i = 1,2,…,N. 2. For m = 1 to M:
(a) Fit a classifier Gm(x) to the training data using weights wi.
(b) Compute
􏰛Ni=1 wiI(yi ̸= Gm(xi)) errm= 􏰛N w .
i=1 i (c) Compute αm = log((1 − errm)/errm).
to concentrate on those training observations that are missed by previous ones in the sequence.
Algorithm 10.1 shows the details of the AdaBoost.M1 algorithm. The current classifier Gm(x) is induced on the weighted observations at line 2a. The resulting weighted error rate is computed at line 2b. Line 2c calculates the weight αm given to Gm(x) in producing the final classifier G(x) (line 3). The individual weights of each of the observations are updated for the next iteration at line 2d. Observations misclassified by Gm(x) have their weights scaled by a factor exp(αm), increasing their relative influence for inducing the next classifier Gm+1(x) in the sequence.
The AdaBoost.M1 algorithm is known as “Discrete AdaBoost” in Fried- man et al. (2000), because the base classifier Gm(x) returns a discrete class label. If the base classifier instead returns a real-valued prediction (e.g., a probability mapped to the interval [−1,1]), AdaBoost can be modified appropriately (see “Real AdaBoost” in Friedman et al. (2000)).
Y=
The power of AdaBoost to dramatically increase the performance of even a very weak classifier is illustrated in Figure 10.2. The features X1 , . . . , X10 are standard independent Gaussian, and the deterministic target Y is de- fined by
􏰙 1 −1
if 􏰛10 X2 > χ2 (0.5),
j=1 j 10 (10.2)
otherwise.
Here χ210(0.5) = 9.34 is the median of a chi-squared random variable with 10 degrees of freedom (sum of squares of 10 standard Gaussians). There are 2000 training cases, with approximately 1000 cases in each class, and 10,000 test observations. Here the weak classifier is just a “stump”: a two-terminal node classification tree. Applying this classifier alone to the training data set yields a very poor test set error rate of 45.8%, compared to 50% for

340 10. Boosting and Additive Trees
Single Stump
244 Node Tree
0 100 200 300 400 Boosting Iterations
FIGURE 10.2. Simulated data (10.2): test error rate for boosting with stumps, as a function of the number of iterations. Also shown are the test error rate for a single stump, and a 244-node classification tree.
random guessing. However, as boosting iterations proceed the error rate steadily decreases, reaching 5.8% after 400 iterations. Thus, boosting this simple very weak classifier reduces its prediction error rate by almost a factor of four. It also outperforms a single large classification tree (error rate 24.7%). Since its introduction, much has been written to explain the success of AdaBoost in producing accurate classifiers. Most of this work has centered on using classification trees as the “base learner” G(x), where improvements are often most dramatic. In fact, Breiman (NIPS Workshop, 1996) referred to AdaBoost with trees as the “best off-the-shelf classifier in the world” (see also Breiman (1998)). This is especially the case for data- mining applications, as discussed more fully in Section 10.7 later in this chapter.
10.1.1 Outline of This Chapter
Here is an outline of the developments in this chapter:
• We show that AdaBoost fits an additive model in a base learner, optimizing a novel exponential loss function. This loss function is
Test Error
0.0 0.1 0.2 0.3 0.4 0.5

f(x) =
βmb(x;γm), (10.3)
10.2 Boosting Fits an Additive Model 341
very similar to the (negative) binomial log-likelihood (Sections 10.2– 10.4).
• The population minimizer of the exponential loss function is shown to be the log-odds of the class probabilities (Section 10.5).
• We describe loss functions for regression and classification that are more robust than squared error or exponential loss (Section 10.6).
• It is argued that decision trees are an ideal base learner for data mining applications of boosting (Sections 10.7 and 10.9).
• We develop a class of gradient boosted models (GBMs), for boosting trees with any loss function (Section 10.10).
• The importance of “slow learning” is emphasized, and implemented by shrinkage of each new term that enters the model (Section 10.12), as well as randomization (Section 10.12.2).
• Toolsforinterpretationofthefittedmodelaredescribed(Section10.13).
10.2 Boosting Fits an Additive Model
The success of boosting is really not very mysterious. The key lies in ex- pression (10.1). Boosting is a way of fitting an additive expansion in a set of elementary “basis” functions. Here the basis functions are the individual classifiers Gm (x) ∈ {−1, 1}. More generally, basis function expansions take the form
􏰜M
m=1
where βm,m = 1,2,…,M are the expansion coefficients, and b(x;γ) ∈ IR
are usually simple functions of the multivariate argument x, characterized by a set of parameters γ. We discuss basis expansions in some detail in Chapter 5.
Additive expansions like this are at the heart of many of the learning techniques covered in this book:
• In single-hidden-layer neural networks (Chapter 11), b(x; γ) = σ(γ0 + γ1T x), where σ(t) = 1/(1+e−t) is the sigmoid function, and γ param- eterizes a linear combination of the input variables.
• In signal processing, wavelets (Section 5.9.1) are a popular choice with γ parameterizing the location and scale shifts of a “mother” wavelet.
• Multivariate adaptive regression splines (Section 9.4) uses truncated- power spline basis functions where γ parameterizes the variables and values for the knots.

342 10. Boosting and Additive Trees
Algorithm 10.2 Forward Stagewise Additive Modeling. 1. Initialize f0(x) = 0.
2. For m = 1 to M: (a) Compute
i=1
• For trees, γ parameterizes the split variables and split points at the internal nodes, and the predictions at the terminal nodes.
Typically these models are fit by minimizing a loss function averaged over the training data, such as the squared-error or a likelihood-based loss
􏰜N
L(yi, fm−1(xi) + βb(xi; γ)). (b) Set fm(x) = fm−1(x) + βmb(x; γm).
(βm, γm) = arg min β,γ
function,
􏰜N􏰨􏰜M 􏰩
min L yi, βmb(xi; γm) . (10.4)
{ β m , γ m } M1 i = 1 m = 1
For many loss functions L(y,f(x)) and/or basis functions b(x;γ), this re- quires computationally intensive numerical optimization techniques. How- ever, a simple alternative often can be found when it is feasible to rapidly solve the subproblem of fitting just a single basis function,
􏰜N
β,γ
i=1
10.3 Forward Stagewise Additive Modeling
For squared-error loss
min
L (yi, βb(xi; γ)) .
(10.5)
Forward stagewise modeling approximates the solution to (10.4) by sequen- tially adding new basis functions to the expansion without adjusting the parameters and coefficients of those that have already been added. This is outlined in Algorithm 10.2. At each iteration m, one solves for the optimal basis function b(x;γm) and corresponding coefficient βm to add to the cur- rent expansion fm−1(x). This produces fm(x), and the process is repeated. Previously added terms are not modified.
L(y,f(x)) = (y − f(x))2, (10.6)

(βm, Gm) = arg min β,G
i=1
(βm, Gm) = arg min β,G
w(m) exp(−β yi G(xi)) (10.9) i
ii
10.4 Exponential Loss and AdaBoost 343
one has
L(yi, fm−1(xi) + βb(xi; γ)) = (yi − fm−1(xi) − βb(xi; γ))2
= (rim − βb(xi; γ))2, (10.7)
where rim = yi − fm−1(xi) is simply the residual of the current model on the ith observation. Thus, for squared-error loss, the term βmb(x;γm) that best fits the current residuals is added to the expansion at each step. This idea is the basis for “least squares” regression boosting discussed in Section 10.10.2. However, as we show near the end of the next section, squared-error loss is generally not a good choice for classification; hence the need to consider other loss criteria.
10.4 Exponential Loss and AdaBoost
We now show that AdaBoost.M1 (Algorithm 10.1) is equivalent to forward stagewise additive modeling (Algorithm 10.2) using the loss function
L(y, f (x)) = exp(−y f (x)). (10.8)
The appropriateness of this criterion is addressed in the next section.
For AdaBoost the basis functions are the individual classifiers Gm(x) ∈
{−1, 1}. Using the exponential loss function, one must solve 􏰜N
exp[−yi(fm−1(xi) + β G(xi))]
for the classifier Gm and corresponding coefficient βm to be added at each
step. This can be expressed as
􏰜N
i=1
with w(m) = exp(−yi fm−1(xi)). Since each w(m) depends neither on β
nor G(x), it can be regarded as a weight that is applied to each observa- tion. This weight depends on fm−1(xi), and so the individual weight values change with each iteration m.
The solution to (10.9) can be obtained in two steps. First, for any value of β > 0, the solution to (10.9) for Gm(x) is
􏰜N
Gm = arg min Gi
i=1
w(m)I(yi ̸= G(xi)), (10.10)

344 10. Boosting and Additive Trees
which is the classifier that minimizes the weighted error rate in predicting y. This can be easily seen by expressing the criterion in (10.9) as
−β 􏰜 (m) β 􏰜 (m) e· wi+e· wi,
yi =G(xi ) yi ̸=G(xi ) which in turn can be written as
􏰎 􏰏􏰜N eβ − e−β ·
􏰜N ii
w(m)I(yi ̸= G(xi)) + e−β · Plugging this Gm into (10.9) and solving for β one obtains
i=1
i=1
βm = 1 log 1−errm, 2 errm
w(m). (10.11)
(10.12)
(10.13)
where errm is the minimized weighted error rate
􏰛N w(m)I(yi ̸= Gm(xi))
errm = i=1 i􏰛N w(m) . i=1 i
The approximation is then updated
fm(x) = fm−1(x) + βmGm(x),
which causes the weights for the next iteration to be w(m+1) = w(m) · e−βmyiGm(xi).
(10.14) Using the fact that −yiGm(xi) = 2 · I(yi ̸= Gm(xi)) − 1, (10.14) becomes
ii
w(m+1) = w(m) · eαmI(yi̸=Gm(xi)) · e−βm , (10.15) ii
where αm = 2βm is the quantity defined at line 2c of AdaBoost.M1 (Al- gorithm 10.1). The factor e−βm in (10.15) multiplies all weights by the same value, so it has no effect. Thus (10.15) is equivalent to line 2(d) of Algorithm 10.1.
One can view line 2(a) of the Adaboost.M1 algorithm as a method for approximately solving the minimization in (10.11) and hence (10.10). Hence we conclude that AdaBoost.M1 minimizes the exponential loss criterion (10.8) via a forward-stagewise additive modeling approach.
Figure 10.3 shows the training-set misclassification error rate and aver- age exponential loss for the simulated data problem (10.2) of Figure 10.2. The training-set misclassification error decreases to zero at around 250 it- erations (and remains there), but the exponential loss keeps decreasing. Notice also in Figure 10.2 that the test-set misclassification error continues to improve after iteration 250. Clearly Adaboost is not optimizing training- set misclassification error; the exponential loss is more sensitive to changes in the estimated class probabilities.

10.5 Why Exponential Loss? 345
Exponential Loss
Misclassification Rate
0 100 200 300 400
Boosting Iterations
FIGURE 10.3. Simulated data, boosting with stumps: misclassification error rate on the training set, and average exponential loss: (1/N)PNi=1 exp(−yif(xi)). After about 250 iterations, the misclassification error is zero, while the exponential loss continues to decrease.
10.5 Why Exponential Loss?
The AdaBoost.M1 algorithm was originally motivated from a very differ- ent perspective than presented in the previous section. Its equivalence to forward stagewise additive modeling based on exponential loss was only discovered five years after its inception. By studying the properties of the exponential loss criterion, one can gain insight into the procedure and dis- cover ways it might be improved.
The principal attraction of exponential loss in the context of additive modeling is computational; it leads to the simple modular reweighting Ad- aBoost algorithm. However, it is of interest to inquire about its statistical properties. What does it estimate and how well is it being estimated? The first question is answered by seeking its population minimizer.
It is easy to show (Friedman et al., 2000) that
f∗(x) = arg min EY |x(e−Y f(x)) = 1 log Pr(Y = 1|x) , (10.16) f(x) 2 Pr(Y =−1|x)
Training Error
0.0 0.2 0.4 0.6 0.8 1.0

346 10. Boosting and Additive Trees
or equivalently
Pr(Y = 1|x) = 1 . 1 + e−2f∗(x)
Thus, the additive expansion produced by AdaBoost is estimating one- half the log-odds of P (Y = 1|x). This justifies using its sign as the classifi- cation rule in (10.1).
Another loss criterion with the same population minimizer is the bi- nomial negative log-likelihood or deviance (also known as cross-entropy), interpreting f as the logit transform. Let
ef(x) 1
p(x) = Pr(Y = 1|x) = e−f(x) + ef(x) = 1 + e−2f(x) (10.17)
and define Y ′ = (Y + 1)/2 ∈ {0, 1}. Then the binomial log-likelihood loss function is
l(Y, p(x)) = Y ′ log p(x) + (1 − Y ′) log(1 − p(x)), or equivalently the deviance is
􏰓􏰔
−l(Y,f(x)) = log 1 + e−2Y f(x) . (10.18)
Since the population maximizer of log-likelihood is at the true probabilities p(x) = Pr(Y = 1 | x), we see from (10.17) that the population minimizers of the deviance EY |x[−l(Y,f(x))] and EY |x[e−Y f(x)] are the same. Thus, using either criterion leads to the same solution at the population level. Note that e−Y f itself is not a proper log-likelihood, since it is not the logarithm of any probability mass function for a binary random variable Y ∈ {−1, 1}.
10.6 Loss Functions and Robustness
In this section we examine the different loss functions for classification and regression more closely, and characterize them in terms of their robustness to extreme data.
Robust Loss Functions for Classification
Although both the exponential (10.8) and binomial deviance (10.18) yield the same solution when applied to the population joint distribution, the same is not true for finite data sets. Both criteria are monotone decreasing functions of the “margin” yf(x). In classification (with a −1/1 response) the margin plays a role analogous to the residuals y−f(x) in regression. The classification rule G(x) = sign[f(x)] implies that observations with positive margin yif(xi) > 0 are classified correctly whereas those with negative margin yif(xi) < 0 are misclassified. The decision boundary is defined by 10.6 Loss Functions and Robustness 347 Misclassification Exponential Binomial Deviance Squared Error Support Vector −2 −1 0 1 2 y·f FIGURE 10.4. Loss functions for two-class classification. The response is y = ±1; the prediction is f, with class prediction sign(f). The losses are misclassification: I(sign(f) ̸= y); exponential: exp(−yf); binomial deviance: log(1 + exp(−2yf)); squared error: (y − f)2; and support vector: (1 − yf)+ (see Section 12.3). Each function has been scaled so that it passes through the point (0, 1). f(x) = 0. The goal of the classification algorithm is to produce positive margins as frequently as possible. Any loss criterion used for classification should penalize negative margins more heavily than positive ones since positive margin observations are already correctly classified. Figure 10.4 shows both the exponential (10.8) and binomial deviance criteria as a function of the margin y · f (x). Also shown is misclassification loss L(y,f(x)) = I(y·f(x) < 0), which gives unit penalty for negative mar- gin values, and no penalty at all for positive ones. Both the exponential and deviance loss can be viewed as monotone continuous approximations to misclassification loss. They continuously penalize increasingly negative margin values more heavily than they reward increasingly positive ones. The difference between them is in degree. The penalty associated with bi- nomial deviance increases linearly for large increasingly negative margin, whereas the exponential criterion increases the influence of such observa- tions exponentially. At any point in the training process the exponential criterion concen- trates much more influence on observations with large negative margins. Binomial deviance concentrates relatively less influence on such observa- Loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0 348 10. Boosting and Additive Trees tions, more evenly spreading the influence among all of the data. It is therefore far more robust in noisy settings where the Bayes error rate is not close to zero, and especially in situations where there is misspecification of the class labels in the training data. The performance of AdaBoost has been empirically observed to dramatically degrade in such situations. Also shown in the figure is squared-error loss. The minimizer of the cor- responding risk on the population is f∗(x)=argminEY|x(Y−f(x))2 =E(Y |x)=2·Pr(Y =1|x)−1. (10.19) f (x) As before the classification rule is G(x) = sign[f(x)]. Squared-error loss is not a good surrogate for misclassification error. As seen in Figure 10.4, it is not a monotone decreasing function of increasing margin yf(x). For mar- gin values yif(xi) > 1 it increases quadratically, thereby placing increasing influence (error) on observations that are correctly classified with increas- ing certainty, thereby reducing the relative influence of those incorrectly classified yif(xi) < 0. Thus, if class assignment is the goal, a monotone de- creasing criterion serves as a better surrogate loss function. Figure 12.4 on page 426 in Chapter 12 includes a modification of quadratic loss, the “Hu- berized” square hinge loss (Rosset et al., 2004b), which enjoys the favorable properties of the binomial deviance, quadratic loss and the SVM hinge loss. It has the same population minimizer as the quadratic (10.19), is zero for y·f(x) > 1, and becomes linear for y·f(x) < −1. Since quadratic functions are easier to compute with than exponentials, our experience suggests this to be a useful alternative to the binomial deviance. With K-class classification, the response Y takes values in the unordered set G = {G1,...,Gk} (see Sections 2.4 and 4.4). We now seek a classifier G(x) taking values in G. It is sufficient to know the class conditional proba- bilities pk(x) = Pr(Y = Gk|x), k = 1, 2, . . . , K, for then the Bayes classifier is G(x)=Gk wherek=argmaxpl(x). (10.20) l In principal, though, we need not learn the pk(x), but simply which one is largest. However, in data mining applications the interest is often more in the class probabilities pl(x), l = 1, . . . , K themselves, rather than in per- forming a class assignment. As in Section 4.4, the logistic model generalizes naturally to K classes, efk (x) pk(x) = 􏰛Kl=1 efl(x) , (10.21) which ensures that 0 ≤ pk(x) ≤ 1 and that they sum to one. Note that here we have K different functions, one per class. There is a redundancy in the functions fk(x), since adding an arbitrary h(x) to each leaves the model unchanged. Traditionally one of them is set to zero: for example, L(y, f(x)) = 􏰙 [y − f (x)]2 for | y − f (x) | ≤ δ, 2δ(| y − f(x) | − δ2) otherwise. (10.23) 􏰜K = − k=1 I(y = Gk)fk(x) + log 􏰜K l=1 efl(x) . (10.22) 10.6 Loss Functions and Robustness 349 fK (x) = 0, as in (4.17). Here we prefer to retain the symmetry, and impose the constraint 􏰛Kk=1 fk(x) = 0. The binomial deviance extends naturally to the K-class multinomial deviance loss function: 􏰜K L(y,p(x)) = − I(y=Gk)logpk(x) k=1 􏰨􏰩 As in the two-class case, the criterion (10.22) penalizes incorrect predictions only linearly in their degree of incorrectness. Zhu et al. (2005) generalize the exponential loss for K-class problems. See Exercise 10.5 for details. Robust Loss Functions for Regression In the regression setting, analogous to the relationship between exponential loss and binomial log-likelihood is the relationship between squared-error loss L(y,f(x)) = (y−f(x))2 and absolute loss L(y,f(x)) = |y−f(x)|. The population solutions are f (x) = E(Y |x) for squared-error loss, and f (x) = median(Y |x) for absolute loss; for symmetric error distributions these are the same. However, on finite samples squared-error loss places much more emphasis on observations with large absolute residuals |yi −f(xi)| during the fitting process. It is thus far less robust, and its performance severely degrades for long-tailed error distributions and especially for grossly mis- measured y-values (“outliers”). Other more robust criteria, such as abso- lute loss, perform much better in these situations. In the statistical ro- bustness literature, a variety of regression loss criteria have been proposed that provide strong resistance (if not absolute immunity) to gross outliers while being nearly as efficient as least squares for Gaussian errors. They are often better than either for error distributions with moderately heavy tails. One such criterion is the Huber loss criterion used for M-regression (Huber, 1964) Figure 10.5 compares these three loss functions. These considerations suggest than when robustness is a concern, as is especially the case in data mining applications (see Section 10.7), squared- error loss for regression and exponential loss for classification are not the best criteria from a statistical perspective. However, they both lead to the elegant modular boosting algorithms in the context of forward stagewise additive modeling. For squared-error loss one simply fits the base learner to the residuals from the current model yi − fm−1(xi) at each step. For 350 10. Boosting and Additive Trees Squared Error Absolute Error Huber −3 −2 −1 0 1 2 3 y−f FIGURE 10.5. A comparison of three loss functions for regression, plotted as a function of the margin y−f. The Huber loss function combines the good properties of squared-error loss near zero and absolute error loss when |y − f | is large. exponential loss one performs a weighted fit of the base learner to the output values yi, with weights wi = exp(−yifm−1(xi)). Using other more robust criteria directly in their place does not give rise to such simple feasible boosting algorithms. However, in Section 10.10.2 we show how one can derive simple elegant boosting algorithms based on any differentiable loss criterion, thereby producing highly robust boosting procedures for data mining. 10.7 “Off-the-Shelf” Procedures for Data Mining Predictive learning is an important aspect of data mining. As can be seen from this book, a wide variety of methods have been developed for predic- tive learning from data. For each particular method there are situations for which it is particularly well suited, and others where it performs badly compared to the best that can be done with that data. We have attempted to characterize appropriate situations in our discussions of each of the re- spective methods. However, it is seldom known in advance which procedure will perform best or even well for any given problem. Table 10.1 summarizes some of the characteristics of a number of learning methods. Industrial and commercial data mining applications tend to be especially challenging in terms of the requirements placed on learning procedures. Data sets are often very large in terms of number of observations and number of variables measured on each of them. Thus, computational con- 02468 Loss 10.7 “Off-the-Shelf” Procedures for Data Mining 351 TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good, ◆=fair, and ▼=poor. Characteristic Neural SVM Trees Nets Kernels Natural handling of data ▼ ▼ ▲ of “mixed” type Handling of missing values ▼ ▼ ▲ Robustness to outliers in ▼ ▼ ▲ input space Insensitive to monotone ▼ ▼ ▲ transformations of inputs Computational scalability ▼ ▼ ▲ (large N) Ability to deal with irrel- ▼ ▼ ▲ evant inputs Ability to extract linear ▲ ▲ ▼ combinations of features ▲ ▼ ▲ ▲ ▼ ▲ ▼ ▼ ▲ ▼ ▲ ▼ ▼ ◆ MARS k-NN, Interpretability ▼ ▼ ◆ Predictive power ▲ ▲ ▲ ▼ ▼◆ ▲ siderations play an important role. Also, the data are usually messy: the inputs tend to be mixtures of quantitative, binary, and categorical vari- ables, the latter often with many levels. There are generally many missing values, complete observations being rare. Distributions of numeric predic- tor and response variables are often long-tailed and highly skewed. This is the case for the spam data (Section 9.1.2); when fitting a generalized additive model, we first log-transformed each of the predictors in order to get a reasonable fit. In addition they usually contain a substantial fraction of gross mis-measurements (outliers). The predictor variables are generally measured on very different scales. In data mining applications, usually only a small fraction of the large number of predictor variables that have been included in the analysis are actually relevant to prediction. Also, unlike many applications such as pat- tern recognition, there is seldom reliable domain knowledge to help create especially relevant features and/or filter out the irrelevant ones, the inclu- sion of which dramatically degrades the performance of many methods. In addition, data mining applications generally require interpretable mod- els. It is not enough to simply produce predictions. It is also desirable to have information providing qualitative understanding of the relationship 352 10. Boosting and Additive Trees between joint values of the input variables and the resulting predicted re- sponse value. Thus, black box methods such as neural networks, which can be quite useful in purely predictive settings such as pattern recognition, are far less useful for data mining. These requirements of speed, interpretability and the messy nature of the data sharply limit the usefulness of most learning procedures as off- the-shelf methods for data mining. An “off-the-shelf” method is one that can be directly applied to the data without requiring a great deal of time- consuming data preprocessing or careful tuning of the learning procedure. Of all the well-known learning methods, decision trees come closest to meeting the requirements for serving as an off-the-shelf procedure for data mining. They are relatively fast to construct and they produce interpretable models (if the trees are small). As discussed in Section 9.2, they naturally incorporate mixtures of numeric and categorical predictor variables and missing values. They are invariant under (strictly monotone) transforma- tions of the individual predictors. As a result, scaling and/or more general transformations are not an issue, and they are immune to the effects of pre- dictor outliers. They perform internal feature selection as an integral part of the procedure. They are thereby resistant, if not completely immune, to the inclusion of many irrelevant predictor variables. These properties of decision trees are largely the reason that they have emerged as the most popular learning method for data mining. Trees have one aspect that prevents them from being the ideal tool for predictive learning, namely inaccuracy. They seldom provide predictive ac- curacy comparable to the best that can be achieved with the data at hand. As seen in Section 10.1, boosting decision trees improves their accuracy, often dramatically. At the same time it maintains most of their desirable properties for data mining. Some advantages of trees that are sacrificed by boosting are speed, interpretability, and, for AdaBoost, robustness against overlapping class distributions and especially mislabeling of the training data. A gradient boosted model (GBM) is a generalization of tree boosting that attempts to mitigate these problems, so as to produce an accurate and effective off-the-shelf procedure for data mining. 10.8 Example: Spam Data Before we go into the details of gradient boosting, we demonstrate its abili- ties on a two-class classification problem. The spam data are introduced in Chapter 1, and used as an example for many of the procedures in Chapter 9 (Sections 9.1.2, 9.2.5, 9.3.1 and 9.4.1). Applying gradient boosting to these data resulted in a test error rate of 4.5%, using the same test set as was used in Section 9.1.2. By comparison, an additive logistic regression achieved 5.5%, a CART tree fully grown and pruned by cross-validation 8.7%, and MARS 5.5%. The standard error of these estimates is around 0.6%, although gradient boosting is significantly better than all of them using the McNemar test (Exercise 10.6). In Section 10.13 below we develop a relative importance measure for each predictor, as well as a partial dependence plot describing a predictor’s contribution to the fitted model. We now illustrate these for the spam data. Figure 10.6 displays the relative importance spectrum for all 57 predictor variables. Clearly some predictors are more important than others in sep- arating spam from email. The frequencies of the character strings !, $, hp, and remove are estimated to be the four most relevant predictor variables. At the other end of the spectrum, the character strings 857, 415, table, and 3d have virtually no relevance. The quantity being modeled here is the log-odds of spam versus email f(x) = log Pr(spam|x) (10.24) Pr(email|x) (see Section 10.13 below). Figure 10.7 shows the partial dependence of the log-odds on selected important predictors, two positively associated with spam (! and remove), and two negatively associated (edu and hp). These particular dependencies are seen to be essentially monotonic.There is a general agreement with the corresponding functions found by the additive logistic regression model; see Figure 9.1 on page 303. Running a gradient boosted model on these data with J = 2 terminal- node trees produces a purely additive (main effects) model for the log- odds, with a corresponding error rate of 4.7%, as compared to 4.5% for the full gradient boosted model (with J = 5 terminal-node trees). Although not significant, this slightly higher error rate suggests that there may be interactions among some of the important predictor variables. This can be diagnosed through two-variable partial dependence plots. Figure 10.8 shows one of the several such plots displaying strong interaction effects. One sees that for very low frequencies of hp, the log-odds of spam are greatly increased. For high frequencies of hp, the log-odds of spam tend to be much lower and roughly constant as a function of !. As the frequency of hp decreases, the functional relationship with ! strengthens. 10.9 Boosting Trees Regression and classification trees are discussed in detail in Section 9.2. They partition the space of all joint predictor variable values into disjoint regions Rj,j = 1,2,...,J, as represented by the terminal nodes of the tree. A constant γj is assigned to each such region and the predictive rule is x∈Rj ⇒f(x)=γj. 10.9 Boosting Trees 353 354 10. Boosting and Additive Trees 3d addresses labs telnet 857 415 direct cs table 85 # parts credit [ lab conference report original data project font make address order all hpl technology people pm mail over 650 ; meeting email 000 internet receive ( re business 1999 will money our you edu CAPTOT george CAPMAX your CAPAVE free remove hp $ ! 0 20 40 60 80 100 Relative Importance FIGURE 10.6. Predictor variable importance spectrum for the spam data. The variable names are written on the vertical axis. 0.0 0.2 0.4 0.6 ! 0.8 1.0 0.0 0.2 0.4 0.6 remove 0.0 0.2 0.4 edu 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 hp FIGURE 10.7. Partial dependence of log-odds of spam on four important pre- dictors. The red ticks at the base of the plots are deciles of the input variable. 10.9 Boosting Trees 355 1.0 0.5 0.0 -0.5 -1.0 1.0 0.8 0.6 ! 0.4 0.2 3.0 1.0 hp 0.5 2.5 2.0 1.5 FIGURE 10.8. Partial dependence of the log-odds of spam vs. email as a func- tion of joint frequencies of hp and the character !. Partial Dependence -1.0 -0.6 -0.2 0.0 0.2 Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence -1.0 -0.6 -0.2 0.0 0.2 Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 356 10. Boosting and Additive Trees Thus a tree can be formally expressed as The parameters are found by minimizing the empirical risk ˆ 􏰜J􏰜 Θ = arg min Θ j=1 xi∈Rj 􏰜J j=1 γjI(x ∈ Rj), (10.25) T(x;Θ) = with parameters Θ = {Rj , γj }J1 . J is usually treated as a meta-parameter. This is a formidable combinatorial optimization problem, and we usually settle for approximate suboptimal solutions. It is useful to divide the opti- mization problem into two parts: Finding γj given Rj: Given the Rj, estimating the γj is typically trivial, and often γˆj = y ̄j, the mean of the yi falling in region Rj. For mis- classification loss, γˆj is the modal class of the observations falling in region Rj. Finding Rj : This is the difficult part, for which approximate solutions are found. Note also that finding the Rj entails estimating the γj as well. A typical strategy is to use a greedy, top-down recursive partitioning algorithm to find the Rj. In addition, it is sometimes necessary to approximate (10.26) by a smoother and more convenient criterion for optimizing the Rj: 􏰜N L ̃(yi,T(xi,Θ)). (10.27) Then given the Rˆj = R ̃j, the γj can be estimated more precisely using the original criterion. In Section 9.2 we described such a strategy for classification trees. The Gini index replaced misclassification loss in the growing of the tree (identifying the Rj). The boosted tree model is a sum of such trees, the forward stagewise procedure one must solve ˆ 􏰜N Θm =argmin Θm i=1 Θ ̃ = argmin Θ i=1 􏰜M m=1 fM (x) = induced in a forward stagewise manner (Algorithm 10.2). At each step in L(yi , γj ). (10.26) T (x; Θm ), (10.28) L(yi,fm−1(xi)+T(xi;Θm)) (10.29) for the region set and constants Θ = {R , γ }Jm m jmjm1 of the next tree, given the current model fm−1(x). Given the regions Rjm, finding the optimal constants γjm in each region is typically straightforward: 􏰜 xi∈Rjm Finding the regions is difficult, and even more difficult than for a single tree. For a few special cases, the problem simplifies. For squared-error loss, the solution to (10.29) is no harder than for a single tree. It is simply the regression tree that best predicts the current residuals yi − fm−1 (xi ), and γˆj m is the mean of these residuals in each corresponding region. For two-class classification and exponential loss, this stagewise approach gives rise to the AdaBoost method for boosting classification trees (Algo- rithm 10.1). In particular, if the trees T(x;Θm) are restricted to be scaled γˆjm =argmin γjm L(yi,fm−1(xi)+γjm). (10.30) classification trees, then we showed in Section 10.4 that the solution to (10.29) is the tree that minimizes the weighted error rate 􏰛N w(m)I(yi ̸= i=1 i T(xi;Θm)) with weights w(m) = e−yifm−1(xi). By a scaled classification i tree, we mean βmT(x;Θm), with the restriction that γjm ∈ {−1,1}). Without this restriction, (10.29) still simplifies for exponential loss to a weighted exponential criterion for the new tree: Θˆ m = arg min 􏰜N Θm i w(m) exp[−yiT (xi; Θm)]. (10.31) It is straightforward to implement a greedy recursive-partitioning algorithm using this weighted exponential loss as a splitting criterion. Given the Rjm, one can show (Exercise 10.7) that the solution to (10.30) is the weighted log-odds in each corresponding region 􏰛 w(m)I(yi = 1) γˆjm = log 􏰛 xi∈Rjm i . (10.32) This requires a specialized tree-growing algorithm; in practice, we prefer the approximation presented below that uses a weighted least squares re- gression tree. Using loss criteria such as the absolute error or the Huber loss (10.23) in place of squared-error loss for regression, and the deviance (10.22) in place of exponential loss for classification, will serve to robustify boosting trees. Unfortunately, unlike their nonrobust counterparts, these robust criteria do not give rise to simple fast boosting algorithms. For more general loss criteria the solution to (10.30), given the Rjm, is typically straightforward since it is a simple “location” estimate. For i=1 10.9 Boosting Trees 357 w(m)I(yi = −1) xi∈Rjm i 358 10. Boosting and Additive Trees absolute loss it is just the median of the residuals in each respective region. For the other criteria fast iterative algorithms exist for solving (10.30), and usually their faster “single-step” approximations are adequate. The problem is tree induction. Simple fast algorithms do not exist for solving (10.29) for these more general loss criteria, and approximations like (10.27) become essential. 10.10 Numerical Optimization via Gradient Boosting Fast approximate algorithms for solving (10.29) with any differentiable loss criterion can be derived by analogy to numerical optimization. The loss in using f(x) to predict y on the training data is 􏰜N i=1 The goal is to minimize L(f) with respect to f, where here f(x) is con- strained to be a sum of trees (10.28). Ignoring this constraint, minimizing (10.33) can be viewed as a numerical optimization ˆf = arg min L(f ), (10.34) f where the “parameters” f ∈ IRN are the values of the approximating func- tion f(xi) at each of the N data points xi: f = {f(x1),f(x2)),...,f(xN)}. Numerical optimization procedures solve (10.34) as a sum of component vectors 􏰜M L(f) = L(yi,f(xi)). (10.33) fM= hm, hm∈IRN, m=0 where f0 = h0 is an initial guess, and each successive fm is induced based on the current parameter vector fm−1, which is the sum of the previously induced updates. Numerical optimization methods differ in their prescrip- tions for computing each increment vector hm (“step”). 10.10.1 Steepest Descent Steepest descent chooses hm = −ρmgm where ρm is a scalar and gm ∈ IRN is the gradient of L(f) evaluated at f = fm−1. The components of the gradient gm are 􏰗∂L(yi,f(xi))􏰘 gim = ∂f(xi) (10.35) f (xi )=fm−1 (xi ) 10.10 Numerical Optimization via Gradient Boosting 359 The step length ρm is the solution to ρm = arg min L(fm−1 − ρgm ). (10.36) ρ The current solution is then updated fm = fm−1 − ρmgm and the process repeated at the next iteration. Steepest descent can be viewed as a very greedy strategy, since −gm is the local direction in IRN for which L(f) is most rapidly decreasing at f = fm−1. 10.10.2 Gradient Boosting Forward stagewise boosting (Algorithm 10.2) is also a very greedy strategy. At each step the solution tree is the one that maximally reduces (10.29), given the current model fm−1 and its fits fm−1(xi). Thus, the tree predic- tions T (xi; Θm) are analogous to the components of the negative gradient (10.35). The principal difference between them is that the tree compo- nentstm =(T(x1;Θm),...,T(xN;Θm)arenotindependent.Theyarecon- strained to be the predictions of a Jm-terminal node decision tree, whereas the negative gradient is the unconstrained maximal descent direction. The solution to (10.30) in the stagewise approach is analogous to the line search (10.36) in steepest descent. The difference is that (10.30) performs a separate line search for those components of tm that correspond to each separate terminal region {T(xi;Θm)}xi∈Rjm. If minimizing loss on the training data (10.33) were the only goal, steep- est descent would be the preferred strategy. The gradient (10.35) is trivial to calculate for any differentiable loss function L(y,f(x)), whereas solving (10.29) is difficult for the robust criteria discussed in Section 10.6. Unfor- tunately the gradient (10.35) is defined only at the training data points xi, whereas the ultimate goal is to generalize fM (x) to new data not repre- sented in the training set. A possible resolution to this dilemma is to induce a tree T(x;Θm) at the mth iteration whose predictions tm are as close as possible to the negative gradient. Using squared error to measure closeness, this leads us to 􏰜N i=1 That is, one fits the tree T to the negative gradient values (10.35) by least squares. As noted in Section 10.9 fast algorithms exist for least squares decision tree induction. Although the solution regions R ̃jm to (10.37) will not be identical to the regions Rjm that solve (10.29), it is generally sim- ilar enough to serve the same purpose. In any case, the forward stagewise Θ ̃ m = a r g m i n Θ ( − g i m − T ( x i ; Θ ) ) 2 . ( 1 0 . 3 7 ) 360 10. Boosting and Additive Trees Setting Regression Regression Regression Classification Loss Function 12[yi −f(xi)]2 |yi − f(xi)| Huber Deviance −∂L(yi,f(xi))/∂f(xi) yi −f(xi) sign[yi − f(xi)] yi − f(xi) for |yi − f(xi)| ≤ δm δmsign[yi − f(xi)] for |yi − f(xi)| > δm where δm = αth-quantile{|yi − f(xi)|}
kth component: I(yi = Gk) − pk(xi)
TABLE 10.2. Gradients for commonly used loss functions.
boosting procedure, and top-down decision tree induction, are themselves approximation procedures. After constructing the tree (10.37), the corre- sponding constants in each region are given by (10.30).
Table 10.2 summarizes the gradients for commonly used loss functions. For squared error loss, the negative gradient is just the ordinary residual −gim = yi − fm−1(xi), so that (10.37) on its own is equivalent standard least squares boosting. With absolute error loss, the negative gradient is the sign of the residual, so at each iteration (10.37) fits the tree to the sign of the current residuals by least squares. For Huber M-regression, the negative gradient is a compromise between these two (see the table).
For classification the loss function is the multinomial deviance (10.22), and K least squares trees are constructed at each iteration. Each tree Tkm is fit to its respective negative gradient vector gkm,
∂L (yi, f1m(xi), . . . , f1m(xi)) ∂fkm(xi)
−gikm =
= I(yi = Gk) − pk(xi), (10.38)
with pk(x) given by (10.21). Although K separate trees are built at each iteration, they are related through (10.21). For binary classification (K = 2), only one tree is needed (exercise 10.10).
10.10.3 Implementations of Gradient Boosting
Algorithm 10.3 presents the generic gradient tree-boosting algorithm for regression. Specific algorithms are obtained by inserting different loss cri- teria L(y,f(x)). The first line of the algorithm initializes to the optimal constant model, which is just a single terminal node tree. The components of the negative gradient computed at line 2(a) are referred to as general- ized or pseudo residuals, r. Gradients for commonly used loss functions are summarized in Table 10.2.

10.11 Right-Sized Trees for Boosting 361
Algorithm 10.3 Gradient Tree Boosting Algorithm. 1. Initialize f (x) = arg min 􏰛N L(y , γ).
0 γi=1i 2. For m = 1 to M:
(a) For i = 1,2,…,N compute 􏰗∂L(yi,f(xi))􏰘
rim =− ∂f(xi)
.
(b) Fit a regression tree to the targets rim giving terminal regions
Rjm, j = 1,2,…,Jm.
(c) For j = 1,2,…,Jm compute
􏰜
γjm =argmin γ
L(yi,fm−1(xi)+γ).
xi∈Rjm (d) Update fm(x) = fm−1(x) + 􏰛Jm
j=1
ˆ
3. Outputf(x)=fM(x).
The algorithm for classification is similar. Lines 2(a)–(d) are repeated K times at each iteration m, once for each class using (10.38). The result at line 3 is K different (coupled) tree expansions fkM (x), k = 1, 2, . . . , K. These produce probabilities via (10.21) or do classification as in (10.20). Details are given in Exercise 10.9. Two basic tuning parameters are the number of iterations M and the sizes of each of the constituent trees Jm, m=1,2,…,M.
The original implementation of this algorithm was called MART for “multiple additive regression trees,” and was referred to in the first edi- tion of this book. Many of the figures in this chapter were produced by MART. Gradient boosting as described here is implemented in the R gbm package (Ridgeway, 1999, “Gradient Boosted Models”), and is freely avail- able. The gbm package is used in Section 10.14.2, and extensively in Chap- ters 16 and 15. Another R implementation of boosting is mboost (Hothorn and Bu ̈hlmann, 2006). A commercial implementation of gradient boost- ing/MART called TreeNet® is available from Salford Systems, Inc.
10.11 Right-Sized Trees for Boosting
Historically, boosting was considered to be a technique for combining mod- els, here trees. As such, the tree building algorithm was regarded as a
f =fm−1
γjmI(x ∈ Rjm).

362 10. Boosting and Additive Trees
primitive that produced models to be combined by the boosting proce- dure. In this scenario, the optimal size of each tree is estimated separately in the usual manner when it is built (Section 9.2). A very large (oversized) tree is first induced, and then a bottom-up procedure is employed to prune it to the estimated optimal number of terminal nodes. This approach as- sumes implicitly that each tree is the last one in the expansion (10.28). Except perhaps for the very last tree, this is clearly a very poor assump- tion. The result is that trees tend to be much too large, especially during the early iterations. This substantially degrades performance and increases computation.
The simplest strategy for avoiding this problem is to restrict all trees to be the same size, Jm = J ∀m. At each iteration a J-terminal node regression tree is induced. Thus J becomes a meta-parameter of the entire boosting procedure, to be adjusted to maximize estimated performance for the data at hand.
One can get an idea of useful values for J by considering the properties of the “target” function
η = argminEXY L(Y,f(X)). (10.39) f
Here the expected value is over the population joint distribution of (X, Y ). The target function η(x) is the one with minimum prediction risk on future data. This is the function we are trying to approximate.
One relevant property of η(X) is the degree to which the coordinate vari- ables XT = (X1, X2, . . . , Xp) interact with one another. This is captured by its ANOVA (analysis of variance) expansion
􏰜􏰜􏰜
η(X) = ηj(Xj)+ ηjk(Xj,Xk)+ ηjkl(Xj,Xk,Xl)+··· . (10.40) j jk jkl
The first sum in (10.40) is over functions of only a single predictor variable Xj . The particular functions ηj (Xj ) are those that jointly best approximate η(X) under the loss criterion being used. Each such ηj(Xj) is called the “main effect” of Xj. The second sum is over those two-variable functions that when added to the main effects best fit η(X). These are called the second-order interactions of each respective variable pair (Xj,Xk). The third sum represents third-order interactions, and so on. For many problems encountered in practice, low-order interaction effects tend to dominate. When this is the case, models that produce strong higher-order interaction effects, such as large decision trees, suffer in accuracy.
The interaction level of tree-based approximations is limited by the tree size J. Namely, no interaction effects of level greater that J − 1 are pos- sible. Since boosted models are additive in the trees (10.28), this limit extends to them as well. Setting J = 2 (single split “decision stump”) produces boosted models with only main effects; no interactions are per- mitted. With J = 3, two-variable interaction effects are also allowed, and

10.11 Right-Sized Trees for Boosting 363
Stumps 10 Node 100 Node Adaboost
0 100 200 300 400 Number of Terms
FIGURE 10.9. Boosting with different sized trees, applied to the example (10.2) used in Figure 10.2. Since the generative model is additive, stumps perform the best. The boosting algorithm used the binomial deviance loss in Algorithm 10.3; shown for comparison is the AdaBoost Algorithm 10.1.
so on. This suggests that the value chosen for J should reflect the level of dominant interactions of η(x). This is of course generally unknown, but in most situations it will tend to be low. Figure 10.9 illustrates the effect of interaction order (choice of J) on the simulation example (10.2). The generative function is additive (sum of quadratic monomials), so boosting models with J > 2 incurs unnecessary variance and hence the higher test error. Figure 10.10 compares the coordinate functions found by boosted stumps with the true functions.
Although in many applications J = 2 will be insufficient, it is unlikely that J > 10 will be required. Experience so far indicates that 4 ≤ J ≤ 8 works well in the context of boosting, with results being fairly insensitive to particular choices in this range. One can fine-tune the value for J by trying several different values and choosing the one that produces the low- est risk on a validation sample. However, this seldom provides significant improvement over using J ≃ 6.
Test Error
0.0 0.1 0.2 0.3 0.4

364 10. Boosting and Additive Trees
Coordinate Functions for Additive Logistic Trees
f1(x1) f2(x2) f3(x3) f4(x4) f5(x5)
f6(x6) f7(x7) f8(x8) f9(x9) f10(x10)
FIGURE 10.10. Coordinate functions estimated by boosting stumps for the sim- ulated example used in Figure 10.9. The true quadratic functions are shown for comparison.
10.12 Regularization
Besides the size of the constituent trees, J, the other meta-parameter of gradient boosting is the number of boosting iterations M. Each iteration usually reduces the training risk L(fM ), so that for M large enough this risk can be made arbitrarily small. However, fitting the training data too well can lead to overfitting, which degrades the risk on future predictions. Thus, there is an optimal number M∗ minimizing future risk that is application dependent. A convenient way to estimate M∗ is to monitor prediction risk as a function of M on a validation sample. The value of M that minimizes this risk is taken to be an estimate of M∗. This is analogous to the early stopping strategy often used with neural networks (Section 11.4).
10.12.1 Shrinkage
Controlling the value of M is not the only possible regularization strategy. As with ridge regression and neural networks, shrinkage techniques can be employed as well (see Sections 3.4.1 and 11.5). The simplest implementation of shrinkage in the context of boosting is to scale the contribution of each tree by a factor 0 < ν < 1 when it is added to the current approximation. That is, line 2(d) of Algorithm 10.3 is replaced by 􏰜J j=1 The parameter ν can be regarded as controlling the learning rate of the boosting procedure. Smaller values of ν (more shrinkage) result in larger training risk for the same number of iterations M. Thus, both ν and M control prediction risk on the training data. However, these parameters do fm(x) = fm−1(x) + ν · γjmI(x ∈ Rjm). (10.41) not operate independently. Smaller values of ν lead to larger values of M for the same training risk, so that there is a tradeoff between them. Empirically it has been found (Friedman, 2001) that smaller values of ν favor better test error, and require correspondingly larger values of M. In fact, the best strategy appears to be to set ν to be very small (ν < 0.1) and then choose M by early stopping. This yields dramatic improvements (over no shrinkage ν = 1) for regression and for probability estimation. The corresponding improvements in misclassification risk via (10.20) are less, but still substantial. The price paid for these improvements is computa- tional: smaller values of ν give rise to larger values of M, and computation is proportional to the latter. However, as seen below, many iterations are generally computationally feasible even on very large data sets. This is partly due to the fact that small trees are induced at each step with no pruning. Figure 10.11 shows test error curves for the simulated example (10.2) of Figure 10.2. A gradient boosted model (MART) was trained using binomial deviance, using either stumps or six terminal-node trees, and with or with- out shrinkage. The benefits of shrinkage are evident, especially when the binomial deviance is tracked. With shrinkage, each test error curve reaches a lower value, and stays there for many iterations. Section 16.2.1 draws a connection between forward stagewise shrinkage in boosting and the use of an L1 penalty for regularizing model parame- ters (the “lasso”). We argue that L1 penalties may be superior to the L2 penalties used by methods such as the support vector machine. 10.12.2 Subsampling We saw in Section 8.7 that bootstrap averaging (bagging) improves the performance of a noisy classifier through averaging. Chapter 15 discusses in some detail the variance-reduction mechanism of this sampling followed by averaging. We can exploit the same device in gradient boosting, both to improve performance and computational efficiency. With stochastic gradient boosting (Friedman, 1999), at each iteration we sample a fraction η of the training observations (without replacement), and grow the next tree using that subsample. The rest of the algorithm is identical. A typical value for η can be 12, although for large N, η can be substantially smaller than 12 . Not only does the sampling reduce the computing time by the same fraction η, but in many cases it actually produces a more accurate model. Figure 10.12 illustrates the effect of subsampling using the simulated example (10.2), both as a classification and as a regression example. We see in both cases that sampling along with shrinkage slightly outperformed the rest. It appears here that subsampling without shrinkage does poorly. 10.12 Regularization 365 366 10. Boosting and Additive Trees Test Set Deviance 0.0 0.5 1.0 1.5 2.0 Test Set Deviance 0.0 0.5 1.0 1.5 2.0 Test Set Misclassification Error 0.0 0.1 0.2 0.3 0.4 0.5 Test Set Misclassification Error 0.0 0.1 0.2 0.3 0.4 0.5 0 500 1000 1500 Boosting Iterations 6-Node Trees Deviance No shrinkage Shrinkage=0.6 2000 0 500 Boosting Iterations 0 500 1000 1500 Boosting Iterations 2000 0 500 Boosting Iterations Stumps Deviance No shrinkage Shrinkage=0.2 Stumps Misclassification Error No shrinkage Shrinkage=0.2 FIGURE 10.11. Test error curves for simulated example (10.2) of Figure 10.9, using gradient boosting (MART). The models were trained using binomial de- viance, either stumps or six terminal-node trees, and with or without shrinkage. The left panels report test deviance, while the right panels show misclassification error. The beneficial effect of shrinkage can be seen in all cases, especially for deviance in the left panels. 1000 1500 2000 6-Node Trees Misclassification Error No shrinkage Shrinkage=0.6 1000 1500 2000 Deviance Absolute Error No shrinkage Shrink=0.1 Sample=0.5 Shrink=0.1 Sample=0.5 4−Node Trees 10.13 Interpretation 367 0 200 Boosting Iterations FIGURE 10.12. Test-error curves for the simulated example (10.2), showing the effect of stochasticity. For the curves labeled “Sample= 0.5”, a different 50% subsample of the training data was used each time a tree was grown. In the left panel the models were fit by gbm using a binomial deviance loss function; in the right-hand panel using square-error loss. The downside is that we now have four parameters to set: J, M, ν and η. Typically some early explorations determine suitable values for J, ν and η, leaving M as the primary parameter. 10.13 Interpretation Single decision trees are highly interpretable. The entire model can be com- pletely represented by a simple two-dimensional graphic (binary tree) that is easily visualized. Linear combinations of trees (10.28) lose this important feature, and must therefore be interpreted in a different way. 10.13.1 Relative Importance of Predictor Variables In data mining applications the input predictor variables are seldom equally relevant. Often only a few of them have substantial influence on the re- sponse; the vast majority are irrelevant and could just as well have not been included. It is often useful to learn the relative importance or contri- bution of each input variable in predicting the response. 400 600 800 1000 0 200 Boosting Iterations 400 600 800 1000 Test Set Deviance 0.4 0.6 0.8 1.0 1.2 1.4 Test Set Absolute Error 0.30 0.35 0.40 0.45 0.50 368 10. Boosting and Additive Trees For a single decision tree T, Breiman et al. (1984) proposed J−1 all of the classes 1 􏰜K I l2 ( T ) = 􏰜 t=1 ˆı 2t I ( v ( t ) = l ) ( 1 0 . 4 2 ) as a measure of relevance for each predictor variable Xl. The sum is over the J − 1 internal nodes of the tree. At each such node t, one of the input variables Xv(t) is used to partition the region associated with that node into two subregions; within each a separate constant is fit to the response values. The particular variable chosen is the one that gives maximal estimated improvement ˆı2t in squared error risk over that for a constant fit over the entire region. The squared relative importance of variable Xl is the sum of such squared improvements over all internal nodes for which it was chosen as the splitting variable. This importance measure is easily generalized to additive tree expansions (10.28); it is simply averaged over the trees 1 􏰜M Il2 = M Due to the stabilizing effect of averaging, this measure turns out to be more reliable than is its counterpart (10.42) for a single tree. Also, because of shrinkage (Section 10.12.1) the masking of important variables by others with which they are highly correlated is much less of a problem. Note that (10.42) and (10.43) refer to squared relevance; the actual relevances are their respective square roots. Since these measures are relative, it is customary to assign the largest a value of 100 and then scale the others accordingly. Figure 10.6 shows the relevant importance of the 57 inputs in predicting spam versus email. For K-class classification, K separate models fk(x), k = 1, 2, . . . , K are induced, each consisting of a sum of trees fk(x) = In this case (10.43) generalizes to 􏰜M m=1 Tkm(x). (10.44) (10.45) m=1 1 􏰜M lk M l km m=1 I2 = I2(T ). Here Ilk is the relevance of Xl in separating the class k observations from the other classes. The overall relevance of Xl is obtained by averaging over I2 = l K lk k=1 Il2(Tm). (10.43) I2 . (10.46) Figures 10.23 and 10.24 illustrate the use of these averaged and separate relative importances. 10.13.2 Partial Dependence Plots After the most relevant variables have been identified, the next step is to attempt to understand the nature of the dependence of the approximation f(X) on their joint values. Graphical renderings of the f(X) as a function of its arguments provides a comprehensive summary of its dependence on the joint values of the input variables. Unfortunately, such visualization is limited to low-dimensional views. We can easily display functions of one or two arguments, either continuous or discrete (or mixed), in a variety of different ways; this book is filled with such displays. Functions of slightly higher dimensions can be plotted by conditioning on particular sets of values of all but one or two of the arguments, producing a trellis of plots (Becker et al., 1996).1 For more than two or three variables, viewing functions of the corre- sponding higher-dimensional arguments is more difficult. A useful alterna- tive can sometimes be to view a collection of plots, each one of which shows the partial dependence of the approximation f(X) on a selected small sub- set of the input variables. Although such a collection can seldom provide a comprehensive depiction of the approximation, it can often produce helpful clues, especially when f(x) is dominated by low-order interactions (10.40). Consider the subvector XS of l < p of the input predictor variables XT = (X1,X2,...,Xp), indexed by S ⊂ {1,2,...,p}. Let C be the complement set, with S ∪ C = {1,2,...,p}. A general function f(X) will in principle depend on all of the input variables: f(X) = f(XS,XC). One way to define the average or partial dependence of f(X) on XS is fS(XS) = EXC f(XS,XC). (10.47) This is a marginal average of f, and can serve as a useful description of the effect of the chosen subset on f(X) when, for example, the variables in XS do not have strong interactions with those in XC. Partial dependence functions can be used to interpret the results of any “black box” learning method. They can be estimated by ̄ 1􏰜N f(XS,xiC), (10.48) where {x1C,x2C,...,xNC} are the values of XC occurring in the training data. This requires a pass over the data for each set of joint values of XS for which f ̄ (X ) is to be evaluated. This can be computationally intensive, SS 1lattice in R. fS(XS) = N i=1 10.13 Interpretation 369 370 10. Boosting and Additive Trees evenformoderatelysizeddatasets.Fortunatelywithdecisiontrees,f ̄(X ) SS (10.48) can be rapidly computed from the tree itself without reference to the data (Exercise 10.11). It is important to note that partial dependence functions defined in (10.47) represent the effect of XS on f(X) after accounting for the (av- erage) effects of the other variables XC on f(X). They are not the effect of XS on f(X) ignoring the effects of XC. The latter is given by the con- ditional expectation f ̃ ( X ) = E ( f ( X , X ) | X ) , ( 1 0 . 4 9 ) SS SCS and is the best least squares approximation to f(X) by a function of XS alone. The quantities f ̃ (X ) and f ̄ (X ) will be the same only in the SS SS unlikely event that XS and XC are independent. For example, if the effect of the chosen variable subset happens to be purely additive, f(X) = h1(XS) + h2(XC). (10.50) Then (10.47) produces the h1(XS) up to an additive constant. If the effect is purely multiplicative, f(X) = h1(XS) · h2(XC), (10.51) then (10.47) produces h1(XS) up to a multiplicative constant factor. On the other hand, (10.49) will not produce h1(XS) in either case. In fact, (10.49) can produce strong effects on variable subsets for which f(X) has no dependence at all. Viewing plots of the partial dependence of the boosted-tree approxima- tion (10.28) on selected variables subsets can help to provide a qualitative description of its properties. Illustrations are shown in Sections 10.8 and 10.14. Owing to the limitations of computer graphics, and human percep- tion, the size of the subsets XS must be small (l ≈ 1,2,3). There are of course a large number of such subsets, but only those chosen from among the usually much smaller set of highly relevant predictors are likely to be informative. Also, those subsets whose effect on f(X) is approximately additive (10.50) or multiplicative (10.51) will be most revealing. For K-class classification, there are K separate models (10.44), one for each class. Each one is related to the respective probabilities (10.21) through 1 􏰜K fk(X) = log pk(X) − K Thus each fk(X) is a monotone increasing function of its respective prob- ability on a logarithmic scale. Partial dependence plots of each respective fk(X) (10.44) on its most relevant predictors (10.45) can help reveal how the log-odds of realizing that class depend on the respective input variables. l=1 log pl(X). (10.52) 10.14 Illustrations In this section we illustrate gradient boosting on a number of larger datasets, using different loss functions as appropriate. 10.14.1 California Housing This data set (Pace and Barry, 1997) is available from the Carnegie-Mellon StatLib repository2. It consists of aggregated data from each of 20,460 neighborhoods (1990 census block groups) in California. The response vari- able Y is the median house value in each neighborhood measured in units of $100,000. The predictor variables are demographics such as median income MedInc, housing density as reflected by the number of houses House, and the average occupancy in each house AveOccup. Also included as predictors are the location of each neighborhood (longitude and latitude), and several quantities reflecting the properties of the houses in the neighborhood: av- erage number of rooms AveRooms and bedrooms AveBedrms. There are thus a total of eight predictors, all numeric. We fit a gradient boosting model using the MART procedure, with J = 6 terminal nodes, a learning rate (10.41) of ν = 0.1, and the Huber loss criterion for predicting the numeric response. We randomly divided the dataset into a training set (80%) and a test set (20%). Figure 10.13 shows the average absolute error AAE = E |y − fˆM (x)| (10.53) as a function for number of iterations M on both the training data and test data. The test error is seen to decrease monotonically with increasing M, more rapidly during the early stages and then leveling off to being nearly constant as iterations increase. Thus, the choice of a particular value of M is not critical, as long as it is not too small. This tends to be the case in many applications. The shrinkage strategy (10.41) tends to eliminate the problem of overfitting, especially for larger data sets. The value of AAE after 800 iterations is 0.31. This can be compared to that of the optimal constant predictor median{yi} which is 0.89. In terms of more familiar quantities, the squared multiple correlation coefficient of this model is R2 = 0.84. Pace and Barry (1997) use a sophisticated spatial auto- regression procedure, where prediction for each neighborhood is based on median house values in nearby neighborhoods, using the other predictors as covariates. Experimenting with transformations they achieved R2 = 0.85, predicting log Y . Using log Y as the response the corresponding value for gradient boosting was R2 = 0.86. 2 http://lib.stat.cmu.edu. 10.14 Illustrations 371 372 10. Boosting and Additive Trees Training and Test Absolute Error Train Error Test Error 0 200 400 600 800 Iterations M FIGURE 10.13. Average-absolute error as a function of number of iterations for the California housing data. Figure 10.14 displays the relative variable importances for each of the eight predictor variables. Not surprisingly, median income in the neigh- borhood is the most relevant predictor. Longitude, latitude, and average occupancy all have roughly half the relevance of income, whereas the others are somewhat less influential. Figure 10.15 shows single-variable partial dependence plots on the most relevant nonlocation predictors. Note that the plots are not strictly smooth. This is a consequence of using tree-based models. Decision trees produce discontinuous piecewise constant models (10.25). This carries over to sums of trees (10.28), with of course many more pieces. Unlike most of the meth- ods discussed in this book, there is no smoothness constraint imposed on the result. Arbitrarily sharp discontinuities can be modeled. The fact that these curves generally exhibit a smooth trend is because that is what is estimated to best predict the response for this problem. This is often the case. The hash marks at the base of each plot delineate the deciles of the data distribution of the corresponding variables. Note that here the data density is lower near the edges, especially for larger values. This causes the curves to be somewhat less well determined in those regions. The vertical scales of the plots are the same, and give a visual comparison of the relative importance of the different variables. The partial dependence of median house value on median income is monotonic increasing, being nearly linear over the main body of data. House value is generally monotonic decreasing with increasing average occupancy, except perhaps for average occupancy rates less than one. Median house Absolute Error 0.0 0.2 0.4 0.6 0.8 10.14 Illustrations 373 Population AveBedrms AveRooms HouseAge Latitude AveOccup Longitude MedInc 0 20 40 60 80 100 Relative importance FIGURE 10.14. Relative importance of the predictors for the California housing data. value has a nonmonotonic partial dependence on average number of rooms. It has a minimum at approximately three rooms and is increasing both for smaller and larger values. Median house value is seen to have a very weak partial dependence on house age that is inconsistent with its importance ranking (Figure 10.14). This suggests that this weak main effect may be masking stronger interac- tion effects with other variables. Figure 10.16 shows the two-variable partial dependence of housing value on joint values of median age and average oc- cupancy. An interaction between these two variables is apparent. For values of average occupancy greater than two, house value is nearly independent of median age, whereas for values less than two there is a strong dependence on age. Figure 10.17 shows the two-variable partial dependence of the fitted model on joint values of longitude and latitude, displayed as a shaded contour plot. There is clearly a very strong dependence of median house value on the neighborhood location in California. Note that Figure 10.17 is not a plot of house value versus location ignoring the effects of the other predictors (10.49). Like all partial dependence plots, it represents the effect of location after accounting for the effects of the other neighborhood and house attributes (10.47). It can be viewed as representing an extra premium one pays for location. This premium is seen to be relatively large near the Pacific coast especially in the Bay Area and Los Angeles–San Diego re- 374 10. Boosting and Additive Trees 2 4 6 8 10 MedInc 2 3 4 5 AveOccup 10 20 30 40 50 HouseAge 4 6 8 10 AveRooms FIGURE 10.15. Partial dependence of housing value on the nonlocation vari- ables for the California housing data. The red ticks at the base of the plot are deciles of the input variables. 1.0 0.5 0.0 50 40 30 HouseAge 20 2 10 4 3 AveOccup 5 FIGURE 10.16. Partial dependence of house value on median age and aver- age occupancy. There appears to be a strong interaction effect between these two variables. Partial Dependence -1.0 -0.5 0.0 0.5 1.0 Partial Dependence -0.5 0.0 0.5 1.0 1.5 2.0 Partial Dependence -1.0 -0.5 0.0 0.5 1.0 1.5 Partial Dependence -1.0 -0.5 0.0 0.5 1.0 1.5 −124 −122 −120 −118 Longitude −116 −114 FIGURE 10.17. Partial dependence of median house value on location in Cal- ifornia. One unit is $100, 000, at 1990 prices, and the values plotted are relative to the overall median of $180, 000. gions. In the northern, central valley, and southeastern desert regions of California, location costs considerably less. 10.14.2 New Zealand Fish Plant and animal ecologists use regression models to predict species pres- ence, abundance and richness as a function of environmental variables. Although for many years simple linear and parametric models were popu- lar, recent literature shows increasing interest in more sophisticated mod- els such as generalized additive models (Section 9.1, GAM), multivariate adaptive regression splines (Section 9.4, MARS) and boosted regression trees (Leathwick et al., 2005; Leathwick et al., 2006). Here we model the 10.14 Illustrations 375 1.0 0.5 0.0 −0.5 −1.0 Latitude 34 36 38 40 42 376 10. Boosting and Additive Trees presence and abundance of the Black Oreo Dory, a marine fish found in the oceanic waters around New Zealand.3 Figure 10.18 shows the locations of 17,000 trawls (deep-water net fishing, with a maximum depth of 2km), and the red points indicate those 2353 trawls for which the Black Oreo was present, one of over a hundred species regularly recorded. The catch size in kg for each species was recorded for each trawl. Along with the species catch, a number of environmental mea- surements are available for each trawl. These include the average depth of the trawl (AvgDepth), and the temperature and salinity of the water. Since the latter two are strongly correlated with depth, Leathwick et al. (2006) derived instead TempResid and SalResid, the residuals obtained when these two measures are adjusted for depth (via separate non-parametric regres- sions). SSTGrad is a measure of the gradient of the sea surface temperature, and Chla is a broad indicator of ecosytem productivity via satellite-image measurements. SusPartMatter provides a measure of suspended particulate matter, particularly in coastal waters, and is also satellite derived. The goal of this analysis is to estimate the probability of finding Black Oreo in a trawl, as well as the expected catch size, standardized to take into account the effects of variation in trawl speed and distance, as well as the mesh size of the trawl net. The authors used logistic regression for estimating the probability. For the catch size, it might seem natural to assume a Poisson distribution and model the log of the mean count, but this is often not appropriate because of the excessive number of zeros. Although specialized approaches have been developed, such as the zero- inflated Poisson (Lambert, 1992), they chose a simpler approach. If Y is the (non-negative) catch size, E(Y |X) = E(Y |Y > 0,X) · Pr(Y > 0|X). (10.54)
The second term is estimated by the logistic regression, and the first term can be estimated using only the 2353 trawls with a positive catch.
For the logistic regression the authors used a gradient boosted model (GBM)4 with binomial deviance loss function, depth-10 trees, and a shrink- age factor ν = 0.025. For the positive-catch regression, they modeled log(Y) using a GBM with squared-error loss (also depth-10 trees, but ν = 0.01), and un-logged the predictions. In both cases they used 10-fold cross-validation for selecting the number of terms, as well as the shrinkage factor.
3The models, data, and maps shown here were kindly provided by Dr John Leathwick of the National Institute of Water and Atmospheric Research in New Zealand, and Dr Jane Elith, School of Botany, University of Melbourne. The collection of the research trawl data took place from 1979–2005, and was funded by the New Zealand Ministry of Fisheries.
4Version 1.5-7 of package gbm in R, ver. 2.2.0.

10.14 Illustrations 377
FIGURE 10.18. Map of New Zealand and its surrounding exclusive economic zone, showing the locations of 17,000 trawls (small blue dots) taken between 1979 and 2005. The red points indicate trawls for which the species Black Oreo Dory were present.

378 10. Boosting and Additive Trees
AUC
GAM 0.97 GBM 0.98
GBM Test GBM CV GAM Test
0 500 1000 1500 0.0 0.2 Number of Trees
0.4 0.6 0.8 1.0 Specificity
FIGURE 10.19. The left panel shows the mean deviance as a function of the number of trees for the GBM logistic regression model fit to the presence/absence data. Shown are 10-fold cross-validation on the training data (and 1 × s.e. bars), and test deviance on the test data. Also shown for comparison is the test deviance using a GAM model with 8 df for each term. The right panel shows ROC curves on the test data for the chosen GBM model (vertical line in left plot) and the GAM model.
Figure 10.19 (left panel) shows the mean binomial deviance for the se- quence of GBM models, both for 10-fold CV and test data. There is a mod- est improvement over the performance of a GAM model, fit using smoothing splines with 8 degrees-of-freedom (df) per term. The right panel shows the ROC curves (see Section 9.2.5) for both models, which measures predictive performance. From this point of view, the performance looks very simi- lar, with GBM perhaps having a slight edge as summarized by the AUC (area under the curve). At the point of equal sensitivity/specificity, GBM achieves 91%, and GAM 90%.
Figure 10.20 summarizes the contributions of the variables in the logistic GBM fit. We see that there is a well-defined depth range over which Black Oreo are caught, with much more frequent capture in colder waters. We do not give details of the quantitative catch model; the important variables were much the same.
All the predictors used in these models are available on a fine geographi- cal grid; in fact they were derived from environmental atlases, satellite im- ages and the like—see Leathwick et al. (2006) for details. This also means that predictions can be made on this grid, and imported into GIS mapping systems. Figure 10.21 shows prediction maps for both presence and catch size, with both standardized to a common set of trawl conditions; since the predictors vary in a continuous fashion with geographical location, so do the predictions.
0.0
Sensitivity
0.2 0.4 0.6 0.8
1.0
Mean Deviance
0.24 0.26 0.28 0.30 0.32 0.34

10.14 Illustrations 379
T empResid AvgDepth SusPartMatter SalResid SSTGrad ChlaCase2 Slope TidalCurr Pentade CodendSize DisOrgMatter Distance Speed OrbVel
0 10 25 Relative influence
−4 0 2 4 6 TempResid
0 500 1000 2000 AvgDepth
0 5 SusPartMatter
−0.8 −0.4 0.0 SalResid
0.4
0.00
0.05 0.10 0.15 SSTGrad
10 15
FIGURE 10.20. The top-left panel shows the relative influence computed from the GBM logistic regression model. The remaining panels show the partial de- pendence plots for the leading five variables, all plotted on the same scale for comparison.
Because of their ability to model interactions and automatically select variables, as well as robustness to outliers and missing data, GBM models are rapidly gaining popularity in this data-rich and enthusiastic community.
10.14.3 Demographics Data
In this section we illustrate gradient boosting on a multiclass classifica- tion problem, using MART. The data come from 9243 questionnaires filled out by shopping mall customers in the San Francisco Bay Area (Impact Resources, Inc., Columbus, OH). Among the questions are 14 concerning demographics. For this illustration the goal is to predict occupation us- ing the other 13 variables as predictors, and hence identify demographic variables that discriminate between different occupational categories. We randomly divided the data into a training set (80%) and test set (20%), and used J = 6 node trees with a learning rate ν = 0.1.
Figure 10.22 shows the K = 9 occupation class values along with their corresponding error rates. The overall error rate is 42.5%, which can be compared to the null rate of 69% obtained by predicting the most numerous
f(SusPartMatter) −7 −5 −3
f(SalResid)
−7 −5 −3 −1
f(T empResid)
−7 −5 −3 −1
f(SSTGrad)
−7 −5 −3 −1
f(AvgDepth) −6 −4 −2

380 10. Boosting and Additive Trees
FIGURE 10.21. Geological prediction maps of the presence probability (left map) and catch size (right map) obtained from the gradient boosted models.
class Prof/Man (Professional/Managerial). The four best predicted classes are seen to be Retired, Student, Prof/Man, and Homemaker.
Figure 10.23 shows the relative predictor variable importances as aver- aged over all classes (10.46). Figure 10.24 displays the individual relative importance distributions (10.45) for each of the four best predicted classes. One sees that the most relevant predictors are generally different for each respective class. An exception is age which is among the three most relevant for predicting Retired, Student, and Prof/Man.
Figure 10.25 shows the partial dependence of the log-odds (10.52) on age for these three classes. The abscissa values are ordered codes for respective equally spaced age intervals. One sees that after accounting for the contri- butions of the other variables, the odds of being retired are higher for older people, whereas the opposite is the case for being a student. The odds of being professional/managerial are highest for middle-aged people. These results are of course not surprising. They illustrate that inspecting partial dependences separately for each class can lead to sensible results.
Bibliographic Notes
Schapire (1990) developed the first simple boosting procedure in the PAC learning framework (Valiant, 1984; Kearns and Vazirani, 1994). Schapire

Overall Error Rate = 0.425
10.14 Illustrations
381
Student Retired Prof/Man Homemaker Labor Clerical Military Unemployed Sales
0.0 0.2 0.4
0.6 0.8
1.0
FIGURE 10.22. Error rate for each occupation in the demographics data.
yrs-BA children num-hsld lang typ-home mar-stat ethnic sex mar-dlinc hsld-stat edu income age
0 20 40 60 80 100 Relative Importance
FIGURE 10.23. Relative importance of the predictors as averaged over all classes for the demographics data.
Error Rate

382
10. Boosting and Additive Trees
Class = Retired
Class = Student
yrs-BA num-hsld edu children typ-home lang mar-stat hsld-stat income ethnic sex mar-dlinc age
children yrs-BA mar-stat lang num-hsld sex typ-home hsld-stat ethnic mar-dlinc age income edu
0 20 40 60 80 100 Relative Importance
Class = Prof/Man
children yrs-BA lang mar-dlinc sex typ-home num-hsld ethnic edu mar-stat income age hsld-stat
yrs-BA hsld-stat age income typ-home lang mar-stat edu num-hsld ethnic children mar-dlinc sex
0 20 40 60 80 100 Relative Importance
Class = Homemaker
0 20 40 60 80 100 Relative Importance
0 20 40 60 80 100 Relative Importance
FIGURE 10.24. Predictor variable importances separately for each of the four classes with lowest error rate for the demographics data.

Retired Student
1234567 1234567 age age
Prof/Man
1234567 age
FIGURE 10.25. Partial dependence of the odds of three different occupations on age, for the demographics data.
showed that a weak learner could always improve its performance by train- ing two additional classifiers on filtered versions of the input data stream. A weak learner is an algorithm for producing a two-class classifier with performance guaranteed (with high probability) to be significantly better than a coin-flip. After learning an initial classifier G1 on the first N training points,
• G2 is learned on a new sample of N points, half of which are misclas- sified by G1;
• G3 is learned on N points for which G1 and G2 disagree;
• the boosted classifier is GB = majority vote(G1, G2, G3).
Schapire’s “Strength of Weak Learnability” theorem proves that GB has improved performance over G1.
Freund (1995) proposed a “boost by majority” variation which combined many weak learners simultaneously and improved the performance of the simple boosting algorithm of Schapire. The theory supporting both of these
10.14 Illustrations 383
Partial Dependence Partial Dependence
-2 -1 0 1 2 0 1 2 3 4
Partial Dependence
-2 -1 0 1 2

384 10. Boosting and Additive Trees
algorithms requires the weak learner to produce a classifier with a fixed error rate. This led to the more adaptive and realistic AdaBoost (Freund and Schapire, 1996a) and its offspring, where this assumption was dropped.
Freund and Schapire (1996a) and Schapire and Singer (1999) provide some theory to support their algorithms, in the form of upper bounds on generalization error. This theory has evolved in the computational learning community, initially based on the concepts of PAC learning. Other theo- ries attempting to explain boosting come from game theory (Freund and Schapire, 1996b; Breiman, 1999; Breiman, 1998), and VC theory (Schapire et al., 1998). The bounds and the theory associated with the AdaBoost algorithms are interesting, but tend to be too loose to be of practical im- portance. In practice, boosting achieves results far more impressive than the bounds would imply. Schapire (2002) and Meir and Ra ̈tsch (2003) give useful overviews more recent than the first edition of this book.
Friedman et al. (2000) and Friedman (2001) form the basis for our expo- sition in this chapter. Friedman et al. (2000) analyze AdaBoost statistically, derive the exponential criterion, and show that it estimates the log-odds of the class probability. They propose additive tree models, the right-sized trees and ANOVA representation of Section 10.11, and the multiclass logit formulation. Friedman (2001) developed gradient boosting and shrinkage for classification and regression, while Friedman (1999) explored stochastic variants of boosting. Mason et al. (2000) also embraced a gradient approach to boosting. As the published discussions of Friedman et al. (2000) shows, there is some controversy about how and why boosting works.
Since the publication of the first edition of this book, these debates have continued, and spread into the statistical community with a series of papers on consistency of boosting (Jiang, 2004; Logosi and Vayatis, 2004; Zhang and Yu, 2005; Bartlett and Traskin, 2007). Mease and Wyner (2008), through a series of simulation examples, challenge some of our interpre- tations of boosting; our response (Friedman et al., 2008b) puts most of these objections to rest. A recent survey by Bu ̈hlmann and Hothorn (2008) supports our approach to boosting.
Exercises
Ex. 10.1 Derive expression (10.12) for the update parameter in AdaBoost. Ex. 10.2 Prove result (10.16), that is, the minimizer of the population
version of the AdaBoost criterion, is one-half of the log odds.
Ex. 10.3 Show that the marginal average (10.47) recovers additive and multiplicative functions (10.50) and (10.51), while the conditional expec- tation (10.49) does not.

Ex. 10.4
(a) Write a program implementing AdaBoost with trees.
(b) Redo the computations for the example of Figure 10.2. Plot the train- ing error as well as test error, and discuss its behavior.
(c) Investigate the number of iterations needed to make the test error finally start to rise.
(d) Change the setup of this example as follows: define two classes, with
the features in Class 1 being X1,X2,…,X10, standard indepen-
dent Gaussian variates. In Class 2, the features X1,X2,…,X10 are
also standard independent Gaussian, but conditioned on the event
􏰛 X2 > 12. Now the classes have significant overlap in feature space. jj
Repeat the AdaBoost experiments as in Figure 10.2 and discuss the results.
Ex. 10.5 Multiclass exponential loss (Zhu et al., 2005). For a K-class clas- sification problem, consider the coding Y = (Y1, . . . , YK )T with
􏰙 1, if G = Gk Yk = − 1 , otherwise.
Let f = (f1,…,fK)T with 􏰛Kk=1 fk = 0, and define 􏰕1􏰖
L(Y,f)=exp −KYTf .
(10.55)
(10.56)
K−1
(a) Using Lagrange multipliers, derive the population minimizer f∗ of E(Y,f), subject to the zero-sum constraint, and relate these to the class probabilities.
(b) Show that a multiclass boosting using this loss function leads to a reweighting algorithm similar to Adaboost, as in Section 10.4.
Ex. 10.6 McNemar test (Agresti, 1996). We report the test error rates on the spam data to be 5.5% for a generalized additive model (GAM), and 4.5% for gradient boosting (GBM), with a test sample of size 1536.
(a) Show that the standard error of these estimates is about 0.6%.
Since the same test data are used for both methods, the error rates are correlated, and we cannot perform a two-sample t-test. We can compare the methods directly on each test observation, leading to the summary
Exercises 385
GAM Correct Error
GBM Correct Error
1434 18 33 51

386 10. Boosting and Additive Trees
The McNemar test focuses on the discordant errors, 33 vs. 18.
(b) Conduct a test to show that GAM makes significantly more errors
than gradient boosting, with a two-sided p-value of 0.036.
Ex. 10.7 Derive expression (10.32).
Ex. 10.8 Consider a K-class problem where the targets yik are coded as 1 if observation i is in class k and zero otherwise. Suppose we have a current model fk(x), k = 1, . . . , K, with 􏰛Kk=1 fk(x) = 0 (see (10.21) in Section 10.6). We wish to update the model for observations in a region R in predictor space, by adding constants fk (x) + γk , with γK = 0.
(a) Write down the multinomial log-likelihood for this problem, and its first and second derivatives.
(b) Using only the diagonal of the Hessian matrix in (1), and starting from γk = 0 ∀k, show that a one-step approximate Newton update
for γk is
􏰛 (yik −pik) γ k1 = 􏰛 x i ∈ R
, k = 1 , . . . , K − 1 , ( 1 0 . 5 7 ) where pik = exp(fk(xi))/(􏰛Kl=1 fl(xi)).
xi∈R pik(1 − pik)
(c) We prefer our update to sum to zero, as the current model does. Using
symmetry arguments, show that K−1 1􏰜K
γ l1 ) , k = 1 , . . . , K ( 1 0 . 5 8 ) is an appropriate update, where γk1 is defined as in (10.57) for all
γˆ k = K ( γ k1 − K k = 1,…,K.
l=1
Ex. 10.9 Consider a K-class problem where the targets yik are coded as 1 if observation i is in class k and zero otherwise. Using the multinomial deviance loss function (10.22) and the symmetric logistic transform, use the arguments leading to the gradient boosting Algorithm 10.3 to derive Algorithm 10.4. Hint: See exercise 10.8 for step 2(b)iii.
Ex. 10.10 Show that for K = 2 class classification, only one tree needs to be grown at each gradient-boosting iteration.
Ex. 10.11 Show how to compute the partial dependence function fS(XS) in (10.47) efficiently.
Ex. 10.12 Referring to (10.49), let S = {1} and C = {2}, with f(X1,X2) = X1. Assume X1 and X2 are bivariate Gaussian, each with mean zero, vari- ance one, and E(X1,X2) = ρ. Show that E(f(X1,X2|X2) = ρX2, even though f is not a function of X2.

Exercises 387
Algorithm 10.4 Gradient Boosting for K-class Classification. 1. Initialize fk0(x) = 0, k = 1,2,…,K.
2. For m=1 to M: (a) Set
(b) For i. ii.
iii.
iv.
efk (x)
pk(x) = 􏰛Kl=1 efl(x) , k = 1,2,…,K.
k = 1 to K:
Compute rikm = yik −pk(xi), i = 1,2,…,N.
Fit a regression tree to the targets rikm, i = 1,2,…,N, giving terminal regions Rjkm, j = 1, 2, . . . , Jm.
Compute
􏰛
γjkm = K−1􏰛 xi∈Rjkm rikm , j=1,2,…,Jm.
K xi∈Rjkm |rikm|(1−|rikm|)
Update fkm(x) = fk,m−1(x) + 􏰛Jm γjkmI(x ∈ Rjkm).
3. Output fˆ (x) = f (x), k = 1,2,…,K. k kM
j=1