程序代写 Machine Learning and Data Mining in Business

Machine Learning and Data Mining in Business
Lecture 9: Gradient Boosting
Discipline of Business Analytics

Copyright By PowCoder代写 加微信 powcoder

Lecture 9: Gradient Boosting
Learning objectives
• Boosting.
• Gradient boosting.
• XGBoost, LightGBM and CatBoost.

Lecture 9: Gradient Boosting
1. Boosting
2. Least squares boosting
3. Gradient boosting
4. Practical details
5. Explainable Boosting Machine

Boosting is one of the most powerful ideas in machine learning. In boosting, we consider an additive model of the form
f(x;θ) = 􏰐 βmF(x;θm),
where F (x; θm) is a basis function and βm is the corresponding weight, which we often set to βm = 1.

f(x) = 􏰐 βm F(x;θm)
We can think of this as a linear model with adaptive basis functions.

f(x) = 􏰐 βm F(x;θm)
• In boosting, each basis function F (x, θm) is typically a weak learner, in the sense that it is only weakly related to the response (regression) or slightly better than random guessing (classification).
• The idea of boosting to sequentially combine a large number of weak learners that add up to a final strong learner f(x).

f(x) = 􏰐 βm F(x;θm)
• Boosting is a nonparametric method, since the model does not have a fixed number of parameters. The more basis functions we include, the higher the capacity of the model.
• The larger the size of the training data, the more basis functions we include. Therefore, the capacity of model adapts to the size of the data.

Boosted trees
Small decision trees are the most common choice of basis functions for boosting.
• Recall from last week that trees have many useful properties, such as the ability to model interactions, insensitivity to monotone transformations, and robustness to outliers in the predictor space, among others.
• These properties make trees good choices of weak learners in practice.

Boosted trees
The boosted trees model is
f(x) = 􏰐 F(x; θm),
where F (x; θm) are decision trees with a fixed number of leaves and θm are the parameters of the tree (split variables, split points, and predictions at the leaves).

Model fitting
Consider the following the optimisation problem
n􏰉M 􏰊 minimise 􏰐L yi, 􏰐 F(xi; θm) .
{θm }Mm=1 i=1 m=1
Unfortunately, this problem is computationally intractable.

Forward stagewise additive modelling
• In forward stagewise additive modelling, we approximate the problem of fitting an additive model by sequentially adding and fitting a single basis function F(x; θm) at a time.
• At each iteration m, we add a new basis function F(x; θm) to the model without updating the basis functions
F (x; θ1), . . . , F (x; θm−1) that are already in the model.

Forward stagewise additive modelling
At iteration m, we compute n
􏰉􏰊 βm, θm = argmin 􏰐 L yi, fm−1(xi) + βmF (xi; θm) ,
βm ,θm and update the model as
fm(x) = fm−1(x) + βmF (x; θm).

Forward stagewise additive modelling
Algorithm Forward stagewise additive modelling
Initialise f (x) = argmin 􏰏n L􏰃y , β 􏰄. 0 i=1i0
βm, θm = argmin 􏰐 L􏰃yi, fm−1(xi) + βmF (xi; θm)􏰄.
βm ,θm i=1
fm(x) ← fm−1(x) + βmF (x; θm).
form=1toM do Compute
The final model is fM (x).

Forward stagewise additive modelling
Applying the algorithm to boosted trees, we simply solve
􏰉􏰊 θm = argmin 􏰐L yi,fm−1(xi)+F(xi;θm) ,
and update the model as
fm(x) = fm−1(x) + F(x; θm).

Boosting vs. random forests
Despite the similarities, these two learning algorithms are fundamentally different.
• A random forest is an average of deep trees fit independently to different bootstrap samples.
• Boosting fits an additive model based on shallow trees by forward stagewise additive modelling.
• In boosting, the growth of each tree takes into account the trees that are already in the model.

Least squares boosting

Least squares boosting
Suppose that we use the squared error loss. We consider the optimisation problem
minimise 􏰐 yi − 􏰐 F(xi; θm) . {θm}Mm=1 i=1 m=1

Least squares boosting
At the first iteration (m = 1), the problem simplifies to that of fitting a regression tree
minimise 􏰐(yi−F(xi;θ1))2,
which we can do by recursive binary splitting.

Least squares boosting
Consider now the second iteration (m = 2). By forward stagewise additive modelling, we want to fit a new tree by solving
The problem simplifies to that of fitting a regression tree to the residuals of the current model!
θ2 i=1 􏰔 􏰓􏰒 􏰕
􏰑 yi−F(xi;θ1)−F(x;θ2) .

Least squares boosting
At the next iteration, we analogously solve
θ3 i=1􏰔 􏰓􏰒 􏰕
􏰑􏰑 yi−F(xi;θ1)−F(xi;θ2)−F(xi;θ3) .
The key detail is that θ1 and θ2 are fixed in this equation. We fit a new tree at each iteration without revisiting the fit of previous trees.

Least squares boosting
At each iteration m we solve n
fm−1(x) = 􏰐 F(x; θk).
is the current fit.
minimise 􏰐 (yi − fm−1(x) − F (xi; θm))2 ,
The problem reduces to that of fitting a regression tree to the current residuals rim = yi − fm−1(xi).

Illustration: least squares boosting
Image credit: “Probabilistic Machine Learning” (2021) by . Murphy

In practice, we shrink the contribution of each new tree by applying the modified update
fm(x) = fm−1(x) + ηF (x; θm), where we call η ∈ (0, 1) the learning rate.
This is a regularisation strategy. By shrinking the predictions from previous trees, we leave more room for future trees to improve the model.

Least squares boosting
Algorithm Least squares boosting
1: 2: 3: 4: 5:
Set the number of trees M and the maximum depth d of each tree. Initialise f0(x) = y and ri = yi − y.
form=1toM do
Fit a regression tree Fm of depth d to training data {ri,xi}ni=1. Update the regression function,
fm(x) ← fm−1(x) + ηFm(x) Update the residuals,
Output the boosted model
ri ← ri −ηFm(x).
f (x) = fM (x).

Gradient boosting

Gradient boosting
At the m-th iteration of boosting, we want to solve
minimise 􏰐 L yi, fm−1(xi) + F (xi; θm)
We would therefore like to find the regions {Rj}Jj=1 and predicted values {wj}Jj=1 for tree F(x;θm) that minimise the objective function given the current model fm−1(x).

Gradient boosting
minimise 􏰐 L yi, fm−1(xi) + F (xi; θm)
Given the tree, finding the optimal predicted values for each leaf is typically straightforward by computing
w􏰑j=argmin􏰐 L(yi,fm−1(xi)+w). w xi∈Rj 

Gradient boosting
minimise 􏰐 L yi, fm−1(xi) + F (xi; θm)
• It’s not computationally feasible to enumerate all possible trees and optimise over them. We need a fast algorithm to grow tree.
• Except in special cases such as least squares boosting, standard algorithms for tree induction are not applicable to boosting.

Gradient boosting machines
In gradient boosting machines, the original version of gradient boosting, we grow a tree by approximating a solution to
minimise 􏰐 􏰀 − gim − F (xi; θm)􏰁2,
gim = dL􏰀yi,y􏰑i)􏰂􏰂􏰂􏰂 , dy􏰑i 􏰂
􏰑yi =fm−1 (xi ) using fast algorithms for regression tree induction.
Given the regions Rj specified by the solution, we can directly update the predictions for each leaf.

XGBoost, which stands for “extreme gradient bosting”, optimises the following regularised objective:
􏰍n􏰎 minimise 􏰐 L􏰃yi, fm−1(xi) + F (xi; θm)􏰄 + Ω(θm) ,
where Ω(θm) penalises the complexity of the new tree F (xi; θm).

XGBoost uses a second-order Taylor approximation of the loss function around the current fit,
L􏰃yi, fm−1(xi) + F (xi; θm)􏰄 ≈ L􏰃yi, fm−1(xi)􏰄 + gimF (xi; θm) + 21himF(xi;θm)2,
gim = dL(yi,y􏰑i)􏰂􏰂􏰂􏰂 dy􏰑i 􏰂
􏰑yi =fm−1 (xi ) him = d2L(yi,y􏰑i)􏰂􏰂􏰂􏰂
􏰑yi =fm−1 (xi )

The penalty term is
Ω(θm) = γJ + 2
where γ and λ are hyperparameters, J is the number of leaves, wj is the parameter for each leaf. Alternatively, we can use l1 regularisation.

Therefore, we have the following approximate objective function for fitting a new tree:
􏰐n􏰇 1 􏰈 λ􏰐J
gim F (xi ; θm ) + 2 him F (xi ; θm )2 + γ J + 2 wj2 , j=1
where we dropped constant terms.

Let q : Rp → [1,…,J] be a function that maps each x to a leaf j. Given the tree structure specified by q, the cost function is
􏰍n􏰅 1 􏰆􏰎 λJ
L(w;q) = 􏰐 gimF(xi;θm)+ himF(xi;θm)2 i=1 2
􏰐wj2 2 j=1
J2  2 
= 􏰐 wj 􏰐 gim + wj λ + 􏰐 him j=1 i∈Ij i∈Ij
where Ij denotes the set of indices of data points assigned to leaf j. Note that
F(xi;θm) = 􏰐wjI(i ∈ Ij).

J2  2 
L(w; q) = 􏰐 wj 􏰐 gim + wj λ + 􏰐 him + γJ j=1 i∈Ij i∈Ij
Taking the partial derivative with respect to wj, 
∂wj     ∂L(θ;q)= 􏰐gim +wj λ+􏰐him .

Setting the partial derivative to zero gives the first-order condition
 􏰐 gim + wj∗ λ + 􏰐 him
which we can solve for w􏰑j to obtain
􏰏i∈I gim ∗j
wj=−λ+􏰏 h. i∈Ij im
We work with convex loss functions in practice, therefore this is the minimum.

Now, let’s go back to the cost function
􏰍n􏰅 1 􏰆􏰎λJ L(θm)= 􏰐 gimF(xi;θm)+ himF(xi;θm)2 +γJ+ 􏰐wj2
i=1 2 2 j=1 to plug-in the solution:
􏰏i∈I gim ∗j
wj=−λ+􏰏 h. i∈Ij im

J ∗2􏰉n􏰊 2
j=1 i∈Ij i=1 J􏰉􏰏􏰊
L(q) = 􏰐 wj∗ 􏰐 gim + (wj ) λ + 􏰐 him + γJ
i∈Ij gim −λ + 􏰏i=1 him
i∈Ij gim 􏰐 j=1 j i∈Ij
+ 2 −λ+􏰏i∈I him λ+ him
J 􏰃􏰏 g 􏰄2 1􏰐 i∈Ij im
= −2 λ + 􏰏 h + γJ. j=1 i∈Ij im

Therefore, XGBoost uses the following cost function to evaluate the tree structure:
L(q) = −1 􏰐 Gj + γJ,
def 􏰏 def 􏰏
where Gj = i∈Ij gim and Hj = i∈Ij him.

Suppose that we consider splitting a node with data indices I into a left and right nodes, I = IL ∪ IR. The reduction in the cost function from such a split is:
1 􏰋 G 2L G 2R ( G L + G R ) 2 􏰌 gain=2 λ+H +λ+H −λ+(H +H ) −γ,
where GL = 􏰏i∈IL gim, HL = 􏰏i∈IL him, GR = 􏰏i∈IR gim and
HR = 􏰏i∈IR him.
Therefore, it’s only worth it to split a node if the gain is more than γ.

􏰗 High predictive accuracy.
􏰗 Regularisation.
􏰗 Automatic handling of missing values.
􏰗 Feature interaction constraints.
􏰗 Monotonicity contraints.
􏰗 Uses various computer science methods for speed and scalability.

LightGBM is a gradient tree boosting library that is similar to XGBoost.
􏰗 High predictive accuracy.
􏰗 Tends to be even faster than XGBoost. 􏰗 Built-in support for categorical features.

Leaf-wise growth
LightGBM performs leaf-wise growth:
Image credit: LightGBM documentation.

Level-wise growth
Image credit: LightGBM documentation.

CatBoost is another very popular library for gradient tree boosting.
􏰗 High predictive accuracy.
􏰗 Built-in support for categorical features.
􏰗 Uses a modified algorithm that mitigates target leakage.

Practical details

Hyperparameters
The most important hyperparameters for boosting are:
• Number of trees (M): more trees lead to higher capacity and a lower training error. Boosting is typically slow to overfit the data as we increase M.
• Learning rate (η): a lower η shrinks every tree and slows down the learning process. Smaller values of η means that we need a larger M to obtain any given training error, so that there is a trade-off between M and η. Typically η < 0.1 leads to the best performance. • Maximum tree depth (d): a higher d leads to a higher order of interaction between predictors and higher model complexity. We typically set 3 ≤ d ≤ 8. Early stopping Early stopping is a method for selecting the number of boosting iterations M. 1. Set the maximum number of boosting iterations. 2. At the end of each iteration m (or at every few iterations), compute the validation (or cross-validation) error of the current model. 3. If the validation error stops improving or starts deteriorating after a certain number of iterations, stop the algorithm. 4. Output the model with lowest validation error up to that point. Illustration: early stopping Subsampling Another regularisation technique for boosting is subsampling. • Row subsampling (stochastic gradient boosting): similarly to bagging and pasting, we can sample a fraction of observations (without replacement) to grow the new tree at each iteration. • Column subsampling: as in random forests, we use a random subset of predictors as candidate split variables (either for the entire tree or for each node). In addition to the regularising effect, subsampling speeds up the computations. Partial dependence plot A partial dependence plot for feature k is a plot of g(xk) = n1 􏰐f(xi,−k,xk) against xk, where xi,−k denotes xi with element k removed. That is, the partial dependence function evaluated for a feature value xk is the average prediction if we force all data points to assume that feature value. Suppose that we have the following input data: Index Size Bedrooms Quality 113382 8 212802 7 316163 8 418044 7 516553 6 To evaluate the partial dependence function for Size at 1500, we average all the predictions for the modified dataset: Index Size Bedrooms Quality 115002 8 215002 7 315003 8 415004 7 515003 6 Partial dependence plot Partial dependence plot 􏰗 Simple and intuitive. 􏰖 Ignores the dependence between the inputs, which makes interpretation subject to unrealistic combinations of feature values. Explainable Boosting Machine Explainable Boosting Machine The explainable boosting machine (EBM) fits a generalised additive model f(x)=β0+􏰐fj(xj)+ 􏰐 fjk(xj,xk) j=1 (j,k)∈I with automatic interaction detection by cyclic gradient boosting. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com