Tree-Based Methods
April 15, 2020
April 15, 2020 1 / 106
1 The basics of decision trees.
2 Bagging, random forests and boosting
April 15, 2020 2 / 106
About this chapter
• Decisions trees: splitting each variable sequentially, creating rectugular regions.
• Making fitting/prediction locally at each region.
• It is intuitive and easy to implement, may have good interpreation.
• Generally of lower prediction accuracy.
• Bagging, random forests and boosting … make fitting/prediction based on a number of trees.
• Bagging and Boosting are general methodologies, not just limited to trees.
April 15, 2020 3 / 106
Figure: CART and C4.5
April 15, 2020 4 / 106
Figure: Historical story about CART
April 15, 2020 5 / 106
The basics of decision trees.
• Trees can be applied to both regression and classifcation.
• CART refers to classification and regression trees.
• We first consider regression trees through an example of predicting Baseball players’ salaries.
Regression trees
April 15, 2020 6 / 106
The basics of decision trees.
• Response/outputs: Salary.
• Covarites/Inputs:
Years (the number of years that he has played in the major leagues)
Hits (the number of hits that he made in the previous year).
• preparing data: remove the observations with missing data and log-transformed the Salary (preventing heavy right-skewness)
The Hitters data
April 15, 2020 7 / 106
The basics of decision trees.
Years < 4.5
|
Hits <
5.11
6.00 6.74
Figure: 1. Next page
117.5
April 15, 2020
8 / 106
The basics of decision trees.
Figure 1. For the Hitters data, a regression tree for predicting the log salary of a baseball player, based on the number of years that he has played in the major leagues and the number of hits that he made in the previous year. At a given internal node, the label (of the form Xj < tk) indicates the left-hand branch emanating from that split, and the right-hand branch corresponds to Xj ≥ tk. For instance, the split at the top of the tree results in two large branches. The left-hand branch corresponds to Years< 4.5, and the right-hand branch corresponds to Years≥ 4.5. The tree has two internal nodes and three terminal nodes, or leaves. The number in each leaf is the mean of the response for the observations that fall there.
April 15, 2020 9 / 106
The basics of decision trees.
R1
R3
R2
1 4.5
24
Figure: 2. The three-region partition for the Hitters data set from the regression tree illustrated in Figure 1.
Years
238
117.5
1
April 15, 2020
10 / 106
Hits
The basics of decision trees.
Estimation/prediction
• On Regions R1, R2, R3, the mean-log-salary is 5.107, and 6.74.
• Our prediction for any players in R1, R2 and R3 are, respectively 1000 × e5.107 = $165, 174, 1, 000 × e5.999 = $402, 834, and
1, 000 × e6.740 = $845, 346.
April 15, 2020 11 / 106
The basics of decision trees.
Estimation/prediction
• Trees involve a series of splittings of the data, each time by one variable.
• The series of actions taken place sequentially creates a tree-like results.
• As in Figure 1, the terminal nodes are the three indexed by the numbers, which represent the regions R1, R2 and R3. These regions constitute he final partiation of the data.
• Terminal nodes are also called leaves.
• Each internal node represents a splitting,
• In Figure 1, the two internal nodes are indexed by Y < 4.5 and Hits < 117.5.
• The lines connecting nodes are called branches.
• Trees are typically drawn upside down.
April 15, 2020 12 / 106
The basics of decision trees.
Two step towards prediction
• Run the splitting according to input values sequentially, and obtain final partition of the data in regions R1, ..., RJ .
• For any new observation with covariates in region Rk, we predict its response by the average of the reponses of the data points in region Rk.
April 15, 2020 13 / 106
The basics of decision trees.
How to split
• Suppose we wish to partition a region R. In other words, we wish to separate the data in region R into two parts, day R1 and R2, according to one input values.
• What would be the optimal or efficient split in some sense?
• Only two flexibility in the split: 1. Choice of the input variable to
split, 2. the cutpoint of the split of that chose input.
• Imagine that this is the final split of R: R1 and R2 would be leaves.
And we would use the mean response of data in R1 and R2 to predict the response of any new/old observations.
We wish our choice of R1 and R2 would be optimal in the sense of achieving miminum prediction error on the training data in region R.
April 15, 2020 14 / 106
The basics of decision trees.
Recursive binary splitting
• A greedy algorithm (geedy means it is optimal at the current step): Forj=1,...,pandallrealvalues,letR1(j,s)={i∈R:Xj J/2 ,wherexi isi.i.d.
from [0,1]d. The class label is only related to the first J variables. We set q = 0.1, J = 5, d = 20 and n = 200, and used stumps as base learners.
0 200
400 600
Iterations
AdaBoost LogitBoost
800 1000
AdaBoost LogitBoost
800 1000
0
200
400
600
800
AdaBoost LogitBoost
1000
AdaBoost LogitBoost
1000
Exponential Loss of probability
Squared Loss of probability
0.85 0.95 1.05
0.08 0.10 0.12 0.14
Missclassification Loss
Log Loss of probability
0.25 0.35 0.45
0.55 0.60 0.65
0 200
400 600
0
200
400
600
800
Iterations
Iterations
Iterations
April 15, 2020
84 / 106
Bagging, random forests and boosting
R package: gbm
library(gbm)
# Simulate data
set.seed(101) # for reproducibility
N <- 1000
X1 <- runif(N)
X2 <- 2 * runif(N)
X3 <- ordered(sample(letters[1:4], N, replace = TRUE), levels = letters[4:1])
X4 <- factor(sample(letters[1:6], N, replace = TRUE))
X5 <- factor(sample(letters[1:3], N, replace = TRUE))
X6 <- 3 * runif(N)
mu <- c(-1, 0, 1, 2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu
sigma <- sqrt(var(Y) / SNR)
Y <- Y + rnorm(N, 0, sigma)
X1[sample(1:N, size = 500)] <- NA # introduce some missing values
X4[sample(1:N, size = 300)] <- NA # introduce some missing values
data <- data.frame(Y, X1, X2, X3, X4, X5, X6)
April 15, 2020
85 / 106
Bagging, random forests and boosting
R package: gbm
# Fit a GBM
set.seed(102) # for reproducibility
gbm1 <- gbm(Y ~ ., data = data, var.monotone = c(0, 0, 0, 0, 0, 0),
distribution = "gaussian", n.trees = 100, shrinkage = 0.1,
interaction.depth = 3, bag.fraction = 0.5, train.fraction = 0.5,
n.minobsinnode = 10, cv.folds = 5, keep.data = TRUE,
verbose = FALSE, n.cores = 1)
# Check performance using the out-of-bag (OOB) error; the OOB error typically
# underestimates the optimal number of iterations
best.iter <- gbm.perf(gbm1, method = "OOB")
print(best.iter)
# Check performance using the 50% heldout test set
best.iter <- gbm.perf(gbm1, method = "test")
print(best.iter)
# Check performance using 5-fold cross-validation
best.iter <- gbm.perf(gbm1, method = "cv")
print(best.iter)
# Plot relative influence of each variable
par(mfrow = c(1, 2))
summary(gbm1, n.trees = 1) # using first tree
summary(gbm1, n.trees = best.iter) # using estimated best number of trees
April 15, 2020
86 / 106
where
Bagging, random forests and boosting
Penalized regression
• Consider the dictionary of all possible J-terminal node regression trees
T = {Tk} that could be realized on the training data as the basis functions. The linear model is
K
f(x) = αkTk(x),
k=1
where K = card(T ).
• The coefficients are to be estimated by penalized regression
nK 2 α
min yi −αkTk(x) +λ·J(α) , i=1 k=1
K
J(α) = |αk|2,
k=1 K
J(α) = |αk|, k=1
ridge, lasso.
April 15, 2020
87 / 106
Bagging, random forests and boosting
• Sparse regression is reasonable since it is likely that only a small fraction of all possible J-terminal node trees will be relevant in approximating the target function.
• Owing to the very large number of basis function Tk, directly solving the lasso penalized regression is impossible.
Forward stagewise approximation works!
1. Initialize αk = 0,, k = 1,...,K. Set ε > 0 to a small constant 2. Form=1toM (alargenumber),
(a) Find a good basis function to approximate current residual
nK
(β∗, k∗) = arg min [[yi − αlTl(xi)] − βTk(xi)]2.
β,k
i=1 l=k
Current residual (b) Update the corresponding coefficent
αk∗ ← αk∗ + ε · Sign(β∗). 3. Output fM(x) = Kk=1 αkTk(x).
April 15, 2020
88 / 106
Bagging, random forests and boosting
The Lasso and Boosting
Start with y = y − mean(y) and assume xj standardized. The Lasso
βˆ(t)=argmin1∥y−xβ∥2,subjectto∥β∥1 ≤t. β2
Boosting (Stagewise fitting) for linear regression
• Start with β(0) = 0. Repeat, for k = 1,2,3…
• Compute negative gradient g = xT (y − xβ(k−1)) • Find j such that
j∗ =argmax|gj|. j
• Only update the j∗-th coefficient with a small amount, say, ε:
(k) (k−1)
βj∗ = βj∗ + ε · Sign(gj∗ ).
April 15, 2020
89 / 106
Bagging, random forests and boosting
Figure: Profiles of estimated coefficients from linear regression, for the prostate data. The left panel shows the results from the Lasso, for different values of the bound parameter t = k |αk|. The right panel shows the results of the forward stagewise linear regression, using M = 220 consecutive steps of size ε = 0.01.
April 15, 2020 90 / 106
Bagging, random forests and boosting
Algorithm Perspective
The boosting algorithm can be also written as follows: • Start with β(0) = 0.
• Repeat, for k = 1,2,3…
• Computegradientg=−xT(y−xβ(k−1)). • Update β(k) = β(k−1) + ∆, where
∆=argmingTz,s.t. ∥z∥1 ≤ε. z∈Rp
At each iteration, forward stagewise balance decreases in the loss function f with increases in the L1 norm.
April 15, 2020
91 / 106
Bagging, random forests and boosting
A more general framework
Consider a more general convex problem
xˆ(t) = arg min f(x) s.t. g(x) ≤ t, (15)
x∈Rp
where f, g : Rp → R are both convex functions, and f is differentiable. An approximate solution path can be generated from the following general stagewise procedure
Algorithm Fix ε > 0 and t0 ∈ R, initialize x(0) = xˆ(t0), a solution in (6) at t = t0. Repeat for k = 1,2,3,…,
xk = x(k−1) + ∆,
where ∆ = arg min⟨∇f(x(k−1)),z⟩ s.t. g(z) ≤ ε.
z∈Rp
• Intuition: move in a direction that minimize the inner product with the gradient of f, but simultaneously restrict this direction to be small under g.
April 15, 2020 92 / 106
Bagging, random forests and boosting
Steepest descent method
• The first-order Taylor approximation of f(x + ν) around x is ˆT
f(x+ν)≈f(x+ν)=f(x)+∇f(x) ν. f (x)T ν is the directional derivative.
• Let ∥ · ∥ be any norm on Rn. We define normalized steepest direction as
∆xnsd = arg min{f(x)T ν|∥v∥ ≤ 1}
• Update x ← x + t∆xnsd, where t is a small step size.
• The general framework is simply replacing the norm ∥ · ∥ by g(·).
April 15, 2020 93 / 106
Bagging, random forests and boosting
Quadratic regularization
Consider the problem of the form
βˆ(t) ∈ argminβ∈Rp f(β) s.t. βT Qβ ≤ t (16)
where Q ≽ 0.
• f(B) = 1 ∥y − Xβ∥2 and Q = I leads to the ridge regression, where
When Q ≻ 0, the Algorithm can be initialized with t0 = 0 and β(0) = 0. The update procedure follows the following lemma
Lemma
For g(β) = βT Qβ, where Q is a positive definite matrix, the general stagewise procedure in Algorithm repeats the updates β(k) = β(k−1) + ∆, where
22 βTβ=∥β∥2 ≤t.
When Q = I, we have
Q−1 ∇f
∆ = −ε · (∇f)T Q−1(∇f),
∇f ∥∇f ∥2
(Leave it as an exercise).
∆ = −ε ·
.
April 15, 2020
94 / 106
Bagging, random forests and boosting
Lasso v.s. Forward Stagewise Regression
Figure: Comparison of lasso and forward stagewise paths on simulated regression data. The number of samples is 60 and the number of variables 1000. The forward-stagewise paths fluctuate less than those of lasso in the final stages of the algorithms.
April 15, 2020 95 / 106
Bagging, random forests and boosting
Lasso v.s. Forward Stagewise Regression
Figure: Mean squared error for lasso and forward stagewise on the simulated data. Despite the difference in the coefficient paths, the two models perform similarly over the critical part of the regularization path. In the right tail, lasso appears to overfit more rapidly.
April 15, 2020 96 / 106
Bagging, random forests and boosting
Discussion: Explicit and inexplicit regularization
April 15, 2020 97 / 106
Bagging, random forests and boosting
Discussion: Degrees of freedom
Ridge True
Ridge
Gradient Descent
0.0 0.2 0.4 0.6 0.8 1.0
||beta||/max(||beta||)
Forward stepwise Lasso
relaxed Lasso 0 relaxed Lasso 0.5
0 5 10 15 20 25 30
number of nonzero coefficients
Gradient boosting trees
Forward stepwise Lasso
relaxed Lasso 0 relaxed Lasso 0.5
0.0 0.2 0.4 0.6 0.8 1.0
|beta|/max(|beta|)
0
GBM shrinkage 0.05 GBM shrinkage 0.2 GBM shrinkage 0.8
200 400 600 800
number of trees
degrees of freedom
degrees of freedom
0 5 1015202530
5 10 15 20 25
degrees of freedom
degrees of freedom
0 20 40 60
0 5 10 15 20 25 30
Figure:df= 1 n Cov(y,yˆ):n=70,p=30,y=Xβ+ε,x ∼N(0,1), σ2 i=1 ii ij
SNR = 3:7. See my source code “df demo.html” on Canvas
April 15, 2020 98 / 106
Bagging, random forests and boosting
Discussion: Bias-Variance Tradeoff for Gradient Boosting
Gradient boosting trees
GBM shrinkage 0.05 GBM shrinkage 0.2 GBM shrinkage 0.8
Gradient boosting trees
Gradient boosting trees
0.0
0.1
0.2
0.3
0.1
0.2 0.3 0.4
0.5
0.75
0.80
0.85 0.90
0.95 1.00
biasSQ
variance
prediction error
0 1000 2000 3000 4000 5000
number of trees
0 1000 2000 3000 4000 5000
number of trees
0 1000 2000 3000 4000 5000
number of trees
Figure:n=100,p=50,y=Xβ+ε,xij ∼N(0,1),SNR=1:1. Seemy source code “Bias Variance Boosting.html” on Canvas.
April 15, 2020
99 / 106
Bagging, random forests and boosting
Discussion: Bias-Variance Tradeoff for Gradient Boosting
Gradient boosting trees
GBM shrinkage 0.05 GBM shrinkage 0.2 GBM shrinkage 0.8
Gradient boosting trees
Gradient boosting trees
−2.5
−2.0 −1.5 −1.0
−0.30
−0.25 −0.20 −0.15 −0.10 −0.05 0.00
log(biasSQ)
log(variance)
−5 −4 −3 −2 −1
log(prediction error)
0 1000 2000 3000 4000 5000
number of trees
0 1000 2000 3000 4000 5000
number of trees
0 1000 2000 3000 4000 5000
number of trees
Figure:Logscale. n=100,p=50,y=Xβ+ε,xij ∼N(0,1),SNR=1:1. See my source code “Bias Variance Boosting.html” on Canvas.
April 15, 2020 100 / 106
Bagging, random forests and boosting
Discussion: Bias-Variance Tradeoff for Gradient Boosting
Gradient boosting trees
GBM shrinkage 0.05 GBM shrinkage 0.2 GBM shrinkage 0.8
Gradient boosting trees
Gradient boosting trees
−2.0 −1.5 −1.0 −0.5
log(biasSQ)
log(variance)
−7 −6 −5 −4 −3
log(prediction error)
−1.2 −1.0 −0.8 −0.6 −0.4 −0.2
0 1000 2000 3000 4000 5000
number of trees
0 1000 2000 3000 4000 5000
number of trees
0 1000 2000 3000 4000 5000
number of trees
Figure:Logscale. n=100,p=50,y=Xβ+ε,xij ∼N(0,1),SNR=9:1. See my source code “Bias Variance Boosting.html” on Canvas.
April 15, 2020 101 / 106
Bagging, random forests and boosting
Discussion: “Bet on sparsity” principle
April 15, 2020 102 / 106
Bagging, random forests and boosting
Discussion: “Bet on sparsity” principle
April 15, 2020 103 / 106
Bagging, random forests and boosting
April 15, 2020 104 / 106
Bagging, random forests and boosting
Summary: Statistical view of boosting methods
Boosting methods have three important properties that contribute to their success:
• they fit an additive model in a flexible set of basis functions.
• they use a suitable loss function for the fitting process.
• they regularize by forward stagewise fitting; with shrinkage this mimics an L1 (Lasso) penalty on the weights.
April 15, 2020 105 / 106
Bagging, random forests and boosting
Suggested Readings
Yang C.
Stories of Statistical learning (in Chinese),
COS, 2011, https: //cosx.org/2011/12/stories-about-statistical-learning.
David Mease, Aaron Wyner and others
“Evidence contrary to the statistical view of boosting” with Discussion
JMLR 9 (2008) 131-201.
Robert E. Schapire and Yoav Freund
Boosting: Foundations and algorithms
Springer 2012.
Tomaso Poggio et al.
Theory of Deep Learning III: explaining the non-overfitting puzzle
Arvix Feb 2018.
April 15, 2020 106 / 106