Ensemble methods
Data Mining
Prof. Dr. Matei Demetrescu
Statistics and Econometrics (CAU Kiel) Summer 2020 1 / 39
Moving further away from classical statistics
So far, we proceeded as follows:
1 get (many) data, then
2 make a single – typically complex – predictor.
3 Don’t forget validating and testing the prediction model.
We’ve also seen that say trees,
nice and intuitive as they are, need some form of enhancement.
Today we combine several base models to obtain improved predictions and classifications.
Statistics and Econometrics (CAU Kiel) Summer 2020 2 / 39
Today’s outline
Ensemble methods
1 Model averaging 2 Bagging
3 Boosting
4 Up next
Statistics and Econometrics (CAU Kiel)
Summer 2020
3 / 39
Model averaging
Outline
1 Model averaging 2 Bagging
3 Boosting
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 4 / 39
Model averaging
A Bayesian motivation
Let f(x) be of interest (forecast function or conditional class probability),1 and Z = {Yi, Xi}i=1,…,n the training data.
Use Bayes’ law with each candidate model Mm, m = 1,…,M, P (f(x)|Z, Mm) ∝ p (f(x)) L (f(x); Z, Mm) ,
where the prior p(f(x)) may be model-specific as well. Then, we have the “overall” posterior
P (f(x)|Z) = P (f(x)|Z, Mm) P (Mm|Z) .
Models may differ in choice of model family or included predictors.
1We simplified a bit; in “genuine” Bayesian forecasting you need to derive the so-called predictive density averaging future outcomes over their conditional distribution and the posterior distribution of the relevant parameters.
Statistics and Econometrics (CAU Kiel) Summer 2020 5 / 39
Model averaging
Bayesian averaging
For this posterior we have e.g. the posterior mean
E (f(x)|Z) = E (f(x)|Z, Mm) P (Mm|Z) . This gives (roughly) an average of the individual forecasts.
Moreover, the individual weights are proportional to the posterior probability that model Mm is the right one.
We may compute P(Mm|Z) using the Bayes theorem as well; also, exp−1BIC (Mm)
2
P(Mm|Z)≈ M exp−1BIC(Mm)
m=1 with uniform priors P(Mm) = 1/M.2
2
2Sometimes you run into numerical problems if BIC (Mm) is large negative, replace them with BIC (Mm) − min {BIC (Mm)}.
Statistics and Econometrics (CAU Kiel) Summer 2020 6 / 39
Model averaging
And a frequentist view
The above averaging estimator can be used without the Bayes background.
We may also approach model averaging as a MSE optimality problem, i.e. determine optimal weights in the average forecast
fˆ(x)=M w fˆ (x). m=1 m m
To this end, we must minimize
ˆ 2 E Yi − f (Xi) ,
i.e. pick w as the slope coefficients of a linear regression of Y on fˆ (X). mm
This is unstable if M is large (may pick wm = 1/M in this case). For classification, use e.g. inverse error rates as weights.
Statistics and Econometrics (CAU Kiel) Summer 2020 7 / 39
Model averaging
And a mixture of approaches: Complete subset regressions
Simple idea: combine the forecasts from all possible (linear) regression models that keep the number of predictors fixed.
Model complexity is measured by the number of included features This trades off the bias and variance of the forecast errors.
Thus, we may pick a number of regressors (much) smaller than n, … but at the same time allow for many model configurations.
Of course, choosing the optimal complexity is the tricky part…
Statistics and Econometrics (CAU Kiel) Summer 2020
8 / 39
Model averaging
Digression: Weak and strong learners
A learner is just a predictive model (waiting to be) trained on data.
A strong learner is a method that can learn an optimal population
prediction/classification rule arbitrarily well.
A weak learner does better than tossing a (fair) coin,
… but only offers a rough approximation to the optimal decision.
Example
Trying to decide whether an email is ham or spam, a rule of thumb like
Classify to spam if subject line contains “lose weight”
is a weak learner. A strong learner would do a serious text analysis.
Statistics and Econometrics (CAU Kiel) Summer 2020
9 / 39
Model averaging
Averaging weak learners
Model averaging was developed for several strong learners of which neither is uniformly better.3
Model averaging does work (with some care) with weak learners too.
This is attractive because weak learners are computationally cheap. They also have high bias, so
Hope that combining them averages out the bias.
Fortunately, there are more reliable methods of aggregating weak learners…
3By the way, if one of the models underperforms systematically, do not consider it for averaging!
Statistics and Econometrics (CAU Kiel) Summer 2020 10 / 39
Bagging
Outline
1 Model averaging 2 Bagging
3 Boosting
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 11 / 39
Bagging
Averaging
Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a strong or moderate learner.
Note first that model averaging is useless if models are very similar.
Averages have smallest variance for independent summands.
This is highly unlikely since models are all trained with the same data! Since we can’t generate new independent training data sets,
We take repeated samples from the (single) training data set.4
Such bootstrapping amounts to drawing data from the empirical cdf, which is consistent for the population cdf!
(The bootstrap may also be used to approximate standard errors or cdfs of untractable statistics, btw.; see Advanced Statistics III for more details.)
4A variant using subsamples drawn without replacement results in so-called dagging, i.e. disjoint sample aggregation.
Statistics and Econometrics (CAU Kiel) Summer 2020 12 / 39
Bagging
Bagging classification trees
To sum up, we generate B different bootstrapped training data sets.
Then train learner on the bth bootstrapped training set to get fˆ∗b(x),
the prediction at a point x.
Finally, average all the predictions to obtain
1 B ∗b
fˆ (x)= fˆ (x).
bag B b=1
For classification
Record the class predicted by each of the B models for given x, and
Take a majority vote: the overall prediction is the most commonly occurring class among the B predictions.
Statistics and Econometrics (CAU Kiel) Summer 2020 13 / 39
Bagging
Out-of-Bag Error Estimation
There is a simple way to estimate the error rates of a bagged model:
On average, each bagged model makes use of around two-thirds of the training sample.
The remaining one-third are referred to as the out-of-bag (OOB) observations.
We can predict the response for the ith observation using each of the models in which that observation was OOB.
This will yield around B/3 predictions for the ith observation, which we average.
This estimate is essentially the LOO cross-validation error for bagging, if B is large.
Statistics and Econometrics (CAU Kiel) Summer 2020 14 / 39
Bagging
Especially for trees: Random Forests
Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees (variance reduction!).
As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees,
Each time a split in a tree is considered,
A random selection of m predictors is chosen as split candidates from
the full set of p predictors.
The split is allowed to use only one of those m predictors.
A fresh selection of m predictors is taken at each split, and typically we choose m ≈ √p.
Statistics and Econometrics (CAU Kiel) Summer 2020 15 / 39
Bagging
Enhancing interpretability
Give an indication of predictor importance
For regression trees, record the total amount that the RSS decreases
due to splits over a given predictor, and average over the B samples. For classification trees, add up the total amount that the Gini index is
decreased by splits over a given predictor, averaged over all B trees.
Generate explanations
Fit a simple model to the predictions fˆ(Xi);
Get some class representatives (surrogates) via nearest neighbours;
And don’t forget feature engineering/dimensionality reduction.
Statistics and Econometrics (CAU Kiel) Summer 2020 16 / 39
Boosting
Outline
1 Model averaging 2 Bagging
3 Boosting
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 17 / 39
Boosting
The basic idea – classification
Averaging and bagging are not designed to work with weak learners.
Starting point: Many flies can’t be wrong:5
Instead of building (a few) strong or moderate learners, Train many weak learners that
Are good at different parts of the input space.
Then take a weighted aggregation of predictions/classifications.
Boosting: Single learners …
that are most “confident” will get a higher weight
will be focussing on a particular subset of the input space. (Bagging gets this by random resampling.)
The ensemble does better than single learners!
5Yeah, sure.
Statistics and Econometrics (CAU Kiel) Summer 2020 18 / 39
Boosting
Making boosting work
1 We start with a method of building weak learners (e.g. trees);
2 Call this method repeatedly, each time with focus on different subsets
of the data;
3 The subsets come from samples of the data with different weights each time.
And how should we …
Reweight the data each boosting step/round? Combine the weak rules into a single (strong?) rule? Decide when to stop?
Statistics and Econometrics (CAU Kiel) Summer 2020 19 / 39
Boosting
The boosting principle
Boosting was introduced for classification tasks, and
Data that have been misclassified in previous rounds should get higher weights!
This ensures that we improve overall performance after each round. (In a prediction setup, just build on magnitude of residuals.)
And a bit more obvious:
Prediction should be a simple weighted majority (or average) of rules.
All boosting algorithms vary along these lines.
Statistics and Econometrics (CAU Kiel) Summer 2020 20 / 39
Boosting
Statistics and Econometrics (CAU Kiel) Summer 2020 21 / 39
Boosting
Statistics and Econometrics (CAU Kiel) Summer 2020 22 / 39
Statistics and Econometrics (CAU Kiel)
Summer 2020 23 / 39
Boosting
Statistics and Econometrics (CAU Kiel)
Summer 2020 24 / 39
Boosting
Statistics and Econometrics (CAU Kiel)
Summer 2020 25 / 39
Boosting
Boosting
Adaptive boosting
AdaBoost is popular
We are given training pairs (Yi, Xi) with Yi = ±1 (standard notation in the boosting literature), and some weak classifier f(x).
We start with equal weights for each data point, w(1) = 1 . in
1 In each round b, we train the classifier f given data weighted with w(b) to obtain fˆ(b) (x)
i
2 We then update the weights as follows (w ̄(b+1) is for normalization)
(b+1) wi
w(b) e−α(b) if Yi = fˆ(b) (Xi) = i · (b)
w ̄(b+1) eα if Yi ̸= fˆ(b) (Xi)
(b)
= wi e−α(b) Yi fˆ(b) (Xi ) .
w ̄(b+1) The final classifier is given by
B fˆ(x) = sgn b=1 α
(b) (b) fˆ
(x)
.
Statistics and Econometrics (CAU Kiel)
Summer 2020
26 / 39
Boosting
Adaptive boosting
Relative performance on a deterministic problem
Statistics and Econometrics (CAU Kiel) Summer 2020 27 / 39
Boosting
Adaptive boosting
The essential details
The trick is to choose α(b) > 0 in each boosting step, such that “difficult” sample points are assigned a higher weight in next round.
The AdaBoost weights are picked as
(b) 1 1−Err(b)
α = 2 log Err(b)
where Err(b) is the weighted classification error on round b,
n
(b) ˆ
Err(b) =
As long as the classification error is nontrivial (< 0.5), α(b) > 0 and
training error goes down exponentially fast. See how this works. Statistics and Econometrics (CAU Kiel) Summer 2020
i=1
wi I Yi ̸=f(b)(Xi) .
28 / 39
Boosting
Adaptive boosting
AdaBoost tends to avoid overfitting
Statistics and Econometrics (CAU Kiel) Summer 2020 29 / 39
Boosting
Adaptive boosting
Estimating conditional class probabilities
We may often want to estimate p(x) = P (Y = 1|X = x) as well. AdaBoost minimizes (an empirical version of)
Ee−Y p(X)=EP(Y =1|X)e−p(X) +P(Y =−1|X)ep(X) This is minimized for
or
1 P(Y =1|X) p(x)=2log P(Y=−1|X)
P (Y = 1|X = x) = 1 . 1 + e−2p(x)
B(b)(b) Therefore, simply use 1/ 1 + e−2 b=1 α fˆ (x) .
Statistics and Econometrics (CAU Kiel)
Summer 2020
30 / 39
Boosting
Adaptive boosting
Multiclass problems
Should there be several classes, say R (Y ) = {1, . . . , K},
… the cheapest solution is to reduce the situation to a binary problem.
For each j ∈ {1, . . . , K}, replace each training data point (Y, X) for which Y = k with k artificial data points,
(1,X) −1
.
.
. −1
(k, X) ⇒ (k, X) +1 .
( K , X ) − 1 and classify to label receiving most weighted votes.
.
. −1
Statistics and Econometrics (CAU Kiel) Summer 2020 31 / 39
Boosting
Adaptive boosting
Effect of outliers
(Ada)Boosting may identify outliers since focuses on examples that are hard to categorize
But: too many outliers…
can dramatically increase time to convergence can degrade classification performance
Statistics and Econometrics (CAU Kiel) Summer 2020 32 / 39
Boosting
Adaptive boosting
Some more general remarks
The total number of rounds B is the only parameter to tune6
We may choose small # of d.o.f. of the weak learner since performance does not hinge on them
Unlike bagging and random forests, boosting can overfit if B is too large, although this overfitting tends to occur slowly if at all.
We use cross-validation to select B.
Boosting has nice statistical properties and interpretation (we did not do that)
Since it can work with any weak learner, it can handle any kind of data (at least in principle).
Boosting is fast to evaluate (linear-additive), and can be fast to train – depending on the weak classifier.
6The weak learner may have some too, but they’re not important as long as the weak learner does not overfit.
Statistics and Econometrics (CAU Kiel) Summer 2020 33 / 39
Boosting
Adaptive boosting
And some disadvantages
Performance depends on the data and the weak learner:
may fail if the weak classifiers overfit (complexity!)
may also fail in practice if weak classifiers are too weak.
Outliers and noise may also blur things
AdaBoost seems to be particularly sensitive to uniform noise (no proper information and thus random emphasis of particular cases)
This goes for boosted regression trees as well.
Statistics and Econometrics (CAU Kiel) Summer 2020 34 / 39
Boosting
Boosting regression trees
Boosting algorithm for regression
(1)
2 For b = 1,2,…,B, repeat:
1 Fit weak learner to the training data (r(b),X);
(take e.g. a shallow tree, fˆb with d splits, i.e. d + 1 terminal nodes).
2 Update fˆ with a shrunken version of the freshly fitted weak learner:
ˆˆˆ
f(x) = f(x) + λf(b)(x)
3 Then update the residuals,
1ˆ
Setf(x)=0andri =Yi foralliinthetrainingset.
(b+1) (b+1) (b) (b) ˆˆ
ri =Yi−f(Xi)orri =ri −λf (Xi). 3 Output the boosted model,
Any weak learner may be used! E.g. fˆ(b)(x) could be a prediction with the
best-fitting simple linear regression in step b.
Statistics and Econometrics (CAU Kiel) Summer 2020 35 / 39
Bb
ˆˆ f(x) = b=1 λf (x).
Boosting
Boosting regression trees
Recall the idea behind this procedure?
Fitting a single strong learner to the data runs the risk of overfitting (not to mention computational costs).
The boosting approach instead learns incrementally:
Each new learner learns something new as it fits to previous residuals.
By fitting weak learners to the residuals, we slowly improve fˆ in areas where it does not perform well (yet).
The shrinkage parameter λ slows the process down even further, allowing more and different shaped trees to attack the residuals.
Statistics and Econometrics (CAU Kiel) Summer 2020 36 / 39
Boosting
Boosting regression trees
Tuning parameters for boosting
1 The number of trees B. Cross-validate.
2 The shrinkage parameter λ, a small positive number; typical values are 0.01 or 0.001, and the right choice can depend on the problem.
Very small λ can require using a very large value of B in order to achieve good performance; be patient.
Cross-validating it may be overkill, though.
3 The complexity of the weak learner, which influences the complexity of the boosted ensemble.
Keep it low – the boosting algorithm takes care of the rest.
For trees, d = 1 often works well, in which case each tree is a stump, consisting of a single split.7
This could be cross-validated if you have no prior experience with the particular type of data you’re analyzing.
7More generally d is the interaction depth, and influences the interaction order of the boosted model, since d splits can involve at most d variables.
Statistics and Econometrics (CAU Kiel) Summer 2020 37 / 39
Up next
Outline
1 Model averaging 2 Bagging
3 Boosting
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 38 / 39
Up next
Coming up
Neural networks
Statistics and Econometrics (CAU Kiel) Summer 2020 39 / 39