Lecture 23: Ensembles of trees CS 189 (CDSS offering)
2022/03/18
Today’s lecture
Copyright By PowCoder代写 加微信 powcoder
Today, we will discuss the concept of ensembling: combining many “weak” We will use decision trees as the vessel for studying this concept
We will focus on two types of ensembling: bootstrap aggregation (“bagging”) and boosting — these are random forests and boosted trees, respectively, when combined with decision trees
learners into a strong learner
Though many types of models can be ensembled, decision trees are the most
common model that is combined with ensembling
What’s a decision tree?
Decision tree pruning At a high level
If decision trees are grown too large, they often overfit the training data We can prevent this via regularization, e.g.:
restrict the maximum depth of the tree
do not make splits that result in too few data points or small information gain Otherwise, we may need to prune the tree after training
At a high level, we get rid of splits that have the lowest information gain or
contribute the least to validation accuracy until the tree is an acceptable size
Advantages of decision trees
Decision trees are often said to be simple to understand and interpret
They’re probably at least more likely to be understandable and interpretable,
compared to other models we have studied and will study
Decision trees are a natural fit for categorical data and often require little to no
data preprocessing
Decision trees can give the practitioner an intuition for which features are the
most important (split first), which features are irrelevant, etc.
Disadvantages of decision trees
Decision trees can overfit, and to tackle this, we need regularization
However, regularizing decision trees will often make them underfit instead…
Because of this decision trees by themselves don’t actually work very well
But they can work very well when ensembled!
However, this takes away some of the advantages such as being simple to
understand and interpret, so it’s all a trade off
The motivation of ensembling Or, the “wisdom of the crowd”
A classic example: in 1906, 787 people guessed the weight of this ox, whose real weight was 1197 pounds
Median guess: 1207 pounds
Mean guess: 1197 pounds
In this example, the people are the “weak” learners, and we ensemble them together to get greatly improved accuracy
How do we ensemble?
Most of the learning algorithms we have studied thus far are deterministic, i.e.,
with a fixed training set, we will always get the same model
How do we get diversity in the models in our ensemble? •
Idea #1: randomize the data each model sees as well as the algorithm itself
We will see this in bagging and random forests •
Idea #2: learn the models sequentially, and specialize them to different data
We will see this in boosting
Bootstrap aggregation (“bagging”)
In statistics, bootstrap sampling refers to sampling from the observed dataset
with replacement in order to reduce variance in some desired estimate
• We are given a dataset (x1, y1), …, (xN, yN)
Suppose we wish to learn an ensemble of M models
To generate a different training set for a model, we randomly sample up to N
times from the dataset with replacement
So we will get M different datasets for training 9
Random forests
A random forest classifier is an ensemble of M small decision trees, trained on Random forests also use “feature bagging”: at each split, consider only a random
bootstrapped sampled versions of the dataset — call this “data bagging”
subset of the features to split on rather than all possible splits
• A good default: if there are d features, consider only d of them
The resulting classifier has drastically lower variance than a single large tree
This further decorrelates the trees, especially if there are only a few features
that are very informative, and also speeds up training
Boosted ensembles
An alternative to bagging is boosting, which learns ensembles sequentially
At a high level, each model in the ensemble is trained with a focus on what the
previous models have gotten wrong
This is done either by reweighting the data points or having each model learn The details become gory very quickly, and they’re not that important
residual predictions after subtracting the previous models’ predictions
• If you’re curious, https://hastie.su.domains/Papers/ESLII.pdf Chapter 10 gives this topic very thorough coverage
Making predictions with ensembles
Prediction on a new point, for a bagged ensemble such as a random forest, works
as you would expect
For classification, take a majority vote of the ensemble members, and for
regression, take the average of the predictions
Prediction for boosted ensembles is actually more complicated, for technical
reasons which we can’t get into since we didn’t do the math to begin with
At a high level, we actually take a weighted average of the models’ predictions,
where the weights are roughly related to how good each model is
Again, find details in the link on the previous slide if you’re curious
Decision trees are very popular for many machine learning applications due to
their simplicity, interpretability (sometimes), and ease of use
But single decision trees are bad — if grown large enough to fit the training data
well (low bias), they exhibit high variance and overfit
Ensembling via either bagging or boosting reduces variance and keeps bias low
E.g., check out XGBoost for gradient boosted trees 13
Very good implementations of both random forests and boosted trees exist that
many people use for research, industry, Kaggle competitions, etc.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com