Introduction to information system
Forests and other Ensemble Methods…
Gerhard Neumann
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M Data Science
Recap: Regression and Classification Trees
Grow a binary tree
• At each node, “split” the data into two “daughter” nodes.
• Splits are chosen using a splitting criterion.
• Bottom nodes are “terminal” nodes.
For regression:
• the predicted value at a node is the average response variable for all observations
in the node.
For classification:
• the predicted class is the most common class in the node (majority vote).
• Can also get estimated probability of membership in each of the classes
Recap: A classification tree
Recap: A regression tree
Predict (log) prostate specific antigen from
• Log cancer volumne
• Log prostate weight
Recap: Splitting criterion (thats the first mathy slide…)
• Regression: Minimum residual sum of squares
– where and are the average label values in the left and right subtree
– Split such that variance is subtrees is minimized
Recap: Splitting criterion (thats the second mathy slide…)
• Classification: Minimum entropy in subtrees
– where is the entropy in the left sub-tree
– and is the proportion of class k in left tree
• Entropy is a measure of uncertainty
– Split such that class-labels in sub-trees are „pure“
Bias-Variance Tradeoff
What we have seen is a general
concept called the Bias-Variance
Tradeoff
• Bias: What is the error if I would
know the optimal parameters of
my model (assuming infinite
data)?
• Variance: What is the error that I
do because I only have limited
data?
Recap: Model Selection
• We can evaluate a model using:
– Test-set method:
• High variance
• A large number of samples are
„wasted“ for evaluation
• Simple and fast
– K-Fold Cross-Validation:
• More robust – Average value
• Samples are used more
efficiently
• Slow
Standard Model-Selection Loop
How do we select a model?
• For each model (algorithm + parameter setting)
– Evaluate using test set or crossvalidation
• Pick the best model
• Relearn using all the data
How do we evaluate the final performance?
• We can not use the cross-validation score as it has been used for selecting the model. Bias!
• We need an independent test set that has not been used for training nor model-selection!
• Often, the bias is small and this necessity is ignored for simplicity
Today‘s agenda
Esemble Methods
• Ensemble = more then one classifier/regressor
• Why can that be useful?
• Bagging:
– Train multiple classifier/regressors
– Reduce variance while bias stays the same
– E.g.: Random Forests
• Boosting
– Train multiple classifiers/regressors
– Adjust Importance of training data
– Reduces bias and variance
Bagging Predictors
Key Idea: Use multiple trees to improve accuracy
Key Idea: Use multiple trees to improve accuracy
How do we get variability in the trees?
Bagging (Bootstrap Aggregating)
Breiman, “Bagging Predictors”, Machine Learning, 1996.
Fit classification or regression models to bootstrap samples from the data
and combine by voting (classification) or averaging (regression).
Bagging (Bootstrap Aggregating)
• A bootstrap sample is chosen at random with replacement from the data. Some
observations end up in the bootstrap sample more than once, while others are not
included (“out of bag”).
• Bagging reduces the variance of the base learner but has limited effect on the bias.
– I.e. no overfitting: The more trees the better
• It’s most effective if we use strong base learners that have very little bias but high
variance (unstable). E.g. trees.
• Both bagging and boosting are examples of “ensemble learners” that were popular
in machine learning in the ‘90s.
Bagging CART
Random Forests
Random Forests
Grow a forest of many trees. (R default is 500)
Grow each tree on an independent bootstrap sample from the training data.
• Sample N cases at random with replacement.
At each node:
1. Select max_features variables at random out of all M possible variables
(independently for each node).
2. Find the best split on the selected max_features variables.
Grow the trees to maximum depth.
Vote/average the trees to get predictions for new data.
Why does that work?
Intuition: Why randomization?
• Increase variability of the single trees
• A single tree is less likely to over-specialize
• The trees are less likely to overfit
Random Regression Forests
Num Trees = 1 Num Trees = 10 Num Trees = 50
We can represent almost continuous functions!
Random Forests
Random Forests
Advantages
• Applicable to both regression and classification problems. Yes
• Handle categorical predictors naturally. Yes
• Computationally simple and quick to fit, even for large problems. Yes
• No formal distributional assumptions (non-parametric). Yes
• Can handle highly non-linear interactions and classification boundaries. Yes
• Automatic variable selection. Yes
• Very easy to interpret if the tree is small. No
Random Forests
Improve on CART with respect to:
• Accuracy – Random Forests is competitive with the best known machine learning
methods
• Instability – if we change the data a little, the individual trees may change but the
forest is relatively stable because it is a combination of many trees.
Random Forests and the Kinect
Random Forests and the Kinect
Random Forests and the Kinect
Use computer graphics to generate plenty of data
Shotton, et. al., Real-Time Human Pose Recognition in Parts from a Single Depth Image, CVPR 2011
Boosting
Main Idea
• Boosting refers to a general and provably effective method of producing a very
accurate classifier by combining rough and moderately inaccurate rules of
thumb.
• It is based on the observation that finding many rough rules of thumb can be a lot
easier than finding a single, highly accurate classifier.
• To begin, we define an algorithm for finding the rules of thumb, which we call a weak
learner.
• The boosting algorithm repeatedly calls this weak learner, each time feeding it a
different distribution over the training data (in Adaboost).
• Each call generates a weak classifier and we must combine all of these into a
single classifier that, hopefully, is much more accurate than any one of the rules.
Boosting Pseudo Code
Idea: given a weak learner, run it multiple times on (reweighted) training data, then let
the learned classifiers vote
On each iteration t:
• weight each training example by how incorrectly it was classified
• Learn a hypothesis –
• A strength for this hypothesis –
Final classifier:
• A linear combination of the votes of the different classifiers weighted by their strength
Boosting Intuition
Boosting Intuition
Boosting Intuition
Boosting Intuition
Boosting Intuition
Adaboost
• The coefficient depends on the error of the weak classifier
– Smaller error, higher coefficient
• The weights of the samples are:
– Increased with a factor of if incorrectly classified
– Decreased with a factor of if correctly classified
Toy Example: Decision Stumps
Toy Example: Round 1
Toy Example: Round 2
Toy Example: Round 3
Toy Example: Final Hypothesis
Boosting Results – Digit Classification
Application – Face Detection
Training Data
• 5000 faces
– All frontal
• 300 million non-faces
– 9500 non-face images
Application – Detecting Faces
• Problem: find faces in photograph or movie
• Weak classifiers: detect light/dark rectangle in image
• Many clever tricks to make it very fast and accurate
Summary: What we have learned today
• Averaging over multiple trees reduces variance and improves performance
• Variability in the trees:
– Bagging
– Randomized splits
• Bagging:
– Resample data points
– Weight of each classifier is the same
– Only variance reduction
• Boosting:
– Reweights data points (modifies their distribution)
– Weight is dependent on classifier’s accuracy
– Both bias and variance reduced – learning rule becomes more complex with iterations