Introduction to information system
Trees and Forests…
Gerhard Neumann
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M Data Science
Step back… Bigger Picture
Supervised Learning: We want to learn the
relation from the input to the output data
• Regression: Continuous output labels
– Linear regression, Polynomial Regression,
kNN (?)
• Classification: Discrete / Nominal output
labels
– Logistic Regression, Decision Trees, Naive
Bayes, kNN
Recap… Bigger Picture
Unsupervised Learning
There are no „labels“. We want to learn the
„structure“ of the input data
• Clustering: Find clusters (similar data
points) in the data
– kMeans
• Dimensionality Reduction: How can we
represent high dimensional data with only a
few variables?
– Not in this lecture…
Model Selection
For most of these algorithms, we have to solve a
model selection problem
• Choose complexity of the function class
– One of the hardest problems in machine learning
– Too complex: Overfitting
• Fit noise in the data
• Unspecified behavior between data points,
not enough data
– Too simple: Underfitting
• We can not represent the underlying function
• Lets look at some examples…
Recap: Polynomial Regression
Instead of fitting a line, we can also fit a d-th order polynomial
• Still linear in the parameters beta
• We can use linear regression (see week 4)
Selecting the order of the polynom
Model Selection for polynomial regression
• Evaluate the model‘s performance on:
– Test-set
– K-fold cross validation (see Semester A)
• The error on the training set is not an
indication for a good fit!!
• Overfitting:
– Training error goes down
– Test error goes up
• Underfitting:
– Training + Test error are high
Recap: KNN
• Find the k-Nearest Neighbors.
• Use the class with the highest occurance
as output labels
• Different complexity with different k
– Small k: Complex decision
boundaries, high complexity
– Large k: Much smoother decision
boundaries, lower complexity
k = 25
k = 1
Model-Selection for k…
• Overfitting:
– Training accuracy goes up
– Test accuracy goes down
• Underfitting:
– Training + Test accuarcy are low
Bias-Variance Tradeoff
What we have seen is a general
concept called the Bias-Variance
Tradeoff
• Bias: What is the error if I would
know the optimal parameters of
my model (assuming infinite
data)?
• Variance: What is the error that I
do because I only have limited
data?
Bias-Variance Tradeoff
Each dot represents a fitted model
on a slight variation of the data
Classification and Regression Trees (CART)
Today‘s agenda
Learn more about tree-based methods
• For regression: Regression Tree
• For classification: Decision Tree
• Almost the same algorithms!
Random Forest:
• Why using multiple trees?
• Why use randomization?
Good news for today…. Only 2 slides of math
Classification and Regression Trees
Pioneers:
• Morgan and Sonquist (1963).
• Breiman, Friedman, Olshen, Stone (1984). CART
• Quinlan (1993). C4.5
Regression and Classification Trees
Grow a binary tree
• At each node, “split” the data into two “daughter” nodes.
• Splits are chosen using a splitting criterion.
• Bottom nodes are “terminal” nodes.
For regression:
• the predicted value at a node is the average response variable for all observations
in the node.
For classification:
• the predicted class is the most common class in the node (majority vote).
• Can also get estimated probability of membership in each of the classes
A classification tree
A regression tree
Predict (log) prostate specific antigen from
• Log cancer volumne
• Log prostate weight
Splitting criterion (thats the first mathy slide…)
• Regression: Minimum residual sum of squares
– where and are the average label values in the left and right subtree
– Split such that variance is subtrees is minimized
Splitting criterion (thats the second mathy slide…)
• Classification: Minimum entropy in subtrees
– where is the entropy in the left sub-tree
– and is the proportion of class k in left tree
• Entropy is a measure of uncertainty
– Split such that class-labels in sub-trees are „pure“
Finding the best horizontal split
RSS
Finsing the best vertical split
Creating the root node
Finding the best split in the left node
Finding the best split in the left node
Building the regression tree…
Finding the best split in the right node…
Skipping some steps… final result
Classification Tree
Classification Tree
Classification Tree
Classification Tree
Classification Tree
When do we stop?
There are many stopping criterias, the 2 main ones are:
Stop if:
• Minimum number of samples per node
• Maximum depth
… has been reached
Both criterias again influence the complexity of the tree !
Controlling the tree complexity
Small number of samples per leaf:
• Tree is very sensitive to noise
Controlling the tree complexity
Small number of samples per leaf:
• Tree is very sensitive to noise
Controlling the tree complexity
Small number of samples per leaf:
• Tree is very sensitive to noise
Large number of samples per leaf:
• Tree not expressive enough
Model-Selection for Regression Trees
Evaluate error on test-set
• Overfitting for min_samples = 1
• Underfitting for min_sampes > 2
Classification and Regression Trees
Advantages
• Applicable to both regression and classification problems.
• Handle categorical predictors naturally.
• Computationally simple and quick to fit, even for large problems.
• No formal distributional assumptions (non-parametric).
• Can handle highly non-linear interactions and classification boundaries.
• Automatic variable selection.
• Very easy to interpret if the tree is small.
Classification and Regression Trees (CART)
Disadvantages
• Accuracy – current methods, such as support vector machines and ensemble
classifiers often have 30% lower error rates than CART.
• Instability – if we change the data a little, the tree picture can change a lot. So the
interpretation is not as straightforward as it appears.
Nowadays, we can do better! Random Forests!
Bagging Predictors
Key Idea: Use multiple trees to improve accuracy
Key Idea: Use multiple trees to improve accuracy
How do we get variability in the trees?
Bagging (Bootstrap Aggregating)
Breiman, “Bagging Predictors”, Machine Learning, 1996.
Fit classification or regression models to bootstrap samples from the data
and combine by voting (classification) or averaging (regression).
Bagging (Bootstrap Aggregating)
• A bootstrap sample is chosen at random with replacement from the data. Some
observations end up in the bootstrap sample more than once, while others are not
included (“out of bag”).
• Bagging reduces the variance of the base learner but has limited effect on the bias.
– I.e. no overfitting: The more trees the better
• It’s most effective if we use strong base learners that have very little bias but high
variance (unstable). E.g. trees.
• Both bagging and boosting are examples of “ensemble learners” that were popular
in machine learning in the ‘90s.
Bagging CART
Random Forests
Random Forests
Grow a forest of many trees. (R default is 500)
Grow each tree on an independent bootstrap sample from the training data.
• Sample N cases at random with replacement.
At each node:
1. Select m variables at random out of all M possible variables
(independently for each node).
2. Find the best split on the selected m variables.
Grow the trees to maximum depth (classification).
Vote/average the trees to get predictions for new data.
Why does that work?
Intuition: Why randomization?
• Increase variability of the single trees
• A single tree is less likely to over-specialize
• The trees are less likely to overfit
Random Regression Forests
Num Trees = 1 Num Trees = 10 Num Trees = 50
We can represent almost continuous functions!
Random Forests
Random Forests
Advantages
• Applicable to both regression and classification problems. Yes
• Handle categorical predictors naturally. Yes
• Computationally simple and quick to fit, even for large problems. Yes
• No formal distributional assumptions (non-parametric). Yes
• Can handle highly non-linear interactions and classification boundaries. Yes
• Automatic variable selection. Yes
• Very easy to interpret if the tree is small. No
Random Forests
Improve on CART with respect to:
• Accuracy – Random Forests is competitive with the best known machine learning
methods
• Instability – if we change the data a little, the individual trees may change but the
forest is relatively stable because it is a combination of many trees.
Random Forests and the Kinect
Random Forests and the Kinect
Random Forests and the Kinect
Use computer graphics to generate plenty of data
Shotton, et. al., Real-Time Human Pose Recognition in Parts from a Single Depth Image, CVPR 2011
Summary: What we have learned today
• CART: Binary decision trees
• Can be efficiently used for classification and regression
• Complexity can be set by minimum samples per leaf
• Averaging over multiple trees reduces variance and improves performance
• Variability in the trees:
– Bootstrap
– Randomized splits