Decision Trees
Jiayu Zhou
Department of Computer Science and Engineering Michigan State University
East Lansing, MI USA
Based on slides by M. Welling
Jiayu Zhou CSE 404 Intro. to Machine Learning 1 / 33
Running Example: Wait for a table?
Problem: decide whether to wait for a table at a restaurant,
Jiayu Zhou CSE 404 Intro. to Machine Learning 2 / 33
Running Example: Wait for a table?
Problem: decide whether to wait for a table at a restaurant, based on the following features:
1 Alternate: is there an alternative restaurant nearby?
2 Bar: is there a comfortable bar area to wait in?
3 Fri/Sat: is today Friday or Saturday?
4 Hungry: are we hungry?
5 Patrons: number of people in the restaurant (None, Some, Full)
6 Price: price range ($, $$, $$$)
7 Raining: is it raining outside?
8 Reservation: have we made a reservation?
9 Type: kind of restaurant (French, Italian, Thai, Burger)
10 WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Jiayu Zhou CSE 404 Intro. to Machine Learning 2 / 33
Running Example: Wait for a table? (cont.)
Examples described by attribute values (Boolean, discrete, continuous)
E.g., situations where I will/won’t wait for a table:
Classification of examples is positive (T) or negative (F) General form for data: a number N of instances, each with attributes x = [x1,x2,x3,…xd]T and target value y.
Jiayu Zhou CSE 404 Intro. to Machine Learning 3 / 33
Decision Trees
One possible representation for hypotheses
We imagine someone taking a sequence of decisions.
Jiayu Zhou CSE 404 Intro. to Machine Learning 4 / 33
Decision Trees
One possible representation for hypotheses
We imagine someone taking a sequence of decisions. E.g., here is the “true” tree for deciding whether to wait:
Jiayu Zhou CSE 404 Intro. to Machine Learning 4 / 33
Decision Trees
One possible representation for hypotheses
We imagine someone taking a sequence of decisions. E.g., here is the “true” tree for deciding whether to wait:
Note we can use the same feature more than once.
Jiayu Zhou CSE 404 Intro. to Machine Learning 4 / 33
Expressiveness
Decision trees can express any function of the input features. E.g., for Boolean functions, truth table row → path to leaf:
Jiayu Zhou CSE 404 Intro. to Machine Learning 5 / 33
Expressiveness
Decision trees can express any function of the input features. E.g., for Boolean functions, truth table row → path to leaf:
Trivially, there is a consistent decision tree for any training set with one path to leaf for each example but it probably won’t generalize to new examples.
Jiayu Zhou CSE 404 Intro. to Machine Learning 5 / 33
Expressiveness
Decision trees can express any function of the input features. E.g., for Boolean functions, truth table row → path to leaf:
Trivially, there is a consistent decision tree for any training set with one path to leaf for each example but it probably won’t generalize to new examples.
Prefer to find more compact decision trees: we don’t want to memorize the data, we want to find the structure in the data!
Jiayu Zhou CSE 404 Intro. to Machine Learning 5 / 33
Hypothesis space
How many distinct decision trees with n Boolean attributes?
Jiayu Zhou CSE 404 Intro. to Machine Learning 6 / 33
Hypothesis space
How many distinct decision trees with n Boolean attributes? = number of Boolean functions
Jiayu Zhou CSE 404 Intro. to Machine Learning 6 / 33
Hypothesis space
How many distinct decision trees with n Boolean attributes? = number of Boolean functions
= number of distinct truth tables with 2n rows = 22n
Jiayu Zhou CSE 404 Intro. to Machine Learning 6 / 33
Hypothesis space
How many distinct decision trees with n Boolean attributes? = number of Boolean functions
= number of distinct truth tables with 2n rows = 22n
With 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
Example:
Jiayu Zhou CSE 404 Intro. to Machine Learning 6 / 33
Hypothesis space
How many distinct decision trees with n Boolean attributes? = number of Boolean functions
= number of distinct truth tables with 2n rows = 22n
With 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
Example:
n=2: 22 =4rows. ForeachrowwecanchooseTorF:24 functions.
Jiayu Zhou CSE 404 Intro. to Machine Learning 6 / 33
Decision tree learning
If there are so many possible trees, can we actually search this space?
Jiayu Zhou CSE 404 Intro. to Machine Learning 7 / 33
Decision tree learning
If there are so many possible trees, can we actually search this space? (solution: greedy search).
Jiayu Zhou CSE 404 Intro. to Machine Learning 7 / 33
Decision tree learning
If there are so many possible trees, can we actually search this space? (solution: greedy search).
Aim: find a small tree consistent with the training examples
Jiayu Zhou CSE 404 Intro. to Machine Learning 7 / 33
Decision tree learning
If there are so many possible trees, can we actually search this space? (solution: greedy search).
Aim: find a small tree consistent with the training examples
Idea: (recursively) choose most significant feature as root of (sub)tree.
Jiayu Zhou CSE 404 Intro. to Machine Learning 7 / 33
Choosing an attribute
What is a good attribute?
Patrons or Type?
Jiayu Zhou CSE 404 Intro. to Machine Learning 8 / 33
Choosing an attribute
What is a good attribute?
Patrons or Type?
Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative”.
Jiayu Zhou CSE 404 Intro. to Machine Learning 8 / 33
Using information theory
Entropy measures the amount of uncertainty in a probability distribution:
Consider tossing a biased coin. If you toss the coin VERY often, the frequency of heads is, say, p, and hence the frequency of tails is 1 − p. (fair coin p = 0.5).
Jiayu Zhou
CSE 404 Intro. to Machine Learning
9 / 33
Using information theory
Entropy measures the amount of uncertainty in a probability distribution:
Consider tossing a biased coin. If you toss the coin VERY often, the frequency of heads is, say, p, and hence the frequency of tails is 1 − p. (fair coin p = 0.5).
The uncertainty in any actual outcome is given by the entropy
Jiayu Zhou
CSE 404 Intro. to Machine Learning
9 / 33
Using information theory
Entropy measures the amount of uncertainty in a probability distribution:
Consider tossing a biased coin. If you toss the coin VERY often, the frequency of heads is, say, p, and hence the frequency of tails is 1 − p. (fair coin p = 0.5).
The uncertainty in any actual outcome is given by the entropy
Note, the uncertainty is zero if p=0
or 1 and maximal if we have p = 0.5.
Jiayu Zhou
CSE 404 Intro. to Machine Learning 9 / 33
Using information theory
If there are more than two states s = 1,2,…,n we have (e.g. a die):
Entropy(p) = − p(s = 1) log(p(s = 1)) − p(s = 2) log(p(s = 2))
where
…
− p(s = n) log(p(s = n))
n
p(s) = 1 s=1
Jiayu Zhou
CSE 404 Intro. to Machine Learning 10 / 33
Using information theory
Imagine we have p examples which are true (positive) and n examples which are false (negative).
Jiayu Zhou CSE 404 Intro. to Machine Learning 11 / 33
Using information theory
Imagine we have p examples which are true (positive) and n examples which are false (negative).
Our best estimate of true or false (prior) is given by: p(true) ≈ p/(p + n), p(false) ≈ n/(p + n).
Jiayu Zhou CSE 404 Intro. to Machine Learning 11 / 33
Using information theory
Imagine we have p examples which are true (positive) and n examples which are false (negative).
Our best estimate of true or false (prior) is given by: p(true) ≈ p/(p + n), p(false) ≈ n/(p + n).
Hence the entropy is given by
Entropy( p , n )≈− p log( p )− n log( n )
p+n p+n p+n p+n p+n p+n
Jiayu Zhou CSE 404 Intro. to Machine Learning 11 / 33
Using Information Theory
Question: How much information do we gain if we disclose the value of some attribute?
Jiayu Zhou CSE 404 Intro. to Machine Learning 12 / 33
Using Information Theory
Question: How much information do we gain if we disclose the value of some attribute?
Answer: uncertainty before – uncertainty after
Jiayu Zhou CSE 404 Intro. to Machine Learning 12 / 33
Example
Before split:
Jiayu Zhou CSE 404 Intro. to Machine Learning 13 / 33
Example
Before split: E = −1/2 log(1/2) − 1/2 log(1/2) = log(2) = 1 There is “1 bit of information to be discovered”.
Split Type:
Jiayu Zhou CSE 404 Intro. to Machine Learning 13 / 33
Example
Before split: E = −1/2 log(1/2) − 1/2 log(1/2) = log(2) = 1 There is “1 bit of information to be discovered”.
Split Type: If we go into branch French we have 1 bit, similarly for the others. On average: 1 bit. We gained nothing.
Jiayu Zhou CSE 404 Intro. to Machine Learning 13 / 33
Example
Before split: E = −1/2 log(1/2) − 1/2 log(1/2) = log(2) = 1 There is “1 bit of information to be discovered”.
Split Type: If we go into branch French we have 1 bit, similarly for the others. On average: 1 bit. We gained nothing.
Split Patrons:
Jiayu Zhou CSE 404 Intro. to Machine Learning 13 / 33
Example
Before split: E = −1/2 log(1/2) − 1/2 log(1/2) = log(2) = 1 There is “1 bit of information to be discovered”.
Split Type: If we go into branch French we have 1 bit, similarly for the others. On average: 1 bit. We gained nothing.
Split Patrons: Branch None and Some, entropy =0; Branch Full,
Jiayu Zhou CSE 404 Intro. to Machine Learning 13 / 33
Example
Before split: E = −1/2 log(1/2) − 1/2 log(1/2) = log(2) = 1 There is “1 bit of information to be discovered”.
Split Type: If we go into branch French we have 1 bit, similarly for the others. On average: 1 bit. We gained nothing.
Split Patrons: Branch None and Some, entropy =0; Branch Full, entropy = −1/3 log(1/3) − 2/3 log(2/3).
Jiayu Zhou CSE 404 Intro. to Machine Learning 13 / 33
Example
Before split: E = −1/2 log(1/2) − 1/2 log(1/2) = log(2) = 1 There is “1 bit of information to be discovered”.
Split Type: If we go into branch French we have 1 bit, similarly for the others. On average: 1 bit. We gained nothing.
Split Patrons: Branch None and Some, entropy =0; Branch Full, entropy = −1/3 log(1/3) − 2/3 log(2/3). So Patrons gains more information!
Jiayu Zhou CSE 404 Intro. to Machine Learning 13 / 33
Information Gain
How do we combine branches?
Jiayu Zhou CSE 404 Intro. to Machine Learning 14 / 33
Information Gain
How do we combine branches?
An intuitive weighting scheme: 1/6 of the time we enter None, so we weight None with 1/6. Similarly: Some has weight: 1/3 and Full has weight 1/2.
Jiayu Zhou CSE 404 Intro. to Machine Learning 14 / 33
Information Gain
How do we combine branches?
An intuitive weighting scheme: 1/6 of the time we enter None, so we weight None with 1/6. Similarly: Some has weight: 1/3 and Full has weight 1/2.
npi+ni pi ni Entropy(A)= p+n Entropy p +n ,p +n
i=1 iiii
Jiayu Zhou CSE 404 Intro. to Machine Learning 14 / 33
Information gain
Information Gain (IG) or reduction in entropy from the attribute test:
IG(A) = Entropy Before − Entropy After
Jiayu Zhou CSE 404 Intro. to Machine Learning 15 / 33
Information gain
Information Gain (IG) or reduction in entropy from the attribute test:
IG(A) = Entropy Before − Entropy After Choose the attribute with the largest IG as the next split.
Jiayu Zhou CSE 404 Intro. to Machine Learning 15 / 33
Information Gain
For the training set, p = n = 6, E (6/12, 6/12) = 1bit. Compute IG(Patrons) and IG(Type).
Jiayu Zhou CSE 404 Intro. to Machine Learning 16 / 33
Information Gain
For the training set, p = n = 6, E (6/12, 6/12) = 1bit. Compute IG(Patrons) and IG(Type).
2 4 624 IG(P)=1− 12E(0,1)+12E(1,0)+12E(6,6) =0.54
211 211 411 411 IG(T)=1− 12E(2,2)+12E(2,2)+12E(2,2)+12E(2,2) =0
Patrons has the highest IG of all attributes and so is chosen.
Jiayu Zhou CSE 404 Intro. to Machine Learning 16 / 33
b
Example cont’d.
Decision tree learned from the 12 examples:
Jiayu Zhou CSE 404 Intro. to Machine Learning 17 / 33
Example cont’d.
Decision tree learned from the 12 examples:
Substantially simpler than “true” tree—a more complex
hypothesis isn’t justified by small amount of data
Jiayu Zhou CSE 404 Intro. to Machine Learning 17 / 33
Gain-Ratio
If 1 attribute splits in many more classes than another (acts like a unique identifier), it has an (unfair) advantage if we use information gain. → Hurts generalization!
Jiayu Zhou CSE 404 Intro. to Machine Learning 18 / 33
Gain-Ratio
If 1 attribute splits in many more classes than another (acts like a unique identifier), it has an (unfair) advantage if we use information gain. → Hurts generalization!
The gain-ratio is designed to compensate for this problem: Info-Gain
Gain-Ratio=−n pi+ni logpi+ni i=1 p+n p+n
Jiayu Zhou CSE 404 Intro. to Machine Learning 18 / 33
Gain-Ratio
If 1 attribute splits in many more classes than another (acts like a unique identifier), it has an (unfair) advantage if we use information gain. → Hurts generalization!
The gain-ratio is designed to compensate for this problem: Info-Gain
Gain-Ratio=−n pi+ni logpi+ni i=1 p+n p+n
If we have n uniformly populated classes the denominator is log2(n) which penalized relative to 1 for 2 uniformly populated classes.
Jiayu Zhou CSE 404 Intro. to Machine Learning 18 / 33
What to Do if …
In some leaf there are no examples?
Choose True or False according to the number of positive/negative examples at your parent.
There are no attributes left?
But if there are still positive and negative examples, we have an error/noise. Stop and use majority vote.
Jiayu Zhou CSE 404 Intro. to Machine Learning 19 / 33
Continuous Variables
If variables are continuous we can bin them (quantization)
Jiayu Zhou CSE 404 Intro. to Machine Learning 20 / 33
Continuous Variables
If variables are continuous we can bin them (quantization) or we can learn a simple classifier on a single dimension
E.g. we can find a decision point which classifies all data to the left of that point in one class and all data to the right in the other (decision stump – next slide)
Jiayu Zhou CSE 404 Intro. to Machine Learning 20 / 33
Continuous Variables
If variables are continuous we can bin them (quantization) or we can learn a simple classifier on a single dimension
E.g. we can find a decision point which classifies all data to the left of that point in one class and all data to the right in the other (decision stump – next slide)
We can also use a small subset of dimensions and train a linear classifier (e.g. logistic regression classifier).
Jiayu Zhou CSE 404 Intro. to Machine Learning 20 / 33
Decision Stump
Data {(x1, y1), (x2, y2), . . . , (xn, yn)} Decision stump
+1 ifxi>θ h(x)= −1 ifxi<θ
This determines a half space to be +1 and the other half -1.
Jiayu Zhou CSE 404 Intro. to Machine Learning 21 / 33
When to Stop?
Can we keep going until perfect classification?
Jiayu Zhou CSE 404 Intro. to Machine Learning 22 / 33
When to Stop?
Can we keep going until perfect classification? we might over-fit!
Heuristics
Stop when Info-Gain (Gain-Ratio) is smaller than threshold Stop when there are M examples in each leaf node
Jiayu Zhou CSE 404 Intro. to Machine Learning 22 / 33
When to Stop?
Can we keep going until perfect classification? we might over-fit!
Heuristics
Stop when Info-Gain (Gain-Ratio) is smaller than threshold Stop when there are M examples in each leaf node
Penalize complex trees by minimizing Complexity = # nodes
entropy(leaf) + αcomplexity all leaves
Note: if tree grows, complexity grows but entropy shrinks.
Jiayu Zhou CSE 404 Intro. to Machine Learning 22 / 33
When to Stop?
Can we keep going until perfect classification? we might over-fit!
Heuristics
Stop when Info-Gain (Gain-Ratio) is smaller than threshold Stop when there are M examples in each leaf node
Penalize complex trees by minimizing Complexity = # nodes
entropy(leaf) + αcomplexity all leaves
Note: if tree grows, complexity grows but entropy shrinks.
Compute many full grown trees on subsets of data and test them on hold-out data. Pick the best or average their prediction (random forest - advanced topics)
Jiayu Zhou CSE 404 Intro. to Machine Learning 22 / 33
Pruning
If the tree is too big, the lower “branches” are modeling noise in the data (overfitting).
The usual paradigm is to grow the trees large and prune back unnecessary splits.
Methods for pruning trees have been developed. Most use some form of cross-validation. Tuning may be necessary.
Jiayu Zhou CSE 404 Intro. to Machine Learning 23 / 33
Regression Trees
Decision trees where the target has continuous values. Splitting criteria: Residual sum of squares
RSS=(yi −yL∗)2 +(yi −yR∗)2 left right
where:
yL∗ = mean y-value for left node yR∗ = mean y-value for right node
Jiayu Zhou CSE 404 Intro. to Machine Learning 24 / 33
Classification and Regression Trees
Advantages
Applicable to both regression and classification problems. Handle categorical predictors naturally.
Computationally simple and quick to fit, even for large problems.
No formal distributional assumptions (non-parametric).
Can handle highly non-linear interactions and classification boundaries.
Automatic variable selection.
Handle missing values through surrogate variables. Very easy to interpret if the tree is small.
Jiayu Zhou CSE 404 Intro. to Machine Learning 25 / 33
Comparison Between Logistic Regression and Decision Tree Decision Boundaries
https://www.quora.com/In-what-circumstances-or-properties-of-the-data-will-decision-trees-perform-worse-than- logistic-regression
Jiayu Zhou CSE 404 Intro. to Machine Learning 26 / 33
Classification and Regression Trees
Disadvantages
Accuracy - current methods, such as support vector machines and ensemble classifiers often have 30% lower error rates than classification and regression trees (CART).
Instability - if we change the data a little, the tree picture can change a lot. So the interpretation is not as straightforward as it appears.
Jiayu Zhou CSE 404 Intro. to Machine Learning 27 / 33
Classification and Regression Trees
Disadvantages
Accuracy - current methods, such as support vector machines and ensemble classifiers often have 30% lower error rates than classification and regression trees (CART).
Instability - if we change the data a little, the tree picture can change a lot. So the interpretation is not as straightforward as it appears.
Today, we can do better!
Random Forests
Jiayu Zhou CSE 404 Intro. to Machine Learning 27 / 33
Random Forest
Jiayu Zhou CSE 404 Intro. to Machine Learning 28 / 33
Example of Random Forest
Tree 1
tall
old
height
age
short
young
healthy
Tree 2
male
healthy
age
female
disease
Tree 3
retired
height
tall
work
working
healthy
short
old
sex
young
diseased
healthy diseased
healthy diseased
New sample: {old, retired, male, short}. Prediction?
Jiayu Zhou CSE 404 Intro. to Machine Learning 29 / 33
Example of Random Forest
Tree 1
tall
old
height
age
short
young
Tree 2
male
old
sex
age
female
young
Tree 3
retired
work
working
height
healthy
diseased
healthy
healthy
New sample: {old, retired, male, short}. Prediction? Tree predictions: diseased, healthy, diseased Majority Rule: Diseased
disease
diseased
tall
short
healthy
diseased
healthy
Jiayu Zhou CSE 404 Intro. to Machine Learning 29 / 33
The Random Forest Algorithm
1 For b = 1 to B
1 Draw a bootstrap sample of size N from the training data
(randomly sample with replacement)
2 Grow a random-forest tree Tb to the bootstrapped data, by
recursively repeating the following steps, until the minimum node size nmin is reached.
1 Select m variables at random from the p variables.
2 Pick the best variable/split-point among the m
3 Split the node into two daughter nodes
2 Output the ensemble of trees {Tb}B1
Jiayu Zhou CSE 404 Intro. to Machine Learning 30 / 33
The Random Forest Algorithm
1 For b = 1 to B
1 Draw a bootstrap sample of size N from the training data
(randomly sample with replacement)
2 Grow a random-forest tree Tb to the bootstrapped data, by
recursively repeating the following steps, until the minimum node size nmin is reached.
1 Select m variables at random from the p variables.
2 Pick the best variable/split-point among the m
3 Split the node into two daughter nodes
2 Output the ensemble of trees {Tb}B1
To make a prediction at a new point x:
Regression: fˆB = 1 B Tb(x) rf B b=1
Classification: Let Cˆb(x) be the class prediction of the b-th random-forest tree. Then CˆB(x) = majority vote{Cˆb(x)}B.
rf 1
Jiayu Zhou CSE 404 Intro. to Machine Learning 30 / 33
Differences to standard tree
Train each tree on bootstrap resample of data
Bootstrap resample of data set with N samples: Make new data set by drawing with replacement N samples; i.e., some samples will probably occur multiple times in new data set.
For each split, consider only m randomly selected variables.
(Feature bagging) Don’t prune
Fit a forest of B trees in such a way and use average or majority voting to aggregate results.
Jiayu Zhou CSE 404 Intro. to Machine Learning 31 / 33
Variable Importance
Advantage of tree methods is the intepretability.
But we lose this property in random forest: what are the important variables in the forest?
Tree 1
tall
old
height
age
short
young
Tree 2
male
old
sex
age
female
young
Tree 3
retired
work
short
working
height
healthy
diseased
healthy
healthy
disease
diseased
tall
healthy
diseased
healthy
Jiayu Zhou CSE 404 Intro. to Machine Learning 32 / 33
Variable Importance
Advantage of tree methods is the intepretability.
But we lose this property in random forest: what are the important variables in the forest?
Good news! There are ways to identify important variables in the model, e.g., Out-of-Bag error.
Tree 1
tall
old
height
age
short
young
Tree 2
male
old
sex
age
female
young
Tree 3
retired
work
short
working
height
healthy
diseased
healthy
healthy
disease
diseased
tall
healthy
diseased
healthy
Jiayu Zhou CSE 404 Intro. to Machine Learning 32 / 33
Tree and Random Forest
Trees
+ Trees yield insight into decision rules
+ Rather fast
+ Easy to tune parameters
− Prediction of trees tend to have a high variance
Random Forest
+ RF as smaller prediction variance and therefore usually a better general performance
+ Easy to tune parameters
− Rather slow
− “Black Box”: Rather hard to get insights into decision rules
Jiayu Zhou CSE 404 Intro. to Machine Learning 33 / 33