SGN-13006 Introduction to Pattern Recognition and Machine Learning (5 cr) – Decision Tree Learning
SGN-13006 Introduction to Pattern
Recognition and Machine Learning (5 cr)
Decision Tree Learning
Joni-Kristian Kämäräinen
September 2018
Laboratory of Signal Processing
Tampere University of Technology
1
Material
• Lecturer’s slides and blackboard notes
• T.M. Mitchell. Machine Learning. McGraw-Hill, 1997:
Chapter 3
• (Advanced: T. Hastie, R. Tibshirani, and J. Friedman. The
Elements of Statistical Learning. Springer, 2009: Chapter 15)
2
Contents
Decision Tree Learning
ID3 algorithm
Analysis about Decision Tree Learning
Overfitting
Different types of variables
Advanced Topic: Randomisation
Randomisation & Decision trees
3
Beyond concept learning
Figure 1: http://cognitrn.psych.indiana.edu/rgoldsto/
courses/concepts2.pdf
4
http://cognitrn.psych.indiana.edu/rgoldsto/courses/concepts2.pdf
http://cognitrn.psych.indiana.edu/rgoldsto/courses/concepts2.pdf
Decision Tree Learning
Decision Tree Representation
Decision tree representation:
• Each internal node tests an attribute
• Each branch corresponds to attribute value
• Each leaf node assigns a classification
How would we represent:
• ∧,∨, XOR
• (A ∧ B) ∨ (C ∧ ¬D ∧ E )
• M of N
5
Training Data (Mitchell Table 3.1)
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
6
Top-Down Induction of Decision Trees
Main loop:
1: A← the “best” decision attribute for next node
2: Assign A as decision attribute for node
3: For each value of A, create new descendant of node
4: Sort training examples to leaf nodes
5: If training examples perfectly classified, Then STOP, Else
iterate over new leaf nodes
7
Decision Tree Learning
ID3 algorithm
ID3 Algorithm
ID3 (Examples, Target Attribute, Attributes)
1: Create a root node for the tree
2: if all examples are positive then
3: Return the single-node tree Root, with label = +.
4: end if
5: if all examples are negative then
6: Return the single-node tree Root, with label = -.
7: end if
8: if number of predicting attributes is empty then
9: Return the single node tree Root, with label = most common value of the target attribute in the examples.
10: end if
11: Otherwise Begin
12: A = The Attribute that best classifies examples.
13: Decision Tree attribute for Root = A.
14: for each possible value, v i, of A do
15: Add a new tree branch below Root, corresponding to the test A = v i.
16: Let Examples(v i) be the subset of examples that have the value v i for A
17: if Examples(v i) is empty then
18: Then below this new branch add a leaf node with label = most common target value in the examples
19: else
20: below this new branch add the subtree ID3 (Examples(v i), Target Attribute, Attributes – {A})
21: end if
22: end for
23: Return Root
8
Example
9
Example (cont.)
10
Application Example: C-Section Risk
11
Analysis about Decision Tree
Learning
Hypothesis Space Search by ID3
• Hypothesis space is complete (cf. Concept Learning)
• Entropy-driven search is robust to noise – always provides a
solution that fits to training data
• Follows the Occam’s Razor principle – “prefer shortest tree”
12
Occam’s Razor
Why prefer short hypotheses?
Argument in favor:
• Fewer short hyps. than long hyps.
→ a short hyp that fits data unlikely to be coincidence (cf.
Copernicus)
→ a long hyp that fits data might be coincidence
13
Analysis about Decision Tree
Learning
Overfitting
Overfitting
• The most informative attributes typically near the root, but
• all attributes tested toward the leaves ⇒ easily overfits
14
Overfitting
• The most informative attributes typically near the root, but
• all attributes tested toward the leaves ⇒ easily overfits
14
Overfitting
Consider error of hypothesis h over
• training data: errortrain(h)
• entire distribution D of data: errorD(h)
Overfitting
Hypothesis h ∈ H overfits training data if there is an alternative
hypothesis h′ ∈ H such that
errortrain(h) < errortrain(h
′)
and
errorD(h) > errorD(h
′)
15
Avoiding Overfitting
• Stop training with some criterion (i.e. few samples left)
• Grow a full tree, then post-prune
• Use a separate validation set to stop training
16
Avoiding Overfitting – A Post-pruning Example
1. Convert tree to equivalent set of rules
2. Prune each rule independently of others
3. Sort final rules into desired sequence for use
IF (Outlook = Sunny) ∧ (Humidity = High)
THEN PlayTennis = No
. . .
Perhaps most frequently used method (e.g., C4.5)
17
Analysis about Decision Tree
Learning
Different types of variables
Continuous Valued Attributes
Create a discrete attribute to test continuous
• Temperature = 82.5
• (Temperature > 72.3) = t, f
Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No
18
Attributes with Many Values
Problem:
• If attribute has many values, Gain will select it
• Imagine using Date = Jun 3 1996 as attribute
One approach: use GainRatio instead
GainRatio(S ,A) ≡
Gain(S ,A)
SplitInformation(S ,A)
SplitInformation(S ,A) ≡ −
c∑
i=1
|Si |
|S |
log2
|Si |
|S |
where Si is subset of S for which A has value vi
19
Attributes with Costs
Consider
• medical diagnosis, BloodTest has cost $150
How to learn a consistent tree with low expected cost?
One approach: replace gain by
• Tan and Schlimmer (1990)
Gain2(S ,A)
Cost(A)
.
• Nunez (1988)
2Gain(S ,A) − 1
(Cost(A) + 1)w
where w ∈ [0, 1] determines importance of cost
20
Unknown Attribute Values
What if some examples missing values of A?
Use training example anyway, sort through tree
• If node n tests A, assign most common value of A among
other examples sorted to node n
• assign most common value of A among other examples with
same target value
• assign probability pi to each possible value vi of A
• assign fraction pi of example to each descendant in tree
Classify new examples in same fashion
21
Decision Trees – Pros & Cons
• Pros
• Powerful hypothesis space (cf. concept learning)
• Robust to noise (probabilistic)
• Easy to interpret and add heuristics
• (adaptable computing time in testing and fits nicely to GPU)
• Cons
• Local minimum (and sensitive to training data) and severe
ovefitting (high bias)
22
Advanced Topic: Randomisation
Randomisation
There are many ways to add randomisation to the process of
learning:
• Data: Bootstrapping & bagging & cross-validation
• Features: Random splitting & sub-spaces (feature selection)
• Initialisation: Random initialisation
• Learning: Evolutionary & genetic algorithms
Power of randomisation
Randomisation is a meta-technique which can be used along with
any machine learning method – often randomisation is the easier
way as compared to detail analysis of method properties and
sensitivity to training data, and their adjustment
23
Advanced Topic: Randomisation
Randomisation & Decision trees
Randomisation & Decision Trees = Random Forests
• Bagging (bootstrap aggregation) is a technique for reducing
the variance of an estimated prediction function.
• Bagging seems to work especially well for high-variance,
low-bias, procedures, such as trees.
• For regression:
• We fit the same tree many times for bootstrap-sampled
versions of training data and average over all trees.
• For Classification:
• A committee of trees each casts a vote and majority wins.
• Random forests is a substantial modification of bagging that
builds on a large collection of de-correlated trees.
24
Random Forest for Regression or Classification
1: for b = 1 to B do
2: Draw a bootstrap sample Z of size N from the training data.
3: Grow a random-forest tree Tb to the bootstrapped data, by recursively
repeating the following steps for each terminal node of the tree, until the
minimum node size nmin is reached.
4: Select m variables at random from the p variables.
5: Pick the best variable among the m.
6: Split the node into daughter nodes.
7: end for
8: Output the ensemble of trees {Tb}
B
1 .
To make a prediction at a new point x :
Regression: f̂ Brf (x) =
1
B
∑B
b=1 Tb(x) .
Classification: Let Ĉb(x) be the class prediction of the bth
random-forest tree. Then
ĈBrf (x) = majority vote
{
Ĉb(x)
}B
1
.
25
Example: Real-Time Human Pose Recognition in Parts from
Single Depth Images
J. Shotton et al. “Real-Time Human Pose Recognition in Parts from
Single Depth Images”. In: IEEE Conference on Computer Vision and
Pattern Recognition. 2011
Video: kinect cvpr2011 bestpaper.mp4
26
Example: Real-Time Human Pose Recognition in Parts from
Single Depth Images
J. Shotton et al. “Real-Time Human Pose Recognition in Parts from
Single Depth Images”. In: IEEE Conference on Computer Vision and
Pattern Recognition. 2011
Video: kinect cvpr2011 bestpaper.mp4
26
Example: Real-Time Human Pose Recognition in Parts from
Single Depth Images
J. Shotton et al. “Real-Time Human Pose Recognition in Parts from
Single Depth Images”. In: IEEE Conference on Computer Vision and
Pattern Recognition. 2011
Video: kinect cvpr2011 bestpaper.mp4
26
Example: Real-Time Human Pose Recognition in Parts from
Single Depth Images
J. Shotton et al. “Real-Time Human Pose Recognition in Parts from
Single Depth Images”. In: IEEE Conference on Computer Vision and
Pattern Recognition. 2011
Video: kinect cvpr2011 bestpaper.mp4
26
Summary
Summary
• Decision tree model
• Entropy and information gain
• The principles of the ID3 algorithm
• Problem of overfitting
• Randomisation and the principles behing the random forest
algorithm
27
Decision Tree Learning
ID3 algorithm
Analysis about Decision Tree Learning
Overfitting
Different types of variables
Advanced Topic: Randomisation
Randomisation & Decision trees
Summary