Tree Learning
Tree Learning
COMP9417 Machine Learning and Data Mining
Last revision: 21 Mar 2018
COMP9417 ML & DM Tree Learning Semester 1, 2018 1 / 98
Acknowledgements
Material derived from slides for the book
“Machine Learning” by T. Mitchell
McGraw-Hill (1997)
http://www-2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides by Andrew W. Moore
http:www.cs.cmu.edu/~awm/tutorials
Material derived from slides by Eibe Frank
http://www.cs.waikato.ac.nz/ml/weka
Material derived from slides for the book
“Machine Learning” by P. Flach
Cambridge University Press (2012)
http://cs.bris.ac.uk/~flach/mlbook
COMP9417 ML & DM Tree Learning Semester 1, 2018 2 / 98
http://www-2.cs.cmu.edu/~tom/mlbook.html
http:www.cs.cmu.edu/~awm/tutorials
http://www.cs.waikato.ac.nz/ml/weka
http://cs.bris.ac.uk/~flach/mlbook
Aims
Aims
This lecture will enable you to describe decision tree learning, the use of
entropy and the problem of overfitting. Following it you should be able to:
• define the decision tree representation
• list representation properties of data and models for which decision
trees are appropriate
• reproduce the basic top-down algorithm for decision tree induction
(TDIDT)
• define entropy in the context of learning a Boolean classifier from
examples
• describe the inductive bias of the basic TDIDT algorithm
• define overfitting of a training set by a hypothesis
• describe developments of the basic TDIDT algorithm: pruning, rule
generation, numerical attributes, many-valued attributes, costs,
missing values
• describe regression and model trees
COMP9417 ML & DM Tree Learning Semester 1, 2018 3 / 98
Introduction
Brief History of Decision Tree Learning Algorithms
• late 1950’s – Bruner et al. in psychology work on modelling concept
acquisition
• early 1960s – Hunt et al. in computer science work on Concept
Learning Systems (CLS)
• late 1970s – Quinlan’s Iterative Dichotomizer 3 (ID3) based on CLS is
efficient at learning on then-large data sets
• early 1990s – ID3 adds features, develops into C4.5, becomes the
“default” machine learning algorithm
• late 1990s – C5.0, commercial version of C4.5 (available from SPSS
and www.rulequest.com)
• current – widely available and applied; influential techniques
COMP9417 ML & DM Tree Learning Semester 1, 2018 4 / 98
Introduction
Why use decision trees?
• Decision trees are probably the single most popular data mining tool
• Easy to understand
• Easy to implement
• Easy to use
• Computationally cheap (efficient, even on big data)
• There are some drawbacks, though — e.g., high variance
• They do classification, i.e., predict a categorical output from
categorical and/or real inputs, or regression
COMP9417 ML & DM Tree Learning Semester 1, 2018 5 / 98
Introduction
Decision Tree for PlayTennis
Outlook
Overcast
Humidity
NormalHigh
No Yes
Wind
Strong Weak
No Yes
Yes
RainSunny
COMP9417 ML & DM Tree Learning Semester 1, 2018 6 / 98
Introduction
A Tree to Predict C-Section Risk
Learned from medical records of 1000 women
Negative examples are C-sections
[833+,167-] .83+ .17-
Fetal_Presentation = 1: [822+,116-] .88+ .12-
| Previous_Csection = 0: [767+,81-] .90+ .10-
| | Primiparous = 0: [399+,13-] .97+ .03-
| | Primiparous = 1: [368+,68-] .84+ .16-
| | | Fetal_Distress = 0: [334+,47-] .88+ .12-
| | | | Birth_Weight < 3349: [201+,10.6-] .95+ .05- | | | | Birth_Weight >= 3349: [133+,36.4-] .78+ .22-
| | | Fetal_Distress = 1: [34+,21-] .62+ .38-
| Previous_Csection = 1: [55+,35-] .61+ .39-
Fetal_Presentation = 2: [3+,29-] .11+ .89-
Fetal_Presentation = 3: [8+,22-] .27+ .73-
COMP9417 ML & DM Tree Learning Semester 1, 2018 7 / 98
Introduction
Decision Tree for Credit Rating
COMP9417 ML & DM Tree Learning Semester 1, 2018 8 / 98
Introduction
Decision Tree for Fisher’s Iris data
COMP9417 ML & DM Tree Learning Semester 1, 2018 9 / 98
Introduction
Decision Trees
Decision tree representation:
• Each internal node tests an attribute
• Each branch corresponds to attribute value
• Each leaf node assigns a classification
How would we represent the following expressions ?
• ∧,∨, XOR
• (A ∧B) ∨ (C ∧ ¬D ∧ E)
• M of N
COMP9417 ML & DM Tree Learning Semester 1, 2018 10 / 98
Introduction
Decision Trees
X ∧ Y
X = t:
| Y = t: true
| Y = f: no
X = f: no
X ∨ Y
X = t: true
X = f:
| Y = t: true
| Y = f: no
COMP9417 ML & DM Tree Learning Semester 1, 2018 11 / 98
Introduction
Decision Trees
2 of 3
X = t:
| Y = t: true
| Y = f:
| | Z = t: true
| | Z = f: false
X = f:
| Y = t:
| | Z = t: true
| | Z = f: false
| Y = f: false
So in general decision trees represent a disjunction of conjunctions of
constraints on the attributes values of instances.
COMP9417 ML & DM Tree Learning Semester 1, 2018 12 / 98
Introduction
When are Decision Trees the Right Model?
• With Boolean values for the instances X and class Y , the
representation adopted by decision-trees allows us to represent Y as a
Boolean function of the X
• Given d input Boolean variables, there are 2d possible input values for
these variables. Any specific function assigns Y = 1 to some subset of
these, and Y = 0 to the rest
• Any Boolean function can be trivially represented by a tree. Each
function assigns Y = 1 to some subset of the 2d possible values of X.
So, for each combination of values with Y = 1, have a path from root
to a leaf with Y = 1. All other leaves have Y = 0
COMP9417 ML & DM Tree Learning Semester 1, 2018 13 / 98
Introduction
When are Decision Trees the Right Model?
• This is nothing but a re-representation of the truth-table, and will
have 2d leaves. More compact trees may be possible, by taking into
account what is common between one or more rows with the same Y
value
• But, even for Boolean functions, there are some functions for which
compact trees may not be possible (the parity and majority functions
are examples)
• In general, although possible in principle to express any Boolean
function, our search and prior restrictions may not allow us to find the
correct tree in practice.
• BUT: If you want readable models that combine logical tests with a
probability-based decision, then decision trees are a good start
COMP9417 ML & DM Tree Learning Semester 1, 2018 14 / 98
Introduction
When to Consider Decision Trees?
• Instances described by a mix of numeric features and discrete
attribute–value pairs
• Target function is discrete valued (otherwise use regression trees)
• Disjunctive hypothesis may be required
• Possibly noisy training data
• Interpretability is an advantage
Examples are extremely numerous, including:
• Equipment or medical diagnosis
• Credit risk analysis
• Modeling calendar scheduling preferences
• etc.
COMP9417 ML & DM Tree Learning Semester 1, 2018 15 / 98
The Basic Algorithm
Top-Down Induction of Decision Trees (TDIDT)
Main loop:
• A← the “best” decision attribute for next node
• Assign A as decision attribute for node
• For each value of A, create new descendant of node
• Sort training examples to leaf nodes
• If training examples perfectly classified, Then STOP, Else iterate over
new leaf nodes
Essentially this is the “ID3” algorithm (Quinlan, 1986) — the first efficient
symbolic Machine Learning algorithm.
COMP9417 ML & DM Tree Learning Semester 1, 2018 16 / 98
The Basic Algorithm
Which attribute is best?
A1=? A2=?
ft ft
[29+,35-] [29+,35-]
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
COMP9417 ML & DM Tree Learning Semester 1, 2018 17 / 98
Information Gain
Bits
You are watching a set of independent random samples of X
You observe that X has four possible values
P (X = A) = 1
4
P (X = B) = 1
4
P (X = C) = 1
4
P (X = D) = 1
4
So you might see: BAACBADCDADDDA…
You transmit data over a binary serial link. You can encode each reading
with two bits (e.g. A = 00, B = 01, C = 10, D= 11)
0100001001001110110011111100…
COMP9417 ML & DM Tree Learning Semester 1, 2018 18 / 98
Information Gain
Fewer Bits
Someone tells you that the probabilities are not equal
P (X = A) = 1
2
P (X = B) = 1
4
P (X = C) = 1
8
P (X = D) = 1
8
It’s possible . . .
. . . to invent a coding for your transmission that only uses 1.75 bits on
average per symbol. How ?
COMP9417 ML & DM Tree Learning Semester 1, 2018 19 / 98
Information Gain
Fewer Bits
Someone tells you that the probabilities are not equal
P (X = A) = 1
2
P (X = B) = 1
4
P (X = C) = 1
8
P (X = D) = 1
8
It’s possible . . .
. . . to invent a coding for your transmission that only uses 1.75 bits per
symbol on average. How ?
A 0
B 10
C 110
D 111
(This is just one of several ways)
COMP9417 ML & DM Tree Learning Semester 1, 2018 20 / 98
Information Gain
Fewer Bits
Suppose there are three equally likely values
P (X = A) = 1
3
P (X = B) = 1
3
P (X = C) = 1
3
Here’s a näıve coding, costing 2 bits per symbol
A 00
B 01
C 10
Can you think of a coding that would need only 1.6 bits per symbol on
average ?
COMP9417 ML & DM Tree Learning Semester 1, 2018 21 / 98
Information Gain
Fewer Bits
Suppose there are three equally likely values
P (X = A) = 1
3
P (X = B) = 1
3
P (X = C) = 1
3
Using the same approach as before, we can get a coding costing 1.6 bits
per symbol on average . . .
A 0
B 10
C 11
This gives us, on average 1
3
× 1 bit for A and 2× 1
3
× 2 bits for B and C,
which equals 5
3
≈ 1.6 bits.
Is this the best we can do ?
COMP9417 ML & DM Tree Learning Semester 1, 2018 22 / 98
Information Gain
Fewer Bits
Suppose there are three equally likely values
P (X = A) = 1
3
P (X = B) = 1
3
P (X = C) = 1
3
From information theory, the optimal number of bits to encode a symbol
with probability p is − log2 p . . .
So the best we can do for this case is − log2
1
3
bits for each of A, B and
C, or 1.5849625007211563 bits per symbol
COMP9417 ML & DM Tree Learning Semester 1, 2018 23 / 98
Information Gain
General Case
Suppose X can have one of m values . . . V1, V2, . . . Vm
P (X = V1) = p1 P (X = V2) = p2 . . . P (X = Vm) = pm
What’s the smallest possible number of bits, on average, per symbol,
needed to transmit a stream of symbols drawn from X’s distribution ? It’s
H(X) = −p1 log2 p1 − p2 log2 p2 − . . .− pm log2 pm
= −
m∑
j=1
pj log2 pj
H(X) = the entropy of X
COMP9417 ML & DM Tree Learning Semester 1, 2018 24 / 98
Information Gain
General Case
“High entropy” means X is very uniform and boring
“Low entropy” means X is very varied and interesting
COMP9417 ML & DM Tree Learning Semester 1, 2018 25 / 98
Information Gain
Entropy
En
tro
py
(S
)
1.0
0.5
0.0 0.5 1.0
p
+
Where:
S is a sample of training examples
p⊕ is the proportion of positive examples in S
p is the proportion of negative examples in S
COMP9417 ML & DM Tree Learning Semester 1, 2018 26 / 98
Information Gain
Entropy
Entropy measures the “impurity” of S
Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p
A “pure” sample is one in which all examples are of the same class.
COMP9417 ML & DM Tree Learning Semester 1, 2018 27 / 98
Information Gain
Entropy
Entropy(S) = expected number of bits needed to encode class (⊕ or )
of randomly drawn member of S (under the optimal, shortest-length code)
Why ?
Information theory: optimal length code assigns − log2 p bits to message
having probability p.
So, expected number of bits to encode ⊕ or of random member of S:
p⊕(− log2 p⊕) + p (− log2 p )
Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p
COMP9417 ML & DM Tree Learning Semester 1, 2018 28 / 98
Information Gain
Information Gain
• Gain(S,A) = expected reduction in entropy due to sorting on A
Gain(S,A) ≡ Entropy(S) −
∑
v∈V alues(A)
|Sv|
|S|
Entropy(Sv)
A1=? A2=?
ft ft
[29+,35-] [29+,35-]
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
COMP9417 ML & DM Tree Learning Semester 1, 2018 29 / 98
Information Gain
Information Gain
Gain(S,A1) = Entropy(S) −
(
|St|
|S|
Entropy(St) +
|Sf |
|S|
Entropy(Sf )
)
= 0.9936 −
= ((
26
64
(−
21
26
log2(
21
26
)−
5
26
log2(
5
26
))) +
(
38
64
(−
8
38
log2(
8
38
)−
30
38
log2(
30
38
))))
= 0.9936 − ( 0.2869 + 0.4408 )
= 0.2658
COMP9417 ML & DM Tree Learning Semester 1, 2018 30 / 98
Information Gain
Information Gain
Gain(S,A2) = 0.9936 − ( 0.7464 + 0.0828 )
= 0.1643
COMP9417 ML & DM Tree Learning Semester 1, 2018 31 / 98
Information Gain
Information Gain
So we choose A1, since it gives a larger expected reduction in entropy.
COMP9417 ML & DM Tree Learning Semester 1, 2018 32 / 98
Information Gain
Training Examples
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
COMP9417 ML & DM Tree Learning Semester 1, 2018 33 / 98
Information Gain
Information gain once more
Which attribute is the best classifier?
High Normal
Humidity
[3+,4-] [6+,1-]
Wind
Weak Strong
[6+,2-] [3+,3-]
= .940 – (7/14).985 – (7/14).592
= .151
= .940 – (8/14).811 – (6/14)1.0
= .048
Gain (S, Humidity ) Gain (S, )Wind
=0.940E =0.940E
=0.811E=0.592E=0.985E =1.00E
[9+,5-]S:[9+,5-]S:
COMP9417 ML & DM Tree Learning Semester 1, 2018 34 / 98
Information Gain
Information gain once more
Outlook
Sunny Overcast Rain
[9+,5−]
{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}
[2+,3−] [4+,0−] [3+,2−]
Yes
{D1, D2, …, D14}
? ?
Which attribute should be tested here?
Ssunny = {D1,D2,D8,D9,D11}
Gain (Ssunny , Humidity)
sunnyGain (S , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570
Gain (S sunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019
= .970 − (3/5) 0.0 − (2/5) 0.0 = .970
COMP9417 ML & DM Tree Learning Semester 1, 2018 35 / 98
Tree Learning More Generally
Hypothesis Space Search by ID3
…
+ + +
A1
+ – + –
A2
A3
+
…
+ – + –
A2
A4
–
+ – + –
A2
+ – +
… …
–
COMP9417 ML & DM Tree Learning Semester 1, 2018 36 / 98
Tree Learning More Generally
Hypothesis Space Search by ID3
• This can be viewed as a graph-search problem
• Each vertex in the graph is a decision tree
• Suppose we only consider the two-class case (ω = ω1 or ω2), and all
the features xi are Boolean, so each vertex is a binary tree
• A pair of vertices in the graph have an edge if the corresponding trees
differ in just the following way: one of the leaf-nodes in one vertex
has been replaced by a non-leaf node testing a feature that has not
appeared earlier (and 2 leaves)
• This is the full space of all decision trees (is it?). We want to search
for a single tree or a small number of trees in this space. How should
we do this?
COMP9417 ML & DM Tree Learning Semester 1, 2018 37 / 98
Tree Learning More Generally
Hypothesis Space Search by ID3
• Usual graph-search technique: greedy or beam search, starting with
the vertex corresponding to the “empty tree” (single leaf node)
• Greedy choice: which one to select? The neighbour that results in the
greatest increase in P (D|T )
• How?
• Suppose T is changed to T ′. Simply use the ratio of P (D|T ′)/P (D|T )
• Most of the calculation will cancel out: so, we will only need to do the
local computation at the leaf that was converted into a non-leaf node
• RESULT: set of trees with (reasonably) high posterior probabilities
given D: we can now use these to answer questions like
P (y′ = ω1| . . .)? or even make a decision or a classification that
y′ = ω1, given input data x
COMP9417 ML & DM Tree Learning Semester 1, 2018 38 / 98
Tree Learning More Generally
Hypothesis Space Search by ID3
• Hypothesis space is complete! (contains all finite discrete-valued
functions w.r.t attributes)
• Target function surely in there…
• Outputs a single hypothesis (which one?)
• Can’t play 20 questions…
• No back tracking
• Local minima…
• Statistically-based search choices
• Robust to noisy data…
• Inductive bias: approx “prefer shortest tree”
COMP9417 ML & DM Tree Learning Semester 1, 2018 39 / 98
Tree Learning More Generally
Inductive Bias in ID3
Note H is the power set of instances X
→Unbiased?
Not really…
• Preference for short trees, and for those with high information gain
attributes near the root
• Bias is a preference for some hypotheses, rather than a restriction of
hypothesis space H
• an incomplete search of a complete hypothesis space versus a
complete search of an incomplete hypothesis space (as in learning
conjunctive concepts)
• Occam’s razor: prefer the shortest hypothesis that fits the data
COMP9417 ML & DM Tree Learning Semester 1, 2018 40 / 98
Tree Learning More Generally
Occam’s Razor
William of Ockham (c. 1287-1347)
Entities should not be multiplied beyond necessity
Why prefer short hypotheses?
Argument in favour:
• Fewer short hypotheses than long hypotheses
→ a short hyp that fits data unlikely to be coincidence
→ a long hyp that fits data might be coincidence
COMP9417 ML & DM Tree Learning Semester 1, 2018 41 / 98
Tree Learning More Generally
Occam’s Razor
Argument opposed:
• There are many ways to define small sets of hypotheses
• e.g., all trees with a prime number of nodes that use attributes
beginning with “Z”
• What’s so special about small sets based on size of hypothesis??
Look back to linear classification lecture to see how to make this work
using Minimum Description Length (MDL)
COMP9417 ML & DM Tree Learning Semester 1, 2018 42 / 98
Overfitting and How To Avoid It
Why does overfitting occur?
• Greedy search can make mistakes. We know that it can end up in
local minima — so a sub-optimal choice earlier might result in a
better solution later (i.e. pick a test whose posterior gain (or
information gain) is less than the best one
• But there is also another kind of problem. We know that training
error is an optimistic estimate of the true error of the model, and that
this optimism increases as the training error decreases
• We will see why this is the case later (lectures on Evaluation)
• Suppose we have two models h1 and h2 with training errors e1 and e2
and optimism o1 and o2. Let the true error of each be E1 = e1 + o1
and E2 = e2 + o2
• If e1 < e2 and E1 > E2, then we will say that h1 has overfit then
training data
• So, a search method based purely on training data estimates may end
overfitting the training data
COMP9417 ML & DM Tree Learning Semester 1, 2018 43 / 98
Overfitting and How To Avoid It
Overfitting in Decision Tree Learning
Consider adding noisy training example #15:
Sunny, Hot, Normal, Strong, P layTennis = No
What effect on earlier tree?
Outlook
Overcast
Humidity
NormalHigh
No Yes
Wind
Strong Weak
No Yes
Yes
RainSunny
COMP9417 ML & DM Tree Learning Semester 1, 2018 44 / 98
Overfitting and How To Avoid It
Overfitting in Decision Tree Learning
Outlook
Sunny Overcast Rain
[9+,5−]
{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}
[2+,3−] [4+,0−] [3+,2−]
Yes
{D1, D2, …, D14}
? ?
Which attribute should be tested here?
Ssunny = {D1,D2,D8,D9,D11}
Gain (Ssunny , Humidity)
sunnyGain (S , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570
Gain (S sunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019
= .970 − (3/5) 0.0 − (2/5) 0.0 = .970
COMP9417 ML & DM Tree Learning Semester 1, 2018 45 / 98
Overfitting and How To Avoid It
Overfitting in General
Consider error of hypothesis h over
• training data: errortrain(h)
• entire distribution D of data: errorD(h)
Definition
Hypothesis h ∈ H overfits training data if there is an alternative
hypothesis h′ ∈ H such that
errortrain(h) < errortrain(h ′) and errorD(h) > errorD(h
′)
COMP9417 ML & DM Tree Learning Semester 1, 2018 46 / 98
Overfitting and How To Avoid It
Overfitting in Decision Tree Learning
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0 10 20 30 40 50 60 70 80 90 100
A
cc
ur
ac
y
Size of tree (number of nodes)
On training data
On test data
COMP9417 ML & DM Tree Learning Semester 1, 2018 47 / 98
Overfitting and How To Avoid It
Avoiding Overfitting
How can we avoid overfitting? Pruning
• pre-pruning stop growing when data split not statistically significant
• post-pruning grow full tree, then remove sub-trees which are
overfitting
Post-pruning avoids problem of “early stopping”
How to select “best” tree:
• Measure performance over training data ?
• Measure performance over separate validation data set ?
• MDL: minimize size(tree) + size(misclassifications(tree)) ?
COMP9417 ML & DM Tree Learning Semester 1, 2018 48 / 98
Overfitting and How To Avoid It
Avoiding Overfitting
Pre-pruning
• Can be based on statistical significance test
• Stops growing the tree when there is no statistically significant
association between any attribute and the class at a particular node
• For example, in ID3: chi-squared test plus information gain
• only statistically significant attributes were allowed to be selected by
information gain procedure
COMP9417 ML & DM Tree Learning Semester 1, 2018 49 / 98
Overfitting and How To Avoid It
Avoiding Overfitting
Pre-pruning
• Simplest approach: stop growing the tree when fewer than some
lower-bound on the number of examples at a leaf
• In C4.5, this parameter is the m parameter
• In sklearn, this parameter is min samples leaf
• In sklearn, the parameter min impurity decrease enables
stopping when the this falls below a lower-bound
COMP9417 ML & DM Tree Learning Semester 1, 2018 50 / 98
Overfitting and How To Avoid It
Avoiding Overfitting
Early stopping
• Pre-pruning may suffer from early stopping: may stop the growth of
tree prematurely
• Classic example: XOR/Parity-problem
• No individual attribute exhibits a significant association with the class
• Target structure only visible in fully expanded tree
• Prepruning won’t expand the root node
• But: XOR-type problems not common in practice
• And: pre-pruning faster than post-pruning
COMP9417 ML & DM Tree Learning Semester 1, 2018 51 / 98
Overfitting and How To Avoid It
Avoiding Overfitting
Post-pruning
• Builds full tree first and prunes it afterwards
• Attribute interactions are visible in fully-grown tree
• Problem: identification of subtrees and nodes that are due to chance
effects
• Two main pruning operations:
• Subtree replacement
• Subtree raising
• Possible strategies: error estimation, significance testing, MDL
principle
• We examine two methods: Reduced-error Pruning and Error-based
Pruning
COMP9417 ML & DM Tree Learning Semester 1, 2018 52 / 98
Overfitting and How To Avoid It
Reduced-Error Pruning
Split data into training and validation set
Do until further pruning is harmful:
• Evaluate impact on validation set of pruning each possible node
(plus those below it)
• Greedily remove the one that most improves validation set accuracy
COMP9417 ML & DM Tree Learning Semester 1, 2018 53 / 98
Overfitting and How To Avoid It
Reduced-Error Pruning
• Good produces smallest version of most accurate subtree
• Not so good reduces effective size of training set
COMP9417 ML & DM Tree Learning Semester 1, 2018 54 / 98
Overfitting and How To Avoid It
Effect of Reduced-Error Pruning
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0 10 20 30 40 50 60 70 80 90 100
A
cc
ur
ac
y
Size of tree (number of nodes)
On training data
On test data
On test data (during pruning)
COMP9417 ML & DM Tree Learning Semester 1, 2018 55 / 98
Overfitting and How To Avoid It
Error-based pruning (C4.5 / J48 / C5.0)
Quinlan (1993) describes the successor to ID3 – C4.5
• many extensions – see below
• post-pruning using training set
• includes sub-tree replacement and sub-tree raising
• also: pruning by converting tree to rules
• commercial version – C5.0 – is widely used
• RuleQuest.com
• now free
• Weka version – J48 – also widely used
COMP9417 ML & DM Tree Learning Semester 1, 2018 56 / 98
Overfitting and How To Avoid It
Pruning operator: Sub-tree replacement
Bottom-up:
tree is considered for replacement once all its sub-trees have been
considered
COMP9417 ML & DM Tree Learning Semester 1, 2018 57 / 98
Overfitting and How To Avoid It
Error-based pruning: error estimate
Goal is to improve estimate of error on unseen data using all and only data
from training set
But how can this work ?
Make the estimate of error pessimistic !
COMP9417 ML & DM Tree Learning Semester 1, 2018 58 / 98
Overfitting and How To Avoid It
Error-based pruning: error estimate
• Apply pruning operation if this does not increase the estimated error
• C4.5’s method: using upper limit of standard confidence interval
derived from the training data
• Standard Bernoulli-process-based method
• Note: statistically motivated, but not statistically valid
• However: works well in practice !
COMP9417 ML & DM Tree Learning Semester 1, 2018 59 / 98
Overfitting and How To Avoid It
Error-based pruning: error estimate
• The error estimate for a tree node is the weighted sum of error
estimates for all its subtrees (possibly leaves).
• Upper bound error estimate e for a node (simplified version):
e = f + Zc ·
√
f · (1− f)
N
• f is actual (empirical) error of tree on examples at the tree node
• N is the number of examples at the tree node
• Zc is a constant whose value depends on confidence parameter c
• C4.5’s default value for confidence c = 0.25
• If c = 0.25 then Zc = 0.69 (from standardized normal distribution)
COMP9417 ML & DM Tree Learning Semester 1, 2018 60 / 98
Overfitting and How To Avoid It
Error-based pruning: error estimate
• How does this method implement a pessimistic error estimate ?
• What effect will the c parameter have on pruning ?
• As c ↑, z ↓
• See example on next slide (note: values not calculated using exactly
the above formula)
COMP9417 ML & DM Tree Learning Semester 1, 2018 61 / 98
Overfitting and How To Avoid It
Error-based pruning: error estimate
• health plan contribution: node
measures f = 0.36, e = 0.46
• sub-tree measures:
• none: f = 0.33, e = 0.47
• half: f = 0.5, e = 0.72
• full: f = 0.33, e = 0.47
• sub-trees combined 6 : 2 : 6 gives
0.51
• sub-trees estimated to give greater
error so prune away
COMP9417 ML & DM Tree Learning Semester 1, 2018 62 / 98
Overfitting and How To Avoid It
Rule Post-Pruning
This method was introduced in Quinlan’s C4.5
• Convert tree to equivalent set of rules
• Prune each rule independently of others
• Sort final rules into desired sequence for use
For: simpler classifiers, people prefer rules to trees
Against: does not scale well, slow for large trees & datasets
COMP9417 ML & DM Tree Learning Semester 1, 2018 63 / 98
Overfitting and How To Avoid It
Converting A Tree to Rules
Outlook
Overcast
Humidity
NormalHigh
No Yes
Wind
Strong Weak
No Yes
Yes
RainSunny
IF (Outlook = Sunny) ∧ (Humidity = High)
THEN PlayTennis = No
IF (Outlook = Sunny) ∧ (Humidity = Normal)
THEN PlayTennis = Y es
. . .
COMP9417 ML & DM Tree Learning Semester 1, 2018 64 / 98
Overfitting and How To Avoid It
Rules from Trees (Rule Post-Pruning)
Rules can be simpler than trees but just as accurate, e.g., in C4.5Rules:
• path from root to leaf in (unpruned) tree forms a rule
• i.e., tree forms a set of rules
• can simplify rules independently by deleting conditions
• i.e., rules can be generalized while maintaining accuracy
• greedy rule simplification algorithm
• drop the condition giving lowest estimated error (as for pruning)
• continue while estimated error does not increase
COMP9417 ML & DM Tree Learning Semester 1, 2018 65 / 98
Overfitting and How To Avoid It
Rules from Trees
Select a “good” subset of rules within a class (C4.5Rules):
• goal: remove rules not useful in terms of accuracy
• find a subset of rules which minimises an MDL criterion
• trade-off accuracy and complexity of rule-set
• stochastic search using simulated annealing
Sets of rules can be ordered by class (C4.5Rules):
• order classes by increasing chance of making false positive errors
• set as a default the class with the most training instances not covered
by any rule
COMP9417 ML & DM Tree Learning Semester 1, 2018 66 / 98
Further Issues in Tree Learning
Continuous Valued Attributes
Decision trees originated for discrete attributes only. Now: continuous
attributes.
Can create a discrete attribute to test continuous value:
• Temperature = 82.5
• (Temperature > 72.3) ∈ {t, f}
• Usual method: continuous attributes have a binary split
• Note:
• discrete attributes – one split exhausts all values
• continuous attributes – can have many splits in a tree
COMP9417 ML & DM Tree Learning Semester 1, 2018 67 / 98
Further Issues in Tree Learning
Continuous Valued Attributes
• Splits evaluated on all possible split points
• More computation: n− 1 possible splits for n values of an attribute
in training set
• Fayyad (1991)
• sort examples on continuous attribute
• find midway boundaries where class changes, e.g. for Temperature
(48+60)
2
and
(80+90)
2
• Choose best split point by info gain (or evaluation of choice)
• Note: C4.5 uses actual values in data
Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No
COMP9417 ML & DM Tree Learning Semester 1, 2018 68 / 98
Further Issues in Tree Learning
Axis-parallel Splitting
Fitting data that is not a good “match” to the possible splits in a tree.
“Pattern Classification” Duda, Hart, and Stork, (2001)
COMP9417 ML & DM Tree Learning Semester 1, 2018 69 / 98
Further Issues in Tree Learning
Splitting on Linear Combinations of Features
Reduced tree size by allowing splits that are a better “match” to the data.
“Pattern Classification” Duda, Hart, and Stork, (2001)
COMP9417 ML & DM Tree Learning Semester 1, 2018 70 / 98
Further Issues in Tree Learning
Attributes with Many Values
Problem:
• If attribute has many values, Gain will select it
• Why ? more likely to split instances into “pure” subsets
• Maximised by singleton subsets
• Imagine using Date = March 21, 2018 as attribute
• High gain on training set, useless for prediction
COMP9417 ML & DM Tree Learning Semester 1, 2018 71 / 98
Further Issues in Tree Learning
Attributes with Many Values
One approach: use GainRatio instead
GainRatio(S,A) ≡
Gain(S,A)
SplitInformation(S,A)
SplitInformation(S,A) ≡ −
c∑
i=1
|Si|
|S|
log2
|Si|
|S|
where Si is subset of S for which A has value vi
COMP9417 ML & DM Tree Learning Semester 1, 2018 72 / 98
Further Issues in Tree Learning
Attributes with Many Values
Why does this help ?
• sensitive to how broadly and uniformly attribute splits instances
• actually the entropy of S w.r.t. values of A
• i.e., the information of the partition itself
• therefore higher for many-valued attributes, especially if mostly
uniformly distributed across possible values
COMP9417 ML & DM Tree Learning Semester 1, 2018 73 / 98
Further Issues in Tree Learning
Attributes with Costs
Consider
• medical diagnosis, BloodTest has cost $150
• robotics, Width from 1ft has cost 23 sec.
How to learn a consistent tree with low expected cost?
COMP9417 ML & DM Tree Learning Semester 1, 2018 74 / 98
Further Issues in Tree Learning
Attributes with Costs
One approach: evaluate information gain relative to cost:
• Example
Gain2(S,A)
Cost(A)
.
Preference for decision trees using lower-cost attributes.
COMP9417 ML & DM Tree Learning Semester 1, 2018 75 / 98
Further Issues in Tree Learning
Attributes with Costs
Also: class (misclassification) costs, instance costs, . . .
See5 / C5.0 can use a misclassification cost matrix.
Can give false positives a different cost to false negatives
Forces a different tree structure to be learned to mimimise asymmetric
misclassification costs – can help if class distribution is skewed – why ?
COMP9417 ML & DM Tree Learning Semester 1, 2018 76 / 98
Further Issues in Tree Learning
Unknown Attribute Values
What if some examples missing values of A?
Use training example anyway, sort through tree. Here are 3 possible
approaches
• If node n tests A, assign most common value of A among other
examples sorted to node n
• assign most common value of A among other examples with same
target value
• assign probability pi to each possible value vi of A
• assign fraction pi of example to each descendant in tree
Note: need to classify new (unseen) examples in same fashion
COMP9417 ML & DM Tree Learning Semester 1, 2018 77 / 98
Further Issues in Tree Learning
Windowing
Early implementations – training sets too large for memory
As a solution ID3 implemented windowing:
1. select subset of instances – the window
2. construct decision tree from all instances in the window
3. use tree to classify training instances not in window
4. if all instances correctly classified then halt, else
5. add selected misclassified instances to the window
6. go to step 2
Windowing retained in C4.5 because it can lead to more accurate trees.
Related to ensemble learning.
COMP9417 ML & DM Tree Learning Semester 1, 2018 78 / 98
Learning Non-linear Regression Models with Trees
Non-linear Regression with Trees
Despite some nice properties of Neural Networks, such as generalization to
deal sensibly with unseen input patterns and robustness to losing neurons
(prediction performance can degrade gracefully), they still have some
problems:
• Back-propagation is often difficult to scale – large nets need lots of
computing time; may have to be partitioned into separate modules
that can be trained independently, e.g. NetTalk, DeepBind
• Neural Networks are not very transparent – hard to understand the
representation of what has been learned
Possible solution: exploit success of tree-structured approaches in ML
COMP9417 ML & DM Tree Learning Semester 1, 2018 79 / 98
Learning Non-linear Regression Models with Trees
Regression trees
• Differences to decision trees:
• Splitting criterion: minimizing intra-subset variation
• Pruning criterion: based on numeric error measure
• Leaf node predicts average class values of training instances reaching
that node
• Can approximate piecewise constant functions
• Easy to interpret
• More sophisticated version: model trees
COMP9417 ML & DM Tree Learning Semester 1, 2018 80 / 98
Learning Non-linear Regression Models with Trees
A Regression Tree and its Prediction Surface
“Elements of Statistical Learning” Hastie, Tibshirani & Friedman (2001)
COMP9417 ML & DM Tree Learning Semester 1, 2018 81 / 98
Learning Non-linear Regression Models with Trees
Regression Tree on sine dataset
COMP9417 ML & DM Tree Learning Semester 1, 2018 82 / 98
Learning Non-linear Regression Models with Trees
Regression Tree on CPU dataset
COMP9417 ML & DM Tree Learning Semester 1, 2018 83 / 98
Learning Non-linear Regression Models with Trees
Tree learning as variance reduction
• The variance of a Boolean (i.e., Bernoulli) variable with success
probability ṗ is ṗ(1− ṗ), which is half the Gini index. So we could
interpret the goal of tree learning as minimising the class variance (or
standard deviation, in case of
√
Gini) in the leaves.
• In regression problems we can define the variance in the usual way:
Var(Y ) =
1
|Y |
∑
y∈Y
(y − y)2
If a split partitions the set of target values Y into mutually exclusive
sets {Y1, . . . , Yl}, the weighted average variance is then
Var({Y1, . . . , Yl}) =
l∑
j=1
|Yj |
|Y |
Var(Yj) = . . . =
1
|Y |
∑
y∈Y
y2 −
l∑
j=1
|Yj |
|Y |
y2j
The first term is constant for a given set Y and so we want to
maximise the weighted average of squared means in the children.
COMP9417 ML & DM Tree Learning Semester 1, 2018 84 / 98
Learning Non-linear Regression Models with Trees
Learning a regression tree
Imagine you are a collector of vintage Hammond tonewheel organs. You
have been monitoring an online auction site, from which you collected
some data about interesting transactions:
# Model Condition Leslie Price
1. B3 excellent no 4513
2. T202 fair yes 625
3. A100 good no 1051
4. T202 good no 270
5. M102 good yes 870
6. A100 excellent no 1770
7. T202 fair no 99
8. A100 good yes 1900
9. E112 fair no 77
COMP9417 ML & DM Tree Learning Semester 1, 2018 85 / 98
Learning Non-linear Regression Models with Trees
Learning a regression tree
From this data, you want to construct a regression tree that will help you
determine a reasonable price for your next purchase.
There are three features, hence three possible splits:
Model = [A100,B3,E112,M102,T202]
[1051, 1770, 1900][4513][77][870][99, 270, 625]
Condition = [excellent, good, fair]
[1770, 4513][270, 870, 1051, 1900][77, 99, 625]
Leslie = [yes, no] [625, 870, 1900][77, 99, 270, 1051, 1770, 4513]
The means of the first split are 1574, 4513, 77, 870 and 331, and the
weighted average of squared means is 3.21 · 106. The means of the second
split are 3142, 1023 and 267, with weighted average of squared means
2.68 · 106; for the third split the means are 1132 and 1297, with weighted
average of squared means 1.55 · 106. We therefore branch on Model at the
top level. This gives us three single-instance leaves, as well as three A100s
and three T202s.
COMP9417 ML & DM Tree Learning Semester 1, 2018 86 / 98
Learning Non-linear Regression Models with Trees
Learning a regression tree
For the A100s we obtain the following splits:
Condition = [excellent, good, fair] [1770][1051, 1900][]
Leslie = [yes, no] [1900][1051, 1770]
Without going through the calculations we can see that the second split
results in less variance (to handle the empty child, it is customary to set its
variance equal to that of the parent). For the T202s the splits are as
follows:
Condition = [excellent, good, fair] [][270][99, 625]
Leslie = [yes, no] [625][99, 270]
Again we see that splitting on Leslie gives tighter clusters of values. The
learned regression tree is depicted on the next slide.
COMP9417 ML & DM Tree Learning Semester 1, 2018 87 / 98
Learning Non-linear Regression Models with Trees
A regression tree
Model
Leslie
=A100
f̂(x)=4513
=B3
f̂(x)=77
=E122
f̂(x)=870
=M102
Leslie
=T202
f̂(x)=1900
=yes
f̂(x)=1411
=no
f̂(x)=625
=yes
f̂(x)=185
=no
A regression tree learned from the Hammond organ dataset.
COMP9417 ML & DM Tree Learning Semester 1, 2018 88 / 98
Learning Non-linear Regression Models with Trees
Model trees
• Like regression trees but with linear regression functions at each node
• Linear regression applied to instances that reach a node after full tree
has been built
• Only a subset of the attributes is used for LR
• Attributes occurring in subtree (+maybe attributes occurring in path
to the root)
• Fast: overhead for Linear Regression (LR) not large because usually
only a small subset of attributes is used in tree
COMP9417 ML & DM Tree Learning Semester 1, 2018 89 / 98
Learning Non-linear Regression Models with Trees
Two uses of features
Suppose we want to approximate y = cosπx on the interval −1 ≤ x ≤ 1.
A linear approximation is not much use here, since the best fit would be
y = 0. However, if we split the x-axis in two intervals −1 ≤ x < 0 and
0 ≤ x ≤ 1, we could find reasonable linear approximations on each
interval. We can achieve this by using x both as a splitting feature and as
a regression variable (next slide).
COMP9417 ML & DM Tree Learning Semester 1, 2018 90 / 98
Learning Non-linear Regression Models with Trees
A small model tree
x
ŷ = 2x+1
<0
ŷ = −2x+1
≥0
-1 0 1
-1
1
(left) A model tree combining a one-split feature tree with linear regression
models in the leaves. Notice how x is used as both a splitting feature and
a regression variable. (right) The function y = cosπx on the interval
−1 ≤ x ≤ 1, and the piecewise linear approximation achieved by the
model tree.
COMP9417 ML & DM Tree Learning Semester 1, 2018 91 / 98
Learning Non-linear Regression Models with Trees
Model Tree on CPU dataset
COMP9417 ML & DM Tree Learning Semester 1, 2018 92 / 98
Learning Non-linear Regression Models with Trees
Smoothing
• Näıve prediction method – output value of LR model at
corresponding leaf node
• Improve performance by smoothing predictions with internal LR
models
• Predicted value is weighted average of LR models along path from root
to leaf
• Smoothing formula: p′ = np+kq
n+k
where
• p′ prediction passed up to next higher node
• p prediction passed to this node from below
• q value predicted by model at this node
• n number of instances that reach node below
• k smoothing constant
• Same effect can be achieved by incorporating the internal models into
the leaf nodes
COMP9417 ML & DM Tree Learning Semester 1, 2018 93 / 98
Learning Non-linear Regression Models with Trees
Building the tree
• Splitting criterion: standard deviation reduction
SDR = sd(T )−
∑
i
|Ti|
|T |
× sd(Ti)
where T1, T2, . . . are the sets from splits of data at node.
• Termination criteria (important when building trees for numeric
prediction):
• Standard deviation becomes smaller than certain fraction of sd for full
training set (e.g. 5%)
• Too few instances remain (e.g. less than four)
COMP9417 ML & DM Tree Learning Semester 1, 2018 94 / 98
Learning Non-linear Regression Models with Trees
Pruning the tree
• Pruning is based on estimated absolute error of LR models
• Heuristic estimate:
n+ v
n− v
× average absolute error
where n is number of training instances that reach the node, and v is
the number of parameters in the linear model
• LR models are pruned by greedily removing terms to minimize the
estimated error
• Model trees allow for heavy pruning: often a single LR model can
replace a whole subtree
• Pruning proceeds bottom up: error for LR model at internal node is
compared to error for subtree
COMP9417 ML & DM Tree Learning Semester 1, 2018 95 / 98
Learning Non-linear Regression Models with Trees
Discrete (nominal) attributes
• Nominal attributes converted to binary attributes and treated as
numeric
• Nominal values sorted using average class value for each one
• For k-values, k − 1 binary attributes are generated
• the ith binary attribute is 0 if an instance’s value is one of the first i in
the ordering, 1 otherwise
• Best binary split with original attribute provably equivalent to a split
on one of the new attributes
COMP9417 ML & DM Tree Learning Semester 1, 2018 96 / 98
Summary
Summary – decision trees
• Decision tree learning is a practical method for many classifier
learning tasks – still a “Top 10” data mining algorithm – see
sklearn.tree.DecisionTreeClassifier
• TDIDT family descended from ID3 searches complete hypothesis
space - the hypothesis is there, somewhere...
• Uses a search or preference bias, search for optimal tree is, in general,
not tractable
• Overfitting is inevitable with an expressive hypothesis space and noisy
data, so pruning is important
• Decades of research into extensions and refinements of the general
approach, e.g., for numerical prediction, logical trees
• Often the “try-first” machine learning method in applications,
illustrates many general issues
• Performance can be improved with use of “ensemble” methods
COMP9417 ML & DM Tree Learning Semester 1, 2018 97 / 98
Summary
Summary – regression and model trees
• Regression trees were introduced in CART – R’s implementation is
close to CART, but see sklearn.tree.DecisionTreeRegressor
for a basic version
• Quinlan proposed the M5 model tree inducer
• M5′: slightly improved version that is publicly available (M5Pin Weka
is based on this)
• Quinlan also investigated combining instance-based learning with M5
• CUBIST: Quinlan’s rule learner for numeric prediction
www.rulequest.com
• Interesting comparison: Neural nets vs. model trees — both do
non-linear regression
• other methods also can learn non-linear models
COMP9417 ML & DM Tree Learning Semester 1, 2018 98 / 98
www.rulequest.com
Aims
Introduction
The Basic Algorithm
Information Gain
Tree Learning More Generally
Overfitting and How To Avoid It
Further Issues in Tree Learning
Learning Non-linear Regression Models with Trees
Summary