Tree Learning
COMP9417 Machine Learning and Data Mining
Term 2, 2021
COMP9417 ML & DM Tree Learning Term 2, 2021 1 / 67
Acknowledgements
Material derived from slides for the book “Machine Learning” by T. Mitchell
McGraw-Hill (1997) http://www-2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides by Andrew W. Moore
http:www.cs.cmu.edu/~awm/tutorials
Material derived from slides by Eibe Frank
http://www.cs.waikato.ac.nz/ml/weka
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
COMP9417 ML & DM Tree Learning Term 2, 2021 2 / 67
Aims
Aims
This lecture will enable you to describe decision tree learning, the use of entropy and the problem of overfitting. Following it you should be able to:
• define the decision tree representation
• list representation properties of data and models for which decision
trees are appropriate
• reproduce the top-down decision tree induction (TDIDT) algorithm
• define entropy and information gain for the TDIDT algorithm
• describe the inductive bias of the TDIDT algorithm
• define overfitting of a training set by a hypothesis
• describe developments of the basic TDIDT algorithm: pruning, rules, numeric attributes, many-valued attributes, missing values
• describe regression and model trees
COMP9417 ML & DM Tree Learning Term 2, 2021 3 / 67
Introduction
Why use decision trees?
• Trees in some form are probably still the single most popular data mining tool
• Easy to understand
• Easy to implement
• Easy to use
• Computationally efficient (even on big data) to learn and run
• They do classification, i.e., predict a categorical output from categorical and/or real inputs
• Tree learning can also be used for predicting a real-valued output, i.e., they can do regression
• There are some drawbacks, though — e.g., can have high variance (later lecture)
COMP9417 ML & DM Tree Learning Term 2, 2021 4 / 67
Introduction
Training Examples
Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14
Outlook Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain
Temperature Humidity Wind PlayTennis
Hot High Hot High Hot High Mild High
Cool Normal Cool Normal Cool Normal Mild High Cool Normal Mild Normal Mild Normal Mild High Hot Normal Mild High
Weak No Strong No Weak Yes
Weak Yes
Weak Yes Strong No Strong Yes
Weak No Weak Yes Weak Yes
Strong Yes Strong Yes Weak Yes Strong No
COMP9417 ML & DM
Tree Learning
Term 2, 2021
5 / 67
Introduction
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain Yes
Humidity
Wind
High Normal No Yes
Strong Weak No Yes
COMP9417 ML & DM
Tree Learning
Term 2, 2021 6 / 67
Introduction
Decision Trees
Decision tree representation:
• Each internal node tests an attribute (feature)
• Each branch corresponds to attribute (feature) value or threshold • Each leaf node assigns a classification value
How would we represent the following expressions ? • ∧, ∨, XOR
• M of N
• (A∧B)∨(C∧¬D∧E)
COMP9417 ML & DM Tree Learning Term 2, 2021 7 / 67
Introduction
Decision Trees
X∧Y
X = t:
| Y = t: true
| Y = f: no
X = f: no
X∨Y
X = t: true
X = f:
| Y = t: true
| Y = f: no
COMP9417 ML & DM
Tree Learning
Term 2, 2021
8 / 67
Introduction
Decision Trees
X XOR Y
X = t:
| Y = t: false
| Y = f: true
X = f:
| Y = t: true
| Y = f: false
So decision trees are, in some sense, non-linear classifiers (or regressors).
COMP9417 ML & DM Tree Learning Term 2, 2021 9 / 67
Introduction
Decision Trees
2 of 3
X = t:
| Y = t: true
| Y = f:
| | Z = t: true
| | Z = f: false
X = f:
| Y = t:
| | Z = t: true
| | Z = f: false
| Y = f: false
In general, decision trees represent a disjunction of conjunctions of constraints, or tests, on the attributes values of instances.
COMP9417 ML & DM Tree Learning Term 2, 2021 10 / 67
Introduction
When are Decision Trees the Right Model?
• With Boolean values for instances X and class Y , decision-trees represent Y as a Boolean function of the X
• Given d input Boolean variables, there are 2d possible input values for these variables. Any specific function assigns Y = 1 to some subset of these, and Y = 0 to the rest
• Any Boolean function can be trivially represented by a tree. Each function assigns Y = 1 to some subset of the 2d possible values of X. So, for each combination of values with Y = 1, have a path from root toaleafwithY =1. AllotherleaveshaveY =0
COMP9417 ML & DM Tree Learning Term 2, 2021 11 / 67
Introduction
When are Decision Trees the Right Model?
• This is a re-representation of the truth-table, and will have 2d leaves.
• More compact trees may be possible by taking into account what is
common between one or more rows with the same Y value
• But there are some Boolean functions for which compact trees may
not be possible (the parity and majority functions are examples)1
• In general, although possible in principle to express any Boolean function, search and prior restrictions may not allow us to find the correct tree in practice
• BUT: If you want readable models that combine logical tests with a probability-based decision, then decision trees are a good choice
1Finding the optimal tree is NP-complete (Hyafil and Rivest (1976)).
COMP9417 ML & DM Tree Learning Term 2, 2021 12 / 67
The Basic Algorithm
Top-Down Induction of Decision Trees (TDIDT)
Main loop:
1 A ← the “best” decision attribute for next node
2 Assign A as decision attribute for node
3 For each value of A, create new descendant of node
4 Sort training examples to leaf nodes
5 If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes
Essentially, this is the top-level of the ID3 algorithm2 — the first efficient symbolic Machine Learning algorithm.
2See: Quinlan (1986).
COMP9417 ML & DM Tree Learning Term 2, 2021 13 / 67
The Basic Algorithm
Which attribute is best?
[29+,35-] A1=? [29+,35-] A2=? tf tf
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
COMP9417 ML & DM Tree Learning Term 2, 2021 14 / 67
Information Gain
Entropy
1.0
0.5
0.0 0.5 1.0 p+
Consider a 2-class distribution, where:
S is a sample of training examples
p⊕ is the proportion of positive examples in S p⊖ is the proportion of negative examples in S
COMP9417 ML & DM Tree Learning
Term 2, 2021
15 / 67
Entropy(S)
Information Gain
Entropy
Entropy measures the “impurity” of S
Entropy(S) ≡ −p⊕ log2 p⊕ − p⊖ log2 p⊖
A “pure” sample is one in which all examples are of the same class.
A decision tree node with low impurity means that the path from root to the node represents a combination of attribute-value tests with good classification accuracy.
COMP9417 ML & DM Tree Learning Term 2, 2021 16 / 67
Information Gain
Entropy
Entropy(S) = expected number of bits needed to encode class (⊕ or ⊖) of randomly drawn member of S (under the optimal, shortest-length code)
Why ?
Information theory: optimal length code assigns − log2 p bits to message
having probability p.
So, expected number of bits to encode ⊕ or ⊖ of random member of S:
p⊕(− log2 p⊕) + p⊖(− log2 p⊖) Entropy(S) ≡ −p⊕ log2 p⊕ − p⊖ log2 p⊖
COMP9417 ML & DM Tree Learning Term 2, 2021 17 / 67
Information Gain
Information Gain
• Gain(S, A) = expected reduction in entropy due to sorting on A
Gain(S,A) ≡ Entropy(S) − |Sv|Entropy(Sv)
|S| v∈V alues(A)
[29+,35-] A1=? [29+,35-] A2=? tf tf
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
COMP9417 ML & DM Tree Learning Term 2, 2021 18 / 67
Information Gain
Information Gain
Gain(S, A1)
=
= =
= =
|St| |Sf| Entropy(S) − |S| Entropy(St) + |S| Entropy(Sf )
0.9936 −
((26(−21 log2(21) − 5 log2( 5 ))) +
(38(− 8 log2( 8 ) − 30 log2(30)))) 64 38 38 38 38
0.9936 − ( 0.2869 + 0.4408 ) 0.2658
64 26 26 26 26
COMP9417 ML & DM
Tree Learning
Term 2, 2021
19 / 67
Information Gain
Information Gain
Gain(S, A2) = 0.9936 − ( 0.7464 + 0.0828 ) = 0.1643
COMP9417 ML & DM Tree Learning Term 2, 2021 20 / 67
Information Gain
Information Gain
So we choose A1, since it gives a larger expected reduction in entropy.
COMP9417 ML & DM Tree Learning Term 2, 2021 21 / 67
Information Gain
Attribute selection – impurity measures more generally
Estimate class probability of class k at node m of the tree as pˆmk = |Smk|. |Sm |
Classify at node m by predicting the majority class, pˆmk(m). Misclassification error:
Entropy for K class values:
1−pˆmk(m)
K
− pˆmk log pˆmk
k=1
CART (Breiman et al. (1984)) uses the “Gini index”:
K
pˆmk (1 − pˆmk ) k=1
COMP9417 ML & DM
Tree Learning
Term 2, 2021
22 / 67
Information Gain
Attribute selection – impurity measures more generally
Why not just use accuracy, or misclassification error ? In practice, not found to work as well as others
Entropy and Gini index are more sensitive to changes in the node probabilities than misclassification error.
Entropy and Gini index are differentiable, but misclassification error is not (Hastie et al. (2009)).
COMP9417 ML & DM Tree Learning Term 2, 2021 23 / 67
Tree Learning More Generally
Hypothesis Space Search by TDIDT
A1
+–+ +
A2
+–+ –
+
…
+–+
A2
+–+ –
…
A greedy search to maximise information gain . . .
A3
A4
+–+ –
–
… …
A2
COMP9417 ML & DM Tree Learning
Term 2, 2021
24 / 67
Tree Learning More Generally
Inductive Bias of TDIDT
Note hypothesis space H is complete (contains all finite discrete-valued functions w.r.t attributes)
• So H can represent the power set of instances X ! →Unbiased?
Not really…
• Preference for short trees, and for those with high information gain attributes near the root
• Inductive bias is a preference for some hypotheses, rather than a restriction of hypothesis space H
• An incomplete search of a complete hypothesis space versus a complete search of an incomplete hypothesis space
• Occam’s razor: prefer the shortest hypothesis that fits the data
• Inductive bias: approximately, “prefer shortest tree”
COMP9417 ML & DM Tree Learning Term 2, 2021 25 / 67
Overfitting and How To Avoid It
Why does overfitting occur?
• Greedy search can make mistakes. It can end up in local minima — so a sub-optimal choice earlier might result in a better solution later (i.e., pick a test whose information gain is less than the best one)
• But there is also another kind of problem. Training error is an optimistic estimate of the true error of the model, and this optimism increases as the training error decreases
• Suppose we could quantify the “optimism” of a learning algorithm . . .
• Say we have two models h1 and h2 with training errors e1 and e2 and
optimism o1 and o2.
• LetthetrueerrorofeachbeE1 =e1+o1 andE2 =e2+o2
• Ife1
training data
• So, a search method based purely on training data estimates may end up overfitting the training data
COMP9417 ML & DM Tree Learning Term 2, 2021 26 / 67
Overfitting and How To Avoid It
Overfitting in Decision Tree Learning
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5
0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes)
On training data
On test data
COMP9417 ML & DM Tree Learning Term 2, 2021 27 / 67
Accuracy
Overfitting and How To Avoid It
Avoiding Overfitting
How can we avoid overfitting?
For tree learning the answer is pruning Two main approaches
• pre-pruning • post-pruning
overfitting
stop growing when further data splits are not useful grow full tree, then remove sub-trees which may be
COMP9417 ML & DM
Tree Learning Term 2, 2021 28 / 67
Overfitting and How To Avoid It
Avoiding Overfitting – Pre-pruning
• Can be based on statistical significance test
• Stops growing the tree when there is no statistically significant association between any attribute and the class at a particular node
• For example, ID3 used chi-squared test plus information gain
• only statistically significant attributes were allowed to be selected by
information gain procedure
• Problem — as tree grows:
→ typical sample size at nodes get smaller
→ statistical tests become unreliable
COMP9417 ML & DM Tree Learning Term 2, 2021 29 / 67
Overfitting and How To Avoid It
Avoiding Overfitting – Pre-pruning
• Simplest approach: stop growing the tree when fewer than some lower-bound on the number of examples at a leaf
• In C5.03, this parameter is the parameter minCases = 2
• In sklearn, this parameter is min samples leaf
• In sklearn, the parameter min impurity decrease enables stopping when the this falls below a lower-bound
3See: www.rulequest.com
COMP9417 ML & DM Tree Learning Term 2, 2021 30 / 67
Overfitting and How To Avoid It
Avoiding Overfitting – Pre-pruning
• May cause early stopping: may stop the growth of tree prematurely • Classic example: XOR/Parity-problem
• No individual attribute exhibits a significant association with the class • Target structure only visible in fully expanded tree
• Pre-pruning won’t expand the root node
• But: XOR-type problems not common in practice • And: pre-pruning faster than post-pruning
• Useful in exploratory data analysis
• only grow top-level tree — all leaves have many examples
COMP9417 ML & DM Tree Learning Term 2, 2021 31 / 67
Overfitting and How To Avoid It
Avoiding Overfitting – Post-pruning
• Builds full tree first and prunes it afterwards
• Why might this be a good idea ?
• Attribute interactions are visible in fully-grown tree
• Problem: identification of subtrees and nodes that are due to chance effects
• Typical pruning operation: • Subtree replacement
COMP9417 ML & DM Tree Learning Term 2, 2021 32 / 67
Overfitting and How To Avoid It
Avoiding Overfitting – Post-pruning
A simple greedy method (originally called “Reduced-Error Pruning”)
Split data into training and validation set
Do until further pruning is harmful:
• Evaluate impact on validation set of pruning each possible node (plus those below it)
• Remove the one that most improves validation set accuracy
COMP9417 ML & DM Tree Learning Term 2, 2021 33 / 67
Overfitting and How To Avoid It
Avoiding Overfitting – Post-pruning
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5
Effect of reduced-error pruning on tree learning to avoid overfitting.
On training data
On test data On test data (during pruning)
0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes)
COMP9417 ML & DM Tree Learning Term 2, 2021 34 / 67
Accuracy
Overfitting and How To Avoid It
Avoiding Overfitting – Post-pruning
Pruning operator: Sub-tree replacement
Bottom-up:
tree is considered for replacement once all its sub-trees have been considered
COMP9417 ML & DM Tree Learning Term 2, 2021 35 / 67
Overfitting and How To Avoid It
Avoiding Overfitting – Post-pruning
Problem with using validation set: reduces effective size of training data
Goal: to improve estimate of error on unseen data using all and only data from training set
But how can this work ?
Make the estimate of error pessimistic !
COMP9417 ML & DM Tree Learning Term 2, 2021 36 / 67
Overfitting and How To Avoid It
Avoiding Overfitting – Post-pruning
• Apply pruning operation if this does not increase the estimated error
• C5.0’s method: using upper limit of standard confidence interval derived from the training data (parameter called ’c’ or ’CF’)
• originally called ”error-based pruning”
• Standard Bernoulli-process-based method
• Note: statistically motivated, but not statistically valid • However: works well in practice !
COMP9417 ML & DM Tree Learning Term 2, 2021 37 / 67
Overfitting and How To Avoid It
Avoiding Overfitting – Post-pruning
• The error estimate for a tree node is the weighted sum of error estimates for all its subtrees (possibly leaves).
• Use upper bound error estimate e for a node (simplified version): f · (1 − f)
• f is actual (empirical) error of tree on examples at the tree node
• N is the number of examples at the tree node
• Zc is a constant whose value depends on confidence parameter c
• Typical default value for confidence c = 0.25
• If c = 0.25 then Zc = 0.69 (from standardized normal distribution) COMP9417 ML & DM Tree Learning Term 2, 2021 38 / 67
e = f + Zc · N
Overfitting and How To Avoid It
Avoiding Overfitting – Post-pruning
• How does this method implement a pessimistic error estimate ?
• What effect will the c parameter have on pruning ?
• Asc↑,z↓
• See example on next slide (note: values not calculated using exactly the above formula)
COMP9417 ML & DM Tree Learning Term 2, 2021 39 / 67
Overfitting and How To Avoid It
Avoiding Overfitting – Post-pruning
• health plan contribution: node measures f = 0.36, e = 0.46
• sub-tree measures:
• none:f=0.33,e=0.47 • half:f=0.5,e=0.72
• full:f=0.33,e=0.47
• sub-trees combined 6 : 2 : 6 gives 0.51
• sub-trees estimated to give greater error so prune away
COMP9417 ML & DM
Tree Learning Term 2, 2021 40 / 67
Further Issues in Tree Learning
Generating Rules from Trees
Generating Rules from Trees
Rules can be more interpretable than trees, but just as accurate • path from root to leaf in (unpruned) tree forms a rule
• i.e., tree forms a set of rules
• can simplify rules independently by deleting conditions
• i.e., rules can be generalized while maintaining accuracy • A greedy rule simplification algorithm
1 drop the condition giving lowest estimated error
2 continue while estimated error does not increase
COMP9417 ML & DM Tree Learning Term 2, 2021 41 / 67
Further Issues in Tree Learning
Generating Rules from Trees
Generating Rules from Trees
Outlook
Sunny Overcast Rain Yes
Humidity
Wind
IF THEN
IF THEN
…
(Outlook = Sunny) ∧ (Humidity = High) P layT ennis = N o
(Outlook = Sunny) ∧ (Humidity = Normal) P layT ennis = Y es
High Normal No Yes
Strong Weak
No
Yes
COMP9417 ML & DM Tree Learning
Term 2, 2021
42 / 67
Further Issues in Tree Learning
Generating Rules from Trees
Generating Rules from Trees
Rule “post-pruning” introduced by Quinlan4
• Convert tree to equivalent set of rules
• Find a subset of rules which minimises a regularised loss function
• trade-off accuracy and complexity of rule-set
• stochastic search using simulated annealing
• Sort final set of rules, ordered by class
• order classes by increasing chance of making false positive errors
• set as a default the class with the most training instances not covered
by any rule
4See: Quinlan (1993).
COMP9417 ML & DM Tree Learning Term 2, 2021 43 / 67
Further Issues in Tree Learning
Generating Rules from Trees
Generating Rules from Trees
Why do rule generation (and rule post-pruning?) For: simpler classifiers, people prefer rules to trees
For: intrepretable models very important in many applications (medicine, finance, etc.)
Against: does not scale well, slow for large trees & datasets
COMP9417 ML & DM Tree Learning Term 2, 2021 44 / 67
Further Issues in Tree Learning
Numeric Attributes in Tree Learning
Numeric Attributes
Decision trees originated for discrete attributes only. However, data often has numeric attributes.
Can create a discrete attribute to test a numeric value:
• T emperature = 82.5
• (Temperature>72.3)∈{t,f}
• Usual method: numeric attributes have a binary split • Note:
• discrete attributes – one split exhausts all values
• numeric attributes – can have many splits in a tree
COMP9417 ML & DM Tree Learning Term 2, 2021 45 / 67
Further Issues in Tree Learning
Numeric Attributes in Tree Learning
Numeric Attributes
• Splits evaluated on all possible split points
• More computation: n − 1 possible splits for n values of an attribute in training set
• Fayyad (1991)
• sort examples on numeric attribute
• find midway boundaries where class changes, e.g. for Temperature (48+60) and (80+90)
22
• Choose best split point by info gain (or evaluation of choice)
• Note: use actual values in data — more user-friendly
Temperature: 40 48 60 72 80 90 PlayTennis: No No Yes Yes Yes No
COMP9417 ML & DM Tree Learning Term 2, 2021 46 / 67
Further Issues in Tree Learning
Many-valued Attributes
Attributes with Many Values
Problem:
• If attribute has many values, Gain will select it
• Why ? more likely to split instances into “pure” subsets
• Maximised by singleton subsets
• Imagine using Date = June 21, 2021 as attribute • High gain on training set, useless for prediction
COMP9417 ML & DM Tree Learning Term 2, 2021 47 / 67
Further Issues in Tree Learning
Many-valued Attributes
Attributes with Many Values
One approach: use GainRatio instead GainRatio(S, A) ≡ Gain(S, A)
SplitInformation(S, A)
SplitInformation(S, A) ≡ − c |Si| log i=1 |S|
where Si is subset of S for which A has value vi
|Si| 2 |S|
COMP9417 ML & DM Tree Learning
Term 2, 2021
48 / 67
Further Issues in Tree Learning
Many-valued Attributes
Attributes with Many Values
Why does this help ?
• sensitive to how broadly and uniformly attribute splits instances
• actually the entropy of S w.r.t. values of A
• i.e., the information of the partition itself
• therefore higher for many-valued attributes, especially if mostly
uniformly distributed across possible values
• a kind of normalisation
COMP9417 ML & DM Tree Learning Term 2, 2021 49 / 67
Further Issues in Tree Learning
Missing Values
Missing Values
What if some training set examples have missing values for attribute A? Use example anyway, sort through tree. Here are 3 possible approaches
• If node n tests A, assign most common value of A among other examples sorted to node n
• assign most common value of A among other examples with same target value
• assign probability pi to each possible value vi of A
• assign fraction pi of example to each descendant in tree
Note: need to classify new (unseen) examples in the same way !
COMP9417 ML & DM Tree Learning Term 2, 2021 50 / 67
Learning Non-linear Regression Models with Trees
Regression trees
• Differences to decision trees:
• Splitting criterion: minimizing intra-subset variation
• Pruning criterion: based on numeric error measure
• Leaf node predicts average class values of training instances reaching
that node
• Can approximate piecewise constant functions • Easy to interpret
• More sophisticated version: model trees
COMP9417 ML & DM Tree Learning Term 2, 2021 51 / 67
Learning Non-linear Regression Models with Trees
A Regression Tree and its Prediction Surface
“Elements of Statistical Learning” Hastie, Tibshirani & Friedman (2001)
COMP9417 ML & DM Tree Learning Term 2, 2021 52 / 67
Learning Non-linear Regression Models with Trees
Regression Tree on sine dataset
COMP9417 ML & DM Tree Learning Term 2, 2021 53 / 67
Learning Non-linear Regression Models with Trees
Regression Tree on CPU dataset
COMP9417 ML & DM Tree Learning Term 2, 2021 54 / 67
Learning Non-linear Regression Models with Trees
Tree learning as variance reduction
• Variance of a Boolean (i.e., Bernoulli) variable with success probability p is p(1 − p).
• Can interpret goal of tree learning as minimising the class variance in the leaves.
• In regression problems we can define the variance in the usual way: Var(Y)= 1 (y−y)2
|Y | y∈Y
If a split partitions the set of target values Y into mutually exclusive
sets {Y1, . . . , Yl}, the weighted average variance is then Var({Y1,…,Yl})=l |Yj|Var(Yj)=…= 1 y2−l |Yj|y2j
j=1 |Y| |Y|y∈Y j=1 |Y| The first term is constant for a given set Y and so we want to
maximise the weighted average of squared means in the children.
COMP9417 ML & DM Tree Learning Term 2, 2021 55 / 67
Learning Non-linear Regression Models with Trees
Learning a regression tree
Imagine you are a collector of vintage Hammond tonewheel organs. You have been monitoring an online auction site, from which you collected some data about interesting transactions:
# Model Condition Leslie Price
1. B3 excellent no 4513
2. T202 fair yes 625
3. A100 good no 1051
4. T202 good no 270
5. M102 good yes 870
6. A100 excellent no 1770
7. T202 fair no 99
8. A100 good yes 1900
9. E112 fair no 77
COMP9417 ML & DM
Tree Learning
Term 2, 2021
56 / 67
Learning Non-linear Regression Models with Trees
Learning a regression tree
From this data, you want to construct a regression tree that will help you determine a reasonable price for your next purchase.
There are three features, hence three possible splits:
Model = [A100, B3, E112, M102, T202]
[1051, 1770, 1900][4513][77][870][99, 270, 625]
Condition = [excellent, good, fair]
[1770, 4513][270, 870, 1051, 1900][77, 99, 625]
Leslie=[yes,no] [625,870,1900][77,99,270,1051,1770,4513]
The means of the first split are 1574, 4513, 77, 870 and 331, and the weighted average of squared means is 3.21 · 106. The means of the second split are 3142, 1023 and 267, with weighted average of squared means 2.68 · 106; for the third split the means are 1132 and 1297, with weighted average of squared means 1.55 · 106. We therefore branch on Model at the top level. This gives us three single-instance leaves, as well as three A100s and three T202s.
COMP9417 ML & DM Tree Learning Term 2, 2021 57 / 67
Learning Non-linear Regression Models with Trees
Learning a regression tree
For the A100s we obtain the following splits:
Condition=[excellent,good,fair] [1770][1051,1900][] Leslie=[yes,no] [1900][1051,1770]
Without going through the calculations we can see that the second split results in less variance (to handle the empty child, it is customary to set its variance equal to that of the parent). For the T202s the splits are as follows:
Condition=[excellent,good,fair] [][270][99,625] Leslie=[yes,no] [625][99,270]
Again we see that splitting on Leslie gives tighter clusters of values. The learned regression tree is depicted on the next slide.
COMP9417 ML & DM Tree Learning Term 2, 2021 58 / 67
Learning Non-linear Regression Models with Trees
A regression tree
Model
=A100
=B3
=E122
=M102
=T202
Leslie
=yes =no
Leslie
=yes =no
f̂(x)=4513
A regression tree learned from the Hammond organ dataset.
f̂(x)=77
f̂(x)=870
f̂(x)=1900
f̂(x)=1411
f̂(x)=625
f̂(x)=185
COMP9417 ML & DM Tree Learning Term 2, 2021 59 / 67
Learning Non-linear Regression Models with Trees
Model trees
• Like regression trees but with linear regression functions at each node
• Linear regression applied to instances that reach a node after full tree has been built
• Only a subset of the attributes is used for LR
• Attributes occurring in subtree (+maybe attributes occurring in path
to the root)
• Fast: overhead for Linear Regression (LR) not large because usually only a small subset of attributes is used in tree
COMP9417 ML & DM Tree Learning Term 2, 2021 60 / 67
Learning Non-linear Regression Models with Trees
Two uses of features (Flach (2012))
Suppose we want to approximate y = cos πx on the interval −1 ≤ x ≤ 1. A linear approximation is not much use here, since the best fit would be
y = 0.
However, if we split the x-axis in two intervals −1 ≤ x < 0 and 0 ≤ x ≤ 1,
we could find reasonable linear approximations on each interval.
We can achieve this by using x both as a splitting feature and as a regression variable (next slide).
COMP9417 ML & DM Tree Learning Term 2, 2021 61 / 67
Learning Non-linear Regression Models with Trees
A small model tree
COMP9417 ML & DM
Tree Learning
Term 2, 2021
62 / 67
ŷ = 2x+1
x
<0 ≥0
ŷ = −2x+1
1
-1 0 1
-1
Learning Non-linear Regression Models with Trees
Model Tree on CPU dataset
COMP9417 ML & DM Tree Learning Term 2, 2021 63 / 67
Learning Non-linear Regression Models with Trees
Smoothing
• Na ̈ıve prediction method – output value of LR model at corresponding leaf node
• Improve performance by smoothing predictions with internal LR models
• Predicted value is weighted average of LR models along path from root to leaf
• Smoothing formula: p′ = np+kq where n+k
• p′ • p • q • n • k
prediction passed up to next higher node prediction passed to this node from below value predicted by model at this node number of instances that reach node below smoothing constant
• Same effect can be achieved by incorporating the internal models into the leaf nodes
COMP9417 ML & DM Tree Learning Term 2, 2021 64 / 67
Learning Non-linear Regression Models with Trees
Pruning the tree
• Pruning is based on estimated absolute error of LR models
• Heuristic estimate:
n + v × average absolute error n−v
where n is number of training instances that reach the node, and v is the number of parameters in the linear model
• LR models are pruned by greedily removing terms to minimize the estimated error
• Model trees allow for heavy pruning: often a single LR model can replace a whole subtree
• Pruning proceeds bottom up: error for LR model at internal node is compared to error for subtree
COMP9417 ML & DM Tree Learning Term 2, 2021 65 / 67
Summary
Summary – decision trees
• Decision tree learning is a practical method for many classifier learning tasks – still a “Top 10” data mining algorithm – see sklearn.tree.DecisionTreeClassifier
• TDIDT family descended from ID3 searches complete hypothesis space - the hypothesis is there, somewhere...
• Uses a search or preference bias, search for optimal tree is, in general, not tractable
• Overfitting is inevitable with an expressive hypothesis space and noisy data, so pruning is important
• Decades of research into extensions and refinements of the general approach, e.g., for numerical prediction, logical trees
• Often the “try-first” machine learning method in applications, illustrates many general issues
• Performance can be improved with use of “ensemble” methods
COMP9417 ML & DM Tree Learning Term 2, 2021 66 / 67
Summary
Summary – regression and model trees
• Regression trees were introduced in CART – R’s implementation is close to CART, but see sklearn.tree.DecisionTreeRegressor for a basic version
• Quinlan proposed the M5 model tree inducer
• M5′: slightly improved version that is publicly available (M5Pin Weka
is based on this)
• Quinlan also investigated combining instance-based learning with M5
• CUBIST: Quinlan’s rule learner for numeric prediction www.rulequest.com
• Interesting comparison: Neural nets vs. model trees — both do non-linear regression
• other methods also can learn non-linear models
COMP9417 ML & DM Tree Learning Term 2, 2021 67 / 67
References
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth, Belmont.
Flach, P. (2012). Machine Learning. Cambridge University Press.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning.
Springer, 2nd edition.
Hyafil, L. and Rivest, R. (1976). Constructing Optimal Binary Decision Trees is NP-Complete.
Information Processing Letters, 5(1):15–17.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1):81–106.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
COMP9417 ML & DM Tree Learning Term 2, 2021 67 / 67