程序代写代做代考 data mining information theory decision tree algorithm Excel Tree Learning

Tree Learning

Tree Learning

COMP9417 Machine Learning and Data Mining

Last revision: 21 Mar 2018

COMP9417 ML & DM Tree Learning Semester 1, 2018 1 / 98

Acknowledgements

Material derived from slides for the book
“Machine Learning” by T. Mitchell
McGraw-Hill (1997)
http://www-2.cs.cmu.edu/~tom/mlbook.html

Material derived from slides by Andrew W. Moore
http:www.cs.cmu.edu/~awm/tutorials

Material derived from slides by Eibe Frank
http://www.cs.waikato.ac.nz/ml/weka

Material derived from slides for the book
“Machine Learning” by P. Flach
Cambridge University Press (2012)
http://cs.bris.ac.uk/~flach/mlbook

COMP9417 ML & DM Tree Learning Semester 1, 2018 2 / 98

http://www-2.cs.cmu.edu/~tom/mlbook.html
http:www.cs.cmu.edu/~awm/tutorials
http://www.cs.waikato.ac.nz/ml/weka
http://cs.bris.ac.uk/~flach/mlbook

Aims

Aims

This lecture will enable you to describe decision tree learning, the use of
entropy and the problem of overfitting. Following it you should be able to:

• define the decision tree representation
• list representation properties of data and models for which decision

trees are appropriate

• reproduce the basic top-down algorithm for decision tree induction
(TDIDT)

• define entropy in the context of learning a Boolean classifier from
examples

• describe the inductive bias of the basic TDIDT algorithm
• define overfitting of a training set by a hypothesis
• describe developments of the basic TDIDT algorithm: pruning, rule

generation, numerical attributes, many-valued attributes, costs,
missing values

• describe regression and model trees
COMP9417 ML & DM Tree Learning Semester 1, 2018 3 / 98

Introduction

Brief History of Decision Tree Learning Algorithms

• late 1950’s – Bruner et al. in psychology work on modelling concept
acquisition

• early 1960s – Hunt et al. in computer science work on Concept
Learning Systems (CLS)

• late 1970s – Quinlan’s Iterative Dichotomizer 3 (ID3) based on CLS is
efficient at learning on then-large data sets

• early 1990s – ID3 adds features, develops into C4.5, becomes the
“default” machine learning algorithm

• late 1990s – C5.0, commercial version of C4.5 (available from SPSS
and www.rulequest.com)

• current – widely available and applied; influential techniques

COMP9417 ML & DM Tree Learning Semester 1, 2018 4 / 98

Introduction

Why use decision trees?

• Decision trees are probably the single most popular data mining tool
• Easy to understand
• Easy to implement
• Easy to use
• Computationally cheap (efficient, even on big data)

• There are some drawbacks, though — e.g., high variance
• They do classification, i.e., predict a categorical output from

categorical and/or real inputs, or regression

COMP9417 ML & DM Tree Learning Semester 1, 2018 5 / 98

Introduction

Decision Tree for PlayTennis

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

COMP9417 ML & DM Tree Learning Semester 1, 2018 6 / 98

Introduction

A Tree to Predict C-Section Risk

Learned from medical records of 1000 women

Negative examples are C-sections

[833+,167-] .83+ .17-

Fetal_Presentation = 1: [822+,116-] .88+ .12-

| Previous_Csection = 0: [767+,81-] .90+ .10-

| | Primiparous = 0: [399+,13-] .97+ .03-

| | Primiparous = 1: [368+,68-] .84+ .16-

| | | Fetal_Distress = 0: [334+,47-] .88+ .12-

| | | | Birth_Weight < 3349: [201+,10.6-] .95+ .05- | | | | Birth_Weight >= 3349: [133+,36.4-] .78+ .22-

| | | Fetal_Distress = 1: [34+,21-] .62+ .38-

| Previous_Csection = 1: [55+,35-] .61+ .39-

Fetal_Presentation = 2: [3+,29-] .11+ .89-

Fetal_Presentation = 3: [8+,22-] .27+ .73-

COMP9417 ML & DM Tree Learning Semester 1, 2018 7 / 98

Introduction

Decision Tree for Credit Rating

COMP9417 ML & DM Tree Learning Semester 1, 2018 8 / 98

Introduction

Decision Tree for Fisher’s Iris data

COMP9417 ML & DM Tree Learning Semester 1, 2018 9 / 98

Introduction

Decision Trees

Decision tree representation:

• Each internal node tests an attribute
• Each branch corresponds to attribute value
• Each leaf node assigns a classification

How would we represent the following expressions ?

• ∧,∨, XOR
• (A ∧B) ∨ (C ∧ ¬D ∧ E)
• M of N

COMP9417 ML & DM Tree Learning Semester 1, 2018 10 / 98

Introduction

Decision Trees

X ∧ Y
X = t:

| Y = t: true

| Y = f: no

X = f: no

X ∨ Y
X = t: true

X = f:

| Y = t: true

| Y = f: no

COMP9417 ML & DM Tree Learning Semester 1, 2018 11 / 98

Introduction

Decision Trees

2 of 3

X = t:

| Y = t: true

| Y = f:

| | Z = t: true

| | Z = f: false

X = f:

| Y = t:

| | Z = t: true

| | Z = f: false

| Y = f: false

So in general decision trees represent a disjunction of conjunctions of
constraints on the attributes values of instances.

COMP9417 ML & DM Tree Learning Semester 1, 2018 12 / 98

Introduction

When are Decision Trees the Right Model?

• With Boolean values for the instances X and class Y , the
representation adopted by decision-trees allows us to represent Y as a
Boolean function of the X

• Given d input Boolean variables, there are 2d possible input values for
these variables. Any specific function assigns Y = 1 to some subset of
these, and Y = 0 to the rest

• Any Boolean function can be trivially represented by a tree. Each
function assigns Y = 1 to some subset of the 2d possible values of X.
So, for each combination of values with Y = 1, have a path from root
to a leaf with Y = 1. All other leaves have Y = 0

COMP9417 ML & DM Tree Learning Semester 1, 2018 13 / 98

Introduction

When are Decision Trees the Right Model?

• This is nothing but a re-representation of the truth-table, and will
have 2d leaves. More compact trees may be possible, by taking into
account what is common between one or more rows with the same Y
value

• But, even for Boolean functions, there are some functions for which
compact trees may not be possible (the parity and majority functions
are examples)

• In general, although possible in principle to express any Boolean
function, our search and prior restrictions may not allow us to find the
correct tree in practice.

• BUT: If you want readable models that combine logical tests with a
probability-based decision, then decision trees are a good start

COMP9417 ML & DM Tree Learning Semester 1, 2018 14 / 98

Introduction

When to Consider Decision Trees?

• Instances described by a mix of numeric features and discrete
attribute–value pairs

• Target function is discrete valued (otherwise use regression trees)
• Disjunctive hypothesis may be required
• Possibly noisy training data
• Interpretability is an advantage

Examples are extremely numerous, including:

• Equipment or medical diagnosis
• Credit risk analysis
• Modeling calendar scheduling preferences
• etc.

COMP9417 ML & DM Tree Learning Semester 1, 2018 15 / 98

The Basic Algorithm

Top-Down Induction of Decision Trees (TDIDT)

Main loop:

• A← the “best” decision attribute for next node
• Assign A as decision attribute for node
• For each value of A, create new descendant of node
• Sort training examples to leaf nodes
• If training examples perfectly classified, Then STOP, Else iterate over

new leaf nodes

Essentially this is the “ID3” algorithm (Quinlan, 1986) — the first efficient
symbolic Machine Learning algorithm.

COMP9417 ML & DM Tree Learning Semester 1, 2018 16 / 98

The Basic Algorithm

Which attribute is best?

A1=? A2=?

ft ft

[29+,35-] [29+,35-]

[21+,5-] [8+,30-] [18+,33-] [11+,2-]

COMP9417 ML & DM Tree Learning Semester 1, 2018 17 / 98

Information Gain

Bits

You are watching a set of independent random samples of X
You observe that X has four possible values

P (X = A) = 1
4

P (X = B) = 1
4

P (X = C) = 1
4

P (X = D) = 1
4

So you might see: BAACBADCDADDDA…

You transmit data over a binary serial link. You can encode each reading
with two bits (e.g. A = 00, B = 01, C = 10, D= 11)

0100001001001110110011111100…

COMP9417 ML & DM Tree Learning Semester 1, 2018 18 / 98

Information Gain

Fewer Bits

Someone tells you that the probabilities are not equal

P (X = A) = 1
2

P (X = B) = 1
4

P (X = C) = 1
8

P (X = D) = 1
8

It’s possible . . .

. . . to invent a coding for your transmission that only uses 1.75 bits on
average per symbol. How ?

COMP9417 ML & DM Tree Learning Semester 1, 2018 19 / 98

Information Gain

Fewer Bits

Someone tells you that the probabilities are not equal

P (X = A) = 1
2

P (X = B) = 1
4

P (X = C) = 1
8

P (X = D) = 1
8

It’s possible . . .

. . . to invent a coding for your transmission that only uses 1.75 bits per
symbol on average. How ?

A 0

B 10

C 110

D 111

(This is just one of several ways)

COMP9417 ML & DM Tree Learning Semester 1, 2018 20 / 98

Information Gain

Fewer Bits

Suppose there are three equally likely values

P (X = A) = 1
3

P (X = B) = 1
3

P (X = C) = 1
3

Here’s a näıve coding, costing 2 bits per symbol

A 00

B 01

C 10

Can you think of a coding that would need only 1.6 bits per symbol on
average ?

COMP9417 ML & DM Tree Learning Semester 1, 2018 21 / 98

Information Gain

Fewer Bits

Suppose there are three equally likely values

P (X = A) = 1
3

P (X = B) = 1
3

P (X = C) = 1
3

Using the same approach as before, we can get a coding costing 1.6 bits
per symbol on average . . .

A 0

B 10

C 11

This gives us, on average 1
3
× 1 bit for A and 2× 1

3
× 2 bits for B and C,

which equals 5
3
≈ 1.6 bits.

Is this the best we can do ?

COMP9417 ML & DM Tree Learning Semester 1, 2018 22 / 98

Information Gain

Fewer Bits

Suppose there are three equally likely values

P (X = A) = 1
3

P (X = B) = 1
3

P (X = C) = 1
3

From information theory, the optimal number of bits to encode a symbol
with probability p is − log2 p . . .
So the best we can do for this case is − log2

1
3

bits for each of A, B and
C, or 1.5849625007211563 bits per symbol

COMP9417 ML & DM Tree Learning Semester 1, 2018 23 / 98

Information Gain

General Case

Suppose X can have one of m values . . . V1, V2, . . . Vm

P (X = V1) = p1 P (X = V2) = p2 . . . P (X = Vm) = pm

What’s the smallest possible number of bits, on average, per symbol,
needed to transmit a stream of symbols drawn from X’s distribution ? It’s

H(X) = −p1 log2 p1 − p2 log2 p2 − . . .− pm log2 pm

= −
m∑
j=1

pj log2 pj

H(X) = the entropy of X

COMP9417 ML & DM Tree Learning Semester 1, 2018 24 / 98

Information Gain

General Case

“High entropy” means X is very uniform and boring
“Low entropy” means X is very varied and interesting

COMP9417 ML & DM Tree Learning Semester 1, 2018 25 / 98

Information Gain

Entropy

En
tro

py
(S

)

1.0

0.5

0.0 0.5 1.0
p
+

Where:
S is a sample of training examples
p⊕ is the proportion of positive examples in S
p is the proportion of negative examples in S

COMP9417 ML & DM Tree Learning Semester 1, 2018 26 / 98

Information Gain

Entropy

Entropy measures the “impurity” of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p

A “pure” sample is one in which all examples are of the same class.

COMP9417 ML & DM Tree Learning Semester 1, 2018 27 / 98

Information Gain

Entropy

Entropy(S) = expected number of bits needed to encode class (⊕ or )
of randomly drawn member of S (under the optimal, shortest-length code)

Why ?

Information theory: optimal length code assigns − log2 p bits to message
having probability p.

So, expected number of bits to encode ⊕ or of random member of S:

p⊕(− log2 p⊕) + p (− log2 p )

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p

COMP9417 ML & DM Tree Learning Semester 1, 2018 28 / 98

Information Gain

Information Gain

• Gain(S,A) = expected reduction in entropy due to sorting on A

Gain(S,A) ≡ Entropy(S) −

v∈V alues(A)

|Sv|
|S|

Entropy(Sv)

A1=? A2=?

ft ft

[29+,35-] [29+,35-]

[21+,5-] [8+,30-] [18+,33-] [11+,2-]

COMP9417 ML & DM Tree Learning Semester 1, 2018 29 / 98

Information Gain

Information Gain

Gain(S,A1) = Entropy(S) −
(
|St|
|S|

Entropy(St) +
|Sf |
|S|

Entropy(Sf )

)
= 0.9936 −

= ((
26

64
(−

21

26
log2(

21

26
)−

5

26
log2(

5

26
))) +

(
38

64
(−

8

38
log2(

8

38
)−

30

38
log2(

30

38
))))

= 0.9936 − ( 0.2869 + 0.4408 )
= 0.2658

COMP9417 ML & DM Tree Learning Semester 1, 2018 30 / 98

Information Gain

Information Gain

Gain(S,A2) = 0.9936 − ( 0.7464 + 0.0828 )
= 0.1643

COMP9417 ML & DM Tree Learning Semester 1, 2018 31 / 98

Information Gain

Information Gain

So we choose A1, since it gives a larger expected reduction in entropy.

COMP9417 ML & DM Tree Learning Semester 1, 2018 32 / 98

Information Gain

Training Examples

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

COMP9417 ML & DM Tree Learning Semester 1, 2018 33 / 98

Information Gain

Information gain once more

Which attribute is the best classifier?

High Normal

Humidity

[3+,4-] [6+,1-]

Wind

Weak Strong

[6+,2-] [3+,3-]

= .940 – (7/14).985 – (7/14).592
= .151

= .940 – (8/14).811 – (6/14)1.0
= .048

Gain (S, Humidity ) Gain (S, )Wind

=0.940E =0.940E

=0.811E=0.592E=0.985E =1.00E

[9+,5-]S:[9+,5-]S:

COMP9417 ML & DM Tree Learning Semester 1, 2018 34 / 98

Information Gain

Information gain once more

Outlook

Sunny Overcast Rain

[9+,5−]

{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}

[2+,3−] [4+,0−] [3+,2−]

Yes

{D1, D2, …, D14}

? ?

Which attribute should be tested here?

Ssunny = {D1,D2,D8,D9,D11}

Gain (Ssunny , Humidity)

sunnyGain (S , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570

Gain (S sunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019

= .970 − (3/5) 0.0 − (2/5) 0.0 = .970

COMP9417 ML & DM Tree Learning Semester 1, 2018 35 / 98

Tree Learning More Generally

Hypothesis Space Search by ID3

+ + +

A1

+ – + –

A2

A3
+

+ – + –

A2

A4

+ – + –

A2

+ – +

… …

COMP9417 ML & DM Tree Learning Semester 1, 2018 36 / 98

Tree Learning More Generally

Hypothesis Space Search by ID3

• This can be viewed as a graph-search problem
• Each vertex in the graph is a decision tree
• Suppose we only consider the two-class case (ω = ω1 or ω2), and all

the features xi are Boolean, so each vertex is a binary tree

• A pair of vertices in the graph have an edge if the corresponding trees
differ in just the following way: one of the leaf-nodes in one vertex
has been replaced by a non-leaf node testing a feature that has not
appeared earlier (and 2 leaves)

• This is the full space of all decision trees (is it?). We want to search
for a single tree or a small number of trees in this space. How should
we do this?

COMP9417 ML & DM Tree Learning Semester 1, 2018 37 / 98

Tree Learning More Generally

Hypothesis Space Search by ID3

• Usual graph-search technique: greedy or beam search, starting with
the vertex corresponding to the “empty tree” (single leaf node)

• Greedy choice: which one to select? The neighbour that results in the
greatest increase in P (D|T )

• How?
• Suppose T is changed to T ′. Simply use the ratio of P (D|T ′)/P (D|T )
• Most of the calculation will cancel out: so, we will only need to do the

local computation at the leaf that was converted into a non-leaf node

• RESULT: set of trees with (reasonably) high posterior probabilities
given D: we can now use these to answer questions like
P (y′ = ω1| . . .)? or even make a decision or a classification that
y′ = ω1, given input data x

COMP9417 ML & DM Tree Learning Semester 1, 2018 38 / 98

Tree Learning More Generally

Hypothesis Space Search by ID3

• Hypothesis space is complete! (contains all finite discrete-valued
functions w.r.t attributes)

• Target function surely in there…
• Outputs a single hypothesis (which one?)

• Can’t play 20 questions…
• No back tracking

• Local minima…
• Statistically-based search choices

• Robust to noisy data…

• Inductive bias: approx “prefer shortest tree”

COMP9417 ML & DM Tree Learning Semester 1, 2018 39 / 98

Tree Learning More Generally

Inductive Bias in ID3

Note H is the power set of instances X

→Unbiased?

Not really…

• Preference for short trees, and for those with high information gain
attributes near the root

• Bias is a preference for some hypotheses, rather than a restriction of
hypothesis space H

• an incomplete search of a complete hypothesis space versus a
complete search of an incomplete hypothesis space (as in learning
conjunctive concepts)

• Occam’s razor: prefer the shortest hypothesis that fits the data

COMP9417 ML & DM Tree Learning Semester 1, 2018 40 / 98

Tree Learning More Generally

Occam’s Razor

William of Ockham (c. 1287-1347)

Entities should not be multiplied beyond necessity

Why prefer short hypotheses?

Argument in favour:

• Fewer short hypotheses than long hypotheses
→ a short hyp that fits data unlikely to be coincidence
→ a long hyp that fits data might be coincidence

COMP9417 ML & DM Tree Learning Semester 1, 2018 41 / 98

Tree Learning More Generally

Occam’s Razor

Argument opposed:

• There are many ways to define small sets of hypotheses
• e.g., all trees with a prime number of nodes that use attributes

beginning with “Z”

• What’s so special about small sets based on size of hypothesis??

Look back to linear classification lecture to see how to make this work
using Minimum Description Length (MDL)

COMP9417 ML & DM Tree Learning Semester 1, 2018 42 / 98

Overfitting and How To Avoid It

Why does overfitting occur?

• Greedy search can make mistakes. We know that it can end up in
local minima — so a sub-optimal choice earlier might result in a
better solution later (i.e. pick a test whose posterior gain (or
information gain) is less than the best one

• But there is also another kind of problem. We know that training
error is an optimistic estimate of the true error of the model, and that
this optimism increases as the training error decreases

• We will see why this is the case later (lectures on Evaluation)
• Suppose we have two models h1 and h2 with training errors e1 and e2

and optimism o1 and o2. Let the true error of each be E1 = e1 + o1
and E2 = e2 + o2

• If e1 < e2 and E1 > E2, then we will say that h1 has overfit then
training data

• So, a search method based purely on training data estimates may end
overfitting the training data

COMP9417 ML & DM Tree Learning Semester 1, 2018 43 / 98

Overfitting and How To Avoid It

Overfitting in Decision Tree Learning

Consider adding noisy training example #15:

Sunny, Hot, Normal, Strong, P layTennis = No

What effect on earlier tree?

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

COMP9417 ML & DM Tree Learning Semester 1, 2018 44 / 98

Overfitting and How To Avoid It

Overfitting in Decision Tree Learning

Outlook

Sunny Overcast Rain

[9+,5−]

{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}

[2+,3−] [4+,0−] [3+,2−]

Yes

{D1, D2, …, D14}

? ?

Which attribute should be tested here?

Ssunny = {D1,D2,D8,D9,D11}

Gain (Ssunny , Humidity)

sunnyGain (S , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570

Gain (S sunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019

= .970 − (3/5) 0.0 − (2/5) 0.0 = .970

COMP9417 ML & DM Tree Learning Semester 1, 2018 45 / 98

Overfitting and How To Avoid It

Overfitting in General

Consider error of hypothesis h over

• training data: errortrain(h)
• entire distribution D of data: errorD(h)

Definition
Hypothesis h ∈ H overfits training data if there is an alternative
hypothesis h′ ∈ H such that

errortrain(h) < errortrain(h ′) and errorD(h) > errorD(h

′)

COMP9417 ML & DM Tree Learning Semester 1, 2018 46 / 98

Overfitting and How To Avoid It

Overfitting in Decision Tree Learning

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0 10 20 30 40 50 60 70 80 90 100

A
cc

ur
ac

y

Size of tree (number of nodes)

On training data
On test data

COMP9417 ML & DM Tree Learning Semester 1, 2018 47 / 98

Overfitting and How To Avoid It

Avoiding Overfitting

How can we avoid overfitting? Pruning

• pre-pruning stop growing when data split not statistically significant
• post-pruning grow full tree, then remove sub-trees which are

overfitting

Post-pruning avoids problem of “early stopping”
How to select “best” tree:

• Measure performance over training data ?
• Measure performance over separate validation data set ?
• MDL: minimize size(tree) + size(misclassifications(tree)) ?

COMP9417 ML & DM Tree Learning Semester 1, 2018 48 / 98

Overfitting and How To Avoid It

Avoiding Overfitting

Pre-pruning

• Can be based on statistical significance test
• Stops growing the tree when there is no statistically significant

association between any attribute and the class at a particular node

• For example, in ID3: chi-squared test plus information gain
• only statistically significant attributes were allowed to be selected by

information gain procedure

COMP9417 ML & DM Tree Learning Semester 1, 2018 49 / 98

Overfitting and How To Avoid It

Avoiding Overfitting

Pre-pruning

• Simplest approach: stop growing the tree when fewer than some
lower-bound on the number of examples at a leaf

• In C4.5, this parameter is the m parameter
• In sklearn, this parameter is min samples leaf
• In sklearn, the parameter min impurity decrease enables

stopping when the this falls below a lower-bound

COMP9417 ML & DM Tree Learning Semester 1, 2018 50 / 98

Overfitting and How To Avoid It

Avoiding Overfitting

Early stopping

• Pre-pruning may suffer from early stopping: may stop the growth of
tree prematurely

• Classic example: XOR/Parity-problem
• No individual attribute exhibits a significant association with the class
• Target structure only visible in fully expanded tree
• Prepruning won’t expand the root node

• But: XOR-type problems not common in practice
• And: pre-pruning faster than post-pruning

COMP9417 ML & DM Tree Learning Semester 1, 2018 51 / 98

Overfitting and How To Avoid It

Avoiding Overfitting

Post-pruning

• Builds full tree first and prunes it afterwards
• Attribute interactions are visible in fully-grown tree

• Problem: identification of subtrees and nodes that are due to chance
effects

• Two main pruning operations:
• Subtree replacement
• Subtree raising

• Possible strategies: error estimation, significance testing, MDL
principle

• We examine two methods: Reduced-error Pruning and Error-based
Pruning

COMP9417 ML & DM Tree Learning Semester 1, 2018 52 / 98

Overfitting and How To Avoid It

Reduced-Error Pruning

Split data into training and validation set

Do until further pruning is harmful:

• Evaluate impact on validation set of pruning each possible node
(plus those below it)

• Greedily remove the one that most improves validation set accuracy

COMP9417 ML & DM Tree Learning Semester 1, 2018 53 / 98

Overfitting and How To Avoid It

Reduced-Error Pruning

• Good produces smallest version of most accurate subtree
• Not so good reduces effective size of training set

COMP9417 ML & DM Tree Learning Semester 1, 2018 54 / 98

Overfitting and How To Avoid It

Effect of Reduced-Error Pruning

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0 10 20 30 40 50 60 70 80 90 100

A
cc

ur
ac

y

Size of tree (number of nodes)

On training data
On test data

On test data (during pruning)

COMP9417 ML & DM Tree Learning Semester 1, 2018 55 / 98

Overfitting and How To Avoid It

Error-based pruning (C4.5 / J48 / C5.0)

Quinlan (1993) describes the successor to ID3 – C4.5

• many extensions – see below
• post-pruning using training set
• includes sub-tree replacement and sub-tree raising
• also: pruning by converting tree to rules
• commercial version – C5.0 – is widely used

• RuleQuest.com
• now free

• Weka version – J48 – also widely used

COMP9417 ML & DM Tree Learning Semester 1, 2018 56 / 98

Overfitting and How To Avoid It

Pruning operator: Sub-tree replacement

Bottom-up:
tree is considered for replacement once all its sub-trees have been
considered

COMP9417 ML & DM Tree Learning Semester 1, 2018 57 / 98

Overfitting and How To Avoid It

Error-based pruning: error estimate

Goal is to improve estimate of error on unseen data using all and only data
from training set

But how can this work ?

Make the estimate of error pessimistic !

COMP9417 ML & DM Tree Learning Semester 1, 2018 58 / 98

Overfitting and How To Avoid It

Error-based pruning: error estimate

• Apply pruning operation if this does not increase the estimated error
• C4.5’s method: using upper limit of standard confidence interval

derived from the training data
• Standard Bernoulli-process-based method
• Note: statistically motivated, but not statistically valid
• However: works well in practice !

COMP9417 ML & DM Tree Learning Semester 1, 2018 59 / 98

Overfitting and How To Avoid It

Error-based pruning: error estimate

• The error estimate for a tree node is the weighted sum of error
estimates for all its subtrees (possibly leaves).

• Upper bound error estimate e for a node (simplified version):

e = f + Zc ·

f · (1− f)

N

• f is actual (empirical) error of tree on examples at the tree node
• N is the number of examples at the tree node
• Zc is a constant whose value depends on confidence parameter c
• C4.5’s default value for confidence c = 0.25
• If c = 0.25 then Zc = 0.69 (from standardized normal distribution)

COMP9417 ML & DM Tree Learning Semester 1, 2018 60 / 98

Overfitting and How To Avoid It

Error-based pruning: error estimate

• How does this method implement a pessimistic error estimate ?
• What effect will the c parameter have on pruning ?
• As c ↑, z ↓
• See example on next slide (note: values not calculated using exactly

the above formula)

COMP9417 ML & DM Tree Learning Semester 1, 2018 61 / 98

Overfitting and How To Avoid It

Error-based pruning: error estimate

• health plan contribution: node
measures f = 0.36, e = 0.46

• sub-tree measures:
• none: f = 0.33, e = 0.47
• half: f = 0.5, e = 0.72
• full: f = 0.33, e = 0.47

• sub-trees combined 6 : 2 : 6 gives
0.51

• sub-trees estimated to give greater
error so prune away

COMP9417 ML & DM Tree Learning Semester 1, 2018 62 / 98

Overfitting and How To Avoid It

Rule Post-Pruning

This method was introduced in Quinlan’s C4.5

• Convert tree to equivalent set of rules
• Prune each rule independently of others
• Sort final rules into desired sequence for use

For: simpler classifiers, people prefer rules to trees
Against: does not scale well, slow for large trees & datasets

COMP9417 ML & DM Tree Learning Semester 1, 2018 63 / 98

Overfitting and How To Avoid It

Converting A Tree to Rules

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

IF (Outlook = Sunny) ∧ (Humidity = High)
THEN PlayTennis = No

IF (Outlook = Sunny) ∧ (Humidity = Normal)
THEN PlayTennis = Y es

. . .
COMP9417 ML & DM Tree Learning Semester 1, 2018 64 / 98

Overfitting and How To Avoid It

Rules from Trees (Rule Post-Pruning)

Rules can be simpler than trees but just as accurate, e.g., in C4.5Rules:

• path from root to leaf in (unpruned) tree forms a rule
• i.e., tree forms a set of rules

• can simplify rules independently by deleting conditions
• i.e., rules can be generalized while maintaining accuracy

• greedy rule simplification algorithm
• drop the condition giving lowest estimated error (as for pruning)
• continue while estimated error does not increase

COMP9417 ML & DM Tree Learning Semester 1, 2018 65 / 98

Overfitting and How To Avoid It

Rules from Trees

Select a “good” subset of rules within a class (C4.5Rules):

• goal: remove rules not useful in terms of accuracy
• find a subset of rules which minimises an MDL criterion
• trade-off accuracy and complexity of rule-set
• stochastic search using simulated annealing

Sets of rules can be ordered by class (C4.5Rules):

• order classes by increasing chance of making false positive errors
• set as a default the class with the most training instances not covered

by any rule

COMP9417 ML & DM Tree Learning Semester 1, 2018 66 / 98

Further Issues in Tree Learning

Continuous Valued Attributes

Decision trees originated for discrete attributes only. Now: continuous
attributes.
Can create a discrete attribute to test continuous value:

• Temperature = 82.5
• (Temperature > 72.3) ∈ {t, f}

• Usual method: continuous attributes have a binary split
• Note:

• discrete attributes – one split exhausts all values
• continuous attributes – can have many splits in a tree

COMP9417 ML & DM Tree Learning Semester 1, 2018 67 / 98

Further Issues in Tree Learning

Continuous Valued Attributes

• Splits evaluated on all possible split points
• More computation: n− 1 possible splits for n values of an attribute

in training set

• Fayyad (1991)
• sort examples on continuous attribute
• find midway boundaries where class changes, e.g. for Temperature

(48+60)
2

and
(80+90)

2

• Choose best split point by info gain (or evaluation of choice)
• Note: C4.5 uses actual values in data

Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No

COMP9417 ML & DM Tree Learning Semester 1, 2018 68 / 98

Further Issues in Tree Learning

Axis-parallel Splitting

Fitting data that is not a good “match” to the possible splits in a tree.

“Pattern Classification” Duda, Hart, and Stork, (2001)

COMP9417 ML & DM Tree Learning Semester 1, 2018 69 / 98

Further Issues in Tree Learning

Splitting on Linear Combinations of Features

Reduced tree size by allowing splits that are a better “match” to the data.

“Pattern Classification” Duda, Hart, and Stork, (2001)

COMP9417 ML & DM Tree Learning Semester 1, 2018 70 / 98

Further Issues in Tree Learning

Attributes with Many Values

Problem:

• If attribute has many values, Gain will select it
• Why ? more likely to split instances into “pure” subsets

• Maximised by singleton subsets

• Imagine using Date = March 21, 2018 as attribute
• High gain on training set, useless for prediction

COMP9417 ML & DM Tree Learning Semester 1, 2018 71 / 98

Further Issues in Tree Learning

Attributes with Many Values

One approach: use GainRatio instead

GainRatio(S,A) ≡
Gain(S,A)

SplitInformation(S,A)

SplitInformation(S,A) ≡ −
c∑

i=1

|Si|
|S|

log2
|Si|
|S|

where Si is subset of S for which A has value vi

COMP9417 ML & DM Tree Learning Semester 1, 2018 72 / 98

Further Issues in Tree Learning

Attributes with Many Values

Why does this help ?

• sensitive to how broadly and uniformly attribute splits instances
• actually the entropy of S w.r.t. values of A

• i.e., the information of the partition itself

• therefore higher for many-valued attributes, especially if mostly
uniformly distributed across possible values

COMP9417 ML & DM Tree Learning Semester 1, 2018 73 / 98

Further Issues in Tree Learning

Attributes with Costs

Consider

• medical diagnosis, BloodTest has cost $150
• robotics, Width from 1ft has cost 23 sec.

How to learn a consistent tree with low expected cost?

COMP9417 ML & DM Tree Learning Semester 1, 2018 74 / 98

Further Issues in Tree Learning

Attributes with Costs

One approach: evaluate information gain relative to cost:

• Example
Gain2(S,A)

Cost(A)
.

Preference for decision trees using lower-cost attributes.

COMP9417 ML & DM Tree Learning Semester 1, 2018 75 / 98

Further Issues in Tree Learning

Attributes with Costs

Also: class (misclassification) costs, instance costs, . . .

See5 / C5.0 can use a misclassification cost matrix.

Can give false positives a different cost to false negatives

Forces a different tree structure to be learned to mimimise asymmetric
misclassification costs – can help if class distribution is skewed – why ?

COMP9417 ML & DM Tree Learning Semester 1, 2018 76 / 98

Further Issues in Tree Learning

Unknown Attribute Values

What if some examples missing values of A?
Use training example anyway, sort through tree. Here are 3 possible
approaches

• If node n tests A, assign most common value of A among other
examples sorted to node n

• assign most common value of A among other examples with same
target value

• assign probability pi to each possible value vi of A
• assign fraction pi of example to each descendant in tree

Note: need to classify new (unseen) examples in same fashion

COMP9417 ML & DM Tree Learning Semester 1, 2018 77 / 98

Further Issues in Tree Learning

Windowing

Early implementations – training sets too large for memory

As a solution ID3 implemented windowing:

1. select subset of instances – the window
2. construct decision tree from all instances in the window
3. use tree to classify training instances not in window
4. if all instances correctly classified then halt, else
5. add selected misclassified instances to the window
6. go to step 2

Windowing retained in C4.5 because it can lead to more accurate trees.
Related to ensemble learning.

COMP9417 ML & DM Tree Learning Semester 1, 2018 78 / 98

Learning Non-linear Regression Models with Trees

Non-linear Regression with Trees

Despite some nice properties of Neural Networks, such as generalization to
deal sensibly with unseen input patterns and robustness to losing neurons
(prediction performance can degrade gracefully), they still have some
problems:

• Back-propagation is often difficult to scale – large nets need lots of
computing time; may have to be partitioned into separate modules
that can be trained independently, e.g. NetTalk, DeepBind

• Neural Networks are not very transparent – hard to understand the
representation of what has been learned

Possible solution: exploit success of tree-structured approaches in ML

COMP9417 ML & DM Tree Learning Semester 1, 2018 79 / 98

Learning Non-linear Regression Models with Trees

Regression trees

• Differences to decision trees:
• Splitting criterion: minimizing intra-subset variation
• Pruning criterion: based on numeric error measure
• Leaf node predicts average class values of training instances reaching

that node

• Can approximate piecewise constant functions
• Easy to interpret
• More sophisticated version: model trees

COMP9417 ML & DM Tree Learning Semester 1, 2018 80 / 98

Learning Non-linear Regression Models with Trees

A Regression Tree and its Prediction Surface

“Elements of Statistical Learning” Hastie, Tibshirani & Friedman (2001)

COMP9417 ML & DM Tree Learning Semester 1, 2018 81 / 98

Learning Non-linear Regression Models with Trees

Regression Tree on sine dataset

COMP9417 ML & DM Tree Learning Semester 1, 2018 82 / 98

Learning Non-linear Regression Models with Trees

Regression Tree on CPU dataset

COMP9417 ML & DM Tree Learning Semester 1, 2018 83 / 98

Learning Non-linear Regression Models with Trees

Tree learning as variance reduction

• The variance of a Boolean (i.e., Bernoulli) variable with success
probability ṗ is ṗ(1− ṗ), which is half the Gini index. So we could
interpret the goal of tree learning as minimising the class variance (or
standard deviation, in case of


Gini) in the leaves.

• In regression problems we can define the variance in the usual way:

Var(Y ) =
1

|Y |

y∈Y

(y − y)2

If a split partitions the set of target values Y into mutually exclusive
sets {Y1, . . . , Yl}, the weighted average variance is then

Var({Y1, . . . , Yl}) =
l∑

j=1

|Yj |
|Y |

Var(Yj) = . . . =
1

|Y |

y∈Y

y2 −
l∑

j=1

|Yj |
|Y |

y2j

The first term is constant for a given set Y and so we want to
maximise the weighted average of squared means in the children.

COMP9417 ML & DM Tree Learning Semester 1, 2018 84 / 98

Learning Non-linear Regression Models with Trees

Learning a regression tree

Imagine you are a collector of vintage Hammond tonewheel organs. You
have been monitoring an online auction site, from which you collected
some data about interesting transactions:

# Model Condition Leslie Price

1. B3 excellent no 4513
2. T202 fair yes 625
3. A100 good no 1051
4. T202 good no 270
5. M102 good yes 870
6. A100 excellent no 1770
7. T202 fair no 99
8. A100 good yes 1900
9. E112 fair no 77

COMP9417 ML & DM Tree Learning Semester 1, 2018 85 / 98

Learning Non-linear Regression Models with Trees

Learning a regression tree

From this data, you want to construct a regression tree that will help you
determine a reasonable price for your next purchase.
There are three features, hence three possible splits:

Model = [A100,B3,E112,M102,T202]
[1051, 1770, 1900][4513][77][870][99, 270, 625]

Condition = [excellent, good, fair]
[1770, 4513][270, 870, 1051, 1900][77, 99, 625]

Leslie = [yes, no] [625, 870, 1900][77, 99, 270, 1051, 1770, 4513]

The means of the first split are 1574, 4513, 77, 870 and 331, and the
weighted average of squared means is 3.21 · 106. The means of the second
split are 3142, 1023 and 267, with weighted average of squared means
2.68 · 106; for the third split the means are 1132 and 1297, with weighted
average of squared means 1.55 · 106. We therefore branch on Model at the
top level. This gives us three single-instance leaves, as well as three A100s
and three T202s.

COMP9417 ML & DM Tree Learning Semester 1, 2018 86 / 98

Learning Non-linear Regression Models with Trees

Learning a regression tree

For the A100s we obtain the following splits:

Condition = [excellent, good, fair] [1770][1051, 1900][]
Leslie = [yes, no] [1900][1051, 1770]

Without going through the calculations we can see that the second split
results in less variance (to handle the empty child, it is customary to set its
variance equal to that of the parent). For the T202s the splits are as
follows:

Condition = [excellent, good, fair] [][270][99, 625]
Leslie = [yes, no] [625][99, 270]

Again we see that splitting on Leslie gives tighter clusters of values. The
learned regression tree is depicted on the next slide.

COMP9417 ML & DM Tree Learning Semester 1, 2018 87 / 98

Learning Non-linear Regression Models with Trees

A regression tree

Model

Leslie

=A100

f̂(x)=4513

=B3

f̂(x)=77

=E122

f̂(x)=870

=M102

Leslie

=T202

f̂(x)=1900

=yes

f̂(x)=1411

=no

f̂(x)=625

=yes

f̂(x)=185

=no

A regression tree learned from the Hammond organ dataset.
COMP9417 ML & DM Tree Learning Semester 1, 2018 88 / 98

Learning Non-linear Regression Models with Trees

Model trees

• Like regression trees but with linear regression functions at each node
• Linear regression applied to instances that reach a node after full tree

has been built

• Only a subset of the attributes is used for LR
• Attributes occurring in subtree (+maybe attributes occurring in path

to the root)

• Fast: overhead for Linear Regression (LR) not large because usually
only a small subset of attributes is used in tree

COMP9417 ML & DM Tree Learning Semester 1, 2018 89 / 98

Learning Non-linear Regression Models with Trees

Two uses of features

Suppose we want to approximate y = cosπx on the interval −1 ≤ x ≤ 1.
A linear approximation is not much use here, since the best fit would be
y = 0. However, if we split the x-axis in two intervals −1 ≤ x < 0 and 0 ≤ x ≤ 1, we could find reasonable linear approximations on each interval. We can achieve this by using x both as a splitting feature and as a regression variable (next slide). COMP9417 ML & DM Tree Learning Semester 1, 2018 90 / 98 Learning Non-linear Regression Models with Trees A small model tree x ŷ = 2x+1 <0 ŷ = −2x+1 ≥0 -1 0 1 -1 1 (left) A model tree combining a one-split feature tree with linear regression models in the leaves. Notice how x is used as both a splitting feature and a regression variable. (right) The function y = cosπx on the interval −1 ≤ x ≤ 1, and the piecewise linear approximation achieved by the model tree. COMP9417 ML & DM Tree Learning Semester 1, 2018 91 / 98 Learning Non-linear Regression Models with Trees Model Tree on CPU dataset COMP9417 ML & DM Tree Learning Semester 1, 2018 92 / 98 Learning Non-linear Regression Models with Trees Smoothing • Näıve prediction method – output value of LR model at corresponding leaf node • Improve performance by smoothing predictions with internal LR models • Predicted value is weighted average of LR models along path from root to leaf • Smoothing formula: p′ = np+kq n+k where • p′ prediction passed up to next higher node • p prediction passed to this node from below • q value predicted by model at this node • n number of instances that reach node below • k smoothing constant • Same effect can be achieved by incorporating the internal models into the leaf nodes COMP9417 ML & DM Tree Learning Semester 1, 2018 93 / 98 Learning Non-linear Regression Models with Trees Building the tree • Splitting criterion: standard deviation reduction SDR = sd(T )− ∑ i |Ti| |T | × sd(Ti) where T1, T2, . . . are the sets from splits of data at node. • Termination criteria (important when building trees for numeric prediction): • Standard deviation becomes smaller than certain fraction of sd for full training set (e.g. 5%) • Too few instances remain (e.g. less than four) COMP9417 ML & DM Tree Learning Semester 1, 2018 94 / 98 Learning Non-linear Regression Models with Trees Pruning the tree • Pruning is based on estimated absolute error of LR models • Heuristic estimate: n+ v n− v × average absolute error where n is number of training instances that reach the node, and v is the number of parameters in the linear model • LR models are pruned by greedily removing terms to minimize the estimated error • Model trees allow for heavy pruning: often a single LR model can replace a whole subtree • Pruning proceeds bottom up: error for LR model at internal node is compared to error for subtree COMP9417 ML & DM Tree Learning Semester 1, 2018 95 / 98 Learning Non-linear Regression Models with Trees Discrete (nominal) attributes • Nominal attributes converted to binary attributes and treated as numeric • Nominal values sorted using average class value for each one • For k-values, k − 1 binary attributes are generated • the ith binary attribute is 0 if an instance’s value is one of the first i in the ordering, 1 otherwise • Best binary split with original attribute provably equivalent to a split on one of the new attributes COMP9417 ML & DM Tree Learning Semester 1, 2018 96 / 98 Summary Summary – decision trees • Decision tree learning is a practical method for many classifier learning tasks – still a “Top 10” data mining algorithm – see sklearn.tree.DecisionTreeClassifier • TDIDT family descended from ID3 searches complete hypothesis space - the hypothesis is there, somewhere... • Uses a search or preference bias, search for optimal tree is, in general, not tractable • Overfitting is inevitable with an expressive hypothesis space and noisy data, so pruning is important • Decades of research into extensions and refinements of the general approach, e.g., for numerical prediction, logical trees • Often the “try-first” machine learning method in applications, illustrates many general issues • Performance can be improved with use of “ensemble” methods COMP9417 ML & DM Tree Learning Semester 1, 2018 97 / 98 Summary Summary – regression and model trees • Regression trees were introduced in CART – R’s implementation is close to CART, but see sklearn.tree.DecisionTreeRegressor for a basic version • Quinlan proposed the M5 model tree inducer • M5′: slightly improved version that is publicly available (M5Pin Weka is based on this) • Quinlan also investigated combining instance-based learning with M5 • CUBIST: Quinlan’s rule learner for numeric prediction www.rulequest.com • Interesting comparison: Neural nets vs. model trees — both do non-linear regression • other methods also can learn non-linear models COMP9417 ML & DM Tree Learning Semester 1, 2018 98 / 98 www.rulequest.com Aims Introduction The Basic Algorithm Information Gain Tree Learning More Generally Overfitting and How To Avoid It Further Issues in Tree Learning Learning Non-linear Regression Models with Trees Summary