CS计算机代考程序代写 decision tree algorithm SGN-13006 Introduction to Pattern Recognition and Machine Learning (5 cr) - Decision Tree Learning

SGN-13006 Introduction to Pattern Recognition and Machine Learning (5 cr) – Decision Tree Learning

SGN-13006 Introduction to Pattern

Recognition and Machine Learning (5 cr)

Decision Tree Learning

Joni-Kristian Kämäräinen

September 2018

Laboratory of Signal Processing

Tampere University of Technology

Material

• Lecturer’s slides and blackboard notes

• T.M. Mitchell. Machine Learning. McGraw-Hill, 1997:
Chapter 3

• (Advanced: T. Hastie, R. Tibshirani, and J. Friedman. The
Elements of Statistical Learning. Springer, 2009: Chapter 15)

Contents

Decision Tree Learning

ID3 algorithm

Analysis about Decision Tree Learning

Overfitting

Different types of variables

Advanced Topic: Randomisation

Randomisation & Decision trees

Beyond concept learning

Figure 1: http://cognitrn.psych.indiana.edu/rgoldsto/

courses/concepts2.pdf

http://cognitrn.psych.indiana.edu/rgoldsto/courses/concepts2.pdf
http://cognitrn.psych.indiana.edu/rgoldsto/courses/concepts2.pdf

Decision Tree Learning

Decision Tree Representation

Decision tree representation:

• Each internal node tests an attribute

• Each branch corresponds to attribute value

• Each leaf node assigns a classification

How would we represent:

• ∧,∨, XOR
• (A ∧ B) ∨ (C ∧ ¬D ∧ E )
• M of N

Training Data (Mitchell Table 3.1)

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Top-Down Induction of Decision Trees

Main loop:

1: A← the “best” decision attribute for next node
2: Assign A as decision attribute for node

3: For each value of A, create new descendant of node

4: Sort training examples to leaf nodes

5: If training examples perfectly classified, Then STOP, Else

iterate over new leaf nodes

Decision Tree Learning

ID3 algorithm

ID3 Algorithm

ID3 (Examples, Target Attribute, Attributes)
1: Create a root node for the tree
2: if all examples are positive then
3: Return the single-node tree Root, with label = +.
4: end if
5: if all examples are negative then
6: Return the single-node tree Root, with label = -.
7: end if
8: if number of predicting attributes is empty then
9: Return the single node tree Root, with label = most common value of the target attribute in the examples.
10: end if
11: Otherwise Begin
12: A = The Attribute that best classifies examples.
13: Decision Tree attribute for Root = A.
14: for each possible value, v i, of A do
15: Add a new tree branch below Root, corresponding to the test A = v i.
16: Let Examples(v i) be the subset of examples that have the value v i for A
17: if Examples(v i) is empty then
18: Then below this new branch add a leaf node with label = most common target value in the examples
19: else
20: below this new branch add the subtree ID3 (Examples(v i), Target Attribute, Attributes – {A})
21: end if
22: end for
23: Return Root

Example

Example (cont.)

Application Example: C-Section Risk

Analysis about Decision Tree

Learning

Hypothesis Space Search by ID3

• Hypothesis space is complete (cf. Concept Learning)

• Entropy-driven search is robust to noise – always provides a
solution that fits to training data

• Follows the Occam’s Razor principle – “prefer shortest tree”

Occam’s Razor

Why prefer short hypotheses?

Argument in favor:

• Fewer short hyps. than long hyps.

→ a short hyp that fits data unlikely to be coincidence (cf.
Copernicus)

→ a long hyp that fits data might be coincidence

Analysis about Decision Tree

Learning

Overfitting

• The most informative attributes typically near the root, but

• all attributes tested toward the leaves ⇒ easily overfits

Overfitting

• The most informative attributes typically near the root, but

• all attributes tested toward the leaves ⇒ easily overfits

Overfitting

Consider error of hypothesis h over

• training data: errortrain(h)

• entire distribution D of data: errorD(h)

Overfitting
Hypothesis h ∈ H overfits training data if there is an alternative
hypothesis h′ ∈ H such that

errortrain(h) < errortrain(h ′) and errorD(h) > errorD(h
′)

Avoiding Overfitting

• Stop training with some criterion (i.e. few samples left)

• Grow a full tree, then post-prune

• Use a separate validation set to stop training

Avoiding Overfitting – A Post-pruning Example

1. Convert tree to equivalent set of rules

2. Prune each rule independently of others

3. Sort final rules into desired sequence for use

IF (Outlook = Sunny) ∧ (Humidity = High)
THEN PlayTennis = No

. . .

Perhaps most frequently used method (e.g., C4.5)

Analysis about Decision Tree

Learning

Different types of variables

Continuous Valued Attributes

Create a discrete attribute to test continuous

• Temperature = 82.5

• (Temperature > 72.3) = t, f

Temperature: 40 48 60 72 80 90

PlayTennis: No No Yes Yes Yes No

Attributes with Many Values

Problem:

• If attribute has many values, Gain will select it

• Imagine using Date = Jun 3 1996 as attribute

One approach: use GainRatio instead

GainRatio(S ,A) ≡
Gain(S ,A)

SplitInformation(S ,A)

SplitInformation(S ,A) ≡ −
c∑

i=1

|Si |
|S |

log2
|Si |
|S |

where Si is subset of S for which A has value vi

Attributes with Costs

Consider

• medical diagnosis, BloodTest has cost $150

How to learn a consistent tree with low expected cost?

One approach: replace gain by

• Tan and Schlimmer (1990)

Gain2(S ,A)

Cost(A)
.

• Nunez (1988)
2Gain(S ,A) − 1

(Cost(A) + 1)w

where w ∈ [0, 1] determines importance of cost

Unknown Attribute Values

What if some examples missing values of A?

Use training example anyway, sort through tree

• If node n tests A, assign most common value of A among
other examples sorted to node n

• assign most common value of A among other examples with
same target value

• assign probability pi to each possible value vi of A
• assign fraction pi of example to each descendant in tree

Classify new examples in same fashion

Decision Trees – Pros & Cons

• Pros
• Powerful hypothesis space (cf. concept learning)
• Robust to noise (probabilistic)
• Easy to interpret and add heuristics
• (adaptable computing time in testing and fits nicely to GPU)

• Cons
• Local minimum (and sensitive to training data) and severe

ovefitting (high bias)

Advanced Topic: Randomisation

Randomisation

There are many ways to add randomisation to the process of

learning:

• Data: Bootstrapping & bagging & cross-validation

• Features: Random splitting & sub-spaces (feature selection)

• Initialisation: Random initialisation

• Learning: Evolutionary & genetic algorithms

Power of randomisation
Randomisation is a meta-technique which can be used along with

any machine learning method – often randomisation is the easier

way as compared to detail analysis of method properties and

sensitivity to training data, and their adjustment

Advanced Topic: Randomisation

Randomisation & Decision trees

Randomisation & Decision Trees = Random Forests

• Bagging (bootstrap aggregation) is a technique for reducing
the variance of an estimated prediction function.

• Bagging seems to work especially well for high-variance,
low-bias, procedures, such as trees.

• For regression:
• We fit the same tree many times for bootstrap-sampled

versions of training data and average over all trees.

• For Classification:
• A committee of trees each casts a vote and majority wins.

• Random forests is a substantial modification of bagging that
builds on a large collection of de-correlated trees.

Random Forest for Regression or Classification

1: for b = 1 to B do
2: Draw a bootstrap sample Z of size N from the training data.
3: Grow a random-forest tree Tb to the bootstrapped data, by recursively

repeating the following steps for each terminal node of the tree, until the

minimum node size nmin is reached.

4: Select m variables at random from the p variables.
5: Pick the best variable among the m.
6: Split the node into daughter nodes.
7: end for
8: Output the ensemble of trees {Tb}

B
1 .

To make a prediction at a new point x :

Regression: f̂ Brf (x) =
1
B

∑B
b=1 Tb(x) .

Classification: Let Ĉb(x) be the class prediction of the bth

random-forest tree. Then

ĈBrf (x) = majority vote
{
Ĉb(x)

}B
1

Example: Real-Time Human Pose Recognition in Parts from

Single Depth Images

J. Shotton et al. “Real-Time Human Pose Recognition in Parts from

Single Depth Images”. In: IEEE Conference on Computer Vision and

Pattern Recognition. 2011

Video: kinect cvpr2011 bestpaper.mp4
26

Example: Real-Time Human Pose Recognition in Parts from

Single Depth Images

J. Shotton et al. “Real-Time Human Pose Recognition in Parts from

Single Depth Images”. In: IEEE Conference on Computer Vision and

Pattern Recognition. 2011

Video: kinect cvpr2011 bestpaper.mp4
26

Example: Real-Time Human Pose Recognition in Parts from

Single Depth Images

J. Shotton et al. “Real-Time Human Pose Recognition in Parts from

Single Depth Images”. In: IEEE Conference on Computer Vision and

Pattern Recognition. 2011

Video: kinect cvpr2011 bestpaper.mp4
26

Example: Real-Time Human Pose Recognition in Parts from

Single Depth Images

J. Shotton et al. “Real-Time Human Pose Recognition in Parts from

Single Depth Images”. In: IEEE Conference on Computer Vision and

Pattern Recognition. 2011

Video: kinect cvpr2011 bestpaper.mp4
26

Summary

• Decision tree model

• Entropy and information gain

• The principles of the ID3 algorithm

• Problem of overfitting

• Randomisation and the principles behing the random forest
algorithm

Decision Tree Learning
ID3 algorithm

Analysis about Decision Tree Learning
Overfitting
Different types of variables

Advanced Topic: Randomisation
Randomisation & Decision trees

Summary

Related Posts