CS代写 COMP3308/3608, Lecture 7

COMP3308/3608, Lecture 7
ARTIFICIAL INTELLIGENCE
Decision Trees
Reference: Witten, Frank, Hall and Hall: ch.4.3 and ch.6.1 Russell and Norvig: p.697-707

, COMP3308/3608 AI, week 7, 2022 1

Core topics:
• Constructing decision trees
• Entropy and information gain • DT’s decision boundary
Additional topics:
• Avoiding overfitting by pruning
• Dealing with numeric attributes
• Alternative measures for selecting attributes • Dealing with missing values
• Handling attributes with different costs
, COMP3308/3608 AI, week 7, 2022 2

Decision Trees (DTs)
• DTs are supervised classifiers
• Most popular and well researched ML/DM technique
• Developed in parallel in ML by (USyd) and in statistics by Breiman, Friedman, Olshen and Stone “Classification and regression trees”, 1984
• Quinlan has refined the DT algorithm over the years
• ID3, 1986
• C4.5, 1993
• C5.0 (See 5 on Windows) – commercial version of C4.5 used in many DM
• 2 DT versions in WEKA
• id3 – nominal attributes only
• j48 – both nominal and numeric
• Many other implementations available for free and not for free , COMP3308/3608 AI, week 7, 2022 3

DT for the weather data
DT Example
DT representation:
• each non-leaf node tests the values of an attribute
• each branch corresponds to an attribute value
• each leaf node assigns a class
To predict the class for a new example: start from the root and test the values of the attributes, until you reach a leaf node; return the class of the leaf node.
What would be the prediction for
outlook=sunny, temperature=cool, humidity=high, windy=true
Another way to look at DTs:
Use your data to build a tree of questions with answers at the leaves. To answer a new query, start from the tree root, answer the questions until you reach a
leaf node and return its answer.
outlook temp. humidity windy play
sunny hot high sunny hot high overcast hot high rainy mild high rainy cool normal rainy cool normal overcast cool normal sunny mild high sunny cool normal rainy mild normal sunny mild normal overcast mild high overcast hot normal rainy mild high
false no true no false yes false yes false yes true no true yes false no false yes false yes true yes true yes false yes true no
, COMP3308/3608 AI, week 7, 2022 4

Constructing DTs (ID3 algorithm)
• Top-down in recursive divide-and-conquer fashion:
1. The best attribute is selected for root node and a branch is created for
each possible attribute value
2. The examples are split into subsets, one for each branch extending from the node
3. The procedure is repeated recursively for each branch, using only the examples that reach the branch
4. Stop if all examples have the same class; make a leaf node corresponding to this class
, COMP3308/3608 AI, week 7, 2022 5

40K 7 M reject 20K 1 F reject 58K 5 M accept 23K 3 F accept
depen dents
depen dents

depen dents
• Assume that we know how to select the best attribute at each step
Building a DT – Example; First Node
YWage > 22K N
Example created by Creath
COMP3308/3608 AI, week 7, 2022 6

40K 7 M reject 58K 5 M accept 23K 3 F accept
depen dents
depen dents
depen dents
40K 7 M reject
Second Node
dependents < 6 , COMP3308/3608 AI, week 7, 2022 7 Y dependents < 6 Final Tree 40K 7 M reject 20K 1 F reject 58K 5 M accept 23K 3 F accept depen dents Wage > 22K
COMP3308/3608 AI, week 7, 2022 8

Stopping Criterion Again – It is More Complex
• 4. Stop if
• a) [as before, the typical case]
all examples in the subset following the branch have the same class
=> make a leaf node corresponding to this class
• b) [additional condition]
Cannot split any further – all examples in the subset have the same attribute values but different classes (noise in data) => make a leaf node and label it with the majority class of the subset
• c) [additional condition]
The subset is empty (e.g. there are no examples with the attribute value for the branch) => create a leaf node and label it with the majority class of the subset of the parent
, COMP3308/3608 AI, week 7, 2022 9

Pseudo-Code
• Aim: find a tree consistent with the training examples
• Idea: recursively choose the best attribute as root of sub-tree
Majority class of the sub-set of the parent node
Majority class of examples
, COMP3308/3608 AI, week 7, 2022 10

DT = A Set of Rules
• DTs can be expressed as a set of mutually exclusive rules
• Each rule is a conjunction of tests (1 rule = 1 path in the tree)
• Rules are connected with disjunction
• => DT = disjunction of conjunctions
• Extract the rules for the weather tree!
, COMP3308/3608 AI, week 7, 2022 11

Reason: there is a consistent DT for any training set with 1 path to a leaf for each example
• i.e. lookup-table, explicitly encoding the training data
• the above is true unless f(x) is nondeterministic in x
But such DT will probably not generalize well to new examples => Prefer to find more compact decision trees
Expressiveness of DTs
• DTs can express any function of the input attributes
, COMP3308/3608 AI, week 7, 2022 12

How to Find the ?
• Four DTs for the weather data; which is the best choice?
• A heuristic is needed! d
• A measure of ‘purity’ of each node would be a good choice, as the “pure” subsets (containing only yes or no examples) will not have to be split further and the recursive process will terminate
=> at each step we can choose the attribute which produces the purest children nodes; such measure of purity is called …
, COMP3308/3608 AI, week 7, 2022 13

Entropy (= Information Content) – Interpretation 1
• Given a set of examples with their class
• e.g. the weather data – 9 examples from class yes and 5 examples
from class no
• Entropy measures the homogeneity (purity) of a set of examples with
respect to their class
• The smaller the entropy, the greater the purity of the set
• Standard measure in signal compression, information theory, physics
, COMP3308/3608 AI, week 7, 2022 14

Entropy – Formal Definition
• Entropy H(S) of data set S:
Pi – proportion of examples that belong to class i
• Different notation used in textbooks, we will use H(S) and I(S)
• For our example: weather data – 9 yes and 5 no examples:
• log to the base 2 – entropy is measured in bits
• Note: log 0 is not defined, but when we calculate entropy we
will assume that it is 0, i.e. log2 0 = 0
, COMP3308/3608 AI, week 7, 2022 15
H(S)=I(S)=−P.log P
H(S)=−P log P −P log P =I(P ,P )=I(9,5)=−9log 9−5log 5=0.940bits yes 2 yes no 2 no yes no 1414 14 214 14 214

Range of the Entropy for Binary Classification
• 2 classes: yes and no
• on x: p – the proportion of positive examples
=> the proportion of negative examples is 1-p
• on y: the entropy H(S)
H(S) = I(p,(1− p)) = =−plog2 p−(1−p)log2(1−p)
graph from , Machine Learning, McGraw Hill, 1997
• H(S)  [0,1]
• H(S)=0 => all elements of S belong to the same class (no disorder, min
• H(S)=1 => equal number of yes & no (S is as disordered as it can be; max entropy)
, COMP3308/3608 AI, week 7, 2022 16

Entropy – Interpretation 2 (Information Theory
Interpretation)
• Sender and Receiver
• Sender sends information to Receiver about the outcome of an
event (e.g. flipping a coin, 2 possibilities: heads or tails)
• Receiver has some prior knowledge about the outcome, e.g.
• Case 1: knows that the coin is honest
• Case 2: knows that the coin is rigged so that it comes heads 75% of the time
• In which case an answer telling the outcome of a toss will be more informative for the Receiver (i.e. will be less predictable)?
, COMP3308/3608 AI, week 7, 2022 17

Information Theory
• Information theory, Shannon and Weaver 1949:
• Given a set of possible answers (messages) M={m1,m2,…,mn) and a probability P(mi) for the occurrence of each answer, the expected information content (entropy) of the actual answer is:
P(m ).log P(m )=entropy(M) i2i
• Shows the amount of surprise of the receiver by the answer based on the probability of the answers (i.e. based on the prior knowledge of the receiver about the possible answers)
• The less the receiver knows, the more information is provided (the more informative the answer is)
, COMP3308/3608 AI, week 7, 2022 18

Another Example
• Example 2: Basketball game outcome
• Two teams A and B are playing basketball
• Case 1: Equal probability to win
• Case 2: (a famous basketball player) is playing for A
and the probability of A to win is 90%
• In which case an answer telling the outcome of the game will contain more information? (i.e. will be less predictable)
, COMP3308/3608 AI, week 7, 2022 19

Entropy Calculation for the Examples
• Calculate the entropy for the two examples: coin flipping and basketball game
• Example 1: Flipping coin
• Case 1 (honest coin):
• Case 2 (rigged coin):
• Example 2: Basketball game outcome
• Case 1 (equal probability to win):
• Case 2 (non-equal probability to win):
, COMP3308/3608 AI, week 7, 2022 20

Entropy – Interpretation 3
adapted from http://www.cs.cmu.edu/awm/tutorials
Suppose that X is a random variable with 4 possible values A, B, C and D; P(X=A)=P(X=B)=P(X=C)=P(X=D)=1/4.
You must transmit a sequence of values of X over a serial binary channel. How?
Solution: encode each symbol with 2 bits, e.g. A=00, B=01, C=10, D=11 • ABBBCADBADCB…
• 000101011000110100111001…
Now you are told that the probabilities are not equal, e.g. P(X=A)=1/2, P(X=B)=1/4, P(X=C)=P(X=D)=1/8.
Can you invent a better coding? E.g. a coding that uses less than 2 bits on average per symbol, e.g. 1.75 bits?
, COMP3308/3608 AI, week 7, 2022 21

Calculating the Number of Bits per Symbol
• Coding used:
• Average number of bits per symbol:
• Is this the smallest number of bits per symbol?
, COMP3308/3608 AI, week 7, 2022 22

Entropy – Interpretation 3; General Case
adapted from http://www.cs.cmu.edu/~awm/tutorials
Suppose that X is a random variable with m possible values V1…Vm; P(X=V1)=P1, P(X=V2)=P2,…, P(X=Vm)=Pm.
The smallest possible number of bits per symbol on average needed to transmit a stream of symbols drawn from X’s distribution is
m H(X)=−P.log P
High entropy – the values of X are all over place
• The histogram of the frequency distribution of values of X will be flat
Low entropy – the values of X are more predictable
• The histogram of the frequency distribution of values of X have many lows and one or two highs
, COMP3308/3608 AI, week 7, 2022 23

Let’s Check the Entropy!
• Let’s calculate the entropy (= minimum number of bits per symbol on average) and compare with our coding which required 1.75 bits
• The probabilities were:
P(X=A)=1/2, P(X=B)=1/4, P(X=C)=P(X=D)=1/8
• H(X)=I(1/2,1/4,1/8,1/8)=-1/2log1/2-1/4log1/4-1/8log1/8- 1/8log1/8=1/2+1/2+3/8+3/8=1.75 bits
• => the entropy is 1.75 bits and our coding also uses 1.75 bits
• => our coding is good – it uses the smallest number of bits per
symbol on average!
, COMP3308/3608 AI, week 7, 2022 25

Information Gain
• Back to DTs…
• Entropy measures
• the disorder of a set of training examples with respect to their class Y
• shows the amount of surprise of the receiver by the answer Y based on the probability of the answers
• the smallest possible number of bits per symbol (on average) needed to transmit a stream of symbols drawn from Y’s distribution
• We can use the first one to define a measure for the effectiveness of an attribute to classify the training data
• This measure is called information gain
• It measures the reduction in entropy caused by using this attribute to
partition the set of training examples
• The best attribute is the one with the highest information gain (i.e. with
the biggest reduction in entropy)
, COMP3308/3608 AI, week 7, 2022 26

Information Gain – Example
• Let’s calculate the information gain of the attribute outlook
5 examples go along this branch
• Information gain measures reduction in entropy
• It is a difference of 2 terms: T1 –T2
• T1 is the entropy of the set of examples S associated with the parent node before the split
Before splitting: 9 yes & 5 no
T1=H(S)=I(9 , 5)=0.940bits 14 14
• T2 is the remaining entropy in S, after S is split by the attribute (e.g. outlook)
• It takes into consideration the entropies of the child nodes and the
distribution of the examples along each child node
• E.g. for a split on outlook, it will consider the entropies of S1, S2 and S3 and the proportion of examples following each branch (5/14, 4/14, 5/15):
T2=H(S|outlook)= 5 H(S1)+ 4 H(S2)+ 5 H(S3) 14 14 14
, COMP3308/3608 AI, week 7, 2022 27

Gain = T1-T2
T1 is the entropy of the set of examples S at the parent node before the split
T2 is the remaining entropy in S, after S is split by an attribute
Before splitting: 9 yes & 5 no
5 4 5 after
Information Gain – Example (cont.) 5 examples go along this
The larger the difference, the higher the information gain
The best attribute is the one with highest information gain
• it reduces most the entropy of the parent node
• it is expected to lead to pure partitions fastest
, COMP3308/3608 AI, week 7, 2022 28

Information Gain – Formal Definition
• The information gain of an attribute A, Gain(S|A), is the reduction in entropy caused by the partitioning of the set of examples S using A
Gain(S|A)=H(S)− P(A=vj)H(S|A=vj)= jvalues(A)
=H(S)−  |Sv |H(S|A=vj) jvalues(A) |S|
Gain(S|A) is the information gain of an attribute A relative to S Values(A) is the set of all possible values for A
Sv is the subset of S for which A has value v
=entropy of the original data set S
=conditional entropy H(Y|A);
=expected value of the entropy after S is partitioned by A =called Reminder in Russell and Norvig
, COMP3308/3608 AI, week 7, 2022 29

Computing the Reminder for outlook
Fully computing the second term (the reminder) of H(S|outlook):
T2=H(S|outlook)= 5 H(S1)+ 4 H(S2)+ 5 H(S3) 14 14 14
Before splitting: 9 yes & 5 no
we will create a leaf in this branch (if outlook
is chosen for splitting)
H(S|outlook=sunny)=I(2,3)=−2log 2−3log 3=0.971bits 55525525
H(S|outlook=overcast)=I(4,0)=−4log 4−0log 0=0bits 44424424
H(S|outlook=rainy)=I(3,2)=−3log 3−2log 2=0.971bits 55525525
H(S|outlook)= 5 0.971+ 4 0+ 5 0.971=0.693bits 14 14 14
(This is the amount of information that we expect would be necessary to determine the class of an example, given the tree structure a; how difficult it will be to predict the class given this tree )
, COMP3308/3608 AI, week 7, 2022 30

Similarly, the information gain for the other three attributes is:
Computing the Information Gain for outlook
=> the information gain that would be gained by branching on outlook is:
Gain(S|outlook)=H(S)-H(S|outlook)=
0.940-0.639=0.247 bits
Gain(S|temperature)=0.029 bits Gain(S|humidity)=0.152 bits Gain(S|windy)=0.048 bits
=> we select outlook for splitting of the tree because it has the highest information gain
, COMP3308/3608 AI, week 7, 2022 31

Gain(S, temperature)=0.571 bits
Gain(S, humidity)=0.971 bits
Gain(S, windy)=0.020 bits
Continuing to Split
, COMP3308/3608 AI, week 7, 2022 32

Building DT as a Search Problem
DT algorithm searches the hypothesis space for a hypothesis that fits
(correctly classifies) the training data
image ref: , Machine Learning, McGraw Hill, 1997
What is the search strategy?
simple to complex search (starting with an empty tree and progressively considering more elaborate hypotheses)
hill climbing with information gain as an evaluation function
Information gain is an evaluation function of how good the current state is (how close we are to a goal state, i.e. a tree that classifies correctly all training examples)
, COMP3308/3608 AI, week 7, 2022 33

DT Decision Boundary
Example from 6.034 AI, MIT
DTs define a decision boundary in the feature space
Example: binary classification, numeric attributes (binary DT)
2 attributes:
R – ratio of earnings to expenses
L – number of late payments on credit cards over the past year
, COMP3308/3608 AI, week 7, 2022 34

Additional Topics
• Avoiding Overfitting
• Dealing with Numeric Attributes
• Alternative Measures for Selecting Attributes
• Dealing with Missing Attributes
• Handling Attributes with Different Costs
, COMP3308/3608 AI, week 7, 2022 35

Overfitting
• Overfitting: the error on the training data is very small but the error on new data (test data) is high
=> The classifier has memorized the training examples but has not extracted pattern from them and is not classifying well the new examples!
More formal definition of overfitting (generic, not only for DTs) [ , Machine Learning, McGraw Hill, 1997]:
• given: H – a hypothesis space, a hypothesis hH,
D – entire distribution of instances, train – training instances
• h is said to overfit the training data if there exist some alternative hypothesis h’H:
errortrain(h) errorD(h’)
Overfitting is a problem not only for DTs but for most of the classifiers
, COMP3308/3608 AI, week 7, 2022 36

Overfitting in DTs
image ref: , Machine Learning, McGraw Hill, 1997
, COMP3308/3608 AI, week 7, 2022 37

Overfitting in DTs – Reasons

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts