程序代写代做代考 algorithm decision tree C4.5 Algorithm (cont’d)

C4.5 Algorithm (cont’d)

MD-MIS 637-Fall 2020
*
MIS 637
Data Analytics & Machine Learning
C4.5 Algorithm

*

MD-MIS 637-Fall 2020
C4.5 Algorithm
C4.5 uses information gain or entropy reduction to select optimal split at each decision node
In Engineering, information analogous to signal, entropy analogous to noise
What is Entropy?
Event with probability = p, average amount of information, in bits, required to transmit result –log2(p)
For example, toss fair coin with p = 0.5
Result of toss transmitted using –log2(0.5) = 1 bit of information
1 bit of information represents result toss as 0 or 1
Corresponds to either “HEAD” or “TAIL”

*

MD-MIS 637-Fall 2020
C4.5 Algorithm (cont’d)
Variable X has k values with probabilities p1, p2, … , pk
For variables with several outcomes, use weighted sum of
–log2(p)’s to transmit result
Entropy of X defined using formula:

Represents smallest number of bits, on average per symbol, needed to transmit stream of symbols corresponding to observed values of X

*

MD-MIS 637-Fall 2020
C4.5 Algorithm (cont’d)
Information Gain
Suppose we have a candidate split S, which partitions the data in to k subsets: T1, T2, …Tk
Weighted sum of the entropies: Sum Pi times Hs(Ti) = Hs(T), where Pi is the proportion of records in subset i
C4.5’s uses Information Gain
gain(S) = H(T) – HS(T)
Represents increase in information by partitioning training data T according to candidate split S
For each candidate split, C4.5 chooses split that has maximum information gain, gain(S)
Example
C4.5 illustrated with example
Recall the data set where customer classified “Good” or “Bad” credit risk using three predictor fields

*

MD-MIS 637-Fall 2020
C4.5 Algorithm (cont’d)

Consider all candidate splits for root node shown below

*
Customer Savings Assets Income ($1000s) Credit Risk
1 Medium High 75 Good
2 Low Low 50 Bad
3 High Medium 25 Bad
4 Medium Medium 50 Good
5 Low Medium 100 Good
6 High High 25 Good
7 Low Low 25 Bad
8 Medium Medium 75 Good

Candidate Split Child Nodes
1 Savings = Low Savings = Medium Savings = High
2 Assets = Low Assets = Medium Assets = High
3 Income <= $25,000 Income > $25,000
4 Income <= $50,000 Income > $50,000
5 Income <= $75,000 Income > $75,000

MD-MIS 637-Fall 2020
*

*

MD-MIS 637-Fall 2020
C4.5 Algorithm (cont’d)
Calculate Entropy of training set before splitting, where 5/8 records classified “Good” and 3/8 “Bad”

Compare each candidate split against H(T) = 0.9544, to determine which split maximizes information gain
Candidate 1
Split on Savings, “High” = 2 records, “Medium” = 3 records, and “Low” = 3 records:

*

MD-MIS 637-Fall 2020
C4.5 Algorithm (cont’d)
Of records Savings = “High”, 1 is “Good” and 1 is “Bad”. P = 0.5 of choosing “Good” record
Where Savings = “Medium”, 3 are “Good”, so P = 1.0 choosing “Good”
Of records Savings = “Low”, 1 is “Good” and 2 are “Bad”. This results P = 0.33 choosing “Good”
Entropy of 3 branches, “High”, “Medium”, and “Low”, are:

*

MD-MIS 637-Fall 2020
C4.5 Algorithm (cont’d)
Combining Entropy for three branches, along with corresponding proportion Pi:

Information Gain represented by split on Savings attribute is H(T) – HSavings(T) = 0.9544 – 0.5944 = 0.36 bits

How is measure interpreted?
Before split, H(T) = 0.9544 bits, on average. Takes 0.9544 bits of information to transmit credit risk associated with 8 records

*

MD-MIS 637-Fall 2020
C4.5 Algorithm (cont’d)
Splitting on Savings results in HSavings(T) = 0.5944. Now, on average, fewer bits of information required to transmit credit risk associated with 8 records
Reduction in Entropy is Information Gain, 0.9544 – 0.5944 = 0.36 bits of information gained by splitting on Savings
C4.5 chooses attribute split with highest Information Gain as optimal split at root node
Information Gain for other 4 candidate splits calculated similarly

*

MD-MIS 637-Fall 2020
C4.5 Algorithm (cont’d)
Information Gain for 5 candidate splits occurring at root node summarized below
Candidate split 2 has highest Information Gain = 0.5487 bits, and chosen for initial split

*
Candidate
Split Child Nodes Information Gain
(Entropy Reduction)
1 Savings = Low
Savings = Medium
Savings = High 0.36 bits
2 Assets = Low
Assets = Medium
Assets = High 0.5487 bits
3 Income <= $25,000 Income > $25,000 0.1588 bits
4 Income <= $50,000 Income > $50,000 0.3475 bits
5 Income <= $75,000 Income > $75,000 0.0923 bits

MD-MIS 637-Fall 2020
C4.5 Algorithm (cont’d)
Initial split results in two terminal leaf nodes and one second-level decision node
Records with Assets = “Low” and Assets = “High” have same target class value, “Bad” and “Good”, respectively

C4.5 iterates at Decision Node A, choosing optimal split from list of four possible candidate splits

*

Assets = Medium
Root Node (All Records)
Assets = Low, Medium, High?
Decision Node A
(Records 3, 4, 5, 8)

Assets = Low
Bad Risk
(Records 2, 7)

Assets = High
Good Risk
(Records 1, 6)

MD-MIS 637-Fall 2020
C4.5 Algorithm (cont’d)

After growing tree, C4.5 performs pessimistic postpruning, if necessary. Increases generality of tree
Diagram shows fully-grown C4.5 tree. “bushier” and one level shallower, compared to tree build by CART
Both algorithms concur on importance of Assets (root level) and Savings (second level)

*

Assets = Low
Bad Risk
(Records 2, 7)

Assets = High
Good Risk
(Records 1, 6)

Savings = Low
Good Risk
(Records 5)

Savings = High
Bad Risk
(Records 3)
Good Risk
(Records 4, 8)

Savings =Medium

Assets = Medium
Root Node (All Records)
Assets = Low, Medium, High?
Decision Node A
(Records 3, 4, 5, 8)

MD-MIS 637-Fall 2020
Decision Rules
Decision Trees produce interpretable output in human-readable form
Decision Rules constructed directly from Decision Tree output, traversing path from root node to a given leaf node
Decision Rules have form IF antecedent THEN consequent
Antecedent consists of attributes values from branches of given path
Consequent is classification of records contained in particular leaf node, corresponding to path

*

MD-MIS 637-Fall 2020
Decision Rules (cont’d)
Recall the full decision tree produced by C4.5. Table below shows Decision Rules associated with tree

Support of decision rule shows proportion of records in training set resting in leaf node
Confidence is percentage of records in leaf node, for which decision rule is true
Although confidence levels reported = 100%, results not typical of real-world examples

*
Antecedent Consequent Support Confidence
If assets = low then bad credit risk. 2/8 1.00
If assets = high then good credit risk. 2/8 1.00
If assets = medium and savings = low then good credit risk. 1/8 1.00
If assets = medium and savings = medium then good credit risk. 2/8 1.00
If assets = medium and savings = high then bad credit risk. 1/8 1.00

å

=
j
j
j
p
p
X
H
)
(
log
)
(
2

8
3
,
8
3
,
8
2
=
=
=
High
Medium
High
P
P
P

bits

9544
.
0
)
8
3
(
log
8
3
)
8
5
(
log
8
5
)
(
log
)
(
2
2
2
=


=

=
å
j
j
j
p
p
T
H

9183
.
0
)
3
2
(
log
3
2
)
3
1
(
log
3
1
)
(
0
)
3
0
(
log
3
0
)
3
3
(
log
3
3
)
(
1
)
2
1
(
log
2
1
)
2
1
(
log
2
1
)
(
2
2
2
2
2
2
=


=
=


=
=


=
Low
H
Medium
H
High
H

bits

5944
.
0
)
9183
.
0
(
8
3
)
0
(
8
3
)
1
(
8
2
)
(
)
(
1
=
+
+
=
=
å
=
i
S
k
i
i
S
T
H
P
T
H