Data Mining (EECS 4412)
Decision Rule Learning
Parke Godfrey
EECS
Lassonde School of Engineering York University
Thanks to
Professor Aijun An
for curation & use of these slides.
2
Outline
What are decision rules?
How to learn decision rules? Sequential covering algorithm Classification with rules
3
What Are Decision Rules?
If-then rules that can be used for classification (also
called classification rules)
If condition1 and condition2 and …, then class
Types of if-then rules
Propositional rules: rules without variables
Relate an example’s class to its attribute values
Example:
If (Outlook = “sunny”) and (temperature < 30) then (PlayTennis = yes) If (Outlook = “overcast”) and (wind 1 “strong”), then (PlayTennis = yes)
First-order rules: rules containing variables
Example:
If Parent(x, y), then Ancestor(x, y)
If Parent(x, z) Ù Ancestor(z, y), then Ancestor(x, y)
4
Decision Rules vs Decision Trees Decision rules are easier to understand: most human
readable representation
Rules are more flexible than decision trees
No overlapping among branches in a decision tree
Branches in a decision tree share at least one attribute
If-then rules can be used with expert systems
5
Learning Rules
Two general ways to learn rules from data
Rules can be derived from other representations (e.g., decision trees)
Rules can be learned directly. Here, we are concentrating on the direct method.
Advantage of direct rule-learning algorithms more flexible rules can be learned.
Some algorithms can learn sets of first-order rules which have much more representational power than the propositional rules that can be derived from decision trees.
Decision rule learning algorithms
CN2, AQ family, HYDRA, PRISM, ELEM2, FOIL, etc.
6
Learning Rules Directly
Most of direct rule learning algorithms use
sequential covering strategy
A sequential covering algorithm learns a set of
conjunctive rules for each class in turn. A conjunctive rule for class ci:
(A1 rel1 v1)Ù(A2 rel2 v2)Ù...Ù(Ak relk vk)®(Class=ci),
where A1, A2,..., Ak are attributes, v1, v2, ..., vk are attribute values and rel1, rel2, ..., relk are relational operators (=, 1, £, >, Î, etc.). Each attribute-value pair is called a conjunct.
Examples:
(Outlook = “sunny”) Ù (temperature < 30) ® (PlayTennis = yes)
(Outlook = “overcast”) Ù (wind 1 “strong”) ® (PlayTennis
= yes)
7
Sequential Covering Algorithm
To learn rules for class ci from training set S,
Training set is separated into
positive examples: examples that belong to ci negative examples: examples that do not belong to ci
Idea: greedily (sequentially) find rules that apply to (cover) positive examples in the training data
Learn one rule
Remove positive examples covered by this rule from the training data
Repeat
8
A Sequential Covering Example
9
Sequential Covering Algorithm Learn a set of rules for class ci from training set S:
Learn one conjunctive rule (L1ÙL2Ù...® ci)
Remove from S the positive examples covered by the rule
No
A set of conjunctive rules for class ci
Try to cover as many positive examples and as fewer negative examples as possible.
Stopping criterion:
No positive example remains in S or some other criterion is satisfied
Stopping criterion is satisfied ?
Yes
10
How to Learn One Conjunctive Rule
General-to-Specific Search
Start with empty condition (the most general condition), add conjuncts (attribute-value pairs)
Specific-to-General Search
Start with the most specific condition (such as an example), drop conjuncts (attribute-value pairs)
Bi-directional search
11
General to Specific Search
IF {}
THEN Play-Tennis = Yes
IF {Wind = weak} THEN Play-Tennis = Yes
IF {Wind = Strong} THEN Play-Tennis = Yes
IF {Humidity = Normal} THEN Play-Tennis = Yes
...
IF {Humidity = High} THEN Play-Tennis = Yes
...
IF {Humidity = Normal, Outlook = Rain} THEN Play-Tennis = Yes
IF {Humidity = Normal, Wind = weak} THEN Play-Tennis = Yes
IF {Humidity = Normal, Wind = Strong} THEN Play-Tennis = Yes
IF {Humidity = Normal, Outlook = Sunny} THEN Play-Tennis = Yes
12
Learning One Conjunctive Rule with General-to-Specific Search
Initialize rule R to the most general hypothesis: True ® ci
Specialization:
• Select an attribute-value pair L (such as A=ai) based on
an evaluation function (search heuristic).
• Add L to the antecedent of R as a conjunct (R is specialized)
No
R covers only positive examples or no attribute value pairs can be added
Yes
A conjunctive rule R
13
Evaluation Functions for Selecting an Attribute-value Pair
Training accuracy of the rule (AQ15): Accuracy of the rule on the training examples:
p/ t
where t is the total number of training examples that the new rule covers, and p is the number of these that are positive (i.e., belong to the class in question)
log p-log P 2t2T
where p and t are the same as above, and P and T are the corresponding numbers of examples that satisfied the rule before the new attribute-value pair was added.
Information gain (CN2, PRISM):
It is equivalent to using p/ t
14
Evaluation Functions for Selecting an Attribute-value Pair (Cont’d)
Training accuracy itself is not a reliable indicator of the predictive accuracy on future test examples
Rules that have a high training accuracy may be too specific and overfit the data.
Coverage of the rule should be considered in the evaluation function.
Information gain with coverage (used in FOIL):
pélog p-log Pù êë2t 2Túû
15
Contact lens data
age
Spectacle prescription
astigmatism
Tear production rate
Recommended lenses
young
myope
no
reduced
none
young
myope
no
normal
soft
young
myope
yes
reduced
none
young
myope
yes
normal
hard
young
hypermetrope
no
reduced
none
young
hypermetrope
no
normal
soft
young
hypermetrope
yes
reduced
none
young
hypermetrope
yes
normal
hard
middle-aged
myope
no
reduced
none
middle-aged
myope
no
normal
soft
middle-aged
myope
yes
reduced
none
middle-aged
myope
yes
normal
hard
middle-aged
hypermetrope
no
reduced
none
middle-aged
hypermetrope
no
normal
soft
middle-aged
hypermetrope
yes
reduced
none
middle-aged
hypermetrope
yes
normal
none
old
myope
no
reduced
none
old
myope
no
normal
none
old
myope
yes
reduced
none
old
myope
yes
normal
hard
old
hypermetrope
no
reduced
none
old
hypermetrope
no
normal
soft
old
hypermetrope
yes
reduced
none
old
hypermetrope
yes
normal
none
16
Learn the first rule from the data for class “hard”
Assume training accuracy is used as the attribute selection criterion
To begin, we seek a rule
If ? then recommendation = hard
For ?, we have 9 choices:
Attribute-value pair
age = young
age = middle-aged
age = old
spectacle pres. = myope
spectacle pres. = hypermetrope
astigmatism = no
astigmatism = yes
tear prod. rate = reduced
tear prod. rate = normal
p/t
2/8
1/8
1/8
3/12
1/12
0/12
4/12
0/12
The highest fraction is 4/12.
Arbitrarily choose one or choose the first one:
If astigmatism = yes then recommendation = hard
4/12
17
Learn the first rule from the data for class “hard” (Cont’d)
The rule
If astigmatism = yes then recommendation = hard
is not very accurate.
It covers 12 examples
Needs to refine it:
If astigmatism = yes and ? then recommendation = hard
age
Spectacle prescription
astigm atism
Tear production rate
Recommen ded lenses
young
myope
yes
reduced
none
young
myope
yes
normal
hard
young
hypermetrope
yes
reduced
none
young
hypermetrope
yes
normal
hard
middle- aged
myope
yes
reduced
none
middle- aged
myope
yes
normal
hard
middle- aged
hypermetrope
yes
reduced
none
middle- aged
hypermetrope
yes
normal
none
old
myope
yes
reduced
none
old
myope
yes
normal
hard
old
hypermetrope
yes
reduced
none
old
hypermetrope
yes
normal
none
18
Learn the first rule from the data for class “hard” (Cont’d)
For ? in:
If astigmatism = yes and ? then recommendation = hard
Consider 7 choices:
The last one is the winner
The rule is refined as
If astigmatism = yes and
tear production rate = normal then recommendation = hard
Attribute-value pair
p/t
age = young
2/4
age = middle-aged
1/4
age = old
1/4
spectacle pres. = myope
3/6
spectacle pres. = hypermetrope
1/6
tear prod. rate = reduced
0/6
tear prod. rate = normal
4/6
19
Learn the first rule from the data for class “hard” (Cont’d)
The rule
If astigmatism = yes and tear production rate = normal then recommendation = hard
is still not very accurate
Refine it further:
If astigmatism = yes and tear production rate = normal and ?
then recommendation = hard
age
Spectacle prescription
astig matis m
Tear production rate
Recommen ded lenses
young
myope
yes
normal
hard
young
hypermetrope
yes
normal
hard
middle -aged
myope
yes
normal
hard
middle -aged
hypermetrope
yes
normal
none
old
myope
yes
normal
hard
old
hypermetrope
yes
normal
none
20
Learn the first rule from the data for class “hard” (Cont’d)
For ? in:
If astigmatism = yes and tear production rate = normal and ?
then recommendation = hard
Consider 5 choices
Choose
spectacle pres. = myope
because it has the highest ratio and its coverage is better than
age=young
The rule is refined to:
If astigmatism = yes and tear production rate = normal and spectacle pres. = myope
then recommendation = hard
Attribute-value pair
p/t
age = young
2/2
age = middle-aged
1/2
age = old
1/2
spectacle pres. = myope
3/3
spectacle pres. = hypermetrope
1/3
21
Contact lens data
age
Spectacle prescription
astigmatism
Tear production rate
Recommended lenses
young
myope
no
reduced
none
young
myope
no
normal
soft
young
myope
yes
reduced
none
young
myope
yes
normal
hard
young
hypermetrope
no
reduced
none
young
hypermetrope
no
normal
soft
young
hypermetrope
yes
reduced
none
young
hypermetrope
yes
normal
hard
middle-aged
myope
no
reduced
none
middle-aged
myope
no
normal
soft
middle-aged
myope
yes
reduced
none
middle-aged
myope
yes
normal
hard
middle-aged
hypermetrope
no
reduced
none
middle-aged
hypermetrope
no
normal
soft
middle-aged
hypermetrope
yes
reduced
none
middle-aged
hypermetrope
yes
normal
none
old
myope
no
reduced
none
old
myope
no
normal
none
old
myope
yes
reduced
none
old
myope
yes
normal
hard
old
hypermetrope
no
reduced
none
old
hypermetrope
no
normal
soft
old
hypermetrope
yes
reduced
none
old
hypermetrope
yes
normal
none
covered
Not Covered Positive example
22
Learn the second rule from the data for class “hard”
The rule
If astigmatism = yes and tear production rate = normal and spectacle pres. = myope then recommendation = hard
is accurate, but covers only 3 positive examples
Need to generate other rules to cover the rest of positive examples
Remove the 3 covered examples from the training data Look for another rule using the same process
Start with a rule of the form:
If ? then recommendation = hard
Using the same process, find the following rule:
If age = young and astigmatism = yes and tear production rate = normal then recommendation = hard
This rule covers 2 of the original examples , one of which has been covered by the first rule. (It is ok) (It covers the first two positive examples)
Remove the example covered by the second rule from the training data. There is no positive example left. The process for learning rules for the “Hard” class
stops.
23
Contact lens data
age
Spectacle prescription
astigmatism
Tear production rate
Recommended lenses
young
myope
no
reduced
none
young
myope
no
normal
soft
young
myope
yes
reduced
none
young
myope
yes
normal
hard
young
hypermetrope
no
reduced
none
young
hypermetrope
no
normal
soft
young
hypermetrope
yes
reduced
none
young
hypermetrope
yes
normal
hard
middle-aged
myope
no
reduced
none
middle-aged
myope
no
normal
soft
middle-aged
myope
yes
reduced
none
middle-aged
myope
yes
normal
hard
middle-aged
hypermetrope
no
reduced
none
middle-aged
hypermetrope
no
normal
soft
middle-aged
hypermetrope
yes
reduced
none
middle-aged
hypermetrope
yes
normal
none
old
myope
no
reduced
none
old
myope
no
normal
none
old
myope
yes
reduced
none
old
myope
yes
normal
hard
old
hypermetrope
no
reduced
none
old
hypermetrope
no
normal
soft
old
hypermetrope
yes
reduced
none
old
hypermetrope
yes
normal
none
Covered By the 2nd rule
Covered by the 1st rule
24
Overfitting Problem
The learn-one-rule algorithm on the previous slides tries to learn a rule as accurate as possible
Such a rule may fit into noise
very complicated (containing many conjuncts)
exhibit low predictive accuracy on unseen examples
Solution: approximate, but simpler rules Pre-pruning
Post-pruning
25
Prepruning
Stop the refinement of rules although they are still not accurate
Most commonly used stopping criteria:
Minimum Purity Criterion
Requires a certain percentage of the examples covered by the rules is positive (i.e., a minimum training accuracy)
Difference Testing
Differences between the distribution of positive and negative examples covered by a rule and the overall distribution of positive and negative examples, or
Difference between distribution of instances covered by a rule and that of its direct predecessor
Only admits new attribute-value pair when the difference is above a user-set threshold (cutoff)
Not as successful as post-pruning
26
Postpruning
General idea
Learn a “pure” rule first
Remove some attribute value pairs from the rule to make it more general
Test the affect of the removal on a validation/pruning set on the quality of the rule
Method 1
Separate the training data into growing set and validation set
Learn rules from the growing set
Test the accuracy of the pruned rule on a validation set
27
Postpruning (Cont’d)
Method 2
Use a rule quality measure, test the quality of the pruned rule on the same training set from which the rule was learned.
For example, ELEM2 uses rule quality formula
Q(R) = log P(R | C)(1- P(R | C )) P ( R | C ) (1 - P ( R | C ) )
where P(R|C) is the probability of an example covered by rule R given that the example belongs to C.
Procedure of postpruning with rule quality measure:
Compute a rule quality value: Q(R)
Check each attribute-value pair L in R in the reversed order
of their generation to see if removal of L decreases Q(R)
If not, L is removed (i.e., R is generalized) and repeat the process.
28
Learning One Conjunctive Rule with General-to-Specific Search and Postpruning
Initialize R to the most general hypothesis: True ® ci
Specialization:
• Select an attribute-value pair L (such as A=ai) based on an evaluation
function (search heuristic).
• Add L to the antecedent of R as a conjunct (R is specialized)
No
R covers only positive examples or no attribute-value pair remains
Yes
Post-prune R: (generalization to avoid over-fitting data) • Compute a rule quality value: Q(R)
• Check each pair L in R to see if removal of L decreases Q(R) • If not, L is removed (i.e., R is generalized).
A conjunctive rule R
29
Learn a set of conjunctive rules for a class Given a training set
S and a class C:
Initialize R to be most general hypothesis
Specialize R by searching for an attribute-value pair L
No
R covered only positive/
No attribute-value pair remains?
Yes
Post-prune R
Remove from S examples covered by R
S
No
A disjunctive set of conjunctive rules for class C
contains no positive/ Some other criterion
is true
Yes
30
Classification of New Example with Sets of Rules
Three situations are possible when matching a new example with a set of rules:
Single-match
Only one rule is matched with the example. Classify the example into the class indicated by the rule.
31
Classification of New Example with Sets of Rules
Multiple-match
Multiple rules are matched with the example
Two situations:
The matched rules indicate the same class. Classify the example into that class.
The matched rules indicate different classes.
Method 1: Rank the rules according to a criterion, use the first matched rule to classify the example
Method 2: Compute a decision score for each of the involved classes:
DS(C) = åQ(ri ) i
where Q(ri) is a quality measure of rule ri belonging to C and matched with the example .
choose the one with the highest decision score.
32
Classification of New Example with Sets of Rules (Cont’d)
No-match
Method 1:
Use a default rule (the majority class) to classify the example
Method 2:
Partial matching is performed.
Calculate a matching score for each partially matched rule and a decision score for each involved class:
PMS(r ) = Number of matched attribute_value_pairs ́Q(r ) i Number of attribute_value_pairs in r i
For each class involved, compute a decision score by
summing up the partial matching scores for each partially
matched rule within that class: DS (C ) = å PMS (r ) i
i
Choose the class with the highest decision score. 33
i