CS计算机代考程序代写 data mining algorithm decision tree flex Data Mining (EECS 4412)

Data Mining (EECS 4412)
Decision Rule Learning
Parke Godfrey
EECS
Lassonde School of Engineering York University

Thanks to
Professor Aijun An
for curation & use of these slides.
2

Outline
What are decision rules?
How to learn decision rules? Sequential covering algorithm Classification with rules
3

What Are Decision Rules?
If-then rules that can be used for classification (also
called classification rules)
If condition1 and condition2 and …, then class
Types of if-then rules
Propositional rules: rules without variables
Relate an example’s class to its attribute values
Example:
If (Outlook = “sunny”) and (temperature < 30) then (PlayTennis = yes) If (Outlook = “overcast”) and (wind 1 “strong”), then (PlayTennis = yes) First-order rules: rules containing variables Example: If Parent(x, y), then Ancestor(x, y) If Parent(x, z) Ù Ancestor(z, y), then Ancestor(x, y) 4 Decision Rules vs Decision Trees Decision rules are easier to understand: most human readable representation Rules are more flexible than decision trees No overlapping among branches in a decision tree Branches in a decision tree share at least one attribute If-then rules can be used with expert systems 5 Learning Rules Two general ways to learn rules from data Rules can be derived from other representations (e.g., decision trees) Rules can be learned directly. Here, we are concentrating on the direct method. Advantage of direct rule-learning algorithms more flexible rules can be learned. Some algorithms can learn sets of first-order rules which have much more representational power than the propositional rules that can be derived from decision trees. Decision rule learning algorithms CN2, AQ family, HYDRA, PRISM, ELEM2, FOIL, etc. 6 Learning Rules Directly Most of direct rule learning algorithms use sequential covering strategy A sequential covering algorithm learns a set of conjunctive rules for each class in turn. A conjunctive rule for class ci: (A1 rel1 v1)Ù(A2 rel2 v2)Ù...Ù(Ak relk vk)®(Class=ci), where A1, A2,..., Ak are attributes, v1, v2, ..., vk are attribute values and rel1, rel2, ..., relk are relational operators (=, 1, £, >, Î, etc.). Each attribute-value pair is called a conjunct.
Examples:
(Outlook = “sunny”) Ù (temperature < 30) ® (PlayTennis = yes) (Outlook = “overcast”) Ù (wind 1 “strong”) ® (PlayTennis = yes) 7 Sequential Covering Algorithm To learn rules for class ci from training set S, Training set is separated into positive examples: examples that belong to ci negative examples: examples that do not belong to ci Idea: greedily (sequentially) find rules that apply to (cover) positive examples in the training data Learn one rule Remove positive examples covered by this rule from the training data Repeat 8 A Sequential Covering Example 9 Sequential Covering Algorithm Learn a set of rules for class ci from training set S: Learn one conjunctive rule (L1ÙL2Ù...® ci) Remove from S the positive examples covered by the rule No A set of conjunctive rules for class ci Try to cover as many positive examples and as fewer negative examples as possible. Stopping criterion: No positive example remains in S or some other criterion is satisfied Stopping criterion is satisfied ? Yes 10 How to Learn One Conjunctive Rule General-to-Specific Search Start with empty condition (the most general condition), add conjuncts (attribute-value pairs) Specific-to-General Search Start with the most specific condition (such as an example), drop conjuncts (attribute-value pairs) Bi-directional search 11 General to Specific Search IF {} THEN Play-Tennis = Yes IF {Wind = weak} THEN Play-Tennis = Yes IF {Wind = Strong} THEN Play-Tennis = Yes IF {Humidity = Normal} THEN Play-Tennis = Yes ... IF {Humidity = High} THEN Play-Tennis = Yes ... IF {Humidity = Normal, Outlook = Rain} THEN Play-Tennis = Yes IF {Humidity = Normal, Wind = weak} THEN Play-Tennis = Yes IF {Humidity = Normal, Wind = Strong} THEN Play-Tennis = Yes IF {Humidity = Normal, Outlook = Sunny} THEN Play-Tennis = Yes 12 Learning One Conjunctive Rule with General-to-Specific Search Initialize rule R to the most general hypothesis: True ® ci Specialization: • Select an attribute-value pair L (such as A=ai) based on an evaluation function (search heuristic). • Add L to the antecedent of R as a conjunct (R is specialized) No R covers only positive examples or no attribute value pairs can be added Yes A conjunctive rule R 13 Evaluation Functions for Selecting an Attribute-value Pair Training accuracy of the rule (AQ15): Accuracy of the rule on the training examples: p/ t where t is the total number of training examples that the new rule covers, and p is the number of these that are positive (i.e., belong to the class in question) log p-log P 2t2T where p and t are the same as above, and P and T are the corresponding numbers of examples that satisfied the rule before the new attribute-value pair was added. Information gain (CN2, PRISM): It is equivalent to using p/ t 14 Evaluation Functions for Selecting an Attribute-value Pair (Cont’d) Training accuracy itself is not a reliable indicator of the predictive accuracy on future test examples Rules that have a high training accuracy may be too specific and overfit the data. Coverage of the rule should be considered in the evaluation function. Information gain with coverage (used in FOIL): pélog p-log Pù êë2t 2Túû 15 Contact lens data age Spectacle prescription astigmatism Tear production rate Recommended lenses young myope no reduced none young myope no normal soft young myope yes reduced none young myope yes normal hard young hypermetrope no reduced none young hypermetrope no normal soft young hypermetrope yes reduced none young hypermetrope yes normal hard middle-aged myope no reduced none middle-aged myope no normal soft middle-aged myope yes reduced none middle-aged myope yes normal hard middle-aged hypermetrope no reduced none middle-aged hypermetrope no normal soft middle-aged hypermetrope yes reduced none middle-aged hypermetrope yes normal none old myope no reduced none old myope no normal none old myope yes reduced none old myope yes normal hard old hypermetrope no reduced none old hypermetrope no normal soft old hypermetrope yes reduced none old hypermetrope yes normal none 16 Learn the first rule from the data for class “hard” Assume training accuracy is used as the attribute selection criterion To begin, we seek a rule If ? then recommendation = hard For ?, we have 9 choices: Attribute-value pair age = young age = middle-aged age = old spectacle pres. = myope spectacle pres. = hypermetrope astigmatism = no astigmatism = yes tear prod. rate = reduced tear prod. rate = normal p/t 2/8 1/8 1/8 3/12 1/12 0/12 4/12 0/12 The highest fraction is 4/12. Arbitrarily choose one or choose the first one: If astigmatism = yes then recommendation = hard 4/12 17 Learn the first rule from the data for class “hard” (Cont’d) The rule If astigmatism = yes then recommendation = hard is not very accurate. It covers 12 examples Needs to refine it: If astigmatism = yes and ? then recommendation = hard age Spectacle prescription astigm atism Tear production rate Recommen ded lenses young myope yes reduced none young myope yes normal hard young hypermetrope yes reduced none young hypermetrope yes normal hard middle- aged myope yes reduced none middle- aged myope yes normal hard middle- aged hypermetrope yes reduced none middle- aged hypermetrope yes normal none old myope yes reduced none old myope yes normal hard old hypermetrope yes reduced none old hypermetrope yes normal none 18 Learn the first rule from the data for class “hard” (Cont’d) For ? in: If astigmatism = yes and ? then recommendation = hard Consider 7 choices: The last one is the winner The rule is refined as If astigmatism = yes and tear production rate = normal then recommendation = hard Attribute-value pair p/t age = young 2/4 age = middle-aged 1/4 age = old 1/4 spectacle pres. = myope 3/6 spectacle pres. = hypermetrope 1/6 tear prod. rate = reduced 0/6 tear prod. rate = normal 4/6 19 Learn the first rule from the data for class “hard” (Cont’d) The rule If astigmatism = yes and tear production rate = normal then recommendation = hard is still not very accurate Refine it further: If astigmatism = yes and tear production rate = normal and ? then recommendation = hard age Spectacle prescription astig matis m Tear production rate Recommen ded lenses young myope yes normal hard young hypermetrope yes normal hard middle -aged myope yes normal hard middle -aged hypermetrope yes normal none old myope yes normal hard old hypermetrope yes normal none 20 Learn the first rule from the data for class “hard” (Cont’d) For ? in: If astigmatism = yes and tear production rate = normal and ? then recommendation = hard Consider 5 choices Choose spectacle pres. = myope because it has the highest ratio and its coverage is better than age=young The rule is refined to: If astigmatism = yes and tear production rate = normal and spectacle pres. = myope then recommendation = hard Attribute-value pair p/t age = young 2/2 age = middle-aged 1/2 age = old 1/2 spectacle pres. = myope 3/3 spectacle pres. = hypermetrope 1/3 21 Contact lens data age Spectacle prescription astigmatism Tear production rate Recommended lenses young myope no reduced none young myope no normal soft young myope yes reduced none young myope yes normal hard young hypermetrope no reduced none young hypermetrope no normal soft young hypermetrope yes reduced none young hypermetrope yes normal hard middle-aged myope no reduced none middle-aged myope no normal soft middle-aged myope yes reduced none middle-aged myope yes normal hard middle-aged hypermetrope no reduced none middle-aged hypermetrope no normal soft middle-aged hypermetrope yes reduced none middle-aged hypermetrope yes normal none old myope no reduced none old myope no normal none old myope yes reduced none old myope yes normal hard old hypermetrope no reduced none old hypermetrope no normal soft old hypermetrope yes reduced none old hypermetrope yes normal none covered Not Covered Positive example 22 Learn the second rule from the data for class “hard” The rule If astigmatism = yes and tear production rate = normal and spectacle pres. = myope then recommendation = hard is accurate, but covers only 3 positive examples Need to generate other rules to cover the rest of positive examples Remove the 3 covered examples from the training data Look for another rule using the same process Start with a rule of the form: If ? then recommendation = hard Using the same process, find the following rule: If age = young and astigmatism = yes and tear production rate = normal then recommendation = hard This rule covers 2 of the original examples , one of which has been covered by the first rule. (It is ok) (It covers the first two positive examples) Remove the example covered by the second rule from the training data. There is no positive example left. The process for learning rules for the “Hard” class stops. 23 Contact lens data age Spectacle prescription astigmatism Tear production rate Recommended lenses young myope no reduced none young myope no normal soft young myope yes reduced none young myope yes normal hard young hypermetrope no reduced none young hypermetrope no normal soft young hypermetrope yes reduced none young hypermetrope yes normal hard middle-aged myope no reduced none middle-aged myope no normal soft middle-aged myope yes reduced none middle-aged myope yes normal hard middle-aged hypermetrope no reduced none middle-aged hypermetrope no normal soft middle-aged hypermetrope yes reduced none middle-aged hypermetrope yes normal none old myope no reduced none old myope no normal none old myope yes reduced none old myope yes normal hard old hypermetrope no reduced none old hypermetrope no normal soft old hypermetrope yes reduced none old hypermetrope yes normal none Covered By the 2nd rule Covered by the 1st rule 24 Overfitting Problem The learn-one-rule algorithm on the previous slides tries to learn a rule as accurate as possible Such a rule may fit into noise very complicated (containing many conjuncts) exhibit low predictive accuracy on unseen examples Solution: approximate, but simpler rules Pre-pruning Post-pruning 25 Prepruning Stop the refinement of rules although they are still not accurate Most commonly used stopping criteria: Minimum Purity Criterion Requires a certain percentage of the examples covered by the rules is positive (i.e., a minimum training accuracy) Difference Testing Differences between the distribution of positive and negative examples covered by a rule and the overall distribution of positive and negative examples, or Difference between distribution of instances covered by a rule and that of its direct predecessor Only admits new attribute-value pair when the difference is above a user-set threshold (cutoff) Not as successful as post-pruning 26 Postpruning General idea Learn a “pure” rule first Remove some attribute value pairs from the rule to make it more general Test the affect of the removal on a validation/pruning set on the quality of the rule Method 1 Separate the training data into growing set and validation set Learn rules from the growing set Test the accuracy of the pruned rule on a validation set 27 Postpruning (Cont’d) Method 2 Use a rule quality measure, test the quality of the pruned rule on the same training set from which the rule was learned. For example, ELEM2 uses rule quality formula Q(R) = log P(R | C)(1- P(R | C )) P ( R | C ) (1 - P ( R | C ) ) where P(R|C) is the probability of an example covered by rule R given that the example belongs to C. Procedure of postpruning with rule quality measure: Compute a rule quality value: Q(R) Check each attribute-value pair L in R in the reversed order of their generation to see if removal of L decreases Q(R) If not, L is removed (i.e., R is generalized) and repeat the process. 28 Learning One Conjunctive Rule with General-to-Specific Search and Postpruning Initialize R to the most general hypothesis: True ® ci Specialization: • Select an attribute-value pair L (such as A=ai) based on an evaluation function (search heuristic). • Add L to the antecedent of R as a conjunct (R is specialized) No R covers only positive examples or no attribute-value pair remains Yes Post-prune R: (generalization to avoid over-fitting data) • Compute a rule quality value: Q(R) • Check each pair L in R to see if removal of L decreases Q(R) • If not, L is removed (i.e., R is generalized). A conjunctive rule R 29 Learn a set of conjunctive rules for a class Given a training set S and a class C: Initialize R to be most general hypothesis Specialize R by searching for an attribute-value pair L No R covered only positive/ No attribute-value pair remains? Yes Post-prune R Remove from S examples covered by R S No A disjunctive set of conjunctive rules for class C contains no positive/ Some other criterion is true Yes 30 Classification of New Example with Sets of Rules Three situations are possible when matching a new example with a set of rules: Single-match Only one rule is matched with the example. Classify the example into the class indicated by the rule. 31 Classification of New Example with Sets of Rules Multiple-match Multiple rules are matched with the example Two situations: The matched rules indicate the same class. Classify the example into that class. The matched rules indicate different classes. Method 1: Rank the rules according to a criterion, use the first matched rule to classify the example Method 2: Compute a decision score for each of the involved classes: DS(C) = åQ(ri ) i where Q(ri) is a quality measure of rule ri belonging to C and matched with the example . choose the one with the highest decision score. 32 Classification of New Example with Sets of Rules (Cont’d) No-match Method 1: Use a default rule (the majority class) to classify the example Method 2: Partial matching is performed. Calculate a matching score for each partially matched rule and a decision score for each involved class: PMS(r ) = Number of matched attribute_value_pairs ́Q(r ) i Number of attribute_value_pairs in r i For each class involved, compute a decision score by summing up the partial matching scores for each partially matched rule within that class: DS (C ) = å PMS (r ) i i Choose the class with the highest decision score. 33 i