程序代写代做代考 information theory Excel decision tree algorithm database IT enabled Business Intelligence, CRM, Database Applications

IT enabled Business Intelligence, CRM, Database Applications

Sep-18
Classification using Decision Trees
Prof. Vibs Abhishek
The Paul Merage School of Business
University of California, Irvine
BANA 273 Session 6

1

Agenda
Using Decision Tree for Classification
Building Decision Trees
Review Assignment 2

2

Reading Rules off the Decision Tree
IF Income=High AND Balance=Low AND Age<45 THEN Non-Responder For each leaf in the tree, read the rule from the root to that leaf. You will arrive at a set of rules. Income Balance Balance Age Responder Non-Responder Non-Responder Low Low High Low High High >45
<45 IF Income=Low AND Balance=Low THEN Non-Responder IF Income=Low AND Balance=High THEN Responder IF Income=High AND Balance=Low AND Age>45
THEN Responder
Non-Responder
Responder
IF Income=High AND Balance=High THEN Non-Responder

3

Goal of Decision Tree Construction
Partition the training instances into purer sub groups
pure: the instances in a sub-group mostly belong to the same class

Entire population

Age<45 Age>=45

Balance
<50K Balance >=50K

Age>=45
Age<45 Default Not default How to build a tree: How to split instances into purer sub-groups Class Variable Status 4 Purity Measures Purity measures: Many available Gini (population diversity) Entropy (information gain) Information Gain Chi-square Test Most common one (from information theory) is: Information Gain 5 Why do we want to identify pure sub groups? To classify a new instance, we can determine the leaf that the instance belongs to based on its attributes. If the leaf is very pure (e.g. all have defaulted) we can determine with greater confidence that the new instance belongs to this class (i.e., the “Default” class.) If the leaf is not very pure (e.g. a 50%/50% mixture of the two classes, Default and Not Default), our prediction for the new instance is more like a random guess. 6 Impurity Very impure group Less impure Minimum impurity The figures above show distribution of the class variable Default Not default Class Variable Status 7 Example Split Consider the two following splits. Which one is more informative? Split over whether Balance exceeds 50K Over 50K Less or equal 50K Over 100K Less or equal 100K Split over whether Income exceeds 100K Class Variable Status Default Not default 8 A tree is constructed by recursively partitioning the examples. With each partition the examples are split into increasingly purer sub groups. The key in building a tree: How to split Decision Tree Construction 9 Choosing a Split Few Medium PAYS Few High PAYS Philly Philly City Philly Philly Children Many Many Income Medium Low Status DEFAULTS DEFAULTS 3 4 ApplicantID 1 2 Try split on Children attribute:     Try split on Income attribute: Children Many Few Income Low High Medium                 Notice how the split on the Children attribute gives purer partitions. It is therefore chosen as the first split (and in this case the only split – because the two sub-groups are 100% pure). 10 Better, as purity of sub-nodes is improving.  Recursive Steps in Building a Tree Example                 STEP 1: Split Option A Not good as sub-nodes are still very heterogenous!                STEP 1: Split Option B STEP 2: Choose Split Option B as it is the better split. STEP 3: Try out splits on each of the sub-nodes of Split Option B. Eventually, we arrive at:                         Notice how examples in a parent node are split between sub-nodes - i.e. notice how the training examples are partitioned into smaller and smaller subsets. Also, notice that sub-nodes are purer than parent nodes. 11 Example 1: Riding Mower 12 Scatterplot of Lot Size versus Income 13 Splitting the Observations by Lot Size Value of 19 14 Tree Diagram: First Split 15 Second Split: Lot Size Value of 19K and then Income Value of 84.75K 16 Tree Diagram: First Two Splits 17 18 19 Final Partitioning 20 Full Tree owner 21 Calculate the probability of each branch 12/24 10/12 12/24 2/12 3/12 9/12 7/9 2/9 1/3 2/3 1/2 1/2 7/10 3/10 2/3 1/3 owner 22 Given lot size = 20, what is the probability of owner? 12/24 10/12 12/24 2/12 3/12 9/12 7/9 2/9 1/3 2/3 1/2 1/2 7/10 3/10 2/3 1/3 owner P(Owner | Lot size = 20) = P(Owner & Lot Size=20)/ (P(Owner & Lot Size=20)+P(Non-Owner & Lot Size=20)) 23 Given Income = 60, what is the probability of owner? 12/24 10/12 12/24 2/12 3/12 9/12 7/9 2/9 1/3 2/3 1/2 1/2 7/10 3/10 2/3 1/3 owner 24 Calculating Impurity Impurity = Entropy = pi is proportion of class i For example: our initial population is composed of 16 cases of class “Default” and 14 cases of class “Not default” Entropy(entire population of examples)= 25 Calculating the Information Gain of a Split For each sub-group produced by the split, calculate the impurity/entropy of that subset. Calculate the weighted entropy of the split by weighting each sub-group’s entropy by the proportion of training examples (out of the training examples in the parent node) that are in that subset. Calculate the entropy of the parent node, and subtract the weighted entropy of the child nodes to obtain the information gain for the split. 26 Calculating Information Gain Entire population (30 instances) Balance<50K Balance>=50K

17 instances
13 instances
(Weighted) Average Entropy of Children =

Information Gain= 0.997 – 0.615 = 0.382

Information Gain = Entropy (parent) – Entropy (children)

27

Information Gain = Entropy (parent) – Entropy (children)
Entropy(A) =0.997

Entropy(B,C) = 0.615
Gain=0.382
Gain=0.381

Age<45 Age>=45

Balance<50K Balance>=50K

Entire population

Entropy(D,E) =0.406
Information Gain
Entropy(B) = 0.787
Entropy (C)= 0.391
Entropy(D)=0 Log20 +1 log21=0
Entropy(E) =
-3/7 Log23/7 -4/7Log24/7=0.985
A
B
C
D
E

28

Which attribute to split over?
At each node examine splits over each of the attributes
Select the attribute for which the maximum information gain is obtained
For a continuous attribute, also need to consider different ways of splitting (>50 or <=50; >60 or <=60) For a categorical attribute with lots of possible values, sometimes also need to consider how to group these values ( branch 1 corresponds to {A,B,E} and branch 2 corresponds to {C,D,F,G}) 29 Person Hair Length Weight Age Class Homer 0” 250 36 M Marge 10” 150 34 F Bart 2” 90 10 M Lisa 6” 78 8 F Maggie 4” 20 1 F Abe 1” 170 70 M Selma 8” 160 41 F Otto 10” 180 38 M Krusty 6” 200 45 M Comic 8” 290 38 ? Example 2 30 Hair Length <= 5? yes no Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 Entropy(1F,3M) = -(1/4)log2(1/4) - (3/4)log2(3/4) = 0.8113 Entropy(3F,2M) = -(3/5)log2(3/5) - (2/5)log2(2/5) = 0.9710 Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911 Let us try splitting on Hair length Gain= Entropy of parent – Weighted average of entropies of the children Click to edit Master text styles Second level Third level Fourth level Fifth level 31 Weight <= 160? yes no Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5) = 0.7219 Entropy(0F,4M) = -(0/4)log2(0/4) - (4/4)log2(4/4) = 0 Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900 Let us try splitting on Weight Click to edit Master text styles Second level Third level Fourth level Fifth level 32 age <= 40? yes no Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 Entropy(3F,3M) = -(3/6)log2(3/6) - (3/6)log2(3/6) = 1 Entropy(1F,2M) = -(1/3)log2(1/3) - (2/3)log2(2/3) = 0.9183 Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183 Let us try splitting on Age Click to edit Master text styles Second level Third level Fourth level Fifth level 33 Weight <= 160? yes no Hair Length <= 2? yes no Of the 3 features we had, Weight was the best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply continue splitting… This time we find that we can split on Hair length, and then we are done! 34 Example 3: Which attribute to split on? Exercise – Decision Tree Customer ID Student Credit Rating Class: Buy PDA 1 No Fair No 2 No Excellent No 3 No Fair Yes 4 No Fair Yes 5 Yes Fair Yes 6 Yes Excellent No 7 Yes Excellent Yes 8 No Excellent No Which attribute to split on first? log2(2/3) = -0.585, log2(1/3) = -1.585, log2(1/2) = -1, log2(3/5) = -0.737, Log2(2/5) = -1.322, log2(1/4) = -2, log2(3/4) = -0.415 36 Building a Tree - Stopping Criteria You can stop building the tree when: The impurity of all nodes is zero: Problem is that this tends to lead to bushy, highly-branching trees, often with one example at each node. No split achieves a significant gain in purity (information gain not high enough) Node size is too small: That is, there are less than a certain number of examples, or proportion of the training set, at each node. 37 Over-fitting 38 Overfitting & Underfitting Notice how the error rate on the testing data increases for overly large trees. Overfitting: the model performs poorly on new examples (e.g. testing examples) as it is too highly trained to the specific training examples (pick up patterns and noises). Underfitting: the model performs poorly on new examples as it is too simplistic to distinguish between them (i.e. has not picked up the important patterns from the training examples) underfitting overfitting 39 Pruning A decision trees is typically more accurate on its training data than on its test data. Removing branches from a tree can often improve its accuracy on a test set. Classification and Regression Tree (CART) : Use validation data to delete “weak” sub-trees Assess whether splitting a node improves purity by a statistically significant amount 40 There are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets. Glasses Yes No Hair Female Short Long Male Female Hair Female Short Long Male Training Testing Tree 1 Tree 2 100% accurate on training data 41 Decision tree A tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers To avoid overfitting Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree Decision Tree Classification in a Nutshell 42 Strengths In practice: One of the most popular methods Very comprehensible – the tree structure specifies the entire decision structure Easy for decision makers to understand model’s rational Map nicely to a set of business rules Relatively easy to implement Very fast to run (to classify examples) with large data sets Good at handling missing values: just treat “missing” as a value – can become a good predictor Weakness Bad at handling continuous data, good at categorical input and output. 43 Which attribute will you use as the root of the tree, given the following information: gain(Outlook ) = 0.247 bits gain(Temperature ) = 0.029 bits gain(Humidity ) = 0.152 bits gain(Windy ) = 0.048 bits A: Outlook B: Humidity C: Windy D: Temperature E: None of the above 44 What is overfitting? A: When the model fit is better on the top side B: When the model fit is worse on the top side C: When the model captures the correct trend and has best accuracy D: When the model captures noise in the data, hurting accuracy E: None of the above 45 Weka Example – Classification using Naïve Bayes Download file from Canvas: 4bank-data-8.arff Switch tab to “classify” Select method: NaiveBayes Verify class variable set to “pep” Use 10 fold cross validation Run classifier Examine confusion matrix Weka Exercise Follow instructions on http://facweb.cs.depaul.edu/mobasher/classes/ect584/WEKA/classify.html Data files posted on Canvas We will use J48 which is an implementation of the C4.5 algorithm 47 Next Session Association Rules 48 å - i i i p p 2 log 997 . 0 30 16 log 30 16 30 14 log 30 14 2 2 = ÷ ø ö ç è æ × - ÷ ø ö ç è æ × - 997 . 0 30 16 log 30 16 30 14 log 30 14 2 2 = ÷ ø ö ç è æ × - ÷ ø ö ç è æ × - = impurity 787 . 0 17 4 log 17 4 17 13 log 17 13 2 2 = ÷ ø ö ç è æ × - ÷ ø ö ç è æ × - = impurity 615 . 0 391 . 0 30 13 787 . 0 30 17 = ÷ ø ö ç è æ × + ÷ ø ö ç è æ × 391 . 0 13 12 log 13 12 13 1 log 13 1 2 2 = ÷ ø ö ç è æ × - ÷ ø ö ç è æ × - = impurity NameHairGlassesClass MaryLongNoFemale MikeShortNoMale Bill ShortNoMale JaneLongNoFemale AnnShortYesFemale HairGlassesTree 1Tree 2TRUE ShortYesFemaleMaleMale ShortNoMaleMaleFemale LongNoFemaleFemaleFemale ShortYesFemaleMaleMale Error:75%25% Sheet1 Name Hair Glasses Class Mary Long No Female Mike Short No Male Bill Short No Male Jane Long No Female Ann Short Yes Female Sheet1 Hair Glasses Tree 1 Tree 2 Short Yes Female Male Male Short No Male Male Female Long No Female Female Female Short Yes Female Male Male Error: 75% 25% /docProps/thumbnail.jpeg