Data Mining: Concepts and Techniques
— Chapter 8 —
Qiang (Chan) Ye Faculty of Computer Science Dalhousie University University
Copyright By PowCoder代写 加微信 powcoder
Chapter 8. Classification: Basic Concepts
n Classification: Basic Concepts n Decision Tree Induction
n Bayes Classification Methods n Rule-Based Classification
n Model Evaluation and Selection n Summary
Supervised vs. Unsupervised Learning
n Supervised learning (classification)
n Each item in the training data set is associated with a label,
indicating the class of the item
n The training data set is used to construct a classification model
n New data is classified using the classification model n Unsupervised learning (clustering)
n The class labels of training data is unknown
n Given a data set, the goal is to properly divide the data set into a group of clusters
Prediction Problems: Classification vs. Numeric Prediction
n Classification
n Predicts class labels
n e.g. risky loaner vs. non-risky loaner for banks
n The class labels can be represented using alphabetic names or discrete values (where the ordering among values has no meaning ).
n For example, the discrete values 1, 2, and 3 may be used to represent treatments A, B, and C, where there is no ordering implied among this group of treatment regimes.
Prediction Problems: Classification vs. Numeric Prediction
n Numeric Prediction
n Predicts a continuous value or an ordered value
n e.g. Predicting air ticket price
n Regression analysis is a statistical method often used for
numeric prediction
n We focus on classification in this chapter. Here are a few typical classification applications:
n Credit/loan approval
n Medical diagnosis: if a tumor is cancerous or benign n Fraud detection: if a transaction is fraudulent
Classification—A Two-Step Process
n Step 1 – Learning: Training data is analyzed by a classification algorithm to construct a classification model
n Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
n The set of tuples used for model construction form the training dataset
n The model is described using decision trees, classification rules, or mathematical formulae
Classification—A Two-Step Process
n Step 2 – Classifying: The classification model is used to classify future or unknown objects
n Estimate accuracy of the model
n Test dataset should be independent of training dataset
(otherwise overfitting)
n For each item in the testing dataset, the known label of the item is compared with the label generated by the classification model
n Accuracy rate is the percentage of test set samples that are correctly classified by the model
n If the accuracy is acceptable, use the model to classify new data
Process (1): Learning
Process (2): Classifying
Chapter 8. Classification: Basic Concepts
n Classification: Basic Concepts n Decision Tree Induction
n Bayes Classification Methods n Rule-Based Classification
n Model Evaluation and Selection n Summary
Decision Tree Induction
n Decision tree induction is about construction of decision trees using class-labeled training tuples.
n A decision tree is a flowchart-like tree structure:
n Each non-leaf node (or internal node) denotes a test based
on one attribute
n Each branch represents an outcome of the test
n Each leaf node (or terminal node) holds a class label. n The topmost node in a tree is the root node.
n Some decision tree algorithms produce only binary trees (where each internal node branches to exactly two other nodes), whereas others can produce non-binary trees.
Decision Tree Induction
n A typical decision tree is shown in the following figure.
Decision Tree Induction
n ID3, C4.5, and CART are three classic decision tree induction algorithms.
n Most algorithms for decision tree induction follow a top-down approach.
n It starts with a training set of tuples and their associated class labels.
n The training set is recursively partitioned into smaller subsets as the tree is being built.
n A basic decision tree algorithm is summarized in the next slide.
n At first glance, the algorithm may appear long. However, it is quite straightforward.
Decision Tree Induction – Basic Alg.
Decision Tree Induction
Decision Tree Induction – Basic Alg.
n The tree starts as a single node, N, representing the training tuples in D (step 1).
n If the tuples in D all belong to the same class, then node N becomes a leaf and is labeled with that class (steps 2 and 3).
n Note that steps 4 and 5 are terminating conditions. All terminating conditions are explained at the end of the algorithm.
n Otherwise, the algorithm calls Attribute selection method to determine the splitting criterion.
n The splitting criterion tells us which attribute to test at node N by determining the “best” way to separate or partition the tuples in D into individual classes (step 6).
Decision Tree Induction – Basic Alg.
n The splitting criterion also tells us what branches to grow from node N with respect to the outcomes of the chosen test.
n More specifically, the splitting criterion indicates the splitting attribute and may also indicate either a split-point (e.g. part (b) of Figure 8.4) or a splitting subset (e.g. part (c) of Figure 8.4).
n The splitting criterion is determined so that, ideally, the resulting partition at each branch is as “pure” as possible.
n A partition is pure if all the tuples in the partition belong to the same class. In other words, if we split up the tuples in D according to the mutually exclusive outcomes of the splitting criterion, we hope for the resulting partitions to be as pure as possible.
Decision Tree Induction – Basic Alg.
n Then node N is labeled with the splitting criterion, which serves as a test at the node (step 7). A branch is grown from node N for each of the outcomes of the splitting criterion.
n The tuples in D are partitioned accordingly (steps 10 to 11). n When partitioning is carried out, there are three possible
scenarios, as illustrated in Figure 8.4 (in next slide).
n Let A be the splitting attribute.
n A has v distinct values, {a1, a2, …, av}, based on the training data.
Decision Tree Induction
n The classification model is used to classify future or unknown objects
Decision Tree Induction – Basic Alg.
n The algorithm uses the same process recursively to form a decision tree for the tuples in each resulting partition, Dj , of D (step 14).
n The recursive partitioning stops only when any one of the following terminating conditions is true:
n Condition # 1: All the tuples in partition Dj (represented at node N) belong to the same class (steps 2 and 3).
n Condition # 2: There is no remaining attribute on which the tuples may be further partitioned (step 4).
n In this case, majority voting is employed (step 5). With this method, we need to convert node N into a leaf and label it with the most common class in Dj. Alternatively, the class distribution of the node tuples may be stored.
Decision Tree Induction – Basic Alg.
n Condition # 3: There is no tuple for a given branch, that is, a partition Dj is empty (step 12). In this case, a leaf is created
and the class label is the most common class label in D (step 13).
n Finally, the resulting decision tree is returned (step 15).
Attribute Selection Measures
n An attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a given data partition (i.e. D) of class-labeled training tuples into individual classes.
n If we were to split D into smaller partitions according to the outcomes of the splitting criterion, ideally each smaller partition would be pure (i.e., all the tuples that fall into a given smaller partition would belong to the same class).
n Conceptually, the “best” splitting criterion is the one that most closely results in such a scenario.
Attribute Selection Measures – Information Gain
n The notations to be used are listed as follows.
n Let D, the data partition, be a training set of class-labeled
n Suppose the class label attribute has m distinct values,
defining m classes, Ci (for i=1, 2, …, m).
n Let Ci,D be the set of tuples belonging to class Ci in D.
n Let |D| and |Ci,D| denote the number of tuples in D and Ci,D, respectively.
n ID3 uses information gain as its attribute selection measure.
n This measure is based on the pioneering work on information theory by , who studied the value or “information content” of messages.
Attribute Selection Measures – Information Gain
n Let node N represents the set of the tuples of partition D.
n The attribute with the highest information gain is chosen as
the splitting attribute for node N.
n This attribute can minimize the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions.
n Such an approach can minimize the expected number of tests needed to classify a given tuple and guarantees that a simple decision tree (but not necessarily the simplest) is found.
Attribute Selection Measures – Information Gain
n The expected information needed to classify a tuple in D is given by:
n pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci. Mathematically, it is equal to|Ci,D|/|D|.
n A log function to the base 2 is used, because the information is encoded in bits.
n Essentially, Info(D) is the average amount of information needed to identify the class label of a tuple in D. Note that, at this point, the information we have is based solely on the proportions of tuples of each class. Info(D) is also known as the entropy of D.
Attribute Selection Measures – Information Gain
n Now, suppose we were to partition the tuples in D via attribute A with v distinct values, {a1, a2, …, av}.
n If A is discrete-valued, these values correspond directly to the v outcomes of a test on A.
n Attribute A can be used to split D into v sub-partitions, {D1, D2, … , Dv}, where Dj contains those tuples whose value is aj.
n These partitions would correspond to the branches grown from node N.
n Ideally, we would like this partitioning to produce an exact classification of the tuples. That is, we would like each partition to be pure. However, it is quite likely that the partitions will be impure.
Attribute Selection Measures – Information Gain
n How much more information would we still need (after the partitioning) to arrive at an exact classification? This amount is measured by:
n |Dj|/|D| acts as the weight of the j-th partition.
n InfoA(D) is the expected information required to classify a tuple
in D when attribute A is used to partition D.
n The smaller InfoA(D) , the greater the purity of the resulting partitions.
Attribute Selection Measures – Information Gain
n Information gain is defined as the difference between the original information requirement (i.e., based on just the
proportion of classes) and the new requirement (i.e., obtained after partitioning on A). That is,
n In other words, Gain(A) tells us how much would be gained by branching via A. It is the expected reduction in the information requirement caused by knowing the value of A.
n The attribute with the highest information gain should be chosen as the splitting attribute at node N.
Attribute Selection Measures – Information Gain
n This is equivalent to saying that we want to partition on the attribute A that would do the “best classification,” so that the amount of information still required to finish classifying the tuples is minimal (i.e., minimum InfoA(D)).
Attribute Selection Measures – Information Gain
n Table 8.1 presents a training set, D, of class-labeled tuples randomly selected from the AllElectronics customer database.
n The class label attribute, buys_computer, has two distinct values (namely, {yes, no}); therefore, there are two distinct classes (i.e., m=2).
n There are 9 tuples with class label “yes” and 5 tuples with class label “no”.
n A (root) node N is created for the tuples in D.
n To find the splitting criterion for these tuples, we must compute the information gain of each attribute.
Attribute Selection Measures – Information Gain
n We first use Eq. (8.1) to compute the expected information needed to classify a tuple in D, Info(D):
n Next, we need to compute the expected information requirement for each attribute, InfoA(D).
n Let’s start with the attribute age.
n We need to look at the distribution of yes and no tuples for each category of age.
Attribute Selection Measures – Information Gain
n For the category “youth”, there are 2 yes tuples and 3 no tuples.
n For the category “middle_aged”, there are 4 yes tuples and 0 no tuples.
n For the category “senior”, there are 3 yes tuples and 2 no tuples.
Attribute Selection Measures – Information Gain
n Hence, the gain in information from such a partitioning would be:
n Similarly, we can arrive at:
n Gain(income) = 0.029 bits
n Gain(student) = 0.151 bits
n Gain(credit_rating) = 0.048 bits
n Because age has the highest information gain among the attributes, it is selected as the splitting attribute.
n Node N is labeled with age, and branches are grown for each of the attribute’s values. The tuples are then partitioned accordingly, as shown in Figure 8.5.
Attribute Selection Measures – Information Gain
n Note that the tuples falling into the partition for “age = middle_aged” all belong to the same class. Because they all belong to class “yes,” a leaf should therefore be created at the end of this branch and labeled “yes.”
n The final decision tree returned by the algorithm was shown earlier in Figure 8.2.
Attribute Selection Measures – Information Gain
n How can we compute the information gain of an attribute that is continuous-valued, unlike in the example?
n Suppose, instead, that we have an attribute A that is continuous- valued, rather than discrete-valued.
n For such a scenario, we must determine the “best” split-point for A.
n We first sort the values of A in increasing order. Typically, the midpoint between each pair of adjacent values is considered as a possible split-point.
Attribute Selection Measures – Information Gain
n Therefore, given v values of A, then (v-1) possible split-points are evaluated.
n For example, the midpoint between the values ai and a(i+1) of A is:
n If the values of A are sorted in advance, then determining the best split for A requires only one pass through the values.
n For each possible split-point for A, we evaluate InfoA(D), where the number of partitions is two, that is, v=2 (or j= 1, 2) in Eq. 8.2.
Attribute Selection Measures – Information Gain
n The split point with the minimum expected information requirement for A is selected as the split point for A.
n D1 is the set of tuples in D satisfying A <= split point, and D2 is the set of tuples in D satisfying A > split point. Eq. 8.2 is used to calculate the expected information requirement.
n Once the split point is selected, we can compare A to other available attributes and choose the attribute that leads to the highest information gain.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com