CS计算机代考程序代写 data mining decision tree Bayesian algorithm Excel CS699 Lecture 4 Classification 1

CS699 Lecture 4 Classification 1

Supervised vs. Unsupervised Learning
 Supervised learning (classification)
 Supervision: The training data (observations, measurements, etc.) are accompanied by class labels indicating the class of the observations
 New data is classified based on the training set  Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
2

Prediction Problems: Classification vs. Numeric Prediction
 Classification
 Predicts categorical class labels (discrete or
nominal)
 Constructs a model based on the training dataset which has known class labels and uses it to classify new data (or determine the class labels of new data)
 Numeric Prediction
 Models continuous‐valued functions, i.e., class
attribute is a numeric attribute
 Can be also used to predict missing values
3

Prediction Problems: Classification vs. Numeric Prediction
 Typical applications
 Credit/loan approval:
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is
4

Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, mathematical formulae, etc.
5

Classification—A Two-Step Process
 Model usage: classify future or unknown tuples  Estimate accuracy of the model
 The known label of test sample is compared with the classified result from the model
 Accuracy rate is the percentage of test set samples that are correctly classified by the model
 Test set is independent of training set
 If the accuracy is acceptable, use the model to
classify data tuples with unknown class labels. Otherwise, build another model and repeat.
6

NAME SALARY
YEARS MANAGER
Classifier (Model)
Mike85K Mary 105K Bill 75K Jim90K Dave 85K Anne 80K
3 no 5 yes 7 yes 5 no 6 yes 3 no
IF salary > 100K
OR years > 5
THEN manager = ‘yes’
Process (1): Model Construction
Training Data
Classification Algorithms
7

Process (2): Using the Model in Prediction
Testing Data
Unseen Data
NAME SALARY YEARS MANAGER
(Jeff, 80K, 4) Manager?
Tom90K 2 no
Merlisa 75K George 95K Joseph 120K
4 no 6 yes 5 yes
No
Classifier
8

<=30 high <=30 high 31..40 high >40 medium >40 low >40 low 31..40 low <=30 medium <=30 low >40 medium <=30 medium 31..40 medium 31..40 high >40 medium
no fair no no excellent no no fair yes no fair yes yes fair yes yes excellent no yes excellent yes no fair no yes fair yes yes fair yes yes excellent yes no excellent yes yes fair yes no excellent no
Example (training) Dataset
age income
student credit_rating
buys_computer
9

1R Classifier
 Simple and cheap
 Idea: Make rules based on a single attribute  Compute the error rate for each attribute
 Choose the one with the smallest error rate
10

 Algorithm
For each attribute
For each value of the attribute
1R Classifier
count the frequency of each class
find the most frequent class
make rule: assign that class to this attribute‐value
Compute the error rate of the rules (of this attribute) Choose the rules with the smallest error rate
11

 Example dataset
outlook
temperature humidity
windy play
sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy
hot high hot high hot high mild high cool normal cool normal cool normal mild high cool normal mild normal mild normal mild high hot normal mild high
F N T N F Y F Y F Y T N T Y F N F Y F Y T Y T Y F Y T N
1R Classifier
12

1R Classifier
 Make rules for each attribute and calculate error rate
Attribute outlook
sunny: 2 yes’s and 3 no’s
rule: outlook = sunny → no, error rate = 2/5 overcast: 4 yes’s and 0 no
rule: outlook = overcast → yes, error rate = 0 rainy: 3 yes’s and 2 no’s
rule: outlook = rainy → yes, error rate = 2/5 total error = 4/14
13

1R Classifier
 Make rules for each attribute and calculate error rate Repeat the same for all other attributes.
Attribute temperature
temperature = hot → no (2/4), error rate = 2/4
(arbitrary tie breaking)
temperature = mild → yes (4/6), error rate = 2/6 temperature = cool → yes (3/4), error rate = 1/4
total error = 5/14
14

1R Classifier
Attribute humidity
humidity = high → no (4/7), error rate = 3/7
humidity = normal → yes (6/7), error rate = 1/7 total error = 4/14
Attribute windy
windy = false → yes (6/8), error rate = 2/8
windy = true → no (3/6), error rate = 3/6 (arbitrary tie breaking)
total error = 5/14
15

1R Classifier
 outlook and humidity have the same total error rate of 4/14. If we (randomly) choose outlook, the rules are
If outlook = sunny → play = no
If outlook = overcast → play = yes If outlook = rainy → play = yes
16

Bayesian Classification
 A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers
 Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
17

P(H |X) P(X|H)P(H) P(X)
, where
Bayesian Theorem
 Based on Bayes’ rule (or Bayes’ theorem) of conditional probability:
 P(H) (prior probability), the initial probability
 E.g., X will buy a computer, regardless of age, income, …
 P(X) (evidence): probability that a sample data is observed
 P(X|H) (likelihood), the probability of observing the sample X,
given that the hypothesis holds
 E.g., Given that X buys a computer, the prob. that X is 31..40, medium income, …
18

Bayesian Theorem
 A simple example of using Bayes’ theorem for inference: If a person has muscle pain, what is the probability that the person has flu?
19

Bayesian Theorem
 F: A patent has flu, M: A patient has muscle pain
 We know from historical data;
P(M|F) = 0.75 (if a person catches flu, he/she has muscle pain 75% of the time)
P(F) = 0.00002 (the probability that a person has flu)
P(M) = 0.005 (the probability that a person has muscle pain)
 The probability that a person has flu if that person has muscle pain:
P(F |M ) P(M |F)P(F) P(M )
= 0.75 * 0.00002 / 0.005 = 0.003
20

Bayesian Theorem
 Given a data sample X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem
P(H |X) P(X|H)P(H) P(X)
 Informally, this can be written as
posteriori = (likelihood x prior)/evidence
21

Towards Naïve Bayesian Classifier
 Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n‐ (attribute value) vector X = (x1, x2, …, xn), here xi is an attribute value of Ai.
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori
 In other words, compute P(C1|X), P(C2|X), …, P(Cm|X) and predict X belongs to the class with the highest probability.
 Practical difficulty: require initial knowledge of many probabilities, significant computational cost
22

Towards Naïve Bayesian Classifier
 P(Ci|X) can be computed using Bayes’ theorem P(Ci|X)P(X|Ci)P(Ci)
P(X)
 Buys_computer dataset example: Given that we know properties of a (new) customer (i.e., an evidence X), the probability that the customer will buy a computer (C1: buys_computer = yes) or the probability that the customer will not buy a computer (C2 : buys_computer = no)
23

Towards Naïve Bayesian Classifier
 Class label, buys_computer, has two values, Y and N: P(classY|X)P(X|classY)P(classY)
P(X) P(classN|X)P(X|classN)P(classN)
P(X)
 And, assign the class with higher probability to X (or we predict the class label of X will be the class with higher probability).
24

Towards Naïve Bayesian Classifier
 In the equations, P(X) is the same for all classes. So we compute and compare only the numerators.
P(X|C )P(C ) ii
 For class label with two values, Y and N, we compute and compare the following two:
P(X|classY)P(classY) P(X|classN)P(classN)
25

Derivation of Naïve Bayes Classifier
 We need to compute P(X|Ci) and P(Ci).
 P(Ci) can be easily estimated from the training dataset
 For the Buys_computer dataset, there are two classes, yes and no. Let yes be C1 and no be C2.
 We can estimate P(Ci) as follows:
 P(C1) = (# yes tuples) / (total # tuples)  P(C2) = (# no tuples) / (total # tuples)
26

Derivation of Naïve Bayes Classifier
 Computation of P(X|Ci) is not easy.
 It can be simplified with class‐conditional
independence assumption (a naïve assumption)
 class‐conditional independence assumption: attributes are conditionally independent (i.e., no dependence relation between attributes):
 This greatly reduces the computation cost
27

Derivation of Naïve Bayes Classifier
 Based on this assumption, we compute P(xk|Ci) for each attribute xi, and multiply them all to obtain P(X|Ci) (this is possible because we assumed all attributes are independent of each other).
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
P(X| )nP( | )P( | )P( | )…P( | ) CixkCix1Cix2Ci xnCi
k1
(Example follows in the next slides)
28

Naïve Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
age income
<=30 high no <=30 high no 31...40 high no >40 medium no >40 low yes >40 low yes 31…40 low yes <=30 medium no <=30 low yes >40 medium yes <=30 medium yes 31...40 medium no 31...40 high yes >40 medium no
fair yes fair yes fair yes excellent no excellent yes fair no fair yes fair yes excellent yes excellent yes fair yes excellent no
Data sample
X = (age >40,
Income = high,
Student = no Credit_rating = excellent)
studentcredit_rating_comp fair no excellent no
29

Naïve Bayesian Classifier: An Example
 Classpriorprobabilitiesare:
P(C1) = P(buys_computer = “yes”) = 9/14 = 0.643 P(C2) = P(buys_computer = “no”) = 5/14= 0.357
 Next, we compute P(X|Ci) for each class
P(age = “>40” | buys_computer = “yes”) = 3/9 = 0.333
P(income = “high” | buys_computer = “yes”) = 2/9 = 0.222
P(student = “no” | buys_computer = “yes) = 3/9 = 0.333 P(credit_rating = “excellent” | buys_computer = “yes”) = 3/9 = 0.333 P(X|buys_computer = “yes”) = 0.333 x 0.222 x 0.333 x 0.333 = 0.008
P(age = “>40” | buys_computer = “no”) = 2/5 = 0.4
P(income = “high” | buys_computer = “no”) = 2/5 = 0.4
P(student = “no” | buys_computer = “no”) = 4/5 = 0.8 P(credit_rating = “excellent” | buys_computer = “no”) = 3/5 = 0.6 P(X|buys_computer = “no”) = 0.4 x 0.4 x 0.8 x 0.6 = 0.077
30

Naïve Bayesian Classifier: An Example
P(X|Ci) :
P(X|buys_computer = “yes”) = 0.008 P(X|buys_computer = “no”) = 0.077
P(X|Ci)*P(Ci) :
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.008 * 0.643 = 0.005
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.077 * 0.357 = 0.027
Therefore, the model predicts that X belongs to class (“buys_computer = no”)
31

Naïve Bayesian Classifier: Comments
 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence, therefore loss
of accuracy
 Practically, dependencies exist among variables:
 patient_profile: age, height, weight, etc.
 Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
 How to deal with these dependencies? Bayesian Belief Networks (Chapter 9)
32

A popular classifier
A model is built as a decision tree
Decision Tree
Internal node represents a test on an attribute
Branch represents outcome of test Leaf node has a class label
See example in the next slide
33

no
no yes yes
yes
Decision Tree Induction Example
 Training data set: Buys_computer  The data set follows an example of
age income
<=30 high no <=30 high no 31...40 high no >40 medium no >40 low yes >40 low yes 31…40 low yes
excellent no fair yes fair yes fair yes excellent no excellent yes fair no fair yes fair yes excellent yes excellent yes fair yes excellent no
Quinlan’s ID3 (Playing Tennis)  Resulting tree:
<=30 age? <=30 medium no <=30 low yes >40 medium yes <=30 medium yes student? yes credit rating? excellent fair o3v1e.r.4c0ast >40
31…40 medium no 31…40 high yes >40 medium no
studentcredit_ratingbuys_computer fair no
34

Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top‐down, recursive
manner
 Attributes are categorical (if continuous‐ valued, they are discretized first)
 At start, all training samples are considered to be at the root.
35

Algorithm for Decision Tree Induction
 Basic algorithm (continued)
 A test attribute is selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
 samples are partitioned based on the selected
attribute – children nodes are created
 The same process is repeated at each child
node (recursively)
36

Algorithm for Decision Tree Induction
 Conditions for stopping partitioning
 All samples for a given node belong to the
same class
 There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
 There are no samples left
37

Algorithm for Decision Tree Induction
 Initially all samples are in the same partition, associated with the root node.
38

Algorithm for Decision Tree Induction
 Age is chosen as the test attribute, and samples are partitioned into three nodes based on the values of Age, i.e., “≤ 30”, “31..40”, and “>40.”
Needs to be further partitioned
test attribute
Needs to be further partitioned
All 4 tuples are Yes. So, no further partition of this node
39

Algorithm for Decision Tree Induction
 The same process is repeated (recursively) on the left node and on the right node.
test attribute
test attribute
40

Attribute Selection Measure:
 Consider the following dataset:
Temp Wind Class
Split by Temp
high east Y low west N low east N high west Y high west Y
 Which split is better?
Split by Wind
41

Attribute Selection Measure:
 Test attribute is selected based on “purity measure.”
 Purity measures: information gain, gain ratio, Gini index
 Information gain is based on info.
 Info represents how pure a dataset is with regard to class labels.
 Suppose a dataset has 10 tuples with two class values, Y and N.
 If the dataset has all Y’s or all N’s, then the dataset is purest and the value of info is 0.
 If the dataset has 5 Y’s and 5 N’s, this is the extreme impure case, and the value of info is 1.
42

Attribute Selection Measure: Info Examples
 Notation: I(x1, x2, …, xm), where
m is the number of distinct class values,
xi is the number of tuples belonging to the i‐th class |D|=x1 +x2 +…+xm
 Computation of info:
m
I(x1,x2,…,xm)pi log2(pi),where pi 
xi D
i1
 Info is also referred to as entropy
43

Attribute Selection Measure: Info Examples
 Computing log base‐2 log2 xlog10 xlog10 x
log10 2 0.301
log 2 0.6  log10 0.6   0.2218  0.7369 log10 2 0.301
44

Attribute Selection Measure: Info Examples
 A dataset has 6 tuples with 5 Y’s and 1 N: I (5,1)   56 log 2 ( 56 )  16 log 2 ( 16 ) 0.650
 A dataset has 10 tuples with all Y’s I(10,0)10log2(10) 0 log2(0)0
10 10 10 10
 A dataset has 10 tuples with 5 Y’s and 5 N’s
I(5,5) 5 log2( 5 ) 5 log2( 5 )1 10 10 10 10
45

Attribute Selection Measure: Information Gain (ID3/C4.5)
 Compute the information gain of each attribute
 Select the attribute with the highest information gain  Informal description of information gain:
 Info(D): The amount of information we need to classify a sample in D.
Example: we need 0.9
 InfoA(D): After splitting D on an attribute A, the amount of information needed to classify a sample.
Example: after splitting on A, now we need only 0.3  Gain(A)=Info(D)–InfoA(D).
Example: the gain is 0.9 – 0.3 = 0.6
46

Attribute Selection Measure: Information Gain (ID3/C4.5)
 Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D: m
 m:#classes Info(D)pi log2(pi) i1
 Information needed to classify D, after using attribute A to split D into v partitions:
 v: # distinct values of A
InfoA(D)
j Info(D )
 Information gained by splitting by A
v |D|
j1 |D|
Gain(A)  Info(D)  Info A(D)
j
47

<=30 <=30 31... 40 >40 >40 >40 31… 40 <=30 <=30 >40 <=30 31... 40 31... 40 >40
high no high no high no medium no low yes low yes low yes medium no low yes medium yes medium yes medium no high yes medium no
fair no excellent no fair yes fair yes fair yes excellent no excellent yes fair no fair yes fair yes excellent yes excellent yes fair yes excellent no
 There are 14 tuples in D
 9 Yes’s and 5 No’s
Attribute Selection Measure: Information Gain (ID3/C4.5)
 LetDbetheBuys_computerdataset
age income student credit_rating buys_computer
Info(D)I(9,5) 9 log2(9) 5 log2(5)0.940 14 14 14 14
48

Attribute Selection Measure: Information Gain (ID3/C4.5)
 If we split D based on Age, three partitions are created.
 Let’s call them Node‐1 (≤ 30), Node‐2 (31..40), and Node‐3(> 40).
49

Attribute Selection Measure: Information Gain (ID3/C4.5)
 InfoofNode‐1is: I(2,3)52log2(52)53log2(53)0.971
I (4,0)  0
 Info of Node‐3 is: I (3,2)  I (2,3) 0.971
 InfoofNode‐2is:
 AftersplittingbyAge,theamountofinformationneededtoclassifyatuple is computed as the weighted average of above three:
Infoage (D)  5 I(2,3) 4 I(4,0) 5 I(3,2)  0.694 14 14 14
 So,informationgainobtainedbysplittingbyAgeis: Gain(age)  Info(D)  Infoage (D)  0.246
50

Information Gain – Another Example
 If we split by income
51

Information Gain – Another Example
 high :4 tuples; 2 yes’s and 2 no’s
 medium: 6 tuples; 4 yes’s and 2 no’s  low: 4 tuples; 3 yes’s and 1 no
Infoincome (D) 4 I(2,2) 6 I(4,2) 4 I(3,1)0.911 14 14 14
I(2,2)   24 log2 (24) 24 log2 (24) 1.0 I(4,2)   64 log2 (64) 62 log2 (62) 0.918 I (3,1)   34 log 2 ( 34 )  14 log 2 ( 14 ) 0.811
Gain(income)Info(D)Infoincome (D)0.9400.9110.029
52

Attribute Selection: Information Gain
 We can compute Gain(student) and Gain(credit_rating) in the same way.
 Then, we have:
Gain(Age) = 0.246 Gain(income) = 0.029 Gain(student) = 0.151 Gain(credit_rating) = 0.048
 We select Age as the test attribute because it has the highest information gain.
53

Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes with a large number of values.
 C4.5 (a successor of ID3) uses gain ratio to overcome the problem.
v
|D| |D| j log ( j )

j1 |D|
SplitInfoA(D)
Here, attribute A has v distinct values, |Dj| is the number of
tuples with the j‐th value.
 GainRatio(A) = Gain(A)/SplitInfoA(D)
 The attribute with the maximum gain ratio is selected as the splitting attribute.
2
|D|
54

 CHAID
Other Attribute Selection Measures
 GINI index
 C‐SEPG‐statisticMDL (Minimal Description Length) principle
 Multivariate splits (partition based on multiple variable combinations)
 Example: CART
 Which attribute selection measure is the best?
 Mostgivegoodresults,noneissignificantlysuperiorthan others
55

Overfitting and Tree Pruning
 Overfitting:
 A model reflects every details of the training
dataset, even an anomaly or noise.
 A model becomes complex.
 Accuracy on the training dataset is high.
 Accuracy on the test dataset becomes low.
 So, an overfitted model does not generalize well.
56

Overfitting and Tree Pruning
 Overfitting of decision tree:
 Too many branches, some may reflect anomalies due to
noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early – do not split a node if this would result in the goodness measure falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning (more common): Remove branches from a “fully
grown” tree – get a sequence of progressively pruned trees
 Use a set of data different from the training data to decide which is the “best pruned tree”
57

• http://www.cs.illinois.edu/~hanj/bk3/

Ian H. Witten, E. Frank, and M.A. Hall, “Data Mining Practical Machine Learning Tools and Techniques,” Third Ed., 2011, Morgan Kaufmann
58
References
• Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012