Classification
What is a classification?
• Examples:
Copyright By PowCoder代写 加微信 powcoder
• Analyze data to learn which loan applicants are risky and which are safe
• Analyze breast cancer data to predict which of the three specific treatments is better
• The classification task is aimed at predicting the class (category or label) of data
• It is a prediction problem
• Classes are discrete values that are not characterized by an order relationship
Classification: a two-step process
Learning (training) step
• Each tuple/sample is assumed to belong to a predefined class, as determined by the
class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or mathematical formulae
Classification step
• Estimate accuracy of the model
• Theknownlabeloftestsampleiscomparedwiththeclassifiedresultfromthemodel
• Accuracyrateisthepercentageoftestsetsamplesthatarecorrectlyclassifiedbythemodel • Testsetisindependentoftrainingset(otherwiseoverfitting)
• If the accuracy is acceptable, use the model to classify new data
Learning step: example
Training Data
Classification Algorithms
Classifier (Model)
Assistant Prof
IF rank = ‘professor’ OR years > 6
THEN tenured = ‘yes’
Classification step: example
Classifier
Testing Data
Unseen Data
(Jeff, Professor, 4)
Assistant Prof
Associate Prof
Decision tree (1)
• Flow-chart like structure
• Each internal node denotes a test on an attribute • Each branch represents an outcome of the test
• Each leaf node represents a label/class
• Given a tuple X, its attribute values are tested against the decision tree, visiting a path in the tree
• The leaf node reached holds the predicted class for X
• Decision trees can easily be converted into classification rules
Decision tree (2)
Why are decision tree classifiers so popular?
• Their construction does not require domain knowledge or parameter setting
• They can handle multidimensional data
• Their representation is intuitive
• They usually have good accuracy
Decision tree: example
o3v1e.r.4c0ast student? yes
age income student credit_rating buys_comput
credit rating?
excellent fair
Decision tree induction: algorithm
• Principle
• Greedy algorithm
• Tree is constructed in a top-down recursive divide-and-conquer manner
• Algorithm
• At start, all the training examples are at the root
• Tuple are partitioned recursively based on selected attributes
• Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
• Stopping conditions
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
• There are no tuples left in a branch
Partitioning scenarios
Discrete values
Continuous values
Discrete values producing a binary tree
age income student credit_rating buys_comput
credit_ra ting
buys_comp uter
credit_rat ing
buys_comp uter
credit_rat ing
buys_comp uter
Decision tree induction: example (1)
o3v1e.r.4c0ast
Decision tree induction: example (2)
o3v1e.r.4c0ast student? yes
age income student credit_rating buys_comput
credit rating?
excellent fair
credit_ra ting
buys_comp uter
buys_comp uter
buys_comp uter
credit_ra ting
buys_comp uter
Decision tree induction: example (3)
o3v1e.r.4c0ast student? yes
age income student credit_rating buys_comput
credit rating?
excellent fair
Attribute Selection Measure
• Heuristic for selecting the splitting criterion that best separates a given data partition of class-labeled training tuples into individual classes
• Ideally each partition would be pure, that is, all tuples in a given partition belong to the same class
• Provides a ranking of attributes
• Select the attribute with highest score for the measure • Determine a split point or a splitting subset
• Split tuples according to the selected criterion
• Popular measures: information gain, gain ratio, gini index, …
Information gain: entropy
• Entropy: measures the uncertainty associated with a random variable • For a discrete random variable Y taking m distinct values {y1,…,ym}:
H ” = − ∑ /’ 0 1 & ‘ l o g ( & – )
where pi = P(Y=yi)
• Higher entropyàhigher uncertainty • Lower entropyàlower uncertainty
Information gain
• Choose the attribute with highest information gain
• Minimizes the information needed to classify the tuples in the resulting
partitions
• Reflects least randomness or impurity in partitions
• Information gain minimizes the expected number of tests for classification
Information gain: Info(D)
• Expected information needed to classify a tuple in D is
I n f o 5 = − ∑ /’ 0 1 & ‘ l o g 2 ( & – )
• m: number of classes
• pi: probability that an arbitrary tuple in D belongs to class Ci, estimated as
|Ci,D|/|D| where Ci,D is the set of tuples in class Ci in D
• log2: base 2 is used since information is encoded in bits
• Info(D) is the entropy of D
• Info(D) measures the average amount of information needed to identify the class label of a tuple in D
Info(D): example
Credit_rating
Buy_computer
middle aged
middle aged
middle aged
middle aged
C1 represents ‘yes’ C2 represents ‘no’
Info(D): example
Credit_rating
Buy_computer
middle aged
middle aged
middle aged
middle aged
C1 represents ‘yes’, |C1|=9 C2 represents ‘no’, |C2|=5
Info(D)= − 7 9:; 7 − < 9:; < 18 2 18 18 2 18
Information gain: InfoA(D)
• If A is a discrete attribute with v values {a1,...,av}, D is split into v partitions {D1,...,Dv}
• Measure the amount of information still needed to obtain an exact classification after attribute A has been used for partitioning
I n f o A 5 = ∑ D@ 0 1 | ? @ | A B C : 5 @ |?|
• It measures the expected information required to classify a tuple in D based on the partitioning of A
• The smaller the expected information still required, the greater the purity of the partitions
InfoA(D): example
Credit_rating
Buy_computer
middle aged
middle aged
middle aged
middle aged
Consider attribute age
• D1 represents ‘youth’
• D2 prepresents ‘middle aged’
• D3 represents ‘senior’
• 3yes • 2no
Infoage(D)=
InfoA(D): example
Credit_rating
Buy_computer
middle aged
middle aged
middle aged
middle aged
Consider attribute age
• D1 represents ‘youth’
• D2 prepresents ‘middle aged’
• D3 represents ‘senior’
• 3yes • 2no
Info (D)= < −E9:; E−F9:; F + age 18 < 2< < 2<
8 −89:;8 + 18 8 28
< −F9:; F−E9:; E 18 < 2< < 2<
Information gain: Gain(A)
• Information gain for attribute A is the difference between the original information requirement and the one obtained partitioning according
Gain K =Info 5 −InfoA 5
• The attribute with highest gain is chosen as splitting attribute at node N
Gain(A): example
Credit_rating
Buy_computer
middle aged
middle aged
middle aged
middle aged
Info(D) = 0.940
Infoage(D) = 0.694
Gain(age) = 0.940 - 0.694 = 0.246
Gain(income) = 0.029 Gain(student) = 0.151 Gain(credit_rating) = 0.048
Age has the highest gain and is selected
Continuous valued attributes
• If A is a continuous attribute, it is necessary to find the best split point
• order the values of A in increasing order
• the midpoint between each pair of adjacent values is considered as a possible split-point (ai + ai+1)/2 is the split point between ai and ai+1
• evaluate InfoA 5 for each split point and choose the one with minimum InfoA 5
• Once identified the split point
• D1: set of tuples with A <= split_point • D2: set of tuples with A > split_point
Gain ratio
• Information gain measure is biased towards attributes with a large number of values
• Gain ratio applies a normalization to information gain using a split information value computed as
SplitInfo 5 =−∑D 9:; A @01|?| 2 |?|
GainRatio(A)=Gain(A)/SplitInfoA 5
GainRatio(A): example
Credit_rating
Buy_computer
middle aged
middle aged
middle aged
middle aged
Consider attribute income
• D1 represents ‘high’ (4 tuples)
• D2 prepresents ‘medium’ (6 tuples)
• D3 represents ‘low’ (4 tuples)
SplitInfo (D)=−89:;8 −
6 9:; 6 − 14 2 14
8 9:; 8 18 2 18
= 1.557 GainRatio(income)= 0.029/1.557 = 0.019
Overfitting and tree pruning
• Overfitting: An induced tree may overfit the training data
• Too many branches, some may reflect anomalies due to noise or outliers • Poor accuracy for unseen samples
• Two approaches to avoid overfitting
• Prepruning: Halt tree construction early ̵ do not split a node if this would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
• Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to decide which is the “best pruned tree”
Tree pruning: example
Bayesian classification
• A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
• Incremental: Each training example can incrementally increase/decrease the probability that an hypothesis is correct — prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
Bayes’ Theorem
• X is a tuple, called evidence
• H is an hypothesis, such as the fact that X belongs to class C
• Classification problem: determine P(H|X)
• P(H|X) is the probability that H holds given evidence of tuple X (posterior
probability)
• According to Bayes’ theorem R S T = U T S U(V) U(W)
Probabilities
• P(H) is the prior probability that H holds, independently from X
• the probability that a costumer will buy a computer (H), regardless of age, income, or
any other information
• P(X) is the probability that tuple X is observed
• the probability that a person from our set of costumers is 35 years-old and earns 40k (X)
• P(H|X) is the posterior probability of H conditioned X
• probability that costumer will buy a computer (H) given that we know the costumers’
age, which is 35, and income, which is 40k (X)
• P(X|H) is the posterior probability of X conditioned H
• probability that costumer X, is 35 years-old and earns 40k (X), given that we know that the costumer will buy a computer (H)
Naıve Bayesian Classification (1)
• D: training set of tuples, each represented by an n-dimensional vector X=(x1,…,xn) over attributes A1,…,An
• C1, …, Cm: set of m classes
• Given a tuple X, it belongs to the class having the highest posterior
probability conditioned on X:
P(Ci|X) > P(Cj|X) for 1<=j<=m, j<>i
• Note that R X- T = U T X- U(Y’) and P(X) is constant for all classes,
thus we need to maximize R T X- R(X-)
Naıve Bayesian Classification (2)
• If class prior probabilities are not known, it is common to assume that P(C1)=P(C2)=…=P(Cn) and we need to maximize R T X-
• For datasets with many attributes, it is expensive to compute R T X- • Naive assumption: class conditional independence, hence
R T X- = ∏]^01 R([\|X-) = P x1 Ci × ⋯×P(xn|Ci)
Estimating R([\|X-)
• R([\|X-) is estimated from training tuples
• Categorical attributes: number of tuples in class Ci in D having value xk for attribute Ak divided by |Ci,D|
• Continuous valued attributes: assuming that Ak has a Gaussian (normal) distribution with mean d and standard deviation e
2s2 g(x,μ,s)= 1 e-(x-μ)2
2p s P(X|Ci)=g(xk,μCi ,sCi )
Naıve Bayesian Classifier: example (1)
C1: buys_computer = ‘yes’ C2: buys_computer = ‘no’
Data to be classified: X = (age <=30,
income = medium, student = yes credit_rating = fair)
redit_rbautiyns
Naıve Bayesian Classifier: example (2)
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
Compute P(X|Ci)
[X = (age<=30 , income=medium, student=yes, credit_rating = fair)]
P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
Compute P(X|Ci)*P(Ci)
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
àX belongs to class (“buys_computer = yes”)
age income studentcredit_rbautiynsg_comp
0-probability problem
• Naïve Bayesian prediction requires each conditional probability to be non-zero. Otherwise, the predicted probability will be zero
• Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator) • Adding 1 to each case
Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their “uncorrected” counterparts
Naıve Bayesian Classifier: pros and cons
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Assumption: class conditional independence, therefore loss of accuracy • Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes Classifier
Accuracy and error measures
• How can we measure the accuracy of a classi
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com