7class-a
Data Mining: Concepts and Techniques 1
COMP9318: Data Warehousing
and Data Mining
— L7: Classification and Prediction —
n Problem definition and preliminaries
Data Mining: Concepts and Techniques 2
Data Mining: Concepts and Techniques 3
n Classification:
n predicts categorical class labels (discrete or nominal)
n classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
n Prediction (aka. Regression):
n models continuous-valued functions, i.e., predicts
unknown or missing values
n Typical Applications
n credit approval
n target marketing
n medical diagnosis
n treatment effectiveness analysis
Classification vs. Prediction
• Given a new object o, map it to a feature
vector
• Predict the output (class label)
• Binary classification:
• Multi-class classification:
• Learn a classification function:
• Regression:
Classification and Regression
x = (x1, x2, . . . , xd)
>
{0, 1}
Sometimes,
Examples of Classification Problem
n Text categorization:
n Input object: a document = a sequence of words
n Input features : = word frequencies
n freq(democrats)=2, freq(basketball)= 0, …
n = [1, 2, 0, …]T
n Class label:
n ‘Politics’:
n ‘Sport’:
Doc: Months of campaigning
and weeks of round-the-clock
efforts in Iowa all came down to
a final push Sunday, …
Topic:
Politics
Sport
Examples of Classification Problem
n Image Classification:
n Input object: an image = a matrix of RGB values
n Input features = Color histogram
n pixel_count(red) = 1004, pixel_count(blue) = 23000
n = [1004, 23000, …]T
n Class label
n ‘bird image’:
n ‘non-bird image’:
Which images are birds,
which are not?
Supervised Learning
• How to find f()?
• In supervised learning, we are given a set of
training examples:
• Identical independent distribution (i.i.d)
assumption
• A critical assumption for machine learning
theory
Machine Learning Terminologies
n Supervised learning has input labelled data
n #instances x #attributes matrix/table
n #attributes = #features + 1
n 1 (usu. the last attribute) is for the class attribute
n Labelled data split into 2 or 3 disjoint subsets
n Training data
n Training error = 1.0 –
n Validation/development data
n Testing data
n Testing/generalization error = 1.0 –
n We mainly discuss binary classification here
n i.e., #labels = 2
Data Mining: Concepts and Techniques 8
è Build a model
è Select/refine the model
è Evaluate the model
#correctly_classified
#training_instances
#correctly_classified
#testing_instances
n Overview of the whole process (not including
cross-validation)
Data Mining: Concepts and Techniques 9
Data Mining: Concepts and Techniques 10
Classification—A Two-Step Process
n Model construction: describing a set of predetermined classes
n Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
n The set of tuples used for model construction is training set
n The model is represented as classification rules, decision trees,
or mathematical formulae
n Model usage: for classifying future or unknown objects
n Estimate accuracy of the model
n The known label of test sample is compared with the
classified result from the model
n Accuracy rate is the percentage of test set samples that are
correctly classified by the model
n Test set is independent of training set, otherwise over-fitting
will occur
n If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
In practice, much more than this simple view
11
Classification Process (1):
Preprocessing & Feature Engineering
Training
Data
class label attrib
EID Name Title EMP_Date Track
101 Mike Assistant Prof 2013-01-01 3-year contract
102 Mary Assistant Prof 2009-04-15 Continuing
103 Bill Scientia Professor 2014-02-03 Continuing
110 Jim Associate Prof 2009-03-14 Continuing
121 Dave Associate Prof 2009-12-02 1-year contract
234 Anne Professor 2013-03-21 Future Fellow
188 Sarah Student Officer 2008-01-17 Continuing
Training
Data
Raw
Data
Gender RANK YEARS TENURED
M Assistant Prof 3 no
F Assistant Prof 7 yes
M Professor 2 yes
M Associate Prof 7 yes
M Associate Prof 6 no
F Professor 3 no
12
Classification Process (2): Training
Training
Data
Classification
Algorithms; θ
IF rank = ‘professor’
OR years ≥ 6
THEN tenured =
‘yes’
Classifier
(Model)
class label attrib
Gender RANK YEARS TENURED
M Assistant Prof 3 no
F Assistant Prof 7 yes
M Professor 2 yes
M Associate Prof 7 yes
M Associate Prof 6 no
F Professor 3 no
training
training error = 33.3%
Predicted
no
yes
yes
yes
yes
yes
Data Mining: Concepts and Techniques 13
Classification Process (3): Evaluate the
Model on Testing Data
Classifier
(Model)
Testing Data
(M, Jeff, Professor, 4, yes)
(F, Merlisa, Asso Prof, 5, yes)
IF rank = ‘professor’
OR years ≥ 6
THEN tenured = ‘yes’
testing error = 50%
Data Mining: Concepts and Techniques 14
Classification Process (4): Use the Model
in Production
Classifier
(Model)
Unseen Data
IF rank = ‘professor’
OR years ≥ 6
THEN tenured = ‘yes’ (Pedro, Assi Prof, 8, ?)
How to judge a model?
n Based on training error or testing error?
n Testing error
n Otherwise, this is a kind of data scooping è overfitting
n What if there are multiple models to choose from?
n
n Can we trust the error values on
the development set?
n Need “large” dev set
è less data for training
n k-fold cross-validation
n k=n: leave-one-out
15
10-fold CV
Test set
All the labelled data
Further split a “test/development set” from the training set
Exercise: Problem definition and
Feature Engineering
n How to formulate the following into ML problems?
What kind of resources do you need? What are
the features you think may be most relevant?
1. Predict the sale trend of a particular product
in the next month
2. Design an algorithm to produce better top-10
ranking results for queries
Data Mining: Concepts and Techniques 16
Data Mining: Concepts and Techniques 17
Supervised vs. Unsupervised
Learning
n Supervised learning (classification)
n Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
n New data is classified based on the training set
n Unsupervised learning (clustering)
n The class labels of training data is unknown
n Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Data Mining: Concepts and Techniques 18
n Decision Tree classifier
n ID3
n Other variants
Data Mining: Concepts and Techniques 19
Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
This
follows an
example
from
Quinlan’s
ID3
Data Mining: Concepts and Techniques 20
Output: A Decision Tree for “buys_computer”
age?
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
<=30 medium no fair no
<=30 low yes fair yes
<=30 medium yes excellent yes
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
age income student credit_rating buys_computer
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
>40 medium yes fair yes
>40 medium no excellent no
age income student credit_rating buys_computer
31…40 high no fair yes
31…40 low yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
Data Mining: Concepts and Techniques 21
Extracting Classification Rules from Trees
n Represent the knowledge in the form of IF-THEN rules
n Rules are easier for humans to understand
n One rule is created for each path from the root to a leaf
n Each attribute-value pair along a path forms a conjunction
n The leaf node holds the class prediction
n Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” … … … age? student? credit rating? no yes fairexcellent <=30 >40
no noyes yes
yes
30..40
Data Mining: Concepts and Techniques 22
Algorithm for Decision Tree Induction
n Basic algorithm (a greedy algorithm)
n Input: Attributes are categorical (if continuous-valued, they are
discretized in advance)
n Overview: Tree is constructed in a top-down recursive divide-and-
conquer manner
n At start, all the training examples are at the root
n Samples are partitioned recursively based on selected test-
attributes
n Test-attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
n Three conditions for stopping partitioning (i.e., boundary conditions)
n There are no samples left, OR
n All samples for a given node belong to the same class, OR
n There are no remaining attributes for further partitioning
(majority voting is employed for classifying the leaf)
Exercise: Write down the pseudo-code of the induction algorithm
Decision Tree Induction Algorithm
Data Mining: Concepts and Techniques 23
Data Mining: Concepts and Techniques 24
Attribute Selection Measure:
Information Gain (ID3/C4.5)
n Select the attribute with the highest information gain
n S contains si tuples of class Ci for i = {1, …, m}
n information measures info required to classify any
arbitrary tuple
n entropy of attribute A with values {a1,a2,…,av}
n information gained by branching on attribute A
2
1
I( ) log
m
i i
1 2 m
i
s s
s ,s ,…,s
s s=
= -å
)s,…,s(I
s
s…s
E(A) mjj
v
j
mjj
1
1
1
å
=
++
=
E(A))s,…,s,I(sGain(A) m -= 21
Data Mining: Concepts and Techniques 25
Attribute Selection by Information
Gain Computation
g Class P: buys_computer = “yes”
g Class N: buys_computer = “no”
g I(p, n) = I(9, 5) =0.940
g Compute the entropy for age:
means “age <=30” has 5
out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Similarly,
age pi ni I(pi, ni)
<=30 2 3 0.971
30…40 4 0 0
>40 3 2 0.971
694.0)2,3(
14
5
)0,4(
14
4
)3,2(
14
5
)(
=+
+=
I
IIageE
048.0)_(
151.0)(
029.0)(
=
=
=
ratingcreditGain
studentGain
incomeGain
246.0)(),()( =-= ageEnpIageGain
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
)3,2(
14
5
I
income pi ni I(pi, ni)
high 2 2 1
medium 4 2 0.918
low 3 1 0.811Q: what’s the extreme/worst case?
Data Mining: Concepts and Techniques 26
Other Attribute Selection Measures
and Splitting Choices
n Gini index (CART, IBM IntelligentMiner)
n All attributes are assumed continuous-valued
n Assume there exist several possible split values for
each attribute
n May need other tools, such as clustering, to get the
possible split values
n Can be modified for categorical attributes
n Induces binary split => binary decision trees
Data Mining: Concepts and Techniques 27
Gini Index (IBM IntelligentMiner)
n If a data set T contains examples from n classes, gini index,
gini(T) is defined as
where pj is the relative frequency of class j in T.
n If a data set T is split into two subsets T1 and T2 with sizes
N1 and N2 respectively, the gini index of the split data
contains examples from n classes, the gini index gini(T) is
defined as
n The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting
points for each attribute).
2
1( ) 1
n
jjgini T p== -å
)()()( 2
2
1
1 Tgini
N
N
Tgini
N
NTginisplit +=
Data Mining: Concepts and Techniques 28
Case I: Numerical Attributes
Age Car Class
20 … Y
20 … N
20 … N
25 … N
25 … Y
30 … Y
30 … Y
30 … Y
40 … Y
40 … Y
Split value
22.5
27.5
Cut=22.5 Y N
< 1 2
>= 6 1
Cut=27.5 Y N
< 2 3
>= 5 0
…
…
Age
20
20
20
25
25
30
30
30
40
40
å-= 21)( jpSGini
)()()( 2
2
1
1 SGini
n
n
SGini
n
n
SGinisplit +=
Ginisplit(S) =
3/10*Gini(1,2) +
7/10*Gini(6,1) = 0.30
Ginisplit(S) =
5/10*Gini(2,3) +
5/10*Gini(5,0) = 0.24
…
Exercise: compute gini indexes for other splits
Data Mining: Concepts and Techniques 29
Case II: Categorical Attributes
attrib list for Car Class=Y Class=N
M 5 0
T 0 2
S 2 1
count matrix
Y N
{M, T} 5 2
{S} 2 1
Y N
{M, S} 7 1
{T} 0 2
Y N
{T, S} 2 3
{M} 5 0
Ginisplit(S) =
7/10*Gini(5,2) +
3/10*Gini(2,1) =
0.42
Ginisplit(S) =
8/10*Gini(7,1) +
2/10*Gini(0,2) =
0.18
Ginisplit(S) =
5/10*Gini(2,3) +
5/10*Gini(5,0) =
0.24
Need to consider all possible splits !
Age Car Class
20 M Y
30 M Y
25 T N
30 S Y
40 S Y
20 T N
30 M Y
25 M Y
40 M Y
20 S N
Data Mining: Concepts and Techniques 30
ID3
age?
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
<=30 medium no fair no
<=30 low yes fair yes
<=30 medium yes excellent yes
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
age income student credit_rating buys_computer
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
>40 medium yes fair yes
>40 medium no excellent no
age income student credit_rating buys_computer
31…40 high no fair yes
31…40 low yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
Data Mining: Concepts and Techniques 31
CART/SPRINT
age?
student? credit rating?
no yes fairexcellent
<=30 >30
no noyes yes
age
Illustrative of
the shape only
Data Mining: Concepts and Techniques 32
Avoid Overfitting in Classification
n Overfitting: An induced tree may overfit the training data
n Too many branches, some may reflect anomalies due
to noise or outliers
n Poor accuracy for unseen samples
n Two approaches to avoid overfitting
n Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure
falling below a threshold
n Difficult to choose an appropriate threshold
n Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
n Use a set of data different from the training data to
decide which is the “best pruned tree”
Overfitting Example /1
Name Body
Temperature
Gives Birth Four-legged Hibernates Is Mammal?
Salamander Cold-blooded No Yes Yes No
Guppy Cold-blooded Yes No No No
Eagle Warm-blooded No No No No
Poorwill Warm-blooded No No Yes No
Platypus Warm-blooded No Yes Yes Yes
• Lack of representative samples
• Existence of noise
Body temperature
Hibernates
Four-legged
no
no
yes no
warm cold
yes no
noyes
Human ?
Dog?
Training error = 0%, but
does not generalize!
Overfitting Example /2
Name Body
Temperature
Gives Birth Four-legged Hibernates Is Mammal?
Salamander Cold-blooded No Yes Yes No
Guppy Cold-blooded Yes No No No
Eagle Warm-blooded No No No No
Poorwill Warm-blooded No No Yes No
Platypus Warm-blooded No Yes Yes Yes
• Lack of representative samples
• Existence of noise
Body temperature
Gives Birth no
noyes
warm cold
yes no
Human ?
Dog?
Training error = 20%, but generalizes!
Overfitting
35
n Overfitting: model too complex è training error keep decreasing,
but testing error increases
n Underfitting: model too simple è both training and testing has large
errors.
Overfitting
Data Mining: Concepts and Techniques 36
Overfitting examples in Regression &
Classification
Data Mining: Concepts and Techniques 37
Data Mining: Concepts and Techniques 38
DT Pruning Methods
n Use a separate validation set
n Estimation of generalization/test errors
n Use all the data for training
n but apply a statistical test (e.g., chi-square)
to estimate whether expanding or pruning a
node may improve the entire distribution
n Use minimum description length (MDL) principle
n halting growth of the tree when the encoding
is minimized
Pessimistic Post-pruning
n Observed on the training data
n e(t): #errors on a leaf node t of the tree T
n e(T) = ∑t in T e(t)
n What’s the generalization errors (i.e., errors on testing data) on T?
n Use pessimistic estimates
n e�(t) = e(t) + 0.5
n E�(t) = e(T) + 0.5N, where N is the number of leaf nodes in T
n What’s the generalization errors on root(T) only?
n E’ (root(T)) = e(T) + 0.5
n Post-pruning from bottom-up
n If generalization error reduces after pruning, replace sub-tree by a
leaf node
n Use majority voting to decide the class label
Data Mining: Concepts and Techniques 39
Better estimate of generalization
error in C4.5:
Use the upper 75% confidence
bound from the training error of a
node, assuming a binomial
distribution
Example
Class = Yes 20
Class = No 10
Data Mining: Concepts and Techniques 40
Error = 10/30
A?
A1 A3A2
Class = Yes 8
Class = No 4
Class = Yes 3
Class = No 4
Class = Yes 4
Class = No 1
Class = Yes 5
Class = No 1
A4
n Training error before splitting on
A = 10/30
n Pessimistic error = (10+0.5)/30
n Training error after splitting on A
= 9/30
n Pessimistic error = (9 +
4*0.5)/30 = 11/30
Prune the subtree at A
“̂h
Eh
Eh0
“̂h0
Data Mining: Concepts and Techniques 41
Classification in Large Databases
n Classification—a classical problem extensively studied by
statisticians and machine learning researchers
n Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
n Why decision tree induction in data mining?
n relatively faster learning speed (than other classification
methods)
n convertible to simple and easy to understand
classification rules
n can use SQL queries for accessing databases
n comparable classification accuracy with other methods
Data Mining: Concepts and Techniques 42
Enhancements to basic decision
tree induction
n Allow for continuous-valued attributes
n Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals, one-hot encoding, or specialized DT
learning algorithms
n Handle missing attribute values
n Assign the most common value of the attribute
n Assign probability to each of the possible values
n Attribute construction
n Create new attributes based on existing ones that are
sparsely represented
n This reduces fragmentation, repetition, and replication
Data Mining: Concepts and Techniques 43
n Bayesian Classifiers
Data Mining: Concepts and Techniques 44
Bayesian Classification: Why?
n Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
n Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
n Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
n Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
Data Mining: Concepts and Techniques 45
Bayesian Theorem: Basics
n Let X be a data sample whose class label is unknown
n Let h be a hypothesis that X belongs to class C
n For classification problems, determine P(h|X): the
probability that the hypothesis holds given the observed
data sample X
n P(h): prior probability of hypothesis h (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
n P(X): probability that sample data is observed
n P(X|h) : probability of observing the sample X, given that
the hypothesis holds
n Given training data X, posteriori probability of a hypothesis
h, P(h|X) follows the Bayes theorem
n Informally, this can be written as
posterior = likelihood x prior / evidence
n MAP (maximum posteriori) hypothesis
n Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Data Mining: Concepts and Techniques 46
Bayesian Theorem
Data Mining: Concepts and Techniques 47
Training dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
30…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Hypotheses:
C1:buys_computer=
‘yes’
C2:buys_computer=
‘no’
Data sample
X =(age<=30,
Income=medium,
Student=yes,
Credit_rating=
Fair)
Sparsity Problem
n Maximum likelihood Estimate of P(h)
n Let p be the probability that the class is C1
n Consider a training example x and its label y
n L(x) = py(1-p)1-y
n L(X) = ∏x in X L(x)
n l(X) = log(L(X)) = ∑x in X ylog(p) + (1-y)log(1-p)
n To maximize l(X), let dl(X)/dp = 0 è p = (∑y)/n
n P(C1) = 9/14, P(C2) = 5/14
n ML estimate of P(X|h) = ?
n Requires O(2d) training examples, where d is the
#features.
Data Mining: Concepts and Techniques 48
Data likelihood
Log Data
likelihood
Curse of dimensionality
Data Mining: Concepts and Techniques 49
Naïve Bayes Classifier
n Use a model
n Assumption: attributes are conditionally independent:
n The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given
the same class P([x1,x2],C) = P(x1,C) * P(x2,C)
n No dependence relation between attributes given the
class
n Greatly reduces the computation cost, only count the
class distribution è Only need to estimate P(xk | Ci)
1
( | ) ( | )
n
i k i
k
P X C P x C
=
=Õ
Data Mining: Concepts and Techniques 50
Naïve Bayesian Classifier: Example
n Compute P(X|Ci) for each class
X=(age<=30 , income =medium, student=yes, credit_rating=fair)
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
X belongs to class “buys_computer=yes” likelihood
The Need for Smoothing
n Pr[Xi = vj | Ck] could still be 0, if not observed in
the training data
n makes Pr[Ck | X] = 0, regardless of other
likelihood values of Pr[Ct = vt| Ck]
n Add-1 Smoothing
n reserve a small amount of probability for
unseen probabilities
n (conditional) probabilities of observed events
have to be adjusted to make the total
probability equals 1.0
Data Mining: Concepts and Techniques 51
Add-1 Smoothing
n Pr[Xi = vj | Ck] = Count(Xi = vj, Ck) / Count(Ck)
n Pr[Xi = vj | Ck] = [Count(Xi = vj, Ck) +1] / [Count(Ck) +B]
n What’s the right value for B?
n make ∑vj Pr[Xi = vj | Ck] = 1
n Explain the above constraint
n Pr[Xi | Ck]: Given Ck, the (conditional) probability of Xi taking a
specific value
n Xi must take one of the values, hence summing over all
possible value for Xi, the probabilities should sum up to 1.0
n B = dom(Xi), i.e., # of values Xi can take
Data Mining: Concepts and Techniques 52
Many other kinds of smoothing exist (esp. in Natural Language Modelling)
Smoothing Example
Data Mining: Concepts and Techniques 53
age income student credit_rating buys_computer
30…40 high no fair no
30…40 high no excellent no
30…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
30…40 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Class:
C1:buys_computer=
‘yes’
C2:buys_computer=
‘no’
Data sample
X =(age<=30,
Income=medium,
Student=yes,
Credit_rating=
Fair)
no instance of age <= 30 in the “No” class
B=3 B=3 B=2 B=2
Data Mining: Concepts and Techniques 54
Consider the “No” Class
n Compute P(X|Ci) for the “No” class
X=(age<=30 , income =medium, student=yes, credit_rating=fair)
P(age=“<30” | buys_computer=“no”) = (0 +1) / (5 +3) = 0.125
P(income=“medium” | buys_computer=“no”) = (2 +1) / (5 +3) = 0.375
P(student=“yes” | buys_computer=“no”)= (1 +1) / (5 +2) = 0.286
P(credit_rating=“fair” | buys_computer=“no”)= (2 +1) / (5 +2) = 0.429
P(X|Ci) : P(X|buys_computer=“no”)= 0.125 x 0.375 x 0.286 x 0.429 = 0.00575
P(X|Ci)*P(Ci ) : P(X|buys_computer=“no”) * P(buys_computer=“no”) = 0.00205
Probabilities Without Smoothing With Smoothing
Pr[<=30 | No] 0 / 5 1 / 8
Pr[30..40 | No] 3 / 5 4 / 8
Pr[>=40 | No] 2 / 5 3 / 8
sum
up to 1
How to Handle Numeric Values
n Need to model the distribution of Pr[Xi | Ck]
n Method 1:
n Assume it to be Gaussian è Gaussian Naïve
Bayes
n Pr[Xi = vj | Ck] =
n Method 2:
n Use binning to discretize the feature values
Data Mining: Concepts and Techniques 55
1
p
2⇡�2i
exp
✓
�
(vj � µik)2
2�2i
◆
Text Classification
n NB has been widely used in text classification
(aka., text categorization)
n Outline:
n Applications
n Language Model
n Two Classification Methods
Data Mining: Concepts and Techniques 56
Based on “Chap 13: Text classification & Naive Bayes”
in Introduction to Information Retrieval
•http://nlp.stanford.edu/IR-book/
Multimedia GUIGarb.Coll.SemanticsML Planning
planning
temporal
reasoning
plan
language…
programming
semantics
language
proof…
learning
intelligence
algorithm
reinforcement
network…
garbage
collection
memory
optimization
region…
�planning
language
proof
intelligence�
Training
Data:
Test
Data:
Classes:
(AI)
Document Classification
(Programming) (HCI)
… …
(Note: in real life there is often a hierarchy, not
present in the above problem statement; and also,
you get papers on ML approaches to Garb. Coll.) 13.1
More Text Classification Examples:
Many search engine functionalities use classification
Assign labels to each document or web-page:
n Labels are most often topics such as Yahoo-categories
e.g., “finance,” “sports,” “news>world>asia>business”
n Labels may be genres
e.g., “editorials” “movie-reviews” “news�
n Labels may be opinion on a person/product
e.g., �like�, �hate�, �neutral�
n Labels may be domain-specific
e.g., “interesting-to-me” : “not-interesting-to-me�
e.g., �contains adult language� : �doesn�t�
e.g., language identification: English, French, Chinese, …
e.g., search vertical: about Linux versus not
e.g., �link spam� : �not link spam�
Challenge in Applying NB to Text
n Need to model P(text | class)
n e.g., P(“a b c d” | class)
n Extremely sparse è Need a model to help
n 1st method:
n Based on a statistical language model
n Turns out to be the bag of words model if using unigrams
n è multinomial NB
n 2nd method:
n View a text as a set of tokens è a Boolean vector in
{0, 1}|V|, where V is the vocabulary.
n è Bernoulli NB
Data Mining: Concepts and Techniques 59
sklearn.naive_bayes.MultinomialNB
sklearn.naive_bayes.BernoulliNB
Unigram and higher-order Language
models in Information Retrieval
n P(“a b c d”) =
n P(a) * P(b | a) * P(c | a b) * P(d | a b c)
n Unigram model: 0-th order Markov Model
n P(d | a b c) = P(d)
n Bigram Language Models: 1st order Markov
Model
n P(d | a b c) = P(d | c)
n P(a b c d) = P(a) * P(b | a) * P(c | b) * P(d | c)
n The same with class-conditional probabilities,
i.e., P(“a b c d” | C)
Easy.
Effective!
Naïve Bayes via a class conditional
language model = multinomial NB
n Effectively, the probability of each class is
done as a class-specific unigram language
model
Class
w1 w2 w3 w4 w5 w6
Probabilistic Graphical Model: arrow means conditional
dependency.
Using Multinomial Naive Bayes
Classifiers to Classify Text: Basic method
unigram
model
bag of word;
multinomial
an example
text of 4
tokens
n Essentially, the classification is independent of the
positions of the words
n Use same parameters for each position
n Result is bag of words model, i.e. text ⌘ {xi : ✓i}
|V |
i=1
P (Cj | w1w2w3w4) / P (Cj)P (w1w2w3w4 | Cj)
/ P (cj)
4Y
i=1
P (wi | Cj)
/ P (cj)
|V |Y
i=1
P (xi | Cj)✓i
e.g., “to be or not to be”
n Textj ¬ single document containing all docsj
n for each word xk in Vocabulary
n nk ¬ number of occurrences of xk in Textj
n
Naïve Bayes: Learning
n From training corpus, extract V = Vocabulary
n Calculate required P(cj) and P(xk | cj) terms
n For each cj in C do
n docsj ¬ subset of documents for which the target
class is cj
n
||
)|(
Vocabularyn
n
cxP kjk a
a
+
+
¬
|documents # total|
||
)( jj
docs
cP ¬
Naïve Bayes: Classifying
n positions ¬ all word positions in current document
which contain tokens found in Vocabulary
n Return cNB, where
Õ
ÎÎ
=
positionsi
jij
Cc
NB cxPcPc )|()(argmax
j
Naive Bayes: Time Complexity
n Training Time: O(|D|Ld + |C||V|))
where Ld is the average length of a document in D.
n Assumes V and all Di , ni, and nij pre-computed in O(|D|Ld)
time during one pass through all of the data.
n Generally just O(|D|Ld) since usually |C||V| < |D|Ld n Test Time: O(|C| Lt) where Lt is the average length of a test document. n Very efficient overall, linearly proportional to the time needed to just read in all the data. Why? Underflow Prevention: log space n Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. n Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. n Class with highest final un-normalized log probability score is still the most probable. n Note that model is now just max of sum of weights… å ÎÎ += positionsi jij Cc NB cxPcPc )|(log)(logargmax j Bernoulli Model n V = {a, b, c, d, e} = {x1, x2, x3, x4, x5} n Feature functions fi(text) = if text contains xi n The feature functions extract a vector of {0, 1}|V| from any text n Apply NB directly a b c d e C 1 1 1 1 0 + 0 1 0 1 1 - “a b c d” “d e b” Two Models /1 n Model 1: Multinomial = Class conditional unigram n One feature Xi for each word pos in document n feature�s values are all words in dictionary n Value of Xi is the word in position i n Naïve Bayes assumption: n Given the document�s topic, word in one position in the document tells us nothing about words in other positions n Second assumption: n Word appearance does not depend on position n Just have one multinomial feature predicting all words )|()|( cwXPcwXP ji === for all positions i,j, word w, and class c Two Models /2 n Model 2: Multivariate Bernoulli n One feature Xw for each word in dictionary n Xw = true in document d if w appears in d n Naive Bayes assumption: n Given the document�s topic, appearance of one word in the document tells us nothing about chances that another word appears n This is the model used in the binary independence model in classic probabilistic relevance feedback in hand-classified data (Maron in IR was a very early user of NB) Parameter estimation fraction of documents of topic cj in which word w appears n Multivariate Bernoulli model: n Multinomial model: n Can create a mega-document for topic j by concatenating all documents in this topic n Use frequency of w in mega-document == )|(ˆ jw ctXP fraction of times in which word w appears across all documents of topic cj == )|(ˆ ji cwXP Classification n Multinomial vs Multivariate Bernoulli? n Multinomial model is almost always more effective in text applications! n See results figures later n See IIR sections 13.2 and 13.3 for worked examples with each model Feature selection for NB n In general feature selection is necessary for multivariate Bernoulli NB. n Otherwise you suffer from noise, multi-counting n �Feature selection� really means something different for multinomial NB. It means dictionary truncation n The multinomial NB model only has 1 feature n This �feature selection� normally isn�t needed for multinomial NB, but may help a fraction with quantities that are badly estimated NB Model Comparison: WebKB Naïve Bayes on spam email 13.6 Violation of NB Assumptions n Conditional independence n �Positional independence� n Examples? Raining Sunny P(+,+,r) = 3/8 P(+,+,s) = 1/8P(-,-,r) = 1/8 P(-,-,s) = 3/8 n Two identical sensors at the same location n Equivalent training dataset n Note: P(s1|C) = P(s2|C) no matter what Observations S1 S2 Class + + r + + r + + r - - r + + s - - s - - s - - s P(si = + | r ) = P(si = - | r ) = P(si = + | s ) = P(si = - | s ) = P(r) = P(s) = ??? n P(r | ++) n P(s | ++) Prediction S1 S2 Class + + ??? P(si = + | r ) = 3/4 P(si = - | r ) = 1/4 P(si = + | s ) = 1/4 P(si = - | s ) = 3/4 P(r) = 1/2 P(s) = 1/2 / / ??? n P(r | ++) 9/32 n P(s | ++) 1/32 P(si = + | r ) = 3/4 P(si = - | r ) = 1/4 P(si = + | s ) = 1/4 P(si = - | s ) = 3/4 P(r) = 1/2 P(s) = 1/2 / / n P(r | ++) = 9/10 n P(s | ++) =1/10 S1 S2 Class + + r + + r + + r - - r + + s - - s - - s - - s Problem: Posterior probability estimation not accurate Reason: Correlation between features Fix: Use logistic regression classifier Naïve Bayes Posterior Probabilities n Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate. n However, due to the inadequacy of the conditional independence assumption, the actual posterior- probability numerical estimates are not. n Output probabilities are commonly very close to 0 or 1. n Correct estimation Þ accurate prediction, but correct probability estimation is NOT necessary for accurate prediction (just need right ordering of probabilities) Data Mining: Concepts and Techniques 81 Naïve Bayesian Classifier: Comments n Advantages : n Easy to implement n Good results obtained in most of the cases n Disadvantages n Assumption: class conditional independence , therefore loss of accuracy n Practically, dependencies exist among variables n E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc n Dependencies among these cannot be modeled by Naïve Bayesian Classifier n Better methods? n Bayesian Belief Networks n Logistic regression / maxent Data Mining: Concepts and Techniques 82 n (Linear Regression and) Logistic Regression Classifier n See LR slides Data Mining: Concepts and Techniques 83 n Other Classification Methods Data Mining: Concepts and Techniques 84 Instance-Based Methods n Instance-based learning: n Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified n Typical approaches n k-nearest neighbor approach n Instances represented as points in a Euclidean space. Data Mining: Concepts and Techniques 85 The k-Nearest Neighbor Algorithm n All instances correspond to points in the n-D space. n The nearest neighbor are defined in terms of Euclidean distance. n The target function could be discrete- or real- valued. n For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq. n Vonoroi diagram: the decision boundary induced by 1- NN for a typical set of training examples. . _ + _ xq + _ _ + _ _ + . . . . .* Effect of k Data Mining: Concepts and Techniques 86 • k acts as a smoother Data Mining: Concepts and Techniques 87 Discussion on the k-NN Algorithm n The k-NN algorithm for continuous-valued target functions n Calculate the mean values of the k nearest neighbors n Distance-weighted nearest neighbor algorithm n Weight the contribution of each of the k neighbors according to their distance to the query point xq n giving greater weight to closer neighbors n Similarly, for real-valued target functions n Robust to noisy data by averaging k-nearest neighbors n Curse of dimensionality: n kNN search becomes very expensive in high dimensional space n High-dimensional indexing methods, e.g., LSH n distance between neighbors could be dominated by irrelevant attributes. n To overcome it, axes stretch or elimination of the least relevant attributes. 2 1 ( , ) q i w d x x º Data Mining: Concepts and Techniques 88 Remarks on Lazy vs. Eager Learning n Instance-based learning: lazy evaluation n Decision-tree and Bayesian classification: eager evaluation n Key differences n Lazy method may consider query instance xq when deciding how to generalize beyond the training data D n Eager method cannot since they have already chosen global approximation when seeing the query n Efficiency: Lazy - less time training but more time predicting n Accuracy n Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function n Eager: must commit to a single hypothesis that covers the entire instance space