Na ̈ıve Bayes Classifiers
Jiayu Zhou
Department of Computer Science and Engineering Michigan State University
East Lansing, MI USA
Based on Slides from Eric Xing @ CMU
Jiayu Zhou CSE 404 Intro. to Machine Learning 1 / 21
Types of Classifiers
We can divide the large variety of classification approaches into three major types.
Discriminative
directly estimate a decision rule/boundary e.g., logistic regression, decision tree
Generative
build a generative statistical model e.g., Bayesian networks
Instance based classifiers
Use observation directly (no models) e.g., K nearest neighbors
Jiayu Zhou CSE 404 Intro. to Machine Learning 2 / 21
Problem Definition
Consider a supervised classification problem,
We want to approximate a target function f : X → Y or
equivalently we want to estimate P (Y |X). Assume input X is a d−dimensional vector,
X = [x1,x2,··· ,xd]
By applying Bayes’ rule, we represent P (Y |X):
P (Y|X) = P (X|Y)P (Y) P (X)
Jiayu Zhou CSE 404 Intro. to Machine Learning 3 / 21
Bayes decision rule
If we know the conditional probability P(X|Y) we can determine the appropriate class by using Bayes rule:
P(Y = yk|X) = P(X|Y = yk)P(Y = yk) P (X)
Jiayu Zhou CSE 404 Intro. to Machine Learning 4 / 21
Bayes decision rule
If we know the conditional probability P(X|Y) we can determine the appropriate class by using Bayes rule:
P(Y = yk|X) = P(X|Y = yk)P(Y = yk) P (X)
But how do we determine p(X|Y )?
Jiayu Zhou CSE 404 Intro. to Machine Learning 4 / 21
Bayes decision rule
If we know the conditional probability P(X|Y) we can determine the appropriate class by using Bayes rule:
P(Y = yk|X) = P(X|Y = yk)P(Y = yk) P (X)
But how do we determine p(X|Y )?
Use training data to estimate P (X|Y ) and P (Y )!
Jiayu Zhou CSE 404 Intro. to Machine Learning 4 / 21
Computing p(X|Y )
Consider a dataset with 16 attributes (lets assume they are all binary). How many parameters to we need to estimate to fully determine p(X|Y )?
Jiayu Zhou CSE 404 Intro. to Machine Learning 5 / 21
Computing p(X|Y )
Consider a dataset with 16 attributes (lets assume they are all binary). How many parameters to we need to estimate to fully determine p(X|Y )?
Learning the values for the full conditional probability table would require
enormous amounts of data.
Jiayu Zhou CSE 404 Intro. to Machine Learning 5 / 21
Computing p(X|Y )-Simple Example
Binary vectors, 23 rows + binary output Y ∈ {0, 1}
Number of parameters are required for d−dimensional case:
θj,k =PX=xj|Y =yk→22d −1
x1
x2
x3
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
Jiayu Zhou CSE 404 Intro. to Machine Learning 6 / 21
Computing p(X|Y )-Simple Example
Binary vectors, 23 rows + binary output Y ∈ {0, 1}
Number of parameters are required for d−dimensional case:
θj,k =PX=xj|Y =yk→22d −1 When d = 30, need to
estimate billions of parameters!
x1
x2
x3
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
Jiayu Zhou CSE 404 Intro. to Machine Learning 6 / 21
Na ̈ıve Bayes Classifier
Na ̈ıve Bayes classifiers assume that given the class label Y the features are conditional independent of each other
p(X|y) = pj(xj|Y) j
pj: specific model for attribute j.
Jiayu Zhou CSE 404 Intro. to Machine Learning 7 / 21
Na ̈ıve Bayes Classifier
Na ̈ıve Bayes classifiers assume that given the class label Y the features are conditional independent of each other
p(X|y) = pj(xj|Y) j
pj: specific model for attribute j.
Using this idea the full classification rule becomes:
y∗ =argmaxykp(Y =yk|X)
= argmax p(X|Y = yk)p(Y = yk)
yk p(X)
=argmax p(xj|Y =y )p(Y =y )
yk j k k j
Jiayu Zhou CSE 404 Intro. to Machine Learning 7 / 21
Conditional likelihood: Full version
d
p(xi|yi = 1,Θ) = p(xji|yi = 1,θ1j )
j=1
Θ: the set of all parameters in the NB model.
θ1j : The specific parameters for feature j in class 1.
Jiayu Zhou CSE 404 Intro. to Machine Learning 8 / 21
Conditional likelihood: Full version
d
p(xi|yi = 1,Θ) = p(xji|yi = 1,θ1j )
j=1
Θ: the set of all parameters in the NB model.
θ1j : The specific parameters for feature j in class 1.
Note the following:
We assumes conditional independence between attributes
given the class label
We learn a different set of parameters for the two classes
(e.g., class 1 and class 2).
Jiayu Zhou CSE 404 Intro. to Machine Learning 8 / 21
Learning parameters
d
p(xi|yi = 1,Θ) = p(xji|yi = 1,θ1j )
j=1
Jiayu Zhou
CSE 404 Intro. to Machine Learning 9 / 21
Learning parameters
d
p(xi|yi = 1,Θ) = p(xji|yi = 1,θ1j )
j=1
Let x1, . . . , xp1 be the set of input samples with label ‘y=1’
Assume all attributes are binary
To determine the MLE parameteres for p(xj |y = 1), we simply count how many times the jth entry of those samples in class 1 is 0 (termed n0) and how many times its 1 (n1). Then we set
p(xj|y = 1) = n1 . n0 + n1
Jiayu Zhou CSE 404 Intro. to Machine Learning 9 / 21
Final classification
Once we computed all parameters for attributes in both classes we can easily decide on the label of a new sample x:
yˆ = argmaxyk p(y = yk |x)
= argmax p(x|y = yk)p(y = yk)
yk p(x)
=argmax p(xj|y =y )p(y =y )
yk kk j
Jiayu Zhou CSE 404 Intro. to Machine Learning 10 / 21
Final classification
Once we computed all parameters for attributes in both classes we can easily decide on the label of a new sample x:
yˆ = argmaxyk p(y = yk |x)
= argmax p(x|y = yk)p(y = yk)
yk p(x)
=argmax p(xj|y =y )p(y =y )
yk kk j
Perform this computation for both class 1 and class 2 and select the class that leads to a higher probability as your decision
How about p(y = yk)?
Jiayu Zhou CSE 404 Intro. to Machine Learning 10 / 21
Final classification
Once we computed all parameters for attributes in both classes we can easily decide on the label of a new sample x:
yˆ = argmaxyk p(y = yk |x)
= argmax p(x|y = yk)p(y = yk)
yk p(x)
=argmax p(xj|y =y )p(y =y )
yk kk j
Perform this computation for both class 1 and class 2 and select the class that leads to a higher probability as your decision
How about p(y = yk )? Prior on the prevalence of samples from each class.
Jiayu Zhou CSE 404 Intro. to Machine Learning 10 / 21
Example: Text classification
What is the major topic of this article?
Jiayu Zhou CSE 404 Intro. to Machine Learning 11 / 21
Example: Text classification
Text classification is all around us
Jiayu Zhou CSE 404 Intro. to Machine Learning 12 / 21
Feature transformation
How do we encode the set of features (words) in the document?
Jiayu Zhou CSE 404 Intro. to Machine Learning 13 / 21
Feature transformation
How do we encode the set of features (words) in the document?
What type of information do we wish to represent? What can we ignore?
Jiayu Zhou CSE 404 Intro. to Machine Learning 13 / 21
Feature transformation
How do we encode the set of features (words) in the document?
What type of information do we wish to represent? What can we ignore?
Most common encoding: “Bag of Words”
Treat document as a collection of words and encode each document as a vector based on some dictionary
The vector can either be binary (present / absent information for each word) or discrete (number of appearances)
Jiayu Zhou CSE 404 Intro. to Machine Learning 13 / 21
Feature transformation
How do we encode the set of features (words) in the document?
What type of information do we wish to represent? What can we ignore?
Most common encoding: “Bag of Words”
Treat document as a collection of words and encode each document as a vector based on some dictionary
The vector can either be binary (present / absent information for each word) or discrete (number of appearances)
Google is a good example
Other applications include job search adds, spam filtering and many more.
Jiayu Zhou CSE 404 Intro. to Machine Learning 13 / 21
Feature transformation: Bag of Words
In this example we will use a binary vector
For document xi we will use a vector of m indicator features {φj (xi )} for whether a word appears in the document
φj (xi ) = 1 if word j appears in document xi
φj (xi ) = 0 if it does not appear in the document
Jiayu Zhou CSE 404 Intro. to Machine Learning 14 / 21
Feature transformation: Bag of Words
In this example we will use a binary vector
For document xi we will use a vector of m indicator features {φj (xi )} for whether a word appears in the document
φj (xi ) = 1 if word j appears in document xi
φj (xi ) = 0 if it does not appear in the document
Φ(xi)=[φ1(xi),…,φm(xi)]T istheresultingfeaturevector for the entire dictionary for document xi
For notational simplicity we will replace each document xi with a fixed length vector Φi = [φ1i ,…,φmi ]T, where
φji =φj(xi)
Jiayu Zhou CSE 404 Intro. to Machine Learning 14 / 21
Feature transformation: Bag of Words
In this example we will use a binary vector
For document xi we will use a vector of m indicator features {φj (xi )} for whether a word appears in the document
φj (xi ) = 1 if word j appears in document xi
φj (xi ) = 0 if it does not appear in the document
Φ(xi)=[φ1(xi),…,φm(xi)]T istheresultingfeaturevector for the entire dictionary for document xi
For notational simplicity we will replace each document xi with a fixed length vector Φi = [φ1i ,…,φmi ]T, where
φji =φj(xi)
The size of the vector for English is usually 10000 words
Jiayu Zhou CSE 404 Intro. to Machine Learning 14 / 21
Example
Assume we would like to classify documents as election related or not.
Dictionary:
· Washington · Congress …
54. Romney 55. Obama 56. Nader
Jiayu Zhou
CSE 404 Intro. to Machine Learning
15 / 21
Example
Assume we would like to classify documents as election related or not.
Dictionary:
· Washington · Congress …
54. Romney 55. Obama 56. Nader
φ54 =φ54(x)=1,φ55 =φ55(x)=1,φ56 =φ56(x)=0 iiiiii
Jiayu Zhou CSE 404 Intro. to Machine Learning 15 / 21
Example: cont.
We would like to classify documents as election related or not.
Given a collection of documents with their labels (training data) we learn the parameters for our model.
Jiayu Zhou CSE 404 Intro. to Machine Learning 16 / 21
Example: cont.
We would like to classify documents as election related or not.
Given a collection of documents with their labels (training data) we learn the parameters for our model.
For example, if we see the word “Obama” in n1 out of the n documents labeled as “election” we set p(obama|election) = n1/n
Jiayu Zhou CSE 404 Intro. to Machine Learning 16 / 21
Example: cont.
We would like to classify documents as election related or not.
Given a collection of documents with their labels (training data) we learn the parameters for our model.
For example, if we see the word “Obama” in n1 out of the n documents labeled as “election” we set p(obama|election) = n1/n
Similarly we compute the priors (p(election)) based on the proportion of the documents from both classes.
Jiayu Zhou CSE 404 Intro. to Machine Learning 16 / 21
Example: Classifying Election (E) or Sports (S)
Assume we learned the following model
P(φromney = 1|E) = 0.8, P(φobama = 1|E) = 0.9, P(φclinton = 1|E) = 0.9, P(φfootball = 1|E) = 0.1, P(S) = 0.5,
P(φromney = 1|S) = 0.1 P(φobama = 1|S) = 0.05 P(φclinton = 1|S) = 0.05
P(φfootball = 1|S) = 0.7 P(E) = 0.5
For a specific document we have the following feature vector φromney = 1, φobama = 1, φclinton = 1, φfootball = 0
Jiayu Zhou CSE 404 Intro. to Machine Learning 17 / 21
Example: Classifying Election (E) or Sports (S)
Assume we learned the following model
P(φromney = 1|E) = 0.8, P(φobama = 1|E) = 0.9, P(φclinton = 1|E) = 0.9, P(φfootball = 1|E) = 0.1, P(S) = 0.5,
P(φromney = 1|S) = 0.1 P(φobama = 1|S) = 0.05 P(φclinton = 1|S) = 0.05
P(φfootball = 1|S) = 0.7 P(E) = 0.5
For a specific document we have the following feature vector φromney = 1, φobama = 1, φclinton = 1, φfootball = 0
p(y =E|1,1,1,0)∝0.8∗0.9∗0.9∗0.9∗0.5=0.2916
p(y =S|1,1,1,0)∝0.1∗0.05∗0.05∗0.3∗0.5=0.0000375
Jiayu Zhou CSE 404 Intro. to Machine Learning 17 / 21
Example: Classifying Election (E) or Sports (S)
Assume we learned the following model
P(φromney = 1|E) = 0.8, P(φobama = 1|E) = 0.9, P(φclinton = 1|E) = 0.9, P(φfootball = 1|E) = 0.1, P(S) = 0.5,
P(φromney = 1|S) = 0.1 P(φobama = 1|S) = 0.05 P(φclinton = 1|S) = 0.05
P(φfootball = 1|S) = 0.7 P(E) = 0.5
For a specific document we have the following feature vector φromney = 1, φobama = 1, φclinton = 1, φfootball = 0
p(y =E|1,1,1,0)∝0.8∗0.9∗0.9∗0.9∗0.5=0.2916
p(y =S|1,1,1,0)∝0.1∗0.05∗0.05∗0.3∗0.5=0.0000375
So the document is classified as “Election”
Jiayu Zhou CSE 404 Intro. to Machine Learning 17 / 21
Na ̈ıve Bayes classifiers for continuous values
So far we assumed a binomial or discrete distribution for the data given the model (p(xi|y))
However, in many cases the data contains continuous features:
Height, weight, Levels of genes in cells, Brain activity
Jiayu Zhou CSE 404 Intro. to Machine Learning 18 / 21
Na ̈ıve Bayes classifiers for continuous values
So far we assumed a binomial or discrete distribution for the data given the model (p(xi|y))
However, in many cases the data contains continuous features:
Height, weight, Levels of genes in cells, Brain activity For these types of data we often use a Gaussian model
In this model we assume that the observed input vector x is generated from the following distribution
x ∼ N(μ,Σ)
Jiayu Zhou CSE 404 Intro. to Machine Learning 18 / 21
Possible problems with Na ̈ıve Bayes classifiers: Assumptions
In most cases, the assumption of conditional independence given the class label is violated
Jiayu Zhou CSE 404 Intro. to Machine Learning 19 / 21
Possible problems with Na ̈ıve Bayes classifiers: Assumptions
In most cases, the assumption of conditional independence given the class label is violated
much more likely to find the word “Barack” if we saw the word “Obama” regardless of the class
Jiayu Zhou CSE 404 Intro. to Machine Learning 19 / 21
Possible problems with Na ̈ıve Bayes classifiers: Assumptions
In most cases, the assumption of conditional independence given the class label is violated
much more likely to find the word “Barack” if we saw the word “Obama” regardless of the class
This is, unfortunately, a major shortcoming which makes these classifiers inferior in many real world applications (though not always)
Jiayu Zhou CSE 404 Intro. to Machine Learning 19 / 21
Possible problems with Na ̈ıve Bayes classifiers: Assumptions
In most cases, the assumption of conditional independence given the class label is violated
much more likely to find the word “Barack” if we saw the word “Obama” regardless of the class
This is, unfortunately, a major shortcoming which makes these classifiers inferior in many real world applications (though not always)
There are models that can improve upon this assumption without using the full conditional model (one such model are Bayesian networks)
Jiayu Zhou CSE 404 Intro. to Machine Learning 19 / 21
Possible problems with Na ̈ıve Bayes classifiers: Parameter estimation
Even though we need far less data than the full Bayes model, there may be cases when the data we have is not enough
For example, what is p(S =1,N =1|E =2)?
Jiayu Zhou
CSE 404 Intro. to Machine Learning 20 / 21
Possible problems with Na ̈ıve Bayes classifiers: Parameter estimation
Even though we need far less data than the full Bayes model, there may be cases when the data we have is not enough
For example, what is p(S =1,N =1|E =2)?
This can get worse. Assume we have 20 variables, almost all pointing in the direction of the same class except for one for which we have no record for this class.
Jiayu Zhou
CSE 404 Intro. to Machine Learning 20 / 21
Possible problems with Na ̈ıve Bayes classifiers: Parameter estimation
Even though we need far less data than the full Bayes model, there may be cases when the data we have is not enough
For example, what is p(S =1,N =1|E =2)?
This can get worse. Assume we have 20 variables, almost all pointing in the direction of the same class except for one for which we have no record for this class.
Solutions?
Jiayu Zhou
CSE 404 Intro. to Machine Learning 20 / 21
Important points
Problems with estimating full joints Advantages of Na ̈ıve Bayes assumptions Applications to discrete and continuous cases Problems with Na ̈ıve Bayes classifiers
Jiayu Zhou CSE 404 Intro. to Machine Learning 21 / 21