Document Classification 3: The Naïve Bayes Classifier
This time
Some elementary probability theory
Random variables Probability distributions Bayes’ Law
The Naïve Bayes classifier
Parameter learning
Multinomial Bayes Bernoulli Bayes
The zero probability problem and smoothing
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 1 / 21
Learning a Document Classifier
Learning to classify data:
Classification based on features (e.g. words) of document Supervised learning of classifier:
Labelled training data: correct classification for each document Naïve Bayes: a probabilistic classifier model
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 2 / 21
Elementary Probability Theory
Random variable:
Ranges over a predefined set of values
Probability that X has value v written P(X = v) Written P(v) when identity of X can be assumed Sum of probabilities over all possible values of X is 1
P(X =v)=1 v
Or when random variable X is clear from context:
P(v) = 1 v
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 3 / 21
Elementary Probability Theory (cont.)
Random variable for throw of a fair dice
Possible values: { 1, 2, 3, 4, 5, 6 } Because its a fair dice:
P(1)=P(2)=P(3)=P(4)=P(5)=P(6)= 1 6
Outcomes of different throws independent of each other, so…. Probability of throws of 1 and 3:
P(1)×P(3)= 1 36
Probability of a throw of either 1 or 3:
P(1) + P(3) = 1 3
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 4 / 21
Elementary Probability Theory (cont.)
Some useful terminology:
Marginal probability written P(x)
Joint probability written P(x, y) or P(x ∩ y) Conditional probability written P(x | y)
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 5 / 21
Relating Conditional and Joint Probability
We can rewrite joint probability in terms of conditional and marginal probability:
P (x , y ) = P (x ) · P (y | x )
We can rewrite conditional probability in terms of joint and marginal
probability:
P(x|y)= P(x,y) P(y)
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 6 / 21
Bayes’ Law
From previous slide we know that:
P (x , y ) = P (x ) · P (y | x ) = P (y ) · P (x | y ) From this we know that:
P(x|y)= P(y|x)·P(x) P(y)
So we rewrite a conditional probability in terms of a different conditional probability and two marginals
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 7 / 21
Naïve Bayes Classification
Problem instance:
An item to classify is viewed as a sequence of features (f1,…,fn)
As a shorthand for (f1,…,fn) we write f1n
The goal is to assign a class based on f1n
The set of possible classes is C
A particular classes will be denoted c, where c ∈ C
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 8 / 21
Naïve Bayes Classification
Probability of a class:
Probability of some class c given features f1n is written: P(c|f1n)
We want the most probable class
argmax P(c | f1n) c
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 9 / 21
Naïve Bayes Classification
argmaxP(c|fn) = argmaxP(c)×P(f1n|c) c 1 c P(f1n)
= argmax P(c) × P(f1n | c) c
n
≈ argmax P(c) × P(fi | c)
c i=1
Approximation above based on the naïve assumption that for each i
P(f |c) ≈ P(f |c,fi−1,fn ) i i1i+1
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 10 / 21
Learning Parameters
We need a way of estimating the various probabilities
An estimate P(c) for each c AnestimateofP(f|c)foreachf andc
We can estimate these probabilities from a sample of labelled data
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 11 / 21
Sample of Labelled Data
Suppose we have a sample of labelled data: α1,…αk
For each element of the sample, αi we know the correct class label (αi )
For each element of the sample, αi denote the features of αi as feats(αi )
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 12 / 21
Estimating the Priors
Maximum likelihood estimate P(c): Proportionofthelabelsofα1,…αk thatareequaltoc
|{i | label(αi) = c }| k
|{ i | label (αi ) = c }| is set of indices of documents labelled c based on the fixed enumeration of documents α1, . . . αk
|A| denotes the number elements of a set A
|{ i | label (αi ) = c }| is number of documents labelled c
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 13 / 21
Estimating the Priors (cont.)
|{i | label(αi) = c }| k
This is the prior probability of class c
Probability to use in the absence of any evidence about a particular piece of data
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 14 / 21
Two Naïve Bayes Models
Bernoulli Naïve Bayes model: only considers whether a feature is in a document or not
— document can be represented as a vector of booleans, one for each feature
Multinomial Naïve Bayes model: takes account of the number of times a feature occurs in a document
— document can be respresented as a vector of counts, one for each feature
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 15 / 21
Estimating the Conditional Probabilities
Bernoulli Naïve Bayes model
Maximum likelihood estimate for P(f | c):
|{i | label(αi) = c and f in feats(αi) }| |{i | label(αi) = c }|
Out of all documents labelled c, the proportion of them that have feature f
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 16 / 21
Estimating the Conditional Probabilities (cont.)
Multinomial Naïve Bayes model Maximum likelihood estimate for P(f | c):
(i,j) label(αi) = c, feats(αi) = (f1,…,fn) and fj = f |{(i,j) | label(αi) = c, feats(αi) = (f1,…,fn) and 1 ≤ j ≤ n }|
Out of all documents labelled c, the proportion of feature occurrences that are of feature f
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 17 / 21
Labelling Unseen (Unlabelled) Data
Suppose we have an unseen item of data α
We want to use our model to predict the correct label for α
Let the features of α be (f1,…,fn)
We return the result of computing
n argmax P(c) × P(fi | c)
c i=1
where the probabilities are estimated from the labelled data
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 18 / 21
The Zero Probability Problem
Consider this scenario:
One of the features of a data item α, say f , is seen in training data with class c but not with class c
This will mean that for a data item α having feature f
n
P(fi |c) = 0 i=1
As a result, P(c | f1n) will be zero
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 19 / 21
Add-one Smoothing
We need to perform smoothing
Add-one smoothing
Add one to all counts when estimating P(f | c)
1+(i,j)label(αi)=c,feats(αi)=(f1,…,fn)andfj =f N+|{(i,j)|label(αi)=c,feats(αi)=(f1,…,fn)and1≤j≤n }|
where N is the total number of features
Gives low probability to unseen events involving seen features Ignore features not seen anywhere in training data
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 20 / 21
Next Topic: Evaluating Classifiers
Confusion matrices
Measures of performance:
Accuracy and Error Rate Precision and Recall
Trading off Precision and Recall
Receiver Operating Characteristics (ROC) Area under the ROC course (AUROC)
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 21 / 21