Classification (2)
COMP9417 Machine Learning and Data Mining
Term 2, 2021
COMP9417 ML & DM Classification (2) Term 2, 2021 1 / 71
Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997)
http://www- 2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM
Classification (2)
Term 2, 2021
2 / 71
Aims
This lecture will continue your exposure to machine learning approaches to the problem of classification. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
• outline the logistic regression classification algorithm
• describe the use of Bayes Theorem for decision-making under
uncertainty
• outline Bayes Theorem as applied in machine learning
• define MAP and ML inference using Bayes theorem
• define the Bayes optimal classification rule in terms of MAP inference
• outline the Naive Bayes classification algorithm
• describe typical applications of Naive Bayes for text classification
COMP9417 ML & DM Classification (2) Term 2, 2021 3 / 71
Linear classifiers
Linear classifiers
The “classic” classification learning algorithm is a linear discriminant — many forms of linear discriminant from statistics and machine learning, e.g.,
• Fisher’s Linear Discriminant Analysis
• basic linear classifier (last lecture) is a simplified version of this
• Logistic Regression
• a probabilistic linear classifier (next slides)
• Perceptron
• a linear threshold classifier (last lecture)
• an early version of an artificial “neuron”
• still a useful method, and source of ideas (later lecture)
• Winnow
• like Perceptron, another linear threshold classifier
• uses a different learning method, with interesting properties • comes from Learning Theory (later lecture)
COMP9417 ML & DM Classification (2) Term 2, 2021 4 / 71
Linear classifiers
Logistic Regression
Linear Regression for classification
Question: can we use Linear Regression for classification ?
Answer: not really, unless we use an approach like the following, e.g.,
Training:
Prediction:
train a separate linear regression model for each class • sety=1ifexampleinclass
• set y = 0 otherwise
for each example
• run example through all regression models
• predict the class with the largest output value for y
Called multi-response linear regression.1
Note: does not obey linear regression assumptions, but may work in
practice.
1See: Witten et al. (2017).
COMP9417 ML & DM Classification (2) Term 2, 2021 5 / 71
Linear classifiers
Logistic Regression
Logistic regression
COMP9417 ML & DM Classification (2) Term 2, 2021 6 / 71
Linear classifiers
Logistic Regression
Logistic regression
In the case of a two-class classification problem, if we model the probability P (Y = 1) of an instance x being a positive example like this:
P(Y =1|x) = 1
1 + e−wT x
then this probability vs. the alternative (1 − P (Y = 1)) can be written like this:
lnP(Y=1|x) =wTx 1−P(Y =1|x)
The quantity on the l.h.s. is called the logit and we are defining a linear model for the logit.
COMP9417 ML & DM Classification (2) Term 2, 2021 7 / 71
Linear classifiers
Logistic Regression
Logistic regression
Unlike linear regression, no analytical maximum likelihood (ML) solution to find weights w.
An iterative gradient ascent method can be used to maximize log likelihood.
The (conditional) log likelihood is:
N
y(i) log P (1|x(i)) + (1 − y(i)) log(1 − P (1|x(i)))
i=1
Over a set of N examples, choose w to maximise this expression, noting
that y(i) is always either 0 or 1.
Generalises to multiple class versions (Y can have more than two values). COMP9417 ML & DM Classification (2) Term 2, 2021 8 / 71
Some questions about classifier learning
Step back
COMP9417 ML & DM Classification (2) Term 2, 2021 9 / 71
Some questions about classifier learning
Classifier learning
What do we understand about the problem of learning classifiers ? . . . how can we know when classifier learning succeeds ?
and . . . can we use this to build practical algorithms ?
COMP9417 ML & DM Classification (2) Term 2, 2021 10 / 71
Some questions about classifier learning
Inductive Bias
Inductive Bias
All models are wrong, but some models are useful.
Box & Draper (1987)
COMP9417 ML & DM Classification (2) Term 2, 2021 11 / 71
Some questions about classifier learning
Inductive Bias
Inductive Bias
Confusingly, “inductive bias” is NOT the same “bias” as in the “bias-variance” decomposition.
“Inductive bias” is the combination of assumptions and restrictions placed on the models and algorithms used to solve a learning problem.
Essentially it means that the algorithm and model combination you are using to solve the learning problem is appropriate for the task.
Success in machine learning requires understanding the inductive bias of algorithms and models, and choosing them appropriately for the task 2.
2Even true for “deep learning”; watch Andrew Ng’s talk on this at http://www.youtube.com/watch?v=F1ka6a13S9I&t=48s.
COMP9417 ML & DM Classification (2) Term 2, 2021 12 / 71
Some questions about classifier learning
Inductive Bias
Inductive Bias
Unfortunately, for most machine learning algorithms it is not always easy to know what their inductive bias is.
For example, what is the inductive bias of:
• Linear Regression ? •
•
• Nearest Neighbour ? •
•
COMP9417 ML & DM
Classification (2)
Term 2, 2021
13 / 71
Some questions about classifier learning
Inductive Bias
Inductive Bias
Unfortunately, for most machine learning algorithms it is not always easy to know what their inductive bias is.
For example, what is the inductive bias of: • Linear Regression ?
• target function has the form y = ax + b • approximate by fitting using MSE
• Nearest Neighbour ?
• target function is a complex non-linear function of the data
• predict using nearest neighbour by Euclidean distance in feature space
COMP9417 ML & DM Classification (2) Term 2, 2021 14 / 71
Some questions about classifier learning
Inductive Bias
Inductive Bias
What we would really like:
• a framework for machine learning algorithms
• with a way of representing the inductive bias
• ideally, should be a declarative specification
• also should quantify uncertainty in the inductive bias
COMP9417 ML & DM Classification (2) Term 2, 2021 15 / 71
A probabilistic approach
Why model uncertainty ?
A probabilistic approach
COMP9417 ML & DM Classification (2) Term 2, 2021 16 / 71
A probabilistic approach
Why model uncertainty ?
Uncertainty
As far as the laws of mathematics refer to reality, they are not certain; as far as they are certain, they do not refer to reality.
–Albert Einstein
COMP9417 ML & DM Classification (2) Term 2, 2021 17 / 71
Bayesian Machine Learning
Bayesian Machine Learning
COMP9417 ML & DM Classification (2) Term 2, 2021 18 / 71
Bayesian Machine Learning
Why Bayes ?
Two Roles for Bayesian Methods
Provides useful conceptual framework:
• a general framework for decision making under uncertainty • a “gold standard” for evaluating other learning algorithms
Provides practical learning algorithms:
• Naive Bayes classifier learning
• Bayesian network learning, etc. (not in this course)
• combines prior knowledge (prior probabilities) with observed data
COMP9417 ML & DM Classification (2) Term 2, 2021 19 / 71
Bayesian Machine Learning
Why Bayes ?
Basic Formulas for Probabilities
Product Rule: probability P (A ∧ B) of conjunction of two events A and B: P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A)
Sum Rule: probability of disjunction of two events A and B: P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
Theorem of total probability: if events A1,…,An are mutually exclusive with ni=1 P (Ai ) = 1, then:
n
P(B) = P(B|Ai)P(Ai)
i=1
COMP9417 ML & DM Classification (2) Term 2, 2021 20 / 71
Bayesian Machine Learning
Why Bayes ?
Basic Formulas for Probabilities
Also worth remembering:
• Conditional Probability: probability of A given B:
P(A|B)= P(A∧B) P(B)
• Rearrange sum rule to get:
P (A ∧ B) = P (A) + P (B) − P (A ∨ B)
Exercise: (if you haven’t done it before) derive Bayes Theorem.
COMP9417 ML & DM Classification (2) Term 2, 2021 21 / 71
Bayesian Machine Learning
Why Bayes ?
Bayes Theorem
For estimating the probability of a hypothesis (model) from data:
P (h|D) = P (D|h)P (h) P(D)
where
P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D
P(D|h) = probability of D given h
COMP9417 ML & DM Classification (2) Term 2, 2021 22 / 71
Bayesian Machine Learning
Why Bayes ?
Choosing Hypotheses
P (h|D) = P (D|h)P (h) P(D)
Generally, we want the most probable hypothesis given the training data Maximum a posteriori hypothesis hM AP :
hM AP
= arg max P (h|D) h∈H
= arg max P (D|h)P (h) h∈H P (D)
= arg max P (D|h)P (h) h∈H
COMP9417 ML & DM
Classification (2)
Term 2, 2021
23 / 71
Bayesian Machine Learning
Why Bayes ?
Choosing Hypotheses
If assume P(hi) = P(hj) then can further simplify, and choose the Maximum likelihood (ML) hypothesis
hML =argmaxP(D|hi) hi ∈H
COMP9417 ML & DM Classification (2) Term 2, 2021 24 / 71
Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.
P (cancer) = P(⊕ | cancer) = P(⊕ | ¬cancer) =
P (¬cancer) = P(⊖ | cancer) = P(⊖ | ¬cancer) =
COMP9417 ML & DM
Classification (2) Term 2, 2021 25 / 71
Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.
P (cancer) = .008 P(⊕ | cancer) = .98 P(⊕ | ¬cancer) = .03
P (¬cancer) = .992 P(⊖ | cancer) = .02 P(⊖ | ¬cancer) = .97
COMP9417 ML & DM
Classification (2) Term 2, 2021 26 / 71
Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not ?
We can find the maximum a posteriori (MAP) hypothesis
P (⊕ | cancer)P (cancer) = P (⊕ | ¬cancer)P (¬cancer) =
0.98 × 0.008 = 0.00784 0.03 × 0.992 = 0.02976
ThushMAP =….
COMP9417 ML & DM Classification (2)
Term 2, 2021
27 / 71
Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not ?
We can find the maximum a posteriori (MAP) hypothesis
P (⊕ | cancer)P (cancer) = 0.98 × 0.008 = 0.00784 P (⊕ | ¬cancer)P (¬cancer) = 0.03 × 0.992 = 0.02976
Thus hMAP = ¬cancer.
Also note: posterior probability of hypothesis cancer higher than prior.
COMP9417 ML & DM Classification (2) Term 2, 2021 28 / 71
Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
How to get the posterior probability of a hypothesis h ? Divide by probability of data, P (⊕), to normalize result for h.
P (D|h)P (h) P(h|D) = hi∈H P(D|hi)P(hi)
Denominator ensures we obtain posterior probabilities that sum to 1.
Sum for all possible numerator values, since hypotheses are mutually exclusive (e.g., patient either has cancer or does not).
Marginal likelihood (marginalizing out likelihood over all possible values in the hypothesis space) — prior probability of the data.
COMP9417 ML & DM Classification (2) Term 2, 2021 29 / 71
Bayesian Machine Learning
Why Bayes ?
A Bayesian framework for Classification
1 Define a prior on hypotheses
2 Define the likelihood, i.e., the probability of the data given the
hypothesis
3 Learning is finding the required parameters by fitting hypotheses to data
4 Predict (classify) using, e.g., the MAP hypothesis
COMP9417 ML & DM Classification (2) Term 2, 2021 30 / 71
Bayesian Machine Learning
Why Bayes ?
Brute Force MAP Hypothesis Learner
Idea: view learning as finding the most probable hypothesis
• For each hypothesis h in H, calculate the posterior probability
P (h|D) = P (D|h)P (h) P(D)
• Output the hypothesis hMAP with the highest posterior probability hMAP =argmaxP(h|D)
h∈H
COMP9417 ML & DM Classification (2) Term 2, 2021 31 / 71
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Consider a basic form of classification, concept learning3 • two classes: positive examples are in the concept
• e.g., pink elephants
• instances xi described by Boolean features
• e.g., has trunk(x17), pink(x17), . . . • hypotheses xi described by predicates
• e.g., has trunk(X) ∧ pink(X), where X is a variable • examples are instances labelled with class values
• assume perfect classification is possible
• noise-free examples
• a hypothesis is consistent with a set of examples if
• the hypothesis is true for all positive examples • the hypothesis is false for all negative examples
3See: Mitchell (1997).
COMP9417 ML & DM Classification (2) Term 2, 2021 32 / 71
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Canonical concept learning task:
• instance space X, hypothesis space H, training examples D
• consider a learning algorithm that outputs most specific hypothesis from the version space V SH,D (i.e., set of all consistent or ”zero-error” classification rules)
What would Bayes rule produce as the MAP hypothesis? Does this algorithm output a MAP hypothesis??
COMP9417 ML & DM Classification (2) Term 2, 2021 33 / 71
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Brute Force MAP Framework for Concept Learning:
Assume fixed set of instances ⟨x1, . . . , xm⟩
Assume D is the set of classifications D = ⟨c(x1), . . . , c(xm)⟩
Choose P(h) to be uniform distribution: •P(h)=1 forallhinH
Choose P(D|h):
• P(D|h)=1ifhconsistentwithD • P (D|h) = 0 otherwise
|H|
COMP9417 ML & DM Classification (2) Term 2, 2021 34 / 71
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Then:
1 P(h|D) = |V SH,D|
if h is consistent with D 0 otherwise
COMP9417 ML & DM
Classification (2) Term 2, 2021 35 / 71
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Note that since likelihood is zero if h is inconsistent then the posterior is also zero. But how did we obtain the posterior for consistent h ?
P (h|D) =
1·1
=
=
|V SH,D|
1·1 |H|
P(D)
|H| |V SH,D |
|H| 1
COMP9417 ML & DM
Classification (2)
Term 2, 2021
36 / 71
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
How did we obtain P(D) ? From theorem of total probability:
P (D)
= P (D|Hi)P (hi) hi ∈H
=1·1+0·1 h∈VS |H| h̸∈VS |H|
=
i H,D i H,D
=1·1 h ∈VS |H|
|V SH,D| |H|
i H,D
COMP9417 ML & DM
Classification (2)
Term 2, 2021
37 / 71
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Evolution of Posterior Probabilities
P(h)
P(h|D1)
P(h|D1,D2)
hypotheses (a)
hypotheses (b)
hypotheses (c)
COMP9417 ML & DM
Classification (2)
Term 2, 2021 38 / 71
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Every hypothesis consistent with D is a MAP hypothesis, if we assume • uniform probability over H
• target function c ∈ H
• deterministic, noise-free data
• etc. (see above)
So, this learning algorithm will output a MAP hypothesis, even though it
does not explicitly use probabilities in learning.
COMP9417 ML & DM Classification (2) Term 2, 2021 39 / 71
A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Learning A Real Valued Function
E.g., learning a linear target function f from noisy examples: y
f hML
e
x
COMP9417 ML & DM
Classification (2)
Term 2, 2021
40 / 71
A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Learning A Real Valued Function
Consider any real-valued target function f
Training examples ⟨xi,di⟩, where di is noisy training value
• di =f(xi)+ei
• ei is a random variable (noise) drawn independently for each xi
according to some Gaussian (normal) distribution with mean=0 Then the maximum likelihood hypothesis hML is the one that
minimizes the sum of squared errors: m
hML = arg min (di − h(xi))2
h∈H
i=1
COMP9417 ML & DM Classification (2) Term 2, 2021 41 / 71
A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Learning A Real Valued Function
How did we obtain this ?
hM L =
= arg max p(di|h)
arg max p(D|h) h∈H
m i=1
h∈H =argmax√e2 σ
m 1 h∈H i=1 2πσ2
Recall that we treat each probability p(D|h) as if h = f, i.e., we assume μ = f(xi) = h(xi), which is the key idea behind maximum likelihood !
− 1 ( d i − h ( x i ) ) 2
COMP9417 ML & DM Classification (2) Term 2, 2021 42 / 71
A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Learning A Real Valued Function
Maximize natural log to give simpler expression:
m 1 1di−h(xi)2 hML=argmax ln√ 2−2 σ
h∈H i=1 2πσ
m 1di−h(xi)2
=argmax−2 σ h∈H i=1
m
= arg max − (di − h(xi))2
h∈H
i=1
Equivalently, we can minimize the positive version of the expression:
hML =
m
arg min (di − h(xi))2
h∈H
Classification (2) Term 2, 2021 43 / 71
i=1
COMP9417 ML & DM
A Bayesian approach to learning algorithms
Bayes Error
Bayes Error
What is the best performance attainable by a (two-class) classifier ? Define the probability of error for classifying some instance x by
P (error|x) = P (class1|x) if we predict class2 = P (class2|x) if we predict class1
This gives
So we can justify the use of the decision rule
Σx P (error) = P (error|x) P (x)
if P (class1|x) > P (class2|x) then predict class1
else predict class2
On average, this decision rule minimises probability of classification error.
COMP9417 ML & DM Classification (2) Term 2, 2021 44 / 71
Bayes Optimal Classifier
Most Probable Classification of New Instances
So far we’ve sought the most probable hypothesis given the data D (i.e., hMAP)
Given new instance x, what is its most probable classification ? • hM AP (x) is not the most probable classification!
COMP9417 ML & DM Classification (2) Term 2, 2021 45 / 71
Bayes Optimal Classifier
Most Probable Classification of New Instances
Consider:
• Three possible hypotheses:
P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3 • Given new instance x,
h1(x)=+, h2(x)=−, h3(x)=−
• What’s most probable classification of x?
COMP9417 ML & DM Classification (2) Term 2, 2021 46 / 71
Bayes Optimal Classifier
Bayes Optimal Classifier
Bayes optimal classification:
Example:
P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3,
P(−|h1) = 0, P(−|h2) = 1, P(−|h3) = 1,
P(+|h1) = 1 P(+|h2) = 0 P(+|h3) = 0
argmax P(vj|hi)P(hi|D) vj∈V hi∈H
COMP9417 ML & DM
Classification (2)
Term 2, 2021
47 / 71
Bayes Optimal Classifier
Bayes Optimal Classifier
therefore
and
P (+|hi)P (hi|D) hi ∈H
P (−|hi)P (hi|D) hi ∈H
= .4 = .6
argmax P(vj|hi)P(hi|D) = − vj∈V hi∈H
Key point: no other classification method using the same hypothesis space and same prior knowledge can outperform this method on average
COMP9417 ML & DM Classification (2) Term 2, 2021 48 / 71
Naive Bayes
Bayesian classifier learning in practice
Problem: Bayes optimal classification is typically intractable in practice • requires executing all hypotheses
• not practical in applications
COMP9417 ML & DM Classification (2) Term 2, 2021 49 / 71
Naive Bayes
Discriminative and generative probabilistic models
• Discriminative models model the posterior probability distribution P (Y |X), where Y is the target variable and X are the features. That is, given X they return a probability distribution over Y .
• example: Logistic regression
• Generative models model the joint distribution P (Y, X) of the target Y and the feature vector X. Once we have access to this joint distribution we can derive any conditional or marginal distribution involving the same variables. In particular, since P(X) = y P(Y = y,X) it follows that the posterior
distribution can be obtained as P(Y |X) = P(Y,X) . y P(Y=y,X)
• Alternatively, generative models can be described by the likelihood function P (X|Y ), since P (Y, X) = P (X|Y )P (Y ) and the target or prior distribution (usually abbreviated to ‘prior’) can be easily estimated or postulated.
• example: Naive Bayes classifier
COMP9417 ML & DM Classification (2) Term 2, 2021 50 / 71
Naive Bayes
Naive Bayes Classifier
A simplified version of a generative probabilistic model for classification • widely-used algorithm
• easy to train practical learning methods
• not the most accurate, but better than others for some applications • estimated probabilities may be off, but key issue is classification
When to apply
• moderate or large training set available
• attributes that describe instances are conditionally independent given classification
Successful applications
• classifying text documents
• Gaussian Naive Bayes for real-valued data
COMP9417 ML & DM Classification (2) Term 2, 2021 51 / 71
Naive Bayes
Naive Bayes Classifier
Assume target function f : X → V , where each instance x described by attributes ⟨a1, a2 . . . an⟩.
Most probable value of f(x) is:
vMAP = vMAP =
argmaxP(vj|a1,a2 …an) vj ∈V
argmax P(a1,a2 …an|vj)P(vj) vj∈V P(a1,a2 …an)
= argmaxP(a1,a2 …an|vj)P(vj) vj ∈V
By application of Bayes Theorem.
COMP9417 ML & DM Classification (2) Term 2, 2021 52 / 71
Naive Bayes
Naive Bayes Classifier
Naive Bayes assumption:
P(a1,a2 …an|vj) = P(ai|vj) i
• attributes ai are statistically independent, given the class value vj
• this means knowledge about the value of a particular attribute tells us
nothing about the value of another attribute (if the class is known) which gives
Naive Bayes classifier: vNB = argmaxP(vj)P(ai|vj)
vj ∈V
i
COMP9417 ML & DM Classification (2)
Term 2, 2021
53 / 71
Naive Bayes
Probabiistic modelling
Binomial and Bernoulli distributions4
“Coin tossing”
Let X ∈ {0,…,n} be number of heads in a sequence of n tosses. If the probability of heads is θ, then X has a binomial distribution
n k n−k Bin(k|n,θ)= k θ (1−θ)
For a single coin toss, let X ∈ {0, 1}. If θ is the probability of heads, then X has a Bernoulli distribution
Ber(x|θ) = θI(x=1)(1 − θ)I(x=0)
4“Machine Learning: A Probabilistic Perspective”, K. Murphy (2012)
COMP9417 ML & DM Classification (2) Term 2, 2021 54 / 71
Naive Bayes
Probabiistic modelling
Multinomial and Multinoulli distributions
“K-sided Dice throwing”
Let x = (x1,…,xK) be a random vector, where xj is the number of
times side j of the die is thrown.
If θj is the probability of side j, then x has a multinomial distribution
nK Mu(x|n, θ) = θxj
j=1
For a single dice throw, x will be a vector of zeros except for a single ‘1’ at
index j indicating that side j of the die is thrown. If θj is as above, then x has a multivariate Bernoulli (Multinoulli) distribution
K
Mu(x|1, θ) = θI(xj =1) j
j=1 COMP9417 ML & DM Classification (2)
Term 2, 2021
55 / 71
x1 . . . xK
j
Naive Bayes
Naive Bayes classifier for document (email) classification
Categorical random variables
Categorical variables or features (also called discrete or nominal) are ubiquitous in machine learning.
• Perhaps the most common form of the Bernoulli distribution models whether or not a word occurs in a document. That is, for the i-th word in our vocabulary we have a random variable Xi governed by a Bernoulli distribution. The joint distribution over the bit vector
X = (X1, . . . , Xk) is called a multivariate Bernoulli distribution.
• Variables with more than two outcomes are also common: for example, every word position in an e-mail corresponds to a categorical variable with k outcomes, where k is the size of the vocabulary. The multinomial distribution manifests itself as a count vector: a histogram of the number of occurrences of all vocabulary words in a document. This establishes an alternative way of modelling text documents that allows the number of occurrences of a word to influence the classification of a document.
COMP9417 ML & DM Classification (2) Term 2, 2021 56 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Categorical random variables
Both these document models are in common use. Despite their differences, they both assume independence between word occurrences, generally referred to as the Naive Bayes assumption.
• In the multinomial document model, this follows from the very use of the multinomial distribution, which assumes that words at different word positions are drawn independently from the same categorical distribution.
• In the multivariate Bernoulli model we assume that the bits in a bit vector are statistically independent, which allows us to compute the joint probability of a particular bit vector (x1, . . . , xk) as the product of the probabilities of each component P(Xi = xi).
• In practice, such word independence assumptions are often not true: if we know that an e-mail contains the word ‘Viagra’, we can be quite sure that it will also contain the word ‘pill’. Violated independence assumptions reduce the quality of probability estimates but may still
allow good classification performance.
COMP9417 ML & DM Classification (2) Term 2, 2021 57 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Probabilistic decision rules
We choose one of the possible distributions to model our data X as coming from either class.
• The more different P(X|Y = spam) and P(X|Y = ham) are, the more useful the features X are for classification.
• Thus, for a specific e-mail x we calculate both P (X = x|Y = spam) and P (X = x|Y = ham), and apply one of several possible decision rules:
maximum likelihood (ML) – predict arg maxy P (X = x|Y = y); maximum a posteriori (MAP) – predict
argmaxy P(X = x|Y = y)P(Y = y);
The relation between the first two decision rules is that ML classification is
equivalent to MAP classification with a uniform class distribution.
COMP9417 ML & DM Classification (2) Term 2, 2021 58 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Probabilistic decision rules
We use the example of Naive Bayes for text classification to illustrate, using both the multinomial and multivariate Bernoulli models.
COMP9417 ML & DM Classification (2) Term 2, 2021 59 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Training a Naive Bayes model
Consider the following e-mails consisting of five words a, b, c, d, e:
e1: b d e b b d e
e2: b c e b b d d e c c e3: a d a d e a e e e4: b a d b e d a b
e5: a b a b a b a e d e6: a c a c a c a e d e7: e a e d a e a
e8: d e d e d
We are told that the e-mails on the left are spam and those on the right are ham, and so we use them as a small training set to train our Bayesian classifier.
• First, we decide that d and e are so-called stop words that are too common to convey class information.
• The remaining words, a, b and c, constitute our vocabulary (features). COMP9417 ML & DM Classification (2) Term 2, 2021 60 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Training data for Naive Bayes
In the multivariate Bernoulli model e-mails are represented by bit vectors. Here is our email data set described by bit vectors.
E-mail a? b? c? Class e1 0 1 0 + e2 0 1 1 + e3 1 0 0 + e4 1 1 0 + e5 1 1 0 − e6 1 0 1 − e7 1 0 0 − e8 0 0 0 −
COMP9417 ML & DM
Classification (2)
Term 2, 2021
61 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Training a Naive Bayes model
• Adding the bit vectors for each class results in (2, 3, 1) for spam and (3, 1, 1) for ham.
• Each count is divided by the number of documents in a class, to get an estimate of the probability of a document containing a particular vocabulary word.
• Probability smoothing now means adding two pseudo-documents, one containing each word and one containing none of them.
• This results in the estimated parameter vectors
θˆ⊕ = (3/6, 4/6, 2/6) = (0.5, 0.67, 0.33) for spam and θˆ⊖ = (4/6, 2/6, 2/6) = (0.67, 0.33, 0.33) for ham.
COMP9417 ML & DM Classification (2) Term 2, 2021 62 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Training data for Naive Bayes
For the multinomial model, we represent each e-mail as a count vector. Here is our e-mail data set described by count vectors.
E-mail #a #b #c Class e1 0 3 0 + e2 0 3 3 + e3 3 0 0 + e4 2 3 0 + e5 4 3 0 − e6 4 0 3 − e7 3 0 0 − e8 0 0 0 −
COMP9417 ML & DM
Classification (2)
Term 2, 2021
63 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Training a Naive Bayes model
• To estimate the parameters of the multinomial, we sum up the count vectors for each class, which gives (5, 9, 3) for spam and (11, 3, 3) for ham.
• To smooth these probability estimates we add one pseudo-count for each vocabulary word, which brings the total number of occurrences of vocabulary words to 20 for each class.
• The estimated parameter vectors are thus
θˆ⊕ = (6/20, 10/20, 4/20) = (0.3, 0.5, 0.2) for spam and θˆ⊖ = (12/20, 4/20, 4/20) = (0.6, 0.2, 0.2) for ham.
COMP9417 ML & DM Classification (2) Term 2, 2021 64 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Prediction using a Naive Bayes model
Suppose our vocabulary contains three words a, b and c, and we use a multivariate Bernoulli model for our e-mails, with parameters
θ⊕ = (0.5, 0.67, 0.33) θ⊖ = (0.67, 0.33, 0.33)
This means, for example, that the presence of b is twice as likely in spam (+), compared with ham.
The e-mail to be classified contains words a and b but not c, and hence is described by the bit vector x = (1, 1, 0). We obtain likelihoods
P (x|⊕) = 0.5 · 0.67 · (1 − 0.33) = 0.222 P(x|⊖) = 0.67 · 0.33 · (1 − 0.33) = 0.148
The ML classification of x is thus spam.
COMP9417 ML & DM Classification (2) Term 2, 2021 65 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Prediction using a Naive Bayes model
In the case of two classes it is often convenient to work with likelihood ratios and odds.
• The likelihood ratio can be calculated as
P(x|⊕) = 0.5 0.671−0.33 =3/2>1
P (x|⊖) 0.67 0.33 1 − 0.33
• This means that the MAP classification of x is also spam if the prior
odds are more than 2/3, but ham if they are less than that.
• For example, with 33% spam and 67% ham the prior odds are
P (⊕) = 0.33 = 1/2, resulting in a posterior odds of P (⊖) 0.67
P(⊕|x) = P(x|⊕)P(⊕) =3/2·1/2=3/4<1 P (⊖|x) P (x|⊖) P (⊖)
• In this case the likelihood ratio for x is not strong enough to push the decision away from the prior.
COMP9417 ML & DM Classification (2) Term 2, 2021 66 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Prediction using a Naive Bayes model
Alternatively, we can employ a multinomial model. The parameters of a multinomial establish a distribution over the words in the vocabulary, say
θ⊕ = (0.3, 0.5, 0.2) θ⊖ = (0.6, 0.2, 0.2)
The e-mail to be classified contains three occurrences of word a, one single occurrence of word b and no occurrences of word c, and hence is described by the count vector x = (3, 1, 0). The total number of vocabulary word occurrences is n = 4. We obtain likelihoods
0.33 0.51 0.20 P(x|⊕)=4!3! 1! 0!=0.054
0.63 0.21 0.20
P(x|⊖) = 4! 3! 1! 0! = 0.1728
The likelihood ratio is 0.3 3 0.5 1 0.2 0 = 5/16. The ML classification 0.6 0.2 0.2
of x is thus ham, the opposite of the multivariate Bernoulli model. This is
mainly because of the three occurrences of word a, which provide strong
evidence for ham.
COMP9417 ML & DM Classification (2) Term 2, 2021 67 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Naive Bayes: numeric attributes
Note that Naive Bayes can be used with probabilistic models other than categorical variables, e.g., for numeric attributes
• usual assumption: attributes have a normal or Gaussian probability distribution (given the class)
• the probability density function for the normal distribution is defined by two parameters:
The sample mean μ:
The standard deviation σ:
1 n
σ = n − 1 (xi − μ)2 i=1
1 n μ=n xi
i=1
COMP9417 ML & DM Classification (2)
Term 2, 2021
68 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Naive Bayes: numeric attributes
Then we have the density function f(x):
1 − (x−μ)2
e 2σ2
Example: continuous attribute temperature with mean = 73 and standard
f(x) = √
2πσ
deviation = 6.2. Density value
f(temperature = 66|“yes”) = √ standard deviation.
1 − (66−73)2
e 2×6.22 = 0.0340 Missing values during training are not included in calculation of mean and
2π6.2
COMP9417 ML & DM Classification (2) Term 2, 2021 69 / 71
Naive Bayes
Naive Bayes classifier for document (email) classification
Naive Bayes: numeric attributes
Note: the normal distribution is based on the simple exponential function
f(x) = e−|x|m
As the power m in the exponent increases, the function approaches a step
function. Where m = 2
f(x) = e−|x|2
and this is the basis of the normal distribution — the various constants are the result of scaling so that the integral (the area under the curve from −∞ to +∞) is equal to 1.
from “Statistical Computing” by Michael J. Crawley (2002) Wiley.
COMP9417 ML & DM Classification (2) Term 2, 2021 70 / 71
Summary
Summary
• We described the classification problem in machine learning
• We also outlined the issue of Inductive Bias
• Two major frameworks for classification were covered
Distance-based. The key ideas are geometric.
We discussed distance-based classification (Nearest
Neighbour)
Probabilistic. The key ideas are (mostly) Bayesian.
We discussed generative (Naive Bayes) and discriminative (Logistic Regression) models
• So we have established a basis for learning classifiers
• Later we will see how to extend by building on these ideas
COMP9417 ML & DM Classification (2) Term 2, 2021 71 / 71
References
Mitchell, T. (1997). Machine Learning. McGraw-Hill, New York.
Witten, I., Frank, E., Hall, M., and Pal, C. (2017). Data Mining (Fourth Edition). Morgan
Kaufmann.
COMP9417 ML & DM Classification (2) Term 2, 2021 71 / 71