Classification (2)
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain
Aims
This lecture will continue your exposure to machine learning approaches to the problem of classification. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
– explain the concept of inductive bias in machine learning
– outline Bayes Theorem as applied in machine learning
– define MAP and ML inference using Bayes theorem
– define the Bayes optimal classification rule in terms of MAP inference
– outline the Naive Bayes classification algorithm
– describe typical applications of Naive Bayes for text classification
– outline the logistic regression classification algorithm
COMP9417 T1, 2021 1
Introduction
What do we understand about the problem of learning classifiers ? . . . how can we know when classifier learning succeeds ?
and . . . can we use this to build practical algorithms ?
COMP9417 T1, 2021 2
Inductive Bias
“All models are wrong, but some models are useful.”
Box & Draper (1987)
COMP9417 T1, 2021 3
Inductive Bias
Confusingly, “inductive bias” is NOT the same “bias” as in the “bias- variance” decomposition.
“Inductive bias” is the combination of assumptions and restrictions placed on the models and algorithms used to solve a learning problem.
Essentially it means that the algorithm and model combination you are using to solve the learning problem is appropriate for the task.
Success in machine learning requires understanding the inductive bias of algorithms and models and choosing them appropriately for the task.
COMP9417 T1, 2021 4
Inductive Bias
Unfortunately, for most machine learning algorithms it is not always easy to know what their inductive bias is.
For example, what is the inductive bias of:
• Linear Regression ? o…
o…
• Nearest Neighbour ?
o… o…
COMP9417 T1, 2021 5
Inductive Bias
Unfortunately, for most machine learning algorithms it is not always easy to know what their inductive bias is.
For example, what is the inductive bias of:
• Linear Regression ?
o target function has the form 𝑦 = 𝑎𝑥 + 𝑏
o approximate by fitting using MSE
• Nearest Neighbour ?
o target function is a complex non-linear function of the data o predict using nearest neighbour by Euclidean distance in
feature space
COMP9417 T1, 2021 6
Inductive Bias
What we would really like:
o a framework for machine learning algorithms
o with a way of representing the inductive bias
o ideally, should be a declarative specification
o also should quantify uncertainty in the inductive bias
COMP9417 T1, 2021 7
A probabilistic approach
COMP9417 T1, 2021 8
A probabilistic approach Probabilistic models
A simple probabilistic model
A simple probabilistic model
‘Viagra’ and ‘lottery’ are two Boolean features; Y is the class variable,
with values ‘spam’ and ‘ham’. In each row the most likely class is
indicated in bold.
‘Viagra’ and ‘lottery’ are two Boolean features; Y is the class variable, with values ‘spam’ and ‘ham’. In each row the most likely class is indicated in bold.
Viagra lottery P(Y =spam|Viagra,lottery) P(Y =ham|Viagra,lottery) 00 0.31 0.69
01 0.65 0.35
10 0.80 0.20
11 0.40 0.60
COMP9417 ML & DM Classification (2) Term 2, 2019
13 / 99
COMP9417 T1, 2021 9
Decision rule
Assuming that 𝑥 and 𝑦 are the only variables we know and care about, the posterior distribution 𝑃(𝑦 |𝑥) helps us to answer many questions of interest.
• For instance, to classify a new e-mail we determine whether the words ‘Viagra’ and ‘lottery’ occur in it, look up the corresponding probability 𝑃 (𝑦 =
𝑠𝑝𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦), and predict spam if this probability exceeds 0.5 and ham otherwise.
• Such a recipe to predict a value of y on the basis of the values of 𝑥 and the posterior distribution 𝑃(𝑦 |𝑥) is called a decision rule.
COMP9417 T1, 2021 10
Bayesian Machine Learning
COMP9417 T1, 2021 11
Two Roles for Bayesian Methods
Provides practical learning algorithms:
– Naive Bayes classifier learning
– Bayesian network learning, etc.
– Combines prior knowledge (prior probabilities) with observed data
Provides useful conceptual framework:
– Provides a “gold standard” for evaluating other learning algorithms
– Some additional insight into Occam’s razor
COMP9417 T1, 2021 12
Classification Question:
Can we do something different than finding the discriminative line (or some boundary) to be able to separate the two groups?
Width
Sea bass
Salmon
Lightness
COMP9417 T1, 2021
13
Classification Answer:
Yes, we can. Instead of finding a discriminative line, maybe we can focus on one class at a time and build a model that describes how that class looks like; and then do the same for the other class. This type of models are called generative learning algorithm.
Width
Sea bass
𝑃 𝑥𝑦=𝑆𝑒𝑎𝑏𝑎𝑠𝑠 ,𝑃(𝑦=𝑆𝑒𝑎𝑏𝑎𝑠𝑠)
Salmon
𝑃 𝑥 𝑦 = 𝑠𝑎𝑙𝑚𝑜𝑛 ,𝑃(𝑦 = 𝑠𝑎𝑙𝑚𝑜𝑛)
Lightness
COMP9417 T1, 2021
14
Classification – Bayesian methods Example: Imagine, we want to classify fish type: Salmon , Sea bass
If from the past experience we have (𝐶! 𝑖𝑠 𝑡h𝑒 𝑐𝑙𝑎𝑠𝑠):
P(𝒄𝒊)
Salmon
Sea bass
Prior
0.3
0.7
• If we decide only based on prior, we always have to choose “sea bass”. This is called “decision rule based on prior”
o This can behave vey poorly
o It never predicts other classes
COMP9417 T1, 2021 15
Classification – Bayesian methods
Example: now if we have some more information on the length of the fish in each class, then how can we update our decision, if we want to predict the class for a fish with 70cm length?
These are called “class conditionals”, “class conditioned probabilities”
𝑷(𝒙|𝒄𝒊)
Salmon
Sea bass
length > 100 cm
0.5
0.3
50 cm < length < 100 cm
0.4
0.5
length < 50 cm
0.1
0.2
What we are interested in is 𝑃 𝑐! 𝑥 , but we have 𝑃 𝑐! and 𝑃(𝑥|𝑐!), so how should we go about this?
COMP9417 T1, 2021 16
Bayes Theorem
𝑃h𝐷 =𝑃𝐷h𝑃(h) 𝑃(𝐷)
where:
𝑃 h = prior probability of hypothesis h
𝑃 𝐷 = prior probability of training data 𝐷 𝑃(h|𝐷) = probability of h given 𝐷
𝑃(𝐷|h) = probability of 𝐷 given h
COMP9417 T1, 2021 17
Decision Rule from Posteriors Example:
If the output belongs to a set of 𝑘 classes: 𝑦 ∈ {𝐶#,𝐶$,..., 𝐶%} for 1 ≤ 𝑖 ≤ 𝑘 Then in Bayesian framework:
𝑃 𝑦 = 𝐶! 𝑥 = 𝑃 𝑥 𝐶! 𝑃(𝐶!) 𝑃(𝑥)
– 𝑃 𝑦 = 𝐶1 𝑥 : posterior probability
– 𝑃 𝑥 𝐶1 : class conditional (class likelihood) – 𝑃(𝐶1): prior
– 𝑃(𝑥):Marginal(𝑃 𝑥 =∑1𝑝 𝑥𝐶1 𝑃(𝐶1))
The decision rule is to select a class which maximizes the posterior probability for the prediction
COMP9417 T1, 2021 18
Choosing Hypotheses
𝑃h𝐷 =𝑃𝐷h𝑃(h) 𝑃(𝐷)
Generally want the most probable hypothesis given the training data, Maximum a posteriori hypothesis h!"# :
h!"# =argmax𝑃(h|𝐷) $∈&
= arg max 𝑃 𝐷 h 𝑃(h) $∈& 𝑃(𝐷)
= arg max 𝑃 𝐷 h 𝑃(h) $∈&
COMP9417 T1, 2021 19
Choosing Hypotheses
If assume 𝑃 h' = 𝑃(h() then can further simplify, and choose the Maximum likelihood (ML) hypothesis:
h!) =argmax𝑃(𝐷|h') $!∈&
COMP9417 T1, 2021 20
Classification – Bayesian methods
Example: now if we have some more information on the length of the fish in each class, then how can
we update our decision, if we want to predict the class for a fish with 70cm length?
𝑷(𝒙|𝒄𝒊)
Salmon
Sea bass
length > 100 cm
0.5
0.3
50 cm < length < 100 cm
0.4
0.5
length < 50 cm
0.1
0.2
𝑃 𝑐 = 𝑠𝑎𝑙𝑚𝑜𝑛 𝑥 = 70𝑐𝑚 ∝ 𝑃 70𝑐𝑚 𝑠𝑎𝑙𝑚𝑜𝑛 ∗ 𝑃 𝑠𝑎𝑙𝑚𝑜𝑛 = 0.4∗0.3=0.12 𝑃 𝑐=𝑠𝑒𝑎𝑏𝑎𝑠𝑠𝑥=70𝑐𝑚 ∝𝑃 70𝑐𝑚𝑠𝑒𝑎𝑏𝑎𝑠𝑠 ∗𝑃 𝑠𝑒𝑎𝑏𝑎𝑠𝑠 =0.5∗0.7=0.35
So base on these probabilities, our model predict the type as “sea bass”
COMP9417 T1, 2021 21
Applying Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.
𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 =?
𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 =?
𝑃 ⊕ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =?
𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 =? 𝑃 ⊖ 𝑐𝑎𝑛𝑐𝑒𝑟 =?
𝑃 ⊖ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =?
COMP9417 T1, 2021 22
Applying Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.
𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 = .008
𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 =.98
𝑃 ⊕ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =.03
𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 = .992 𝑃 ⊖ 𝑐𝑎𝑛𝑐𝑒𝑟 =0.02
𝑃 ⊖ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =.97
COMP9417 T1, 2021 23
Applying Bayes Theorem
Does patient have cancer or not?
𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =? 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =?
We can find the maximum a posteriori (MAP) hypothesis
𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.98×0.008 = 0.00784
𝑃 ⊕ 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.03×0.992 = 0.02976
Thus h!"# = ⋯
COMP9417 T1, 2021 24
Applying Bayes Theorem
Does patient have cancer or not?
𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =? 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =?
We can find the maximum a posteriori (MAP) hypothesis
𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.98×0.008 = 0.00784
𝑃 ⊕ 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.03×0.992 = 0.02976
Thus h!"# = 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟
COMP9417 T1, 2021 25
Applying Bayes Theorem
How to get the posterior probability of a hypothesis h ? Divide by 𝑃 (⊕) (probability of data) to normalize result for h:
𝑃 h 𝐷 = 𝑃 𝐷 h 𝑃(h) ∑$!∈&𝑃𝐷h' 𝑃(h')
Denominator ensures we obtain posterior probabilities that sum to 1. Sum for all possible numerator values, since hypotheses are mutually exclusive (e.g., patient either has cancer or does not).
COMP9417 T1, 2021 26
Basic Formulas for Probabilities
Product Rule: probability 𝑃 (𝐴 ∧ 𝐵) of conjunction of two events 𝐴 and 𝐵: 𝑃(𝐴 ∧ 𝐵) = 𝑃(𝐴|𝐵)𝑃(𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴)
Sum Rule: probability of disjunction of two events 𝐴 and 𝐵: 𝑃 (𝐴 ∨ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∧ 𝐵)
Theorem of total probability: if events 𝐴: , ... , 𝐴; are mutually exclusive
with∑; 𝑃 𝐴 '<:
'
=1,then:
𝑃𝐵 =O'<:𝑃𝐵𝐴' 𝑃(𝐴')
;
COMP9417 T1, 2021 27
Basic Formulas for Probabilities
Also worth remembering:
o Conditional Probability: probability of A given B:
P(A|B) = #("∧?) #(?)
o Rearrange sum rule to get:
𝑃 (𝐴 ∧ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∨ 𝐵)
COMP9417 T1, 2021 28
Posterior Probabilities & Decision
If we have two competing hypothesis h# and h$, then – 𝑃 h# 𝐷 = posterior probability of h#
– 𝑃 h$ 𝐷 = posterior probability of h$
• So far, the potential decision is made based on the higher posterior probability
– Decideh# if𝑃 h# 𝐷 >𝑃 h$ 𝐷 andelsedecideh$
• An alternative: using a loss function 𝐿(h), where 𝐿(h) is the loss that occurs
when decision h is made.
– In this setup the Bayesian testing procedure minimizes the posterior expected loss
COMP9417 T1, 2021 29
Bayesian Expected Loss
Example: If in the example of patient with positive test result, we know that the cost of misclassifying a patient who has cancer as “not cancer” is 10 times more than misclassifying a patient who doesn’t have cancer as “cancer”, how that will affect our decision?
COMP9417 T1, 2021 30
Bayesian Expected Loss (Risk)
If the cost of misclassification is not the same for different classes, then instead of maximizing a posteriori, we have to minimize the expected loss:
o So, if we define the loss associated to action 𝛼! as 𝜆(𝛼!|h) o Then the expected loss associated to action 𝛼! is:
𝐸𝐿𝛼! =𝑅𝛼!𝑥 =J𝜆(𝛼!|h)𝑃(h|𝑥) &∈(
An optimal Bayesian decision strategy is to minimize the expected loss. And if the loss associated to misclassification is the same for different classes, then maximum a posteriori is equal to minimizing the expected loss.
COMP9417 T1, 2021 31
Bayesian Expected Loss
Example: let’s revisit the example of patient with positive result for cancer,
given the loss function below:
𝜆 𝛼! 𝑐!
Cancer
Not cancer
If predicted cancer
0
1
If predicted not cancer
10
0
𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ = 𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕
+𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ ∝ 0 + 1×0.02976 = 0.02976
𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ = 𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕
+𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ ∝ 10×0.00784 + 0 = 0.0784
𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ < 𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕
Therefor the expected loss is less if we predict that the patient has cancer. COMP9417 T1, 2021 32
Bayesian Expected Loss
In summary:
• Bayesian framework allows for integration of losses into the decision-making rule
• In such framework, we minimize the posterior expected loss
COMP9417 T1, 2021 33
Bayes Theorem
• To compute 𝑃 𝐷 h and 𝑃(h), we can use an empirical method based on the given data
• Or we may assume a parametric model, then we estimate parameters using the data
COMP9417 T1, 2021 34
Learning A Real Valued Function
Consider any real-valued target function 𝑓
Training examples ⟨𝑥' , 𝑦' ⟩, where 𝑦' is noisy training value
• 𝑦'=𝑓𝑥' +𝜀'
• 𝜀' is random variable (noise) drawn independently for each 𝑥'
according to some Gaussian (normal) distribution with mean zero Then the maximum likelihood hypothesis h!) is the one that
minimizes the sum of squared errors:
h!) = argm𝑎𝑥𝑃 𝐷 h = argmaxW𝑃(𝑦'|h)
Where 𝑓* = h23
S
S 1 U:(V!U$(W!))" =argmaxW 2𝜋𝜎T𝑒 T X
$∈& $∈& '<:
$∈& '<:
COMP9417 T1, 2021
35
Learning A Real Valued Function
Maximize natural log to give simpler expression:
S h!)=argmaxO𝑙𝑛
1 2𝜋𝜎T
1 𝑑' −h(𝑥') T − ( )
$∈& '<:
2 𝜎
S
1 𝑑' −h(𝑥') T =argmaxO− ( )
$∈&'<: 2 𝜎 S
=argmaxO−(𝑑' −h(𝑥'))T $∈& '<:
Equivalently, we can minimize the positive version of the expression:
S
h!) =arg minO(𝑑' −h(𝑥'))T $∈& '<:
COMP9417 T1, 2021 36
Discriminative vs Generative Probabilistic Models
• Discriminative models model the posterior probability distribution 𝑃 (𝑦|𝑥), where 𝑦 is the target variable and 𝑥 is the feature vector. That is, given 𝑥 they return a probability distribution over 𝑦 .
• Generative models model the joint distribution 𝑃 (𝑦, 𝑥) of the target 𝑦 and the feature vector 𝑥. Once we have access to this joint distribution, we can derive any conditional or marginal distribution involving the same variables. In particular, since 𝑃 𝑥 = ∑) 𝑃(𝑦 = 𝑐, 𝑥) it follows that the posterior
distribution can be obtained as𝑃 𝑦 𝑥 = *(,,.) ∑! *(,1),.)
• Such models are called ‘generative’ because we can sample from the joint distribution to obtain new data points together with their labels. Alternatively, we can use 𝑃(𝑦) to sample a class and 𝑃(𝑥|𝑦) to sample an instance for that class.
COMP9417 T1, 2021 37
Bayesian Approach
What should h be?
o A collection of possible predictions, or
o A collection of functions that map the data 𝑥 to a possible prediction 𝑦
COMP9417 T1, 2021 38
Most Probable Classification of New Instances
So far, we’ve sought the most probable hypothesis given the data 𝐷 (i.e., h!"#)
But the most important question is: given new instance 𝑥, what is its most probable classification?
o Although it may seem that h!"# can answer this question, but in fact we can do better
o h!"# (𝑥) is not necessarily the most probable classification! (e.g if we are dealing with multiple hypotheses supporting each class)
COMP9417 T1, 2021 39
Most Probable Classification of New Instances
Consider:
o Three possible hypotheses:
𝑃 h: 𝐷 = 0.4, 𝑃(hT|𝐷) = 0.3, 𝑃(hY|𝐷) = 0.3
o Given new instance x,
h:(𝑥) = +, hT (𝑥) = −, hY (𝑥) = −
o What’s most probable classification of 𝑥?
COMP9417 T1, 2021 40
Bayes Optimal Classifier
Bayes optimal classification:
Example:
𝑃h:𝐷 =0.4, 𝑃hT𝐷 =0.3, 𝑃hY𝐷 =0.3,
𝑃−h: =0, 𝑃−hT =1, 𝑃−hY =1,
P+h: =1 P+hT =0 P+hY =0
argmax O 𝑃(𝜈(|h')𝑃(h'|D) Z#∈[ $!∈&
COMP9417 T1, 2021
41
Bayes Optimal Classifier
therefore
and
O𝑃(+|h')Ph'𝐷 =0.4 $!∈&
O𝑃(−|h')Ph'𝐷 =0.6 $!∈&
argmaxO𝑃(𝜈(|h')𝑃h'D= − Z#∈[ $!∈&
𝜈( is the class value from the set of possible class values (𝜈( ∈ 𝑉)
COMP9417 T1, 2021 42
Bayes Optimal Classifier
The Bayes optimal classifier is a probabilistic model that makes the most probable prediction for a new sample based on the training data. This is different than MAP inference.
• In MAP framework, we seek the most probable hypothesis, among the space of hypothesis.
• But in Bayesian optimal classification, the most probable classification of the new instance is obtained by combining the predictions of ALL hypotheses (in the hypothesis space), weighted by their posterior probabilities:
𝑃 𝜈2 𝐷 = J 𝑃 𝜈2 h! 𝑃(h!|𝐷) &" ∈(
The optimal classification of the new instance is the value 𝜈2 , for which 𝑃 𝜈2 𝐷 is maximum
COMP9417 T1, 2021 43
Bayes Optimal Classifier
It can be mathematically shown that no other classification method using the same hypothesis space and same prior knowledge can outperform the Bayes Optimal Classifier method on average.
COMP9417 T1, 2021 44
Bayes Optimal Classifier
Why do we need any other methods?
COMP9417 T1, 2021 45
Bayes Optimal Classifier
The Bayes rule depends on unknown quantities, so we need to use the data to find some approximation of those quantities.
COMP9417 T1, 2021 46
Gibbs Classifier
Bayes optimal classifier is very inefficient.
An alternative, less optimal method is Gibbs algorithm.
Gibbs algorithm:
1. Choose one hypothesis at random, according to 𝑃(h|𝐷) 2. Use this to classify new instance
Surprising fact: Assuming target concepts are drawn at random from 𝐻 (h ∈ 𝐻) according to priors on 𝐻. Then
𝐸[𝑒𝑟𝑟𝑜𝑟3!445] ≤ 2×𝐸[𝑒𝑟𝑟𝑜𝑟67,85 9:;!<7=]
COMP9417 T1, 2021 47
Bayes Error
What is the best performance attainable by a (two-class) classifier ?
Define the probability of error for classifying some instance 𝑥 by 𝑃 𝑒𝑟𝑟𝑜𝑟 𝑥 = 𝑃 𝑐𝑙𝑎𝑠𝑠# 𝑥 𝑖𝑓 𝑤𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑙𝑎𝑠𝑠$ = 𝑃 𝑐𝑙𝑎𝑠𝑠$ 𝑥 𝑖𝑓 𝑤𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑙𝑎𝑠𝑠#
This gives
J𝑃 𝑒𝑟𝑟𝑜𝑟 = J𝑃 𝑒𝑟𝑟𝑜𝑟 𝑥 𝑃(𝑥) .
So we can justify the use of the decision rule:
if 𝑃\ 𝑐𝑙𝑎𝑠𝑠# 𝑥 > 𝑃\ 𝑐𝑙𝑎𝑠𝑠$ 𝑥 then predict 𝑐𝑙𝑎𝑠𝑠#
else predict 𝑐𝑙𝑎𝑠𝑠$
On average, this decision rule minimises probability of classification error.
COMP9417 T1, 2021 48
Naive Bayes Classifier
Along with decision trees, neural networks, nearest neighbour, one of the most practical learning methods.
When to use
o Moderate or large training set available
o Attributes that describe instances are conditionally independent given classification
Successful applications:
o Classifying text documents
o Gaussian Naive Bayes for real-valued data
COMP9417 T1, 2021 49
Naive Bayes Classifier
Assume target function 𝑓 ∶ 𝑋 → 𝑉 , where each instance 𝑥 described by attributes ⟨𝑥:, 𝑥T, . . . , 𝑥;⟩.
Most probable value of 𝑓(𝑥) is:
𝜈!”# = argmax𝑃(𝜈(|𝑥:,𝑥T,…,𝑥;)
Z#∈[
𝜈!”# = argmax𝑃 𝑥:,𝑥T,…,𝑥; 𝜈( 𝑃(𝜈() Z#∈[ 𝑃(𝑥:,𝑥T,…,𝑥;)
= argmax𝑃(𝑥:,𝑥T,…,𝑥; 𝜈( P(𝜈() Z#∈[
COMP9417 T1, 2021 50
Naive Bayes Classifier
Naive Bayes assumption:
𝑃(𝑥:,𝑥T,…,𝑥; 𝜈( =W𝑃(𝑥’|𝜈() ‘
• Attributes are statistically independent (given the class value)
o which means knowledge about the value of a particular attribute tells us nothing about the value of another attribute (if the class is known)
Which gives Naive Bayes classifier:
𝜈\? = 𝑎𝑟𝑔 max𝑃(𝜈()W𝑃(𝑥’|𝜈()
Z#∈[ ‘
COMP9417 T1, 2021
51
Naive Bayes Algorithm Naive Bayes Learn (examples)
for each target value 𝜈(
𝑃f(𝜈() ← estimate 𝑃 𝜈(
For each attribute value 𝑥’:
𝑃f(𝑥’|𝜈() ← estimate 𝑃 𝑥’|𝜈( Classify New Instance (for sample 𝑥)
𝜈\? =𝑎𝑟𝑔max𝑃f(𝜈()W𝑃f(𝑥’|𝜈() Z#∈[ ‘
COMP9417 T1, 2021
52
Naive Bayes Example: PlayTennis Naive Bayes Example: PlayTennis
Naive Bayes
outlook temperature humidity windy
sunny hot sunny hot overcast hot rainy mild rainy cool rainy cool overcast cool sunny mild sunny cool rainy mild sunny mild overcast mild overcast hot rainy mild
high false high true high false high false normal false normal true normal true high false normal false normal false normal true high true normal false high true
COMP9417 ML & DM
Classification (2)
play no no yes yes yes no yes no yes yes yes yes yes no
Term 2, 2019
58 / 99
COMP9417 T1, 2021
53
Naive Bayes
Naive Bayes Example: PlayTennis
Naive Bayes Example: PlayTennis
What are the required probabilities to predict PlayTennis ? What are the required probabilities to predict PlayTennis?
Outlook
Temperature
Humidity
Windy
Yes No
Yes No
Yes No
Yes No
Sunny Overcast Rainy
2 3 4 0 3 2
Hot 2 2 Mild 4 2 Cool 3 1
High 3 4 Normal 6 1
False 6 2 True 3 3
Sunny Overcast Rainy
2/9 3/5 4/9 0/5 3/9 2/5
Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5
High 3/9 4/5 Normal 6/9 1/5
False 6/9 2/5 True 3/9 3/5
Play
Yes No
95
9/14 5/14
COMP9417 ML & DM Classification (2)
Term 2, 2019
59 / 99
COMP9417 T1, 2021 54
Naive Bayes Example: PlayTennis Say we have the new instance:
⟨𝑂𝑢𝑡𝑙𝑘 = 𝑠𝑢𝑛, 𝑇𝑒𝑚𝑝 = 𝑐𝑜𝑜𝑙, 𝐻𝑢𝑚𝑖𝑑 = h𝑖𝑔h, 𝑊𝑖𝑛𝑑 = 𝑡𝑟𝑢𝑒⟩
We want to compute:
𝜈\? =𝑎𝑟𝑔 max Z#∈{“V_`”,”;c”}}
𝑃(𝜈()W𝑃(𝑥’|𝜈() ‘
COMP9417 T1, 2021
55
Naive Bayes Example: PlayTennis
So we first calculate the likelihood of the two classes, “yes” and “no”
For “yes” = 𝑃 “𝑦𝑒𝑠” ×𝑃 𝑠𝑢𝑛 ”𝑦𝑒𝑠” ×𝑃 𝑐𝑜𝑜𝑙 ”𝑦𝑒𝑠” 𝑃 h𝑖𝑔h ”𝑦𝑒𝑠” × 𝑃 𝑡𝑟𝑢𝑒 ”𝑦𝑒𝑠”
0.0053= e ×T×Y×Y×Y :f e e e e
For “no” = 𝑃 “𝑛𝑜” ×𝑃 𝑠𝑢𝑛 ”𝑛𝑜” ×𝑃 𝑐𝑜𝑜𝑙 ”𝑛𝑜” 𝑃 h𝑖𝑔h ”𝑛𝑜” × 𝑃 𝑡𝑟𝑢𝑒 ”𝑛𝑜”
0.0206= g ×Y×:×f×Y :f g g e e
COMP9417 T1, 2021 56
Naive Bayes Example: PlayTennis
Then convert to a probability by normalisation
𝑃 “𝑦𝑒𝑠”|𝑥 = 0.0053 (0.0053 + 0.0206)
𝑃 “𝑛𝑜”|𝑥 = 0.0206 (0.0053 + 0.0206)
The Naive Bayes classification is “no”.
= 0.205 = 0.795
COMP9417 T1, 2021
57
Naive Bayes: Subtleties
Conditional independence assumption is often violated
𝑃(𝑥#,𝑥$,…,𝑥> 𝜈2 = ]𝑃(𝑥!|𝜈2) !
– …but it works surprisingly well anyway. Note that you don’t need the estimated posteriors 𝑃\(𝜈2|𝑥) to be correct; You need only that
𝑎𝑟𝑔max𝑃\(𝜈2)]𝑃\(𝑥!|𝜈2)=argmax𝑃 𝜈2 𝑃(𝑥#,𝑥$,…,𝑥> 𝜈2 ?#∈@ ! ?#∈@
i.e. maximum probability is assigned to correct class
o see [Domingos & Pazzani, 1996] for analysis
o Naive Bayes posteriors often unrealistically close to 1 or 0
o adding too many redundant attributes will cause problems (e.g. identical attributes)
COMP9417 T1, 2021 58
Naive Bayes: “zero-frequency” problem
What if none of the training instances with target value 𝜈( have attribute value 𝑥’? Then
𝑃f𝑥’𝜈( =0,𝑎𝑛𝑑…
𝑃f𝜈( W𝑃f𝑥’𝜈( =0 ‘
Pseudo-counts add 1 to each count (a version of the Laplace Estimator)
(In some cases adding a constant different from 1 might be more appropriate)
COMP9417 T1, 2021 59
Naive Bayes: numeric attributes
• Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)
𝑥|𝜈( ~𝑁 𝜇,𝜎T
• The probability density function for the normal distribution is defined by two parameters:
– The sample mean 𝜇: ;
𝜇 = 𝑛1 O 𝑥 ‘ ‘<:
– Thestandarddeviation𝜎:;
𝜎= 1 O(𝑥'−𝜇)T
𝑛 − 1 '<:
These parameters have to be defined for each class separately. COMP9417 T1, 2021 60
Naive Bayes: numeric attributes
Then we have the density function 𝑓(𝑥):
𝑓 𝑥 = 1 𝑒U(WUh)"
2𝜋𝜎 TX"
Example: continuous attribute temperature with 𝑚𝑒𝑎𝑛 = 73 and
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 6.2. Density value
1 U(iiUjY)"
𝑓 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 66 ”𝑦𝑒𝑠” = 2𝜋6.2 𝑒 T×i.T" = 0.0340
Missing values during training are not included in calculation of mean and standard deviation.
COMP9417 T1, 2021 61
Categorical random variables
Categorical variables or features (also called discrete or nominal) are ubiquitous in machine learning. For example in text classification.
• Perhaps the most common form of the Bernoulli distribution models whether or not a word occurs in a document. That is, for the 𝑖-th word in our vocabulary we have a random variable 𝑋! governed by a Bernoulli distribution. The joint distribution over the bit vector 𝑋 = (𝑋#, . . . , 𝑋%) is called a multivariate Bernoulli distribution.
• Variables with more than two outcomes are also common: for example, every word position in an e-mail corresponds to a categorical variable with 𝑘 outcomes, where 𝑘 is the size of the vocabulary. The multinomial distribution manifests itself as a count vector: a histogram of the number of occurrences of all vocabulary words in a document. This establishes an alternative way of modelling text documents that allows the number of occurrences of a word to influence the classification of a document.
COMP9417 T1, 2021 62
Categorical random variables
Both these document models are in common use. Despite their differences, they both assume independence between word occurrences, generally referred to as the naive Bayes assumption.
• In the multinomial document model, this follows from the very use of the multinomial distribution, which assumes that words at different word positions are drawn independently from the same categorical distribution.
• In the multivariate Bernoulli model we assume that the bits in a bit vector are statistically independent, which allows us to compute the joint probability of a particular bit vector (𝑥#, ... , 𝑥%) as the product of the probabilities of each component 𝑃(𝑋! = 𝑥!).
• In practice, such word independence assumptions are often not true: if we know that an e-mail contains the word ‘Viagra’, we can be quite sure that it will also contain the word ‘pill’. Violated independence assumptions reduce the quality of probability estimates but may still allow good classification performance.
COMP9417 T1, 2021 63
Example application: Learning to Classify Text
In machine learning, the classic example of applications of Naive Bayes is learning to classify text documents.
COMP9417 T1, 2021 64
Learning to Classify Text
For example:
o Learn which news articles are of interest o Learn to classify web pages by topic
Naive Bayes is among most effective algorithms
What attributes shall we use to represent text documents??
COMP9417 T1, 2021 65
Learning to Classify Text
Targetconcept𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑖𝑛𝑔?∶ 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 → {+,−} 1. Represent each document by vector of words
o one attribute per word position in document
2. Learning: Use training examples to estimate
o 𝑃(+)
o 𝑃(−)
o 𝑃(𝑑𝑜𝑐|+) o 𝑃(𝑑𝑜𝑐|−)
COMP9417 T1, 2021 66
Learning to Classify Text
Naive Bayes conditional independence assumption
k_;lm$(nco)
𝑃 𝑑𝑜𝑐 𝜈( = W 𝑃(𝑥' = 𝑤p |𝜈() '<:
where 𝑃(𝑥' = 𝑤p|𝜈() is probability that word in position 𝑖 is 𝑤p, given 𝜈( one more assumption:
𝑃𝑥'=𝑤p𝜈( =𝑃𝑥S=𝑤p𝜈( ,∀𝑖,𝑚
“bag of words”
COMP9417 T1, 2021 67
Application: 20 Newsgroups
Naive Bayes Learning and using a naive Bayes model for classification
Application: 20 Newsgroups
Given: 20000 training documents from each group
Learning task: classify each new document by newsgroup it came from
Given: 1000 training documents from each group
Learning task: classify each new document by newsgroup it came from
comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos
comp.sys.ibm.pc.hardware rec.motorcycles comp.sys.mac.hardware rec.sport.baseball
comp.windows.x
alt.atheism
soc.religion.christian
talk.religion.misc
talk.politics.mideast
talk.politics.misc
talk.politics.guns
rec.sport.hockey
sci.space
sci.crypt
sci.electronics
sci.med
Naive Bayes: 89% classification accuracy
Naive Bayes: 89% classification accuracy
COMP9417 ML & DM Classification (2) Term 2, 2019 75 / 99
COMP9417 T1, 2021 68
Article from rec.sport.hockey
Article from rec.sport.hockey
Naive Bayes Learning and using a naive Bayes model for classification
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu
From: xxx@yyy.zzz.edu (John Doe)
Subject: Re: This year’s biggest and worst (opinion)...
Date: 5 Apr 93 09:53:39 GMT
I can only comment on the Kings, but the most obvious candidate
for pleasant surprise is Alex Zhitnik. He came highly touted as
a defensive defenseman, but he’s clearly much more than that.
Great skater and hard shot (though wish he were more accurate).
In fact, he pretty much allowed the Kings to trade away that
huge defensive liability Paul Coffey. Kelly Hrudey is only the
biggest disappointment if you thought he was any good to begin
with. But, at best, he’s only a mediocre goaltender. A better
choice would be Tomas Sandstrom, though not through any fault
of his own, but because some thugs in Toronto decided ...
COMP9417 ML & DM Classification (2) Term 2, 2019 76 / 99
COMP9417 T1, 2021 69
Learning Curve for 20 Newsgroups
Learning Curve for 20 Newsgroups
Naive Bayes Learning and using a naive Bayes model for classification
100 90 80 70 60 50 40 30 20 10 0
20News
Bayes
TFIDF RTFIDF
P
100 1000
10000
Accuracy vs. Training set size (1/3 withheld for test)
Accuracy vs. Training set size (1/3 withheld for test)
COMP9417 ML & DM Classification (2) Term 2, 2019
77 / 99
COMP9417 T1, 2021 70
Probabilistic decision rules
We again use the example of Naive Bayes for text classification to illustrate, using both the multinomial and multivariate Bernoulli models.
COMP9417 T1, 2021 71
Training a naive Bayes model
Consider the following e-mails consisting of five words a, b, c, d, e:
𝑒1: 𝑏𝑑𝑒𝑏𝑏𝑑𝑒
𝑒2: 𝑏𝑐𝑒𝑏𝑏𝑑𝑑𝑒𝑐𝑐
𝑒3: 𝑎𝑑𝑎𝑑𝑒𝑎𝑒𝑒 𝑒4: 𝑏𝑎𝑑𝑏𝑒𝑑𝑎𝑏
𝑒5: 𝑎𝑏𝑎𝑏𝑎𝑏𝑎𝑒𝑑 𝑒6: 𝑎𝑐𝑎𝑐𝑎𝑐𝑎𝑒𝑑
𝑒7: 𝑒𝑎𝑒𝑑𝑎𝑒𝑎 𝑒8: 𝑑𝑒𝑑𝑒𝑑
We are told that the e-mails on the left are spam and those on the right are ham, and so we use them as a small training set to train our Bayesian classifier.
• First, we decide that 𝑑 and 𝑒 are so-called stop words that are too common to convey class information.
• The remaining words, 𝑎, 𝑏 and 𝑐, constitute our vocabulary.
COMP9417 T1, 2021 72
aining data for naive Bayes
Training data for naive Bayes
he same data set described by bit vectors.
The same data set described by bit vectors (presence of words). We can use Bernoulli model.
E-mail a? b? c? Class e1 0 1 0 + e2 0 1 1 + e3 1 0 0 + e4 1 1 0 + e5 1 1 0 e6 1 0 1 e7 1 0 0 e8 0 0 0
COMP9417 ML & DM Classification (2)
Term 2, 2019
84 / 99
COMP9417 T1, 2021 73
Tr
T
Training a naive Bayes model
In the multivariate Bernoulli model e-mails are represented by bit vectors, as before.
• Adding the bit vectors for each class results in (2, 3, 1) for spam and (3, 1, 1) for ham.
• Each count is to be divided by the number of documents in a class, in order to get an estimate of the probability of a document containing a particular vocabulary word.
• Probability smoothing now means adding two pseudo-documents, one containing each word and one containing none of them.
• This results in the estimated parameter vectors
𝜃k ⊕ = BC , DC , $C = 0.5,0.67,0.33 𝜃k ⊖ = DC , $C , $C = (0.67,0.33,0.33)
for spam for ham
COMP9417 T1, 2021
74
T
Naive Bayes Learning and using a naive Bayes model for classification
raining data for naive Bayes
Training data for naive Bayes
A small e-mail data set described by count vectors.
A small e-mail data set described by count vectors.
We can use multinomial model.
E-mail #a #b #c Class e1 0 3 0 + e2 0 3 3 + e3 3 0 0 + e4 2 3 0 + e5 4 3 0 e6 4 0 3 e7 3 0 0 e8 0 0 0
COMP9417 T1, 2021 75
COMP9417 ML & DM Classification (2) Term 2, 2019 83 / 99
Training a naive Bayes model
For the multinomial model, we represent each e-mail as a count vector, as before.
• In order to estimate the parameters of the multinomial, we sum up the count vectors for each class, which gives (5, 9, 3) for spam and (11, 3, 3) for ham.
• To smooth these probability estimates we add one pseudo-count for each vocabulary word, which brings the total number of occurrences of vocabulary words to 20 for each class.
• The estimated parameter vectors are thus
𝜃k⊕= C,#F,D = 0.3,0.5,0.2 forspam
𝜃k⊖ = #$ , D , D = (0.6,0.2,0.2) for ham $F $F $F
$F $F $F
COMP9417 T1, 2021
76
Prediction using a naive Bayes model
Suppose our vocabulary contains three words 𝑎, 𝑏 and 𝑐, and we use a multivariate Bernoulli model for our e-mails, with parameters
𝜃⊕ = 0.5,0.67,0.33 𝜃⊖ = (0.67,0.33,0.33)
This means, for example, that the presence of 𝑏 is twice as likely in spam (+),
compared with ham (−).
The e-mail to be classified contains words 𝑎 and 𝑏 but not 𝑐, and hence is
described by the bit vector 𝑥 = (1, 1, 0) . We obtain likelihoods 𝑃(𝑥| ⊕) = 0.50×0.67×(1 − 0.33) = 0.222 𝑃(𝑥| ⊖) = 0.67×0.33×(1 − 0.33) = 0.148
The ML classification of 𝑥 is thus spam.
COMP9417 T1, 2021 77
Prediction using a naive Bayes model
In the case of two classes it is often convenient to work with likelihood ratios and odds.
o The likelihood ratio can be calculated as 𝑃(𝑥|⊕)= 0.5 0.67(1−0.33)=3>1 𝑃(𝑥| ⊖) 0.67 0.33 (1 − 0.33) 2
o This means that the 𝑀𝐴𝑃 classification of 𝑥 is also spam if the prior odds are more than 2/3, but ham if they are less than that.
o For example, with 33% spam and 67% ham the prior odds are *(⊕) = F.BB = #, resulting in a posterior odds of
*(⊖) F.CH $
𝑃(⊕ |𝑥) = 𝑃(𝑥| ⊕) 𝑃(⊕) = 3 × 1 = 3 < 1 𝑃(⊖|𝑥) 𝑃(𝑥|⊝)𝑃(⊖) 2 2 4
In this case the likelihood ratio for x is not strong enough to push the decision away from the prior
COMP9417 T1, 2021 78
Prediction using a naive Bayes model
Alternatively, we can employ a multinomial model. The parameters of a multinomial establish a distribution over the words in the vocabulary, say
𝜃⊕ = 0.3,0.5,0.2 𝜃⊖ = (0.6,0.2,0.2)
The e-mail to be classified contains three occurrences of word 𝑎, one single occurrence of word 𝑏 and no occurrences of word 𝑐, and hence is described by the count vector 𝑥 = (3,1,0). The total number of vocabulary word occurrences is 𝑛 = 4. We obtain likelihoods:
0.36 0.57 0.28 𝑃𝑥⊕=4!3! 1! 0!=0.054
0.66 0.27 0.28
𝑃𝑥 ⊖ =4! 3! 1! 0! =0.1728
The likelihood ratio is (8.6)6(8.;)7(8.<)8 = ; . The ML classification of 𝑥 is thus ham, the 8.: 8.< 8.< 7:
opposite of the multivariate Bernoulli model. This is mainly because of the three occurrences of word 𝑎, which provide strong evidence for ham.
COMP9417 T1, 2021 79
Naive Bayes: missing values
What about missing values ? A very basic approach is:
– Training: instance is not included in frequency count for attribute value-class combination
– Classification: attribute will be omitted from calculation Can we do better ?
COMP9417 T1, 2021 80
Missing values
Let’s use the spam-ham data in slide 9. Suppose we skimmed an e-mail and noticed that it contains the word ‘lottery’ but we haven’t looked closely enough to determine whether it uses the word ‘Viagra’. This means that we don’t know whether to use the second or the fourth row (in slide 9) to make a prediction. This is a problem, as we would predict spam if the e-mail contained the word ‘Viagra’ (second row) and ham if it didn’t (fourth row).
The solution is to average these two rows, using the probability of ‘Viagra’ occurring in any e-mail (spam or not):
𝑃 𝑦 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 𝑃 𝑦 𝑉𝑖𝑎𝑔𝑟𝑎 = 0,𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑃 𝑉𝑖𝑎𝑔𝑟𝑎 = 0 +𝑃 𝑦 𝑉𝑖𝑎𝑔𝑟𝑎 = 1, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 1)
COMP9417 T1, 2021 81
Missing values
For instance, suppose for the sake of argument that one in ten e-mails contain the word ‘Viagra’, then 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 1) = 0.10 and 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 0) = 0.90. Using the above formula, we obtain
and
𝑃(𝑦 = 𝑠𝑝𝑎𝑚|𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 1) = 0.65×0.90 + 0.40×0.10 = 0.625 𝑃 𝑦 = h𝑎𝑚 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 1 = 0.35×0.90 + 0.60×0.10 = 0.375
Because the occurrence of ‘Viagra’ in any e-mail is relatively rare, the resulting distribution deviates only a little from the second row in the data.
COMP9417 T1, 2021 82
Likelihood ratio
As a matter of fact, statisticians work very often with different conditional probabilities, given by the likelihood function 𝑃(𝑥|𝑦).
• I like to think of these as thought experiments: if somebody were to send me a spam e-mail, how likely would it be that it contains exactly the words of the e-mail I’m looking at? And how likely if it were a ham e-mail instead?
• What really matters is not the magnitude of these likelihoods, but their ratio: how much more likely is it to observe this combination of words in a spam e- mail than it is in a non-spam e-mail.
• For instance, suppose that for a particular e-mail described by 𝑥 we have 𝑃(𝑥|𝑦 = 𝑠𝑝𝑎𝑚) = 3.5×10IJ and 𝑃(𝑥|𝑦 = h𝑎𝑚) = 7.4×10IC, then observing 𝑋 in a spam e-mail is nearly five times more likely than it is in a ham e-mail.
• This suggests the following decision rule: predict spam if the likelihood ratio is larger than 1 and ham otherwise.
COMP9417 T1, 2021 83
When to use likelihoods
Use likelihoods if you want to ignore the prior distribution or assume it uniform, and posterior probabilities otherwise.
COMP9417 T1, 2021 84
Posterior odds
𝑃(𝑦 = 𝑠𝑝𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 0, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 0) = 0.31 = 0.45 𝑃(𝑦 = h𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 0, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 0) 0.69
𝑃(𝑦 = 𝑠𝑝𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 1, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 1) = 0.40 = 0.67 𝑃(𝑦 = h𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 1, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 1) 0.60
𝑃(𝑦 = 𝑠𝑝𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 0, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 1) = 0.65 = 1.9
𝑃(𝑦 = h𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 0, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 1) 0.35
𝑃(𝑦 = 𝑠𝑝𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 1, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 0) = 0.80 = 4.0
𝑃(𝑦 = h𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 1, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 0) 0.20
Using a MAP decision rule we predict ham in the top two cases and spam in the bottom two. Given that the full posterior distribution is all there is to know about the domain in a statistical sense, these predictions are the best we can do: they are Bayes-optimal.
COMP9417 T1, 2021 85
Naive Bayes Learning and using a naive Bayes model for classification
Example marginal likelihoods
Example marginal likelihoods
Y
spam ham
Y
spam ham
P(Viagra = 1|Y ) 0.40
0.12
P(lottery = 1|Y ) 0.21
0.13
P(Viagra = 0|Y ) 0.60
0.88
P(lottery = 0|Y ) 0.79
0.87
COMP9417 ML & DM COMP94C1l7assTifi1c,a2ti0o2n1(2) 86 Term 2, 2019 95 / 99
Using marginal likelihoods
Using the marginal likelihoods from before, we can approximate the likelihood ratios for Naïve Bayes as follows (the previously calculated odds from the full posterior distribution are shown in brackets):
𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 0|𝑦 = 𝑠𝑝𝑎𝑚) 𝑃(𝑙𝑜𝑡𝑡𝑜𝑟𝑦 = 0|𝑦 = 𝑠𝑝𝑎𝑚) = 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 0|𝑦 = h𝑎𝑚) 𝑃(𝐿𝑜𝑡𝑡𝑒𝑟𝑦 = 0|𝑦 = h𝑎𝑚)
𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 0|𝑦 = 𝑠𝑝𝑎𝑚) 𝑃(𝑙𝑜𝑡𝑡𝑜𝑟𝑦 = 1|𝑦 = 𝑠𝑝𝑎𝑚) = 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 0|𝑦 = h𝑎𝑚) 𝑃(𝐿𝑜𝑡𝑡𝑒𝑟𝑦 = 1|𝑦 = h𝑎𝑚)
𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 1|𝑦 = 𝑠𝑝𝑎𝑚) 𝑃(𝑙𝑜𝑡𝑡𝑜𝑟𝑦 = 0|𝑦 = 𝑠𝑝𝑎𝑚) = 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 1|𝑦 = h𝑎𝑚) 𝑃(𝐿𝑜𝑡𝑡𝑒𝑟𝑦 = 0|𝑦 = h𝑎𝑚)
0.60 0.79 = 0.62
0.88 0.87
0.60 0.21 = 1.1 1.9 0.88 0.13
0.40 0.79 = 3.0 4.0 0.12 0.87
(0.45)
𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 1|𝑦 = 𝑠𝑝𝑎𝑚) 𝑃(𝑙𝑜𝑡𝑡𝑜𝑟𝑦 = 1|𝑦 = 𝑠𝑝𝑎𝑚) = 0.40 0.21 = 5.4 0.67 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 1|𝑦 = h𝑎𝑚) 𝑃(𝐿𝑜𝑡𝑡𝑒𝑟𝑦 = 1|𝑦 = h𝑎𝑚) 0.12 0.13
We see that, using a maximum likelihood decision rule, our very simple model arrives at the Bayes-optimal prediction in the first three cases, but not in the fourth (‘Viagra’ and ‘lottery’ both present), where the marginal likelihoods are actually very misleading.
COMP9417 T1, 2021 87
Naive Bayes
Pros:
• Easy and fast at prediction. It also perform well in multi class prediction
• When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression with less training data.
• Performs well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).
Cons:
• If zero-frequency case happens, it is unable to make a prediction. (solution: use the smoothing technique).
• On the other side naive Bayes is also known as a bad estimator, so the probability outputs can not to be taken too seriously.
• Strong assumption of independent attributes. In real life, it is almost impossible that we get a set of attributes which are completely independent.
COMP9417 T1, 2021 88
Logistic Regression
a probabilistic linear classifier
COMP9417 T1, 2021 89
Logistic Regression
• In a binary classification problem with one input variable, if we show the output on y-axis we may get something like below:
x
• Can we use a linear regression to model this data?
COMP9417 T1, 2021 90
y
Logistic Regression
Why regular regression doesn’t work here:
• Univariate or multivariate regressions are to predict a real-valued output from one or more independent variables
• Binary data doesn’t have a normal distribution which is a condition needed for most other types of regression
• What we expect to get is an output if 0 or 1, but linear regression obviously can produce values beyond that range.
COMP9417 T1, 2021 91
Logistic Regression
• In binary classification, we can transform the y values into probability values (vales are in range [0,1])
• We can model this with a s-curve (sigmoid curve) as above:
𝑃 𝑦=1𝑥 = 1 ⟹𝑓 𝑥 =𝑙𝑜𝑔 𝑃(𝑦=1|𝑥) 1+𝑒IK(.) 1−𝑃(𝑦 = 1|𝑥)
x
Now 𝑓 𝑥 can have a value between −∞ and +∞ and in Logistic Regression we estimate 𝑓 𝑥 with a line.
𝑓k(𝑥)= 𝑥L𝛽⟹𝑙𝑜𝑔 * 𝑦=1𝑥 =𝑥L𝛽 #I* 𝑦 = 1 𝑥
COMP9417 T1, 2021 92
y
Logistic Regression
Logistic regression seeks to:
• Model the probability of a class given the values of independent input variable
• Estimate the probability that a class occurs for a random observation (versus the probability that the class doesn’t occur)
• Classify an observation based on the probability estimations
COMP9417 T1, 2021 93
Logistic Regression
𝑃\𝑦=1𝑥= 1 1+𝑒I.$M
If𝑃 𝑦=1𝑥 ≥0.5(sameassaying𝑥L𝛽≥0) thenpredictasclass1 If𝑃 𝑦=1𝑥 <0.5(sameassaying𝑥L𝛽<0) thenpredictasclass0
• Interpretation of this model in the input space (feature space) is equivalent of having a linear decision boundary (𝑥L𝛽 = 0) separating the two class
• Now we have a linear solution to our problem, and this is what makes Logistic Regression a linear model.
COMP9417 T1, 2021 94
Logistic Regression Parameter Estimation
• We can not use a similar cost function as we used in linear regression here, because it will result a non-convex function with many local minimums and would be very difficult to find the global minimum.
• Instead, the following cost function is used: Let’sdefine𝑃\ 𝑦=1𝑥 =hM (𝑥)
𝑐𝑜𝑠𝑡 hM 𝑥 ,𝑦 =}−log hM 𝑥 −log 1−hM 𝑥
𝑖𝑓𝑦=1 𝑖𝑓𝑦=0
1<
𝐽𝛽 =−𝑚J[𝑦(!)log hM 𝑥! + 1−𝑦! log(1−hM 𝑥! )]
!1#
Now, to find the values of parameters to minimize 𝐽 𝛽 , we can use the
Gradient Descent algorithm.
COMP9417 T1, 2021 95
Minimum Description Length Principle
COMP9417 T1, 2021 96
Minimum Description Length Principle
Once again, the MAP hypothesis:
h!"# = argmax𝑃 𝐷 h 𝑃(h) $∈&
Which is equivalent to
Or
h!"# = argmax [logT 𝑃 𝐷 h +logT 𝑃(h)] $∈&
h!"#=argmin[−logT𝑃𝐷h −logT𝑃(h)] $∈&
COMP9417 T1, 2021 97
Minimum Description Length Principle
Interestingly, this is an expression about a quantity of bits:
h!"# = arg min [−logT 𝑃 𝐷 h − logT 𝑃(h)] (1) $∈&
From information theory:
The optimal (shortest expected coding length) code for an event with probability 𝑝 is − logT 𝑝 bits.
COMP9417 T1, 2021 98
Minimum Description Length Principle
So interpret (1):
o − logT 𝑃(h) is length of h under optimal code. The length of your hypothesis
o − logT 𝑃(𝐷|h) is length of D given h under optimal code. This is a notion of error/misclassification.
Note: assumes optimal encodings, when the priors and likelihoods are known. In practice, this is difficult, and makes a difference.
COMP9417 T1, 2021 99
Minimum Description Length Principle
Occam’s razor: a principle stating that, given all other things being equal, a shorter hypothesis for observed data should be favoured over a lengthier hypothesis.
MDL: prefer the hypothesis h that minimizes
h!"#=argmin[−logT𝑃𝐷h −logT𝑃(h)] $∈&
So from information theory perspective, the best hypothesis is the one that minimizes error and the size of hypothesis. So you want the simplest hypothesis that minimizes the error. So this is kind of a Bayesian argument for Occam’s razor.
COMP9417 T1, 2021 100
Summary
• We described the classification problem in machine learning
• We also outlined the issue of Inductive Bias
• Two major frameworks for classification were covered
o Distance-based. The key ideas are geometric.
o Probabilistic. The key ideas are Bayesian.
• We also discussed Logistic Regression
• So we have established the basis for learning classifiers
• Later we will see how to extend by building on these ideas
COMP9417 T1, 2021 101
Acknowledgements
• Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
• Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012) http://www.cs.ubc.ca/~murphyk/MLbook
• Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
• Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
• Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997) http://www- 2.cs.cmu.edu/~tom/mlbook.html
• Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 T1, 2021 102