CS计算机代考程序代写 Bayesian data mining algorithm information theory Bayesian network decision tree Classification (2)

Classification (2)
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain

Aims
This lecture will continue your exposure to machine learning approaches to the problem of classification. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
– explain the concept of inductive bias in machine learning
– outline Bayes Theorem as applied in machine learning
– define MAP and ML inference using Bayes theorem
– define the Bayes optimal classification rule in terms of MAP inference
– outline the Naive Bayes classification algorithm
– describe typical applications of Naive Bayes for text classification
– outline the logistic regression classification algorithm
COMP9417 T1, 2021 1

Introduction
What do we understand about the problem of learning classifiers ? . . . how can we know when classifier learning succeeds ?
and . . . can we use this to build practical algorithms ?
COMP9417 T1, 2021 2

Inductive Bias
“All models are wrong, but some models are useful.”
Box & Draper (1987)
COMP9417 T1, 2021 3

Inductive Bias
Confusingly, “inductive bias” is NOT the same “bias” as in the “bias- variance” decomposition.
“Inductive bias” is the combination of assumptions and restrictions placed on the models and algorithms used to solve a learning problem.
Essentially it means that the algorithm and model combination you are using to solve the learning problem is appropriate for the task.
Success in machine learning requires understanding the inductive bias of algorithms and models and choosing them appropriately for the task.
COMP9417 T1, 2021 4

Inductive Bias
Unfortunately, for most machine learning algorithms it is not always easy to know what their inductive bias is.
For example, what is the inductive bias of:
• Linear Regression ? o…
o…
• Nearest Neighbour ?
o… o…
COMP9417 T1, 2021 5

Inductive Bias
Unfortunately, for most machine learning algorithms it is not always easy to know what their inductive bias is.
For example, what is the inductive bias of:
• Linear Regression ?
o target function has the form 𝑦 = 𝑎𝑥 + 𝑏
o approximate by fitting using MSE
• Nearest Neighbour ?
o target function is a complex non-linear function of the data o predict using nearest neighbour by Euclidean distance in
feature space
COMP9417 T1, 2021 6

Inductive Bias
What we would really like:
o a framework for machine learning algorithms
o with a way of representing the inductive bias
o ideally, should be a declarative specification
o also should quantify uncertainty in the inductive bias
COMP9417 T1, 2021 7

A probabilistic approach
COMP9417 T1, 2021 8

A probabilistic approach Probabilistic models
A simple probabilistic model
A simple probabilistic model
‘Viagra’ and ‘lottery’ are two Boolean features; Y is the class variable,
with values ‘spam’ and ‘ham’. In each row the most likely class is
indicated in bold.
‘Viagra’ and ‘lottery’ are two Boolean features; Y is the class variable, with values ‘spam’ and ‘ham’. In each row the most likely class is indicated in bold.
Viagra lottery P(Y =spam|Viagra,lottery) P(Y =ham|Viagra,lottery) 00 0.31 0.69
01 0.65 0.35
10 0.80 0.20
11 0.40 0.60
COMP9417 ML & DM Classification (2) Term 2, 2019
13 / 99
COMP9417 T1, 2021 9

Decision rule
Assuming that 𝑥 and 𝑦 are the only variables we know and care about, the posterior distribution 𝑃(𝑦 |𝑥) helps us to answer many questions of interest.
• For instance, to classify a new e-mail we determine whether the words ‘Viagra’ and ‘lottery’ occur in it, look up the corresponding probability 𝑃 (𝑦 =
𝑠𝑝𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦), and predict spam if this probability exceeds 0.5 and ham otherwise.
• Such a recipe to predict a value of y on the basis of the values of 𝑥 and the posterior distribution 𝑃(𝑦 |𝑥) is called a decision rule.
COMP9417 T1, 2021 10

Bayesian Machine Learning
COMP9417 T1, 2021 11

Two Roles for Bayesian Methods
Provides practical learning algorithms:
– Naive Bayes classifier learning
– Bayesian network learning, etc.
– Combines prior knowledge (prior probabilities) with observed data
Provides useful conceptual framework:
– Provides a “gold standard” for evaluating other learning algorithms
– Some additional insight into Occam’s razor
COMP9417 T1, 2021 12

Classification Question:
Can we do something different than finding the discriminative line (or some boundary) to be able to separate the two groups?
Width
Sea bass
Salmon
Lightness
COMP9417 T1, 2021
13

Classification Answer:
Yes, we can. Instead of finding a discriminative line, maybe we can focus on one class at a time and build a model that describes how that class looks like; and then do the same for the other class. This type of models are called generative learning algorithm.
Width
Sea bass
𝑃 𝑥𝑦=𝑆𝑒𝑎𝑏𝑎𝑠𝑠 ,𝑃(𝑦=𝑆𝑒𝑎𝑏𝑎𝑠𝑠)
Salmon
𝑃 𝑥 𝑦 = 𝑠𝑎𝑙𝑚𝑜𝑛 ,𝑃(𝑦 = 𝑠𝑎𝑙𝑚𝑜𝑛)
Lightness
COMP9417 T1, 2021
14

Classification – Bayesian methods Example: Imagine, we want to classify fish type: Salmon , Sea bass
If from the past experience we have (𝐶! 𝑖𝑠 𝑡h𝑒 𝑐𝑙𝑎𝑠𝑠):
P(𝒄𝒊)
Salmon
Sea bass
Prior
0.3
0.7
• If we decide only based on prior, we always have to choose “sea bass”. This is called “decision rule based on prior”
o This can behave vey poorly
o It never predicts other classes
COMP9417 T1, 2021 15

Classification – Bayesian methods
Example: now if we have some more information on the length of the fish in each class, then how can we update our decision, if we want to predict the class for a fish with 70cm length?
These are called “class conditionals”, “class conditioned probabilities”
𝑷(𝒙|𝒄𝒊)
Salmon
Sea bass
length > 100 cm
0.5
0.3
50 cm < length < 100 cm 0.4 0.5 length < 50 cm 0.1 0.2 What we are interested in is 𝑃 𝑐! 𝑥 , but we have 𝑃 𝑐! and 𝑃(𝑥|𝑐!), so how should we go about this? COMP9417 T1, 2021 16 Bayes Theorem 𝑃h𝐷 =𝑃𝐷h𝑃(h) 𝑃(𝐷) where: 𝑃 h = prior probability of hypothesis h 𝑃 𝐷 = prior probability of training data 𝐷 𝑃(h|𝐷) = probability of h given 𝐷 𝑃(𝐷|h) = probability of 𝐷 given h COMP9417 T1, 2021 17 Decision Rule from Posteriors Example: If the output belongs to a set of 𝑘 classes: 𝑦 ∈ {𝐶#,𝐶$,..., 𝐶%} for 1 ≤ 𝑖 ≤ 𝑘 Then in Bayesian framework: 𝑃 𝑦 = 𝐶! 𝑥 = 𝑃 𝑥 𝐶! 𝑃(𝐶!) 𝑃(𝑥) – 𝑃 𝑦 = 𝐶1 𝑥 : posterior probability – 𝑃 𝑥 𝐶1 : class conditional (class likelihood) – 𝑃(𝐶1): prior – 𝑃(𝑥):Marginal(𝑃 𝑥 =∑1𝑝 𝑥𝐶1 𝑃(𝐶1)) The decision rule is to select a class which maximizes the posterior probability for the prediction COMP9417 T1, 2021 18 Choosing Hypotheses 𝑃h𝐷 =𝑃𝐷h𝑃(h) 𝑃(𝐷) Generally want the most probable hypothesis given the training data, Maximum a posteriori hypothesis h!"# : h!"# =argmax𝑃(h|𝐷) $∈& = arg max 𝑃 𝐷 h 𝑃(h) $∈& 𝑃(𝐷) = arg max 𝑃 𝐷 h 𝑃(h) $∈& COMP9417 T1, 2021 19 Choosing Hypotheses If assume 𝑃 h' = 𝑃(h() then can further simplify, and choose the Maximum likelihood (ML) hypothesis: h!) =argmax𝑃(𝐷|h') $!∈& COMP9417 T1, 2021 20 Classification – Bayesian methods Example: now if we have some more information on the length of the fish in each class, then how can we update our decision, if we want to predict the class for a fish with 70cm length? 𝑷(𝒙|𝒄𝒊) Salmon Sea bass length > 100 cm
0.5
0.3
50 cm < length < 100 cm 0.4 0.5 length < 50 cm 0.1 0.2 𝑃 𝑐 = 𝑠𝑎𝑙𝑚𝑜𝑛 𝑥 = 70𝑐𝑚 ∝ 𝑃 70𝑐𝑚 𝑠𝑎𝑙𝑚𝑜𝑛 ∗ 𝑃 𝑠𝑎𝑙𝑚𝑜𝑛 = 0.4∗0.3=0.12 𝑃 𝑐=𝑠𝑒𝑎𝑏𝑎𝑠𝑠𝑥=70𝑐𝑚 ∝𝑃 70𝑐𝑚𝑠𝑒𝑎𝑏𝑎𝑠𝑠 ∗𝑃 𝑠𝑒𝑎𝑏𝑎𝑠𝑠 =0.5∗0.7=0.35 So base on these probabilities, our model predict the type as “sea bass” COMP9417 T1, 2021 21 Applying Bayes Theorem Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer. 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 =? 𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 =? 𝑃 ⊕ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =? 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 =? 𝑃 ⊖ 𝑐𝑎𝑛𝑐𝑒𝑟 =? 𝑃 ⊖ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =? COMP9417 T1, 2021 22 Applying Bayes Theorem Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer. 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 = .008 𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 =.98 𝑃 ⊕ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =.03 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 = .992 𝑃 ⊖ 𝑐𝑎𝑛𝑐𝑒𝑟 =0.02 𝑃 ⊖ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =.97 COMP9417 T1, 2021 23 Applying Bayes Theorem Does patient have cancer or not? 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =? 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =? We can find the maximum a posteriori (MAP) hypothesis 𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.98×0.008 = 0.00784 𝑃 ⊕ 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.03×0.992 = 0.02976 Thus h!"# = ⋯ COMP9417 T1, 2021 24 Applying Bayes Theorem Does patient have cancer or not? 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =? 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =? We can find the maximum a posteriori (MAP) hypothesis 𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.98×0.008 = 0.00784 𝑃 ⊕ 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.03×0.992 = 0.02976 Thus h!"# = 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 COMP9417 T1, 2021 25 Applying Bayes Theorem How to get the posterior probability of a hypothesis h ? Divide by 𝑃 (⊕) (probability of data) to normalize result for h: 𝑃 h 𝐷 = 𝑃 𝐷 h 𝑃(h) ∑$!∈&𝑃𝐷h' 𝑃(h') Denominator ensures we obtain posterior probabilities that sum to 1. Sum for all possible numerator values, since hypotheses are mutually exclusive (e.g., patient either has cancer or does not). COMP9417 T1, 2021 26 Basic Formulas for Probabilities Product Rule: probability 𝑃 (𝐴 ∧ 𝐵) of conjunction of two events 𝐴 and 𝐵: 𝑃(𝐴 ∧ 𝐵) = 𝑃(𝐴|𝐵)𝑃(𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴) Sum Rule: probability of disjunction of two events 𝐴 and 𝐵: 𝑃 (𝐴 ∨ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∧ 𝐵) Theorem of total probability: if events 𝐴: , ... , 𝐴; are mutually exclusive with∑; 𝑃 𝐴 '<: ' =1,then: 𝑃𝐵 =O'<:𝑃𝐵𝐴' 𝑃(𝐴') ; COMP9417 T1, 2021 27 Basic Formulas for Probabilities Also worth remembering: o Conditional Probability: probability of A given B: P(A|B) = #("∧?) #(?) o Rearrange sum rule to get: 𝑃 (𝐴 ∧ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∨ 𝐵) COMP9417 T1, 2021 28 Posterior Probabilities & Decision If we have two competing hypothesis h# and h$, then – 𝑃 h# 𝐷 = posterior probability of h# – 𝑃 h$ 𝐷 = posterior probability of h$ • So far, the potential decision is made based on the higher posterior probability – Decideh# if𝑃 h# 𝐷 >𝑃 h$ 𝐷 andelsedecideh$
• An alternative: using a loss function 𝐿(h), where 𝐿(h) is the loss that occurs
when decision h is made.
– In this setup the Bayesian testing procedure minimizes the posterior expected loss
COMP9417 T1, 2021 29

Bayesian Expected Loss
Example: If in the example of patient with positive test result, we know that the cost of misclassifying a patient who has cancer as “not cancer” is 10 times more than misclassifying a patient who doesn’t have cancer as “cancer”, how that will affect our decision?
COMP9417 T1, 2021 30

Bayesian Expected Loss (Risk)
If the cost of misclassification is not the same for different classes, then instead of maximizing a posteriori, we have to minimize the expected loss:
o So, if we define the loss associated to action 𝛼! as 𝜆(𝛼!|h) o Then the expected loss associated to action 𝛼! is:
𝐸𝐿𝛼! =𝑅𝛼!𝑥 =J𝜆(𝛼!|h)𝑃(h|𝑥) &∈(
An optimal Bayesian decision strategy is to minimize the expected loss. And if the loss associated to misclassification is the same for different classes, then maximum a posteriori is equal to minimizing the expected loss.
COMP9417 T1, 2021 31

Bayesian Expected Loss
Example: let’s revisit the example of patient with positive result for cancer,
given the loss function below:
𝜆 𝛼! 𝑐!
Cancer
Not cancer
If predicted cancer
0
1
If predicted not cancer
10
0
𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ = 𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕
+𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ ∝ 0 + 1×0.02976 = 0.02976
𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ = 𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕
+𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ ∝ 10×0.00784 + 0 = 0.0784
𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ < 𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ Therefor the expected loss is less if we predict that the patient has cancer. COMP9417 T1, 2021 32 Bayesian Expected Loss In summary: • Bayesian framework allows for integration of losses into the decision-making rule • In such framework, we minimize the posterior expected loss COMP9417 T1, 2021 33 Bayes Theorem • To compute 𝑃 𝐷 h and 𝑃(h), we can use an empirical method based on the given data • Or we may assume a parametric model, then we estimate parameters using the data COMP9417 T1, 2021 34 Learning A Real Valued Function Consider any real-valued target function 𝑓 Training examples ⟨𝑥' , 𝑦' ⟩, where 𝑦' is noisy training value • 𝑦'=𝑓𝑥' +𝜀' • 𝜀' is random variable (noise) drawn independently for each 𝑥' according to some Gaussian (normal) distribution with mean zero Then the maximum likelihood hypothesis h!) is the one that minimizes the sum of squared errors: h!) = argm𝑎𝑥𝑃 𝐷 h = argmaxW𝑃(𝑦'|h) Where 𝑓* = h23 S S 1 U:(V!U$(W!))" =argmaxW 2𝜋𝜎T𝑒 T X $∈& $∈& '<: $∈& '<: COMP9417 T1, 2021 35 Learning A Real Valued Function Maximize natural log to give simpler expression: S h!)=argmaxO𝑙𝑛 1 2𝜋𝜎T 1 𝑑' −h(𝑥') T − ( ) $∈& '<: 2 𝜎 S 1 𝑑' −h(𝑥') T =argmaxO− ( ) $∈&'<: 2 𝜎 S =argmaxO−(𝑑' −h(𝑥'))T $∈& '<: Equivalently, we can minimize the positive version of the expression: S h!) =arg minO(𝑑' −h(𝑥'))T $∈& '<: COMP9417 T1, 2021 36 Discriminative vs Generative Probabilistic Models • Discriminative models model the posterior probability distribution 𝑃 (𝑦|𝑥), where 𝑦 is the target variable and 𝑥 is the feature vector. That is, given 𝑥 they return a probability distribution over 𝑦 . • Generative models model the joint distribution 𝑃 (𝑦, 𝑥) of the target 𝑦 and the feature vector 𝑥. Once we have access to this joint distribution, we can derive any conditional or marginal distribution involving the same variables. In particular, since 𝑃 𝑥 = ∑) 𝑃(𝑦 = 𝑐, 𝑥) it follows that the posterior distribution can be obtained as𝑃 𝑦 𝑥 = *(,,.) ∑! *(,1),.) • Such models are called ‘generative’ because we can sample from the joint distribution to obtain new data points together with their labels. Alternatively, we can use 𝑃(𝑦) to sample a class and 𝑃(𝑥|𝑦) to sample an instance for that class. COMP9417 T1, 2021 37 Bayesian Approach What should h be? o A collection of possible predictions, or o A collection of functions that map the data 𝑥 to a possible prediction 𝑦 COMP9417 T1, 2021 38 Most Probable Classification of New Instances So far, we’ve sought the most probable hypothesis given the data 𝐷 (i.e., h!"#) But the most important question is: given new instance 𝑥, what is its most probable classification? o Although it may seem that h!"# can answer this question, but in fact we can do better o h!"# (𝑥) is not necessarily the most probable classification! (e.g if we are dealing with multiple hypotheses supporting each class) COMP9417 T1, 2021 39 Most Probable Classification of New Instances Consider: o Three possible hypotheses: 𝑃 h: 𝐷 = 0.4, 𝑃(hT|𝐷) = 0.3, 𝑃(hY|𝐷) = 0.3 o Given new instance x, h:(𝑥) = +, hT (𝑥) = −, hY (𝑥) = − o What’s most probable classification of 𝑥? COMP9417 T1, 2021 40 Bayes Optimal Classifier Bayes optimal classification: Example: 𝑃h:𝐷 =0.4, 𝑃hT𝐷 =0.3, 𝑃hY𝐷 =0.3, 𝑃−h: =0, 𝑃−hT =1, 𝑃−hY =1, P+h: =1 P+hT =0 P+hY =0 argmax O 𝑃(𝜈(|h')𝑃(h'|D) Z#∈[ $!∈& COMP9417 T1, 2021 41 Bayes Optimal Classifier therefore and O𝑃(+|h')Ph'𝐷 =0.4 $!∈& O𝑃(−|h')Ph'𝐷 =0.6 $!∈& argmaxO𝑃(𝜈(|h')𝑃h'D= − Z#∈[ $!∈& 𝜈( is the class value from the set of possible class values (𝜈( ∈ 𝑉) COMP9417 T1, 2021 42 Bayes Optimal Classifier The Bayes optimal classifier is a probabilistic model that makes the most probable prediction for a new sample based on the training data. This is different than MAP inference. • In MAP framework, we seek the most probable hypothesis, among the space of hypothesis. • But in Bayesian optimal classification, the most probable classification of the new instance is obtained by combining the predictions of ALL hypotheses (in the hypothesis space), weighted by their posterior probabilities: 𝑃 𝜈2 𝐷 = J 𝑃 𝜈2 h! 𝑃(h!|𝐷) &" ∈( The optimal classification of the new instance is the value 𝜈2 , for which 𝑃 𝜈2 𝐷 is maximum COMP9417 T1, 2021 43 Bayes Optimal Classifier It can be mathematically shown that no other classification method using the same hypothesis space and same prior knowledge can outperform the Bayes Optimal Classifier method on average. COMP9417 T1, 2021 44 Bayes Optimal Classifier Why do we need any other methods? COMP9417 T1, 2021 45 Bayes Optimal Classifier The Bayes rule depends on unknown quantities, so we need to use the data to find some approximation of those quantities. COMP9417 T1, 2021 46 Gibbs Classifier Bayes optimal classifier is very inefficient. An alternative, less optimal method is Gibbs algorithm. Gibbs algorithm: 1. Choose one hypothesis at random, according to 𝑃(h|𝐷) 2. Use this to classify new instance Surprising fact: Assuming target concepts are drawn at random from 𝐻 (h ∈ 𝐻) according to priors on 𝐻. Then 𝐸[𝑒𝑟𝑟𝑜𝑟3!445] ≤ 2×𝐸[𝑒𝑟𝑟𝑜𝑟67,85 9:;!<7=] COMP9417 T1, 2021 47 Bayes Error What is the best performance attainable by a (two-class) classifier ? Define the probability of error for classifying some instance 𝑥 by 𝑃 𝑒𝑟𝑟𝑜𝑟 𝑥 = 𝑃 𝑐𝑙𝑎𝑠𝑠# 𝑥 𝑖𝑓 𝑤𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑙𝑎𝑠𝑠$ = 𝑃 𝑐𝑙𝑎𝑠𝑠$ 𝑥 𝑖𝑓 𝑤𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑙𝑎𝑠𝑠# This gives J𝑃 𝑒𝑟𝑟𝑜𝑟 = J𝑃 𝑒𝑟𝑟𝑜𝑟 𝑥 𝑃(𝑥) . So we can justify the use of the decision rule: if 𝑃\ 𝑐𝑙𝑎𝑠𝑠# 𝑥 > 𝑃\ 𝑐𝑙𝑎𝑠𝑠$ 𝑥 then predict 𝑐𝑙𝑎𝑠𝑠#
else predict 𝑐𝑙𝑎𝑠𝑠$
On average, this decision rule minimises probability of classification error.
COMP9417 T1, 2021 48

Naive Bayes Classifier
Along with decision trees, neural networks, nearest neighbour, one of the most practical learning methods.
When to use
o Moderate or large training set available
o Attributes that describe instances are conditionally independent given classification
Successful applications:
o Classifying text documents
o Gaussian Naive Bayes for real-valued data
COMP9417 T1, 2021 49

Naive Bayes Classifier
Assume target function 𝑓 ∶ 𝑋 → 𝑉 , where each instance 𝑥 described by attributes ⟨𝑥:, 𝑥T, . . . , 𝑥;⟩.
Most probable value of 𝑓(𝑥) is:
𝜈!”# = argmax𝑃(𝜈(|𝑥:,𝑥T,…,𝑥;)
Z#∈[
𝜈!”# = argmax𝑃 𝑥:,𝑥T,…,𝑥; 𝜈( 𝑃(𝜈() Z#∈[ 𝑃(𝑥:,𝑥T,…,𝑥;)
= argmax𝑃(𝑥:,𝑥T,…,𝑥; 𝜈( P(𝜈() Z#∈[
COMP9417 T1, 2021 50

Naive Bayes Classifier
Naive Bayes assumption:
𝑃(𝑥:,𝑥T,…,𝑥; 𝜈( =W𝑃(𝑥’|𝜈() ‘
• Attributes are statistically independent (given the class value)
o which means knowledge about the value of a particular attribute tells us nothing about the value of another attribute (if the class is known)
Which gives Naive Bayes classifier:
𝜈\? = 𝑎𝑟𝑔 max𝑃(𝜈()W𝑃(𝑥’|𝜈()
Z#∈[ ‘
COMP9417 T1, 2021
51

Naive Bayes Algorithm Naive Bayes Learn (examples)
for each target value 𝜈(
𝑃f(𝜈() ← estimate 𝑃 𝜈(
For each attribute value 𝑥’:
𝑃f(𝑥’|𝜈() ← estimate 𝑃 𝑥’|𝜈( Classify New Instance (for sample 𝑥)
𝜈\? =𝑎𝑟𝑔max𝑃f(𝜈()W𝑃f(𝑥’|𝜈() Z#∈[ ‘
COMP9417 T1, 2021
52

Naive Bayes Example: PlayTennis Naive Bayes Example: PlayTennis
Naive Bayes
outlook temperature humidity windy
sunny hot sunny hot overcast hot rainy mild rainy cool rainy cool overcast cool sunny mild sunny cool rainy mild sunny mild overcast mild overcast hot rainy mild
high false high true high false high false normal false normal true normal true high false normal false normal false normal true high true normal false high true
COMP9417 ML & DM
Classification (2)
play no no yes yes yes no yes no yes yes yes yes yes no
Term 2, 2019
58 / 99
COMP9417 T1, 2021
53

Naive Bayes
Naive Bayes Example: PlayTennis
Naive Bayes Example: PlayTennis
What are the required probabilities to predict PlayTennis ? What are the required probabilities to predict PlayTennis?
Outlook
Temperature
Humidity
Windy
Yes No
Yes No
Yes No
Yes No
Sunny Overcast Rainy
2 3 4 0 3 2
Hot 2 2 Mild 4 2 Cool 3 1
High 3 4 Normal 6 1
False 6 2 True 3 3
Sunny Overcast Rainy
2/9 3/5 4/9 0/5 3/9 2/5
Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5
High 3/9 4/5 Normal 6/9 1/5
False 6/9 2/5 True 3/9 3/5
Play
Yes No
95
9/14 5/14
COMP9417 ML & DM Classification (2)
Term 2, 2019
59 / 99
COMP9417 T1, 2021 54

Naive Bayes Example: PlayTennis Say we have the new instance:
⟨𝑂𝑢𝑡𝑙𝑘 = 𝑠𝑢𝑛, 𝑇𝑒𝑚𝑝 = 𝑐𝑜𝑜𝑙, 𝐻𝑢𝑚𝑖𝑑 = h𝑖𝑔h, 𝑊𝑖𝑛𝑑 = 𝑡𝑟𝑢𝑒⟩
We want to compute:
𝜈\? =𝑎𝑟𝑔 max Z#∈{“V_`”,”;c”}}
𝑃(𝜈()W𝑃(𝑥’|𝜈() ‘
COMP9417 T1, 2021
55

Naive Bayes Example: PlayTennis
So we first calculate the likelihood of the two classes, “yes” and “no”
For “yes” = 𝑃 “𝑦𝑒𝑠” ×𝑃 𝑠𝑢𝑛 ”𝑦𝑒𝑠” ×𝑃 𝑐𝑜𝑜𝑙 ”𝑦𝑒𝑠” 𝑃 h𝑖𝑔h ”𝑦𝑒𝑠” × 𝑃 𝑡𝑟𝑢𝑒 ”𝑦𝑒𝑠”
0.0053= e ×T×Y×Y×Y :f e e e e
For “no” = 𝑃 “𝑛𝑜” ×𝑃 𝑠𝑢𝑛 ”𝑛𝑜” ×𝑃 𝑐𝑜𝑜𝑙 ”𝑛𝑜” 𝑃 h𝑖𝑔h ”𝑛𝑜” × 𝑃 𝑡𝑟𝑢𝑒 ”𝑛𝑜”
0.0206= g ×Y×:×f×Y :f g g e e
COMP9417 T1, 2021 56

Naive Bayes Example: PlayTennis
Then convert to a probability by normalisation
𝑃 “𝑦𝑒𝑠”|𝑥 = 0.0053 (0.0053 + 0.0206)
𝑃 “𝑛𝑜”|𝑥 = 0.0206 (0.0053 + 0.0206)
The Naive Bayes classification is “no”.
= 0.205 = 0.795
COMP9417 T1, 2021
57

Naive Bayes: Subtleties
Conditional independence assumption is often violated
𝑃(𝑥#,𝑥$,…,𝑥> 𝜈2 = ]𝑃(𝑥!|𝜈2) !
– …but it works surprisingly well anyway. Note that you don’t need the estimated posteriors 𝑃\(𝜈2|𝑥) to be correct; You need only that
𝑎𝑟𝑔max𝑃\(𝜈2)]𝑃\(𝑥!|𝜈2)=argmax𝑃 𝜈2 𝑃(𝑥#,𝑥$,…,𝑥> 𝜈2 ?#∈@ ! ?#∈@
i.e. maximum probability is assigned to correct class
o see [Domingos & Pazzani, 1996] for analysis
o Naive Bayes posteriors often unrealistically close to 1 or 0
o adding too many redundant attributes will cause problems (e.g. identical attributes)
COMP9417 T1, 2021 58

Naive Bayes: “zero-frequency” problem
What if none of the training instances with target value 𝜈( have attribute value 𝑥’? Then
𝑃f𝑥’𝜈( =0,𝑎𝑛𝑑…
𝑃f𝜈( W𝑃f𝑥’𝜈( =0 ‘
Pseudo-counts add 1 to each count (a version of the Laplace Estimator)
(In some cases adding a constant different from 1 might be more appropriate)
COMP9417 T1, 2021 59

Naive Bayes: numeric attributes
• Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)
𝑥|𝜈( ~𝑁 𝜇,𝜎T
• The probability density function for the normal distribution is defined by two parameters:
– The sample mean 𝜇: ;
𝜇 = 𝑛1 O 𝑥 ‘ ‘<: – Thestandarddeviation𝜎:; 𝜎= 1 O(𝑥'−𝜇)T 𝑛 − 1 '<: These parameters have to be defined for each class separately. COMP9417 T1, 2021 60 Naive Bayes: numeric attributes Then we have the density function 𝑓(𝑥): 𝑓 𝑥 = 1 𝑒U(WUh)" 2𝜋𝜎 TX" Example: continuous attribute temperature with 𝑚𝑒𝑎𝑛 = 73 and 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 6.2. Density value 1 U(iiUjY)" 𝑓 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 66 ”𝑦𝑒𝑠” = 2𝜋6.2 𝑒 T×i.T" = 0.0340 Missing values during training are not included in calculation of mean and standard deviation. COMP9417 T1, 2021 61 Categorical random variables Categorical variables or features (also called discrete or nominal) are ubiquitous in machine learning. For example in text classification. • Perhaps the most common form of the Bernoulli distribution models whether or not a word occurs in a document. That is, for the 𝑖-th word in our vocabulary we have a random variable 𝑋! governed by a Bernoulli distribution. The joint distribution over the bit vector 𝑋 = (𝑋#, . . . , 𝑋%) is called a multivariate Bernoulli distribution. • Variables with more than two outcomes are also common: for example, every word position in an e-mail corresponds to a categorical variable with 𝑘 outcomes, where 𝑘 is the size of the vocabulary. The multinomial distribution manifests itself as a count vector: a histogram of the number of occurrences of all vocabulary words in a document. This establishes an alternative way of modelling text documents that allows the number of occurrences of a word to influence the classification of a document. COMP9417 T1, 2021 62 Categorical random variables Both these document models are in common use. Despite their differences, they both assume independence between word occurrences, generally referred to as the naive Bayes assumption. • In the multinomial document model, this follows from the very use of the multinomial distribution, which assumes that words at different word positions are drawn independently from the same categorical distribution. • In the multivariate Bernoulli model we assume that the bits in a bit vector are statistically independent, which allows us to compute the joint probability of a particular bit vector (𝑥#, ... , 𝑥%) as the product of the probabilities of each component 𝑃(𝑋! = 𝑥!). • In practice, such word independence assumptions are often not true: if we know that an e-mail contains the word ‘Viagra’, we can be quite sure that it will also contain the word ‘pill’. Violated independence assumptions reduce the quality of probability estimates but may still allow good classification performance. COMP9417 T1, 2021 63 Example application: Learning to Classify Text In machine learning, the classic example of applications of Naive Bayes is learning to classify text documents. COMP9417 T1, 2021 64 Learning to Classify Text For example: o Learn which news articles are of interest o Learn to classify web pages by topic Naive Bayes is among most effective algorithms What attributes shall we use to represent text documents?? COMP9417 T1, 2021 65 Learning to Classify Text Targetconcept𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑖𝑛𝑔?∶ 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 → {+,−} 1. Represent each document by vector of words o one attribute per word position in document 2. Learning: Use training examples to estimate o 𝑃(+) o 𝑃(−) o 𝑃(𝑑𝑜𝑐|+) o 𝑃(𝑑𝑜𝑐|−) COMP9417 T1, 2021 66 Learning to Classify Text Naive Bayes conditional independence assumption k_;lm$(nco) 𝑃 𝑑𝑜𝑐 𝜈( = W 𝑃(𝑥' = 𝑤p |𝜈() '<: where 𝑃(𝑥' = 𝑤p|𝜈() is probability that word in position 𝑖 is 𝑤p, given 𝜈( one more assumption: 𝑃𝑥'=𝑤p𝜈( =𝑃𝑥S=𝑤p𝜈( ,∀𝑖,𝑚 “bag of words” COMP9417 T1, 2021 67 Application: 20 Newsgroups Naive Bayes Learning and using a naive Bayes model for classification Application: 20 Newsgroups Given: 20000 training documents from each group Learning task: classify each new document by newsgroup it came from Given: 1000 training documents from each group Learning task: classify each new document by newsgroup it came from comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos comp.sys.ibm.pc.hardware rec.motorcycles comp.sys.mac.hardware rec.sport.baseball comp.windows.x alt.atheism soc.religion.christian talk.religion.misc talk.politics.mideast talk.politics.misc talk.politics.guns rec.sport.hockey sci.space sci.crypt sci.electronics sci.med Naive Bayes: 89% classification accuracy Naive Bayes: 89% classification accuracy COMP9417 ML & DM Classification (2) Term 2, 2019 75 / 99 COMP9417 T1, 2021 68 Article from rec.sport.hockey Article from rec.sport.hockey Naive Bayes Learning and using a naive Bayes model for classification Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu From: xxx@yyy.zzz.edu (John Doe) Subject: Re: This year’s biggest and worst (opinion)... Date: 5 Apr 93 09:53:39 GMT I can only comment on the Kings, but the most obvious candidate for pleasant surprise is Alex Zhitnik. He came highly touted as a defensive defenseman, but he’s clearly much more than that. Great skater and hard shot (though wish he were more accurate). In fact, he pretty much allowed the Kings to trade away that huge defensive liability Paul Coffey. Kelly Hrudey is only the biggest disappointment if you thought he was any good to begin with. But, at best, he’s only a mediocre goaltender. A better choice would be Tomas Sandstrom, though not through any fault of his own, but because some thugs in Toronto decided ... COMP9417 ML & DM Classification (2) Term 2, 2019 76 / 99 COMP9417 T1, 2021 69 Learning Curve for 20 Newsgroups Learning Curve for 20 Newsgroups Naive Bayes Learning and using a naive Bayes model for classification 100 90 80 70 60 50 40 30 20 10 0 20News Bayes TFIDF RTFIDF P 100 1000 10000 Accuracy vs. Training set size (1/3 withheld for test) Accuracy vs. Training set size (1/3 withheld for test) COMP9417 ML & DM Classification (2) Term 2, 2019 77 / 99 COMP9417 T1, 2021 70 Probabilistic decision rules We again use the example of Naive Bayes for text classification to illustrate, using both the multinomial and multivariate Bernoulli models. COMP9417 T1, 2021 71 Training a naive Bayes model Consider the following e-mails consisting of five words a, b, c, d, e: 𝑒1: 𝑏𝑑𝑒𝑏𝑏𝑑𝑒 𝑒2: 𝑏𝑐𝑒𝑏𝑏𝑑𝑑𝑒𝑐𝑐 𝑒3: 𝑎𝑑𝑎𝑑𝑒𝑎𝑒𝑒 𝑒4: 𝑏𝑎𝑑𝑏𝑒𝑑𝑎𝑏 𝑒5: 𝑎𝑏𝑎𝑏𝑎𝑏𝑎𝑒𝑑 𝑒6: 𝑎𝑐𝑎𝑐𝑎𝑐𝑎𝑒𝑑 𝑒7: 𝑒𝑎𝑒𝑑𝑎𝑒𝑎 𝑒8: 𝑑𝑒𝑑𝑒𝑑 We are told that the e-mails on the left are spam and those on the right are ham, and so we use them as a small training set to train our Bayesian classifier. • First, we decide that 𝑑 and 𝑒 are so-called stop words that are too common to convey class information. • The remaining words, 𝑎, 𝑏 and 𝑐, constitute our vocabulary. COMP9417 T1, 2021 72 aining data for naive Bayes Training data for naive Bayes he same data set described by bit vectors. The same data set described by bit vectors (presence of words). We can use Bernoulli model. E-mail a? b? c? Class e1 0 1 0 + e2 0 1 1 + e3 1 0 0 + e4 1 1 0 + e5 1 1 0 e6 1 0 1 e7 1 0 0 e8 0 0 0 COMP9417 ML & DM Classification (2) Term 2, 2019 84 / 99 COMP9417 T1, 2021 73 Tr T Training a naive Bayes model In the multivariate Bernoulli model e-mails are represented by bit vectors, as before. • Adding the bit vectors for each class results in (2, 3, 1) for spam and (3, 1, 1) for ham. • Each count is to be divided by the number of documents in a class, in order to get an estimate of the probability of a document containing a particular vocabulary word. • Probability smoothing now means adding two pseudo-documents, one containing each word and one containing none of them. • This results in the estimated parameter vectors 𝜃k ⊕ = BC , DC , $C = 0.5,0.67,0.33 𝜃k ⊖ = DC , $C , $C = (0.67,0.33,0.33) for spam for ham COMP9417 T1, 2021 74 T Naive Bayes Learning and using a naive Bayes model for classification raining data for naive Bayes Training data for naive Bayes A small e-mail data set described by count vectors. A small e-mail data set described by count vectors. We can use multinomial model. E-mail #a #b #c Class e1 0 3 0 + e2 0 3 3 + e3 3 0 0 + e4 2 3 0 + e5 4 3 0 e6 4 0 3 e7 3 0 0 e8 0 0 0 COMP9417 T1, 2021 75 COMP9417 ML & DM Classification (2) Term 2, 2019 83 / 99 Training a naive Bayes model For the multinomial model, we represent each e-mail as a count vector, as before. • In order to estimate the parameters of the multinomial, we sum up the count vectors for each class, which gives (5, 9, 3) for spam and (11, 3, 3) for ham. • To smooth these probability estimates we add one pseudo-count for each vocabulary word, which brings the total number of occurrences of vocabulary words to 20 for each class. • The estimated parameter vectors are thus 𝜃k⊕= C,#F,D = 0.3,0.5,0.2 forspam 𝜃k⊖ = #$ , D , D = (0.6,0.2,0.2) for ham $F $F $F $F $F $F COMP9417 T1, 2021 76 Prediction using a naive Bayes model Suppose our vocabulary contains three words 𝑎, 𝑏 and 𝑐, and we use a multivariate Bernoulli model for our e-mails, with parameters 𝜃⊕ = 0.5,0.67,0.33 𝜃⊖ = (0.67,0.33,0.33) This means, for example, that the presence of 𝑏 is twice as likely in spam (+), compared with ham (−). The e-mail to be classified contains words 𝑎 and 𝑏 but not 𝑐, and hence is described by the bit vector 𝑥 = (1, 1, 0) . We obtain likelihoods 𝑃(𝑥| ⊕) = 0.50×0.67×(1 − 0.33) = 0.222 𝑃(𝑥| ⊖) = 0.67×0.33×(1 − 0.33) = 0.148 The ML classification of 𝑥 is thus spam. COMP9417 T1, 2021 77 Prediction using a naive Bayes model In the case of two classes it is often convenient to work with likelihood ratios and odds. o The likelihood ratio can be calculated as 𝑃(𝑥|⊕)= 0.5 0.67(1−0.33)=3>1 𝑃(𝑥| ⊖) 0.67 0.33 (1 − 0.33) 2
o This means that the 𝑀𝐴𝑃 classification of 𝑥 is also spam if the prior odds are more than 2/3, but ham if they are less than that.
o For example, with 33% spam and 67% ham the prior odds are *(⊕) = F.BB = #, resulting in a posterior odds of
*(⊖) F.CH $
𝑃(⊕ |𝑥) = 𝑃(𝑥| ⊕) 𝑃(⊕) = 3 × 1 = 3 < 1 𝑃(⊖|𝑥) 𝑃(𝑥|⊝)𝑃(⊖) 2 2 4 In this case the likelihood ratio for x is not strong enough to push the decision away from the prior COMP9417 T1, 2021 78 Prediction using a naive Bayes model Alternatively, we can employ a multinomial model. The parameters of a multinomial establish a distribution over the words in the vocabulary, say 𝜃⊕ = 0.3,0.5,0.2 𝜃⊖ = (0.6,0.2,0.2) The e-mail to be classified contains three occurrences of word 𝑎, one single occurrence of word 𝑏 and no occurrences of word 𝑐, and hence is described by the count vector 𝑥 = (3,1,0). The total number of vocabulary word occurrences is 𝑛 = 4. We obtain likelihoods: 0.36 0.57 0.28 𝑃𝑥⊕=4!3! 1! 0!=0.054 0.66 0.27 0.28 𝑃𝑥 ⊖ =4! 3! 1! 0! =0.1728 The likelihood ratio is (8.6)6(8.;)7(8.<)8 = ; . The ML classification of 𝑥 is thus ham, the 8.: 8.< 8.< 7: opposite of the multivariate Bernoulli model. This is mainly because of the three occurrences of word 𝑎, which provide strong evidence for ham. COMP9417 T1, 2021 79 Naive Bayes: missing values What about missing values ? A very basic approach is: – Training: instance is not included in frequency count for attribute value-class combination – Classification: attribute will be omitted from calculation Can we do better ? COMP9417 T1, 2021 80 Missing values Let’s use the spam-ham data in slide 9. Suppose we skimmed an e-mail and noticed that it contains the word ‘lottery’ but we haven’t looked closely enough to determine whether it uses the word ‘Viagra’. This means that we don’t know whether to use the second or the fourth row (in slide 9) to make a prediction. This is a problem, as we would predict spam if the e-mail contained the word ‘Viagra’ (second row) and ham if it didn’t (fourth row). The solution is to average these two rows, using the probability of ‘Viagra’ occurring in any e-mail (spam or not): 𝑃 𝑦 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 𝑃 𝑦 𝑉𝑖𝑎𝑔𝑟𝑎 = 0,𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑃 𝑉𝑖𝑎𝑔𝑟𝑎 = 0 +𝑃 𝑦 𝑉𝑖𝑎𝑔𝑟𝑎 = 1, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 1) COMP9417 T1, 2021 81 Missing values For instance, suppose for the sake of argument that one in ten e-mails contain the word ‘Viagra’, then 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 1) = 0.10 and 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 0) = 0.90. Using the above formula, we obtain and 𝑃(𝑦 = 𝑠𝑝𝑎𝑚|𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 1) = 0.65×0.90 + 0.40×0.10 = 0.625 𝑃 𝑦 = h𝑎𝑚 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 1 = 0.35×0.90 + 0.60×0.10 = 0.375 Because the occurrence of ‘Viagra’ in any e-mail is relatively rare, the resulting distribution deviates only a little from the second row in the data. COMP9417 T1, 2021 82 Likelihood ratio As a matter of fact, statisticians work very often with different conditional probabilities, given by the likelihood function 𝑃(𝑥|𝑦). • I like to think of these as thought experiments: if somebody were to send me a spam e-mail, how likely would it be that it contains exactly the words of the e-mail I’m looking at? And how likely if it were a ham e-mail instead? • What really matters is not the magnitude of these likelihoods, but their ratio: how much more likely is it to observe this combination of words in a spam e- mail than it is in a non-spam e-mail. • For instance, suppose that for a particular e-mail described by 𝑥 we have 𝑃(𝑥|𝑦 = 𝑠𝑝𝑎𝑚) = 3.5×10IJ and 𝑃(𝑥|𝑦 = h𝑎𝑚) = 7.4×10IC, then observing 𝑋 in a spam e-mail is nearly five times more likely than it is in a ham e-mail. • This suggests the following decision rule: predict spam if the likelihood ratio is larger than 1 and ham otherwise. COMP9417 T1, 2021 83 When to use likelihoods Use likelihoods if you want to ignore the prior distribution or assume it uniform, and posterior probabilities otherwise. COMP9417 T1, 2021 84 Posterior odds 𝑃(𝑦 = 𝑠𝑝𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 0, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 0) = 0.31 = 0.45 𝑃(𝑦 = h𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 0, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 0) 0.69 𝑃(𝑦 = 𝑠𝑝𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 1, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 1) = 0.40 = 0.67 𝑃(𝑦 = h𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 1, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 1) 0.60 𝑃(𝑦 = 𝑠𝑝𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 0, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 1) = 0.65 = 1.9 𝑃(𝑦 = h𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 0, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 1) 0.35 𝑃(𝑦 = 𝑠𝑝𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 1, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 0) = 0.80 = 4.0 𝑃(𝑦 = h𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎 = 1, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦 = 0) 0.20 Using a MAP decision rule we predict ham in the top two cases and spam in the bottom two. Given that the full posterior distribution is all there is to know about the domain in a statistical sense, these predictions are the best we can do: they are Bayes-optimal. COMP9417 T1, 2021 85 Naive Bayes Learning and using a naive Bayes model for classification Example marginal likelihoods Example marginal likelihoods Y spam ham Y spam ham P(Viagra = 1|Y ) 0.40 0.12 P(lottery = 1|Y ) 0.21 0.13 P(Viagra = 0|Y ) 0.60 0.88 P(lottery = 0|Y ) 0.79 0.87 COMP9417 ML & DM COMP94C1l7assTifi1c,a2ti0o2n1(2) 86 Term 2, 2019 95 / 99 Using marginal likelihoods Using the marginal likelihoods from before, we can approximate the likelihood ratios for Naïve Bayes as follows (the previously calculated odds from the full posterior distribution are shown in brackets): 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 0|𝑦 = 𝑠𝑝𝑎𝑚) 𝑃(𝑙𝑜𝑡𝑡𝑜𝑟𝑦 = 0|𝑦 = 𝑠𝑝𝑎𝑚) = 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 0|𝑦 = h𝑎𝑚) 𝑃(𝐿𝑜𝑡𝑡𝑒𝑟𝑦 = 0|𝑦 = h𝑎𝑚) 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 0|𝑦 = 𝑠𝑝𝑎𝑚) 𝑃(𝑙𝑜𝑡𝑡𝑜𝑟𝑦 = 1|𝑦 = 𝑠𝑝𝑎𝑚) = 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 0|𝑦 = h𝑎𝑚) 𝑃(𝐿𝑜𝑡𝑡𝑒𝑟𝑦 = 1|𝑦 = h𝑎𝑚) 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 1|𝑦 = 𝑠𝑝𝑎𝑚) 𝑃(𝑙𝑜𝑡𝑡𝑜𝑟𝑦 = 0|𝑦 = 𝑠𝑝𝑎𝑚) = 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 1|𝑦 = h𝑎𝑚) 𝑃(𝐿𝑜𝑡𝑡𝑒𝑟𝑦 = 0|𝑦 = h𝑎𝑚) 0.60 0.79 = 0.62 0.88 0.87 0.60 0.21 = 1.1 1.9 0.88 0.13 0.40 0.79 = 3.0 4.0 0.12 0.87 (0.45) 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 1|𝑦 = 𝑠𝑝𝑎𝑚) 𝑃(𝑙𝑜𝑡𝑡𝑜𝑟𝑦 = 1|𝑦 = 𝑠𝑝𝑎𝑚) = 0.40 0.21 = 5.4 0.67 𝑃(𝑉𝑖𝑎𝑔𝑟𝑎 = 1|𝑦 = h𝑎𝑚) 𝑃(𝐿𝑜𝑡𝑡𝑒𝑟𝑦 = 1|𝑦 = h𝑎𝑚) 0.12 0.13 We see that, using a maximum likelihood decision rule, our very simple model arrives at the Bayes-optimal prediction in the first three cases, but not in the fourth (‘Viagra’ and ‘lottery’ both present), where the marginal likelihoods are actually very misleading. COMP9417 T1, 2021 87 Naive Bayes Pros: • Easy and fast at prediction. It also perform well in multi class prediction • When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression with less training data. • Performs well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption). Cons: • If zero-frequency case happens, it is unable to make a prediction. (solution: use the smoothing technique). • On the other side naive Bayes is also known as a bad estimator, so the probability outputs can not to be taken too seriously. • Strong assumption of independent attributes. In real life, it is almost impossible that we get a set of attributes which are completely independent. COMP9417 T1, 2021 88 Logistic Regression a probabilistic linear classifier COMP9417 T1, 2021 89 Logistic Regression • In a binary classification problem with one input variable, if we show the output on y-axis we may get something like below: x • Can we use a linear regression to model this data? COMP9417 T1, 2021 90 y Logistic Regression Why regular regression doesn’t work here: • Univariate or multivariate regressions are to predict a real-valued output from one or more independent variables • Binary data doesn’t have a normal distribution which is a condition needed for most other types of regression • What we expect to get is an output if 0 or 1, but linear regression obviously can produce values beyond that range. COMP9417 T1, 2021 91 Logistic Regression • In binary classification, we can transform the y values into probability values (vales are in range [0,1]) • We can model this with a s-curve (sigmoid curve) as above: 𝑃 𝑦=1𝑥 = 1 ⟹𝑓 𝑥 =𝑙𝑜𝑔 𝑃(𝑦=1|𝑥) 1+𝑒IK(.) 1−𝑃(𝑦 = 1|𝑥) x Now 𝑓 𝑥 can have a value between −∞ and +∞ and in Logistic Regression we estimate 𝑓 𝑥 with a line. 𝑓k(𝑥)= 𝑥L𝛽⟹𝑙𝑜𝑔 * 𝑦=1𝑥 =𝑥L𝛽 #I* 𝑦 = 1 𝑥 COMP9417 T1, 2021 92 y Logistic Regression Logistic regression seeks to: • Model the probability of a class given the values of independent input variable • Estimate the probability that a class occurs for a random observation (versus the probability that the class doesn’t occur) • Classify an observation based on the probability estimations COMP9417 T1, 2021 93 Logistic Regression 𝑃\𝑦=1𝑥= 1 1+𝑒I.$M If𝑃 𝑦=1𝑥 ≥0.5(sameassaying𝑥L𝛽≥0) thenpredictasclass1 If𝑃 𝑦=1𝑥 <0.5(sameassaying𝑥L𝛽<0) thenpredictasclass0 • Interpretation of this model in the input space (feature space) is equivalent of having a linear decision boundary (𝑥L𝛽 = 0) separating the two class • Now we have a linear solution to our problem, and this is what makes Logistic Regression a linear model. COMP9417 T1, 2021 94 Logistic Regression Parameter Estimation • We can not use a similar cost function as we used in linear regression here, because it will result a non-convex function with many local minimums and would be very difficult to find the global minimum. • Instead, the following cost function is used: Let’sdefine𝑃\ 𝑦=1𝑥 =hM (𝑥) 𝑐𝑜𝑠𝑡 hM 𝑥 ,𝑦 =}−log hM 𝑥 −log 1−hM 𝑥 𝑖𝑓𝑦=1 𝑖𝑓𝑦=0 1< 𝐽𝛽 =−𝑚J[𝑦(!)log hM 𝑥! + 1−𝑦! log(1−hM 𝑥! )] !1# Now, to find the values of parameters to minimize 𝐽 𝛽 , we can use the Gradient Descent algorithm. COMP9417 T1, 2021 95 Minimum Description Length Principle COMP9417 T1, 2021 96 Minimum Description Length Principle Once again, the MAP hypothesis: h!"# = argmax𝑃 𝐷 h 𝑃(h) $∈& Which is equivalent to Or h!"# = argmax [logT 𝑃 𝐷 h +logT 𝑃(h)] $∈& h!"#=argmin[−logT𝑃𝐷h −logT𝑃(h)] $∈& COMP9417 T1, 2021 97 Minimum Description Length Principle Interestingly, this is an expression about a quantity of bits: h!"# = arg min [−logT 𝑃 𝐷 h − logT 𝑃(h)] (1) $∈& From information theory: The optimal (shortest expected coding length) code for an event with probability 𝑝 is − logT 𝑝 bits. COMP9417 T1, 2021 98 Minimum Description Length Principle So interpret (1): o − logT 𝑃(h) is length of h under optimal code. The length of your hypothesis o − logT 𝑃(𝐷|h) is length of D given h under optimal code. This is a notion of error/misclassification. Note: assumes optimal encodings, when the priors and likelihoods are known. In practice, this is difficult, and makes a difference. COMP9417 T1, 2021 99 Minimum Description Length Principle Occam’s razor: a principle stating that, given all other things being equal, a shorter hypothesis for observed data should be favoured over a lengthier hypothesis. MDL: prefer the hypothesis h that minimizes h!"#=argmin[−logT𝑃𝐷h −logT𝑃(h)] $∈& So from information theory perspective, the best hypothesis is the one that minimizes error and the size of hypothesis. So you want the simplest hypothesis that minimizes the error. So this is kind of a Bayesian argument for Occam’s razor. COMP9417 T1, 2021 100 Summary • We described the classification problem in machine learning • We also outlined the issue of Inductive Bias • Two major frameworks for classification were covered o Distance-based. The key ideas are geometric. o Probabilistic. The key ideas are Bayesian. • We also discussed Logistic Regression • So we have established the basis for learning classifiers • Later we will see how to extend by building on these ideas COMP9417 T1, 2021 101 Acknowledgements • Material derived from slides for the book “Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/ • Material derived from slides for the book “Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012) http://www.cs.ubc.ca/~murphyk/MLbook • Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook • Material derived from slides for the book “Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml • Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997) http://www- 2.cs.cmu.edu/~tom/mlbook.html • Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016) COMP9417 T1, 2021 102