Classification (2)
COMP9417 Machine Learning & Data Mining
Term 1, 2022
Adapted from slides by Dr Michael
Copyright By PowCoder代写 加微信 powcoder
This lecture will continue your exposure to machine learning approaches to the problem of classification. Following it you should be able to:
– explain the concept of inductive bias in machine learning – outline Bayes Theorem as applied in machine learning
– define MAP and ML inference using Bayes theorem
– outline the Naive Bayes classification algorithm
– outline the logistic regression classification algorithm
COMP9417 T1, 2022 1
Introduction
What do we understand about the problem of learning classifiers ? . . . how can we know when classifier learning succeeds ?
and . . . can we use this to build practical algorithms ?
COMP9417 T1, 2022 2
Inductive Bias
“All models are wrong, but some models are useful.”
Box & Draper (1987)
COMP9417 T1, 2022 3
Inductive Bias
Confusingly, “inductive bias” is NOT the same “bias” as in the “bias- variance” decomposition.
“Inductive bias” is the combination of assumptions and restrictions placed on the models and algorithms used to solve a learning problem.
Essentially it means that the algorithm and model combination you are using to solve the learning problem is appropriate for the task.
Success in machine learning requires understanding the inductive bias of algorithms and models and choosing them appropriately for the task.
COMP9417 T1, 2022 4
Inductive Bias
Unfortunately, for most machine learning algorithms it is not always easy to know what their inductive bias is.
For example, what is the inductive bias of:
• Linear Regression ?
o target function has the form 𝑦 = 𝑎𝑥 + 𝑏
o approximate by fitting using MSE
• Nearest Neighbour ?
o target function is a complex non-linear function of the data o predict using nearest neighbour by Euclidean distance in
feature space
COMP9417 T1, 2022 5
Inductive Bias
What we would really like is a framework for machine learning algorithms:
o with a way of representing the inductive bias
o ideally, should be a declarative specification
o also should quantify uncertainty in the inductive bias
COMP9417 T1, 2022 6
A probabilistic approach
COMP9417 T1, 2022 7
A probabilistic approach Probabilistic models
A simple probabilistic model
A simple probabilistic model
‘Viagra’ and ‘lottery’ are two Boolean features; Y is the class variable,
with values ‘spam’ and ‘ham’. In each row the most likely class is
indicated in bold.
‘Viagra’ and ‘lottery’ are two Boolean features; Y is the class variable, with values ‘spam’ and ‘ham’. In each row the most likely class is indicated in bold.
Viagra lottery P(Y =spam|Viagra,lottery) P(Y =ham|Viagra,lottery) 00 0.31 0.69
01 0.65 0.35
10 0.80 0.20
11 0.40 0.60
COMP9417 ML & DM Classification (2) Term 2, 2019
COMP9417 T1, 2022 8
Decision rule
Assuming that 𝑥 and 𝑦 are the only variables we know and care about, the posterior distribution 𝑃(𝑦 |𝑥) helps us to answer many questions of interest.
• For instance, to classify a new e-mail we determine whether the words ‘Viagra’ and ‘lottery’ occur in it, look up the corresponding probability 𝑃 (𝑦 =
𝑠𝑝𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦), and predict spam if this probability exceeds 0.5 and ham otherwise.
• Such a recipe to predict a value of 𝑦 on the basis of the values of 𝑥 and the posterior distribution 𝑃(𝑦 |𝑥) is called a decision rule.
COMP9417 T1, 2022 9
Bayesian Machine Learning
COMP9417 T1, 2022 10
Two Roles for Bayesian Methods
Provides practical learning algorithms:
– Naive Bayes classifier learning
– Bayesian network learning, etc.
– Combines prior knowledge (prior probabilities) with observed data
Provides useful conceptual framework:
– Provides a “gold standard” for evaluating other learning algorithms
– Some additional insight into Occam’s razor
COMP9417 T1, 2022 11
Classification Question:
Can we do something different than finding the discriminative line (or some boundary) to be able to separate the two groups?
COMP9417 T1, 2022
Classification Answer:
Yes, we can. Instead of finding a discriminative line, maybe we can focus on one class at a time and build a model that describes how that class looks like; and then do the same for the other class. This type of models are called generative learning algorithm.
𝑃 𝑥𝑦=𝑆𝑒𝑎𝑏𝑎𝑠𝑠 ,𝑃(𝑦=𝑆𝑒𝑎𝑏𝑎𝑠𝑠)
𝑃 𝑥 𝑦 = 𝑠𝑎𝑙𝑚𝑜𝑛 ,𝑃(𝑦 = 𝑠𝑎𝑙𝑚𝑜𝑛)
COMP9417 T1, 2022
Classification – Bayesian methods Example: Imagine, we want to classify fish type: Salmon , Sea bass
If from the past experience we have (𝐶! 𝑖𝑠 𝑡h𝑒 𝑐𝑙𝑎𝑠𝑠):
• If we decide only based on prior, we always have to choose “sea bass”. This is called “decision rule based on prior”
o This can behave vey poorly
o It never predicts other classes
COMP9417 T1, 2022 14
Classification – Bayesian methods
Example: now if we have some more information on the length of the fish in each class, then how can we update our decision, if we want to predict the class for a fish with 70cm length?
These are called “class conditionals”, “class conditioned probabilities”
length > 100 cm
50 cm < length < 100 cm
length < 50 cm
What we are interested in is 𝑃 𝑐! 𝑥 , but we have 𝑃 𝑐! and 𝑃(𝑥|𝑐!), so how should we go about this?
COMP9417 T1, 2022 15
Bayes Theorem
𝑃h𝐷 =𝑃𝐷h𝑃(h) 𝑃(𝐷)
𝑃 h = prior probability of hypothesis h
𝑃 𝐷 = prior probability of training data 𝐷 𝑃(h|𝐷) = probability of h given 𝐷
𝑃(𝐷|h) = probability of 𝐷 given h
COMP9417 T1, 2022 16
Decision Rule from Posteriors Example:
If the output belongs to a set of 𝑘 classes: 𝑦 ∈ {𝐶#,𝐶$,..., 𝐶%} for 1 ≤ 𝑖 ≤ 𝑘 Then in Bayesian framework:
𝑃 𝑦 = 𝐶! 𝑥 = 𝑃 𝑥 𝐶! 𝑃(𝐶!) 𝑃(𝑥)
– 𝑃 𝑦 = 𝐶1 𝑥 : posterior probability
– 𝑃 𝑥 𝐶1 : class conditional (class likelihood) – 𝑃(𝐶1): prior
– 𝑃(𝑥):Marginal(𝑃 𝑥 =∑1𝑝 𝑥𝐶1 𝑃(𝐶1))
The decision rule is to select a class which maximizes the posterior probability for the prediction
COMP9417 T1, 2022 17
Choosing Hypotheses
𝑃h𝐷 =𝑃𝐷h𝑃(h) 𝑃(𝐷)
Generally want the most probable hypothesis given the training data, Maximum a posteriori hypothesis h!"# :
h!"# =argmax𝑃(h|𝐷) $∈&
= arg max 𝑃 𝐷 h 𝑃(h) $∈& 𝑃(𝐷)
= arg max 𝑃 𝐷 h 𝑃(h) $∈&
COMP9417 T1, 2022 18
Choosing Hypotheses
If assume 𝑃 h' = 𝑃(h() then can further simplify, and choose the Maximum likelihood (ML) hypothesis:
h!) =argmax𝑃(𝐷|h') $!∈&
COMP9417 T1, 2022 19
Classification – Bayesian methods
Example: now if we have some more information on the length of the fish in each class, then how can we update our decision, if we want to predict the class for a fish with 70cm length?
length > 100 cm
50 cm < length < 100 cm
length < 50 cm
𝑃 𝑐 = 𝑠𝑎𝑙𝑚𝑜𝑛 𝑥 = 70𝑐𝑚 ∝ 𝑃 70𝑐𝑚 𝑠𝑎𝑙𝑚𝑜𝑛 ∗ 𝑃 𝑠𝑎𝑙𝑚𝑜𝑛 = 0.4∗0.3=0.12 𝑃 𝑐=𝑠𝑒𝑎𝑏𝑎𝑠𝑠𝑥=70𝑐𝑚 ∝𝑃 70𝑐𝑚𝑠𝑒𝑎𝑏𝑎𝑠𝑠 ∗𝑃 𝑠𝑒𝑎𝑏𝑎𝑠𝑠 =0.5∗0.7=0.35
So base on these probabilities, our model predict the type as “sea bass”
COMP9417 T1, 2022 20
Applying Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.
𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 =?
𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 =?
𝑃 ⊕ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =?
𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 =? 𝑃 ⊖ 𝑐𝑎𝑛𝑐𝑒𝑟 =?
𝑃 ⊖ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =?
COMP9417 T1, 2022 21
Applying Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.
𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 = .008
𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 =.98
𝑃 ⊕ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =.03
𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 = .992 𝑃 ⊖ 𝑐𝑎𝑛𝑐𝑒𝑟 =0.02
𝑃 ⊖ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =.97
COMP9417 T1, 2022 22
Applying Bayes Theorem
Does patient have cancer or not?
𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =? 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =?
We can find the maximum a posteriori (MAP) hypothesis
𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.98×0.008 = 0.00784
𝑃 ⊕ 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.03×0.992 = 0.02976
Thus h!"# = ⋯
COMP9417 T1, 2022 23
Applying Bayes Theorem
Does patient have cancer or not?
𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =? 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =?
We can find the maximum a posteriori (MAP) hypothesis
𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.98×0.008 = 0.00784
𝑃 ⊕ 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.03×0.992 = 0.02976
Thus h!"# = 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟
COMP9417 T1, 2022 24
Applying Bayes Theorem
How to get the posterior probability of a hypothesis h ? Divide by 𝑃 (⊕) (probability of data) to normalize result for h:
𝑃 h 𝐷 = 𝑃 𝐷 h 𝑃(h) ∑$!∈&𝑃𝐷h' 𝑃(h')
Denominator ensures we obtain posterior probabilities that sum to 1. Sum for all possible numerator values, since hypotheses are mutually exclusive (e.g., patient either has cancer or does not).
COMP9417 T1, 2022 25
Basic Formulas for Probabilities
Product Rule: probability 𝑃 (𝐴 ∧ 𝐵) of conjunction of two events 𝐴 and 𝐵: 𝑃(𝐴 ∧ 𝐵) = 𝑃(𝐴|𝐵)𝑃(𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴)
Sum Rule: probability of disjunction of two events 𝐴 and 𝐵: 𝑃 (𝐴 ∨ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∧ 𝐵)
Theorem of total probability: if events 𝐴* , ... , 𝐴+ are mutually exclusive
with∑+ 𝑃 𝐴 ',*
𝑃𝐵 =O',*𝑃𝐵𝐴' 𝑃(𝐴')
COMP9417 T1, 2022 26
Basic Formulas for Probabilities
Also worth remembering:
o Conditional Probability: probability of A given B:
P(A|B) = *(+∧-) *(-)
o Rearrange sum rule to get:
𝑃 (𝐴 ∧ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∨ 𝐵)
COMP9417 T1, 2022 27
Posterior Probabilities & Decision
If we have two competing hypothesis h# and h$, then – 𝑃 h# 𝐷 = posterior probability of h#
– 𝑃 h$ 𝐷 = posterior probability of h$
• So far, the potential decision is made based on the higher posterior probability
– Decideh# if𝑃 h# 𝐷 >𝑃 h$ 𝐷 andelsedecideh$
• An alternative: using a loss function 𝐿(h), where 𝐿(h) is the loss that occurs
when decision h is made.
– In this setup the Bayesian testing procedure minimizes the posterior expected loss
COMP9417 T1, 2022 28
Bayesian Expected Loss
Example: If in the example of patient with positive test result, we know that the cost of misclassifying a patient who has cancer as “not cancer” is 10 times more than misclassifying a patient who doesn’t have cancer as “cancer”, how that will affect our decision?
COMP9417 T1, 2022 29
Bayesian Expected Loss (Risk)
If the cost of misclassification is not the same for different classes, then instead of maximizing a posteriori, we have to minimize the expected loss:
o So, if we define the loss associated to action 𝛼! as 𝜆(𝛼!|h) o Then the expected loss associated to action 𝛼! is:
𝐸𝐿𝛼! =𝑅𝛼!𝑥 =O𝜆(𝛼!|h)𝑃(h|𝑥) &∈(
An optimal Bayesian decision strategy is to minimize the expected loss. And if the loss associated to misclassification is the same for different classes, then maximum a posteriori is equal to minimizing the expected loss.
COMP9417 T1, 2022 30
Bayesian Expected Loss
Example: let’s revisit the example of patient with positive result for cancer,
given the loss function below:
Not cancer
If predicted cancer
If predicted not cancer
𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ = 𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕
+𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ ∝ 0 + 1×0.02976 = 0.02976
𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ = 𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕
+𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ ∝ 10×0.00784 + 0 = 0.0784
𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ < 𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕
Therefor the expected loss is less if we predict that the patient has cancer. COMP9417 T1, 2022 31
Bayesian Expected Loss
In summary:
• Bayesian framework allows for integration of losses into the decision-making rule
• In such framework, we can minimize the posterior expected loss
COMP9417 T1, 2022 32
Bayes Theorem
• To compute 𝑃 𝐷 h and 𝑃(h), we can use an empirical method based on the given data
• Or we may assume a parametric model, then we estimate parameters using the data
COMP9417 T1, 2022 33
Learning A Real Valued Function
Consider any real-valued target function 𝑓
Training examples ⟨𝑥' , 𝑦' ⟩, where 𝑦' is noisy training value
–𝑦'=𝑓S𝑥' +𝜀'
– 𝜀' is random variable (noise) drawn independently for each 𝑥' according to some Gaussian (normal) distribution with mean zero
Then the maximum likelihood hypothesis h!) is the one that
minimizes the sum of squared errors: h!)=argm𝑎𝑥𝑃𝐷h =argmaxU𝑃(𝑦'|𝑓S)
Where 𝑓G = h23
$∈& $∈& ',*
- 1 / * ( 1 ! / 32 ( 4 ! ) ) " =argmaxU 𝑒. 6
$∈& ',* 2𝜋𝜎.
COMP9417 T1, 2022 34
Learning A Real Valued Function
Maximize natural log to give simpler expression:
- h!)=argmaxO𝑙𝑛
1 𝑦 ' − 𝑓S ( 𝑥 ' ) . − ( )
1 𝑦 ' − 𝑓S ( 𝑥 ' ) . =argmaxO− ( )
$∈&',* 2 𝜎 -
= a r g m a x O − ( 𝑦 ' − 𝑓S ( 𝑥 ' ) ) . $∈& ',*
Equivalently, we can minimize the positive version of the expression:
h ! ) = a r g m i n O ( 𝑦 ' − 𝑓S ( 𝑥 ' ) ) . $∈& ',*
COMP9417 T1, 2022 35
Discriminative vs Generative Probabilistic Models
• Discriminative models model the posterior probability distribution 𝑃 (𝑦|𝑥), where 𝑦 is the target variable and 𝑥 is the feature vector. That is, given 𝑥 they return a probability distribution over 𝑦 .
• Generative models model the joint distribution 𝑃 (𝑦, 𝑥) of the target 𝑦 and the feature vector 𝑥. Once we have access to this joint distribution, we can derive any conditional or marginal distribution involving the same variables. In particular, since 𝑃 𝑥 = ∑) 𝑃(𝑦 = 𝑐, 𝑥) it follows that the posterior
distribution can be obtained as𝑃 𝑦 𝑥 = *(,,.) ∑! *(,1),.)
• Such models are called ‘generative’ because we can sample from the joint distribution to obtain new data points together with their labels. Alternatively, we can use 𝑃(𝑦) to sample a class and 𝑃(𝑥|𝑦) to sample an instance for that class.
COMP9417 T1, 2022 36
Bayesian Approach
What should h be?
o A collection of possible predictions, or
o A collection of functions that map the data 𝑥 to a possible prediction 𝑦
COMP9417 T1, 2022 37
Most Probable Classification of
So far, we’ve sought the most probable hypothesis given the data 𝐷 (i.e., h!"#)
But the most important question is: given new instance 𝑥, what is its most probable classification?
o Although it may seem that h!"# can answer this question, but in fact we can do better
o h!"# (𝑥) is not necessarily the most probable classification! (e.g if we are dealing with multiple hypotheses supporting each class)
COMP9417 T1, 2022 38
Most Probable Classification of
o Three possible hypotheses:
𝑃 h* 𝐷 = 0.4, 𝑃(h.|𝐷) = 0.3, 𝑃(h7|𝐷) = 0.3
o Given new instance x,
h*(𝑥) = +, h. (𝑥) = −, h7 (𝑥) = −
o What’s most probable classification of 𝑥?
COMP9417 T1, 2022 39
Bayes Optimal Classifier
Bayes optimal classification:
𝑃h*𝐷 =0.4, 𝑃h.𝐷 =0.3, 𝑃h7𝐷 =0.3,
𝑃−h* =0, 𝑃−h. =1, 𝑃−h7 =1,
P+h* =1 P+h. =0 P+h7 =0
argmax O 𝑃(𝜈(|h')𝑃(h'|D) 8#∈9 $!∈&
COMP9417 T1, 2022
Bayes Optimal Classifier
O𝑃(+|h')Ph'𝐷 =0.4 $!∈&
O𝑃(−|h')Ph'𝐷 =0.6 $!∈&
argmaxO𝑃(𝜈(|h')𝑃h'D= − 8#∈9 $!∈&
𝜈( is the class value from the set of possible class values (𝜈( ∈ 𝑉)
COMP9417 T1, 2022 41
Bayes Optimal Classifier
The Bayes optimal classifier is a probabilistic model that makes the most probable prediction for a new sample based on the training data. This is different than MAP inference.
• In MAP framework, we seek the most probable hypothesis, among the space of hypothesis.
• But in Bayesian optimal classification, the most probable classification of the new instance is obtained by combining the predictions of ALL hypotheses (in the hypothesis space), weighted by their posterior probabilities:
𝑃 𝜈2 𝐷 = O 𝑃 𝜈2 h! 𝑃(h!|𝐷) &" ∈(
The optimal classification of the new instance is the value 𝜈2 , for which 𝑃 𝜈2 𝐷 is maximum
COMP9417 T1, 2022 42
Bayes Optimal Classifier
It can be mathematically shown that no other classification method using the same hypothesis space and same prior knowledge can outperform the Bayes Optimal Classifier method on average.
COMP9417 T1, 2022 43
Bayes Optimal Classifier
Why do we need any other methods?
COMP9417 T1, 2022 44
Bayes Optimal Classifier
The Bayes rule depends on unknown quantities, so we need to use the data to find some approximation of those quantities.
COMP9417 T1, 2022 45
Bayes optimal classifier is very inefficient.
An alternative, less optimal method is Gibbs algorithm.
Gibbs algorithm:
1. Choose one hypothesis at random, according to 𝑃(h|𝐷) 2. Use this to classify new instance
Surprising fact: Assuming target concepts are drawn at random from 𝐻 (h ∈ 𝐻) according to priors on 𝐻. Then
𝐸[𝑒𝑟𝑟𝑜𝑟3!445] ≤ 2×𝐸[𝑒𝑟𝑟𝑜𝑟67,85 9:;!<7=]
COMP9417 T1, 2022 46
Bayes Error
What is the best performance attainable by a (two-class) classifier ?
Define the probability of error for classifying some instance 𝑥 by 𝑃 𝑒𝑟𝑟𝑜𝑟 𝑥 = 𝑃 𝑐𝑙𝑎𝑠𝑠# 𝑥 𝑖𝑓 𝑤𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑙𝑎𝑠𝑠$ = 𝑃 𝑐𝑙𝑎𝑠𝑠$ 𝑥 𝑖𝑓 𝑤𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑙𝑎𝑠𝑠#
This gives
O𝑃 𝑒𝑟𝑟𝑜𝑟 = O𝑃 𝑒𝑟𝑟𝑜𝑟 𝑥 𝑃(𝑥) .
So we can justify the use of the decision rule:
if 𝑃a 𝑐𝑙𝑎𝑠𝑠# 𝑥 > 𝑃a 𝑐𝑙𝑎𝑠𝑠$ 𝑥 then predict 𝑐𝑙𝑎𝑠𝑠#
else predict 𝑐𝑙𝑎𝑠𝑠$
On average, this decision rule minimises probability of classification error.
COMP9417 T1, 2022 47
Naive Bayes Classifier
Along with decision trees, neural networks, nearest neighbour, one of the most practical learning methods.
When to use
o Moderate or large training set available
o Attributes that describe instances are conditionally independent given classification
Successful applications:
o Classifying text documents
o Gaussian Naive Bayes for real-valued data
COMP9417 T1, 2022 48
Naive Bayes Classifier
Assume target function 𝑓 ∶ 𝑋 → 𝑉 , where each instance 𝑥 described by attributes ⟨𝑥*, 𝑥., . . . , 𝑥+⟩.
Most probable value of 𝑓(𝑥) is:
𝜈!”# = argmax𝑃(𝜈(|𝑥*,𝑥.,…,𝑥+)
𝜈!”# = argmax𝑃 𝑥*,𝑥.,…,𝑥+ 𝜈( 𝑃(𝜈() 8#∈9 𝑃(𝑥*,𝑥.,…,𝑥+)
= argmax𝑃(𝑥*,𝑥.,…,𝑥+ 𝜈( P(𝜈() 8#∈9
COMP9417 T1, 2022 49
Naive Bayes Classifier
Naive Bayes assumption:
𝑃(𝑥*,𝑥.,…,𝑥+ 𝜈( =U𝑃(𝑥’|𝜈() ‘
• Attributes are statistically independent (given the class value)
o which means knowledge about the value of a particular attribute tells us nothing about the value of another attribute (if the class is known)
Which gives Naive Bayes classifier:
𝜈:; = 𝑎𝑟𝑔 max𝑃(𝜈()U𝑃(𝑥’|𝜈()
COMP9417 T1, 2022
Naive Bayes Algorithm Naive Bayes Learn (examples)
for each target value 𝜈(
𝑃d(𝜈() ← estimate 𝑃 𝜈(
For each attribute value 𝑥’:
𝑃d(𝑥’|𝜈() ← estimate 𝑃 𝑥’|𝜈( Classify New Instance (for sample 𝑥)
𝜈:; =𝑎𝑟𝑔max𝑃d(𝜈()U𝑃d(𝑥’|𝜈() 8#∈9 ‘
COMP9417 T1, 2022
Naive Bayes Example: PlayTennis Naive Bayes Example: PlayTennis
Naive Bayes
outlook temperature humidity windy
sunny hot sunny hot overcast hot rainy mild rainy cool rainy cool overcast cool sunny mild sunny cool rainy mild sunny mild overcast mild overcast hot rainy mild
high false high true high false high false normal false normal true normal true high false normal false normal false normal true high true normal false high true
COMP9417 ML & DM
Classification (2)
play no no yes yes yes no yes no yes yes yes yes yes no
Term 2, 2019
COMP9417 T1, 2022
Naive Bayes
Naive Bayes Example: PlayTennis
Naive Bayes Example: PlayTennis
What are the required probabilities to predict PlayTennis ? What are the required probabilities to predict PlayTennis?
Outlook Yes Sunny 2 Overcast 4 Rainy 3
Sunny 2/9 Overcast 4/9 Rainy 3/9
Temperature
No Yes No 3 Hot 2 2 0 Mild 4 2 2 Cool 3 1
3/5 Hot 2/9 2/5 0/5 Mild 4/9 2/5 2/5 Cool 3/9 1/5
6/9 2/5 3/9 3/5
High Normal
High Normal
3/9 4/5 6/9 1/5
False True
False True
95 9/14 5/14
COMP9417 ML & DM Classification (2)
Term 2, 2019
COMP9417 T1, 2022 53
Naive Bayes Example: PlayTennis Say we have the new instance:
⟨𝑂𝑢𝑡𝑙𝑘 = 𝑠𝑢𝑛, 𝑇𝑒𝑚𝑝 = 𝑐𝑜𝑜𝑙, 𝐻𝑢𝑚𝑖𝑑 = h𝑖𝑔h, 𝑊𝑖𝑛𝑑 = 𝑡𝑟𝑢𝑒⟩
We want to compute:
𝜈:; =𝑎𝑟𝑔 max 8#∈{“1>?”,”+B”}}
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com