代写代考 COMP9417 Machine Learning & Data Mining

Classification (2)
COMP9417 Machine Learning & Data Mining
Term 1, 2022
Adapted from slides by Dr Michael

Copyright By PowCoder代写 加微信 powcoder

This lecture will continue your exposure to machine learning approaches to the problem of classification. Following it you should be able to:
– explain the concept of inductive bias in machine learning – outline Bayes Theorem as applied in machine learning
– define MAP and ML inference using Bayes theorem
– outline the Naive Bayes classification algorithm
– outline the logistic regression classification algorithm
COMP9417 T1, 2022 1

Introduction
What do we understand about the problem of learning classifiers ? . . . how can we know when classifier learning succeeds ?
and . . . can we use this to build practical algorithms ?
COMP9417 T1, 2022 2

Inductive Bias
“All models are wrong, but some models are useful.”
Box & Draper (1987)
COMP9417 T1, 2022 3

Inductive Bias
Confusingly, “inductive bias” is NOT the same “bias” as in the “bias- variance” decomposition.
“Inductive bias” is the combination of assumptions and restrictions placed on the models and algorithms used to solve a learning problem.
Essentially it means that the algorithm and model combination you are using to solve the learning problem is appropriate for the task.
Success in machine learning requires understanding the inductive bias of algorithms and models and choosing them appropriately for the task.
COMP9417 T1, 2022 4

Inductive Bias
Unfortunately, for most machine learning algorithms it is not always easy to know what their inductive bias is.
For example, what is the inductive bias of:
• Linear Regression ?
o target function has the form 𝑦 = 𝑎𝑥 + 𝑏
o approximate by fitting using MSE
• Nearest Neighbour ?
o target function is a complex non-linear function of the data o predict using nearest neighbour by Euclidean distance in
feature space
COMP9417 T1, 2022 5

Inductive Bias
What we would really like is a framework for machine learning algorithms:
o with a way of representing the inductive bias
o ideally, should be a declarative specification
o also should quantify uncertainty in the inductive bias
COMP9417 T1, 2022 6

A probabilistic approach
COMP9417 T1, 2022 7

A probabilistic approach Probabilistic models
A simple probabilistic model
A simple probabilistic model
‘Viagra’ and ‘lottery’ are two Boolean features; Y is the class variable,
with values ‘spam’ and ‘ham’. In each row the most likely class is
indicated in bold.
‘Viagra’ and ‘lottery’ are two Boolean features; Y is the class variable, with values ‘spam’ and ‘ham’. In each row the most likely class is indicated in bold.
Viagra lottery P(Y =spam|Viagra,lottery) P(Y =ham|Viagra,lottery) 00 0.31 0.69
01 0.65 0.35
10 0.80 0.20
11 0.40 0.60
COMP9417 ML & DM Classification (2) Term 2, 2019
COMP9417 T1, 2022 8

Decision rule
Assuming that 𝑥 and 𝑦 are the only variables we know and care about, the posterior distribution 𝑃(𝑦 |𝑥) helps us to answer many questions of interest.
• For instance, to classify a new e-mail we determine whether the words ‘Viagra’ and ‘lottery’ occur in it, look up the corresponding probability 𝑃 (𝑦 =
𝑠𝑝𝑎𝑚|𝑉𝑖𝑎𝑔𝑟𝑎, 𝑙𝑜𝑡𝑡𝑒𝑟𝑦), and predict spam if this probability exceeds 0.5 and ham otherwise.
• Such a recipe to predict a value of 𝑦 on the basis of the values of 𝑥 and the posterior distribution 𝑃(𝑦 |𝑥) is called a decision rule.
COMP9417 T1, 2022 9

Bayesian Machine Learning
COMP9417 T1, 2022 10

Two Roles for Bayesian Methods
Provides practical learning algorithms:
– Naive Bayes classifier learning
– Bayesian network learning, etc.
– Combines prior knowledge (prior probabilities) with observed data
Provides useful conceptual framework:
– Provides a “gold standard” for evaluating other learning algorithms
– Some additional insight into Occam’s razor
COMP9417 T1, 2022 11

Classification Question:
Can we do something different than finding the discriminative line (or some boundary) to be able to separate the two groups?
COMP9417 T1, 2022

Classification Answer:
Yes, we can. Instead of finding a discriminative line, maybe we can focus on one class at a time and build a model that describes how that class looks like; and then do the same for the other class. This type of models are called generative learning algorithm.
𝑃 𝑥𝑦=𝑆𝑒𝑎𝑏𝑎𝑠𝑠 ,𝑃(𝑦=𝑆𝑒𝑎𝑏𝑎𝑠𝑠)
𝑃 𝑥 𝑦 = 𝑠𝑎𝑙𝑚𝑜𝑛 ,𝑃(𝑦 = 𝑠𝑎𝑙𝑚𝑜𝑛)
COMP9417 T1, 2022

Classification – Bayesian methods Example: Imagine, we want to classify fish type: Salmon , Sea bass
If from the past experience we have (𝐶! 𝑖𝑠 𝑡h𝑒 𝑐𝑙𝑎𝑠𝑠):
• If we decide only based on prior, we always have to choose “sea bass”. This is called “decision rule based on prior”
o This can behave vey poorly
o It never predicts other classes
COMP9417 T1, 2022 14

Classification – Bayesian methods
Example: now if we have some more information on the length of the fish in each class, then how can we update our decision, if we want to predict the class for a fish with 70cm length?
These are called “class conditionals”, “class conditioned probabilities”
length > 100 cm
50 cm < length < 100 cm length < 50 cm What we are interested in is 𝑃 𝑐! 𝑥 , but we have 𝑃 𝑐! and 𝑃(𝑥|𝑐!), so how should we go about this? COMP9417 T1, 2022 15 Bayes Theorem 𝑃h𝐷 =𝑃𝐷h𝑃(h) 𝑃(𝐷) 𝑃 h = prior probability of hypothesis h 𝑃 𝐷 = prior probability of training data 𝐷 𝑃(h|𝐷) = probability of h given 𝐷 𝑃(𝐷|h) = probability of 𝐷 given h COMP9417 T1, 2022 16 Decision Rule from Posteriors Example: If the output belongs to a set of 𝑘 classes: 𝑦 ∈ {𝐶#,𝐶$,..., 𝐶%} for 1 ≤ 𝑖 ≤ 𝑘 Then in Bayesian framework: 𝑃 𝑦 = 𝐶! 𝑥 = 𝑃 𝑥 𝐶! 𝑃(𝐶!) 𝑃(𝑥) – 𝑃 𝑦 = 𝐶1 𝑥 : posterior probability – 𝑃 𝑥 𝐶1 : class conditional (class likelihood) – 𝑃(𝐶1): prior – 𝑃(𝑥):Marginal(𝑃 𝑥 =∑1𝑝 𝑥𝐶1 𝑃(𝐶1)) The decision rule is to select a class which maximizes the posterior probability for the prediction COMP9417 T1, 2022 17 Choosing Hypotheses 𝑃h𝐷 =𝑃𝐷h𝑃(h) 𝑃(𝐷) Generally want the most probable hypothesis given the training data, Maximum a posteriori hypothesis h!"# : h!"# =argmax𝑃(h|𝐷) $∈& = arg max 𝑃 𝐷 h 𝑃(h) $∈& 𝑃(𝐷) = arg max 𝑃 𝐷 h 𝑃(h) $∈& COMP9417 T1, 2022 18 Choosing Hypotheses If assume 𝑃 h' = 𝑃(h() then can further simplify, and choose the Maximum likelihood (ML) hypothesis: h!) =argmax𝑃(𝐷|h') $!∈& COMP9417 T1, 2022 19 Classification – Bayesian methods Example: now if we have some more information on the length of the fish in each class, then how can we update our decision, if we want to predict the class for a fish with 70cm length? length > 100 cm
50 cm < length < 100 cm length < 50 cm 𝑃 𝑐 = 𝑠𝑎𝑙𝑚𝑜𝑛 𝑥 = 70𝑐𝑚 ∝ 𝑃 70𝑐𝑚 𝑠𝑎𝑙𝑚𝑜𝑛 ∗ 𝑃 𝑠𝑎𝑙𝑚𝑜𝑛 = 0.4∗0.3=0.12 𝑃 𝑐=𝑠𝑒𝑎𝑏𝑎𝑠𝑠𝑥=70𝑐𝑚 ∝𝑃 70𝑐𝑚𝑠𝑒𝑎𝑏𝑎𝑠𝑠 ∗𝑃 𝑠𝑒𝑎𝑏𝑎𝑠𝑠 =0.5∗0.7=0.35 So base on these probabilities, our model predict the type as “sea bass” COMP9417 T1, 2022 20 Applying Bayes Theorem Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer. 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 =? 𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 =? 𝑃 ⊕ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =? 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 =? 𝑃 ⊖ 𝑐𝑎𝑛𝑐𝑒𝑟 =? 𝑃 ⊖ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =? COMP9417 T1, 2022 21 Applying Bayes Theorem Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer. 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 = .008 𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 =.98 𝑃 ⊕ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =.03 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 = .992 𝑃 ⊖ 𝑐𝑎𝑛𝑐𝑒𝑟 =0.02 𝑃 ⊖ 𝑛𝑜𝑡𝑐𝑎𝑛𝑐𝑒𝑟 =.97 COMP9417 T1, 2022 22 Applying Bayes Theorem Does patient have cancer or not? 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =? 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =? We can find the maximum a posteriori (MAP) hypothesis 𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.98×0.008 = 0.00784 𝑃 ⊕ 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.03×0.992 = 0.02976 Thus h!"# = ⋯ COMP9417 T1, 2022 23 Applying Bayes Theorem Does patient have cancer or not? 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =? 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ =? We can find the maximum a posteriori (MAP) hypothesis 𝑃 ⊕ 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.98×0.008 = 0.00784 𝑃 ⊕ 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 = 0.03×0.992 = 0.02976 Thus h!"# = 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 COMP9417 T1, 2022 24 Applying Bayes Theorem How to get the posterior probability of a hypothesis h ? Divide by 𝑃 (⊕) (probability of data) to normalize result for h: 𝑃 h 𝐷 = 𝑃 𝐷 h 𝑃(h) ∑$!∈&𝑃𝐷h' 𝑃(h') Denominator ensures we obtain posterior probabilities that sum to 1. Sum for all possible numerator values, since hypotheses are mutually exclusive (e.g., patient either has cancer or does not). COMP9417 T1, 2022 25 Basic Formulas for Probabilities Product Rule: probability 𝑃 (𝐴 ∧ 𝐵) of conjunction of two events 𝐴 and 𝐵: 𝑃(𝐴 ∧ 𝐵) = 𝑃(𝐴|𝐵)𝑃(𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴) Sum Rule: probability of disjunction of two events 𝐴 and 𝐵: 𝑃 (𝐴 ∨ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∧ 𝐵) Theorem of total probability: if events 𝐴* , ... , 𝐴+ are mutually exclusive with∑+ 𝑃 𝐴 ',* 𝑃𝐵 =O',*𝑃𝐵𝐴' 𝑃(𝐴') COMP9417 T1, 2022 26 Basic Formulas for Probabilities Also worth remembering: o Conditional Probability: probability of A given B: P(A|B) = *(+∧-) *(-) o Rearrange sum rule to get: 𝑃 (𝐴 ∧ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∨ 𝐵) COMP9417 T1, 2022 27 Posterior Probabilities & Decision If we have two competing hypothesis h# and h$, then – 𝑃 h# 𝐷 = posterior probability of h# – 𝑃 h$ 𝐷 = posterior probability of h$ • So far, the potential decision is made based on the higher posterior probability – Decideh# if𝑃 h# 𝐷 >𝑃 h$ 𝐷 andelsedecideh$
• An alternative: using a loss function 𝐿(h), where 𝐿(h) is the loss that occurs
when decision h is made.
– In this setup the Bayesian testing procedure minimizes the posterior expected loss
COMP9417 T1, 2022 28

Bayesian Expected Loss
Example: If in the example of patient with positive test result, we know that the cost of misclassifying a patient who has cancer as “not cancer” is 10 times more than misclassifying a patient who doesn’t have cancer as “cancer”, how that will affect our decision?
COMP9417 T1, 2022 29

Bayesian Expected Loss (Risk)
If the cost of misclassification is not the same for different classes, then instead of maximizing a posteriori, we have to minimize the expected loss:
o So, if we define the loss associated to action 𝛼! as 𝜆(𝛼!|h) o Then the expected loss associated to action 𝛼! is:
𝐸𝐿𝛼! =𝑅𝛼!𝑥 =O𝜆(𝛼!|h)𝑃(h|𝑥) &∈(
An optimal Bayesian decision strategy is to minimize the expected loss. And if the loss associated to misclassification is the same for different classes, then maximum a posteriori is equal to minimizing the expected loss.
COMP9417 T1, 2022 30

Bayesian Expected Loss
Example: let’s revisit the example of patient with positive result for cancer,
given the loss function below:
Not cancer
If predicted cancer
If predicted not cancer
𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ = 𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕
+𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ ∝ 0 + 1×0.02976 = 0.02976
𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ = 𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕
+𝜆 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 𝑃 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ ∝ 10×0.00784 + 0 = 0.0784
𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ < 𝑅 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑜𝑡 𝑐𝑎𝑛𝑐𝑒𝑟 ⊕ Therefor the expected loss is less if we predict that the patient has cancer. COMP9417 T1, 2022 31 Bayesian Expected Loss In summary: • Bayesian framework allows for integration of losses into the decision-making rule • In such framework, we can minimize the posterior expected loss COMP9417 T1, 2022 32 Bayes Theorem • To compute 𝑃 𝐷 h and 𝑃(h), we can use an empirical method based on the given data • Or we may assume a parametric model, then we estimate parameters using the data COMP9417 T1, 2022 33 Learning A Real Valued Function Consider any real-valued target function 𝑓 Training examples ⟨𝑥' , 𝑦' ⟩, where 𝑦' is noisy training value –𝑦'=𝑓S𝑥' +𝜀' – 𝜀' is random variable (noise) drawn independently for each 𝑥' according to some Gaussian (normal) distribution with mean zero Then the maximum likelihood hypothesis h!) is the one that minimizes the sum of squared errors: h!)=argm𝑎𝑥𝑃𝐷h =argmaxU𝑃(𝑦'|𝑓S) Where 𝑓G = h23 $∈& $∈& ',* - 1 / * ( 1 ! / 32 ( 4 ! ) ) " =argmaxU 𝑒. 6 $∈& ',* 2𝜋𝜎. COMP9417 T1, 2022 34 Learning A Real Valued Function Maximize natural log to give simpler expression: - h!)=argmaxO𝑙𝑛 1 𝑦 ' − 𝑓S ( 𝑥 ' ) . − ( ) 1 𝑦 ' − 𝑓S ( 𝑥 ' ) . =argmaxO− ( ) $∈&',* 2 𝜎 - = a r g m a x O − ( 𝑦 ' − 𝑓S ( 𝑥 ' ) ) . $∈& ',* Equivalently, we can minimize the positive version of the expression: h ! ) = a r g m i n O ( 𝑦 ' − 𝑓S ( 𝑥 ' ) ) . $∈& ',* COMP9417 T1, 2022 35 Discriminative vs Generative Probabilistic Models • Discriminative models model the posterior probability distribution 𝑃 (𝑦|𝑥), where 𝑦 is the target variable and 𝑥 is the feature vector. That is, given 𝑥 they return a probability distribution over 𝑦 . • Generative models model the joint distribution 𝑃 (𝑦, 𝑥) of the target 𝑦 and the feature vector 𝑥. Once we have access to this joint distribution, we can derive any conditional or marginal distribution involving the same variables. In particular, since 𝑃 𝑥 = ∑) 𝑃(𝑦 = 𝑐, 𝑥) it follows that the posterior distribution can be obtained as𝑃 𝑦 𝑥 = *(,,.) ∑! *(,1),.) • Such models are called ‘generative’ because we can sample from the joint distribution to obtain new data points together with their labels. Alternatively, we can use 𝑃(𝑦) to sample a class and 𝑃(𝑥|𝑦) to sample an instance for that class. COMP9417 T1, 2022 36 Bayesian Approach What should h be? o A collection of possible predictions, or o A collection of functions that map the data 𝑥 to a possible prediction 𝑦 COMP9417 T1, 2022 37 Most Probable Classification of So far, we’ve sought the most probable hypothesis given the data 𝐷 (i.e., h!"#) But the most important question is: given new instance 𝑥, what is its most probable classification? o Although it may seem that h!"# can answer this question, but in fact we can do better o h!"# (𝑥) is not necessarily the most probable classification! (e.g if we are dealing with multiple hypotheses supporting each class) COMP9417 T1, 2022 38 Most Probable Classification of o Three possible hypotheses: 𝑃 h* 𝐷 = 0.4, 𝑃(h.|𝐷) = 0.3, 𝑃(h7|𝐷) = 0.3 o Given new instance x, h*(𝑥) = +, h. (𝑥) = −, h7 (𝑥) = − o What’s most probable classification of 𝑥? COMP9417 T1, 2022 39 Bayes Optimal Classifier Bayes optimal classification: 𝑃h*𝐷 =0.4, 𝑃h.𝐷 =0.3, 𝑃h7𝐷 =0.3, 𝑃−h* =0, 𝑃−h. =1, 𝑃−h7 =1, P+h* =1 P+h. =0 P+h7 =0 argmax O 𝑃(𝜈(|h')𝑃(h'|D) 8#∈9 $!∈& COMP9417 T1, 2022 Bayes Optimal Classifier O𝑃(+|h')Ph'𝐷 =0.4 $!∈& O𝑃(−|h')Ph'𝐷 =0.6 $!∈& argmaxO𝑃(𝜈(|h')𝑃h'D= − 8#∈9 $!∈& 𝜈( is the class value from the set of possible class values (𝜈( ∈ 𝑉) COMP9417 T1, 2022 41 Bayes Optimal Classifier The Bayes optimal classifier is a probabilistic model that makes the most probable prediction for a new sample based on the training data. This is different than MAP inference. • In MAP framework, we seek the most probable hypothesis, among the space of hypothesis. • But in Bayesian optimal classification, the most probable classification of the new instance is obtained by combining the predictions of ALL hypotheses (in the hypothesis space), weighted by their posterior probabilities: 𝑃 𝜈2 𝐷 = O 𝑃 𝜈2 h! 𝑃(h!|𝐷) &" ∈( The optimal classification of the new instance is the value 𝜈2 , for which 𝑃 𝜈2 𝐷 is maximum COMP9417 T1, 2022 42 Bayes Optimal Classifier It can be mathematically shown that no other classification method using the same hypothesis space and same prior knowledge can outperform the Bayes Optimal Classifier method on average. COMP9417 T1, 2022 43 Bayes Optimal Classifier Why do we need any other methods? COMP9417 T1, 2022 44 Bayes Optimal Classifier The Bayes rule depends on unknown quantities, so we need to use the data to find some approximation of those quantities. COMP9417 T1, 2022 45 Bayes optimal classifier is very inefficient. An alternative, less optimal method is Gibbs algorithm. Gibbs algorithm: 1. Choose one hypothesis at random, according to 𝑃(h|𝐷) 2. Use this to classify new instance Surprising fact: Assuming target concepts are drawn at random from 𝐻 (h ∈ 𝐻) according to priors on 𝐻. Then 𝐸[𝑒𝑟𝑟𝑜𝑟3!445] ≤ 2×𝐸[𝑒𝑟𝑟𝑜𝑟67,85 9:;!<7=] COMP9417 T1, 2022 46 Bayes Error What is the best performance attainable by a (two-class) classifier ? Define the probability of error for classifying some instance 𝑥 by 𝑃 𝑒𝑟𝑟𝑜𝑟 𝑥 = 𝑃 𝑐𝑙𝑎𝑠𝑠# 𝑥 𝑖𝑓 𝑤𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑙𝑎𝑠𝑠$ = 𝑃 𝑐𝑙𝑎𝑠𝑠$ 𝑥 𝑖𝑓 𝑤𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑐𝑙𝑎𝑠𝑠# This gives O𝑃 𝑒𝑟𝑟𝑜𝑟 = O𝑃 𝑒𝑟𝑟𝑜𝑟 𝑥 𝑃(𝑥) . So we can justify the use of the decision rule: if 𝑃a 𝑐𝑙𝑎𝑠𝑠# 𝑥 > 𝑃a 𝑐𝑙𝑎𝑠𝑠$ 𝑥 then predict 𝑐𝑙𝑎𝑠𝑠#
else predict 𝑐𝑙𝑎𝑠𝑠$
On average, this decision rule minimises probability of classification error.
COMP9417 T1, 2022 47

Naive Bayes Classifier
Along with decision trees, neural networks, nearest neighbour, one of the most practical learning methods.
When to use
o Moderate or large training set available
o Attributes that describe instances are conditionally independent given classification
Successful applications:
o Classifying text documents
o Gaussian Naive Bayes for real-valued data
COMP9417 T1, 2022 48

Naive Bayes Classifier
Assume target function 𝑓 ∶ 𝑋 → 𝑉 , where each instance 𝑥 described by attributes ⟨𝑥*, 𝑥., . . . , 𝑥+⟩.
Most probable value of 𝑓(𝑥) is:
𝜈!”# = argmax𝑃(𝜈(|𝑥*,𝑥.,…,𝑥+)
𝜈!”# = argmax𝑃 𝑥*,𝑥.,…,𝑥+ 𝜈( 𝑃(𝜈() 8#∈9 𝑃(𝑥*,𝑥.,…,𝑥+)
= argmax𝑃(𝑥*,𝑥.,…,𝑥+ 𝜈( P(𝜈() 8#∈9
COMP9417 T1, 2022 49

Naive Bayes Classifier
Naive Bayes assumption:
𝑃(𝑥*,𝑥.,…,𝑥+ 𝜈( =U𝑃(𝑥’|𝜈() ‘
• Attributes are statistically independent (given the class value)
o which means knowledge about the value of a particular attribute tells us nothing about the value of another attribute (if the class is known)
Which gives Naive Bayes classifier:
𝜈:; = 𝑎𝑟𝑔 max𝑃(𝜈()U𝑃(𝑥’|𝜈()
COMP9417 T1, 2022

Naive Bayes Algorithm Naive Bayes Learn (examples)
for each target value 𝜈(
𝑃d(𝜈() ← estimate 𝑃 𝜈(
For each attribute value 𝑥’:
𝑃d(𝑥’|𝜈() ← estimate 𝑃 𝑥’|𝜈( Classify New Instance (for sample 𝑥)
𝜈:; =𝑎𝑟𝑔max𝑃d(𝜈()U𝑃d(𝑥’|𝜈() 8#∈9 ‘
COMP9417 T1, 2022

Naive Bayes Example: PlayTennis Naive Bayes Example: PlayTennis
Naive Bayes
outlook temperature humidity windy
sunny hot sunny hot overcast hot rainy mild rainy cool rainy cool overcast cool sunny mild sunny cool rainy mild sunny mild overcast mild overcast hot rainy mild
high false high true high false high false normal false normal true normal true high false normal false normal false normal true high true normal false high true
COMP9417 ML & DM
Classification (2)
play no no yes yes yes no yes no yes yes yes yes yes no
Term 2, 2019
COMP9417 T1, 2022

Naive Bayes
Naive Bayes Example: PlayTennis
Naive Bayes Example: PlayTennis
What are the required probabilities to predict PlayTennis ? What are the required probabilities to predict PlayTennis?
Outlook Yes Sunny 2 Overcast 4 Rainy 3
Sunny 2/9 Overcast 4/9 Rainy 3/9
Temperature
No Yes No 3 Hot 2 2 0 Mild 4 2 2 Cool 3 1
3/5 Hot 2/9 2/5 0/5 Mild 4/9 2/5 2/5 Cool 3/9 1/5
6/9 2/5 3/9 3/5
High Normal
High Normal
3/9 4/5 6/9 1/5
False True
False True
95 9/14 5/14
COMP9417 ML & DM Classification (2)
Term 2, 2019
COMP9417 T1, 2022 53

Naive Bayes Example: PlayTennis Say we have the new instance:
⟨𝑂𝑢𝑡𝑙𝑘 = 𝑠𝑢𝑛, 𝑇𝑒𝑚𝑝 = 𝑐𝑜𝑜𝑙, 𝐻𝑢𝑚𝑖𝑑 = h𝑖𝑔h, 𝑊𝑖𝑛𝑑 = 𝑡𝑟𝑢𝑒⟩
We want to compute:
𝜈:; =𝑎𝑟𝑔 max 8#∈{“1>?”,”+B”}}

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com