CS计算机代考程序代写 Bayesian AI data mining algorithm information theory Bayesian network decision tree Classification (2)

Classification (2)
COMP9417 Machine Learning and Data Mining
Term 2, 2020
COMP9417 ML & DM Classification (2) Term 2, 2020 1 / 104

Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997)
http://www- 2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM
Classification (2)
Term 2, 2020
2 / 104

Aims
This lecture will continue your exposure to machine learning approaches to the problem of classification. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
• explain the concept of inductive bias in machine learning
• outline Bayes Theorem as applied in machine learning
• define MAP and ML inference using Bayes theorem
• define the Bayes optimal classification rule in terms of MAP inference • outline the Naive Bayes classification algorithm
• describe typical applications of Naive Bayes for text classification • outline the logistic regression classification algorithm
COMP9417 ML & DM Classification (2) Term 2, 2020 3 / 104

Introduction
Introduction
What do we understand about the problem of learning classifiiers ? . . . how can we know when classifier learning succeeds ?
and . . . can we use this to build practical algorithms ?
COMP9417 ML & DM Classification (2) Term 2, 2020 4 / 104

Introduction
Inductive Bias
Step back
COMP9417 ML & DM Classification (2) Term 2, 2020 5 / 104

Introduction
Inductive Bias
Inductive Bias
All models are wrong, but some models are useful.
Box & Draper (1987)
COMP9417 ML & DM Classification (2) Term 2, 2020 6 / 104

Introduction
Inductive Bias
Inductive Bias
Confusingly, “inductive bias” is NOT the same “bias” as in the “bias-variance” decomposition.
“Inductive bias” is the combination of assumptions and restrictions placed on the models and algorithms used to solve a learning problem.
Essentially it means that the algorithm and model combination you are using to solve the learning problem is appropriate for the task.
Success in machine learning requires understanding the inductive bias of algorithms and models, and choosing them appropriately for the task.
COMP9417 ML & DM Classification (2) Term 2, 2020 7 / 104

Introduction
Inductive Bias
Inductive Bias
Unfortunately, for most machine learning algorithms it is not always easy to know what their inductive bias is.
For example, what is the inductive bias of:
• Linear Regression ? •

• Nearest Neighbour ? •

COMP9417 ML & DM
Classification (2)
Term 2, 2020
8 / 104

Introduction
Inductive Bias
Inductive Bias
Unfortunately, for most machine learning algorithms it is not always easy to know what their inductive bias is.
For example, what is the inductive bias of: • Linear Regression ?
• target function has the form y = ax + b • approximate by fitting using MSE
• Nearest Neighbour ?
• target function is a complex non-linear function of the data
• predict using nearest neighbour by Euclidean distance in feature space
COMP9417 ML & DM Classification (2) Term 2, 2020 9 / 104

Introduction
Inductive Bias
Inductive Bias
What we would really like:
• a framework for machine learning algorithms
• with a way of representing the inductive bias
• ideally, should be a declarative specification
• also should quantify uncertainty in the inductive bias
COMP9417 ML & DM Classification (2) Term 2, 2020 10 / 104

A probabilistic approach
Why model uncertainty ?
A probabilistic approach
COMP9417 ML & DM Classification (2) Term 2, 2020 11 / 104

A probabilistic approach
Why model uncertainty ?
Uncertainty
As far as the laws of mathematics refer to reality, they are not certain; as far as they are certain, they do not refer to reality.
–Albert Einstein
COMP9417 ML & DM Classification (2) Term 2, 2020 12 / 104

A probabilistic approach
Probabilistic models
A simple probabilistic model
‘Viagra’ and ‘lottery’ are two Boolean features; Y is the class variable,
with values ‘spam’ and ‘ham’. In each row the most likely class is
indicated in bold.
Viagra lottery P(Y =spam|Viagra,lottery) P(Y =ham|Viagra,lottery) 00 0.31 0.69
01 0.65 0.35
10 0.80 0.20
11 0.40 0.60
COMP9417 ML & DM Classification (2) Term 2, 2020 13 / 104

A probabilistic approach
Probabilistic models
Decision rule
Assuming that X and Y are the only variables we know and care about, the posterior distribution P(Y |X) helps us to answer many questions of interest.
• For instance, to classify a new e-mail we determine whether the words ‘Viagra’ and ‘lottery’ occur in it, look up the corresponding probability P (Y = spam|Viagra, lottery), and predict spam if this probability exceeds 0.5 and ham otherwise.
• Such a recipe to predict a value of Y on the basis of the values of X and the posterior distribution P(Y |X) is called a decision rule.
COMP9417 ML & DM Classification (2) Term 2, 2020 14 / 104

Bayesian Machine Learning
Bayesian Machine Learning
COMP9417 ML & DM Classification (2) Term 2, 2020 15 / 104

Bayesian Machine Learning
Why Bayes ?
Two Roles for Bayesian Methods
Provides practical learning algorithms:
• Naive Bayes classifier learning
• Bayesian network learning, etc.
• Combines prior knowledge (prior probabilities) with observed data • How to get prior probabilities ?
Provides useful conceptual framework:
• Provides a “gold standard” for evaluating other learning algorithms • Gives some additional insight into “Occam’s razor”
COMP9417 ML & DM Classification (2) Term 2, 2020 16 / 104

Bayesian Machine Learning
Why Bayes ?
Basic Formulas for Probabilities
Product Rule: probability P (A ∧ B) of conjunction of two events A and B: P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A)
Sum Rule: probability of disjunction of two events A and B: P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
Theorem of total probability: if events A1,…,An are mutually exclusive with 􏰇ni=1 P (Ai ) = 1, then:
n
P(B) = 􏰂P(B|Ai)P(Ai)
i=1
COMP9417 ML & DM Classification (2) Term 2, 2020 17 / 104

Bayesian Machine Learning
Why Bayes ?
Basic Formulas for Probabilities
Also worth remembering:
• Conditional Probability: probability of A given B:
P(A|B)= P(A∧B) P(B)
• Rearrange sum rule to get:
P (A ∧ B) = P (A) + P (B) − P (A ∨ B)
Exercise: (if you haven’t done it before) derive Bayes Theorem.
COMP9417 ML & DM Classification (2) Term 2, 2020 18 / 104

Bayesian Machine Learning
Why Bayes ?
Bayes Theorem
P (h|D) = P (D|h)P (h) P(D)
where
P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D
P(D|h) = probability of D given h
COMP9417 ML & DM Classification (2) Term 2, 2020 19 / 104

Bayesian Machine Learning
Why Bayes ?
Choosing Hypotheses
P (h|D) = P (D|h)P (h) P(D)
Generally, we want the most probable hypothesis given the training data Maximum a posteriori hypothesis hM AP :
hM AP
= arg max P (h|D) h∈H
= arg max P (D|h)P (h) h∈H P (D)
= arg max P (D|h)P (h) h∈H
COMP9417 ML & DM
Classification (2)
Term 2, 2020
20 / 104

Bayesian Machine Learning
Why Bayes ?
Choosing Hypotheses
If assume P(hi) = P(hj) then can further simplify, and choose the Maximum likelihood (ML) hypothesis
hML =argmaxP(D|hi) hi ∈H
COMP9417 ML & DM Classification (2) Term 2, 2020 21 / 104

Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.
P (cancer) = P(⊕ | cancer) = P(⊕ | ¬cancer) =
P (¬cancer) = P(⊖ | cancer) = P(⊖ | ¬cancer) =
COMP9417 ML & DM
Classification (2) Term 2, 2020 22 / 104

Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.
P (cancer) = .008 P(⊕ | cancer) = .98 P(⊕ | ¬cancer) = .03
P (¬cancer) = .992 P(⊖ | cancer) = .02 P(⊖ | ¬cancer) = .97
COMP9417 ML & DM
Classification (2) Term 2, 2020 23 / 104

Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
We can find the maximum a posteriori (MAP) hypothesis
P (⊕ | cancer)P (cancer) = P (⊕ | ¬cancer)P (¬cancer) =
ThushMAP =….
0.98 × 0.008 = 0.00784 0.03 × 0.992 = 0.02976
COMP9417 ML & DM Classification (2)
Term 2, 2020
24 / 104

Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
We can find the maximum a posteriori (MAP) hypothesis
P (⊕ | cancer)P (cancer) = 0.98 × 0.008 = 0.00784 P (⊕ | ¬cancer)P (¬cancer) = 0.03 × 0.992 = 0.02976
Thus hMAP = ¬cancer.
Also note: posterior probability of hypothesis cancer higher than prior.
COMP9417 ML & DM Classification (2) Term 2, 2020 25 / 104

Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
How to get the posterior probability of a hypothesis h ? Divide by P (⊕), probability of data, to normalize result for h:
P (D|h)P (h) P(h|D) = 􏰇hi∈H P(D|hi)P(hi)
Denominator ensures we obtain posterior probabilities that sum to 1.
Sum for all possible numerator values, since hypotheses are mutually exclusive (e.g., patient either has cancer or does not).
Marginal likelihood (marginalizing out likelihood over all possible values in the hypothesis space) — prior probability of the data.
COMP9417 ML & DM Classification (2) Term 2, 2020 26 / 104

Bayesian Machine Learning
Why Bayes ?
A Bayesian framework for Classification
First, define a prior on the hypothesis
Next, define define the likelihood, i.e., the probability of the data given the hypothesis
Then learning is finding the required parameters by fitting hypotheses to data
Predict (classify) using, e.g., the MAP hypothesis
COMP9417 ML & DM Classification (2) Term 2, 2020 27 / 104

Bayesian Machine Learning
Why Bayes ?
Brute Force MAP Hypothesis Learner
Idea: view learning as finding the most probable hypothesis
• For each hypothesis h in H, calculate the posterior probability
P (h|D) = P (D|h)P (h) P(D)
• Output the hypothesis hMAP with the highest posterior probability hMAP =argmaxP(h|D)
h∈H
COMP9417 ML & DM Classification (2) Term 2, 2020 28 / 104

A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Canonical concept learning task:
• instance space X, hypothesis space H, training examples D
• consider a learning algorithm that outputs most specific hypothesis from the version space V SH,D (i.e., set of all consistent or ”zero-error” classification rules)
What would Bayes rule produce as the MAP hypothesis? Does this algorithm output a MAP hypothesis??
COMP9417 ML & DM Classification (2) Term 2, 2020 29 / 104

A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Brute Force MAP Framework for Concept Learning:
Assume fixed set of instances ⟨x1, . . . , xm⟩
Assume D is the set of classifications D = ⟨c(x1), . . . , c(xm)⟩
Choose P(h) to be uniform distribution: •P(h)=1 forallhinH
Choose P(D|h):
• P(D|h)=1ifhconsistentwithD • P (D|h) = 0 otherwise
|H|
COMP9417 ML & DM Classification (2) Term 2, 2020 30 / 104

A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Then:
1 P(h|D) =  |V SH,D|

if h is consistent with D 0 otherwise
COMP9417 ML & DM
Classification (2) Term 2, 2020 31 / 104

A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Note that since likelihood is zero if h is inconsistent then the posterior is also zero. But how did we obtain the posterior for consistent h ?
1·1 |H|
P (h|D) =
1·1
=
=
|V SH,D|
P(D)
|H| |V SH,D |
|H| 1
COMP9417 ML & DM
Classification (2)
Term 2, 2020
32 / 104

A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
How did we obtain P(D) ? From theorem of total probability:
P (D)
= 􏰂 P (D|Hi)P (hi) hi ∈H
=􏰂1·1+􏰂0·1 h∈VS |H| h̸∈VS |H|
=
i H,D i H,D
=􏰂1·1 h ∈VS |H|
|V SH,D| |H|
i H,D
COMP9417 ML & DM
Classification (2)
Term 2, 2020
33 / 104

A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Evolution of Posterior Probabilities
P(h)
P(h|D1)
P(h|D1,D2)
hypotheses (a)
hypotheses (b)
hypotheses (c)
COMP9417 ML & DM
Classification (2)
Term 2, 2020 34 / 104

A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Every hypothesis consistent with D is a MAP hypothesis, if we assume • uniform probability over H
• target function c ∈ H
• deterministic, noise-free data
• etc. (see above)
So, this learning algorithm will output a MAP hypothesis, even though it
does not explicitly use probabilities in learning.
COMP9417 ML & DM Classification (2) Term 2, 2020 35 / 104

A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Learning A Real Valued Function
E.g., learning a linear target function f from noisy examples: y
f hML
e
x
COMP9417 ML & DM
Classification (2)
Term 2, 2020
36 / 104

A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Learning A Real Valued Function
Consider any real-valued target function f
Training examples ⟨xi,di⟩, where di is noisy training value
• di =f(xi)+ei
• ei is random variable (noise) drawn independently for each xi
according to some Gaussian (normal) distribution with mean=0 Then the maximum likelihood hypothesis hML is the one that
minimizes the sum of squared errors: m
hML = arg min 􏰂 (di − h(xi))2
h∈H
i=1
COMP9417 ML & DM Classification (2) Term 2, 2020 37 / 104

A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Learning A Real Valued Function
How did we obtain this ?
hM L =
= arg max 􏰊 p(di|h)
arg max p(D|h) h∈H
m i=1
h∈H =argmax√e2 σ
􏰊m 1 h∈H i=1 2πσ2
Recall that we treat each probability p(D|h) as if h = f, i.e., we assume μ = f(xi) = h(xi), which is the key idea behind maximum likelihood !
− 1 ( d i − h ( x i ) ) 2
COMP9417 ML & DM Classification (2) Term 2, 2020 38 / 104

A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Learning A Real Valued Function
Maximize natural log to give simpler expression:
􏰂m 1 1􏰀di−h(xi)􏰁2 hML=argmax ln√ 2−2 σ
h∈H i=1 2πσ
􏰂m 1􏰀di−h(xi)􏰁2
=argmax−2 σ h∈H i=1
m
= arg max 􏰂 − (di − h(xi))2
h∈H
i=1
Equivalently, we can minimize the positive version of the expression:
hML =
m
arg min 􏰂 (di − h(xi))2
h∈H
Classification (2) Term 2, 2020 39 / 104
i=1
COMP9417 ML & DM

A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Discriminative and generative probabilistic models
• Discriminative models model the posterior probability distribution P (Y |X), where Y is the target variable and X are the features. That is, given X they return a probability distribution over Y .
• Generative models model the joint distribution P (Y, X) of the target Y and the feature vector X. Once we have access to this joint distribution we can derive any conditional or marginal distribution involving the same variables. In particular, since P(X) = 􏰇y P(Y = y,X) it follows that the posterior
distribution can be obtained as P(Y |X) = P(Y,X) . 􏰇y P(Y=y,X)
• Alternatively, generative models can be described by the likelihood function P (X|Y ), since P (Y, X) = P (X|Y )P (Y ) and the target or prior distribution (usually abbreviated to ‘prior’) can be easily estimated or postulated.
• Such models are called ‘generative’ because we can sample from the joint distribution to obtain new data points together with their labels. Alternatively, we can use P (Y ) to sample a class and P (X |Y ) to sample an instance for that class.
COMP9417 ML & DM Classification (2) Term 2, 2020 40 / 104

A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Assessing uncertainty in estimates
Suppose we want to estimate the probability θ that an arbitrary e-mail is spam, so that we can use the appropriate prior distribution.
• The natural thing to do is to inspect n e-mails, determine the number of spam e-mails d, and set θˆ = d/n; we don’t really need any complicated statistics to tell us that.
• However, while this is the most likely estimate of θ – the maximum a posteriori (MAP) estimate – this doesn’t mean that other values of θ are completely ruled out.
• We model this by a probability distribution over θ (a Beta distribution in this case) which is updated each time new information comes in. This is further illustrated in the figure for a distribution that is more and more skewed towards spam.
• For each curve, its bias towards spam is given by the area under the curve and to the right of θ = 1/2.
COMP9417 ML & DM Classification (2) Term 2, 2020 41 / 104

A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms
Assessing uncertainty in estimates
5 4 3 2 1
0.2 0.4
0.6 0.8
Each time we inspect an e-mail, we are reducing our uncertainty regarding the prior spam probability θ. After we inspect two e-mails and observe one spam, the possible θ values are characterised by a symmetric distribution around 1/2. If we inspect a third, fourth, . . . , tenth e-mail and each time (except the first one) it is spam, then this distribution narrows and shifts a little bit to the right each time. The distribution for n e-mails reaches its
maximum at θˆMAP = n−1 (e.g., θˆMAP = 0.8 for n = 5). n
COMP9417 ML & DM
Classification (2) Term 2, 2020 42 / 104

A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
The Bayesian perspective
Explicitly modelling the posterior distribution over the parameter θ has a number of advantages that are usually associated with the ‘Bayesian’ perspective:
• We can precisely characterise the uncertainty that remains about our estimate by quantifying the spread of the posterior distribution.
• We can obtain a generative model for the parameter by sampling from the posterior distribution, which contains much more information than a summary statistic such as the MAP estimate can convey – so, rather than using a single e-mail with θ = θMAP, our generative model can contain a number of e-mails with θ sampled from the posterior distribution.
COMP9417 ML & DM Classification (2) Term 2, 2020 43 / 104

A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
The Bayesian perspective
• We can quantify the probability of statements such as ‘e-mails are biased towards ham’ (the tiny shaded area in the figure demonstrates that after observing one ham and nine spam e-mails this probability is very small, about 0.6%).
• We can use one of these distributions to encode our prior beliefs: e.g., if we believe that the proportions of spam and ham are typically 50–50, we can take the distribution for n = 2 (the lowest, symmetric one in the figure on the previous slide) as our prior.
The key point is that probabilities do not have to be interpreted as estimates of relative frequencies, but can carry the more general meaning of (possibly subjective) degrees of belief.
Consequently, we can attach a probability distribution to almost anything: not just features and targets, but also model parameters and even models.
COMP9417 ML & DM Classification (2) Term 2, 2020 44 / 104

Minimum Description Length Principle
Minimum Description Length Principle
Once again, the MAP hypothesis
hMAP =argmaxP(D|h)P(h) h∈H
Which is equivalent to
Or
hMAP = argmax log2 P(D|h)+log2 P(h) h∈H
hMAP = argmin −log2 P(D|h)−log2 P(h) h∈H
COMP9417 ML & DM
Classification (2) Term 2, 2020 45 / 104

Minimum Description Length Principle
Minimum Description Length Principle
Interestingly, this is an expression about a quantity of bits. hMAP = argmin−log2 P(D|h)−log2 P(h)
h∈H
From information theory:
The optimal (shortest expected coding length) code for an event with probability p is − log2 p bits.
(1)
COMP9417 ML & DM Classification (2) Term 2, 2020
46 / 104

Minimum Description Length Principle
Minimum Description Length Principle
So interpret (1):
• − log2 P (h) is length of h under optimal code
• − log2 P (D|h) is length of D given h under optimal code
Note well: assumes optimal encodings, when the priors and likelihoods are known. In practice, this is difficult, and makes a difference.
COMP9417 ML & DM Classification (2) Term 2, 2020 47 / 104

Minimum Description Length Principle
Minimum Description Length Principle
Occam’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that minimizes
hMDL =argminLC1(h)+LC2(D|h) h∈H
where LC(x) is the description length of x under optimal encoding C
COMP9417 ML & DM Classification (2) Term 2, 2020 48 / 104

Bayes Optimal Classifier
Most Probable Classification of New Instances
So far we’ve sought the most probable hypothesis given the data D (i.e., hMAP)
Given new instance x, what is its most probable classification? • hM AP (x) is not the most probable classification!
COMP9417 ML & DM Classification (2) Term 2, 2020 49 / 104

Bayes Optimal Classifier
Most Probable Classification of New Instances
Consider:
• Three possible hypotheses:
P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3 • Given new instance x,
h1(x)=+, h2(x)=−, h3(x)=−
• What’s most probable classification of x?
COMP9417 ML & DM Classification (2) Term 2, 2020 50 / 104

Bayes Optimal Classifier
Bayes Optimal Classifier
Bayes optimal classification:
Example:
P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3,
P(−|h1) = 0, P(−|h2) = 1, P(−|h3) = 1,
P(+|h1) = 1 P(+|h2) = 0 P(+|h3) = 0
argmax 􏰂 P(vj|hi)P(hi|D) vj∈V hi∈H
COMP9417 ML & DM
Classification (2)
Term 2, 2020
51 / 104

Bayes Optimal Classifier
Bayes Optimal Classifier
therefore
and
􏰂 P (+|hi)P (hi|D) hi ∈H
􏰂 P (−|hi)P (hi|D) hi ∈H
= .4 = .6
argmax 􏰂 P(vj|hi)P(hi|D) = − vj∈V hi∈H
No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average
COMP9417 ML & DM Classification (2) Term 2, 2020 52 / 104

Bayes Optimal Classifier
Bayes Error
What is the best performance attainable by a (two-class) classifier ? Define the probability of error for classifying some instance x by
P (error|x) = P (class1|x) if we predict class2 = P (class2|x) if we predict class1
This gives
So we can justify the use of the decision rule
Σx P (error) = P (error|x) P (x)
if P (class1|x) > P (class2|x) then predict class1
else predict class2
On average, this decision rule minimises probability of classification error.
COMP9417 ML & DM Classification (2) Term 2, 2020 53 / 104

Naive Bayes
Naive Bayes Classifier
Along with decision trees, neural networks, nearest neighbour, one of the most practical learning methods.
When to use
• Moderate or large training set available
• Attributes that describe instances are conditionally independent given classification
Successful applications:
• Classifying text documents
• Gaussian Naive Bayes for real-valued data
COMP9417 ML & DM Classification (2) Term 2, 2020 54 / 104

Naive Bayes
Naive Bayes Classifier
Assume target function f : X → V , where each instance x described by attributes ⟨a1, a2 . . . an⟩.
Most probable value of f(x) is:
vMAP = vMAP =
argmaxP(vj|a1,a2 …an) vj ∈V
argmax P(a1,a2 …an|vj)P(vj) vj∈V P(a1,a2 …an)
= argmaxP(a1,a2 …an|vj)P(vj) vj ∈V
COMP9417 ML & DM
Classification (2) Term 2, 2020 55 / 104

Naive Bayes
Naive Bayes Classifier
Naive Bayes assumption:
P(a1,a2 …an|vj) = 􏰊P(ai|vj) i
• Attributes are statistically independent (given the class value)
• which means knowledge about the value of a particular attribute tells
us nothing about the value of another attribute (if the class is known) which gives
Naive Bayes classifier: vNB = argmaxP(vj)􏰊P(ai|vj)
vj ∈V
i
COMP9417 ML & DM Classification (2)
Term 2, 2020 56 / 104

Naive Bayes
Binomial and Bernoulli distributions1
“Coin tossing”
Let X ∈ {0,…,n} be number of heads in a sequence of n tosses. If the probability of heads is θ, then X has a binomial distribution
􏰀n􏰁 k n−k Bin(k|n,θ)= k θ (1−θ)
For a single coin toss, let X ∈ {0, 1}. If θ is the probability of heads, then X has a Bernoulli distribution
Ber(x|θ) = θI(x=1)(1 − θ)I(x=0)
1“Machine Learning: A Probabilistic Perspective”, K. Murphy (2012)
COMP9417 ML & DM Classification (2) Term 2, 2020 57 / 104

Naive Bayes
Multinomial and Multinoulli distributions
“K-sided Dice throwing”
Let x = (x1,…,xK) be a random vector, where xj is the number of
times side j of the die is thrown.
If θj is the probability of side j, then x has a multinomial distribution
􏰀n􏰁K Mu(x|n, θ) = 􏰊 θxj
j=1
For a single dice throw, x will be a vector of zeros except for a single ‘1’ at
index j indicating that side j of the die is thrown. If θj is as above, then x has a multivariate Bernoulli (Multinoulli) distribution
K
Mu(x|1, θ) = 􏰊 θI(xj =1) j
j=1 COMP9417 ML & DM Classification (2)
Term 2, 2020
58 / 104
x1 . . . xK
j

Naive Bayes
Categorical random variables
Categorical variables or features (also called discrete or nominal) are ubiquitous in machine learning.
• Perhaps the most common form of the Bernoulli distribution models whether or not a word occurs in a document. That is, for the i-th word in our vocabulary we have a random variable Xi governed by a Bernoulli distribution. The joint distribution over the bit vector
X = (X1, . . . , Xk) is called a multivariate Bernoulli distribution.
• Variables with more than two outcomes are also common: for example, every word position in an e-mail corresponds to a categorical variable with k outcomes, where k is the size of the vocabulary. The multinomial distribution manifests itself as a count vector: a histogram of the number of occurrences of all vocabulary words in a document. This establishes an alternative way of modelling text documents that allows the number of occurrences of a word to influence the classification of a document.
COMP9417 ML & DM Classification (2) Term 2, 2020 59 / 104

Naive Bayes
Categorical random variables
Both these document models are in common use. Despite their differences, they both assume independence between word occurrences, generally referred to as the naive Bayes assumption.
• In the multinomial document model, this follows from the very use of the multinomial distribution, which assumes that words at different word positions are drawn independently from the same categorical distribution.
• In the multivariate Bernoulli model we assume that the bits in a bit vector are statistically independent, which allows us to compute the joint probability of a particular bit vector (x1, . . . , xk) as the product of the probabilities of each component P(Xi = xi).
• In practice, such word independence assumptions are often not true: if we know that an e-mail contains the word ‘Viagra’, we can be quite sure that it will also contain the word ‘pill’. Violated independence assumptions reduce the quality of probability estimates but may still
allow good classification performance.
COMP9417 ML & DM Classification (2) Term 2, 2020 60 / 104

Naive Bayes
Probabilistic decision rules
We have chosen one of the possible distributions to model our data X as coming from either class.
• The more different P(X|Y = spam) and P(X|Y = ham) are, the more useful the features X are for classification.
• Thus, for a specific e-mail x we calculate both P (X = x|Y = spam) and P (X = x|Y = ham), and apply one of several possible decision rules:
maximum likelihood (ML) – predict arg maxy P (X = x|Y = y); maximum a posteriori (MAP) – predict
argmaxy P(X = x|Y = y)P(Y = y);
The relation between the first two decision rules is that ML classification is
equivalent to MAP classification with a uniform class distribution.
COMP9417 ML & DM Classification (2) Term 2, 2020 61 / 104

Naive Bayes
Probabilistic decision rules
We again use the example of Naive Bayes for text classification to illustrate, using both the multinomial and multivariate Bernoulli models.
COMP9417 ML & DM Classification (2) Term 2, 2020 62 / 104

Naive Bayes
Prediction using a naive Bayes model
Suppose our vocabulary contains three words a, b and c, and we use a multivariate Bernoulli model for our e-mails, with parameters
θ⊕ = (0.5, 0.67, 0.33) θ⊖ = (0.67, 0.33, 0.33)
This means, for example, that the presence of b is twice as likely in spam (+), compared with ham.
The e-mail to be classified contains words a and b but not c, and hence is described by the bit vector x = (1, 1, 0). We obtain likelihoods
P (x|⊕) = 0.5 · 0.67 · (1 − 0.33) = 0.222 P(x|⊖) = 0.67 · 0.33 · (1 − 0.33) = 0.148
The ML classification of x is thus spam.
COMP9417 ML & DM Classification (2) Term 2, 2020 63 / 104

Naive Bayes
Prediction using a naive Bayes model
In the case of two classes it is often convenient to work with likelihood ratios and odds.
• The likelihood ratio can be calculated as
P(x|⊕) = 0.5 0.671−0.33 =3/2>1
P (x|⊖) 0.67 0.33 1 − 0.33
• This means that the MAP classification of x is also spam if the prior
odds are more than 2/3, but ham if they are less than that.
• For example, with 33% spam and 67% ham the prior odds are
P (⊕) = 0.33 = 1/2, resulting in a posterior odds of P (⊖) 0.67
P(⊕|x) = P(x|⊕)P(⊕) =3/2·1/2=3/4<1 P (⊖|x) P (x|⊖) P (⊖) In this case the likelihood ratio for x is not strong enough to push the decision away from the prior. COMP9417 ML & DM Classification (2) Term 2, 2020 64 / 104 Naive Bayes Prediction using a naive Bayes model Alternatively, we can employ a multinomial model. The parameters of a multinomial establish a distribution over the words in the vocabulary, say θ⊕ = (0.3, 0.5, 0.2) θ⊖ = (0.6, 0.2, 0.2) The e-mail to be classified contains three occurrences of word a, one single occurrence of word b and no occurrences of word c, and hence is described by the count vector x = (3, 1, 0). The total number of vocabulary word occurrences is n = 4. We obtain likelihoods 0.33 0.51 0.20 P(x|⊕)=4!3! 1! 0!=0.054 0.63 0.21 0.20 P(x|⊖) = 4! 3! 1! 0! = 0.1728 The likelihood ratio is 􏰈 0.3 􏰉3 􏰈 0.5 􏰉1 􏰈 0.2 􏰉0 = 5/16. The ML classification 0.6 0.2 0.2 of x is thus ham, the opposite of the multivariate Bernoulli model. This is mainly because of the three occurrences of word a, which provide strong evidence for ham. COMP9417 ML & DM Classification (2) Term 2, 2020 65 / 104 Naive Bayes Training data for naive Bayes A small e-mail data set described by count vectors. E-mail #a #b #c Class e1 0 3 0 + e2 0 3 3 + e3 3 0 0 + e4 2 3 0 + e5 4 3 0 − e6 4 0 3 − e7 3 0 0 − e8 0 0 0 − COMP9417 ML & DM Classification (2) Term 2, 2020 66 / 104 Naive Bayes Training data for naive Bayes The same data set described by bit vectors. E-mail a? b? c? Class e1 0 1 0 + e2 0 1 1 + e3 1 0 0 + e4 1 1 0 + e5 1 1 0 − e6 1 0 1 − e7 1 0 0 − e8 0 0 0 − COMP9417 ML & DM Classification (2) Term 2, 2020 67 / 104 Naive Bayes Training a naive Bayes model Consider the following e-mails consisting of five words a, b, c, d, e: e1: b d e b b d e e2: b c e b b d d e c c e3: a d a d e a e e e4: b a d b e d a b e5: a b a b a b a e d e6: a c a c a c a e d e7: e a e d a e a e8: d e d e d We are told that the e-mails on the left are spam and those on the right are ham, and so we use them as a small training set to train our Bayesian classifier. • First, we decide that d and e are so-called stop words that are too common to convey class information. • The remaining words, a, b and c, constitute our vocabulary. COMP9417 ML & DM Classification (2) Term 2, 2020 68 / 104 Naive Bayes Training a naive Bayes model For the multinomial model, we represent each e-mail as a count vector, as before. • In order to estimate the parameters of the multinomial, we sum up the count vectors for each class, which gives (5, 9, 3) for spam and (11, 3, 3) for ham. • To smooth these probability estimates we add one pseudo-count for each vocabulary word, which brings the total number of occurrences of vocabulary words to 20 for each class. • The estimated parameter vectors are thus θˆ⊕ = (6/20, 10/20, 4/20) = (0.3, 0.5, 0.2) for spam and θˆ⊖ = (12/20, 4/20, 4/20) = (0.6, 0.2, 0.2) for ham. COMP9417 ML & DM Classification (2) Term 2, 2020 69 / 104 Naive Bayes Training a naive Bayes model In the multivariate Bernoulli model e-mails are represented by bit vectors, as before. • Adding the bit vectors for each class results in (2, 3, 1) for spam and (3, 1, 1) for ham. • Each count is to be divided by the number of documents in a class, in order to get an estimate of the probability of a document containing a particular vocabulary word. • Probability smoothing now means adding two pseudo-documents, one containing each word and one containing none of them. • This results in the estimated parameter vectors θˆ⊕ = (3/6, 4/6, 2/6) = (0.5, 0.67, 0.33) for spam and θˆ⊖ = (4/6, 2/6, 2/6) = (0.67, 0.33, 0.33) for ham. COMP9417 ML & DM Classification (2) Term 2, 2020 70 / 104 Naive Bayes Likelihood ratio As a matter of fact, statisticians work very often with different conditional probabilities, given by the likelihood function P(X|Y ). • I like to think of these as thought experiments: if somebody were to send me a spam e-mail, how likely would it be that it contains exactly the words of the e-mail I’m looking at? And how likely if it were a ham e-mail instead? • What really matters is not the magnitude of these likelihoods, but their ratio: how much more likely is it to observe this combination of words in a spam e-mail than it is in a non-spam e-mail. • For instance, suppose that for a particular e-mail described by X we have P(X|Y = spam) = 3.5 · 10−5 and P(X|Y = ham) = 7.4 · 10−6, then observing X in a spam e-mail is nearly five times more likely than it is in a ham e-mail. • This suggests the following decision rule: predict spam if the likelihood ratio is larger than 1 and ham otherwise. COMP9417 ML & DM Classification (2) Term 2, 2020 71 / 104 Naive Bayes When to use likelihoods Use likelihoods if you want to ignore the prior distribution or assume it uniform, and posterior probabilities otherwise. COMP9417 ML & DM Classification (2) Term 2, 2020 72 / 104 Naive Bayes Posterior odds P (Y = spam|Viagra = 0, lottery = 0) = 0.31 = 0.45 P (Y = ham|Viagra = 0, lottery = 0) 0.69 P (Y = spam|Viagra = 1, lottery = 1) = 0.40 = 0.67 P (Y = ham|Viagra = 1, lottery = 1) 0.60 P (Y = spam|Viagra = 0, lottery = 1) = 0.65 = 1.9 P (Y = ham|Viagra = 0, lottery = 1) 0.35 P (Y = spam|Viagra = 1, lottery = 0) = 0.80 P (Y = ham|Viagra = 1, lottery = 0) 0.20 = 4.0 Using a MAP decision rule we predict ham in the top two cases and spam in the bottom two. Given that the full posterior distribution is all there is to know about the domain in a statistical sense, these predictions are the best we can do: they are Bayes-optimal. COMP9417 ML & DM Classification (2) Term 2, 2020 73 / 104 Naive Bayes Example marginal likelihoods Y spam ham Y spam ham P(Viagra = 1|Y ) 0.40 0.12 P(lottery = 1|Y ) 0.21 0.13 P(Viagra = 0|Y ) 0.60 0.88 P(lottery = 0|Y ) 0.79 0.87 COMP9417 ML & DM Classification (2) Term 2, 2020 74 / 104 Naive Bayes Using marginal likelihoods Using the marginal likelihoods from before, we can approximate the likelihood ratios (the previously calculated odds from the full posterior distribution are shown in brackets): P (Viagra = 0|Y = spam) P (lottery = 0|Y = spam) = P (Viagra = 0|Y = ham) P (lottery = 0|Y = ham) P (Viagra = 0|Y = spam) P (lottery = 1|Y = spam) = P (Viagra = 0|Y = ham) P (lottery = 1|Y = ham) P (Viagra = 1|Y = spam) P (lottery = 0|Y = spam) = P (Viagra = 1|Y = ham) P (lottery = 0|Y = ham) P (Viagra = 1|Y = spam) P (lottery = 1|Y = spam) = P (Viagra = 1|Y = ham) P (lottery = 1|Y = ham) 0.60 0.79 0.88 0.87 0.60 0.21 0.88 0.13 0.40 0.79 0.12 0.87 0.40 0.21 0.12 0.13 = 0.62 = 1.1 = 3.0 = 5.4 (0.45) (1.9) (4.0) (0.67) We see that, using a maximum likelihood decision rule, our very simple model arrives at the Bayes-optimal prediction in the first three cases, but not in the fourth (‘Viagra’ and ‘lottery’ both present), where the marginal likelihoods are actually very misleading. COMP9417 ML & DM Classification (2) Term 2, 2020 75 / 104 Naive Bayes Naive Bayes Algorithm Naive Bayes Learn(examples) For each target value vj Pˆ(vj ) ← estimate P (vj ) For each attribute value ai of each attribute a Pˆ(ai|vj ) ← estimate P (ai|vj ) Classify New Instance(x) vNB = argmaxPˆ(vj) 􏰊 Pˆ(ai|vj) vj ∈V ai ∈x COMP9417 ML & DM Classification (2) Term 2, 2020 76 / 104 Naive Bayes Naive Bayes Example: PlayTennis outlook temperature sunny hot sunny hot overcast hot rainy mild rainy cool rainy cool overcast cool sunny mild sunny cool rainy mild sunny mild overcast mild overcast hot rainy mild humidity high high high high normal normal normal high normal normal normal high normal high windy play false no true no false yes false yes false yes true no true yes false no false yes false yes true yes true yes false yes true no COMP9417 ML & DM Classification (2) Term 2, 2020 77 / 104 Naive Bayes Naive Bayes Example: PlayTennis What are the required probabilities to predict PlayTennis ? Outlook Temperature Humidity Windy Yes No Yes No Yes No Yes No Sunny Overcast Rainy 2 3 4 0 3 2 Hot 2 2 Mild 4 2 Cool 3 1 High 3 4 Normal 6 1 False 6 2 True 3 3 Sunny Overcast Rainy 2/9 3/5 4/9 0/5 3/9 2/5 Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 High 3/9 4/5 Normal 6/9 1/5 False 6/9 2/5 True 3/9 3/5 Play Yes No 95 9/14 5/14 COMP9417 ML & DM Classification (2) Term 2, 2020 78 / 104 Naive Bayes Naive Bayes Example: PlayTennis Say we have the new instance: ⟨Outlk = sun, T emp = cool, Humid = high, W ind = true⟩ We want to compute: vNB = argmax P(vj)􏰊P(ai|vj) vj ∈{“yes”,“no”} i COMP9417 ML & DM Classification (2) Term 2, 2020 79 / 104 Naive Bayes Naive Bayes Example: PlayTennis So we first calculate the likelihood of the two classes, “yes” and “no” for “yes” = P (y) × P (sun|y) × P (cool|y) × P (high|y) × P (true|y) 0.0053 = 9 ×2×3×3×3 14 9 9 9 9 for “no” = P (n) × P (sun|n) × P (cool|n) × P (high|n) × P (true|n) 0.0206 = 5 ×3×1×4×3 14 5 5 5 5 COMP9417 ML & DM Classification (2) Term 2, 2020 80 / 104 Naive Bayes Naive Bayes Example: PlayTennis Then convert to a probability by normalisation P (“yes”) = = 0.205 0.0053 (0.0053 + 0.0206) P (“no”) = = 0.795 0.0206 (0.0053 + 0.0206) The Naive Bayes classification is “no”. COMP9417 ML & DM Classification (2) Term 2, 2020 81 / 104 Naive Bayes Naive Bayes: Subtleties Conditional independence assumption is often violated P(a1,a2 ...an|vj) = 􏰊P(ai|vj) i • ...but it works surprisingly well anyway. Note don’t need estimated posteriors Pˆ(vj|x) to be correct; need only that argmaxPˆ(vj)􏰊Pˆ(ai|vj) = argmaxP(vj)P(a1 ...,an|vj) vj∈V i vj∈V i.e. maximum probability is assigned to correct class • see [Domingos & Pazzani, 1996] for analysis • Naive Bayes posteriors often unrealistically close to 1 or 0 • adding too many redundant attributes will cause problems (e.g. identical attributes) COMP9417 ML & DM Classification (2) Term 2, 2020 82 / 104 Naive Bayes Naive Bayes: “zero-frequency” problem What if none of the training instances with target value vj have attribute value ai? Then Pˆ(ai|vj) = 0, and... Pˆ ( v j ) 􏰊 Pˆ ( a i | v j ) = 0 i Pseudo-counts add 1 to each count (a version of the Laplace Estimator) COMP9417 ML & DM Classification (2) Term 2, 2020 83 / 104 Naive Bayes Naive Bayes: “zero-frequency” problem • In some cases adding a constant different from 1 might be more appropriate • Example: attribute outlook for class yes Sunny Overcast Rainy 2+μ 4+μ 3+μ 243 9+μ 9+μ 9+μ • Weights don’t need to be equal (if they sum to 1) – a form of prior Sunny Overcast Rainy 2+μp1 4+μp2 9+μ 9+μ 9+μ 3+μp3 COMP9417 ML & DM Classification (2) Term 2, 2020 84 / 104 Naive Bayes Naive Bayes: numeric attributes • Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) • The probability density function for the normal distribution is defined by two parameters: The sample mean μ: 1 􏰂n μ=n xi i=1 The standard deviation σ: 1 􏰂n σ = n − 1 (xi − μ)2 i=1 COMP9417 ML & DM Classification (2) Term 2, 2020 85 / 104 Naive Bayes Naive Bayes: numeric attributes Then we have the density function f(x): 1 − (x−μ)2 e 2σ2 Example: continuous attribute temperature with mean = 73 and standard deviation = 6.2. Density value 1 − (66−73)2 f(temperature = 66|“yes”) = √ Missing values during training are not included in calculation of mean and standard deviation. f(x) = √ 2πσ 2π6.2 e 2×6.22 = 0.0340 COMP9417 ML & DM Classification (2) Term 2, 2020 86 / 104 Naive Bayes Naive Bayes: numeric attributes Note: the normal distribution is based on the simple exponential function f(x) = e−|x|m As the power m in the exponent increases, the function approaches a step function. Where m = 2 f(x) = e−|x|2 and this is the basis of the normal distribution – the various constants are the result of scaling so that the integral (the area under the curve from −∞ to +∞) is equal to 1. from “Statistical Computing” by Michael J. Crawley (2002) Wiley. COMP9417 ML & DM Classification (2) Term 2, 2020 87 / 104 Naive Bayes Learning and using a naive Bayes model for classification Example application: Learning to Classify Text In machine learning, the classic example of applications of Naive Bayes is learning to classify text documents. Here is a simplified version in the multinomial document model. COMP9417 ML & DM Classification (2) Term 2, 2020 88 / 104 Naive Bayes Learning and using a naive Bayes model for classification Learning to Classify Text Why? • Learn which news articles are of interest • Learn to classify web pages by topic Naive Bayes is among most effective algorithms What attributes shall we use to represent text documents?? COMP9417 ML & DM Classification (2) Term 2, 2020 89 / 104 Naive Bayes Learning and using a naive Bayes model for classification Learning to Classify Text Target concept Interesting? : Document → {+, −} 1 2 Represent each document by vector of words • one attribute per word position in document Learning: Use training examples to estimate • P(+) • P(−) • P(doc|+) • P(doc|−) COMP9417 ML & DM Classification (2) Term 2, 2020 90 / 104 Naive Bayes Learning and using a naive Bayes model for classification Learning to Classify Text Naive Bayes conditional independence assumption length(doc) P(doc|vj) = 􏰊 P(ai = wk|vj) i=1 where P(ai = wk|vj) is probability that word in position i is wk, given vj one more assumption: P(ai = wk|vj) = P(am = wk|vj),∀i,m “bag of words” COMP9417 ML & DM Classification (2) Term 2, 2020 91 / 104 Naive Bayes Learning and using a naive Bayes model for classification Application: 20 Newsgroups Given: 1000 training documents from each group Learning task: classify each new document by newsgroup it came from comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x alt.atheism soc.religion.christian talk.religion.misc talk.politics.mideast talk.politics.misc talk.politics.guns Naive Bayes: 89% classification accuracy misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.space sci.crypt sci.electronics sci.med COMP9417 ML & DM Classification (2) Term 2, 2020 92 / 104 Naive Bayes Learning and using a naive Bayes model for classification Article from rec.sport.hockey Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu From: xxx@yyy.zzz.edu (John Doe) Subject: Re: This year’s biggest and worst (opinion)... Date: 5 Apr 93 09:53:39 GMT I can only comment on the Kings, but the most obvious candidate for pleasant surprise is Alex Zhitnik. He came highly touted as a defensive defenseman, but he’s clearly much more than that. Great skater and hard shot (though wish he were more accurate). In fact, he pretty much allowed the Kings to trade away that huge defensive liability Paul Coffey. Kelly Hrudey is only the biggest disappointment if you thought he was any good to begin with. But, at best, he’s only a mediocre goaltender. A better choice would be Tomas Sandstrom, though not through any fault of his own, but because some thugs in Toronto decided ... COMP9417 ML & DM Classification (2) Term 2, 2020 93 / 104 Naive Bayes Learning and using a naive Bayes model for classification Learning Curve for 20 Newsgroups 100 90 80 70 60 50 40 30 20 10 0 20News Bayes TFIDF P RT FID F 100 1000 10000 Accuracy vs. Training set size (1/3 withheld for test) COMP9417 ML & DM Classification (2) Term 2, 2020 94 / 104 Naive Bayes Learning and using a naive Bayes model for classification Naive Bayes: “zero-frequency” problem This generalisation is a Bayesian estimate for Pˆ(ai|vj) Pˆ(ai|vj) ← nc + mp n+m where • n is number of training examples for which v = vj, • nc numberofexamplesforwhichv=vj anda=ai • p is prior estimate for Pˆ(ai|vj) • m is weight given to prior (i.e. number of “virtual” examples) This is called the m-estimate of probability ( a form of “prior”). COMP9417 ML & DM Classification (2) Term 2, 2020 95 / 104 Naive Bayes Learning and using a naive Bayes model for classification Naive Bayes: missing values What about missing values ? A very basic approach is: • Training: instance is not included in frequency count for attribute value-class combination • Classification: attribute will be omitted from calculation Can we do better ? COMP9417 ML & DM Classification (2) Term 2, 2020 96 / 104 Naive Bayes Learning and using a naive Bayes model for classification Missing values Suppose we skimmed an e-mail and noticed that it contains the word ‘lottery’ but we haven’t looked closely enough to determine whether it uses the word ‘Viagra’. This means that we don’t know whether to use the second or the fourth row in to make a prediction. This is a problem, as we would predict spam if the e-mail contained the word ‘Viagra’ (second row) and ham if it didn’t (fourth row). The solution is to average these two rows, using the probability of ‘Viagra’ occurring in any e-mail (spam or not): P (Y |lottery) =P (Y |Viagra = 0, lottery)P (Viagra = 0) + P (Y |Viagra = 1, lottery)P (Viagra = 1) COMP9417 ML & DM Classification (2) Term 2, 2020 97 / 104 Naive Bayes Learning and using a naive Bayes model for classification Missing values For instance, suppose for the sake of argument that one in ten e-mails contain the word ‘Viagra’, then P (Viagra = 1) = 0.10 and P (Viagra = 0) = 0.90. Using the above formula, we obtain P(Y = spam|lottery = 1) = 0.65 · 0.90 + 0.40 · 0.10 = 0.625 and P (Y = ham|lottery = 1) = 0.35 · 0.90 + 0.60 · 0.10 = 0.375. Because the occurrence of ‘Viagra’ in any e-mail is relatively rare, the resulting distribution deviates only a little from the second row in the data. COMP9417 ML & DM Classification (2) Term 2, 2020 98 / 104 Linear discriminants Linear discriminants The “classic” classification learning algorithm is a linear discriminant — many forms of linear discriminant from statistics and machine learning, e.g., • Fisher’s Linear Discriminant Analysis • basic linear classifier (last lecture) is a simplified version of this • Logistic Regression • a probabilistic linear classifier (next slides) • Perceptron • a linear threshold classifier • an early version of an artificial “neuron” • still a useful method, and source of ideas (later lecture) • Winnow • like Perceptron, another linear threshold classifier • uses a different learning method, with interesting properties • comes from Learning Theory (later lecture) COMP9417 ML & DM Classification (2) Term 2, 2020 99 / 104 Linear discriminants Logistic Regression Linear Regression for classification Question: can we use Linear Regression for classification ? Answer: yes, e.g., Training: Prediction: train a separate linear regression model for each class • sety=1ifexampleinclass • set y = 0 otherwise for each example • run example through all regression models • predict the class with the largest output value for y Called multi-response linear regression.2 Note: does not obey linear regression assumptions, may work in practice. 2“Data Mining” (4th edn.) by Witten et al. (2017). COMP9417 ML & DM Classification (2) Term 2, 2020 100 / 104 Linear discriminants Logistic Regression Logistic regression COMP9417 ML & DM Classification (2) Term 2, 2020 101 / 104 Linear discriminants Logistic Regression Logistic regression In the case of a two-class classification problem, if we model the probability P (Y = 1) of an instance x being a positive example like this: P(Y =1|x) = 1 1 + e−wT x then this probability vs. the alternative (1 − P (Y = 1)) can be written like this: lnP(Y=1|x) =wTx 1−P(Y =1|x) The quantity on the l.h.s. is called the logit and we are defining a linear model for the logit. COMP9417 ML & DM Classification (2) Term 2, 2020 102 / 104 Linear discriminants Logistic Regression Logistic regression Unlike linear regression, no analytical maximum likelihood (ML) solution to find weights w. An iterative gradient ascent method can be used to maximize log likelihood. The (conditional) log likelihood is: N 􏰂 y(i) log P (1|x(i)) + (1 − y(i)) log(1 − P (1|x(i))) i=1 Over a set of N examples, choose w to maximise this expression, noting that y(i) is always either 0 or 1. Generalises to multiple class versions (Y can have more than two values). COMP9417 ML & DM Classification (2) Term 2, 2020 103 / 104 Summary Summary • We described the classification problem in machine learning • We also outlined the issue of Inductive Bias • Two major frameworks for classification were covered Distance-based. The key ideas are geometric. We discussed distance-based classification (Nearest Neighbour) Probabilistic. The key ideas are (mostly) Bayesian. We discussed generative (Naive Bayes) and discriminative (Logistic Regression) models • So we have established a basis for learning classifiers • Later we will see how to extend by building on these ideas COMP9417 ML & DM Classification (2) Term 2, 2020 104 / 104