Classification (2)
COMP9417 Machine Learning and Data Mining
Term 2, 2020
COMP9417 ML & DM Classification (2) Term 2, 2020 1 / 104
Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997)
http://www- 2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM
Classification (2)
Term 2, 2020
2 / 104
Aims
This lecture will continue your exposure to machine learning approaches to the problem of classification. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
• explain the concept of inductive bias in machine learning
• outline Bayes Theorem as applied in machine learning
• define MAP and ML inference using Bayes theorem
• define the Bayes optimal classification rule in terms of MAP inference • outline the Naive Bayes classification algorithm
• describe typical applications of Naive Bayes for text classification • outline the logistic regression classification algorithm
COMP9417 ML & DM Classification (2) Term 2, 2020 3 / 104
Introduction
Introduction
What do we understand about the problem of learning classifiiers ? . . . how can we know when classifier learning succeeds ?
and . . . can we use this to build practical algorithms ?
COMP9417 ML & DM Classification (2) Term 2, 2020 4 / 104
Introduction
Inductive Bias
Step back
COMP9417 ML & DM Classification (2) Term 2, 2020 5 / 104
Introduction
Inductive Bias
Inductive Bias
All models are wrong, but some models are useful.
Box & Draper (1987)
COMP9417 ML & DM Classification (2) Term 2, 2020 6 / 104
Introduction
Inductive Bias
Inductive Bias
Confusingly, “inductive bias” is NOT the same “bias” as in the “bias-variance” decomposition.
“Inductive bias” is the combination of assumptions and restrictions placed on the models and algorithms used to solve a learning problem.
Essentially it means that the algorithm and model combination you are using to solve the learning problem is appropriate for the task.
Success in machine learning requires understanding the inductive bias of algorithms and models, and choosing them appropriately for the task.
COMP9417 ML & DM Classification (2) Term 2, 2020 7 / 104
Introduction
Inductive Bias
Inductive Bias
Unfortunately, for most machine learning algorithms it is not always easy to know what their inductive bias is.
For example, what is the inductive bias of:
• Linear Regression ? •
•
• Nearest Neighbour ? •
•
COMP9417 ML & DM
Classification (2)
Term 2, 2020
8 / 104
Introduction
Inductive Bias
Inductive Bias
Unfortunately, for most machine learning algorithms it is not always easy to know what their inductive bias is.
For example, what is the inductive bias of: • Linear Regression ?
• target function has the form y = ax + b • approximate by fitting using MSE
• Nearest Neighbour ?
• target function is a complex non-linear function of the data
• predict using nearest neighbour by Euclidean distance in feature space
COMP9417 ML & DM Classification (2) Term 2, 2020 9 / 104
Introduction
Inductive Bias
Inductive Bias
What we would really like:
• a framework for machine learning algorithms
• with a way of representing the inductive bias
• ideally, should be a declarative specification
• also should quantify uncertainty in the inductive bias
COMP9417 ML & DM Classification (2) Term 2, 2020 10 / 104
A probabilistic approach
Why model uncertainty ?
A probabilistic approach
COMP9417 ML & DM Classification (2) Term 2, 2020 11 / 104
A probabilistic approach
Why model uncertainty ?
Uncertainty
As far as the laws of mathematics refer to reality, they are not certain; as far as they are certain, they do not refer to reality.
–Albert Einstein
COMP9417 ML & DM Classification (2) Term 2, 2020 12 / 104
A probabilistic approach
Probabilistic models
A simple probabilistic model
‘Viagra’ and ‘lottery’ are two Boolean features; Y is the class variable,
with values ‘spam’ and ‘ham’. In each row the most likely class is
indicated in bold.
Viagra lottery P(Y =spam|Viagra,lottery) P(Y =ham|Viagra,lottery) 00 0.31 0.69
01 0.65 0.35
10 0.80 0.20
11 0.40 0.60
COMP9417 ML & DM Classification (2) Term 2, 2020 13 / 104
A probabilistic approach
Probabilistic models
Decision rule
Assuming that X and Y are the only variables we know and care about, the posterior distribution P(Y |X) helps us to answer many questions of interest.
• For instance, to classify a new e-mail we determine whether the words ‘Viagra’ and ‘lottery’ occur in it, look up the corresponding probability P (Y = spam|Viagra, lottery), and predict spam if this probability exceeds 0.5 and ham otherwise.
• Such a recipe to predict a value of Y on the basis of the values of X and the posterior distribution P(Y |X) is called a decision rule.
COMP9417 ML & DM Classification (2) Term 2, 2020 14 / 104
Bayesian Machine Learning
Bayesian Machine Learning
COMP9417 ML & DM Classification (2) Term 2, 2020 15 / 104
Bayesian Machine Learning
Why Bayes ?
Two Roles for Bayesian Methods
Provides practical learning algorithms:
• Naive Bayes classifier learning
• Bayesian network learning, etc.
• Combines prior knowledge (prior probabilities) with observed data • How to get prior probabilities ?
Provides useful conceptual framework:
• Provides a “gold standard” for evaluating other learning algorithms • Gives some additional insight into “Occam’s razor”
COMP9417 ML & DM Classification (2) Term 2, 2020 16 / 104
Bayesian Machine Learning
Why Bayes ?
Basic Formulas for Probabilities
Product Rule: probability P (A ∧ B) of conjunction of two events A and B: P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A)
Sum Rule: probability of disjunction of two events A and B: P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
Theorem of total probability: if events A1,…,An are mutually exclusive with ni=1 P (Ai ) = 1, then:
n
P(B) = P(B|Ai)P(Ai)
i=1
COMP9417 ML & DM Classification (2) Term 2, 2020 17 / 104
Bayesian Machine Learning
Why Bayes ?
Basic Formulas for Probabilities
Also worth remembering:
• Conditional Probability: probability of A given B:
P(A|B)= P(A∧B) P(B)
• Rearrange sum rule to get:
P (A ∧ B) = P (A) + P (B) − P (A ∨ B)
Exercise: (if you haven’t done it before) derive Bayes Theorem.
COMP9417 ML & DM Classification (2) Term 2, 2020 18 / 104
Bayesian Machine Learning
Why Bayes ?
Bayes Theorem
P (h|D) = P (D|h)P (h) P(D)
where
P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D
P(D|h) = probability of D given h
COMP9417 ML & DM Classification (2) Term 2, 2020 19 / 104
Bayesian Machine Learning
Why Bayes ?
Choosing Hypotheses
P (h|D) = P (D|h)P (h) P(D)
Generally, we want the most probable hypothesis given the training data Maximum a posteriori hypothesis hM AP :
hM AP
= arg max P (h|D) h∈H
= arg max P (D|h)P (h) h∈H P (D)
= arg max P (D|h)P (h) h∈H
COMP9417 ML & DM
Classification (2)
Term 2, 2020
20 / 104
Bayesian Machine Learning
Why Bayes ?
Choosing Hypotheses
If assume P(hi) = P(hj) then can further simplify, and choose the Maximum likelihood (ML) hypothesis
hML =argmaxP(D|hi) hi ∈H
COMP9417 ML & DM Classification (2) Term 2, 2020 21 / 104
Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.
P (cancer) = P(⊕ | cancer) = P(⊕ | ¬cancer) =
P (¬cancer) = P(⊖ | cancer) = P(⊖ | ¬cancer) =
COMP9417 ML & DM
Classification (2) Term 2, 2020 22 / 104
Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.
P (cancer) = .008 P(⊕ | cancer) = .98 P(⊕ | ¬cancer) = .03
P (¬cancer) = .992 P(⊖ | cancer) = .02 P(⊖ | ¬cancer) = .97
COMP9417 ML & DM
Classification (2) Term 2, 2020 23 / 104
Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
We can find the maximum a posteriori (MAP) hypothesis
P (⊕ | cancer)P (cancer) = P (⊕ | ¬cancer)P (¬cancer) =
ThushMAP =….
0.98 × 0.008 = 0.00784 0.03 × 0.992 = 0.02976
COMP9417 ML & DM Classification (2)
Term 2, 2020
24 / 104
Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
We can find the maximum a posteriori (MAP) hypothesis
P (⊕ | cancer)P (cancer) = 0.98 × 0.008 = 0.00784 P (⊕ | ¬cancer)P (¬cancer) = 0.03 × 0.992 = 0.02976
Thus hMAP = ¬cancer.
Also note: posterior probability of hypothesis cancer higher than prior.
COMP9417 ML & DM Classification (2) Term 2, 2020 25 / 104
Bayesian Machine Learning
Why Bayes ?
Applying Bayes Theorem
How to get the posterior probability of a hypothesis h ? Divide by P (⊕), probability of data, to normalize result for h:
P (D|h)P (h) P(h|D) = hi∈H P(D|hi)P(hi)
Denominator ensures we obtain posterior probabilities that sum to 1.
Sum for all possible numerator values, since hypotheses are mutually exclusive (e.g., patient either has cancer or does not).
Marginal likelihood (marginalizing out likelihood over all possible values in the hypothesis space) — prior probability of the data.
COMP9417 ML & DM Classification (2) Term 2, 2020 26 / 104
Bayesian Machine Learning
Why Bayes ?
A Bayesian framework for Classification
First, define a prior on the hypothesis
Next, define define the likelihood, i.e., the probability of the data given the hypothesis
Then learning is finding the required parameters by fitting hypotheses to data
Predict (classify) using, e.g., the MAP hypothesis
COMP9417 ML & DM Classification (2) Term 2, 2020 27 / 104
Bayesian Machine Learning
Why Bayes ?
Brute Force MAP Hypothesis Learner
Idea: view learning as finding the most probable hypothesis
• For each hypothesis h in H, calculate the posterior probability
P (h|D) = P (D|h)P (h) P(D)
• Output the hypothesis hMAP with the highest posterior probability hMAP =argmaxP(h|D)
h∈H
COMP9417 ML & DM Classification (2) Term 2, 2020 28 / 104
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Canonical concept learning task:
• instance space X, hypothesis space H, training examples D
• consider a learning algorithm that outputs most specific hypothesis from the version space V SH,D (i.e., set of all consistent or ”zero-error” classification rules)
What would Bayes rule produce as the MAP hypothesis? Does this algorithm output a MAP hypothesis??
COMP9417 ML & DM Classification (2) Term 2, 2020 29 / 104
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Brute Force MAP Framework for Concept Learning:
Assume fixed set of instances ⟨x1, . . . , xm⟩
Assume D is the set of classifications D = ⟨c(x1), . . . , c(xm)⟩
Choose P(h) to be uniform distribution: •P(h)=1 forallhinH
Choose P(D|h):
• P(D|h)=1ifhconsistentwithD • P (D|h) = 0 otherwise
|H|
COMP9417 ML & DM Classification (2) Term 2, 2020 30 / 104
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Then:
1 P(h|D) = |V SH,D|
if h is consistent with D 0 otherwise
COMP9417 ML & DM
Classification (2) Term 2, 2020 31 / 104
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Note that since likelihood is zero if h is inconsistent then the posterior is also zero. But how did we obtain the posterior for consistent h ?
1·1 |H|
P (h|D) =
1·1
=
=
|V SH,D|
P(D)
|H| |V SH,D |
|H| 1
COMP9417 ML & DM
Classification (2)
Term 2, 2020
32 / 104
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
How did we obtain P(D) ? From theorem of total probability:
P (D)
= P (D|Hi)P (hi) hi ∈H
=1·1+0·1 h∈VS |H| h̸∈VS |H|
=
i H,D i H,D
=1·1 h ∈VS |H|
|V SH,D| |H|
i H,D
COMP9417 ML & DM
Classification (2)
Term 2, 2020
33 / 104
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Evolution of Posterior Probabilities
P(h)
P(h|D1)
P(h|D1,D2)
hypotheses (a)
hypotheses (b)
hypotheses (c)
COMP9417 ML & DM
Classification (2)
Term 2, 2020 34 / 104
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Every hypothesis consistent with D is a MAP hypothesis, if we assume • uniform probability over H
• target function c ∈ H
• deterministic, noise-free data
• etc. (see above)
So, this learning algorithm will output a MAP hypothesis, even though it
does not explicitly use probabilities in learning.
COMP9417 ML & DM Classification (2) Term 2, 2020 35 / 104
A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Learning A Real Valued Function
E.g., learning a linear target function f from noisy examples: y
f hML
e
x
COMP9417 ML & DM
Classification (2)
Term 2, 2020
36 / 104
A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Learning A Real Valued Function
Consider any real-valued target function f
Training examples ⟨xi,di⟩, where di is noisy training value
• di =f(xi)+ei
• ei is random variable (noise) drawn independently for each xi
according to some Gaussian (normal) distribution with mean=0 Then the maximum likelihood hypothesis hML is the one that
minimizes the sum of squared errors: m
hML = arg min (di − h(xi))2
h∈H
i=1
COMP9417 ML & DM Classification (2) Term 2, 2020 37 / 104
A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Learning A Real Valued Function
How did we obtain this ?
hM L =
= arg max p(di|h)
arg max p(D|h) h∈H
m i=1
h∈H =argmax√e2 σ
m 1 h∈H i=1 2πσ2
Recall that we treat each probability p(D|h) as if h = f, i.e., we assume μ = f(xi) = h(xi), which is the key idea behind maximum likelihood !
− 1 ( d i − h ( x i ) ) 2
COMP9417 ML & DM Classification (2) Term 2, 2020 38 / 104
A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Learning A Real Valued Function
Maximize natural log to give simpler expression:
m 1 1di−h(xi)2 hML=argmax ln√ 2−2 σ
h∈H i=1 2πσ
m 1di−h(xi)2
=argmax−2 σ h∈H i=1
m
= arg max − (di − h(xi))2
h∈H
i=1
Equivalently, we can minimize the positive version of the expression:
hML =
m
arg min (di − h(xi))2
h∈H
Classification (2) Term 2, 2020 39 / 104
i=1
COMP9417 ML & DM
A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Discriminative and generative probabilistic models
• Discriminative models model the posterior probability distribution P (Y |X), where Y is the target variable and X are the features. That is, given X they return a probability distribution over Y .
• Generative models model the joint distribution P (Y, X) of the target Y and the feature vector X. Once we have access to this joint distribution we can derive any conditional or marginal distribution involving the same variables. In particular, since P(X) = y P(Y = y,X) it follows that the posterior
distribution can be obtained as P(Y |X) = P(Y,X) . y P(Y=y,X)
• Alternatively, generative models can be described by the likelihood function P (X|Y ), since P (Y, X) = P (X|Y )P (Y ) and the target or prior distribution (usually abbreviated to ‘prior’) can be easily estimated or postulated.
• Such models are called ‘generative’ because we can sample from the joint distribution to obtain new data points together with their labels. Alternatively, we can use P (Y ) to sample a class and P (X |Y ) to sample an instance for that class.
COMP9417 ML & DM Classification (2) Term 2, 2020 40 / 104
A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
Assessing uncertainty in estimates
Suppose we want to estimate the probability θ that an arbitrary e-mail is spam, so that we can use the appropriate prior distribution.
• The natural thing to do is to inspect n e-mails, determine the number of spam e-mails d, and set θˆ = d/n; we don’t really need any complicated statistics to tell us that.
• However, while this is the most likely estimate of θ – the maximum a posteriori (MAP) estimate – this doesn’t mean that other values of θ are completely ruled out.
• We model this by a probability distribution over θ (a Beta distribution in this case) which is updated each time new information comes in. This is further illustrated in the figure for a distribution that is more and more skewed towards spam.
• For each curve, its bias towards spam is given by the area under the curve and to the right of θ = 1/2.
COMP9417 ML & DM Classification (2) Term 2, 2020 41 / 104
A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms
Assessing uncertainty in estimates
5 4 3 2 1
0.2 0.4
0.6 0.8
Each time we inspect an e-mail, we are reducing our uncertainty regarding the prior spam probability θ. After we inspect two e-mails and observe one spam, the possible θ values are characterised by a symmetric distribution around 1/2. If we inspect a third, fourth, . . . , tenth e-mail and each time (except the first one) it is spam, then this distribution narrows and shifts a little bit to the right each time. The distribution for n e-mails reaches its
maximum at θˆMAP = n−1 (e.g., θˆMAP = 0.8 for n = 5). n
COMP9417 ML & DM
Classification (2) Term 2, 2020 42 / 104
A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
The Bayesian perspective
Explicitly modelling the posterior distribution over the parameter θ has a number of advantages that are usually associated with the ‘Bayesian’ perspective:
• We can precisely characterise the uncertainty that remains about our estimate by quantifying the spread of the posterior distribution.
• We can obtain a generative model for the parameter by sampling from the posterior distribution, which contains much more information than a summary statistic such as the MAP estimate can convey – so, rather than using a single e-mail with θ = θMAP, our generative model can contain a number of e-mails with θ sampled from the posterior distribution.
COMP9417 ML & DM Classification (2) Term 2, 2020 43 / 104
A Bayesian approach to learning algorithms
Numeric prediction in Bayesian terms
The Bayesian perspective
• We can quantify the probability of statements such as ‘e-mails are biased towards ham’ (the tiny shaded area in the figure demonstrates that after observing one ham and nine spam e-mails this probability is very small, about 0.6%).
• We can use one of these distributions to encode our prior beliefs: e.g., if we believe that the proportions of spam and ham are typically 50–50, we can take the distribution for n = 2 (the lowest, symmetric one in the figure on the previous slide) as our prior.
The key point is that probabilities do not have to be interpreted as estimates of relative frequencies, but can carry the more general meaning of (possibly subjective) degrees of belief.
Consequently, we can attach a probability distribution to almost anything: not just features and targets, but also model parameters and even models.
COMP9417 ML & DM Classification (2) Term 2, 2020 44 / 104
Minimum Description Length Principle
Minimum Description Length Principle
Once again, the MAP hypothesis
hMAP =argmaxP(D|h)P(h) h∈H
Which is equivalent to
Or
hMAP = argmax log2 P(D|h)+log2 P(h) h∈H
hMAP = argmin −log2 P(D|h)−log2 P(h) h∈H
COMP9417 ML & DM
Classification (2) Term 2, 2020 45 / 104
Minimum Description Length Principle
Minimum Description Length Principle
Interestingly, this is an expression about a quantity of bits. hMAP = argmin−log2 P(D|h)−log2 P(h)
h∈H
From information theory:
The optimal (shortest expected coding length) code for an event with probability p is − log2 p bits.
(1)
COMP9417 ML & DM Classification (2) Term 2, 2020
46 / 104
Minimum Description Length Principle
Minimum Description Length Principle
So interpret (1):
• − log2 P (h) is length of h under optimal code
• − log2 P (D|h) is length of D given h under optimal code
Note well: assumes optimal encodings, when the priors and likelihoods are known. In practice, this is difficult, and makes a difference.
COMP9417 ML & DM Classification (2) Term 2, 2020 47 / 104
Minimum Description Length Principle
Minimum Description Length Principle
Occam’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that minimizes
hMDL =argminLC1(h)+LC2(D|h) h∈H
where LC(x) is the description length of x under optimal encoding C
COMP9417 ML & DM Classification (2) Term 2, 2020 48 / 104
Bayes Optimal Classifier
Most Probable Classification of New Instances
So far we’ve sought the most probable hypothesis given the data D (i.e., hMAP)
Given new instance x, what is its most probable classification? • hM AP (x) is not the most probable classification!
COMP9417 ML & DM Classification (2) Term 2, 2020 49 / 104
Bayes Optimal Classifier
Most Probable Classification of New Instances
Consider:
• Three possible hypotheses:
P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3 • Given new instance x,
h1(x)=+, h2(x)=−, h3(x)=−
• What’s most probable classification of x?
COMP9417 ML & DM Classification (2) Term 2, 2020 50 / 104
Bayes Optimal Classifier
Bayes Optimal Classifier
Bayes optimal classification:
Example:
P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3,
P(−|h1) = 0, P(−|h2) = 1, P(−|h3) = 1,
P(+|h1) = 1 P(+|h2) = 0 P(+|h3) = 0
argmax P(vj|hi)P(hi|D) vj∈V hi∈H
COMP9417 ML & DM
Classification (2)
Term 2, 2020
51 / 104
Bayes Optimal Classifier
Bayes Optimal Classifier
therefore
and
P (+|hi)P (hi|D) hi ∈H
P (−|hi)P (hi|D) hi ∈H
= .4 = .6
argmax P(vj|hi)P(hi|D) = − vj∈V hi∈H
No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average
COMP9417 ML & DM Classification (2) Term 2, 2020 52 / 104
Bayes Optimal Classifier
Bayes Error
What is the best performance attainable by a (two-class) classifier ? Define the probability of error for classifying some instance x by
P (error|x) = P (class1|x) if we predict class2 = P (class2|x) if we predict class1
This gives
So we can justify the use of the decision rule
Σx P (error) = P (error|x) P (x)
if P (class1|x) > P (class2|x) then predict class1
else predict class2
On average, this decision rule minimises probability of classification error.
COMP9417 ML & DM Classification (2) Term 2, 2020 53 / 104
Naive Bayes
Naive Bayes Classifier
Along with decision trees, neural networks, nearest neighbour, one of the most practical learning methods.
When to use
• Moderate or large training set available
• Attributes that describe instances are conditionally independent given classification
Successful applications:
• Classifying text documents
• Gaussian Naive Bayes for real-valued data
COMP9417 ML & DM Classification (2) Term 2, 2020 54 / 104
Naive Bayes
Naive Bayes Classifier
Assume target function f : X → V , where each instance x described by attributes ⟨a1, a2 . . . an⟩.
Most probable value of f(x) is:
vMAP = vMAP =
argmaxP(vj|a1,a2 …an) vj ∈V
argmax P(a1,a2 …an|vj)P(vj) vj∈V P(a1,a2 …an)
= argmaxP(a1,a2 …an|vj)P(vj) vj ∈V
COMP9417 ML & DM
Classification (2) Term 2, 2020 55 / 104
Naive Bayes
Naive Bayes Classifier
Naive Bayes assumption:
P(a1,a2 …an|vj) = P(ai|vj) i
• Attributes are statistically independent (given the class value)
• which means knowledge about the value of a particular attribute tells
us nothing about the value of another attribute (if the class is known) which gives
Naive Bayes classifier: vNB = argmaxP(vj)P(ai|vj)
vj ∈V
i
COMP9417 ML & DM Classification (2)
Term 2, 2020 56 / 104
Naive Bayes
Binomial and Bernoulli distributions1
“Coin tossing”
Let X ∈ {0,…,n} be number of heads in a sequence of n tosses. If the probability of heads is θ, then X has a binomial distribution
n k n−k Bin(k|n,θ)= k θ (1−θ)
For a single coin toss, let X ∈ {0, 1}. If θ is the probability of heads, then X has a Bernoulli distribution
Ber(x|θ) = θI(x=1)(1 − θ)I(x=0)
1“Machine Learning: A Probabilistic Perspective”, K. Murphy (2012)
COMP9417 ML & DM Classification (2) Term 2, 2020 57 / 104
Naive Bayes
Multinomial and Multinoulli distributions
“K-sided Dice throwing”
Let x = (x1,…,xK) be a random vector, where xj is the number of
times side j of the die is thrown.
If θj is the probability of side j, then x has a multinomial distribution
nK Mu(x|n, θ) = θxj
j=1
For a single dice throw, x will be a vector of zeros except for a single ‘1’ at
index j indicating that side j of the die is thrown. If θj is as above, then x has a multivariate Bernoulli (Multinoulli) distribution
K
Mu(x|1, θ) = θI(xj =1) j
j=1 COMP9417 ML & DM Classification (2)
Term 2, 2020
58 / 104
x1 . . . xK
j
Naive Bayes
Categorical random variables
Categorical variables or features (also called discrete or nominal) are ubiquitous in machine learning.
• Perhaps the most common form of the Bernoulli distribution models whether or not a word occurs in a document. That is, for the i-th word in our vocabulary we have a random variable Xi governed by a Bernoulli distribution. The joint distribution over the bit vector
X = (X1, . . . , Xk) is called a multivariate Bernoulli distribution.
• Variables with more than two outcomes are also common: for example, every word position in an e-mail corresponds to a categorical variable with k outcomes, where k is the size of the vocabulary. The multinomial distribution manifests itself as a count vector: a histogram of the number of occurrences of all vocabulary words in a document. This establishes an alternative way of modelling text documents that allows the number of occurrences of a word to influence the classification of a document.
COMP9417 ML & DM Classification (2) Term 2, 2020 59 / 104
Naive Bayes
Categorical random variables
Both these document models are in common use. Despite their differences, they both assume independence between word occurrences, generally referred to as the naive Bayes assumption.
• In the multinomial document model, this follows from the very use of the multinomial distribution, which assumes that words at different word positions are drawn independently from the same categorical distribution.
• In the multivariate Bernoulli model we assume that the bits in a bit vector are statistically independent, which allows us to compute the joint probability of a particular bit vector (x1, . . . , xk) as the product of the probabilities of each component P(Xi = xi).
• In practice, such word independence assumptions are often not true: if we know that an e-mail contains the word ‘Viagra’, we can be quite sure that it will also contain the word ‘pill’. Violated independence assumptions reduce the quality of probability estimates but may still
allow good classification performance.
COMP9417 ML & DM Classification (2) Term 2, 2020 60 / 104
Naive Bayes
Probabilistic decision rules
We have chosen one of the possible distributions to model our data X as coming from either class.
• The more different P(X|Y = spam) and P(X|Y = ham) are, the more useful the features X are for classification.
• Thus, for a specific e-mail x we calculate both P (X = x|Y = spam) and P (X = x|Y = ham), and apply one of several possible decision rules:
maximum likelihood (ML) – predict arg maxy P (X = x|Y = y); maximum a posteriori (MAP) – predict
argmaxy P(X = x|Y = y)P(Y = y);
The relation between the first two decision rules is that ML classification is
equivalent to MAP classification with a uniform class distribution.
COMP9417 ML & DM Classification (2) Term 2, 2020 61 / 104
Naive Bayes
Probabilistic decision rules
We again use the example of Naive Bayes for text classification to illustrate, using both the multinomial and multivariate Bernoulli models.
COMP9417 ML & DM Classification (2) Term 2, 2020 62 / 104
Naive Bayes
Prediction using a naive Bayes model
Suppose our vocabulary contains three words a, b and c, and we use a multivariate Bernoulli model for our e-mails, with parameters
θ⊕ = (0.5, 0.67, 0.33) θ⊖ = (0.67, 0.33, 0.33)
This means, for example, that the presence of b is twice as likely in spam (+), compared with ham.
The e-mail to be classified contains words a and b but not c, and hence is described by the bit vector x = (1, 1, 0). We obtain likelihoods
P (x|⊕) = 0.5 · 0.67 · (1 − 0.33) = 0.222 P(x|⊖) = 0.67 · 0.33 · (1 − 0.33) = 0.148
The ML classification of x is thus spam.
COMP9417 ML & DM Classification (2) Term 2, 2020 63 / 104
Naive Bayes
Prediction using a naive Bayes model
In the case of two classes it is often convenient to work with likelihood ratios and odds.
• The likelihood ratio can be calculated as
P(x|⊕) = 0.5 0.671−0.33 =3/2>1
P (x|⊖) 0.67 0.33 1 − 0.33
• This means that the MAP classification of x is also spam if the prior
odds are more than 2/3, but ham if they are less than that.
• For example, with 33% spam and 67% ham the prior odds are
P (⊕) = 0.33 = 1/2, resulting in a posterior odds of P (⊖) 0.67
P(⊕|x) = P(x|⊕)P(⊕) =3/2·1/2=3/4<1 P (⊖|x) P (x|⊖) P (⊖)
In this case the likelihood ratio for x is not strong enough to push the
decision away from the prior.
COMP9417 ML & DM Classification (2) Term 2, 2020 64 / 104
Naive Bayes
Prediction using a naive Bayes model
Alternatively, we can employ a multinomial model. The parameters of a multinomial establish a distribution over the words in the vocabulary, say
θ⊕ = (0.3, 0.5, 0.2) θ⊖ = (0.6, 0.2, 0.2)
The e-mail to be classified contains three occurrences of word a, one single occurrence of word b and no occurrences of word c, and hence is described by the count vector x = (3, 1, 0). The total number of vocabulary word occurrences is n = 4. We obtain likelihoods
0.33 0.51 0.20 P(x|⊕)=4!3! 1! 0!=0.054
0.63 0.21 0.20
P(x|⊖) = 4! 3! 1! 0! = 0.1728
The likelihood ratio is 0.3 3 0.5 1 0.2 0 = 5/16. The ML classification 0.6 0.2 0.2
of x is thus ham, the opposite of the multivariate Bernoulli model. This is
mainly because of the three occurrences of word a, which provide strong
evidence for ham.
COMP9417 ML & DM Classification (2) Term 2, 2020 65 / 104
Naive Bayes
Training data for naive Bayes
A small e-mail data set described by count vectors.
E-mail #a #b #c Class e1 0 3 0 + e2 0 3 3 + e3 3 0 0 + e4 2 3 0 + e5 4 3 0 − e6 4 0 3 − e7 3 0 0 − e8 0 0 0 −
COMP9417 ML & DM
Classification (2)
Term 2, 2020
66 / 104
Naive Bayes
Training data for naive Bayes
The same data set described by bit vectors.
E-mail a? b? c? Class e1 0 1 0 + e2 0 1 1 + e3 1 0 0 + e4 1 1 0 + e5 1 1 0 − e6 1 0 1 − e7 1 0 0 − e8 0 0 0 −
COMP9417 ML & DM
Classification (2)
Term 2, 2020
67 / 104
Naive Bayes
Training a naive Bayes model
Consider the following e-mails consisting of five words a, b, c, d, e:
e1: b d e b b d e
e2: b c e b b d d e c c e3: a d a d e a e e e4: b a d b e d a b
e5: a b a b a b a e d e6: a c a c a c a e d e7: e a e d a e a
e8: d e d e d
We are told that the e-mails on the left are spam and those on the right are ham, and so we use them as a small training set to train our Bayesian classifier.
• First, we decide that d and e are so-called stop words that are too common to convey class information.
• The remaining words, a, b and c, constitute our vocabulary.
COMP9417 ML & DM Classification (2) Term 2, 2020 68 / 104
Naive Bayes
Training a naive Bayes model
For the multinomial model, we represent each e-mail as a count vector, as before.
• In order to estimate the parameters of the multinomial, we sum up the count vectors for each class, which gives (5, 9, 3) for spam and (11, 3, 3) for ham.
• To smooth these probability estimates we add one pseudo-count for each vocabulary word, which brings the total number of occurrences of vocabulary words to 20 for each class.
• The estimated parameter vectors are thus
θˆ⊕ = (6/20, 10/20, 4/20) = (0.3, 0.5, 0.2) for spam and θˆ⊖ = (12/20, 4/20, 4/20) = (0.6, 0.2, 0.2) for ham.
COMP9417 ML & DM Classification (2) Term 2, 2020 69 / 104
Naive Bayes
Training a naive Bayes model
In the multivariate Bernoulli model e-mails are represented by bit vectors, as before.
• Adding the bit vectors for each class results in (2, 3, 1) for spam and (3, 1, 1) for ham.
• Each count is to be divided by the number of documents in a class, in order to get an estimate of the probability of a document containing a particular vocabulary word.
• Probability smoothing now means adding two pseudo-documents, one containing each word and one containing none of them.
• This results in the estimated parameter vectors
θˆ⊕ = (3/6, 4/6, 2/6) = (0.5, 0.67, 0.33) for spam and θˆ⊖ = (4/6, 2/6, 2/6) = (0.67, 0.33, 0.33) for ham.
COMP9417 ML & DM Classification (2) Term 2, 2020 70 / 104
Naive Bayes
Likelihood ratio
As a matter of fact, statisticians work very often with different conditional probabilities, given by the likelihood function P(X|Y ).
• I like to think of these as thought experiments: if somebody were to send me a spam e-mail, how likely would it be that it contains exactly the words of the e-mail I’m looking at? And how likely if it were a ham e-mail instead?
• What really matters is not the magnitude of these likelihoods, but their ratio: how much more likely is it to observe this combination of words in a spam e-mail than it is in a non-spam e-mail.
• For instance, suppose that for a particular e-mail described by X we have P(X|Y = spam) = 3.5 · 10−5 and P(X|Y = ham) = 7.4 · 10−6, then observing X in a spam e-mail is nearly five times more likely than it is in a ham e-mail.
• This suggests the following decision rule: predict spam if the likelihood ratio is larger than 1 and ham otherwise.
COMP9417 ML & DM Classification (2) Term 2, 2020 71 / 104
Naive Bayes
When to use likelihoods
Use likelihoods if you want to ignore the prior distribution or assume it uniform, and posterior probabilities otherwise.
COMP9417 ML & DM Classification (2) Term 2, 2020 72 / 104
Naive Bayes
Posterior odds
P (Y = spam|Viagra = 0, lottery = 0) = 0.31 = 0.45 P (Y = ham|Viagra = 0, lottery = 0) 0.69
P (Y = spam|Viagra = 1, lottery = 1) = 0.40 = 0.67 P (Y = ham|Viagra = 1, lottery = 1) 0.60
P (Y = spam|Viagra = 0, lottery = 1) = 0.65 = 1.9 P (Y = ham|Viagra = 0, lottery = 1) 0.35
P (Y = spam|Viagra = 1, lottery = 0) = 0.80 P (Y = ham|Viagra = 1, lottery = 0) 0.20
= 4.0
Using a MAP decision rule we predict ham in the top two cases and spam in the bottom two. Given that the full posterior distribution is all there is to know about the domain in a statistical sense, these predictions are the best we can do: they are Bayes-optimal.
COMP9417 ML & DM Classification (2) Term 2, 2020 73 / 104
Naive Bayes
Example marginal likelihoods
Y
spam ham
Y
spam ham
P(Viagra = 1|Y ) 0.40
0.12
P(lottery = 1|Y ) 0.21
0.13
P(Viagra = 0|Y ) 0.60
0.88
P(lottery = 0|Y ) 0.79
0.87
COMP9417 ML & DM
Classification (2)
Term 2, 2020
74 / 104
Naive Bayes
Using marginal likelihoods
Using the marginal likelihoods from before, we can approximate the likelihood ratios (the previously calculated odds from the full posterior distribution are shown in brackets):
P (Viagra = 0|Y = spam) P (lottery = 0|Y = spam) = P (Viagra = 0|Y = ham) P (lottery = 0|Y = ham)
P (Viagra = 0|Y = spam) P (lottery = 1|Y = spam) = P (Viagra = 0|Y = ham) P (lottery = 1|Y = ham)
P (Viagra = 1|Y = spam) P (lottery = 0|Y = spam) = P (Viagra = 1|Y = ham) P (lottery = 0|Y = ham)
P (Viagra = 1|Y = spam) P (lottery = 1|Y = spam) = P (Viagra = 1|Y = ham) P (lottery = 1|Y = ham)
0.60 0.79 0.88 0.87 0.60 0.21 0.88 0.13 0.40 0.79 0.12 0.87 0.40 0.21 0.12 0.13
= 0.62 = 1.1 = 3.0 = 5.4
(0.45) (1.9) (4.0)
(0.67)
We see that, using a maximum likelihood decision rule, our very simple model arrives at the Bayes-optimal prediction in the first three cases, but not in the fourth (‘Viagra’ and ‘lottery’ both present), where the marginal likelihoods are actually very misleading.
COMP9417 ML & DM Classification (2) Term 2, 2020 75 / 104
Naive Bayes
Naive Bayes Algorithm
Naive Bayes Learn(examples) For each target value vj
Pˆ(vj ) ← estimate P (vj )
For each attribute value ai of each attribute a
Pˆ(ai|vj ) ← estimate P (ai|vj )
Classify New Instance(x)
vNB = argmaxPˆ(vj) Pˆ(ai|vj)
vj ∈V
ai ∈x
COMP9417 ML & DM Classification (2) Term 2, 2020 76 / 104
Naive Bayes
Naive Bayes Example: PlayTennis
outlook temperature sunny hot
sunny hot
overcast hot
rainy mild rainy cool rainy cool overcast cool sunny mild sunny cool rainy mild sunny mild overcast mild overcast hot rainy mild
humidity high high high high normal normal normal high normal normal normal high normal high
windy play false no true no false yes false yes false yes true no true yes false no false yes false yes true yes true yes false yes true no
COMP9417 ML & DM
Classification (2)
Term 2, 2020
77 / 104
Naive Bayes
Naive Bayes Example: PlayTennis
What are the required probabilities to predict PlayTennis ?
Outlook
Temperature
Humidity
Windy
Yes No
Yes No
Yes No
Yes No
Sunny Overcast Rainy
2 3 4 0 3 2
Hot 2 2 Mild 4 2 Cool 3 1
High 3 4 Normal 6 1
False 6 2 True 3 3
Sunny Overcast Rainy
2/9 3/5 4/9 0/5 3/9 2/5
Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5
High 3/9 4/5 Normal 6/9 1/5
False 6/9 2/5 True 3/9 3/5
Play
Yes No
95
9/14 5/14
COMP9417 ML & DM Classification (2) Term 2, 2020 78 / 104
Naive Bayes
Naive Bayes Example: PlayTennis
Say we have the new instance:
⟨Outlk = sun, T emp = cool, Humid = high, W ind = true⟩ We want to compute:
vNB = argmax P(vj)P(ai|vj)
vj ∈{“yes”,“no”}
i
COMP9417 ML & DM Classification (2)
Term 2, 2020
79 / 104
Naive Bayes
Naive Bayes Example: PlayTennis
So we first calculate the likelihood of the two classes, “yes” and “no”
for “yes” = P (y) × P (sun|y) × P (cool|y) × P (high|y) × P (true|y)
0.0053 = 9 ×2×3×3×3 14 9 9 9 9
for “no” = P (n) × P (sun|n) × P (cool|n) × P (high|n) × P (true|n) 0.0206 = 5 ×3×1×4×3
14 5 5 5 5
COMP9417 ML & DM Classification (2) Term 2, 2020 80 / 104
Naive Bayes
Naive Bayes Example: PlayTennis
Then convert to a probability by normalisation
P (“yes”) =
= 0.205
0.0053 (0.0053 + 0.0206)
P (“no”) =
= 0.795
0.0206 (0.0053 + 0.0206)
The Naive Bayes classification is “no”.
COMP9417 ML & DM Classification (2) Term 2, 2020 81 / 104
Naive Bayes
Naive Bayes: Subtleties
Conditional independence assumption is often violated
P(a1,a2 ...an|vj) = P(ai|vj) i
• ...but it works surprisingly well anyway. Note don’t need estimated posteriors Pˆ(vj|x) to be correct; need only that
argmaxPˆ(vj)Pˆ(ai|vj) = argmaxP(vj)P(a1 ...,an|vj) vj∈V i vj∈V
i.e. maximum probability is assigned to correct class
• see [Domingos & Pazzani, 1996] for analysis
• Naive Bayes posteriors often unrealistically close to 1 or 0
• adding too many redundant attributes will cause problems (e.g. identical attributes)
COMP9417 ML & DM Classification (2) Term 2, 2020 82 / 104
Naive Bayes
Naive Bayes: “zero-frequency” problem
What if none of the training instances with target value vj have attribute value ai? Then
Pˆ(ai|vj) = 0, and... Pˆ ( v j ) Pˆ ( a i | v j ) = 0
i
Pseudo-counts add 1 to each count (a version of the Laplace Estimator)
COMP9417 ML & DM Classification (2) Term 2, 2020 83 / 104
Naive Bayes
Naive Bayes: “zero-frequency” problem
• In some cases adding a constant different from 1 might be more appropriate
• Example: attribute outlook for class yes
Sunny Overcast Rainy
2+μ 4+μ 3+μ 243
9+μ 9+μ 9+μ
• Weights don’t need to be equal (if they sum to 1) – a form of prior
Sunny Overcast Rainy
2+μp1 4+μp2
9+μ 9+μ 9+μ
3+μp3
COMP9417 ML & DM
Classification (2)
Term 2, 2020 84 / 104
Naive Bayes
Naive Bayes: numeric attributes
• Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)
• The probability density function for the normal distribution is defined by two parameters:
The sample mean μ:
1 n μ=n xi
i=1
The standard deviation σ:
1 n
σ = n − 1 (xi − μ)2 i=1
COMP9417 ML & DM Classification (2)
Term 2, 2020
85 / 104
Naive Bayes
Naive Bayes: numeric attributes
Then we have the density function f(x):
1 − (x−μ)2
e 2σ2
Example: continuous attribute temperature with mean = 73 and standard
deviation = 6.2. Density value
1 − (66−73)2
f(temperature = 66|“yes”) = √
Missing values during training are not included in calculation of mean and
standard deviation.
f(x) = √
2πσ
2π6.2
e 2×6.22 = 0.0340
COMP9417 ML & DM Classification (2) Term 2, 2020 86 / 104
Naive Bayes
Naive Bayes: numeric attributes
Note: the normal distribution is based on the simple exponential function
f(x) = e−|x|m
As the power m in the exponent increases, the function approaches a step function.
Where m = 2
f(x) = e−|x|2
and this is the basis of the normal distribution – the various constants are the result of scaling so that the integral (the area under the curve from −∞ to +∞) is equal to 1.
from “Statistical Computing” by Michael J. Crawley (2002) Wiley.
COMP9417 ML & DM Classification (2) Term 2, 2020 87 / 104
Naive Bayes
Learning and using a naive Bayes model for classification
Example application: Learning to Classify Text
In machine learning, the classic example of applications of Naive Bayes is learning to classify text documents.
Here is a simplified version in the multinomial document model.
COMP9417 ML & DM Classification (2) Term 2, 2020 88 / 104
Naive Bayes
Learning and using a naive Bayes model for classification
Learning to Classify Text
Why?
• Learn which news articles are of interest • Learn to classify web pages by topic
Naive Bayes is among most effective algorithms
What attributes shall we use to represent text documents??
COMP9417 ML & DM Classification (2) Term 2, 2020 89 / 104
Naive Bayes
Learning and using a naive Bayes model for classification
Learning to Classify Text
Target concept Interesting? : Document → {+, −}
1
2
Represent each document by vector of words • one attribute per word position in document
Learning: Use training examples to estimate
• P(+)
• P(−)
• P(doc|+) • P(doc|−)
COMP9417 ML & DM Classification (2) Term 2, 2020 90 / 104
Naive Bayes
Learning and using a naive Bayes model for classification
Learning to Classify Text
Naive Bayes conditional independence assumption
length(doc)
P(doc|vj) = P(ai = wk|vj)
i=1
where P(ai = wk|vj) is probability that word in position i is wk, given vj
one more assumption: P(ai = wk|vj) = P(am = wk|vj),∀i,m “bag of words”
COMP9417 ML & DM Classification (2) Term 2, 2020 91 / 104
Naive Bayes
Learning and using a naive Bayes model for classification
Application: 20 Newsgroups
Given: 1000 training documents from each group
Learning task: classify each new document by newsgroup it came from
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
alt.atheism
soc.religion.christian
talk.religion.misc
talk.politics.mideast
talk.politics.misc
talk.politics.guns
Naive Bayes: 89% classification accuracy
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.space
sci.crypt
sci.electronics
sci.med
COMP9417 ML & DM Classification (2)
Term 2, 2020
92 / 104
Naive Bayes
Learning and using a naive Bayes model for classification
Article from rec.sport.hockey
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu
From: xxx@yyy.zzz.edu (John Doe)
Subject: Re: This year’s biggest and worst (opinion)...
Date: 5 Apr 93 09:53:39 GMT
I can only comment on the Kings, but the most obvious candidate
for pleasant surprise is Alex Zhitnik. He came highly touted as
a defensive defenseman, but he’s clearly much more than that.
Great skater and hard shot (though wish he were more accurate).
In fact, he pretty much allowed the Kings to trade away that
huge defensive liability Paul Coffey. Kelly Hrudey is only the
biggest disappointment if you thought he was any good to begin
with. But, at best, he’s only a mediocre goaltender. A better
choice would be Tomas Sandstrom, though not through any fault
of his own, but because some thugs in Toronto decided ...
COMP9417 ML & DM Classification (2) Term 2, 2020 93 / 104
Naive Bayes
Learning and using a naive Bayes model for classification
Learning Curve for 20 Newsgroups
100 90 80 70 60 50 40 30 20 10 0
20News
Bayes
TFIDF
P
RT
FID
F
100 1000
10000
Accuracy vs. Training set size (1/3 withheld for test)
COMP9417 ML & DM Classification (2) Term 2, 2020 94 / 104
Naive Bayes
Learning and using a naive Bayes model for classification
Naive Bayes: “zero-frequency” problem
This generalisation is a Bayesian estimate for Pˆ(ai|vj) Pˆ(ai|vj) ← nc + mp
n+m
where
• n is number of training examples for which v = vj,
• nc numberofexamplesforwhichv=vj anda=ai
• p is prior estimate for Pˆ(ai|vj)
• m is weight given to prior (i.e. number of “virtual” examples)
This is called the m-estimate of probability ( a form of “prior”).
COMP9417 ML & DM Classification (2) Term 2, 2020 95 / 104
Naive Bayes
Learning and using a naive Bayes model for classification
Naive Bayes: missing values
What about missing values ? A very basic approach is:
• Training: instance is not included in frequency count for attribute value-class combination
• Classification: attribute will be omitted from calculation Can we do better ?
COMP9417 ML & DM Classification (2) Term 2, 2020 96 / 104
Naive Bayes
Learning and using a naive Bayes model for classification
Missing values
Suppose we skimmed an e-mail and noticed that it contains the word ‘lottery’ but we haven’t looked closely enough to determine whether it uses the word ‘Viagra’. This means that we don’t know whether to use the second or the fourth row in to make a prediction. This is a problem, as we would predict spam if the e-mail contained the word ‘Viagra’ (second row) and ham if it didn’t (fourth row).
The solution is to average these two rows, using the probability of ‘Viagra’ occurring in any e-mail (spam or not):
P (Y |lottery) =P (Y |Viagra = 0, lottery)P (Viagra = 0)
+ P (Y |Viagra = 1, lottery)P (Viagra = 1)
COMP9417 ML & DM Classification (2) Term 2, 2020 97 / 104
Naive Bayes
Learning and using a naive Bayes model for classification
Missing values
For instance, suppose for the sake of argument that one in ten e-mails contain the word ‘Viagra’, then P (Viagra = 1) = 0.10 and
P (Viagra = 0) = 0.90. Using the above formula, we obtain
P(Y = spam|lottery = 1) = 0.65 · 0.90 + 0.40 · 0.10 = 0.625 and
P (Y = ham|lottery = 1) = 0.35 · 0.90 + 0.60 · 0.10 = 0.375. Because the occurrence of ‘Viagra’ in any e-mail is relatively rare, the resulting distribution deviates only a little from the second row in the data.
COMP9417 ML & DM Classification (2) Term 2, 2020 98 / 104
Linear discriminants
Linear discriminants
The “classic” classification learning algorithm is a linear discriminant — many forms of linear discriminant from statistics and machine learning, e.g.,
• Fisher’s Linear Discriminant Analysis
• basic linear classifier (last lecture) is a simplified version of this
• Logistic Regression
• a probabilistic linear classifier (next slides)
• Perceptron
• a linear threshold classifier
• an early version of an artificial “neuron”
• still a useful method, and source of ideas (later lecture)
• Winnow
• like Perceptron, another linear threshold classifier
• uses a different learning method, with interesting properties • comes from Learning Theory (later lecture)
COMP9417 ML & DM Classification (2) Term 2, 2020 99 / 104
Linear discriminants
Logistic Regression
Linear Regression for classification
Question: can we use Linear Regression for classification ? Answer: yes, e.g.,
Training:
Prediction:
train a separate linear regression model for each class • sety=1ifexampleinclass
• set y = 0 otherwise
for each example
• run example through all regression models
• predict the class with the largest output value for y
Called multi-response linear regression.2
Note: does not obey linear regression assumptions, may work in practice.
2“Data Mining” (4th edn.) by Witten et al. (2017).
COMP9417 ML & DM Classification (2) Term 2, 2020 100 / 104
Linear discriminants
Logistic Regression
Logistic regression
COMP9417 ML & DM Classification (2) Term 2, 2020 101 / 104
Linear discriminants
Logistic Regression
Logistic regression
In the case of a two-class classification problem, if we model the probability P (Y = 1) of an instance x being a positive example like this:
P(Y =1|x) = 1
1 + e−wT x
then this probability vs. the alternative (1 − P (Y = 1)) can be written like this:
lnP(Y=1|x) =wTx 1−P(Y =1|x)
The quantity on the l.h.s. is called the logit and we are defining a linear model for the logit.
COMP9417 ML & DM Classification (2) Term 2, 2020 102 / 104
Linear discriminants
Logistic Regression
Logistic regression
Unlike linear regression, no analytical maximum likelihood (ML) solution to find weights w.
An iterative gradient ascent method can be used to maximize log likelihood.
The (conditional) log likelihood is:
N
y(i) log P (1|x(i)) + (1 − y(i)) log(1 − P (1|x(i)))
i=1
Over a set of N examples, choose w to maximise this expression, noting
that y(i) is always either 0 or 1.
Generalises to multiple class versions (Y can have more than two values). COMP9417 ML & DM Classification (2) Term 2, 2020 103 / 104
Summary
Summary
• We described the classification problem in machine learning
• We also outlined the issue of Inductive Bias
• Two major frameworks for classification were covered
Distance-based. The key ideas are geometric.
We discussed distance-based classification (Nearest
Neighbour)
Probabilistic. The key ideas are (mostly) Bayesian.
We discussed generative (Naive Bayes) and discriminative (Logistic Regression) models
• So we have established a basis for learning classifiers
• Later we will see how to extend by building on these ideas
COMP9417 ML & DM Classification (2) Term 2, 2020 104 / 104