程序代写代做代考 scheme data mining decision tree algorithm information theory Bayesian network Bayesian AI Supervised Learning – Classification

Supervised Learning – Classification

Supervised Learning – Classification

COMP9417 Machine Learning and Data Mining

Last revision: 14 March 2018

COMP9417 ML & DM Classification Semester 1, 2018 1 / 132

Acknowledgements

Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie,
R. Tibshirani & J. Friedman. Springer (2009)
http://statweb.stanford.edu/~tibs/ElemStatLearn/

Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy
MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook

Material derived from slides for the book
“Machine Learning” by P. Flach
Cambridge University Press (2012)
http://cs.bris.ac.uk/~flach/mlbook

Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber
Cambridge University Press (2012)
http://www.cs.ucl.ac.uk/staff/d.barber/brml

Material derived from slides for the book
“Machine Learning” by T. Mitchell
McGraw-Hill (1997)
http://www-2.cs.cmu.edu/~tom/mlbook.html

Material derived from slides for the course
“Machine Learning” by A. Srinivasan
BITS Pilani, Goa, India (2016)

COMP9417 ML & DM Classification Semester 1, 2018 2 / 132

http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.cs.ubc.ca/~murphyk/MLbook
http://cs.bris.ac.uk/~flach/mlbook
http://www.cs.ucl.ac.uk/staff/d.barber/brml
http://www-2.cs.cmu.edu/~tom/mlbook.html

Aims

This lecture will introduce you to machine learning approaches to the
problem of classification. Following it you should be able to reproduce
theoretical results, outline algorithmic techniques and describe practical
applications for the topics:

• describe distance measures and how they are used in classification
• outline the basic k-nearest neighbour classification method
• define MAP and ML inference using Bayes theorem
• define the Bayes optimal classification rule in terms of MAP inference
• outline the Naive Bayes classification algorithm
• describe typical applications of Naive Bayes for text classification
• outline the Perceptron classification algorithm
• outline the logistic regression classification algorithm

Note: slides with titles marked ∗ are for background only.

COMP9417 ML & DM Classification Semester 1, 2018 3 / 132

Introduction

Introduction

Classification (sometimes called concept learning) methods dominate
machine learning . . .

. . . however, they often don’t have convenient mathematical properties like
regression, so are more complicated to analyse. The idea is to learn a
classifier, which is usually a function mapping from an input data point to
one of a set of discrete outputs, i.e., the classes.

We will mostly focus on their advantages and disadvantages as learning
methods first, and point to unifying ideas and approaches where
applicable. In this lecture we focus on classification methods that are
essentially linear models . . .

and in later lectures we will see other, more expressive, classifier learning
methods.

COMP9417 ML & DM Classification Semester 1, 2018 4 / 132

Nearest neighbour classification

Nearest neighbour classification

• Related to the simplest form of learning: rote learning or
memorization
• Training instances are searched for instance that most closely

resembles new or query instance
• The instances themselves represent the knowledge
• Called: instance-based, memory-based learning or case-based learning;

often a form of local learning

• The similarity or distance function defines “learning”, i.e., how to go
beyond simple memorization

• Intuitive idea — instances “close by”, i.e., neighbours or exemplars,
should be classified similarly

• Instance-based learning is lazy learning
• Methods: nearest-neighbour, k-nearest-neighbour, lowess, . . .
• Ideas also important for unsupervised methods, e.g., clustering (later

lectures)

COMP9417 ML & DM Classification Semester 1, 2018 5 / 132

Distance-based models

Minkowski distance

Minkowski distance If X = Rd, the Minkowski distance of order p > 0 is
defined as

Disp(x,y) =


 d∑
j=1

|xj − yj |p

1/p = ||x− y||p

where ||z||p =
(∑d

j=1 |zj |
p
)1/p

is the p-norm (sometimes denoted Lp

norm) of the vector z.

COMP9417 ML & DM Classification Semester 1, 2018 6 / 132

Distance-based models

Minkowski distance

• The 2-norm refers to the familiar Euclidean distance

Dis2(x,y) =

√√√√ d∑
j=1

(xj − yj)2 =

(x− y)T(x− y)

which measures distance ‘as the crow flies’.

• The 1-norm denotes Manhattan distance, also called cityblock
distance:

Dis1(x,y) =

d∑
j=1

|xj − yj |

This is the distance if we can only travel along coordinate axes.

COMP9417 ML & DM Classification Semester 1, 2018 7 / 132

Distance-based models

Minkowski distance

• If we now let p grow larger, the distance will be more and more
dominated by the largest coordinate-wise distance, from which we can
infer that Dis∞(x,y) = maxj |xj − yj |; this is also called Chebyshev
distance.

• You will sometimes see references to the 0-norm (or L0 norm) which
counts the number of non-zero elements in a vector. The
corresponding distance then counts the number of positions in which
vectors x and y differ. This is not strictly a Minkowski distance;
however, we can define it as

Dis0(x,y) =
d∑
j=1

(xj − yj)0 =
d∑
j=1

I[xj = yj ]

under the understanding that x0 = 0 for x = 0 and 1 otherwise.

COMP9417 ML & DM Classification Semester 1, 2018 8 / 132

Distance-based models

Minkowski distance

Sometimes the data X is not naturally in Rd, but if we can turn it into
Boolean features, or character sequesnces, we can still apply distance
measures. For example:

• If x and y are binary strings, this is also called the Hamming
distance. Alternatively, we can see the Hamming distance as the
number of bits that need to be flipped to change x into y.

• For non-binary strings of unequal length this can be generalised to the
notion of edit distance or Levenshtein distance.

COMP9417 ML & DM Classification Semester 1, 2018 9 / 132

Distance-based models

Circles and ellipses

Lines connecting points at order-p Minkowski distance 1 from the origin
for (from inside) p = 0.8; p = 1 (Manhattan distance, the rotated square
in red); p = 1.5; p = 2 (Euclidean distance, the violet circle); p = 4;
p = 8; and p =∞ (Chebyshev distance, the blue rectangle). Notice that
for points on the coordinate axes all distances agree. For the other points,
our reach increases with p; however, if we require a rotation-invariant
distance metric then Euclidean distance is our only choice.

COMP9417 ML & DM Classification Semester 1, 2018 10 / 132

Distance-based models

Distance metric

Distance metric Given an instance space X , a distance metric is a
function Dis : X × X → R such that for any x, y, z ∈ X :
• distances between a point and itself are zero: Dis(x, x) = 0;
• all other distances are larger than zero: if x 6= y then Dis(x, y) > 0;
• distances are symmetric: Dis(y, x) = Dis(x, y);
• detours can not shorten the distance:

Dis(x, z) ≤ Dis(x, y) + Dis(y, z).
If the second condition is weakened to a non-strict inequality – i.e.,
Dis(x, y) may be zero even if x 6= y – the function Dis is called a
pseudo-metric.

COMP9417 ML & DM Classification Semester 1, 2018 11 / 132

Distance-based models

The triangle inequality – Minkowski distance for p = 2

A
!

C
!

B
!

The green circle connects points the same Euclidean distance (i.e.,
Minkowski distance of order p = 2) away from the origin as A. The orange
circle shows that B and C are equidistant from A. The red circle
demonstrates that C is closer to the origin than B, which conforms to the
triangle inequality.

COMP9417 ML & DM Classification Semester 1, 2018 12 / 132

Distance-based models

The triangle inequality – Minkowski distance for p ≤ 1

A
!

C
!

B
!

With Manhattan distance (p = 1), B and C are equally close to the origin
and also equidistant from A. With p < 1 (here, p = 0.8) C is further away from the origin than B; since both are again equidistant from A, it follows that travelling from the origin to C via A is quicker than going there directly, which violates the triangle inequality. COMP9417 ML & DM Classification Semester 1, 2018 13 / 132 Distance-based models Mahalanobis distance * Often, the shape of the ellipse is estimated from data as the inverse of the covariance matrix: M = Σ−1. This leads to the definition of the Mahalanobis distance DisM (x,y|Σ) = √ (x− y)TΣ−1(x− y) Using the covariance matrix in this way has the effect of decorrelating and normalising the features. Clearly, Euclidean distance is a special case of Mahalanobis distance with the identity matrix I as covariance matrix: Dis2(x,y) = DisM (x,y|I). COMP9417 ML & DM Classification Semester 1, 2018 14 / 132 Distance-based models Neighbours and exemplars Means and distances The arithmetic mean minimises squared Euclidean distance The arithmetic mean µ of a set of data points D in a Euclidean space is the unique point that minimises the sum of squared Euclidean distances to those data points. Proof. We will show that arg miny ∑ x∈D ||x− y|| 2 = µ, where || · || denotes the 2-norm. We find this minimum by taking the gradient (the vector of partial derivatives with respect to yi) of the sum and setting it to the zero vector: ∇y ∑ x∈D ||x− y||2 = −2 ∑ x∈D (x− y) = −2 ∑ x∈D x + 2|D|y = 0 from which we derive y = 1|D| ∑ x∈D x = µ. COMP9417 ML & DM Classification Semester 1, 2018 15 / 132 Distance-based models Neighbours and exemplars Means and distances • Notice that minimising the sum of squared Euclidean distances of a given set of points is the same as minimising the average squared Euclidean distance. • You may wonder what happens if we drop the square here: wouldn’t it be more natural to take the point that minimises total Euclidean distance as exemplar? • This point is known as the geometric median, as for univariate data it corresponds to the median or ‘middle value’ of a set of numbers. However, for multivariate data there is no closed-form expression for the geometric median, which needs to be calculated by successive approximation. COMP9417 ML & DM Classification Semester 1, 2018 16 / 132 Distance-based models Neighbours and exemplars Means and distances • In certain situations it makes sense to restrict an exemplar to be one of the given data points. In that case, we speak of a medoid, to distinguish it from a centroid which is an exemplar that doesn’t have to occur in the data. • Finding a medoid requires us to calculate, for each data point, the total distance to all other data points, in order to choose the point that minimises it. Regardless of the distance metric used, this is an O(n2) operation for n points. • So for medoids there is no computational reason to prefer one distance metric over another. • There may be more than one medoid. COMP9417 ML & DM Classification Semester 1, 2018 17 / 132 Distance-based models Neighbours and exemplars Centroids and medoids −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 data points squared 2−norm centroid (mean) 2−norm centroid (geometric median) squared 2−norm medoid 2−norm medoid 1−norm medoid A small data set of 10 points, with circles indicating centroids and squares indicating medoids (the latter must be data points), for different distance metrics. Notice how the outlier on the bottom-right ‘pulls’ the mean away from the geometric median; as a result the corresponding medoid changes as well. COMP9417 ML & DM Classification Semester 1, 2018 18 / 132 Distance-based models Neighbours and exemplars The basic linear classifier is distance-based • The basic linear classifier constructs the decision boundary as the perpendicular bisector of the line segment connecting the two exemplars (one for each class). • An alternative, distance-based way to classify instances without direct reference to a decision boundary is by the following decision rule: if x is nearest to µ⊕ then classify it as positive, otherwise as negative; or equivalently, classify an instance to the class of the nearest exemplar. • If we use Euclidean distance as our closeness measure, simple geometry tells us we get exactly the same decision boundary. • So the basic linear classifier can be interpreted from a distance-based perspective as constructing exemplars that minimise squared Euclidean distance within each class, and then applying a nearest-exemplar decision rule. COMP9417 ML & DM Classification Semester 1, 2018 19 / 132 Distance-based models Neighbours and exemplars Two-exemplar decision boundaries (left) For two exemplars the nearest-exemplar decision rule with Euclidean distance results in a linear decision boundary coinciding with the perpendicular bisector of the line connecting the two exemplars. (right) Using Manhattan distance the circles are replaced by diamonds. COMP9417 ML & DM Classification Semester 1, 2018 20 / 132 Distance-based models Neighbours and exemplars Three-exemplar decision boundaries (left) Decision regions defined by the 2-norm nearest-exemplar decision rule for three exemplars. (right) With Manhattan distance the decision regions become non-convex. COMP9417 ML & DM Classification Semester 1, 2018 21 / 132 Distance-based models Neighbours and exemplars Distance-based models To summarise, the main ingredients of distance-based models are • distance metrics, which can be Euclidean, Manhattan, Minkowski or Mahalanobis, among many others; • exemplars: centroids that find a centre of mass according to a chosen distance metric, or medoids that find the most centrally located data point; and • distance-based decision rules, which take a vote among the k nearest exemplars. COMP9417 ML & DM Classification Semester 1, 2018 22 / 132 Distance-based models k-Nearest neighbour Classification Nearest Neighbour Stores all training examples 〈xi, f(xi)〉. Nearest neighbour: • Given query instance xq, first locate nearest training example xn, then estimate f̂(xq)← f(xn) k-Nearest neighbour: • Given xq, take vote among its k nearest neighbours (if discrete-valued target function) • take mean of f values of k nearest neighbours (if real-valued) f̂(xq)← ∑k i=1 f(xi) k COMP9417 ML & DM Classification Semester 1, 2018 23 / 132 Distance-based models k-Nearest neighbour Classification k-Nearest Neighbour Algorithm Training algorithm: • For each training example 〈xi, f(xi)〉, add the example to the list training examples. Classification algorithm: • Given a query instance xq to be classified, • Let x1 . . . xk be the k instances from training examples that are nearest to xq by the distance function • Return f̂(xq)← arg max v∈V k∑ i=1 δ(v, f(xi)) where δ(a, b) = 1 if a = b and 0 otherwise. COMP9417 ML & DM Classification Semester 1, 2018 24 / 132 Distance-based models k-Nearest neighbour Classification “Hypothesis Space” for Nearest Neighbour + + − − − + − − + xq 2 classes, + and − and query point xq. On left, note effect of varying k. On right, 1−NN induces a Voronoi tessellation of the instance space. Formed by the perpendicular bisectors of lines between points. COMP9417 ML & DM Classification Semester 1, 2018 25 / 132 Distance-based models k-Nearest neighbour Classification Distance function again The distance function defines what is learned. Instance x is described by a feature vector (list of attribute-value pairs) 〈a1(x), a2(x), . . . am(x)〉 where ar(x) denotes the value of the rth attribute of x. Most commonly used distance function is Euclidean distance . . . • distance between two instances xi and xj is defined to be d(xi, xj) = √√√√ m∑ r=1 (ar(xi)− ar(xj))2 COMP9417 ML & DM Classification Semester 1, 2018 26 / 132 Distance-based models k-Nearest neighbour Classification Distance function again Many other distance functions could be used . . . • e.g., Manhattan or city-block distance (sum of absolute values of differences between attributes) d(xi, xj) = m∑ r=1 |ar(xi)− ar(xj)| Vector-based formalization – use norm L1, L2, . . . The idea of distance functions will appear again in kernel methods. COMP9417 ML & DM Classification Semester 1, 2018 27 / 132 Distance-based models k-Nearest neighbour Classification Normalization and other issues • Different attributes measured on different scales • Need to be normalized (why ?) ar = vr −minvr maxvr −−minvr where vr is the actual value of attribute r • Nominal attributes: distance either 0 or 1 • Common policy for missing values: assumed to be maximally distant (given normalized attributes) COMP9417 ML & DM Classification Semester 1, 2018 28 / 132 Distance-based models k-Nearest neighbour Classification When To Consider Nearest Neighbour • Instances map to points in 1

• Warning: in high-dimensional spaces everything is far away from
everything and so pairwise distances are uninformative (curse of
dimensionality)

COMP9417 ML & DM Classification Semester 1, 2018 33 / 132

Distance-based models k-Nearest neighbour Classification

Distance-Weighted kNN

• Might want to weight nearer neighbours more heavily …
• Use distance function to construct a weight wi
• Replace the final line of the classification algorithm by:

f̂(xq)← arg max
v∈V

k∑
i=1

wiδ(v, f(xi))

where

wi ≡
1

d(xq, xi)2

and d(xq, xi) is distance between xq and xi

COMP9417 ML & DM Classification Semester 1, 2018 34 / 132

Distance-based models k-Nearest neighbour Classification

Distance-Weighted kNN

For real-valued target functions replace the final line of the algorithm by:

f̂(xq)←
∑k

i=1wif(xi)∑k
i=1wi

(denominator normalizes contribution of individual weights).
Now we can consider using all the training examples instead of just k

→ using all examples (i.e., when k = n) with the rule above is called
Shepard’s method

COMP9417 ML & DM Classification Semester 1, 2018 35 / 132

Distance-based models k-Nearest neighbour Classification

Evaluation

Lazy learners do not construct an explicit model, so how do we evaluate
the output of the learning process ?

• 1-NN – training set error is always zero !
• each training example is always closest to itself

• k-NN – overfitting may be hard to detect

Leave-one-out cross-validation (LOOCV) – leave out each example and
predict it given the rest:

(x1, y1), (x2, y2), . . . , (xi−1, yi−1), (xi+1, yi+1), . . . , (xn, yn)

Error is mean over all predicted examples. Fast – no models to be built !

COMP9417 ML & DM Classification Semester 1, 2018 36 / 132

Distance-based models k-Nearest neighbour Classification

Curse of Dimensionality

Bellman (1960) coined this term in the context of dynamic programming

Imagine instances described by 20 attributes, but only 2 are relevant to
target function — “similar” examples will appear “distant”.

Curse of dimensionality: nearest neighbour is easily mislead when
high-dimensional X – problem of irrelevant attributes

One approach:

• Stretch jth axis by weight zj , where z1, . . . , zn chosen to minimize
prediction error

• Use cross-validation to automatically choose weights z1, . . . , zn
• Note setting zj to zero eliminates this dimension altogether

COMP9417 ML & DM Classification Semester 1, 2018 37 / 132

Distance-based models k-Nearest neighbour Classification

Curse of Dimensionality

• number of “cells” in the instance space grows exponentially in the
number of features

• with exponentially many cells we would need exponentially many data
points to ensure that each cell is sufficiently populated to make
nearest-neighbour predictions reliably

COMP9417 ML & DM Classification Semester 1, 2018 38 / 132

Distance-based models k-Nearest neighbour Classification

Curse of Dimensionality

Some ideas to address this for instance-based (nearest-neighbour) learning

• Euclidean distance with weights on attributes√∑
wr(ar(xq)− ar(x))2

• updating of weights based on nearest neighbour classification error
• class correct/incorrect: weight increased/decreased
• can be useful if not all features used in classification

See Moore and Lee (1994) “Efficient Algorithms for Minimizing Cross Validation Error”

COMP9417 ML & DM Classification Semester 1, 2018 39 / 132

Distance-based models k-Nearest neighbour Classification

Instance-based (nearest-neighbour) learning

Recap – Practical problems of 1-NN scheme:

• Slow (but fast k − d tree-based approaches exist)
• Remedy: removing irrelevant data

• Noise (but k-NN copes quite well with noise)
• Remedy: removing noisy instances

• All attributes deemed equally important
• Remedy: attribute weighting (or simply selection)

• Doesn’t perform explicit generalization
• Remedy: rule-based or tree-based NN approaches

COMP9417 ML & DM Classification Semester 1, 2018 40 / 132

Distance-based models k-Nearest neighbour Classification

Refinements of instance-based (nearest-neighbour)
classifiers *

• Edited NN classifiers discard some of the training instances before
making predictions

• Saves memory and speeds up classification
• IB2: incremental NN learner: only incorporates misclassified instances

into the classifier
• Problem: noisy data gets incorporated

• IB3: store classification performance information with each instance
& only use in prediction if above a threshold

COMP9417 ML & DM Classification Semester 1, 2018 41 / 132

Distance-based models k-Nearest neighbour Classification

Dealing with noise *

Use larger values of k (why ?) How to find the “right” k ?

• One way: cross-validation-based k-NN classifier (but slow)
• Different approach: discarding instances that don’t perform well by

keeping success records of how well an instance does at prediction
(IB3)
• Computes confidence interval for an instance’s success rate and for

default accuracy of its class
• If lower limit of first interval is above upper limit of second one,

instance is accepted (IB3: 5%-level)
• If upper limit of first interval is below lower limit of second one,

instance is rejected (IB3: 12.5%-level)

COMP9417 ML & DM Classification Semester 1, 2018 42 / 132

Naive Bayes Why model uncertainty ?

Uncertainty

As far as the laws of mathematics refer to reality, they are not
certain; as far as they are certain, they do not refer to reality.

–Albert Einstein

COMP9417 ML & DM Classification Semester 1, 2018 43 / 132

Naive Bayes Why Bayes ?

Two Roles for Bayesian Methods

Provides practical learning algorithms:

• Naive Bayes classifier learning
• Bayesian network learning, etc.
• Combines prior knowledge (prior probabilities) with observed data
• How to get prior probabilities ?

Provides useful conceptual framework:

• Provides a “gold standard” for evaluating other learning algorithms
• Some additional insight into Occam’s razor

COMP9417 ML & DM Classification Semester 1, 2018 44 / 132

Naive Bayes Why Bayes ?

Bayes Theorem

P (h|D) =
P (D|h)P (h)

P (D)

where
P (h) = prior probability of hypothesis h
P (D) = prior probability of training data D
P (h|D) = probability of h given D
P (D|h) = probability of D given h

COMP9417 ML & DM Classification Semester 1, 2018 45 / 132

Naive Bayes Why Bayes ?

Choosing Hypotheses

P (h|D) =
P (D|h)P (h)

P (D)

Generally want the most probable hypothesis given the training data
Maximum a posteriori hypothesis hMAP :

hMAP = arg max
h∈H

P (h|D)

= arg max
h∈H

P (D|h)P (h)
P (D)

= arg max
h∈H

P (D|h)P (h)

COMP9417 ML & DM Classification Semester 1, 2018 46 / 132

Naive Bayes Why Bayes ?

Choosing Hypotheses

If assume P (hi) = P (hj) then can further simplify, and choose the
Maximum likelihood (ML) hypothesis

hML = arg max
hi∈H

P (D|hi)

COMP9417 ML & DM Classification Semester 1, 2018 47 / 132

Naive Bayes Why Bayes ?

Applying Bayes Theorem

Does patient have cancer or not?

A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases
in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not
present. Furthermore, .008 of the entire population have this
cancer.

P (cancer) = P (¬cancer) =
P (⊕ | cancer) = P ( | cancer) =
P (⊕ | ¬cancer) = P ( | ¬cancer) =

COMP9417 ML & DM Classification Semester 1, 2018 48 / 132

Naive Bayes Why Bayes ?

Applying Bayes Theorem

Does patient have cancer or not?

A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases
in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not
present. Furthermore, .008 of the entire population have this
cancer.

P (cancer) = .008 P (¬cancer) = .992
P (⊕ | cancer) = .98 P ( | cancer) = .02
P (⊕ | ¬cancer) = .03 P ( | ¬cancer) = .97

COMP9417 ML & DM Classification Semester 1, 2018 49 / 132

Naive Bayes Why Bayes ?

Applying Bayes Theorem

Does patient have cancer or not?

We can find the maximum a posteriori (MAP) hypothesis

P (⊕ | cancer)P (cancer) = 0.98× 0.008 = 0.00784
P (⊕ | ¬cancer)P (¬cancer) = 0.03× 0.992 = 0.02976

Thus hMAP = . . ..

COMP9417 ML & DM Classification Semester 1, 2018 50 / 132

Naive Bayes Why Bayes ?

Applying Bayes Theorem

Does patient have cancer or not?

We can find the maximum a posteriori (MAP) hypothesis

P (⊕ | cancer)P (cancer) = 0.98× 0.008 = 0.00784
P (⊕ | ¬cancer)P (¬cancer) = 0.03× 0.992 = 0.02976

Thus hMAP = ¬cancer.

Also note: posterior probability of hypothesis cancer higher than prior.

COMP9417 ML & DM Classification Semester 1, 2018 51 / 132

Naive Bayes Why Bayes ?

Applying Bayes Theorem

How to get the posterior probability of a hypothesis h ?
Divide by P (⊕), probability of data, to normalize result for h:

P (h|D) =
P (D|h)P (h)∑

hi∈H P (D|hi)P (hi)

Denominator ensures we obtain posterior probabilities that sum to 1.
Sum for all possible numerator values, since hypotheses are mutually
exclusive (e.g., patient either has cancer or does not).
Marginal likelihood (marginalizing out over hypothesis space) — prior
probability of the data.

COMP9417 ML & DM Classification Semester 1, 2018 52 / 132

Naive Bayes Why Bayes ?

Basic Formulas for Probabilities

Product Rule: probability P (A∧B) of conjunction of two events A and B:

P (A ∧B) = P (A|B)P (B) = P (B|A)P (A)

Sum Rule: probability of disjunction of two events A and B:

P (A ∨B) = P (A) + P (B)− P (A ∧B)

Theorem of total probability : if events A1, . . . , An are mutually exclusive
with

∑n
i=1 P (Ai) = 1, then:

P (B) =

n∑
i=1

P (B|Ai)P (Ai)

COMP9417 ML & DM Classification Semester 1, 2018 53 / 132

Naive Bayes Why Bayes ?

Basic Formulas for Probabilities

Also worth remembering:

• Conditional Probability: probability of A given B:

P (A|B) =
P (A ∧B)
P (B)

• Rearrange sum rule to get:

P (A ∧B) = P (A) + P (B)− P (A ∨B)

Exercise: Derive Bayes Theorem.

COMP9417 ML & DM Classification Semester 1, 2018 54 / 132

Naive Bayes Why Bayes ?

Brute Force MAP Hypothesis Learner

Idea: view learning as finding the most probable hypothesis

• For each hypothesis h in H, calculate the posterior probability

P (h|D) =
P (D|h)P (h)

P (D)

• Output the hypothesis hMAP with the highest posterior probability

hMAP = arg max
h∈H

P (h|D)

COMP9417 ML & DM Classification Semester 1, 2018 55 / 132

A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Relation to Concept Learning (i.e., classification)

Canonical concept learning task:

• instance space X, hypothesis space H, training examples D
• consider a learning algorithm that outputs most specific hypothesis

from the version space V SH,D (i.e., set of all consistent or
”zero-error” classification rules)

What would Bayes rule produce as the MAP hypothesis?

Does this algorithm output a MAP hypothesis??

COMP9417 ML & DM Classification Semester 1, 2018 56 / 132

A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Relation to Concept Learning

Brute Force MAP Framework for Concept Learning:

Assume fixed set of instances 〈x1, . . . , xm〉

Assume D is the set of classifications D = 〈c(x1), . . . , c(xm)〉

Choose P (h) to be uniform distribution:

• P (h) = 1|H| for all h in H

Choose P (D|h):
• P (D|h) = 1 if h consistent with D
• P (D|h) = 0 otherwise

COMP9417 ML & DM Classification Semester 1, 2018 57 / 132

A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Relation to Concept Learning

Then:

P (h|D) =



1
|V SH,D|

if h is consistent with D

0 otherwise

COMP9417 ML & DM Classification Semester 1, 2018 58 / 132

A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Relation to Concept Learning

Note that since likelihood is zero if h is inconsistent then the posterior is
also zero. But how did we obtain the posterior for consistent h ?

P (h|D) =
1 · 1|H|
P (D)

=
1 · 1|H|
|V SH,D|
|H|

=
1

|V SH,D|

COMP9417 ML & DM Classification Semester 1, 2018 59 / 132

A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Relation to Concept Learning

How did we obtain P (D) ? From theorem of total probability:

P (D) =

hi∈H

P (D|Hi)P (hi)

=

hi∈V SH,D

1 ·
1

|H|
+


hi 6∈V SH,D

0 ·
1

|H|

=

hi∈V SH,D

1 ·
1

|H|

=
|V SH,D|
|H|

COMP9417 ML & DM Classification Semester 1, 2018 60 / 132

A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Evolution of Posterior Probabilities

hypotheses hypotheses hypotheses

P(h|D1,D2)P(h|D1)P h )(

a( ) b( ) c( )

COMP9417 ML & DM Classification Semester 1, 2018 61 / 132

A Bayesian approach to learning algorithms Concept learning in Bayesian terms

Relation to Concept Learning

Every hypothesis consistent with D is a MAP hypothesis, if we assume

• uniform probability over H
• target function c ∈ H
• deterministic, noise-free data
• etc. (see above)

So, this learning algorithm will output a MAP hypothesis, even though it
does not explicitly use probabilities in learning.

COMP9417 ML & DM Classification Semester 1, 2018 62 / 132

A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Learning A Real Valued Function

E.g., learning a linear target function f from noisy examples:

hML

f

e

y

x

COMP9417 ML & DM Classification Semester 1, 2018 63 / 132

A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Learning A Real Valued Function

Consider any real-valued target function f
Training examples 〈xi, di〉, where di is noisy training value
• di = f(xi) + ei
• ei is random variable (noise) drawn independently for each xi

according to some Gaussian (normal) distribution with mean=0

Then the maximum likelihood hypothesis hML is the one that
minimizes the sum of squared errors:

hML = arg min
h∈H

m∑
i=1

(di − h(xi))2

COMP9417 ML & DM Classification Semester 1, 2018 64 / 132

A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Learning A Real Valued Function

How did we obtain this ?

hML = arg max
h∈H

p(D|h)

= arg max
h∈H

m∏
i=1

p(di|h)

= arg max
h∈H

m∏
i=1

1

2πσ2
e−

1
2
(
di−h(xi)

σ
)2

Recall that we treat each probability p(D|h) as if h = f , i.e., we assume
µ = f(xi) = h(xi), which is the key idea behind maximum likelihood !

COMP9417 ML & DM Classification Semester 1, 2018 65 / 132

A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Learning A Real Valued Function

Maximize natural log to give simpler expression:

hML = arg max
h∈H

m∑
i=1

ln
1


2πσ2


1

2

(
di − h(xi)

σ

)2
= arg max

h∈H

m∑
i=1


1

2

(
di − h(xi)

σ

)2
= arg max

h∈H

m∑
i=1

− (di − h(xi))2

Equivalently, we can minimize the positive version of the expression:

hML = arg min
h∈H

m∑
i=1

(di − h(xi))2

COMP9417 ML & DM Classification Semester 1, 2018 66 / 132

A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Discriminative and generative probabilistic models

• Discriminative models model the posterior probability distribution P (Y |X),
where Y is the target variable and X are the features. That is, given X they
return a probability distribution over Y .

• Generative models model the joint distribution P (Y,X) of the target Y and
the feature vector X. Once we have access to this joint distribution we can
derive any conditional or marginal distribution involving the same variables.
In particular, since P (X) =


y P (Y = y,X) it follows that the posterior

distribution can be obtained as P (Y |X) = P (Y,X)∑
y
P (Y=y,X)

.

• Alternatively, generative models can be described by the likelihood function
P (X|Y ), since P (Y,X) = P (X|Y )P (Y ) and the target or prior distribution
(usually abbreviated to ‘prior’) can be easily estimated or postulated.

• Such models are called ‘generative’ because we can sample from the joint
distribution to obtain new data points together with their labels.
Alternatively, we can use P (Y ) to sample a class and P (X|Y ) to sample an
instance for that class.

COMP9417 ML & DM Classification Semester 1, 2018 67 / 132

A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Assessing uncertainty in estimates

Suppose we want to estimate the probability θ that an arbitrary e-mail is
spam, so that we can use the appropriate prior distribution.

• The natural thing to do is to inspect n e-mails, determine the number
of spam e-mails d, and set θ̂ = d/n; we don’t really need any
complicated statistics to tell us that.

• However, while this is the most likely estimate of θ – the maximum a
posteriori (MAP) estimate – this doesn’t mean that other values of θ
are completely ruled out.

• We model this by a probability distribution over θ (a Beta distribution
in this case) which is updated each time new information comes in.
This is further illustrated in the figure for a distribution that is more
and more skewed towards spam.

• For each curve, its bias towards spam is given by the area under the
curve and to the right of θ = 1/2.

COMP9417 ML & DM Classification Semester 1, 2018 68 / 132

A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

Assessing uncertainty in estimates

0.2 0.4 0.6 0.8

1

2

3

4

5

Each time we inspect an e-mail, we are reducing our uncertainty regarding
the prior spam probability θ. After we inspect two e-mails and observe one
spam, the possible θ values are characterised by a symmetric distribution
around 1/2. If we inspect a third, fourth, . . . , tenth e-mail and each time
(except the first one) it is spam, then this distribution narrows and shifts a
little bit to the right each time. The distribution for n e-mails reaches its
maximum at θ̂MAP =

n−1
n

(e.g., θ̂MAP = 0.8 for n = 5).
COMP9417 ML & DM Classification Semester 1, 2018 69 / 132

A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

The Bayesian perspective

Explicitly modelling the posterior distribution over the parameter θ has a
number of advantages that are usually associated with the ‘Bayesian’
perspective:

• We can precisely characterise the uncertainty that remains about our
estimate by quantifying the spread of the posterior distribution.

• We can obtain a generative model for the parameter by sampling
from the posterior distribution, which contains much more
information than a summary statistic such as the MAP estimate can
convey – so, rather than using a single e-mail with θ = θMAP, our
generative model can contain a number of e-mails with θ sampled
from the posterior distribution.

COMP9417 ML & DM Classification Semester 1, 2018 70 / 132

A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms

The Bayesian perspective

• We can quantify the probability of statements such as ‘e-mails are
biased towards ham’ (the tiny shaded area in the figure demonstrates
that after observing one ham and nine spam e-mails this probability is
very small, about 0.6%).

• We can use one of these distributions to encode our prior beliefs: e.g.,
if we believe that the proportions of spam and ham are typically
50–50, we can take the distribution for n = 2 (the lowest, symmetric
one in the figure on the previous slide) as our prior.

The key point is that probabilities do not have to be interpreted as
estimates of relative frequencies, but can carry the more general meaning
of (possibly subjective) degrees of belief.

Consequently, we can attach a probability distribution to almost anything:
not just features and targets, but also model parameters and even models.

COMP9417 ML & DM Classification Semester 1, 2018 71 / 132

Minimum Description Length Principle

Minimum Description Length Principle

Once again, the MAP hypothesis

hMAP = arg max
h∈H

P (D|h)P (h)

Which is equivalent to

hMAP = arg max
h∈H

log2 P (D|h) + log2 P (h)

Or
hMAP = arg min

h∈H
− log2 P (D|h)− log2 P (h)

COMP9417 ML & DM Classification Semester 1, 2018 72 / 132

Minimum Description Length Principle

Minimum Description Length Principle

Interestingly, this is an expression about a quantity of bits.

hMAP = arg min
h∈H
− log2 P (D|h)− log2 P (h) (1)

From information theory:

The optimal (shortest expected coding length) code for an event
with probability p is − log2 p bits.

COMP9417 ML & DM Classification Semester 1, 2018 73 / 132

Minimum Description Length Principle

Minimum Description Length Principle

So interpret (1):

• − log2 P (h) is length of h under optimal code
• − log2 P (D|h) is length of D given h under optimal code

Note well: assumes optimal encodings, when the priors and likelihoods
are known. In practice, this is difficult, and makes a difference.

COMP9417 ML & DM Classification Semester 1, 2018 74 / 132

Minimum Description Length Principle

Minimum Description Length Principle

Occam’s razor: prefer the shortest hypothesis

MDL: prefer the hypothesis h that minimizes

hMDL = arg min
h∈H

LC1(h) + LC2(D|h)

where LC(x) is the description length of x under optimal encoding C

COMP9417 ML & DM Classification Semester 1, 2018 75 / 132

Minimum Description Length Principle

Minimum Description Length Principle

Without loss of generality, classifier is here assumed to be a decision tree

Example: H = decision trees, D = training data labels

• LC1(h) is # bits to describe tree h
• LC2(D|h) is # bits to describe D given h

• Note LC2(D|h) = 0 if examples classified perfectly by h. Need only
describe exceptions

• Hence hMDL trades off tree size for training errors
• i.e., prefer the hypothesis that minimizes

length(h) + length(misclassifications)

COMP9417 ML & DM Classification Semester 1, 2018 76 / 132

Bayes Optimal Classifier

Most Probable Classification of New Instances

So far we’ve sought the most probable hypothesis given the data D (i.e.,
hMAP )

Given new instance x, what is its most probable classification?

• hMAP (x) is not the most probable classification!

COMP9417 ML & DM Classification Semester 1, 2018 77 / 132

Bayes Optimal Classifier

Most Probable Classification of New Instances

Consider:

• Three possible hypotheses:
P (h1|D) = .4, P (h2|D) = .3, P (h3|D) = .3

• Given new instance x,
h1(x) = +, h2(x) = −, h3(x) = −

• What’s most probable classification of x?

COMP9417 ML & DM Classification Semester 1, 2018 78 / 132

Bayes Optimal Classifier

Bayes Optimal Classifier

Bayes optimal classification:

arg max
vj∈V


hi∈H

P (vj |hi)P (hi|D)

Example:

P (h1|D) = .4, P (−|h1) = 0, P (+|h1) = 1
P (h2|D) = .3, P (−|h2) = 1, P (+|h2) = 0
P (h3|D) = .3, P (−|h3) = 1, P (+|h3) = 0

COMP9417 ML & DM Classification Semester 1, 2018 79 / 132

Bayes Optimal Classifier

Bayes Optimal Classifier

therefore ∑
hi∈H

P (+|hi)P (hi|D) = .4


hi∈H

P (−|hi)P (hi|D) = .6

and

arg max
vj∈V


hi∈H

P (vj |hi)P (hi|D) = −

No other classification method using the same hypothesis space and same
prior knowledge can outperform this method on average

COMP9417 ML & DM Classification Semester 1, 2018 80 / 132

Bayes Optimal Classifier

Bayes Error

What is the best performance attainable by a (two-class) classifier ?

Define the probability of error for classifying some instance x by

P (error|x) = P (class1|x) if we predict class2
= P (class2|x) if we predict class1

This gives
Σx P (error) = P (error|x) P (x)

So we can justify the use of the decision rule

if P (class1|x) > P (class2|x) then predict class1
else predict class2

On average, this decision rule minimises probability of classification error.
COMP9417 ML & DM Classification Semester 1, 2018 81 / 132

Naive Bayes

Naive Bayes Classifier

Along with decision trees, neural networks, nearest neighbour, one of the
most practical learning methods.

When to use

• Moderate or large training set available
• Attributes that describe instances are conditionally independent given

classification

Successful applications:

• Classifying text documents
• Gaussian Naive Bayes for real-valued data

COMP9417 ML & DM Classification Semester 1, 2018 82 / 132

Naive Bayes

Naive Bayes Classifier

Assume target function f : X → V , where each instance x described by
attributes 〈a1, a2 . . . an〉.
Most probable value of f(x) is:

vMAP = arg max
vj∈V

P (vj |a1, a2 . . . an)

vMAP = arg max
vj∈V

P (a1, a2 . . . an|vj)P (vj)
P (a1, a2 . . . an)

= arg max
vj∈V

P (a1, a2 . . . an|vj)P (vj)

COMP9417 ML & DM Classification Semester 1, 2018 83 / 132

Naive Bayes

Naive Bayes Classifier

Naive Bayes assumption:

P (a1, a2 . . . an|vj) =

i

P (ai|vj)

• Attributes are statistically independent (given the class value)
• which means knowledge about the value of a particular attribute tells

us nothing about the value of another attribute (if the class is known)

which gives

Naive Bayes classifier: vNB = arg max
vj∈V

P (vj)

i

P (ai|vj)

COMP9417 ML & DM Classification Semester 1, 2018 84 / 132

Naive Bayes

Naive Bayes Algorithm

Naive Bayes Learn(examples)

For each target value vj

P̂ (vj)← estimate P (vj)
For each attribute value ai of each attribute a

P̂ (ai|vj)← estimate P (ai|vj)

Classify New Instance(x)

vNB = arg max
vj∈V

P̂ (vj)

ai∈x

P̂ (ai|vj)

COMP9417 ML & DM Classification Semester 1, 2018 85 / 132

Naive Bayes

Naive Bayes Example

Consider PlayTennis again . . .

Outlook Temperature Humidity Windy
Yes No Yes No Yes No Yes No

Sunny 2 3 Hot 2 2 High 3 4 False 6 2
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1

Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5

Play
Yes No
9 5

9/14 5/14

COMP9417 ML & DM Classification Semester 1, 2018 86 / 132

Naive Bayes

Naive Bayes Example

Say we have the new instance:

〈Outlk = sun, Temp = cool,Humid = high,Wind = true〉

We want to compute:

vNB = arg max
vj∈{“yes”,“no”}

P (vj)

i

P (ai|vj)

COMP9417 ML & DM Classification Semester 1, 2018 87 / 132

Naive Bayes

Naive Bayes Example

So we first calculate the likelihood of the two classes, “yes” and “no”

for “yes” = P (y)× P (sun|y)× P (cool|y)× P (high|y)× P (true|y)

0.0053 =
9

14
×

2

9
×

3

9
×

3

9
×

3

9

for “no” = P (n)× P (sun|n)× P (cool|n)× P (high|n)× P (true|n)

0.0206 =
5

14
×

3

5
×

1

5
×

4

5
×

3

5

COMP9417 ML & DM Classification Semester 1, 2018 88 / 132

Naive Bayes

Naive Bayes Example

Then convert to a probability by normalisation

P (“yes”) =
0.0053

(0.0053 + 0.0206)

= 0.205

P (“no”) =
0.0206

(0.0053 + 0.0206)

= 0.795

The Naive Bayes classification is “no”.

COMP9417 ML & DM Classification Semester 1, 2018 89 / 132

Naive Bayes

Naive Bayes: Subtleties

Conditional independence assumption is often violated

P (a1, a2 . . . an|vj) =

i

P (ai|vj)

• …but it works surprisingly well anyway. Note don’t need estimated
posteriors P̂ (vj |x) to be correct; need only that

arg max
vj∈V

P̂ (vj)

i

P̂ (ai|vj) = arg max
vj∈V

P (vj)P (a1 . . . , an|vj)

i.e. maximum probability is assigned to correct class

• see [Domingos & Pazzani, 1996] for analysis
• Naive Bayes posteriors often unrealistically close to 1 or 0
• adding too many redundant attributes will cause problems (e.g.

identical attributes)

COMP9417 ML & DM Classification Semester 1, 2018 90 / 132

Naive Bayes

Naive Bayes: “zero-frequency” problem

What if none of the training instances with target value vj have attribute
value ai? Then

P̂ (ai|vj) = 0, and…

P̂ (vj)

i

P̂ (ai|vj) = 0

Pseudo-counts add 1 to each count (a version of the Laplace Estimator)

COMP9417 ML & DM Classification Semester 1, 2018 91 / 132

Naive Bayes

Naive Bayes: “zero-frequency” problem

• In some cases adding a constant different from 1 might be more
appropriate

• Example: attribute outlook for class yes
Sunny Overcast Rainy

2+
µ
2

9+µ

4+
µ
3

9+µ

3+
µ
3

9+µ

• Weights don’t need to be equal (if they sum to 1) – a form of prior
Sunny Overcast Rainy
2+µp1
9+µ

4+µp2
9+µ

3+µp3
9+µ

COMP9417 ML & DM Classification Semester 1, 2018 92 / 132

Naive Bayes

Naive Bayes: “zero-frequency” problem

This generalisation is a Bayesian estimate for P̂ (ai|vj)

P̂ (ai|vj)←
nc +mp

n+m

where

• n is number of training examples for which v = vj ,
• nc number of examples for which v = vj and a = ai
• p is prior estimate for P̂ (ai|vj)
• m is weight given to prior (i.e. number of “virtual” examples)

This is called the m-estimate of probability.

COMP9417 ML & DM Classification Semester 1, 2018 93 / 132

Naive Bayes

Naive Bayes: missing values

• Training: instance is not included in frequency count for attribute
value-class combination

• Classification: attribute will be omitted from calculation

COMP9417 ML & DM Classification Semester 1, 2018 94 / 132

Naive Bayes

Naive Bayes: numeric attributes

• Usual assumption: attributes have a normal or Gaussian probability
distribution (given the class)

• The probability density function for the normal distribution is defined
by two parameters:

The sample mean µ:

µ =
1

n

n∑
i=1

xi

The standard deviation σ:

σ =
1

n− 1

n∑
i=1

(xi − µ)2

COMP9417 ML & DM Classification Semester 1, 2018 95 / 132

Naive Bayes

Naive Bayes: numeric attributes

Then we have the density function f(x):

f(x) =
1


2πσ

e
− (x−µ)

2

2σ2

Example: continuous attribute temperature with mean = 73 and standard
deviation = 6.2. Density value

f(temperature = 66|“yes”) =
1


2π6.2

e
− (66−73)

2

2×6.22 = 0.0340

Missing values during training are not included in calculation of mean and
standard deviation.

COMP9417 ML & DM Classification Semester 1, 2018 96 / 132

Naive Bayes

Naive Bayes: numeric attributes

Note: the normal distribution is based on the simple exponential function

f(x) = e−|x|
m

As the power m in the exponent increases, the function approaches a step
function.
Where m = 2

f(x) = e−|x|
2

and this is the basis of the normal distribution – the various constants are
the result of scaling so that the integral (the area under the curve from
−∞ to +∞) is equal to 1.
from “Statistical Computing” by Michael J. Crawley (2002) Wiley.

COMP9417 ML & DM Classification Semester 1, 2018 97 / 132

Naive Bayes

Categorical random variables

Categorical variables or features (also called discrete or nominal) are
ubiquitous in machine learning.

• Perhaps the most common form of the Bernoulli distribution models
whether or not a word occurs in a document. That is, for the i-th
word in our vocabulary we have a random variable Xi governed by a
Bernoulli distribution. The joint distribution over the bit vector
X = (X1, . . . , Xk) is called a multivariate Bernoulli distribution.

• Variables with more than two outcomes are also common: for
example, every word position in an e-mail corresponds to a categorical
variable with k outcomes, where k is the size of the vocabulary. The
multinomial distribution manifests itself as a count vector: a
histogram of the number of occurrences of all vocabulary words in a
document. This establishes an alternative way of modelling text
documents that allows the number of occurrences of a word to
influence the classification of a document.

COMP9417 ML & DM Classification Semester 1, 2018 98 / 132

Naive Bayes

Categorical random variables

Both these document models are in common use. Despite their
differences, they both assume independence between word occurrences,
generally referred to as the naive Bayes assumption.
• In the multinomial document model, this follows from the very use of

the multinomial distribution, which assumes that words at different
word positions are drawn independently from the same categorical
distribution.

• In the multivariate Bernoulli model we assume that the bits in a bit
vector are statistically independent, which allows us to compute the
joint probability of a particular bit vector (x1, . . . , xk) as the product
of the probabilities of each component P (Xi = xi).

• In practice, such word independence assumptions are often not true:
if we know that an e-mail contains the word ‘Viagra’, we can be quite
sure that it will also contain the word ‘pill’. Violated independence
assumptions reduce the quality of probability estimates but may still
allow good classification performance.

COMP9417 ML & DM Classification Semester 1, 2018 99 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Example application: Learning to Classify Text

In machine learning, the classic example of applications of Naive Bayes is
learning to classify text documents.

Here is a simplified version in the multinomial document model.

COMP9417 ML & DM Classification Semester 1, 2018 100 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Learning to Classify Text

Why?

• Learn which news articles are of interest
• Learn to classify web pages by topic

Naive Bayes is among most effective algorithms

What attributes shall we use to represent text documents??

COMP9417 ML & DM Classification Semester 1, 2018 101 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Learning to Classify Text

Target concept Interesting? : Document→ {+,−}
1 Represent each document by vector of words

• one attribute per word position in document
2 Learning: Use training examples to estimate

• P (+)
• P (−)
• P (doc|+)
• P (doc|−)

COMP9417 ML & DM Classification Semester 1, 2018 102 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Learning to Classify Text

Naive Bayes conditional independence assumption

P (doc|vj) =
length(doc)∏

i=1

P (ai = wk|vj)

where P (ai = wk|vj) is probability that word in position i is wk, given vj

one more assumption: P (ai = wk|vj) = P (am = wk|vj),∀i,m

“bag of words”

COMP9417 ML & DM Classification Semester 1, 2018 103 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Learning to Classify Text

Learn naive Bayes text(Examples, V )

// collect all words and other tokens that occur in Examples

V ocabulary ← all distinct words and other tokens in Examples
// calculate the required P (vj) and P (wk|vj) probability terms
for each target value vj in V do

docsj ← subset of Examples for which the target value is vj
P (vj)←

|docsj |
|Examples|

Textj ← a single document created by concatenating all members of docsj
n← total number of words in Textj (counting duplicate words multiple times)
for each word wk in V ocabulary

nk ← number of times word wk occurs in Textj
P (wk|vj)← nk+1n+|V ocabulary|

COMP9417 ML & DM Classification Semester 1, 2018 104 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Learning to Classify Text

Classify naive Bayes text(Doc)

• positions← all word positions in Doc that contain tokens found in
V ocabulary

• Return vNB, where

vNB = arg max
vj∈V

P (vj)

i∈positions
P (ai|vj)

COMP9417 ML & DM Classification Semester 1, 2018 105 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Application: 20 Newsgroups

Given: 1000 training documents from each group
Learning task: classify each new document by newsgroup it came from

comp.graphics misc.forsale

comp.os.ms-windows.misc rec.autos

comp.sys.ibm.pc.hardware rec.motorcycles

comp.sys.mac.hardware rec.sport.baseball

comp.windows.x rec.sport.hockey

alt.atheism sci.space

soc.religion.christian sci.crypt

talk.religion.misc sci.electronics

talk.politics.mideast sci.med

talk.politics.misc

talk.politics.guns

Naive Bayes: 89% classification accuracy

COMP9417 ML & DM Classification Semester 1, 2018 106 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Article from rec.sport.hockey

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu

From: xxx@yyy.zzz.edu (John Doe)

Subject: Re: This year’s biggest and worst (opinion)…

Date: 5 Apr 93 09:53:39 GMT

I can only comment on the Kings, but the most obvious candidate

for pleasant surprise is Alex Zhitnik. He came highly touted as

a defensive defenseman, but he’s clearly much more than that.

Great skater and hard shot (though wish he were more accurate).

In fact, he pretty much allowed the Kings to trade away that

huge defensive liability Paul Coffey. Kelly Hrudey is only the

biggest disappointment if you thought he was any good to begin

with. But, at best, he’s only a mediocre goaltender. A better

choice would be Tomas Sandstrom, though not through any fault

of his own, but because some thugs in Toronto decided …

COMP9417 ML & DM Classification Semester 1, 2018 107 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Learning Curve for 20 Newsgroups

0

10

20

30

40

50

60

70

80

90

100

100 1000 10000

20News

Bayes
TFIDF

PRTFIDF

Accuracy vs. Training set size (1/3 withheld for test)

COMP9417 ML & DM Classification Semester 1, 2018 108 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Probabilistic decision rules

We have chosen one of the possible distributions to model our data X as
coming from either class.

• The more different P (X|Y = spam) and P (X|Y = ham) are, the
more useful the features X are for classification.

• Thus, for a specific e-mail x we calculate both P (X = x|Y = spam)
and P (X = x|Y = ham), and apply one of several possible decision
rules:

maximum likelihood (ML) – predict arg maxy P (X = x|Y = y);
maximum a posteriori (MAP) – predict

arg maxy P (X = x|Y = y)P (Y = y);

The relation between the first two decision rules is that ML classification is
equivalent to MAP classification with a uniform class distribution.

COMP9417 ML & DM Classification Semester 1, 2018 109 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Probabilistic decision rules

We again use the example of Naive Bayes for text classification to
illustrate, using both the multinomial and multivariate Bernoulli models.

COMP9417 ML & DM Classification Semester 1, 2018 110 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Prediction using a naive Bayes model

Suppose our vocabulary contains three words a, b and c, and we use a
multivariate Bernoulli model for our e-mails, with parameters

θ⊕ = (0.5, 0.67, 0.33) θ = (0.67, 0.33, 0.33)

This means, for example, that the presence of b is twice as likely in spam
(+), compared with ham.
The e-mail to be classified contains words a and b but not c, and hence is
described by the bit vector x = (1, 1, 0). We obtain likelihoods

P (x|⊕) = 0.5 · 0.67 · (1− 0.33) = 0.222
P (x| ) = 0.67 · 0.33 · (1− 0.33) = 0.148

The ML classification of x is thus spam.

COMP9417 ML & DM Classification Semester 1, 2018 111 / 132

Naive Bayes Learning and using a naive Bayes model for classification

Prediction using a naive Bayes model

In the case of two classes it is often convenient to work with likelihood
ratios and odds.

• The likelihood ratio can be calculated as

P (x|⊕)
P (x| )

=
0.5

0.67

0.67

0.33

1− 0.33
1− 0.33

= 3/2 > 1

• This means that the MAP classification of x is also spam if the prior
odds are more than 2/3, but ham if they are less than that.

• For example, with 33% spam and 67% ham the prior odds are
P (⊕)
P ( ) =

0.33
0.67

= 1/2, resulting in a posterior odds of

P (⊕|x)
P ( |x)

=
P (x|⊕)
P (x| )

P (⊕)
P ( )

= 3/2 · 1/2 = 3/4 < 1 In this case the likelihood ratio for x is not strong enough to push the decision away from the prior. COMP9417 ML & DM Classification Semester 1, 2018 112 / 132 Naive Bayes Learning and using a naive Bayes model for classification Prediction using a naive Bayes model Alternatively, we can employ a multinomial model. The parameters of a multinomial establish a distribution over the words in the vocabulary, say θ⊕ = (0.3, 0.5, 0.2) θ = (0.6, 0.2, 0.2) The e-mail to be classified contains three occurrences of word a, one single occurrence of word b and no occurrences of word c, and hence is described by the count vector x = (3, 1, 0). The total number of vocabulary word occurrences is n = 4. We obtain likelihoods P (x|⊕) = 4! 0.33 3! 0.51 1! 0.20 0! = 0.054 P (x| ) = 4! 0.63 3! 0.21 1! 0.20 0! = 0.1728 The likelihood ratio is ( 0.3 0.6 )3 (0.5 0.2 )1 (0.2 0.2 )0 = 5/16. The ML classification of x is thus ham, the opposite of the multivariate Bernoulli model. This is mainly because of the three occurrences of word a, which provide strong evidence for ham. COMP9417 ML & DM Classification Semester 1, 2018 113 / 132 Naive Bayes Learning and using a naive Bayes model for classification Training data for naive Bayes A small e-mail data set described by count vectors. E-mail #a #b #c Class e1 0 3 0 + e2 0 3 3 + e3 3 0 0 + e4 2 3 0 + e5 4 3 0 − e6 4 0 3 − e7 3 0 0 − e8 0 0 0 − COMP9417 ML & DM Classification Semester 1, 2018 114 / 132 Naive Bayes Learning and using a naive Bayes model for classification Training data for naive Bayes The same data set described by bit vectors. E-mail a? b? c? Class e1 0 1 0 + e2 0 1 1 + e3 1 0 0 + e4 1 1 0 + e5 1 1 0 − e6 1 0 1 − e7 1 0 0 − e8 0 0 0 − COMP9417 ML & DM Classification Semester 1, 2018 115 / 132 Naive Bayes Learning and using a naive Bayes model for classification Training a naive Bayes model Consider the following e-mails consisting of five words a, b, c, d, e: e1: b d e b b d e e2: b c e b b d d e c c e3: a d a d e a e e e4: b a d b e d a b e5: a b a b a b a e d e6: a c a c a c a e d e7: e a e d a e a e8: d e d e d We are told that the e-mails on the left are spam and those on the right are ham, and so we use them as a small training set to train our Bayesian classifier. • First, we decide that d and e are so-called stop words that are too common to convey class information. • The remaining words, a, b and c, constitute our vocabulary. COMP9417 ML & DM Classification Semester 1, 2018 116 / 132 Naive Bayes Learning and using a naive Bayes model for classification Training a naive Bayes model For the multinomial model, we represent each e-mail as a count vector, as before. • In order to estimate the parameters of the multinomial, we sum up the count vectors for each class, which gives (5, 9, 3) for spam and (11, 3, 3) for ham. • To smooth these probability estimates we add one pseudo-count for each vocabulary word, which brings the total number of occurrences of vocabulary words to 20 for each class. • The estimated parameter vectors are thus θ̂⊕ = (6/20, 10/20, 4/20) = (0.3, 0.5, 0.2) for spam and θ̂ = (12/20, 4/20, 4/20) = (0.6, 0.2, 0.2) for ham. COMP9417 ML & DM Classification Semester 1, 2018 117 / 132 Naive Bayes Learning and using a naive Bayes model for classification Training a naive Bayes model In the multivariate Bernoulli model e-mails are represented by bit vectors, as before. • Adding the bit vectors for each class results in (2, 3, 1) for spam and (3, 1, 1) for ham. • Each count is to be divided by the number of documents in a class, in order to get an estimate of the probability of a document containing a particular vocabulary word. • Probability smoothing now means adding two pseudo-documents, one containing each word and one containing none of them. • This results in the estimated parameter vectors θ̂⊕ = (3/6, 4/6, 2/6) = (0.5, 0.67, 0.33) for spam and θ̂ = (4/6, 2/6, 2/6) = (0.67, 0.33, 0.33) for ham. COMP9417 ML & DM Classification Semester 1, 2018 118 / 132 Linear discriminants Linear discriminants Many forms of linear discriminant from statistics and machine learning, e.g., • Fisher’s Linear Discriminant Analysis • the basic linear classifier in lecture 1 is a version of this • Logistic Regression • a probabilistic linear classifier • Perceptron • a linear threshold classifier • an early version of an artificial “neuron” • still a useful method, and source of ideas COMP9417 ML & DM Classification Semester 1, 2018 119 / 132 Linear discriminants Logistic regression In the case of a two-class problem, model the probability of one class P (Y = 1) vs. the alternative (1− P (Y = 1)): P (Y = 1|x) = 1 1 + e−w Tx or ln P (Y = 1|x) 1− P (Y = 1|x) = wTx The quantity on the l.h.s. is called the logit and all we are doing is a linear model for the logit. Note: to fit this is actually more complex than linear regression, so we omit the details. Generalises to multiple class versions (Y can have more than two values). COMP9417 ML & DM Classification Semester 1, 2018 120 / 132 Perceptrons What are perceptrons ? Perceptron A linear classifier that can achieve perfect separation on linearly separable data is the perceptron, originally proposed as a simple neural network by F. Rosenblatt in the late 1950s. COMP9417 ML & DM Classification Semester 1, 2018 121 / 132 Perceptrons What are perceptrons ? Perceptron Originally implemented in software (based on the McCulloch-Pitts neuron from the 1940s), then in hardware as a 20x20 visual sensor array with potentiometers for adaptive weights. Source http://en.wikipedia.org/w/index.php?curid=47541432 COMP9417 ML & DM Classification Semester 1, 2018 122 / 132 http://en.wikipedia.org/w/index.php?curid=47541432 Perceptrons What are perceptrons ? Perceptron Output o is thresholded sum of products of inputs and their weights: o(x1, . . . , xn) = { +1 if w0 + w1x1 + · · ·+ wnxn > 0
−1 otherwise.

COMP9417 ML & DM Classification Semester 1, 2018 123 / 132

Perceptrons What are perceptrons ?

Perceptron

Or in vector notation:

o(x) =

{
1 if w · x > 0
−1 otherwise.

COMP9417 ML & DM Classification Semester 1, 2018 124 / 132

Perceptrons Perceptrons are linear classifiers

Decision Surface of a Perceptron

x1

x2
+

+


+

x1

x2

(a) (b)

+ –

+

Represents some useful functions

• What weights represent o(x1, x2) = AND(x1, x2)?
• What weights represent o(x1, x2) = XOR(x1, x2)?

COMP9417 ML & DM Classification Semester 1, 2018 125 / 132

Perceptrons Perceptrons are linear classifiers

Decision Surface of a Perceptron

So some functions not representable

• e.g., not linearly separable
• a labelled data set is linearly separable if there is a linear decision

boundary that separates the classes

• for non-linearly separable data we’ll need something else
• e.g., networks of these …
• the start of “deep” networks …

COMP9417 ML & DM Classification Semester 1, 2018 126 / 132

Perceptrons How to train

Perceptron learning

Key idea:

Learning is “finding a good set of weights”

Perceptron learning is simply an iterative weight-update scheme:

wi ← wi + ∆wi

where the weight update ∆wi depends only on misclassified examples and
is modulated by a “smoothing” parameter η typically referred to as the
“learning rate”.

Can prove that perceptron learning will converge:

• if training data is linearly separable
• and η sufficiently small

COMP9417 ML & DM Classification Semester 1, 2018 127 / 132

Perceptrons How to train

Perceptron learning

The perceptron iterates over the training set, updating the weight vector
every time it encounters an incorrectly classified example.

• For example, let xi be a misclassified positive example, then we have
yi = +1 and w · xi < t. We therefore want to find w′ such that w′ · xi > w · xi, which moves the decision boundary towards and
hopefully past xi.

• This can be achieved by calculating the new weight vector as
w′ = w +ηxi, where 0 < η ≤ 1 is the learning rate (again, assume set to 1). We then have w′ · xi = w · xi + ηxi · xi > w · xi as required.

• Similarly, if xj is a misclassified negative example, then we have
yj = −1 and w · xj > t. In this case we calculate the new weight
vector as w′ = w− ηxj , and thus w′ ·xj = w ·xj − ηxj ·xj < w ·xj . COMP9417 ML & DM Classification Semester 1, 2018 128 / 132 Perceptrons How to train Perceptron learning • The two cases can be combined in a single update rule: w′ = w + ηyixi • Here yi acts to change the sign of the update, corresponding to whether a positive or negative example was misclassified • This is the basis of the perceptron training algorithm for linear classification • The algorithm just iterates over the training examples applying the weight update rule until all the examples are correctly classified • If there is a linear model that separates the positive from the negative examples, i.e., the data is linearly separable, it can be shown that the perceptron training algorithm will converge in a finite number of steps. COMP9417 ML & DM Classification Semester 1, 2018 129 / 132 Perceptrons How to train Perceptron training algorithm Algorithm Perceptron(D, η) // perceptron training for linear classification Input: labelled training data D in homogeneous coordinates; learning rate η. Output: weight vector w defining classifier ŷ = sign(w · x). 1 w ←0 // Other initialisations of the weight vector are possible 2 converged←false 3 while converged = false do 4 converged←true 5 for i = 1 to |D| do 6 if yiw · xi ≤ 0 then // i.e., ŷi 6= yi 7 w←w + ηyixi 8 converged←false // We changed w so haven’t converged yet 9 end 10 end 11 end COMP9417 ML & DM Classification Semester 1, 2018 130 / 132 Perceptrons How to train Perceptron training – varying learning rate −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 (left) A perceptron trained with a small learning rate (η = 0.2). The circled examples are the ones that trigger the weight update. (middle) Increasing the learning rate to η = 0.5 leads in this case to a rapid convergence. (right) Increasing the learning rate further to η = 1 may lead to too aggressive weight updating, which harms convergence. The starting point in all three cases was the basic linear classifier. COMP9417 ML & DM Classification Semester 1, 2018 131 / 132 Summary Summary • Two major frameworks for classification by linear models Distance-based. The key ideas are geometric. Probabilistic. The key ideas are Bayesian. • We also discussed Logistic Regression and the Perceptron, a simple form of threshold model. • These classifiers are also, in some sense, linear models • So we have established the basis for learning classifiers • Later we will see how to extend by building on these ideas COMP9417 ML & DM Classification Semester 1, 2018 132 / 132 Introduction Nearest neighbour classification Distance-based models Neighbours and exemplars k-Nearest neighbour Classification Naive Bayes Why model uncertainty ? Why Bayes ? A Bayesian approach to learning algorithms Concept learning in Bayesian terms Numeric prediction in Bayesian terms Minimum Description Length Principle Bayes Optimal Classifier Naive Bayes Learning and using a naive Bayes model for classification Linear discriminants Perceptrons What are perceptrons ? Perceptrons are linear classifiers How to train Summary