Supervised Learning – Classification
Supervised Learning – Classification
COMP9417 Machine Learning and Data Mining
Last revision: 14 March 2018
COMP9417 ML & DM Classification Semester 1, 2018 1 / 132
Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie,
R. Tibshirani & J. Friedman. Springer (2009)
http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy
MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book
“Machine Learning” by P. Flach
Cambridge University Press (2012)
http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber
Cambridge University Press (2012)
http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book
“Machine Learning” by T. Mitchell
McGraw-Hill (1997)
http://www-2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course
“Machine Learning” by A. Srinivasan
BITS Pilani, Goa, India (2016)
COMP9417 ML & DM Classification Semester 1, 2018 2 / 132
http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.cs.ubc.ca/~murphyk/MLbook
http://cs.bris.ac.uk/~flach/mlbook
http://www.cs.ucl.ac.uk/staff/d.barber/brml
http://www-2.cs.cmu.edu/~tom/mlbook.html
Aims
This lecture will introduce you to machine learning approaches to the
problem of classification. Following it you should be able to reproduce
theoretical results, outline algorithmic techniques and describe practical
applications for the topics:
• describe distance measures and how they are used in classification
• outline the basic k-nearest neighbour classification method
• define MAP and ML inference using Bayes theorem
• define the Bayes optimal classification rule in terms of MAP inference
• outline the Naive Bayes classification algorithm
• describe typical applications of Naive Bayes for text classification
• outline the Perceptron classification algorithm
• outline the logistic regression classification algorithm
Note: slides with titles marked ∗ are for background only.
COMP9417 ML & DM Classification Semester 1, 2018 3 / 132
Introduction
Introduction
Classification (sometimes called concept learning) methods dominate
machine learning . . .
. . . however, they often don’t have convenient mathematical properties like
regression, so are more complicated to analyse. The idea is to learn a
classifier, which is usually a function mapping from an input data point to
one of a set of discrete outputs, i.e., the classes.
We will mostly focus on their advantages and disadvantages as learning
methods first, and point to unifying ideas and approaches where
applicable. In this lecture we focus on classification methods that are
essentially linear models . . .
and in later lectures we will see other, more expressive, classifier learning
methods.
COMP9417 ML & DM Classification Semester 1, 2018 4 / 132
Nearest neighbour classification
Nearest neighbour classification
• Related to the simplest form of learning: rote learning or
memorization
• Training instances are searched for instance that most closely
resembles new or query instance
• The instances themselves represent the knowledge
• Called: instance-based, memory-based learning or case-based learning;
often a form of local learning
• The similarity or distance function defines “learning”, i.e., how to go
beyond simple memorization
• Intuitive idea — instances “close by”, i.e., neighbours or exemplars,
should be classified similarly
• Instance-based learning is lazy learning
• Methods: nearest-neighbour, k-nearest-neighbour, lowess, . . .
• Ideas also important for unsupervised methods, e.g., clustering (later
lectures)
COMP9417 ML & DM Classification Semester 1, 2018 5 / 132
Distance-based models
Minkowski distance
Minkowski distance If X = Rd, the Minkowski distance of order p > 0 is
defined as
Disp(x,y) =
d∑
j=1
|xj − yj |p
1/p = ||x− y||p
where ||z||p =
(∑d
j=1 |zj |
p
)1/p
is the p-norm (sometimes denoted Lp
norm) of the vector z.
COMP9417 ML & DM Classification Semester 1, 2018 6 / 132
Distance-based models
Minkowski distance
• The 2-norm refers to the familiar Euclidean distance
Dis2(x,y) =
√√√√ d∑
j=1
(xj − yj)2 =
√
(x− y)T(x− y)
which measures distance ‘as the crow flies’.
• The 1-norm denotes Manhattan distance, also called cityblock
distance:
Dis1(x,y) =
d∑
j=1
|xj − yj |
This is the distance if we can only travel along coordinate axes.
COMP9417 ML & DM Classification Semester 1, 2018 7 / 132
Distance-based models
Minkowski distance
• If we now let p grow larger, the distance will be more and more
dominated by the largest coordinate-wise distance, from which we can
infer that Dis∞(x,y) = maxj |xj − yj |; this is also called Chebyshev
distance.
• You will sometimes see references to the 0-norm (or L0 norm) which
counts the number of non-zero elements in a vector. The
corresponding distance then counts the number of positions in which
vectors x and y differ. This is not strictly a Minkowski distance;
however, we can define it as
Dis0(x,y) =
d∑
j=1
(xj − yj)0 =
d∑
j=1
I[xj = yj ]
under the understanding that x0 = 0 for x = 0 and 1 otherwise.
COMP9417 ML & DM Classification Semester 1, 2018 8 / 132
Distance-based models
Minkowski distance
Sometimes the data X is not naturally in Rd, but if we can turn it into
Boolean features, or character sequesnces, we can still apply distance
measures. For example:
• If x and y are binary strings, this is also called the Hamming
distance. Alternatively, we can see the Hamming distance as the
number of bits that need to be flipped to change x into y.
• For non-binary strings of unequal length this can be generalised to the
notion of edit distance or Levenshtein distance.
COMP9417 ML & DM Classification Semester 1, 2018 9 / 132
Distance-based models
Circles and ellipses
Lines connecting points at order-p Minkowski distance 1 from the origin
for (from inside) p = 0.8; p = 1 (Manhattan distance, the rotated square
in red); p = 1.5; p = 2 (Euclidean distance, the violet circle); p = 4;
p = 8; and p =∞ (Chebyshev distance, the blue rectangle). Notice that
for points on the coordinate axes all distances agree. For the other points,
our reach increases with p; however, if we require a rotation-invariant
distance metric then Euclidean distance is our only choice.
COMP9417 ML & DM Classification Semester 1, 2018 10 / 132
Distance-based models
Distance metric
Distance metric Given an instance space X , a distance metric is a
function Dis : X × X → R such that for any x, y, z ∈ X :
• distances between a point and itself are zero: Dis(x, x) = 0;
• all other distances are larger than zero: if x 6= y then Dis(x, y) > 0;
• distances are symmetric: Dis(y, x) = Dis(x, y);
• detours can not shorten the distance:
Dis(x, z) ≤ Dis(x, y) + Dis(y, z).
If the second condition is weakened to a non-strict inequality – i.e.,
Dis(x, y) may be zero even if x 6= y – the function Dis is called a
pseudo-metric.
COMP9417 ML & DM Classification Semester 1, 2018 11 / 132
Distance-based models
The triangle inequality – Minkowski distance for p = 2
A
!
C
!
B
!
The green circle connects points the same Euclidean distance (i.e.,
Minkowski distance of order p = 2) away from the origin as A. The orange
circle shows that B and C are equidistant from A. The red circle
demonstrates that C is closer to the origin than B, which conforms to the
triangle inequality.
COMP9417 ML & DM Classification Semester 1, 2018 12 / 132
Distance-based models
The triangle inequality – Minkowski distance for p ≤ 1
A
!
C
!
B
!
With Manhattan distance (p = 1), B and C are equally close to the origin
and also equidistant from A. With p < 1 (here, p = 0.8) C is further away
from the origin than B; since both are again equidistant from A, it follows
that travelling from the origin to C via A is quicker than going there
directly, which violates the triangle inequality.
COMP9417 ML & DM Classification Semester 1, 2018 13 / 132
Distance-based models
Mahalanobis distance *
Often, the shape of the ellipse is estimated from data as the inverse of the
covariance matrix: M = Σ−1. This leads to the definition of the
Mahalanobis distance
DisM (x,y|Σ) =
√
(x− y)TΣ−1(x− y)
Using the covariance matrix in this way has the effect of decorrelating and
normalising the features.
Clearly, Euclidean distance is a special case of Mahalanobis distance with
the identity matrix I as covariance matrix: Dis2(x,y) = DisM (x,y|I).
COMP9417 ML & DM Classification Semester 1, 2018 14 / 132
Distance-based models Neighbours and exemplars
Means and distances
The arithmetic mean minimises squared Euclidean distance The
arithmetic mean µ of a set of data points D in a Euclidean space is the
unique point that minimises the sum of squared Euclidean distances to
those data points.
Proof. We will show that arg miny
∑
x∈D ||x− y||
2 = µ, where || · ||
denotes the 2-norm. We find this minimum by taking the gradient (the
vector of partial derivatives with respect to yi) of the sum and setting it to
the zero vector:
∇y
∑
x∈D
||x− y||2 = −2
∑
x∈D
(x− y) = −2
∑
x∈D
x + 2|D|y = 0
from which we derive y = 1|D|
∑
x∈D x = µ.
COMP9417 ML & DM Classification Semester 1, 2018 15 / 132
Distance-based models Neighbours and exemplars
Means and distances
• Notice that minimising the sum of squared Euclidean distances of a
given set of points is the same as minimising the average squared
Euclidean distance.
• You may wonder what happens if we drop the square here: wouldn’t
it be more natural to take the point that minimises total Euclidean
distance as exemplar?
• This point is known as the geometric median, as for univariate data it
corresponds to the median or ‘middle value’ of a set of numbers.
However, for multivariate data there is no closed-form expression for
the geometric median, which needs to be calculated by successive
approximation.
COMP9417 ML & DM Classification Semester 1, 2018 16 / 132
Distance-based models Neighbours and exemplars
Means and distances
• In certain situations it makes sense to restrict an exemplar to be one
of the given data points. In that case, we speak of a medoid, to
distinguish it from a centroid which is an exemplar that doesn’t have
to occur in the data.
• Finding a medoid requires us to calculate, for each data point, the
total distance to all other data points, in order to choose the point
that minimises it. Regardless of the distance metric used, this is an
O(n2) operation for n points.
• So for medoids there is no computational reason to prefer one
distance metric over another.
• There may be more than one medoid.
COMP9417 ML & DM Classification Semester 1, 2018 17 / 132
Distance-based models Neighbours and exemplars
Centroids and medoids
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
data points
squared 2−norm centroid (mean)
2−norm centroid (geometric median)
squared 2−norm medoid
2−norm medoid
1−norm medoid
A small data set of 10 points, with circles indicating centroids and squares
indicating medoids (the latter must be data points), for different distance metrics.
Notice how the outlier on the bottom-right ‘pulls’ the mean away from the
geometric median; as a result the corresponding medoid changes as well.
COMP9417 ML & DM Classification Semester 1, 2018 18 / 132
Distance-based models Neighbours and exemplars
The basic linear classifier is distance-based
• The basic linear classifier constructs the decision boundary as the
perpendicular bisector of the line segment connecting the two
exemplars (one for each class).
• An alternative, distance-based way to classify instances without direct
reference to a decision boundary is by the following decision rule: if x
is nearest to µ⊕ then classify it as positive, otherwise as negative; or
equivalently, classify an instance to the class of the nearest exemplar.
• If we use Euclidean distance as our closeness measure, simple
geometry tells us we get exactly the same decision boundary.
• So the basic linear classifier can be interpreted from a distance-based
perspective as constructing exemplars that minimise squared
Euclidean distance within each class, and then applying a
nearest-exemplar decision rule.
COMP9417 ML & DM Classification Semester 1, 2018 19 / 132
Distance-based models Neighbours and exemplars
Two-exemplar decision boundaries
(left) For two exemplars the nearest-exemplar decision rule with Euclidean
distance results in a linear decision boundary coinciding with the
perpendicular bisector of the line connecting the two exemplars. (right)
Using Manhattan distance the circles are replaced by diamonds.
COMP9417 ML & DM Classification Semester 1, 2018 20 / 132
Distance-based models Neighbours and exemplars
Three-exemplar decision boundaries
(left) Decision regions defined by the 2-norm nearest-exemplar decision
rule for three exemplars. (right) With Manhattan distance the decision
regions become non-convex.
COMP9417 ML & DM Classification Semester 1, 2018 21 / 132
Distance-based models Neighbours and exemplars
Distance-based models
To summarise, the main ingredients of distance-based models are
• distance metrics, which can be Euclidean, Manhattan, Minkowski or
Mahalanobis, among many others;
• exemplars: centroids that find a centre of mass according to a chosen
distance metric, or medoids that find the most centrally located data
point; and
• distance-based decision rules, which take a vote among the k nearest
exemplars.
COMP9417 ML & DM Classification Semester 1, 2018 22 / 132
Distance-based models k-Nearest neighbour Classification
Nearest Neighbour
Stores all training examples 〈xi, f(xi)〉.
Nearest neighbour:
• Given query instance xq, first locate nearest training example xn, then
estimate f̂(xq)← f(xn)
k-Nearest neighbour:
• Given xq, take vote among its k nearest neighbours (if discrete-valued
target function)
• take mean of f values of k nearest neighbours (if real-valued)
f̂(xq)←
∑k
i=1 f(xi)
k
COMP9417 ML & DM Classification Semester 1, 2018 23 / 132
Distance-based models k-Nearest neighbour Classification
k-Nearest Neighbour Algorithm
Training algorithm:
• For each training example 〈xi, f(xi)〉, add the example to the list
training examples.
Classification algorithm:
• Given a query instance xq to be classified,
• Let x1 . . . xk be the k instances from training examples that are
nearest to xq by the distance function
• Return
f̂(xq)← arg max
v∈V
k∑
i=1
δ(v, f(xi))
where δ(a, b) = 1 if a = b and 0 otherwise.
COMP9417 ML & DM Classification Semester 1, 2018 24 / 132
Distance-based models k-Nearest neighbour Classification
“Hypothesis Space” for Nearest Neighbour
+
+
−
−
−
+
−
−
+
xq
2 classes, + and − and query point xq. On left, note effect of varying k.
On right, 1−NN induces a Voronoi tessellation of the instance space.
Formed by the perpendicular bisectors of lines between points.
COMP9417 ML & DM Classification Semester 1, 2018 25 / 132
Distance-based models k-Nearest neighbour Classification
Distance function again
The distance function defines what is learned.
Instance x is described by a feature vector (list of attribute-value pairs)
〈a1(x), a2(x), . . . am(x)〉
where ar(x) denotes the value of the rth attribute of x.
Most commonly used distance function is Euclidean distance . . .
• distance between two instances xi and xj is defined to be
d(xi, xj) =
√√√√ m∑
r=1
(ar(xi)− ar(xj))2
COMP9417 ML & DM Classification Semester 1, 2018 26 / 132
Distance-based models k-Nearest neighbour Classification
Distance function again
Many other distance functions could be used . . .
• e.g., Manhattan or city-block distance (sum of absolute values of
differences between attributes)
d(xi, xj) =
m∑
r=1
|ar(xi)− ar(xj)|
Vector-based formalization – use norm L1, L2, . . .
The idea of distance functions will appear again in kernel methods.
COMP9417 ML & DM Classification Semester 1, 2018 27 / 132
Distance-based models k-Nearest neighbour Classification
Normalization and other issues
• Different attributes measured on different scales
• Need to be normalized (why ?)
ar =
vr −minvr
maxvr −−minvr
where vr is the actual value of attribute r
• Nominal attributes: distance either 0 or 1
• Common policy for missing values: assumed to be maximally distant
(given normalized attributes)
COMP9417 ML & DM Classification Semester 1, 2018 28 / 132
Distance-based models k-Nearest neighbour Classification
When To Consider Nearest Neighbour
• Instances map to points in
• Warning: in high-dimensional spaces everything is far away from
everything and so pairwise distances are uninformative (curse of
dimensionality)
COMP9417 ML & DM Classification Semester 1, 2018 33 / 132
Distance-based models k-Nearest neighbour Classification
Distance-Weighted kNN
• Might want to weight nearer neighbours more heavily …
• Use distance function to construct a weight wi
• Replace the final line of the classification algorithm by:
f̂(xq)← arg max
v∈V
k∑
i=1
wiδ(v, f(xi))
where
wi ≡
1
d(xq, xi)2
and d(xq, xi) is distance between xq and xi
COMP9417 ML & DM Classification Semester 1, 2018 34 / 132
Distance-based models k-Nearest neighbour Classification
Distance-Weighted kNN
For real-valued target functions replace the final line of the algorithm by:
f̂(xq)←
∑k
i=1wif(xi)∑k
i=1wi
(denominator normalizes contribution of individual weights).
Now we can consider using all the training examples instead of just k
→ using all examples (i.e., when k = n) with the rule above is called
Shepard’s method
COMP9417 ML & DM Classification Semester 1, 2018 35 / 132
Distance-based models k-Nearest neighbour Classification
Evaluation
Lazy learners do not construct an explicit model, so how do we evaluate
the output of the learning process ?
• 1-NN – training set error is always zero !
• each training example is always closest to itself
• k-NN – overfitting may be hard to detect
Leave-one-out cross-validation (LOOCV) – leave out each example and
predict it given the rest:
(x1, y1), (x2, y2), . . . , (xi−1, yi−1), (xi+1, yi+1), . . . , (xn, yn)
Error is mean over all predicted examples. Fast – no models to be built !
COMP9417 ML & DM Classification Semester 1, 2018 36 / 132
Distance-based models k-Nearest neighbour Classification
Curse of Dimensionality
Bellman (1960) coined this term in the context of dynamic programming
Imagine instances described by 20 attributes, but only 2 are relevant to
target function — “similar” examples will appear “distant”.
Curse of dimensionality: nearest neighbour is easily mislead when
high-dimensional X – problem of irrelevant attributes
One approach:
• Stretch jth axis by weight zj , where z1, . . . , zn chosen to minimize
prediction error
• Use cross-validation to automatically choose weights z1, . . . , zn
• Note setting zj to zero eliminates this dimension altogether
COMP9417 ML & DM Classification Semester 1, 2018 37 / 132
Distance-based models k-Nearest neighbour Classification
Curse of Dimensionality
• number of “cells” in the instance space grows exponentially in the
number of features
• with exponentially many cells we would need exponentially many data
points to ensure that each cell is sufficiently populated to make
nearest-neighbour predictions reliably
COMP9417 ML & DM Classification Semester 1, 2018 38 / 132
Distance-based models k-Nearest neighbour Classification
Curse of Dimensionality
Some ideas to address this for instance-based (nearest-neighbour) learning
• Euclidean distance with weights on attributes√∑
wr(ar(xq)− ar(x))2
• updating of weights based on nearest neighbour classification error
• class correct/incorrect: weight increased/decreased
• can be useful if not all features used in classification
See Moore and Lee (1994) “Efficient Algorithms for Minimizing Cross Validation Error”
COMP9417 ML & DM Classification Semester 1, 2018 39 / 132
Distance-based models k-Nearest neighbour Classification
Instance-based (nearest-neighbour) learning
Recap – Practical problems of 1-NN scheme:
• Slow (but fast k − d tree-based approaches exist)
• Remedy: removing irrelevant data
• Noise (but k-NN copes quite well with noise)
• Remedy: removing noisy instances
• All attributes deemed equally important
• Remedy: attribute weighting (or simply selection)
• Doesn’t perform explicit generalization
• Remedy: rule-based or tree-based NN approaches
COMP9417 ML & DM Classification Semester 1, 2018 40 / 132
Distance-based models k-Nearest neighbour Classification
Refinements of instance-based (nearest-neighbour)
classifiers *
• Edited NN classifiers discard some of the training instances before
making predictions
• Saves memory and speeds up classification
• IB2: incremental NN learner: only incorporates misclassified instances
into the classifier
• Problem: noisy data gets incorporated
• IB3: store classification performance information with each instance
& only use in prediction if above a threshold
COMP9417 ML & DM Classification Semester 1, 2018 41 / 132
Distance-based models k-Nearest neighbour Classification
Dealing with noise *
Use larger values of k (why ?) How to find the “right” k ?
• One way: cross-validation-based k-NN classifier (but slow)
• Different approach: discarding instances that don’t perform well by
keeping success records of how well an instance does at prediction
(IB3)
• Computes confidence interval for an instance’s success rate and for
default accuracy of its class
• If lower limit of first interval is above upper limit of second one,
instance is accepted (IB3: 5%-level)
• If upper limit of first interval is below lower limit of second one,
instance is rejected (IB3: 12.5%-level)
COMP9417 ML & DM Classification Semester 1, 2018 42 / 132
Naive Bayes Why model uncertainty ?
Uncertainty
As far as the laws of mathematics refer to reality, they are not
certain; as far as they are certain, they do not refer to reality.
–Albert Einstein
COMP9417 ML & DM Classification Semester 1, 2018 43 / 132
Naive Bayes Why Bayes ?
Two Roles for Bayesian Methods
Provides practical learning algorithms:
• Naive Bayes classifier learning
• Bayesian network learning, etc.
• Combines prior knowledge (prior probabilities) with observed data
• How to get prior probabilities ?
Provides useful conceptual framework:
• Provides a “gold standard” for evaluating other learning algorithms
• Some additional insight into Occam’s razor
COMP9417 ML & DM Classification Semester 1, 2018 44 / 132
Naive Bayes Why Bayes ?
Bayes Theorem
P (h|D) =
P (D|h)P (h)
P (D)
where
P (h) = prior probability of hypothesis h
P (D) = prior probability of training data D
P (h|D) = probability of h given D
P (D|h) = probability of D given h
COMP9417 ML & DM Classification Semester 1, 2018 45 / 132
Naive Bayes Why Bayes ?
Choosing Hypotheses
P (h|D) =
P (D|h)P (h)
P (D)
Generally want the most probable hypothesis given the training data
Maximum a posteriori hypothesis hMAP :
hMAP = arg max
h∈H
P (h|D)
= arg max
h∈H
P (D|h)P (h)
P (D)
= arg max
h∈H
P (D|h)P (h)
COMP9417 ML & DM Classification Semester 1, 2018 46 / 132
Naive Bayes Why Bayes ?
Choosing Hypotheses
If assume P (hi) = P (hj) then can further simplify, and choose the
Maximum likelihood (ML) hypothesis
hML = arg max
hi∈H
P (D|hi)
COMP9417 ML & DM Classification Semester 1, 2018 47 / 132
Naive Bayes Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases
in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not
present. Furthermore, .008 of the entire population have this
cancer.
P (cancer) = P (¬cancer) =
P (⊕ | cancer) = P ( | cancer) =
P (⊕ | ¬cancer) = P ( | ¬cancer) =
COMP9417 ML & DM Classification Semester 1, 2018 48 / 132
Naive Bayes Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases
in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not
present. Furthermore, .008 of the entire population have this
cancer.
P (cancer) = .008 P (¬cancer) = .992
P (⊕ | cancer) = .98 P ( | cancer) = .02
P (⊕ | ¬cancer) = .03 P ( | ¬cancer) = .97
COMP9417 ML & DM Classification Semester 1, 2018 49 / 132
Naive Bayes Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
We can find the maximum a posteriori (MAP) hypothesis
P (⊕ | cancer)P (cancer) = 0.98× 0.008 = 0.00784
P (⊕ | ¬cancer)P (¬cancer) = 0.03× 0.992 = 0.02976
Thus hMAP = . . ..
COMP9417 ML & DM Classification Semester 1, 2018 50 / 132
Naive Bayes Why Bayes ?
Applying Bayes Theorem
Does patient have cancer or not?
We can find the maximum a posteriori (MAP) hypothesis
P (⊕ | cancer)P (cancer) = 0.98× 0.008 = 0.00784
P (⊕ | ¬cancer)P (¬cancer) = 0.03× 0.992 = 0.02976
Thus hMAP = ¬cancer.
Also note: posterior probability of hypothesis cancer higher than prior.
COMP9417 ML & DM Classification Semester 1, 2018 51 / 132
Naive Bayes Why Bayes ?
Applying Bayes Theorem
How to get the posterior probability of a hypothesis h ?
Divide by P (⊕), probability of data, to normalize result for h:
P (h|D) =
P (D|h)P (h)∑
hi∈H P (D|hi)P (hi)
Denominator ensures we obtain posterior probabilities that sum to 1.
Sum for all possible numerator values, since hypotheses are mutually
exclusive (e.g., patient either has cancer or does not).
Marginal likelihood (marginalizing out over hypothesis space) — prior
probability of the data.
COMP9417 ML & DM Classification Semester 1, 2018 52 / 132
Naive Bayes Why Bayes ?
Basic Formulas for Probabilities
Product Rule: probability P (A∧B) of conjunction of two events A and B:
P (A ∧B) = P (A|B)P (B) = P (B|A)P (A)
Sum Rule: probability of disjunction of two events A and B:
P (A ∨B) = P (A) + P (B)− P (A ∧B)
Theorem of total probability : if events A1, . . . , An are mutually exclusive
with
∑n
i=1 P (Ai) = 1, then:
P (B) =
n∑
i=1
P (B|Ai)P (Ai)
COMP9417 ML & DM Classification Semester 1, 2018 53 / 132
Naive Bayes Why Bayes ?
Basic Formulas for Probabilities
Also worth remembering:
• Conditional Probability: probability of A given B:
P (A|B) =
P (A ∧B)
P (B)
• Rearrange sum rule to get:
P (A ∧B) = P (A) + P (B)− P (A ∨B)
Exercise: Derive Bayes Theorem.
COMP9417 ML & DM Classification Semester 1, 2018 54 / 132
Naive Bayes Why Bayes ?
Brute Force MAP Hypothesis Learner
Idea: view learning as finding the most probable hypothesis
• For each hypothesis h in H, calculate the posterior probability
P (h|D) =
P (D|h)P (h)
P (D)
• Output the hypothesis hMAP with the highest posterior probability
hMAP = arg max
h∈H
P (h|D)
COMP9417 ML & DM Classification Semester 1, 2018 55 / 132
A Bayesian approach to learning algorithms Concept learning in Bayesian terms
Relation to Concept Learning (i.e., classification)
Canonical concept learning task:
• instance space X, hypothesis space H, training examples D
• consider a learning algorithm that outputs most specific hypothesis
from the version space V SH,D (i.e., set of all consistent or
”zero-error” classification rules)
What would Bayes rule produce as the MAP hypothesis?
Does this algorithm output a MAP hypothesis??
COMP9417 ML & DM Classification Semester 1, 2018 56 / 132
A Bayesian approach to learning algorithms Concept learning in Bayesian terms
Relation to Concept Learning
Brute Force MAP Framework for Concept Learning:
Assume fixed set of instances 〈x1, . . . , xm〉
Assume D is the set of classifications D = 〈c(x1), . . . , c(xm)〉
Choose P (h) to be uniform distribution:
• P (h) = 1|H| for all h in H
Choose P (D|h):
• P (D|h) = 1 if h consistent with D
• P (D|h) = 0 otherwise
COMP9417 ML & DM Classification Semester 1, 2018 57 / 132
A Bayesian approach to learning algorithms Concept learning in Bayesian terms
Relation to Concept Learning
Then:
P (h|D) =
1
|V SH,D|
if h is consistent with D
0 otherwise
COMP9417 ML & DM Classification Semester 1, 2018 58 / 132
A Bayesian approach to learning algorithms Concept learning in Bayesian terms
Relation to Concept Learning
Note that since likelihood is zero if h is inconsistent then the posterior is
also zero. But how did we obtain the posterior for consistent h ?
P (h|D) =
1 · 1|H|
P (D)
=
1 · 1|H|
|V SH,D|
|H|
=
1
|V SH,D|
COMP9417 ML & DM Classification Semester 1, 2018 59 / 132
A Bayesian approach to learning algorithms Concept learning in Bayesian terms
Relation to Concept Learning
How did we obtain P (D) ? From theorem of total probability:
P (D) =
∑
hi∈H
P (D|Hi)P (hi)
=
∑
hi∈V SH,D
1 ·
1
|H|
+
∑
hi 6∈V SH,D
0 ·
1
|H|
=
∑
hi∈V SH,D
1 ·
1
|H|
=
|V SH,D|
|H|
COMP9417 ML & DM Classification Semester 1, 2018 60 / 132
A Bayesian approach to learning algorithms Concept learning in Bayesian terms
Evolution of Posterior Probabilities
hypotheses hypotheses hypotheses
P(h|D1,D2)P(h|D1)P h )(
a( ) b( ) c( )
COMP9417 ML & DM Classification Semester 1, 2018 61 / 132
A Bayesian approach to learning algorithms Concept learning in Bayesian terms
Relation to Concept Learning
Every hypothesis consistent with D is a MAP hypothesis, if we assume
• uniform probability over H
• target function c ∈ H
• deterministic, noise-free data
• etc. (see above)
So, this learning algorithm will output a MAP hypothesis, even though it
does not explicitly use probabilities in learning.
COMP9417 ML & DM Classification Semester 1, 2018 62 / 132
A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms
Learning A Real Valued Function
E.g., learning a linear target function f from noisy examples:
hML
f
e
y
x
COMP9417 ML & DM Classification Semester 1, 2018 63 / 132
A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms
Learning A Real Valued Function
Consider any real-valued target function f
Training examples 〈xi, di〉, where di is noisy training value
• di = f(xi) + ei
• ei is random variable (noise) drawn independently for each xi
according to some Gaussian (normal) distribution with mean=0
Then the maximum likelihood hypothesis hML is the one that
minimizes the sum of squared errors:
hML = arg min
h∈H
m∑
i=1
(di − h(xi))2
COMP9417 ML & DM Classification Semester 1, 2018 64 / 132
A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms
Learning A Real Valued Function
How did we obtain this ?
hML = arg max
h∈H
p(D|h)
= arg max
h∈H
m∏
i=1
p(di|h)
= arg max
h∈H
m∏
i=1
1
√
2πσ2
e−
1
2
(
di−h(xi)
σ
)2
Recall that we treat each probability p(D|h) as if h = f , i.e., we assume
µ = f(xi) = h(xi), which is the key idea behind maximum likelihood !
COMP9417 ML & DM Classification Semester 1, 2018 65 / 132
A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms
Learning A Real Valued Function
Maximize natural log to give simpler expression:
hML = arg max
h∈H
m∑
i=1
ln
1
√
2πσ2
−
1
2
(
di − h(xi)
σ
)2
= arg max
h∈H
m∑
i=1
−
1
2
(
di − h(xi)
σ
)2
= arg max
h∈H
m∑
i=1
− (di − h(xi))2
Equivalently, we can minimize the positive version of the expression:
hML = arg min
h∈H
m∑
i=1
(di − h(xi))2
COMP9417 ML & DM Classification Semester 1, 2018 66 / 132
A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms
Discriminative and generative probabilistic models
• Discriminative models model the posterior probability distribution P (Y |X),
where Y is the target variable and X are the features. That is, given X they
return a probability distribution over Y .
• Generative models model the joint distribution P (Y,X) of the target Y and
the feature vector X. Once we have access to this joint distribution we can
derive any conditional or marginal distribution involving the same variables.
In particular, since P (X) =
∑
y P (Y = y,X) it follows that the posterior
distribution can be obtained as P (Y |X) = P (Y,X)∑
y
P (Y=y,X)
.
• Alternatively, generative models can be described by the likelihood function
P (X|Y ), since P (Y,X) = P (X|Y )P (Y ) and the target or prior distribution
(usually abbreviated to ‘prior’) can be easily estimated or postulated.
• Such models are called ‘generative’ because we can sample from the joint
distribution to obtain new data points together with their labels.
Alternatively, we can use P (Y ) to sample a class and P (X|Y ) to sample an
instance for that class.
COMP9417 ML & DM Classification Semester 1, 2018 67 / 132
A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms
Assessing uncertainty in estimates
Suppose we want to estimate the probability θ that an arbitrary e-mail is
spam, so that we can use the appropriate prior distribution.
• The natural thing to do is to inspect n e-mails, determine the number
of spam e-mails d, and set θ̂ = d/n; we don’t really need any
complicated statistics to tell us that.
• However, while this is the most likely estimate of θ – the maximum a
posteriori (MAP) estimate – this doesn’t mean that other values of θ
are completely ruled out.
• We model this by a probability distribution over θ (a Beta distribution
in this case) which is updated each time new information comes in.
This is further illustrated in the figure for a distribution that is more
and more skewed towards spam.
• For each curve, its bias towards spam is given by the area under the
curve and to the right of θ = 1/2.
COMP9417 ML & DM Classification Semester 1, 2018 68 / 132
A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms
Assessing uncertainty in estimates
0.2 0.4 0.6 0.8
1
2
3
4
5
Each time we inspect an e-mail, we are reducing our uncertainty regarding
the prior spam probability θ. After we inspect two e-mails and observe one
spam, the possible θ values are characterised by a symmetric distribution
around 1/2. If we inspect a third, fourth, . . . , tenth e-mail and each time
(except the first one) it is spam, then this distribution narrows and shifts a
little bit to the right each time. The distribution for n e-mails reaches its
maximum at θ̂MAP =
n−1
n
(e.g., θ̂MAP = 0.8 for n = 5).
COMP9417 ML & DM Classification Semester 1, 2018 69 / 132
A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms
The Bayesian perspective
Explicitly modelling the posterior distribution over the parameter θ has a
number of advantages that are usually associated with the ‘Bayesian’
perspective:
• We can precisely characterise the uncertainty that remains about our
estimate by quantifying the spread of the posterior distribution.
• We can obtain a generative model for the parameter by sampling
from the posterior distribution, which contains much more
information than a summary statistic such as the MAP estimate can
convey – so, rather than using a single e-mail with θ = θMAP, our
generative model can contain a number of e-mails with θ sampled
from the posterior distribution.
COMP9417 ML & DM Classification Semester 1, 2018 70 / 132
A Bayesian approach to learning algorithms Numeric prediction in Bayesian terms
The Bayesian perspective
• We can quantify the probability of statements such as ‘e-mails are
biased towards ham’ (the tiny shaded area in the figure demonstrates
that after observing one ham and nine spam e-mails this probability is
very small, about 0.6%).
• We can use one of these distributions to encode our prior beliefs: e.g.,
if we believe that the proportions of spam and ham are typically
50–50, we can take the distribution for n = 2 (the lowest, symmetric
one in the figure on the previous slide) as our prior.
The key point is that probabilities do not have to be interpreted as
estimates of relative frequencies, but can carry the more general meaning
of (possibly subjective) degrees of belief.
Consequently, we can attach a probability distribution to almost anything:
not just features and targets, but also model parameters and even models.
COMP9417 ML & DM Classification Semester 1, 2018 71 / 132
Minimum Description Length Principle
Minimum Description Length Principle
Once again, the MAP hypothesis
hMAP = arg max
h∈H
P (D|h)P (h)
Which is equivalent to
hMAP = arg max
h∈H
log2 P (D|h) + log2 P (h)
Or
hMAP = arg min
h∈H
− log2 P (D|h)− log2 P (h)
COMP9417 ML & DM Classification Semester 1, 2018 72 / 132
Minimum Description Length Principle
Minimum Description Length Principle
Interestingly, this is an expression about a quantity of bits.
hMAP = arg min
h∈H
− log2 P (D|h)− log2 P (h) (1)
From information theory:
The optimal (shortest expected coding length) code for an event
with probability p is − log2 p bits.
COMP9417 ML & DM Classification Semester 1, 2018 73 / 132
Minimum Description Length Principle
Minimum Description Length Principle
So interpret (1):
• − log2 P (h) is length of h under optimal code
• − log2 P (D|h) is length of D given h under optimal code
Note well: assumes optimal encodings, when the priors and likelihoods
are known. In practice, this is difficult, and makes a difference.
COMP9417 ML & DM Classification Semester 1, 2018 74 / 132
Minimum Description Length Principle
Minimum Description Length Principle
Occam’s razor: prefer the shortest hypothesis
MDL: prefer the hypothesis h that minimizes
hMDL = arg min
h∈H
LC1(h) + LC2(D|h)
where LC(x) is the description length of x under optimal encoding C
COMP9417 ML & DM Classification Semester 1, 2018 75 / 132
Minimum Description Length Principle
Minimum Description Length Principle
Without loss of generality, classifier is here assumed to be a decision tree
Example: H = decision trees, D = training data labels
• LC1(h) is # bits to describe tree h
• LC2(D|h) is # bits to describe D given h
• Note LC2(D|h) = 0 if examples classified perfectly by h. Need only
describe exceptions
• Hence hMDL trades off tree size for training errors
• i.e., prefer the hypothesis that minimizes
length(h) + length(misclassifications)
COMP9417 ML & DM Classification Semester 1, 2018 76 / 132
Bayes Optimal Classifier
Most Probable Classification of New Instances
So far we’ve sought the most probable hypothesis given the data D (i.e.,
hMAP )
Given new instance x, what is its most probable classification?
• hMAP (x) is not the most probable classification!
COMP9417 ML & DM Classification Semester 1, 2018 77 / 132
Bayes Optimal Classifier
Most Probable Classification of New Instances
Consider:
• Three possible hypotheses:
P (h1|D) = .4, P (h2|D) = .3, P (h3|D) = .3
• Given new instance x,
h1(x) = +, h2(x) = −, h3(x) = −
• What’s most probable classification of x?
COMP9417 ML & DM Classification Semester 1, 2018 78 / 132
Bayes Optimal Classifier
Bayes Optimal Classifier
Bayes optimal classification:
arg max
vj∈V
∑
hi∈H
P (vj |hi)P (hi|D)
Example:
P (h1|D) = .4, P (−|h1) = 0, P (+|h1) = 1
P (h2|D) = .3, P (−|h2) = 1, P (+|h2) = 0
P (h3|D) = .3, P (−|h3) = 1, P (+|h3) = 0
COMP9417 ML & DM Classification Semester 1, 2018 79 / 132
Bayes Optimal Classifier
Bayes Optimal Classifier
therefore ∑
hi∈H
P (+|hi)P (hi|D) = .4
∑
hi∈H
P (−|hi)P (hi|D) = .6
and
arg max
vj∈V
∑
hi∈H
P (vj |hi)P (hi|D) = −
No other classification method using the same hypothesis space and same
prior knowledge can outperform this method on average
COMP9417 ML & DM Classification Semester 1, 2018 80 / 132
Bayes Optimal Classifier
Bayes Error
What is the best performance attainable by a (two-class) classifier ?
Define the probability of error for classifying some instance x by
P (error|x) = P (class1|x) if we predict class2
= P (class2|x) if we predict class1
This gives
Σx P (error) = P (error|x) P (x)
So we can justify the use of the decision rule
if P (class1|x) > P (class2|x) then predict class1
else predict class2
On average, this decision rule minimises probability of classification error.
COMP9417 ML & DM Classification Semester 1, 2018 81 / 132
Naive Bayes
Naive Bayes Classifier
Along with decision trees, neural networks, nearest neighbour, one of the
most practical learning methods.
When to use
• Moderate or large training set available
• Attributes that describe instances are conditionally independent given
classification
Successful applications:
• Classifying text documents
• Gaussian Naive Bayes for real-valued data
COMP9417 ML & DM Classification Semester 1, 2018 82 / 132
Naive Bayes
Naive Bayes Classifier
Assume target function f : X → V , where each instance x described by
attributes 〈a1, a2 . . . an〉.
Most probable value of f(x) is:
vMAP = arg max
vj∈V
P (vj |a1, a2 . . . an)
vMAP = arg max
vj∈V
P (a1, a2 . . . an|vj)P (vj)
P (a1, a2 . . . an)
= arg max
vj∈V
P (a1, a2 . . . an|vj)P (vj)
COMP9417 ML & DM Classification Semester 1, 2018 83 / 132
Naive Bayes
Naive Bayes Classifier
Naive Bayes assumption:
P (a1, a2 . . . an|vj) =
∏
i
P (ai|vj)
• Attributes are statistically independent (given the class value)
• which means knowledge about the value of a particular attribute tells
us nothing about the value of another attribute (if the class is known)
which gives
Naive Bayes classifier: vNB = arg max
vj∈V
P (vj)
∏
i
P (ai|vj)
COMP9417 ML & DM Classification Semester 1, 2018 84 / 132
Naive Bayes
Naive Bayes Algorithm
Naive Bayes Learn(examples)
For each target value vj
P̂ (vj)← estimate P (vj)
For each attribute value ai of each attribute a
P̂ (ai|vj)← estimate P (ai|vj)
Classify New Instance(x)
vNB = arg max
vj∈V
P̂ (vj)
∏
ai∈x
P̂ (ai|vj)
COMP9417 ML & DM Classification Semester 1, 2018 85 / 132
Naive Bayes
Naive Bayes Example
Consider PlayTennis again . . .
Outlook Temperature Humidity Windy
Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5
Play
Yes No
9 5
9/14 5/14
COMP9417 ML & DM Classification Semester 1, 2018 86 / 132
Naive Bayes
Naive Bayes Example
Say we have the new instance:
〈Outlk = sun, Temp = cool,Humid = high,Wind = true〉
We want to compute:
vNB = arg max
vj∈{“yes”,“no”}
P (vj)
∏
i
P (ai|vj)
COMP9417 ML & DM Classification Semester 1, 2018 87 / 132
Naive Bayes
Naive Bayes Example
So we first calculate the likelihood of the two classes, “yes” and “no”
for “yes” = P (y)× P (sun|y)× P (cool|y)× P (high|y)× P (true|y)
0.0053 =
9
14
×
2
9
×
3
9
×
3
9
×
3
9
for “no” = P (n)× P (sun|n)× P (cool|n)× P (high|n)× P (true|n)
0.0206 =
5
14
×
3
5
×
1
5
×
4
5
×
3
5
COMP9417 ML & DM Classification Semester 1, 2018 88 / 132
Naive Bayes
Naive Bayes Example
Then convert to a probability by normalisation
P (“yes”) =
0.0053
(0.0053 + 0.0206)
= 0.205
P (“no”) =
0.0206
(0.0053 + 0.0206)
= 0.795
The Naive Bayes classification is “no”.
COMP9417 ML & DM Classification Semester 1, 2018 89 / 132
Naive Bayes
Naive Bayes: Subtleties
Conditional independence assumption is often violated
P (a1, a2 . . . an|vj) =
∏
i
P (ai|vj)
• …but it works surprisingly well anyway. Note don’t need estimated
posteriors P̂ (vj |x) to be correct; need only that
arg max
vj∈V
P̂ (vj)
∏
i
P̂ (ai|vj) = arg max
vj∈V
P (vj)P (a1 . . . , an|vj)
i.e. maximum probability is assigned to correct class
• see [Domingos & Pazzani, 1996] for analysis
• Naive Bayes posteriors often unrealistically close to 1 or 0
• adding too many redundant attributes will cause problems (e.g.
identical attributes)
COMP9417 ML & DM Classification Semester 1, 2018 90 / 132
Naive Bayes
Naive Bayes: “zero-frequency” problem
What if none of the training instances with target value vj have attribute
value ai? Then
P̂ (ai|vj) = 0, and…
P̂ (vj)
∏
i
P̂ (ai|vj) = 0
Pseudo-counts add 1 to each count (a version of the Laplace Estimator)
COMP9417 ML & DM Classification Semester 1, 2018 91 / 132
Naive Bayes
Naive Bayes: “zero-frequency” problem
• In some cases adding a constant different from 1 might be more
appropriate
• Example: attribute outlook for class yes
Sunny Overcast Rainy
2+
µ
2
9+µ
4+
µ
3
9+µ
3+
µ
3
9+µ
• Weights don’t need to be equal (if they sum to 1) – a form of prior
Sunny Overcast Rainy
2+µp1
9+µ
4+µp2
9+µ
3+µp3
9+µ
COMP9417 ML & DM Classification Semester 1, 2018 92 / 132
Naive Bayes
Naive Bayes: “zero-frequency” problem
This generalisation is a Bayesian estimate for P̂ (ai|vj)
P̂ (ai|vj)←
nc +mp
n+m
where
• n is number of training examples for which v = vj ,
• nc number of examples for which v = vj and a = ai
• p is prior estimate for P̂ (ai|vj)
• m is weight given to prior (i.e. number of “virtual” examples)
This is called the m-estimate of probability.
COMP9417 ML & DM Classification Semester 1, 2018 93 / 132
Naive Bayes
Naive Bayes: missing values
• Training: instance is not included in frequency count for attribute
value-class combination
• Classification: attribute will be omitted from calculation
COMP9417 ML & DM Classification Semester 1, 2018 94 / 132
Naive Bayes
Naive Bayes: numeric attributes
• Usual assumption: attributes have a normal or Gaussian probability
distribution (given the class)
• The probability density function for the normal distribution is defined
by two parameters:
The sample mean µ:
µ =
1
n
n∑
i=1
xi
The standard deviation σ:
σ =
1
n− 1
n∑
i=1
(xi − µ)2
COMP9417 ML & DM Classification Semester 1, 2018 95 / 132
Naive Bayes
Naive Bayes: numeric attributes
Then we have the density function f(x):
f(x) =
1
√
2πσ
e
− (x−µ)
2
2σ2
Example: continuous attribute temperature with mean = 73 and standard
deviation = 6.2. Density value
f(temperature = 66|“yes”) =
1
√
2π6.2
e
− (66−73)
2
2×6.22 = 0.0340
Missing values during training are not included in calculation of mean and
standard deviation.
COMP9417 ML & DM Classification Semester 1, 2018 96 / 132
Naive Bayes
Naive Bayes: numeric attributes
Note: the normal distribution is based on the simple exponential function
f(x) = e−|x|
m
As the power m in the exponent increases, the function approaches a step
function.
Where m = 2
f(x) = e−|x|
2
and this is the basis of the normal distribution – the various constants are
the result of scaling so that the integral (the area under the curve from
−∞ to +∞) is equal to 1.
from “Statistical Computing” by Michael J. Crawley (2002) Wiley.
COMP9417 ML & DM Classification Semester 1, 2018 97 / 132
Naive Bayes
Categorical random variables
Categorical variables or features (also called discrete or nominal) are
ubiquitous in machine learning.
• Perhaps the most common form of the Bernoulli distribution models
whether or not a word occurs in a document. That is, for the i-th
word in our vocabulary we have a random variable Xi governed by a
Bernoulli distribution. The joint distribution over the bit vector
X = (X1, . . . , Xk) is called a multivariate Bernoulli distribution.
• Variables with more than two outcomes are also common: for
example, every word position in an e-mail corresponds to a categorical
variable with k outcomes, where k is the size of the vocabulary. The
multinomial distribution manifests itself as a count vector: a
histogram of the number of occurrences of all vocabulary words in a
document. This establishes an alternative way of modelling text
documents that allows the number of occurrences of a word to
influence the classification of a document.
COMP9417 ML & DM Classification Semester 1, 2018 98 / 132
Naive Bayes
Categorical random variables
Both these document models are in common use. Despite their
differences, they both assume independence between word occurrences,
generally referred to as the naive Bayes assumption.
• In the multinomial document model, this follows from the very use of
the multinomial distribution, which assumes that words at different
word positions are drawn independently from the same categorical
distribution.
• In the multivariate Bernoulli model we assume that the bits in a bit
vector are statistically independent, which allows us to compute the
joint probability of a particular bit vector (x1, . . . , xk) as the product
of the probabilities of each component P (Xi = xi).
• In practice, such word independence assumptions are often not true:
if we know that an e-mail contains the word ‘Viagra’, we can be quite
sure that it will also contain the word ‘pill’. Violated independence
assumptions reduce the quality of probability estimates but may still
allow good classification performance.
COMP9417 ML & DM Classification Semester 1, 2018 99 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Example application: Learning to Classify Text
In machine learning, the classic example of applications of Naive Bayes is
learning to classify text documents.
Here is a simplified version in the multinomial document model.
COMP9417 ML & DM Classification Semester 1, 2018 100 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Learning to Classify Text
Why?
• Learn which news articles are of interest
• Learn to classify web pages by topic
Naive Bayes is among most effective algorithms
What attributes shall we use to represent text documents??
COMP9417 ML & DM Classification Semester 1, 2018 101 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Learning to Classify Text
Target concept Interesting? : Document→ {+,−}
1 Represent each document by vector of words
• one attribute per word position in document
2 Learning: Use training examples to estimate
• P (+)
• P (−)
• P (doc|+)
• P (doc|−)
COMP9417 ML & DM Classification Semester 1, 2018 102 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Learning to Classify Text
Naive Bayes conditional independence assumption
P (doc|vj) =
length(doc)∏
i=1
P (ai = wk|vj)
where P (ai = wk|vj) is probability that word in position i is wk, given vj
one more assumption: P (ai = wk|vj) = P (am = wk|vj),∀i,m
“bag of words”
COMP9417 ML & DM Classification Semester 1, 2018 103 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Learning to Classify Text
Learn naive Bayes text(Examples, V )
// collect all words and other tokens that occur in Examples
V ocabulary ← all distinct words and other tokens in Examples
// calculate the required P (vj) and P (wk|vj) probability terms
for each target value vj in V do
docsj ← subset of Examples for which the target value is vj
P (vj)←
|docsj |
|Examples|
Textj ← a single document created by concatenating all members of docsj
n← total number of words in Textj (counting duplicate words multiple times)
for each word wk in V ocabulary
nk ← number of times word wk occurs in Textj
P (wk|vj)← nk+1n+|V ocabulary|
COMP9417 ML & DM Classification Semester 1, 2018 104 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Learning to Classify Text
Classify naive Bayes text(Doc)
• positions← all word positions in Doc that contain tokens found in
V ocabulary
• Return vNB, where
vNB = arg max
vj∈V
P (vj)
∏
i∈positions
P (ai|vj)
COMP9417 ML & DM Classification Semester 1, 2018 105 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Application: 20 Newsgroups
Given: 1000 training documents from each group
Learning task: classify each new document by newsgroup it came from
comp.graphics misc.forsale
comp.os.ms-windows.misc rec.autos
comp.sys.ibm.pc.hardware rec.motorcycles
comp.sys.mac.hardware rec.sport.baseball
comp.windows.x rec.sport.hockey
alt.atheism sci.space
soc.religion.christian sci.crypt
talk.religion.misc sci.electronics
talk.politics.mideast sci.med
talk.politics.misc
talk.politics.guns
Naive Bayes: 89% classification accuracy
COMP9417 ML & DM Classification Semester 1, 2018 106 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Article from rec.sport.hockey
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu
From: xxx@yyy.zzz.edu (John Doe)
Subject: Re: This year’s biggest and worst (opinion)…
Date: 5 Apr 93 09:53:39 GMT
I can only comment on the Kings, but the most obvious candidate
for pleasant surprise is Alex Zhitnik. He came highly touted as
a defensive defenseman, but he’s clearly much more than that.
Great skater and hard shot (though wish he were more accurate).
In fact, he pretty much allowed the Kings to trade away that
huge defensive liability Paul Coffey. Kelly Hrudey is only the
biggest disappointment if you thought he was any good to begin
with. But, at best, he’s only a mediocre goaltender. A better
choice would be Tomas Sandstrom, though not through any fault
of his own, but because some thugs in Toronto decided …
COMP9417 ML & DM Classification Semester 1, 2018 107 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Learning Curve for 20 Newsgroups
0
10
20
30
40
50
60
70
80
90
100
100 1000 10000
20News
Bayes
TFIDF
PRTFIDF
Accuracy vs. Training set size (1/3 withheld for test)
COMP9417 ML & DM Classification Semester 1, 2018 108 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Probabilistic decision rules
We have chosen one of the possible distributions to model our data X as
coming from either class.
• The more different P (X|Y = spam) and P (X|Y = ham) are, the
more useful the features X are for classification.
• Thus, for a specific e-mail x we calculate both P (X = x|Y = spam)
and P (X = x|Y = ham), and apply one of several possible decision
rules:
maximum likelihood (ML) – predict arg maxy P (X = x|Y = y);
maximum a posteriori (MAP) – predict
arg maxy P (X = x|Y = y)P (Y = y);
The relation between the first two decision rules is that ML classification is
equivalent to MAP classification with a uniform class distribution.
COMP9417 ML & DM Classification Semester 1, 2018 109 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Probabilistic decision rules
We again use the example of Naive Bayes for text classification to
illustrate, using both the multinomial and multivariate Bernoulli models.
COMP9417 ML & DM Classification Semester 1, 2018 110 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Prediction using a naive Bayes model
Suppose our vocabulary contains three words a, b and c, and we use a
multivariate Bernoulli model for our e-mails, with parameters
θ⊕ = (0.5, 0.67, 0.33) θ = (0.67, 0.33, 0.33)
This means, for example, that the presence of b is twice as likely in spam
(+), compared with ham.
The e-mail to be classified contains words a and b but not c, and hence is
described by the bit vector x = (1, 1, 0). We obtain likelihoods
P (x|⊕) = 0.5 · 0.67 · (1− 0.33) = 0.222
P (x| ) = 0.67 · 0.33 · (1− 0.33) = 0.148
The ML classification of x is thus spam.
COMP9417 ML & DM Classification Semester 1, 2018 111 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Prediction using a naive Bayes model
In the case of two classes it is often convenient to work with likelihood
ratios and odds.
• The likelihood ratio can be calculated as
P (x|⊕)
P (x| )
=
0.5
0.67
0.67
0.33
1− 0.33
1− 0.33
= 3/2 > 1
• This means that the MAP classification of x is also spam if the prior
odds are more than 2/3, but ham if they are less than that.
• For example, with 33% spam and 67% ham the prior odds are
P (⊕)
P ( ) =
0.33
0.67
= 1/2, resulting in a posterior odds of
P (⊕|x)
P ( |x)
=
P (x|⊕)
P (x| )
P (⊕)
P ( )
= 3/2 · 1/2 = 3/4 < 1
In this case the likelihood ratio for x is not strong enough to push the
decision away from the prior.
COMP9417 ML & DM Classification Semester 1, 2018 112 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Prediction using a naive Bayes model
Alternatively, we can employ a multinomial model. The parameters of a
multinomial establish a distribution over the words in the vocabulary, say
θ⊕ = (0.3, 0.5, 0.2) θ = (0.6, 0.2, 0.2)
The e-mail to be classified contains three occurrences of word a, one single
occurrence of word b and no occurrences of word c, and hence is described
by the count vector x = (3, 1, 0). The total number of vocabulary word
occurrences is n = 4. We obtain likelihoods
P (x|⊕) = 4!
0.33
3!
0.51
1!
0.20
0!
= 0.054
P (x| ) = 4!
0.63
3!
0.21
1!
0.20
0!
= 0.1728
The likelihood ratio is
(
0.3
0.6
)3 (0.5
0.2
)1 (0.2
0.2
)0
= 5/16. The ML classification
of x is thus ham, the opposite of the multivariate Bernoulli model. This is
mainly because of the three occurrences of word a, which provide strong
evidence for ham.
COMP9417 ML & DM Classification Semester 1, 2018 113 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Training data for naive Bayes
A small e-mail data set described by count vectors.
E-mail #a #b #c Class
e1 0 3 0 +
e2 0 3 3 +
e3 3 0 0 +
e4 2 3 0 +
e5 4 3 0 −
e6 4 0 3 −
e7 3 0 0 −
e8 0 0 0 −
COMP9417 ML & DM Classification Semester 1, 2018 114 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Training data for naive Bayes
The same data set described by bit vectors.
E-mail a? b? c? Class
e1 0 1 0 +
e2 0 1 1 +
e3 1 0 0 +
e4 1 1 0 +
e5 1 1 0 −
e6 1 0 1 −
e7 1 0 0 −
e8 0 0 0 −
COMP9417 ML & DM Classification Semester 1, 2018 115 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Training a naive Bayes model
Consider the following e-mails consisting of five words a, b, c, d, e:
e1: b d e b b d e
e2: b c e b b d d e c c
e3: a d a d e a e e
e4: b a d b e d a b
e5: a b a b a b a e d
e6: a c a c a c a e d
e7: e a e d a e a
e8: d e d e d
We are told that the e-mails on the left are spam and those on the right
are ham, and so we use them as a small training set to train our Bayesian
classifier.
• First, we decide that d and e are so-called stop words that are too
common to convey class information.
• The remaining words, a, b and c, constitute our vocabulary.
COMP9417 ML & DM Classification Semester 1, 2018 116 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Training a naive Bayes model
For the multinomial model, we represent each e-mail as a count vector, as
before.
• In order to estimate the parameters of the multinomial, we sum up
the count vectors for each class, which gives (5, 9, 3) for spam and
(11, 3, 3) for ham.
• To smooth these probability estimates we add one pseudo-count for
each vocabulary word, which brings the total number of occurrences
of vocabulary words to 20 for each class.
• The estimated parameter vectors are thus
θ̂⊕ = (6/20, 10/20, 4/20) = (0.3, 0.5, 0.2) for spam and
θ̂ = (12/20, 4/20, 4/20) = (0.6, 0.2, 0.2) for ham.
COMP9417 ML & DM Classification Semester 1, 2018 117 / 132
Naive Bayes Learning and using a naive Bayes model for classification
Training a naive Bayes model
In the multivariate Bernoulli model e-mails are represented by bit vectors,
as before.
• Adding the bit vectors for each class results in (2, 3, 1) for spam and
(3, 1, 1) for ham.
• Each count is to be divided by the number of documents in a class, in
order to get an estimate of the probability of a document containing a
particular vocabulary word.
• Probability smoothing now means adding two pseudo-documents, one
containing each word and one containing none of them.
• This results in the estimated parameter vectors
θ̂⊕ = (3/6, 4/6, 2/6) = (0.5, 0.67, 0.33) for spam and
θ̂ = (4/6, 2/6, 2/6) = (0.67, 0.33, 0.33) for ham.
COMP9417 ML & DM Classification Semester 1, 2018 118 / 132
Linear discriminants
Linear discriminants
Many forms of linear discriminant from statistics and machine learning,
e.g.,
• Fisher’s Linear Discriminant Analysis
• the basic linear classifier in lecture 1 is a version of this
• Logistic Regression
• a probabilistic linear classifier
• Perceptron
• a linear threshold classifier
• an early version of an artificial “neuron”
• still a useful method, and source of ideas
COMP9417 ML & DM Classification Semester 1, 2018 119 / 132
Linear discriminants
Logistic regression
In the case of a two-class problem, model the probability of one class
P (Y = 1) vs. the alternative (1− P (Y = 1)):
P (Y = 1|x) =
1
1 + e−w
Tx
or
ln
P (Y = 1|x)
1− P (Y = 1|x)
= wTx
The quantity on the l.h.s. is called the logit and all we are doing is a linear
model for the logit.
Note: to fit this is actually more complex than linear regression, so we
omit the details.
Generalises to multiple class versions (Y can have more than two values).
COMP9417 ML & DM Classification Semester 1, 2018 120 / 132
Perceptrons What are perceptrons ?
Perceptron
A linear classifier that can achieve perfect separation on linearly separable
data is the perceptron, originally proposed as a simple neural network by
F. Rosenblatt in the late 1950s.
COMP9417 ML & DM Classification Semester 1, 2018 121 / 132
Perceptrons What are perceptrons ?
Perceptron
Originally implemented in software (based on the McCulloch-Pitts neuron
from the 1940s), then in hardware as a 20x20 visual sensor array with
potentiometers for adaptive weights.
Source http://en.wikipedia.org/w/index.php?curid=47541432
COMP9417 ML & DM Classification Semester 1, 2018 122 / 132
http://en.wikipedia.org/w/index.php?curid=47541432
Perceptrons What are perceptrons ?
Perceptron
Output o is thresholded sum of products of inputs and their weights:
o(x1, . . . , xn) =
{
+1 if w0 + w1x1 + · · ·+ wnxn > 0
−1 otherwise.
COMP9417 ML & DM Classification Semester 1, 2018 123 / 132
Perceptrons What are perceptrons ?
Perceptron
Or in vector notation:
o(x) =
{
1 if w · x > 0
−1 otherwise.
COMP9417 ML & DM Classification Semester 1, 2018 124 / 132
Perceptrons Perceptrons are linear classifiers
Decision Surface of a Perceptron
x1
x2
+
+
–
–
+
–
x1
x2
(a) (b)
–
+ –
+
Represents some useful functions
• What weights represent o(x1, x2) = AND(x1, x2)?
• What weights represent o(x1, x2) = XOR(x1, x2)?
COMP9417 ML & DM Classification Semester 1, 2018 125 / 132
Perceptrons Perceptrons are linear classifiers
Decision Surface of a Perceptron
So some functions not representable
• e.g., not linearly separable
• a labelled data set is linearly separable if there is a linear decision
boundary that separates the classes
• for non-linearly separable data we’ll need something else
• e.g., networks of these …
• the start of “deep” networks …
COMP9417 ML & DM Classification Semester 1, 2018 126 / 132
Perceptrons How to train
Perceptron learning
Key idea:
Learning is “finding a good set of weights”
Perceptron learning is simply an iterative weight-update scheme:
wi ← wi + ∆wi
where the weight update ∆wi depends only on misclassified examples and
is modulated by a “smoothing” parameter η typically referred to as the
“learning rate”.
Can prove that perceptron learning will converge:
• if training data is linearly separable
• and η sufficiently small
COMP9417 ML & DM Classification Semester 1, 2018 127 / 132
Perceptrons How to train
Perceptron learning
The perceptron iterates over the training set, updating the weight vector
every time it encounters an incorrectly classified example.
• For example, let xi be a misclassified positive example, then we have
yi = +1 and w · xi < t. We therefore want to find w′ such that
w′ · xi > w · xi, which moves the decision boundary towards and
hopefully past xi.
• This can be achieved by calculating the new weight vector as
w′ = w +ηxi, where 0 < η ≤ 1 is the learning rate (again, assume set
to 1). We then have w′ · xi = w · xi + ηxi · xi > w · xi as required.
• Similarly, if xj is a misclassified negative example, then we have
yj = −1 and w · xj > t. In this case we calculate the new weight
vector as w′ = w− ηxj , and thus w′ ·xj = w ·xj − ηxj ·xj < w ·xj .
COMP9417 ML & DM Classification Semester 1, 2018 128 / 132
Perceptrons How to train
Perceptron learning
• The two cases can be combined in a single update rule:
w′ = w + ηyixi
• Here yi acts to change the sign of the update, corresponding to
whether a positive or negative example was misclassified
• This is the basis of the perceptron training algorithm for linear
classification
• The algorithm just iterates over the training examples applying the
weight update rule until all the examples are correctly classified
• If there is a linear model that separates the positive from the negative
examples, i.e., the data is linearly separable, it can be shown that the
perceptron training algorithm will converge in a finite number of steps.
COMP9417 ML & DM Classification Semester 1, 2018 129 / 132
Perceptrons How to train
Perceptron training algorithm
Algorithm Perceptron(D, η) // perceptron training for linear classification
Input: labelled training data D in homogeneous coordinates; learning rate η.
Output: weight vector w defining classifier ŷ = sign(w · x).
1 w ←0 // Other initialisations of the weight vector are possible
2 converged←false
3 while converged = false do
4 converged←true
5 for i = 1 to |D| do
6 if yiw · xi ≤ 0 then // i.e., ŷi 6= yi
7 w←w + ηyixi
8 converged←false // We changed w so haven’t converged yet
9 end
10 end
11 end
COMP9417 ML & DM Classification Semester 1, 2018 130 / 132
Perceptrons How to train
Perceptron training – varying learning rate
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
(left) A perceptron trained with a small learning rate (η = 0.2). The
circled examples are the ones that trigger the weight update. (middle)
Increasing the learning rate to η = 0.5 leads in this case to a rapid
convergence. (right) Increasing the learning rate further to η = 1 may lead
to too aggressive weight updating, which harms convergence.
The starting point in all three cases was the basic linear classifier.
COMP9417 ML & DM Classification Semester 1, 2018 131 / 132
Summary
Summary
• Two major frameworks for classification by linear models
Distance-based. The key ideas are geometric.
Probabilistic. The key ideas are Bayesian.
• We also discussed Logistic Regression and the Perceptron, a simple
form of threshold model.
• These classifiers are also, in some sense, linear models
• So we have established the basis for learning classifiers
• Later we will see how to extend by building on these ideas
COMP9417 ML & DM Classification Semester 1, 2018 132 / 132
Introduction
Nearest neighbour classification
Distance-based models
Neighbours and exemplars
k-Nearest neighbour Classification
Naive Bayes
Why model uncertainty ?
Why Bayes ?
A Bayesian approach to learning algorithms
Concept learning in Bayesian terms
Numeric prediction in Bayesian terms
Minimum Description Length Principle
Bayes Optimal Classifier
Naive Bayes
Learning and using a naive Bayes model for classification
Linear discriminants
Perceptrons
What are perceptrons ?
Perceptrons are linear classifiers
How to train
Summary