Classification (1)
COMP9417 Machine Learning and Data Mining
Term 2, 2021
COMP9417 ML & DM Classification (1) Term 2, 2021 1 / 86
Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997)
http://www- 2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM
Classification (1)
Term 2, 2021
2 / 86
Aims
This lecture will introduce you to machine learning approaches to the problem of classification. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
• outline the probem of learning to classify
• outline a framework for solving machine learning problems
• describe issues of generalisation and evaluation for classification
• outline the use of a linear model as a 2-class classifier
• outline the Perceptron classification algorithm
• describe distance measures and how they are used in classification
• describe the k-nearest neighbour method for classification (and regression)
COMP9417 ML & DM Classification (1) Term 2, 2021 3 / 86
Introduction
Introduction
Classification (sometimes called concept learning) methods dominate machine learning . . .
. . . however, they often don’t have convenient mathematical properties like regression, so are more complicated to analyse. The idea is to learn a classifier, which is usually a function mapping from an input data point to one of a set of discrete outputs, i.e., the classes.
We will mostly focus on their advantages and disadvantages as learning methods first, and point to unifying ideas and approaches where applicable. In this and the next lecture we focus on classification methods that are essentially linear models . . .
and in later lectures we will see other, more expressive, classifier learning methods.
COMP9417 ML & DM Classification (1) Term 2, 2021 4 / 86
Introduction
Assassinating spam e-mail
SpamAssassin is a widely used open-source spam filter. It calculates a score for an incoming e-mail, based on a number of built-in rules or ‘tests’ in SpamAssassin’s terminology, and adds a ‘junk’ flag and a summary report to the e-mail’s headers if the score is 5 or more.
-0.1 RCVD_IN_MXRATE_WL
0.6 HTML_IMAGE_RATIO_02
1.2 TVD_FW_GRAPHIC_NAME_MID 0.0 HTML_MESSAGE
0.6 HTML_FONx_FACE_BAD
1.4 SARE_GIF_ATTACH
0.1 BOUNCE_MESSAGE
0.1 ANY_BOUNCE_MESSAGE
1.4 AWL
RBL: MXRate recommends allowing
[123.45.6.789 listed in sub.mxrate.net]
BODY: HTML has a low ratio of text to image area BODY: TVD_FW_GRAPHIC_NAME_MID
BODY: HTML included in message
BODY: HTML font face is not a word
FULL: Email has a inline gif
MTA bounce message
Message is some kind of bounce message
AWL: From: address is in the auto white-list
From left to right you see the score attached to a particular test, the test identifier, and a short description including a reference to the relevant part of the e-mail. As you see, scores for individual tests can be negative (indicating evidence suggesting the e-mail is ham rather than spam) as well as positive. The overall score of 5.3 suggests the e-mail might be spam.
COMP9417 ML & DM Classification (1) Term 2, 2021 5 / 86
Introduction
Linear classification
Suppose we have only two tests and four training e-mails, one of which is spam. Both tests succeed for the spam e-mail; for one ham e-mail neither test succeeds, for another the first test succeeds and the second doesn’t, and for the third ham e-mail the first test fails and the second succeeds.
It is easy to see that assigning both tests a weight of 4 correctly ‘classifies’ these four e-mails into spam and ham. In the mathematical notation introduced above we could describe this classifier as 4×1 + 4×2 > 5 or
(4, 4) · (x1, x2) > 5.
In fact, any weight between 2.5 and 5 will ensure that the threshold of 5 is only exceeded when both tests succeed. We could even consider assigning different weights to the tests – as long as each weight is less than 5 and their sum exceeds 5 – although it is hard to see how this could be justified by the training data.
COMP9417 ML & DM Classification (1) Term 2, 2021 6 / 86
Introduction
Spam filtering as a classification task
The columns marked x1 and x2 indicate the results of two tests on four different e-mails. The fourth column indicates which of the e-mails are spam. The right-most column demonstrates that by thresholding the function 4×1 + 4×2 at 5, we can separate spam from ham.
E-mail x1 x2 Spam? 4×1 + 4×2 11118 20000 31004 40104
COMP9417 ML & DM Classification (1) Term 2, 2021 7 / 86
Introduction
Linear classification in two dimensions
w++++ ++
++
x2 –––– –––
x0
x1 –
• straight line separates positives from negatives (linear “discriminant”) • defined by w · xi = t
• w is perpendicular to decision boundary
• w points in direction of positives
• t is the decision threshold
COMP9417 ML & DM Classification (1) Term 2, 2021 8 / 86
Introduction
Linear classification in two dimensions
Note: xi points to a point on the decision boundary. In particular, x0 points in the same direction as w,
from which it follows that w · x0 = ||w|| ||x0|| = t (where ||x|| denotes the length of the vector x).
COMP9417 ML & DM Classification (1) Term 2, 2021 9 / 86
Introduction
Homogeneous coordinates
It is sometimes convenient to simplify notation further by introducing an extra constant ‘variable’ x0 = 1, the weight of which is fixed to w0 = −t.
The extended data point is then x◦ = (1, x1, . . . , xn) and the extended weight vector is w◦ = (−t, w1, . . . , wn), leading to the decision rule w◦·x◦ >0andthedecisionboundaryw◦·x◦ =0.
Thanks to these so-called homogeneous coordinates the decision boundary passes through the origin of the extended coordinate system, at the expense of needing an additional dimension.
Note: this doesn’t really affect the data, as all data points and the ‘real’ decision boundary live in the plane x0 = 1.
COMP9417 ML & DM Classification (1) Term 2, 2021 10 / 86
Introduction
Machine learning for spam filtering
E-mails SpamAssassin Data Spam? tests Linear classifier
At the top we see how SpamAssassin approaches the spam e-mail classification task: the text of each e-mail is converted into a data point by means of SpamAssassin’s built-in tests, and a linear classifier is applied to obtain a ‘spam or ham’ decision. At the bottom (in blue) we see the bit that is done by machine learning.
weights
Training data
Learn weights
COMP9417 ML & DM Classification (1) Term 2, 2021 11 / 86
Introduction
Example: a Bayesian classifier I
Bayesian spam filters maintain a vocabulary of words and phrases – potential spam or ham indicators – for which statistics are collected from a training set.
• For instance, suppose that the word ‘Viagra’ occurred in four spam e-mails and in one ham e-mail. If we then encounter a new e-mail that contains the word ‘Viagra’, we might reason that the odds that this e-mail is spam are 4:1, or the probability of it being spam is 0.80 and the probability of it being ham is 0.20.
• The situation is slightly more subtle because we have to take into account the prevalence of spam. Suppose that I receive on average one spam e-mail for every six ham e-mails. This means that I would estimate the odds of an unseen e-mail being spam as 1:6, i.e., non-negligible but not very high either.
COMP9417 ML & DM Classification (1) Term 2, 2021 12 / 86
Introduction
Example: a Bayesian classifier II
• If I then learn that the e-mail contains the word ‘Viagra’, which occurs four times as often in spam as in ham, I need to combine these two odds. As we shall see later, Bayes’ rule tells us that we should simply multiply them: 1:6 times 4:1 is 4:6, corresponding to a spam probability of 0.4.
In this way you are combining two independent pieces of evidence, one concerning the prevalence of spam, and the other concerning the occurrence of the word ‘Viagra’, pulling in opposite directions.
COMP9417 ML & DM Classification (1) Term 2, 2021 13 / 86
Introduction
Example: a Bayesian classifier III
The nice thing about this ‘Bayesian’ classification scheme is that it can be repeated if you have further evidence. For instance, suppose that the odds in favour of spam associated with the phrase ‘blue pill’ is estimated at 3:1, and suppose our e-mail contains both ‘Viagra’ and ‘blue pill’, then the combined odds are 4:1 times 3:1 is 12:1, which is ample to outweigh the 1:6 odds associated with the low prevalence of spam (total odds are 2:1, or a spam probability of 0.67, up from 0.40 without the ‘blue pill’).
COMP9417 ML & DM Classification (1) Term 2, 2021 14 / 86
Introduction
Example: a rule-based classifier
• if the e-mail contains the word ‘Viagra’ then estimate the odds of spam as 4:1;
• otherwise, if it contains the phrase ‘blue pill’ then estimate the odds of spam as 3:1;
• otherwise, estimate the odds of spam as 1:6.
The first rule covers all e-mails containing the word ‘Viagra’, regardless of whether they contain the phrase ‘blue pill’, so no overcounting occurs. The second rule only covers e-mails containing the phrase ‘blue pill’ but not the word ‘Viagra’, by virtue of the ‘otherwise’ clause. The third rule covers all remaining e-mails: those which neither contain neither ‘Viagra’ nor ‘blue pill’.
COMP9417 ML & DM Classification (1) Term 2, 2021 15 / 86
Some ingredients of machine learning
Step back: some ingredients of machine learning
COMP9417 ML & DM Classification (1) Term 2, 2021 16 / 86
Some ingredients of machine learning
How machine learning helps to solve a task
Task
Domain objects
Features
Data
Output
Model
Training data
Learning problem
Learning algorithm
An overview of how machine learning is used to address a given task. A task (red box) requires an appropriate mapping – a model – from data described by features to outputs. Obtaining such a mapping from training data is what constitutes a learning problem (blue box).
COMP9417 ML & DM Classification (1) Term 2, 2021 17 / 86
Some ingredients of machine learning
Some terminology I
Tasks are addressed by models, whereas learning problems are solved by learning algorithms that produce models.
COMP9417 ML & DM Classification (1) Term 2, 2021 18 / 86
Some ingredients of machine learning
Some terminology II
Machine learning is concerned with using the right features to build the right models that achieve the right tasks.
COMP9417 ML & DM Classification (1) Term 2, 2021 19 / 86
Some ingredients of machine learning
Some terminology III
Models lend the machine learning field diversity, but tasks and features give it unity.
COMP9417 ML & DM Classification (1) Term 2, 2021 20 / 86
Some ingredients of machine learning
Some terminology IV
Does the algorithm require all training data to be present before the start of learning ? If yes, then it is categorised as batch learning algorithm.
If however, it can continue to learn a new data arrives, it is an online learning algorithm.
COMP9417 ML & DM Classification (1) Term 2, 2021 21 / 86
Some ingredients of machine learning
Some terminology V
If the model has a fixed number of parameters it is categorised as parametric.
Otherwise, if the number of parameters grows with the amount of training data it is categorised as non-parametric.
COMP9417 ML & DM Classification (1) Term 2, 2021 22 / 86
Some ingredients of machine learning
Example: a “geometric” classification model
Example: basic linear classifier I
p–
The basic linear classifier constructs a decision boundary by half-way intersecting the line between the positive and negative centres of mass.
++++ ++
++
w=p–n ––––
(p+n)/2 – – n–
COMP9417 ML & DM Classification (1) Term 2, 2021 23 / 86
Some ingredients of machine learning
Example: a “geometric” classification model
Example: basic linear classifier II
The basic linear classifier is described by the equation w · x = t, with
w = p − n; the decision threshold can be found by noting that (p + n)/2 is on the decision boundary, and hence
t = (p − n) · (p + n)/2 = (||p||2 − ||n||2)/2, where ||x|| denotes the length of vector x.
COMP9417 ML & DM Classification (1) Term 2, 2021 24 / 86
Some ingredients of machine learning
Algorithm-independent aspects of machine learning
The philosophical problem
Recall: Deduction: derive specific consequences from general theories
Induction: derive general theories from specific observations
Deduction is well-founded (mathematical logic).
Induction is (philosophically and practically) problematic – induction is useful since it often seems to work – an inductive argument !
COMP9417 ML & DM Classification (1) Term 2, 2021 25 / 86
Some ingredients of machine learning
Algorithm-independent aspects of machine learning
Generalisation – the key objective of machine learning
What we are really interested in is generalising from the sample of data in our training set. This requires an assumption that can be stated as (last lecture):
The inductive learning hypothesis
Any hypothesis found to approximate the target (true) function well over a sufficiently large set of training examples will also ap- proximate the target function well over other unobserved exam- ples.1 .
1T. Mitchell (1997) “Machine Learning”
COMP9417 ML & DM Classification (1) Term 2, 2021 26 / 86
Some ingredients of machine learning
Algorithm-independent aspects of machine learning
Generalisation – the key objective of machine learning
A corollary of the inductive learning hypothesis is that it is necessary to make some assumptions about the type of target function in a task for an algorithm to go beyond the data, i.e., generalise or learn.
The set of assumptions required for a machine learning algorithm to be able to generalise is known in as the inductive bias of the algorithm.
We’ll see that this idea of inductive bias is often used when comparing algorithms, and can be formalised in some cases (later lecture).
COMP9417 ML & DM Classification (1) Term 2, 2021 27 / 86
Some ingredients of machine learning
Evaluating performance on a task
Evaluating classification – contingency table I
For the two-class prediction case:
Actual Class
Predicted Class
Yes
No
Yes
True Positive (TP)
False Negative (FN)
No
False Positive (FP)
True Negative (TN)
COMP9417 ML & DM Classification (1) Term 2, 2021 28 / 86
Some ingredients of machine learning
Evaluating performance on a task
Evaluating classification – contingency table II
Classification Accuracy on a sample of labelled pairs (x, c(x)) given a learned classification model that predicts, for each instance x, a class value cˆ(x):
acc= 1 I[cˆ(x)=c(x)] |Test| x∈Test
where Test is a test set and I[] is the indicator function which is 1 iff its argument evaluates to true, and 0 otherwise.
Classification Error is 1-acc.
COMP9417 ML & DM Classification (1) Term 2, 2021 29 / 86
Some ingredients of machine learning
Evaluating performance on a task
Cross-validation I
There are certain parameters that need to be estimated during learning. We use the data, but NOT the training set, OR the test set. Instead, we
use a separate validation or development set.
COMP9417 ML & DM Classification (1) Term 2, 2021 30 / 86
Some ingredients of machine learning
Evaluating performance on a task
Cross-validation II
COMP9417 ML & DM Classification (1) Term 2, 2021 31 / 86
Some ingredients of machine learning
Evaluating performance on a task
Cross-validation III
COMP9417 ML & DM Classification (1) Term 2, 2021 32 / 86
Some ingredients of machine learning
Evaluating performance on a task
Cross-validation IV
COMP9417 ML & DM Classification (1) Term 2, 2021 33 / 86
Perceptrons
What are perceptrons ?
Perceptron
A linear classifier that can achieve perfect separation on linearly separable data is the perceptron, originally proposed as a simple neural network by F. Rosenblatt in the late 1950s.
COMP9417 ML & DM Classification (1) Term 2, 2021 34 / 86
Perceptrons
What are perceptrons ?
Perceptron
Originally implemented in software (based on the McCulloch-Pitts neuron from the 1940s), then in hardware as a 20×20 visual sensor array with potentiometers for adaptive weights.
Source http://en.wikipedia.org/w/index.php?curid=47541432
COMP9417 ML & DM Classification (1) Term 2, 2021 35 / 86
Perceptrons
What are perceptrons ?
Perceptron
Output o is thresholded sum of products of inputs and their weights: +1 ifw0+w1x1+···+wnxn>0
o(x1, . . . , xn) = −1 otherwise.
COMP9417 ML & DM Classification (1) Term 2, 2021 36 / 86
Perceptrons
What are perceptrons ?
Perceptron
Or in vector notation:
1 ifw·x>0 o(x) = −1 otherwise.
COMP9417 ML & DM
Classification (1) Term 2, 2021 37 / 86
Perceptrons
Perceptrons are linear classifiers
Decision Surface of a Perceptron
x2
x2
+-
-+ (b)
+ +
+
–
– –
(a)
x1
x1
Represents some useful functions
• What weights represent o(x1,x2) = AND(x1,x2)? • What weights represent o(x1, x2) = XOR(x1, x2)?
COMP9417 ML & DM Classification (1)
Term 2, 2021
38 / 86
Perceptrons
Perceptrons are linear classifiers
Decision Surface of a Perceptron
So some functions not representable • e.g., not linearly separable
• a labelled data set is linearly separable if there is a linear decision boundary that separates the classes
• for non-linearly separable data we’ll need something else • e.g., networks of these …
• the start of “deep” networks …
COMP9417 ML & DM Classification (1) Term 2, 2021 39 / 86
Perceptrons
How to train
Perceptron learning
Key idea:
Learning is “finding a good set of weights”
Perceptron learning is simply an iterative weight-update scheme:
wi ← wi + ∆wi
where the weight update ∆wi depends only on misclassified examples and is modulated by a “smoothing” parameter η typically referred to as the “learning rate”.
Can prove that perceptron learning will converge: • if training data is linearly separable
• and η sufficiently small
COMP9417 ML & DM Classification (1) Term 2, 2021 40 / 86
Perceptrons
How to train
Perceptron learning
The perceptron iterates over the training set, updating the weight vector every time it encounters an incorrectly classified example.
• For example, let xi be a misclassified positive example, then we have yi =+1andw·xi
• This can be achieved by calculating the new weight vector as
w′ = w+ηxi, where 0 < η ≤ 1 is the learning rate (again, assume set to1). Wethenhavew′·xi =w·xi+ηxi·xi >w·xi asrequired.
• Similarly, if xj is a misclassified negative example, then we have
yj =−1andw·xj >t. Inthiscasewecalculatethenewweight vector as w′ = w−ηxj, and thus w′ ·xj = w·xj −ηxj ·xj < w·xj.
COMP9417 ML & DM Classification (1) Term 2, 2021 41 / 86
Perceptrons
How to train
Perceptron learning
• The two cases can be combined in a single update rule: w′ =w+ηyixi
• Here yi acts to change the sign of the update, corresponding to whether a positive or negative example was misclassified
• This is the basis of the perceptron training algorithm for linear classification
• The algorithm just iterates over the training examples applying the weight update rule until all the examples are correctly classified
• If there is a linear model that separates the positive from the negative examples, i.e., the data is linearly separable, it can be shown that the perceptron training algorithm will converge in a finite number of steps.
COMP9417 ML & DM Classification (1) Term 2, 2021 42 / 86
Perceptrons
How to train
Perceptron training algorithm
1 2 3 4 5 6
7 8 9
10 11
Algorithm Perceptron(D, η) // perceptron training for linear classification
Input: labelled training data D in homogeneous coordinates; learning rate η. Output: weight vector w defining classifier yˆ = sign(w · x).
w ←0 // Other initialisations of the weight vector are possible
converged ←false
while converged = false do converged ←true
for i = 1 to |D| do
if yiw · xi ≤ 0 then
w←w + ηyixi
converged←false // We changed w so haven’t converged yet
end end
end
// i.e., yˆi ̸= yi
COMP9417 ML & DM Classification (1) Term 2, 2021 43 / 86
Perceptrons
How to train
Perceptron training – varying learning rate
333 222 111 000
−1 −1 −1
−2 −2 −2
−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
(left) A perceptron trained with a small learning rate (η = 0.2). The circled examples are the ones that trigger the weight update. (middle) Increasing the learning rate to η = 0.5 leads in this case to a rapid convergence. (right) Increasing the learning rate further to η = 1 may lead to too aggressive weight updating, which harms convergence.
The starting point in all three cases was the basic linear classifier.
COMP9417 ML & DM Classification (1) Term 2, 2021 44 / 86
Distance-based models
Nearest Neighbour
Nearest Neighbour is a classification or regression algorithm that predicts whatever is the output value of the nearest data point to some query point.
COMP9417 ML & DM Classification (1) Term 2, 2021 45 / 86
Distance-based models
Minkowski distance
Minkowski distance If X = Rd, the Minkowski distance of order p > 0 is defined as
d 1/p
Disp(x,y)= |xj −yj|p =||x−y||p j=1
d p 1/p
where ||z||p = j=1 |zj| is the p-norm (sometimes denoted Lp
norm) of the vector z.
Note: sometimes p is omitted when writing the norm, often when p = 2.
COMP9417 ML & DM Classification (1) Term 2, 2021 46 / 86
Distance-based models
Minkowski distance
• The 2-norm refers to the familiar Euclidean distance d
(xj −yj)2 = (x−y)T(x−y) which measures distance ‘as the crow flies’.
• The 1-norm denotes Manhattan distance, also called cityblock
distance:
d Dis1(x,y)=|xj −yj|
Dis2(x,y)=
j=1
j=1
This is the distance if we can only travel along coordinate axes.
COMP9417 ML & DM Classification (1) Term 2, 2021 47 / 86
Distance-based models
Minkowski distance
• If we now let p grow larger, the distance will be more and more dominated by the largest coordinate-wise distance, from which we can inferthatDis∞(x,y)=maxj|xj −yj|;thisisalsocalledChebyshev distance.
• You will sometimes see references to the 0-norm (or L0 norm) which counts the number of non-zero elements in a vector. The corresponding distance then counts the number of positions in which vectors x and y differ. This is not strictly a Minkowski distance; however, we can define it as
dd Dis0(x,y)=(xj −yj)0 =I[xj =yj]
j=1 j=1
under the understanding that x0 = 0 for x = 0 and 1 otherwise.
COMP9417 ML & DM Classification (1) Term 2, 2021 48 / 86
Distance-based models
Minkowski distance
Sometimes the data X is not naturally in Rd, but if we can turn it into Boolean features, or character sequences, we can still apply distance measures. For example:
• If x and y are binary strings, this is also called the Hamming distance. Alternatively, we can see the Hamming distance as the number of bits that need to be flipped to change x into y.
• For non-binary strings of unequal length this can be generalised to the notion of edit distance or Levenshtein distance.
COMP9417 ML & DM Classification (1) Term 2, 2021 49 / 86
Circles and ellipses
Distance-based models
Lines connecting points at order-p Minkowski distance 1 from the origin for (from inside) p = 0.8; p = 1 (Manhattan distance, the rotated square in red); p = 1.5; p = 2 (Euclidean distance, the violet circle); p = 4;
p = 8; and p = ∞ (Chebyshev distance, the blue rectangle). Notice that for points on the coordinate axes all distances agree. For the other points, our reach increases with p; however, if we require a rotation-invariant distance metric then Euclidean distance is our only choice.
COMP9417 ML & DM Classification (1) Term 2, 2021 50 / 86
Distance-based models
Distance metric
Distance metric Given an instance space X , a distance metric is a function Dis : X × X → R such that for any x, y, z ∈ X :
1 distances between a point and itself are zero: Dis(x, x) = 0;
2 all other distances are larger than zero: if x ̸= y then Dis(x, y) > 0;
3 distances are symmetric: Dis(y, x) = Dis(x, y);
4 detours can not shorten the distance: Dis(x, z) ≤ Dis(x, y) + Dis(y, z).
If the second condition is weakened to a non-strict inequality, so that Dis(x, y) may be zero even if x ̸= y, then the function Dis is called a pseudo-metric.
COMP9417 ML & DM Classification (1) Term 2, 2021 51 / 86
Distance-based models
The triangle inequality – Minkowski distance for p = 2
B
!
AC
!!
The green circle connects points the same Euclidean distance (i.e., Minkowski distance of order p = 2) away from the origin as A. The orange circle shows that B and C are equidistant from A. The red circle demonstrates that C is closer to the origin than B, which conforms to the triangle inequality.
COMP9417 ML & DM Classification (1) Term 2, 2021 52 / 86
Distance-based models
The triangle inequality – Minkowski distance for p ≤ 1
B
!
AC
!!
With Manhattan distance (p = 1), B and C are equally close to the origin and also equidistant from A. With p < 1 (here, p = 0.8) C is further away from the origin than B; since both are again equidistant from A, it follows that travelling from the origin to C via A is quicker than going there directly, which violates the triangle inequality.
COMP9417 ML & DM Classification (1) Term 2, 2021 53 / 86
Distance-based models
Neighbours and exemplars
Means and distances
The arithmetic mean minimises squared Euclidean distance The arithmetic mean μ of a set of data points D in a Euclidean space is the unique point that minimises the sum of squared Euclidean distances to those data points.
Proof. We will show that arg miny x∈D ||x − y||2 = μ, where || · || denotes the 2-norm. We find this minimum by taking the gradient (the vector of partial derivatives with respect to yi) of the sum and setting it to the zero vector:
∇y ||x−y||2 =−2(x−y)=−2x+2|D|y=0
x∈D x∈D fromwhichwederivey= 1 x=μ.
x∈D
|D| x∈D
COMP9417 ML & DM
Classification (1)
Term 2, 2021
54 / 86
Distance-based models
Neighbours and exemplars
Means and distances
• Notice that minimising the sum of squared Euclidean distances of a given set of points is the same as minimising the average squared Euclidean distance.
• You may wonder what happens if we drop the square here: wouldn’t it be more natural to take the point that minimises total Euclidean distance as exemplar?
• This point is known as the geometric median, as for univariate data it corresponds to the median or ‘middle value’ of a set of numbers. However, for multivariate data there is no closed-form expression for the geometric median, which needs to be calculated by successive approximation.
COMP9417 ML & DM Classification (1) Term 2, 2021 55 / 86
Distance-based models
Neighbours and exemplars
Means and distances
• In certain situations it makes sense to restrict an exemplar to be one of the given data points. In that case, we speak of a medoid, to distinguish it from a centroid which is an exemplar that doesn’t have to occur in the data.
• Finding a medoid requires us to calculate, for each data point, the total distance to all other data points, in order to choose the point that minimises it. Regardless of the distance metric used, this is an O(n2) operation for n points.
• So for medoids there is no computational reason to prefer one distance metric over another.
• There may be more than one medoid.
COMP9417 ML & DM Classification (1) Term 2, 2021 56 / 86
Distance-based models
Neighbours and exemplars
Centroids and medoids
2 1.5 1 0.5 0 −0.5 −1 −1.5 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
A small data set of 10 points, with circles indicating centroids and squares indicating medoids (the latter must be data points), for different distance metrics. Notice how the outlier on the bottom-right ‘pulls’ the mean away from the geometric median; as a result the corresponding medoid changes as well.
data points
squared 2−norm centroid (mean) 2−norm centroid (geometric median) squared 2−norm medoid
2−norm medoid
1−norm medoid
COMP9417 ML & DM Classification (1) Term 2, 2021 57 / 86
Distance-based models
Neighbours and exemplars
The basic linear classifier is distance-based
• The basic linear classifier constructs the decision boundary as the perpendicular bisector of the line segment connecting the two exemplars (one for each class).
• An alternative, distance-based way to classify instances without direct reference to a decision boundary is by the following decision rule: if x is nearest to μ⊕ then classify it as positive, otherwise as negative; or equivalently, classify an instance to the class of the nearest exemplar.
• If we use Euclidean distance as our closeness measure, simple geometry tells us we get exactly the same decision boundary.
• So the basic linear classifier can be interpreted from a distance-based perspective as constructing exemplars that minimise squared Euclidean distance within each class, and then applying a nearest-exemplar decision rule.
COMP9417 ML & DM Classification (1) Term 2, 2021 58 / 86
Distance-based models
Neighbours and exemplars
Two-exemplar decision boundaries
(left) For two exemplars the nearest-exemplar decision rule with Euclidean distance results in a linear decision boundary coinciding with the perpendicular bisector of the line connecting the two exemplars. (right) Using Manhattan distance the circles are replaced by diamonds.
COMP9417 ML & DM Classification (1) Term 2, 2021 59 / 86
Distance-based models
Neighbours and exemplars
Three-exemplar decision boundaries
(left) Decision regions defined by the 2-norm nearest-exemplar decision rule for three exemplars. (right) With Manhattan distance the decision regions become non-convex.
COMP9417 ML & DM Classification (1) Term 2, 2021 60 / 86
Distance-based models
Neighbours and exemplars
Distance-based models
To summarise, the main ingredients of distance-based models are
• distance metrics, which can be Euclidean, Manhattan, Minkowski or Mahalanobis, among many others;
• exemplars: centroids that find a centre of mass according to a chosen distance metric, or medoids that find the most centrally located data point; and
• distance-based decision rules, which take a vote among the k nearest exemplars.
COMP9417 ML & DM Classification (1) Term 2, 2021 61 / 86
Nearest neighbour classification
Nearest neighbour classification
COMP9417 ML & DM Classification (1) Term 2, 2021 62 / 86
Nearest neighbour classification
Nearest neighbour classification
• Related to the simplest form of learning: rote learning or memorization
• Training instances are searched for instance that most closely resembles new or query instance
• The instances themselves represent the knowledge
• Called: instance-based, memory-based learning or case-based learning;
often a form of local learning
• The similarity or distance function defines “learning”, i.e., how to go
beyond simple memorization
• Intuitive idea — instances “close by”, i.e., neighbours or exemplars, should be classified similarly
• Instance-based learning is lazy learning
• Methods: nearest-neighbour, k-nearest-neighbour, lowess, . . .
• Ideas also important for unsupervised methods, e.g., clustering (later lectures)
COMP9417 ML & DM Classification (1) Term 2, 2021 63 / 86
Nearest neighbour classification
k-Nearest Neighbour Classification
Nearest Neighbour
Stores all training examples ⟨xi,f(xi)⟩. Nearest neighbour:
• Given query instance xq, first locate nearest training example xn, then ˆ
estimate f(xq) ← f(xn) k-Nearest neighbour:
• Given xq, take vote among its k nearest neighbours (if discrete-valued target function)
• take mean of f values of k nearest neighbours (if real-valued) ˆ ki=1 f(xi)
k
f(xq) ←
COMP9417 ML & DM Classification (1)
Term 2, 2021 64 / 86
Nearest neighbour classification
k-Nearest Neighbour Classification
k-Nearest Neighbour (kNN) Algorithm
Training algorithm:
• For each training example ⟨xi,f(xi)⟩, add the example to the list
training examples. Classification algorithm:
• Given a query instance xq to be classified,
• Let x1 . . . xk be the k instances from training examples that are
nearest to xq by the distance function • Return
k
ˆ
f(xq) ← arg max
I[v, f(xi)] where I[a,b] = 1 if a = b and 0 otherwise.
v∈V
i=1
COMP9417 ML & DM Classification (1) Term 2, 2021 65 / 86
Nearest neighbour classification
k-Nearest Neighbour Classification
“Hypothesis Space” for Nearest Neighbour
+
− −
− +
xq
+ −
+
−
2 classes, + and − and query point xq. On left, note effect of varying k. On right, 1−NN induces a Voronoi tessellation of the instance space. Formed by the perpendicular bisectors of lines between points.
COMP9417 ML & DM Classification (1) Term 2, 2021 66 / 86
Nearest neighbour classification
k-Nearest Neighbour Classification
Distance function again
The distance function defines what is learned, i.e., how an instance x will be classified, given a training dataset.
Instance x is described by a feature vector (list of attribute-value pairs) ⟨a1(x), a2(x), . . . ad(x)⟩
where ar(x) denotes the value of the rth attribute (feature) of x. Most commonly used distance function is Euclidean distance . . .
• distance between two instances xi and xj is defined to be d
Dis2 (xi , xj ) = (ar (xi ) − ar (xj ))2 r=1
COMP9417 ML & DM Classification (1) Term 2, 2021 67 / 86
Nearest neighbour classification
k-Nearest Neighbour Classification
Distance function again
Many other distance functions could be used . . .
• e.g., Manhattan or city-block distance (sum of absolute values of
differences between attributes)
d
Dis1 (xi , xj ) = |ar (xi ) − ar (xj )|
r=1
Typically use a vector-based formalization — norm L1, L2, . . . The idea of distance functions will appear again in kernel methods.
COMP9417 ML & DM Classification (1) Term 2, 2021 68 / 86
Nearest neighbour classification
k-Nearest Neighbour Classification
Normalization and other issues
• Different attributes measured on different scales
• Need to be normalized (why ?)
ar = vr − minvr maxvr − −minvr
where vr is the actual value of attribute r
• Nominal attributes: distance either 0 or 1
• Common policy for missing values: assumed to be maximally distant (given normalized attributes)
COMP9417 ML & DM Classification (1) Term 2, 2021 69 / 86
Nearest neighbour classification
k-Nearest Neighbour Classification
When To Consider Nearest Neighbour
• Instances map to points in Rn
• Low-dimensional data, say, less than 20 attributes per instance
• or number of attributes can be reduced . . .
• Lots of training data
• No requirement for “explanatory” model to be learned
COMP9417 ML & DM Classification (1) Term 2, 2021 70 / 86
Nearest neighbour classification
k-Nearest Neighbour Classification
When To Consider Nearest Neighbour
Advantages:
• Statisticians have used k-NN since early 1950s • Can be very accurate
• at most twice the “Bayes error” for 1-NN2
• Training is very fast
• Can learn complex target functions
• Don’t lose information by generalization - keep all instances
2See Hastie et al. (2009).
COMP9417 ML & DM Classification (1) Term 2, 2021 71 / 86
Nearest neighbour classification
k-Nearest Neighbour Classification
When To Consider Nearest Neighbour
Disadvantages:
• Slow at query time: basic algorithm scans entire training data to derive a prediction
• “Curse of dimensionality”
• Assumes all attributes are equally important, so easily fooled by irrelevant attributes
• Remedy: attribute selection or weights
• Problem of noisy instances:
• Remedy: remove from data set
• not easy – how to know which are noisy ?
COMP9417 ML & DM Classification (1) Term 2, 2021 72 / 86
Nearest neighbour classification
k-Nearest Neighbour Classification
When To Consider Nearest Neighbour
What is the inductive bias of k-NN ?
• an assumption that the classification of query instance xq will be most similar to the classification of other instances that are nearby according to the distance function
k-NN uses terminology from statistical pattern recognition (see below) • Regression approximating a real-valued target function
•ˆ
Residual the error f (x) − f (x) in approximating the target function
• Kernel function function of distance used to determine weight of each training example, i.e., kernel function is the function K s.t.
wi = K(Dis(xi, xq))
COMP9417 ML & DM Classification (1) Term 2, 2021 73 / 86
Nearest neighbour classification
k-Nearest Neighbour Classification
Nearest-neighbour classifier
• kNN uses the training data as exemplars, so training is O(n) (but prediction is also O(n)!)
• 1NN perfectly separates training data, so low bias but high variance3
• By increasing the number of neighbours k we increase bias and
decrease variance (what happens when k = n?)
• Easily adapted to real-valued targets, and even to structured objects (nearest-neighbour retrieval). Can also output probabilities when k>1
• Warning: in high-dimensional spaces everything is far away from everything and so pairwise distances are uninformative (curse of dimensionality)
3See Hastie et al. (2009).
COMP9417 ML & DM Classification (1) Term 2, 2021 74 / 86
Local (nearest-neighbour) regression
Local (nearest-neighbour) regression
COMP9417 ML & DM Classification (1) Term 2, 2021 75 / 86
Local (nearest-neighbour) regression
Local learning
• Related to the simplest form of learning: rote learnin, or memorization
• Training instances are searched for instance that most closely
resembles query or test instance
• The instances themselves represent the knowledge
• Called: nearest-neighbour, instance-based, memory-based or case-based learning; all forms of local learning
• The similarity or distance function defines “learning”, i.e., how to go beyond simple memorization
• Intuition — classify an instance similarly to examples “close by” — neighbours or exemplars
• A form of lazy learning – don’t need to build a model!
COMP9417 ML & DM Classification (1) Term 2, 2021 76 / 86
Local (nearest-neighbour) regression
Nearest neighbour for numeric prediction
Store all training examples ⟨xi,f(xi)⟩. Nearest neighbour:
• Given query instance xq,
• first locate nearest training example xn,
•ˆ
then estimate yˆ = f(xq) = f(xn)
• k-Nearest neighbour:
• Given xq, take mean of f values of k nearest neighbours
ˆ ki=1 f(xi)
yˆ = f ( x q ) =
k
COMP9417 ML & DM Classification (1)
Term 2, 2021
77 / 86
Local (nearest-neighbour) regression
Distance function
The distance function defines what is “learned”, i.e., value predicted. Instance xi is described by an d-vector of feature values:
⟨xi1, xi2, . . . xid⟩
where xir denotes the value of the rth feature of xi.
Most commonly used distance function is Euclidean distance.
COMP9417 ML & DM Classification (1) Term 2, 2021 78 / 86
Local (nearest-neighbour) regression
Local regression
Use kNN to form a local approximation to f for each query point xq using a linear function of the form
ˆ
f(x)=b0 +b1x1 +…+bdxd
where xi denotes the value of the ith feature of instance x. Where does this linear regression model come from ?
• fit linear function to k nearest neighbours • or quadratic or higher-order polynomial . . . • produces “piecewise approximation” to f
COMP9417 ML & DM Classification (1) Term 2, 2021 79 / 86
Local learning — further issues
Distance-Weighted kNN
• Might want to weight nearer neighbours more heavily …
• Use distance function to construct a weight wi
• Replace the final line of the classification algorithm by:
where
k
v∈V
wi≡ 1 Dis(xq , xi )
ˆ
f(xq) ← arg max
wiI[v, f(xi)]
i=1
and Dis(xq,xi) is distance between xq and xi, such as Euclidean distance.
COMP9417 ML & DM Classification (1) Term 2, 2021 80 / 86
Local learning — further issues
Distance-Weighted kNN
For real-valued target functions replace the final line of the algorithm by:
ki=1 wif(xi) f(xq) ← ki=1 wi
(denominator normalizes contribution of individual weights).
Now we can consider using all the training examples instead of just k
→ using all examples (i.e., when k = n) with the rule above is called “Shepard’s method”
ˆ
COMP9417 ML & DM Classification (1) Term 2, 2021 81 / 86
Local learning — further issues
Evaluation
Lazy learners do not construct an explicit model, so how do we evaluate the output of the learning process ?
• 1-NN – training set error is always zero !
• each training example is always closest to itself
• k-NN – overfitting may be hard to detect
Use leave-one-out cross-validation (LOOCV) – leave out each example and
predict it given the rest:
(x1, y1), (x2, y2), . . . , (xi−1, yi−1), (xi+1, yi+1), . . . , (xn, yn)
Error is mean over all predicted examples. Fast — no models to be built !
COMP9417 ML & DM Classification (1) Term 2, 2021 82 / 86
Local learning — further issues
Curse of Dimensionality
Bellman (1960) coined this term in the context of dynamic programming. Curse of dimensionality: nearest neighbour is hard for high-dimensional
instances x. One approach:
•
• •
“Stretch” jth axis by weight zj, where z1,…,zn chosen to minimize prediction error
Use cross-validation to automatically choose weights z1, . . . , zn Note: setting zj to zero eliminates this dimension altogether
COMP9417 ML & DM Classification (1) Term 2, 2021 83 / 86
Local learning — further issues
Curse of Dimensionality
• number of “cells” in the instance space grows exponentially in the number of features
• with exponentially many cells we would need exponentially many data points to ensure that each cell is sufficiently populated to make nearest-neighbour predictions reliably
COMP9417 ML & DM Classification (1) Term 2, 2021 84 / 86
Local learning — further issues
Instance-based (nearest-neighbour) learning
Recap – Practical problems of 1-NN scheme:
• Slow (but fast k − d tree-based approaches exist)
• Remedy: removing irrelevant data
• Noise (but k-NN copes quite well with noise)
• Remedy: removing noisy instances
• All attributes deemed equally important
• Remedy: attribute weighting (or simply selection) • Doesn’t perform explicit generalization
• Remedy: rule-based (like local regression) or tree-based NN approaches
COMP9417 ML & DM Classification (1) Term 2, 2021 85 / 86
Summary
Summary
• A framework for classification
• Classification viewed in terms of distance in feature space
Distance-based. The key ideas are geometric.
• A classifier as a linear model
• Nearest neighbour classifiers
• Later we will see how to extend by building on these ideas
COMP9417 ML & DM Classification (1) Term 2, 2021 86 / 86
References
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer, 2nd edition.
COMP9417 ML & DM Classification (1) Term 2, 2021 86 / 86