Classification (1)
COMP9417 Machine Learning and Data Mining
Term 2, 2020
COMP9417 ML & DM Classification (1) Term 2, 2020 1 / 72
Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997)
http://www- 2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM
Classification (1)
Term 2, 2020
2 / 72
Aims
This lecture will introduce you to machine learning approaches to the problem of classification. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
• outline a framework for solving machine learning problems
• outline the general problem of induction
• describe issues of generalisation and evaluation for classification
• outline the use of a linear model as a 2-class classifier
• describe distance measures and how they are used in classification • outline the basic k-nearest neighbour classification method
COMP9417 ML & DM Classification (1) Term 2, 2020 3 / 72
Introduction
Introduction
Classification (sometimes called concept learning) methods dominate machine learning . . .
. . . however, they often don’t have convenient mathematical properties like regression, so are more complicated to analyse. The idea is to learn a classifier, which is usually a function mapping from an input data point to one of a set of discrete outputs, i.e., the classes.
We will mostly focus on their advantages and disadvantages as learning methods first, and point to unifying ideas and approaches where applicable. In this and the next lecture we focus on classification methods that are essentially linear models . . .
and in later lectures we will see other, more expressive, classifier learning methods.
COMP9417 ML & DM Classification (1) Term 2, 2020 4 / 72
Introduction
Classification
Customer103: (time=t0) Years of credit: 9
Loan balance: $2,400 Income: $52k
Own House: Yes
Other delinquent accts: 2 Max billing cycles late: 3 Profitable customer?: ? …
Customer103: (time=t1) Years of credit: 9
Loan balance: $3,250 Income: ?
Own House: Yes
Other delinquent accts: 2 Max billing cycles late: 4 Profitable customer?: ? …
…
Customer103: (time=tn) Years of credit: 9
Loan balance: $4,500
Income: ?
Own House: Yes
Other delinquent accts: 3
Max billing cycles late: 6 Profitable customer?: No …
If OtherDelinquentAccounts > 2, and NumberDelinquentBillingCycles > 1
Then ProfitableCustomer = No
If OtherDelinquentAccounts = 0, and Income > 30k OR Y earsOfCredit > 3
Then ProfitableCustomer = Y es
COMP9417 ML & DM Classification (1)
Term 2, 2020
5 / 72
Introduction
Assassinating spam e-mail
SpamAssassin is a widely used open-source spam filter. It calculates a score for an incoming e-mail, based on a number of built-in rules or ‘tests’ in SpamAssassin’s terminology, and adds a ‘junk’ flag and a summary report to the e-mail’s headers if the score is 5 or more.
-0.1 RCVD_IN_MXRATE_WL
0.6 HTML_IMAGE_RATIO_02
1.2 TVD_FW_GRAPHIC_NAME_MID 0.0 HTML_MESSAGE
0.6 HTML_FONx_FACE_BAD
1.4 SARE_GIF_ATTACH
0.1 BOUNCE_MESSAGE
0.1 ANY_BOUNCE_MESSAGE
1.4 AWL
RBL: MXRate recommends allowing
[123.45.6.789 listed in sub.mxrate.net]
BODY: HTML has a low ratio of text to image area BODY: TVD_FW_GRAPHIC_NAME_MID
BODY: HTML included in message
BODY: HTML font face is not a word
FULL: Email has a inline gif
MTA bounce message
Message is some kind of bounce message
AWL: From: address is in the auto white-list
From left to right you see the score attached to a particular test, the test identifier, and a short description including a reference to the relevant part of the e-mail. As you see, scores for individual tests can be negative (indicating evidence suggesting the e-mail is ham rather than spam) as well as positive. The overall score of 5.3 suggests the e-mail might be spam.
COMP9417 ML & DM Classification (1) Term 2, 2020 6 / 72
Introduction
Linear classification
Suppose we have only two tests and four training e-mails, one of which is spam. Both tests succeed for the spam e-mail; for one ham e-mail neither test succeeds, for another the first test succeeds and the second doesn’t, and for the third ham e-mail the first test fails and the second succeeds.
It is easy to see that assigning both tests a weight of 4 correctly ‘classifies’ these four e-mails into spam and ham. In the mathematical notation introduced above we could describe this classifier as 4×1 + 4×2 > 5 or
(4, 4) · (x1, x2) > 5.
In fact, any weight between 2.5 and 5 will ensure that the threshold of 5 is only exceeded when both tests succeed. We could even consider assigning different weights to the tests – as long as each weight is less than 5 and their sum exceeds 5 – although it is hard to see how this could be justified by the training data.
COMP9417 ML & DM Classification (1) Term 2, 2020 7 / 72
Introduction
Spam filtering as a classification task
The columns marked x1 and x2 indicate the results of two tests on four different e-mails. The fourth column indicates which of the e-mails are spam. The right-most column demonstrates that by thresholding the function 4×1 + 4×2 at 5, we can separate spam from ham.
E-mail x1 x2 Spam? 4×1 + 4×2 11118 20000 31004 40104
COMP9417 ML & DM Classification (1) Term 2, 2020 8 / 72
Introduction
Linear classification in two dimensions
w++++ ++
++
x2 –––– –––
x0
x1 –
• straight line separates positives from negatives • defined by w · xi = t
• w is perpendicular to decision boundary
• w points in direction of positives
• t is the decision threshold
COMP9417 ML & DM Classification (1) Term 2, 2020 9 / 72
Introduction
Linear classification in two dimensions
Note: xi points to a point on the decision boundary. In particular, x0 points in the same direction as w, from which it follows that
w · x0 = ||w|| ||x0|| = t (where ||x|| denotes the length of the vector x).
COMP9417 ML & DM Classification (1) Term 2, 2020 10 / 72
Introduction
Homogeneous coordinates
It is sometimes convenient to simplify notation further by introducing an extra constant ‘variable’ x0 = 1, the weight of which is fixed to w0 = −t.
The extended data point is then x◦ = (1, x1, . . . , xn) and the extended weight vector is w◦ = (−t, w1, . . . , wn), leading to the decision rule w◦·x◦ >0andthedecisionboundaryw◦·x◦ =0.
Thanks to these so-called homogeneous coordinates the decision boundary passes through the origin of the extended coordinate system, at the expense of needing an additional dimension.
Note: this doesn’t really affect the data, as all data points and the ‘real’ decision boundary live in the plane x0 = 1.
COMP9417 ML & DM Classification (1) Term 2, 2020 11 / 72
Introduction
Machine learning for spam filtering
E-mails SpamAssassin Data Spam? tests Linear classifier
At the top we see how SpamAssassin approaches the spam e-mail classification task: the text of each e-mail is converted into a data point by means of SpamAssassin’s built-in tests, and a linear classifier is applied to obtain a ‘spam or ham’ decision. At the bottom (in blue) we see the bit that is done by machine learning.
weights
Training data
Learn weights
COMP9417 ML & DM Classification (1) Term 2, 2020 12 / 72
Introduction
A Bayesian classifier I
Bayesian spam filters maintain a vocabulary of words and phrases – potential spam or ham indicators – for which statistics are collected from a training set.
• For instance, suppose that the word ‘Viagra’ occurred in four spam e-mails and in one ham e-mail. If we then encounter a new e-mail that contains the word ‘Viagra’, we might reason that the odds that this e-mail is spam are 4:1, or the probability of it being spam is 0.80 and the probability of it being ham is 0.20.
• The situation is slightly more subtle because we have to take into account the prevalence of spam. Suppose that I receive on average one spam e-mail for every six ham e-mails. This means that I would estimate the odds of an unseen e-mail being spam as 1:6, i.e., non-negligible but not very high either.
COMP9417 ML & DM Classification (1) Term 2, 2020 13 / 72
Introduction
A Bayesian classifier II
• If I then learn that the e-mail contains the word ‘Viagra’, which occurs four times as often in spam as in ham, I need to combine these two odds. As we shall see later, Bayes’ rule tells us that we should simply multiply them: 1:6 times 4:1 is 4:6, corresponding to a spam probability of 0.4.
In this way you are combining two independent pieces of evidence, one concerning the prevalence of spam, and the other concerning the occurrence of the word ‘Viagra’, pulling in opposite directions.
COMP9417 ML & DM Classification (1) Term 2, 2020 14 / 72
Introduction
A Bayesian classifier III
The nice thing about this ‘Bayesian’ classification scheme is that it can be repeated if you have further evidence. For instance, suppose that the odds in favour of spam associated with the phrase ‘blue pill’ is estimated at 3:1, and suppose our e-mail contains both ‘Viagra’ and ‘blue pill’, then the combined odds are 4:1 times 3:1 is 12:1, which is ample to outweigh the 1:6 odds associated with the low prevalence of spam (total odds are 2:1, or a spam probability of 0.67, up from 0.40 without the ‘blue pill’).
COMP9417 ML & DM Classification (1) Term 2, 2020 15 / 72
Introduction
A rule-based classifier
• if the e-mail contains the word ‘Viagra’ then estimate the odds of spam as 4:1;
• otherwise, if it contains the phrase ‘blue pill’ then estimate the odds of spam as 3:1;
• otherwise, estimate the odds of spam as 1:6.
The first rule covers all e-mails containing the word ‘Viagra’, regardless of whether they contain the phrase ‘blue pill’, so no overcounting occurs. The second rule only covers e-mails containing the phrase ‘blue pill’ but not the word ‘Viagra’, by virtue of the ‘otherwise’ clause. The third rule covers all remaining e-mails: those which neither contain neither ‘Viagra’ nor ‘blue pill’.
COMP9417 ML & DM Classification (1) Term 2, 2020 16 / 72
The ingredients of machine learning
How machine learning helps to solve a task
Task
Domain
objects
Features
Data
Output
Model
Training data
Learning problem
Learning algorithm
An overview of how machine learning is used to address a given task. A task (red box) requires an appropriate mapping – a model – from data described by features to outputs. Obtaining such a mapping from training data is what constitutes a learning problem (blue box).
COMP9417 ML & DM Classification (1) Term 2, 2020 17 / 72
The ingredients of machine learning
Some terminology I
Tasks are addressed by models, whereas learning problems are solved by learning algorithms that produce models.
COMP9417 ML & DM Classification (1) Term 2, 2020 18 / 72
The ingredients of machine learning
Some terminology II
Machine learning is concerned with using the right features to build the right models that achieve the right tasks.
COMP9417 ML & DM Classification (1) Term 2, 2020 19 / 72
The ingredients of machine learning
Some terminology III
Models lend the machine learning field diversity, but tasks and features give it unity.
COMP9417 ML & DM Classification (1) Term 2, 2020 20 / 72
The ingredients of machine learning
Some terminology IV
Does the algorithm require all training data to be present before the start of learning ? If yes, then it is categorised as batch learning algorithm.
If however, it can continue to learn a new data arrives, it is an online learning algorithm.
COMP9417 ML & DM Classification (1) Term 2, 2020 21 / 72
The ingredients of machine learning
Some terminology V
If the model has a fixed number of parameters it is categorised as parametric.
Otherwise, if the number of parameters grows with the amount of training data it is categorised as non-parametric.
COMP9417 ML & DM Classification (1) Term 2, 2020 22 / 72
The ingredients of machine learning
Geometric models
Basic linear classifier I
p–
The basic linear classifier constructs a decision boundary by half-way intersecting the line between the positive and negative centres of mass.
++++ ++
++
w=p–n ––––
(p+n)/2 – – n–
COMP9417 ML & DM Classification (1) Term 2, 2020 23 / 72
The ingredients of machine learning
Geometric models
Basic linear classifier II
The basic linear classifier is described by the equation w · x = t, with
w = p − n; the decision threshold can be found by noting that (p + n)/2 is on the decision boundary, and hence
t = (p − n) · (p + n)/2 = (||p||2 − ||n||2)/2, where ||x|| denotes the length of vector x.
COMP9417 ML & DM Classification (1) Term 2, 2020 24 / 72
Algorithm-independent aspects of machine learning
The philosophical problem
Deduction: derive specific consequences from general theories
Induction: derive general theories from specific observations
Deduction is well-founded (mathematical logic).
Induction is (philosophically) problematic – induction is useful since it often seems to work – an inductive argument !
COMP9417 ML & DM Classification (1) Term 2, 2020 25 / 72
Algorithm-independent aspects of machine learning
Generalisation – the key objective of machine learning
What we are really interested in is generalising from the sample of data in our training set. This can be stated as:
The inductive learning hypothesis
Any hypothesis found to approximate the target (true) function well over a sufficiently large set of training examples will also ap- proximate the target function well over other unobserved examples.
A corollary of this is that it is necessary to make some assumptions about the type of target function in a task for an algorithm to go beyond the data, i.e., generalise or learn.
COMP9417 ML & DM Classification (1) Term 2, 2020 26 / 72
Algorithm-independent aspects of machine learning
Evaluating performance on a task
Cross-validation I
There are certain parameters that need to be estimated during learning. We use the data, but NOT the training set, OR the test set. Instead, we use a separate validation or development set.
COMP9417 ML & DM Classification (1) Term 2, 2020 27 / 72
Algorithm-independent aspects of machine learning
Evaluating performance on a task
Cross-validation II
COMP9417 ML & DM Classification (1) Term 2, 2020 28 / 72
Algorithm-independent aspects of machine learning
Evaluating performance on a task
Cross-validation III
COMP9417 ML & DM Classification (1) Term 2, 2020 29 / 72
Algorithm-independent aspects of machine learning
Evaluating performance on a task
Cross-validation IV
COMP9417 ML & DM Classification (1) Term 2, 2020 30 / 72
Algorithm-independent aspects of machine learning
Evaluating performance on a task
Contingency table I
Two-class prediction case:
Actual Class
Predicted Class
Yes
No
Yes
True Positive (TP)
False Negative (FN)
No
False Positive (FP)
True Negative (TN)
COMP9417 ML & DM Classification (1) Term 2, 2020 31 / 72
Algorithm-independent aspects of machine learning
Evaluating performance on a task
Contingency table II
Classification Accuracy on a sample of labelled pairs (x, c(x)) given a learned classification model that predicts, for each instance x, a class value cˆ(x):
acc= 1 I[cˆ(x)=c(x)] |Test| x∈Test
where Test is a test set and I[] is the indicator function which is 1 iff its argument evaluates to true, and 0 otherwise.
Classification Error is 1-acc.
COMP9417 ML & DM Classification (1) Term 2, 2020 32 / 72
Distance-based models
Nearest Neighbour
Nearest Neighbour is a regression or classification algorithm that predicts whatever is the output value of the nearest data point to some query.
COMP9417 ML & DM Classification (1) Term 2, 2020 33 / 72
Distance-based models
Minkowski distance
Minkowski distance If X = Rd, the Minkowski distance of order p > 0 is defined as
d 1/p
Disp(x,y)= |xj −yj|p =||x−y||p j=1
d p 1/p
where ||z||p = j=1 |zj| is the p-norm (sometimes denoted Lp
norm) of the vector z.
COMP9417 ML & DM Classification (1) Term 2, 2020 34 / 72
Distance-based models
Minkowski distance
• The 2-norm refers to the familiar Euclidean distance d
(xj −yj)2 = (x−y)T(x−y) which measures distance ‘as the crow flies’.
• The 1-norm denotes Manhattan distance, also called cityblock
distance:
d Dis1(x,y)=|xj −yj|
Dis2(x,y)=
j=1
j=1
This is the distance if we can only travel along coordinate axes.
COMP9417 ML & DM Classification (1) Term 2, 2020 35 / 72
Distance-based models
Minkowski distance
• If we now let p grow larger, the distance will be more and more dominated by the largest coordinate-wise distance, from which we can inferthatDis∞(x,y)=maxj|xj −yj|;thisisalsocalledChebyshev distance.
• You will sometimes see references to the 0-norm (or L0 norm) which counts the number of non-zero elements in a vector. The corresponding distance then counts the number of positions in which vectors x and y differ. This is not strictly a Minkowski distance; however, we can define it as
dd Dis0(x,y)=(xj −yj)0 =I[xj =yj]
j=1 j=1
under the understanding that x0 = 0 for x = 0 and 1 otherwise.
COMP9417 ML & DM Classification (1) Term 2, 2020 36 / 72
Distance-based models
Minkowski distance
Sometimes the data X is not naturally in Rd, but if we can turn it into Boolean features, or character sequesnces, we can still apply distance measures. For example:
• If x and y are binary strings, this is also called the Hamming distance. Alternatively, we can see the Hamming distance as the number of bits that need to be flipped to change x into y.
• For non-binary strings of unequal length this can be generalised to the notion of edit distance or Levenshtein distance.
COMP9417 ML & DM Classification (1) Term 2, 2020 37 / 72
Circles and ellipses
Distance-based models
Lines connecting points at order-p Minkowski distance 1 from the origin for (from inside) p = 0.8; p = 1 (Manhattan distance, the rotated square in red); p = 1.5; p = 2 (Euclidean distance, the violet circle); p = 4;
p = 8; and p = ∞ (Chebyshev distance, the blue rectangle). Notice that for points on the coordinate axes all distances agree. For the other points, our reach increases with p; however, if we require a rotation-invariant distance metric then Euclidean distance is our only choice.
COMP9417 ML & DM Classification (1) Term 2, 2020 38 / 72
Distance-based models
Distance metric
Distance metric Given an instance space X , a distance metric is a function Dis : X × X → R such that for any x, y, z ∈ X :
• distances between a point and itself are zero: Dis(x, x) = 0;
• all other distances are larger than zero: if x ̸= y then Dis(x, y) > 0;
• distances are symmetric: Dis(y, x) = Dis(x, y);
• detours can not shorten the distance: Dis(x, z) ≤ Dis(x, y) + Dis(y, z).
If the second condition is weakened to a non-strict inequality – i.e., Dis(x, y) may be zero even if x ̸= y – the function Dis is called a pseudo-metric.
COMP9417 ML & DM Classification (1) Term 2, 2020 39 / 72
Distance-based models
The triangle inequality – Minkowski distance for p = 2
B
!
AC
!!
The green circle connects points the same Euclidean distance (i.e., Minkowski distance of order p = 2) away from the origin as A. The orange circle shows that B and C are equidistant from A. The red circle demonstrates that C is closer to the origin than B, which conforms to the triangle inequality.
COMP9417 ML & DM Classification (1) Term 2, 2020 40 / 72
Distance-based models
The triangle inequality – Minkowski distance for p ≤ 1
B
!
AC
!!
With Manhattan distance (p = 1), B and C are equally close to the origin and also equidistant from A. With p < 1 (here, p = 0.8) C is further away from the origin than B; since both are again equidistant from A, it follows that travelling from the origin to C via A is quicker than going there directly, which violates the triangle inequality.
COMP9417 ML & DM Classification (1) Term 2, 2020 41 / 72
Distance-based models
Neighbours and exemplars
Means and distances
The arithmetic mean minimises squared Euclidean distance The arithmetic mean μ of a set of data points D in a Euclidean space is the unique point that minimises the sum of squared Euclidean distances to those data points.
Proof. We will show that arg miny x∈D ||x − y||2 = μ, where || · || denotes the 2-norm. We find this minimum by taking the gradient (the vector of partial derivatives with respect to yi) of the sum and setting it to the zero vector:
∇y ||x−y||2 =−2(x−y)=−2x+2|D|y=0
x∈D x∈D fromwhichwederivey= 1 x=μ.
x∈D
|D| x∈D
COMP9417 ML & DM
Classification (1)
Term 2, 2020
42 / 72
Distance-based models
Neighbours and exemplars
Means and distances
• Notice that minimising the sum of squared Euclidean distances of a given set of points is the same as minimising the average squared Euclidean distance.
• You may wonder what happens if we drop the square here: wouldn’t it be more natural to take the point that minimises total Euclidean distance as exemplar?
• This point is known as the geometric median, as for univariate data it corresponds to the median or ‘middle value’ of a set of numbers. However, for multivariate data there is no closed-form expression for the geometric median, which needs to be calculated by successive approximation.
COMP9417 ML & DM Classification (1) Term 2, 2020 43 / 72
Distance-based models
Neighbours and exemplars
Means and distances
• In certain situations it makes sense to restrict an exemplar to be one of the given data points. In that case, we speak of a medoid, to distinguish it from a centroid which is an exemplar that doesn’t have to occur in the data.
• Finding a medoid requires us to calculate, for each data point, the total distance to all other data points, in order to choose the point that minimises it. Regardless of the distance metric used, this is an O(n2) operation for n points.
• So for medoids there is no computational reason to prefer one distance metric over another.
• There may be more than one medoid.
COMP9417 ML & DM Classification (1) Term 2, 2020 44 / 72
Distance-based models
Neighbours and exemplars
Centroids and medoids
2 1.5 1 0.5 0 −0.5 −1 −1.5 −2
data points
squared 2−norm centroid (mean) 2−norm centroid (geometric median) squared 2−norm medoid
2−norm medoid
1−norm medoid
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
A small data set of 10 points, with circles indicating centroids and squares indicating medoids (the latter must be data points), for different distance metrics. Notice how the outlier on the bottom-right ‘pulls’ the mean away from the geometric median; as a result the corresponding medoid changes as well.
COMP9417 ML & DM Classification (1) Term 2, 2020 45 / 72
Distance-based models
Neighbours and exemplars
The basic linear classifier is distance-based
• The basic linear classifier constructs the decision boundary as the perpendicular bisector of the line segment connecting the two exemplars (one for each class).
• An alternative, distance-based way to classify instances without direct reference to a decision boundary is by the following decision rule: if x is nearest to μ⊕ then classify it as positive, otherwise as negative; or equivalently, classify an instance to the class of the nearest exemplar.
• If we use Euclidean distance as our closeness measure, simple geometry tells us we get exactly the same decision boundary.
• So the basic linear classifier can be interpreted from a distance-based perspective as constructing exemplars that minimise squared Euclidean distance within each class, and then applying a nearest-exemplar decision rule.
COMP9417 ML & DM Classification (1) Term 2, 2020 46 / 72
Two-exemplar decision boundaries
Distance-based models Neighbours and exemplars
(left) For two exemplars the nearest-exemplar decision rule with Euclidean distance results in a linear decision boundary coinciding with the perpendicular bisector of the line connecting the two exemplars. (right) Using Manhattan distance the circles are replaced by diamonds.
COMP9417 ML & DM Classification (1) Term 2, 2020 47 / 72
Three-exemplar decision boundaries
Distance-based models Neighbours and exemplars
(left) Decision regions defined by the 2-norm nearest-exemplar decision rule for three exemplars. (right) With Manhattan distance the decision regions become non-convex.
COMP9417 ML & DM Classification (1) Term 2, 2020 48 / 72
Distance-based models
Neighbours and exemplars
Distance-based models
To summarise, the main ingredients of distance-based models are
• distance metrics, which can be Euclidean, Manhattan, Minkowski or Mahalanobis, among many others;
• exemplars: centroids that find a centre of mass according to a chosen distance metric, or medoids that find the most centrally located data point; and
• distance-based decision rules, which take a vote among the k nearest exemplars.
COMP9417 ML & DM Classification (1) Term 2, 2020 49 / 72
Nearest neighbour classification
Nearest neighbour classification
COMP9417 ML & DM Classification (1) Term 2, 2020 50 / 72
Nearest neighbour classification
Nearest neighbour classification
• Related to the simplest form of learning: rote learning or memorization
• Training instances are searched for instance that most closely resembles new or query instance
• The instances themselves represent the knowledge
• Called: instance-based, memory-based learning or case-based learning;
often a form of local learning
• The similarity or distance function defines “learning”, i.e., how to go
beyond simple memorization
• Intuitive idea — instances “close by”, i.e., neighbours or exemplars, should be classified similarly
• Instance-based learning is lazy learning
• Methods: nearest-neighbour, k-nearest-neighbour, lowess, . . .
• Ideas also important for unsupervised methods, e.g., clustering (later lectures)
COMP9417 ML & DM Classification (1) Term 2, 2020 51 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Nearest Neighbour
Stores all training examples ⟨xi,f(xi)⟩. Nearest neighbour:
• Given query instance xq, first locate nearest training example xn, then ˆ
estimate f(xq) ← f(xn) k-Nearest neighbour:
• Given xq, take vote among its k nearest neighbours (if discrete-valued target function)
• take mean of f values of k nearest neighbours (if real-valued) ˆ ki=1 f(xi)
k
f(xq) ←
COMP9417 ML & DM Classification (1)
Term 2, 2020 52 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
k-Nearest Neighbour Algorithm
Training algorithm:
• For each training example ⟨xi,f(xi)⟩, add the example to the list
training examples. Classification algorithm:
• Given a query instance xq to be classified,
• Let x1 . . . xk be the k instances from training examples that are
nearest to xq by the distance function • Return
k
ˆ
f(xq) ← arg max
δ(v, f(xi)) where δ(a,b) = 1 if a = b and 0 otherwise.
v∈V
i=1
COMP9417 ML & DM Classification (1) Term 2, 2020 53 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
“Hypothesis Space” for Nearest Neighbour
+
− −
− +
xq
+ −
+
−
2 classes, + and − and query point xq. On left, note effect of varying k. On right, 1−NN induces a Voronoi tessellation of the instance space. Formed by the perpendicular bisectors of lines between points.
COMP9417 ML & DM Classification (1) Term 2, 2020 54 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Distance function again
The distance function defines what is learned.
Instance x is described by a feature vector (list of attribute-value pairs)
⟨a1(x), a2(x), . . . am(x)⟩ where ar(x) denotes the value of the rth attribute of x.
Most commonly used distance function is Euclidean distance . . . • distance between two instances xi and xj is defined to be
m
d(xi , xj ) =
r=1
(ar (xi ) − ar (xj ))2
COMP9417 ML & DM Classification (1) Term 2, 2020 55 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Distance function again
Many other distance functions could be used . . .
• e.g., Manhattan or city-block distance (sum of absolute values of
differences between attributes)
m
d(xi , xj ) = |ar (xi ) − ar (xj )|
r=1 Vector-based formalization – use norm L1, L2, . . .
The idea of distance functions will appear again in kernel methods.
COMP9417 ML & DM Classification (1) Term 2, 2020 56 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Normalization and other issues
• Different attributes measured on different scales
• Need to be normalized (why ?)
ar = vr − minvr maxvr − −minvr
where vr is the actual value of attribute r
• Nominal attributes: distance either 0 or 1
• Common policy for missing values: assumed to be maximally distant (given normalized attributes)
COMP9417 ML & DM Classification (1) Term 2, 2020 57 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
When To Consider Nearest Neighbour
• Instances map to points in Rn
• Less than 20 attributes per instance
• or number of attributes can be reduced . . .
• Lots of training data
• No requirement for “explanatory” model to be learned
COMP9417 ML & DM Classification (1) Term 2, 2020 58 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
When To Consider Nearest Neighbour
Advantages:
• Statisticians have used k-NN since early 1950s • Can be very accurate
• at most twice the “Bayes error” for 1-NN (Cover & Hart, 1967) • Training is very fast
• Can learn complex target functions
• Don’t lose information by generalization - keep all instances
COMP9417 ML & DM Classification (1) Term 2, 2020 59 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
When To Consider Nearest Neighbour
Disadvantages:
• Slow at query time: basic algorithm scans entire training data to derive a prediction
• “Curse of dimensionality”
• Assumes all attributes are equally important, so easily fooled by irrelevant attributes
• Remedy: attribute selection or weights
• Problem of noisy instances:
• Remedy: remove from data set
• not easy – how to know which are noisy ?
COMP9417 ML & DM Classification (1) Term 2, 2020 60 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
When To Consider Nearest Neighbour
What is the inductive bias of k-NN ?
• an assumption that the classification of query instance xq will be most similar to the classification of other instances that are nearby according to the distance function
k-NN uses terminology from statistical pattern recognition (see below) • Regression approximating a real-valued target function
•ˆ
Residual the error f (x) − f (x) in approximating the target function
• Kernel function function of distance used to determine weight of each training example, i.e., kernel function is the function K s.t.
wi =K(d(xi,xq))
COMP9417 ML & DM Classification (1) Term 2, 2020 61 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Nearest-neighbour classifier
• kNN uses the training data as exemplars, so training is O(n) (but prediction is also O(n)!)
• 1NN perfectly separates training data, so low bias but high variance
• By increasing the number of neighbours k we increase bias and
decrease variance (what happens when k = n?)
• Easily adapted to real-valued targets, and even to structured objects (nearest-neighbour retrieval). Can also output probabilities when k>1
• Warning: in high-dimensional spaces everything is far away from everything and so pairwise distances are uninformative (curse of dimensionality)
COMP9417 ML & DM Classification (1) Term 2, 2020 62 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Distance-Weighted kNN
• Might want to weight nearer neighbours more heavily …
• Use distance function to construct a weight wi
• Replace the final line of the classification algorithm by:
where
ˆ
f(xq) ← arg max
wiδ(v, f(xi))
k
v∈V
wi≡ 1 d(xq,xi)2
i=1
and d(xq,xi) is distance between xq and xi
COMP9417 ML & DM Classification (1) Term 2, 2020 63 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Distance-Weighted kNN
For real-valued target functions replace the final line of the algorithm by:
ki=1 wif(xi) f(xq) ← ki=1 wi
(denominator normalizes contribution of individual weights).
Now we can consider using all the training examples instead of just k
→ using all examples (i.e., when k = n) with the rule above is called Shepard’s method
ˆ
COMP9417 ML & DM Classification (1) Term 2, 2020 64 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Evaluation
Lazy learners do not construct an explicit model, so how do we evaluate the output of the learning process ?
• 1-NN – training set error is always zero !
• each training example is always closest to itself
• k-NN – overfitting may be hard to detect
Leave-one-out cross-validation (LOOCV) – leave out each example and
predict it given the rest:
(x1, y1), (x2, y2), . . . , (xi−1, yi−1), (xi+1, yi+1), . . . , (xn, yn)
Error is mean over all predicted examples. Fast – no models to be built !
COMP9417 ML & DM Classification (1) Term 2, 2020 65 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Curse of Dimensionality
Bellman (1960) coined this term in the context of dynamic programming
Imagine instances described by 20 attributes, but only 2 are relevant to target function — “similar” examples will appear “distant”.
Curse of dimensionality: nearest neighbour is easily mislead when high-dimensional X – problem of irrelevant attributes
One approach:
•
• •
Stretch jth axis by weight zj , where z1, . . . , zn chosen to minimize prediction error
Use cross-validation to automatically choose weights z1, . . . , zn Note setting zj to zero eliminates this dimension altogether
COMP9417 ML & DM Classification (1) Term 2, 2020 66 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Curse of Dimensionality
• number of “cells” in the instance space grows exponentially in the number of features
• with exponentially many cells we would need exponentially many data points to ensure that each cell is sufficiently populated to make nearest-neighbour predictions reliably
COMP9417 ML & DM Classification (1) Term 2, 2020 67 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Curse of Dimensionality
Some ideas to address this for instance-based (nearest-neighbour) learning • Euclidean distance with weights on attributes
wr(ar(xq) − ar(x))2
• updating of weights based on nearest neighbour classification error • class correct/incorrect: weight increased/decreased
• can be useful if not all features used in classification
See Moore and Lee (1994) “Efficient Algorithms for Minimizing Cross Validation Error”
COMP9417 ML & DM Classification (1) Term 2, 2020 68 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Instance-based (nearest-neighbour) learning
Recap – Practical problems of 1-NN scheme:
• Slow (but fast k − d tree-based approaches exist)
• Remedy: removing irrelevant data
• Noise (but k-NN copes quite well with noise)
• Remedy: removing noisy instances
• All attributes deemed equally important
• Remedy: attribute weighting (or simply selection) • Doesn’t perform explicit generalization
• Remedy: rule-based or tree-based NN approaches
COMP9417 ML & DM Classification (1) Term 2, 2020 69 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Some refinements of instance-based classifiers
• Edited NN classifiers discard some of the training instances before making predictions
• Saves memory and speeds up classification
• IB2: incremental NN learner: only incorporates misclassified instances into the classifier
• Problem: noisy data gets incorporated
• IB3: store classification performance information with each instance
& only use in prediction if above a threshold
COMP9417 ML & DM Classification (1) Term 2, 2020 70 / 72
Nearest neighbour classification
k-Nearest neighbour Classification
Dealing with noise
Use larger values of k (why ?) How to find the “right” k ?
• One way: cross-validation-based k-NN classifier (but slow)
• Different approach: discarding instances that don’t perform well by keeping success records of how well an instance does at prediction (IB3)
• Computes confidence interval for an instance’s success rate and for default accuracy of its class
• If lower limit of first interval is above upper limit of second one, instance is accepted (IB3: 5%-level)
• If upper limit of first interval is below lower limit of second one, instance is rejected (IB3: 12.5%-level)
COMP9417 ML & DM Classification (1) Term 2, 2020 71 / 72
Summary
Summary
• A framework for classification
• Classification viewed in terms of distance in feature space
Distance-based. The key ideas are geometric.
• A classifier as a linear model
• Nearest neighbour classifiers
• Later we will see how to extend by building on these ideas
COMP9417 ML & DM Classification (1) Term 2, 2020 72 / 72