Classification (1)
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain
Aims
This lecture will introduce you to machine learning approaches to the problem of classification. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
• outline a framework for solving machine learning problems
• outline the general problem of induction
• describe issues of generalisation and evaluation for classification
• outline the use of a linear model as a 2-class classifier
• describe distance measures and how they are used in classification • outline the basic k-nearest neighbour classification method
COMP9417 T1, 2021 1
Introduction
Classification (sometimes called concept learning) methods dominate machine learning . . .
. . . however, they often don’t have convenient mathematical properties like regression, so are more complicated to analyse. The idea is to learn a classifier, which is usually a function mapping from an input data point to one of a set of discrete outputs, i.e., the classes.
We will mostly focus on classifier advantages and disadvantages as learning methods first and point to unifying ideas and approaches where applicable.
COMP9417 T1, 2021 2
Classification
Example: Imagine that we want to automate the process of sorting incoming fish in a fish-packing plant. And as a pilot, we start by separating sea bass from salmon using some information collected through sensing.
COMP9417 T1, 2021 3
Classification
Example: classifying sea bass vs. salmon
Features that can be used: width, length, weight, lightness, fins, eyes/mouth position, etc.
Question: how to separate these two classes?
Width
Sea bass
Salmon
Lightness
COMP9417 T1, 2021
4
Classification
Example: Maybe we can find a line that separates the two classes.
Width
Sea bass
Salmon
Lightness
COMP9417 T1, 2021
5
Classification
Example: If we find the line that separated the two classes, then how
our algorithm makes prediction? The line equation will look like:
𝑎𝑥! + 𝑏𝑥” + 𝑐 = 0
We can define 𝑎, 𝑏 & 𝑐 such that:
for any point above the line:
𝑎𝑥! + 𝑏𝑥” + 𝑐 > 0
and for any point below the line:
𝑎𝑥! + 𝑏𝑥” + 𝑐 < 0
X2 : Width
Sea bass
This type of classifier is called linear classifier. It is also a type of X1 : Lightness discriminative learning algorithm.
Salmon
COMP9417 T1, 2021 6
Classification Example:
Can we do something different than finding the discriminative line (or some boundary) to be able to separate the two groups?
Width
Sea bass
Salmon
Lightness
COMP9417 T1, 2021
7
Classification Example:
Instead of finding a discriminative line, maybe we can focus on one class at a time and build a model that describes how that class looks like; and then do the same for the other class. This type of models are called generative learning algorithm.
Width
Lightness
COMP9417 T1, 2021
8
Classification
Generative algorithm: builds some models for each of the classes and then makes classification predictions based on looking at the test example and see it is more similar to which of the models.
– Learns 𝑝(𝑥|𝑦) (and also 𝑝 𝑦 , called class prior)
– So,wecanget𝑝 𝑥,𝑦 =𝑝 𝑥𝑦 𝑝(𝑦)
– It learns the mechanism by which the data has been generated
Discriminative algorithm: Do not build models for different classes, but rather focuses on finding a decision boundary that separates classes
– Learns 𝑝(𝑦|𝑥)
COMP9417 T1, 2021 9
Classification
•
To predict the output for sample 𝑥, in generative algorithm, we have
to estimate 𝑝(𝑦|𝑥): 𝑝𝑦=0𝑥=
𝑝 𝑥 𝑦 = 0 𝑝(𝑦 = 0)
𝑝(𝑥)
𝑝(𝑥)
𝑝 𝑥 𝑦 = 1 𝑝(𝑦 = 1)
𝑝𝑦=1𝑥=
If𝑝 𝑦=0𝑥 >𝑝 𝑦=1𝑥 ,then𝑥belongstoclass𝑦=0andotherwise
to class 𝑦 = 1.
• For discriminative algorithm, we can directly have 𝑝 𝑦 = 0 𝑥 and 𝑝 𝑦=1𝑥 andsimilartoabove,if𝑝 𝑦=0𝑥 >𝑝 𝑦=1𝑥 ,then𝑥 belongs to class 𝑦 = 0 and otherwise to class 𝑦 = 1.
COMP9417 T1, 2021 10
Linear classification in two dimensions x2 : Width
Positive class
W
• We find the line that separates the two class: 𝑎𝑥! + 𝑏𝑥” + 𝑐 = 0
• We define a weight vector 𝑤# = [𝑎, 𝑏], 𝑥# = [𝑥!, 𝑥”]
• So,thelinecanbedefinedby𝑥#𝑤=−𝑐=𝑡
• 𝑤 is perpendicular to decision boundary (in direction of positive class)
• 𝑡 is the decision threshold (if 𝑥#𝑤 > t then 𝑥 belongs to positive class and if 𝑥#𝑤 < t then 𝑥 belongs to negative class)
w Negative class
x1 : Lightness
COMP9417 T1, 2021 11
Basic Linear Classifier
The basic linear classifier constructs a decision boundary by half-way intersecting the line between the positive and negative centres of mass.
Width
Sea bass
p
n
Salmon
Lightness
COMP9417 T1, 2021
12
w=p-n
Basic Linear Classifier
The basic linear classifier is described by the equation 𝑥#𝑤 = 𝑡, and 𝑤=𝑝−𝑛
As we know, $%& is on the decision boundary, so we have: " 𝑝 + 𝑛 # | 𝑝 |" − ||𝑛||"
𝑡=(2).𝑝−𝑛= 2 Where | 𝑥 |, denotes the length of vector 𝑥
COMP9417 T1, 2021 13
The ingredients of machine learning
HoHwow smoaclhvine laeartnainsgkhewlpsittho smolvae ca htaisnk e learning
Task
Output
Data
Model
Training data
Learning problem
Learning algorithm
Domain
objects
Features
An overview of how machine learning is used to address a given task. A
An overview of how machine learning is used to address a given task. A
task (red box) requires an appropriate mapping – a model – from data
task (red box) requires an appropriate mapping – a model – from data
described by features to outputs. Obtaining such a mapping from
described by features to outputs. Obtaining such a mapping from training traindiantga idsawtahaitscwonhsatittuctoens satilteuatrensinga pleroabrlneming(bplureobloexm). (blue box).
COMP9417 ML & DM Classification (1) Term 2, 2019 17 / 72
COMP9417 T1, 2021 14
Some terminology
Tasks are addressed by models, whereas learning problems are solved by learning algorithms that produce models.
COMP9417 T1, 2021 15
Some terminology
Machine learning is concerned with using the right features to build the right models that achieve the right tasks.
COMP9417 T1, 2021 16
Some terminology
Does the algorithm require all training data to be present before the start of learning ? If yes, then it is categorised as batch learning (a.k.a. offline learning) algorithm.
If, however, it can continue to learn a new data arrives, it is an online learning algorithm.
COMP9417 T1, 2021 17
Some terminology
If the model has a fixed number of parameters, it is categorised as parametric.
Otherwise, if the number of parameters grows with the amount of training data it is categorised as non-parametric. They do not make any strong assumption about the underlying model and so they are more flexible.
COMP9417 T1, 2021 18
The philosophical problem
Deduction: derive specific consequences from general theories
Induction: derive general theories from specific observations Deduction is well-founded (mathematical logic).
Induction is (philosophically) problematic – induction is useful since it often seems to work – an inductive argument !
COMP9417 T1, 2021 19
Generalisation - the key objective of machine learning
What we are really interested in is generalising from the sample of data in our training set. This can be stated as:
The inductive learning hypothesis
Any hypothesis found to approximate the target (true) function well over a sufficiently large set of training examples will also ap- proximate the target function well over other unobserved examples.
A corollary of this is that it is necessary to make some assumptions about the type of target function in a task for an algorithm to go beyond the data, i.e., generalise or learn.
COMP9417 T1, 2021 20
Cross-validation
Also known as out-of-sample testing is validation technique to assess the
results of a model to an independent data set 1. Holdout method:
Train Test
COMP9417 T1, 2021 21
Cross-validation
2. Leave-One-Out Cross validation (LOOCV):
Test Iteration 1
Iteration 2 Iteration 3
Iteration 4
Iteration m
. . .
COMP9417 T1, 2021 22
Cross-validation
3. K-fold Cross Validation
Iteration 1 Iteration 2 Iteration 3
Iteration 4
Iteration 7
. . .
COMP9417 T1, 2021 23
Cross-validation
There are certain parameters that need to be estimated during learning. We use the data, but NOT the training set, OR the test set. Instead, we use a separate validation or development set.
COMP9417 T1, 2021 24
Cross-validation
Validation set: To make the hyperparameter tuning and model selection independent from the test set, we define another set within the train set
Train Validation Test
COMP9417 T1, 2021 25
Data Types
In Machine Learning world, in general two types of data is defined:
o Numerical: Anything represented by numbers (e.g., integer , floating point)
o Categorical: everything that is not numerical (e.g. discrete labeled groups)
In general, for machine learning algorithms, data has to be represented in numeric form
COMP9417 T1, 2021 26
Data Types
Another taxonomy of data types:
1.Irrelevant: it might be represented with strings or numbers but has no relationship with the outcome (e.g. participants name or code)
2.Nominal: discrete values with no numerical relationship between different categories (e.g. animal types, colors, nationality)
3.Binary: discrete data with only two possibilities (e.g cancerous vs. non-cancerous)
COMP9417 T1, 2021 27
Data Types
4. Ordinal: discrete integers that can be ranked, but the relative distance between any two number can not be defined (e.g. students rank based on GPA)
5. Count: discrete whole numbers without any negatives 6. Time: a cyclical, repeating continuous form of data (e.g
days, weeks)
7. Interval: data that the we can measure the distance between different values. (e.g temperature, income)
COMP9417 T1, 2021 28
Binary Classification task
In a binary classification (or binomial classification) task, we always want to classify the data of a given set into two groups. We usually define one of the classes as positive and one as negative.
o Sometimes the classes are equally important (e.g. recognition of dog vs cat in image classification
o Sometimes misclassification in one of the classes is more costly than misclassification in the other class (e.g. predicting that someone has cancer while (s)he doesn’t have vs predicting that someone doesn’t have cancer while (s)he has) therefore we may prefer to have better classification in one class in the cost of more errors in the other class
COMP9417 T1, 2021 29
Evaluation of error
If we have a binary classification, then we have two classes of y ∈ {0,1} , where we call the class 𝑦 = 1, positive class and 𝑦 = 0, negative class.
Some evaluation metrics:
o True positive: number of instances from class one that have
been predicted as one
o True negative: number of instances from class zero that have been predicted as zero
o False positive: number of instances from class zero that have been predicted as one
o False negative: number of instances from class one that have been predicted as zero
COMP9417 T1, 2021 30
Algorithm-independent aspects of machine learning Evaluating performance on a task
Contingency table I
Contingency table
For two-class prediction case:
Two-class prediction case:
Predicted Class
Actual Class
PoYseitisve
NegNatoive
PoYsietisve
True Positive (TP)
False Negative (FN)
No
Negative
False Positive (FP)
True Negative (TN)
This is also called confusion matrix COMP9417 ML & DM Classification (1)
Term 2, 2019
31 / 72
COMP9417 T1, 2021
31
Classification Accuracy
Classification Accuracy on a sample of labelled pairs (𝑥, 𝑐(𝑥)) given a learned classification model that predicts, for each instance 𝑥, a class
value 𝑐̂(𝑥):
𝑎𝑐𝑐= 1 C 𝐼[𝑐̂𝑥 =𝑐(𝑥)] |𝑇𝑒𝑠𝑡| '∈#)*+
where 𝑇𝑒𝑠𝑡 is a test set and 𝐼[] is the indicator function which is 1 iff its argument evaluates to true, and 0 otherwise.
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 𝑖𝑠 = 1 − 𝑎𝑐𝑐.
COMP9417 T1, 2021 32
Other evaluation metrics Precision/correctness
– is the number of relevant objects classified correctly divided by the total number of relevant objects classified 𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 + 𝐹𝑃
Recall/sensitivity/completeness/true positive rate (TPR)
– is the number of relevant objects classified correctly divided by total number of relevant/correct objects
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁
https://en.wikipedia.org/wiki/Precision_and_recall
COMP9417 T1, 2021 33
Other evaluation metrics
F1 score: a measure of accuracy, which is the harmonic mean of precision and recall and is defined as:
𝐹! = 2. 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛. 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
This measure gives equal importance to precision and recall which is sometime undesirable; so we have to decide which metric to use depending on the task and what’s important for the task.
COMP9417 T1, 2021 34
Other evaluation metrics
AUC-ROC curve: Area Under the Curve (AUC) – Receiver Operating Characteristics (ROC) curve is one of the most important evaluation metric for performance of classification models. This metric evaluates the model at different threshold settings and can inform us on the capability of the model in distinguishing between classes.
• 𝑇𝑃𝑅= #, #,%-.
• 𝐹𝑃𝑅= -, -,%#.
• A good model has 𝐴𝑈𝐶 close to 1
• A very poor model has 𝐴𝑈𝐶 close to 0
• 𝐴𝑈𝐶 = 0.5 means no class separation
AUC
COMP9417 T1, 2021 35
Missing Value: An issue to consider
COMP9417 T1, 2021 36
Missing Values
• In practice it rarely happens that the data is complete and homogenous.
• Why data is incomplete:
o Human errors
o Sensor errors
o Software bugs
o Faulty preprocessing o...
COMP9417 T1, 2021 37
Missing Values
How to handle missing values (common approaches): o Deleting samples with missing values
o Replacing the missing value with some statistics from the data (mean, median, ...)
o Assigning a unique category
o Predicting the missing values
o Using algorithms that support missing values
COMP9417 T1, 2021 38
Missing Values
Deleting samples with missing values: – Pros:
o A robust and probably more accurate model – Cons:
o Loss of information and data
o Works poorly if the percentage of missing values is high
COMP9417 T1, 2021 39
Missing Values
Replacing the missing value with mean/median/mode: – Pros:
o When the data size is small, it is better than deleting
o It can prevent data loss – Cons:
o Imputing the approximations adds bias to the model (it reduces the variance of the variable)
o Works poorly compared to other methods
COMP9417 T1, 2021 40
Missing Values
If categorical, assigning a unique category or the most frequent category:
– Pros:
o Works well with small datasets and easy to implement o No loss of data
– Cons:
o Works only for categorical features
o Adding another feature (e.g. a new unique category) to the model may result higher variance in the model
o Adding the most frequent category can increase the bias in the model
COMP9417 T1, 2021 41
Missing Values
Predicting the missing values: – Pros:
o Imputing the missing variable is an improvement as long as the bias from it is smaller than the omitted variable bias
o Yields unbiased estimates of the model parameters – Cons:
o Bias also arises when an incomplete conditioning set is used for a categorical variable
o Considered only as a proxy for the true values
COMP9417 T1, 2021 42
Missing Values
Using algorithms that support missing values: – Pros:
o Does not require creation of a predictive model
o Correlation of the data is neglected – Cons:
o Some of these algorithms are very time-consuming and it can be critical in data mining where large databases are being extracted
COMP9417 T1, 2021 43
Nearest Neighbor Algorithm for Classification
COMP9417 T1, 2021 44
Distance-based models
earest Neighbour
Nearest Neighbour
Nearest Neighbour is a regression or classification algorithm that predicts whatever is the output value of the nearest data point to some query.
earest Neighbour is a regression or classification algorithm that predicts
To find the nearest data point, we have to find the distance between the query and other points. So we hatever is the output value of the nearest data point to some query.
have to decide how to define the distance.
COMP9417 T1, 2021 45
COMP9417 ML & DM Classification (1) Term 2, 2019 33 / 72
N
N w
Minkowski distance
Minkowski distance If 𝒳 → R/ , 𝑥, 𝑦 ∈ 𝜒, the Minkowski distance of order 𝑝 > 0isdefinedas:
/
𝐷𝑖𝑠$ 𝑥,𝑦 =(C|𝑥0−𝑦0|$)!2$=||x−y||$ 01!
Where | 𝑧 | = (∑/ |𝑧 |$)!2″ is the 𝑝 − 𝑛𝑜𝑟𝑚 (sometimes denoted 𝐿 $ 01!0 $
norm) of the vector 𝑧.
COMP9417 T1, 2021 46
Minkowski distance
• The 2-norm refers to the familiar Euclidean distance &
𝐷𝑖𝑠” 𝑥,𝑦 = C(𝑥0 −𝑦0)” = (𝑥−𝑦)#(𝑥−𝑦) 01!
• The 1-norm denotes Manhattan distance, also called cityblock
distance:
&
𝐷𝑖𝑠! 𝑥,𝑦 =C|𝑥0−𝑦0| 01!
COMP9417 T1, 2021 47
Minkowski distance
• If we now let 𝑝 grow larger, the distance will be more and more dominated by the largest coordinate-wise distance, from which we can infer that 𝐷𝑖𝑠3 = 𝑚𝑎𝑥0 |𝑥0 − 𝑦0 |; this is also called Chebyshev
distance.
• You will sometimes see references to the 0-norm (or 𝐿4 norm) which counts the number of non-zero elements in a vector. The corresponding distance then counts the number of positions in which vectors x and y differ. This is not strictly a Minkowski distance; however, we can define it as:
//
𝐷𝑖𝑠4 𝑥,𝑦 =C(𝑥0−𝑦0)4=C𝐼[𝑥0≠𝑦0]
01! 01!
under the understanding that 𝑥4 = 0 for 𝑥 = 0 and 1 otherwise. COMP9417 T1, 2021 48
Minkowski distance
Sometimes the data is not naturally in R/, but if we can turn it into Boolean features, or character sequences, we can still apply distance measures. For example:
• If 𝑥 and 𝑦 are binary strings, this is also called the Hamming distance. Alternatively, we can see the Hamming distance as the number of bits that need to be flipped to change 𝑥 into 𝑦.
• For non-binary strings of unequal length this can be generalised to the notion of edit distance or Levenshtein distance.
COMP9417 T1, 2021 49
Circles and ellipses
Circles and ellipses
Unite circles with different order-p Minkowski distance
• Notice that for points on the coordinate axes all distances agree
• If we require a rotation invariant distance metric, then Euclidean distance is our only choice
Lines connecting points at order-p Minkowski distance 1 from
for (from inside) p = 0.8; p = 1 (Manhattan distance, the rot
in red); p = 1.5; p = 2 (Euclidean distance, the violet circle);
Distance-based models
p = 8; and p = 1 (Chebyshev distance, the blue rectangle). N COMP9417 T1, 2021 50
for points on the coordinate axes all distances agree. For the our reach increases with p; however, if we require a rotation-in
Distance metric
Distance metric Given an instance space 𝒳 , a distance metric is a function 𝐷𝑖𝑠 ∶ 𝒳×𝒳 → 0,∞ such that for any 𝑥,𝑦,𝑧 ∈ 𝒳:
– distances between a point and itself are zero: 𝐷𝑖𝑠(𝑥, 𝑥) = 0
– all other distances are larger than zero: if 𝑥 ≠ 𝑦 then 𝐷𝑖𝑠(𝑥, 𝑦) > 0 – distancesaresymmetric:𝐷𝑖𝑠(𝑦,𝑥)=𝐷𝑖𝑠(𝑥,𝑦)
– detours can not shorten the distance (triangle inequality):
𝐷𝑖𝑠(𝑥, 𝑧) ≤ 𝐷𝑖𝑠(𝑥, 𝑦) + 𝐷𝑖𝑠(𝑦, 𝑧)
o It can be shown that triangle inequality does not hold for 𝑝 < 1
If the second condition is weakened to a non-strict inequality – i.e., 𝐷𝑖𝑠(𝑥, 𝑦) may be zero even if 𝑥 ≠ 𝑦 – the function 𝐷𝑖𝑠 is called a
pseudo-metric.
COMP9417 T1, 2021 51
Means and distances
The arithmetic mean minimises squared Euclidean distance The arithmetic mean 𝜇 of a set of data points 𝐷 in a Euclidean space is the unique point that minimises the sum of squared Euclidean distances to those data points.
Proof. We will show that argmin∑'∈6 | 𝑥 − 𝑦 |" = 𝜇 , where 5
||. || denotes the 2-norm. We find this minimum by taking the gradient (the vector of partial derivatives with respect to 𝑦) of the sum and setting it to the zero vector:
∇5C|𝑥−𝑦|"=−2C𝑥−𝑦 =−2C𝑥+2𝐷𝑦=0 '∈6 '∈6 '∈6
Fromwhichwederive𝑦= ! ∑'∈6𝑥=𝜇 |6|
COMP9417 T1, 2021
52
Means and distances
• Notice that minimising the sum of squared Euclidean distances of a given set of points is the same as minimising the average squared Euclidean distance.
• You may wonder what happens if we drop the square here: wouldn’t it be more natural to take the point that minimises total Euclidean distance as exemplar?
• This point is known as the geometric median, as for univariate data it corresponds to the median or ‘middle value’ of a set of numbers. However, for multivariate data there is no closed-form expression for the geometric median, which needs to be calculated by successive approximation.
COMP9417 T1, 2021 53
Means and distances
• In certain situations, it makes sense to restrict an exemplar to be one of the given data points. In that case, we speak of a medoid, to distinguish it from a centroid which is an exemplar that doesn’t have to occur in the data.
• Finding a medoid requires us to calculate, for each data point, the total distance to all other data points, in order to choose the point that minimises it. Regardless of the distance metric used, this is an 𝑂 𝑛" operation for 𝑛 points.
• So, for medoids there is no computational reason to prefer one distance metric over another.
• There may be more than one medoid.
COMP9417 T1, 2021 54
Distance-based models Neighbours and exemplars
CeCnenttroidids asndamnedoimds edoids
2 1.5 1 0.5 0 −0.5 −1 −1.5
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
data points
squared 2−norm centroid (mean) 2−norm centroid (geometric median) squared 2−norm medoid
2−norm medoid
1−norm medoid
A smAaslml dalal tdatasesettof 10ppooinitns,tsw,itwh ictihrclcesiricnldeicsatiingdiceantrionigds caendntsrqouiadres and squainrdeicsatinindgimcaedtionidgs (mthedlaottidersm(uthstebeladtatetarpmoinuts)t,bfoer di↵aetraenptodinsttasn)c,efmoertrdicifsf.erent
distance metrics. Notice how the outlier on the bottom-right ‘pulls’ the
Notice how the outlier on the bottom-right ‘pulls’ the mean away from the
mean away from the geometric median; as a result, the corresponding
geometric median; as a result the corresponding medoid changes as well.
medoid changes as well.
COMP9417 T1, 2021 55
COMP9417 ML & DM Classification (1) Term 2, 2019 45 / 72
Two-exemplar decision boundaries
Distance-based models Neighbours and exemplars
Two-exemplar decision boundaries
(left) For two exemplars the nearest-exemplar decision rule with Euclidean
(left) For two exemplars the nearest-exemplar decision rule with
distance results in a linear decision boundary coinciding with the
Euclidean distance results in a linear decision boundary coinciding with
perpendicular bisector of the line connecting the two exemplars. (right)
the perpendicular bisector of the line connecting the two exemplars.
Using Manhattan distance the circles are replaced by diamonds.
(right) Using Manhattan distance the circles are replaced by diamonds.
COMP9417 ML & DM Classification (1) Term 2, 2019 47 / 72
COMP9417 T1, 2021 56
Distance-based models Neighbours and exemplars
Three-exemplar decision boundaries
Three-exemplar decision boundaries
(left) Decision regions defined by the 2-norm nearest-exemplar decision rule for three exemplars. (right) With Manhattan distance the decision
(left) Decision regions defined by the 2-norm nearest-exemplar decision
regions become non-convex.
rule for three exemplars. (right) With Manhattan distance the decision regions become non-convex.
COMP9417 ML & DM Classification (1) Term 2, 2019 48 / 72
COMP9417 T1, 2021 57
Distance-based models
To summarise, the main ingredients of distance-based models are:
• distance metrics, which can be Euclidean, Manhattan, Minkowski or Mahalanobis, among many others;
• exemplars: centroids that find a centre of mass according to a chosen distance metric, or medoids that find the most centrally located data point; and
COMP9417 T1, 2021 58
Nearest Centroid Classifier
COMP9417 T1, 2021 59
Nearest Centroid Classifier
• This is a classifier based on minimum distance principle, where the class exemplars are just the centroids (or means)
!"
!#
• Training: for training sample pairs { 𝑥!, 𝑦! , 𝑥", 𝑦" , ... , 𝑥8, 𝑦8 } where 𝑥9 is the feature vector for sample 𝑖 and 𝑦9 is the class label, class centroids are:
𝜇: = 1 C 𝑥0 |𝐶:|0∈;#
• Test: a new unknown object with feature vector 𝑥 is classified as class 𝑖 if it is much closer to the mean vector of class 𝑘 than to any other class mean vector
COMP9417 T1, 2021 60
Basic Linear Classifier & Nearest Centroid Classifier
• The basic linear classifier is distance-based.
• An alternative, distance-based way to classify instances without direct reference to a decision boundary is by the following decision rule: if 𝑥 is nearest to 𝜇⊕ then classify it as positive, otherwise as negative; or equivalently, classify an instance to the class of the nearest exemplar.
• If we use Euclidean distance as our closeness measure, simple geometry tells us we get exactly the same decision boundary.
• So the basic linear classifier can be interpreted from a distance- based perspective as constructing exemplars that minimise squared
Euclidean distance within each class, and then applying a nearest- exemplar decision rule.
COMP9417 T1, 2021 61
Nearest Centroid Classifier
• What happens if a class has more than one mode? (similar to the image)
1. If there is only one centroid per class, then it will perform poorly
2. If we can somehow find different modes, we can define one centroid per each mode which helps the classifier generalizability
!" !#
(1)
!"
!#
!$
(2)
COMP9417 T1, 2021
62
Nearest Centroid Classifier
Advantages:
o Simple
o Fast
o works well when classes are compact and far from each other.
COMP9417 T1, 2021 63
Nearest Centroid Classifier
Disadvantages:
o For complex classes (e.g., Multimodal, non-spherical) may give very poor results
o Can not handle outliers and noisy data well o Can not handle missing data
COMP9417 T1, 2021 64
Nearest neighbour classification
COMP9417 T1, 2021 65
Nearest neighbour classification
• Related to the simplest form of learning: rote learning or memorization o Training instances are searched for instance that most closely
resembles new or query instance
o The instances themselves represent the knowledge
o Called: instance-based, memory-based learning or case-based learning; often a form of local learning
• The similarity or distance function defines “learning”, i.e., how to go beyond simple memorization
• Intuitive idea — instances “close by”, i.e., neighbours or exemplars, should be classified similarly
• Instance-based learning is lazy learning
• Methods: nearest-neighbour, k-nearest-neighbour, ...
• Ideas also important for unsupervised methods, e.g., clustering (later lectures)
COMP9417 T1, 2021 66
Nearest Neighbour
Storesalltrainingexamples 𝑥0,𝑓(𝑥0). Nearest neighbour:
o Given query instance 𝑥= , first locate nearest training example 𝑥&, then estimate 𝑓m(𝑥=) ← 𝑓(𝑥&)
𝑘-Nearest neighbour:
o Given 𝑥= , take vote among its 𝑘 nearest neighbours (if
discrete-valued target function) (see next slide)
o take mean of 𝑓 values of 𝑘 nearest neighbours (if real- valued)
∑: 𝑓(𝑥 )
𝑓m(𝑥=)← 01! 𝑘
0
COMP9417 T1, 2021
67
𝑘-Nearest Neighbour Algorithm Training algorithm:
o For each training example 𝑥0 , 𝑓(𝑥0 ) , add the example to the list training _examples.
Classification algorithm:
– Given a query instance 𝑥= to be classified,
o Let 𝑥!,...,𝑥: be the 𝑘 instances from training examples that are nearest to 𝑥= by the distance function
o Return
𝑓m ( 𝑥 = ) ← a r g m a x C 𝛿 ( 𝜐 , 𝑓 ( 𝑥 0 ) )
: >∈? 91!
Where𝛿 𝑎,𝑏 =1if𝑎=𝑏and0otherwise. COMP9417 T1, 2021 68
Distance function again
The distance function defines what is learned.
Instance 𝑥0 is described by a feature vector (list of attribute-value pairs)
𝑥0 =(𝑥0!,…,𝑥0/)#
Where 𝑥0@ denotes the value of the 𝑟th attribute/feature of 𝑥0.
Most commonly used distance function is Euclidean distance . . .
o distance between two instances 𝑥9 and 𝑥0 is defined to be
/
𝐷𝑖𝑠𝑥9,𝑥0 = C(𝑥0@−𝑥0@)” @1!
COMP9417 T1, 2021 69
Distance function again
Many other distance functions could be used . . .
o e.g., Manhattan also referred to as city-block distance (sum of absolute values of differences between attributes)
𝐷𝑖𝑠𝑥,𝑥=∑/ |𝑥−𝑥| 9 0 @1! 9@ 0@
Vector-based formalization – use norm 𝐿!, 𝐿”, …
COMP9417 T1, 2021 70
kNN Example
• What is the predicted class for the green point given the data for? k=12
h𝑡𝑡𝑝𝑠://𝑡𝑜𝑤𝑎𝑟𝑑𝑠𝑑𝑎𝑡𝑎𝑠𝑐𝑖𝑒𝑛𝑐𝑒. 𝑐𝑜𝑚/𝑘𝑛𝑛 − 𝑢𝑠𝑖𝑛𝑔 − 𝑠𝑐𝑖𝑘𝑖𝑡 − 𝑙𝑒𝑎𝑟𝑛 − 𝑐6𝑏𝑒𝑑765𝑏𝑒75
COMP9417 T1, 2021 71
Normalization and other issues
• Different attributes measured on different scales (for example one attribute/feature may have a range of [0,100] and another have a range of [−1,1])
• Need to be normalized (why ?)
𝑥A = 𝑥0@−min(𝑥0@)
0@ m𝑎𝑥 𝑥0@ −min(𝑥0@)
where 𝑥 is the actual value of attribute/feature 𝑟 and 𝑥A
normalised value.
• Nominal attributes: distance either 0 or 1
is the
0@
0@
COMP9417 T1, 2021
72
When To Consider Nearest Neighbour
• Instances map to points in R/
• Less than 20 attributes per instance
o or number of attributes can be reduced . . .
• Lots of training data
• No requirement for “explanatory” model to be learned
COMP9417 T1, 2021 73
K-Nearest Neighbour
Advantages:
• Statisticians have used 𝑘-NN since early 1950s
• Can be very accurate
• Training is very fast
• Can learn complex target functions
COMP9417 T1, 2021 74
K-Nearest Neighbour
Disadvantages:
o Slow at query time: basic algorithm scans entire training data to derive a prediction
o “Curse of dimensionality”
o Assumes all attributes are equally important, so easily fooled
by irrelevant attributes
§ Remedy: attribute selection or weights
o Problem of noisy instances:
§ Remedy: remove from data set
§ not easy – how to know which are noisy ?
o Needs homogenous feature type and scale
o Finding the optimal number of neighbors (𝑘) can be
challenging
COMP9417 T1, 2021 75
Inductive Bias of KNN
The inductive bias (a.k.a. learning bias) of a learning algorithm is the set of assumptions that the learner uses to predict outputs given unseen inputs.
What is the inductive bias of KNN ?
o an assumption that the classification of query instance 𝑥= will be most similar to the classification of other instances that
are nearby according to the distance function
COMP9417 T1, 2021 76
Nearest-neighbour classifier
• 1NN perfectly separates training data, so low bias but high variance
• By increasing the number of neighbours 𝑘 we increase bias and decrease variance (what happens when 𝑘 = 𝑚? 𝑚 is the number of observations)
• Easily adapted to real-valued targets, and even to structured objects (nearest-neighbour retrieval). Can also output probabilities when 𝑘>1
COMP9417 T1, 2021 77
Distance-Weighted KNN
• Might want to weight nearer neighbours more heavily …
• Use distance function to construct a weight 𝑤9
• Replace the final line of the classification algorithm by:
Where
:
𝑓m(𝑥=) ← arg max C 𝑤9𝛿(𝜐, 𝑓(𝑥0)) >∈? 91!
𝑤9= 1 𝐷𝑖𝑠(𝑥= , 𝑥9 )”
𝐷𝑖𝑠(𝑥=,𝑥9) is distance between 𝑥=, 𝑥9
COMP9417 T1, 2021 78
Distance-Weighted KNN
For real-valued target functions replace the final line of the algorithm by:
m ∑# CD(‘) 𝑓(𝑥)←$%!$ $
= ∑#C$ $%!
(denominator normalizes contribution of individual weights).
Now we can consider using all the training examples instead of just 𝑘:
o using all examples (i.e., when 𝑘 = 𝑚 and 𝑚 is number of training samples) with the rule above is called Shepard’s method
COMP9417 T1, 2021 79
Evaluation
Lazy learners do not construct an explicit model, so how do we evaluate the output of the learning process ?
o 1-NN – training set error is always zero !
§ each training example is always closest to itself
o 𝑘-NN – overfitting may be hard to detect Solution:
Leave-one-out cross-validation (LOOCV) – leave out each example and predict it given the rest:
𝑥!,𝑦! , 𝑥”,𝑦” ,…, 𝑥9G!,𝑦9G! , 𝑥9%!,𝑦9%! ,…,(𝑥8,𝑦8)
Error is mean over all predicted examples. Fast – no models to be built ! COMP9417 T1, 2021 80
KNN Computational Time
• KNN uses the training data as exemplars, so using simple search for prediction is 𝑂(𝑛)!
• There are algorithms to search for neighbours more efficiently with 𝑂(log 𝑛) but they do not work very well for above 10 dimensions (more than 10 features/attributes)
• For above 10 dimension, there are some approximate nearest neighbour approaches that can improve computation by orders of magnitude
• In high dimensional space (e.g., above 20 dimensions) even with using such algorithms, the KNN doesn’t work well
• In high-dimensional spaces everything is far away from everything and so pairwise distances are uninformative (curse of dimensionality)
COMP9417 T1, 2021 81
When is KNN meaningful?
You may think that this is an exceptional example, and this doesn’t really happen in practice!!
COMP9417 T1, 2021 82
Curse of Dimensionality
• It can be shown that as dimensions increase the effectiveness of distance metrics decrease and the concept of proximity may not be qualitatively meaningful as all points look equidistant
• This is one symptom of having high dimensional space (curse of dimensionality)
• There are also other problems arising from curse of dimensionality:
– It becomes polynomially harder to estimate many parameters (e.g., covariances)
– It becomes more difficult to visualize data
– Enormous amount of data is needed to train a model
COMP9417 T1, 2021 83
Curse of Dimensionality
Curse of Dimensionality
•
• •with exponentially many cells we would need exponentially many
•
number of “cells” in the instance space grows exponentially in the
number of “cells” in the instance space grows exponentially in the
number of features
number of features
with exponentially many cells we would need exponentially many data
data points to ensure that each cell is sufficiently populated to make
67 / 72
nearest-neighbour predictions reliably
points to ensure that each cell is su ciently populated to make
nearest-neighbour predictions reliably
COMP9417 ML & DM Classification (1) Term 2, 2019
COMP9417 T1, 2021 84
Curse of Dimensionality
Bellman (1960) coined this term in the context of dynamic programming
Imagine instances described by 20 attributes, but only 2 are relevant to target function — “similar” examples will appear “distant”.
Curse of dimensionality: nearest neighbour is easily mislead when dealing with high-dimensional 𝑥 in terms of the number of features – problem of irrelevant attributes
One approach:
oStretch𝑗thaxisbyweight𝑧0,where𝑧!,…,𝑧/ chosento
minimize prediction error
o Use cross-validation to automatically choose weights 𝑧! , … , 𝑧/ o Note setting 𝑧0 to zero eliminates this dimension altogether
COMP9417 T1, 2021 85
Curse of Dimensionality
Some ideas to address this for instance-based (nearest-neighbour) learning
o Euclidean distance with weights on attributes /
𝐷𝑖𝑠 𝑥=,𝑥9 = C𝑧@(𝑥=@ −𝑥9@)” @1!
o updating of weights based on nearest neighbour classification error
§ class correct/incorrect: weight increased/decreased § can be useful if not all features used in classification
See Moore and Lee (1994) “Efficient Algorithms for Minimizing Cross Validation Error” COMP9417 T1, 2021 86
Instance-based (nearest-neighbour) learning
Recap – Practical problems of NN scheme:
o Slow (but fast k-dimensional tree-based approaches exist) § Remedy: removing irrelevant data
o Noise (but KNN copes quite well with noise) § Remedy: removing noisy instances
o All attributes deemed equally important
§ Remedy: attribute weighting (or simply selection)
COMP9417 T1, 2021 87
Some refinements of instance-based classifiers
• Edited NN classifiers discard some of the training instances before making predictions (removes datapoint that does not agree with the majority of its k nearest neighbors)
– Saves memory and speeds up classification
• IB2: incremental NN learner that only incorporates misclassified instances into the classifier
o Problem: noisy data gets incorporated
• IB3: store classification performance information with each instance
& only use in prediction if above a threshold
COMP9417 T1, 2021 88
Dealing with noise
Use larger values of k (why ?) How to find the “right” k ?
o One way: cross-validation-based 𝑘-NN classifier (but slow)
o Different approach: discarding instances that don’t perform well by keeping success records of how well an instance does at prediction (IB3)
COMP9417 T1, 2021 89
kNN Example
COMP9417 T1, 2021 90
kNN Example
• Automated MS-lesion segmentation by KNN
• They have used some manually labeled image as the training set
• They used 4 features: Intensity and voxel locations (x,y,z coordinates)
• Ref: Anbeek et. Al, “Automated MS-lesion segmentation by K-nearest neighbor classification”, MIDAS journal, 2008
COMP9417 T1, 2021 91
Summary
• A framework for classification
• Classification viewed in terms of distance in feature space
• Distance-based learning
• Nearest neighbour classifiers
• Later we will see how to extend by building on these ideas
COMP9417 T1, 2021 92
Acknowledgements
• Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
• Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012) http://www.cs.ubc.ca/~murphyk/MLbook
• Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
• Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
• Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997) http://www- 2.cs.cmu.edu/~tom/mlbook.html
• Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 T1, 2021 93