Classification (1)
COMP9417 Machine Learning & Data Mining
Term 1, 2022
Adapted from slides by Dr Michael
Copyright By PowCoder代写 加微信 powcoder
This lecture will introduce you to machine learning approaches to the problem of classification. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
• outline a framework for solving classification problems
• describe issues of generalisation and evaluation for classification
• outline the use of a linear model as a 2-class classifier
• describe distance measures and how they are used in classification
• outline the basic k-nearest neighbour classification method
COMP9417 T1, 2022 1
Introduction
Classification (sometimes called concept learning) methods dominate machine learning . . .
. . . however, they often don’t have convenient mathematical properties like regression, so are more complicated to analyse. The idea is to learn a classifier, which is usually a function mapping from an input data point to one of a set of discrete outputs, i.e., the classes.
We will mostly focus on classifier advantages and disadvantages as learning methods first and point to unifying ideas and approaches where applicable.
COMP9417 T1, 2022 2
Classification
Example: Imagine that we want to automate the process of sorting incoming fish in a fish-packing plant. And as a pilot, we start by separating sea bass from salmon using some information collected through sensing.
COMP9417 T1, 2022 3
Classification
Example: classifying sea bass vs. salmon
Features that can be used: width, length, weight, lightness, fins, eyes/mouth position, etc.
Question: how to separate these two classes?
COMP9417 T1, 2022
Classification
Example: Maybe we can find a line that separates the two classes.
COMP9417 T1, 2022
Classification
Example: If we find the line that separated the two classes, then how
our algorithm makes prediction? The line equation will look like:
𝑎𝑥! + 𝑏𝑥” + 𝑐 = 0
We can define 𝑎, 𝑏 & 𝑐 such that:
for any point above the line:
𝑎𝑥! + 𝑏𝑥” + 𝑐 > 0
and for any point below the line:
𝑎𝑥! + 𝑏𝑥” + 𝑐 < 0
X2 : Width
This type of classifier is called linear classifier. It is also a type of X1 : Lightness discriminative learning algorithm.
COMP9417 T1, 2022 6
Classification Example:
Can we do something different than finding the discriminative line (or some boundary) to be able to separate the two groups?
COMP9417 T1, 2022
Classification Example:
Instead of finding a discriminative line, maybe we can focus on one class at a time and build a model that describes how that class looks like; and then do the same for the other class. This type of models are called generative learning algorithm.
COMP9417 T1, 2022
Classification
Generative algorithm: builds some models for each of the classes and then makes classification predictions based on looking at the test example and see it is more similar to which of the models.
– Learns 𝑝(𝑥|𝑦) (and also 𝑝 𝑦 , called class prior)
– So,wecanget𝑝 𝑥,𝑦 =𝑝 𝑥𝑦 𝑝(𝑦)
– It learns the mechanism by which the data has been generated
Discriminative algorithm: Do not build models for different classes, but rather focuses on finding a decision boundary that separates classes
– Learns 𝑝(𝑦|𝑥)
COMP9417 T1, 2022 9
Classification
To predict the output for sample 𝑥, in generative algorithm, we have
to estimate 𝑝(𝑦|𝑥): 𝑝𝑦=0𝑥=
𝑝 𝑥 𝑦 = 0 𝑝(𝑦 = 0)
𝑝 𝑥 𝑦 = 1 𝑝(𝑦 = 1)
If𝑝 𝑦=0𝑥 >𝑝 𝑦=1𝑥 ,then𝑥belongstoclass𝑦=0andotherwise
to class 𝑦 = 1.
• For discriminative algorithm, we can directly have 𝑝 𝑦 = 0 𝑥 and 𝑝 𝑦=1𝑥 andsimilartoabove,if𝑝 𝑦=0𝑥 >𝑝 𝑦=1𝑥 ,then𝑥 belongs to class 𝑦 = 0 and otherwise to class 𝑦 = 1.
COMP9417 T1, 2022 10
Linear classification in two dimensions x2 : Width
Positive class
• We find the line that separates the two class: 𝑎𝑥! + 𝑏𝑥” + 𝑐 = 0
• We define a weight vector 𝑤# = [𝑎, 𝑏], 𝑥# = [𝑥!, 𝑥”]
• So,thelinecanbedefinedby𝑥#𝑤=−𝑐=𝑡
• 𝑤 is perpendicular to decision boundary (in direction of positive class)
• 𝑡 is the decision threshold (if 𝑥#𝑤 > t then 𝑥 belongs to positive class and if 𝑥#𝑤 < t then 𝑥 belongs to negative class)
w Negative class
x1 : Lightness
COMP9417 T1, 2022 11
Basic Linear Classifier
The basic linear classifier constructs a decision boundary by half-way intersecting the line between the positive and negative centres of mass.
COMP9417 T1, 2022
Basic Linear Classifier
The basic linear classifier is described by the equation 𝑥#𝑤 = 𝑡, and 𝑤=𝑝−𝑛
As we know, $%& is on the decision boundary, so we have: " 𝑝 + 𝑛 # | 𝑝 |" − ||𝑛||"
𝑡=(2).𝑝−𝑛= 2 Where | 𝑥 |, denotes the length of vector 𝑥
COMP9417 T1, 2022 13
The ingredients of machine learning
HoHwow smoaclhvine laeartnainsgkhewlpsittho smolvae ca htaisnk e learning
Training data
Learning problem
Learning algorithm
Domain objects
An overview of how machine learning is used to address a given task. A
An overview of how machine learning is used to address a given task. A
task (red box) requires an appropriate mapping – a model – from data
task (red box) requires an appropriate mapping – a model – from data
described by features to outputs. Obtaining such a mapping from
described by features to outputs. Obtaining such a mapping from training traindiantga idsawtahaitscwonhsatittuctoens satilteuatrensinga pleroabrlneming(bplureobloexm). (blue box).
COMP9417 ML & DM Classification (1) Term 2, 2019 17 / 72
COMP9417 T1, 2022 14
is a key objective of machine learning.
What we are really interested in is generalising from the sample of data in our training set.
In machine learning, generalization means how well a trained model can classify or forecast unseen data. Training a generalized machine learning model means, in general, it works for all subset of unseen data.
An example is when we train a model to classify between dogs and cats. If the model is provided with dogs images dataset with only two breeds, it may obtain a good performance. But, it possibly gets a low classification score when it is tested by other breeds of dogs as well.
COMP9417 T1, 2022 15
Generalisation
There are three basic assumptions for generalization:
• Examples are drawn independently and identically (i.i.d) at random from the distribution;
• The distribution is stationary; that is the distribution doesn't change within the data set
• We always pull from the same distribution (for training, validation and test samples)
In practice, we sometimes violate these assumptions.
COMP9417 T1, 2022 16
Cross-validation
Also known as out-of-sample testing is validation technique to assess the
results of a model to an independent data set 1. Holdout method:
Train Test
COMP9417 T1, 2022 17
Cross-validation
2. Leave-One-Out Cross validation (LOOCV):
Test Iteration 1
Iteration 2 Iteration 3
Iteration 4
Iteration m
COMP9417 T1, 2022 18
Cross-validation
3. K-fold Cross Validation
Iteration 1 Iteration 2 Iteration 3
Iteration 4
Iteration 7
COMP9417 T1, 2022 19
Cross-validation
There are certain parameters that need to be estimated during learning. We use the data, but NOT the training set, OR the test set. Instead, we use a separate validation or development set.
COMP9417 T1, 2022 20
Cross-validation
Validation set: To make the hyperparameter tuning and model selection independent from the test set, we define another set within the train set
Train Validation Test
COMP9417 T1, 2022 21
Data Types
In Machine Learning world, in general two types of data is defined:
o Numerical: Anything represented by numbers (e.g., integer , floating point)
o Categorical: everything that is not numerical (e.g. discrete labeled groups)
In general, for machine learning algorithms, data has to be represented in numeric form
COMP9417 T1, 2022 22
Data Types
Another taxonomy of data types:
1.Irrelevant: it might be represented with strings or numbers but has no relationship with the outcome (e.g. participants name or code)
2.Nominal: discrete values with no numerical relationship between different categories (e.g. animal types, colors, nationality)
3.Binary: discrete data with only two possibilities (e.g cancerous vs. non-cancerous)
COMP9417 T1, 2022 23
Data Types
4. Ordinal: discrete integers that can be ranked, but the relative distance between any two number can not be defined (e.g., students rank based on GPA)
5. Count: discrete whole numbers without any negatives 6. Time: a cyclical, repeating continuous form of data (e.g
days, weeks)
7. Interval: data that the we can measure the distance between different values. (e.g., temperature, income)
COMP9417 T1, 2022 24
Binary Classification task
In a binary classification (or binomial classification) task, we always want to classify the data of a given set into two groups. We usually define one of the classes as positive and one as negative.
o Sometimes the classes are equally important (e.g. recognition of dog vs cat in image classification
o Sometimes misclassification in one of the classes is more costly than misclassification in the other class (e.g. predicting that someone has cancer while (s)he doesn’t have vs predicting that someone doesn’t have cancer while (s)he has) therefore we may prefer to have better classification in one class in the cost of more errors in the other class
COMP9417 T1, 2022 25
Evaluation of error
If we have a binary classification, then we have two classes of y ∈ {0,1} , where we call the class 𝑦 = 1, positive class and 𝑦 = 0, negative class.
Some evaluation metrics:
o True positive: number of instances from class one that have
been predicted as one
o True negative: number of instances from class zero that have been predicted as zero
o False positive: number of instances from class zero that have been predicted as one
o False negative: number of instances from class one that have been predicted as zero
COMP9417 T1, 2022 26
Algorithm-independent aspects of machine learning Evaluating performance on a task
Contingency table I
Contingency table
For two-class prediction case:
Two-class prediction case:
Predicted Class
Actual Class
PoYseitisve
NegNatoive
PoYsietisve
True Positive (TP)
False Negative (FN)
False Positive (FP)
True Negative (TN)
This is also called confusion matrix COMP9417 ML & DM Classification (1)
Term 2, 2019
COMP9417 T1, 2022
Classification Accuracy
Classification Accuracy on a sample of labelled pairs (𝑥, 𝑐(𝑥)) given a learned classification model that predicts, for each instance 𝑥, a class
value 𝑐̂(𝑥):
𝑎𝑐𝑐= 1 C 𝐼[𝑐̂𝑥 =𝑐(𝑥)] |𝑇𝑒𝑠𝑡| '∈#)*+
where 𝑇𝑒𝑠𝑡 is a test set and 𝐼[] is the indicator function which is 1 iff its argument evaluates to true, and 0 otherwise.
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 𝑖𝑠 = 1 − 𝑎𝑐𝑐.
COMP9417 T1, 2022 28
Other evaluation metrics Precision/correctness
– is the number of relevant objects classified correctly divided by the total number of relevant objects classified 𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 + 𝐹𝑃
Recall/sensitivity/completeness/true positive rate (TPR)
– is the number of relevant objects classified correctly divided by total number of relevant/correct objects
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁
https://en.wikipedia.org/wiki/Precision_and_recall
COMP9417 T1, 2022 29
Other evaluation metrics
F1 score: a measure of accuracy, which is the harmonic mean of precision and recall and is defined as:
𝐹! = 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛× 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
This measure gives equal importance to precision and recall which is sometime undesirable; so, we have to decide which metric to use depending on the task and what’s important for the task.
COMP9417 T1, 2022 30
Other evaluation metrics
AUC-ROC curve: Area Under the Curve (AUC) – Receiver Operating Characteristics (ROC) curve is one of the most important evaluation metric for performance of classification models. This metric evaluates the model at different threshold settings and can inform us on the capability of the model in distinguishing between classes.
• 𝑇𝑃𝑅= #, #,%-.
• 𝐹𝑃𝑅= -, -,%#.
• A good model has 𝐴𝑈𝐶 close to 1
• A very poor model has 𝐴𝑈𝐶 close to 0
• 𝐴𝑈𝐶 = 0.5 means no class separation
COMP9417 T1, 2022 31
Missing Value: An issue to consider
COMP9417 T1, 2022 32
Missing Values
• In practice it rarely happens that the data is complete and homogenous.
• Why data is incomplete:
o Human errors
o Sensor errors
o Software bugs
o Faulty preprocessing o...
COMP9417 T1, 2022 33
Missing Values
How to handle missing values (common approaches): o Deleting samples with missing values
o Replacing the missing value with some statistics from the data (mean, median, ...)
o Assigning a unique category
o Predicting the missing values
o Using algorithms that support missing values
COMP9417 T1, 2022 34
Missing Values
Deleting samples with missing values: – Pros:
o A robust and probably more accurate model – Cons:
o Loss of information and data
o Works poorly if the percentage of missing values is high
COMP9417 T1, 2022 35
Missing Values
Replacing the missing value with mean/median/mode: – Pros:
o When the data size is small, it is better than deleting
o It can prevent data loss – Cons:
o Imputing the approximations adds bias to the model (it reduces the variance of the variable)
o Works poorly compared to other methods
COMP9417 T1, 2022 36
Missing Values
If categorical, assigning a unique category or the most frequent category:
o Works well with small datasets and easy to implement o No loss of data
o Unique category works only for categorical features
o Adding another feature (e.g., a new unique category) to the model may result higher variance in the model
o Adding the most frequent category can increase the bias in the model
COMP9417 T1, 2022 37
Missing Values
Predicting the missing values: – Pros:
o Imputing the missing variable is an improvement as long as the bias from it is smaller than the omitted variable bias
o Yields unbiased estimates of the model parameters – Cons:
o Bias also arises when an incomplete conditioning set is used for a categorical variable
o Considered only as a proxy for the true values
COMP9417 T1, 2022 38
Missing Values
Using algorithms that support missing values: – Pros:
o Does not require creation of a predictive model
o Correlation of the data is neglected – Cons:
o Some of these algorithms are very time-consuming and it can be critical in data mining where large databases are being extracted
COMP9417 T1, 2022 39
Nearest Neighbor Algorithm for Classification
COMP9417 T1, 2022 40
Distance-based models
earest Neighbour
Nearest Neighbour
Nearest Neighbour is a regression or classification algorithm that predicts whatever is the output value of the nearest data point to some query.
earest Neighbour is a regression or classification algorithm that predicts
To find the nearest data point, we have to find the distance between the query and other points. So we hatever is the output value of the nearest data point to some query.
have to decide how to define the distance.
COMP9417 T1, 2022 41
COMP9417 ML & DM Classification (1) Term 2, 2019 33 / 72
Minkowski distance
Minkowski distance If 𝒳 → R/ , 𝑥, 𝑦 ∈ 𝜒, the Minkowski distance of order 𝑝 > 0isdefinedas:
𝐷𝑖𝑠$ 𝑥,𝑦 =(C|𝑥0−𝑦0|$)!2$=||x−y||$ 01!
Where | 𝑧 | = (∑/ |𝑧 |$)!2″ is the 𝑝 − 𝑛𝑜𝑟𝑚 (sometimes denoted 𝐿 $ 01!0 $
norm) of the vector 𝑧.
COMP9417 T1, 2022 42
Minkowski distance
• The 2-norm refers to the familiar Euclidean distance &
𝐷𝑖𝑠” 𝑥,𝑦 = C(𝑥0 −𝑦0)” = (𝑥−𝑦)#(𝑥−𝑦) 01!
• The 1-norm denotes Manhattan distance, also called cityblock
𝐷𝑖𝑠! 𝑥,𝑦 =C|𝑥0−𝑦0| 01!
COMP9417 T1, 2022 43
Minkowski distance
• If we now let 𝑝 grow larger, the distance will be more and more dominated by the largest coordinate-wise distance, from which we can infer that 𝐷𝑖𝑠3 = 𝑚𝑎𝑥0 |𝑥0 − 𝑦0 |; this is also called Chebyshev
• You will sometimes see references to the 0-norm (or 𝐿4 norm) which counts the number of non-zero elements in a vector. The corresponding distance then counts the number of positions in which vectors x and y differ. This is not strictly a Minkowski distance; however, we can define it as:
𝐷𝑖𝑠4 𝑥,𝑦 =C(𝑥0−𝑦0)4=C𝐼[𝑥0≠𝑦0]
under the understanding that 𝑥4 = 0 for 𝑥 = 0 and 1 otherwise. COMP9417 T1, 2022 44
Minkowski distance
Sometimes the data is not naturally in R/, but if we can turn it into Boolean features, or character sequences, we can still apply distance measures. For example:
• If 𝑥 and 𝑦 are binary strings, this is also called the Hamming distance. Alternatively, we can see the Hamming distance as the number of bits that need to be flipped to change 𝑥 into 𝑦.
• For non-binary strings of unequal length this can be generalised to the notion of edit distance or Levenshtein distance.
COMP9417 T1, 2022 45
Circles and ellipses
Circles and ellipses
Unite circles with different order-p Minkowski distance
• Notice that for points on the coordinate axes all distances agree
• If we require a rotation invariant distance metric, then Euclidean distance is our only choice
Lines connecting points at order-p Minkowski distance 1 from
for (from inside) p = 0.8; p = 1 (Manhattan distance, the rot
in red); p = 1.5; p = 2 (Euclidean distance, the violet circle);
Distance-based models
p = 8; and p = 1 (Chebyshev distance, the blue rectangle). N COMP9417 T1, 2022 46
for points on the coordinate axes all distances agree. For the our reach increases with p; however, if we require a rotation-in
Distance metric
Distance metric Given an instance space 𝒳 , a distance metric is a function 𝐷𝑖𝑠 ∶ 𝒳×𝒳 → 0,∞ such that for any 𝑥,𝑦,𝑧 ∈ 𝒳:
– distances between a point and itself are zero: 𝐷𝑖𝑠(𝑥, 𝑥) = 0
– all other distances are larger than zero: if 𝑥 ≠ 𝑦 then 𝐷𝑖𝑠(𝑥, 𝑦) > 0 – distancesaresymmetric:𝐷𝑖𝑠(𝑦,𝑥)=𝐷𝑖𝑠(𝑥,𝑦)
– detours can not shorten the distance (triangle inequality):
𝐷𝑖𝑠(𝑥, 𝑧) ≤ 𝐷𝑖𝑠(𝑥, 𝑦) + 𝐷𝑖𝑠(𝑦, 𝑧)
o It can be shown that triangle inequality does not hold for 𝑝 < 1
If the second condition is weakened to a non-strict inequality – i.e., 𝐷𝑖𝑠(𝑥, 𝑦) may be zero even if 𝑥 ≠ 𝑦 – the function 𝐷𝑖𝑠 is called a
pseudo-metric.
COMP9417 T1, 2022 47
Means and distances
The arithmetic mean minimises squared Euclidean distance The arithmetic mean 𝜇 of a set of data points 𝐷 in a Euclidean space is the unique point that minimises the sum of squared Euclidean distances to those data points.
Proof. We will show that argmin∑'∈6 | 𝑥 − 𝑦 |" = 𝜇 , where 5
||. || denotes the 2-norm. We find this minimum by taking the gradient (the vector of partial derivatives with respect to 𝑦) of the sum and setting it to the zero vector:
∇5C|𝑥−𝑦|"=−2C𝑥−𝑦 =−2C𝑥+2𝐷𝑦=0 '∈6 '∈6 '∈6
Fromwhichwederive𝑦= ! ∑'∈6𝑥=𝜇 |6|
COMP9417 T1, 2022
Means and distances
• Notice that minimising the sum of squared Euclidean distances of a given set of points is the same as minimising the average squared Euclidean distance.
• You may wonder what happens if we drop the square here: wouldn’t it be more natural to take the point that minimises total Euclidean distance as exemplar?
• This point is known as the geometric median, as for univariate data it corresponds to the median or ‘middle value’ of a set of numbers. However, for multivariate data there is no closed-form expression for the geometric median, which needs to be calculated by successive approximation.
COMP9417 T1, 2022 49
Means and distances
• In certain situations, it makes sense to restrict an exemplar to be one of the given data points. In that case, we speak of a medoid, to distinguish it from a centroid which is an exemplar that doesn’t have to occur in the data.
• Finding a medoid requires us to calculate, for each data point, the total distance to all other data poin
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com