CS代考 COMP9417 Machine Learning & Data Mining

Classification (1)
COMP9417 Machine Learning & Data Mining
Term 1, 2022
Adapted from slides by Dr Michael

Copyright By PowCoder代写 加微信 powcoder

This lecture will introduce you to machine learning approaches to the problem of classification. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
• outline a framework for solving classification problems
• describe issues of generalisation and evaluation for classification
• outline the use of a linear model as a 2-class classifier
• describe distance measures and how they are used in classification
• outline the basic k-nearest neighbour classification method
COMP9417 T1, 2022 1

Introduction
Classification (sometimes called concept learning) methods dominate machine learning . . .
. . . however, they often don’t have convenient mathematical properties like regression, so are more complicated to analyse. The idea is to learn a classifier, which is usually a function mapping from an input data point to one of a set of discrete outputs, i.e., the classes.
We will mostly focus on classifier advantages and disadvantages as learning methods first and point to unifying ideas and approaches where applicable.
COMP9417 T1, 2022 2

Classification
Example: Imagine that we want to automate the process of sorting incoming fish in a fish-packing plant. And as a pilot, we start by separating sea bass from salmon using some information collected through sensing.
COMP9417 T1, 2022 3

Classification
Example: classifying sea bass vs. salmon
Features that can be used: width, length, weight, lightness, fins, eyes/mouth position, etc.
Question: how to separate these two classes?
COMP9417 T1, 2022

Classification
Example: Maybe we can find a line that separates the two classes.
COMP9417 T1, 2022

Classification
Example: If we find the line that separated the two classes, then how
our algorithm makes prediction? The line equation will look like:
𝑎𝑥! + 𝑏𝑥” + 𝑐 = 0
We can define 𝑎, 𝑏 & 𝑐 such that:
for any point above the line:
𝑎𝑥! + 𝑏𝑥” + 𝑐 > 0
and for any point below the line:
𝑎𝑥! + 𝑏𝑥” + 𝑐 < 0 X2 : Width This type of classifier is called linear classifier. It is also a type of X1 : Lightness discriminative learning algorithm. COMP9417 T1, 2022 6 Classification Example: Can we do something different than finding the discriminative line (or some boundary) to be able to separate the two groups? COMP9417 T1, 2022 Classification Example: Instead of finding a discriminative line, maybe we can focus on one class at a time and build a model that describes how that class looks like; and then do the same for the other class. This type of models are called generative learning algorithm. COMP9417 T1, 2022 Classification Generative algorithm: builds some models for each of the classes and then makes classification predictions based on looking at the test example and see it is more similar to which of the models. – Learns 𝑝(𝑥|𝑦) (and also 𝑝 𝑦 , called class prior) – So,wecanget𝑝 𝑥,𝑦 =𝑝 𝑥𝑦 𝑝(𝑦) – It learns the mechanism by which the data has been generated Discriminative algorithm: Do not build models for different classes, but rather focuses on finding a decision boundary that separates classes – Learns 𝑝(𝑦|𝑥) COMP9417 T1, 2022 9 Classification To predict the output for sample 𝑥, in generative algorithm, we have to estimate 𝑝(𝑦|𝑥): 𝑝𝑦=0𝑥= 𝑝 𝑥 𝑦 = 0 𝑝(𝑦 = 0) 𝑝 𝑥 𝑦 = 1 𝑝(𝑦 = 1) If𝑝 𝑦=0𝑥 >𝑝 𝑦=1𝑥 ,then𝑥belongstoclass𝑦=0andotherwise
to class 𝑦 = 1.
• For discriminative algorithm, we can directly have 𝑝 𝑦 = 0 𝑥 and 𝑝 𝑦=1𝑥 andsimilartoabove,if𝑝 𝑦=0𝑥 >𝑝 𝑦=1𝑥 ,then𝑥 belongs to class 𝑦 = 0 and otherwise to class 𝑦 = 1.
COMP9417 T1, 2022 10

Linear classification in two dimensions x2 : Width
Positive class
• We find the line that separates the two class: 𝑎𝑥! + 𝑏𝑥” + 𝑐 = 0
• We define a weight vector 𝑤# = [𝑎, 𝑏], 𝑥# = [𝑥!, 𝑥”]
• So,thelinecanbedefinedby𝑥#𝑤=−𝑐=𝑡
• 𝑤 is perpendicular to decision boundary (in direction of positive class)
• 𝑡 is the decision threshold (if 𝑥#𝑤 > t then 𝑥 belongs to positive class and if 𝑥#𝑤 < t then 𝑥 belongs to negative class) w Negative class x1 : Lightness COMP9417 T1, 2022 11 Basic Linear Classifier The basic linear classifier constructs a decision boundary by half-way intersecting the line between the positive and negative centres of mass. COMP9417 T1, 2022 Basic Linear Classifier The basic linear classifier is described by the equation 𝑥#𝑤 = 𝑡, and 𝑤=𝑝−𝑛 As we know, $%& is on the decision boundary, so we have: " 𝑝 + 𝑛 # | 𝑝 |" − ||𝑛||" 𝑡=(2).𝑝−𝑛= 2 Where | 𝑥 |, denotes the length of vector 𝑥 COMP9417 T1, 2022 13 The ingredients of machine learning HoHwow smoaclhvine laeartnainsgkhewlpsittho smolvae ca htaisnk e learning Training data Learning problem Learning algorithm Domain objects An overview of how machine learning is used to address a given task. A An overview of how machine learning is used to address a given task. A task (red box) requires an appropriate mapping – a model – from data task (red box) requires an appropriate mapping – a model – from data described by features to outputs. Obtaining such a mapping from described by features to outputs. Obtaining such a mapping from training traindiantga idsawtahaitscwonhsatittuctoens satilteuatrensinga pleroabrlneming(bplureobloexm). (blue box). COMP9417 ML & DM Classification (1) Term 2, 2019 17 / 72 COMP9417 T1, 2022 14 is a key objective of machine learning. What we are really interested in is generalising from the sample of data in our training set. In machine learning, generalization means how well a trained model can classify or forecast unseen data. Training a generalized machine learning model means, in general, it works for all subset of unseen data. An example is when we train a model to classify between dogs and cats. If the model is provided with dogs images dataset with only two breeds, it may obtain a good performance. But, it possibly gets a low classification score when it is tested by other breeds of dogs as well. COMP9417 T1, 2022 15 Generalisation There are three basic assumptions for generalization: • Examples are drawn independently and identically (i.i.d) at random from the distribution; • The distribution is stationary; that is the distribution doesn't change within the data set • We always pull from the same distribution (for training, validation and test samples) In practice, we sometimes violate these assumptions. COMP9417 T1, 2022 16 Cross-validation Also known as out-of-sample testing is validation technique to assess the results of a model to an independent data set 1. Holdout method: Train Test COMP9417 T1, 2022 17 Cross-validation 2. Leave-One-Out Cross validation (LOOCV): Test Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration m COMP9417 T1, 2022 18 Cross-validation 3. K-fold Cross Validation Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 7 COMP9417 T1, 2022 19 Cross-validation There are certain parameters that need to be estimated during learning. We use the data, but NOT the training set, OR the test set. Instead, we use a separate validation or development set. COMP9417 T1, 2022 20 Cross-validation Validation set: To make the hyperparameter tuning and model selection independent from the test set, we define another set within the train set Train Validation Test COMP9417 T1, 2022 21 Data Types In Machine Learning world, in general two types of data is defined: o Numerical: Anything represented by numbers (e.g., integer , floating point) o Categorical: everything that is not numerical (e.g. discrete labeled groups) In general, for machine learning algorithms, data has to be represented in numeric form COMP9417 T1, 2022 22 Data Types Another taxonomy of data types: 1.Irrelevant: it might be represented with strings or numbers but has no relationship with the outcome (e.g. participants name or code) 2.Nominal: discrete values with no numerical relationship between different categories (e.g. animal types, colors, nationality) 3.Binary: discrete data with only two possibilities (e.g cancerous vs. non-cancerous) COMP9417 T1, 2022 23 Data Types 4. Ordinal: discrete integers that can be ranked, but the relative distance between any two number can not be defined (e.g., students rank based on GPA) 5. Count: discrete whole numbers without any negatives 6. Time: a cyclical, repeating continuous form of data (e.g days, weeks) 7. Interval: data that the we can measure the distance between different values. (e.g., temperature, income) COMP9417 T1, 2022 24 Binary Classification task In a binary classification (or binomial classification) task, we always want to classify the data of a given set into two groups. We usually define one of the classes as positive and one as negative. o Sometimes the classes are equally important (e.g. recognition of dog vs cat in image classification o Sometimes misclassification in one of the classes is more costly than misclassification in the other class (e.g. predicting that someone has cancer while (s)he doesn’t have vs predicting that someone doesn’t have cancer while (s)he has) therefore we may prefer to have better classification in one class in the cost of more errors in the other class COMP9417 T1, 2022 25 Evaluation of error If we have a binary classification, then we have two classes of y ∈ {0,1} , where we call the class 𝑦 = 1, positive class and 𝑦 = 0, negative class. Some evaluation metrics: o True positive: number of instances from class one that have been predicted as one o True negative: number of instances from class zero that have been predicted as zero o False positive: number of instances from class zero that have been predicted as one o False negative: number of instances from class one that have been predicted as zero COMP9417 T1, 2022 26 Algorithm-independent aspects of machine learning Evaluating performance on a task Contingency table I Contingency table For two-class prediction case: Two-class prediction case: Predicted Class Actual Class PoYseitisve NegNatoive PoYsietisve True Positive (TP) False Negative (FN) False Positive (FP) True Negative (TN) This is also called confusion matrix COMP9417 ML & DM Classification (1) Term 2, 2019 COMP9417 T1, 2022 Classification Accuracy Classification Accuracy on a sample of labelled pairs (𝑥, 𝑐(𝑥)) given a learned classification model that predicts, for each instance 𝑥, a class value 𝑐̂(𝑥): 𝑎𝑐𝑐= 1 C 𝐼[𝑐̂𝑥 =𝑐(𝑥)] |𝑇𝑒𝑠𝑡| '∈#)*+ where 𝑇𝑒𝑠𝑡 is a test set and 𝐼[] is the indicator function which is 1 iff its argument evaluates to true, and 0 otherwise. 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 𝑖𝑠 = 1 − 𝑎𝑐𝑐. COMP9417 T1, 2022 28 Other evaluation metrics Precision/correctness – is the number of relevant objects classified correctly divided by the total number of relevant objects classified 𝑇𝑃 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 + 𝐹𝑃 Recall/sensitivity/completeness/true positive rate (TPR) – is the number of relevant objects classified correctly divided by total number of relevant/correct objects 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 https://en.wikipedia.org/wiki/Precision_and_recall COMP9417 T1, 2022 29 Other evaluation metrics F1 score: a measure of accuracy, which is the harmonic mean of precision and recall and is defined as: 𝐹! = 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛× 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 This measure gives equal importance to precision and recall which is sometime undesirable; so, we have to decide which metric to use depending on the task and what’s important for the task. COMP9417 T1, 2022 30 Other evaluation metrics AUC-ROC curve: Area Under the Curve (AUC) – Receiver Operating Characteristics (ROC) curve is one of the most important evaluation metric for performance of classification models. This metric evaluates the model at different threshold settings and can inform us on the capability of the model in distinguishing between classes. • 𝑇𝑃𝑅= #, #,%-. • 𝐹𝑃𝑅= -, -,%#. • A good model has 𝐴𝑈𝐶 close to 1 • A very poor model has 𝐴𝑈𝐶 close to 0 • 𝐴𝑈𝐶 = 0.5 means no class separation COMP9417 T1, 2022 31 Missing Value: An issue to consider COMP9417 T1, 2022 32 Missing Values • In practice it rarely happens that the data is complete and homogenous. • Why data is incomplete: o Human errors o Sensor errors o Software bugs o Faulty preprocessing o... COMP9417 T1, 2022 33 Missing Values How to handle missing values (common approaches): o Deleting samples with missing values o Replacing the missing value with some statistics from the data (mean, median, ...) o Assigning a unique category o Predicting the missing values o Using algorithms that support missing values COMP9417 T1, 2022 34 Missing Values Deleting samples with missing values: – Pros: o A robust and probably more accurate model – Cons: o Loss of information and data o Works poorly if the percentage of missing values is high COMP9417 T1, 2022 35 Missing Values Replacing the missing value with mean/median/mode: – Pros: o When the data size is small, it is better than deleting o It can prevent data loss – Cons: o Imputing the approximations adds bias to the model (it reduces the variance of the variable) o Works poorly compared to other methods COMP9417 T1, 2022 36 Missing Values If categorical, assigning a unique category or the most frequent category: o Works well with small datasets and easy to implement o No loss of data o Unique category works only for categorical features o Adding another feature (e.g., a new unique category) to the model may result higher variance in the model o Adding the most frequent category can increase the bias in the model COMP9417 T1, 2022 37 Missing Values Predicting the missing values: – Pros: o Imputing the missing variable is an improvement as long as the bias from it is smaller than the omitted variable bias o Yields unbiased estimates of the model parameters – Cons: o Bias also arises when an incomplete conditioning set is used for a categorical variable o Considered only as a proxy for the true values COMP9417 T1, 2022 38 Missing Values Using algorithms that support missing values: – Pros: o Does not require creation of a predictive model o Correlation of the data is neglected – Cons: o Some of these algorithms are very time-consuming and it can be critical in data mining where large databases are being extracted COMP9417 T1, 2022 39 Nearest Neighbor Algorithm for Classification COMP9417 T1, 2022 40 Distance-based models earest Neighbour Nearest Neighbour Nearest Neighbour is a regression or classification algorithm that predicts whatever is the output value of the nearest data point to some query. earest Neighbour is a regression or classification algorithm that predicts To find the nearest data point, we have to find the distance between the query and other points. So we hatever is the output value of the nearest data point to some query. have to decide how to define the distance. COMP9417 T1, 2022 41 COMP9417 ML & DM Classification (1) Term 2, 2019 33 / 72 Minkowski distance Minkowski distance If 𝒳 → R/ , 𝑥, 𝑦 ∈ 𝜒, the Minkowski distance of order 𝑝 > 0isdefinedas:
𝐷𝑖𝑠$ 𝑥,𝑦 =(C|𝑥0−𝑦0|$)!2$=||x−y||$ 01!
Where | 𝑧 | = (∑/ |𝑧 |$)!2″ is the 𝑝 − 𝑛𝑜𝑟𝑚 (sometimes denoted 𝐿 $ 01!0 $
norm) of the vector 𝑧.
COMP9417 T1, 2022 42

Minkowski distance
• The 2-norm refers to the familiar Euclidean distance &
𝐷𝑖𝑠” 𝑥,𝑦 = C(𝑥0 −𝑦0)” = (𝑥−𝑦)#(𝑥−𝑦) 01!
• The 1-norm denotes Manhattan distance, also called cityblock
𝐷𝑖𝑠! 𝑥,𝑦 =C|𝑥0−𝑦0| 01!
COMP9417 T1, 2022 43

Minkowski distance
• If we now let 𝑝 grow larger, the distance will be more and more dominated by the largest coordinate-wise distance, from which we can infer that 𝐷𝑖𝑠3 = 𝑚𝑎𝑥0 |𝑥0 − 𝑦0 |; this is also called Chebyshev
• You will sometimes see references to the 0-norm (or 𝐿4 norm) which counts the number of non-zero elements in a vector. The corresponding distance then counts the number of positions in which vectors x and y differ. This is not strictly a Minkowski distance; however, we can define it as:
𝐷𝑖𝑠4 𝑥,𝑦 =C(𝑥0−𝑦0)4=C𝐼[𝑥0≠𝑦0]
under the understanding that 𝑥4 = 0 for 𝑥 = 0 and 1 otherwise. COMP9417 T1, 2022 44

Minkowski distance
Sometimes the data is not naturally in R/, but if we can turn it into Boolean features, or character sequences, we can still apply distance measures. For example:
• If 𝑥 and 𝑦 are binary strings, this is also called the Hamming distance. Alternatively, we can see the Hamming distance as the number of bits that need to be flipped to change 𝑥 into 𝑦.
• For non-binary strings of unequal length this can be generalised to the notion of edit distance or Levenshtein distance.
COMP9417 T1, 2022 45

Circles and ellipses
Circles and ellipses
Unite circles with different order-p Minkowski distance
• Notice that for points on the coordinate axes all distances agree
• If we require a rotation invariant distance metric, then Euclidean distance is our only choice
Lines connecting points at order-p Minkowski distance 1 from
for (from inside) p = 0.8; p = 1 (Manhattan distance, the rot
in red); p = 1.5; p = 2 (Euclidean distance, the violet circle);
Distance-based models
p = 8; and p = 1 (Chebyshev distance, the blue rectangle). N COMP9417 T1, 2022 46
for points on the coordinate axes all distances agree. For the our reach increases with p; however, if we require a rotation-in

Distance metric
Distance metric Given an instance space 𝒳 , a distance metric is a function 𝐷𝑖𝑠 ∶ 𝒳×𝒳 → 0,∞ such that for any 𝑥,𝑦,𝑧 ∈ 𝒳:
– distances between a point and itself are zero: 𝐷𝑖𝑠(𝑥, 𝑥) = 0
– all other distances are larger than zero: if 𝑥 ≠ 𝑦 then 𝐷𝑖𝑠(𝑥, 𝑦) > 0 – distancesaresymmetric:𝐷𝑖𝑠(𝑦,𝑥)=𝐷𝑖𝑠(𝑥,𝑦)
– detours can not shorten the distance (triangle inequality):
𝐷𝑖𝑠(𝑥, 𝑧) ≤ 𝐷𝑖𝑠(𝑥, 𝑦) + 𝐷𝑖𝑠(𝑦, 𝑧)
o It can be shown that triangle inequality does not hold for 𝑝 < 1 If the second condition is weakened to a non-strict inequality – i.e., 𝐷𝑖𝑠(𝑥, 𝑦) may be zero even if 𝑥 ≠ 𝑦 – the function 𝐷𝑖𝑠 is called a pseudo-metric. COMP9417 T1, 2022 47 Means and distances The arithmetic mean minimises squared Euclidean distance The arithmetic mean 𝜇 of a set of data points 𝐷 in a Euclidean space is the unique point that minimises the sum of squared Euclidean distances to those data points. Proof. We will show that argmin∑'∈6 | 𝑥 − 𝑦 |" = 𝜇 , where 5 ||. || denotes the 2-norm. We find this minimum by taking the gradient (the vector of partial derivatives with respect to 𝑦) of the sum and setting it to the zero vector: ∇5C|𝑥−𝑦|"=−2C𝑥−𝑦 =−2C𝑥+2𝐷𝑦=0 '∈6 '∈6 '∈6 Fromwhichwederive𝑦= ! ∑'∈6𝑥=𝜇 |6| COMP9417 T1, 2022 Means and distances • Notice that minimising the sum of squared Euclidean distances of a given set of points is the same as minimising the average squared Euclidean distance. • You may wonder what happens if we drop the square here: wouldn’t it be more natural to take the point that minimises total Euclidean distance as exemplar? • This point is known as the geometric median, as for univariate data it corresponds to the median or ‘middle value’ of a set of numbers. However, for multivariate data there is no closed-form expression for the geometric median, which needs to be calculated by successive approximation. COMP9417 T1, 2022 49 Means and distances • In certain situations, it makes sense to restrict an exemplar to be one of the given data points. In that case, we speak of a medoid, to distinguish it from a centroid which is an exemplar that doesn’t have to occur in the data. • Finding a medoid requires us to calculate, for each data point, the total distance to all other data poin 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com