Classification
COMP9417 Machine Learning and Data Mining Term 2, 2022
COMP9417 ML & DM Classification Term 2, 2022 1 / 52
Acknowledgements
Copyright By PowCoder代写 加微信 powcoder
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. IT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Graw-Hill (1997) http://www-2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM
Classification
Term 2, 2022
This lecture will introduce you to machine learning approaches to the problem of classification. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
• outline the probem of learning to classify
• outline a framework for solving machine learning problems
• describe issues of generalisation and evaluation for classification • outline the use of a linear model as a 2-class classifier
• outline the Perceptron classification algorithm
• outline the Logistic Regression classification algorithm
• compare Maximum Likelihood and Bayesian classification
COMP9417 ML & DM Classification Term 2, 2022 3 / 52
Introduction
Introduction
Classification (sometimes called concept learning) methods dominate machine learning . . .
. . . however, they often don’t have convenient mathematical properties like regression, so can be more complicated to analyse. The idea is to learn a classifier, which is usually a function mapping from an input data point to one of a set of discrete outputs, i.e., the classes.
We will mostly focus on their advantages and disadvantages as learning methods first, and point to unifying ideas and approaches where applicable. In this lecture we focus on classification methods that are essentially linear models . . .
and in later lectures we will see other, more expressive, classifier learning methods.
COMP9417 ML & DM Classification Term 2, 2022 4 / 52
Introduction
How machine learning helps to solve a task
Domain objects
Training data
Learning problem
Learning algorithm
An overview of how machine learning is used to address a given task. A task (red box) requires an appropriate mapping – a model – from data described by features to outputs. Obtaining such a mapping from training data is what constitutes a learning problem (blue box).
COMP9417 ML & DM Classification Term 2, 2022 5 / 52
Introduction
Some terminology I
Tasks are addressed by models, whereas learning problems are solved by learning algorithms that produce models.
COMP9417 ML & DM Classification Term 2, 2022 6 / 52
Introduction
Some terminology II
Machine learning is concerned with using the right features to build the right models that achieve the right tasks.
COMP9417 ML & DM Classification Term 2, 2022 7 / 52
Introduction
Some terminology III
Models lend the machine learning field diversity, but tasks and features give it unity.
COMP9417 ML & DM Classification Term 2, 2022 8 / 52
Introduction
Some terminology IV
Does the algorithm require all training data to be present before the start of learning ? If yes, then it is categorised as batch learning algorithm.
If however, it can continue to learn a new data arrives, it is an online learning algorithm.
COMP9417 ML & DM Classification Term 2, 2022 9 / 52
Introduction
Some terminology V
If the model has a fixed number of parameters it is categorised as parametric.
Otherwise, if the number of parameters grows as part of training it is categorised as non-parametric.
COMP9417 ML & DM Classification Term 2, 2022 10 / 52
Introduction
Example: assassinating spam e-mail
SpamAssassin is a widely used open-source spam filter. It calculates a score for an incoming e-mail, based on a number of built-in rules or ‘tests’ in SpamAssassin’s terminology, and adds a ‘junk’ flag and a summary report to the e-mail’s headers if the score is 5 or more.
-0.1 RCVD_IN_MXRATE_WL
0.6 HTML_IMAGE_RATIO_02 1.2 TVD_FW_GRAPHIC_NAME_MID 0.0 HTML_MESSAGE
0.6 HTML_FONx_FACE_BAD
1.4 SARE_GIF_ATTACH
0.1 BOUNCE_MESSAGE
0.1 ANY_BOUNCE_MESSAGE
RBL: MXRate recommends allowing
[123.45.6.789 listed in sub.mxrate.net]
BODY: HTML has a low ratio of text to image area BODY: TVD_FW_GRAPHIC_NAME_MID
BODY: HTML included in message
BODY: HTML font face is not a word
FULL: Email has a inline gif
MTA bounce message
Message is some kind of bounce message
AWL: From: address is in the auto white-list
From left to right you see the score attached to a particular test, the test identifier, and a short description including a reference to the relevant part of the e-mail. As you see, scores for individual tests can be negative (indicating evidence suggesting the e-mail is ham rather than spam) as well as positive. The overall score of 5.3 suggests the e-mail might be spam.
COMP9417 ML & DM Classification Term 2, 2022 11 / 52
Introduction
Example: simple linear classification
Linear classification
Suppose we have only two tests and four training e-mails, one of which is spam. Both tests succeed for the spam e-mail; for one ham e-mail neither test succeeds, for another the first test succeeds and the second doesn’t, and for the third ham e-mail the first test fails and the second succeeds.
It is easy to see that assigning both tests a weight of 4 correctly ‘classifies’ these four e-mails into spam and ham. In the mathematical notation introduced above we could describe this classifier as 4×1 + 4×2 > 5 or (4,4)·(x1,x2) > 5.
In fact, any weight between 2.5 and 5 will ensure that the threshold of 5 is only exceeded when both tests succeed. We could even consider assigning different weights to the tests – as long as each weight is less than 5 and their sum exceeds 5 – although it is hard to see how this could be justified by the training data.
COMP9417 ML & DM Classification Term 2, 2022 12 / 52
Introduction
Example: simple linear classification
Spam filtering as a classification task
The columns marked x1 and x2 indicate the results of two tests on four different e-mails. The fourth column indicates which of the e-mails are spam. The right-most column demonstrates that by thresholding the function 4×1 + 4×2 at 5, we can separate spam from ham.
E-mail x1 x2 Spam? 4×1 + 4×2 11118 20000 31004 40104
COMP9417 ML & DM Classification Term 2, 2022 13 / 52
Introduction
Example: simple linear classification
Linear classification in two dimensions
• straight line separates positives from negatives (linear “discriminant”) • defined by w · xi = t
• w is perpendicular to decision boundary
• w points in direction of positives
• t is the decision threshold
COMP9417 ML & DM Classification Term 2, 2022 14 / 52
Introduction
Example: simple linear classification
Linear classification in two dimensions
Note: xi points to a point on the decision boundary. In particular, x0 points in the same direction as w,
from which it follows that w · x0 = ||w|| ||x0|| = t (where ||x|| denotes the length of the vector x).
COMP9417 ML & DM Classification Term 2, 2022 15 / 52
Introduction
Example: simple linear classification
Homogeneous coordinates
It is sometimes convenient to simplify notation further by introducing an extra constant ‘variable’ x0 = 1, the weight of which is fixed to w0 = −t.
The extended data point is then x = (1, x1, . . . , xn) and the extended weight vector is w = (−t, w1, . . . , wn), leading to the decision rule w ·x > 0 and the decision boundary w ·x = 0.
Thanks to these so-called homogeneous coordinates the decision boundary passes through the origin of the extended coordinate system, at the expense of needing an additional dimension.
Note: this doesn’t really affect the data, as all data points and the ‘real’ decision boundary live in the plane x0 = 1.
COMP9417 ML & DM Classification Term 2, 2022 16 / 52
Introduction
Example: simple linear classification
Example: a simple linear classifier I
This classifier constructs a decision boundary by half-way intersecting the line between the positive and negative centres of mass.
COMP9417 ML & DM Classification Term 2, 2022 17 / 52
(p+n)/2 ––– n–
Introduction
Example: simple linear classification
Example: a simple linear classifier II
The linear classifier is described by the equation w · x = t, with w = p − n; the decision threshold can be found by noting that (p + n)/2 is on the decision boundary, and hence t = (p − n) · (p + n)/2 = (||p||2 − ||n||2)/2, where ||x|| denotes the length of vector x.
COMP9417 ML & DM Classification Term 2, 2022 18 / 52
Introduction
Example: simple linear classification
Machine learning for spam filtering
E-mails SpamAssassin Data Spam? tests Linear classifier
Training data
Learn weights
At the top we see how SpamAssassin approaches the spam e-mail classification task: the text of each e-mail is converted into a data point by means of SpamAssassin’s built-in tests, and a linear classifier is applied to obtain a ‘spam or ham’ decision. At the bottom (in blue) we see the bit that is done by machine learning.
COMP9417 ML & DM Classification Term 2, 2022 19 / 52
Introduction
Evaluating performance on a task
Evaluating classification – contingency table I
For the two-class prediction case:
Predicted Class
Actual Class
True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)
COMP9417 ML & DM
Classification
Term 2, 2022
Introduction
Evaluating performance on a task
Evaluating classification – contingency table II
Classification Accuracy on a sample of labelled pairs (x, c(x)) given a learned classification model that predicts, for each instance x, a class value cˆ(x):
acc= 1 X I[cˆ(x)=c(x)] |Test| x2Test
where Test is a test set and I[] is the indicator function which is 1 iff its argument evaluates to true, and 0 otherwise.
Classification Error is 1-acc.
COMP9417 ML & DM Classification Term 2, 2022 21 / 52
Introduction
Evaluating performance on a task
Cross-validation I
There are certain parameters that need to be estimated during learning. We use the data, but NOT the training set, OR the test set. Instead, we
use a separate validation or development set.
COMP9417 ML & DM Classification Term 2, 2022 22 / 52
Introduction
Evaluating performance on a task
Cross-validation II
COMP9417 ML & DM Classification Term 2, 2022 23 / 52
Introduction
Evaluating performance on a task
Cross-validation III
COMP9417 ML & DM Classification Term 2, 2022 24 / 52
Introduction
Evaluating performance on a task
Cross-validation IV
COMP9417 ML & DM Classification Term 2, 2022 25 / 52
Perceptrons
What are perceptrons ?
Perceptron
A linear classifier that can achieve perfect separation on linearly separable data is the perceptron, originally proposed as a simple neural network by F. Rosenblatt in the late 1950s.
COMP9417 ML & DM Classification Term 2, 2022 26 / 52
Perceptrons
What are perceptrons ?
Perceptron
Originally implemented in software (based on the McCulloch-Pitts neuron from the 1940s), then in hardware as a 20×20 visual sensor array with potentiometers for adaptive weights.
Source http://en.wikipedia.org/w/index.php?curid=47541432
COMP9417 ML & DM Classification Term 2, 2022 27 / 52
Perceptrons
What are perceptrons ?
Perceptron
Output o is thresholded sum of products of inputs and their weights: o(x1,…,xn)=⇢ +1 if w0 +w1x1 +···+wnxn >0
−1 otherwise.
COMP9417 ML & DM Classification Term 2, 2022 28 / 52
Perceptrons
What are perceptrons ?
Perceptron
Or in vector notation:
o(x)=⇢ 1 ifw·x>0
−1 otherwise.
COMP9417 ML & DM Classification Term 2, 2022 29 / 52
are linear classifiers
Decision Surface of a Perceptron
Represents some useful functions
• Whatweightsrepresento(x1,x2)=AND(x1,x2)? • What weights represent o(x1, x2) = XOR(x1, x2)?
COMP9417 ML & DM Classification Term 2, 2022 30 / 52
are linear classifiers
Decision Surface of a Perceptron
So some functions not representable • e.g., not linearly separable
• a labelled data set is linearly separable if there is a linear decision boundary that separates the classes
• for non-linearly separable data we’ll need something else • e.g., networks of these …
• the start of “deep” networks …
COMP9417 ML & DM Classification Term 2, 2022 31 / 52
Perceptrons
How to train
Perceptron learning
Online learner
” prequential”
Learning is “finding a good set of weights” Data:C , y ) Gwen:X . Predict:& Perceptron learning is simply an iterative weight-update scheme: Check ?
wi ← wi + ∆wi
where the weight update ∆wi depends only on misclassified examples and is modulated by a “smoothing” parameter η typically referred to as the “learning rate”.
Can prove that perceptron learning will converge: • if training data is linearly separable
• and η sufficiently small
COMP9417 ML & DM Classification Term 2, 2022 32 / 52
Perceptrons
How to train
Perceptron learning
The perceptron iterates over the training set, updating the weight vector every time it encounters an incorrectly classified example.
• For example, let xi be a misclassified positive example, then we have yi = +1 and w · xi < t. We therefore want to find w0 such that
w0 · xi > w · xi, which moves the decision boundary towards and hopefully past xi.
• This can be achieved by calculating the new weight vector as
w0 = w+ηxi, where 0 < η ≤ 1 is the learning rate (again, assume set to 1). We then have w0 · xi = w · xi + ηxi · xi > w · xi as required.
• Similarly, if xj is a misclassified negative example, then we have
yj =−1andw·xj >t. Inthiscasewecalculatethenewweight vector as w0 = w−ηxj, and thus w0 ·xj = w·xj −ηxj ·xj < w·xj.
COMP9417 ML & DM Classification Term 2, 2022 33 / 52
Perceptrons
How to train
Perceptron learning
• The two cases can be combined in a single update rule: w0 =w+ηyixi
• Here yi acts to change the sign of the update, corresponding to whether a positive or negative example was misclassified
• This is the basis of the perceptron training algorithm for linear classification
• The algorithm just iterates over the training examples applying the weight update rule until all the examples are correctly classified
• If there is a linear model that separates the positive from the negative examples, i.e., the data is linearly separable, it can be shown that the perceptron training algorithm will converge in a finite number of steps.
COMP9417 ML & DM Classification Term 2, 2022 34 / 52
Perceptrons
How to train
Perceptron training algorithm
Algorithm Perceptron(D, η) // perceptron training for linear classification
Input: labelled training data D in homogeneous coordinates; learning rate η.
Output: weight vector w defining classifier yˆ = sign(w · x).
1 w 0 // Other initialisations of the weight vector are possible
converged false
while converged = false do
converged true for i = 1 to |D| do
i f y i w · x i 0 t h e n w w+ηyixi
/ / i . e . , yˆi 6 = y i
COMP9417 ML & DM
false // We changed w so haven’t converged yet
Error rate
Classification
# of examples Term 2, 2022
make y depend o n epoch .
Perceptrons
How to train
Perceptron training – varying learning rate
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
(left) A perceptron trained with a small learning rate (η = 0.2). The circled examples are the ones that trigger the weight update. (middle) Increasing the learning rate to η = 0.5 leads in this case to a rapid convergence. (right) Increasing the learning rate further to η = 1 may lead to too aggressive weight updating, which harms convergence.
The starting point in all three cases was the simple linear classifier.
COMP9417 ML & DM Classification Term 2, 2022 36 / 52
Logistic Regression
Linear Regression for Classification
Linear Regression for classification
Question: can we use Linear Regression for classification ?
Answer: not really, unless we use an approach like the following, e.g.,
Prediction:
train a separate linear regression model for each class
• sety=1ifexampleinclass
• set y = 0 otherwise for each example[i
Classes c- {r, g , b} 3models: fr; 5g;Bb
• run example through all regression models
• predict the class with the largest output value for y
Called multi-response linear regression.1
Note: does not obey linear regression assumptions, but may work in
1See: Witten et al. (2017).
COMP9417 ML & DM Classification Term 2, 2022 37 / 52
Logistic Regression
Logistic Regression for Classification
Logistic regression
COMP9417 ML & DM Classification Term 2, 2022 38 / 52
Maximum Likelihood Principle
Likelihood :
5= argmax Hi! Ptxi; 0)
L10;✗)± pH;O) ,
Assumes : x i (data ) are independent
log L(0:X)
ML estimation of parameter 0
Let✗ bea Bernoullir. v. ✗c-{0,1} ,
{0 Ax- 1 =
pg } is Ior0
of a Bernoulli distribution ( coin flip).
Model is a probabilistic /random
" coin - flipping " mechanism that generates data
L (0 ; x . . .
,xn) = (1-0)^(1-0)
Oh (1-0)^(1-0)
5 = argmax
IT " (1-0)" "
it 0=0 Liii) h(1--0)
where h=Exi 0%-0)" " .o≤o≤i.
Take derivative set to zero . ,
proportion of ✗=
[ Oh / I - Derivative is zero
number of heads in a
+-, hOh- '(1-0)"" 0h(nh/(1-0)""'t)
sequence of
coin flips
Logistic Regression
Logistic Regression for Classification
Logistic regression Conditional Likelihood
{(×. ,y*),(x, ,y. ), - .(xn.sn)}
In the case of a two-class classification problem, if we model the
probability P (Y = 1) of an instance x being a positive example like this: h .IQ /Y = 0 p (x ;o )Yc -
{at} P(Y =1|x) = 1
1 + e wT x p(Y=11x;O ) then this probability vs. the alternative (1 − P (Y = 1)) can be written like
PE [0.1] odds -2-(0,1-0)
P (Y = 1|x)
ln1−P(Y =1|x) = w x
The quantity on the l.h.s. is called the logit and we are defining a linear model for the logit.
logoddsc-C-N.to
COMP9417 ML & DM Classification Term 2, 2022 39 / 52
Logistic Regression
Logistic Regression for Classification
Logistic regression
Unlike linear regression, no analytical maximum likelihood (ML) solution to find weights w.
An iterative gradient ascent method can be used to maximize log likelihood.
The (conditional) log likelihood is:
X y(i)logP(1|x(i))+(1−y(i))log(1−P(1|x(i))) i=1
Over a set of N examples, choose w to maximise this expression, noting that y(i) is always either 0 or 1.
Generalises to multiple class versions (Y can have more than two values). COMP9417 ML & DM Classification Term 2, 2022 40 / 52
Bayes Theorem and Maximum Likelihood
Bayes Theorem
Bayes Theorem
i.e., parameters 0 ↑
Bayes Theorem for estimating probability of model m from data D:
Likelihood Prior (model) P (m|D) = P (D|m)P (m)
Finite set tf of models
Posterior prob.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com