Kernel Methods
COMP9417 Machine Learning and Data Mining Term 2, 2022
COMP9417 ML & DM Kernel Methods Term 2, 2022 1 / 47
Acknowledgements
Copyright By PowCoder代写 加微信 powcoder
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. IT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Graw-Hill (1997) http://www-2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani Goa Campus, India (2016)
COMP9417 ML & DM
Kernel Methods
Term 2, 2022
This lecture will develop your understanding of kernel methods in machine learning. Following it you should be able to:
• describe learning with the dual perceptron
• outline the idea of learning in a dual space
• describe the concept of maximising the margin in linear classification • outline the typical loss function for maximising the margin
• describe the method of support vector machines (SVMs)
• describe the concept of kernel functions
• outline the idea of using a kernel in a learning algorithm
• outline non-linear classification with kernel methods
COMP9417 ML & DM Kernel Methods Term 2, 2022 3 / 47
Introduction
Classification and loss functions
Predictive machine learning scenarios
Regression Classification
Scoring and ranking
Probability estimation
Label space
Output space
Y = [0, 1]|C|
Learning problem
learn an approximation fˆ : X ! R t o t h e t r u e labelling function f learn an approximation cˆ : X ! C t o t h e t r u e labelling function c learn a model that out- puts a score vector over classes
learn a model that out- puts a probability vec- tor over classes
COMP9417 ML & DM
Kernel Methods
Term 2, 2022 4 / 47
Introduction
Classification and loss functions
Classification
A classifier is a mapping cˆ: X ! C, where C = {C1,C2,…,Ck} is a finite and usually small set of class labels. We will sometimes also use Ci to indicate the set of examples of that class.
We use the ‘hat’ to indicate that cˆ(x) is an estimate of the true but unknown function c(x). Examples for a classifier take the form (x, c(x)), where x 2 X is an instance and c(x) is the true class of the instance (sometimes contaminated by noise).
Learning a classifier involves constructing the function cˆ such that it matches c as closely as possible (and not just on the training set, but ideally on the entire instance space X ).
COMP9417 ML & DM Kernel Methods Term 2, 2022 5 / 47
Introduction
Classification and loss functions
A decision tree
spam: 20 ham: 40
spam: 10 ham: 5
ĉ(x) = ham
ʻlotteryʼ =0
spam: 20 ham: 5
ʻViagraʼ =0 =1
ʻlotteryʼ ĉ(x) = spam
(left) A tree with the training set class distribution in the leaves. (right) A tree with the majority class prediction rule in the leaves.
COMP9417 ML & DM Kernel Methods Term 2, 2022 6 / 47
ĉ(x) = spam
Introduction
Classification and loss functions
Scoring classifier
A scoring classifier is a mapping ˆs : X ! Rk, i.e., a mapping from the instance space to a k-vector of real numbers.
The boldface notation indicates that a scoring classifier outputs a vector ˆs(x) = (sˆ1(x), . . . , sˆk(x)) rather than a single number; sˆi(x) is the score assigned to class Ci for instance x.
This score indicates how likely it is that class label Ci applies.
If we only have two classes, it usually suffices to consider the score for only one of the classes; in that case, we use sˆ(x) to denote the score of the positive class for instance x.
COMP9417 ML & DM Kernel Methods Term 2, 2022 7 / 47
Introduction
Classification and loss functions
A scoring tree
spam: 20 ham: 40
spam: 10 ham: 5
ŝ(x) = −1
ʻlotteryʼ =0
spam: 20 ham: 5
ʻlotteryʼ =0
ʻViagraʼ =0 =1
ŝ(x) = +2 =1
ŝ(x) = +1
(left) A tree with the training set class distribution in the leaves. (right) A tree using the logarithm of the class ratio as scores; spam is
taken as the positive class.
COMP9417 ML & DM Kernel Methods Term 2, 2022 8 / 47
Introduction
Classification and loss functions
Margins and loss functions
If we take the true class c(x) as +1 for positive examples and 1 for negative examples, then the quantity z(x) = c(x)sˆ(x) is positive for correct predictions and negative for incorrect predictions: this quantity is called the margin assigned by the scoring classifier to the example.
We would like to reward large positive margins, and penalise large negative values. This is achieved by means of a so-called loss function
L : R 7! [0, 1) which maps each example’s margin z(x) to an associated loss L(z(x)).
We will assume that L(0) = 1, which is the loss incurred by having an example on the decision boundary. We furthermore have L(z) 1 for z < 0, and usually also 0 L(z) < 1 for z > 0.
The average loss over a test set Te is 1 P | Te |
COMP9417 ML & DM Kernel Methods
Term 2, 2022
Introduction
Classification and loss functions
Loss functions
Hinge loss
Frombottom-left: (i)0–1lossL01(z)=1ifz0,andL01(z)=0ifz>0;(ii) hinge loss Lh(z) = (1 z) if z 1, and Lh(z) = 0 if z > 1; (iii) logistic loss Llog(z) = log2(1 + exp ( z)); (iv) exponential loss Lexp(z) = exp ( z); (v) squared loss Lsq(z) = (1 z)2 (can be set to 0 for z > 1, just like hinge loss).
COMP9417 ML & DM Kernel Methods Term 2, 2022 10 / 47
-2 -1.5 -1 -0.5 0 0.5 z 1 1.5 2
Lo(a)= {1 ifz ≤o
e.9-, +1*1.2 (correct ) e.g., +1*-0-7 (incorrect)
, 0 if 270
Ignores magnitude of the margin
Hinge loss Lh (a) =
{(1-2) ‘ f z ≤ 1 0 251
e.g., Lh =
+1*2-4 (correct)
+ 1*-1.7 (incorrect
(-1-7)–2-7
(-0-7)=7-7
Mmi”: ≥ =#%)
Introduction
Review: linear models for classification
Review: Linear classification
• Example: a two-class classifier “separates” instances in feature space: f(x) = sign(wT x + b)
COMP9417 ML & DM Kernel Methods Term 2, 2022 11 / 47
Introduction
Review: linear models for classification
Issues in linear classification
• A linear model defines a hyperplane in feature space • This decision boundary can be used for classification
COMP9417 ML & DM Kernel Methods Term 2, 2022 12 / 47
Introduction
Review: linear models for classification
Issues in linear classification
• Many possible linear decision boundaries: which one to choose ?
COMP9417 ML & DM Kernel Methods Term 2, 2022 13 / 47
Introduction
Review: linear models for classification
Issues in linear classification
Is there an optimal linear classification learning method ?
• one approach is to define the empirical risk, or error on the data
• learn the maximum margin separating hyperplane
• unique solution
• minimises empirical risk
• is there a way to trade-off minimising risk with model complexity?
• answer: yes, under Vapnik’s framework for statistical learning • structural risk minimization
COMP9417 ML & DM Kernel Methods Term 2, 2022 14 / 47
Introduction
Review: linear models for classification
Issues in linear classification
• Recall: may not be possible to learn a linear separating hyperplane
• filled / empty circles are in / out of the target concept • AND is linearly separable – but not XOR
COMP9417 ML & DM Kernel Methods Term 2, 2022 15 / 47
Introduction
Review: linear models for classification
Extending linear classification
• Linear classifiers can’t model nonlinear class boundaries • But there is a simple trick to allow them to do that
• Nonlinear mapping: map features into new higher-dimensional space consisting of combinations of attribute values
• For example: all products with n factors that can be constructed from the features (feature construction)
• e.g., for 2 features, all products with n = 3 factors
y = w1x31 + w2x21x2 + w3x1x2 + w4x32
• y is predicted output for instances with two features x1 and x2 • A function in the new higher-dimensional feature space
COMP9417 ML & DM Kernel Methods Term 2, 2022 16 / 47
Introduction
Review: linear models for classification
Two main problems
• Efficiency:
• With 10 features and n = 5 have to learn more than 2000 coefficients
• Fitting, e.g., a linear regression model, runtime can be cubic in the
number of features • Overfitting:
• “Too nonlinear” – number of coefficients large relative to number of training instances
• Curse of dimensionality applies . . .
COMP9417 ML & DM Kernel Methods Term 2, 2022 17 / 47
Revisiting training of linear classifiers
Linear classifiers in dual form
An observation about the Perceptron training rule:
Every time an example xi is misclassified, add yixi to the weight vector. This leads to a different perspective on training linear classifiers.
COMP9417 ML & DM Kernel Methods Term 2, 2022 18 / 47
Revisiting training of linear classifiers
Linear classifiers in dual form
• After training has completed, each example has been misclassified zero or more times. Denoting this number as αi for example xi, the weight vector can be expressed as
• In the dual, instance-based view of linear classification we are learning instance weights αi rather than feature weights wj. An instance x is classified as
yˆ = s i g n ( W – x )
• During training, the only information needed about the training data is all pairwise dot products: the n-by-n matrix G = XXT containing these dot products is called the Gram matrix.
w = Xαiyixi
sign Xαiyixi·x i=1
COMP9417 ML & DM Kernel Methods Term 2, 2022 19 / 47
Perceptron Training from Initialisatin
both Perceptron & Dual Perceptron start with initial parameters set to zero !
Perceptron (primal) [lecture ” Classification ” ] Perceptron (dual) [last slide]
first example : Gc , , y . )
( x , , ( line
y . ) 6) :
first example : prediction ( line 6) : check prediction
– y % 0C = 0
, , YjXj •
MISTAKE ! g., ( line 7) :
now weights w ≠ 0
and training continues .
update (line 7) now ✗ ≠ 0
and training continues
So we see that for both versions of the training procedure , theparameters(eitherw or &. ,✗,, . . . .tn)areinitiatedtozero
and will misclassify the first example they are then updated to be non- zero ,
Revisiting training of linear classifiers
Perceptron training in dual form
Algorithm DualPerceptron(D) // perceptron training in dual form Input: labelled training data D in homogeneous coordinates
Output: coefficients αi defining weight vector w = P|D| αiyixi i=1
αi ←0for1≤i≤|D| converged ←false
while converged = false do
6 ifyi j=1αjyjxi·xj ≤0then
” preqeential ” Optimisation
primal formulated for model + parameters
• dual : formulated for data 1- parameters
Term 2, 2022 20 / 47
converged←true for i = 1 to |D| do
P|D| Xj αi ← αi + 1
converged←false end
COMP9417 ML & DM
equivalent
Kernel Methods
Support Vector Machines
Support vector machine
The decision boundary learned by a support vector machine maximises the margin, which is indicated by the dotted lines. The circled data points are the support vectors.
COMP9417 ML & DM Kernel Methods Term 2, 2022 21 / 47
Support Vector Machines
Support vector machines
• Support vector machines (machine ⌘ algorithm) learn linear classifiers
• Can avoid overfitting – learn a form of decision boundary called the maximum margin hyperplane
• Fast for mappings to nonlinear spaces
• employ a mathematical trick to avoid the actual creation of new
“pseudo-attributes” in transformed instance space • i.e. the nonlinear space is created implicitly
COMP9417 ML & DM Kernel Methods Term 2, 2022 22 / 47
Support Vector Machines
Training a support vector machine
COMP9417 ML & DM Kernel Methods Term 2, 2022 23 / 47
Support Vector Machines
Training a support vector machine
• learning problem: fit maximum margin hyperplane, i.e. a kind of linear model
• for a linearly separable two-class data set the maximum margin hyperplane is the classification surface which
• correctly classifies all examples in the data set • has the greatest separation between classes
• “convex hull” of instances in each class is tightest enclosing convex polygon
• for a linearly separable two-class data set convex hulls do not overlap
• maximum margin hyperplane is orthogonal to shortest line connecting
convex hulls, intersects with it halfway
• the more “separated” the classes, the larger the margin, the better the generalization
COMP9417 ML & DM Kernel Methods Term 2, 2022 24 / 47
Support Vector Machines
Support vectors
• Instances closest to maximum margin hyperplane are support vectors
• Important observation: support vectors define maximum margin hyperplane!
• All other instances can be deleted without changing position and orientation of the hyperplane!
COMP9417 ML & DM Kernel Methods Term 2, 2022 25 / 47
Support Vector Machines
Finding support vectors
• Determining parameters is a constrained quadratic optimization problem
• standard algorithms, or
• special-purpose algorithms are faster, e.g. Platt’s sequential minimal
optimization (SMO), or LibSVM
• Note: all this assumes separable data!
COMP9417 ML & DM Kernel Methods Term 2, 2022 26 / 47
Support Vector Machines
Support vector machine
t – m ||w||
w.x = t + m w.x = t
w.x = t – m
The geometry of a support vector classifier. The circled data points are the support vectors, which are the training examples nearest to the decision boundary. The support vector machine finds the decision boundary that maximises the margin m/||w||.
COMP9417 ML & DM Kernel Methods Term 2, 2022 27 / 47
Support Vector Machines
Maximising the margin
¥, Since we are free to rescale t, ||w|| and m, it is customary to choose
m = 1. Maximising the margin then corresponds to minimising ||w|| or, more conveniently, 21||w||2, provided of course that none of the training points fall inside the margin.
This leads to a quadratic, constrained optimisation problem:
yic-{+1 -1} ,
w⇤,t⇤ =argmin12||w||2 subjecttoyi(w·xi t) 1,1in w,t
Using the method of Lagrange multipliers, the dual form of this problem
can be derived.
” constrained optimisation ” COMP9417 ML & DM Kernel Methods Term 2, 2022 28 / 47
Support Vector Machines
Deriving the dual problem
Adding the constraints with multipliers αi for each training example gives the Lagrange function
Λ(w,t,α1,…,αn) = = =
2||w|| − αi(yi(w·xi −t)−1)
2||w||2 −Xαiyi(w·xi)+Xαiyit+Xαi i=1 i=1 i=1
1 Xn ! Xn ! Xn 2w·w−w· αiyixi +t αiyi + αi
i=1 i=1 i=1
• By taking the partial derivative of the Lagrange function with respect
to t and setting it to 0 we find Pni=1αiyi =0.
• Similarly, by taking the partial derivative of the Lagrange function with respect to w and setting to 0 we obtain w = Pni=1 αiyixi – the same expression as we derived for the perceptron.
COMP9417 ML & DM Kernel Methods Term 2, 2022 29 / 47
Support Vector Machines
Deriving the dual problem
• For the perceptron, the instance weights αi are non-negative integers denoting the number of times an example has been misclassified in training. For a support vector machine, the αi are non-negative reals.
• What they have in common is that, if αi = 0 for a particular example xi, that example could be removed from the training set without affecting the learned decision boundary. In the case of support vector machines this means that αi > 0 only for the support vectors: the training examples nearest to the decision boundary.
These expressions allow us to eliminate w and t and lead to the dual Lagrangian
1 Xn ! Xn ! Xn Λ(α1,…,αn) = −2 αiyixi · αiyixi + αi
i=1 i=1 i=1 1nn n
= −2XXαiαjyiyjxi·xj+Xαi i=1 j=1 i=1
COMP9417 ML & DM
Kernel Methods Term 2, 2022 30 / 47
Support Vector Machines
SVM in dual form
The dual optimisation problem for support vector machines is to maximise the dual Lagrangian under positivity constraints and one equality constraint:
algorithm : example :
u s e a standard optimization by hand (next slides )
Kernel Methods Term 2, 2022 31 / 47
1nn n α1⇤,…,αn⇤ =argmax XXαiαjyiyjxi ·xj +Xαi
α1,…,αn 2 i=1 j=1 i=1 n
subject to αi 0, 1in and Xαiyi =0 i=1
COMP9417 ML & DM
Support Vector Machines
A maximum-margin classifier
A maximum-margin classifier built from three examples, with w = (0, 1/2) and margin 2.
The circled examples are the support vectors: they receive non-zero Lagrange multipliers and define the decision boundary.
COMP9417 ML & DM Kernel Methods Term 2, 2022 32 / 47
Support Vector Machines
A maximum-margin classifier
0 1 2 1 0 −1 1 0 −1 −2 1 X = @ −1 2 A y = @ −1 A X0 = @ 1 −2 A
The matrix X0 on the right incorporates the class labels; i.e., the rows are yixi. The Gram matrix is (without and with class labels):
05 3−51 XXT =@ 3 5 −3 A
−5 −3 5 The dual optimisation problem is thus
05351 X0X0T =@ 3 5 3 A
argmax 1 5α21 +3α1α2 +5α1α3 +3α2α1 +5α2 +3α2α3 +5α3α1 +3α3α2 +5α23 α1 ,α2 ,α3 2
+α1 +α2 +α3
=argmax 1 5α21 +6α1α2 +10α1α3 +5α2 +6α2α3 +5α23 +α1 +α2 +α3
α1 ,α2 ,α3 2
subject to α1 0,α2 0,α3 0 and α1 α2 +α3 =0.
COMP9417 ML & DM Kernel Methods Term 2, 2022 33 / 47
Support Vector Machines
A maximum-margin classifier
• Using the equality constraint we can eliminate one of the variables, say α3, and simplify the objective function to
argmax 1 20α12 +32α1α2 +16α2 +2α1 +2α2 α1 ,α2 ,α3 2
• Setting partial derivatives to 0 we obtain 20α1 16α2 + 2 = 0 and 16α1 16α2 + 2 = 0 (notice that, because the objective function is quadratic, these equations are guaranteed to be linear).
• We therefore obtain the solution α1 = 0 and α2 = α3 = 1/8. We
thenhavew=1/8(x3 x2)=✓ 0 ◆,resultinginamarginof
1/||w|| = 2. 1/2
• Finally, t can be obtained from any support vector, say x2, since y2(w·x2 t)=1; this gives 1·( 1 t)=1, hence t=0.
COMP9417 ML & DM Kernel Methods Term 2, 2022 34 / 47
Support Vector Machines
• So far we have assumed that the data is separable (in original or transformed space)
• Misclassified examples may break the separability assumption
• Introduce “slack” variables ξi to allow misclassification of instances
• This “soft margin” allows SVMs to handle noisy data
COMP9417 ML & DM Kernel Methods Term 2, 2022 35 / 47
Support Vector Machines
Allowing margin errors
The idea is to introduce slack variables ξi, one for each example, which allow some of them to be inside the margin or even at the wrong side of the decision boundary.
We will call these margin errors.
Thus, we change the constraints to w·xi t 1 ξi and add the sum of all slack variables to the objective function to be minimised, resulting in the following soft margin optimisation problem:
⇤⇤⇤ 12Xn w ,t ,ξi = argmin2||w|| +C ξi
w,t,ξi i=1 subjecttoyi(w·xi t) 1 ξiandξi 0,1in
COMP9417 ML & DM Kernel Methods Term 2, 2022 36 / 47
Support Vector Machines
Allowing margin errors
• C is a user-defined parameter trading off margin maximisation against slack variable minimisation: a high value of C means that margin errors incur a high penalty, while a low value permits more margin errors (possibly including misclassifications) in order to achieve a large margin.
• If we allow more margin errors we need fewer support vectors, hence C controls to some extent the ‘complexity’
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com