CS计算机代考程序代写 Bayesian deep learning algorithm data mining Bioinformatics decision tree Kernel Methods

Kernel Methods
COMP9417 Machine Learning and Data Mining
Term 2, 2020
COMP9417 ML & DM Kernel Methods Term 2, 2020 1 / 63

Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997)
http://www- 2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani Goa Campus, India (2016)
COMP9417 ML & DM
Kernel Methods
Term 2, 2020
2 / 63

Aims
Aims
This lecture will develop your understanding of kernel methods in machine learning. Following it you should be able to:
• describe learning with the dual perceptron
• outline the idea of learning in a dual space
• describe the concept of maximising the margin in linear classification • outline the typical loss function for maximising the margin
• describe the method of support vector machines (SVMs)
• describe the concept of kernel functions
• outline the idea of using a kernel in a learning algorithm
• outline non-linear classification with kernel methods
COMP9417 ML & DM Kernel Methods Term 2, 2020 3 / 63

Introduction
Classification and loss functions
Predictive machine learning scenarios
Task
Classification
Scoring and ranking
Probability estimation
Regression
Label space
L=C L=C L=C L=R
Output space
Y=C
Y = R|C|
Y = [0, 1]|C| Y=R
Learning problem
learn an approximation cˆ : X → C to the true labelling function c learn a model that out- puts a score vector over classes
learn a model that out- puts a probability vec- tor over classes
learn an approximation fˆ : X → R to the true labelling function f
COMP9417 ML & DM
Kernel Methods
Term 2, 2020 4 / 63

Introduction
Classification and loss functions
Classification
A classifier is a mapping cˆ: X → C, where C = {C1,C2,…,Ck} is a finite and usually small set of class labels. We will sometimes also use Ci to indicate the set of examples of that class.
We use the ‘hat’ to indicate that cˆ(x) is an estimate of the true but unknown function c(x). Examples for a classifier take the form (x, c(x)), where x ∈ X is an instance and c(x) is the true class of the instance (sometimes contaminated by noise).
Learning a classifier involves constructing the function cˆ such that it matches c as closely as possible (and not just on the training set, but ideally on the entire instance space X ).
COMP9417 ML & DM Kernel Methods Term 2, 2020 5 / 63

Introduction
Classification and loss functions
A decision tree
=0 ʻlotteryʼ
=0 =1
=1
ʻViagraʼ
ʻViagraʼ =0 =1
ʻlotteryʼ
=0 =1
spam: 20 ham: 5
ĉ(x) = spam
spam: 20 ham: 40
spam: 10 ham: 5
(left) A tree with the training set class distribution in the leaves. (right) A tree with the majority class prediction rule in the leaves.
ĉ(x) = ham
ĉ(x) = spam
COMP9417 ML & DM Kernel Methods Term 2, 2020 6 / 63

Introduction
Classification and loss functions
Scoring classifier
A scoring classifier is a mapping ˆs : X → Rk, i.e., a mapping from the instance space to a k-vector of real numbers.
The boldface notation indicates that a scoring classifier outputs a vector ˆs(x) = (sˆ1(x), . . . , sˆk(x)) rather than a single number; sˆi(x) is the score assigned to class Ci for instance x.
This score indicates how likely it is that class label Ci applies.
If we only have two classes, it usually suffices to consider the score for only one of the classes; in that case, we use sˆ(x) to denote the score of the positive class for instance x.
COMP9417 ML & DM Kernel Methods Term 2, 2020 7 / 63

Introduction
Classification and loss functions
A scoring tree
=0 ʻlotteryʼ
=0 =1
=1
ʻViagraʼ
ʻViagraʼ =0 =1
ʻlotteryʼ
=0 =1
spam: 20 ham: 5
ŝ(x) = +2
spam: 20 ham: 40
spam: 10 ham: 5
(left) A tree with the training set class distribution in the leaves. (right) A tree using the logarithm of the class ratio as scores; spam is
taken as the positive class.
ŝ(x) = −1
ŝ(x) = +1
COMP9417 ML & DM Kernel Methods Term 2, 2020 8 / 63

Introduction
Classification and loss functions
Margins and loss functions
If we take the true class c(x) as +1 for positive examples and −1 for negative examples, then the quantity z(x) = c(x)sˆ(x) is positive for correct predictions and negative for incorrect predictions: this quantity is called the margin assigned by the scoring classifier to the example.
We would like to reward large positive margins, and penalise large negative values. This is achieved by means of a so-called loss function
L : R 􏰋→ [0, ∞) which maps each example’s margin z(x) to an associated loss L(z(x)).
We will assume that L(0) = 1, which is the loss incurred by having an example on the decision boundary. We furthermore have L(z) ≥ 1 for z < 0, and usually also 0 ≤ L(z) < 1 for z > 0.
The average loss over a test set Te is 1
| Te |
􏰇
x ∈ Te
L(z(x)).
COMP9417 ML & DM Kernel Methods
Term 2, 2020
9 / 63

Loss functions
-2 -1.5
-1 -0.5
0
0.5 z 1
1.5 2
Introduction
Classification and loss functions
10
8
6
L(z)
4
2
Frombottom-left: (i)0–1lossL01(z)=1ifz≤0,andL01(z)=0ifz>0;(ii) hinge loss Lh(z) = (1 − z) if z ≤ 1, and Lh(z) = 0 if z > 1; (iii) logistic loss Llog(z) = log2(1 + exp (−z)); (iv) exponential loss Lexp(z) = exp (−z); (v) squared loss Lsq(z) = (1 − z)2 (can be set to 0 for z > 1, just like hinge loss).
COMP9417 ML & DM Kernel Methods Term 2, 2020 10 / 63

Introduction
Review: linear models for classification
Review: Linear classification
• Two-class classifier “separates” instances in feature space: f(x) = sign(wT x + b)
COMP9417 ML & DM Kernel Methods Term 2, 2020 11 / 63

Introduction
Review: linear models for classification
Issues in linear classification
• Define a decision boundary by a hyperplane in feature space • A linear model can be used for classification
COMP9417 ML & DM Kernel Methods Term 2, 2020 12 / 63

Introduction
Review: linear models for classification
Issues in linear classification
• Many possible linear decision boundaries: which one to choose ?
COMP9417 ML & DM Kernel Methods Term 2, 2020 13 / 63

Introduction
Review: linear models for classification
Issues in linear classification
Different classification learning algorithms use different criteria:
• Basic linear classifier finds class means (centroids), joins them by a straight line, and its perpendicular bisector is the separating hyperplane
• Nearest-neighbour, naive Bayes, logistic regression all use different criteria
• Perceptron training uses iterative reweighting (gradient descent)
• Perceptron may find different models depending on starting conditions
COMP9417 ML & DM Kernel Methods Term 2, 2020 14 / 63

Introduction
Review: linear models for classification
Issues in linear classification
Is there an optimal linear classification learning method ?
• one approach is to define the empirical risk, or error on the data
• the maximum margin separating hyperplane
• unique solution
• minimises empirical risk
• is there a way to trade-off minimising risk with model complexity?
• answer: yes, under Vapnik’s framework for statistical learning • structural risk minimization
COMP9417 ML & DM Kernel Methods Term 2, 2020 15 / 63

Introduction
Review: linear models for classification
Issues in linear classification
• May not be possible to find a linear separating hyperplane
• filled / empty circles are in / out of the target concept • AND is linearly separable – but not XOR
COMP9417 ML & DM Kernel Methods Term 2, 2020 16 / 63

Introduction
Review: linear models for classification
Extending linear classification
• Linear classifiers can’t model nonlinear class boundaries • Simple trick to allow them to do that:
• Nonlinear mapping: map attributes into new space consisting of combinations of attribute values
• For example: all products with n factors that can be constructed from the attributes (feature construction or basis expansion)
• e.g., for 2 attributes, all products with n = 3 factors y = w1x31 + w2x21x2 + w3x1x2 + w4x32
• y is predicted output for instances with two attributes x1 and x2 COMP9417 ML & DM Kernel Methods Term 2, 2020 17 / 63

Introduction
Review: linear models for classification
Two main problems
• efficiency:
• With 10 attributes and n = 5 have to learn more than 2000 coefficients
(weights)
• Linear regression (with attribute selection) running time is cubic in the
number of attributes • overfitting:
• “Too nonlinear” – number of coefficients large relative to number of training instances
• Curse of dimensionality applies . . .
COMP9417 ML & DM Kernel Methods Term 2, 2020 18 / 63

Revisiting training of linear classifiers
Linear classifiers in dual form
Every time an example xi is misclassified, add yixi to the weight vector.
• After training has completed, each example has been misclassified
zero or more times. Denoting this number as αi for example xi, the weight vector can be expressed as
n
w = 􏰂αiyixi
i=1
• In the dual, instance-based view of linear classification we are learning instance weights αi rather than feature weights wj. An instance x is
classified as
􏰎n􏰏 yˆ=sign 􏰂αiyixi·x
i=1
• During training, the only information needed about the training data
is all pairwise dot products: the n-by-n matrix G = XXT containing
these dot products is called the Gram matrix.
COMP9417 ML & DM Kernel Methods Term 2, 2020 19 / 63

Revisiting training of linear classifiers
Perceptron training in dual form
1 2 3 4 5 6
7 8 9
10 11
Algorithm DualPerceptron(D) // perceptron training in dual form Input: labelled training data D in homogeneous coordinates
Output: coefficients αi defining weight vector w = 􏰇|D| αiyixi i=1
αi ←0for1≤i≤|D| converged ←false
while converged = false do
converged ←true for i = 1 to |D| do
ifyi􏰇|D| αjyjxi·xj ≤0then j=1
αi ← αi + 1
converged ←false end
end end
COMP9417 ML & DM
Kernel Methods
Term 2, 2020
20 / 63

Support Vector Machines
Support vector machine
w++++ ++
++

–––– –

The decision boundary learned by a support vector machine maximises the margin, which is indicated by the dotted lines. The circled data points are the support vectors.
COMP9417 ML & DM Kernel Methods Term 2, 2020 21 / 63

Support Vector Machines
Support vector machines
• Support vector machines (machine ≡ algorithm) learn linear classifiers
• Can avoid overfitting – learn a form of decision boundary called the maximum margin hyperplane
• Fast for mappings to nonlinear spaces
• employ a mathematical trick to avoid the actual creation of new
“pseudo-attributes” in transformed instance space • i.e. the nonlinear space is created implicitly
COMP9417 ML & DM Kernel Methods Term 2, 2020 22 / 63

Support Vector Machines
The kernel trick
Let x1 = (x1, y1) and x2 = (x2, y2) be two data points, and consider the mapping (x, y) 􏰋→ (x2, y2, √2xy) to a three-dimensional feature space. The points in feature space corresponding to x1 and x2 are
x′1 = (x21, y12, √2x1y1) and x′2 = (x2, y2, √2x2y2). The dot product of these two feature vectors is
x′1 · x′2 = x21x2 + y12y2 + 2x1y1x2y2 = (x1x2 + y1y2)2 = (x1 · x2)2
That is, by squaring the dot product in the original space we obtain the dot product in the new space without actually constructing the feature vectors! A function that calculates the dot product in feature space directly from the vectors in the original space is called a kernel – here the kernel is κ(x1, x2) = (x1 · x2)2.
COMP9417 ML & DM Kernel Methods Term 2, 2020 23 / 63

Support Vector Machines
Training a support vector machine
COMP9417 ML & DM Kernel Methods Term 2, 2020 24 / 63

Support Vector Machines
Training a support vector machine
• learning problem: fit maximum margin hyperplane, i.e. a kind of linear model
• for a linearly separable two-class data set the maximum margin hyperplane is the classification surface which
• correctly classifies all examples in the data set • has the greatest separation between classes
• “convex hull” of instances in each class is tightest enclosing convex polygon
• for a linearly separable two-class data set convex hulls do not overlap
• maximum margin hyperplane is orthogonal to shortest line connecting
convex hulls, intersects with it halfway
• the more “separated” the classes, the larger the margin, the better the generalization
COMP9417 ML & DM Kernel Methods Term 2, 2020 25 / 63

Support Vector Machines
Support vectors
• Instances closest to maximum margin hyperplane are support vectors
• Important observation: support vectors define maximum margin hyperplane!
• All other instances can be deleted without changing position and orientation of the hyperplane!
COMP9417 ML & DM Kernel Methods Term 2, 2020 26 / 63

Support Vector Machines
Finding support vectors
• Determining parameters is a constrained quadratic optimization problem
• standard algorithms, or
• special-purpose algorithms are faster, e.g. Platt’s sequential minimal
optimization (SMO), or LibSVM
• Note: all this assumes separable data!
COMP9417 ML & DM Kernel Methods Term 2, 2020 27 / 63

Support Vector Machines
Support vector machine
2m ||w||
t+m ||w||
w
t
||w||
t–m ||w||
++++ ++
++
–– –
– –
–– –
w.x=t+m w.x = t
w.x = t – m
The geometry of a support vector classifier. The circled data points are the support vectors, which are the training examples nearest to the decision boundary. The support vector machine finds the decision boundary that maximises the margin m/||w||.
COMP9417 ML & DM Kernel Methods Term 2, 2020 28 / 63

Support Vector Machines
Maximising the margin
Since we are free to rescale t, ||w|| and m, it is customary to choose
m = 1. Maximising the margin then corresponds to minimising ||w|| or,
more conveniently, 1||w||2, provided of course that none of the training 2
points fall inside the margin.
This leads to a quadratic, constrained optimisation problem:
w∗,t∗ =argmin1||w||2 subjecttoyi(w·xi−t)≥1,1≤i≤n w,t 2
Using the method of Lagrange multipliers, the dual form of this problem can be derived.
COMP9417 ML & DM Kernel Methods Term 2, 2020 29 / 63

Support Vector Machines
Deriving the dual problem
Adding the constraints with multipliers αi for each training example gives the Lagrange function
Λ(w,t,α1,…,αn) = = =
n
1||w||2 −􏰂αi(yi(w·xi −t)−1)
2
1||w||2 −􏰂αiyi(w·xi)+􏰂αiyit+􏰂αi
i=1 nnn
2
i=1 i=1 i=1
1􏰎n􏰏􏰎n􏰏n
2w·w−w· 􏰂αiyixi +t 􏰂αiyi +􏰂αi
i=1 i=1 i=1
• By taking the partial derivative of the Lagrange function with respect
totandsettingitto0wefind􏰇ni=1αiyi =0.
• Similarly, by taking the partial derivative of the Lagrange function with respect to w and setting to 0 we obtain w = 􏰇ni=1 αiyixi – the same expression as we derived for the perceptron.
COMP9417 ML & DM Kernel Methods Term 2, 2020 30 / 63

Support Vector Machines
Deriving the dual problem
• For the perceptron, the instance weights αi are non-negative integers denoting the number of times an example has been misclassified in training. For a support vector machine, the αi are non-negative reals.
• What they have in common is that, if αi = 0 for a particular example xi, that example could be removed from the training set without affecting the learned decision boundary. In the case of support vector machines this means that αi > 0 only for the support vectors: the training examples nearest to the decision boundary.
These expressions allow us to eliminate w and t and lead to the dual Lagrangian
1􏰎n 􏰏􏰎n 􏰏 n
Λ(α1,…,αn) = − 􏰂αiyixi · 􏰂αiyixi +􏰂αi
2
i=1 i=1 i=1 nnn
= −1􏰂􏰂αiαjyiyjxi·xj+􏰂αi
2
COMP9417 ML & DM
i=1 j=1 i=1 Kernel Methods
Term 2, 2020
31 / 63

Support Vector Machines
SVM in dual form
The dual optimisation problem for support vector machines is to maximise the dual Lagrangian under positivity constraints and one equality constraint:
1nn n α1∗,…,αn∗ =argmax− 􏰂􏰂αiαjyiyjxi ·xj +􏰂αi
α1,…,αn 2 i=1 j=1 i=1 n
subjectto αi ≥0, 1≤i≤n and 􏰂αiyi =0 i=1
COMP9417 ML & DM Kernel Methods
Term 2, 2020
32 / 63

Support Vector Machines
Two maximum-margin classifiers
2

1
w
+
3
2

1

+
4
+
3
w

(left) A maximum-margin classifier built from three examples, with
w = (0, −1/2) and margin 2. The circled examples are the support vectors: they receive non-zero Lagrange multipliers and define the decision boundary. (right) By adding a second positive the decision boundary is rotated to w = (3/5, −4/5) and the margin decreases to 1.
COMP9417 ML & DM Kernel Methods Term 2, 2020 33 / 63

Support Vector Machines
Two maximum-margin classifiers
 1 2  X =  −1 2 
−1 −2
 −1  y =  −1 
+1
 −1 −2  X′ =  1 −2 
−1 −2
The matrix X′ on the right incorporates the class labels; i.e., the rows are yixi. The Gram matrix is (without and with class labels):
 5 3 −5  XXT = 3 5 −3 
−5 −3 5 The dual optimisation problem is thus
 5 3 5  X′X′T = 3 5 3 
5 3 5
argmax−1􏰈5α21 +3α1α2 +5α1α3 +3α2α1 +5α2 +3α2α3 +5α3α1 +3α3α2 +5α23􏰉+α1 + α1 ,α2 ,α3 2
=argmax−1􏰈5α21 +6α1α2 +10α1α3 +5α2 +6α2α3 +5α23􏰉+α1 +α2 +α3 α1 ,α2 ,α3 2
subjecttoα1 ≥0,α2 ≥0,α3 ≥0and−α1−α2+α3 =0.
COMP9417 ML & DM Kernel Methods Term 2, 2020 34 / 63

Support Vector Machines
Two maximum-margin classifiers
• Using the equality constraint we can eliminate one of the variables, say α3, and simplify the objective function to
argmax−1􏰈20α12 +32α1α2 +16α2􏰉+2α1 +2α2 α1 ,α2 ,α3 2
• Setting partial derivatives to 0 we obtain −20α1 − 16α2 + 2 = 0 and −16α1 − 16α2 + 2 = 0 (notice that, because the objective function is quadratic, these equations are guaranteed to be linear).
• We therefore obtain the solution α1 = 0 and α2 = α3 = 1/8. We 􏰀0􏰁
then have w = 1/8(x3 − x2) = −1/2 , resulting in a margin of 1/||w|| = 2.
• Finally, t can be obtained from any support vector, say x2, since y2(w · x2 − t) = 1; this gives −1 · (−1 − t) = 1, hence t = 0.
COMP9417 ML & DM Kernel Methods Term 2, 2020 35 / 63

Support Vector Machines
Two maximum-margin classifiers
We now add an additional positive at (3, 1). This gives the following data matrices:
 −1 −2   5 3 5 −5  X′=1−2 X′X′T=3531
  −1 −2   5 3 5 −5 
3 1 −5 1 −5 10
• It can be verified by similar calculations to those above that the margin decreases to 1 and the decision boundary rotates to
􏰀 3/5􏰁 w= −4/5 .
• The Lagrange multipliers now are α1 = 1/2, α2 = 0, α3 = 1/10 and α4 = 2/5. Thus, only x3 is a support vector in both the original and the extended data set.
COMP9417 ML & DM Kernel Methods Term 2, 2020 36 / 63

Support Vector Machines
Noise
• Misclassified examples may break the separability assumption
• Introduce “slack” variables ξi to allow misclassification of instances • MinimiseΦ(w)=wTw+C􏰇ξi
COMP9417 ML & DM Kernel Methods Term 2, 2020 37 / 63

Support Vector Machines
Noise
• “soft margin” SVMs to handle noisy data
• Parameter C bounds influence of any one training instance on
decision boundary
• Constraint 0 ≤ αi ≤ C
• Still a quadratic optimization problem
• C has to be found by, e.g., cross-validation
COMP9417 ML & DM Kernel Methods Term 2, 2020 38 / 63

Support Vector Machines
Allowing margin errors
The idea is to introduce slack variables ξi, one for each example, which allow some of them to be inside the margin or even at the wrong side of the decision boundary. We will call these margin errors. Thus, we change the constraints to w · xi − t ≥ 1 − ξi and add the sum of all slack variables to the objective function to be minimised, resulting in the following soft margin optimisation problem:
∗∗∗ 12􏰂n w ,t ,ξi = argmin2||w|| +C ξi
w,t,ξi i=1 subjecttoyi(w·xi−t)≥1−ξiandξi ≥0,1≤i≤n
COMP9417 ML & DM Kernel Methods Term 2, 2020 39 / 63

Support Vector Machines
Allowing margin errors
• C is a user-defined parameter trading off margin maximisation against slack variable minimisation: a high value of C means that margin errors incur a high penalty, while a low value permits more margin errors (possibly including misclassifications) in order to achieve a large margin.
• If we allow more margin errors we need fewer support vectors, hence C controls to some extent the ‘complexity’ of the SVM and hence is often referred to as the complexity parameter.
COMP9417 ML & DM Kernel Methods Term 2, 2020 40 / 63

Support Vector Machines
Allowing margin errors
The Lagrange function is then as follows:
nnn 1||w||2+C􏰂ξi −􏰂αi(yi(w·xi −t)−(1−ξi))−􏰂βiξi
Λ(w,t,ξi,αi,βi) =
= Λ(w,t,αi)+􏰂(C−αi −βi)ξi
2
i=1 i=1 i=1 n
i=1
• For an optimal solution every partial derivative with respect to ξi should be 0, from which it follows that the added term vanishes from the dual problem.
• Furthermore, since both αi and βi are positive, this means that αi cannot be larger than C:
COMP9417 ML & DM Kernel Methods Term 2, 2020 41 / 63

Support Vector Machines
Three cases for the training instances
What is the significance of the upper bound C on the αi multipliers?
• Since C − αi − βi = 0 for all i, αi = C implies βi = 0. The βi multipliers come from the ξi ≥ 0 constraint, and a multiplier of 0 means that the lower bound is not reached, i.e., ξi > 0 (analogous to the fact that αj = 0 means that xj is not a support vector and hence w·xj −t>1).
• In other words, a solution to the soft margin optimisation problem in dual form divides the training examples into three cases:
αi = 0 these are outside or on the margin;
0 < αi < C these are the support vectors on the margin; αi = C these are on or inside the margin. • Notice that we still have w = 􏰇ni=1 αiyixi, and so both second and third case examples participate in spanning the decision boundary. COMP9417 ML & DM Kernel Methods Term 2, 2020 42 / 63 Support Vector Machines Soft margins 2 – 1 w – + 4 + 3 2 – 1 – w + 4 + 3 (left) The soft margin classifier learned with C = 5/16, at which point x2 is about to become a support vector. (right) The soft margin classifier learned with C = 1/10: all examples contribute equally to the weight vector. The asterisks denote the class means, and the decision boundary is parallel to the one learned by the basic linear classifier. COMP9417 ML & DM Kernel Methods Term 2, 2020 43 / 63 Support Vector Machines Soft margins • Recall that the Lagrange multipliers for the classifier above are α1 =1/2,α2 =0,α3 =1/10andα4 =2/5. Soα1 isthelargest multiplier, and as long as C > α1 = 1/2 no margin errors are tolerated.
• ForC=1/2wehaveα1 =C,andhenceforC<1/2wehavethat x1 becomes a margin error and the optimal classifier is a soft margin classifier. Effectively, with decreasing C the decision boundary and the upper margin move upward, while the lower margin stays the same. • The upper margin reaches x2 for C = 5/16, at which point we have 􏰀 3/8 􏰁 w = −1/2 , t = 3/8 and the margin has increased to 1.6. Furthermore, we have ξ1 = 6/8,α1 = C = 5/16,α2 = 0,α3 = 1/16 and α4 = 1/4. COMP9417 ML & DM Kernel Methods Term 2, 2020 44 / 63 Support Vector Machines Soft margins • If we now decrease C further, the decision boundary starts to rotate clockwise, so that x4 becomes a margin error as well, and only x2 and x3 are support vectors. The boundary rotates until C = 1/10, at 􏰀 1/5􏰁 which point we have w = −1/2 , t = 1/5 and the margin has increased to 1.86. Furthermore, we have ξ1 = 4/10 and ξ4 = 7/10, and all multipliers have become equal to C. • Finally, when C decreases further the decision boundary stays where it is, but the norm of the weight vector gradually decreases and all points become margin errors. COMP9417 ML & DM Kernel Methods Term 2, 2020 45 / 63 Support Vector Machines Soft margins NOTE: a minimal-complexity soft margin classifier summarises the classes by their class means in a way very similar to the basic linear classifier. COMP9417 ML & DM Kernel Methods Term 2, 2020 46 / 63 Support Vector Machines Sparse data • SVM algorithms can be sped up dramatically if the data is sparse (i.e. many values are 0) • Why? Because they compute lots and lots of dot products • With sparse data dot products can be computed very efficiently • Just need to iterate over the values that are non-zero • SVMs can process sparse datasets with tens of thousands of attributes COMP9417 ML & DM Kernel Methods Term 2, 2020 47 / 63 Nonlinear SVMs Nonlinear SVMs • Apply same trick: “pseudo attributes” representing attribute combinations • Overfitting not (such) a (big) problem because the maximum margin hyperplane is stable (only replacement of “support vectors” changes decision boundary) • usually few support vectors relative to the size of the training set • error bounded by number of support vectors, irrespective of dimensionality • Efficiency problem remains ? • Every time dot product is computed need to go through all the “pseudo attributes” COMP9417 ML & DM Kernel Methods Term 2, 2020 48 / 63 Nonlinear SVMs Kernel trick • Can avoid computing the “pseudo attributes”! • Can compute the dot product before the nonlinear mapping is performed • All computatation done in original low-dimensional space • Instead of computing x = b + 􏰂 αiyia(i) • a i is a support vector we can compute x = b + 􏰂 αiyi(a(i) • a)n i is a support vector where n is number of factors COMP9417 ML & DM Kernel Methods Term 2, 2020 49 / 63 Nonlinear SVMs Kernel trick • Corresponds to a map into the instance space spanned by all products of n attributes • bound αi to prevent overfitting This mathematical trick is a general approach to solve non-linearly separable problems by an implicit mapping into the (possibly very) high-dimensional feature space. Applies as long as certain conditions are satisfied. E.g., polynomial kernel function K(x, y) = (1 + x • y)d Next slide: classic example for 2 attributes x1, x2 where d = 2. COMP9417 ML & DM Kernel Methods Term 2, 2020 50 / 63 Nonlinear SVMs . N=,=8%213%)*$%7'($%12%#QV8%@QV8%)*$%I$3#$&% Kernel trick @ z2 X XXOXX XOOXOX XOOx1 OX Figure by Avrim Blum, CS Dept, CMU. x2 X XX X XX X O OOO X z1 OOX OO OX X X X XXX OX X z3 XXXX XXX X COMP9417 ML & DM Kernel Methods Term 2, 2020 51 / 63 O6P8A<%Q%6R%S%P!A< 7133$(51#@(%)1%)*$%4'55+#,G Nonlinear SVMs Kernel functions • The mapping is performed by the kernel function • (x • y)n computes dot product of vectors x and y and raises them to power n • polynomial kernel • Can use other kernel functions than polynomial kernel x = b + 􏰂 αiyiK(a(i) • a) i is a support vector • Requirement: K(xi,xj)=φ(xi)•φ(xj) • must satisfy matrix conditions to be a Mercer kernel COMP9417 ML & DM Kernel Methods Term 2, 2020 52 / 63 Nonlinear SVMs Kernel functions • Common choices for kernel functions include: • polynomial kernel of degree d K(xi,xj)=(xi •xj +1)d • radial basis function kernel • sigmoid kernel K(xi,xj)=e−(xi •xj +1)2 2σ2 K(xi,xj)=tanh(βxi •xj +b) COMP9417 ML & DM Kernel Methods Term 2, 2020 53 / 63 Nonlinear SVMs ‘Kernelising’ the perceptron The perceptron algorithm is a simple counting algorithm – the only operation that is somewhat involved is testing whether example xi is correctlyclassifiedbyevaluatingyi􏰇|D| αjyjxi·xj. j=1 • The key component of this calculation is the dot product xi · xj . • Assuming bivariate examples xi = (xi,yi) and xj = (xj,yj) for notational simplicity, the dot product can be written as xi · xj = xixj + yiyj. • The corresponding instances in the quadratic feature space are 􏰈x2i , yi2􏰉 and 􏰌x2j , yj2􏰍, and their dot product is 􏰈x2i , yi2􏰉 · 􏰌x2j , yj2􏰍 = x2i x2j + yi2yj2. • This is almost equal to (xi · xj)2 = (xixj + yiyj)2 = (xixj)2 + (yiyj)2 + 2xixjyiyj, but not quite because of the third term of cross-products. • We can capture this term by extending the feature vector with a third √ feature 2xy. COMP9417 ML & DM Kernel Methods Term 2, 2020 54 / 63 Nonlinear SVMs ‘Kernelising’ the perceptron This gives the following feature space: φ(x ) = 􏰌x2,y2,√2x y 􏰍 φ(x ) = 􏰌x2,y2,√2x y 􏰍 iiiii jjjjj φ(xi) · φ(xj) = x2i x2j + yi2yj2 + 2xixjyiyj = (xi · xj)2 • Wenowdefineκ(xi,xj)=(xi·xj)2,andreplacexi·xj with κ(xi,xj) in the dual perceptron algorithm to obtain the kernel perceptron • This would work for many other kernels satisfying certain conditions. COMP9417 ML & DM Kernel Methods Term 2, 2020 55 / 63 Nonlinear SVMs ‘Kernelising’ the perceptron 1 2 3 4 5 6 7 8 9 10 11 Algorithm KernelPerceptron(D, η) // perceptron training algorithm using a ker Input: labelled training data D in homogeneous coordinates, plus kernel function κ Output: coefficients αi defining non-linear decision boundary αi ←0for1≤i≤|D| converged ←false while converged = false do converged ←true for i = 1 to |D| do if yi 􏰇|D| αjyjκ(xi,xj) ≤ 0 then j=1 αi ← αi + 1 converged ←false end end end COMP9417 ML & DM Kernel Methods Term 2, 2020 56 / 63 n Nonlinear SVMs Other kernels We can define a polynomial kernel of any degree p as κ(xi,xj)=(xi·xj)p. Thistransformsad-dimensionalinputspaceintoa high-dimensional feature space, such that each new feature is a product of p terms (possibly repeated). If we include a constant, say κ(xi , xj ) = (xi · xj + 1)p , we would get all lower-order terms as well. So, for example, in a bivariate input space and setting p = 2 the resulting feature space is φ(x) = 􏰌x2, y2, √2xy, √2x, √2y, 1􏰍 with linear as well as quadratic features. COMP9417 ML & DM Kernel Methods Term 2, 2020 57 / 63 Nonlinear SVMs Other kernels An often-used kernel is the Gaussian kernel, defined as 􏰀−||xi − xj||2 􏰁 κ(xi,xj) = exp 2σ2 where σ is a parameter known as the bandwidth. Notice that the soft margin optimisation problem (above) is defined in terms of dot products between training instances and hence the ‘kernel trick’ can be applied to SVMs: COMP9417 ML & DM Kernel Methods Term 2, 2020 58 / 63 Nonlinear SVMs Other kernels • The decision boundary learned with a non-linear kernel cannot be represented by a simple weight vector in input space. Thus, in order to classify a new example x we need to evaluate yi 􏰇nj=1 αjyjκ(x,xj) which is an O(n) computation involving all training examples, or at least the ones with non-zero multipliers αj. • This is why support vector machines are a popular choice as a kernel method, since they naturally promote sparsity in the support vectors. • Although we have restricted attention to numerical features here, kernels can be defined over discrete structures, including trees, graphs, and logical formulae, opening the way to extending geometric models to non-numerical data1. 1See, for example, Scho ̈lkopf and Smola (2002). COMP9417 ML & DM Kernel Methods Term 2, 2020 59 / 63 Applications SVM Applications • Machine vision: e.g face identification • Prior to deep learning, achieves lowest error • Handwritten digit recognition: • Comparable to best alternative • Bioinformatics: e.g. prediction of protein secondary structure, microarray classification • Text classification • Algorithm can be modified to deal with numeric prediction problems – support vector regression COMP9417 ML & DM Kernel Methods Term 2, 2020 60 / 63 Applications Example - pedestrian detection COMP9417 ML & DM Kernel Methods Term 2, 2020 61 / 63 Applications Example - classification from gene expression COMP9417 ML & DM Kernel Methods Term 2, 2020 62 / 63 Summary Summary: Learning with Kernel Methods • Kernel methods around for a long time in statistics • Main example of the “optimisation” approach to machine learning • Kernelisation a “modular” approach to machine learning • Algorithms that can be kernelised can learn different model classes simply by changing the kernel, e.g., string kernels for sequence data • SVMs exemplify this – mostly for classification (but also regression, “one-class’ classification, etc.) • SVMs one of the most widely used “off-the-shelf” classifier learning methods, especially for “small n (examples), large p (dimensionality)” classification problems COMP9417 ML & DM Kernel Methods Term 2, 2020 63 / 63 References Sch ̈olkopf, B. and Smola, A. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA. COMP9417 ML & DM Kernel Methods Term 2, 2020 63 / 63