Kernel Methods
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain
Aims
This lecture will develop your understanding of kernel methods in machine learning. Following it you should be able to:
– describe perceptron learning
– describe learning with the dual perceptron
– outline the idea of learning in a dual space
– describe the concept of maximizing the margin in linear classification
– outline the typical loss function for maximizing the margin
– describe the method of support vector machines (SVMs)
– describe the concept of kernel functions
– outline the idea of using a kernel in a learning algorithm
– outline non-linear classification with kernel methods
COMP9417 T1, 2021 1
Predictive machine learning scenarios
COMP9417 T1, 2021 2
Classification
• Aclassifierisamapping𝑐̂:𝒳→𝐶,where𝐶= 𝐶!,𝐶”,…,𝐶# isafiniteand usually small set of class labels. We will sometimes also use 𝐶$ to indicate the set of examples of that class.
• We use the ‘hat’ to indicate that 𝑐̂(𝑥)is an estimate of the true but unknown function 𝑐(𝑥). Examples for a classifier take the form (𝑥, 𝑐(𝑥)), where 𝑥 ∈ 𝒳 is an instance and 𝑐(𝑥) is the true class of the instance (sometimes contaminated by noise).
• Learning a classifier involves constructing the function such that it matches 𝑐 as closely as possible (and not just on the training set, but ideally on the entire instance space 𝒳).
COMP9417 T1, 2021 3
A decision tree
(left) A tree with the training set class distribution in the leaves. (right) A tree with the majority class prediction rule in the leaves
COMP9417 T1, 2021 4
Scoring classifier
-#
• A scoring classifier is a mapping 𝑆: 𝒳 →R i.e., a mapping from the instance
space 𝒳 to a 𝑘 −vector of real numbers.
• The notation indicates that a scoring classifier outputs a vector 𝑆2 𝑥 = (𝑠̂!(𝑥), … , 𝑠̂#(𝑥)) rather than a single number; 𝑠̂$(𝑥) is the score assigned to class 𝐶$ for instance 𝑥.
• This score indicates how likely it is that class label 𝐶$ applies.
• If we only have two classes, it usually suffices to consider the score for only one of the classes; in that case, we use 𝑠̂ 𝑥 to denote the score of the positive class for instance 𝑥.
COMP9417 T1, 2021 5
A scoring tree
(left) A tree with the training set class distribution in the leaves.
(right) A tree using the logarithm of the class ratio as scores; spam is taken as the positive class.
COMP9417 T1, 2021 6
Margins and loss functions
If we take the true class 𝑐(𝑥) as +1 for positive examples and −1 for negative examples, then the quantity 𝑧(𝑥) = 𝑐(𝑥)×𝑠̂(𝑥) is positive for correct predictions and negative for incorrect predictions: this quantity is called the margin assigned by the scoring classifier to the example.
For example, in a linear classifier, we can define the score to be the distance between the examples and the line
x1
COMP9417 T1, 2021 7
x2
Margins and loss functions
We would like to reward large positive margins and penalize large negative values. This is achieved by means of a so-called loss function 𝐿 ∶ 𝑅 → [0, ∞) which maps each example’s margin 𝑧(𝑥) to an associated loss 𝐿(𝑧(𝑥)).
We will assume that 𝐿(0) = 1, which is the loss incurred by having an example on the decision boundary. We furthermore have 𝐿(𝑧) ≥ 1 for 𝑧 < 0, and usually also 0 ≤ 𝐿(𝑧) < 1 for 𝑧 > 0. The average loss over a test set 𝑇𝑒𝑠𝑡 is
! ∑*∈&'() 𝐿(𝑧 𝑥 ) |&'()|
COMP9417 T1, 2021 8
Loss functions
Frombottom-left:(i)0–1loss𝐿,!(𝑧)=1if𝑧 ≤ 0,and𝐿,!(𝑧)=0if𝑧 >0;(ii) hingeloss𝐿-(𝑧) = (1−𝑧)if𝑧 ≤ 1,and𝐿-(𝑧)=0if𝑧>1;(iii)logisticloss 𝐿./0(𝑧) = 𝑙𝑜𝑔” (1 + exp(−𝑧)); (iv) exponential loss 𝐿’*1(𝑧) = exp(−𝑧); (v)
squared loss 𝐿(2(𝑧) = (1 − 𝑧)” (can be set to 0 for 𝑧 > 1, just like hinge loss). COMP9417 T1, 2021 9
Review: Linear classification
• Two-class classifier “separates” instances in feature space:
𝑓(𝑥) = 𝑠𝑖𝑔𝑛(𝑤!𝑥 + 𝑏)
𝑤!𝑥 + 𝑏 > 0
𝑤!𝑥 + 𝑏 < 0
COMP9417 T1, 2021
10
Issues in linear classification
• Define a decision boundary by a hyperplane in feature space
• A linear model can be used for classification
COMP9417 T1, 2021 11
Issues in linear classification
• Many possible linear decision boundaries: which one to choose?
𝑤!𝑥 + 𝑏 > 0
𝑤!𝑥 + 𝑏 < 0
COMP9417 T1, 2021
12
Linear classification: Perceptron
Perceptron: is an algorithm for binary classification that uses a linear prediction function
If we have two attributes/features of 𝑥! and 𝑥" then we can predict the target function 𝑓(𝑥) with:
𝑓(x)=Q+1 𝑖𝑓𝑤,+𝑤!𝑥!+𝑤"𝑥">0 −1 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒.
Seabass
𝑤” + 𝑤#𝑥# + 𝑤$𝑥$ = 0
Salmon
x1 = Lightness
COMP9417 T1, 2021
13
x2 = Width
Linear classification: Perceptron
For a general case with 𝑛 attributes:
𝑓x =Q+1 𝑖𝑓𝑤,+𝑤!𝑥!+⋯+𝑤4𝑥4>0
−1 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒.
If we add 𝑥, = 1 to the feature vector:
∑4 𝑤𝑥=𝑤.x $3, $ $
Dotproduct: 𝑎.𝑏=𝑎!𝑏!+𝑎!𝑏!+⋯+𝑎4𝑏4
COMP9417 T1, 2021 14
4
+1 𝑖𝑓V𝑤$𝑥$>0
𝑓x=
−1 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒.
$3,
Perceptron learning
𝑓x=Q+1 𝑖𝑓𝑤.x>0 −1 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒.
𝑦Z = 𝑓 x = s g n ( 𝑤 . x )
Now, we have to find a good set of weights using our training set x!, x”, …, x5
sgn is the sign function. with labels 𝑦!, 𝑦”, …, 𝑦5
Please note that, here, x6 corresponds to observation 𝑗 and is a vector of 𝑛 features: x6=[𝑥6!, … , 𝑥64]
COMP9417 T1, 2021 15
Perceptron learning
The perceptron algorithm initializes all weights 𝑤$ to zero, and learns the weights using the following update rule:
There are 4 cases:
𝑤 ≔ 𝑤 + 12 𝑦 6 − 𝑓 x 6 x 6
𝑦=+1,𝑓x=+1⟹𝑦−𝑓x =0 𝑦=+1,𝑓x=−1⟹𝑦−𝑓x =+2 𝑦=−1,𝑓x=+1⟹𝑦−𝑓x =−2 𝑦=−1,𝑓x=−1⟹𝑦−𝑓x =0
COMP9417 T1, 2021 16
Perceptron learning
𝑤 gets updated only if the prediction mismatches the actual class label (misclassification) and otherwise remains the same. Therefore, for misclassified instances we can write:
𝑤 ≔ 𝑤 + 𝑦6×6 1. Initialize all weights 𝑤 to zero
2. Iterate through the training data. For each training sample, classify the sample:
a) If the prediction was correct, don’t do anything
b) If the prediction was wrong, modify the weights by using the update rules
3. Repeat step 2 some number of times
Perceptron algorithm:
COMP9417 T1, 2021 17
Perceptron training algorithm
Algorithm Perceptron(𝐷) / perceptron training for linear classification Input: labelled training data 𝐷 in homogeneous coordinates
Output: weight vector 𝑤
𝑤←0
𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑 ← 𝑓𝑎𝑙𝑠𝑒
while 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑 = 𝑓𝑎𝑙𝑠𝑒 do
𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑 ← 𝑡𝑟𝑢𝑒 for 𝑖 = 1 𝑡𝑜 |𝐷| do
if𝑦$𝑤.x$ ≤0then 𝑤 ← 𝑤 + 𝑦$x$
𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑 ← 𝑓𝑎𝑙𝑠𝑒 end
end end
COMP9417 T1, 2021 18
Issues in linear classification
• May not be possible to find a linear separating hyperplane
• Filled / empty circles are in / out of the target concept
• AND is linearly separable – but not XOR
COMP9417 T1, 2021 19
Extending linear classification
• Linear classifiers can’t model nonlinear class boundaries
• Simple trick to allow them to do that:
– Nonlinear mapping: map attributes into new space consisting of
combinations of attribute values
– For example: all products with 𝑛 factors that can be constructed from the attributes (feature construction or basis expansion)
• e.g., for 2 attributes, all products with 𝑛 = 3 factors 𝑦 = 𝑤#𝑥#$ + 𝑤%𝑥#%𝑥% + 𝑤$𝑥#𝑥% + 𝑤&𝑥%$
• 𝑦 is predicted output for instances with two attributes 𝑥# and 𝑥%
COMP9417 T1, 2021 20
Two main problems
• Efficiency:
– With 10 attributes and 𝑛 = 5 more than 2000 coefficients
(weights) have to be learned
– Linear regression (with attribute selection/regularisation) running time is cubic in the number of attributes
• Overfitting:
– “Too nonlinear” – number of coefficients large relative to number of training instances
– Curse of dimensionality applies …
COMP9417 T1, 2021 21
Duality (optimization)
• Essentially, an optimization approach has to be used to find the discriminative line
• Duality concept in optimization can help to view the problem from another perspective and make it simpler
• Sometimes solving the dual form is much easier. (Computational advantage)
• We will see, how using dual form and kernel trick simplify the computations.
COMP9417 T1, 2021 22
Duality (optimization)
One way of thinking about duality is that when you have an optimization problem, we can construct another optimization problem which is called dual problem and is related to our original problem (primal problem) and can be useful in solving the primal problem.
In convex optimization problems, the optimal values of the primal and dual problems are equal under a constraint qualification condition.
COMP9417 T1, 2021 23
Perceptron classifiers in dual form
Every time an example x$ is misclassified, add 𝑦$x$ to the weight vector.
• After training has completed, each example has been misclassified zero or
more times. Denoting this number as 𝛼$ for example x$, the weight vector for
𝑚 observations can be expressed as 5
𝑤 = V𝛼$𝑦$x$ $3!
• In the dual instance-based view of linear classification we are learning instance weights 𝛼$ rather than feature weights 𝑤6. An instance x is
classified as
𝑦Z = 𝑓 x = s g n ( 𝑤 . x ) 5
𝑦Z=sgn V𝛼$𝑦$(x$.x) $3!
COMP9417 T1, 2021 24
Perceptron training in dual form Algorithm Dual-Perceptron(𝐷) / perceptron training in dual form
Input: labelled training data 𝐷 in homogeneous coordinates
Output: coefficients 𝛼$ defining weight vector 𝑊 = ∑|7| 𝛼$𝑦$𝑥$ $3!
𝛼$ ← 0
𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑 ← 𝑓𝑎𝑙𝑠𝑒
while 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑 = 𝑓𝑎𝑙𝑠𝑒 do
𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑 ← 𝑡𝑟𝑢𝑒 for 𝑖 = 1 𝑡𝑜 |𝐷| do
if𝑦$∑|7| 𝛼6𝑦6×6.x$≤0then 63!
𝛼$ ← 𝛼$ + 1 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑 ← 𝑓𝑎𝑙𝑠𝑒
end end
end
COMP9417 T1, 2021
25
Perceptron classifiers in dual form
• Using the dual form of perceptron, we estimate values of 𝛼’ instead of 𝑤
• During training, the only information needed about the training data is all pairwise dot products: the 𝑚-by-𝑚 matrix 𝐺 = 𝑋𝑋! containing these dot products is called the Gram matrix.
x#.x#, x#.x%, …, x#.x) x%.x#, x%.x%, …, x%.x)
.
.
.
x).x#, x).x%, …, x).x)
𝐺 x#,…,x( =
COMP9417 T1, 2021 26
Nonlinear dual perceptron
• We can use nonlinear mapping to map attributes into new space consisting of combinations of attribute values
x → 𝜑(x)
• The perceptron decision will be:
5
𝑦Z = sign V𝛼$𝑦$(𝜑(x$).𝜑(x)) $3!
• So the only thing we need is the dot product in the new feature space ( (𝜑(x$). 𝜑(x)) or 𝜑 x$ , 𝜑(x) )
• Why this matters?
COMP9417 T1, 2021 27
The kernel trick
Let x = 𝑥!, 𝑥” and x′ = 𝑥!′, 𝑥”′ be two data points, and consider the following mapping to a three-dimensional feature space:
(𝑥!, 𝑥”) → (𝑥!”, 𝑥””, 2𝑥!𝑥”)
(original feature space) 𝒳 ⟶ 𝒵 (new feature space)
The points in feature space corresponding to x and x′ are
z = 𝑥!”, 𝑥””, 2𝑥!𝑥” and z′ = (𝑥!′”, 𝑥”′”, 2𝑥!′𝑥”′)
The dot product of these two feature vectors is
z.z′=𝑥!”𝑥′!”+𝑥”𝑥′” +2𝑥!𝑥!′𝑥”𝑥”′= 𝑥!𝑥!8+𝑥”𝑥”′”= x!.x” ”
COMP9417 T1, 2021 28
The kernel trick
• By squaring the dot product in the original space we obtain the dot product in the new space without actually constructing the feature vectors! A function that calculates the dot product in feature space directly from the vectors in the original space is called a kernel – here the kernel is 𝐾 x#, x% = (x#. x%)%.
• In this example order is 2, so the computational gain may not be very obvious, but if we aim for higher orders, for example 20, then we see more clearly the computational advantage
COMP9417 T1, 2021 29
The kernel trick
• A valid kernel function is equivalent to a dot product in some space.
• The kernel trick helps to go to a high dimensional space without paying the price (!!), because if we find the right kernel, we do not need to explicitly map the features into the other high dimensional feature space.
COMP9417 T1, 2021 30
Kernel function
Example: if x = 𝑥#, 𝑥% and x* = 𝑥#′, 𝑥%′ , is the following function a valid kernel? If so, what is the feature map function?
𝐾 x,x’ =(1+x.x’)%
x.x’ = 𝑥#𝑥#* + 𝑥%𝑥%*
(1 + x.x’)% = 1 + 𝑥#%𝑥#*% + 𝑥%𝑥%*% + 2𝑥#𝑥#* + 2𝑥%𝑥′% + 2𝑥#𝑥′#𝑥%𝑥′%
This is the dot product of:
1, 𝑥#%, 𝑥%, 2𝑥#, 2𝑥%, 2𝑥#𝑥% and (1, 𝑥′#%, 𝑥′%, 2𝑥′#,
So this is a valid kernel.
2𝑥′%, 2𝑥′#𝑥′%)
COMP9417 T1, 2021
31
The kernel trick
• The kernel trick avoids the explicit mapping from original feature space (𝒳) to another space (𝒵).
• For {x,x’} ∈ 𝒳, certain function K(x,x’) can be expressed as a dot product in another space and 𝐾 is called kernel or kernel function.
• If 𝜑(x) is the input x in the new (higher dimensional) feature space, then computation becomes much simpler, if we can have kernel function that:
𝐾 x,x’ = 𝜑(x).𝜑(x’)
COMP9417 T1, 2021 32
The kernel trick
• A kernel function is a similarity function that corresponds to a dot product in some expanded feature space
• Some very useful kernels in machine learning are polynomial kernel and radial basis function kernel (RBF kernel)
• Polynomial kernel is defined as:
𝐾 x,x’ = (x.x’ + 𝑐)+
• RBF kernel is defined as:
||x−x’||% 𝐾x,x’=exp(− 2𝜎% )
Using Taylor expansion, it can be shown that RBF kernel is equivalent of mapping features into infinite dimensions
COMP9417 T1, 2021 33
Nonlinear dual perceptron
Using kernel trick, the nonlinear perceptron:
5
𝑦Z = sign V𝛼$𝑦$(𝜑(x$).𝜑(x)) $3!
can be solved using the dual form and the kernel as follow:
5
𝑦Z=sign V𝛼$𝑦$𝐾(x$,x) $3!
The same algorithm as in linear perceptron can be used, but the x6.x$ has to be replaced with 𝐾(x6, x$)
COMP9417 T1, 2021 34
Issues in linear classification
Different classification learning algorithms use different criteria:
• Basic linear classifier finds class means (centroids), joins them by a straight line, and its perpendicular bisector is the separating hyperplane
• Perceptron training uses iterative reweighting (gradient descent)
• Perceptron may find different models depending on starting conditions
COMP9417 T1, 2021 35
Issues in linear classification
Which line is a better classifier?
What is the advantage?
COMP9417 T1, 2021 36
Issues in linear classification
Which line is a better classifier? – The line with bigger margin
• Why bigger margin is better?
• Can I find a 𝑤 that maximizes the margin?
COMP9417 T1, 2021 37
Issues in linear classification
Is there an optimal linear classification learning method ?
answer: Yes, under Vapnik’s framework for statistical learning which is called Support Vector Machine (SVM)
– define the empirical risk, or error on the data
– maximum margin separating hyperplane
– minimizes empirical risk
– SVM lets to make a trade-off between model complexity and the error
– unique solution
– structural risk minimization
COMP9417 T1, 2021 38
Support Vector Machine
COMP9417 T1, 2021 39
Support vector machine
The decision boundary learned by a support vector machine maximizes the margin, which is indicated by the dotted lines. The circled data points are the support vectors.
COMP9417 T1, 2021 40
Support vector machine
• Support vector machines (machine ≡ algorithm) learn linear classifiers
• Can avoid overfitting – learn a form of decision boundary called the maximum margin hyperplane
• Fast for mappings to nonlinear spaces
– employ a mathematical trick (kernel) to avoid the actual creation of new “pseudo-attributes” in transformed instance space
– i.e. the nonlinear space is created implicitly
COMP9417 T1, 2021 41
Support vector machine
The geometry of a support vector classifier. The circled data points are the support vectors, which are the training examples nearest to the decision boundary. The support vector machine finds the decision boundary that maximizes the margin 𝑚/||𝑤||.
COMP9417 T1, 2021 42
Training a support vector machine
• Learning problem: fit maximum margin hyperplane, i.e. a kind of linear model
• For a linearly separable two-class data set the maximum margin hyperplane is the classification surface which
– correctly classifies all examples in the data set – has the greatest separation between classes
COMP9417 T1, 2021 43
Support vector machine
Let’s x( be the closest point to the separating hyperplane (line in 2D) with the following equation:
𝑤. x = 𝑡
Let’s have 2 minor technicalities to simplify the math later:
1. Pull out 𝑤,: 𝑤 = [𝑤!, … , 𝑤4] and 𝑤, = −𝑡, therefore we will have:
2. Normalize𝑤:
Weknowthat|w.x( −t|>0andweknowthatwecanscale𝑤and𝑡 together without having any effect on the hyperplane, so we choose the scale such that:
w.x(−t =1
This means 𝑚 = 1
COMP9417 T1, 2021 44
Support vector machine
• Wehavethelineequationwhichis: 𝑤.x−𝑡=0
• And we have the constraint w. x( − t = 1 where x( is the closest point to
the separating line (hyperplane)
• 𝑤 is perpendicular to the line (hyperplane). How?
For every two points x′ and x′′ on the line (hyperplane), we can write: 𝑤.x′−𝑡=0 and𝑤.x′′−𝑡=0
⟹ 𝑤.x8−x88 =0
When the dot product of two vectors is zero, it means they are perpendicular
COMP9417 T1, 2021 45
Support vector machine
From geometry, we know that the distance between point x( and line
(hyperplane) 𝑤. x − 𝑡 = 0 is:
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = |𝑤. x( − 𝑡|
Andwehave w.x( −t =1,so: 1 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = | 𝑤 |
This 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 is the margin of our classifier which we want to maximize: max 1 subjectto min |𝑤.x$ −𝑡|=1
|𝑤|
| 𝑤 | $3!,…,5
This is not a friendly optimisation as it has “min” in the constraint!
COMP9417 T1, 2021 46
Support vector machine
Our focus is on the points which the line (hyperplane) can predict them correctly. So we can have:
𝑤.x$−𝑡 =𝑦$(𝑤.x$−𝑡)
And we can change the “min” in constraint as follow:
𝑦$𝑤.x$−𝑡≥1 𝑓𝑜𝑟𝑖=1,…,𝑚
We can also transform the maximization problem into the following minimization problem:
min1||𝑤||” 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑡𝑜 𝑦$ 𝑤.x$−𝑡 ≥1 𝑓𝑜𝑟𝑖=1,…,𝑚 ;2
𝑤 ∈ R%,𝑡 ∈ R
How to solve this? Largangian multipliers
COMP9417 T1, 2021 47
Lagrangian multipliers
• Constrained optimization problems are generally expressed as:
min 𝑓 𝑥!,…,𝑥4 *! ,…,*”
Subject to:
• Lagrange multiplier methods involve the modification of the objective function through the addition of terms that describe the constraints. The objective function 𝑓(x) is augmented by the constraint equations through a set of non-negative multiplicative Lagrange multipliers,𝛼6 ≥ 0 and it is called
𝑔! 𝑥!,…,𝑥4 ≤ 0,𝑔” 𝑥!,…,𝑥4 ≤ 0,…,𝑔#(𝑥!,…,𝑥4) ≤ 0
the dual Lagrangian:
#
L 𝑥!,…,𝑥4,𝛼!,…,𝛼5 =𝑓 𝑥!,…,𝑥4 +V𝛼6𝑔6(𝑥!,…,𝑥4) 63!
COMP9417 T1, 2021
48
Lagrangian multipliers
In Lagrangian form, the optimization problem becomes:
max min L 𝑥!,…,𝑥4,𝛼!,…,𝛼5 𝑠𝑢𝑐h𝑡h𝑎𝑡 𝛼6 ≥0∀𝑗 0 only for the support vectors: the training examples nearest to the decision boundary.
COMP9417 T1, 2021 53
SVM in dual form
• The dual optimization problem for support vector machines is to maximize the dual Lagrangian under positivity constraints and one equality constraint: ) ) )
𝛼#∗,…,𝛼)∗ =argmax−12PP𝛼’𝛼-𝑦’𝑦-x’.x- +P𝛼’ /!,…,/” ‘,# -,# ) ‘,#
subjectto𝛼’ ≥0,1≤𝑖≤𝑚,P𝛼’𝑦’ =0 ‘,#
COMP9417 T1, 2021
54
Finding support vectors
∗∗1))) 𝛼#,…,𝛼) =argm𝑖𝑛 2PP𝛼’𝛼-𝑦’𝑦-x’.x- −P𝛼’
/!,…,/” ‘,# -,# ) ‘,#
subjectto𝛼’ ≥0,1≤𝑖≤𝑚,P𝛼’𝑦’ =0 ‘,#
• Determining parameters is a constrained quadratic optimization problem
• standard algorithms, or special-purpose algorithms (usually faster, e.g. Platt’s sequential minimal optimization (SMO), or LibSVM)
COMP9417 T1, 2021 55
Training a support vector machine
When you solve this optimization problem and get 𝛼!∗, … , 𝛼5∗ back, you will see that it will be mostly zero and only for the points which are the closest to the separating line, 𝛼$ will be non-zero and positive. Those points are called the support vectors.
𝑤= V 𝛼$𝑦$𝑥$ x$∈{(?11/@) A’B)/@(}
Solve for 𝑡 using any of support vectors:
𝑦$ 𝑤.x$−𝑡 =1
It can be shown that error is bounded by number of support vectors, irrespective of dimensionality
Note: all this assumes separable data!
COMP9417 T1, 2021 56
Training a support vector machine
• The more “separated” the classes, the larger the margin, the better the generalization
• Instances closest to maximum margin hyperplane are support vectors
• Important observation: support vectors define maximum margin hyperplane!
– All other instances can be deleted without changing position and orientation of the hyperplane!
COMP9417 T1, 2021 57
Two maximum-margin classifiers
(left) A maximum-margin classifier built from three examples, with 𝑤 = (0, −1/2) and margin 2. The circled examples are the support vectors:
they receive non-zero Lagrange multipliers and define the decision boundary. (right) By adding a second positive the decision boundary is rotated to 𝑤 = (3/5, −4/5) and the margin decreases to 1.
COMP9417 T1, 2021 58
Two maximum-margin classifiers
12 −1*−1−2 𝑋=−1 2 𝑦=−1 𝑋= 1−2 −1 −2 +1 −1 −2
The matrix 𝑋* on the right incorporates the class labels; i.e., the rows are 𝑦’𝑥’. The Gram matrix is (without and with class labels):
! 53−5**! 535 𝑋𝑋 = 3 5 −3 𝑋 𝑋 = 3 5 3 −5 −3 5 5 3 5
COMP9417 T1, 2021 59
Two maximum-margin classifiers
The dual optimization problem is thus
argmax − 12 (5𝛼!” + 3𝛼!𝛼” + 5𝛼!𝛼D + 3𝛼”𝛼! + 5𝛼” + 3𝛼”𝛼D + 5𝛼D𝛼! + 3𝛼D𝛼” 𝛼! = 1/2 no margin errors are tolerated.
For𝐶 = 1/2wehave𝛼! =𝐶,andhencefor𝐶 < 1/2wehavethat𝑥! becomes a margin error and the optimal classifier is a soft margin classifier. Effectively, with decreasing 𝐶 the decision boundary and the upper margin move upward, while the lower margin stays the same.
The upper margin reaches 𝑥" or 𝐶 = 5/16, at which point we have w = 3⁄8 and the margin has increased to 1.6. Furthermore, we have 𝜉! =
− 1⁄2
6/8,𝛼! =𝐶=5/16, 𝛼" = 0,𝛼D =1/16and𝛼I = 1/4.
COMP9417 T1, 2021 79
Soft margins
•
If we now decrease 𝐶 further, the decision boundary starts to rotate clockwise, so that 𝑥I becomes a margin error as well, and only 𝑥" and 𝑥D are support vectors. The boundary rotates until 𝐶 = 1/10, at which point we
have 𝑤 = 1⁄5 , 𝑡 = −1/5 and the margin has increased to 1.86. − 1⁄2
Furthermore, we have 𝜉! = 4/10 and 𝜉I = 7/10 , and all multipliers have become equal to 𝐶.
Finally, when 𝐶 decreases further the decision boundary stays where it is, but the norm of the weight vector gradually decreases and all points become margin errors.
•
COMP9417 T1, 2021 80
Soft margins
NOTE: a minimal-complexity soft margin classifier summarizes the classes by their class means in a way very similar to the basic linear classifier.
COMP9417 T1, 2021 81
Sparse data
• SVM algorithms can be sped up dramatically if the data is sparse (i.e. many values are 0)
• Why? Because they compute lots and lots of dot products
• With sparse data dot products can be computed very efficiently • Just need to iterate over the values that are non-zero
• SVMs can process sparse datasets with tens of thousands of attributes
COMP9417 T1, 2021 82
SVM Applications
• Machine vision: e.g face identification
– Prior to deep learning, achieves lowest error
• Handwritten digit recognition:
– Comparable to best alternative
• Bioinformatics: e.g. prediction of protein secondary structure, microarray classification
• Text classification
• Algorithm can be modified to deal with numeric prediction problems – support vector regression
COMP9417 T1, 2021 83
Example - pedestrian detection
COMP9417 T1, 2021 84
Summary: Learning with Kernel Methods
• Kernel methods around for a long time in statistics
• Kernelization a “modular” approach to machine learning
• Algorithms that can be kernelized can learn different model classes simply by changing the kernel
• SVMs exemplify this – mostly for classification (but also regression, “one-class’ classification, etc.)
• SVMs one of the most widely used “off-the-shelf” classifier learning methods, especially for “small 𝑚 (examples), large 𝑛 (dimensionality)” classification problems
COMP9417 T1, 2021 85
Acknowledgements
• “Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
• Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012) http://www.cs.ubc.ca/~murphyk/MLbook
• Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
• Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
• Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997) http://www- 2.cs.cmu.edu/~tom/mlbook.html
• Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani Goa Campus, India (2016)
COMP9417 T1, 2021 86