Lecture 9. Support Vector Machines
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Supportvectormachines(SVMs)asmaximum- margin classifiers
• Derivinghard-marginSVMobjective
• SVMobjectiveasregularisedlossfunction
2
COMP90051 Statistical Machine Learning
Maximum-Margin Classifier: Motivation
A new twist to binary linear classification
3
COMP90051 Statistical Machine Learning
Beginning: linear SVMs
• In the first part, we will consider a basic setup of SVMs, something called linear hard-margin SVM. These definitions will not make much sense now, but we will come back to this later today
• Keep in mind: SVMs are more powerful than they may initially appear
• For now, we model the data as linearly separable, i.e., there exists a hyperplane perfectly separating the classes
• We will consider training using all data at once
4
COMP90051 Statistical Machine Learning
SVM is a linear binary classifier
Predict class A if 𝑠𝑠 ≥ 0
SVM is a linear classifier: 𝑠𝑠 is a linear function of inputs, and the separating boundary is linear
Predict class B if 𝑠𝑠 < 0
where𝑠𝑠=𝑏𝑏+∑𝑚𝑚 𝑥𝑥𝑤𝑤 𝑖𝑖=1 𝑖𝑖 𝑖𝑖
separating boundary
plane with values of 𝑠𝑠
𝑥𝑥2
𝑥𝑥1
Example for 2D data
plane with data points
5
COMP90051 Statistical Machine Learning
SVM and the perceptron
• Previousslidetakenfromperceptronlecture
• Givenlearnedparametervalues,anSVMmakes
predictions exactly like a perceptron.
• HowSVMisdifferent:wayparametersarelearned. ∗ Perceptron: min perceptron loss as studied earlier
∗ SVMs: different criterion for choosing parameters
6
COMP90051 Statistical Machine Learning
Choosing separation boundary
• An SVM is a linear binary classifier: choosing parameters means choosing a separating boundary (hyperplane)
•𝑥𝑥 In2D: 2BA
C
𝑥𝑥1
7
COMP90051 Statistical Machine Learning
Which boundary should we use?
• Provided the dataset is linearly separable, the perceptron will find a boundary that separates classes perfectly. This can be
a𝑥𝑥ny such boundary, e.g., A or B 2BA
𝑥𝑥1
For the perceptron, all such boundaries are equally good, because the perceptron loss is zero for each of them.
8
COMP90051 Statistical Machine Learning
Which boundary should we use?
• Provided the dataset is linearly separable, the perceptron will find a boundary that separates classes perfectly. This can be
a𝑥𝑥ny such boundary, e.g., A or B 2BA
𝑥𝑥1
But they don't look equally good to us. Line A seems to be more reliable. When new data point arrives, line B is likely to misclassify it
9
COMP90051 Statistical Machine Learning
Aiming for the safest boundary
• Intuitively, the most reliable boundary would be the one that is between the classes and as far away from both classes as
p𝑥𝑥ossible 2
SVM objective captures this observation
SVMs aim to find the separation boundary that maximises the margin
𝑥𝑥between the classes 1
10
COMP90051 Statistical Machine Learning
Maximum-margin classifier
• An SVM is a linear binary classifier. SVM training aims to find the separating boundary that maximises margin
• For this reason, SVMs a.k.a maximum-margin classifiers
• The training data is fixed, so the margin is defined by the location and orientation of the separating boundary which, of course, are defined by SVM parameters
• Our next step is to formalise our objective by expressing margin width as a function of parameters (and data)
11
COMP90051 Statistical Machine Learning
Maximum-Margin Classifier: Derivation
A geometric derivation of the SVM’s objective
12
COMP90051 Statistical Machine Learning
Margin width
• While the margin can be thought as the space between two dashed lines, it is more convenient to define margin width as the distance between the separating boundary and the nearest data point(s)
𝑥𝑥2
This separating boundary is exactly “between the classes”: distances to the nearest red and blue points are the same
Question: Must be so for max margin. Why?
𝑥𝑥1 Point(s) on margin boundaries called support vectors
13
COMP90051 Statistical Machine Learning
Margin width
• While the margin can be thought as the space between two dashed lines, it is more convenient to define margin width as the distance between the separating boundary and the nearest data point(s)
𝑥𝑥2
𝑥𝑥1
We want to maximise the distance to support vectors
However, before doing this, let’s derive the expression for distance to an arbitrary point
14
COMP90051 Statistical Machine Learning
Distance from point to hyperplane 1/3
• Consider an arbitrary point 𝑋𝑋 (from either of the classes, and not necessarily the closest one to the boundary), and let 𝑋𝑋𝑝𝑝 denote the projection of 𝑋𝑋 onto the separating boundary
• Now, let 𝒓𝒓 be a vector 𝑋𝑋𝑝𝑝 − 𝑋𝑋. Note that 𝒓𝒓 is perpendicular to the boundary, and also that 𝒓𝒓 is the required distance
𝑥𝑥2
𝒓𝒓
𝑋𝑋𝑝𝑝 22
The separation boundary is defined by parameters 𝒘𝒘 and 𝑏𝑏.
0
Remember that
𝑥𝑥1
15
𝑋𝑋
𝒘𝒘
From our linear algebra slides, recall that 𝒘𝒘 is a vector normal (perpendicular) to the boundary
In the figure, 𝒘𝒘 is drawn from an arbitrary starting point on boundary
𝒘𝒘=𝑤𝑤+⋯+𝑤𝑤 1𝑚𝑚
COMP90051 Statistical Machine Learning
Distance from point to hyperplane 2/3
Trivially, 𝒓𝒓 = 𝒘𝒘
addition,wehavethat𝒙𝒙+𝒓𝒓=𝒙𝒙𝑝𝑝 or 𝒙𝒙+𝒘𝒘 𝒘𝒘𝒓𝒓 =𝒙𝒙𝑝𝑝
• •
𝒘𝒘𝒓𝒓
Next, points 𝑋𝑋 and 𝑋𝑋𝑝𝑝 can be viewed as vectors 𝒙𝒙 and 𝒙𝒙𝑝𝑝. By vector
𝑥𝑥2
Vectors 𝒓𝒓 and 𝒘𝒘 are parallel, but not generally of the same length.
𝒓𝒓 𝑋𝑋 𝑝𝑝
Now let’s multiply both sides of this equation by 𝒘𝒘 and also add 𝑏𝑏:
𝒘𝒘 ′ 𝒙𝒙 + 𝑏𝑏 + 𝒘𝒘 ′ 𝒘𝒘 𝒘𝒘𝒓𝒓 = 𝒘𝒘 ′ 𝒙𝒙 𝑝𝑝 + 𝑏𝑏
𝒙𝒙𝑋𝑋
Dista𝑥𝑥nceis 𝒓𝒓 =−𝒘𝒘′𝒙𝒙+𝑏𝑏
𝒙𝒙 𝑝𝑝
𝒘𝒘
Since 𝒙𝒙𝑝𝑝 lies on the boundary, we have 𝒘𝒘 ′ 𝒙𝒙 + 𝑏𝑏 + 𝒘𝒘 2 𝒘𝒘𝒓𝒓 = 0
0
1𝒘𝒘
16
COMP90051 Statistical Machine Learning
Distance from point to hyperplane 3/3
• •
𝒓𝒓 In this case, distance is 𝒓𝒓 = 𝒘𝒘′𝒙𝒙+𝑏𝑏 𝒘𝒘
vectors 𝒓𝒓 and 𝒘𝒘 would be anti-parallel, giving us 𝒓𝒓 = −𝒘𝒘
However, if we took our point from the other side of the boundary,
𝑥𝑥2 𝒓𝒓 𝒘𝒘
We will return to this fact shortly, and
𝒙𝒙 𝑝𝑝
0
𝒙𝒙
for now we combine the two cases in 𝒘𝒘 the following result: 𝒘𝒘 𝒙𝒙+𝑏𝑏
Distance is 𝒓𝒓 = ± 𝑥𝑥1
′𝒘𝒘
17
COMP90051 Statistical Machine Learning
• Training data is a collection {𝒙𝒙𝑖𝑖, 𝑦𝑦𝑖𝑖}, 𝑖𝑖 = 1, ... , 𝑛𝑛, where each 𝒙𝒙𝑖𝑖 is an 𝑚𝑚-dimensional instance and 𝑦𝑦𝑖𝑖 is the corresponding binary label encoded as −1 or 1
Encoding the side using labels
• Given a perfect separation boundary, 𝑦𝑦𝑖𝑖 encode the side of the boundary each 𝒙𝒙𝑖𝑖 is on
• Thus the distance from the 𝑖𝑖-th point to a perfect boundary
can be encoded as
𝒓𝒓 =𝑦𝑦𝑖𝑖𝒘𝒘′𝒙𝒙𝑖𝑖+𝑏𝑏 𝑖𝑖𝒘𝒘
18
COMP90051 Statistical Machine Learning
𝒘𝒘
Maximum margin objective
• The distance from the 𝑖𝑖-th point to a perfect boundary can be encoded as 𝒓𝒓𝑖𝑖 = 𝑦𝑦𝑖𝑖 𝒘𝒘′𝒙𝒙𝑖𝑖+𝑏𝑏
• The margin width is the distance to the closest point
• Thus SVMs aim to maximise min 𝑦𝑦𝑖𝑖 𝒘𝒘′𝒙𝒙𝑖𝑖+𝑏𝑏
as a function of 𝒘𝒘 and 𝑏𝑏 𝑖𝑖=1,...,𝑛𝑛
Do you see any problems
𝒘𝒘
art: OpenClipartVectors at pixabay.com (CC0)
with this objective?
19
COMP90051 Statistical Machine Learning
Non-unique representation
• A separating boundary (e.g., a line in 2D) is a set of points that satisfy 𝒘𝒘′𝒙𝒙 + 𝑏𝑏 = 0 for some given 𝒘𝒘 and b
• However, the same set of points will also satisfy 𝒘𝒘�′𝒙𝒙 + 𝑏𝑏� = 0, with 𝒘𝒘� = 𝛼𝛼𝒘𝒘 and 𝑏𝑏� = 𝛼𝛼𝑏𝑏, where 𝛼𝛼 > 0 is arbitrary
𝑥𝑥2
The same boundary, and essentially the same classifier can be expressed with infinitely many parameter combinations
𝑥𝑥– that diverge! 1
20
COMP90051 Statistical Machine Learning
Resolving ambiguity
• Consider a “candidate” separating line. Which parameter combinations should we choose to represent it?
∗ Ashumans,wedonotreallycare
∗ Math/Machinesrequireapreciseanswer
• A possible way to resolve ambiguit=y: measure the distance to the closest point (𝑖𝑖∗), and rescale parameters such that
𝑦𝑦𝑖𝑖∗ 𝒘𝒘′𝒙𝒙𝑖𝑖∗+𝑏𝑏 1 𝒘𝒘𝒘𝒘
• For a given “candidate” boundary, and fixed training points there will be only one way of scaling 𝒘𝒘 and 𝑏𝑏 in order to satisfy this requirement
21
COMP90051 Statistical Machine Learning
Constraining the objective
• SVMs aim to maximise min 𝑦𝑦𝑖𝑖 𝒘𝒘′𝒙𝒙𝑖𝑖+𝑏𝑏 𝑖𝑖=1,…,𝑛𝑛 𝒘𝒘
• Introduce (arbitrary) extra requirement 𝑦𝑦𝑖𝑖∗ 𝒘𝒘′𝒙𝒙𝑖𝑖∗+𝑏𝑏 = 1
∗ 𝑖𝑖∗ denotes index of closest example to boundary 𝒘𝒘 • SVM aims to find
𝒘𝒘
argmin 𝒘𝒘 𝒘𝒘
s.t. 𝑦𝑦𝑖𝑖 𝒘𝒘′𝒙𝒙𝑖𝑖+𝑏𝑏 ≥1for𝑖𝑖=1,…,𝑛𝑛
22
COMP90051 Statistical Machine Learning
Hard margin SVM objective
argmin 𝒘𝒘 𝒘𝒘
We now have a major result: SVMs aim to find
s.t.𝑦𝑦𝑖𝑖 𝒘𝒘′𝒙𝒙𝑖𝑖+𝑏𝑏 ≥1for𝑖𝑖=1,…,𝑛𝑛
𝑥𝑥2
𝒘𝒘1
Note 1: parameter 𝑏𝑏 is 1 optimised indirectly by
𝒘𝒘 influencing constraints
Note 2: all points are enforced
to be on or outside the margin 𝑥𝑥 Therefore, this version of SVM
1 is called hard-margin SVM
23
COMP90051 Statistical Machine Learning
Geometry of SVM training 1/2
• Training a linear SVM essentially means moving/rotating plane B so that separating boundary changes
Separating boundary
Plane B: values of
𝑏𝑏+∑𝑚𝑚 𝑥𝑥𝑤𝑤 𝑖𝑖=1 𝑖𝑖 𝑖𝑖
changing 𝑤𝑤 and 𝑏𝑏 𝑖𝑖
𝑥𝑥 2
Plane A: data
𝑥𝑥1
• This is achieved by
COMP90051 Statistical Machine Learning
Geometry of SVM training 2/2
• However, we can also rotate
Plane B along the separating 𝑥𝑥
Plane B: values of
boundary
2𝑥𝑥
Plane A: data
Separating boundary
𝑏𝑏+∑𝑚𝑚 𝑥𝑥𝑤𝑤 𝑖𝑖=1 𝑖𝑖 𝑖𝑖
• In this case, the boundary does not change
• Same classifier!
• Theadditionalrequirement fixes the angle between planes A and B to a particular constant
1
COMP90051 Statistical Machine Learning
SVM Objective as Regularised Loss
Relating the resulting objective function to that of other machine learning methods
26
COMP90051 Statistical Machine Learning
Previously in COMP90051 …
1. Choose/design a model
2. Choose/design loss function
3. Find parameter values that minimise discrepancy on
We defined loss functions for perceptron and ANN, and aimed to minimised the loss during training
A loss function measures discrepancy between prediction and true value for a single example
Training error is the average loss over all training examples
training data
But how do SVMs fit this pattern?
27
COMP90051 Statistical Machine Learning
SVM as Regularised ERM
𝑛𝑛
minimise ∑ 𝑦𝑦 − 𝒘𝒘 𝒙𝒙 + 𝜆𝜆 𝒘𝒘
• Recallridgeregressionobjective
𝑖𝑖=1
𝑖𝑖′𝑖𝑖22
• Hardmargin SVMobjective argmin 𝒘𝒘
training error 𝒘𝒘
s.t.𝑦𝑦 𝒘𝒘𝒙𝒙 +𝑏𝑏 ≥1for𝑖𝑖=1,…,𝑛𝑛
data-independent data-dependent 𝑖𝑖 ′ 𝑖𝑖 regularisation term
• Theconstraintscanbeinterpretedasloss 𝑙𝑙 =�0 1−𝑦𝑦𝒘𝒘′𝒙𝒙+𝑏𝑏≤0
∞
𝑖𝑖𝑖𝑖
∞ 1−𝑦𝑦𝑖𝑖𝒘𝒘′𝒙𝒙𝑖𝑖+𝑏𝑏>0
28
COMP90051 Statistical Machine Learning
Hard margin SVM loss
• Theconstraintscanbeinterpretedasloss 𝑙𝑙 =�0 1−𝑦𝑦𝒘𝒘′𝒙𝒙+𝑏𝑏≤0
∞
𝑖𝑖𝑖𝑖
∞ 1−𝑦𝑦𝑖𝑖𝒘𝒘′𝒙𝒙𝑖𝑖+𝑏𝑏>0
• Inotherwords,foreachpoint:
∗ If it’s on the right side of the boundary and at least
units away from the boundary, we’re OK, the loss is 0
∗ If the point is on the wrong side, or too close to the boundary, we immediately give infinite loss thus prohibiting such a solution altogether
𝒘𝒘1
29
COMP90051 Statistical Machine Learning
This lecture
• Support vector machines (SVMs) as maximum margin classifiers
• DerivinghardmarginSVMobjective
• SVMasregularisedERM
• Next lecture: Soft-margin SVM
30