Slide 1
BUSINESS SCHOOL
Discipline of Business Analytics
QBUS6850 Team
2
Topics covered
Support Vector Machine (SVM)
Kernel method
References
Friedman et al., (2001), Chapter 12.1 – 12.3
James et al., (2014), Chapter 9
Bishop, (2006), Chapter 7.1
Alpaydin, (2014), Chapter 13
3
Learning Objectives
Understand the intuition of Support Vector Machine
Understand the loss function of SVM
Understand how SVM works
Understand the Gaussian kernel and its decision
boundary
Understand how to incorporate kernel method with
SVM
4
What we learnt
Supervised learning algorithm:
Linear regression
Logistics regression for classification
k-NN
Unsupervised learning algorithm:
K-means
PCA
Loss function optimization, cross validation and regularization
Other important skills:
Skills in applying these algorithms, e.g. choice of the algorithm
Choice of the features you design to give to the learning algorithms
Choice of the regularization parameter
The amount of data you really need
5
Presenter
Presentation Notes
6
Support Vector Machine
Support Vector Machine (SVM) algorithm which was first introduced in
the mid-1990s
One of the most powerful ‘black box’ learning algorithms and widely used
learning algorithms today
Powerful way of learning complex non-linear functions
Having a cleverly-chosen optimization objective
Support vector machines (SVMs) is an important machine learning method
with many applications
In addition to performing linear classification, SVMs can efficiently perform
a non-linear classification using what is called the Kernel Trick, implicitly
mapping their inputs into high-dimensional feature spaces.
7
Support Vector Machine
A straightforward engineering solution for classification tasks.
Support vector machines have nice computational properties.
Key idea:
Construct a separating hyperplane in a high-dimensional feature
space.
Maximize separability.
Express the hyperplane in the original space using a small set of
training vectors, the “support vectors”.
Links to a wealth of material (books, software, etc.) on SVMs can be
found: http://www.support-vector-machines.org/
Goal: construct a separating hyperplane that maximizes the margin
of separation.
http://www.support-vector-machines.org/
8
Intuition
𝑥𝑥2
𝑥𝑥1
9
Intuition
𝑥𝑥2
𝑥𝑥1
Why?
10
Margin
Largest Margin with SVM
𝑥𝑥2
𝑥𝑥1
Decision
Boundary
How?
“The widest
street approach”
If the training data are linearly separable, SVM can select two parallel hyperplanes
that separate the two classes of data, so that the distance between them is as large
as possible. The region bounded by these two hyperplanes is called the “margin“.
Observations on the margin are called the support vectors.
11
12
Calculating Margin
𝑥𝑥2
𝑥𝑥1
𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽0 = 0
How to calculate the distance from to 𝐱𝐱𝑛𝑛 to the decision boundary?
which is half of the margin
𝐱𝐱𝑛𝑛
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 = 𝛽𝛽1𝑥𝑥𝑛𝑛1+𝛽𝛽2𝑥𝑥𝑛𝑛2+𝛽𝛽0
𝛽𝛽1
2+𝛽𝛽2
2
=
𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛+𝛽𝛽0
𝜷𝜷
dist
13
Decision Boundary Equation
Boundary Equation 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽0 = 0 (more general 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 0) can be
represented as 𝑐𝑐𝛽𝛽1𝑥𝑥1 + 𝑐𝑐𝛽𝛽2𝑥𝑥2 + 𝑐𝑐𝛽𝛽0 = 0 for any constant 𝑐𝑐.
That means we can choose 𝛽𝛽s such that for any points 𝐱𝐱𝑛𝑛 = (𝑥𝑥𝑛𝑛1, 𝑥𝑥𝑛𝑛2) on two
dashed blues we have 𝛽𝛽1𝑥𝑥𝑛𝑛1 + 𝛽𝛽2𝑥𝑥𝑛𝑛2 + 𝛽𝛽0 = 1.
The margin now becomes
Define labels 𝑑𝑑𝑛𝑛 = 1 if 𝐱𝐱𝑛𝑛 on green side; 𝑑𝑑𝑛𝑛 = −1 if 𝐱𝐱𝑛𝑛 on red side; Then no
matter on which side, we can write 𝛽𝛽1𝑥𝑥𝑛𝑛1 + 𝛽𝛽2𝑥𝑥𝑛𝑛2 + 𝛽𝛽0 = 1 as
If 𝐱𝐱𝑛𝑛 is not on the dashed margin boundaries, we have
margin =
2
𝜷𝜷
𝑑𝑑𝑛𝑛(𝛽𝛽1𝑥𝑥𝑛𝑛1 + 𝛽𝛽2𝑥𝑥𝑛𝑛2 + 𝛽𝛽0) = 1 [ In general 𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 = 1 ]
𝑑𝑑𝑛𝑛(𝛽𝛽1𝑥𝑥𝑛𝑛1 + 𝛽𝛽2𝑥𝑥𝑛𝑛2 + 𝛽𝛽0) > 1 [ In general 𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 > 1 ]
14
SVM Formulation
Given a dataset 𝒟𝒟 = 𝐱𝐱1, 𝑑𝑑1 , 𝐱𝐱1, 𝑑𝑑1 ,⋯ , 𝐱𝐱𝑁𝑁, 𝑑𝑑𝑁𝑁 where 𝑑𝑑𝑛𝑛 =
1 or −1 (two classes)
We assume the data is separable by a hyperplane
The SVM learning is defined
The above problem is a constrained optimisation. Cannot be
solved by Gradient Descent directly
min
𝜷𝜷,𝛽𝛽0
1
2
𝜷𝜷 2
𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≥ 1, for all 𝑛𝑛 = 1, 2, … ,𝑁𝑁
15
SVM Formulation
𝑥𝑥2
𝑥𝑥1
𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽0 = 0
What can we do, if some data are not trusted and we wish a wider margin? Or To
get a wider margin, we may allow some data fall inside the margin boundaries
Denote 𝜉𝜉𝑛𝑛 > 0 the distance of the data to the margin boundary, then the distance of
the data to the decision boundary is 1 − 𝜉𝜉𝑛𝑛
𝐱𝐱𝑛𝑛
𝜉𝜉𝑛𝑛 > 0
16
SVM Formulation
Given a dataset 𝒟𝒟 = 𝐱𝐱1, 𝑑𝑑1 , 𝐱𝐱1, 𝑑𝑑1 ,⋯ , 𝐱𝐱𝑁𝑁, 𝑑𝑑𝑁𝑁 where 𝑑𝑑𝑛𝑛 =
1 or −1 (two classes)
The general SVM learning (P1)
In Literature, this is called the primary problem of SVM. Not
easy to solve
By introducing Lagrange multipliers 𝛼𝛼𝑛𝑛 for the first set of
𝑁𝑁 conditions, we can solve its dual problem which is a
Quadratic Programming problem
min
𝜷𝜷,𝛽𝛽0,𝜉𝜉𝑛𝑛≥0
1
2
𝜷𝜷 2 + 𝐶𝐶�
𝑛𝑛=1
𝑁𝑁
𝜉𝜉𝑛𝑛
𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≥ 1 − 𝜉𝜉𝑛𝑛, for all 𝑛𝑛 = 1, 2, … ,𝑁𝑁
𝜉𝜉𝑛𝑛 ≥ 0, for all 𝑛𝑛 = 1, 2, … ,𝑁𝑁
17
Quadratic Programming Problem
The dual problem of SVM (D1)
This can be efficiently solved for all 𝛼𝛼𝑛𝑛s. And solution is given
by
and the model
max
𝜶𝜶
−
1
2
�
𝑚𝑚,𝑛𝑛=1
𝑁𝑁
𝑑𝑑𝑚𝑚𝑑𝑑𝑛𝑛𝐱𝐱𝑚𝑚𝑇𝑇 𝐱𝐱𝑛𝑛𝛼𝛼𝑚𝑚𝛼𝛼𝑛𝑛 + 𝐶𝐶�
𝑛𝑛=1
𝑁𝑁
𝛼𝛼𝑛𝑛
0 ≤ 𝛼𝛼𝑛𝑛 ≤ 𝐶𝐶, for all 𝑛𝑛 = 1, 2, … ,𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
𝛼𝛼𝑛𝑛𝑑𝑑𝑛𝑛 = 0
𝜷𝜷 = �
𝑛𝑛=1
𝑁𝑁
𝛼𝛼𝑛𝑛𝑑𝑑𝑛𝑛𝐱𝐱𝑛𝑛 𝛽𝛽0 = −
max
{𝑛𝑛:𝑡𝑡𝑛𝑛=−1}
𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + min{𝑛𝑛:𝑡𝑡𝑛𝑛=1}
𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛
2
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = �
𝑛𝑛=1
𝑁𝑁
𝑑𝑑𝑛𝑛𝛼𝛼𝑛𝑛𝐱𝐱𝑛𝑛𝑇𝑇𝐱𝐱 + 𝛽𝛽0
If 𝑓𝑓 𝐱𝐱,𝜷𝜷 > 0, then 𝐱𝐱 belongs
to +1 class;
If 𝑓𝑓 𝐱𝐱,𝜷𝜷 < 0, then 𝐱𝐱 belongs to
−1 class
18
Properties of Solutions
An interesting property is that many of the resulting 𝛼𝛼𝑛𝑛 values
are equal to zero. Hence the obtained solution vector is sparse
The 𝐱𝐱𝑛𝑛 whose corresponding 𝛼𝛼𝑛𝑛 ≠ 0 is called Support Vector
(SV), hence the model is
Support Vectors are those data 𝐱𝐱𝑛𝑛 which are either
(i) on the margin boundaries, or
(ii) between the margin boundaries or
(iii) on the wrong side of the margin boundary
𝑓𝑓 𝐱𝐱,𝜷𝜷 = �
all 𝑘𝑘 for SVs
𝑑𝑑𝑛𝑛𝛼𝛼𝑛𝑛𝐱𝐱𝑛𝑛𝑇𝑇𝐱𝐱 + 𝛽𝛽0
19
Demo
𝑥𝑥2
𝑥𝑥1
𝛼𝛼 = 𝐶𝐶
0 < 𝛼𝛼 < 𝐶𝐶
𝛼𝛼 = 0
𝛼𝛼 = 0
20
21
Revisit the SVM Conditions
The two constraint conditions are
Or
Or
𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≥ 1 − 𝜉𝜉𝑛𝑛
𝜉𝜉𝑛𝑛 ≥ 0,
𝜉𝜉𝑛𝑛 ≥ 1 − 𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 and 𝜉𝜉𝑛𝑛 ≥ 0
𝜉𝜉𝑛𝑛 ≥ max 0, 1 − 𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≔ 𝐿𝐿(𝑑𝑑𝑛𝑛,𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0)
min
𝜷𝜷,𝛽𝛽0
1
2
𝜷𝜷 2 + 𝐶𝐶�
𝑛𝑛=1
𝑁𝑁
𝐿𝐿(𝑑𝑑𝑛𝑛,𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0)
Hinge Loss
22
10
𝒕𝒕 = 𝟏𝟏 Hinge Loss
z = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0
If z = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 ≥ 1,
then 1 − 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 < 0
Then 𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 0;
If 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 < 1,
then 1 − 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 > 0
Then 𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 1 − 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 ;
If z = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 0,
then 1 − 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 1
Then 𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 1.
𝐿𝐿
1
𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = max 0, 1 − 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0
23
𝒕𝒕 = −𝟏𝟏 Hinge Loss
If 𝒛𝒛 = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 ≤ −1,
then 1 + 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 < 0
Then 𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 0;
If 𝒛𝒛 = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 > −1,
then 1 + 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 > 0
then 𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 1 + 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 ;
If 𝒛𝒛 = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 0,
Then 1 + 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 1
Then 𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 1.
-1
0
𝒛𝒛 = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0
𝐿𝐿
1
𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = max 0, 1 + 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0
24
Review: Logistics Regression
If 𝑓𝑓 𝐱𝐱,𝜷𝜷 → 1, we should
predict t=1, meaning z =
𝐱𝐱𝑇𝑇𝜷𝜷 ≫ 0
10.01 𝑓𝑓 𝐱𝐱,𝜷𝜷0
Note: this is not the final logistic regression loss function and is only for one
training example.
If t=1, “Loss” plot.
Loss 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑑𝑑𝑛𝑛 =
− log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 = − log
1
1 + 𝑒𝑒−𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷
, 𝑑𝑑 = 1
− log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 = − log 1 −
1
1 + 𝑒𝑒−𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷
, 𝑑𝑑 = 0
We have absorbed 𝛽𝛽0 into 𝜷𝜷
25
Class t=0 (t=-1 in SVM)
If𝑓𝑓 𝐱𝐱,𝜷𝜷 → 0, we should
predict t=0, meaning z =
𝐱𝐱𝑇𝑇𝜷𝜷 ≪ 0
1
If t=0, “Loss” plot.
0 0.99𝑓𝑓 𝐱𝐱,𝜷𝜷
Loss 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑑𝑑𝑛𝑛 =
− log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 = − log
1
1 + 𝑒𝑒−𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷
, 𝑑𝑑 = 1
− log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 = − log 1 −
1
1 + 𝑒𝑒−𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷
, 𝑑𝑑 = 0
26
SVM Loss Function
Logistic regression loss function with regularization
For SVM ,we replace
with
10
-1 0
1
1
𝐿𝐿 𝜷𝜷 = −
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
𝑑𝑑𝑛𝑛 log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 + 1 − 𝑑𝑑𝑛𝑛 log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 +
𝜆𝜆
2𝑁𝑁
�
𝑗𝑗=1
𝑑𝑑
𝛽𝛽𝑗𝑗
2
−(𝑑𝑑𝑛𝑛 log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 + 1 − 𝑑𝑑𝑛𝑛 log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )
𝐿𝐿 𝑑𝑑𝑛𝑛,𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≔ max 0, 1 − 𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0
27
What we did:
Removed 𝑁𝑁 from the loss function. Will this change the estimation
results of parameters?
Used 𝐶𝐶 instead of 𝜆𝜆 for the regularization.
𝐶𝐶 play the same role as 1
𝜆𝜆
. These notations are just by convention for
SVM.
SVM Loss Function
𝐿𝐿 𝜷𝜷 = 𝐶𝐶�
𝑛𝑛=1
𝑁𝑁
𝐿𝐿 𝑑𝑑𝑛𝑛 ,𝜷𝜷
𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 +
1
2
�
𝑗𝑗=1
𝑑𝑑
𝛽𝛽𝑗𝑗
2
28
Summary: SVM Output
Unlike logistic regression which generates the estimated probability,
SVM directly produces the 1, -1 classification prediction:
𝑥𝑥2
𝐱𝐱𝑇𝑇𝜷𝜷 + 𝛽𝛽0 = 0
𝑥𝑥1
F 𝐱𝐱n,𝜷𝜷 = �
1, if 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≥ 0
−1, if 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 < 0
29
Methodology t Prediction
Target
Logistic regression if t=1
Logistic regression if t=0
10
−1 0
t=1
t=0
Methodology t Prediction
Target
SVM if t=1
SVM if t=-1
SVM requires/wants a bit more than logistic regression.
Some extra confidence factor.
We do not want data points to fall into the margin
Comparison
30
• If C is very very large, then the hinge loss will be close to zero:
𝐿𝐿 𝑑𝑑𝑛𝑛,𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≈ 0:
• If 𝑑𝑑 = 1,, then z = 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≥ 1
• If 𝑑𝑑 = −1,, then z = 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≤ −1
• Minimizing loss function will be close to minimizing below specification
• Also called hard margin SVM loss function
Prevent observations
from falling into the
margin
Hard Margin SVM
𝐿𝐿 𝜷𝜷 = 𝐶𝐶�
𝑛𝑛=1
𝑁𝑁
𝐿𝐿 𝑑𝑑𝑛𝑛,𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 +
1
2
�
𝑗𝑗=1
𝑑𝑑
𝛽𝛽𝑗𝑗
2
𝐿𝐿 𝜷𝜷 =
1
2
�
𝑗𝑗=1
𝑑𝑑
𝛽𝛽𝑗𝑗
2
s.t. �
𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≥ 1, if 𝑑𝑑𝑛𝑛 = 1
𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≤ −1, if 𝑑𝑑𝑛𝑛 = −1
34
Decision Boundary & Parameters Vector
5
5 𝑥𝑥1
𝑥𝑥2 t=1
t=0
Decision boundary is:
𝑥𝑥1 + 𝑥𝑥2 −5 =0
Parameter vector (intercept is not
included):
𝜷𝜷𝑻𝑻 = [1,1]𝑇𝑇
Decision boundary is orthogonal
to the parameter vector. Proof
omitted.
𝜷𝜷
38
39
Linearly Separable Case
Lecture05_Example01.py
40
𝑥𝑥1
𝑥𝑥2
If we have one green outlier as
in the plot.
Shall we choose the purple to
the blue decision boundary?
This is decided by 𝐶𝐶:
• 𝐶𝐶 is very large=> hard
margin SVM=> blue decision
boundary
• 𝐶𝐶 is not that large=> soft
margin SVM=> purple
decision boundary.
SVM Decision Boundary
41
C=1000
Close to a hard margin
classifier
C=1
Decision boundary is: 4𝑥𝑥2 − 3 = 0
Margin/Street width: 2/4=0.5
42
Linearly Non-Separable Case
Soft margin SVM is a better
choice for linearly non-
separable case.
Parameter 𝐶𝐶 determines the
tradeoff between increasing the
margin-size and ensuring that
the observations lie on the
correct side of the margin.
When 𝐶𝐶 is very very large, a soft
margin SVM become hard
margin.
𝑥𝑥1
𝑥𝑥2
43
Linearly Non-Separable Case
C=1 C=100
Lecture05_Example02.py
44
45
Kernel Method
• The original SVM algorithm was proposed by constructing a
linear classifier
• In 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir
N. Vapnik suggested a way to create nonlinear classifiers by
applying the kernel trick to SVM
• SVMs can efficiently perform a non-linear classification using
kernel trick, implicitly mapping their inputs into high-
dimensional feature spaces.
• What is kernel method or kernel trick?
• We offer two ways of explanation
46
Kernel Method Trick
The dual problem of SVM (D1)
The linear SVM algorithm uses the inner product of data. It is
true for any dimension 𝑑𝑑
max
𝜶𝜶
−
1
2
�
𝑚𝑚,𝑛𝑛=1
𝑁𝑁
𝑑𝑑𝑚𝑚𝑑𝑑𝑛𝑛𝐱𝐱𝑚𝑚𝑇𝑇 𝐱𝐱𝑛𝑛𝛼𝛼𝑚𝑚𝛼𝛼𝑛𝑛 + 𝐶𝐶�
𝑛𝑛=1
𝑁𝑁
𝛼𝛼𝑛𝑛
0 ≤ 𝛼𝛼𝑛𝑛 ≤ 𝐶𝐶, for all 𝑛𝑛 = 1, 2, … ,𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
𝛼𝛼𝑛𝑛𝑑𝑑𝑛𝑛 = 0
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = �
𝑛𝑛=1
𝑁𝑁
𝑑𝑑𝑛𝑛𝛼𝛼𝑛𝑛𝐱𝐱𝑛𝑛𝑇𝑇𝐱𝐱 + 𝛽𝛽0
𝑘𝑘 𝐱𝐱, 𝐱𝐱′ : = 𝐱𝐱𝑇𝑇𝐱𝐱′
𝜑𝜑 𝐱𝐱𝑚𝑚 𝑇𝑇 𝜑𝜑(𝐱𝐱𝑛𝑛)
Linear Kernelr
47
Kernel Method Trick
Suppose we have a mapping 𝐱𝐱𝑛𝑛 → 𝜑𝜑(𝐱𝐱𝑛𝑛) and can define
“inner” product 𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 ≔ 𝜑𝜑 𝐱𝐱𝑚𝑚 𝑇𝑇 𝜑𝜑(𝐱𝐱𝑛𝑛) , we can replace
𝐱𝐱𝑚𝑚𝑇𝑇 𝐱𝐱𝑛𝑛 with the 𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 in the SVM algorithm
The dual problem of SVM (D1)
The learned SVM model is
max
𝜶𝜶
−
1
2
�
𝑚𝑚,𝑛𝑛=1
𝑁𝑁
𝑑𝑑𝑚𝑚𝑑𝑑𝑛𝑛𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 𝛼𝛼𝑚𝑚𝛼𝛼𝑛𝑛 + 𝐶𝐶�
𝑛𝑛=1
𝑁𝑁
𝛼𝛼𝑛𝑛
0 ≤ 𝛼𝛼𝑛𝑛 ≤ 𝐶𝐶, for all 𝑛𝑛 = 1, 2, … ,𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
𝛼𝛼𝑛𝑛𝑑𝑑𝑛𝑛 = 0
𝑓𝑓 𝐱𝐱,𝜷𝜷 = �
𝑛𝑛=1
𝑁𝑁
𝑑𝑑𝑛𝑛𝛼𝛼𝑛𝑛𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱 + 𝛽𝛽0
Note, only terms for support
vectors are in summation
𝛽𝛽𝑛𝑛
48
Suppose we have a data set like this, the
SVM (linear kernel) we used in the
previous section will not work.
Decision boundary?
𝑥𝑥1
𝑥𝑥2
49
Python example
For this application, the linear SVM (SVM with linear kernel) is not
working.
A more flexible decision boundary is needed.
50
𝑥𝑥1
𝑥𝑥2
Nonlinear decision boundary
Current strategy
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽2𝑥𝑥1
2 + 𝛽𝛽4𝑥𝑥2
2
51
Here we have 4 features decided by functions:
𝑓𝑓1 = 𝑥𝑥1;
𝑓𝑓2 = 𝑥𝑥2;
𝑓𝑓3 = 𝑥𝑥1
2;
𝑓𝑓4 = 𝑥𝑥2
2;
Suppose
Decision boundary
𝑓𝑓 𝐱𝐱,𝜷𝜷 = −5 + 𝑥𝑥1 + 𝑥𝑥2 + 𝑥𝑥1
2 + 2𝑥𝑥2
2 = 0
We define a 𝜑𝜑 as
𝑥𝑥1, 𝑥𝑥2 →
𝜑𝜑
𝑥𝑥1, 𝑥𝑥2, 𝑥𝑥1
2, 𝑥𝑥2
2
Then take this four dimension
into SVM to solve with a kernel function
𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 = 𝑥𝑥𝑚𝑚1𝑥𝑥𝑛𝑛1 + 𝑥𝑥𝑚𝑚2𝑥𝑥𝑛𝑛2 + 𝑥𝑥𝑚𝑚1
2 𝑥𝑥𝑛𝑛1
2 + 𝑥𝑥𝑚𝑚2
2 𝑥𝑥𝑛𝑛2
2
52
Kernel Function
• However it is hard to design such a mapping 𝜑𝜑
• If we know 𝜑𝜑, we can define a function
• This function is symmetric, positive (don’t go details) etc. We
call it kernel function
• We find it is much easier to find kernel functions with the
above properties. And mathematician says “For any such a
kernel function, there must be a mapping 𝜑𝜑 such that
𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 : = 𝜑𝜑 𝐱𝐱𝑚𝑚 T𝜑𝜑(𝐱𝐱𝑛𝑛)
• In SVM method, we ONLY need to know to calculate kernel
𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 (or 𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱 ), don’t care whether we know 𝜑𝜑
𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 : = 𝜑𝜑 𝐱𝐱𝑚𝑚 T𝜑𝜑(𝐱𝐱𝑛𝑛)
53
Gaussian Kernel
𝑥𝑥1
𝑥𝑥2
The famous Gaussian kernel function
evaluates the similarity between any
two points 𝐱𝐱 and 𝐱𝐱′
It comes from:
𝑏𝑏(1) = 𝐱𝐱4
𝑏𝑏(2) = 𝐱𝐱5
PDF of Gaussian distribution
𝑝𝑝 𝐱𝐱 𝝁𝝁,𝜎𝜎2 =
1
2𝜋𝜋𝜎𝜎2𝑑𝑑
exp −
𝐱𝐱 − 𝝁𝝁 2
2𝜎𝜎2
𝑘𝑘 𝐱𝐱, 𝐱𝐱′ = exp −
𝐱𝐱 − 𝐱𝐱′ 2
2𝜎𝜎2
If we have two fixed basis , we can calculate the similarity of a point to these
two basis by the Gaussian kernel
54
Gaussian kernel
𝑥𝑥1
𝑥𝑥2
𝐱𝐱1
𝐱𝐱2
𝐱𝐱3
Why kernel trick can produce a
nonlinear decision boundary?
𝐛𝐛1 = 𝐱𝐱4
𝐛𝐛2 = 𝐱𝐱5
𝑓𝑓2 = 𝑘𝑘 𝐱𝐱,𝐛𝐛2 = exp −
𝐱𝐱 − 𝐛𝐛2 2
2𝜎𝜎2
𝑓𝑓1 = 𝑘𝑘 𝐱𝐱,𝐛𝐛1 = exp −
𝐱𝐱 − 𝐛𝐛1 2
2𝜎𝜎2
55
𝑥𝑥1
𝑥𝑥2
𝐱𝐱1
𝐱𝐱2
𝐱𝐱3
Training example 𝐱𝐱1is close to basis 𝐛𝐛1:
Then 𝑓𝑓1 will be close 1
Training example 𝐱𝐱2 is far away to basis 𝐛𝐛1:
Then 𝑓𝑓1 will be close 0
Training example 𝐱𝐱3 is far away basis 𝐛𝐛1:
Then 𝑓𝑓1 will be close to 0
Training example 𝐱𝐱4is the basis 𝐛𝐛1:
Then 𝑓𝑓1 will be exactly 1
Training example 𝐱𝐱5= 𝐛𝐛2 is far away to basis 𝐛𝐛1:
Then 𝑓𝑓1 will be close to 0
𝐛𝐛1 = 𝐱𝐱4
𝐛𝐛2 = 𝐱𝐱5
Same process can be
implemented for 𝐛𝐛2 = 𝐱𝐱5
𝑓𝑓2 = 𝑘𝑘 𝐱𝐱,𝐛𝐛2 = exp −
𝐱𝐱 − 𝐛𝐛2 2
2𝜎𝜎2
𝑓𝑓1 = 𝑘𝑘 𝐱𝐱,𝐛𝐛1 = exp −
𝐱𝐱 − 𝐛𝐛1 2
2𝜎𝜎2
56
Gaussian Kernel Visualization
Basis 𝐛𝐛𝑖𝑖 controls the location of the kernel, and
𝜎𝜎 controls the shape. Try to plot this in Python.
The Gaussian kernel or radial basis function (RBF) kernel K( , ) is
a popular kernel function that is commonly used in SVM classification.
𝑓𝑓𝑖𝑖 = 𝑘𝑘 𝐱𝐱,𝐛𝐛𝑖𝑖 = exp −
𝐱𝐱 − 𝐛𝐛𝑖𝑖 2
2𝜎𝜎2
57
𝑥𝑥1
𝑥𝑥2
𝐱𝐱1
𝐱𝐱2
𝐱𝐱3
𝐛𝐛1 = 𝐱𝐱4
𝐛𝐛2 = 𝐱𝐱5
If 𝛽𝛽0 + 𝛽𝛽1𝑓𝑓1 + 𝛽𝛽2𝑓𝑓2 ≥ 0; predict 𝑑𝑑 = 1;
Suppose 𝛽𝛽0 = −1,𝛽𝛽1 = 2.5,𝛽𝛽2 = 1.5
For 𝐱𝐱1, 𝑓𝑓1 will be close 1, 𝑓𝑓2 will be close 0:
𝛽𝛽0 + 𝛽𝛽1𝑓𝑓1 + 𝛽𝛽2𝑓𝑓2 ≈ 1.5, so predict 1;
For 𝐱𝐱2, 𝑓𝑓1 will be close 0, 𝑓𝑓2 will be close 1:
𝛽𝛽0 + 𝛽𝛽1𝑓𝑓1 + 𝛽𝛽2𝑓𝑓2 ≈ 0.5, so predict 1;
For 𝐱𝐱3, 𝑓𝑓1 will be close 0, 𝑓𝑓2 will be close 0:
𝛽𝛽0 + 𝛽𝛽1𝑓𝑓1 + 𝛽𝛽2𝑓𝑓2 ≈ −0.5, so predict 0;
Decision Boundary
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑓𝑓1 + 𝛽𝛽2𝑓𝑓2
58
High-dimensional Feature Space
Training set:
Basis:
• In the previous example, we only used 𝐛𝐛1 = 𝐱𝐱4,𝐛𝐛2 = 𝐱𝐱5.
• But why?
• Actually in the kernel method, all the training examples can be
used as basis
{ 𝐱𝐱1, 𝑑𝑑1 , 𝐱𝐱2, 𝑑𝑑2 , 𝐱𝐱3, 𝑑𝑑3 , … , 𝐱𝐱𝑁𝑁, 𝑑𝑑𝑁𝑁 }
{𝐱𝐱1, 𝐱𝐱2, 𝐱𝐱3, … , 𝐱𝐱𝑁𝑁}
59
𝑥𝑥1
𝑥𝑥2
Choose 𝐱𝐱2, 𝑑𝑑2 as example, for each basis, we can calculate the similarity (𝑁𝑁 in total)
𝒇𝒇(𝟐𝟐) = 𝑓𝑓02, 𝑓𝑓12,𝑓𝑓22 , 𝑓𝑓32, 𝑓𝑓42, 𝑓𝑓52 𝑻𝑻 ∈ ℝN+1 = ℝ6
High-dimensional Feature Space
𝐱𝐱1
𝐱𝐱2
𝐱𝐱3
𝐱𝐱4
𝐱𝐱5
n=2
N=5
𝑓𝑓02 = 1: Intercept term
𝑓𝑓12 = exp −
𝐱𝐱2 − 𝐱𝐱1 2
2𝜎𝜎2
𝑓𝑓22 = exp −
𝐱𝐱2 − 𝐱𝐱2 2
2𝜎𝜎2
𝑓𝑓32 = exp −
𝐱𝐱2 − 𝐱𝐱3 2
2𝜎𝜎2
𝑓𝑓42 = exp −
𝐱𝐱2 − 𝐱𝐱4 2
2𝜎𝜎2
𝑓𝑓52 = exp −
𝐱𝐱2 − 𝐱𝐱5 2
2𝜎𝜎2
60
SVM loss function employing kernel method
Let: 𝜷𝜷 ∈ℝN+1
SVM+Kernel
Prediction rule
For example: 𝑛𝑛 = 2
𝜷𝜷𝑇𝑇𝒇𝒇(𝑛𝑛) = 𝛽𝛽0𝑓𝑓0𝑛𝑛 + 𝛽𝛽1𝑓𝑓1𝑛𝑛 + 𝛽𝛽2𝑓𝑓2𝑛𝑛 + 𝛽𝛽3𝑓𝑓3𝑛𝑛 + ⋯+ 𝛽𝛽𝑁𝑁𝑓𝑓𝑁𝑁𝑛𝑛
𝐿𝐿 𝜷𝜷 = 𝐶𝐶�
𝑛𝑛=1
𝑁𝑁
𝐿𝐿 𝑑𝑑𝑛𝑛,𝜷𝜷𝑇𝑇𝒇𝒇(𝑛𝑛) +
1
2
�
𝑗𝑗=1
𝑁𝑁
𝛽𝛽𝑗𝑗
2
�
1, if 𝜷𝜷𝑇𝑇𝒇𝒇(𝑛𝑛) ≥ 0
0, if 𝜷𝜷𝑇𝑇𝒇𝒇(𝑛𝑛) < 0
61
How the parameter C = 1
λ
can impact bias and variance:
Large C:
Low bias and high variance
Small C:
High bias and low variance
Bias & Variance Impact of C
62
Bias & Variance Impact of 𝝈𝝈
Large 𝝈𝝈
𝑓𝑓𝑖𝑖 will vary more smoothly
High bias and low variance
Small 𝝈𝝈
𝑓𝑓𝑖𝑖 will vary more shapely
Low bias and high variance
63
There are many other types of kernels, e.g. polynomial kernel, chi-
square kernel, etc
Use SVM without kernel (linear kernel) when number of features 𝑑𝑑
is larger than number of training examples 𝑁𝑁
Use Gaussian kernel when 𝑑𝑑 is small and 𝑁𝑁 is medium, e.g. 𝑁𝑁 =
500
If 𝑑𝑑 is small and 𝑁𝑁 is very large, e.g. 𝑁𝑁 = 100,000, speed could be
an issue when using Gaussian kernel (too many
features).Therefore, SVM without kernel is a better choice.
How to choose kernel?
64
Lecture05_Example03.py
65
Slide Number 1
Slide Number 2
Slide Number 3
Slide Number 4
Slide Number 5
Slide Number 6
Slide Number 7
Slide Number 8
Slide Number 9
Slide Number 10
Slide Number 11
Slide Number 12
Slide Number 13
Slide Number 14
Slide Number 15
Slide Number 16
Slide Number 17
Slide Number 18
Slide Number 19
Slide Number 20
Slide Number 21
Slide Number 22
Slide Number 23
Slide Number 24
Slide Number 25
Slide Number 26
Slide Number 27
Slide Number 28
Slide Number 29
Slide Number 30
Slide Number 34
Slide Number 38
Slide Number 39
Slide Number 40
Slide Number 41
Slide Number 42
Slide Number 43
Slide Number 44
Slide Number 45
Slide Number 46
Slide Number 47
Slide Number 48
Slide Number 49
Slide Number 50
Slide Number 51
Slide Number 52
Slide Number 53
Slide Number 54
Slide Number 55
Slide Number 56
Slide Number 57
Slide Number 58
Slide Number 59
Slide Number 60
Slide Number 61
Slide Number 62
Slide Number 63
Slide Number 64
Slide Number 65