程序代写代做代考 python flex algorithm Slide 1

Slide 1

BUSINESS SCHOOL

 Discipline of Business Analytics

QBUS6850 Team

 Topics covered
 Support Vector Machine (SVM)
 Kernel method

 References
 Friedman et al., (2001), Chapter 12.1 – 12.3
 James et al., (2014), Chapter 9

 Bishop, (2006), Chapter 7.1

 Alpaydin, (2014), Chapter 13

Learning Objectives

 Understand the intuition of Support Vector Machine

 Understand the loss function of SVM

 Understand how SVM works

Understand the Gaussian kernel and its decision

boundary

 Understand how to incorporate kernel method with

SVM

What we learnt
 Supervised learning algorithm:

 Linear regression
 Logistics regression for classification
 k-NN

 Unsupervised learning algorithm:
 K-means
 PCA

 Loss function optimization, cross validation and regularization

 Other important skills:
 Skills in applying these algorithms, e.g. choice of the algorithm
 Choice of the features you design to give to the learning algorithms
 Choice of the regularization parameter
 The amount of data you really need

Presenter
Presentation Notes

SVM – Understanding the math – Part 1 – The margin

Support Vector Machines Tutorial

Support Vector Machine
 Support Vector Machine (SVM) algorithm which was first introduced in

the mid-1990s

 One of the most powerful ‘black box’ learning algorithms and widely used
learning algorithms today

 Powerful way of learning complex non-linear functions

 Having a cleverly-chosen optimization objective

 Support vector machines (SVMs) is an important machine learning method
with many applications

 In addition to performing linear classification, SVMs can efficiently perform
a non-linear classification using what is called the Kernel Trick, implicitly
mapping their inputs into high-dimensional feature spaces.

Support Vector Machine

 A straightforward engineering solution for classification tasks.
 Support vector machines have nice computational properties.
 Key idea:

 Construct a separating hyperplane in a high-dimensional feature
space.

 Maximize separability.
 Express the hyperplane in the original space using a small set of

training vectors, the “support vectors”.

 Links to a wealth of material (books, software, etc.) on SVMs can be
found: http://www.support-vector-machines.org/

Goal: construct a separating hyperplane that maximizes the margin
of separation.

http://www.support-vector-machines.org/

Intuition

𝑥𝑥2

𝑥𝑥1

Intuition

𝑥𝑥2

𝑥𝑥1

Why?

Margin

Largest Margin with SVM

𝑥𝑥2

𝑥𝑥1

Decision
Boundary

How?

“The widest
street approach”

If the training data are linearly separable, SVM can select two parallel hyperplanes
that separate the two classes of data, so that the distance between them is as large
as possible. The region bounded by these two hyperplanes is called the “margin“.
Observations on the margin are called the support vectors.

Calculating Margin

𝑥𝑥2

𝑥𝑥1

𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽0 = 0

How to calculate the distance from to 𝐱𝐱𝑛𝑛 to the decision boundary?

which is half of the margin

𝐱𝐱𝑛𝑛

𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 = 𝛽𝛽1𝑥𝑥𝑛𝑛1+𝛽𝛽2𝑥𝑥𝑛𝑛2+𝛽𝛽0
𝛽𝛽1
2+𝛽𝛽2

2
=

𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛+𝛽𝛽0
𝜷𝜷

dist

Decision Boundary Equation

 Boundary Equation 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽0 = 0 (more general 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 0) can be
represented as 𝑐𝑐𝛽𝛽1𝑥𝑥1 + 𝑐𝑐𝛽𝛽2𝑥𝑥2 + 𝑐𝑐𝛽𝛽0 = 0 for any constant 𝑐𝑐.

 That means we can choose 𝛽𝛽s such that for any points 𝐱𝐱𝑛𝑛 = (𝑥𝑥𝑛𝑛1, 𝑥𝑥𝑛𝑛2) on two
dashed blues we have 𝛽𝛽1𝑥𝑥𝑛𝑛1 + 𝛽𝛽2𝑥𝑥𝑛𝑛2 + 𝛽𝛽0 = 1.

 The margin now becomes

 Define labels 𝑑𝑑𝑛𝑛 = 1 if 𝐱𝐱𝑛𝑛 on green side; 𝑑𝑑𝑛𝑛 = −1 if 𝐱𝐱𝑛𝑛 on red side; Then no
matter on which side, we can write 𝛽𝛽1𝑥𝑥𝑛𝑛1 + 𝛽𝛽2𝑥𝑥𝑛𝑛2 + 𝛽𝛽0 = 1 as

 If 𝐱𝐱𝑛𝑛 is not on the dashed margin boundaries, we have

margin =
2
𝜷𝜷

𝑑𝑑𝑛𝑛(𝛽𝛽1𝑥𝑥𝑛𝑛1 + 𝛽𝛽2𝑥𝑥𝑛𝑛2 + 𝛽𝛽0) = 1 [ In general 𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 = 1 ]

𝑑𝑑𝑛𝑛(𝛽𝛽1𝑥𝑥𝑛𝑛1 + 𝛽𝛽2𝑥𝑥𝑛𝑛2 + 𝛽𝛽0) > 1 [ In general 𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 > 1 ]

SVM Formulation
 Given a dataset 𝒟𝒟 = 𝐱𝐱1, 𝑑𝑑1 , 𝐱𝐱1, 𝑑𝑑1 ,⋯ , 𝐱𝐱𝑁𝑁, 𝑑𝑑𝑁𝑁 where 𝑑𝑑𝑛𝑛 =

1 or −1 (two classes)

 We assume the data is separable by a hyperplane

 The SVM learning is defined

 The above problem is a constrained optimisation. Cannot be
solved by Gradient Descent directly

min
𝜷𝜷,𝛽𝛽0

1
2

𝜷𝜷 2

𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≥ 1, for all 𝑛𝑛 = 1, 2, … ,𝑁𝑁

SVM Formulation

𝑥𝑥2

𝑥𝑥1

𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽0 = 0

What can we do, if some data are not trusted and we wish a wider margin? Or To
get a wider margin, we may allow some data fall inside the margin boundaries

Denote 𝜉𝜉𝑛𝑛 > 0 the distance of the data to the margin boundary, then the distance of
the data to the decision boundary is 1 − 𝜉𝜉𝑛𝑛

𝐱𝐱𝑛𝑛

𝜉𝜉𝑛𝑛 > 0

SVM Formulation
 Given a dataset 𝒟𝒟 = 𝐱𝐱1, 𝑑𝑑1 , 𝐱𝐱1, 𝑑𝑑1 ,⋯ , 𝐱𝐱𝑁𝑁, 𝑑𝑑𝑁𝑁 where 𝑑𝑑𝑛𝑛 =

1 or −1 (two classes)

 The general SVM learning (P1)

 In Literature, this is called the primary problem of SVM. Not
easy to solve

 By introducing Lagrange multipliers 𝛼𝛼𝑛𝑛 for the first set of
𝑁𝑁 conditions, we can solve its dual problem which is a
Quadratic Programming problem

min
𝜷𝜷,𝛽𝛽0,𝜉𝜉𝑛𝑛≥0

1
2

𝜷𝜷 2 + 𝐶𝐶�
𝑛𝑛=1

𝑁𝑁

𝜉𝜉𝑛𝑛

𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≥ 1 − 𝜉𝜉𝑛𝑛, for all 𝑛𝑛 = 1, 2, … ,𝑁𝑁

𝜉𝜉𝑛𝑛 ≥ 0, for all 𝑛𝑛 = 1, 2, … ,𝑁𝑁

Quadratic Programming Problem

 The dual problem of SVM (D1)

 This can be efficiently solved for all 𝛼𝛼𝑛𝑛s. And solution is given
by

 and the model

max
𝜶𝜶
−

1
2
�

𝑚𝑚,𝑛𝑛=1

𝑁𝑁

𝑑𝑑𝑚𝑚𝑑𝑑𝑛𝑛𝐱𝐱𝑚𝑚𝑇𝑇 𝐱𝐱𝑛𝑛𝛼𝛼𝑚𝑚𝛼𝛼𝑛𝑛 + 𝐶𝐶�
𝑛𝑛=1

𝑁𝑁

𝛼𝛼𝑛𝑛

0 ≤ 𝛼𝛼𝑛𝑛 ≤ 𝐶𝐶, for all 𝑛𝑛 = 1, 2, … ,𝑁𝑁

�
𝑛𝑛=1

𝑁𝑁

𝛼𝛼𝑛𝑛𝑑𝑑𝑛𝑛 = 0

𝜷𝜷 = �
𝑛𝑛=1

𝑁𝑁

𝛼𝛼𝑛𝑛𝑑𝑑𝑛𝑛𝐱𝐱𝑛𝑛 𝛽𝛽0 = −
max

{𝑛𝑛:𝑡𝑡𝑛𝑛=−1}
𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + min{𝑛𝑛:𝑡𝑡𝑛𝑛=1}

𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛

𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = �
𝑛𝑛=1

𝑁𝑁

𝑑𝑑𝑛𝑛𝛼𝛼𝑛𝑛𝐱𝐱𝑛𝑛𝑇𝑇𝐱𝐱 + 𝛽𝛽0

If 𝑓𝑓 𝐱𝐱,𝜷𝜷 > 0, then 𝐱𝐱 belongs
to +1 class;
If 𝑓𝑓 𝐱𝐱,𝜷𝜷 < 0, then 𝐱𝐱 belongs to −1 class 18 Properties of Solutions  An interesting property is that many of the resulting 𝛼𝛼𝑛𝑛 values are equal to zero. Hence the obtained solution vector is sparse  The 𝐱𝐱𝑛𝑛 whose corresponding 𝛼𝛼𝑛𝑛 ≠ 0 is called Support Vector (SV), hence the model is  Support Vectors are those data 𝐱𝐱𝑛𝑛 which are either (i) on the margin boundaries, or (ii) between the margin boundaries or (iii) on the wrong side of the margin boundary 𝑓𝑓 𝐱𝐱,𝜷𝜷 = � all 𝑘𝑘 for SVs 𝑑𝑑𝑛𝑛𝛼𝛼𝑛𝑛𝐱𝐱𝑛𝑛𝑇𝑇𝐱𝐱 + 𝛽𝛽0 19 Demo 𝑥𝑥2 𝑥𝑥1 𝛼𝛼 = 𝐶𝐶 0 < 𝛼𝛼 < 𝐶𝐶 𝛼𝛼 = 0 𝛼𝛼 = 0 20 21 Revisit the SVM Conditions  The two constraint conditions are  Or  Or 𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≥ 1 − 𝜉𝜉𝑛𝑛 𝜉𝜉𝑛𝑛 ≥ 0, 𝜉𝜉𝑛𝑛 ≥ 1 − 𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 and 𝜉𝜉𝑛𝑛 ≥ 0 𝜉𝜉𝑛𝑛 ≥ max 0, 1 − 𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≔ 𝐿𝐿(𝑑𝑑𝑛𝑛,𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0) min 𝜷𝜷,𝛽𝛽0 1 2 𝜷𝜷 2 + 𝐶𝐶� 𝑛𝑛=1 𝑁𝑁 𝐿𝐿(𝑑𝑑𝑛𝑛,𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0) Hinge Loss 22 10 𝒕𝒕 = 𝟏𝟏 Hinge Loss z = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 If z = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 ≥ 1, then 1 − 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 < 0 Then 𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 0; If 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 < 1, then 1 − 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 > 0
Then 𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 1 − 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 ;

If z = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 0,
then 1 − 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 1
Then 𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 1.

𝐿𝐿

𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = max 0, 1 − 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0

𝒕𝒕 = −𝟏𝟏 Hinge Loss

If 𝒛𝒛 = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 ≤ −1,
then 1 + 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 < 0 Then 𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 0; If 𝒛𝒛 = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 > −1,
then 1 + 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 > 0
then 𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 1 + 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 ;

If 𝒛𝒛 = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 0,
Then 1 + 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 1
Then 𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = 1.

-1
0

𝒛𝒛 = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0

𝐿𝐿

𝐿𝐿 𝑑𝑑,𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = max 0, 1 + 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0

Review: Logistics Regression

If 𝑓𝑓 𝐱𝐱,𝜷𝜷 → 1, we should
predict t=1, meaning z =
𝐱𝐱𝑇𝑇𝜷𝜷 ≫ 0

10.01 𝑓𝑓 𝐱𝐱,𝜷𝜷0

Note: this is not the final logistic regression loss function and is only for one
training example.

If t=1, “Loss” plot.

Loss 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑑𝑑𝑛𝑛 =
− log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 = − log

1
1 + 𝑒𝑒−𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷

, 𝑑𝑑 = 1

− log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 = − log 1 −
1

1 + 𝑒𝑒−𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷
, 𝑑𝑑 = 0

We have absorbed 𝛽𝛽0 into 𝜷𝜷

Class t=0 (t=-1 in SVM)

If𝑓𝑓 𝐱𝐱,𝜷𝜷 → 0, we should
predict t=0, meaning z =
𝐱𝐱𝑇𝑇𝜷𝜷 ≪ 0

If t=0, “Loss” plot.

0 0.99𝑓𝑓 𝐱𝐱,𝜷𝜷

Loss 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑑𝑑𝑛𝑛 =
− log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 = − log

1
1 + 𝑒𝑒−𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷

, 𝑑𝑑 = 1

− log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 = − log 1 −
1

1 + 𝑒𝑒−𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷
, 𝑑𝑑 = 0

SVM Loss Function
Logistic regression loss function with regularization

For SVM ,we replace

with
10

-1 0

𝐿𝐿 𝜷𝜷 = −
1
𝑁𝑁

�
𝑛𝑛=1

𝑁𝑁

𝑑𝑑𝑛𝑛 log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 + 1 − 𝑑𝑑𝑛𝑛 log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 +
𝜆𝜆
2𝑁𝑁

�
𝑗𝑗=1

𝑑𝑑

𝛽𝛽𝑗𝑗
2

−(𝑑𝑑𝑛𝑛 log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 + 1 − 𝑑𝑑𝑛𝑛 log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )

𝐿𝐿 𝑑𝑑𝑛𝑛,𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≔ max 0, 1 − 𝑑𝑑𝑛𝑛 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0

What we did:
 Removed 𝑁𝑁 from the loss function. Will this change the estimation

results of parameters?
 Used 𝐶𝐶 instead of 𝜆𝜆 for the regularization.
 𝐶𝐶 play the same role as 1

𝜆𝜆
. These notations are just by convention for

SVM.

SVM Loss Function

𝐿𝐿 𝜷𝜷 = 𝐶𝐶�
𝑛𝑛=1

𝑁𝑁

𝐿𝐿 𝑑𝑑𝑛𝑛 ,𝜷𝜷
𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 +

1
2
�
𝑗𝑗=1

𝑑𝑑

𝛽𝛽𝑗𝑗
2

Summary: SVM Output
Unlike logistic regression which generates the estimated probability,
SVM directly produces the 1, -1 classification prediction:

𝑥𝑥2

𝐱𝐱𝑇𝑇𝜷𝜷 + 𝛽𝛽0 = 0

𝑥𝑥1

F 𝐱𝐱n,𝜷𝜷 = �
1, if 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≥ 0
−1, if 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 < 0 29 Methodology t Prediction Target Logistic regression if t=1 Logistic regression if t=0 10 −1 0 t=1 t=0 Methodology t Prediction Target SVM if t=1 SVM if t=-1 SVM requires/wants a bit more than logistic regression. Some extra confidence factor. We do not want data points to fall into the margin Comparison 30 • If C is very very large, then the hinge loss will be close to zero: 𝐿𝐿 𝑑𝑑𝑛𝑛,𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≈ 0: • If 𝑑𝑑 = 1,, then z = 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≥ 1 • If 𝑑𝑑 = −1,, then z = 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≤ −1 • Minimizing loss function will be close to minimizing below specification • Also called hard margin SVM loss function Prevent observations from falling into the margin Hard Margin SVM 𝐿𝐿 𝜷𝜷 = 𝐶𝐶� 𝑛𝑛=1 𝑁𝑁 𝐿𝐿 𝑑𝑑𝑛𝑛,𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 + 1 2 � 𝑗𝑗=1 𝑑𝑑 𝛽𝛽𝑗𝑗 2 𝐿𝐿 𝜷𝜷 = 1 2 � 𝑗𝑗=1 𝑑𝑑 𝛽𝛽𝑗𝑗 2 s.t. � 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≥ 1, if 𝑑𝑑𝑛𝑛 = 1 𝜷𝜷𝑇𝑇𝐱𝐱𝑛𝑛 + 𝛽𝛽0 ≤ −1, if 𝑑𝑑𝑛𝑛 = −1 34 Decision Boundary & Parameters Vector 5 5 𝑥𝑥1 𝑥𝑥2 t=1 t=0 Decision boundary is: 𝑥𝑥1 + 𝑥𝑥2 −5 =0 Parameter vector (intercept is not included): 𝜷𝜷𝑻𝑻 = [1,1]𝑇𝑇 Decision boundary is orthogonal to the parameter vector. Proof omitted. 𝜷𝜷 38 39 Linearly Separable Case Lecture05_Example01.py 40 𝑥𝑥1 𝑥𝑥2 If we have one green outlier as in the plot. Shall we choose the purple to the blue decision boundary? This is decided by 𝐶𝐶: • 𝐶𝐶 is very large=> hard

margin SVM=> blue decision
boundary

• 𝐶𝐶 is not that large=> soft
margin SVM=> purple
decision boundary.

SVM Decision Boundary

C=1000
Close to a hard margin

classifier

C=1

Decision boundary is: 4𝑥𝑥2 − 3 = 0
Margin/Street width: 2/4=0.5

Linearly Non-Separable Case

Soft margin SVM is a better
choice for linearly non-
separable case.

Parameter 𝐶𝐶 determines the
tradeoff between increasing the
margin-size and ensuring that
the observations lie on the
correct side of the margin.

When 𝐶𝐶 is very very large, a soft
margin SVM become hard
margin.

𝑥𝑥1

𝑥𝑥2

Linearly Non-Separable Case

C=1 C=100

Lecture05_Example02.py

Kernel Method

• The original SVM algorithm was proposed by constructing a
linear classifier

• In 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir
N. Vapnik suggested a way to create nonlinear classifiers by
applying the kernel trick to SVM

• SVMs can efficiently perform a non-linear classification using
kernel trick, implicitly mapping their inputs into high-
dimensional feature spaces.

• What is kernel method or kernel trick?

• We offer two ways of explanation

Kernel Method Trick

 The dual problem of SVM (D1)



 The linear SVM algorithm uses the inner product of data. It is
true for any dimension 𝑑𝑑

max
𝜶𝜶
−

1
2
�

𝑚𝑚,𝑛𝑛=1

𝑁𝑁

𝑑𝑑𝑚𝑚𝑑𝑑𝑛𝑛𝐱𝐱𝑚𝑚𝑇𝑇 𝐱𝐱𝑛𝑛𝛼𝛼𝑚𝑚𝛼𝛼𝑛𝑛 + 𝐶𝐶�
𝑛𝑛=1

𝑁𝑁

𝛼𝛼𝑛𝑛

0 ≤ 𝛼𝛼𝑛𝑛 ≤ 𝐶𝐶, for all 𝑛𝑛 = 1, 2, … ,𝑁𝑁

�
𝑛𝑛=1

𝑁𝑁

𝛼𝛼𝑛𝑛𝑑𝑑𝑛𝑛 = 0

𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝜷𝜷𝑇𝑇𝐱𝐱 + 𝛽𝛽0 = �
𝑛𝑛=1

𝑁𝑁

𝑑𝑑𝑛𝑛𝛼𝛼𝑛𝑛𝐱𝐱𝑛𝑛𝑇𝑇𝐱𝐱 + 𝛽𝛽0

𝑘𝑘 𝐱𝐱, 𝐱𝐱′ : = 𝐱𝐱𝑇𝑇𝐱𝐱′

𝜑𝜑 𝐱𝐱𝑚𝑚 𝑇𝑇 𝜑𝜑(𝐱𝐱𝑛𝑛)

Linear Kernelr

Kernel Method Trick

 Suppose we have a mapping 𝐱𝐱𝑛𝑛 → 𝜑𝜑(𝐱𝐱𝑛𝑛) and can define
“inner” product 𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 ≔ 𝜑𝜑 𝐱𝐱𝑚𝑚 𝑇𝑇 𝜑𝜑(𝐱𝐱𝑛𝑛) , we can replace
𝐱𝐱𝑚𝑚𝑇𝑇 𝐱𝐱𝑛𝑛 with the 𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 in the SVM algorithm

 The dual problem of SVM (D1)

 The learned SVM model is

max
𝜶𝜶
−

1
2
�

𝑚𝑚,𝑛𝑛=1

𝑁𝑁

𝑑𝑑𝑚𝑚𝑑𝑑𝑛𝑛𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 𝛼𝛼𝑚𝑚𝛼𝛼𝑛𝑛 + 𝐶𝐶�
𝑛𝑛=1

𝑁𝑁

𝛼𝛼𝑛𝑛

0 ≤ 𝛼𝛼𝑛𝑛 ≤ 𝐶𝐶, for all 𝑛𝑛 = 1, 2, … ,𝑁𝑁

�
𝑛𝑛=1

𝑁𝑁

𝛼𝛼𝑛𝑛𝑑𝑑𝑛𝑛 = 0

𝑓𝑓 𝐱𝐱,𝜷𝜷 = �
𝑛𝑛=1

𝑁𝑁

𝑑𝑑𝑛𝑛𝛼𝛼𝑛𝑛𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱 + 𝛽𝛽0

Note, only terms for support
vectors are in summation

𝛽𝛽𝑛𝑛

Suppose we have a data set like this, the
SVM (linear kernel) we used in the
previous section will not work.

Decision boundary?

𝑥𝑥1

𝑥𝑥2

Python example
 For this application, the linear SVM (SVM with linear kernel) is not

working.
 A more flexible decision boundary is needed.

𝑥𝑥1

𝑥𝑥2

Nonlinear decision boundary

Current strategy

𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2

𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽2𝑥𝑥1
2 + 𝛽𝛽4𝑥𝑥2

Here we have 4 features decided by functions:
𝑓𝑓1 = 𝑥𝑥1;
𝑓𝑓2 = 𝑥𝑥2;
𝑓𝑓3 = 𝑥𝑥1

2;
𝑓𝑓4 = 𝑥𝑥2

Suppose

Decision boundary

𝑓𝑓 𝐱𝐱,𝜷𝜷 = −5 + 𝑥𝑥1 + 𝑥𝑥2 + 𝑥𝑥1
2 + 2𝑥𝑥2

2 = 0

We define a 𝜑𝜑 as

𝑥𝑥1, 𝑥𝑥2 →
𝜑𝜑

𝑥𝑥1, 𝑥𝑥2, 𝑥𝑥1
2, 𝑥𝑥2

Then take this four dimension
into SVM to solve with a kernel function

𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 = 𝑥𝑥𝑚𝑚1𝑥𝑥𝑛𝑛1 + 𝑥𝑥𝑚𝑚2𝑥𝑥𝑛𝑛2 + 𝑥𝑥𝑚𝑚1
2 𝑥𝑥𝑛𝑛1

2 + 𝑥𝑥𝑚𝑚2
2 𝑥𝑥𝑛𝑛2

Kernel Function

• However it is hard to design such a mapping 𝜑𝜑

• If we know 𝜑𝜑, we can define a function

• This function is symmetric, positive (don’t go details) etc. We
call it kernel function

• We find it is much easier to find kernel functions with the
above properties. And mathematician says “For any such a
kernel function, there must be a mapping 𝜑𝜑 such that
𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 : = 𝜑𝜑 𝐱𝐱𝑚𝑚 T𝜑𝜑(𝐱𝐱𝑛𝑛)

• In SVM method, we ONLY need to know to calculate kernel
𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 (or 𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱 ), don’t care whether we know 𝜑𝜑

𝑘𝑘 𝐱𝐱𝑚𝑚, 𝐱𝐱𝑛𝑛 : = 𝜑𝜑 𝐱𝐱𝑚𝑚 T𝜑𝜑(𝐱𝐱𝑛𝑛)

Gaussian Kernel

𝑥𝑥1

𝑥𝑥2
The famous Gaussian kernel function
evaluates the similarity between any
two points 𝐱𝐱 and 𝐱𝐱′

It comes from:

𝑏𝑏(1) = 𝐱𝐱4

𝑏𝑏(2) = 𝐱𝐱5

PDF of Gaussian distribution

𝑝𝑝 𝐱𝐱 𝝁𝝁,𝜎𝜎2 =
1

2𝜋𝜋𝜎𝜎2𝑑𝑑
exp −

𝐱𝐱 − 𝝁𝝁 2

2𝜎𝜎2

𝑘𝑘 𝐱𝐱, 𝐱𝐱′ = exp −
𝐱𝐱 − 𝐱𝐱′ 2

2𝜎𝜎2

If we have two fixed basis , we can calculate the similarity of a point to these
two basis by the Gaussian kernel

Gaussian kernel

𝑥𝑥1

𝑥𝑥2
𝐱𝐱1

𝐱𝐱2

𝐱𝐱3

Why kernel trick can produce a
nonlinear decision boundary?

𝐛𝐛1 = 𝐱𝐱4

𝐛𝐛2 = 𝐱𝐱5

𝑓𝑓2 = 𝑘𝑘 𝐱𝐱,𝐛𝐛2 = exp −
𝐱𝐱 − 𝐛𝐛2 2

2𝜎𝜎2

𝑓𝑓1 = 𝑘𝑘 𝐱𝐱,𝐛𝐛1 = exp −
𝐱𝐱 − 𝐛𝐛1 2

2𝜎𝜎2

𝑥𝑥1

𝑥𝑥2
𝐱𝐱1

𝐱𝐱2

𝐱𝐱3

Training example 𝐱𝐱1is close to basis 𝐛𝐛1:
Then 𝑓𝑓1 will be close 1

Training example 𝐱𝐱2 is far away to basis 𝐛𝐛1:
Then 𝑓𝑓1 will be close 0

Training example 𝐱𝐱3 is far away basis 𝐛𝐛1:
Then 𝑓𝑓1 will be close to 0

Training example 𝐱𝐱4is the basis 𝐛𝐛1:
Then 𝑓𝑓1 will be exactly 1

Training example 𝐱𝐱5= 𝐛𝐛2 is far away to basis 𝐛𝐛1:
Then 𝑓𝑓1 will be close to 0

𝐛𝐛1 = 𝐱𝐱4

𝐛𝐛2 = 𝐱𝐱5

Same process can be
implemented for 𝐛𝐛2 = 𝐱𝐱5

𝑓𝑓2 = 𝑘𝑘 𝐱𝐱,𝐛𝐛2 = exp −
𝐱𝐱 − 𝐛𝐛2 2

2𝜎𝜎2
𝑓𝑓1 = 𝑘𝑘 𝐱𝐱,𝐛𝐛1 = exp −

𝐱𝐱 − 𝐛𝐛1 2

2𝜎𝜎2

Gaussian Kernel Visualization

Basis 𝐛𝐛𝑖𝑖 controls the location of the kernel, and
𝜎𝜎 controls the shape. Try to plot this in Python.

The Gaussian kernel or radial basis function (RBF) kernel K( , ) is
a popular kernel function that is commonly used in SVM classification.

𝑓𝑓𝑖𝑖 = 𝑘𝑘 𝐱𝐱,𝐛𝐛𝑖𝑖 = exp −
𝐱𝐱 − 𝐛𝐛𝑖𝑖 2

2𝜎𝜎2

𝑥𝑥1

𝑥𝑥2

𝐱𝐱1

𝐱𝐱2

𝐱𝐱3

𝐛𝐛1 = 𝐱𝐱4

𝐛𝐛2 = 𝐱𝐱5

If 𝛽𝛽0 + 𝛽𝛽1𝑓𝑓1 + 𝛽𝛽2𝑓𝑓2 ≥ 0; predict 𝑑𝑑 = 1;

Suppose 𝛽𝛽0 = −1,𝛽𝛽1 = 2.5,𝛽𝛽2 = 1.5

For 𝐱𝐱1, 𝑓𝑓1 will be close 1, 𝑓𝑓2 will be close 0:
𝛽𝛽0 + 𝛽𝛽1𝑓𝑓1 + 𝛽𝛽2𝑓𝑓2 ≈ 1.5, so predict 1;

For 𝐱𝐱2, 𝑓𝑓1 will be close 0, 𝑓𝑓2 will be close 1:
𝛽𝛽0 + 𝛽𝛽1𝑓𝑓1 + 𝛽𝛽2𝑓𝑓2 ≈ 0.5, so predict 1;

For 𝐱𝐱3, 𝑓𝑓1 will be close 0, 𝑓𝑓2 will be close 0:
𝛽𝛽0 + 𝛽𝛽1𝑓𝑓1 + 𝛽𝛽2𝑓𝑓2 ≈ −0.5, so predict 0;

Decision Boundary

𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑓𝑓1 + 𝛽𝛽2𝑓𝑓2

High-dimensional Feature Space

Training set:

Basis:

• In the previous example, we only used 𝐛𝐛1 = 𝐱𝐱4,𝐛𝐛2 = 𝐱𝐱5.

• But why?

• Actually in the kernel method, all the training examples can be
used as basis

{ 𝐱𝐱1, 𝑑𝑑1 , 𝐱𝐱2, 𝑑𝑑2 , 𝐱𝐱3, 𝑑𝑑3 , … , 𝐱𝐱𝑁𝑁, 𝑑𝑑𝑁𝑁 }

{𝐱𝐱1, 𝐱𝐱2, 𝐱𝐱3, … , 𝐱𝐱𝑁𝑁}

𝑥𝑥1

𝑥𝑥2

Choose 𝐱𝐱2, 𝑑𝑑2 as example, for each basis, we can calculate the similarity (𝑁𝑁 in total)

𝒇𝒇(𝟐𝟐) = 𝑓𝑓02, 𝑓𝑓12,𝑓𝑓22 , 𝑓𝑓32, 𝑓𝑓42, 𝑓𝑓52 𝑻𝑻 ∈ ℝN+1 = ℝ6

High-dimensional Feature Space

𝐱𝐱1

𝐱𝐱2

𝐱𝐱3

𝐱𝐱4

𝐱𝐱5

n=2
N=5

𝑓𝑓02 = 1: Intercept term

𝑓𝑓12 = exp −
𝐱𝐱2 − 𝐱𝐱1 2

2𝜎𝜎2

𝑓𝑓22 = exp −
𝐱𝐱2 − 𝐱𝐱2 2

2𝜎𝜎2

𝑓𝑓32 = exp −
𝐱𝐱2 − 𝐱𝐱3 2

2𝜎𝜎2

𝑓𝑓42 = exp −
𝐱𝐱2 − 𝐱𝐱4 2

2𝜎𝜎2

𝑓𝑓52 = exp −
𝐱𝐱2 − 𝐱𝐱5 2

2𝜎𝜎2

SVM loss function employing kernel method

Let: 𝜷𝜷 ∈ℝN+1

SVM+Kernel

Prediction rule

For example: 𝑛𝑛 = 2

𝜷𝜷𝑇𝑇𝒇𝒇(𝑛𝑛) = 𝛽𝛽0𝑓𝑓0𝑛𝑛 + 𝛽𝛽1𝑓𝑓1𝑛𝑛 + 𝛽𝛽2𝑓𝑓2𝑛𝑛 + 𝛽𝛽3𝑓𝑓3𝑛𝑛 + ⋯+ 𝛽𝛽𝑁𝑁𝑓𝑓𝑁𝑁𝑛𝑛

𝐿𝐿 𝜷𝜷 = 𝐶𝐶�
𝑛𝑛=1

𝑁𝑁

𝐿𝐿 𝑑𝑑𝑛𝑛,𝜷𝜷𝑇𝑇𝒇𝒇(𝑛𝑛) +
1
2
�
𝑗𝑗=1

𝑁𝑁

𝛽𝛽𝑗𝑗
2

�
1, if 𝜷𝜷𝑇𝑇𝒇𝒇(𝑛𝑛) ≥ 0
0, if 𝜷𝜷𝑇𝑇𝒇𝒇(𝑛𝑛) < 0 61 How the parameter C = 1 λ can impact bias and variance: Large C: Low bias and high variance Small C: High bias and low variance Bias & Variance Impact of C 62 Bias & Variance Impact of 𝝈𝝈 Large 𝝈𝝈 𝑓𝑓𝑖𝑖 will vary more smoothly High bias and low variance Small 𝝈𝝈 𝑓𝑓𝑖𝑖 will vary more shapely Low bias and high variance 63  There are many other types of kernels, e.g. polynomial kernel, chi- square kernel, etc  Use SVM without kernel (linear kernel) when number of features 𝑑𝑑 is larger than number of training examples 𝑁𝑁  Use Gaussian kernel when 𝑑𝑑 is small and 𝑁𝑁 is medium, e.g. 𝑁𝑁 = 500  If 𝑑𝑑 is small and 𝑁𝑁 is very large, e.g. 𝑁𝑁 = 100,000, speed could be an issue when using Gaussian kernel (too many features).Therefore, SVM without kernel is a better choice. How to choose kernel? 64 Lecture05_Example03.py 65 Slide Number 1 Slide Number 2 Slide Number 3 Slide Number 4 Slide Number 5 Slide Number 6 Slide Number 7 Slide Number 8 Slide Number 9 Slide Number 10 Slide Number 11 Slide Number 12 Slide Number 13 Slide Number 14 Slide Number 15 Slide Number 16 Slide Number 17 Slide Number 18 Slide Number 19 Slide Number 20 Slide Number 21 Slide Number 22 Slide Number 23 Slide Number 24 Slide Number 25 Slide Number 26 Slide Number 27 Slide Number 28 Slide Number 29 Slide Number 30 Slide Number 34 Slide Number 38 Slide Number 39 Slide Number 40 Slide Number 41 Slide Number 42 Slide Number 43 Slide Number 44 Slide Number 45 Slide Number 46 Slide Number 47 Slide Number 48 Slide Number 49 Slide Number 50 Slide Number 51 Slide Number 52 Slide Number 53 Slide Number 54 Slide Number 55 Slide Number 56 Slide Number 57 Slide Number 58 Slide Number 59 Slide Number 60 Slide Number 61 Slide Number 62 Slide Number 63 Slide Number 64 Slide Number 65

Related Posts