10_Kernel_methods
Qiuhong Ke
Kernel Methods
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
So far…
Objectives of hard-margin and soft-margin SVM
2
min
∥w∥2
2
1 − y(i)(wTx(i) + b) ≤ 0,
Hard-margin SVM:
min (
∥w∥2
2
+ C
n
∑
i=1
ξi)
y(i)(wTx(i) + b) ≥ 1 − ξi ,
ξi ≥ 0 (i = 1,⋯, n data points)
Soft-margin SVM:
i = 1,⋯, n (data points)
, y(i)(wTx(i) + b) ≥ 1,0
, otherwise1 − y(i)(wTx(i) + b)
• Hard-margin SVM loss
!! = #0 1 − ‘ (
“) + + ≤ 0
∞ ./ℎ123451
• Soft-margin SVM loss (hinge loss)
!# = #
0 1 − ‘ (“) + + ≤ 0
1 − ‘ (“) + + ./ℎ123451ξi =
w w
So far…
3
s.t.min (
∥w∥2
2
+ C
n
∑
i=1
ξi) y
(i)(wTx(i) + b) ≥ 1 − ξi ,
ξi ≥ 0
(i = 1,⋯, n data points)
Influence of slack penalty C
w
So far…
What’s the dual problem?
4
θd(λ) = min L(x, λ) ≤ L(x, λ) = f(x) + λg(x) ≤ f(x)
x
d* = θd(λ*) = min L(x, λ*) ≤ L(x*, λ*) = f(x*) + λ*g(x*) ≤ f(x*) = p*
x
θd(λ*) = d* = max θd(λ)
Solutions:
f(x*) = p* = min f(x)min f(x) = min max L(x, λ)x λ
• Primal problem:
• Dual problem:
max min L(x, λ) = max θd(λ)
xλ λ
x
L(x, λ) = f(x) + λg(x) λ ≥ 0
g(x) ≤ 0
So far…
What are the conditions making Solution(dual)==Solution(primal))
5
KKT (Karush-Kuhn-Tucker) conditions :
g(x) ≤ 0 (Primal feasibility)
∂L
∂x
= 0 (Stationarity)
λ ≥ 0 (Dual feasibility)
λg(x) = 0 (Complementary slackness)
d* = θd(λ*) = min L(x, λ*) ≤ L(x*, λ*) = f(x*) + λ*g(x*) ≤ f(x*) = p*
x
d* = p*
if min L(x, λ*) = = L(x*, λ*) and f(x*) + λ*g(x*) = = f(x*)x
6
Dual problem of hard-margin SVM
So far…
Dual problem max θd(λ)λ
λi ≥ 0 and
s.t.
θd(λ) =
n
∑
i=1
λi −
1
2
n
∑
i=1
n
∑
k=1
λiλky
(i)y(k)(x(i))Tx(k)
Dual function
n
∑
i=1
λiy
(i) = 0
7
KKT conditions in hard-margin SVM
λi ≥ 0(Dual feasibility)
λigi(w, b) = 0
(Complementary
slackness)
gi(w, b) = 1 − y
(i)(wTx(i) + b) ≤ 0(Primal feasibility)
(i = 1,⋯, n data points)
∂L
∂wj
= 0 : wj =
n
∑
i=1
λiy
(i)x(i)j
n
∑
i=1
λiy
(i) = 0
∂L
∂b
= 0 :
(Stationarity)
L(w, b, λ) =
∥w∥2
2
+
n
∑
i=1
λi(1 − y
(i)(wTx(i) + b)),
So far…
f(x) = 0
f(x) = − 1
−
+
support vectors
λi > 0 λi > 0
f(x) = + 1
So far…
8
Dual problem max θd(λ)λ
and
s.t.
θd(λ) =
n
∑
i=1
λi −
1
2
n
∑
i=1
n
∑
k=1
λiλky
(i)y(k)(x(i))Tx(k)
Dual function
Dual problem of soft-margin SVM
0 ≤ λi ≤ C
n
∑
i=1
λiy
(i) = 0
So far…
9
KKT conditions in soft-margin SVM
λi ≥ 0, βi ≥ 0(Dual feasibility)
(Complementary
slackness)
λigi(w, b, ξ) = 0 βiξi = 0
(Primal feasibility)
−ξi ≤ 0
gi(w, b, ξ) = 1 − ξi − y
(i)(wTx(i) + b) ≤ 0,
(i = 1,⋯, n data points)
∂L
∂wj
= 0 : wj =
n
∑
i=1
λiy
(i)x(i)j
n
∑
i=1
λiy
(i) = 0
∂L
∂b
= 0 :
(Stationarity)
L(w, b, λ, β, ξ) =
∥w∥2
2
+ C
n
∑
i=1
ξi +
n
∑
i=1
λigi(w, b, ξ) +
n
∑
i=1
βi(−ξi)
∂L
∂ξi
= 0 : C − λi − βi = 0
support vectors
support vectors: 0 < λi ≤ C
0 ≤ λi ≤ C
So far…
10
Why bother solving dual problem to solve primal problem
• Inner product: Kernel trick can be used to efficiently handle non-linearly separable
data
• Efficient in high-dim space
θd(λ) =
n
∑
i=1
λi −
1
2
n
∑
i=1
n
∑
k=1
λiλky
(i)y(k)(x(i))Tx(k)max θd(λ)
λ
Training, solve:
λi ≥ 0 and
s.t. n
∑
i=1
λiy
(i) = 0
f(x) =
n
∑
i=1
λiy
(i)(x(i))Tx + bPrediction:
Outline
• Kernel Trick
• Common Kernels and Identify New Kernels
• Kernelised Perceptron
11
Handling non-linear data with the SVM
12
Kernel Trick Common Kernels Kernelised Perceptron
Feature transformation
• Consider a binary classification
problem
• Each example has features ["!, ""]
• Not linearly separable
• Now ‘add’ a feature "# = "!" + """
• Each point is now ["!, "", "!" + """]
• Linearly separable!
• Consider a binary classification
problem
• Each example has features ["!, ""]
• Not linearly separable
• Now ‘add’ a feature "# = "!" + """
• Each point is now ["!, "", "!" + """]
• Linearly separable!
13
Kernel Trick Common Kernels Kernelised Perceptron
Naïve workflow
• Choose/design a linear model
• Choose/design a high-dimensional transformation ! "
* Hoping that after adding a lot of various features some of them will
make the data linearly separable
• For each training example, and for each new instance
compute ! "
• Train classifier/Do predictions
• Problem: impractical/impossible to compute !(") for
high/infinite-dimensional !(") 14
Kernel Trick Common Kernels Kernelised Perceptron
Hard-margin SVM’s dual formulation
θd(λ) =
n
∑
i=1
λi −
1
2
n
∑
i=1
n
∑
k=1
λiλky
(i)y(k)(x(i))Tx(k)max θd(λ)
λ
Training, solve:
λi ≥ 0 and
s.t. n
∑
i=1
λiy
(i) = 0
f(x) =
n
∑
i=1
λiy
(i)(x(i))Tx + bPrediction:
15
Kernel Trick Common Kernels Kernelised Perceptron
Hard-margin SVM in feature space
We just need the dot product!
max θd(λ)
λ
Training, solve:
λi ≥ 0 and
s.t.
Prediction:
16
θd(λ) =
n
∑
i=1
λi −
1
2
n
∑
i=1
n
∑
k=1
λiλky
(i)y(k)(φ(x(i)))Tφ(x(k))
f(x) = wTx + b =
n
∑
i=1
λiy
(i)(φ(x(i)))Tφ(x) + b
Kernel Trick Common Kernels Kernelised Perceptron
n
∑
i=1
λiy
(i) = 0
Observation: Kernel representation
• Both parameter estimation and computing predictions
depend on data only in a form of a dot product
* In original space !!" = ∑ %"&"#"$%
* In transformed space ! " !! # = ∑ ! " "! # "#"$%
Benefits:
• no need to find the mapping function.
• no need to do transformation.
• no need to do dot product. 17
Kernel Trick Common Kernels Kernelised Perceptron
Kernel as shortcut
• For some ! " ’s, kernel is faster to compute directly than
first mapping to feature space then taking dot product.
• For example, consider two vectors # = %! and & = '!
and transformation ! " = [)!", 2,)!, ,], some ,
* So ! " = $!", 2'$!, '
#
and ! ( = )!", 2')!, '
#
* Then ! " #! ( = $!")!" + 2'$!)! + '"
• This can be alternatively computed directly as
! # #! & = %!'! + , "
* Here + ", ( = $!)! + ' " is the corresponding kernel
• For some ! " ’s, kernel is faster to compute directly than
first mapping to feature space then taking dot product.
• For example, consider two vectors # = %! and & = '!
and transformation ! " = [)!", 2,)!, ,], some ,
* So ! " = $!", 2'$!, '
#
and ! ( = )!", 2')!, '
#
* Then ! " #! ( = $!")!" + 2'$!)! + '"
• This can be alternatively computed directly as
! # #! & = %!'! + , "
* Here + ", ( = $!)! + ' " is the corresponding kernel
18
Kernel Trick Common Kernels Kernelised Perceptron
Hard-margin SVM in feature space
L(λ) =
n
∑
i=1
λi −
1
2
n
∑
i=1
n
∑
k=1
λiλky
(i)y(k)(φ(x(i)))Tφ(x(k))max L(λ)λ
f(x) = wTx + b =
n
∑
i=1
λiy
(i)(φ(x(i)))Tφ(x) + b
Making predictions:
Training: solve
K(x(i), x(k))
K(x(i), x)
19
Kernel Trick Common Kernels Kernelised Perceptron
More generally: The “kernel trick”
20
Kernel Trick Common Kernels Kernelised Perceptron
Modular learning
21
Kernel Trick Common Kernels Kernelised Perceptron
Approaches to non-linearity
22
Kernel Trick Common Kernels Kernelised Perceptron
Polynomial kernel
K(u, v) = (uTv + c)d, (c ≥ 0 & d ≥ 0)
23
assume c=0, If it’s not, add as a dummy feature to u and vc
K(u, v) = (u1v1 + u2v2)
2
e . g . , u = [u1, u2], v = [v1, v2], c = 0, d = 2
= u21v
2
1 + u
2
2v
2
2 + 2u1u2v1v2
= [u21 , u
2
2 , 2u1u2]
T[v21 , v
2
2 , 2u1v2]
= ϕ(u)Tϕ(v)
ϕ(x) = [x21 , x
2
2 , 2x1x2]
d = 3?
Kernel Trick Common Kernels Kernelised Perceptron
https://www.youtube.com/watch?v=3liCbRZPrZA
https://www.youtube.com/watch?v=3liCbRZPrZA
Radial Basis Function (RBF) kernel
K(u, v) = exp(−γ∥u − v∥2)
24
Radial Basis Function (RBF) kernel
e . g . , u = [u1], v = [v1], γ = 1
K(u, v) = exp( −∥u1 − v1∥
2)
= exp(−u21)exp(−v
2
1)exp(2u1v1)
= exp(−u21)exp(−v
2
1)(
∞
∑
i=0
(2u1v1)
i
i! )
=
∞
∑
i=0
exp(−u21)exp(−v
2
1)
2i
i!
2i
i!
ui1v
i
1
Taylor
= ϕ(u)Tϕ(v)
ϕ(x) = exp(−x2) ⋅ 1,
2
1!
x,
22
2!
x2, ⋯
Infinite dimensional feature space:
Kernel Trick Common Kernels Kernelised Perceptron
https://www.youtube.com/watch?v=3liCbRZPrZA
https://www.youtube.com/watch?v=3liCbRZPrZA
Radial Basis Function (RBF) kernel
K(u, v) = exp(−γ∥u − v∥2)
y = exp(−γx2) = exp(−
x2
2σ2
)
o
y
x
γ = 10
γ = 0.1
γ is too small : underfitting
1γ is too large : overfitting
25
Radial Basis Function (RBF) kernel
Kernel Trick Common Kernels Kernelised Perceptron
https://www.youtube.com/watch?v=3liCbRZPrZA
https://www.youtube.com/watch?v=3liCbRZPrZA
Identify new kernels
Mercer’s theorem:
Consider a finite sequences of vectors
Construct n×n matrix A (Gram matrix) of pairwise values
is a valid kernel if this matrix is positive semi-definite, and this holds for all
possible sequences
A =
# $!, $! # $!, $" … # $!, $#
# $", $! # $", $" … # $", $#
⋮
# $# , $!
⋮
# $# , $"
⋮
…
⋮
# $# , $#
K(xi, xj)
x1, ⋯, xn
26
Kernel Trick Common Kernels Kernelised Perceptron
Identify new kernels
Positive semi-definite matrix: a square symmetric matrix satisfies
any non-zero vector (column), , A ∈ ℝn×n
vT Av ≥ 0
v ∈ ℝn×1
A =
# $!, $! # $!, $" … # $!, $#
# $", $! # $", $" … # $", $#
⋮
# $# , $!
⋮
# $# , $"
⋮
…
⋮
# $# , $#
A = AT
27
Kernel Trick Common Kernels Kernelised Perceptron
Identify new kernels
Margin Kernels
Let K_1(𝒖,𝒗), K_2(𝒖,𝒗) be kernels, 𝑐>0 be a constant, and 𝑓(𝒙) be a real-valued function.
Then each of the following is also a kernel:
1) 𝐾(𝒖,𝒗)= K_1(𝒖,𝒗)+ K_2(𝒖,𝒗)
2) 𝐾(𝒖,𝒗)=𝑐 K_1(𝒖,𝒗)
3) 𝐾(𝒖,𝒗)=𝑓(𝒖) K_1(𝒖,𝒗)𝑓(𝒗)
28
Kernel Trick Common Kernels Kernelised Perceptron
Recap of perceptron
29
Kernel Trick Common Kernels Kernelised Perceptron
L(s,y)=max(0,-sy)
Perceptron training algorithm
30
Kernel Trick Common Kernels Kernelised Perceptron
Kernelised perceptron
31
Kernel Trick Common Kernels Kernelised Perceptron
Kernelised perceptron
32
Kernel Trick Common Kernels Kernelised Perceptron
Kernelised perceptron
33
Kernel Trick Common Kernels Kernelised Perceptron
Summary
• What is kernel trick
• What is the influence of gamma in RBF kernel
• How to identify new kernels
• How to kernelise perceptron
34