CS计算机代考程序代写 algorithm 10_Kernel_methods

10_Kernel_methods

Qiuhong Ke

Kernel Methods

COMP90051 Statistical Machine Learning

Copyright: University of Melbourne

So far…
Objectives of hard-margin and soft-margin SVM

2

min
∥w∥2

2
1 − y(i)(wTx(i) + b) ≤ 0,

Hard-margin SVM:

min (
∥w∥2

2
+ C

n


i=1

ξi)

y(i)(wTx(i) + b) ≥ 1 − ξi ,

ξi ≥ 0 (i = 1,⋯, n data points)

Soft-margin SVM:

i = 1,⋯, n (data points)

, y(i)(wTx(i) + b) ≥ 1,0

, otherwise1 − y(i)(wTx(i) + b)

• Hard-margin SVM loss

!! = #0 1 − ‘ (
“) + + ≤ 0

∞ ./ℎ123451
• Soft-margin SVM loss (hinge loss)

!# = #
0 1 − ‘ (“) + + ≤ 0

1 − ‘ (“) + + ./ℎ123451ξi =

w w

So far…

3

s.t.min (
∥w∥2

2
+ C

n


i=1

ξi) y
(i)(wTx(i) + b) ≥ 1 − ξi ,

ξi ≥ 0
(i = 1,⋯, n data points)

Influence of slack penalty C

w

So far…
What’s the dual problem?

4

θd(λ) = min L(x, λ) ≤ L(x, λ) = f(x) + λg(x) ≤ f(x)
x

d* = θd(λ*) = min L(x, λ*) ≤ L(x*, λ*) = f(x*) + λ*g(x*) ≤ f(x*) = p*
x

θd(λ*) = d* = max θd(λ)

Solutions:
f(x*) = p* = min f(x)min f(x) = min max L(x, λ)x λ

• Primal problem:

• Dual problem:
max min L(x, λ) = max θd(λ)

xλ λ

x

L(x, λ) = f(x) + λg(x) λ ≥ 0
g(x) ≤ 0

So far…
What are the conditions making Solution(dual)==Solution(primal))

5

KKT (Karush-Kuhn-Tucker) conditions :

g(x) ≤ 0 (Primal feasibility)

∂L
∂x

= 0 (Stationarity)

λ ≥ 0 (Dual feasibility)
λg(x) = 0 (Complementary slackness)

d* = θd(λ*) = min L(x, λ*) ≤ L(x*, λ*) = f(x*) + λ*g(x*) ≤ f(x*) = p*
x

d* = p*

if min L(x, λ*) = = L(x*, λ*) and f(x*) + λ*g(x*) = = f(x*)x

6

Dual problem of hard-margin SVM

So far…

Dual problem max θd(λ)λ
λi ≥ 0 and

s.t.

θd(λ) =
n


i=1

λi −
1
2

n


i=1

n


k=1

λiλky
(i)y(k)(x(i))Tx(k)

Dual function

n


i=1

λiy
(i) = 0

7

KKT conditions in hard-margin SVM

λi ≥ 0(Dual feasibility)

λigi(w, b) = 0
(Complementary
slackness)

gi(w, b) = 1 − y
(i)(wTx(i) + b) ≤ 0(Primal feasibility)

(i = 1,⋯, n data points)

∂L
∂wj

= 0 : wj =
n


i=1

λiy
(i)x(i)j

n


i=1

λiy
(i) = 0

∂L
∂b

= 0 :

(Stationarity)

L(w, b, λ) =
∥w∥2

2
+

n


i=1

λi(1 − y
(i)(wTx(i) + b)),

So far…

f(x) = 0

f(x) = − 1

+

support vectors

λi > 0 λi > 0

f(x) = + 1

So far…

8

Dual problem max θd(λ)λ

and
s.t.

θd(λ) =
n


i=1

λi −
1
2

n


i=1

n


k=1

λiλky
(i)y(k)(x(i))Tx(k)

Dual function

Dual problem of soft-margin SVM

0 ≤ λi ≤ C
n


i=1

λiy
(i) = 0

So far…

9

KKT conditions in soft-margin SVM

λi ≥ 0, βi ≥ 0(Dual feasibility)
(Complementary
slackness)

λigi(w, b, ξ) = 0 βiξi = 0

(Primal feasibility)
−ξi ≤ 0
gi(w, b, ξ) = 1 − ξi − y

(i)(wTx(i) + b) ≤ 0,
(i = 1,⋯, n data points)

∂L
∂wj

= 0 : wj =
n


i=1

λiy
(i)x(i)j

n


i=1

λiy
(i) = 0

∂L
∂b

= 0 :

(Stationarity)

L(w, b, λ, β, ξ) =
∥w∥2

2
+ C

n


i=1

ξi +
n


i=1

λigi(w, b, ξ) +
n


i=1

βi(−ξi)

∂L
∂ξi

= 0 : C − λi − βi = 0

support vectors

support vectors: 0 < λi ≤ C 0 ≤ λi ≤ C So far… 10 Why bother solving dual problem to solve primal problem • Inner product: Kernel trick can be used to efficiently handle non-linearly separable data • Efficient in high-dim space θd(λ) = n ∑ i=1 λi − 1 2 n ∑ i=1 n ∑ k=1 λiλky (i)y(k)(x(i))Tx(k)max θd(λ) λ Training, solve: λi ≥ 0 and s.t. n ∑ i=1 λiy (i) = 0 f(x) = n ∑ i=1 λiy (i)(x(i))Tx + bPrediction: Outline • Kernel Trick • Common Kernels and Identify New Kernels • Kernelised Perceptron 11 Handling non-linear data with the SVM 12 Kernel Trick Common Kernels Kernelised Perceptron Feature transformation • Consider a binary classification problem • Each example has features ["!, ""] • Not linearly separable • Now ‘add’ a feature "# = "!" + """ • Each point is now ["!, "", "!" + """] • Linearly separable! • Consider a binary classification problem • Each example has features ["!, ""] • Not linearly separable • Now ‘add’ a feature "# = "!" + """ • Each point is now ["!, "", "!" + """] • Linearly separable! 13 Kernel Trick Common Kernels Kernelised Perceptron Naïve workflow • Choose/design a linear model • Choose/design a high-dimensional transformation ! " * Hoping that after adding a lot of various features some of them will make the data linearly separable • For each training example, and for each new instance compute ! " • Train classifier/Do predictions • Problem: impractical/impossible to compute !(") for high/infinite-dimensional !(") 14 Kernel Trick Common Kernels Kernelised Perceptron Hard-margin SVM’s dual formulation θd(λ) = n ∑ i=1 λi − 1 2 n ∑ i=1 n ∑ k=1 λiλky (i)y(k)(x(i))Tx(k)max θd(λ) λ Training, solve: λi ≥ 0 and s.t. n ∑ i=1 λiy (i) = 0 f(x) = n ∑ i=1 λiy (i)(x(i))Tx + bPrediction: 15 Kernel Trick Common Kernels Kernelised Perceptron Hard-margin SVM in feature space We just need the dot product! max θd(λ) λ Training, solve: λi ≥ 0 and s.t. Prediction: 16 θd(λ) = n ∑ i=1 λi − 1 2 n ∑ i=1 n ∑ k=1 λiλky (i)y(k)(φ(x(i)))Tφ(x(k)) f(x) = wTx + b = n ∑ i=1 λiy (i)(φ(x(i)))Tφ(x) + b Kernel Trick Common Kernels Kernelised Perceptron n ∑ i=1 λiy (i) = 0 Observation: Kernel representation • Both parameter estimation and computing predictions depend on data only in a form of a dot product * In original space !!" = ∑ %"&"#"$% * In transformed space ! " !! # = ∑ ! " "! # "#"$% Benefits: • no need to find the mapping function. • no need to do transformation. • no need to do dot product. 17 Kernel Trick Common Kernels Kernelised Perceptron Kernel as shortcut • For some ! " ’s, kernel is faster to compute directly than first mapping to feature space then taking dot product. • For example, consider two vectors # = %! and & = '! and transformation ! " = [)!", 2,)!, ,], some , * So ! " = $!", 2'$!, ' # and ! ( = )!", 2')!, ' # * Then ! " #! ( = $!")!" + 2'$!)! + '" • This can be alternatively computed directly as ! # #! & = %!'! + , " * Here + ", ( = $!)! + ' " is the corresponding kernel • For some ! " ’s, kernel is faster to compute directly than first mapping to feature space then taking dot product. • For example, consider two vectors # = %! and & = '! and transformation ! " = [)!", 2,)!, ,], some , * So ! " = $!", 2'$!, ' # and ! ( = )!", 2')!, ' # * Then ! " #! ( = $!")!" + 2'$!)! + '" • This can be alternatively computed directly as ! # #! & = %!'! + , " * Here + ", ( = $!)! + ' " is the corresponding kernel 18 Kernel Trick Common Kernels Kernelised Perceptron Hard-margin SVM in feature space L(λ) = n ∑ i=1 λi − 1 2 n ∑ i=1 n ∑ k=1 λiλky (i)y(k)(φ(x(i)))Tφ(x(k))max L(λ)λ f(x) = wTx + b = n ∑ i=1 λiy (i)(φ(x(i)))Tφ(x) + b Making predictions: Training: solve K(x(i), x(k)) K(x(i), x) 19 Kernel Trick Common Kernels Kernelised Perceptron More generally: The “kernel trick” 20 Kernel Trick Common Kernels Kernelised Perceptron Modular learning 21 Kernel Trick Common Kernels Kernelised Perceptron Approaches to non-linearity 22 Kernel Trick Common Kernels Kernelised Perceptron Polynomial kernel K(u, v) = (uTv + c)d, (c ≥ 0 & d ≥ 0) 23 assume c=0, If it’s not, add as a dummy feature to u and vc K(u, v) = (u1v1 + u2v2) 2 e . g . , u = [u1, u2], v = [v1, v2], c = 0, d = 2 = u21v 2 1 + u 2 2v 2 2 + 2u1u2v1v2 = [u21 , u 2 2 , 2u1u2] T[v21 , v 2 2 , 2u1v2] = ϕ(u)Tϕ(v) ϕ(x) = [x21 , x 2 2 , 2x1x2] d = 3? Kernel Trick Common Kernels Kernelised Perceptron https://www.youtube.com/watch?v=3liCbRZPrZA https://www.youtube.com/watch?v=3liCbRZPrZA Radial Basis Function (RBF) kernel K(u, v) = exp(−γ∥u − v∥2) 24 Radial Basis Function (RBF) kernel e . g . , u = [u1], v = [v1], γ = 1 K(u, v) = exp( −∥u1 − v1∥ 2) = exp(−u21)exp(−v 2 1)exp(2u1v1) = exp(−u21)exp(−v 2 1)( ∞ ∑ i=0 (2u1v1) i i! ) = ∞ ∑ i=0 exp(−u21)exp(−v 2 1) 2i i! 2i i! ui1v i 1 Taylor = ϕ(u)Tϕ(v) ϕ(x) = exp(−x2) ⋅ 1, 2 1! x, 22 2! x2, ⋯ Infinite dimensional feature space: Kernel Trick Common Kernels Kernelised Perceptron https://www.youtube.com/watch?v=3liCbRZPrZA https://www.youtube.com/watch?v=3liCbRZPrZA Radial Basis Function (RBF) kernel K(u, v) = exp(−γ∥u − v∥2) y = exp(−γx2) = exp(− x2 2σ2 ) o y x γ = 10 γ = 0.1 γ is too small : underfitting 1γ is too large : overfitting 25 Radial Basis Function (RBF) kernel Kernel Trick Common Kernels Kernelised Perceptron https://www.youtube.com/watch?v=3liCbRZPrZA https://www.youtube.com/watch?v=3liCbRZPrZA Identify new kernels Mercer’s theorem: Consider a finite sequences of vectors Construct n×n matrix A (Gram matrix) of pairwise values is a valid kernel if this matrix is positive semi-definite, and this holds for all possible sequences A = # $!, $! # $!, $" … # $!, $# # $", $! # $", $" … # $", $# ⋮ # $# , $! ⋮ # $# , $" ⋮ … ⋮ # $# , $# K(xi, xj) x1, ⋯, xn 26 Kernel Trick Common Kernels Kernelised Perceptron Identify new kernels Positive semi-definite matrix: a square symmetric matrix satisfies any non-zero vector (column), , A ∈ ℝn×n vT Av ≥ 0 v ∈ ℝn×1 A = # $!, $! # $!, $" … # $!, $# # $", $! # $", $" … # $", $# ⋮ # $# , $! ⋮ # $# , $" ⋮ … ⋮ # $# , $# A = AT 27 Kernel Trick Common Kernels Kernelised Perceptron Identify new kernels Margin Kernels Let K_1(𝒖,𝒗), K_2(𝒖,𝒗) be kernels, 𝑐>0 be a constant, and 𝑓(𝒙) be a real-valued function.
Then each of the following is also a kernel:

1) 𝐾(𝒖,𝒗)= K_1(𝒖,𝒗)+ K_2(𝒖,𝒗)
2) 𝐾(𝒖,𝒗)=𝑐 K_1(𝒖,𝒗)
3) 𝐾(𝒖,𝒗)=𝑓(𝒖) K_1(𝒖,𝒗)𝑓(𝒗)

28

Kernel Trick Common Kernels Kernelised Perceptron

Recap of perceptron

29

Kernel Trick Common Kernels Kernelised Perceptron

L(s,y)=max(0,-sy)

Perceptron training algorithm

30

Kernel Trick Common Kernels Kernelised Perceptron

Kernelised perceptron

31

Kernel Trick Common Kernels Kernelised Perceptron

Kernelised perceptron

32

Kernel Trick Common Kernels Kernelised Perceptron

Kernelised perceptron

33

Kernel Trick Common Kernels Kernelised Perceptron

Summary

• What is kernel trick

• What is the influence of gamma in RBF kernel

• How to identify new kernels

• How to kernelise perceptron

34