CS计算机代考程序代写 algorithm 10_Kernel_methods

10_Kernel_methods

Qiuhong Ke

Kernel Methods

COMP90051 Statistical Machine Learning

So far…
Objectives of hard-margin and soft-margin SVM

min
∥w∥2

2
1 − y(i)(wTx(i) + b) ≤ 0,

Hard-margin SVM:

min (
∥w∥2

2
+ C

∑
i=1

ξi)

y(i)(wTx(i) + b) ≥ 1 − ξi ,

ξi ≥ 0 (i = 1,⋯, n data points)

Soft-margin SVM:

i = 1,⋯, n (data points)

, y(i)(wTx(i) + b) ≥ 1,0

, otherwise1 − y(i)(wTx(i) + b)

• Hard-margin SVM loss

!! = #0 1 − ‘ (
“) + + ≤ 0

∞ ./ℎ123451
• Soft-margin SVM loss (hinge loss)

!# = #
0 1 − ‘ (“) + + ≤ 0

1 − ‘ (“) + + ./ℎ123451ξi =

w w

So far…

s.t.min (
∥w∥2

2
+ C

∑
i=1

ξi) y
(i)(wTx(i) + b) ≥ 1 − ξi ,

ξi ≥ 0
(i = 1,⋯, n data points)

Influence of slack penalty C

So far…
What’s the dual problem?

θd(λ) = min L(x, λ) ≤ L(x, λ) = f(x) + λg(x) ≤ f(x)
x

d* = θd(λ*) = min L(x, λ*) ≤ L(x*, λ*) = f(x*) + λ*g(x*) ≤ f(x*) = p*
x

θd(λ*) = d* = max θd(λ)

Solutions:
f(x*) = p* = min f(x)min f(x) = min max L(x, λ)x λ

• Primal problem:

• Dual problem:
max min L(x, λ) = max θd(λ)

xλ λ

L(x, λ) = f(x) + λg(x) λ ≥ 0
g(x) ≤ 0

So far…
What are the conditions making Solution(dual)==Solution(primal))

KKT (Karush-Kuhn-Tucker) conditions :

g(x) ≤ 0 (Primal feasibility)

∂L
∂x

= 0 (Stationarity)

λ ≥ 0 (Dual feasibility)
λg(x) = 0 (Complementary slackness)

d* = θd(λ*) = min L(x, λ*) ≤ L(x*, λ*) = f(x*) + λ*g(x*) ≤ f(x*) = p*
x

d* = p*

if min L(x, λ*) = = L(x*, λ*) and f(x*) + λ*g(x*) = = f(x*)x

Dual problem of hard-margin SVM

So far…

Dual problem max θd(λ)λ
λi ≥ 0 and

s.t.

θd(λ) =
n

∑
i=1

λi −
1
2

∑
i=1

∑
k=1

λiλky
(i)y(k)(x(i))Tx(k)

Dual function

∑
i=1

λiy
(i) = 0

KKT conditions in hard-margin SVM

λi ≥ 0(Dual feasibility)

λigi(w, b) = 0
(Complementary
slackness)

gi(w, b) = 1 − y
(i)(wTx(i) + b) ≤ 0(Primal feasibility)

(i = 1,⋯, n data points)

∂L
∂wj

= 0 : wj =
n

∑
i=1

λiy
(i)x(i)j

∑
i=1

λiy
(i) = 0

∂L
∂b

= 0 :

(Stationarity)

L(w, b, λ) =
∥w∥2

2
+

∑
i=1

λi(1 − y
(i)(wTx(i) + b)),

So far…

f(x) = 0

f(x) = − 1

−

support vectors

λi > 0 λi > 0

f(x) = + 1

So far…

Dual problem max θd(λ)λ

and
s.t.

θd(λ) =
n

∑
i=1

λi −
1
2

∑
i=1

∑
k=1

λiλky
(i)y(k)(x(i))Tx(k)

Dual function

Dual problem of soft-margin SVM

0 ≤ λi ≤ C
n

∑
i=1

λiy
(i) = 0

So far…

KKT conditions in soft-margin SVM

λi ≥ 0, βi ≥ 0(Dual feasibility)
(Complementary
slackness)

λigi(w, b, ξ) = 0 βiξi = 0

(Primal feasibility)
−ξi ≤ 0
gi(w, b, ξ) = 1 − ξi − y

(i)(wTx(i) + b) ≤ 0,
(i = 1,⋯, n data points)

∂L
∂wj

= 0 : wj =
n

∑
i=1

λiy
(i)x(i)j

∑
i=1

λiy
(i) = 0

∂L
∂b

= 0 :

(Stationarity)

L(w, b, λ, β, ξ) =
∥w∥2

2
+ C

∑
i=1

ξi +
n

∑
i=1

λigi(w, b, ξ) +
n

∑
i=1

βi(−ξi)

∂L
∂ξi

= 0 : C − λi − βi = 0

support vectors

support vectors: 0 < λi ≤ C 0 ≤ λi ≤ C So far… 10 Why bother solving dual problem to solve primal problem • Inner product: Kernel trick can be used to efficiently handle non-linearly separable data • Efficient in high-dim space θd(λ) = n ∑ i=1 λi − 1 2 n ∑ i=1 n ∑ k=1 λiλky (i)y(k)(x(i))Tx(k)max θd(λ) λ Training, solve: λi ≥ 0 and s.t. n ∑ i=1 λiy (i) = 0 f(x) = n ∑ i=1 λiy (i)(x(i))Tx + bPrediction: Outline • Kernel Trick • Common Kernels and Identify New Kernels • Kernelised Perceptron 11 Handling non-linear data with the SVM 12 Kernel Trick Common Kernels Kernelised Perceptron Feature transformation • Consider a binary classification problem • Each example has features ["!, ""] • Not linearly separable • Now ‘add’ a feature "# = "!" + """ • Each point is now ["!, "", "!" + """] • Linearly separable! • Consider a binary classification problem • Each example has features ["!, ""] • Not linearly separable • Now ‘add’ a feature "# = "!" + """ • Each point is now ["!, "", "!" + """] • Linearly separable! 13 Kernel Trick Common Kernels Kernelised Perceptron Naïve workflow • Choose/design a linear model • Choose/design a high-dimensional transformation ! " * Hoping that after adding a lot of various features some of them will make the data linearly separable • For each training example, and for each new instance compute ! " • Train classifier/Do predictions • Problem: impractical/impossible to compute !(") for high/infinite-dimensional !(") 14 Kernel Trick Common Kernels Kernelised Perceptron Hard-margin SVM’s dual formulation θd(λ) = n ∑ i=1 λi − 1 2 n ∑ i=1 n ∑ k=1 λiλky (i)y(k)(x(i))Tx(k)max θd(λ) λ Training, solve: λi ≥ 0 and s.t. n ∑ i=1 λiy (i) = 0 f(x) = n ∑ i=1 λiy (i)(x(i))Tx + bPrediction: 15 Kernel Trick Common Kernels Kernelised Perceptron Hard-margin SVM in feature space We just need the dot product! max θd(λ) λ Training, solve: λi ≥ 0 and s.t. Prediction: 16 θd(λ) = n ∑ i=1 λi − 1 2 n ∑ i=1 n ∑ k=1 λiλky (i)y(k)(φ(x(i)))Tφ(x(k)) f(x) = wTx + b = n ∑ i=1 λiy (i)(φ(x(i)))Tφ(x) + b Kernel Trick Common Kernels Kernelised Perceptron n ∑ i=1 λiy (i) = 0 Observation: Kernel representation • Both parameter estimation and computing predictions depend on data only in a form of a dot product * In original space !!" = ∑ %"&"#"$% * In transformed space ! " !! # = ∑ ! " "! # "#"$% Benefits: • no need to find the mapping function. • no need to do transformation. • no need to do dot product. 17 Kernel Trick Common Kernels Kernelised Perceptron Kernel as shortcut • For some ! " ’s, kernel is faster to compute directly than first mapping to feature space then taking dot product. • For example, consider two vectors # = %! and & = '! and transformation ! " = [)!", 2,)!, ,], some , * So ! " = $!", 2'$!, ' # and ! ( = )!", 2')!, ' # * Then ! " #! ( = $!")!" + 2'$!)! + '" • This can be alternatively computed directly as ! # #! & = %!'! + , " * Here + ", ( = $!)! + ' " is the corresponding kernel • For some ! " ’s, kernel is faster to compute directly than first mapping to feature space then taking dot product. • For example, consider two vectors # = %! and & = '! and transformation ! " = [)!", 2,)!, ,], some , * So ! " = $!", 2'$!, ' # and ! ( = )!", 2')!, ' # * Then ! " #! ( = $!")!" + 2'$!)! + '" • This can be alternatively computed directly as ! # #! & = %!'! + , " * Here + ", ( = $!)! + ' " is the corresponding kernel 18 Kernel Trick Common Kernels Kernelised Perceptron Hard-margin SVM in feature space L(λ) = n ∑ i=1 λi − 1 2 n ∑ i=1 n ∑ k=1 λiλky (i)y(k)(φ(x(i)))Tφ(x(k))max L(λ)λ f(x) = wTx + b = n ∑ i=1 λiy (i)(φ(x(i)))Tφ(x) + b Making predictions: Training: solve K(x(i), x(k)) K(x(i), x) 19 Kernel Trick Common Kernels Kernelised Perceptron More generally: The “kernel trick” 20 Kernel Trick Common Kernels Kernelised Perceptron Modular learning 21 Kernel Trick Common Kernels Kernelised Perceptron Approaches to non-linearity 22 Kernel Trick Common Kernels Kernelised Perceptron Polynomial kernel K(u, v) = (uTv + c)d, (c ≥ 0 & d ≥ 0) 23 assume c=0, If it’s not, add as a dummy feature to u and vc K(u, v) = (u1v1 + u2v2) 2 e . g . , u = [u1, u2], v = [v1, v2], c = 0, d = 2 = u21v 2 1 + u 2 2v 2 2 + 2u1u2v1v2 = [u21 , u 2 2 , 2u1u2] T[v21 , v 2 2 , 2u1v2] = ϕ(u)Tϕ(v) ϕ(x) = [x21 , x 2 2 , 2x1x2] d = 3? Kernel Trick Common Kernels Kernelised Perceptron https://www.youtube.com/watch?v=3liCbRZPrZA https://www.youtube.com/watch?v=3liCbRZPrZA Radial Basis Function (RBF) kernel K(u, v) = exp(−γ∥u − v∥2) 24 Radial Basis Function (RBF) kernel e . g . , u = [u1], v = [v1], γ = 1 K(u, v) = exp( −∥u1 − v1∥ 2) = exp(−u21)exp(−v 2 1)exp(2u1v1) = exp(−u21)exp(−v 2 1)( ∞ ∑ i=0 (2u1v1) i i! ) = ∞ ∑ i=0 exp(−u21)exp(−v 2 1) 2i i! 2i i! ui1v i 1 Taylor = ϕ(u)Tϕ(v) ϕ(x) = exp(−x2) ⋅ 1, 2 1! x, 22 2! x2, ⋯ Infinite dimensional feature space: Kernel Trick Common Kernels Kernelised Perceptron https://www.youtube.com/watch?v=3liCbRZPrZA https://www.youtube.com/watch?v=3liCbRZPrZA Radial Basis Function (RBF) kernel K(u, v) = exp(−γ∥u − v∥2) y = exp(−γx2) = exp(− x2 2σ2 ) o y x γ = 10 γ = 0.1 γ is too small : underfitting 1γ is too large : overfitting 25 Radial Basis Function (RBF) kernel Kernel Trick Common Kernels Kernelised Perceptron https://www.youtube.com/watch?v=3liCbRZPrZA https://www.youtube.com/watch?v=3liCbRZPrZA Identify new kernels Mercer’s theorem： Consider a finite sequences of vectors Construct n×n matrix A (Gram matrix) of pairwise values is a valid kernel if this matrix is positive semi-definite, and this holds for all possible sequences A = # $!, $! # $!, $" … # $!, $# # $", $! # $", $" … # $", $# ⋮ # $# , $! ⋮ # $# , $" ⋮ … ⋮ # $# , $# K(xi, xj) x1, ⋯, xn 26 Kernel Trick Common Kernels Kernelised Perceptron Identify new kernels Positive semi-definite matrix: a square symmetric matrix satisfies any non-zero vector (column), , A ∈ ℝn×n vT Av ≥ 0 v ∈ ℝn×1 A = # $!, $! # $!, $" … # $!, $# # $", $! # $", $" … # $", $# ⋮ # $# , $! ⋮ # $# , $" ⋮ … ⋮ # $# , $# A = AT 27 Kernel Trick Common Kernels Kernelised Perceptron Identify new kernels Margin Kernels Let K_1(𝒖,𝒗), K_2(𝒖,𝒗) be kernels, 𝑐>0 be a constant, and 𝑓(𝒙) be a real-valued function.
Then each of the following is also a kernel:

1) 𝐾(𝒖,𝒗)= K_1(𝒖,𝒗)+ K_2(𝒖,𝒗)
2) 𝐾(𝒖,𝒗)=𝑐 K_1(𝒖,𝒗)
3) 𝐾(𝒖,𝒗)=𝑓(𝒖) K_1(𝒖,𝒗)𝑓(𝒗)