CS计算机代考程序代写 algorithm 09_Soft-margin_SVM

09_Soft-margin_SVM

Qiuhong Ke

Soft-margin SVM

COMP90051 Statistical Machine Learning

Copyright: University of Melbourne

So far …
Hard-margin SVM

2

Hard-margin SVM: all points are perfectly
classified and on or outside the margin
(hard constraint)

Ma
rgi

n

So far …
Geometric Margin vs Functional Margin

3

wTx + b = − b0

wTx + b = + b0
wTx + b = 0

γ* =
b0

∥ w ∥

γ*γ*

geometric margin: γi =
yi(w

Txi + b)
∥ w ∥

3

functional margin:  ̂γi = yi(wTxi + b)

For data point i:

Margin of boundary = 2γ* =
2b0

∥ w ∥

So far …
Resolving ambiguity of function representaion

44

wTx + b = − b0

wTx + b = + b0
wTx + b = 0 +

+
wT x + b = 0

wT x + b = − 1

wT x + b = + 1

w
b0

→ w

b
b0

→ b

2γ*

γ*γ*

Margin = 2γ* =
2b0

∥ w ∥
Margin = 2γ* =

2
∥ w ∥

4

So far …

555

SVM: Constrained optimisation problem

s.t
w

min
∥w∥2

2

1 − y(i)(wTx(i) + b) ≤ 0, i = 1,⋯, n (data points)

max
2

∥w∥w

if y(i) = + 1 : f(x(i)) = wTx(i) + b ≥ + 1

if y(i) = − 1 : f(x(i)) = wTx(i) + b ≤ − 1

subject to

(i = 1,⋯, n data points)f(x) = 0

f(x) = − 1

f(x) = + 1

+

w

1
∥ w ∥1

∥ w ∥

• Hard-margin loss is too stringent (hard!)
• Real data is unlikely to be linearly separable
• If the data is not separable, hard-margin SVMs are in trouble

!!

!” ?
SVMs offer 3 approaches
to address this problem:
1. Relax the constraints

(soft-margin)
2. Still use hard-margin

SVM, but transform
the data (kernel)

3. The combination of 1
and 2 J6

What if the data is not linearly separable

Outline

• Deriving Soft-margin SVM Objective

• Lagrange Duality

• Alternate formulation with different training complexity

• Dual form of hard-margin & soft-margin SVM

7

Soft-margin SVM: ‘soft’ constraint
• Relax constraints to allow points to be inside the margin

or even on the wrong side of the boundary

!!

!”

8

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

Hard-margin

Soft-margin

Add penalty to violation

9

penalise boundaries by the extent of “violation”

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

Objective of soft-margin SVM

s.t.min (
∥w∥2

2
+ C

n


i=1

ξi) y
(i)(wTx(i) + b) ≥ 1 − ξi ,

w
ξi ≥ 0 (i = 1,⋯, n data points)

Use slack variable to ‘soft’ constraint:
allow violation of the constraint

Figure 7.3 in Pattern Recognition and Machine Learning by Chris Bishop

, y(i)(wTx(i) + b) ≥ 1,0

, otherwise1 − y(i)(wTx(i) + b)

• Hard-margin SVM loss

!! = #0 1 − ‘ (
“) + + ≤ 0

∞ ./ℎ123451
• Soft-margin SVM loss (hinge loss)

!# = #
0 1 − ‘ (“) + + ≤ 0

1 − ‘ (“) + + ./ℎ123451ξi =

10

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

Equivalent formulation using hinge loss

hinge loss

or lh(x
(i), y(i), w, b) = max(0,1 − y(i)(wTx(i) + b))

11

, y(i)(wTx(i) + b) ≥ 1,0

, otherwise1 − y(i)(wTx(i) + b)

• Hard-margin SVM loss

!! = #0 1 − ‘ (
“) + + ≤ 0

∞ ./ℎ123451
• Soft-margin SVM loss (hinge loss)

!# = #
0 1 − ‘ (“) + + ≤ 0

1 − ‘ (“) + + ./ℎ123451ξi =

min (
∥w∥2

2
+ C

n


i=1

lh(x
(i), y(i), w, b)

w

Data-dependent

training error

Data-independent

regularisation term

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

Influence of slack penalty

C > 0Slack penalty:

If C= ∞: data has to be correctly classified Overfitting

If C= 0: data is ignored Underfitting

s.t.min (
∥w∥2

2
+ C

n


i=1

ξi) y
(i)(wTx(i) + b) ≥ 1 − ξi ,

w
ξi ≥ 0

(i = 1,⋯, n data points)

12

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

Recap of underfitting and overfitting

13

Lazy model Hard-working model

underfitting: does not fit the data overderfitting: fit the data too well

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

Recap of underfitting and overfitting

14

overderfitting: fit the data too wellunderfitting: does not fit the data

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

Influence of slack penalty C

s.t.min (
∥w∥2

2
+ C

n


i=1

ξi) y
(i)(wTx(i) + b) ≥ 1 − ξi ,

w
ξi ≥ 0

(i = 1,⋯, n data points)

15

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

Quiz

16

Larger C? Smaller C?

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

Primal problem

x

s.t w
min

∥w∥2

2

1 − y(i)(wTx(i) + b) ≤ 0,

17

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

Hard-margin SVM objective:

s.t.

min (
∥w∥2

2
+ C

n


i=1

ξi)

y(i)(wT x(i) + b) ≥ 1 − ξi ,

ξi ≥ 0 (i = 1,⋯, n data points)

Soft-margin SVM objective:

• What’s the dual problem?

• At what conditions the solutions of

dual and primal problems are equal?

i = 1,⋯, n (data points)

• The solutions of primal and dual are
equal with some conditions

w

Solve primal problem by
solving dual problem:

Dual problem

• More efficient in high-dim feature
space

• Allows for kernel trick: efficient in
handling non-linearly separable data

Simple constrained optimisation problem
y = f(x) = x2 y

x

min x2
x

o
x ≤ − 1

s.t. x ≤ − 1

−1

18

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

What’s the dual problem?

s.t.

min f(x)

g(x) = x + 1 ≤ 0
x

y = f(x) = x2 y

x

min x2
x

o
x ≤ − 1

s.t. x ≤ − 1

−1

g(x) ≤ 0 : θp(x) = max L(x, λ) = f(x) (when λ = 0)
λ 19

• Construct to a Primal function:
λ

θp(x) = max L(x, λ)
g(x) > 0 : θp(x) = max L(x, λ) = ∞ (when λ = ∞)

λ

Lagrangian functionL(x, λ) = f(x) + λg(x) :
λ ≥ 0 : Lagrange multiplier

• Construct a function

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

What’s the dual problem?

s.t.

min f(x)

g(x) = x + 1 ≤ 0
x

x
min f(x) = min θp(x) = min max L(x, λ)x x λSo:

y = f(x) = x2 y

x

min x2
x

o
x ≤ − 1

s.t. x ≤ − 1

−1

20

f(x) with constraint = =

λ
θp(x) = max L(x, λ)

primal function:

L(x, λ) = f(x) + λg(x), λ ≥ 0

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

What’s the dual problem?

• Primal problem:

• Dual problem:
max min L(x, λ) = max θd(λ)

xλ λ

θd(λ) = min L(x, λ)x
Dual function:

x λx
L(x, λ) = f(x) + λg(x) λ ≥ 0min f(x) = min max L(x, λ) g(x) ≤ 0

θd(λ) = min L(x, λ) ≤ L(x, λ) = f(x) + λg(x) ≤ f(x)
x

21

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

What are the conditions making Solution(dual)==Solution(primal))

Solutions:
x* makes f(x) minimum : f(x*) = p*

λ* makes θd(λ) maximum : θd(λ*) = d*

min f(x) = min max L(x, λ)
x

• Primal problem:

x

• Dual problem:
max min L(x, λ) = max θd(λ)

xλ λ

θd(λ) = min L(x, λ) ≤ L(x, λ) = f(x) + λg(x) ≤ f(x)
x

22

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

What are the conditions making Solution(dual)==Solution(primal))

θd(λ) = min L(x, λ) ≤ L(x, λ) = f(x) + λg(x) ≤ f(x)
x

d* = θd(λ*) = min L(x, λ*) ≤ L(x*, λ*) = f(x*) + λ*g(x*) ≤ f(x*) = p*
x

θd(λ*) = d* = max θd(λ)

Solutions:
f(x*) = p* = min f(x)

λ

x
min f(x) = min max L(x, λ)
x λ

• Primal problem:

• Dual problem:
max min L(x, λ) = max θd(λ)

xλ λ

?Under some conditions: d* = p*
23

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

x

What are the conditions making Solution(dual)==Solution(primal))

KKT (Karush-Kuhn-Tucker) conditions :

d* = θd(λ*) = min L(x, λ*) ≤ L(x*, λ*) = f(x*) + λ*g(x*) ≤ f(x*) = p*
x

g(x) ≤ 0 (Primal feasibility)

∂L
∂x

= 0 (Stationarity)

λ ≥ 0 (Dual feasibility)
λg(x) = 0 (Complementary slackness)

d* = p*

if min L(x, λ*) = = L(x*, λ*) and f(x*) + λ*g(x*) = = f(x*)x

24

Lagrange

Duality

Soft-margin
SVM

Dual form of

SVM

Dual problem of hard-margin SVM

s.t.

min
∥w∥2

2

(1) Lagrangian function:
L(w, b, λ) =

∥w∥2

2
+

n


i=1

λi(1 − y
(i)(wTx(i) + b)),

gi(w, b) = 1 − y
(i)(wTx(i) + b) ≤ 0, i = 1,⋯, n data points

w

∂L
∂wj

= 0 : wj =
n


i=1

λiy
(i)x(i)j

n


i=1

λiy
(i) = 0

∂L
∂b

= 0 :

θd(λ) = min L(w, b, λ) :(2) dual function
w, b

Primal problem

25

λi ≥ 0

Margin Lagrange
Duality
Soft-margin

SVM
KernelsLinear Classifier SVM Objective Lagrange DualityLagrange

Duality
Soft-margin

SVM
Dual form of

SVM

Dual problem of hard-margin SVM

Dual problem max θd(λ)λ
λi ≥ 0 and

s.t.

θd(λ) =
n


i=1

λi −
1
2

n


i=1

n


k=1

λiλky
(i)y(k)(x(i))Tx(k)

Dual function

26

n


i=1

λiy
(i) = 0

Solve n variables for n data points in the dataset (λi)
Using Sequential minimal optimisation (OUT OF SCOPE)
Platt, John. “Sequential minimal optimization: A fast algorithm for training support vector machines.” (1998).

Margin Lagrange
Duality
Soft-margin

SVM
KernelsLinear Classifier SVM Objective Lagrange DualityLagrange

Duality
Soft-margin

SVM
Dual form of

SVM

KKT conditions in hard-margin SVM

27

λi ≥ 0(Dual feasibility)

λigi(w, b) = 0
(Complementary
slackness)

gi(w, b) = 1 − y
(i)(wTx(i) + b) ≤ 0(Primal feasibility)

(i = 1,⋯, n data points)

∂L
∂wj

= 0 : wj =
n


i=1

λiy
(i)x(i)j

n


i=1

λiy
(i) = 0

∂L
∂b

= 0 :

(Stationarity)

L(w, b, λ) =
∥w∥2

2
+

n


i=1

λi(1 − y
(i)(wTx(i) + b)),

f(x) = 0

f(x) = − 1

+

support vectors

λi > 0 λi > 0

f(x) = + 1

Margin Lagrange
Duality
Soft-margin

SVM
KernelsLinear Classifier SVM Objective Lagrange DualityLagrange

Duality
Soft-margin

SVM
Dual form of

SVM

Quiz

f(x) = 0

f(x) = − 1
f(x) = + 1

+

28

A

B

What is the value of λi of data point A and B?

Margin Lagrange
Duality
Soft-margin

SVM
KernelsLinear Classifier SVM Objective Lagrange DualityLagrange

Duality
Soft-margin

SVM
Dual form of

SVM

Primal vs Dual (Training)

θd(λ) =
n


i=1

λi −
1
2

n


i=1

n


k=1

λiλky
(i)y(k)(x(i))Tx(k)max θd(λ)

λ

• Dual problem: solve n variables for n data points in the dataset

λi ≥ 0 and

(λi)

s.t. n

i=1

λiy
(i) = 0

If feature dim d is larger
solving dual problem is faster
than solving primal problem

(d > n)

• Primal problem: solve d variables

s.t.
min

∥w∥2

2

gi(w, b) = 1 − y
(i)(wTx(i) + b) ≤ 0, i = 1,⋯, n data points

w

(wj) (d: dimension of weight vector w)

29

Margin Lagrange
Duality
Soft-margin

SVM
KernelsLinear Classifier SVM Objective Lagrange DualityLagrange

Duality
Soft-margin

SVM
Dual form of

SVM

Primal vs Dual (Prediction)

• Dual form:
wj =

n


i=1

λiy
(i)x(i)j

f(x) =
n


i=1

λiy
(i)(x(i))Tx + b

(using only (small number) support vectors with )λi > 0

• Primal form:
f(x) = wTx + b f(x) > 0 : positive class

f(x) < 0 : negative class 30 Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM Why bother solving dual problem to solve primal problem • Inner product: Kernel trick can be used to efficiently handle non-linearly separable data • Efficient in high-dim space θd(λ) = n ∑ i=1 λi − 1 2 n ∑ i=1 n ∑ k=1 λiλky (i)y(k)(x(i))Tx(k)max θd(λ) λ Training, solve: λi ≥ 0 and s.t. n ∑ i=1 λiy (i) = 0 f(x) = n ∑ i=1 λiy (i)(x(i))Tx + bPrediction: 31 Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM Dual problem of soft-margin SVM (1) Lagrangian function: L(w, b, λ, β, ξ) = ∥w∥2 2 + C n ∑ i=1 ξi + n ∑ i=1 λigi(w, b, ξ) + n ∑ i=1 βi(−ξi) w Primal problem 32 λi ≥ 0 s.t.min ( ∥w∥2 2 + C n ∑ i=1 ξi) y (i)(wTx(i) + b) ≥ 1 − ξi , ξi ≥ 0 (i = 1,⋯, n data points) βi ≥ 0 gi(w, b, ξ) = 1 − ξi − y (i)(wTx(i) + b) ≤ 0 −ξi ≤ 0 Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM Dual problem of soft-margin SVM w θd(λ) = min L(w, b, λ, β, ξ) :(2) dual function w, b, ξ Primal problem 33 s.t.min ( ∥w∥2 2 + C n ∑ i=1 ξi) y (i)(wTx(i) + b) ≥ 1 − ξi , ξi ≥ 0 (i = 1,⋯, n data points) ∂L ∂wj = 0 : wj = n ∑ i=1 λiy (i)x(i)j n ∑ i=1 λiy (i) = 0 ∂L ∂b = 0 : ∂L ∂ξi = 0 : C − λi − βi = 0 0 ≤ λi ≤ C Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM Dual problem max θd(λ)λ and s.t. θd(λ) = n ∑ i=1 λi − 1 2 n ∑ i=1 n ∑ k=1 λiλky (i)y(k)(x(i))Tx(k) Dual function 34 Dual problem of soft-margin SVM 0 ≤ λi ≤ C n ∑ i=1 λiy (i) = 0 Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM KKT conditions in soft-margin SVM 35 λi ≥ 0, βi ≥ 0(Dual feasibility) (Complementary slackness) λigi(w, b, ξ) = 0 βiξi = 0 (Primal feasibility) −ξi ≤ 0 gi(w, b, ξ) = 1 − ξi − y (i)(wTx(i) + b) ≤ 0, (i = 1,⋯, n data points) ∂L ∂wj = 0 : wj = n ∑ i=1 λiy (i)x(i)j n ∑ i=1 λiy (i) = 0 ∂L ∂b = 0 : (Stationarity) L(w, b, λ, β, ξ) = ∥w∥2 2 + C n ∑ i=1 ξi + n ∑ i=1 λigi(w, b, ξ) + n ∑ i=1 βi(−ξi) ∂L ∂ξi = 0 : C − λi − βi = 0 support vectors Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM 36 Quiz if λi = C : if 0 < λi < C : y(i)(wTx(i) + b) = 1 − ξi ≤ 1βi = 0, − ξi ≤ 0 What are the values of λi, βi of data point A B and C? A B C Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM Summary • What are the objective and constraints of soft-margin SVM • What are slack variables & slack penalty of soft-margin SVM? • What are KKT conditions and dual problems of soft-margin and hard-margin SVM? 37