09_Soft-margin_SVM
Qiuhong Ke
Soft-margin SVM
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
So far …
Hard-margin SVM
2
☺
Hard-margin SVM: all points are perfectly
classified and on or outside the margin
(hard constraint)
Ma
rgi
n
So far …
Geometric Margin vs Functional Margin
3
wTx + b = − b0
wTx + b = + b0
wTx + b = 0
−
γ* =
b0
∥ w ∥
γ*γ*
geometric margin: γi =
yi(w
Txi + b)
∥ w ∥
3
functional margin: ̂γi = yi(wTxi + b)
For data point i:
Margin of boundary = 2γ* =
2b0
∥ w ∥
So far …
Resolving ambiguity of function representaion
44
wTx + b = − b0
wTx + b = + b0
wTx + b = 0 +
−
+
wT x + b = 0
wT x + b = − 1
wT x + b = + 1
w
b0
→ w
b
b0
→ b
2γ*
γ*γ*
Margin = 2γ* =
2b0
∥ w ∥
Margin = 2γ* =
2
∥ w ∥
4
So far …
555
SVM: Constrained optimisation problem
s.t
w
min
∥w∥2
2
1 − y(i)(wTx(i) + b) ≤ 0, i = 1,⋯, n (data points)
max
2
∥w∥w
if y(i) = + 1 : f(x(i)) = wTx(i) + b ≥ + 1
if y(i) = − 1 : f(x(i)) = wTx(i) + b ≤ − 1
subject to
(i = 1,⋯, n data points)f(x) = 0
f(x) = − 1
f(x) = + 1
−
+
w
1
∥ w ∥1
∥ w ∥
• Hard-margin loss is too stringent (hard!)
• Real data is unlikely to be linearly separable
• If the data is not separable, hard-margin SVMs are in trouble
!!
!” ?
SVMs offer 3 approaches
to address this problem:
1. Relax the constraints
(soft-margin)
2. Still use hard-margin
SVM, but transform
the data (kernel)
3. The combination of 1
and 2 J6
What if the data is not linearly separable
Outline
• Deriving Soft-margin SVM Objective
• Lagrange Duality
• Alternate formulation with different training complexity
• Dual form of hard-margin & soft-margin SVM
7
Soft-margin SVM: ‘soft’ constraint
• Relax constraints to allow points to be inside the margin
or even on the wrong side of the boundary
!!
!”
8
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Hard-margin
Soft-margin
Add penalty to violation
9
penalise boundaries by the extent of “violation”
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Objective of soft-margin SVM
s.t.min (
∥w∥2
2
+ C
n
∑
i=1
ξi) y
(i)(wTx(i) + b) ≥ 1 − ξi ,
w
ξi ≥ 0 (i = 1,⋯, n data points)
Use slack variable to ‘soft’ constraint:
allow violation of the constraint
Figure 7.3 in Pattern Recognition and Machine Learning by Chris Bishop
, y(i)(wTx(i) + b) ≥ 1,0
, otherwise1 − y(i)(wTx(i) + b)
• Hard-margin SVM loss
!! = #0 1 − ‘ (
“) + + ≤ 0
∞ ./ℎ123451
• Soft-margin SVM loss (hinge loss)
!# = #
0 1 − ‘ (“) + + ≤ 0
1 − ‘ (“) + + ./ℎ123451ξi =
10
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Equivalent formulation using hinge loss
hinge loss
or lh(x
(i), y(i), w, b) = max(0,1 − y(i)(wTx(i) + b))
11
, y(i)(wTx(i) + b) ≥ 1,0
, otherwise1 − y(i)(wTx(i) + b)
• Hard-margin SVM loss
!! = #0 1 − ‘ (
“) + + ≤ 0
∞ ./ℎ123451
• Soft-margin SVM loss (hinge loss)
!# = #
0 1 − ‘ (“) + + ≤ 0
1 − ‘ (“) + + ./ℎ123451ξi =
min (
∥w∥2
2
+ C
n
∑
i=1
lh(x
(i), y(i), w, b)
w
Data-dependent
training error
Data-independent
regularisation term
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Influence of slack penalty
C > 0Slack penalty:
If C= ∞: data has to be correctly classified Overfitting
If C= 0: data is ignored Underfitting
s.t.min (
∥w∥2
2
+ C
n
∑
i=1
ξi) y
(i)(wTx(i) + b) ≥ 1 − ξi ,
w
ξi ≥ 0
(i = 1,⋯, n data points)
12
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Recap of underfitting and overfitting
13
Lazy model Hard-working model
underfitting: does not fit the data overderfitting: fit the data too well
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Recap of underfitting and overfitting
14
overderfitting: fit the data too wellunderfitting: does not fit the data
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Influence of slack penalty C
s.t.min (
∥w∥2
2
+ C
n
∑
i=1
ξi) y
(i)(wTx(i) + b) ≥ 1 − ξi ,
w
ξi ≥ 0
(i = 1,⋯, n data points)
15
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Quiz
16
Larger C? Smaller C?
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Primal problem
x
s.t w
min
∥w∥2
2
1 − y(i)(wTx(i) + b) ≤ 0,
17
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Hard-margin SVM objective:
s.t.
min (
∥w∥2
2
+ C
n
∑
i=1
ξi)
y(i)(wT x(i) + b) ≥ 1 − ξi ,
ξi ≥ 0 (i = 1,⋯, n data points)
Soft-margin SVM objective:
• What’s the dual problem?
• At what conditions the solutions of
dual and primal problems are equal?
i = 1,⋯, n (data points)
• The solutions of primal and dual are
equal with some conditions
w
Solve primal problem by
solving dual problem:
Dual problem
• More efficient in high-dim feature
space
• Allows for kernel trick: efficient in
handling non-linearly separable data
Simple constrained optimisation problem
y = f(x) = x2 y
x
min x2
x
o
x ≤ − 1
s.t. x ≤ − 1
−1
18
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
What’s the dual problem?
s.t.
min f(x)
g(x) = x + 1 ≤ 0
x
y = f(x) = x2 y
x
min x2
x
o
x ≤ − 1
s.t. x ≤ − 1
−1
g(x) ≤ 0 : θp(x) = max L(x, λ) = f(x) (when λ = 0)
λ 19
• Construct to a Primal function:
λ
θp(x) = max L(x, λ)
g(x) > 0 : θp(x) = max L(x, λ) = ∞ (when λ = ∞)
λ
Lagrangian functionL(x, λ) = f(x) + λg(x) :
λ ≥ 0 : Lagrange multiplier
• Construct a function
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
What’s the dual problem?
s.t.
min f(x)
g(x) = x + 1 ≤ 0
x
x
min f(x) = min θp(x) = min max L(x, λ)x x λSo:
y = f(x) = x2 y
x
min x2
x
o
x ≤ − 1
s.t. x ≤ − 1
−1
20
f(x) with constraint = =
λ
θp(x) = max L(x, λ)
primal function:
L(x, λ) = f(x) + λg(x), λ ≥ 0
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
What’s the dual problem?
• Primal problem:
• Dual problem:
max min L(x, λ) = max θd(λ)
xλ λ
θd(λ) = min L(x, λ)x
Dual function:
x λx
L(x, λ) = f(x) + λg(x) λ ≥ 0min f(x) = min max L(x, λ) g(x) ≤ 0
θd(λ) = min L(x, λ) ≤ L(x, λ) = f(x) + λg(x) ≤ f(x)
x
21
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
What are the conditions making Solution(dual)==Solution(primal))
Solutions:
x* makes f(x) minimum : f(x*) = p*
λ* makes θd(λ) maximum : θd(λ*) = d*
min f(x) = min max L(x, λ)
x
• Primal problem:
x
• Dual problem:
max min L(x, λ) = max θd(λ)
xλ λ
θd(λ) = min L(x, λ) ≤ L(x, λ) = f(x) + λg(x) ≤ f(x)
x
22
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
What are the conditions making Solution(dual)==Solution(primal))
θd(λ) = min L(x, λ) ≤ L(x, λ) = f(x) + λg(x) ≤ f(x)
x
d* = θd(λ*) = min L(x, λ*) ≤ L(x*, λ*) = f(x*) + λ*g(x*) ≤ f(x*) = p*
x
θd(λ*) = d* = max θd(λ)
Solutions:
f(x*) = p* = min f(x)
λ
x
min f(x) = min max L(x, λ)
x λ
• Primal problem:
• Dual problem:
max min L(x, λ) = max θd(λ)
xλ λ
?Under some conditions: d* = p*
23
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
x
What are the conditions making Solution(dual)==Solution(primal))
KKT (Karush-Kuhn-Tucker) conditions :
d* = θd(λ*) = min L(x, λ*) ≤ L(x*, λ*) = f(x*) + λ*g(x*) ≤ f(x*) = p*
x
g(x) ≤ 0 (Primal feasibility)
∂L
∂x
= 0 (Stationarity)
λ ≥ 0 (Dual feasibility)
λg(x) = 0 (Complementary slackness)
d* = p*
if min L(x, λ*) = = L(x*, λ*) and f(x*) + λ*g(x*) = = f(x*)x
24
Lagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Dual problem of hard-margin SVM
s.t.
min
∥w∥2
2
(1) Lagrangian function:
L(w, b, λ) =
∥w∥2
2
+
n
∑
i=1
λi(1 − y
(i)(wTx(i) + b)),
gi(w, b) = 1 − y
(i)(wTx(i) + b) ≤ 0, i = 1,⋯, n data points
w
∂L
∂wj
= 0 : wj =
n
∑
i=1
λiy
(i)x(i)j
n
∑
i=1
λiy
(i) = 0
∂L
∂b
= 0 :
θd(λ) = min L(w, b, λ) :(2) dual function
w, b
Primal problem
25
λi ≥ 0
Margin Lagrange
Duality
Soft-margin
SVM
KernelsLinear Classifier SVM Objective Lagrange DualityLagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Dual problem of hard-margin SVM
Dual problem max θd(λ)λ
λi ≥ 0 and
s.t.
θd(λ) =
n
∑
i=1
λi −
1
2
n
∑
i=1
n
∑
k=1
λiλky
(i)y(k)(x(i))Tx(k)
Dual function
26
n
∑
i=1
λiy
(i) = 0
Solve n variables for n data points in the dataset (λi)
Using Sequential minimal optimisation (OUT OF SCOPE)
Platt, John. “Sequential minimal optimization: A fast algorithm for training support vector machines.” (1998).
Margin Lagrange
Duality
Soft-margin
SVM
KernelsLinear Classifier SVM Objective Lagrange DualityLagrange
Duality
Soft-margin
SVM
Dual form of
SVM
KKT conditions in hard-margin SVM
27
λi ≥ 0(Dual feasibility)
λigi(w, b) = 0
(Complementary
slackness)
gi(w, b) = 1 − y
(i)(wTx(i) + b) ≤ 0(Primal feasibility)
(i = 1,⋯, n data points)
∂L
∂wj
= 0 : wj =
n
∑
i=1
λiy
(i)x(i)j
n
∑
i=1
λiy
(i) = 0
∂L
∂b
= 0 :
(Stationarity)
L(w, b, λ) =
∥w∥2
2
+
n
∑
i=1
λi(1 − y
(i)(wTx(i) + b)),
f(x) = 0
f(x) = − 1
−
+
support vectors
λi > 0 λi > 0
f(x) = + 1
Margin Lagrange
Duality
Soft-margin
SVM
KernelsLinear Classifier SVM Objective Lagrange DualityLagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Quiz
f(x) = 0
f(x) = − 1
f(x) = + 1
−
+
28
A
B
What is the value of λi of data point A and B?
Margin Lagrange
Duality
Soft-margin
SVM
KernelsLinear Classifier SVM Objective Lagrange DualityLagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Primal vs Dual (Training)
θd(λ) =
n
∑
i=1
λi −
1
2
n
∑
i=1
n
∑
k=1
λiλky
(i)y(k)(x(i))Tx(k)max θd(λ)
λ
• Dual problem: solve n variables for n data points in the dataset
λi ≥ 0 and
(λi)
s.t. n
∑
i=1
λiy
(i) = 0
If feature dim d is larger
solving dual problem is faster
than solving primal problem
(d > n)
• Primal problem: solve d variables
s.t.
min
∥w∥2
2
gi(w, b) = 1 − y
(i)(wTx(i) + b) ≤ 0, i = 1,⋯, n data points
w
(wj) (d: dimension of weight vector w)
29
Margin Lagrange
Duality
Soft-margin
SVM
KernelsLinear Classifier SVM Objective Lagrange DualityLagrange
Duality
Soft-margin
SVM
Dual form of
SVM
Primal vs Dual (Prediction)
• Dual form:
wj =
n
∑
i=1
λiy
(i)x(i)j
f(x) =
n
∑
i=1
λiy
(i)(x(i))Tx + b
(using only (small number) support vectors with )λi > 0
• Primal form:
f(x) = wTx + b f(x) > 0 : positive class
f(x) < 0 : negative class 30 Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM Why bother solving dual problem to solve primal problem • Inner product: Kernel trick can be used to efficiently handle non-linearly separable data • Efficient in high-dim space θd(λ) = n ∑ i=1 λi − 1 2 n ∑ i=1 n ∑ k=1 λiλky (i)y(k)(x(i))Tx(k)max θd(λ) λ Training, solve: λi ≥ 0 and s.t. n ∑ i=1 λiy (i) = 0 f(x) = n ∑ i=1 λiy (i)(x(i))Tx + bPrediction: 31 Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM Dual problem of soft-margin SVM (1) Lagrangian function: L(w, b, λ, β, ξ) = ∥w∥2 2 + C n ∑ i=1 ξi + n ∑ i=1 λigi(w, b, ξ) + n ∑ i=1 βi(−ξi) w Primal problem 32 λi ≥ 0 s.t.min ( ∥w∥2 2 + C n ∑ i=1 ξi) y (i)(wTx(i) + b) ≥ 1 − ξi , ξi ≥ 0 (i = 1,⋯, n data points) βi ≥ 0 gi(w, b, ξ) = 1 − ξi − y (i)(wTx(i) + b) ≤ 0 −ξi ≤ 0 Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM Dual problem of soft-margin SVM w θd(λ) = min L(w, b, λ, β, ξ) :(2) dual function w, b, ξ Primal problem 33 s.t.min ( ∥w∥2 2 + C n ∑ i=1 ξi) y (i)(wTx(i) + b) ≥ 1 − ξi , ξi ≥ 0 (i = 1,⋯, n data points) ∂L ∂wj = 0 : wj = n ∑ i=1 λiy (i)x(i)j n ∑ i=1 λiy (i) = 0 ∂L ∂b = 0 : ∂L ∂ξi = 0 : C − λi − βi = 0 0 ≤ λi ≤ C Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM Dual problem max θd(λ)λ and s.t. θd(λ) = n ∑ i=1 λi − 1 2 n ∑ i=1 n ∑ k=1 λiλky (i)y(k)(x(i))Tx(k) Dual function 34 Dual problem of soft-margin SVM 0 ≤ λi ≤ C n ∑ i=1 λiy (i) = 0 Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM KKT conditions in soft-margin SVM 35 λi ≥ 0, βi ≥ 0(Dual feasibility) (Complementary slackness) λigi(w, b, ξ) = 0 βiξi = 0 (Primal feasibility) −ξi ≤ 0 gi(w, b, ξ) = 1 − ξi − y (i)(wTx(i) + b) ≤ 0, (i = 1,⋯, n data points) ∂L ∂wj = 0 : wj = n ∑ i=1 λiy (i)x(i)j n ∑ i=1 λiy (i) = 0 ∂L ∂b = 0 : (Stationarity) L(w, b, λ, β, ξ) = ∥w∥2 2 + C n ∑ i=1 ξi + n ∑ i=1 λigi(w, b, ξ) + n ∑ i=1 βi(−ξi) ∂L ∂ξi = 0 : C − λi − βi = 0 support vectors Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM 36 Quiz if λi = C : if 0 < λi < C : y(i)(wTx(i) + b) = 1 − ξi ≤ 1βi = 0, − ξi ≤ 0 What are the values of λi, βi of data point A B and C? A B C Margin Lagrange Duality Soft-margin SVM KernelsLinear Classifier SVM Objective Lagrange DualityLagrange Duality Soft-margin SVM Dual form of SVM Summary • What are the objective and constraints of soft-margin SVM • What are slack variables & slack penalty of soft-margin SVM? • What are KKT conditions and dual problems of soft-margin and hard-margin SVM? 37