Lecture 10. Soft-Margin SVM, Lagrangian Duality
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Soft-marginSVM
∗ Intuition and problem formulation
• Lagrangiandual
∗ Alternate formulation with different training complexity
∗ Explains support vectors
∗ Sets us up for kernels (next lectures)
2
COMP90051 Statistical Machine Learning
Recap: hard-margin SVM
• SVMisalinearbinaryclassifier
• Maxmargin:aimforboundaryrobusttonoise • Tricktoresolveambiguity
• Hard-marginprogram:
argmin 𝒘𝒘 s.t.𝑦𝑦 𝒘𝒘𝒙𝒙 +𝑏𝑏 ≥1for𝑖𝑖=1,…,𝑛𝑛
𝒘𝒘,𝑏𝑏
𝑖𝑖′𝑖𝑖
3
COMP90051 Statistical Machine Learning
Soft-Margin SVMs Addressing linear inseparability
4
COMP90051 Statistical Machine Learning
When data is not linearly separable
• Hard-margin loss is too stringent (hard!)
• Real data is unlikely to be linearly separable
𝑥𝑥• If the data is not separable, hard-margin SVMs are in trouble
2
𝑥𝑥1
?
SVMs offer 3 approaches to address this problem:
1. Still use hard-margin
SVM, but transform
the data (next lecture)
2. Relax the constraints
(next slide)
3. The combination of 1
and 2
5
COMP90051 Statistical Machine Learning
Soft-margin SVM
• Relax constraints to allow points to be inside the margin
or even on the wrong side of the boundary
However, we penalise boundaries by the extent of “violation”
In the figure, the objective penalty will take into account the orange distances
𝑥𝑥1
𝑥𝑥2
6
COMP90051 Statistical Machine Learning
Hinge loss: soft-margin SVM loss
𝑙𝑙 =�0 1−𝑦𝑦𝒘𝒘′𝒙𝒙+𝑏𝑏 ≤0
• Hard-marginSVMloss
∞
∞ 𝑜𝑜𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖𝑡𝑡𝑡𝑡
𝑙𝑙=�
0 1−𝑦𝑦𝒘𝒘′𝒙𝒙+𝑏𝑏 ≤0 1 − 𝑦𝑦 𝒘𝒘′𝒙𝒙 + 𝑏𝑏 𝑜𝑜𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖𝑡𝑡𝑡𝑡
• Soft-marginSVMloss(hingeloss)
h
𝑙𝑙 h ( 𝑦𝑦� , 𝑦𝑦 ) 𝑦𝑦� 𝑡𝑡 = 𝑦𝑦 𝑦𝑦�
01𝑡𝑡 7
compare this with perceptron loss
COMP90051 Statistical Machine Learning
Soft-margin SVM objective
• Soft-marginSVMobjective 𝑛𝑛
argmin �𝑙𝑙h 𝒙𝒙𝑖𝑖,𝑦𝑦𝑖𝑖,𝒘𝒘,𝑏𝑏 +𝜆𝜆 𝒘𝒘 𝒘𝒘,𝑏𝑏 𝑖𝑖=1
2
∗ Reminiscent of ridge regression
∗ Hingeloss 𝑙𝑙 =max 0,1−𝑦𝑦 𝒘𝒘′𝒙𝒙 +𝑏𝑏
h𝑖𝑖𝑖𝑖
• We are going to re-formulate this objective to make it more amenable to analysis
8
COMP90051 Statistical Machine Learning
Re-formulating soft-margin objective • Defineslackvariablesasanupperboundonloss
𝜉𝜉 ≥𝑙𝑙 =max 0,1−𝑦𝑦 𝒘𝒘′𝒙𝒙 +𝑏𝑏 𝑖𝑖h 𝑖𝑖𝑖𝑖
orequivalently𝜉𝜉𝑖𝑖 ≥1−𝑦𝑦𝑖𝑖 𝒘𝒘′𝒙𝒙𝑖𝑖 +𝑏𝑏 and𝜉𝜉𝑖𝑖 ≥0
• Re-write the soft-margin SVM objective as: 1𝑛𝑛
argmin 2 𝒘𝒘 2+𝐶𝐶�𝜉𝜉𝑖𝑖 𝒘𝒘,𝑏𝑏,𝝃𝝃 𝑖𝑖=1
s.t.𝜉𝜉𝑖𝑖≥1−𝑦𝑦𝑖𝑖 𝒘𝒘′𝒙𝒙𝑖𝑖+𝑏𝑏 for𝑖𝑖=1,…,𝑛𝑛 𝜉𝜉𝑖𝑖 ≥0for𝑖𝑖=1,…,𝑛𝑛
9
COMP90051 Statistical Machine Learning
Side-by-side: Two variations of SVM
• Hard-margin SVM objective*:
argmin 1 𝒘𝒘
2
s.t.𝑦𝑦𝑖𝑖 𝒘𝒘′𝒙𝒙𝑖𝑖+𝑏𝑏 ≥1for𝑖𝑖=1,…,𝑛𝑛
• Soft-margin SVM objective: 1 𝑛𝑛 argmin 2𝒘𝒘2+𝐶𝐶�𝜉𝜉𝑖𝑖
𝒘𝒘,𝑏𝑏
2
𝒘𝒘,𝑏𝑏,𝝃𝝃 𝑖𝑖=1
s.t.𝑦𝑦𝑖𝑖 𝒘𝒘′𝒙𝒙𝑖𝑖+𝑏𝑏 ≥1−𝜉𝜉𝑖𝑖for𝑖𝑖=1,…,𝑛𝑛 𝜉𝜉𝑖𝑖 ≥0for𝑖𝑖=1,…,𝑛𝑛
• In the second case, the constraints are relaxed (“softened”) by allowing violations by 𝜉𝜉𝑖𝑖. Hence the name “soft margin”
*Changed 𝒘𝒘 to 0.5 𝒘𝒘 2 – monotonic increasing transform. Modified objective yields same solution. 10
COMP90051 Statistical Machine Learning
Lagrangian Duality for the SVM
An equivalent formulation, with important consequences.
11
COMP90051 Statistical Machine Learning
Constrained optimisation • Constrained optimisation: canonical form
minimise 𝑓𝑓(𝒙𝒙) s.t.𝑔𝑔𝑖𝑖 𝒙𝒙 ≤0,𝑖𝑖=1,…,𝑛𝑛
𝑡𝑗𝑗 𝒙𝒙 =0,𝑗𝑗=1,…,𝑚𝑚 ∗ E.g., find deepest point in the lake, south of the bridge
12
• Hard-marginSVM:argmin 𝒘𝒘 s.t.1−𝑦𝑦 𝒘𝒘𝒙𝒙 +𝑏𝑏 ≤0for𝑖𝑖=1,…,𝑛𝑛
• Method of Lagrange multipliers
∗ Transform to unconstrained optimisation – not necessarily for solution
∗ Transform primal program to a related dual program, alternate to primal ∗ Analyse necessary & sufficient conditions for solutions of both programs
• Applicable but: gradient descent doesn’t immediately apply
𝒘𝒘,𝑏𝑏
2𝑖𝑖′𝑖𝑖
12
COMP90051 Statistical Machine Learning
The Lagrangian and duality
L𝒙𝒙,𝝀𝝀,𝝂𝝂 =𝑓𝑓𝒙𝒙 +∑ 𝜆𝜆𝑔𝑔 𝒙𝒙 +∑ 𝜈𝜈𝑡 𝒙𝒙 𝑖𝑖=1 𝑖𝑖 𝑖𝑖 𝑗𝑗=1 𝑗𝑗 𝑗𝑗
• Introduce auxiliary objective function via auxiliary variables
𝑛𝑛𝑚𝑚
Primal constraints became penalties
∗ CalledtheLagrangianfunction
∗ New𝝀𝝀and𝝂𝝂arecalledtheLagrangemultipliersordualvariables
• (Old) primal program: min𝒙𝒙max𝝀𝝀≥𝟎𝟎,𝝂𝝂L 𝒙𝒙, 𝝀𝝀, 𝝂𝝂
• (New) dual program: max𝝀𝝀≥𝟎𝟎,𝝂𝝂min𝒙𝒙L 𝒙𝒙, 𝝀𝝀, 𝝂𝝂
May be easier to solve, advantageous
• Duality theory relates primal/dual:
∗ Weakduality:dualoptimum≤primaloptimum
∗ Forconvexprograms(inc.SVM!)strongduality:optimacoincide!
13
COMP90051 Statistical Machine Learning
Karush-Kuhn-Tucker Necessary Conditions
• Lagrangian:L 𝒙𝒙,𝝀𝝀,𝝂𝝂 =𝑓𝑓 𝒙𝒙 +∑𝑛𝑛 𝜆𝜆𝑔𝑔 𝒙𝒙 +∑𝑚𝑚 𝜈𝜈𝑡 𝒙𝒙 𝑖𝑖=1 𝑖𝑖 𝑖𝑖 𝑗𝑗=1 𝑗𝑗 𝑗𝑗
• Necessaryconditionsforoptimalityofaprimalsolution
∗𝑔𝑔𝒙𝒙∗ ≤0,𝑖𝑖=1,…,𝑛𝑛
• Primalfeasibility:
∗𝑡𝑖𝑖𝒙𝒙∗ =0,𝑗𝑗=1,…,𝑚𝑚
Souped-up version of necessary condition “derivative is zero” in unconstrained optimisation.
𝑗𝑗
• Dual feasibility: 𝜆𝜆 ≥ 0 for 𝑖𝑖 = 1,…,𝑛𝑛
∗𝑖𝑖
• Complementary slackness: 𝜆𝜆∗𝑖𝑖 𝑔𝑔𝑖𝑖 𝒙𝒙∗ = 0, 𝑖𝑖 = 1, … , 𝑛𝑛
• Stationarity: ∇𝒙𝒙L 𝒙𝒙∗, 𝝀𝝀∗, 𝝂𝝂∗ = 𝟎𝟎
14
Don’t penalise if constraint satisfied
COMP90051 Statistical Machine Learning
KKT conditions for hard-margin SVM
1
KKT conditions: 𝑖𝑖 ∗ ′ 𝑖𝑖 ∗
𝑛𝑛
L 𝒘𝒘 , 𝑏𝑏 , 𝝀𝝀 = 2 𝒘𝒘 2 − �𝑖𝑖 = 1 𝜆𝜆 𝑖𝑖 𝑦𝑦 𝑖𝑖 𝒘𝒘 ′ 𝒙𝒙 𝑖𝑖 + 𝑏𝑏 − 1
The Lagrangian
∗ Feasibility:𝑦𝑦 𝒘𝒘 𝒙𝒙 +𝑏𝑏
∗ Feasibility: 𝜆𝜆∗𝑖𝑖 ≥ 0 for 𝑖𝑖 = 1,…,𝑛𝑛
∗ Complementary slackness: 𝜆𝜆∗𝑖𝑖 𝑦𝑦𝑖𝑖 𝒘𝒘∗ ′𝒙𝒙𝑖𝑖 + 𝑏𝑏∗ − 1 = 0 ∗ Stationarity: ∇𝒘𝒘,𝑏𝑏L 𝒘𝒘∗, 𝑏𝑏∗, 𝝀𝝀∗ = 𝟎𝟎
−1≥0for𝑖𝑖=1,…,𝑛𝑛
15
COMP90051 Statistical Machine Learning
Let’s minimise Lagrangian w.r.t primal variables
• Lagrangian:
𝜕𝜕L 𝜕𝜕𝑏𝑏
𝑖𝑖=1 𝑖𝑖 𝑖𝑖
𝜕𝜕L 𝜕𝜕𝑤𝑤𝑗𝑗
= 𝑡𝑡 ∗ − ∑𝑛𝑛 𝜆𝜆 𝑦𝑦 𝒙𝒙 = 0 𝑗𝑗 𝑖𝑖=1𝑖𝑖𝑖𝑖𝑖𝑖𝑗𝑗
New constraint
Eliminates primal variables
𝑛𝑛
=∑ 𝜆𝜆𝑦𝑦=0
1
𝑛𝑛
L 𝒘𝒘 , 𝑏𝑏 , 𝝀𝝀 = 2 𝒘𝒘 2 − �𝑖𝑖 = 1 𝜆𝜆 𝑖𝑖 𝑦𝑦 𝑖𝑖 𝒘𝒘 ′ 𝒙𝒙 𝑖𝑖 + 𝑏𝑏 − 1
• Stationarity conditions give us more information:
• The Lagrangian becomes (with additional constraint, above) 𝑛𝑛1𝑛𝑛𝑛𝑛
L 𝝀𝝀 = �𝑖𝑖 = 1 𝜆𝜆 𝑖𝑖 − 2 �𝑖𝑖 = 1 �𝑗𝑗 = 1 𝜆𝜆 𝑖𝑖 𝜆𝜆 𝑗𝑗 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝑗𝑗 𝒙𝒙 ′ 𝑖𝑖 𝒙𝒙 𝑗𝑗
16
COMP90051 Statistical Machine Learning
Dual program for hard-margin SVM
• Having minimised the Lagrangian with respect to primal variables, now maximising w.r.t dual vari2ables yields the dual program
• Strong duality: Solving dual, solves the primal!!
• Like primal: A so-called quadratic program – off-the-shelf software
• Unlike primal:
∗ Complexity of solution is O(n3) instead of O(d3) – more later
∗ Program depends on dot products of data only – more later on kernels!
can solve – more later
𝑛𝑛1𝑛𝑛𝑛𝑛
argmax�𝜆𝜆𝑖𝑖 − ��𝜆𝜆𝑖𝑖𝜆𝜆𝑗𝑗𝑦𝑦𝑖𝑖𝑦𝑦𝑗𝑗𝒙𝒙′𝑖𝑖𝒙𝒙𝑗𝑗 𝝀𝝀 𝑖𝑖=1 𝑖𝑖=1 𝑗𝑗=1
s.t.𝜆𝜆 ≥0and∑𝑛𝑛 𝜆𝜆𝑦𝑦 =0 𝑖𝑖 𝑖𝑖=1 𝑖𝑖 𝑖𝑖
17
COMP90051 Statistical Machine Learning
Making predictions with dual solution
Recovering primal variables
• Recallfromstationarity:𝑡𝑡∗−∑𝑛𝑛 𝜆𝜆𝑦𝑦 𝒙𝒙 =0
𝑗𝑗 𝑖𝑖=1𝑖𝑖𝑖𝑖𝑖𝑖𝑗𝑗
• Complementary slackness: 𝑏𝑏∗ can be recovered from dual
solution, noting for any example 𝑗𝑗 with 𝜆𝜆∗𝑖𝑖 > 0, we have
𝑦𝑦 𝑏𝑏∗+∑𝑛𝑛 𝜆𝜆∗𝑦𝑦𝒙𝒙′𝒙𝒙 =1 𝑗𝑗 𝑖𝑖=1 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑗𝑗
Testing: classify new instance 𝒙𝒙 based on sign of 𝑛𝑛
𝑡𝑡 = 𝑏𝑏 ∗ + �𝑖𝑖 = 1 𝜆𝜆 ∗ 𝑖𝑖 𝑦𝑦 𝑖𝑖 𝒙𝒙 ′ 𝑖𝑖 𝒙𝒙
18
COMP90051 Statistical Machine Learning
Soft-margin SVM’s dual
• Training: find 𝝀𝝀 that solves 𝑛𝑛1𝑛𝑛𝑛𝑛
argmax�𝜆𝜆𝑖𝑖 −2��𝜆𝜆𝑖𝑖𝜆𝜆𝑗𝑗𝑦𝑦𝑖𝑖𝑦𝑦𝑗𝑗𝒙𝒙′𝑖𝑖𝒙𝒙𝑗𝑗 𝝀𝝀 𝑖𝑖=1 𝑖𝑖=1 𝑗𝑗=1
boxconstraintss.t.𝐶𝐶≥𝜆𝜆𝑖𝑖 ≥0and∑𝑛𝑛 𝜆𝜆𝑖𝑖𝑦𝑦𝑖𝑖 =0
𝑖𝑖=1
• Makingpredictions:samepatternasinasinhard- margin case
19
COMP90051 Statistical Machine Learning
,b
graphics: stux at pixabay.com (CC0)
20
COMP90051 Statistical Machine Learning
Additional Notes
21
COMP90051 Statistical Machine Learning
Complementary slackness, Dot products
𝜆𝜆∗ 𝑦𝑦 𝒘𝒘∗ ′𝒙𝒙 +𝑏𝑏∗ −1 =0 𝑖𝑖𝑖𝑖𝑖𝑖
• Recall that one of the KKT conditions is complementary slackness
• Rememberthat𝑦𝑦𝑖𝑖 𝒘𝒘′𝒙𝒙𝑖𝑖 +𝑏𝑏 −1>0meansthat𝒙𝒙𝑖𝑖 isoutsidethe margin
𝑥𝑥2
(Likely many) points outside the margin must have 𝜆𝜆∗ = 0
𝑥𝑥1
𝑡𝑡 = 𝑏𝑏∗ + ∑𝑛𝑛 𝜆𝜆∗𝑦𝑦 𝒙𝒙′𝒙𝒙 𝑖𝑖=1 𝑖𝑖 𝑖𝑖 𝑖𝑖
22
𝑖𝑖 The points with non-zero 𝜆𝜆s
𝒘𝒘=∑𝜆𝜆𝑦𝑦𝒙𝒙 𝑖𝑖=1 𝑖𝑖 𝑖𝑖 𝑖𝑖
are support vectors ∗𝑛𝑛
Predictions made by dot products with the s.v.’s
COMP90051 Statistical Machine Learning
Training the SVM
• The SVM dual problems are quadratic programs. Using standard algorithms this problem can be solved in in 𝑂𝑂(𝑛𝑛3). Or 𝑂𝑂(𝑑𝑑3) for the primal.
• This can inefficient; Several specialised solutions proposed
• Solutions mostly decompose training data and break down
program into smaller programs that can be solved quickly • Original SVM training algorithm chunking exploits fact that
many 𝜆𝜆s will be zero (sparsity)
• Sequential minimal optimisation (SMO) another algorithm an extreme case of chunking. An iterative procedure that analytically optimises randomly chosen pairs of 𝜆𝜆s per iteration
23
COMP90051 Statistical Machine Learning
This lecture
• Soft-margin SVM
∗ Intuition and problem formulation
• Forming the dual program
∗ Lagrangian multipliers, KKT conditions ∗ Weak and strong duality
• Finishing touches
∗ Complementary slackness ∗ Notes on training
• Workshops Week #6: more neural nets
• Next lecture: kernelisation
24