COMS 4771 Support Vector Machines
Nakul Verma
Last time…
• Decision boundaries for classification
• Linear decision boundary (linear classification)
• The Perceptron algorithm
• Mistake bound for the perceptron
• Generalizing to non-linear boundaries (via Kernel space)
• Problems become linear in Kernel space
• The Kernel trick to speed up computation
Perceptron and Linear Separablity
Say there is a linear decision boundary which can perfectly separate the training data
Which linear separator will the Perceptron algorithm return?
The separator with a large margin γ is better for generalization
How can we incorporate the margin in finding the linear boundary?
Solution: Support Vector Machines (SVMs)
Motivation:
• It returns a linear classifier that is stable solution by giving a maximum margin solution
• Slight modification to the problem provides a way to deal with non- separable cases
• It is kernelizable, so gives an implicit way of yielding non-linear classification.
SVM Formulation
Say the training data S is linearly separable by some margin (but the linear separator does not necessarily passes through the origin).
Then:
decision boundary: Linear classifier:
Idea: we can try finding two parallel hyperplanes that correctly classify all the points, and maximize the distance between them!
SVM Formulation (contd. 1)
Decision boundary for the two hyperpanes:
Distance between the two hyperplanes:
why?
Training data is correctly classified if:
ifyi =+1 ifyi =-1
Together: for all i
SVM Formulation (contd. 2)
Distance between the hyperplanes: Training data is correctly classified if:
(for all i)
Therefore, want:
Maximize the distance:
Such that:
(for all i)
Let’s put it in the standard form…
SVM Formulation (finally!)
Maximize:
Such that:
(for all i)
SVM standard (primal) form:
Minimize: Such that:
(for all i)
What can we do if the problem is not-linearly separable?
SVM Formulation (non-separable case)
Idea: introduce a slack for the mis- classified points, and minimize the slack!
SVM standard (primal) form (with slack): Minimize:
Such that:
(for all i)
SVM: Question
SVM standard (primal) form (with slack): Minimize:
Such that:
(for all i)
Questions:
1. How do we find the optimal w, b and ξ?
2. Why is it called “Support Vector Machine”?
How to Find the Solution?
Cannot simply take the derivative (wrt w, b and ξ) and examine the stationary points…
Why?
Minimize: x2 Suchthat: x≥5
SVM standard (primal) form:
Minimize: Such that:
(for all i)
x2
Gradient not zero at the function minima (respecting the constraints)!
(infeasible region)
x=5
x
Need a way to do optimization with constraints
Detour: Constrained Optimization
Constrained optimization (standard form): minimize
(objective) (constraints)
We’ll assume that the problem is feasible
subject to: What to do?
for 1 ≤ i ≤ n
• Projection methods
start with a feasible solution x0,
find x1 that has slightly lower objective value,
if x1 violates the constraints, project back to the constraints. iterate.
• Penalty methods
use a penalty function to incorporate the constraints into the objective
•…
The Lagrange (Penalty) Method
Consider the augmented function:
(Lagrange function)
Optimization problem:
Minimize: Such that:
Observation:
For any feasible x and all λi ≥ 0, we have
So, the optimal value to the constrained optimization:
The problem becomes unconstrained in x!
(for all i)
(Lagrange variables, or dual variables)
The Dual Problem
Optimal value:
Optimization problem:
Minimize: Such that:
Now, consider the function: Observation:
Since, for any feasible x and all λi ≥ 0: Thus:
(also called the dual)
(for all i)
(also called the primal)
Lagrange function:
(Weak) Duality Theorem
Theorem (weak Lagrangian duality):
(also called the minimax inequality)
Optimization problem:
Minimize: Such that:
(called the duality gap)
Under what conditions can we achieve equality?
Lagrange function:
Primal: Dual:
(for all i)
Convexity
A function f: Rd → R is called convex iff for any two points x, x’ and β ∈ [0,1]
≤
Convexity
AsetS⊂Rd iscalledconvexiffforanytwopointsx,x’∈Sandanyβ∈[0,1] Examples:
Convex Optimization
A constrained optimization
minimize (objective)
subject to: for 1 ≤ i ≤ n
(constraints)
is called convex a convex optimization problem If:
the objective function is convex function, and
the feasible set induced by the constraints gi is a convex set
Why do we care?
We and find the optimal solution for convex problems efficiently!
Convex Optimization: Niceties
• Every local optima is a global optima in a convex optimization problem.
Example convex problems:
Linear programs, quadratic programs, Conic programs, semi-definite program.
Several solvers exist to find the optima: CVX, SeDuMi, C-SALSA, …
• We can use a simple ‘descend-type’ algorithm for finding the minima!
Gradient Descent (for finding local minima)
Theorem (Gradient Descent):
Given a smooth function Then, for any and For sufficiently small
, we have:
Can derive a simple algorithm (the projected Gradient Descent): Initialize
for t = 1,2,…do
terminate when no progress can be made, ie,
(step in the gradient direction) (project back onto the constraints)
Back to Constrained Opt.: Duality Theorems
Theorem (weak Lagrangian duality):
Optimization problem:
Minimize: Such that:
Theorem (strong Lagrangian duality):
If f is convex and for a feasible point x* , or
when g is affine Then
Lagrange function:
Primal: Dual:
(for all i)
Ok, Back to SVMs
Observations:
• object function is convex
• the constraints are affine, inducing a polytope constraint set.
So, SVM is a convex optimization problem (in fact a quadratic program)
Moreover, strong duality holds.
Let’s examine the dual… the Lagrangian is:
SVM standard (primal) form:
Minimize:
(w,b)
Such that:
(for all i)
SVM Dual
Lagrangian:
Primal: Dual:
SVM standard (primal) form:
Minimize:
(w,b)
Such that:
(for all i)
• •
when αI > 0, the corresponding xi is the support vector w is only a function of the support vectors!
Unconstrained, let’s calculate
SVM Dual (contd.)
Lagrangian:
Primal: Dual:
SVM standard (primal) form:
Minimize:
(w,b)
Such that:
(for all i)
So:
Unconstrained, let’s calculate
subject to
SVM Optimization Interpretation
SVM standard (primal) form:
Minimize:
(w,b)
Such that:
(for all i)
SVM standard (dual) form:
Maximize:
(αi)
Such that:
(for all i)
Maximize γ = 2/||w||
Kernelized version
Only a function of “support vectors”
What We Learned…
• Support Vector Machines
• Maximum Margin formulation
• Constrained Optimization
• Lagrange Duality Theory
• Convex Optimization
• SVM dual and Interpretation
• How get the optimal solution
Questions?
Next time…
Parametric and non-parametric Regression