Announcements
Reminder: ps4 self-grading form out, due Friday 10/30
• pset 5 out today 10/29, due 11/5 (1 week)
• Midterm grades will go up by Monday (don’t discuss it yet)
Support Vector Machines
CS542 Machine Learning
slides based on lecture by R. Urtasun
http://www.cs.toronto.edu/~urtasun/courses/CSC2515/CSC2515_Winter15.html
Support Vector Machine (SVM)
• A maximum margin method, can be used for classification or regression
• SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces
• First, we will derive linear, hard-margin SVM for linearly separable data, later for non-separable (soft margin SVM), and for nonlinear boundaries (kernel SVM)
Maximum Margin
Recall: logistic regression
Many possible lines!
Which classifier is best?
this one?
Or this one?
How about the one in the middle?
this one?
Or this one?
Intuitively, this classifier avoids misclassifying new test points generated from the same distribution as the training points
Max margin classification
Instead of fitting all the points, focus on boundary points
Aim: learn a boundary that leads to the largest margin (buffer) from points on both sides
Why: intuition; theoretical support: robust to small perturbations near the boundary
And works well in practice!
Subset of vectors that support (determine boundary) are called the support vectors (circled)
Max-Margin Classifier
Max Margin Classifier
“Expand” the decision boundary to include a margin (until we hit first point on either side)
Use margin of 1
Inputs in the margins are of unknown class
Why is the margin = 1?
d
x
d
Aside: vector inner product
w
d is the length of projection
Computing the Margin
x+
λw x-
Computing the Margin
x+
λw x-
Computing the Margin
Define the margin M to be the distance between the +1 and -1 planes We can now express this in terms of w 🡪
to maximize the margin we minimize the length of w x+
λw x-
Maximizing the margin is equivalent to regularization
But not same as regularized logistic regression, the SVM loss is different! Only care about boundary points.
Linear SVM
Linear SVM Formulation
This is the primal formulation, can be optimized via gradient descent, EM, etc.
Apply Lagrange multipliers: formulate equivalent problem
Lagrange Multipliers
Convert the primal constrained minimization to an unconstrained optimization problem: represent constraints as penalty terms:
For data {(xi,yi)} use the following penalty term:
Note, we are now minimizing with respect to w and b, and maximizing with respect to a (additional parameters)
Lagrange Multipliers
Convert the primal constrained minimization to an unconstrained optimization problem: represent constraints as penalty terms:
For data {(xi,yi)} use the following penalty term:
Rewrite the minimization problem:
Where {αi} are the
Lagrange multipliers
Solution to Linear SVM
Swap the ‘max’ and ‘min’:
First minimize J() w.r.t. {w,b} for any fixed setting of the Lagrange multipliers:
Then substitute back into J() and simplify to get final optimization:
Dual Problem
subject to
only dot products of data points needed
Prediction on Test Example
Now we have the solution for the weights and bias
Dual vs Primal SVM
d
s.t
Dual vs Primal SVM
• Dual: quadratic programming problem in which we optimize a quadratic function of a subject to a set of inequality constraints
• The solution to a quadratic programming problem in d variables in general has computational complexity that is O(d3)
• For a fixed set of basis functions whose number d is smaller than the number n of data points, the move to the dual problem appears disadvantageous.
• However, it allows the model to be reformulated using kernels which allow infinite feature spaces (more on this later)
Bishop Ch. 7
Dual vs Primal SVM
• Most of the SVM literature and software solves the Lagrange dual problem formulation
• Why prefer solving the dual problem over the primal?
– provides a way to deal with constraints
– expresses solution in terms of dot products of data points,
allowing kernels
– historical reasons
For an in-depth discussion refer to http://olivier.chapelle.cc/pub/neco07.pdf (optional reading)
Support Vectors
Only a small subset of αi’s will be nonzero, and the corresponding xi’s are the support vectors S
support vectors
Summary of Linear SVM
• Binary and linear separable classification (regression possible too)
• Linear classifier with maximal margin
• Training SVM by maximizing
• Subject to
• Weights:
• Only a small subset of αi’s will be nonzero, and the corresponding xi’s are the support vectors S
• Prediction on a new example:
Next Class
Support Vector Machines II
non-separable data; slack variables;kernels; multiclass SVM
Reading: Bishop Ch 6.1-6.2, Ch 7.1.3