Support Vector Machines
CS542 Machine Learning
slides based on lecture by R. Urtasun
http://www.cs.toronto.edu/~urtasun/courses/CSC2515/CSC2515_Winter15.html
Support Vector Machine (SVM)
• A maximum margin method, can be used for classification or regression
• SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces
• First, we will derive linear, hard-margin SVM for linearly separable data, later for non-separable (soft margin SVM), and for nonlinear boundaries (kernel SVM)
Maximum Margin
Recall: logistic regression
Many possible lines!
Which classifier is best?
this one?
Or this one?
How about the one in the middle?
this one?
Or this one?
Intuitively, this classifier avoids misclassifying new test points generated from the same distribution as the training points
Max margin classification
Instead of fitting all the points, focus on boundary points
Aim: learn a boundary that leads to the largest margin (buffer) from points on both sides
Why: intuition; theoretical support: robust to small perturbations near the boundary
And works well in practice!
Subset of vectors that support (determine boundary) are called the support vectors (circled)
Max-Margin Classifier
Max Margin Classifier
“Expand” the decision boundary to include a margin (until we hit first point on either side)
Use margin of 1
Inputs in the margins are of unknown class
Why is the margin = 1?
d
x
w
d
Aside: vector inner product
d is the length of projection
Computing the Margin
x+
λw x-
Computing the Margin
x+
λw x-
Computing the Margin
Define the margin M to be the distance between the +1 and -1 planes We can now express this in terms of w 🡪🡪
to maximize the margin we minimize the length of w x+
λw x-
Maximizing the margin is equivalent to regularization
But not same as regularized logistic regression, the SVM loss is different! Only care about boundary points.
Linear SVM
Linear SVM Formulation
This is the primal formulation, can be optimized via gradient descent, EM, etc.
Apply Lagrange multipliers: formulate equivalent problem
Lagrange Multipliers
Convert the primal constrained minimization to an unconstrained optimization problem: represent constraints as penalty terms:
For data {(xi,yi)} use the following penalty term:
Note, we are now minimizing with respect to w and b, and maximizing with respect to a (additional parameters)
Lagrange Multipliers
Convert the primal constrained minimization to an unconstrained optimization problem: represent constraints as penalty terms:
For data {(xi,yi)} use the following penalty term:
Rewrite the minimization problem:
Where {αi} are the
Lagrange multipliers
Solution to Linear SVM
Swap the ‘max’ and ‘min’:
First minimize J() w.r.t. {w,b} for any fixed setting of the Lagrange multipliers:
Then substitute back into J() and simplify to get final optimization:
Dual Problem
subject to
only dot products of data points needed
Prediction on Test Example
Now we have the solution for the weights and bias
Dual vs Primal SVM
d
s.t
Dual vs Primal SVM
• Dual: quadratic programming problem in which we optimize a quadratic function of a subject to a set of inequality constraints
• The solution to a quadratic programming problem in d variables in general has computational complexity that is O(d3)
• For a fixed set of basis functions whose number d is smaller than the number n of data points, the move to the dual problem appears disadvantageous.
• However, it allows the model to be reformulated using kernels which allow infinite feature spaces (more on this later)
Bishop Ch. 7
Dual vs Primal SVM
• Most of the SVM literature and software solves the Lagrange dual problem formulation
• Why prefer solving the dual problem over the primal?
– provides a way to deal with constraints
– expresses solution in terms of dot products of data points,
allowing kernels – historical reasons
For an in-depth discussion refer to http://olivier.chapelle.cc/pub/neco07.pdf (optional reading)
Support Vectors
Only a small subset of αi’s will be nonzero, and the corresponding xi’s are the support vectors S
support vectors
Summary of Linear SVM
• Binary and linear separable classification (regression possible too)
• Linear classifier with maximal margin
• Training SVM by maximizing
• Subject to
• Weights:
• Only a small subset of αi’s will be nonzero, and the corresponding xi’s are the support vectors S
• Prediction on a new example:
Soft-margin SVM
What if data is not linearly separable?
• Introduce slack variables ξi
subject to constraints (for all i):
• Example lies on wrong side of hyperplane:
is upper bound on number of training errors
• trades off training error versus model complexity
• This is known as the soft-margin extension
Multi-class SVMs
Multi-class classification
Kernel SVM
Non-linear decision boundaries
• Note that both the learning objective and the decision function depend only on dot products between patterns
• How to form non-linear decision boundaries in input space?
• Basic idea:
1. Map data into feature space
1. Replace dot products between inputs with feature points
1. Find linear decision boundary in feature space • Problem: what is a good feature function φ(x)?
Kernel Trick
• Kernel trick: dot-products in feature space can be computed as a kernel function
• Idea: work directly on x, avoid having to compute φ(x) • Example:
Input transformation
Mapping to a feature space can produce problems:
• High computational burden due to high dimensionality • Many more parameters
SVM solves these two issues simultaneously
• Kernel trick produces efficient classification
• Dual formulation only assigns parameters to samples, not features
Kernels
Kernels as Similarity Functions
Given , compute new feature depending on proximity to “landmarks”
𝑓𝑓 = (0.6, 2.3, 0.8)
x1
similarity is high
Predict label “1” when
x2
Example: Gaussian (RBF) kernel
similarity is low
Example:
Landmarks for Gaussian kernel
Given :
x2 𝜎𝜎 x1
Predict if
So the new features 𝑓𝑓measure how close the example is to each “landmark” point
Where do the landmarks come from?
Landmarks for Gaussian kernel
Where do come from? They are the training points!
blue is class 1 and red is class 2 Training points become landmarks
So the “landmarks” are points we can use to compute a new feature representation for a point x, by representing it as
the similarity to each landmark point (measured using a Gaussian centered at the landmark)
In SVMs with RBF (Gaussian) kernels, we place a Gaussian centered at each training point to compute the nonlinear features. This is equivalent to using all of the training data as landmarks.
SVM with Kernels
Given choose
Given example :
𝜎𝜎
Kernels
Examples of kernels (kernels measure similarity):
1. Polynomial 2. Gaussian 3. Sigmoid
Each kernel computation corresponds to dot product calculation for particular mapping φ(x): implicitly maps to high-dimensional space
Why is this useful?
1. Rewrite training examples using more complex features
2. Dataset not linearly separable in original space may be linearly separable in higher dimensional space
Classification with non-linear SVMs
Non-linear SVM using kernel function K():
Maximize LK w.r.t. {α}, under constraints α≥0
Unlike linear SVM, cannot express w as linear combination of support vectors – now must retain the support vectors to classify new examples
Final decision function:
Decision boundary in Gaussian kernel SVM
Gammma = 1/𝜎𝜎2 towardsdatascience.com
SVM hyper-parameters
C ( ). Large C: Lower bias, high variance. Small C: Higher bias, low variance.
Kernel SVM has additional hyperparameters, e.g. for Gaussian kernel:
Large : Features vary more smoothly. Higher bias, lower variance.
Small : Features vary less smoothly. Lower bias, higher variance.
Kernel SVM Summary
Advantages:
• Kernels allow very flexible hypotheses
• Poly-time exact optimization methods rather than approximate methods
• Soft-margin extension permits mis-classified examples
• Variable-sized hypothesis space
• Excellent results (1.1% error rate on handwritten digits vs. LeNet’s 0.9%)
Disadvantages:
• Must choose kernel hyperparameters
• Very large problems computationally intractable • Batch algorithm
Kernel Functions
Mercer’s Theorem (1909): any reasonable kernel corresponds to some feature space
Reasonable means that the Gram matrix is positive definite
Feature space can be very large, e.g., polynomial kernel (1+ xi + xj)d corresponds to feature space exponential in d
Linear separators in these super high-dim spaces correspond to highly nonlinear decision boundaries in input space
Kernelizing
A popular way to make an algorithm more powerful is to develop a kernelized version of it
• We can rewrite a lot of algorithms to be defined only in terms of inner product
• For example: k-nearest neighbors
Techniques for constructing valid kernels
Summary of SVMs
Summary
Software:
• A list of SVM implementations can be found at
http://www.kernel-machines.org/software.html
• Some implementations (such as LIBSVM) can handle multi-class classification
• SVMLight is among the earliest implementations
• Several Matlab toolboxes for SVM are also available
Key points:
• Difference between logistic regression and SVMs • Maximum margin principle
• Target function for SVMs
• Slack variables for mis-classified points
• Kernel trick allows non-linear generalizations
Logistic regression vs. SVMs
Neural network likely to work well for most of these settings, but may be slower to train.
History of SVMs
• TheoriginalSVMalgorithmwasinventedbyVladimir Vapnik and Alexey Chervonenkis in 1963.
• In1992,BernhardE.Boser,IsabelleM.Guyon
and Vladimir Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick to maximum- margin hyperplanes.[13]
• The soft margin was proposed by Corinna Cortes and Vapnik in 1993 and published in 1995.[1]
• SVMs were very popular in the 90’s-00’s until neural networks took over circa 2012
https://en.wikipedia.org/wiki/Support-vector_machine