ANLP_1: Introduction
ISML_3: Linear Classifier and (Linear) Support
Vector Machine
Lingqiao Liu
Outlines
University of Adelaide 2
• Linear Classifier
• Support Vector Machine: Primal form
– Separable case
– Non-separable case
• Support Vector Machine: Dual form
– KKT condition
– Dual form of separable linear SVM
– Dual form of non-separable linear SVM
• Solving optimization problem with CVX
• Summary
Requirements
• Basic concepts of linear classifier
• Basic ideas and concepts in SVMs
• Hard-margin SVMs:
– formulation (but not derivation)
– How to calculate margin and find support vectors
• Soft-margin SVMs:
– Motivation
– Formulation
– Hinge loss formulation
• SVM dual
– How to derive
– Formulation
– Relationship to primal problem solutions
University of Adelaide 3
Outlines
University of Adelaide 4
• Linear Classifier
• Support Vector Machine: Primal form
– Separable case
– Non-separable case
• Support Vector Machine: Dual form
– KKT condition
– Dual form of separable linear SVM
– Dual form of non-separable linear SVM
• Solving optimization problem with CVX
• Summary
Linear classifier
• Suppose each data sample can be represented as a vector
(called feature vector)
• In linear classifier, the classification can be done by
calculating the linear combination of feature values
– are the parameters of the classifier
– For binary classification, we only need a , the decision value
is larger than 0, the prediction is class 1, otherwise class -1
– For multi-class classification, we need to learn c sets of , one
for each class. The class that gives the highest decision value will
be classified as the predicted class
University of Adelaide 5
Geometrically
University of Adelaide 6
The linear boundary forms a Hyperplane in high-dimensional space
Multiclass classification
• Multiclass classification can
be converted into c binary
classification problems by
using the one-vs-rest scheme
– For each problem, we use the
c-th class as positive class and
all the others as the negative
class
– We focus on binary
classification today
University of Adelaide 7
Outlines
University of Adelaide 8
• Linear Classifier
• Support Vector Machine: Primal form
– Separable case
– Non-separable case
• Support Vector Machine: Dual form
– KKT condition
– Dual form of separable linear SVM
– Dual form of non-separable linear SVM
• Solving optimization problem with CVX
• Summary
Motivation
• To separate two classes, which hyperplane we should
choose
University of Adelaide 9
What makes a good separating hyperplane
• Training data is just a sampled subset of all possible data
• Suppose the hyperplane is close to one sample
• If we see new samples close to sample i, it is likely that it
will be wrongly placed on the other side of hyperplane
– Assumption: similar sample, similar label
• Lead to poor generalization
University of Adelaide 10
What makes a good separating hyperplane
• Hyperplane should be as far as possible from every
sample
• In this way, new samples close to old samples will be
classified correctly
• Good generalization
University of Adelaide 11
Support Vector Machine
• Idea: maximize the distance to the closest example =
margin
• This leads to an optimization problem
University of Adelaide 12
Properties of the optimal hyperplane
• Distance to the closest negative example = distance to
the closest positive example
– Why?
• Examples closest to the hyperplane are support vectors.
– Removing non-support vector will not affect the optimal hyper-
plane
University of Adelaide 13
Optimization problem for finding the
optimal hyperplane
• The primal optimization problem for linear support
vector machine (Hard-margin SVMs)
• Decision rule
University of Adelaide 14
Proof: (1) Related Math Conclusions
• Cosine angle between two vectors
• Norm vector of a hyperplane
University of Adelaide 15
Proof: (1) Related Math Conclusions
• Distance between a point to a hyperplane
University of Adelaide 16
Proof:
Proof: (2) Original optimization problem
SVM objective
• (1) maximize the distance to the closest example
• (2) positive class and negative class on each side of the
hyperplane
University of Adelaide 17
The minimal distance between hyper-plane
Correct classification
Proof: (2) transformed optimization
problem
University of Adelaide 18
Proof: (2) transformed optimization
problem (continue)
University of Adelaide 19
Take a closer look at the SVM problem
University of Adelaide 20
• Margin:
• All the support
vectors are in
Non-separable case
University of Adelaide 21
What if the training set is not linearly separable?
In that case, not all
inequality constraints can
be satisfied
• Slack variables indicates how much a sample violates the
inequality constraint
• And we want to minimize the total violation
• C is a hyper-parameter represents how much violation
allowed.
– Question: How to choose C?
Slack variables and soft-margin SVM
University of Adelaide 22
Hinge loss view of soft-margin SVM
• The soft-margin SVM can also be viewed from the
perspective of hinge-loss
University of Adelaide 23
Hinge loss view of soft-margin SVM
• If , which means the sample has been
correctly classified with sufficient confidence (not just
>0, but > 1), then the loss incurred for this sample is 0
• This means, if the “confidence” is large enough, we do
not pursuit a larger “confidence”.
• Instead, we focus on those samples that haven’t been
confidently classified.
University of Adelaide 24
Outlines
University of Adelaide 25
• Linear Classifier
• Support Vector Machine: Primal form
– Separable case
– Non-separable case
• Support Vector Machine: Dual form
– KKT condition
– Dual form of separable linear SVM
– Dual form of non-separable linear SVM
• Solving optimization problem with CVX
• Summary
SVM dual problem
• Sometimes it is beneficial to convert the SVM
optimization problems into another equivalent
optimization problem called the dual formulation of
SVM, or dual problem of SVM
• By applying certain transforms, we can derive a dual
problem for any optimization problem.
– An important type of dual problem is called Lagrangian dual
problem
– If the original problem is convex, then there are some
equivalence between the optimal solutions and the original
problem. In other words, a convex optimization problem could
be solved in its original (called primary form) form or its dual
form
University of Adelaide 26
Convex optimization
University of Adelaide 27
Convex function:
Convex Set:
An optimization is a convex optimization problem if is a convex function and
is a convex set
If f is a quadratic function and g is linear function, the problem is convex.
Lagrangian dual problem
• Original problem (Primal problem)
• Lagrangian dual problem
University of Adelaide 28
Lagrangian functionDefine
Assume and are convex functions, e.g.,
linear or quadratic functions
Relationship between dual problem and
primal problem
• Notation: using and denote the optimal solution
for the primal problem and dual problem, respectively
• Note that the inner problem is an unconstrained
optimization problem, so we should have
• Complementary slackness
• More information (if you want to know “why”) can be
found in this tutorial
lagrangian_duality.pdf (stanford.edu)
University of Adelaide 29
https://cs.stanford.edu/people/davidknowles/lagrangian_duality.pdf
Summary: Karush–Kuhn–Tucker (KKT)
conditions
University of Adelaide 30
1. Stationarity
2. Primary feasibility
3. Dual feasibility
4. Complementary slackness
Relationship between the solution to the primal problem and dual
problem can be specified by the KKT condition. The dual problem can
be derived by applying KKT condition and make some transformations
Derive dual problem for soft-margin SVMs
University of Adelaide 31
1. Stationarity
Step 1: take derivative over Lagrangian function
Derive dual problem for soft-margin SVMs
University of Adelaide 32
Step 2: represent primal variables with dual variables, substitute into the Lagrangian
function
Derive dual problem for soft-margin SVMs
University of Adelaide 33
Derive dual problem for soft-margin SVMs
University of Adelaide 34
Derive dual problem for soft-margin SVMs
University of Adelaide 35
Derive dual problem for soft-margin SVMs
University of Adelaide 36
Derive dual problem for soft-margin SVMs
University of Adelaide 37
Step 3: Check constraints by using dual feasibility
Any constraints? Dual variables must be no less than 0
SVM dual problem
• Dual problem
• Primal problem
University of Adelaide 38
Convert dual solutions to primal solutions
University of Adelaide 39
• How about b?
– By using complementary slackness
Convert dual solutions to primal solutions
• A closer look at complementary slackness condition
– If then , then
– If then , then
– How about then , then
University of Adelaide 40
Points satisfying the last two conditions
are called support vectors
How to solve b
• Find any sample with its corresponding ,
solving b by
University of Adelaide 41
SVM dual problem
• Primal problem
– Number of variables is proportional to the number of feature
dimensions
– Number of constraints is proportional to the number of samples
• Dual problem
– Number of variables is proportional to the number of samples
• Solving SVM from dual or primal?
– Depends which one results in smaller optimization problem, i.e.,
number of variables to optimize
– Efficient solutions have been developed by using the relationship
between dual problem and primal problem
University of Adelaide 42
Outlines
University of Adelaide 43
• Linear Classifier
• Support Vector Machine: Primary form
– Separable case
– Non-separable case
• Support Vector Machine: Dual form
– KKT condition
– Dual form of separable linear SVM
– Dual form of non-separable linear SVM
• Solving optimization problem with CVX
• Summary
Optimization solver
• Optimization problem is usually solved by existing solver
• For the optimization problem in SVMs
– Using standard convex optimization solver
• CVX Linear program — CVXPY 1.1.13 documentation
– Using efficient solver specifically developed for SVMs
• LIBSVM — A Library for Support Vector Machines (ntu.edu.tw)
• LIBLINEAR — A Library for Large Linear Classification
(ntu.edu.tw)
University of Adelaide 44
https://www.cvxpy.org/examples/basic/linear_program.html
https://www.csie.ntu.edu.tw/~cjlin/libsvm/
https://www.csie.ntu.edu.tw/~cjlin/liblinear/
How to use CVX for solving optimization
problem
• Scientific programming in Python
– Commonly used packages: Numpy, Scipy
– NumPy Tutorial (tutorialspoint.com)
– Data are represented as Ndarray and various matrix operations
are supported by that
• Using CVX
– CVX provides a more user-friendly interface to define your
optimization problem
– Before CVX, you need to convert your problem into its canonical
form, e,g. a quadratic programming problem
University of Adelaide 45
https://www.tutorialspoint.com/numpy/index.htm
Example
University of Adelaide 46
Example
University of Adelaide 47
Important skills
• Converting summation to matrix forms
•
•
University of Adelaide 48
Important skills
•
•
• Please refer to the documents and examples of CVX
University of Adelaide 49
Summary
• Concept of linear classifier
• Basic idea of linear SVMs: margin maximization
• Primal problem (hard-margin)
• Related concepts
– Support vectors
– Margin
University of Adelaide 50
Summary
• Primal problem (soft-margin)
• Dual problem: concepts and derivation
University of Adelaide 51
Summary
• Dual problem
– Relationship with primal problem
• Support vector
• Calculating w and b
• Using optimization toolbox
– CVX
– Skills for converting summation into matrix operations
University of Adelaide 52