Machine Learning
Lecture 9: SVM and Kernels
Prof. Dr. ̈nnemann
Data Analytics and Machine Learning Group Technical University of Munich
Copyright By PowCoder代写 加微信 powcoder
Winter term 2020/2021
1. Support Vector Machines (SVM)
2. Soft Margin Support Vector Machines
3. Kernels
SVM and Kernels 2
Data Analytics and Machine Learning
Support Vector Machines (SVM)
Linear classifier
A linear classifier assigns all x with wT x + b > 0
to class blue and all x with wT x + b < 0
to class green.
Thus the class of x is given by
h(x) = sign(wT x + b) with
− 1 i f z < 0 sign(z) = 0 if z = 0 .
+1 ifz>0
SVM and Kernels 4
Data Analytics and Machine Learning
Maximum margin classifier
• Intuitively, a wide margin around the dividing line makes it more likely that new samples will fall on the right side of the boundary.
SVM and Kernels 5
Data Analytics and Machine Learning
Maximum margin classifier
• Intuitively, a wide margin around the dividing line makes it more likely that new samples will fall on the right side of the boundary.
• Actual rigorous motivation comes from Statistical Learning Theory 1
• Objective:
Find a hyperplane that separates both classes with the maximum margin.
1V. Vapnik – ”Statistical Learning Theory”, 1995
SVM and Kernels 5
Data Analytics and Machine Learning
Linear classifier with margin
We add two more hyperplanes that are parallel to the original hyperplane and require that no training points must lie between those hyperplanes.
Thus we now require
wT x + (b − s) > 0 for all x from class blue and
wT x + (b + s) < 0 for all x from class green.
SVM and Kernels 6
Data Analytics and Machine Learning
wTx+(b-s)=0 wTx+b=0 wTx+(b+s)=0
Size of the margin
Signed distance from the origin to the hyperplane is given by
d=−b. ||w||
Thus we have
dblue = −b − s ||w ||
dgreen =−b+s ||w ||
and the margin is
b/|w| ||w ||
m=dblue−dgreen= 2s .
SVM and Kernels 7
Data Analytics and Machine Learning
wTx+(b-s)=0 wTx+b=0 wTx+(b+s)=0
Redundancy of parameter s The size of the margin,
m= 2s ||w ||
only depends on the ratio, so w.l.o.g. we can set s = 1 and get
SVM and Kernels 8
Data Analytics and Machine Learning
wTx+(b-s)=0 wTx+b=0 wTx+(b+s)=0
Redundancy of parameter s The size of the margin,
m= 2s ||w ||
only depends on the ratio, so w.l.o.g. we can set s = 1 and get
m=2 ||w ||
Although the distance from the origin to the black plane,
d=−b, ||w ||
also depends on two parameters we cannot set b = 1 as this would link the distance d to the size of the margin m.
SVM and Kernels 8
Data Analytics and Machine Learning
wTx+(b-1)=0 wTx+b=0 wTx+(b+1)=0
Set of constraints
Let xi be the ith sample, and
yi ∈{−1,1}theclassassignedtoxi.
The constraints
wTxi+b≥+1 foryi=+1,
wTxi+b≤−1 foryi=−1 can be condensed into
yi(wTxi+b)≥1 foralli. If these constraints are fulfilled the
m=2=√2. ||w|| wTw
SVM and Kernels 9
Data Analytics and Machine Learning
wTx+(b-1)=0 wTx+b=0 wTx+(b+1)=0
SVM‘s Optimization problem
Let xi be the ith data point, i = 1,...,N, and yi ∈ {−1,1} the class assigned to xi.
To find the separating hyperplane with the maximum margin we need to find {w,b} that
minimize f0(w,b)=12wTw
subjectto fi(w,b)=yi(wTxi +b)−1≥0 fori=1,...,N.
This is a constrained convex optimization problem (more specifically, quadratic programming problem).
We go from ||w || to ||w ||2 = w T w as square root is a monotonic function that doesn’t change the location of the optimum.
SVM and Kernels 10
Data Analytics and Machine Learning
Optimization with inequality constraints Constrained optimization problem
Givenf0 :Rd →Randfi :Rd →R, minimizeθ f0(θ)
subject to fi(θ) ≤ 0 for i = 1,...,M .
A point θ ∈ Rd is called feasible if and only if it satisfies the constraints
of the optimization problem, i.e. fi(θ) ≤ 0 for all i ∈ {1,...,M}.
Minimum and minimizer
We call the optimal value the minimum p∗, and the point where the minimum is obtained the minimizer θ∗. Thus p∗ = f0(θ∗).
Feasibility
SVM and Kernels 11
Data Analytics and Machine Learning
Lagrangian
minimizeθ f0(θ)
subject to fi(θ) ≤ 0 for i = 1,...,M
Definition (Lagrangian)
We define the Lagrangian L : Rd × RM → R associated with the above problem as
We refer to αi ≥ 0 as the Lagrange multiplier associated with the inequality constraint fi (θ) ≤ 0.
L(θ, α) = f0(θ) + αi fi (θ)
SVM and Kernels 12
Data Analytics and Machine Learning
Lagrange dual function
Definition (Lagrange dual function)
The Lagrange dual function g : RM → R maps α to the minimum of the Lagrangian over θ (possibly −∞ for some values of α),
It is concave in α since it is the point-wise minimum of a family of affine functions of α.
g(α)= minL(θ,α)= min f0(θ)+αifi(θ)
SVM and Kernels 13
Data Analytics and Machine Learning
Interpretation of the Lagrangian
minimizeθ f0(θ)
subject to fi(θ) ≤ 0, i = 1,...,M
L(θ, α) = f0(θ) + αi fi (θ)
For every choice of α, the corresponding unconstrained
g(α) = minθ∈Rd L(θ, α) is a lower bound on the optimal value of the constrained problem:
min f0(θ)=f0(θ∗)≥f0(θ∗)+ αi fi(θ∗)=L(θ∗,α)≥ minL(θ,α) θ∈Rd θ∈Rd
fi(θ)≤0 i=1 ≥0 ≤0 g (α)
Hence, ∀α f0(θ∗) ≥ g(α).
SVM and Kernels 14
Data Analytics and Machine Learning
Lagrange dual problem
For each α ≥ 0 the Lagrange dual function g(α) gives us a lower bound on the optimal value p∗ of the original optimization problem.
What is the best (highest) lower bound?
Lagrange dual problem
maximizeα g (α)
subject to αi ≥ 0, i = 1,...,m
The maximum d∗ of the Lagrange dual problem is the best lower bound on p∗ that we can achieve by using the Lagrangian.
Note: since we are maximizing g we are not interested in dual multipliers α such that g(α) = −∞, so the condition g(α) ̸= −∞ is usually added as an additional constraint to the dual problem → we call α feasible if and only if all αi ≥ 0 and L(θ,α) is bounded from below for θ ∈ Rd.
SVM and Kernels 15
Data Analytics and Machine Learning
Weak duality (always)
Since for all α ≥ 0 it holds that g(α) ≤ p∗ we have weak duality, d∗ ≤ p∗ .
The difference p∗ − d∗ ≥ 0 between the solution of the original and the dual problem is called the duality gap.
Strong duality (under certain conditions)
Under certain conditions we have
i.e. the maximum to the Lagrange dual problem is the minimum of the original (primal) constrained optimization problem (i.e. f0(θ∗) = g(α∗)).
SVM and Kernels 16
Data Analytics and Machine Learning
SVM’s Primal problem
We apply a recipe for solving the constrained optimization problem. 1. Calculate the Lagrangian
L(w,b,α)=2wTw−
2. Minimize L(w,b,α) w.r.t. w and b.
N ∇wL(w,b,α)=w−αiyixi =! 0
αi[yi(wTxi +b)−1].
∂ L = α i y i =! 0
Thus the weights are a linear combination of the training samples,
w = αi yi x i .
SVM and Kernels 17
Data Analytics and Machine Learning
SVM‘s Dual problem
Substituting both relations back into L(w,b,α) gives the Lagrange dual function g(α).
Thus we have reformulated our original problem as
subject to
g(α) = αi − i=1
yiyjαiαjxTi xj 2 i=1 j=1
αi ≥0, 3. Solve this problem.
i=1,...,N.
αiyi = 0 i=1
SVM and Kernels 18
Data Analytics and Machine Learning
Solving the dual problem
We can rewrite the dual function g(α) in vector form g ( α ) = 21 α T Q α + α T 1 N
where Q is a symmetric negative (semi-)definite matrix, and the constraints on α are linear.
This is an instance of a quadratic programming problem.
There exist efficient algorithms for its solution, such as Sequential minimal optimization (SMO) 2.
A number of implementations, such as LIBSVM 3 are available and are widely used in practice.
2 http://cs229.stanford.edu/materials/smo.pdf
3C.-C. Chang and C.-J. Lin. LIBSVM : a library for support vector machines, 2011
SVM and Kernels 19
Data Analytics and Machine Learning
Recovering w and b from the dual solution α∗
Having obtained the optimal α∗ using our favorite QP solver, we can
compute the parameters defining the separating hyperplane.
Recall, that from the optimality condition, the weights w are a linear combination of the training samples,
w = α i∗ y i x i
From the complementary slackness condition αi∗fi (θ∗) = 0 we can easily
recover the bias.
When we take any vector xi for which αi ̸= 0. The corresponding
constraint fi(w,b) must be zero and thus we have w T x i + b = yi .
Solving this for b yields the bias
b=yi −wTxi
We can also average the b over all support vectors to get a more stable solution.
SVM and Kernels 20
Data Analytics and Machine Learning
Support vectors
From complimentary slackness N
αi∗fi(θ∗) = 0. i=1 αi[yi(wTxi+b)−1]=0 foralli.
Hence a training sample xi can only contribute to the weight vector
(αi ̸= 0) if it lies on the margin, that is
yi(wTxi +b)=1.
A training sample x i with αi ̸= 0 is called a support vector.
SVM and Kernels 21
Data Analytics and Machine Learning
Classifying
The class of x is given by h(x)=sign(wTx +b).
Substituting
w = αi yi x i
N h(x)=sign αiyixTi x+b .
Since the solution is sparse (most αi s are zero) we only need to remember the few training samples xi withαi ̸=0.
SVM and Kernels 22
Data Analytics and Machine Learning
Soft Margin Support Vector Machines
Dealing with noisy data
What if the data is not linearly separable due to some noise?
With our current version of SVM, a single outlier makes the constraint unsatisfiable.
The corresponding Lagrange multiplier αi would go to infinity and destroy the solution.
How to make our model more robust?
SVM and Kernels 24
Data Analytics and Machine Learning
Slack variables
Idea: Relax the constraints as much as necessary but punish the relaxation of a constraint.
We introduce a slack variable ξi ≥ 0 for every training sample xi that gives the distance of how far the margin is violated by this training sample in units of ||w||.
SVM and Kernels 25
Data Analytics and Machine Learning
Slack variables
Idea: Relax the constraints as much as necessary but punish the relaxation of a constraint.
We introduce a slack variable ξi ≥ 0 for every training sample xi that gives the distance of how far the margin is violated by this training sample in units of ||w||.
Hence our relaxed constraints are wTxi +b≥+1−ξi foryi =+1,
wTxi +b≤−1+ξi foryi =−1. Again, they can be condensed into
yi(wTxi +b)≥1−ξi foralli.
SVM and Kernels 25
Data Analytics and Machine Learning
Slack variables
Idea: Relax the constraints as much as necessary but punish the relaxation of a constraint.
We introduce a slack variable ξi ≥ 0 for every training sample xi that gives the distance of how far the margin is violated by this training sample in units of ||w||.
Hence our relaxed constraints are wTxi +b≥+1−ξi foryi =+1,
wTx+b≤−1+ξ fory=−1. i i i
Again, they can be condensed into yi(wTxi +b)≥1−ξi foralli.
The new cost function is,
1T f0(w,b,ξ)=2w w+C
i=1 The factor C > 0 determines how
heavyaviolationispunished.
C → ∞ recovers hard-margin SVM.
SVM and Kernels 25
Data Analytics and Machine Learning
Optimization problem with slack variables
Let xi be the ith data point, i = 1,…,N, and yi ∈ {−1,1} the class assigned to xi. Let C > 0 be a constant.
To find the hyperplane that separates most of the data with maximum margin we
1T N minimize f0(w,b,ξ)=2w w+C ξi
i=1 subjectto yi(wTxi+b)−1+ξi≥0
Here we used the 1-norm for the penalty term i ξi . Another choice is
to use the 2-norm penalty, i ξi2.
The penalty that performs better in practice will depend on the data and the type of noise that has influenced it.
i=1,…,N, i=1,…,N.
SVM and Kernels 26
Data Analytics and Machine Learning
Lagrangian with slack variables
1. Calculate the Lagrangian
L(w,b,ξ,α,μ)=2w w+C
−αi[yi(wTxi +b)−1+ξi]−μiξi . i=1 i=1
2. Minimize L(w,b,ξ,α,μ) w.r.t. w, b and ξ. NN
∇wL(w,b,ξ,α,μ)=w−αiyixi =! 0; ∂L=αiyi =! 0,
∂b ∂L=C−αi−μi=! 0 for i=1,…,N
Fromαi =C−μi anddualfeasibilityμi ≥0,αi ≥0weget
SVM and Kernels 27
Data Analytics and Machine Learning
Dual problem with slack variables
This leads to the dual problem:
subject to
g(α) = αi − i=1
αiyi = 0 i=1
yiyjαiαjxTi xj i=1 j=1
0 ≤ αi ≤ C
This is nearly the same dual problem as for the case without slack
variables.
Only the constraint αi ≤ C is new. It ensures that αi is bounded and cannot go to infinity.
i = 1, . . . , N .
SVM and Kernels 28
Data Analytics and Machine Learning
Influence of the penalty C
SVM and Kernels 29
Data Analytics and Machine Learning
Hinge loss formulation
We can have another look at our constrained optimization problem. 1T N
minimize f0(w,b,ξ)=2w w+C ξi i=1
subjectto yi(wTxi +b)−1+ξi ≥0, ξi ≥0,
Clearly, for the optimal solution the slack variables are
1−yi(wTxi +b), ifyi(wTxi +b)<1 ξi = 0 else
since we are minimizing over ξ.
i=1,...,N, i=1,...,N.
SVM and Kernels 30
Data Analytics and Machine Learning
Hinge loss formulation
Thus, we can rewrite the objective function as an unconstrained optimization problem known as the hinge loss formulation
1 N min wTw +C
max{0,1−yi(wTxi +b)} Ehinge(z) = max{0,1−z} penalizes the
w,b 2 The hinge loss function
points that lie within the margin.
The name comes from the shape from the function, as can be seen in the figure to the right.
We can optimize this hinge loss objective directly, using standard gradient-based methods.
Hinge loss (blue) can be viewed as an approximation to zero-one loss (green).
SVM and Kernels 31
Data Analytics and Machine Learning
Feature space
So far we can only construct linear classifiers.
Before, we used basis functions φ(·) to make the models nonlinear
For example, with the following mapping the data becomes linearly separable
xi →φ(xi)
φ(x,y) = y −x2 +y2
SVM and Kernels 33
Data Analytics and Machine Learning
Kernel trick
In the dual formulation of SVM, the samples xi only enter the dual objective as inner products xTi xj
g(α)=αi − yiyjαiαjxTi xj ,
For basis functions this means that
g(α)=αi − i=1
yiyjαiαjφ(xi)Tφ(xj) i=1 j=1
SVM and Kernels 34
Data Analytics and Machine Learning
Kernel trick
Wecandefineakernelfunctionk:RD ×RD →R k (x i , x j ) := φ(x i )T φ(x j )
and rewrite the dual as
g(α) = αi − i=1
yiyjαiαjk(xi,xj) i=1 j=1
This operation is referred to as the kernel trick.
It can be used not only for SVM. Kernel trick can be used in any model that can be formulated such that it only depends on the inner products xTi xj.(e.g.linearregression,k-nearestneighbors)
SVM and Kernels 35
Data Analytics and Machine Learning
Kernel trick
The kernel represent an inner product in the feature space spanned by φ. Like before, this makes our models non-linear w.r.t. the data space.
What’s the point of using kernels if we can simply use the basis functions?
• Some kernels are equivalent to using infinite-dimensional basis functions. While computing these feature transformations would be impossible, directly evaluating the kernels is often easy.
• Kernels can be used to encode similarity between arbitrary non-numerical data, from strings to graphs.
For example, we could define
k (lemon, orange) = 10 and k (apple, orange) = −5
SVM and Kernels 36
Data Analytics and Machine Learning
SVM using kernels example
Linear kernel (no kernel)
2nd order polynomial kernel
SVM and Kernels 37
Data Analytics and Machine Learning
What makes a valid kernel?
A kernel is valid if it corresponds to an inner product in some feature space. An equivalent formulation is given by Mercer’s theorem.
Mercer’s theorem
A kernel is valid if it gives rise to a symmetric, positive semidefinite kernel matrix K for any input data X .
Kernel matrix (also known as Gram matrix) K ∈ RN ×N is defined as
k(x1,x1) k(x1,x2) ··· k(x1,xN)
k(x2,x1) k(x2,x2) ··· k(x2,xN) K= . . .. .
.... k(xN,x1) k(xN,x2) ··· k(xN,xN)
What happens if we use a non-valid kernel?
Our optimization problem might become non-convex, so we may not get a globally optimal solution.
SVM and Kernels 38
Data Analytics and Machine Learning
Kernel preserving operations
Let k1 : X × X → R and k2 : X × X → R be kernels, with X ⊆ RN . Then the following functions k are kernels as well:
• k(x1,x2)=k1(x1,x2)+k2(x1,x2)
• k(x1,x2)=c·k1(x1,x2),withc>0
• k(x1,x2)=k1(x1,x2)·k2(x1,x2)
• k(x1,x2) = k3(φ(x1),φ(x2)), with the kernel k3 on X′ ⊆ RM and φ:X→X′
• k(x1,x2)=x1Ax2,withA∈RN ×RN symmetricandpositive semidefinite
SVM and Kernels 39
Data Analytics and Machine Learning
Examples of kernels
• Polynomial:
• Gaussian kernel:
k(a,b)=(aTb)p or (aTb+1)p ∥a−b∥2
k(a,b)=exp − 2σ2 k(a,b)=tanh(κaTb−δ) forκ,δ>0
• Sigmoid:
In fact, the sigmoid kernel is not PSD, but still works well in practice.
Some kernels introduce additional hyperparameters, that affect the behavior of the algorithm.
Note, that the sigmoid kernel is different from the sigmoid function from Linear Classification.
SVM and Kernels 40
Data Analytics and Machine Learning
Gaussian kernel (C=1000)
σ = 0.5 σ = 0.25
σ = 0.05 σ = 0.005
SVM and Kernels 41
Data Analytics and Machine Learning
Classifying a new point with kernelized SVM
We denote the set of support vectors as S (points xi for which holds 0<αi ≤C). Note: If0<αi
From the complementary slackness condition, points xi ∈ S with ξi = 0 must satisfy
y α y k(x ,x )+b =1
ijjij {j|xj ∈S}
Like for the regular SVM, the bias can be recovered as
b=yi − αjyjk(xi,xj) {j|xj ∈S}
Thus, a new point x can be classified as
h (x ) = sign αj yj k (x j , x ) + b
SVM and Kernels 42
Data Analytics and Machine Learning
How to choose the hyperparameters?
The best setting of the penalty parameter C and the kernel hyperparameters (e.g., γ or σ) can be determined by performing cross validation with random search over the parameter space.
SVM and Kernels 43
Data Analytics and Machine Learning
Multiple classes
The standard SVM model cannot handle multiclass data. Two approaches to address this issue are:
One-vs-rest:
Train C SVM models for C classes, where each SVM is being trained for classification of one class against all the remaining ones. The winner is then the class, where the distance from the hyperplane is maximal.
Train C classifiers (all possible pairings) and evaluate all. 2
The winner is the class with the majority vote (votes are weighted according to the distance from the margin).
One-vs-one:
SVM and Kernels 44
Data Analytics and Machine Learning
• Support Vector Machine is a linear binary classifier that, motivated by learning theory, maximizes the margin between the two classes.
• The SVM objective is convex, so a globally optimal solution can be obtained.
• The dual formulation is a quadratic programming problem, that can be solved efficiently, e.g. using standard QP libraries.
• Soft-margin SVM with slack variables can deal with noisy data. Smaller values for the penalty C lead to a larger margin at the cost of misclassifying more samples.
• Linear soft-margin SVM (= hinge loss formulation) can be solved directly using gradient descent.
• We can obtain a nonlinear decision boundary by moving to an implicit high-dimensional feature space by using the kernel trick. This only works in the dual formulation.
SVM and Kernels 45
Data Analytics and Machine Learning
Reading material
Reading material
• Bishop: chapters 7.1.0, 7.1.1, 7.1.2
Acknowledgements
• Slides are based on an older version by S. Urban
SVM and Kernels 46
Data Analytics and Machine Learning
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com