程序代写 Machine Learning

Machine Learning
Lecture 9: SVM and Kernels
Prof. Dr. ̈nnemann
Data Analytics and Machine Learning Group Technical University of Munich

Copyright By PowCoder代写 加微信 powcoder

Winter term 2020/2021

1. Support Vector Machines (SVM)
2. Soft Margin Support Vector Machines
3. Kernels
SVM and Kernels 2
Data Analytics and Machine Learning

Support Vector Machines (SVM)

Linear classifier
A linear classifier assigns all x with wT x + b > 0
to class blue and all x with wT x + b < 0 to class green. Thus the class of x is given by h(x) = sign(wT x + b) with  − 1 i f z < 0 sign(z) = 0 if z = 0 . +1 ifz>0
SVM and Kernels 4
Data Analytics and Machine Learning

Maximum margin classifier
• Intuitively, a wide margin around the dividing line makes it more likely that new samples will fall on the right side of the boundary.
SVM and Kernels 5
Data Analytics and Machine Learning

Maximum margin classifier
• Intuitively, a wide margin around the dividing line makes it more likely that new samples will fall on the right side of the boundary.
• Actual rigorous motivation comes from Statistical Learning Theory 1
• Objective:
Find a hyperplane that separates both classes with the maximum margin.
1V. Vapnik – ”Statistical Learning Theory”, 1995
SVM and Kernels 5
Data Analytics and Machine Learning

Linear classifier with margin
We add two more hyperplanes that are parallel to the original hyperplane and require that no training points must lie between those hyperplanes.
Thus we now require
wT x + (b − s) > 0 for all x from class blue and
wT x + (b + s) < 0 for all x from class green. SVM and Kernels 6 Data Analytics and Machine Learning wTx+(b-s)=0 wTx+b=0 wTx+(b+s)=0 Size of the margin Signed distance from the origin to the hyperplane is given by d=−b. ||w|| Thus we have dblue = −b − s ||w || dgreen =−b+s ||w || and the margin is b/|w| ||w || m=dblue−dgreen= 2s . SVM and Kernels 7 Data Analytics and Machine Learning wTx+(b-s)=0 wTx+b=0 wTx+(b+s)=0 Redundancy of parameter s The size of the margin, m= 2s ||w || only depends on the ratio, so w.l.o.g. we can set s = 1 and get SVM and Kernels 8 Data Analytics and Machine Learning wTx+(b-s)=0 wTx+b=0 wTx+(b+s)=0 Redundancy of parameter s The size of the margin, m= 2s ||w || only depends on the ratio, so w.l.o.g. we can set s = 1 and get m=2 ||w || Although the distance from the origin to the black plane, d=−b, ||w || also depends on two parameters we cannot set b = 1 as this would link the distance d to the size of the margin m. SVM and Kernels 8 Data Analytics and Machine Learning wTx+(b-1)=0 wTx+b=0 wTx+(b+1)=0 Set of constraints Let xi be the ith sample, and yi ∈{−1,1}theclassassignedtoxi. The constraints wTxi+b≥+1 foryi=+1, wTxi+b≤−1 foryi=−1 can be condensed into yi(wTxi+b)≥1 foralli. If these constraints are fulfilled the m=2=√2. ||w|| wTw SVM and Kernels 9 Data Analytics and Machine Learning wTx+(b-1)=0 wTx+b=0 wTx+(b+1)=0 SVM‘s Optimization problem Let xi be the ith data point, i = 1,...,N, and yi ∈ {−1,1} the class assigned to xi. To find the separating hyperplane with the maximum margin we need to find {w,b} that minimize f0(w,b)=12wTw subjectto fi(w,b)=yi(wTxi +b)−1≥0 fori=1,...,N. This is a constrained convex optimization problem (more specifically, quadratic programming problem). We go from ||w || to ||w ||2 = w T w as square root is a monotonic function that doesn’t change the location of the optimum. SVM and Kernels 10 Data Analytics and Machine Learning Optimization with inequality constraints Constrained optimization problem Givenf0 :Rd →Randfi :Rd →R, minimizeθ f0(θ) subject to fi(θ) ≤ 0 for i = 1,...,M . A point θ ∈ Rd is called feasible if and only if it satisfies the constraints of the optimization problem, i.e. fi(θ) ≤ 0 for all i ∈ {1,...,M}. Minimum and minimizer We call the optimal value the minimum p∗, and the point where the minimum is obtained the minimizer θ∗. Thus p∗ = f0(θ∗). Feasibility SVM and Kernels 11 Data Analytics and Machine Learning Lagrangian minimizeθ f0(θ) subject to fi(θ) ≤ 0 for i = 1,...,M Definition (Lagrangian) We define the Lagrangian L : Rd × RM → R associated with the above problem as We refer to αi ≥ 0 as the Lagrange multiplier associated with the inequality constraint fi (θ) ≤ 0. L(θ, α) = f0(θ) + 􏰉 αi fi (θ) SVM and Kernels 12 Data Analytics and Machine Learning Lagrange dual function Definition (Lagrange dual function) The Lagrange dual function g : RM → R maps α to the minimum of the Lagrangian over θ (possibly −∞ for some values of α), It is concave in α since it is the point-wise minimum of a family of affine functions of α. g(α)= minL(θ,α)= min f0(θ)+􏰉αifi(θ) SVM and Kernels 13 Data Analytics and Machine Learning Interpretation of the Lagrangian minimizeθ f0(θ) subject to fi(θ) ≤ 0, i = 1,...,M L(θ, α) = f0(θ) + 􏰉 αi fi (θ) For every choice of α, the corresponding unconstrained g(α) = minθ∈Rd L(θ, α) is a lower bound on the optimal value of the constrained problem: min f0(θ)=f0(θ∗)≥f0(θ∗)+􏰉 αi fi(θ∗)=L(θ∗,α)≥ minL(θ,α) θ∈Rd 􏰌􏰋􏰊􏰍 􏰌 􏰋􏰊 􏰍 θ∈Rd fi(θ)≤0 i=1 ≥0 ≤0 􏰌 􏰋􏰊 􏰍 g (α) Hence, ∀α f0(θ∗) ≥ g(α). SVM and Kernels 14 Data Analytics and Machine Learning Lagrange dual problem For each α ≥ 0 the Lagrange dual function g(α) gives us a lower bound on the optimal value p∗ of the original optimization problem. What is the best (highest) lower bound? Lagrange dual problem maximizeα g (α) subject to αi ≥ 0, i = 1,...,m The maximum d∗ of the Lagrange dual problem is the best lower bound on p∗ that we can achieve by using the Lagrangian. Note: since we are maximizing g we are not interested in dual multipliers α such that g(α) = −∞, so the condition g(α) ̸= −∞ is usually added as an additional constraint to the dual problem → we call α feasible if and only if all αi ≥ 0 and L(θ,α) is bounded from below for θ ∈ Rd. SVM and Kernels 15 Data Analytics and Machine Learning Weak duality (always) Since for all α ≥ 0 it holds that g(α) ≤ p∗ we have weak duality, d∗ ≤ p∗ . The difference p∗ − d∗ ≥ 0 between the solution of the original and the dual problem is called the duality gap. Strong duality (under certain conditions) Under certain conditions we have i.e. the maximum to the Lagrange dual problem is the minimum of the original (primal) constrained optimization problem (i.e. f0(θ∗) = g(α∗)). SVM and Kernels 16 Data Analytics and Machine Learning SVM’s Primal problem We apply a recipe for solving the constrained optimization problem. 1. Calculate the Lagrangian L(w,b,α)=2wTw− 2. Minimize L(w,b,α) w.r.t. w and b. N ∇wL(w,b,α)=w−􏰉αiyixi =! 0 αi[yi(wTxi +b)−1]. ∂ L = 􏰉 α i y i =! 0 Thus the weights are a linear combination of the training samples, w = 􏰉 αi yi x i . SVM and Kernels 17 Data Analytics and Machine Learning SVM‘s Dual problem Substituting both relations back into L(w,b,α) gives the Lagrange dual function g(α). Thus we have reformulated our original problem as subject to g(α) = 􏰉αi − i=1 􏰉􏰉yiyjαiαjxTi xj 2 i=1 j=1 αi ≥0, 3. Solve this problem. i=1,...,N. 􏰉αiyi = 0 i=1 SVM and Kernels 18 Data Analytics and Machine Learning Solving the dual problem We can rewrite the dual function g(α) in vector form g ( α ) = 21 α T Q α + α T 1 N where Q is a symmetric negative (semi-)definite matrix, and the constraints on α are linear. This is an instance of a quadratic programming problem. There exist efficient algorithms for its solution, such as Sequential minimal optimization (SMO) 2. A number of implementations, such as LIBSVM 3 are available and are widely used in practice. 2 http://cs229.stanford.edu/materials/smo.pdf 3C.-C. Chang and C.-J. Lin. LIBSVM : a library for support vector machines, 2011 SVM and Kernels 19 Data Analytics and Machine Learning Recovering w and b from the dual solution α∗ Having obtained the optimal α∗ using our favorite QP solver, we can compute the parameters defining the separating hyperplane. Recall, that from the optimality condition, the weights w are a linear combination of the training samples, w = 􏰉 α i∗ y i x i From the complementary slackness condition αi∗fi (θ∗) = 0 we can easily recover the bias. When we take any vector xi for which αi ̸= 0. The corresponding constraint fi(w,b) must be zero and thus we have w T x i + b = yi . Solving this for b yields the bias b=yi −wTxi We can also average the b over all support vectors to get a more stable solution. SVM and Kernels 20 Data Analytics and Machine Learning Support vectors From complimentary slackness 􏰉N αi∗fi(θ∗) = 0. i=1 αi[yi(wTxi+b)−1]=0 foralli. Hence a training sample xi can only contribute to the weight vector (αi ̸= 0) if it lies on the margin, that is yi(wTxi +b)=1. A training sample x i with αi ̸= 0 is called a support vector. SVM and Kernels 21 Data Analytics and Machine Learning Classifying The class of x is given by h(x)=sign(wTx +b). Substituting w = 􏰉 αi yi x i 􏰆N􏰇 h(x)=sign 􏰉αiyixTi x+b . Since the solution is sparse (most αi s are zero) we only need to remember the few training samples xi withαi ̸=0. SVM and Kernels 22 Data Analytics and Machine Learning Soft Margin Support Vector Machines Dealing with noisy data What if the data is not linearly separable due to some noise? With our current version of SVM, a single outlier makes the constraint unsatisfiable. The corresponding Lagrange multiplier αi would go to infinity and destroy the solution. How to make our model more robust? SVM and Kernels 24 Data Analytics and Machine Learning Slack variables Idea: Relax the constraints as much as necessary but punish the relaxation of a constraint. We introduce a slack variable ξi ≥ 0 for every training sample xi that gives the distance of how far the margin is violated by this training sample in units of ||w||. SVM and Kernels 25 Data Analytics and Machine Learning Slack variables Idea: Relax the constraints as much as necessary but punish the relaxation of a constraint. We introduce a slack variable ξi ≥ 0 for every training sample xi that gives the distance of how far the margin is violated by this training sample in units of ||w||. Hence our relaxed constraints are wTxi +b≥+1−ξi foryi =+1, wTxi +b≤−1+ξi foryi =−1. Again, they can be condensed into yi(wTxi +b)≥1−ξi foralli. SVM and Kernels 25 Data Analytics and Machine Learning Slack variables Idea: Relax the constraints as much as necessary but punish the relaxation of a constraint. We introduce a slack variable ξi ≥ 0 for every training sample xi that gives the distance of how far the margin is violated by this training sample in units of ||w||. Hence our relaxed constraints are wTxi +b≥+1−ξi foryi =+1, wTx+b≤−1+ξ fory=−1. i i i Again, they can be condensed into yi(wTxi +b)≥1−ξi foralli. The new cost function is, 1T f0(w,b,ξ)=2w w+C i=1 The factor C > 0 determines how
heavyaviolationispunished.
C → ∞ recovers hard-margin SVM.
SVM and Kernels 25
Data Analytics and Machine Learning

Optimization problem with slack variables
Let xi be the ith data point, i = 1,…,N, and yi ∈ {−1,1} the class assigned to xi. Let C > 0 be a constant.
To find the hyperplane that separates most of the data with maximum margin we
1T 􏰉N minimize f0(w,b,ξ)=2w w+C ξi
i=1 subjectto yi(wTxi+b)−1+ξi≥0
Here we used the 1-norm for the penalty term 􏰀i ξi . Another choice is
to use the 2-norm penalty, 􏰀i ξi2.
The penalty that performs better in practice will depend on the data and the type of noise that has influenced it.
i=1,…,N, i=1,…,N.
SVM and Kernels 26
Data Analytics and Machine Learning

Lagrangian with slack variables
1. Calculate the Lagrangian
L(w,b,ξ,α,μ)=2w w+C
−􏰉αi[yi(wTxi +b)−1+ξi]−􏰉μiξi . i=1 i=1
2. Minimize L(w,b,ξ,α,μ) w.r.t. w, b and ξ. NN
∇wL(w,b,ξ,α,μ)=w−􏰉αiyixi =! 0; ∂L=􏰉αiyi =! 0,
∂b ∂L=C−αi−μi=! 0 for i=1,…,N
Fromαi =C−μi anddualfeasibilityμi ≥0,αi ≥0weget
SVM and Kernels 27
Data Analytics and Machine Learning

Dual problem with slack variables
This leads to the dual problem:
subject to
g(α) = 􏰉αi − i=1
􏰉αiyi = 0 i=1
􏰉􏰉yiyjαiαjxTi xj i=1 j=1
0 ≤ αi ≤ C
This is nearly the same dual problem as for the case without slack
variables.
Only the constraint αi ≤ C is new. It ensures that αi is bounded and cannot go to infinity.
i = 1, . . . , N .
SVM and Kernels 28
Data Analytics and Machine Learning

Influence of the penalty C
SVM and Kernels 29
Data Analytics and Machine Learning

Hinge loss formulation
We can have another look at our constrained optimization problem. 1T 􏰉N
minimize f0(w,b,ξ)=2w w+C ξi i=1
subjectto yi(wTxi +b)−1+ξi ≥0, ξi ≥0,
Clearly, for the optimal solution the slack variables are
􏰈1−yi(wTxi +b), ifyi(wTxi +b)<1 ξi = 0 else since we are minimizing over ξ. i=1,...,N, i=1,...,N. SVM and Kernels 30 Data Analytics and Machine Learning Hinge loss formulation Thus, we can rewrite the objective function as an unconstrained optimization problem known as the hinge loss formulation 1 􏰉N min wTw +C max{0,1−yi(wTxi +b)} Ehinge(z) = max{0,1−z} penalizes the w,b 2 The hinge loss function points that lie within the margin. The name comes from the shape from the function, as can be seen in the figure to the right. We can optimize this hinge loss objective directly, using standard gradient-based methods. Hinge loss (blue) can be viewed as an approximation to zero-one loss (green). SVM and Kernels 31 Data Analytics and Machine Learning Feature space So far we can only construct linear classifiers. Before, we used basis functions φ(·) to make the models nonlinear For example, with the following mapping the data becomes linearly separable xi 􏱭→φ(xi) φ(x,y) =  y  −􏰁x2 +y2 SVM and Kernels 33 Data Analytics and Machine Learning Kernel trick In the dual formulation of SVM, the samples xi only enter the dual objective as inner products xTi xj g(α)=􏰉αi − 􏰉􏰉yiyjαiαjxTi xj , For basis functions this means that g(α)=􏰉αi − i=1 􏰉􏰉yiyjαiαjφ(xi)Tφ(xj) i=1 j=1 SVM and Kernels 34 Data Analytics and Machine Learning Kernel trick Wecandefineakernelfunctionk:RD ×RD →R k (x i , x j ) := φ(x i )T φ(x j ) and rewrite the dual as g(α) = 􏰉αi − i=1 􏰉􏰉yiyjαiαjk(xi,xj) i=1 j=1 This operation is referred to as the kernel trick. It can be used not only for SVM. Kernel trick can be used in any model that can be formulated such that it only depends on the inner products xTi xj.(e.g.linearregression,k-nearestneighbors) SVM and Kernels 35 Data Analytics and Machine Learning Kernel trick The kernel represent an inner product in the feature space spanned by φ. Like before, this makes our models non-linear w.r.t. the data space. What’s the point of using kernels if we can simply use the basis functions? • Some kernels are equivalent to using infinite-dimensional basis functions. While computing these feature transformations would be impossible, directly evaluating the kernels is often easy. • Kernels can be used to encode similarity between arbitrary non-numerical data, from strings to graphs. For example, we could define k (lemon, orange) = 10 and k (apple, orange) = −5 SVM and Kernels 36 Data Analytics and Machine Learning SVM using kernels example Linear kernel (no kernel) 2nd order polynomial kernel SVM and Kernels 37 Data Analytics and Machine Learning What makes a valid kernel? A kernel is valid if it corresponds to an inner product in some feature space. An equivalent formulation is given by Mercer’s theorem. Mercer’s theorem A kernel is valid if it gives rise to a symmetric, positive semidefinite kernel matrix K for any input data X . Kernel matrix (also known as Gram matrix) K ∈ RN ×N is defined as k(x1,x1) k(x1,x2) ··· k(x1,xN) k(x2,x1) k(x2,x2) ··· k(x2,xN) K= . . .. .  .... k(xN,x1) k(xN,x2) ··· k(xN,xN) What happens if we use a non-valid kernel? Our optimization problem might become non-convex, so we may not get a globally optimal solution. SVM and Kernels 38 Data Analytics and Machine Learning Kernel preserving operations Let k1 : X × X → R and k2 : X × X → R be kernels, with X ⊆ RN . Then the following functions k are kernels as well: • k(x1,x2)=k1(x1,x2)+k2(x1,x2) • k(x1,x2)=c·k1(x1,x2),withc>0
• k(x1,x2)=k1(x1,x2)·k2(x1,x2)
• k(x1,x2) = k3(φ(x1),φ(x2)), with the kernel k3 on X′ ⊆ RM and φ:X→X′
• k(x1,x2)=x1Ax2,withA∈RN ×RN symmetricandpositive semidefinite
SVM and Kernels 39
Data Analytics and Machine Learning

Examples of kernels
• Polynomial:
• Gaussian kernel:
k(a,b)=(aTb)p or (aTb+1)p 􏰖 ∥a−b∥2􏰗
k(a,b)=exp − 2σ2 k(a,b)=tanh(κaTb−δ) forκ,δ>0
• Sigmoid:
In fact, the sigmoid kernel is not PSD, but still works well in practice.
Some kernels introduce additional hyperparameters, that affect the behavior of the algorithm.
Note, that the sigmoid kernel is different from the sigmoid function from Linear Classification.
SVM and Kernels 40
Data Analytics and Machine Learning

Gaussian kernel (C=1000)
σ = 0.5 σ = 0.25
σ = 0.05 σ = 0.005
SVM and Kernels 41
Data Analytics and Machine Learning

Classifying a new point with kernelized SVM
We denote the set of support vectors as S (points xi for which holds 0<αi ≤C). Note: If0<αi 0.
From the complementary slackness condition, points xi ∈ S with ξi = 0 must satisfy
y 􏰉 α y k(x ,x )+b =1
ijjij {j|xj ∈S}
Like for the regular SVM, the bias can be recovered as
b=yi − 􏰉 αjyjk(xi,xj) {j|xj ∈S}
Thus, a new point x can be classified as 
 h (x ) = sign 􏰉 αj yj k (x j , x ) + b
SVM and Kernels 42
Data Analytics and Machine Learning

How to choose the hyperparameters?
The best setting of the penalty parameter C and the kernel hyperparameters (e.g., γ or σ) can be determined by performing cross validation with random search over the parameter space.
SVM and Kernels 43
Data Analytics and Machine Learning

Multiple classes
The standard SVM model cannot handle multiclass data. Two approaches to address this issue are:
One-vs-rest:
Train C SVM models for C classes, where each SVM is being trained for classification of one class against all the remaining ones. The winner is then the class, where the distance from the hyperplane is maximal.
Train 􏰎C􏰏 classifiers (all possible pairings) and evaluate all. 2
The winner is the class with the majority vote (votes are weighted according to the distance from the margin).
One-vs-one:
SVM and Kernels 44
Data Analytics and Machine Learning

• Support Vector Machine is a linear binary classifier that, motivated by learning theory, maximizes the margin between the two classes.
• The SVM objective is convex, so a globally optimal solution can be obtained.
• The dual formulation is a quadratic programming problem, that can be solved efficiently, e.g. using standard QP libraries.
• Soft-margin SVM with slack variables can deal with noisy data. Smaller values for the penalty C lead to a larger margin at the cost of misclassifying more samples.
• Linear soft-margin SVM (= hinge loss formulation) can be solved directly using gradient descent.
• We can obtain a nonlinear decision boundary by moving to an implicit high-dimensional feature space by using the kernel trick. This only works in the dual formulation.
SVM and Kernels 45
Data Analytics and Machine Learning

Reading material
Reading material
• Bishop: chapters 7.1.0, 7.1.1, 7.1.2
Acknowledgements
• Slides are based on an older version by S. Urban
SVM and Kernels 46
Data Analytics and Machine Learning

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com