Support Vector Machines
Chapter 9
April 15, 2020
Chapter 9 April 15, 2020 1 / 57
1 9.1. Maximal margin classifier
2 9.2. Support vector classifiers
3 9.3. Support vector machines
4 Appendix: Primal-Dual support vector classifiers 5 9.4. SVMs with more than two classes
6 9.5. Relationshiop to logistic regression
Chapter 9
April 15, 2020
2 / 57
• Support vector machine is one of the most popular machine learning methodologies.
• Empirically successful, with well developed theory.
• One of the best off-the-shelf methods.
• We mainly address classification.
About this chapter
Chapter 9 April 15, 2020 3 / 57
9.1. Maximal margin classifier
X2
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
X1
Figure: 9.1. The hyperplane 1 + 2X1 + 3X2 = 0 is shown. The blue region is the set of points for which 1 + 2X1 + 3X2 > 0, and the purple region is the set of points for which 1+2X1 +3X2 <0.
Chapter 9 April 15, 2020 4 / 57
9.1. Maximal margin classifier
Separating hyperplane
• Training Data:(yi, xi), i = 1, ..., n, with input xi = (xi1, ..., xip) and two-class output yi = ±1.
• Suppose the two classes are seperated by one hyperplane f(x)=β0 +β1x1 +...+βpxp,
meaning that, for all i in one class yi = 1,
f(xi) > 0, one side of the hyperplane;
and for all i in the other class yi = −1,
f(xi) < 0, the other side of the hyperplane;
• It can be equivalently expressed as
yif(xi) > 0, for all i = 1,..,n
Chapter 9 April 15, 2020 5 / 57
9.1. Maximal margin classifier
• If such separating hyperplane exists, it can be our classification rule:
For any new/old observation with x∗ such that f(x∗) > 0, classify it as in the class +1. Otherwise, classify it as in class −1.
• Problem: If the two classes in training data are indeed separable by a hyperplane, which hyperplane is the best?
Chapter 9 April 15, 2020 6 / 57
9.1. Maximal margin classifier
X2
−1 0 1 2 3
X2
−1 0 1 2 3
−1 0 1 2 3 −1 0 1 2 3
X1 X1
Figure: 9.2. Left: There are two classes of observations, shown in blue and in purple, each of which has measurements on two variables. Three separating hyperplanes, out of many possible, are shown in black. Right: A separating hyperplane is shown in black. The blue and purple grid indicates the decision rule made by a classifier based on this separating hyperplane: a test observation that falls in the blue portion of the grid will be assigned to the blue class, and a test observation that falls into the purple portion of the grid will be assigned to the purple class.
Chapter 9 April 15, 2020 7 / 57
9.1. Maximal margin classifier
Maximal margin classifier
• Maximal margin hyperplane: the separating hyperplane that optimal separating hyperplane is farthest from the training observations.
• The seperating hyperplane such that the minimum distance of any training point to the hyperplane is the largest.
• Creates a widest gap between the two classes.
• Points on the boundary hyperplane, those with smallest distance to the max margin hyperplane, are called support vectors.
They “support” the maximal margin hyperplane in the sense vector that if these points were moved slightly then the maximal margin hyperplane would move as well
Chapter 9 April 15, 2020 8 / 57
9.1. Maximal margin classifier
X2
−1 0 1 2 3
−1 0 1 2 3
X1
Figure: 9.3. There are two classes of observations, shown in blue and in purple. The maximal margin hyperplane is shown as a solid line. The margin is the distance from the solid line to either of the dashed lines. The two blue points and the purple point that lie on the dashed lines are the support vectors, and the distance from those points to the hyperplane is indicated by arrows. The purple and blue grid indicates the decision rule
made by a classifier based on this separating hyperplane.
Chapter 9 April 15, 2020 9 / 57
9.1. Maximal margin classifier
Margin
• Consider a data point x0. The distance from x0 to the hyperplane β0 + βT x = 0 is given by
min ∥x0 − x∥2 x
s.t. β0 + βT x = 0 • The distance is then given by |β0+βT x0| .
∥β∥
• For the seperable case, all data points can be correctly classified,
i.e., yi(β0 + βT xi) > 0 for all i.
• The maximum margin is formulated as
1T argmax min[yi(β xi+β0)]
(1)
β,β0 ∥β∥ i
Chapter 9 April 15, 2020
10 / 57
9.1. Maximal margin classifier
Computing the max margin hyperplane
• Note that rescaling β → κβ and β0 → κβ0 does not change (1). • Therefore, we can consider
maximizeβ0 ,β1 ,…,βp M p
subject to βj2 = 1, j=1
andyi(β0+β1xi1+…+βpxip)≥M foralli
Chapter 9 April 15, 2020 11 / 57
9.1. Maximal margin classifier
• Note that M > 0 is the half of the width of the strip separating the two classes.
• The eventual solution, the max margin hyperplane is determined by the support vectors.
If xi on the correct side of the trip varies, the solution would remain same.
• The max margin hyperplane may vary a lot when the support vectors vary. (high variance; see Figure 9.5)
Chapter 9 April 15, 2020 12 / 57
9.1. Maximal margin classifier
0123
X1
Figure: 9.4. There are two classes of observations, shown in blue and in purple. In this case, the two classes are not separable by a hyperplane, and so the maximal margin classifier cannot be used.
Chapter 9 April 15, 2020 13 / 57
X2
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
9.2. Support vector classifiers
The non-seperable case
• In general, the two classes are usually not separable by any hyperplane.
• Even if they are, the max margin may not be desirable because of its high variance, and thus possible over-fit.
• The generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier.
• Use a soft-margin in place of the max margin.
Chapter 9 April 15, 2020 14 / 57
9.2. Support vector classifiers
−1 0 1 2 3 −1 0 1 2 3
X1 X1
Figure: 9.5. Left: Two classes of observations are shown in blue and in purple, along with the maximal margin hyperplane. Right: An additional blue observation has been added, leading to a dramatic shift in the maximal margin hyperplane shown as a solid line. The dashed line indicates the maximal margin hyperplane that was obtained in the absence of this additional point.
Chapter 9 April 15, 2020 15 / 57
X2
−1 0 1 2 3
X2
−1 0 1 2 3
9.2. Support vector classifiers
Non-perfect separation
• Consider a classifier based on a hyperplane that does not perfectly separate the two classes, in the interest of
1 Greater robustness to individual observations, and
2 Better classification of most of the training observations.
• Soft-margin classifier (support vector classifier) allow some violation of the margin: some can be on the wrong side of the margin (in the river) or even wrong side of the hyperplane; see Figure 9.6.
Chapter 9 April 15, 2020 16 / 57
9.2. Support vector classifiers
9 8
10
7
3
1
452 6
10
7
11
1
452 6
9 8
3
12
−0.5 0.0 0.5 1.0 1.5 2.0 2.5 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
X1
X1
Figure: 9.6. next page
Chapter 9
April 15, 2020 17 / 57
X2
−1 0 1 2 3 4
X2
−1 0 1 2 3 4
9.2. Support vector classifiers
FIGURE 9.6. Left: A support vector classifier was fit to a small data set. The hyperplane is shown as a solid line and the margins are shown as dashed lines. Purple observations: Observations 3, 4, 5, and 6 are on the correct side of the margin, observation 2 is on the margin, and observation 1 is on the wrong side of the margin. Blue observations: Observations 7 and 10 are on the correct side of the margin, observation 9 is on the margin, and observation 8 is on the wrong side of the margin. No observations are on the wrong side of the hyperplane. Right: Same as left panel with two additional points, 11 and 12. These two observations are on the wrong side of the hyperplane and the wrong side of the margin.
Chapter 9 April 15, 2020 18 / 57
9.2. Support vector classifiers
Computing the support vector classifier
maximizeβ0 ,β1 ,…,βp M p
subject to βj2 = 1, and j=1
yi(β0+β1xi1+…+βpxip)≥M(1−εi), εi ≥0foralli n
and εi ≤C, i=1
where C is a nonnegative tuning parameter. εi are slack variables.
Chapter 9 April 15, 2020 19 / 57
9.2. Support vector classifiers
The support vector classifier
The solution of this optimization is the support vector classsifier: f(x)=β0 +β1×1 +…+βpxp.
And we classify an observation in +1 class if f(x) > 0; else into −1 class.
Chapter 9 April 15, 2020 20 / 57
9.2. Support vector classifiers
Understaning the slack variable εi
• εi = 0 ⇐⇒ the i-th observation is on the correct side of the margin
• εi > 0 ⇐⇒ the i-th observation is on the wrong side of the margin
• εi > 1 ⇐⇒ the i-th observation is on the wrong side of the hyperplane.
Chapter 9 April 15, 2020 21 / 57
9.2. Support vector classifiers
Understanding tuning parameter C
• C is a budget for the amount that the margin can be violated by
the n observations
• C = 0 ⇐⇒ no budget
Asaresultεi =0foralli.
The classifier is a maximal margin classifier, which exists only if the two classes are separable by hyperplanes.
• Larger C, more tolerance of margin violation.
• No more than C observations can be on the wrong side of the
soft-margin classifier hyperplane.
• As C increases, the margin widens and more violations of the margin.
Chapter 9 April 15, 2020 22 / 57
9.2. Support vector classifiers
Understanding tuning parameter C
• C controls the bias-variance trade-off.
• Small C high variance, small bias.
• Large C: small variance, high bias.
• C can be determined by using cross validation.
Chapter 9 April 15, 2020 23 / 57
9.2. Support vector classifiers
• Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors.
• Only the support vectors affect the support vector classifier.
• Those strictly on the correct side of the margin do not.
(robustness, analgous to median)
• Larger C =⇒ more violations, =⇒ more support vectors, =⇒ smaller variance and more robust classfier.
Support vectors
Chapter 9 April 15, 2020 24 / 57
9.2. Support vector classifiers
−1 0 1 2
X1
−1 0 1 2
X1
−1 0 1 2
−1 0 1 2
X1
X1
Figure: 9.7. Chapter 9
April 15, 2020
25 / 57
X2
−3 −2 −1 0 1 2 3
X2
−3 −2 −1 0 1 2 3
X2
−3 −2 −1 0 1 2 3
X2
−3 −2 −1 0 1 2 3
9.2. Support vector classifiers
Figure 9.7. A support vector classifier was fit using four different values of the tuning parameter C in (9.12) and (9.15). The largest value of C was used in the top left panel, and smaller values were used in the top right, bottom left, and bottom right panels. When C is large, then there is a high tolerance for observations being on the wrong side of the margin, and so the margin will be large. As C decreases, the tolerance for observations being on the wrong side of the margin decreases, and the margin narrows.
Chapter 9 April 15, 2020 26 / 57
9.3. Support vector machines
X2
−4 −2 0 2 4
X2
−4 −2 0 2 4
−4 −2 0 2 4 −4 −2 0 2 4
X1 X1
Figure: 9.8. Nonlinear boundaries. Left: The observations fall into two classes, with a non-linear boundary between them. Right: The support vector classifier seeks a linear boundary, and consequently performs very poorly.
Chapter 9 April 15, 2020 27 / 57
9.3. Support vector machines
Extending to nonlinear boundary
• In practice, we are sometimes faced with non-linear class boundaries
• Linear classifier could perform poorly.
• Need nonlinera classifier.
• As in the extension to the polynomial regression from linear regression, we can consider enlarge the feature space from the original p inputs to polynomials (of certain order) of the inputs.
Chapter 9 April 15, 2020 28 / 57
• we use 2p features:
X1, …, Xp. X1, X12, …, Xp, Xp2.
9.3. Support vector machines
Extending to quadratic inputs
• Rather than constructing the support vector classifier using p features:
• Treat them as 2p original inputs, and fit the support vector classifier.
• The separating hyperplane is a hyperplane in R2p, which should be a linear equation:
β0 +β1X1 +β2Xp +βp+1X12 +…+β2pXp2 = 0
• This is a quadrtic equation in X1, …, Xp. Thus the separating surface in Rp in terms of X1, …, Xp corresponds to a quadratic surface in Rp.
Chapter 9 April 15, 2020 29 / 57
9.3. Support vector machines
Extending to polynomial inputs
• Can extend to polynomial of any given order d.
• Could lead to too many features, too large feature space; thus
overfit.
• Higher powers are unstable.
Chapter 9 April 15, 2020 30 / 57
9.3. Support vector machines
Key observation
• The linear support vector classifier can be represented as
n f(x)=β0 +ai⟨xi,x⟩
i=1
• ai ̸= 0 only for all support vectors.
• ai can also be computed based on ⟨xj,xk⟩ and yi.
• Only the inner product of the feature space is relevant in computing the linear support vector classifier.
Chapter 9 April 15, 2020 31 / 57
9.3. Support vector machines
The kernel trick
• The inner product ⟨·, ·⟩ is a bivariate function (satisfying some property).
• It can be genearlized to kernel functions K(x,z)
which is positive definite.
• The classifier can be expressed as
n
f(x)=β0 +aiK(x,xi)
i=1
Chapter 9
April 15, 2020
32 / 57
9.3. Support vector machines
Examples of kernels
• Examples of the kernel function are:
• linear kernel
K(xi,xj) = ⟨xi,xj⟩ = xTi xj.
• polynomial kernel of degree d:
K (xi , xj ) = (1 + ⟨xi , xj ⟩)d .
• Gaussian radial kernel:
K(xi,xj)=exp(−γ∥xi−xj∥2), γ>0.
• Only the inner product of the feature space is relevant in
computing the linaer support vector classfier.
Chapter 9 April 15, 2020 33 / 57
9.3. Support vector machines
X2
−4 −2 0 2 4
X2
−4 −2 0 2 4
−4 −2 0 2 4 −4 −2 0 2 4
X1 X1
Figure: 9.9. Left: An SVM with a polynomial kernel of degree 3 is applied to the non-linear data from Figure 9.8, resulting in a far more appropriate decision rule. Right: An SVM with a radial kernel is applied. In this example, either kernel is capable of capturing the decision boundary. In
Chapter 9 April 15, 2020 34 / 57
9.3. Support vector machines
• The radial kernel has local behavior.
• To predict the class for a new observation with input x,
n
f(x)=β0 +aiexp(−γ∥x−xi∥2)
i=1
• A training data point xi being far away from x will have little effect to the sign of f(x).
Chapter 9 April 15, 2020 35 / 57
9.3. Support vector machines
The enlarged feature space.
• The support vector machine actually enlarges the original feature space to a space of kernel functions.
xi →K(·,xi).
• The original space of p inputs has dimension p.
• The enlarged space of features, the function space, is infinite dimension!
• In actual fitting of the support vector machine, we only need to compute the K(xi,xj) for all xi,xj in training data.
• Do not have to work with the enlarged feature space of infinite dimension.
Chapter 9 April 15, 2020 36 / 57
9.3. Support vector machines
Example: the Heart data.
• In Chapter 8 we apply decision trees and related methods to the Heart data.
• The aim is to use 13 predictors such as Age, Sex, and Chol in order to predict whether an individual has heart disease.
• We now investigate how an SVM compares to LDA on this data.
• 297 subjects, randomly split into 207 training and 90 test observations.
Chapter 9 April 15, 2020 37 / 57
9.3. Support vector machines
Support Vector Classifier LDA
Support Vector Classifier −3
SVM: γ=10 −2
SVM: γ=10
−1 SVM: γ=10
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2
False positive rate
0.4 0.6 0.8 1.0
False positive rate
Figure: ROC curves for the Heart data training set. Left: The support vector classifier and LDA are compared. Right: The support vector classifier is compared to an SVM using a radial basis kernel with γ = 10−3,10−2 and 10−1.
Chapter 9 April 15, 2020 38 / 57
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
9.3. Support vector machines
0.0 0.2 0.4
0.0 0.2 0.4
0.6 0.8 1.0
False positive rate
False positive rate
Support Vector Classifier LDA
0.6 0.8 1.0
SVM: γ=10 −2
Figure: 9.11. ROC curves for the test set of the Heart data. Left: The support vector classifier and LDA are compared. Right: The support vector classifier is compared to an SVM using a radial basis kernel with γ = 10−3,10−2 and 10−1.
Support Vector Classifier −3
SVM: γ=10 −1
SVM: γ=10
Chapter 9 April 15, 2020 39 / 57
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
Appendix: Primal-Dual support vector classifiers
Appendix: Equivalent reformulation of Hard Margin
maximizeβ0 ,β1 ,…,βp M p
subject to βj2 = 1, j=1
andyi(β0+β1xi1+…+βpxip)≥M foralli
⇔
minimizeβ0,β1,…,βp ∥β∥2 := βj2 j
subject to yi(β0 + β1xi1 + … + βpxip) ≥ 1 for all i , using M = 1/∥β∥.
Chapter 9 April 15, 2020 40 / 57
Appendix: Primal-Dual support vector classifiers
48 2à Flexible Discriminants
xTβ+β0 =à
••
•
•• ••
•
• ••∥β∥
M=1
•
•• •
•
••
•
margin
M=1 ∥β∥
•
xT β +
• ∗ • • • ξ1
•• •
• •
Chapter 9
April 15, 2020
41 / 57
FIGURE 2àà Support vector classiÞersà The left
β
•
•
So
0 = ∂ L ⇒ βˆ = α i y i x i ∂β i
0 = ∂ L ⇒ αˆ i y i = 0 ∂β0 i
Appendix: Primal-Dual support vector classifiers
Lagrangian: for αi ≥ 0,
Appendix: Lagrangian
1 2 n
max min L(β, α) := ∥β∥ − αi(yi(β0 + β1xi1 + … + βpxip) − 1)
α≥0β 2i=1
and complementary condition
αˆ(y(βˆ +βˆ,x)−1)=0, foralli
ii0 i
Chapter 9 April 15, 2020 42 / 57
Appendix: Primal-Dual support vector classifiers
Appendix: Support Vectors
Complementary condition αˆ (y (βˆ + βˆ, x ) − 1) = 0, ii0 i
for all i
implies that
y(βˆ +βˆ,x)>1⇒αˆ =0 i0ii
y(βˆ +βˆ,x)=1⇒αˆ ≥0 i0ii
Those sample point i’s such that αˆi > 0, are called support vectors (sv), which decided the maximal margin hyperplane
βˆ = αˆ i y i x i i∈sv
and βˆ0 can be uniquely decided by any support vector s using
βˆ = y − βˆ , x . 0ss
Chapter 9 April 15, 2020 43 / 57
Appendix: Primal-Dual support vector classifiers
Appendix: Dual Formulation
After plugging βˆ in the Lagrangian, it gives
subject to
max
α i=1
αi − αiαjyiyjxTi xj 2i,j=1
αi ≥ 0
αiyi =0 i
n 1n
Chapter 9
April 15, 2020
44 / 57
Appendix: Primal-Dual support vector classifiers
Appendix: Equivalent reformulation of Soft-margin Classifier
maximizeβ0 ,β1 ,…,βp M p
subject to βj2 = 1, j=1
andyi(β0+β1xi1+…+βpxip)≥M(1−εi), εi ≥0, foralli n
and ε≤C, i=1
⇔
n minimizeβ0,β1,…,βp ∥β∥2 + C ξi
i=1 subjecttoyi(β0+β1xi1+…+βpxip)≥1−ξi, ξi ≥0foralli,
Chapter 9 April 15, 2020 45 / 57
criminants
∥∥
∗
Appendix: Primal-Dual support vector classifiers
xTβ+β0 =à
••
•ξ∗ 4
• ∗ ∗ ∗ • ξ ξ 3 3∗ ∗ • ξ1 •
•
•∗ ••
• ξ5•
M = ∥β∥
• •
1
• •
•
ξ∗ 2
• • •
margin
• •
M=1 ∥β∥
•
•• ∥β∥
M=1 margin
1
∥β∥
Figure: Separating hyperplane with margin vector classiÞersà The left panel shows the separable
ry is the solid line, while broken lines bound the shaded
2M = 2 β à The right panel sChaopwters9 the nonseparaAbpleril 15, 2020 46 / 57
Appendix: Primal-Dual support vector classifiers
Appendix: The Lagrangian
Lagrangian: for αi ≥0, μi ≥0, ξi ≥0,
1nn n
L(β, ξ, α, μ) = 2∥β∥2 +C ξi − αi[yi(β0 +xTi β)−(1−ξi)]− μiξi,
So for all i,
i=1 i=1
0= ∂L ⇒βˆ=αˆiyixi ∂β i
0= ∂L ⇒αˆiyi =0 ∂β0 i
0 = ∂L ⇒ αˆi + μi = C ∂ξi
i=1
(2) (3) (4)
and complementary condition
αˆi[yi(βˆ0 + xTi βˆ) − (1 − ξi)] = 0
μiξi = 0 Chapter 9
April 15, 2020
47 / 57
Appendix: Primal-Dual support vector classifiers
Appendix: Support Vectors
Complementary condition αˆi[yi(βˆ0 + xTi βˆ) − (1 − ξi)] = 0, ξi ≥ 0, ⇒ y(βˆ +βˆ,x)>1⇒αˆ =0
i0ii
y(βˆ +βˆ,x)=1−ξ ⇒C≥αˆ ≥0
i0iii
μiξi =0⇒ξi >0,μi =0,αˆi =C−μi =C
Those samples such that C ≥ αˆi > 0, are called support vectors (sv), βˆ = αˆ i y i x i
i∈sv
and those s.t. αi = C are within margin (violators), ξi = 0 are on margin (deciding βˆ0).
Chapter 9 April 15, 2020 48 / 57
Appendix: Primal-Dual support vector classifiers
Appendix: Dual Problem
Substituting (2)-(4) into the Lagrangian, it gives the dual optimization
problem
1nn min αiαjyiyjxTixj−αi
i=1
subject to
α 2i,j=1
C ≥ αi ≥ 0 αiyi =0
i
One can replace xTi xj by K(xi,xj) using the kernel trick for nonlinear SVMs.
Chapter 9 April 15, 2020 49 / 57
9.4. SVMs with more than two classes
One-versus-one approach
• With K > 2 classes.
• Run a SVM on each of the K pairs of classes.
• We obtain K SVMs. 2
• For every test observations, compute the number of times it is classified into class k by all the SVMs, denote as dk.
• Classify it into the class with highest dk (majortiy vote).
2
Chapter 9 April 15, 2020 50 / 57
9.4. SVMs with more than two classes
• With K > 2 classes.
One-versus-all approach
• Run a SVM on class k (coded as +1) versus class “not-k” (coded as −1): fk(x). (Note that the larger fk(x), the more likely x is in class k.
• For a new test observation with x, compute assign to the class with largest fk(x).
Chapter 9 April 15, 2020 51 / 57
9.5. Relationshiop to logistic regression
Recall the “Loss + Penalty” formula
• Minimize, for f in certain space,
n L(yi,f(xi))+λP(f)
i=1
• Ridge regression: for linear f,
np (yi −f(xi))2 +λβj2
i=1 j=1 • Lasso regression: for linear f
np (yi −f(xi))2 +λ|βj|
i=1 j=1 Chapter 9
April 15, 2020
52 / 57
9.5. Relationshiop to logistic regression
Logistic regression for classification
• Data: (yi, xi), yi = ±1.
• The logistic model assumes
P(Y = 1|X) = 1/(1 + e−f(X)); P(Y = −1|X) = 1/(1 + ef(X)) That is
P(Y =y|X)=1/(1+e−yf(X))
Chapter 9 April 15, 2020 53 / 57
9.5. Relationshiop to logistic regression
Logistic regression for classification
• The negative of logistic likelihood (a.k.a. binomial deviance,
cross-entropy)
n log(1+e−yif(xi))
i=1
• Note that in AdaBoost, exponential loss is used
n
e−yif(xi)
i=1
Chapter 9
April 15, 2020
54 / 57
9.5. Relationshiop to logistic regression
Logistic regression with regularization
• Logistic loss with ridge l2 penalty np
log(1+e−yif(xi))+λβj2 i=1 j=1
• Logistic loss with Lasso l1 penalty:
np log(1+e−yif(xi))+λ|βj|
i=1 j=1
• AdaBoost uses coordinate descent to solve l1-penalty
approximately
np e−yif(xi) +λ|βj|
i=1 j=1 Chapter 9
April 15, 2020
55 / 57
9.5. Relationshiop to logistic regression
The SVM
• Data: (yi, xi), yi = ±1.
• SVM is a result of “hige loss + ridge penalty”:
np max[0,1−yif(xi)]+λβj2.
i=1 j=1 where f(x) = β0 + β1×1 + … + βxp. Note that
ξi =max[0,1−yif(xi)]≥0.
• the hinge loss function : l(u) = (1 − u)+.
• the logistic loss function: l(u) = log(1 + e−u).
• the exponential loss function: l(u) = e−u.
Chapter 9
April 15, 2020
56 / 57
9.5. Relationshiop to logistic regression
SVM Loss
Logistic Regression Loss
−6 −4 −2 0 2
yi(β0 +β1xi1 +…+βpxip)
Figure: 9.12. The SVM and logistic regression loss functions are compared, as a function of yi(β0 + β1xi1 + …βpxip). When yi(β0 + β1xi1 + …βpxip) is greater than 1, then the SVM loss is zero, since this corresponds to an observation that is on the correct side of the margin. Overall, the two loss functions have quite similar behavior.
Chapter 9 April 15, 2020 57 / 57
Loss 02468