CS5487 Problem Set 9
Support Vector Machines
Antoni of Computer Science City University of
Hyperplanes
Copyright By PowCoder代写 加微信 powcoder
Problem 9.1 Margin
Let f(x) = wT x + b and consider the hyperplane f(x) = 0.
(a) Show that the Euclidean distance from a point xa to the hyperplane is |f(xa)| by minimizing
kx xak2 subject to f(x) = 0.
(b) Show that the distance from the origin to the hyperplane is |b| .
kwk (c) Show that the projection of xa onto the hyperplane is
xp = xa f(xa)w. kwk2
Problem 9.2 SVM dual problem
Consider the SVM problem for linear separable data {X, y}, min 1 kwk2 s.t. yi(wT xi + b) 1,
We will derive the dual formulation of the SVM.
(a) Show that the Lagrangian of (9.2) is
i=1 where ↵i 0 is a Lagrange multiplier for each i.
w = ↵iyixi, ↵iyi = 0.
L(w,b,↵)= 2kwk2
↵i(yi(wTxi +b) 1), (b) Show that the minimum of L(w, b, ↵) w.r.t. {w, b} satisfies
(c) Use (9.4) on the Lagrangian to obtain the dual function,
Problem 9.3
……… Calculating the bias term
L(↵) = minL(w,b,↵) = ↵i ↵i↵jyiyjxTi xj, w,b i=1 2 i=1 j=1
yielding the SVM dual problem, max↵ L(↵),
Xn 1XnXn ↵i
↵i↵jyiyjxTi xj
i=1 2 i=1 j=1
↵iyi = 0, ↵i 0, 8i.
Solving the dual SVM problem yields only the hyperplane parameters w. In this problem, we consider two ways to calculate the bias term b. Recall that the set of support vectors are those training points with non-zero Lagrange multipliers,
SV = {i|↵i > 0}. (9.7) Since ↵i > 0 for points in SV , by the KKT conditions, the equality constraint must be “active” for
the support vector xi, i.e.,
yi(wT xi + b) = 1 (9.8)
(a) Using the active constraints, show that b can be calculated as
b⇤= 1 X(yi wTxi) (9.9) |SV | i2SV
(b) Another method for calculating the bias is to select one support vector on each side of the hyperplan. Given points x+ and x , which are on the positive and negative margins, show that
b⇤ = 12wT (x+ + x ) (9.10)
(c) Show that the method in (b) is a special case of (a) when there are only two support vectors.
Problem 9.4 Soft-margin SVM (1-norm penalty)
Consider the soft-margin SVM we saw in lecture,
1 2 Xn minkwk+C ⇠i w,b,⇠ 2 i=1
s.t.yi(wTxi+b) 1 ⇠i, 8i ⇠i 0, 8i
where ⇠i is the slack variable that allows the ith point to violate the margin. The new term in the objective function penalizes large slack variables (with parameter C). Since the penalty on the slack variable is their sum and the slack is non-negative, then the penalty function is the 1-norm of the slack variables. We will derive the dual formulation of this SVM. (This SVM is sometimes called C-SVM or 1-norm SVM. It is also the most popular one used.)
(a) Introduce Lagrange multipliers ↵i 0 for the margin constraint, and ri 0 for the non-negative slack constraint, show that the Lagrangian is
1 Xn L(w,b,⇠,↵,r)= 2kwk2 +C ⇠i ↵i(yi(wTxi +b) 1+⇠i)
i=1 i=1 i=1
(b) Show that the minimum of L(w, b, ⇠, ↵, r) w.r.t. {w, b, ⇠} satisifies
w = ↵iyixi, ↵iyi =0, ri =C ↵i.
(c) Use (9.13) on the Lagrangian to obtain the dual function,
L(↵) = minL(w,b,⇠,↵,r) = ↵i ↵i↵jyiyjxTi xj.
ri⇠i, (9.12)
w,b,⇠ i=1 2 i=1 j=1
(d) Showthatthetwonon-negativeconstraintsontheLagrangemultipliers{↵i,ri}andtheequality
constraint from (9.13),
are equivalent to
Hence, the SVM dual problem is
ri = C ↵i
↵i↵jyiyjxTi xj
(9.15) (9.16) (9.17)
Xn 1XnXn ↵i
i=1 2 i=1 j=1
↵iyi = 0, 0↵iC, 8i.
(e) Use the KKT conditions to show that only one of these conditions hold for each point xi:
i. ⇠i =0and0<↵i
iii. ⇠i = 0 and ↵i = 0, and the point is correctly classified. ………
Problem 9.5 Soft-margin SVM risk function
Consider the soft-margin SVM in Problem 9.4. In this problem we will interpret the soft-margin SVM as regularized risk minimization.
(a) Show that the slack variables must satisfy,
⇠i max(0, 1 yi(wT xi + b)). (9.20)
Given the above, at the optimum of the SVM primal problem, what is the expression for ⇠i?
(b) Show that the C-SVM primal problem is equivalent to
min max(0,1 yi(wTxi +b))+ kwk2 .
The first term is the empirical risk, where the loss function is
LSV M (zi) = max(0, 1 zi).
This is called the “hinge loss”. The second term is a regularization term on w to control its
complexity. Hence, SVM is optimizing a regularized risk.
(c) Plot the SVM loss function along with the other loss functions from Problem 8.5. Give an intuitive explanation about why the SVM loss function is good (and possibly better than others).
……… Problem 9.6 Soft-margin SVM (2-norm penalty)
Consider the soft-margin SVM problem using a 2-norm penalty on the slack variables, 1 Xn
s.t.yi(wTxi+b) 1 ⇠i, 8i ⇠i 0, 8i
min kwk2+C ⇠i2 w,b,⇠ 2 i=1
where ⇠i is the slack variable that allows the ith point to violate the margin. We will derive the dual of this SVM.
(a) Show that the non-negative constraint on ⇠i is redundant, and hence can be dropped. Hint: show that if ⇠i < 0 and the margin constraint is satisfied, then ⇠i = 0 is also a solution with lower cost.
(b) Introduce Lagrange multipliers ↵i for the margin constraint, show that the Lagrangian is
L(w,b,⇠,↵)= 2kwk2 + 2 ⇠i2 ↵i(yi(wTxi +b) 1+⇠i).
(c) Show that the minimum of L(w, b, ⇠, ↵) w.r.t. {w, b, ⇠} satisifies
⇤ ↵i w = ↵iyixi, ↵iyi = 0, ⇠i = C .
(d) Use (9.25) on the Lagrangian to obtain the dual function,
Xn1XnXn 1 L(↵) = minL(w,b,⇠,↵) = ↵i ↵i↵jyiyj(xTi xj +
w,b,⇠ i=1 2 i=1 j=1 C where ij = (1 i = j . Hence, the SVM dual problem is
↵i ↵i↵jyiyj(xTi xj + ij) i=1 2 i=1 j=1 C
↵iyi = 0, ↵i 0, 8i.
(e) (9.27) is the same as the standard SVM dual, except for the extra 1/C term. What is the role of this term, and how does it a↵ect the solution?
Problem 9.7 ⌫ -SVM
One limitation of the soft-margin SVM using the 1-norm penalty (Problem 9.4) is that there is no intuition for what the parameter C means and it can therefore be di cult to find good values for it in practice. In this problem we consider a slightly di↵erent, but more intuitive formulation, based on the solution of the following problem, with ⌫ as the parameter
12 1Xn min kwk ⌫⇢+ ⇠i
w,⇠,⇢,b 2 n i=1
s.t.yi(wTxi+b) ⇢ ⇠i, 8i ⇠i 0, 8i,
(a) Derive the dual problem and the resulting decision function.
(b) Given the dual solution how would you determine the values of b and ⇢? 68
(c) Define the fraction of margin errors as
✏⇢ = n1|{i|yif(xi) < ⇢}| (9.29)
and suppose that we solve the optimization problem on a dataset with the result that ⇢ > 0. Show that
(a) ⌫ is an upper bound on ✏⇢.
(b) ⌫ is a lower bound on the fraction of vectors that are support vectors.
(d) Show that if the solution of the second problem leads to ⇢ > 0, then the first problem with C set a priori to ⇢1 leads to the same decision function.
Problem 9.8 Adaptive SVM
Other SVMs
In this problem we will consider an adaptive SVM. Suppose we have used a dataset D0 to learn a linear classifier function f0(x) = w0T x with decision rule y = sign(f0(x)). Since we have the classifier, we then threw away the data D0. Now, suppose we receive a new set of data D = {(xi,yi)}ni=1, and now wish to update our classifier function f0(x). To do this, we will add a “delta function” f (x) = wT x to adapt our original classifier,
f(x) = f0(x) + f(x) = f0(x) + wT x, (9.30)
where w is the parameter vector of the delta-function.
Let’s consider the case when D is linearly separable. We wish to maximize the margin
between the updated classifier and the new training set, which yields the adaptive-SVM objective function is
min 1 kwk2 w2
s.t. yif0(xi) + yiwT xi 1, 8i
(a) Show that the Lagrangian of (9.31) is
L(w, ↵) = 2 kwk2 ↵i(yif0(xi) + yiwT xi 1),
i=1 where ↵i 0 are the Lagrange multipliers.
(b) Show that the minimum of L(w, ↵) w.r.t. w satisifies ⇤ Xn
w = ↵iyixi. i=1
(c) Show that the dual function is
(1 yif0(x))↵i 2 ↵i↵jyiyjxTi xj. (9.34)
L(↵) = Hence, the ASVM dual problem is
max (1 yif0(x))↵i
s.t. 0 ↵i, 8i.
1 2 i=1 j=1
↵i↵jyiyjxTi xj
(d) Compare the ASVM dual in (9.34) with the original SVM dual function? What is the interpre- tation of the ASVM dual (considering the original SVM dual)? What is the role of the original classifier f0(x)?
Note that we can also define a “soft” version of ASVM with slack variables to handle the non-linearly separable case.
……… Problem 9.9 Support Vector Regression (SVR)
In this problem, we will consider support vector regression (SVR), which applies margin principles from SVMs to regression. The goal is to learn a linear function,
f(x) = wT x + b, (9.36) which fits a given dataset D = {(xi, yi)}ni=1, where xi 2 Rd and yi 2 R. Suppose we form a “band”
or “tube” around f(x) with width ✏ (see figure below).
−4 −2 0 2 4 6
f(x)+✏ f(x) f(x) ✏
We can consider any training pair (xi,yi) that falls inside of the tube as correctly regressed, while points falling outside of the tube are errors. Assuming that ✏ is known, the SVR problem is
min 1 kwk2 w,b 2
s.t. yi (wT xi + b) ✏, (wTxi+b) yi✏, 8i
(a) What are the roles of the inequality constraints and the objective function in (9.37)? 70
(b) Show that the Lagrangian of (9.37) is
L(w,b,↵,↵ˆ)= 2kwk2 ↵i(✏ yi +(wTxi +b)) ↵ˆi(✏+yi (wTxi +b), (9.38) i=1 i=1
where ↵i 0 are the Lagrange multiplier for the first inequality constraint, and ↵ˆi 0 are for the second inequality constraint.
(c) Show that the minimum of L(w, b, ↵, ↵ˆ) w.r.t. {w, b} satisfies ⇤
w = (↵i ↵ˆi)xi, (↵i ↵ˆi)=0. (9.39) i=1 i=1
(d) Show that the SVR dual function is L(↵, ↵ˆ) = min L(w, b, ↵, ↵ˆ)
(9.40) (↵i ↵ˆi)(↵j ↵ˆj)xTi xj. (9.41)
Hence the SVR dual problem is
yi(↵i ↵ˆi) ✏
(↵i +↵ˆi) 2 max L(↵,↵ˆ)
Xn s . t .
0↵i, 0↵ˆi 8i.
(e) Use the KKT conditions to show that only one of these conditions holds for the ith data point,
i. ↵i = 0 and ↵ˆi = 0, and the point is inside the ✏-tube.
ii. ↵i > 0 and ↵ˆi = 0, and the point is on the positive margin of the tube, i.e., yi = f(xi)+✏.
iii. ↵i = 0 and ↵ˆi > 0, and the point is on the negative margin of the tube, i.e., yi = f(xi) ✏. Hence, ↵i↵ˆi = 0, i.e., the point can’t both be on the positive margin and negative margin at
the same time.
……… Problem 9.10 “Soft” Support Vector Regression
In this problem we consider a “soft” version of SVR. We introduce a set of slack variables ⇠ , ⇠ˆ 0 ii
to allow for some errors on either side of the tube, i.e., some points can be outside of the tube.
−4 −2 0 2 4 6
The amount of error is penalized linearly, similar to SVMs. The SVR problem is now: 1 Xn
(wTx +b) y ✏+⇠ˆ, iii
⇠ 0 , ⇠ˆ 0 , 8 i . ii
yi(↵i ↵ˆi) ✏
↵ i = ↵ˆ i ,
(↵i +↵ˆi)
1 2 i=1 j=1
(↵i ↵ˆi)(↵j ↵ˆj)xTi xj
0↵i C, 0↵ˆi C 8i.
min kwk2+C (⇠+⇠ˆ) w,b2 ii
s.t. yi (wT xi + b) ✏ + ⇠i,
At the optimum, if the ith point is inside the positive margin of the ✏-tube, then the slack variable
⇠i = 0. Otherwise, if the ith point is outside the positive margin, then ⇠i is the distance between
the point and the margin, i.e., the amount of error y f(x) ✏. The same holds for ⇠ˆ and the ii
negative margin. In other words, the slack variables ⇠ and ⇠ˆ contain the amount of error outside ii
of the ✏-tube.
(a) Let |z|✏ be the ✏-insensitive loss function, (
|z|✏ =max(0,|z| ✏)= 0 |z|✏ (9.44) |z| ✏ otherwise.
Show that the primal problem in (9.43) is equivalent to regularized regression with the ✏- insensitive loss function,
min |yi f(xi)|✏ + kwk2 . (9.45)
How does this compare with other regularized regression formulations, e.g., regularized least
squares? Comment on its robustness to outliers. (b) Show that the soft SVR dual problem is:
(c) Use the KKT conditions to show that only one of these conditions holds for the ith data point,
2. 0 < ↵i < C, 3. ↵i = 0,
4. ↵i = C,
5. ↵i = 0,
0 < ↵ˆ i < C , ↵ˆi = 0,
↵ˆ i = C ,
|yi f(xi)| < ✏ yi =f(xi)+✏ yi=f(xi) ✏ yi>f(xi)+✏ yi