Math 123 Problem Set 6 Due: Tuesday 12/14 at 11:59 p.m. EST
Instruction: Read the homework policy. You should submit a PDF copy of the homework
and any associated codes on Canvas. Your PDF must be a single file, not multiple images.
1. In this problem, we consider the least squares problem.
(a) Prove that the least squares loss f(x) = ||Ax− b||22 is a convex function.
(b) Sketch the gradient descent algorithm to solve the least squares problem.
(c) When is it preferable to use gradient descent instead of the normal equations? Explain in
brief terms.
2. Consider the following minimization problem
min
x∈Rn
xTCx + bTx + d
where C is an n× n symmetric matrix, b ∈ Rn is a fixed vector and d ∈ R is a constant.
(a) Derive the gradient of the objective function.
(b) Under what conditions on C is the objective convex?
(c) Consider the case where C is symmetric and positive-definite. What is the solution to the
minimization problem?
3. Consider the optimization problem corresponding to the soft SVM problem,
min
w∈Rd,b,ξi≥0
||w||2 + C
n∑
i=1
ξi subject to yi(w
Txi + b) ≥ 1
We consider the effect of the regularization parameter C.
(a) What kind of problem do we obtain as C →∞? Justify your answer.
(b) For very small C, what could you say about the margin? Briefly explain your answer.
(c) For very large C, what could you say about the margin? Briefly explain your answer.
4. [Extra Credit: 10 points] Given n points and their labels (xi, yi) for 1 ≤ i ≤ n with each
xi ∈ Rd, consider the dual optimization program corresponding to the hard margin support
vector machines classification problem.
max
αi≥0,i=1,…,n
n∑
i=1
αi −
1
2
∑
j,k
αjαkyjykx
T
j xk subject to
n∑
i=1
αiyi = 0
Let {α∗i }ni=1 denote the optimal solution to the above program. Assume that there exists a
point xs such that ys(w
∗xs + b
∗) = 1 where ys is the label of xs and w
∗ and b∗ denote the
optimal solutions to the primal optimization program.
1
(a) What is the form of the classifier function in terms of {α∗i }ni=1 and {(xi, yi), 1 ≤ i ≤ n}?
[Remark: The classifer function should only depend on the dual variables and the data
points and their labels. It should not be given in terms of b∗ and w∗].
(b) Suppose that we employ kernel for the support vector machines classification problem em-
ploying the Gaussian kernel K(x, x̂) = exp
(
||x−x̂||2
2σ2
)
. What is the form of the classifier
function in terms of {α∗i }ni=1 and {(xi, yi), 1 ≤ i ≤ n} and the Gaussian kernel?
[Remark: The classifer function should only depend on the dual variables and the data
points and their labels. It should not be given in terms of b∗ and w∗].
(c) Let n = 1000. Assume that α∗3 = 0. Suppose we remove the third training data point
(x3, y3). How does the classifier function change? Justify your answer.
(d) Characterize the support vectors in terms of {α∗i }ni=1.
(e) What is the computational complexity of evaluating the classifier function in (b) given a
test data point xtest? The complexity should be in terms of n and d.
5. [Extra Credit: 20 points] Given n points and their labels (xi, yi) for 1 ≤ i ≤ n with each
xi ∈ Rd, consider the optimization program corresponding to the hard margin support vector
machines classification problem.
min
b,w∈Rd
1
2
wTw subject to yj(w
Txj + b) ≥ 1 for j = 1, …, n (1)
Let d = 2. Let x1 = (0, 0), x2 = (2, 2), x3 = (2, 0) and x4 = (3, 0) be the training points with
labels y1 = −1, y2 = −1, y3 = 1 and y4 = 1 respectively.
(a) Derive the optimal solution to (1) i.e. find the equation of the optimal hyperplane.
(b) Based your solution in (a), find an explicit form for the classifier function.
(c) Using the classifier in (b), find the labels of the test points (6, 2) and (1, 1).
(d) Compute the margin.
(e) What are the support vectors?
(f) Consider we remove the data point x4 = (3, 0). How does the solution of the optimization
problem change? In particular, what is the form of the optimal hyperplane? Justify your
answer.
[Remark: Show detailed work. Answers that simply guess the the equation of the optimal
hyperplane receive no credit for (a). To get full credit, you need to solve the problem by
considering the optimization objective and the constraints in (1)].
[Remark: For part (a) and (b), you can directly use the relations/results that can be established
from the equivalence of the primal hard SVM problem and dual SVM problem. ]
6. [Extra Credit: 30 points] Given n points and their labels (xi, yi) for 1 ≤ i ≤ n with
each ξi ∈ R2, consider the dual optimization program corresponding to the soft margin support
vector machines classification problem.
min
w,b,ξi≥0
1
2
wTw + C
n∑
i=1
ξi such that yi(w
Txi + b) ≥ 1− ξi ∀(xi, yi)
C is a regularization constant that controls overfitting.
2
(a) What is the Lagrangian L corresponding to the above optimization problem. L should be
a function of w, b and α. α = {αi}ni=1 are the penalty parameters.
(b) What is the primal problem?
(c) Recall that the dual problem is given by max
α≥0
min
w,b
L(w, b, α). We study the inner mini-
mization problem, min
w,b
L(w, b, α).
(a) Compute ∂L
∂w
.
(b) Compute ∂L
∂b
.
(d) What is the dual problem? [Hint: Set the partial derivatives evaluated in (c) to zero].
3