EECS 189 Introduction to Machine Learning Fall 2020
This homework is due Tuesday, November 17 at 11:59 p.m. 1 Getting Started
HW11
Read through this page carefully. You may typeset your homework in latex or submit neatly handwritten/scanned solutions. Please start each question on a new page. Deliverables:
1. Submit a PDF of your writeup, with an appendix for your code, to the appropriate assign- ment on Gradescope. If there are graphs, include those graphs in the correct sections. Do not simply reference your appendix.
2. If there is code, submit all code needed to reproduce your results.
3. If there is a test set, submit your test set evaluation results.
After you’ve submitted your homework, watch out for the self-grade form.
(a)
(b)
2
Who else did you work with on this homework? In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?
Please copy the following statement and sign next to it. We just want to make it extra clear so that no one inadvertently cheats.
I certify that all solutions are entirely in my words and that I have not looked at another student’s solutions nor have I looked at any online solutions to any of these problems. I have credited all external sources in this write up.
SVM with custom margins
In the lecture, we covered the soft-margin SVM. The objective to be optimized over the training set {(x1, y1), (x2, y2), . . . , (xn, yn)} is
1 2 n
min ||w|| +C ξi (1) w,b,ξi 2 i=1
s.t. yi(w⊤xi−b)≥1−ξi ∀i (2) ξi≥0 ∀i (3)
HW11, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1
In this problem, we are interested in a modified version of the soft-margin SVM where we have a custom margin for each of the n data points. In the standard soft-margin SVM, we pay a penalty of ξi for each of the data point. In practice, we might not want to treat each training point equally, since with prior knowledge, we might know that some data points are more important than the others. (This is related to weighted least squares, where we give each data point an importance or where we have a prior belief that some points are more noisy than others.)
We formally define the following optimization problem:
1 2 n
min ||w|| +C φiξi (4)
w,b,{ξi } 2 i=1
s.t. yi(w⊤xi−b)≥1−ξi ∀i (5)
ξi≥0 ∀i (6)
Note that the only difference is that we have a custom weighting factor φi > 0 for each of the slack variables ξi in the objective function. These φi are some constants given by the prior knowledge, and thus they can be treated as known constants in the optimization problem. Intuitively, this formulation weights each of the violations (ξi) differently according to the prior knowledge (φi).
(a) For the standard soft-margin SVM, we have shown that the constrained optimization problem is equal to the following unconstrained optimization problem, i.e. regularized empirical risk minimization problem with hinge loss:
1 n
min ||w||2 + C max(0, 1 − yi(w⊤xi − b)) (7)
w,b 2 i=1
What’s the corresponding unconstrained optimization problem for the SVM with custom
margins?
(b) The dual of the standard soft-margin SVM is:
maxα⊤1 − 1α⊤Qα (8) α2
n
s.t.αiyi =0 (9)
i=1
0≤αi ≤C i=1,···,n (10)
where Q = (diag y)XXT (diag y)
What’s the dual form of the SVM with custom margins? Show the derivation steps in
detail.
(c) From the dual formulation above, how would you kernelize the SVM with custom mar-
gins? What role does the φi play in the kernelized version?
HW11, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2
3 Kernel Logistic Regression
We have seen in lecture how to kernelize ridge regression, and by going to the dual formulation, how to kernalize soft-margin SVMs as well. Here, we will consider how to kernelize logistic regression and compare its performance to kernelized SVMs using a real-world dataset.
(a) Imagine we have n different d-dimensional data points in an n × d matrix X, with associated labels in an n-dimensional vector y. Let this be a binary classification problem, so each label yi ∈ {0, 1}. Remember, this means that each training data point is associated with a row in the matrix X and the vector y.
Recall that logistic regression associates a point x with a real number from 0 to 1 by computing: 1
f(x)= 1+exp−wTx,
This number can be interpreted as the estimated probability for the point x having a true label
of +1. Since this number is 1 when wT x = 0, the sign of wT x is what predicts the label of the 2
test point x.
As you’ve seen in earlier homeworks, the loss function is defined to be
−yi lnf(xi)−(1−yi)ln1− f(xi), (11) i
where the label of the ith point xi is yi.
Write down the gradient-descent update step for logistic regression, with step size γ.
Assume that we are working with the raw features X for now, with no kernelization. For convenience, define the logistic function s(·) to be
s(x) = 1 . (12) 1+e−x
(b) You should have found that the update w(t+1) − w(t) is a linear combination of the observations {xi}. This suggests that gradient descent for logistic regression might be compatible with the kernel trick. After all, this is the same thing that we saw when we were doing least-squares iteratively by gradient descent, and that was certainly something that we could kernelize.
When we argued for the kernelization of least-squares, we did so by means of the augmented- features view of ridge regression. That had the pedagogic advantage of helping you internalize the importance of norm-minimization and made for an argument by which you could naturally discover the kernelization for yourself. Here, we will pursue a more direct path that has an inductive feeling to it.
Imagine that we start with some weight vector w(t) = XT a(t) that is a linear combination of the {xi}. (Notice that if we start with the zero vector as our base case, this is true for the base case.) Show that after one gradient step, w(t+1) will remain a linear combination of the {xi}, and so is expressible as XT a(t+1) for some “dual weight vector” a(t+1). Then write down the gradient-descent update step for the dual weights a(t) directly without referring to w(t) at all. In other words, tell us how a(t+1) is obtained from the data, the step size, and a(t).
HW11, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3
(c)
You should see from the previous part that the gradient-descent update step for a(t) can be writtentodependsolelyonXXT,notontheindividual{xi}inanyotherway.SinceXXT isjust the Gram matrix of pairwise inner-products of training point inputs, this suggests that we can use the kernel trick to quickly compute gradient steps for a(t) so long as we can compute the inner products of any pair of featurized observations.
Now suppose that you just have access to a similarity kernel function k(x, z) that can be un- derstood in terms of an implicit featurization φ(·) so that k(x, z) = φ(x)T φ(z). Describe how you would compute gradient-descent updates for the dual weights a(t) as well as how you would use the final weights together with the training data to classify a test point x.
Note: You do not have access to the implicit featurization φ(·). You have to use the similarity kernel k(·, ·) in your final answer.
You have now derived kernel logistic regression! Next, we will study how it relates to the kernel SVM, which we will do numerically. Complete all the parts in the associated Jupyter notebook.
Adversarial Examples
(d)
4
In this problem, suppose that we are learning parameters θ by trying to minimize the following robust optimization problem on training data:
n
1
min max l(f (x +δ),y), (13)
θiii θ n δ∈S i=1 i
where xi ∈ Rd is the i-th input and yi ∈ {−1, +1} is the binary label of the i-th data. The loss l(·, ·) could be square loss or logistic loss. This looks familiar, except that there is this δ hanging around together with the inner maximization. Why?
In general, we might be concerned about how robust we are if an adversary can perturb our inputs. For example, consider the logistic loss with such a perturbation
l( fθ(xi + δi), yi) = ln 1 + exp−yi⟨θ, xi + δi⟩ where δi ∈ S is the perturbation on xi.
What we ideally want is for the adversary to not be able to cause us errors at test time by introducing such perturbations. One approach to trying to do so is to at least try to guarantee robustness to such perturbations on the training set. Our goal here becomes finding model parameters θ such that they minimize the loss function on a worst case perturbation of the training dataset. That is what (13) is trying to achieve.
To have the problem be meaningful, we must constrain the adversary in some manner otherwise the problem is hopeless. This is where the contraint set S comes in. If we set this to S = {0}, then the problem (13) reduces to the standard training problem. To add some robustness, we could set the perturbation set to be S = {δ : ∥δ∥ ≤ ε}, where ε is small compared with the size of typical training data points. (Note that such an approach to adversarial robustness is somewhat naive in
HW11, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4
practice, as has been recently pointed out in a paper arxiv.org/abs/2006.12655 by Laidlaw, Singla, and Feizi. This is because even large perturbations to inputs might be imperceptable to humans. But it is a start, and you have to start somewhere.)
Recall that the motivation of training models on worst case perturbations is we want our model to provide “robust” prediction. In other words, for test data xtest, the learned model fθ should predict consistently over the whole perturbation set around xtest, i.e.,
fθ(xtest) ≈ fθ(xtest + δtest), ∀δtest ∈ S.
In practice, there exist data points, called adversarial examples, which in the context of images or videos are visually indistinguishable from normal examples but can fool/mislead machine-learning algorithms. This has been found to be quite a ubiquitous phenomenon and impacts audio data as well as textual data. Because it impacts all the kinds of data that humans have direct perceptual intuitions about, this phenomenon is believed to also impact machine learning on data sources that humans have no intuition about.
This problem is designed to help introduce you to this important phenomenon.
(a) One way of constraining the perturbation is to bound it in each entry. This is referred to as the
infinity-norm so ∥δ∥∞ ≤ ε means that |δ[ j]| ≤ ε for all j.
For the linear model with logistic loss, prove the following:
nn
1 max ln1+exp−yi⟨θ,xi +δi⟩= 1log1+exp−yi⟨θ,xi⟩+ε∥θ∥1, (14)
n i=1 ∥δi∥∞≤ε n i=1 and along the way prove
δ⋆i = arg max∥δi∥∞≤ε log 1 + exp−yi⟨θ, xi + δi⟩ = −ε · yi · sign(θ), (15) where yi ∈ {+1, −1}. This is all a consequence of the fact that ∥ · ∥1 is the dual norm of ∥ · ∥∞.
You should be able to see that if the original constraint were on the traditional Euclidean norm of δ, the role of the one-norm would be played by the traditional 2-norm itself — it is its own dual norm. (Recall from optimization that if ∥x∥ is a norm, then its dual norm is ∥x∥∗, which is defined as ∥x∥∗ = max∥z∥≤1 z⊤x.)
(b) In this part, we consider a simple 2d example. Without explicitly minimizing the worst case loss in (13), we explore the decision boundaries of different models and whether the boundaries could avoid intersecting with the perturbation sets around training data points. Run Part (I) in Jupyter Notebook and describe the decision boundaries of (i). Neural networks; (ii). Linear models using random Fourier features; (iii). Kernel ridge regression. Also, describe how the boundary changes when you
(a) Increase the number of Fourier features in (ii).
(b) Vary the γ parameter for the rbf kernel from 10−3 to 106.
HW11, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5
(c)
Now consider that we are trying to learn a binary linear classifier using logistic loss on a subset of the MNIST dataset. We want to learn a classifier distinguishing digit ‘1’ and digit ‘3’. For this task, we define the perturbation set as
S={δ:∥δ∥∞ ≤0.1}.
Run Part (II) in Jupyter Notebook and finish the code for constructing the adversarial perturbations by using the results you proved in Part (a). Then report the accuracy of the learned linear model and the learned kernel ridge regression on adversarially perturbed test dataset (adversarially perturbations are computed only using the information of the linear model).
Your Own Question
5
Write your own question, and provide a thorough solution.
Writing your own problems is a very important way to really learn the material. The famous “Bloom’s Taxonomy” that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we don’t want the same to be true about the highest level.
As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams.
Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You don’t have to achieve this every week. But unless you try every week, it probably won’t happen ever.
HW11, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6
Contributors:
• Anant Sahai
• Chawin Sitawarin • Kailas Vodrahalli • Peter Wang
• Philipp Moritz
• Rahul Arya
• Yang Gao
• Yaodong Yu
HW11, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7