CS代写 GA-1003 – Spring 2022 1 Homework 3: SVMs & Kernel Methods

DS-GA-1003 – Spring 2022 1 Homework 3: SVMs & Kernel Methods
Due: Friday, March 4, 2022 at 11:59PM EST
Instructions: Your answers to the questions below, including plots and mathematical work, should be submitted as a single PDF file. It’s preferred that you write your answers using software that typesets mathematics (e.g.LaTeX, LyX, or MathJax via iPython), though if you need to you may scan handwritten work. You may find the minted package convenient for including source code in your LaTeX document. If you are using LyX, then the listings package tends to work better.
Update 2/23 Following feedback from the students, we wanted to reduce the length of the homework, while still rewarding students that already started answering the problems. So we have demarcated 7 questions as {BONUS}. {BONUS} questions do not need to be attempted but if you already have completed them, then you can submit them and get them graded. The difference between {BONUS} and Optional questions is that {BONUS} questions will count towards this homework while Optional counts towards extra credit.

Copyright By PowCoder代写 加微信 powcoder

In this problem set we will get up to speed with SVMs and Kernels. Long at first glance, the problem set includes a lot of helpers. You will find a review of kernalization. One section will include a revision of ridge regression which you should start to be familiar with. For the second and third problem some codes are provided to save you some time. Finally, some reminders on positive (semi)definite matrices are included in the Appendix.
1 Support Vector Machines: SVMs with Pegasos
In this first problem we will use Support Vector Machines to predict whether the sentiment of a movie review was positive or negative. We will represent each review by a vector x ∈ Rd where d is the size of the word dictionary and xi is equal to the number of occurrence of the i-th word in the review x. The corresponding label is either y = 1 for a positive review or y = −1 for a negative review. In class we have seen how to transform the SVM training objective into a quadratic program using the dual formulation. Here we will use a gradient descent algorithm instead.
Subgradients
Recall that a vector g ∈ Rd is a subgradient of f : Rd → R at x if for all z, f(z) ≥ f(x) + gT (z − x).
There may be 0, 1, or infinitely many subgradients at any point. The subdifferential of f at a point x, denoted ∂f(x), is the set of all subgradients of f at x. A good reference for subgradients are the course notes on Subgradients by Boyd et al. Below we derive a property that will make our life easier for finding a subgradient of the hinge loss.
1. {BONUS}Supposef1,…,fm :Rd →Rareconvexfunctions,andf(x)=maxi=1,…,,mfi(x). Let k be any index for which fk(x) = f(x), and choose g ∈ ∂fk((x) (a convex function on Rd has a non-empty subdifferential at all points). Show that g ∈ ∂f(x).
2. {BONUS} Give a subgradient of the hinge loss objective J(w) = max 􏰇0, 1 − ywT x􏰈.

DS-GA-1003 – Spring 2022 2
3. {BONUS} Suppose we have function f : Rn → R which is sub-differentiable everywhere, i.e. ∂f ̸= ∅ for all x ∈ Rn. Show that f is convex. Note, in the general case, a function is convex if for all x,y in the domain of f and for all θ ∈ (0,1),
θf(a) + (1 − θ)f(b) ≥ f(θa + (1 − θ)(b))
Hint: Suppose f is not convex, then by definition, there exists a point in some interval: x0 ∈ (a,b), such that f(x0) lies above the line connection (a,f(a)),(b,f(b)). Is this possible if the function s sub-differentiable everywhere?
SVM with the Pegasos algorithm
You will train a Support Vector Machine using the Pegasos algorithm 1. Recall the SVM objective using a linear predictor f(x) = wT x and the hinge loss:
min ∥w∥2 + max􏰇0,1−yiwTxi􏰈,
where n is the number of training examples and d the size of the dictionary. Note that, for simplicity, we are leaving off the bias term b. Note also that we are using l2 regularization with a parameter λ. Pegasos is stochastic subgradient descent using a step size rule ηt = 1/ (λt) for iteration number t. The pseudocode is given below:
Input: λ>0. Choosew1 =0,t=0 While termination condition not met
For j = 1, . . . , n (assumes data is randomly permuted) t=t+1
ηt = 1/ (tλ);
IfyjwtTxj <1 wt+1 = (1 − ηtλ)wt + ηtyjxj Else wt+1 = (1 − ηtλ)wt 4. Consider the SVM objective function for a single training point2: Ji(w) = λ∥w∥2 + max 0, 1 − yiw xi . The function Ji(w) is not differentiable everywhere. Specify where the gradient of Ji(w) is not defined. Give an expression for the gradient where it is defined. 5. Show that a subgradient of Ji(w) is given by 􏰉λw−yixi foryiwTxi <1 gw = λw for yiwT xi ≥ 1. You may use the following facts without proof: 1) If f1, . . . , fn : Rd → R are convex functions and f = f1 +···+fn, then ∂f(x) = ∂f1(x)+···+∂fn(x). 2) For α ≥ 0, ∂(αf)(x)=α∂f(x). (Hint: Usethefirstpartofthisproblem.) 1Shalev-Shwartz et al. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. 2Recall that if i is selected uniformly from the set {1, . . . , n}, then this objective function has the same expected value as the full SVM objective function. DS-GA-1003 - Spring 2022 3 Convince yourself that if your step size rule is ηt = 1/ (λt), then doing SGD with the subgradient direction from the previous question is the same as given in the pseudocode. Dataset and sparse representation We will be using the Polarity Dataset v2.0, constructed by Pang and Lee, provided in the data reviews folder. It has the full text from 2000 movies reviews: 1000 reviews are classified as positive and 1000 as negative. Our goal is to predict whether a review has positive or negative sentiment from the text of the review. Each review is stored in a separate file: the positive reviews are in a folder called “pos”, and the negative reviews are in “neg”. We have provided some code in utils svm reviews.py to assist with reading these files. The code removes some special symbols from the reviews and shuffles the data. Load all the data to have an idea of what it looks like. A usual method to represent text documents in machine learning is with bag-of-words. As hinted above, here every possible word in the dictionnary is a feature, and the value of a word feature for a given text is the number of times that word appears in the text. As most words will not appear in any particular document, many of these counts will be zero. Rather than storing many zeros, we use a sparse representation, in which only the nonzero counts are tracked. The counts are stored in a key/value data structure, such as a dictionary in Python. For example, “ and II” would be represented as the following Python dict: x={’Harry’:2, ’Potter’:2, ’and’:1, ’II’:1}. 6. Write a function that converts an example (a list of words) into a sparse bag-of-words representation. You may find Python’s Counter3 class to be useful here. Note that a Counter is itself a dictionary. 7. Load all the data and split it into 1500 training examples and 500 validation examples. Format the training data as a list X train of dictionaries and y train as the list of corre- sponding 1 or -1 labels. Format the test set similarly. We will be using linear classifiers of the form f(x) = wTx, and we can store the w vec- tor in a sparse format as well, such as w={’minimal’:1.3, ’Harry’:-1.1, ’viable’:-4.2, ’and’:2.2, ’product’:9.1}. The inner product between w and x would only involve the fea- tures that appear in both x and w, since whatever doesn’t appear is assumed to be zero. For this example, the inner product would be x(Harry) * w(Harry) + x(and) * w(and) = 2*(-1.1) + 1*(2.2). To help you along, utils svm reviews.py includes two functions for working with sparse vectors: 1) a dot product between two vectors represented as dictionaries and 2) a func- tion that increments one sparse vector by a scaled multiple of another vector, which is a very common operation. It is worth reading the code, even if you intend to implement it yourself. You may get some ideas on how to make things faster. 8. Implement the Pegasos algorithm to run on a sparse data representation. The output should be a sparse weight vector w represented as a dictionary. Note that our Pegasos algorithm starts at w = 0, which corresponds to an empty dictionary. Note: With this problem, you will need to take some care to code things efficiently. In particular, be aware that making copies of the weight dictionary can slow down your code significantly. If you want to make a copy of your weights (e.g. for checking for convergence), make sure you don’t do this more than once per epoch. Also: If you normalize your data in some way, 3 https://docs.python.org/2/library/collections.html DS-GA-1003 - Spring 2022 4 be sure not to destroy the sparsity of your data. Anything that starts as 0 should stay at 0. Note that in every step of the Pegasos algorithm, we rescale every entry of wt by the factor (1 − ηtλ). Implementing this directly with dictionaries is very slow. We can make things significantly faster by representing w as w = sW, where s ∈ R and W ∈ Rd. You can start with s = 1 and W all zeros (i.e. an empty dictionary). Note that both updates (i.e. whether or not we have a margin error) start with rescaling wt, which we can do simply by setting st+1 =(1−ηtλ)st. 9. If the update is wt+1 = (1 − ηtλ)wt + ηtyjxj, then verify that the Pegasos update step is equivalent to: st+1 = (1−ηtλ)st Wt+1 = Wt + 1 ηtyjxj. Implement the Pegasos algorithm with the (s, W ) representation described above. 4 10. Run both implementations of Pegasos on the training data for a couple epochs. Make sure your implementations are correct by verifying that the two approaches give essentially the same result. Report on the time taken to run each approach. 11. Write a function classification error that takes a sparse weight vector w, a list of sparse vectors X and the corresponding list of labels y, and returns the fraction of errors when predicting yi using sign(wT xi). In other words, the function reports the 0-1 loss of the linear predictor f (x) = wT x. 12. Search for the regularization parameter that gives the minimal percent error on your test set. You should now use your faster Pegasos implementation, and run it to convergence. A good search strategy is to start with a set of regularization parameters spanning a broad range of orders of magnitude. Then, continue to zoom in until you’re convinced that addi- tional search will not significantly improve your test performance. Plot the test errors you obtained as a function of the parameters λ you tested. (Hint: the error you get with the best regularization should be closer to 15% than 20%. If not, maybe you did not train to convergence.) Error Analysis Recall that the score is the value of the prediction f(x) = wTx. We like to think that the magnitude of the score represents the confidence of the prediction. This is something we can directly verify or refute. 13. {BONUS} Break the predictions on the test set into groups based on the score (you can play with the size of the groups to get a result you think is informative). For each group, 4There is one subtle issue with the approach described above: if we ever have 1 − ηtλ = 0, then st+1 = 0, and we’ll have a divide by 0 in the calculation for Wt+1. This only happens when ηt = 1/λ. With our step-size rule of ηt = 1/ (λt), it happens exactly when t = 1. So one approach is to just start at t = 2. More generically, note that if st+1 = 0, then wt+1 = 0. Thus an equivalent representation is st+1 = 1 and W = 0. Thus if we ever get st+1 = 0, simply set it back to 1 and reset Wt+1 to zero, which is an empty dictionary in a sparse representation. DS-GA-1003 - Spring 2022 5 examine the percentage error. You can make a table or graph. Summarize the results. Is there a correlation between higher magnitude scores and accuracy? In natural language processing one can often interpret why a model has performed well or poorly on a specific example. The first step in this process is to look closely at the errors that the model makes. 14. (Optional) Choose an input example x = (x1,...,xd) ∈ Rd that the model got wrong. We want to investigate what features contributed to this incorrect prediction. One way to rank the importance of the features to the decision is to sort them by the size of their contributions to the score. That is, for each feature we compute |wixi|, where wi is the weight of the ith feature in the prediction function, and xi is the value of the ith feature in the input x. Create a table of the most important features, sorted by |wixi|, including the feature name, the feature value xi, the feature weight wi, and the product wixi. Attempt to explain why the model was incorrect. Can you think of a new feature that might be able to fix the issue? Include a short analysis for at least 2 incorrect examples. Can you think of new features that might help fix a problem? (Think of making groups of words.) Kernel Methods Kernelization review Consider the following optimization problem on a data set (x1, y1) , . . . (xn, yn) ∈ Rd × Y: min R􏰌􏰓⟨w,w⟩􏰍+L(⟨w,x1⟩,...,⟨w,xn⟩), where w, x1, . . . , xn ∈ Rd, and ⟨·, ·⟩ is the standard inner product on Rd. The function R : [0, ∞) → R is nondecreasing and gives us our regularization term, while L : Rn → R is arbitrary5 and gives us our loss term. We noted in lecture that this general form includes soft-margin SVM and ridge regression, though not lasso regression. Using the representer theorem, we showed if the optimization problem has a solution, there is always a solution of the form w = 􏰂ni=1 αixi, for some α ∈ Rn. Plugging this into the our original problem, we get the following “kernelized” optimization problem: where K ∈ Rn×n is the Gram matrix (or “kernel matrix”) defined by Kij = k(xi, xj ) = ⟨xi, xj ⟩. Predictions are given by f(x) = 􏰃αik(xi,x), and we can recover the original w ∈ Rd by w = 􏰂ni=1 αixi. The kernel trick is to swap out occurrences of the kernel k (and the corresponding Gram matrix K) with another kernel. For example, we could replace k(xi,xj) = ⟨xi,xj⟩ by k′(xi,xj) = 5You may be wondering “Where are the yi’s?”. They’re built into the function L. For example, a square loss on a training set of size 3 could be represented as L(s1, s2, s3) = 1 􏰔(s1 − y1)2 + (s2 − y2)2 + (s3 − y3)3􏰕, where 3 each si stands for the ith prediction ⟨w,xi⟩. minR α Kα +L(Kα), DS-GA-1003 - Spring 2022 6 ⟨ψ(xi),ψ(xj)⟩ for an arbitrary feature mapping ψ : Rd → Rd. In this case, the recovered w ∈ Rd would be w = 􏰂ni=1 αiψ(xi) and predictions would be ⟨w, ψ(xi)⟩. More interestingly, we can replace k by another kernel k′′(xi,xj) for which we do not even know or cannot explicitly write down a corresponding feature map ψ. Our main example of this is the RBF kernel ′ 􏰑 ∥x−x′∥2􏰒 k(x,x)=exp − 2σ2 , for which the corresponding feature map ψ is infinite dimensional. In this case, we cannot recover w since it would be infinite dimensional. Predictions must be done using α ∈ Rn, with f(x) = 􏰂ni=1 αik(xi,x). Your implementation of kernelized methods below should not make any reference to w or to a feature map ψ. Your learning routine should return α, rather than w, and your prediction function should also use α rather than w. This will allow us to work with kernels that correspond to infinite-dimensional feature vectors. 2.2 Kernel problems Ridge Regression: Theory Suppose our input space is X =Rd and our output space is Y = R. Let D = {(x1, y1) , . . . , (xn, yn)} be a training set from X × Y. We’ll use the “design matrix” X ∈ Rn×d, which has the input vectors as rows: Recall the ridge regression objective function: J(w) = ||Xw − y||2 + λ||w||2, for λ > 0.
15. ShowthatforwtobeaminimizerofJ(w),wemusthaveXTXw+λIw=XTy. Show that the minimizer of J(w) is w = (XT X +λI)−1XT y. Justify that the matrix XT X +λI is invertible, for λ > 0. (You should use properties of positive (semi)definite matrices. If you need a reminder look up the Appendix.)
16. Rewrite XTXw+λIw = XTy as w = λ1(XTy−XTXw). Based on this, show that we can write w = XT α for some α, and give an expression for α.
17. Based on the fact that w = XT α, explain why we say w is “in the span of the data.”
18. Show that α = (λI + XXT )−1y. Note that XXT is the kernel matrix for the standard vector dot product. (Hint: Replace w by XT α in the expression for α, and then solve for α.)
−x1− .
X= . . −xn −

DS-GA-1003 – Spring 2022 7
19. Give a kernelized expression for the Xw, the predicted values on the training points. (Hint:
Replace w by XT α and α by its expression in terms of the kernel matrix XXT .)
20. Give an expression for the prediction f (x) = xT w∗ for a new point x, not in the training set. The expression should only involve x via inner products with other x’s. (Hint: It is often convenient to define the column vector
to simplify the expression.)
Kernels and Kernel Machines
There are many different families of kernels. So far we spoken about linear kernels, RBF/Gaussian kernels, and polynomial kernels. The last two kernel types have parameters. In this section, we’ll implement these kernels in a way that will be convenient for implementing our kernelized ridge regression later on. For simplicity, we will assume that our input space is X = R . This allows us to represent a collection of n inputs in a matrix X ∈ Rn×1. You should now refer to the jupyter notebook skeleton code kernels.ipynb.
21. Write functions that compute the RBF kernel kRBF(σ)(x,x′) = exp􏰀−∥x−x′∥2/􏰀2σ2􏰁􏰁
and the polynomial kernel kpoly(a,d)(x, x′) = (a + ⟨x, x′⟩)d. The linear kernel klinear(x, x′) = ⟨x,x′⟩, has been done for you in the support code. Your functions should take as input two matrices W ∈ Rn1×d and X ∈ Rn2×d and should return a matrix M ∈ Rn1×n2 where Mij = k(Wi·, Xj·). In words, the (i, j)’th entry of M should be kernel evaluation between wi (the ith row of W) and xj (the jth row of X). For the RBF kernel, you may use the scipy function cdist(X1,X2,’sqeuclidean’) in the package scipy.spatial.distance.
22. Use the linear kernel function defined in the code to compute the kernel matrix on the set of points x0 ∈ DX = {−4, −1, 0, 2}. Include both the code and the output.
23. Suppose we have the data set DX,y = {(−4, 2), (−1, 0), (0, 3), (2, 5)} (in each set of paren- theses, the first number is the value of xi and the second number the corresponding value of the target yi). Then by the representer theorem, the final prediction function will be in the span of the functions x 􏰐→ k(x0, x) for x0 ∈ DX = {−4, −1, 0, 2}. This set of functions will look quite different depending on the kernel function we use. The set of functions x 􏰐→ klinear(x0, x) for x0 ∈ X and for x ∈ [−6, 6] has been provided for the linear kernel.
(a) Plot the set of functions x 􏰐→ kpoly(1,3)(x0, x) for x0 ∈ DX and for x ∈ [−6, 6]. (b) Plot the set of functions x 􏰐→ kRBF(1)(x0, x) for x0 ∈ X and for x ∈ [−6, 6].
Note that the values of the parameters of the kernels you should use are given in their definitions in (a) and (b).
24. By the representer theorem, the final prediction function will be of the form f(x) = 􏰂ni=1 αik(xi,x), where x1,…,xn ∈ X are the inputs in the training set. We will use the class Kernel Machine in the skeleton code to make prediction with different kernels.
xTx1 .
kx= .  xT xn

DS-GA-1003 – Spring 2022 8
Complete the predict function of the class Kernel Machine. Construct a Kernel Machine object with the RBF kernel (sigma=1), with prototype points at −1,0,1 and correspond- ing weights αi 1, −1, 1. Plot the resulting function.
Note: For this last problem, and for other problems below, it may be helpful to use par- tial application on your kernel functions. For example, if your polynomial kernel function has signature polynomial kernel(W, X, offset, degree), you can write k = functools. partial(polynomial kernel, offset=2, degree=2), and then a call to k(W,X) is equivalent to polynomial kernel(W, X, offset=2, deg

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com