Homework 3: SVM and Sentiment Analysis
Instructions: Your answers to the questions below, including plots and mathematical work, should be submitted as a single PDF file. It’s preferred that you write your answers using software that typesets mathematics (e.g. LATEX, LYX, or MathJax via iPython), though if you need to you may scan handwritten work. You may find the minted package convenient for including source code in your LATEX document. If you are using LYX, then the listings package tends to work better.
1 Introduction
In this assignment, we’ll be working with natural language data. In particular, we’ll be doing sentiment analysis on movie reviews. This problem will give you the opportunity to try your hand at feature engineering, which is one of the most important parts of many data science problems. From a technical standpoint, this homework has two new pieces. First, you’ll be implementing Pegasos. Pegasos is essentially stochastic subgradient descent for the SVM with a particular schedule for the step-size. Second, because in natural langauge domains we typically have huge feature spaces, we work with sparse representations of feature vectors, where only the non-zero entries are explicitly recorded. This will require coding your gradient and SGD code using hash tables (dictionaries in Python), rather than numpy arrays. We begin with some practice with subgradients and an easy problem that introduces the Perceptron algorithm.
2 Calculating Subgradients
Recallthatavectorg∈Rd isasubgradientoff:Rd →Ratxifforallz, f(z) ≥ f(x) + gT (z − x).
As we noted in lecture, there may be 0, 1, or infinitely many subgradients at any point. The subdifferential of f at a point x, denoted ∂f(x), is the set of all subgradients of f at x.
Just as there is a calculus for gradients, there is a calculus for subgradients1. For our purposes, we can usually get by using the definition of subgradient directly. However, in the first problem below we derive a property that will make our life easier for finding a subgradient of the hinge loss and perceptron loss.
1. [Subgradients for pointwise maximum of functions] Suppose f1, . . . , fm : Rd → R are convex functions, and
f(x) = max fi(x). i=1,…,,m
1A good reference for subgradients are the course notes on Subgradients by Boyd et al. 1
Let k be any index for which fk(x) = f(x), and choose g ∈ ∂fk(x). [We are using the fact that a convex function on Rd has a non-empty subdifferential at all points.] Show that g ∈ ∂f(x).
2. [Subgradient of hinge loss for linear prediction] Give a subgradient of
J(w) = max 0, 1 − ywT x .
3 Perceptron
The perceptron algorithm is often the first classification algorithm taught in machine learning classes. Suppose we have a labeled training set (x1, y1) , . . . , (xn, yn) ∈ Rd × {−1, 1}. In the perceptron algorithm, we are looking for a hyperplane that perfectly separates the classes. That is, we’re looking for w ∈ Rd such that
yiwTxi >0∀i∈{1,…,n}.
Visually, this would mean that all the x’s with label y = 1 are on one side of the hyperplane x | wT x = 0, and all the x′s with label y = −1 are on the other side. When such a hyperplane exists, we say that the data are linearly separable. The perceptron algorithm is given in Algorithm 1.
Algorithm 1: Perceptron Algorithm
input: Training set (x1,y1),…,(xn,yn)∈Rd ×{−1,1} w(0) = (0,…,0) ∈ Rd
k=0 # step number
repeat
all_correct = TRUE
for i = 1,2,…,n # loop through data
if (yixTi w(k) ≤ 0)
w(k+1) = w(k) + yixi all_correct = FALSE
else
w(k+1) = w(k)
end if
k=k+1 end for
until (all_correct == TRUE) return w(k)
There is also something called the perceptron loss, given by l(yˆ, y) = max {0, −yˆy} .
In this problem we will see why this loss function has this name.
2
4
1.
2.
3.
Showthatifx|wTx=0isaseparatinghyperplaneforatrainingsetD=((x1,y1),…,(xn,yn)), then the average perceptron loss on D is 0. Thus any separating hyperplane of D is an em- pirical risk minimizer for perceptron loss.
Let H be the linear hypothesis space consisting of functions x → wT x. Consider running stochastic subgradient descent (SSGD) to minimize the empirical risk with the perceptron loss. We’ll use the version of SSGD in which we cycle through the data points in each epoch. Show that if we use a fixed step size 1, we terminate when our training data are separated, and we make the right choice of subgradient, then we are exactly doing the Perceptron algorithm.
Suppose the perceptron algorithm returns w. Show that w is a linear combination of the input points. That is, we can write w = ni=1 αixi for some α1,…,αn ∈ R. The xi for which αi ̸= 0 are called support vectors. Give a characterization of points that are support vectors and not support vectors.
The Data
We will be using the Polarity Dataset v2.0, constructed by Pang and Lee. It has the full text from 2000 movies reviews: 1000 reviews are classified as “positive” and 1000 as “negative.” Our goal is to predict whether a review has positive or negative sentiment from the text of the review. Each review is stored in a separate file: the positive reviews are in a folder called “pos”, and the negative reviews are in “neg”. We have provided some code in load.py to assist with reading these files. You can use the code, or write your own version. The code removes some special symbols from the reviews. Later you can check if this helps or hurts your results.
1. Load all the data and randomly split it into 1500 training examples and 500 validation exam- ples.
5 Sparse Representations
The most basic way to represent text documents for machine learning is with a “bag-of-words” representation. Here every possible word is a feature, and the value of a word feature is the number of times that word appears in the document. Of course, most words will not appear in any particular document, and those counts will be zero. Rather than store a huge number of zeros, we use a sparse representation, in which we only store the counts that are nonzero. The counts are stored in a key/value store (such as a dictionary in Python). For example, “Harry Potter and Harry Potter II” would be represented as the following Python dict: x={’Harry’:2, ’Potter’:2, ’and’:1, ’II’:1}. We will be using linear classifiers of the form f(x) = wTx, and we can store the w vector in a sparse format as well, such as w={’minimal’:1.3, ’Harry’:-1.1, ’viable’:-4.2, ’and’:2.2, ’product’:9.1}. The inner product between w and x would only involve the features that appear in both x and w, since whatever doesn’t appear is assumed to be zero. For this example, the inner product would be x[Harry] * w[Harry] + x[and] * w[and] = 2*(-1.1) + 1*(2.2). To help you along, we’ve included two functions for working with sparse vectors: 1) a dot product between two vectors represented as dict’s and 2) a function
3
that increments one sparse vector by a scaled multiple of another vector, which is a very common operation. These functions are located in util.py. It is worth reading the code, even if you intend to implement it yourself. You may get some ideas on how to make things faster.
1. Write a function that converts an example (e.g. a list of words) into a sparse bag-of-words representation. You may find Python’s Counter class to be useful here: https://docs. python.org/2/library/collections.html. Note that a Counter is also a dict.
6 Support Vector Machine via Pegasos
In this question you will build an SVM using the Pegasos algorithm. To align with the notation used in the Pegasos paper2, we’re considering the following formulation of the SVM objective function:
subgradient descent using a step size rule ηt = 1/ (λt). The pseudocode is given below:
Input: λ>0. Choosew1 =0,t=0 While termination condition not met
For j = 1, . . . , m (assumes data is randomly permuted) t=t+1
ηt = 1/ (tλ);
IfyjwtTxj <1
wt+1 = (1 − ηtλ)wt + ηtyjxj Else
wt+1 = (1 − ηtλ)wt
1. [Written] Consider the “stochastic” SVM objective function, which is the SVM objective func-
tion with a single training point3: Ji(w) = λ ∥w∥2 + max 0, 1 − yiwT xi. The function Ji(θ) 2
is not differentiable everywhere. Give an expression for the gradient of Ji(w) where it’s de- fined, and specify where it is not defined.
λ 1m
min ∥w∥2 + max0,1−yiwTxi.
w∈Rd 2 m
Note that, for simplicity, we are leaving off the unregularized bias term b. Pegasos is stochastic
i=1
2. [Written] Show that a subgradient of Ji(w) is given by λw−yixi foryiwTxi <1
g = λw for yiwT xi ≥ 1.
You may use the following facts without proof: 1) If f1, . . . , fm : Rd → R are convex functions andf =f1+···+fm,then∂f(x)=∂f1(x)+···+∂fm(x). 2)Forα≥0,∂(αf)(x)=α∂f(x). [Hint: Use the rules provided and the calculation in the first problem.]
2Shalev-Shwartz et al.’s “Pegasos: Primal Estimated sub-GrAdient SOlver for SVM”.
3Recall that if i is selected uniformly from the set {1, . . . , m}, then this stochastic objective function has the same expected value as the full SVM objective function.
4
3. [Written] Show that if your step size rule is ηt = 1/ (λt), then doing SGD with the subgradient direction from the previous problem is the same as given in the pseudocode.
4. Implement the Pegasos algorithm to run on a sparse data representation. The output should be a sparse weight vector w. Note that our Pegasos algorithm starts at w = 0. In a sparse representation, this corresponds to an empty dictionary. Note: With this problem, you will need to take some care to code things efficiently. In particular, be aware that making copies of the weight dictionary can slow down your code significantly. If you want to make a copy of your weights (e.g. for checking for convergence), make sure you don’t do this more than once per epoch. Also: If you normalize your data in some way, be sure not to destroy the sparsity of your data. Anything that starts as 0 should stay at 0.
5. Note that in every step of the Pegasos algorithm, we rescale every entry of wt by the factor (1 − ηtλ). Implementing this directly with dictionaries is very slow. We can make things significantly faster by representing w as w = sW, where s ∈ R and W ∈ Rd. You can start with s = 1 and W all zeros (i.e. an empty dictionary). Note that both updates (i.e. whether or not we have a margin error) start with rescaling wt, which we can do simply by setting st+1 = (1−ηtλ)st. If the update is wt+1 = (1−ηtλ)wt +ηtyjxj, then verify that the Pegasos update step is equivalent to:
st+1 = (1−ηtλ)st
Wt+1 = Wt + 1 ηtyjxj.
st+1
There is one subtle issue with the approach described above: if we ever have 1 − ηtλ = 0, then st+1 = 0, and we’ll have a divide by 0 in the calculation for Wt+1. This only happens when ηt = 1/λ. With our step-size rule of ηt = 1/ (λt), it happens exactly when t = 1. So one approach is to just start at t = 2. More generically, note that if st+1 = 0, then wt+1 = 0. Thus an equivalent representation is st+1 = 1 and W = 0. Thus if we ever get st+1 = 0, simply set it back to 1 and reset Wt+1 to zero, which is an empty dictionary in a sparse representation. Implement the Pegasos algorithm with the (s, W ) representation described above. [See section 5.1 of Leon Bottou’s Stochastic Gradient Tricks for a more generic version of this technique, and many other useful tricks.]
6. Run both implementations of Pegasos on the training data for a couple epochs (using the bag-of-words feature representation described above). Make sure your implementations are correct by verifying that the two approaches give essentially the same result. Report on the time taken to run each approach.
7. Write a function that takes a sparse weight vector w and a collection of (x,y) pairs, and returns the percent error when predicting y using sign(wT x). In other words, the function reports the 0-1 loss of the linear predictor x → wT x.
5
8.
9.
10.
7
Using the bag-of-words feature representation described above, search for the regularization parameter that gives the minimal percent error on your test set. (You should now use your faster Pegasos implementation, and run it to convergence.) A good search strategy is to start with a set of regularization parameters spanning a broad range of orders of magnitude. Then, continue to zoom in until you’re convinced that additional search will not significantly im- prove your test performance. Once you have a sense of the general range of regularization parameters that give good results, you do not have to search over orders of magnitude every time you change something (such as adding a new feature).
[Optional] Recall that the “score” is the value of the prediction f(x) = wT x. We like to think that the magnitude of the score represents the confidence of the prediction. This is something we can directly verify or refute. Break the predictions into groups based on the score (you can play with the size of the groups to get a result you think is informative). For each group, examine the percentage error. You can make a table or graph. Summarize the results. Is there a correlation between higher magnitude scores and accuracy?
[Optional] Our objective is not differentiable when yiwT xi = 1. Investigate how often and when we have yiwT xi = 1 (or perhaps within a small distance of 1 – this is for you to ex- plore) . Describe your findings. If we didn’t know about subgradients, one might suggest just skipping the update when ywT xi = 1. Does this seem reasonable? What about shortening the step size by a small percentage?
Error Analysis
The natural language processing domain is particularly nice in that one can often interpret why a model has performed well or poorly on a specific example, and sometimes it is not very difficult to come up with ideas for new features that might help fix a problem. The first step in this process is to look closely at the errors that our model makes.
1. Choose an input example x = (x1, . . . , xd) ∈ Rd that the model got wrong. We want to inves- tigate what features contributed to this incorrect prediction. One way to rank the importance of the features to the decision is to sort them by the size of their contributions to the score. That is, for each feature we compute |wixi|, where wi is the weight of the ith feature in the prediction function, and xi is the value of the ith feature in the input x. Create a table of the most important features, sorted by |wixi|, including the feature name, the feature value xi, the feature weight wi, and the product wixi. Attempt to explain why the model was incorrect. Can you think of a new feature that might be able to fix the issue? Include a short analysis for at least 2 incorrect examples.
8 Features
For a problem like this, the features you use are far more important than the learning model you choose. Whenever you enter a new problem domain, one of your first orders of business is to beg,
6
borrow, or steal the best features you can find. This means looking at any relevant published work and seeing what they’ve used. Maybe it means asking a colleague what features they use. But eventually you’ll need to engineer new features that help in your particular situation. To get ideas for this dataset, you might check the discussion board on this Kaggle competition, which is using a very similar dataset. There are also a very large number of academic research papers on sentiment analysis that you can look at for ideas.
1. [Optional] Based on your error analysis, or on some idea you have, construct a new feature (or group of features) that you hope will improve your test performance. Describe the features and what kind of improvement they give. At this point, it’s important to consider the stan- dard errors p(1 − p)/n (where p is the proportion of the test examples you got correct, and n is the size of the test set) on your performance estimates, to know whether the improvement is statistically significant.
2. [Optional] Try to get the best performance possible by generating lots of new features, chang- ing the pre-processing, or any other method you want, so long as you are using the same core SVM model. Describe what you tried, and how much improvement each thing brought to the model. To get you thinking on features, here are some basic ideas of varying quality: 1) how many words are in the review? 2) How many “negative” words are there? (You’d have to construct or find a list of negative words.) 3) Word n-gram features: Instead of single-word features, you can make every pair of consecutive words a feature. 4) Character n-gram fea- tures: Ignore word boundaries and make every sequence of n characters into a feature (this will be a lot). 5) Adding an extra feature whenever a word is preceded by “not”. For example “not amazing” becomes its own feature. 6) Do we really need to eliminate those funny char- acters in the data loading phase? Might there be useful signal there? 7) Use tf-idf instead of raw word counts. The tf-idf is calculated as
tfidf(fi) = FFi (1) log(DFi)
where F Fi is the feature frequency of feature fi and DFi is the number of document containing fi. In this way we increase the weight of rare words. Sometimes this scheme helps, sometimes it makes things worse. You could try using both! [Extra credit points will be awarded in proportion to how much improvement you achieve.]
7