CSE 5525 Homework #1: Sentiment Classification
The main goal of Assignment #1 is for you to get familiar with extracting features and training classifiers on text data. You’ll get a sense of what the standard machine learning workflow looks like (reading in data, training, and testing), how standard learning algorithms work, and how the feature design process goes.
Timeline & Credit
Copyright By PowCoder代写 加微信 powcoder
You will have around 2 weeks to work on this programming assignment. We currently use a 100-point scale for this homework, but it will take 10% of your final grade.
Questions?
Please post on Microsoft Teams to get timely help from other students, the TA, and the instructor. Remember that participation takes 5% of your final grade. You can show your participation by actively answering others’ questions! Besides, everyone can benefit from checking what has been asked previously. So, please try to avoid directly sending emails to the instructor/TA. Thanks!
1 Dataset and Code
Please use Python 3.5+ for this project. You may find numpy useful for storing and manipulating vectors in this project, though it is not strictly required. The easiest way to install numpy is to install anaconda,1 which includes useful packages for scientific computing and is a handy package manager that will make it easier to install PyTorch for future assignments.
You’ll be using the movie review dataset in Socher et al. [1]. This is a dataset of movie review snippets taken from Rotten Tomatoes. We are tackling a simplified version of this task which frequently appears in the literature: positive/negative binary sentiment classification of sentences, with neutral sentences discarded from the dataset. The data files given to you contain newline-separated sentiment examples, consisting of a label (0 or 1) followed by a tab, followed by the sentence, which has been tokenized
1 https://docs.anaconda.com/anaconda/install/
but not lowercased. The data has been split into a train, development (dev), and blind test set. On the blind test set, you do not see the labels and only the sentences are given to you. The framework code reads these in for you.
1.2 Getting started
Download the code and data. Expand the file and change it into the directory. To confirm everything is working properly, run:
python sentiment classifier.py –model TRIVIAL –no run on test
This loads the data, instantiates a TrivialSentimentClassifier that always returns 1 (positive), and eval- uates it on the training and dev sets. The reported dev accuracy should be Accuracy: 444 / 872 = 0.509174. Always predicting positive isn’t so good!
1.3 Framework code.
You have been given the framework code to start with:
sentiment classifier.py is the main class. Do not modify this file for your final submission
(you can do whatever you need before submission, though). The main method loads the data, initializes the feature extractor, trains the model, and evaluates it on train, dev, and blind test, and writes the blind test results to a file. It uses argparse to read in several command line arguments. You should generally not need to modify the paths. –model and –feats control the model specification. This file also contains the evaluation code.
models.py is the primary file you’ll modify. It defines base classes for the FeatureExtractor and the classifiers, and defines train logistic regression, which you will be implementing. train model is your entry point which you may modify if needed.
2 What You Need to Do 2.1 Perceptron (40 points)
In this part, you should implement a perceptron classifier with a bag-of-words unigram featuriza- tion, as discussed in the lecture and the textbook. This will require modifying train perceptron, UnigramFeatureExtractor, and PerceptronClassifier, all in models.py.
train perceptron should handle the processing of the training data using the feature extractor with the very simple update rule you learned in lecture 2.
PerceptronClassifier should take the results of that training procedure (model weights) and use them to do inference.
Feature extraction First, you will need a way of mapping from sentences (lists of strings) to feature vectors, a process called feature extraction or featurization. A unigram feature vector will be a sparse vector with a length equal to the vocabulary size. There is no single right way to define unigram features. For example, do you want to throw out low-count words? Do you want to lowercase them? Do you want to discard stopwords? Do you want the value in the feature vector to be 0/1 for the absence or presence of a word, or reflect its count in the given sentence?
You can use the provided Indexer class in utils.py to map from string-valued feature names to indices. Note that later in this assignment when you have other types of features in the mix (e.g., bigrams in Part 3), you can still get away with just using a single Indexer: you can encode your features with “magic words” like Unigram=great and Bigram=great|movie. This is a good strategy for managing complex feature sets.
There are two approaches you can take to extract features for training: (1) extract features “on-the- fly” during training and grow your weight vector as you add features; (2) iterate through all training points and pre-extract features so you know how many there are in advance (optionally: build a feature cache to speed things up for the next pass).
Feature vectors Since there are a large number of possible features, it is always preferable to represent feature vectors sparsely. That is, if you are using unigram features with a 10,000-word vocabulary, you should not be instantiating a 10,000-dimensional vector for each example, as this is very inefficient. Instead, you want to maintain a list of only the nonzero features and their counts. Our starter code
suggests Counter from the collections as the return type for the extract features method; this class is a convenient map from objects to floats and is useful for storing sparse vectors like this.
Weight vector The most efficient way to store the weight vector is a fixed-size numpy array.
Perceptron and randomness Throughout this course, the examples in our training sets not neces- sarily randomly ordered. You should make sure to randomly shuffle the data before iterating through it. Even better, you could do a random shuffle every epoch.
Random seed If you do use randomness, you can either fix the random seed or leave it variable. Fixing the seed (with random.seed) can make the behavior you observe consistent, which can make debugging or regression testing easier (e.g., ensuring that code refactoring didn’t actually change the results). However, your results are not guaranteed to be exactly the same as in the autograder environment.
Q1 (40 points) Implement the unigram perceptron. While we don’t require you to obtain a certain accuracy for full credit, for your reference, in this setting you should get ∼74% accuracy on the development set, and the training and evaluation (the printed time) should take about 20 seconds.2 To get the performance target on this part, you probably had to change the learning rate schedule and also tweak the preprocessing.
Exploration (optional) Try at least two different “schedules” for the step size for perceptron (having one be the constant schedule such as [0.01, 0.1, 1] is fine). One common one is to decrease the step size by some factor every epoch or few; another is to decrease it like 1t . How do the results change?
Exploration (optional) Compare the training accuracy and development accuracy of the model. Think about why this might be happening.
2.2 Logistic Regression (30 points)
In this part, you’ll implement a logistic regression classifier with the same unigram bag-of-words feature set as in the previous part. Implement logistic regression training in train logistic regression and LogisticRegressionClassifier in models.py.
Note that you have a lot of freedom to modify models.py in your implementation; however, you are expected to show how you computed the loss function, the gradients, and the gradient descent algorithm (e.g., at least the simplest version of Algorithm 5 in the Eisenstein textbook3) to minimize the loss function. Do NOT directly call existing libraries (such as sklearn.linear model.LogisticRegression) or other well-implemented logistic regression classifiers.
Q2 (30 points) Implement logistic regression. Report your model’s performance on the train/dev dataset. For you reference, you should get ∼77% accuracy on the development set and it must run in about 20 seconds. If Q1 worked out for you, this should as well. Logistic regression when implemented correctly is usually a bit stabler and more effective than perceptron.
Exploration (optional) Plot (using matplotlib or another tool) the training objective (dataset log- likelihood) and development accuracy of logistic regression vs. number of training iterations for a couple of different step sizes. Think about what you observe and what it means. These plots are very useful for understanding whether your model is training at all, how long it needs to train, and whether you’re seeing overfitting.
2.3 Part 3: Better Features (30 points)
You’ll implement a more sophisticated set of features in this part. You should implement two additional feature extractors BigramFeatureExtractor and BetterFeatureExtractor. Note that your features for this can go beyond word n-grams; for example, you could define a FirstWord=X to extract a feature based on what the first word of a sentence is, although this one may not be useful.
2Our autograder hardware is fairly powerful and we will be lenient about rounding, so don’t worry if you’re close to this threshold.
3 https://github.com/jacobeisenstein/gt- nlp- class/blob/master/notes/eisenstein- nlp- notes.pdf 3
Q3 (15 points) Implement and experiment with BigramFeatureExtractor. Bigram features should be indicators of adjacent pairs of words in the text.
Q4 (15 points) Experiment with at least one feature modification in BetterFeatureExtractor. Try it out with logistic regression. Things you might try: other types of n-grams, tf-idf weighting, clipping your word frequencies, discarding rare words, discarding stopwords, etc. Your final code here should be whatever works best (even if that’s one of your other feature extractors). This model should train and evaluate in at most 60 seconds. This feature modification should not just consist of combining unigrams and bigrams.
3 What You Should Submit and Our Expectation
Submission. You should submit the following files to Carmen as three separate file uploads (not a zip file):
1. A PDF file of your answers to questions Q1-Q4 as your report. Report the required performance, explain your explorations, and try to conclude your observations.
2. The output of the blind test set in a file named test-blind.output.txt. The code produces this by default, but make sure you include the right version (e.g., the results of your model with the best feature setting)!
3. models.py, which should be submitted as an individual file upload. Do not modify or up- load sentiment classifier.py, sentiment data.py, or utils.py. Please put all of your code in models.py.
Expectation. Beyond your writeup, your submission will be evaluated on several axes:
1. Execution: your code should train and evaluate within a reasonable amount of time (e.g., dozens
of seconds; usually less than 30s) without crashing.
2. Reasonable accuracy on the development set of your logistic regression classifier under each feature
setting. You should be able to get at least 70% accuracy, and please conduct error analysis if not. 3. Reasonable accuracy on the blind test set. You should run the prediction with your best feature setting (which is determined based on the dev set performance; the setting you used for Q6) and submit
the test output file. Our TA will compute the accuracy on this blind test set.
Before you submit, make sure that the following commands work:
python sentiment classifier.py –model PERCEPTRON –feats UNIGRAM python sentiment classifier.py –model LR –feats UNIGRAM
python sentiment classifier.py –model LR –feats BIGRAM
python sentiment classifier.py –model LR –feats BETTER
These commands should all print dev results and write blind test output to the file by default (so, make sure you have the right version submitted).
3.1 Acknowledgment
This homework assignment is largely adapted from NLP courses by Dr. at UT Austin.
References
[1] , , , , Manning, Ng, and . Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
Academic Integrity
You can discuss the homework assignments with other students and study course materials together towards solutions. However, all of the code and report you write must be your own!!! Plagiarism will be automatically detected on Carmen (by comparing HWs from previous and current terms) as well as manually checked by the TA.
If the instructor or TA suspects that a student has committed academic misconduct in this course, they are obligated by university rules to report their suspicions to the Committee on Academic Miscon- duct. Please see more in Statements for CSE5525 on Carmen pages.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com