CS计算机代考程序代写 python cuda GPU Assignment 2: Feedforward Neural Networks

Assignment 2: Feedforward Neural Networks

Academic Honesty: Please see the course syllabus for information about collaboration in this course.
While you may discuss the assignment with other students, all work you submit must be your own!

Goals The main goal of this assignment is for you to get experience training neural networks over text.
You’ll play around with feedforward neural networks in PyTorch and see the impact of different sets of word
vectors on the sentiment classification problem from Assignment 1.

Code Setup

Please use Python 3.5+ and a recent version of PyTorch for this project.

Installing PyTorch You will need PyTorch for this project. To get it working on your own machine, you
should follow the instructions at https://pytorch.org/get-started/locally/. The assign-
ment is small-scale enough to complete using CPU only, so don’t worry about installing CUDA and getting
GPU support working unless you want to.

Installing via anaconda is typically easiest, especially if you are on OS X, where the system python has
some weird package versions. Installing in a virtual environment is recommended but not essential.

Part 1: Optimization (25 points)

In this part, you’ll get some familiarity with function optimization.

Q1 (25 points) First we start with optimization.py, which defines a quadratic with two variables:

y = (x1 − 1)2 + 8(x2 − 1)2

This file contains a manual implementation of SGD for this function. Run:

python optimization.py –lr 1

to optimize the quadratic with a learning rate of 1. However, the code will crash, since the gradient hasn’t
been implemented.

a) Implement the gradient of the provided quadratic function in quadratic grad. sgd test quadratic
will then call this function inside an SGD loop and show a visualization of the learning process. Note: you
should not use PyTorch for this part!

b) When initializing at the origin, what is the best step size to use? Set your step size so that it gets to
a distance of within 0.1 of the optimum within as few iterations as possible. Several answers are possible.
Hardcode this value into your code.

Exploration (optional) What is the “tipping point” of the step size parameter, where step sizes larger than
that cause SGD to diverge rather than find the optimum?

1

Part 2: Deep Averaging Network (75 points)

In this part, you’ll implement a deep averaging network as discussed in lecture and in Iyyer et al. (2015). If
our input s = (w1, . . . , wn), then we use a feedforward neural network for prediction with input

1
n

∑n
i=1 e(wi),

where e is a function that maps a word w to its real-valued vector embedding.

Getting started Download the code and data; the data is the same as in Assignment 1. Expand the tgz file
and change into the directory. To confirm everything is working properly, run:

python neural_sentiment_classifier.py –model TRIVIAL –no_run_on_test

This loads the data, instantiates a TrivialSentimentClassifier that always returns 1 (positive),
and evaluates it on the training and dev sets. Compared to Assignment 1, this runs an extra word embedding
loading step.

Framework code The framework code you are given consists of several files. neural sentiment classifier.py
is the main class. As before, you cannot modify this file for your final submission, though it’s okay to add
command line arguments or make changes during development. You should generally not need to modify
the paths. The –model argument controls the model specification. The main method loads in the data,
initializes the feature extractor, trains the model, and evaluates it on train, dev, and blind test, and writes the
blind test results to a file.
models.py is the file you’ll be modifying for this part, and train deep averaging network

is your entry point, similar to Assignment 1. Data reading in sentiment data.py and the utilities in
utils.py are similar to Assignment 1. However, read sentiment examples now lowercases the
dataset; the GloVe embeddings do not distinguish case and only contain embeddings for lowercase words.
sentiment data.py also additionally contains a WordEmbeddings class and code for reading it

from a file. This class wraps a matrix of word vectors and an Indexer in order to index new words. The
Indexer contains two special tokens: PAD (index 0) and UNK (index 1). UNK can stand in words that aren’t
in the vocabulary, and PAD is useful for implementing batching later. Both are mapped to the zero vector by
default.

Data You are given two sources of pretrained embeddings you can use: data/glove.6B.50d-relativized.txt
and data/glove.6B.300d-relativized.txt, the loading of which is controlled by the –word vecs path.
These are trained using GloVe (Pennington et al., 2014). These vectors have been relativized to your data,
meaning that they do not contain embeddings for words that don’t occur in the train, dev, or test data. This
is purely a runtime and memory optimization.

PyTorch example ffnn example.py1 implements the network discussed in lecture for the synthetic
XOR task. It shows a minimal example of the PyTorch network definition, training, and evaluation loop.
Feel free to refer to this code extensively and to copy-paste parts of it into your solution as needed. Most
of this code is self-documenting. The most unintuitive piece is calling zero grad before calling back-
ward! Backward computation uses in-place storage and this must be zeroed out before every gradient
computation.

1Available from Exercise 8b on edX.

2

Implementation Following the example, the rough steps you should take are:

1. Define a subclass of nn.Module that does your prediction. This should return a log-probability
distribution over class labels. Your module should take a list of word indices as input and embed them
using a nn.Embedding layer initialized appropriately.

2. Compute your classification loss based on the prediction. In lecture, we saw using the negative log
probability of the correct label as the loss. You can do this directly, or you can use a built-in loss
function like NLLLoss or CrossEntropyLoss. Pay close attention to what these losses expect as
inputs (probabilities, log probabilities, or raw scores).

3. Call network.zero grad() (zeroes out in-place gradient vectors), loss.backward (runs the
backward pass to compute gradients), and optimizer.step to update your parameters.

Implementation and Debugging Tips Come back to this section as you tackle the assignment!

• You should print training loss over your models’ epochs; this will give you an idea of how the learning
process is proceeding.

• You should be able to do the vast majority of your parameter tuning in small-scale experiments. Try to
avoid running large experiments on the whole dataset in order to keep your development time fast.

• If you see NaNs in your code, it’s likely due to a large step size. log(0) is the main way these arise.

• For creating tensors, torch.tensor and torch.from numpy are pretty useful. For manipulating
tensors, permute lets you rearrange axes, squeeze can eliminate dimensions of length 1, expand
or repeat can duplicate a tensor across a dimension, etc. You probably won’t need to use all of these
in this project, but they’re there if you need them. PyTorch supports most basic arithmetic operations
done elementwise on tensors.

• To handle sentence input data, you typically want to treat the input as a sequence of word indices.
You can use torch.nn.Embedding to convert these into their corresponding word embeddings;
you can initialize this layer with data from the WordEmbedding class using from pretrained. By
default, this will cause the embeddings to be updated during learning, but this can be stopped by setting
requires grad (False) on the layer.

• Google/Stack Overflow and the PyTorch documentation2 are your friends. Although you should not
seek out prepackaged solutions to the assignment itself, you should avail yourself of the resources out
there to learn the tools.

Q2

a) Implement the deep averaging network. Your implementation should consist of averaging vectors and
using a feedforward network, but otherwise you do not need to exactly reimplement what’s discussed in
Iyyer et al. (2015). Things you can experiment with include varying the number of layers, the hidden
layer sizes, which source of embeddings you use (50d or 300d), your optimizer (Adam is a good choice),
the nonlinearity, whether you add dropout layers (after embeddings? after the hidden layer?), and your
initialization.

2https://pytorch.org/docs/stable/index.html

3

b) Implement batching in your neural network. To do this, you should modify your nn.Module subclass
to take a batch of examples at a time instead of a single example. You should compute the loss over the
entire batch. Otherwise your code can function as before. You can also try out batching at test time by
changing the predict all method.

Note that different sentences have different lengths; to fit these into an input matrix, you will need to
“pad” the inputs to be the same length. If you use the index 0 (which corresponds to the PAD token in the
indexer), you can set padding idx=0 in the embedding layer. For the length, you can either dynamically
choose the length of the longest sentence in the batch, use a large enough constant, or use a constant that
isn’t quite large enough but truncates some sentences (will be faster).

Requirements You should get least 77% accuracy on the development set without the autograder timing
out; you should aim for your model to train in less than 10 minutes or so (and you should be able to get
good performance in 3-5 minutes on a recent laptop).

If the autograder crashes but your code works locally, there is a good chance you are taking too much time
(the autograder VMs are quite weak). Try reducing the number of epochs so your code runs in 3-4 minutes
and resubmitting to at least confirm that it works.

Deliverables and Submission

You will upload your code to Gradescope.

Code Submission You should submit both optimization.py and models.py, which will be eval-
uated by our autograder on several axes:

1. Execution: your code should train and evaluate within the time limits without crashing

2. Accuracy on the development set of your deep averaging network model using 300-dimensional em-
beddings

3. Accuracy on the blind test set: this is not explicitly reported by the autograder but we may consider it,
particularly if it differs greatly from the dev performance (by more than a few percent)

Note that we will only evaluate your code with 300-dimensional embeddings.
Make sure that the following command works before you submit (for Parts 1 and 2, respectively):

python optimization.py

python neural sentiment classifier.py –word vecs path data/glove.6B.300d-relativized.txt

References

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep Unordered Composition
Rivals Syntactic Methods for Text Classification. In Proceedings of the 53rd Annual Meeting of the Association for
Computational Linguistics (ACL).

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Represen-
tation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

4