1 Data
Independent Project 3: Language Modeling
For this project, we are going to use the wikitext-2 data for language modeling. I did some additional pre-processing on this dataset, therefore it is not exactly the same as the one available online.
2
In the data files, there are four special tokens
• <unk>: special token for low-frequency words
• <num>: special token for all the numbers
• <start>: special token to indicate the beginning of a sentence • <stop>: special token to indicate the end of a sentence
Here are some simple statistics about the dataset
Recurrent Neural Network Language Models
In this section, you are going to implement a RNN for language modeling. To be specific, it is a RNN with LSTM units. As a starting point, you need to implement a very simple RNN with LSTM units, so please read the instruction carefully!
The goal of this part includes
• learn to implement a simple RNN LM with LSTM units
• get some experience on tuning hyper-parameters for a better model
I recommend to use PyTorch for the all the implementation in this section.
1. (5 points) Please implement a simple RNN with LSTM units to meet the following requirements:
Training Development Test
# Sentences
17,556 1,841 2,183
# Tokens
1,800,391 188,963 212,719
Table 1: Dataset statistics 1
- Input and hidden dimensions: 32
- No mini-batch or mini-batch size is 1
- No truncation on sentence length, every token in a sentence must be read into the RNN LM to compute a hidden state, except the last token <stop>
- Use SGD with no momentum and no weight decay, you may want to norm clipping to make sure you can train the model without being interrupted by gradient explosion
- Use single-layer LSTM
- Use nn.Embedding with default initialization for word embeddings
Please write the code into a Python file with name simple-rnnlm.py. Please follow the require- ments strictly, otherwise you will lose some points for this question and your answers for the following questions will be invalid. If you want to use some technique or a deep learning trick that is not covered in the requirement, feel free to use it and explain it in your report.
2. (3 points) Perplexity. It should be computed on corpus level. For example, if a corpus has only two sentences as following
<start>, w1,1 , . . . , w1,N1 , <stop> <start>, w2,1 , . . . , w2,N2 , <stop>
To compute the perplexity, we need to compute the average of the log probabilities first as avg = 1 {log p(w1,1) + · · · + log p(w1,N1 ) + log p(<stop>)
N1 +1+N2 +1
+ log p(w2,1) + · · · + log p(w2,N2 ) + log p(<stop>)}
(1)
- (3points)ReporttheperplexitynumbersofyoursimpleRNNLM(asin[computingID]-simple-rnnlm.py) on the training and development datasets. In addition, run your model on the test data, and write
the log probabilities into the file [computingID]-tst-logprob.txt with the following format in
each linetoken\t log probability
- (3 points) Stacked LSTM. Based on your implementation in [computingID]-simple-rnnlm.py, modify the model to use multi-layer LSTM (stacked LSTM). Based on the perplexity on the dev data, you can tune the number of hidden layers n as a hyper-parameter to find a better model. Here, 1 ≤ n ≤ 3. In your report, explain the following information
• the value of n in the better model
• perplexity number on the training data based the better model • perplexity number on the dev data based on the better modelSubmit your code with file name [computingID]-stackedlstm-rnnlm.py
- (3 points) Optimization. Based on your implementation [computingID]-simple-rnnlm.py again, choose different optimization methods (SGD with momentum, AdaGrad, etc.) to find a better model. In your report, explain the following information
• the optimization method used in the better model
• perplexity number on the training data based the better model
and then
Please implement the function to compute perplexity as explained above and write the code into a
Perplexity = e−avg. (2) separate Python file with name [computingID]-perplexity.py.
2
• perplexity number on the dev data based on the better model Submit your code with file name [computingID]-opt-rnnlm.py
- (3 points) Model Size. Based on your implementation [computingID]-simple-rnnlm.py again, choose different input/hidden dimensions. Suggested dimensions for both input and hidden are 32, 64, 128, 256. Try different combinations of input/hidden dimensions to find a better model. In your report, explain the following information
• input/hidden dimensions used in the better model
• perplexity number on the training data based the better model • perplexity number on the dev data based on the better modelSubmit your code with file name [computingID]-model-rnnlm.py
- (5 points, extra) Mini-batch. Based on your implementation [computingID]-simple-rnnlm.py again, modify this code to add the mini-batch function. Tune the mini-batch size b as {16, 24, 32, 64} to see whether it makes a difference. In your report, explain the following information
• whether different mini-batch sizes make a difference. If the answer is yes, then • the best batch size
• perplexity number on the training data based the better model
• perplexity number on the dev data based on the better modelSubmit your code with file name [computingID]-minibatch-rnnlm.py
3