1. (10 points) Copy your code comments here.
A. add_link will add a child link to the EncoderDecoder chain and it will set the
attribute of linkname to the child link, so we can access the added link using
self.linkname or self[’linkname’].Because we use bidirectional LSTM, so we use
two loops, one for the forward encoder and one for the backward encoder.
B. Because the initial decoder LSTM state will be set to the final encoder state
which itself is the concatentation of forward and backward final encoder LSTM
hidden states. So the decoder LSTM has size 2*n_units. And we need to embed
word of vsize_dec to 2*n_units at first and linear transform from 2*n_units back
to vsize_dec in the last.
C. set_decoder_state initialize decoder LSTM to the final encoder state which is the
concatentation of forward and backward final encoder LSTM hidden states.
c_state is the cell state and h_state is the hidden state.
D. Because we encode the sentence using a bidirectional LSTM: one from left-to-right and
the other from right-to-left.
E. The cross entropy loss is computed. Example: Suppose the unnormalized log probability is [1, 2, 3]
and the correct word id is 1, then the loss for the sample is − log(e2/(e1 + e2 + e3)).
F. It tells the chainer to scale the gradient if necessary to make the gradient norm
within the specified threhold which help to avoid the exploding gradients problem.
2. (10 points) Examine the parallel data and answer questions. (Plots may appear in the appendix.)
1. Figure 1 shows the plot. Longer Japanese sentence tends to be translated to longer English sentence.
2. 97643 word tokens are in the English data and 143581 in the Japanese.
3. 7211 word types are in the English data and 8252 in the Japanese.
4. 3624 word tokens will be replaced by UNK in English and 4451 in Japanese.
5. The sentence difference is relative small and will not affect the NMT system very much. The type/
token ratio of Japanese is much smaller than English and may cause problem for the NMT system.
Many tokens are replaced by UNK, this unknown word handling may not be very good.
3. (10 points) What language phenomena might influence what you observed above?
The Kāngx̄ı dictionary lists over 47,000 characters and there are only 2,136 commonly used characters in
Japan. In contrast, Oxford English Dictionary contains over 170,000 words and about 3000 commonly
used words. There is no surprise that we observe the type/token ratio of Japanese is much smaller than
English.
4. (10 points) Answers to questions about sampling, beam search, and dynamic programming.
1. The no sample approach may not produce the best translation sequence and the sample approach can
produce different sequence at different time and sometimes produce better translation than no sample
approach. The examples are listed in the appendix. Figure 2 shows the examples.
2. In the first time step, choose the k words with top k probability, in the second time step, for all the
kxn sequence, choose the top k sequence, continue this process until in the final time step, we get the
k possible translation sequence.
3. We can’t implement dynamic programming for this model. Because there is no conditional indepen-
dence in this model and we can’t derive a recurrence formulae that is suitable for dynamic programming.
5. (10 points) Experiment with changes to the model, and explain results.
I change the number of layers in the encoder and decoder both to 2. Table 1 shows the result. The
perplexity increases and the bleu score decreases for the changed model which indicates the new model
perform worse than the baseline. An example is given in Figure 3. The reason for worse performance
may due to overfitting.
6. (10 points) Implement dropout, and explain the results.
I add the dropout code in function feed_lstm, adding line hs = dropout(hs, ratio, train) right before
the line hs = self[lstm_layer](hs). I set the dropout ratio to 0.5. Table 1 shows the result. Both the
perplexity and the bleu score decreases for the changed model. Some of the translation becomes better.
An example is given in Figure 4. The reason for this is that dropout help to reduce overfitting.
7. (20 points) Implement attention. This question will be evaluated on the basis of your code.
8. (10 points) Explain the results of implementing attention.
Table 1 shows the result. The perplexity has a large decrement and the bleu score increases for the
changed model. The translation becomes better. An example is given in Figure 5.
9. (10 points) Explain what you observe with attention. (Figures may appear in an appendix.)
Figure 6, Figure 7, Figure 8, Figure 9, and Figure 10 shows the 5 plots. The plot is reasonable. The
value for corresponding word in Japanese and English have high value. For example, from Figure 10,
we can easily see that “the“ corresponds to その, “question“ corresponds to 質問 and they are the correct
translation from the word to word.
Optional: you may include an appendix after the line above. Everything above the line must appear in the first
three pages of your submission.
Table 1: Model Performance
Model Perplexity Bleu
Baseline 53.7228 19.487
2 Layer 54.6638 17.808
Dropout 44.6820 14.874
Attention 28.0935 22.675
Page 2
Figure 1: Plot of distribution of sentence lengths in English and Japanese and their correlation
Figure 2: Samples For Q4
Page 3
Figure 3: Samples For Q5
Figure 4: Example For Q6
Page 4
Figure 5: Example For Q8
Page 5
Figure 6: Attention Plot 1
Page 6
Figure 7: Attention Plot 2
Page 7
Figure 8: Attention Plot 3
Page 8
Figure 9: Attention Plot 4
Page 9
Figure 10: Attention Plot 5
Page 10