This assignment is due at 16:00 GMT on 9 Feb, 2017. Late submissions will receive a mark of zero,
under the Informatics late policy (http://web.inf.ed.ac.uk/infweb/student-
services/ito/admin/coursework-projects/late-coursework-extension-requests). An assignment
submitted at 16:01 GMT on 9 Feb is considered late. Before you start, please read this page
carefully. Submissions that do not follow the Ground Rules (at the bottom of the page) will receive a
mark of zero.
Decoding | Coursework 1
Decoding is the process of taking an input sentence in a source language (e.g. French):
honorables sénateurs , que se est – il passé ici , mardi dernier ?
…And finding its best translation into the target language (e.g. English) under your model:
honourable senators , what happened here last Tuesday ?
To decode, we need to model the probability of an English sentence conditioned on the French sentence.
You did some of the work of creating a simple model of this kind in Lab 1. In this assignment, we will give
you some French sentences and a probabilistic model consisting of a phrase-based translation model
and an n-gram language model . We will assume a noisy channel decomposition
and our goal will be to implement the following decision function.
Since multiplying many small probabilities together can lead to numerical underflow, we’ll work in
logspace using the equivalent decision function:
We will keep the model fixed for this assignment. Your challenge will be to write a program that, given an
input French sentence, attempts to find its most probable English translation under this model using this
decision function.
Part 1: Getting Started [10 marks]
You will need two things for this assignment. The first is an answer sheet on which you are required to
submit your answers to the inline questions below. Follow this link
(https://www.overleaf.com/latex/templates/infr1113-homework-1-template/bkqsppbgkqnn) and open the
pTM(f, a ∣ e) pLM(e)
e∗ = arg max
e
p(e ∣ f)
= arg max
e
= arg max
e
pTM(f ∣ e) × pLM(e)
= arg max
e
∑
a
pTM(f, a ∣ e) × pLM(e)
pTM(f ∣ e) × pLM(e)
p(f)
e∗ = arg max
e
log(∑
a
pTM(f, a ∣ e) × pLM(e))
http://web.inf.ed.ac.uk/infweb/student-services/ito/admin/coursework-projects/late-coursework-extension-requests
https://www.overleaf.com/latex/templates/infr1113-homework-1-template/bkqsppbgkqnn
template to get started.
The second is code and data. At the dice commandline, clone the coursework repository:
git clone https://github.com/INFR11133/hw1.git
You now have simple decoder and some data files. Take a look at the input French sentences, from a
speech in the Canadian parliament:
cat data/input
Let’s translate them!
python2.7 decode | tee output.default
The decode program translates the contents of data/input and writes the result to the terminal and
to the file output . It translates using a phrase-based translation model: a model that replaces
sequences of French words—called phrases—with sequences of English words. The replacement is
one-to-one and every input word must be accounted for by exactly one phrase, but the phrases can be
reordered. This means that to define the probability of a translation, our model requires probabilities for
the choice of segmentation into phrases, the translation of the phrases, and the permutation of the target
phrases. However, we will make the simplifying assumption that segmentation and ordering probabilities
are uniform across all possible choices, so our translation model will only assign different probabilities to
translations that use different phrases to translate the input; the ordering of those phrases is only
influenced by an n-gram language model. This means that is proportional to the product of the
n-gram probabilities in and the phrase translation probabilities in . In phrase-based
translation, the output sequence can be permuted, but the default decoder does not do this. We’ll
discuss the model and the decoding algorithm in more detail below.
You can probably get the gist of the Canadian senator’s complaint from this translation, but it isn’t very
readable. There are a couple of possible explanations for this:
1. Our model of gives high probability to bad translations. This is called model error.
2. Our model is good but our decoder fails to find . This is called search error.
How can we tell which problem is the culprit? One way would be to solve problem 2, a purely
computational problem. If we do and the resulting translations are good, then our work is finished. If the
resulting translations are still poor, then we at least know that we should focus on creating better models.
For this assignment, we will focus only on problem 2. (But don’t worry, we’ll revisit problem 1 repeatedly
throughout the course.)
If we’re going to improve search so that the decoder returns translations with higher probability, we need
some way to measure how well it’s doing. You can compute a value that is proportional to using
compute-model-score .
python2.7 compute-model-score < output.default
It will take a few minutes. Make a cup of tea.
1
p(e, a ∣ f)
pLM(e) pTM(f, a ∣ e)
2
p(e ∣ f)
arg maxe p(e ∣ f)
3
p(e ∣ f)
http://www.inf.ed.ac.uk/teaching/courses/mt/hw1.html#fn:1
http://www.inf.ed.ac.uk/teaching/courses/mt/hw1.html#fn:2
http://www.inf.ed.ac.uk/teaching/courses/mt/hw1.html#fn:3
The compute-model-score program computes the probability of the decoder’s output according to the
model. It does this by summing the probabilities of all possible alignments that the model could have
used to generate the English from the French, including translations that permute the phrases. That is,
givne a pair and it exactly computes:
This value is proportional to up to a constant . In general it is intractable to compute this
sum, and if you inspect the code you’ll find that the implementation uses an algorithm whose worst-case
runtime is exponential in the length of the input sentence, and indeed this computation is provably NP-
hard. But exponential worst case is an upper bound, and for these particular instances the sum will only
take a few minutes. It is easier to do this than it is to find the optimal translation.
I highly recommend that you look at the code (https://github.com/INFR11133/hw1/blob/master/compute-
model-score) for compute-model-score . It uses an unpruned exact dynamic program to compute the
sum, so it may give you some hints about how to do the assignment! It also contains some useful utility
functions to add probabilities in logarithmic space
(https://github.com/INFR11133/hw1/blob/master/compute-model-score#L16) and manipulate bitmaps
(https://github.com/INFR11133/hw1/master/compute-model-score#L8).
A tour of the default decoding algorithm
Now let’s turn our attention to the decoder you’ve been given. It generates the most probable
translations that it can find, but it uses three common approximations that can cause search error.
The Viterbi approximation
The first approximation of our decoder is the Viterbi approximation. Instead of computing the
intractable sum over all alignments for each sentence, it seeks the best single alignment and uses the
corresponding translation.
This approximation vastly simplifies the search problem, since as we’ve already discussed, computing
the sum for even a single can take time exponential in sentence length. This is why we require a
separate script compute-model-score to compute , rather than having the decoder itself do
this. Computing sums for an exponential number of outputs is doubly exponential (that’s bad).
Translation without reordering
The second approximation of our decoder is that it translates French phrases into English without
changing their order. That is, it has a heuristic distortion limit of zero. So, it only reorders words if the
reordering has been memorized as a phrase pair. For example, in the first sentence, we see that mardi
dernier is correctly translated as last Tuesday.
head -1 data/input
head -1 output.default
f e
log(∑
a
pTM(f, a ∣ e) × pLM(e))
p(e ∣ f) 1
p(f)
e∗ = arg max
e
log(∑
a
pTM(f, a ∣ e) × pLM(e))
≈ arg max
e
log(max
a
pTM(f, a ∣ e) × pLM(e))
e
p(e ∣ f)
https://github.com/INFR11133/hw1/blob/master/compute-model-score
https://github.com/INFR11133/hw1/blob/master/compute-model-score#L16
https://github.com/INFR11133/hw1/master/compute-model-score#L8
If we consult data/tm , we will find that this happens because the model has memorized the phrase
translation mardi dernier ||| last Tuesday .
grep "mardi dernier" data/tm
But in the second sentence, we see that Comité de sélection is translated as committee selection, rather
than the more fluent selection committee. To show that this is a search problem rather than a modeling
problem, we can generate the latter output by hand and check that the model really prefers it.
head -2 data/input | tail -1 > example
python2.7 decode -i example | python2.7 compute-model-score -i example
echo a selection committee was achievement . | python2.7 compute-model-score -i e
xample
The scores are reported as log-probabilities, and higher scores (with lower absolute value) are better. We
see that the model prefers selection committee, but the decoder does not consider this word order since
it has never memorized this phrase pair. Moreover, you can see from the language model and translation
model scores of this example that it is the language model that correctly selects the correct ordering.
Since we have only changed the order of the words, the translation model score is identical for both
translations. So, if our decoder could search over more permutations of the translated phrases, the
language model would allow it to choose more fluent sentences.
Dynamic programming
Not searching over permutations of the translated phrases is harmful because the decoder cannot find
translations with higher model score, but it is beneficial in one way, because it admits a straightforward
dynamic program, which will now define under the simplifying assumption of a bigram language model
(the default decoder uses a trigram language model). Let and, for each phrase let
denote its set of possible phrase translations. The problem our decoder must solve is: what is
the most probable translation under our model, under the constraint that phrase translations are one-to-
one, in the same order as the source, and covering all source words exactly once?
We called this translation above, so let’s continue to use that notation. Although must be chosen
from a large set of translations, each of which can be decomposed into a smaller set of individual
decisions that factor with and . Since our language model must define a stop
probability, we know that the best translation must contain a bigram probability of the symbol,
conditioned on its final word. Let’s use to denote the highest probability sequence of English
words ending in word that translates , and denote the probability of this partial
translation: that is, the product of the translation model probability for all phrases used and the language
model probability for the entire sequence up to word . Given this definition, we can define as the
product of two terms:
This is a good start, because now we’ve defined as a choice from at most objects (each
corresponding to the choice of in ) rather than an exponential number of objects (as a practical
matter, we’ll only need to inspect for those that can actually appear in a valid translation of , a
much smaller set than ). But now we have a new problem: how do we define ? To answer this,
we’ll actually answer a more general question: how would we define for any between 1 and ?
Since it is constructed from a sequence of phrase translations, let’s break it into two parts: a prefix that
f = f1. . . fI fi. . . fj
t(fi. . . fj)
e∗ e∗
pTM(f, a ∣ e) pLM(e)
STOP
h(j, e)
e f1. . . fj p(h(j, e))
e e∗
e∗ = arg max
h(I,e)
log p(h(I, e)) + log pLM(STOP ∣ e)
e∗ VE
e h(I, e)
h(I, e) e f
VE h(I, e)
h(j, e) j I
translates the words and the final English phrase, which must be a translation of the French
words for some position . There are many possible choices for , the translation of
may end in many possible English words, and there may be many translations of . We
must maximize over all combinations of these choices:
Don’t be alarmed by the notation: all this says is that must consist of the best choice of a pair
and phrase , where the final word of the phrase must be . The first term is the
probability of , the second is the probability of translating by , and the
remaining terms compute the language model probability of the new English words by iterating over
them. All that is left is to define a base case, using to denote an empty string:
Note that this implies:
Using induction, convince yourself that this recursion defines every highest-probability hypothesis (or
partial translation) for every choice of and , because it recursively considers all possible ways
of reaching that hypothesis. By extension, it must also define the highest-probability translation .
To implement this dynamic program, we actually compute the recursion left-to-right, from smaller to
larger values of . In this way, we always have all the information we need when computing . As a
practical matter, the way we do this is to ask, for each , which larger hypotheses it might
be a maximizing prefix to, and then for each one compute the probability of as if this were true. If
the newly computed probability is indeed higher than any probability we previously computed for ,
we store it with and make its predecessor. The MT jargon term for this is recombination.
The form of the recursion gives us a strong hint about the upper bound complexity of the dynamic
program, since if we consider all possible assignments of , , , , and in the central
recursion, we can deduce that complexity is if phrases can be arbitrarily long, and if
phrases have a maximum length (which is true in the default decoder and most practical
implementations). However there is a large factor (constant in the input length) due to language model
computations.
Beam search, aka stack decoding
The third approximation of our decoder is pruning: as it constructs the dynamic program from left to
right, for each source position it remembers only the highest probability values for (over all
) and discards the rest. The decoder uses histogram pruning, in which at most hypotheses are
retained at each position . By default it only keeps 1 hypothesis. Pruning introduces approximation into
f1. . . fi
fi+1. . . fj 0 ≤ i < j i
f1. . . fi fi+1. . . fj
h(j, e) = arg max
h(i,e′)e1...eke
log p(h(i, e′)) + log pTM(fi+1. . . fj ∣ e1. . . eke)+
log pLM(e1 ∣ e
′) +
k−1
∑
k′=1
log pLM(ek′+1 ∣ ek′)+
log pLM(e ∣ ek)
h(j, e)
h(i, e′) e1. . . eke e
h(i, e′) fi+1. . . fj e1. . . eke
ϵ
h(0, e) = { ϵ if e = START
undefined otherwise
p(h(0, e)) = { 1 if e = START
0 otherwise
h(j, e) j e
e∗
j h(j, e)
h(i, e′) h(j, e)
h(j, e)
h(j, e)
h(j, e) h(i, e′)
i j e e′ e1, . . . , ek
O(I 2) O(IK)
K
j h(j, e)
e ∈ VE s
j
the search, because a hypothesis leading to the overall best translation may be pruned during search. At
each position , we keep the best hypotheses in a stack (the unfortunate MT jargon term for a priority
queue).
The default decoder also implements another common form of pruning, which is that it only considers
the most probable translations for each phrase. By default, .
This approach to pruning, in which we iterate through a set of successively longer partial translations,
considering only a fixed number at each time step, is also known as beam search.
Putting it all together
The code consists of two distinct sections: one defines the dynamic programming recursion, starting at
line 42 (https://github.com/INFR11133/hw1/blob/master/decode#L42).
Line 42 simply defines objects of the form , here called state objects.
The function initial_state at Line 45 defines the initial state .
The function assign_stack at Line 49 takes a state as input, and returns the index of the stack it
should appear in.
The function extend_state at line 58 computes all possible extensions of a state s with
reference to input sentence f .
For the first part of the homework below (and probably the second), these functions are the only part of
the code that you need to modify!
A generic stack decoding algorithm starting at line 88
(https://github.com/INFR11133/hw1/blob/master/decode#L88) closely follows the pseudocode in
Koehn’s textbook (Figure 6.6 on p. 165).
Lines 91–94 define data structures: we have a hypothesis data structure that summarizes a partial
English translation of the French sentence parameterized by the state objects defined above, but
also including information about the best previous state. There is also a set of n+1 stacks.
stack[i] will eventually hold different hypotheses about ways to translate exactly words of the
French. The hypotheses in a stack are indexed by their state, so that each state can appear at
most once in a stack.
Line 95 places an initial hypothesis in the initial stack (Fig 6.6 line 1).
At line 96, we iterate over the stacks, processing each in turn. In other words, we process the
hypotheses in order of the number of French words they have translated. (Fig 6.6 line 2).
At line 97, we iterate over all hypotheses in the stack (Fig 6.6 line 3). Prior to this, we sort the
hypotheses according the LM score and then prune the stack, leaving only the top hypotheses
(Fig 6.6 line 9).
Lines 98-103 iterate over all possible extensions of the hypothesis (Fig 6.6 line 4-6).
Line 104 adds the new hypothesis to a stack (Fig 6.6 line 7). Notice that in the second condition of
line 49 we also do recombination: if there was already a hypothesis with the same LM state but
lower probability, we replace it with the newly created one, which represents a higher-probability
path with identical completions (Fig 6.6 line 8).
Lines 106–109 print the best hypothesis. First, we choose the best hypothesis in the final stack,
which is simply the one with the highest probability (line 51). Then we trace back through the
sequence of hypotheses that lead to it, and print out the corresponding English words in order
using the recursive function extract_english (this is not a closure. I just included it at this point
to keep all of the logic in a natural order).
j s
k k = 1
h(i, e)
h(0, START )
i
s
https://github.com/INFR11133/hw1/blob/master/decode#L42
https://github.com/INFR11133/hw1/blob/master/decode#L88
It’s quite possible to complete the assignment without modifying any part of the stack decoding
algorithm!
Now that we’ve seen how all of this works, it’s time to experiment with some of the pruning parameters,
and see how they trade search accuracy for speed.
time python2.7 decode > output.default
python2.7 compute-model-score < output.default
time python2.7 decode -s 100 -k 10 > output.s=100.k=10
python2.7 compute-model-score < output.s=100.k=10
vimdiff output.default output.s=100.k=10
(Type :q:q to exit vimdiff).
Question 1 [10 marks]: Experiment with different combinations of values for the stack size parameter -
s and the maximum number of translations -k , until you can no longer obtain any improvements.
1. [4 marks] How do changes in these parameters affect the resulting log-probabilities?
2. [2 marks] How do they affect speed?
3. [2 marks] How do they affect the translations? Do the translations change? Do they make more
sense? Are they more fluent? Give examples.
4. [2 marks] What is the maximum log-probability you can obtain?
Part2: Local Reordering [50 marks]
Your task is to find the most probable English translation.
Remember that our model assumes that any segmentation of the French sentence into phrases followed
by a one-for-one substitution and permutation of those phrases is a valid translation. The baseline
approach that I want you to explore in this part of the assignment is one that considers only a very
limited space of permutations: the ability to swap translations of adjacent phrases. More precisely, if
is the translation of French words and is the translation of French words , then
your output can contain them in the order . Notice that since phrases can be swapped only if they
are translations of adjacent phrases, further swaps involving and are no longer possible, because
the new right neighbour of was not adjacent to it in the original order, likewise the left neighbour of
. In other words, we cannot permute sequences of three or more phrases under this definition of
reordering.
To be very precise about what the new space of possible translations looks like, let’s examine a worked
example.
echo un Comité de sélection > example
To keep things simple, we’ll only consider the case where there are at most two translations per phrase.
You can control this using the -k parameter to the decoder. To see the phrases that will actually be used,
insert the following line just below line 61 (https://github.com/INFR11133/hw1/blob/master/decode#L61):
sys.stderr.write(“%s ||| %s ||| %f\n” % (” “.join(f[s.i:j]), phrase.english, phra
se.logprob))
Now run the decoder:
⃗e 1
fi. . . fk ⃗e 2 fk+1. . . fj
⃗e 2 ⃗e 1
⃗e 1 ⃗e 2
⃗e 1
⃗e 2
https://github.com/INFR11133/hw1/blob/master/decode#L61
python decode -k 2 -i example
You should see the following phrase table on stderr:
un ||| a ||| -0.128525
un ||| an ||| -0.912392
Comité ||| committee ||| -0.277924
Comité ||| Committee ||| -0.330485
Comité de ||| committee ||| -0.511883
Comité de ||| Committee on ||| -0.511883
de ||| of ||| -0.313224
de ||| to ||| -0.746520
de sélection ||| for determining qualified ||| 0.000000
sélection ||| selection ||| -0.054358
sélection ||| determining qualified ||| -0.929419
How many translations can we create from the above phrase table? Our model consists of three steps:
segmentation, permutation, and substitution. (Note that under the assumptions of our model,
permutation and substitution operations are associative and commutative; so we can actually do them in
any order. However, we need to know the segmentation first).
First we segment. There are eight possible segmentations, indicated by the below:
1. un Comité de sélection
2. un Comité de sélection
3. un Comité de sélection
4. un Comité de sélection
5. un Comité de sélection
6. un Comité de sélection
7. un Comité de sélection
8. un Comité de sélection
Only segmentations 6, 7, and 8 will lead to valid translations, since others contain phrases for which
there is no entry in the phrase table. (This means that our probability model allocates some mass to
segmentations that don’t lead to a translation. But do not worry about this—this assumption just makes
our model easier to work with).
Second, we permute the segments. The baseline decoder only considers the source language order, so
we are left with just the segmentations 6, 7, and 8. We’ll discuss swaps below, but I hope walking
through a complete example of the baseline decoder first will be helpful.
Finally, we translate the segments independently. This leads to a large number of translations. For
segmentation 6 alone, we get 8 translations:
1. a committee selection
2. a committee determining qualified
3. a Committee on selection
4. a Committee on determining qualified
5. an committee selection
6. an committee determining qualified
7. an Committee on selection
8. an Committee on determining qualified
∙
∙
∙
∙ ∙
∙
∙ ∙
∙ ∙
∙ ∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙
We get four more for segmentation 7:
1. a committee for determining qualified
2. a Committee for determining qualified
3. an committee for determining qualified
4. an Committee for determining qualified
And 16 for segmentation 8:
1. a committee of selection
2. a committee of determining qualified
3. a committee to selection
4. a committee to determining qualified
5. a Committee of selection
6. a Committee of determining qualified
7. a Committee to selection
8. a Committee to determining qualified
9. an committee of selection
10. an committee of determining qualified
11. an committee to selection
12. an committee to determining qualified
13. an Committee of selection
14. an Committee of determining qualified
15. an Committee to selection
16. an Committee to determining qualified
For the case where adjacent phrases are swapped. Let’s go back to step 2, and think about what this
means for segmentation 6: un Comité de sélection. There are three possible orderings that swap
adjacent phrases:
1. un Comité de sélection (the original order)
2. un sélection Comité de (swapping the 1st and 2nd phrase)
3. Comité de un sélection (swapping the 2nd and 3rd phrase)
In the full space of permutations, more orders are possible, but I’m not asking you to do this here (you
might try it for the challenge part of the assignment). In particular, you don’t need to consider any of
these orders, which would involve swapping a phrase more than once:
Comité de sélection un (to get this, you’d have to first swap un with Comité de, and then swap
un again with sélection)
sélection un Comité de (to get this, you’d have to first swap sélection with Comité de, and then
swap sélection again with un)
sélection Comité de un (similar to the above, also swapping un again with Comité de)
Now consider segmentation 7. Again, 3 permutations are possible:
1. un Comité de sélection (original order)
2. un de sélection Comité (swap 2nd and 3rd)
3. Comité un de sélection (swap 1st and 2nd)
And finally, for segmentation 8, 5 permutations are possible:
1. un Comité de sélection (original order)
2. un Comité sélection de (swap 3rd and 4th phrase)
∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙
∙ ∙ ∙
∙ ∙ ∙
3. un de Comité sélection (swap 2nd and 3rd phrase)
4. Comité un de sélection (swap 1st and 2nd phrase)
5. Comité un sélection de (swap 1st and 2nd phrase, and also swap 3rd and 4th phrase)
Now, for all permutations, you must consider all possible translations of the segments! This means that
for segmentation 6, we get possible translations, for segmentation 7 we get , and
for segmentation 8, we get possible translations. That’s 116 possible translations! But you
have probably noticed that many of these translations differ from others by only a single phrase or swap.
Hence, dynamic programming will allow you to implement an efficient search by sharing work across
translations.
Let’s reason about what your the dynamic program should look like. Since only adjacent phrases can be
swapped, and once swapped, a phrase cannot interact with any other, there are only two cases to
reason about:
1. We can have a partial translation that completely covers a prefix of the source sentence. Denote
by the best partial translation of the first French words ending in English word . There are
two ways to reach this situation:
1. By extending a similar partial translation covering a shorter prefix; or
2. By filling in a gap in the set of translated words that was created by translating the second in
pair of phrases first. For example, in the third permutation of segmentation 8, there is a gap
after we translate “de”, because we have skipped translating “Comité”. This gap is filled in
when we translate “Comité”.
2. We can have a partial translation that covers the first words of the French, except those from to
. This case is the antecedent of 1.2. above. This case can only result if we previously had a
translation of the first words, and then chose to translate a phrase from to , skipping
the phrase from to . Let’s denote the best possible translation of these words, ending in , as
.
If you’re not sure where to start, you may want to read more about dynamic programming
(https://people.eecs.berkeley.edu/~vazirani/algorithms/chap6.pdf).
Question 2 [15 marks]: Define a new dynamic program for the search space described above. More
precisely: give correct recursive definitions for and described above.
You will need a precise mathematical description in order to convert it to code.
Question 3 [5 marks]: What is the computational complexity of your dynamic program? Justify your
answer.
Question 4 [5 marks]: Define a mapping from hypothesis objects of your new dynamic program to
stacks, so that all hypotheses in stack represent translations of exactly words. In other words: which
stack should a hypothesis be placed on?
Question 5 [15 marks]: Implement your new dynamic program, by replacing the code from lines 42
through 76 of the default decoder.
HINT: This should require relatively little code; my implementation is about 15 lines longer than the
default decoder. You will need to change the state object according to the new dynamic program that
you’ve written, or you might use multiple types of state objects (but if you do, you’ll need to check the
type in assign_stack and extend_state ). You will also need to think carefully about which stack to
place new hypothesis objects in. If you’ve carefully answered the questions above, you should have a
good idea about how to do this.
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
3 × 8 = 24 3 × 4 = 12
5 × 16 = 80
h(i, e) i e
i k
j
k − 1 j + 1 i
k j e
s(k, j, i, e)
h(i, e) s(k, j, i, e)
i i
https://people.eecs.berkeley.edu/~vazirani/algorithms/chap6.pdf
Question 6 [10 marks]: Using your new decoder, experiment with different combinations of values for
the stack size parameter -s and the maximum number of translations -k , until you can no longer
obtain any improvements.
[4 marks] How do changes in these parameters affect the resulting log-probabilities?
[2 marks] How do they affect speed?
[2 marks] How do they affect the translations? Do the translations change? Do they make more
sense? Are they more fluent? Give examples.
[2 marks] What is the maximum log-probability you can obtain?
Part 3: Beyond local reordering [40 marks]
Implementing a decoder that can swap adjacent phrases and answering the accompanying questions
will earn you 60 marks (i.e., a B). But swapping adjacent phrases will not get you anywhere close to the
most probable translation according to this model. To get full credit, you must additionally design,
implement, and experiment with a more advanced dynamic program. Specifically, I want you to design
and implement a dynamic program that enforces a variant of what’s called the IBM constraint. In this
setting, the decoder may only translate a word at position if all words to its left are already translated,
except for at most one phrase. For example, suppose we the following sentence and segmentation:
il avait envoyé un remplaçant
Given this segmentation, we may choose to translate either “il” or “avait” first. If we translate “avait” first,
we can then translate “envoyé”, but we can’t translate “un” or “remplaçant” yet, because in either case
that would leave two phrases to the left untranslated. In fact, for this particular segmentation, there are
sixteen possible orders, shown below with words that have been ordered before a word to their left in
italics:
1. il avait envoyé un remplaçant
2. il avait envoyé remplaçant un
3. il avait un envoyé remplaçant
4. il avait un remplaçant envoyé
5. il envoyé avait un remplaçant
6. il envoyé avait remplaçant un
7. il envoyé un avait remplaçant
8. il envoyé un remplaçant avait
9. avait il envoyé un remplaçant
10. avait il envoyé remplaçant un
11. avait il un envoyé remplaçant
12. avait il un remplaçant envoyé
13. avait envoyé il un remplaçant
14. avait envoyé il remplaçant un
15. avait envoyé un il remplaçant
16. avait envoyé un remplaçant il
Before proceeding, you might find it helpful to convince yourself that other orders are not permitted
under the above definition, and also that this search space is larger than the one in Part 2. Also notice
where the above permutations are similar; this should help in designing your dynamic program.
Question 7 [10 marks]: Define a new dynamic program for the search space described above.
i
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
∙ ∙ ∙ ∙
Question 8 [5 marks]: What is the computational complexity of your dynamic program? Justify your
answer.
Question 9 [5 marks]: Define a mapping from hypothesis objects of your new dynamic program to
stacks, so that all hypotheses in stack represent translations of exactly words. In other words: which
stack should a hypothesis be placed on?
Question 10 [15 marks]: Implement your new dynamic program, by replacing the code from lines 42
through 76 of the default decoder.
Question 11 [5 marks]: Using your new decoder, experiment with different combinations of values for
the stack size parameter -s and the maximum number of translations -k , until you can no longer
obtain any improvements. Report the final result. What do you conclude?
Self-directed learning: beyond dynamic
programming
In past versions of this course, this coursework contained an open-ended component; unfortunately, due
to this year’s enrollemnt it is not feasible for us to mark open-ended solutions, since it takes a lot of time.
If you’d like to make your implementation closer to a real phrase-based decoder, you could add the
ability to translate words within a particular distortion limit, as discussed in the video lecture. However, if
you’re keen to learn more about decoding, on your own you might consider implementing other
approaches to solving the combinatorial optimization problem implied by the Viterbi approximation:
Implement a greedy decoder (http://www.iro.umontreal.ca/~felipe/bib2webV0.81/cv/papers/paper-
tmi-2007.pdf).
Use chart parsing to search over many permutations in polynomial time
(http://aclweb.org/anthology/C/C04/C04-1030.pdf).
Use a traveling salesman problem (TSP) solver (http://aclweb.org/anthology-new/P/P09/P09-
1038.pdf).
Use finite-state algorithms (http://mi.eng.cam.ac.uk/~wjb31/ppubs/ttmjnle.pdf).
Use Lagrangian relaxation (http://aclweb.org/anthology//D/D13/D13-1022.pdf).
Use integer linear programming (http://aclweb.org/anthology-new/N/N09/N09-2002.pdf).
Use A* search (http://aclweb.org/anthology-new/W/W01/W01-1408.pdf).
These methods all attempt to approximate or solve the Viterbi approximation to decoding. You can also
try to approximate directly.
Change the decision function (http://anthology.aclweb.org/N/N04/N04-1022.pdf) to minimize
Bayes risk, which explicitly sums over translations.
Use variational algorithms (http://aclweb.org/anthology//P/P09/P09-1067.pdf).
Use Markov chain Monte Carlo algorithms (http://aclweb.org/anthology//W/W09/W09-1114.pdf).
But there are many other things you might try. There are many ways to decode, and good search is still
considered an open problem.
Ground Rules
You must work individually. If you submit work from someone other than yourself you will receive a
mark of zero for the assignment. Your code and report must be your own work. You can safely
assume that your instructor has software to accurately compute the probability that one piece of
i i
p(e ∣ f)
http://www.iro.umontreal.ca/~felipe/bib2webV0.81/cv/papers/paper-tmi-2007.pdf
http://aclweb.org/anthology/C/C04/C04-1030.pdf
http://aclweb.org/anthology-new/P/P09/P09-1038.pdf
http://mi.eng.cam.ac.uk/~wjb31/ppubs/ttmjnle.pdf
http://aclweb.org/anthology//D/D13/D13-1022.pdf
http://aclweb.org/anthology-new/N/N09/N09-2002.pdf
http://aclweb.org/anthology-new/W/W01/W01-1408.pdf
http://anthology.aclweb.org/N/N04/N04-1022.pdf
http://aclweb.org/anthology//P/P09/P09-1067.pdf
http://aclweb.org/anthology//W/W09/W09-1114.pdf
code is a modified copy of the other. On the other hand, sharing questions, clarifications, and
ideas that stop short of actual answers is fine and encouraged, especially through the forum
(https://piazza.com/class/idfwi88bkpo377), since articulating your questions is often a good way to
figure out what you do and don’t know.
You must submit five files. Any files with names other than these will be ignored.
1. answers.pdf : A file containing your answers to Questions 1 through 11 in an A4 PDF. Your
file must be written in LaTeX using this template
(https://www.overleaf.com/latex/templates/infr1113-homework-1-template/bkqsppbgkqnn),
which you should clone and edit to provide your answers. Answers provided in any other
format will receive a mark of zero. Your answers must not exceed two pages, so be
concise. You are however permitted to include graphs or tables on an unlimited number of
additional pages. They should be readable. They should also be numbered and the text
should refer to these numbers.
2. part2.out : The highest-probability translations of the input that you can produce using
your Part 2 decoder.
3. part2.py : Your implementation of the Part 2 decoder.
4. part3.out : The highest-probability translations of the input that you can produce using
your Part 3 decoder.
5. part3.py : Your implementation of the Part 3 decoder.
Questions that are not answered will receive no marks. Even if you aren’t sure of your answer, an
attempt may earn partial marks if it shows evidence of understanding.
You are strongly encouraged to comment your code; markers may review comments if they can’t
understand what your code does, so this is in your best interests.
Your name must not appear in any of the submitted files. If your name appears in the code or
pdf (or output) you will receive a mark of zero.
To submit your files on dice, run:
submit mt 1 answers.pdf part2.py part3.py part2.out part3.out
Acknowledgements
This assignment was developed and refined over several years in collaboration with Chris Callison-Burch
(http://www.cis.upenn.edu/~ccb/), Chris Dyer (http://www.cs.cmu.edu/~cdyer), and Matt Post
(http://cs.jhu.edu/~post/).
Footnotes
1. the machine translation jargon for a sequence of words is phrase, but these aren’t phrases in any
linguistic theory—they can be any sequence of words in the sentence. ↩
2. Technically, this makes the default decoder a hidden semi-Markov model ↩
3. Sometimes these problem combine to create a fortuitous search error
(https://www.aclweb.org/anthology/P/P01/P01-1030.pdf), where inexact search finds a better
translation than the one preferred by a poor model. We will not try to cause fortuitous search
errors! ↩
https://piazza.com/class/idfwi88bkpo377
https://www.overleaf.com/latex/templates/infr1113-homework-1-template/bkqsppbgkqnn
http://www.cis.upenn.edu/~ccb/
http://www.cs.cmu.edu/~cdyer
http://cs.jhu.edu/~post/
http://www.inf.ed.ac.uk/teaching/courses/mt/hw1.html#fnref:1
http://www.inf.ed.ac.uk/teaching/courses/mt/hw1.html#fnref:2
https://www.aclweb.org/anthology/P/P01/P01-1030.pdf
http://www.inf.ed.ac.uk/teaching/courses/mt/hw1.html#fnref:3
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin (http://www.inf.ed.ac.uk/about/webmaster.html) with any comments or corrections. Logging
and Cookies (http://www.inf.ed.ac.uk/about/cookies.html)
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh.
Material on this page is freely reuasable under a Creative Commons attribution
(https://creativecommons.org/licenses/by/3.0/) license,
and you are free to reuse it with appropriate credit. You can get the source code on github (https://github.com/alopez/mt-
class).
http://www.inf.ed.ac.uk/about/webmaster.html
http://www.inf.ed.ac.uk/about/cookies.html
https://creativecommons.org/licenses/by/3.0/
https://github.com/alopez/mt-class