Sharon Goldwater
(based on slides by Philipp Koehn)
This Unit
Accelerated Natural Language Processing Week 1/Unit 4
Edit Distance
• What is an alignment between two strings?
• What is minimum edit distance and how do we compute it? What’s wrong with a brute force solution, and how do we solve that problem?
Sharon Goldwater
ANLP Week 1/Unit 4
Sharon Goldwater ANLP Week 1/Unit 4 1
Related task: string similarity
Given two strings, how “similar” are they?
• Could indicate morphological relationships:
walk – walks, sleep – slept • Or possible spelling errors (and corrections):
definition – defintion, separate – seperate • Also used in other fields, e.g., bioinformatics: ACCGTA – ACCGATA
Video 1: Edit distance (definition and motivation of algorithm)
Sharon Goldwater
ANLP Week 1/Unit 4 2
Sharon Goldwater ANLP Week 1/Unit 4 3
One measure: minimum edit distance
Example alignments
Let ins/del cost (distance) = 1, sub cost = 2 (0 if no change) (can use other costs, incl diff costs for diff chars)
• Two optimal alignments (cost = 4):
STALL- STA-LL d||s|i d||i|s -TABLE -TABLE
• How many changes to go from string s1 → s2?
STALL
T A L L deletion
T A B L substitution T A B L E insertion
• To solve the problem, we need to find the best alignment between the words.
– Could be several equally good alignments.
Sharon Goldwater ANLP Week 1/Unit 4 4
Example alignments
Let ins/del cost (distance) = 1, sub cost = 2 (0 if no change) (can use other costs, incl diff costs for diff chars)
• Two optimal alignments (cost = 4):
STALL- STA-LL d||s|i d||i|s – T A B L E – T A B L E
• LOTS of non-optimal alignments, such as:
STA-L-L STAL-L- sd|i|id ddssi|i T-ABLE- –TABLE
Sharon Goldwater ANLP Week 1/Unit 4 5
Brute force solution: too slow
How many possible alignments to consider?
• First character could align to any of: —–TABLE-
• Next character can align anywhere to its right •Andsoon…thenumberofalignmentsgrowsexponentiallywith
the length of the sequences.
Sharon Goldwater ANLP Week 1/Unit 4 6
Sharon Goldwater ANLP Week 1/Unit 4 7
Brute force solution: too slow
How many possible alignments to consider?
• First character could align to any of: —–TABLE-
• Next character can align anywhere to its right
• And so on… the number of alignments grows exponentially with
the length of the sequences.
To solve, we use a dynamic programming algorithm
• Store solutions to smaller computations and combine them • Widespread in NLP, e.g. tagging (HMMs), parsing (CKY)
Sharon Goldwater ANLP Week 1/Unit 4 8
Intuition
• Minimum distance D(stall, table) must be the minimum of:
– D(stall, tabl) + 1 (ins e) – D(stal, table) + 1 (del l) – D(stal, tabl) + 2 (sub l/e)
• Similarly for the smaller subproblems
• So proceed as follows:
– solve smallest subproblems first
– store solutions in a table (chart)
– use these to solve and store larger subproblems until we get the
full solution
Sharon Goldwater ANLP Week 1/Unit 4 10
Video 2: Edit distance algorithm (intuition and example)
Sharon Goldwater
ANLP Week 1/Unit 4 9
Chart: starting point
TABLE
S T A L L
0
1
2
3
4
5
1
2
3
4
5
• Chart[i, j] stores D(stall[0..i],table[0..j])
– Ex: Chart[3,0] = D(stall[0..3], table[0..0]) = D(sta, ε)
Sharon Goldwater ANLP Week 1/Unit 4 11
Chart: next step Chart: one more step
TABLE TABLE
SS TT AA LL LL
0
1
2
3
4
5
1
?
2
3
4
5
0
1
2
3
4
5
1
2
2
?
3
4
5
• Chart[1, 1] is D(S, T): the minimum of
D(-,T)+1 (Chart[0,1]+1=2) D(S,-)+1 (Chart[1,0]+1=2)
• Chart[2, 1] is D(ST, T): the minimum of
D(S,T)+1 (Chart[1,1]+1=3) D(ST,-)+1 (Chart[2,0]+1=3)
D(-, -) + 2
Sharon Goldwater
(Chart[0, 0] + 2 = 2) ANLP Week 1/Unit 4
Chart: next steps
D(S, -) + 0
12 Sharon Goldwater
(Chart[1, 0] + 0 = 1)
ANLP Week 1/Unit 4 13
Chart: completed
TABLE TABLE
SS TT AA LL LL
0
1
2
3
4
5
1
2
2
1
3
4
5
0
1
2
3
4
5
1
2
3
4
5
6
2
1
2
3
4
5
3
2
1
2
3
4
4
3
2
3
2
3
5
4
3
4
3
4
• Continue by filling in each full column in order (or go by rows)
Sharon Goldwater ANLP Week 1/Unit 4 14 Sharon Goldwater ANLP Week 1/Unit 4 15
To find alignments
Appendix: More details for the example
TABLE ←1 ←2 ←3 ←4 ←5
S ↑1 ←↖↑2 ←↖↑3 ←↖↑4 ←↖↑5 ←↖↑6 T ↑2 ↖1 ←2 ←3 ←4 ←5 A ↑3 ↑2 ↖1 ←2 ←3 ←4 L ↑4 ↑3 ↑2 ←↖↑3 ↖2 ←3 L ↑5 ↑4 ↑3 ←↖↑4 ↖↑3 ←↖↑4
• also store which subproblem the best score came from • backtrack to get the best alignment
⇒ More complete worked example is below.
Sharon Goldwater ANLP Week 1/Unit 4 16
A note about costs
• The examples uses cost(ins) = cost(del) = 1, and cost(sub) = 2. • But we can choose whatever costs we want. They can even
depend on the particular characters involved.
– Ex: choose cost(sub(c,c′)) to be P(c′|c), the probability of someone accidentally typing c′ when they meant to type c.
– Then we end up computing the most probable sequence of typos that would change one word to the other.
Sharon Goldwater
ANLP Week 1/Unit 4 17
Chart: starting point
TABLE
0
0
S T A L L
• Chart[i, j] stores two things:
– D(stall[0..i],table[0..j]): the MED of substrings of length i, j
– Backpointer(s) showing which sub-alignment(s) was/were extended to create this one.
Sharon Goldwater ANLP Week 1/Unit 4 18
Sharon Goldwater ANLP Week 1/Unit 4 19
Filling first cell
TABLE
S ↑1 T ↑2 A ↑3 L ↑4 L ↑5
• Moving down in chart: means we had a deletion (of S). • That is, we’ve aligned (S) with (-).
• Add cost of deletion (1) and backpointer.
Rest of first column
0
←1
0
←1
Sharon Goldwater
ANLP Week 1/Unit 4 20
Rest of first column
TABLE
S ↑1 T ↑2 A ↑3 L ↑4 L ↑5
• Each move down first column means another deletion. – D(ST, -) = D(S, -) + cost(del)
Sharon Goldwater ANLP Week 1/Unit 4 21
Start of second column: insertion
TABLE TABLE
0
0
←1
S ↑1 T ↑2 A ↑3 L ↑4 L ↑5
• Each move down first column means another deletion.
– D(ST, -) = D(S, -) + cost(del)
– D(STA, -) = D(ST, -) + cost(del) – etc
S ↑1 T ↑2 A ↑3 L ↑4 L ↑5
• Moving right in chart (from [0,0]): means we had an insertion. • That is, we’ve aligned (-) with (T).
• Add cost of insertion (1) and backpointer.
Sharon Goldwater ANLP Week 1/Unit 4 22
Sharon Goldwater ANLP Week 1/Unit 4 23
Substitution
Multiple paths
TABLE
TABLE
0
←1 ↖2
0
←1 ↖↑2
S ↑1 T ↑2 A ↑3 L ↑4 L ↑5
S ↑1 T ↑2 A ↑3 L ↑4 L ↑5
• Moving down and right: either a substitution or identity. • Here, a substitution: we aligned (S) to (T), so cost is 2.
• For identity (align letter to itself), cost is 0.
• However, we also need to consider other ways to get to this cell:
– Move down from [0,1]: deletion of S, total cost is D(-, T) + cost(del) = 2.
– Same cost, but add a new backpointer.
Sharon Goldwater ANLP Week 1/Unit 4 25
Single best path
Sharon Goldwater
ANLP Week 1/Unit 4 24
Multiple paths
TABLE TABLE
0
←1 ←↖↑2 ↖1
0
←1 ←↖↑2
S↑1 T↑2 A↑3 L↑4 L↑5
• However, we also need to consider other ways to get to this cell:
– Move right from [1,0]: insertion of T, total cost is D(S,-)+cost(ins)=2.
– Same cost, but add a new backpointer.
Sharon Goldwater ANLP Week 1/Unit 4 26
S ↑1 T ↑2 A ↑3 L ↑4 L ↑5
• Now compute D(ST, T). Take the min of three possibilities:
– D(ST, -) + cost(ins) = 2 + 1 = 3. – D(S,T)+cost(del)=2+1=3. – D(S, -) + cost(ident) = 1 + 0 = 1.
Sharon Goldwater ANLP Week 1/Unit 4 27
Final completed chart
TABLE ←1 ←2 ←3 ←4 ←5
S ↑1 ←↖↑2 ←↖↑3 ←↖↑4 ←↖↑5 ←↖↑6 T ↑2 ↖1 ←2 ←3 ←4 ←5
0
A ↑3 L ↑4 L ↑5
• Exercises for you:
– How many different optimal alignments are there?
– Reconstruct all the optimal alignments.
– Redo the chart with all costs = 1 (Levenshtein distance)
Sharon Goldwater ANLP Week 1/Unit 4 28
↑2 ↖1 ←2 ←3 ←4 ↑3 ↑2 ←↖↑3 ↖2 ←3 ↑4 ↑3 ←↖↑4 ↖↑3 ←↖↑4