CS114 (Spring 2020) Programming Assignment 1 Finite State Transducers
Due February 4, 2020
This assignment is on finite state transducers and will require you to create transducers in Python. There are two problems, and we have provided some code to get you started, as well as a few utilities that will make it easier for you to debug and test your code.
Some minimal test cases are provided as well. Before you turn in your work, you should at least pass all of the test cases we provided although it is neither sufficient nor necessary for getting full credit. You should add your own test cases to help yourself debug your program.
1
Problem 1: Soundex1
The Soundex algorithm is a phonetic algorithm commonly used by libraries and the Census Bureau to represent people’s names as they are pronounced in English. It has the advantage that name variations with minor spelling differences will map to the same representation, as long as they have the same pronunciation in English. Here is how the algorithm works:
Step 1: Retain the first letter of the name. This may be uppercased or lowercased.
Step 2: Remove all non-initial occurrences of the following letters: a, e, h, i, o, u, w, y. (To clarify, this step removes all occurrences of the given characters except when they occur in the first position.)
Step 3: Replace the remaining letters (except the first) with numbers:
b, f, p, v → 1
c, g, j, k, q, s, x, z → 2 d, t → 3
l→4
m, n → 5
r→6
If two or more letters from the same number group were adjacent in the original name, then only replace the first of those letters with the corresponding number and ignore the others. This rule also applies at the beginning of the name: retain the first letter, but ignore the others.
Step 4: If there are more than 3 digits in the resulting output, then drop the extra ones.
Step 5: If there are less than 3 digits, then pad at the end with the required number of trailing zeros.
The final output of applying Soundex algorithm to any input string should be of the form Letter Digit Digit Digit. Table 1 shows the out- put of the Soundex algorithm for some example names.
1 This problem is adapted from Jordan Boyd-Graber’s course at UMD.
2
Input
Output
Jurafsky
J612
Jarovski
J612
Resnik
R252
Reznick
R252
Euler
E460
Peterson
P362
Ashcroft
A226
Pfister
P236
Table 1: Example outputs for the Soundex algorithm.
Construct an fst that implements the Soundex algorithm. Obviously, it is non-trivial to implement a single transducer for the entire algorithm. Therefore, the strategy we will adopt is a bottom-up one: implement mul- tiple transducers, each performing a simpler task, and then compose them together to get the final output. One possibility is to partition the algorithm across three transducers:
1. Transducer 1: (letters_to_numbers) Performs steps 1-3 of the al- gorithm, i.e, retaining the first letter, removing letters and replacing letters with numbers.
2. Transducer 2: (truncate_to_three_digits) Performs step 4 of the algorithm, i.e., truncating extra digits.
3. Transducer 3: (add_zero_padding) Performs step 5 of the algo- rithm, i.e., padding with zeros if required.
Note that each of these three transducers will have characters as input/out- put symbols.
To make things easier for you, we have provided the file soundex.py which is where you will write your code. It already imports all needed mod- ules and functions (including fsmutils.py). It also creates three transducer objects—as dictated by the bottom-up strategy outlined above—such that all you should have to do is to figure out the states and arcs required by each transducer. It also contains code that allows you to input a single name on the command line to get the output.
Note that while we have provided you with sample unit tests containing some names, it might be very useful to test your code on other names. For
3
comparison purposes, you may use one of the many Soundex calculators available online to create more test cases. (Note that some calculators may use additional rules; you only need to implement the rules listed above.)
Problem 2: English Morphology
English often requires some spelling changes at the morpheme boundary between the stem and the affixes that indicate the inflection of the verb. Specifically, we will build a finite state transducer to handle the K-insertion rule in English.
If a verb ends with vowel + c, then we add k before adding in the -ed and -ing suffixes. For example:
panicked ↔ panic + ed ↔ panic+past form
panicking ↔ panic + ing ↔ panic+present participle form lick ↔ lick + ed ↔ lick+past form
lick ↔ lick + ing ↔ lick+present participle form
As you can see in the examples, the morphological parsing process can be done in two steps (although you may skip the intermediate step). So you might want to (but don’t have to) build two separate FSTs and compose them. The lexicon FST only has to include two inflectional morphemes (-ed and -ing) and five verbs: want, sync, panic, havoc, and lick. The other FST will handle the actual K-insertion rule.
The parser FST will be used as a morphological parser to transduce the surface form (e.g panicked) to the lexical form (panic + past form). The generator FST will be used to transduce the lexical form to the surface form.
Turning in Your Assignment
Submit your completed code (soundex.py and morphology.py) to LATTE. We are using a grading script to grade the work, so please make sure:
You submit each file individually. Do not zip or tar the files. You do not change filenames, function names, or the api.
Do not add print statements to the code you turn in
Add your name as a comment to all files you turn in
4
How to use the FST code
We’ve provided a modified fst implementation from nltk. However, the documentation is less than perfect. Therefore, in this section, we will give you a brief introduction to everything that you will need to understand how to build finite state transducers in Python.
To start building fsts, you need to first import the fst module into your program’s namespace. Then, you need to instantiate an FST object. Once you have such an object, you can start adding states and arcs to it. Listing 1 shows how to build a very simple finite state transducer—one that removes all vowels from any given word.
Feel free to try out the example (provided as devowelizer.py) to see how it works on some of your own input. There are a few points worth mentioning:
1. The Python string module comes with a few built-in strings that you might be able to use in this assignment for purposes of iteration as used in the example on line 23. These are:
string.ascii letters : All letters, upppercased and lowercased string.ascii lowercase : All lowercased letters
string.ascii uppercase : All uppercased letters
2. States can be added to an FST object by using its add state() method. This method takes a single argument: a unique string iden- tifier for the state. Our example has only one state (line 13). Fur- thermore, there can only be one initial state and this is indicated by assigning the state identifier to the FST object’s initial state field (line 17). However, there may be multiple final states in an FST. In fact, it is almost always necessary to have multiple final states when working with transducers. All final states may be so indicated by using the FST object’s set final() method (line 18).
3. Arcs can be added between the states of an FST object by using its add arc() method. This method takes the following arguments (in order): the starting state, the ending state, the input symbol and, finally, the output symbol.
5
Listing 1: A 1-state transducer that deletes vowels
1 # import the fst module
2 import fst
4 # import the string module
5 import string
7 # Define a list of all vowels for convenience
8 vowels = [‘a’, ‘e’, ‘i’, ‘o’, ‘u’]
10 # Instantiate an FST object with some name
11 f = fst.FST(‘devowelizer’)
13 # All we need is a single state …
14 f.add_state(‘1’)
16 # and this same state is the initial and the final state
17 f.initial_state = ‘1’
18 f.set_final(‘1’)
20 # Now, we need to add an arc for each letter; if the letter is
21 # a vowel then the transition outputs nothing but otherwise it
22 # outputs the same letter that it consumed.
23 for letter in string.ascii_lowercase:
24 if letter in vowels:
25 f.add_arc(‘1’, ‘1’, letter, ”)
26 else:
27 f.add_arc(‘1’, ‘1’, letter, letter)
29 # Evaluate it on some example words
30 print(”.join(f.transduce([‘v’, ‘o’, ‘w’, ‘e’, ‘l’])[0]))
31 print(”.join(f.transduce(‘exception’)[0]))
32 print(”.join(f.transduce(‘consonant’)[0]))
6
4. An FST object can be evaluated against any input string by using its transduce() method. Here’s how:
(a) If your transducer uses characters as input/output symbols, then the input to transduce() must be a list of characters. You may either directly input a list of characters (line 30) or a string (which is just a list of characters) (lines 31 and 32). Note that the output of transduce() is a list of possible trans- ductions. For the assignment, you should make sure your trans- ductions are deterministic; i.e., there should be at most one element in the list.
(b) If your transducer uses words as input/output symbols, then the
input to transduce() should be a list of words. Again, you can
either explicitly use a list of words or call the split method on a
string of words separated by whitespace. For example, say your
FST maps from strings like ten and twenty to number strings 10
and 20, then to evaluate it on the input string ten twenty, you
should use either:
OR
5. In fsmutils.py , we provide the code for composition. It does not
actually perform the composition and return a newly composed FST.
The above function call computes (f3 ◦ f2 ◦ f1)(S). i.e., it will first apply transducer f1 to the given input S, use the output of this trans- duction as input to transducer f2 and so on and so forth. S must be a list of characters. Note that like transduce(), the compose() function returns a list of possible transductions; make sure there is at most one element in the list.
7
f.transduce(‘ten twenty’.split())
f.transduce([‘ten’, ‘twenty’])
It cascades the output from one FST to the next.
from fsmutils import compose output = compose(S, f1, f2, f3)[0]