This Unit
Accelerated Natural Language Processing Week 1/Unit 3 Computational Approaches to Morphology
Sharon Goldwater
(based on slides by Philipp Koehn)
Sharon Goldwater
ANLP Week 1/Unit 3
• What is a Finite State Machine, and what is the relationship between FSMs and regular languages?
• How are FSMs and FSTs used for morphological recognition, analysis and generation?
• How can we deal with spelling changes in morphological analysis?
Sharon Goldwater ANLP Week 1/Unit 3 1
Tasks
• Recognition
– given: wordform (string of characters)
– wanted: yes/no decision if it is in the language
• Generation
– given: lemma and morphological properties – wanted: correctly inflected wordform
• Analysis
– given: wordform
– wanted: lemma and morphological properties
2 Sharon Goldwater ANLP Week 1/Unit 3 3
Video 1: Finite state machines (motivation and definition)
Sharon Goldwater
ANLP Week 1/Unit 3
Word Lists
Finite State Machines: States
• Simple Solution
– create a list of all wordforms and their morphological properties
– solve tasks by checking against list • But…
– list can become very long
– list fails to generalize for productive morphology
• Instead: use finite state machines (also called finite state automatons)
Sharon Goldwater ANLP Week 1/Unit 3 4
Finite State Machines: Transitions
Sharon Goldwater
places we may find ourselves in
ANLP Week 1/Unit 3 5
c
a
moving between the states
emissions: letters produced at each transition
Finite State Machines: Emissions
Sharon Goldwater ANLP Week 1/Unit 3 6
Sharon Goldwater ANLP Week 1/Unit 3 7
c
a
c
a
c
a
b
b
b
b
Finite State Machines: Start and End
The language of an FSM
Every FSM defines a formal language:
• The set of strings that can be generated by moving from start to
end states, emitting symbols on each transition.
• Equivalently, the set of strings that can be recognized by matching input characters to emission symbols.
The language of an FSM may be finite or infinite.
Sharon Goldwater ANLP Week 1/Unit 3 9
Video 2: FSMs (cont); relationship to RegExps; the Chomsky hierarchy
c
START a
END
Sharon Goldwater
begin at start state, finish at end state
ANLP Week 1/Unit 3 8
FSM with Finite Language
c
START a
END
generated language: { acac, acbc, aacc, aabb, bacc, babb } Sharon Goldwater ANLP Week 1/Unit 3 10
Sharon Goldwater ANLP Week 1/Unit 3 11
a
c
a
c
a
c
a
c
c
c
b
b
a
a
b
b
b
b
b
b
FSM with Finite Language
FSM with Infinite Language
c
START
a
END
b
c
START a
END
generated language: { acac, acbc, aacc, aabb, bacc, babb }
Sharon Goldwater ANLP Week 1/Unit 3 12
Regular Languages
• Languages produced by FSMs are called regular languages
• Many convenient properties (e.g., straightforward to determine if
a word is in the language)
• Not all languages are regular
example: anbn = { ab, aabb, aaabbb, aaaabbbb, … } (would require an infinite number of states)
generated language: { acac, acbc, aacc, aabb, bacc, babb, bbacc, bbabb, bbbacc, bbbabb, bbbbacc, bbbbabb, … }
Sharon Goldwater ANLP Week 1/Unit 3 13
Regular Expressions
• Reg. languages can also be described with regular expressions.
• Every RegEx is equivalent to some FSM (and vice versa). Example: ac(ac|bc) | aa(cc|bb) | bb∗a(bb|cc) where ‘|’ means “or” and ‘x∗’ means “zero or more x’s”.
• RegExs are common in programming to describe sets of strings.
– ls *.jpg
– if ($word =~ /^[A-Z].*/) { $name = 1; }
– if ($name =~ /[WB]ill/) { print “Will or Bill”; }
Sharon Goldwater ANLP Week 1/Unit 3 14
Sharon Goldwater ANLP Week 1/Unit 3 15
c
c
a
c
a
c
a
c
a
c
a
a
b
b
b
b
b
b
b
b
Chomsky Hierarchy
Chomsky Hierarchy
• Chomsky discussed four major classes of formal languages
3. regular (generated by finite state machines, usually assumed sufficient to describe phonology and morphology)
2. context-free (will be covered in later lectures on syntax)
1. context-sensitive (possibly needed for some natural language
phenomena)
0. recursively enumerable (anything a computer program can
produce)
• (There are also many classes of “sub-regular” languages.)
Sharon Goldwater ANLP Week 1/Unit 3 16
Video 3: FSMs for inflection and derivation
• Language classes further down the list are increasingly complex – can describe more languages
– but languages in the class are more difficult to compute
for instance: for a type-0 language it is not generally possible to
determine if a specified word can be generated by the language
• Linguists argue about which (if any) of these classes natural languages belong to, but most phenomena of interest can be described by context-free languages.
Sharon Goldwater
ANLP Week 1/Unit 3 17
One Word
S walk E Basic finite state automaton:
• start state
• transition that emits the word
walk
• end state
ANLP Week 1/Unit 3 19
Sharon Goldwater
ANLP Week 1/Unit 3 18
Sharon Goldwater
One Word and One Inflection
One Word and Multiple Inflections
+s
S walk 1 +ed E
+ing
S walk 1 +ed E Two transitions and intermediate state
• first transition emits walk
• second transition emits +ed → walked
Sharon Goldwater ANLP Week 1/Unit 3 20
Multiple Words and Multiple Inflections
Multiple transitions between states • three different paths
→ walks, walked, walking
Sharon Goldwater ANLP Week 1/Unit 3 21
Multiple Words and Multiple Inflections
laugh +s
S walk 1 +ed E
report +ing
Multiple stems
• implements regular verb morphology
→ laughs, laughed, laughing walks, walked, walking reports, reported, reporting
Multiple stems
• implements regular verb morphology
→ what about bake, emit, fuss? more on this later…
laugh +s
S walk 1 +ed E
report +ing
Sharon Goldwater ANLP Week 1/Unit 3 22
Sharon Goldwater ANLP Week 1/Unit 3 23
Derivational Morphology
ion
word
START START
Derivational Morphology
word
y
ion
double lines = end state
Sharon Goldwater ANLP Week 1/Unit 3
Derivational Morphology
ion
word y fy
24
Sharon Goldwater
ANLP Week 1/Unit 3
Derivational Morphology
25
ion
word y fy cate
START START
again: wordify not wordyfy! why a loop? could it be placed differently?
again, will come back to that later…
Sharon Goldwater ANLP Week 1/Unit 3 26 Sharon Goldwater ANLP Week 1/Unit 3 27
Video 4: More about constructing FSMs
Derivational Morphology
word
START
ion y fy
ist
cate
ism
er
Sharon Goldwater
ANLP Week 1/Unit 3
Marking Part of Speech
ion y fy
28
Sharon Goldwater
ANLP Week 1/Unit 3
29
word
cate
word
ion
y fy cate
Marking Part of Speech
NAVV NAVV
START ism ist er START ism ist er NN NN
Now: where to add -less? -ness? Others?
Sharon Goldwater ANLP Week 1/Unit 3 30 Sharon Goldwater ANLP Week 1/Unit 3 31
Concatenation
What is Required?
• Constructing an FSA gets very complicated • Build components as separate FSAs
– L: FSA for lexicon
– D: FSA for derivational morphology – I: FSA for inflectional morphology
• Concatenate L + D + I (there are standard algorithms)
– In fact, each component may consist of multiple components
(e.g., D has different sets of affixes with ordering constraints)
Sharon Goldwater ANLP Week 1/Unit 3 32
Video 6: Spelling changes, nondeterminism, ambiguity
(Video 5 has no associated slides.)
• Lexicon of lemmas
– very large, needs to be collected by hand
• Inflection and derivation rules
– not large, but requires understanding of the language
Recent work: automatically learn lemmas and suffixes from a corpus • OK solution for languages with few resources
• Hand-engineered systems much better when available
Sharon Goldwater ANLP Week 1/Unit 3 33
Generation and analysis
• FSAs used as morphological recognizers • What if we want to generate or analyze?
walk+V+past ↔ walked report+V+prog ↔ reporting
• Use a finite-state transducer (FST)
– Replace output symbols with input-output pairs x : y
Sharon Goldwater
ANLP Week 1/Unit 3 34
Sharon Goldwater ANLP Week 1/Unit 3 35
FSA for verbs
Schematically
laugh s walk ed
report ing
s verb−reg ed
ing
Sharon Goldwater
ANLP Week 1/Unit 3
FST for verbs
36
Sharon Goldwater
ANLP Week 1/Unit 3 37
Accounting for spelling changes
verb−reg
+3sg:s +V: +past:ed
+prog:ing
• We now have:
• How to fix this?
walk+V+past ↔ walked BUT bake+V+past ↔ bakeed
where x means x:x and x: means x:ε.
Sharon Goldwater
ANLP Week 1/Unit 3
38
Sharon Goldwater
ANLP Week 1/Unit 3 39
• We now have:
+3sg:s# +V:^ +past:ed#
+prog:ing#
Accounting for spelling changes
1. Analysis to intermediate form
walk+V+past ↔ walked BUT bake+V+past ↔ bakeed
verb−reg
• How to fix this? Use two FSTs in a row! walk+V+past ↔ walkˆed# ↔ walked
where x means x:x and x: means x:ε.
• Examples:
walk+V+past ↔ walkˆed# bake+V+past ↔ bakeˆed# bake+V+prog ↔ bakeˆing#
Sharon Goldwater ANLP Week 1/Unit 3 41
Plural transducer (J&M, Fig. 3.17)
• Complete FST for English plural (‘other’ = none of {z,s,x,ˆ,#,ε}) • What happens in each case? catˆs# foxˆs# axleˆs#
Sharon Goldwater
bake+V+past ↔ bakeˆed# ↔ baked ANLP Week 1/Unit 3
40
2. Intermediate form to surface form
Simplified version, only handles some aspects of past tense:
e,other
e:
other
^: ed#:ed
where other means any character except ‘e’.
• Examples: walkˆed# ↔ walked, bakeˆed# ↔ baked
• A nondeterministic FST: mulitple transitions may be possible on the same input (where?).
If any path goes to end state, string is accepted.
Sharon Goldwater ANLP Week 1/Unit 3 42
Sharon Goldwater ANLP Week 1/Unit 3
43
Remaining problem: ambiguity
More info and tools
• FSTs often produce multiple analyses for a single form: walks → walk+V+1sg OR walk+N+pl
German ‘the’: 6 surface forms, but 24 possible analyses
• Resolve using context (surrounding words), usually in a
probabilistic system (stay tuned…)
• More information: Oflazer (2009): Computational Morphology http://fsmnlp2009.fastar.org/Program files/Oflazer – slides.pdf
• OpenFST (Google and NYU) http://www.openfst.org/
• Carmel Toolkit http://www.isi.edu/licensed-sw/carmel/
• FSA Toolkit http://www-i6.informatik.rwth-aachen.de/∼kanthak/fsa.html
Sharon Goldwater ANLP Week 1/Unit 3 45
Sharon Goldwater ANLP Week 1/Unit 3 44