l13-formal-language-theory-v3
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
Semester 1 2021 Week 7
Jey Han Lau
Formal Language Theory &
Finite State Automata
COMP90042
Natural Language Processing
Lecture 13
COMP90042 L13
2
What Have We Learnt?
• Methods to process sequence of words:
‣ N-gram language Model
‣ Hidden Markov Model
‣ Recurrent Neural Networks
• Nothing is fundamentally linguistic about these
models
COMP90042 L13
3
Formal Language Theory
• Studies classes of languages and their
computational properties
‣ Regular language (this lecture)
‣ Context free language (next lecture)
COMP90042 L13
4
Formal Language Theory
• A language = set of strings
• A string = sequence of elements from a finite
alphabet (AKA vocabulary)
today is a nice day
interstellar is a great movie
trump will never be a vegan
ai will take over the world one day
…
COMP90042 L13
5
Why?
• Main goal is to solve the membership problem
‣ Whether a string is in a language
• How? By defining its grammar
today is a nice day
interstellar is a great movie
trump will never be a vegan
ai will take over the world one day
…
colourless green ideas
sleep furiously
?
COMP90042 L13
6
Examples of Language
• Binary strings that start with 0 and end with 1
‣ { 01, 001, 011, 0001, … } ✓
‣ ( 1, 0, 00, 11, 100, … } ✗
• Even-length sequences from alphabet {a, b}
‣ { aa, ab, ba, bb, aaaa, … } ✓
‣ { aaa, aba, bbb, … } ✗
• English sentences that start with wh-word and end in ?
‣ { what ?, where my pants ?, … } ✓
‣ { hello how are you?, why is the dog so cute! }
COMP90042 L13
7
Beyond Membership Problem…
• Membership
‣ Is the string part of the language? Y/N
• Scoring
‣ Graded membership
‣ “How acceptable is a string?” (language models!)
• Transduction
‣ “Translate” one string into another (stemming!)
COMP90042 L13
8
Outline
• Regular language
• Finite state acceptor
• Finite state transducer
COMP90042 L13
9
Regular Language
COMP90042 L13
10
Regular Language
• The simplest class of languages
• Any regular expression is a regular language
‣ Describes what strings are part of the language
(e.g. ‘0(0|1)*1’)
COMP90042 L13
11
Regular Languages
• Formally, a regular expression includes the
following operations/definitions:
‣ Symbol drawn from alphabet, Σ
‣ Empty string, ε
‣ Concatenation of two regular expressions, RS
‣ Alternation of two regular expressions, R|S
‣ Kleene star for 0 or more repeats, R*
‣ Parenthesis () to define scope of operations
COMP90042 L13
12
Examples of Regular Languages
• Binary strings that start with 0 and end with 1
‣ 0(0|1)*1
• Even-length sequences from alphabet {a, b}
‣ ((aa)|(ab)|(ba)|(bb))*
• English sentences that start with wh-word and end
in ?
‣ ((what)|(where)|(why)|(which)|(whose)|(whom)) Σ* ?
COMP90042 L13
13
Properties of Regular Languages
• Closure: if we take regular languages L1 and L2
and merge them, is the resulting language regular?
• RLs are closed under the following:
‣ concatenation and union
‣ intersection: strings that are valid in both L1 and L2
‣ negation: strings that are not in L
• Extremely versatile! Can have RLs for different
properties of language, and use them together
COMP90042 L13
14
Finite State Acceptor
COMP90042 L13
15
Finite State Acceptor
• Regular expression defines a regular language
• But it doesn’t give an algorithm to check whether a
string belongs to the language
• Finite state acceptors (FSA) describes the
computation involved for membership checking
COMP90042 L13
16
Finite State Acceptors
• FSA consists:
‣ alphabet of input symbols, Σ
‣ set of states, Q
‣ start state, q0 ∈ Q
‣ final states, F ⊆ Q
‣ transition function: symbol and state → next state
• Accepts strings if there is path from q0 to a final
state with transitions matching each symbol
‣ Djisktra’s shortest-path algorithm, O(V log V + E)
COMP90042 L13
17
Example FSA
• Input alphabet {a, b}
• States {q0, q1}
• Start, final states q0, {q1}
• Transition function {(q0,a) → q0, (q0, b) → q1,
(q1,b) → q1}
• Regular expression defined
by this FSA?
9.1. REGULAR LANGUAGES 193
q0start q1
a
b
b
Figure 9.1: State diagram for the finite state acceptor M1.
9.1.1 Finite state acceptors
A regular expression defines a regular language, but does not give an algorithm for de-
termining whether a string is in the language that it defines. Finite state automata are
theoretical models of computation on regular languages, which involve transitions be-
tween a finite number of states. The most basic type of finite state automaton is the finite
state acceptor (FSA), which describes the computation involved in testing if a string is
a member of a language. Formally, a finite state acceptor is a tuple M = (Q,⌃, q0, F, �),
consisting of:
• a finite alphabet ⌃ of input symbols;
• a finite set of states Q = {q0, q1, . . . , qn};
• a start state q0 2 Q;
• a set of final states F ✓ Q;
• a transition function � : Q ⇥ (⌃ [ {✏}) ! 2Q. The transition function maps from a
state and an input symbol (or empty string ✏) to a set of possible resulting states.
A path in M is a sequence of transitions, ⇡ = t1, t2, . . . , tN , where each ti traverses an
arc in the transition function �. The finite state acceptor M accepts a string ! if there is
an accepting path, in which the initial transition t1 begins at the start state q0, the final
transition tN terminates in a final state in Q, and the entire input ! is consumed.
Example
Consider the following FSA, M1.
⌃ ={a, b} [9.1]
Q ={q0, q1} [9.2]
F ={q1} [9.3]
� ={(q0, a) ! q0, (q0, b) ! q1, (q1, b) ! q1}. [9.4]
This FSA defines a language over an alphabet of two symbols, a and b. The transition
function � is written as a set of arcs: (q0, a) ! q0 says that if the machine is in state
Under contract with MIT Press, shared under CC-BY-NC-ND license.
abb ✓
aba ✗
aab ✓
b ✓
aaa ✗
PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L13
18
COMP90042 L13
19
Derivational Morphology
• Use of affixes to change word to another
grammatical category
• grace → graceful → gracefully
• grace → disgrace → disgracefully
• allure → alluring → alluringly
• allure → *allureful
• allure → *disallure
COMP90042 L13
20
FSA for Morphology
• Fairly consistent process
‣ want to accept valid forms (grace → graceful)
‣ reject invalid ones (allure → *allureful)
‣ generalise to other words, e.g., nouns that
behave like grace or allure
COMP90042 L13
21
FSA for Word Morphology
play
spare
light
COMP90042 L13
22
Weighted FSA
• Some words are more plausible than others
‣ fishful vs. disgracelyful
‣ musicky vs. writey
• Graded measure of acceptability — weighted FSA
changes the following:
‣ start state weight function, λ: Q → ℝ
‣ final state weight function, ρ: Q → ℝ
‣ transition function, δ: (Q, Σ, Q) → ℝ
COMP90042 L13
23
WFSA Shortest-Path
• Total score of a path
‣ is an edge
• Use shortest-path algorithm to find with
minimum cost
‣ O(V log V + E), as before
π = t1, . . . , tN
λ(t0) +
N
∑
i=1
δ(ti) + ρ(tN)
t
π
COMP90042 L13
24
Finite State Transducer
COMP90042 L13
25
Finite State Transducers (FST)
• Often don’t want to just accept or score strings
‣ want to translate them into another string
COMP90042 L13
26
FSA for Word Morphology
Finite state acceptor: allure + ing = allureing
Finite state transducer: allure + ing = alluring
COMP90042 L13
27
Finite State Transducer
• FST add string output capability to FSA
‣ includes an output alphabet
‣ transitions now take input symbol and emit
output symbol (Q, Σ, Σ, Q)
• Can be weighted = WFST
‣ Graded scores for transition
• E.g., edit distance as WFST
‣ distance to transform one string to another
COMP90042 L13
28
Edit Distance Automata
9.1. REGULAR LANGUAGES 201
transition weight, which are themselves probabilities. The � operator is addition, so that
the total score is the sum of the scores (probabilities) for each path. This corresponds to
the probability under the interpolated bigram language model.
9.1.4 Finite state transducers
Finite state acceptors can determine whether a string is in a regular language, and weighted
finite state acceptors can compute a score for every string over a given alphabet. Finite
state transducers (FSTs) extend the formalism further, by adding an output symbol to each
transition. Formally, a finite state transducer is a tuple T = (Q,⌃,⌦,�, ⇢, �), with ⌦ repre-
senting an output vocabulary and the transition function � : Q⇥ (⌃[ ✏)⇥ (⌦[ ✏)⇥Q ! R
mapping from states, input symbols, and output symbols to states. The remaining ele-
ments (Q,⌃,�, ⇢) are identical to their definition in weighted finite state acceptors (sub-
section 9.1.3). Thus, each path through the FST T transduces the input string into an
output.
String edit distance
The edit distance between two strings s and t is a measure of how many operations are
required to transform one string into another. There are several ways to compute edit
distance, but one of the most popular is the Levenshtein edit distance, which counts the
minimum number of insertions, deletions, and substitutions. This can be computed by
a one-state weighted finite state transducer, in which the input and output alphabets are
identical. For simplicity, consider the alphabet ⌃ = ⌦ = {a, b}. The edit distance can be
computed by a one-state transducer with the following transitions,
�(q, a, a, q) = �(q, b, b, q) = 0 [9.12]
�(q, a, b, q) = �(q, b, a, q) = 1 [9.13]
�(q, a, ✏, q) = �(q, b, ✏, q) = 1 [9.14]
�(q, ✏, a, q) = �(q, ✏, b, q) = 1. [9.15]
The state diagram is shown in Figure 9.5.
For a given string pair, there are multiple paths through the transducer: the best-
scoring path from dessert to desert involves a single deletion, for a total score of 1; the
worst-scoring path involves seven deletions and six additions, for a score of 13.
The Porter stemmer
The Porter (1980) stemming algorithm is a “lexicon-free” algorithm for stripping suffixes
from English words, using a sequence of character-level rules. Each rule can be described
Under contract with MIT Press, shared under CC-BY-NC-ND license.
202 CHAPTER 9. FORMAL LANGUAGE THEORY
qstart
a/a, b/b : 0
a/b, b/a : 1
a/✏, b/✏ : 1
✏/a, ✏/b : 1
Figure 9.5: State diagram for the Levenshtein edit distance finite state transducer. The
label x/y : c indicates a cost of c for a transition with input x and output y.
by an unweighted finite state transducer. The first rule is:
-sses ! -ss e.g., dresses ! dress [9.16]
-ies ! -i e.g., parties ! parti [9.17]
-ss ! -ss e.g., dress ! dress [9.18]
-s ! ✏ e.g., cats ! cat [9.19]
The final two lines appear to conflict; they are meant to be interpreted as an instruction
to remove a terminal -s unless it is part of an -ss ending. A state diagram to handle just
these final two lines is shown in Figure 9.6. Make sure you understand how this finite
state transducer handles cats, steps, bass, and basses.
Inflectional morphology
In inflectional morphology, word lemmas are modified to add grammatical information
such as tense, number, and case. For example, many English nouns are pluralized by the
suffix -s, and many verbs are converted to past tense by the suffix -ed. English’s inflectional
morphology is considerably simpler than many of the world’s languages. For example,
Romance languages (derived from Latin) feature complex systems of verb suffixes which
must agree with the person and number of the verb, as shown in Table 9.1.
The task of morphological analysis is to read a form like canto, and output an analysis
like CANTAR+VERB+PRESIND+1P+SING, where +PRESIND describes the tense as present
indicative, +1P indicates the first-person, and +SING indicates the singular number. The
task of morphological generation is the reverse, going from CANTAR+VERB+PRESIND+1P+SING
to canto. Finite state transducers are an attractive solution, because they can solve both
problems with a single model (Beesley and Karttunen, 2003). As an example, Figure 9.7
shows a fragment of a finite state transducer for Spanish inflectional morphology. The
Jacob Eisenstein. Draft of October 15, 2018.
ab → bb: 1
ab → aaab: 2
input = output; no cost
a→b or b→a
delete a character
insert a new character
input symbol
output symbol
COMP90042 L13
29
• Verb inflections in Spanish must match the subject in
person & number
• Goal of morphological analysis:
• canto → cantar+VERB+present+1P+singular
FST for Inflectional Morphology
cantar to sing
1P singular yo canto I sing
2P singular tu cantas you sing
3P singular ella canta she sings
1P plural nostotros cantamos we sing
2P plural vosotros cantáis you sing
3P plural ellas cantan they sing
COMP90042 L13
30
FST for Spanish Inflection
canto →
• canto+Noun+Masc+Sing
• cantar+Verb+PresInd+1P+Sing
canta → cantar+VERB+PresInd+3P+Sing
204 CHAPTER 9. FORMAL LANGUAGE THEORY
start
c/c a/a n/n t/t
o/o
✏/+Noun ✏/+Masc ✏/+Sing
✏/a ✏/r ✏/+Verb o/+PresInd ✏/+1p ✏/+Sing
a/+PresInd
✏/+3p ✏/+Sing
Figure 9.7: Fragment of a finite state transducer for Spanish morphology. There are two
accepting paths for the input canto: canto+NOUN+MASC+SING (masculine singular noun,
meaning a song), and cantar+VERB+PRESIND+1P+SING (I sing). There is also an accept-
ing path for canta, with output cantar+VERB+PRESIND+3P+SING (he/she sings).
— failing to accept strings or transductions that are valid. For example, a pluralization
transducer that does not accept foot/feet would undergenerate. Suppose we “fix” the trans-
ducer to accept this example, but as a side effect, it now accepts boot/beet; the transducer
would then be said to overgenerate. If a transducer accepts foot/foots but not foot/feet, then
it simultaneously overgenerates and undergenerates.
Finite state composition
Designing finite state transducers to capture the full range of morphological phenomena
in any real language is a huge task. Modularization is a classic computer science approach
for this situation: decompose a large and unwieldly problem into a set of subproblems,
each of which will hopefully have a concise solution. Finite state automata can be mod-
ularized through composition: feeding the output of one transducer T1 as the input to
another transducer T2, written T2�T1. Formally, if there exists some y such that (x, y) 2 T1
(meaning that T1 produces output y on input x), and (y, z) 2 T2, then (x, z) 2 (T2 � T1).
Because finite state transducers are closed under composition, there is guaranteed to be
a single finite state transducer that T3 = T2 � T1, which can be constructed as a machine
with one state for each pair of states in T1 and T2 (Mohri et al., 2002).
Example: Morphology and orthography In English morphology, the suffix -ed is added
to signal the past tense for many verbs: cook!cooked, want!wanted, etc. However, English
orthography dictates that this process cannot produce a spelling with consecutive e’s, so
that bake!baked, not bakeed. A modular solution is to build separate transducers for mor-
phology and orthography. The morphological transducer TM transduces from bake+PAST
to bake+ed, with the + symbol indicating a segment boundary. The input alphabet of TM
includes the lexicon of words and the set of morphological features; the output alphabet
includes the characters a-z and the + boundary marker. Next, an orthographic transducer
TO is responsible for the transductions cook+ed ! cooked, and bake+ed ! baked. The input
alphabet of TO must be the same as the output alphabet for TM , and the output alphabet
Jacob Eisenstein. Draft of October 15, 2018.
COMP90042 L13
31
• Yes
• No
• Maybe
• Don’t know
PollEv.com/jeyhanlau569
Is natural language regular?
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L13
32
COMP90042 L13
33
Sometimes…
• Example:
the mouse that ran.
the cat that killed the mouse that ran.
…
the lion that bullied the hyena that bit the dog that chased
the cat that killed the mouse that ran
…
• Length is unbounded, but structure is local
‣ (Det Noun Prep Verb)*
‣ Can describe with FSA
COMP90042 L13
34
Non-Regular Languages
• Arithmetic expressions with balanced parentheses
‣ (a + (b x ( c / d ) ) )
‣ Can have arbitrarily many opening parentheses
‣ Need to remember how many open parentheses,
to produce the same number of closed
parentheses
‣ Can’t be done with finite number of states
• anbn
COMP90042 L13
35
Center Embedding
• Center embedding of relative clauses
‣ The cat loves Mozart
‣ The cat the dog chased loves Mozart
‣ The cat the dog the rat bit chased loves Mozart
‣ The cat the dog the rat the elephant admired bit chased
loves Mozart
• Need to remember the n subject nouns, to ensure n verbs
follow (and that they agree etc)
• Requires (at least) context-free grammar (next lecture!)
COMP90042 L13
36
Summary
• Concept of a language
• Regular languages
• Finite state automata: acceptors, transducers
• Weighted variants
• Application to edit distance, morphology
COMP90042 L13
37
Reading
• E18, Chapter 9.1