Formal Language Theory &
Finite State Automata
COMP90042
Natural Language Processing Lecture 13
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L13
What is a Language?
• Methodstoprocesssequenceofsymbols: ‣ Language Model
‣ Hidden Markov Model
‣ Recurrent Neural Networks
• Nothingisfundamentallylinguisticaboutthese models
2
COMP90042
L13
Formal Language Theory • Alanguage=setofstrings
• Astring=sequenceofelementsfromafinite alphabet
3
COMP90042
L13
Motivation
• Formallanguagetheorystudiesclassesof languages and their computational properties
‣ Regular language (this lecture)
‣ Context free language (next lecture)
• Maingoalistosolvethemembershipproblem
‣ Whether a string is in a language or not • How?Bydefiningitsgrammar
4
COMP90042
L13
Examples
• Binarystringsthatstartwith0andendwith1 ‣ {01,001,011,0001,…}✓
‣ (1,0,00,11,100,…}✗
• Even-lengthsequencesfromalphabet{a,b} ‣ {aa,ab,ba,bb,aaaa,…}✓
‣ {aaa,aba,bbb,…}✗
• Englishsentencesthatstartwithwh-wordandendin? ‣ {what?,wheremypants?,…}✓
5
COMP90042
L13
Beyond Membership Problem… • Membership
‣ Is the string part of the language? Y/N
• Scoring
‣ Graded membership
‣ “How acceptable is a string?” (language models!)
• Transduction
‣ “Translate” one string into another (stemming!)
6
COMP90042
L13
Overview
• Regularlanguages
• Finitestateacceptors&transducers • Modellingwordmorphology
7
COMP90042
L13
Regular Languages
• Regularlanguage:thesimplestclassof languages
• Anyregularexpressionisaregularlanguage
‣ Describes what strings are part of the language
8
COMP90042
L13
Regular Languages
• Formally,aregularexpressionincludesthe following operations:
‣ Symbol drawn from alphabet, Σ
‣ Empty string, ε
‣ Concatenation of two regular expressions, RS ‣ Alternation of two regular expressions, R|S
‣ Kleene star for 0 or more repeats, R*
‣ Parenthesis () to define scope of operations
9
COMP90042
L13
Examples of Regular Languages • Binarystringsthatstartwith0andendwith1
‣ 0(0|1)*1
• Even-lengthsequencesfromalphabet{a,b}
‣ ((aa)|(ab)|(ba)|(bb))*
• Englishsentencesthatstartwithwh-wordand end in ?
‣ ((what)|(where)|(why)|(which)|(whose)|(whom)) Σ* ?
10
COMP90042
L13
Properties of Regular Languages
• Closure:ifwetakeregularlanguagesL1andL2 and merge them, is the resulting language regular?
• RLsareclosedunderthefollowing:
‣ concatenation and union — follows from definition ‣ intersection: strings that are valid in both L1 and L2 ‣ negation: strings that are not in L
• Extremelyversatile!CanhaveRLsfordifferent properties of language, and use them together
‣ core algorithms will still apply
11
COMP90042
L13
Finite State Acceptors
• Regularexpressiondefinesaregularlanguage
• Butitdoesn’tgiveanalgorithmtocheckwhether a string belongs to the language
• Finitestateacceptors(FSA)describesthe computation involved for membership checking
12
COMP90042
L13
Finite State Acceptors
• FSAconsists:
‣ alphabet of input symbols, Σ
‣ set of states, Q
•
symbol and state → next state
Accepts strings if there is path from q0 to a final
state with transitions matching each symbol ‣ Djisktra’s shortest-path algorithm, O(V log V + E)
‣
‣ transition function
start state, q0 ∈ Q ‣ final states, F ⊆ Q
13
COMP90042
L13
• Inputalphabet
• States
• Start,finalstates REGULAR LANGUAGES
{a,b}
{q0,q1}
q0,{q1}
{(q0,a) → q0, (q0, b) → q1,
(q1,b) → q1}
• Transition function
• Note:seeingainq1
a
b
results in failure • Acceptsa*bb*
start
q0 b q1
Example FSA
14
Figure 9.1: State diagram for the finite state acce
COMP90042
L13
Derivational Morphology
• Useofaffixestochangewordtoanother grammatical category
• grace→graceful→gracefully
• grace→disgrace→disgracefully
• allure→alluring→alluringly
• allure→*allureful
• allure→*disallure
15
COMP90042
L13
FSA for Morphology
• (Fairly)consistentprocess—canwedescribe this as a regular language?
‣ want to accept valid forms, and reject invalid ones
[flagged with *]
‣ generalise to other words, e.g., nouns that behave like grace or allure
16
196
FSA for Word Morphology
COMP90042
L13
CHAPTER9. FORMALLANGUAGETHEORY
grace dis-
allure fair
q -ful q -ly q N1 J1 A1
q grace q -ful q -ly q neg N2 J2 A2
start
q0
-ing q -ly q N3 J3 A3
q
q -ness q q
J4 N4 A4
9.1. REGULARLANGUAGES -ly 197
Figure 9.2: A finite state acceptor for a fragment of English derivational morphology. Each
path represents possible derivations from a single root form.
111
q grace q -ful q -ly q neg N J A
dis-
This FSA can be minimized to the form shown in Figure 9-.i3n,gwhich makes the gen-
grace
erality of the finite state approach more apparent. For example, the transition from q0 to start q0 qN2
q can be made to accept not only fair but any single-morpheme (monomorphemic) ad- J2 allure -ly
jective that takes -ness and -ly as suffixes. In this way, the finite state acceptor can easily be extended: as new word stems are added to the vocabulary, their derived forms will be
fair
accepted automatically. Of course, this FSA would still need to be extended considerably
q -ness q
J2 N3
to cover even this small fragment of English morphology. As shown by cases like music
! musical, athlete ! athletic, English includes several classes of nouns, each with its own
rules for derivation.
Figure 9.3: Minimization of the finite state acceptor shown in Figure 9.2.
17
COMP90042
L13
196
FSA for Word Morphology
grace dis-
allure fair
q -ful q -ly q N1 J1 A1
q grace q -ful q -ly q neg N2 J2 A2
CHAPTER9. FORMALLANGUAGETHEORY
start
q0
-ing q -ly q N3 J3 A3
q
q -ness q q
J4 N4 A4
9.1. REGULARLANGUAGES -ly 197
Figure 9.2: A finite state acceptor for a fragment of English derivational morphology. Each
path represents possible derivations from a single root form.
111
q grace q -ful q -ly q neg N J A
dis-
This FSA can be minimized to the form shown in Figure 9-.i3n,gwhich makes the gen-
grace
erality of the finite state approach more apparent. For example, the transition from q0 to start q0 qN2
q can be made to accept not only fair but any single-morpheme (monomorphemic) ad- J2 allure -ly
jective that takes -ness and -ly as suffixes. In this way, the finite state acceptor can easily be extended: as new word stems are added to the vocabulary, their derived forms will be
fair
accepted automatically. Of course, this FSA would still need to be extended considerably
q -ness q J2 N3
to cover even this small fragment of English morphology. As shown by cases like music
! musical, athlete ! athletic, English includes several classes of nouns, each with its own
rules for derivation.
Figure 9.3: Minimization of the finite state acceptor shown in Figure 9.2.
18
COMP90042
L13
Weighted FSA
19
COMP90042
L13
Weighted FSA
• Somewordsaremorepossiblethanothers ‣ fishful vs. disgracelyful
‣ musicky vs. writey
• Gradedmeasureofacceptability—weightedFSA adds/changes the following:
‣ set of states Q
‣ alphabet of input symbols Σ
‣ start state weight function, λ: Q → R ‣ final state weight function, ρ: Q → R ‣ transition function, δ: (Q, Σ, Q) → R
20
COMP90042
L13
WFSA Shortest-Path Totalscoreofapath𝜋 = 𝑡1, …, 𝑡𝑁now
𝜆(𝑡0) + ∑𝑁 𝛿(𝑡𝑖) + 𝜌(𝑡𝑁)
𝑖=1
each t is an edge, so more formally using from &/or to
states and edge label in score calculation
• Useshortest-pathalgorithmtofindπwithmin. cost
‣ O(V log V + E), as before
•
21
COMP90042
L13
N-gram LMs as WFSA • Recall LM calculates score of string as follows
• Unigram language model:
‣ One state: q0
‣ State transition score:
‣ Initial and final state scores = 0 ‣ Path score for w1, w2, …, wM:
22
COMP90042
L13
Bigram LM
• Bigram sentence probability:
𝑃(𝑤1, 𝑤2, .. 𝑤𝑀) = ∏𝑀 𝑃(𝑤𝑖|𝑤𝑖−1) (bigram) 𝑖=1
• Implemented as WFSA
‣ Σ = set of word types
‣ Q=Σ(no.ofstates=no.ofwordtypes)
23
COMP90042 L13
Finite State Transducer
24
COMP90042
L13
Finite State Transducers (FST) • Oftendon’twanttojustacceptorscorestrings
‣ want to translate them into another language, correct grammar, parse their structure, etc
25
COMP90042
L13
FSA for Word Morphology
FSA: allure + ing = allureing xFST: allure + ing = alluring
26
COMP90042
L13
Finite State Transducers
• FST add string output capability to FSAs
‣ includes an output alphabet
‣ and transitions now take input symbol and emit output symbol (Q, Σ, Σ, Q)
• Canbeweighted=WFST
‣ Graded scores for transition
• E.g.,editdistanceasWFSTwhichtakesone string, and outputs the other
‣ zero cost only if strings are identical
27
rtions, deletions, and substitutions. This can be computed by e
r
e
n g
a label x/y : c indicates a cost of c for a transition with input x and outp
N
COMP90042
state transducer, in which the input and output alphabets are L13
onsider the alphabet ⌃ = ⌦ = {a, b}. The edit distance can be Edit Distance Automata
ansducer with the following transitions,
(q,a,a,q) = (q,b,b,q) = 0 202
(q,a,b,q) = (q,b,a,q) = 1 (q,a,✏,q) = (q,b,✏,q) = 1 (q,✏,a,q) = (q,✏,b,q) = 1.
[9.12]
CHAPTER9. FORMALLA
a/a,b/b : 0
[9.13] [9.14] [9.15]
a/✏, b/✏ : 1 r, there are multiple paths through the transducer: the best-
to desert involves a single deletion, for a total score of 1; the
s seven deletions and six additions, for a score of 13.✏/a, ✏/b : 1
n in Figure 9.5.
ab → bb: 1 ab → aaab: 2
start
q
a/b, b/a : 1
g algorithm is a “lexicon-free” algorithm for stripping suffixes
Figure 9.5: State diagram for the Levenshtein edit distance finite st
a sequence of character-level rules. Each rule can be described
28
COMP90042
L13
FST for Inflectional Morphology
• VerbinflectionsinSpanishmustmatchthesubject in person & number
• Goalofmorphologicalanalysis:
• canto→cantar+VERB+present+1P+singular
cantar
to sing
1P singular
yo canto
I sing
2P singular
tu cantas
you sing
3P singular
ella canta
she sings
1P plural
nostotros cantamos
we sing
2P plural
vosotros cantáis
you sing
3P plural
ellas cantan
they sing
29
COMP90042
L13
204
CHAPTER9. FORMALLANGUAGETHEORY
FST for Spanish Inflection
o/o
✏/+Noun ✏/+Masc
✏/r ✏/+Verb
✏/+Sing
o/+PresInd ✏/+1p
a/+PresInd ✏/+3p
start
c/c a/a n/n t/t
✏/a
✏/+Sing
✏/+Sing
Figure 9.7: Fragment of a finite state transducer for Spanish morphology. There are two
accepting paths for the input canto: canto+NOUN+MASC+SING (masculine singular noun,
meaning a song), and cantar+VERB+PRESIND+1P+SING (I sing). There is also an accept- canto → canto+Noun+Masc+Sing
ing path for canta, with output cantar+VERB+PRESIND+3P+SING (he/she sings). canto → cantar+Verb+PresInd+1P+Sing
— failing to accept strings or transductions that are valid. For example, a pluralization transducer that does not accept foot/feet would undergenerate. Suppose we “fix” the trans- ducer to accept this example, but as a side effect, it now accepts boot/beet; the transducer would then be said to overgenerate. If a transducer accepts foot/foots but not foot/feet, then it simultaneously overgenerates and undergenerates. 30
COMP90042
L13
FST Composition
• ComposetwoFSTsbytakingoutputofoneFST, T1, and giving this as input to FST T2
‣ denoted T1 ○ T2; and results in another FST
‣ can also compose FST with FSA, resulting in a FST
31
COMP90042
L13
FST Composition Example
• Allowsdevelopmentofdifferentprocessesas FSTs:
‣ -ed added to signal past tense in English
‣ cook → cooked, want → wanted
‣ But when the word ends with ‘e’, then we want to avoid producing a spelling with consecutive e’s (bakeed)
‣ Solution: build 2 FSTs
‣ FST T1: bake+PAST → bake+ed
‣ FST T2: bake+ed → baked
32
COMP90042 L13
Is Natural Language Regular?
33
COMP90042
L13
• Example:
Sometimes…
the mouse that ran.
the cat that killed the mouse that ran.
…
the lion that bullied the hyena that bit the dog that chased the cat that killed the mouse that ran
…
• Lengthisunbounded(recursive),butstructureis
local → can describe with FSA = Regular • (DetNounPrepVerb)*
34
COMP90042
L13
Non-Regular Languages
• Arithmeticexpressionswithbalancedparentheses ‣ (a+(bx(c/d)))
‣ Can have arbitrarily many opening parentheses
‣ Need to remember how many open parentheses, to produce the same number of closed parentheses
‣ Can’t be done with finite number of states • anbn
35
COMP90042
L13
Center Embedding
• Center embedding of relative clauses
‣ The cat loves Mozart
‣ The cat the dog chased loves Mozart
‣ The cat the dog the rat bit chased loves Mozart
‣ The cat the dog the rat the elephant admired bit chased loves Mozart
• Need to remember the n subject nouns, to ensure n verbs follow (and that they agree etc)
• Requires (at least) context-free grammar (next lecture!)
36
COMP90042
L13
Summary
• Conceptofalanguage,andgrammar
• Regularlanguages
• Finitestateautomata:acceptors,transducers • Closureproperties
• Weightedvariants&shortestpathinference
• Applicationtoeditdistance,morphology
37
COMP90042
L13
Reading ‣ E18, Chapter 9.1
• Reading
38