Formal Language Theory &
Finite State Automata
COMP90042
Natural Language Processing
Lecture 13
Semester 1 2021 Week 7 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L13
•
Methods to process sequence of words: ‣ N-gram language Model
‣ Hidden Markov Model
‣ Recurrent Neural Networks
•
Nothing is fundamentally linguistic about these models
What Have We Learnt?
2
COMP90042
L13
•
Studies classes of languages and their computational properties
‣ Regular language (this lecture)
‣ Context free language (next lecture)
Formal Language Theory
3
COMP90042
L13
• •
Formal Language Theory A language = set of strings
A string = sequence of elements from a finite alphabet (AKA vocabulary)
today is a nice day
ai will take over the world one day
interstellar is a great movie trump will never be a vegan
…
4
COMP90042
L13
•
•
Main goal is to solve the membership problem ‣ Whether a string is in a language
How? By defining its grammar
Why?
colourless green ideas
sleep furiously
?
today is a nice day
ai will take over the world one day
interstellar is a great movie trump will never be a vegan
…
5
COMP90042
L13
Examples of Language
• Binarystringsthatstartwith0andendwith1 ‣ {01,001,011,0001,…}✓
‣ (1,0,00,11,100,…}✗
• Even-lengthsequencesfromalphabet{a,b} ‣ {aa,ab,ba,bb,aaaa,…}✓
‣ {aaa,aba,bbb,…}✗
• Englishsentencesthatstartwithwh-wordandendin? ‣ {what?,wheremypants?,…}✓
‣ { hello how are you?, why is the dog so cute! }
6
COMP90042
L13
Beyond Membership Problem… Membership
‣ Is the string part of the language? Y/N
•
•
•
Scoring
‣ Graded membership
‣ “How acceptable is a string?” (language models!)
Transduction
‣ “Translate” one string into another (stemming!)
7
COMP90042
L13
• • •
Regular language Finite state acceptor Finite state transducer
Outline
8
COMP90042
L13
Regular Language
9
COMP90042
L13
Regular Language
• Thesimplestclassoflanguages
• Anyregularexpressionisaregularlanguage
‣
Describes what strings are part of the language (e.g. ‘0(0|1)*1’)
10
COMP90042
L13
•
Formally, a regular expression includes the following operations/definitions:
‣ Symbol drawn from alphabet, Σ
‣ Empty string, ε
‣ Concatenation of two regular expressions, RS ‣ Alternation of two regular expressions, R|S
‣ Kleene star for 0 or more repeats, R*
‣ Parenthesis () to define scope of operations
Regular Languages
11
COMP90042
L13
•
•
•
Examples of Regular Languages Binary strings that start with 0 and end with 1
‣ 0(0|1)*1
Even-length sequences from alphabet {a, b}
‣ ((aa)|(ab)|(ba)|(bb))*
English sentences that start with wh-word and end
in ?
‣ ((what)|(where)|(why)|(which)|(whose)|(whom)) Σ* ?
12
COMP90042
L13
•
•
Closure: if we take regular languages L1 and L2 and merge them, is the resulting language regular?
•
Extremely versatile! Can have RLs for different properties of language, and use them together
Properties of Regular Languages
RLs are closed under the following:
‣ concatenation and union
‣ intersection: strings that are valid in both L1 and L2 ‣ negation: strings that are not in L
13
COMP90042
L13
Finite State Acceptor
14
COMP90042
L13
Finite State Acceptor
Regular expression defines a regular language
But it doesn’t give an algorithm to check whether a string belongs to the language
•
• •
Finite state acceptors (FSA) describes the computation involved for membership checking
15
COMP90042
L13
•
FSA consists:
‣ alphabet of input symbols, Σ
‣ set of states, Q
•
Accepts strings if there is path from q0 to a final state with transitions matching each symbol
‣ Djisktra’s shortest-path algorithm, O(V log V + E)
Finite State Acceptors
‣
‣ transition function: symbol and state → next state
start state, q0 ∈ Q ‣ final states, F ⊆ Q
16
COMP90042
L13
•
•
•
Input alphabet States
Start, final states
{a, b} {q0, q1} q0, {q1}
Example FSA
abb ✓ aba ✗ aab ✓ b✓ aaa ✗
REGUTrLaAnsRitiLoAnNfuGncUtiAonGES{(q0,a) → q0, (q0, b) → q1,
(q1,b) → q1}
•
•
Regular expression defined
by this FSA?
start
PollEv.com/jeyhanlau569
a
b
q0 b q1
17
Figure 9.1: State diagram for the finite state acce
COMP90042
L13
18
COMP90042
L13
•
• • • • •
Use of affixes to change word to another grammatical category
grace → graceful → gracefully grace → disgrace → disgracefully allure → alluring → alluringly allure → *allureful
allure → *disallure
Derivational Morphology
19
COMP90042
L13
•
Fairly consistent process
‣ want to accept valid forms (grace → graceful) ‣ reject invalid ones (allure → *allureful)
‣ generalise to other words, e.g., nouns that behave like grace or allure
FSA for Morphology
20
COMP90042
L13
FSA for Word Morphology
play spare
light
21
COMP90042
L13
•
Some words are more plausible than others
‣ fishful vs. disgracelyful
‣ musicky vs. writey
•
Graded measure of acceptability — weighted FSA changes the following:
‣ start state weight function, λ: Q → R ‣ final state weight function, ρ: Q → R ‣ transition function, δ: (Q, Σ, Q) → R
Weighted FSA
22
COMP90042
L13
•
WFSA Shortest-Path Total score of a path π = t1,…,tN
λ(t0) + ∑N δ(ti) + ρ(tN) i=1
t is an edge
Use shortest-path algorithm to find π with
minimum cost
‣ O(V log V + E), as before
‣
•
23
COMP90042
L13
Finite State Transducer
24
COMP90042
L13
•
Often don’t want to just accept or score strings ‣ want to translate them into another string
Finite State Transducers (FST)
25
COMP90042
L13
FSA for Word Morphology
Finite state acceptor: allure + ing = allureing Finite state transducer: allure + ing = alluring
26
COMP90042
L13
•
FST add string output capability to FSA
‣ includes an output alphabet
‣ transitions now take input symbol and emit output symbol (Q, Σ, Σ, Q)
•
•
Can be weighted = WFST
‣ Graded scores for transition
E.g., edit distance as WFST
‣ distance to transform one string to another
Finite State Transducer
27
rtions, deletions, and substitutions. This can be computed by e
r
e
n g
a label x/y : c indicates a cost of c for a transition with input x and outp
N
COMP90042
state transducer, in which the input and output alphabets are L13
onsider the alphabet ⌃ = ⌦ = {a, b}. The edit distance can be Edit Distance Automata
ansducer with the following transitions,
(q,a,a,q) = (q,b,b,q) = 0 202
(q,a,b,q) = (q,b,a,q) = 1
(q,a,✏,q) = (q,b,✏,q) = 1
(q,✏,a,q) = (q,✏,b,q) = 1. input symbol
[9.12]
CHAPTER9. FORMALLA
[9.13]
[9.14]
[9.15]
input = output; no cost
a/a,b/b : 0
delete a character
a/✏, b/✏ : 1 r, there are multiple paths through the transducer: the best-
n in Figure 9.5.
output symbol
start
q
to desert involves a single deletion, for a total score of 1; the
s seven deletions and six additions, for a score of 13.✏/a, ✏/b : 1
ab → bb: 1
ab → aaab: 2 insert a new character
a/b, b/a : 1 a→b or b→a
g algorithm is a “lexicon-free” algorithm for stripping suffixes
Figure 9.5: State diagram for the Levenshtein edit distance finite st
a sequence of character-level rules. Each rule can be described
28
COMP90042
L13
FST for Inflectional Morphology
• VerbinflectionsinSpanishmustmatchthesubjectin person & number
• Goalofmorphologicalanalysis:
canto → cantar+VERB+present+1P+singular
•
cantar
to sing
1P singular
yo canto
I sing
2P singular
tu cantas
you sing
3P singular
ella canta
she sings
1P plural
nostotros cantamos
we sing
2P plural
vosotros cantáis
you sing
3P plural
ellas cantan
they sing
29
COMP90042
L13
204
start
CHAPTER9. FORMALLANGUAGETHEORY
FST for Spanish Inflection
c/c a/a n/n t/t
✏/a
✏/+Sing
✏/+Sing
o/o
✏/+Noun ✏/+Masc
✏/r ✏/+Verb
✏/+Sing
o/+PresInd ✏/+1p
a/+PresInd ✏/+3p
Figure 9.7: Fragment of a finite state transducer for Spanish morphology. There are two
canto →
accepting paths for the input canto: canto+NOUN+MASC+SING (masculine singular noun,
meaning a song), and cantar+VERB+PRESIND+1P+SING (I sing). There is also an accept- ing path for canta, with output cantar+VERB+PRESIND+3P+SING (he/she sings).
• canto+Noun+Masc+Sing
—fa•ilicngatnotaccre+pVtsetrirnbg+soPrrtreanssIdnudcti+on1sPth+atSarienvgalid.Forexample,apluralization
transducer that does not accept foot/feet would undergenerate. Suppose we “fix” the trans-
ducer to accept this example, but as a side effect, it now accepts boot/beet; the transducer canta → cantar+VERB+PresInd+3P+Sing
would then be said to overgenerate. If a transducer accepts foot/foots but not foot/feet, then
it simultaneously overgenerates and undergenerates.
30
COMP90042
L13
Is natural language regular?
• Yes
• No
• Maybe
• Don’t know
PollEv.com/jeyhanlau569
31
COMP90042
L13
32
COMP90042
L13
•
Example:
…
•
Sometimes…
the mouse that ran.
the cat that killed the mouse that ran.
…
the lion that bullied the hyena that bit the dog that chased the cat that killed the mouse that ran
Length is unbounded, but structure is local ‣ (Det Noun Prep Verb)*
‣ Can describe with FSA
33
COMP90042
L13
•
Arithmetic expressions with balanced parentheses ‣ (a+(bx(c/d)))
‣ Can have arbitrarily many opening parentheses
‣ Need to remember how many open parentheses, to produce the same number of closed parentheses
‣ Can’t be done with finite number of states anbn
•
Non-Regular Languages
34
COMP90042
L13
Center Embedding
• Center embedding of relative clauses
‣ The cat loves Mozart
‣ The cat the dog chased loves Mozart
‣ The cat the dog the rat bit chased loves Mozart
‣ The cat the dog the rat the elephant admired bit chased loves Mozart
• Need to remember the n subject nouns, to ensure n verbs follow (and that they agree etc)
• Requires (at least) context-free grammar (next lecture!)
35
COMP90042
L13
• • • • •
Concept of a language
Regular languages
Finite state automata: acceptors, transducers Weighted variants
Application to edit distance, morphology
Summary
36
COMP90042
L13
•
Reading E18, Chapter 9.1
37