CS计算机代考程序代写 Finite State Automaton Hidden Markov Mode AI algorithm l13-formal-language-theory-v3

l13-formal-language-theory-v3

COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

Semester 1 2021 Week 7
Jey Han Lau

Formal Language Theory &

Finite State Automata

COMP90042
Natural Language Processing

Lecture 13

COMP90042 L13

2

What Have We Learnt?

• Methods to process sequence of words:

‣ N-gram language Model

‣ Hidden Markov Model

‣ Recurrent Neural Networks

• Nothing is fundamentally linguistic about these
models

COMP90042 L13

3

Formal Language Theory

• Studies classes of languages and their
computational properties

‣ Regular language (this lecture)

‣ Context free language (next lecture)

COMP90042 L13

4

Formal Language Theory

• A language = set of strings

• A string = sequence of elements from a finite
alphabet (AKA vocabulary)

today is a nice day

interstellar is a great movie

trump will never be a vegan

ai will take over the world one day

COMP90042 L13

5

Why?

• Main goal is to solve the membership problem

‣ Whether a string is in a language

• How? By defining its grammar

today is a nice day

interstellar is a great movie

trump will never be a vegan

ai will take over the world one day

colourless green ideas

sleep furiously

?

COMP90042 L13

6

Examples of Language
• Binary strings that start with 0 and end with 1

‣ { 01, 001, 011, 0001, … } ✓

‣ ( 1, 0, 00, 11, 100, … } ✗

• Even-length sequences from alphabet {a, b}

‣ { aa, ab, ba, bb, aaaa, … } ✓

‣ { aaa, aba, bbb, … } ✗

• English sentences that start with wh-word and end in ?

‣ { what ?, where my pants ?, … } ✓

‣ { hello how are you?, why is the dog so cute! }

COMP90042 L13

7

Beyond Membership Problem…

• Membership
‣ Is the string part of the language? Y/N

• Scoring
‣ Graded membership
‣ “How acceptable is a string?” (language models!)

• Transduction
‣ “Translate” one string into another (stemming!)

COMP90042 L13

8

Outline

• Regular language

• Finite state acceptor

• Finite state transducer

COMP90042 L13

9

Regular Language

COMP90042 L13

10

Regular Language

• The simplest class of languages

• Any regular expression is a regular language

‣ Describes what strings are part of the language
(e.g. ‘0(0|1)*1’)

COMP90042 L13

11

Regular Languages

• Formally, a regular expression includes the
following operations/definitions:
‣ Symbol drawn from alphabet, Σ
‣ Empty string, ε
‣ Concatenation of two regular expressions, RS
‣ Alternation of two regular expressions, R|S
‣ Kleene star for 0 or more repeats, R*
‣ Parenthesis () to define scope of operations

COMP90042 L13

12

Examples of Regular Languages

• Binary strings that start with 0 and end with 1
‣ 0(0|1)*1

• Even-length sequences from alphabet {a, b}
‣ ((aa)|(ab)|(ba)|(bb))*

• English sentences that start with wh-word and end
in ?
‣ ((what)|(where)|(why)|(which)|(whose)|(whom)) Σ* ?

COMP90042 L13

13

Properties of Regular Languages

• Closure: if we take regular languages L1 and L2
and merge them, is the resulting language regular?

• RLs are closed under the following:
‣ concatenation and union
‣ intersection: strings that are valid in both L1 and L2
‣ negation: strings that are not in L

• Extremely versatile! Can have RLs for different
properties of language, and use them together

COMP90042 L13

14

Finite State Acceptor

COMP90042 L13

15

Finite State Acceptor

• Regular expression defines a regular language

• But it doesn’t give an algorithm to check whether a
string belongs to the language

• Finite state acceptors (FSA) describes the
computation involved for membership checking

COMP90042 L13

16

Finite State Acceptors

• FSA consists:
‣ alphabet of input symbols, Σ
‣ set of states, Q
‣ start state, q0 ∈ Q
‣ final states, F ⊆ Q
‣ transition function: symbol and state → next state

• Accepts strings if there is path from q0 to a final
state with transitions matching each symbol
‣ Djisktra’s shortest-path algorithm, O(V log V + E)

COMP90042 L13

17

Example FSA

• Input alphabet {a, b}
• States {q0, q1}
• Start, final states q0, {q1}
• Transition function {(q0,a) → q0, (q0, b) → q1,


(q1,b) → q1}
• Regular expression defined 


by this FSA? 


9.1. REGULAR LANGUAGES 193

q0start q1

a

b

b

Figure 9.1: State diagram for the finite state acceptor M1.

9.1.1 Finite state acceptors

A regular expression defines a regular language, but does not give an algorithm for de-
termining whether a string is in the language that it defines. Finite state automata are
theoretical models of computation on regular languages, which involve transitions be-
tween a finite number of states. The most basic type of finite state automaton is the finite
state acceptor (FSA), which describes the computation involved in testing if a string is
a member of a language. Formally, a finite state acceptor is a tuple M = (Q,⌃, q0, F, �),
consisting of:

• a finite alphabet ⌃ of input symbols;

• a finite set of states Q = {q0, q1, . . . , qn};

• a start state q0 2 Q;

• a set of final states F ✓ Q;

• a transition function � : Q ⇥ (⌃ [ {✏}) ! 2Q. The transition function maps from a
state and an input symbol (or empty string ✏) to a set of possible resulting states.

A path in M is a sequence of transitions, ⇡ = t1, t2, . . . , tN , where each ti traverses an
arc in the transition function �. The finite state acceptor M accepts a string ! if there is
an accepting path, in which the initial transition t1 begins at the start state q0, the final
transition tN terminates in a final state in Q, and the entire input ! is consumed.

Example

Consider the following FSA, M1.

⌃ ={a, b} [9.1]
Q ={q0, q1} [9.2]
F ={q1} [9.3]
� ={(q0, a) ! q0, (q0, b) ! q1, (q1, b) ! q1}. [9.4]

This FSA defines a language over an alphabet of two symbols, a and b. The transition
function � is written as a set of arcs: (q0, a) ! q0 says that if the machine is in state

Under contract with MIT Press, shared under CC-BY-NC-ND license.

abb ✓
aba ✗
aab ✓
b ✓
aaa ✗

PollEv.com/jeyhanlau569

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L13

18

COMP90042 L13

19

Derivational Morphology

• Use of affixes to change word to another
grammatical category

• grace → graceful → gracefully

• grace → disgrace → disgracefully

• allure → alluring → alluringly

• allure → *allureful

• allure → *disallure

COMP90042 L13

20

FSA for Morphology

• Fairly consistent process
‣ want to accept valid forms (grace → graceful)
‣ reject invalid ones (allure → *allureful)
‣ generalise to other words, e.g., nouns that

behave like grace or allure

COMP90042 L13

21

FSA for Word Morphology

play

spare

light

COMP90042 L13

22

Weighted FSA

• Some words are more plausible than others

‣ fishful vs. disgracelyful

‣ musicky vs. writey

• Graded measure of acceptability — weighted FSA
changes the following:

‣ start state weight function, λ: Q → ℝ

‣ final state weight function, ρ: Q → ℝ

‣ transition function, δ: (Q, Σ, Q) → ℝ

COMP90042 L13

23

WFSA Shortest-Path

• Total score of a path

‣ is an edge

• Use shortest-path algorithm to find with
minimum cost

‣ O(V log V + E), as before

π = t1, . . . , tN

λ(t0) +
N


i=1

δ(ti) + ρ(tN)

t
π

COMP90042 L13

24

Finite State Transducer

COMP90042 L13

25

Finite State Transducers (FST)

• Often don’t want to just accept or score strings

‣ want to translate them into another string

COMP90042 L13

26

FSA for Word Morphology

Finite state acceptor: allure + ing = allureing
Finite state transducer: allure + ing = alluring

COMP90042 L13

27

Finite State Transducer

• FST add string output capability to FSA

‣ includes an output alphabet

‣ transitions now take input symbol and emit
output symbol (Q, Σ, Σ, Q)

• Can be weighted = WFST

‣ Graded scores for transition

• E.g., edit distance as WFST

‣ distance to transform one string to another

COMP90042 L13

28

Edit Distance Automata

9.1. REGULAR LANGUAGES 201

transition weight, which are themselves probabilities. The � operator is addition, so that
the total score is the sum of the scores (probabilities) for each path. This corresponds to
the probability under the interpolated bigram language model.

9.1.4 Finite state transducers

Finite state acceptors can determine whether a string is in a regular language, and weighted
finite state acceptors can compute a score for every string over a given alphabet. Finite
state transducers (FSTs) extend the formalism further, by adding an output symbol to each
transition. Formally, a finite state transducer is a tuple T = (Q,⌃,⌦,�, ⇢, �), with ⌦ repre-
senting an output vocabulary and the transition function � : Q⇥ (⌃[ ✏)⇥ (⌦[ ✏)⇥Q ! R
mapping from states, input symbols, and output symbols to states. The remaining ele-
ments (Q,⌃,�, ⇢) are identical to their definition in weighted finite state acceptors (sub-
section 9.1.3). Thus, each path through the FST T transduces the input string into an
output.

String edit distance

The edit distance between two strings s and t is a measure of how many operations are
required to transform one string into another. There are several ways to compute edit
distance, but one of the most popular is the Levenshtein edit distance, which counts the
minimum number of insertions, deletions, and substitutions. This can be computed by
a one-state weighted finite state transducer, in which the input and output alphabets are
identical. For simplicity, consider the alphabet ⌃ = ⌦ = {a, b}. The edit distance can be
computed by a one-state transducer with the following transitions,

�(q, a, a, q) = �(q, b, b, q) = 0 [9.12]
�(q, a, b, q) = �(q, b, a, q) = 1 [9.13]
�(q, a, ✏, q) = �(q, b, ✏, q) = 1 [9.14]
�(q, ✏, a, q) = �(q, ✏, b, q) = 1. [9.15]

The state diagram is shown in Figure 9.5.

For a given string pair, there are multiple paths through the transducer: the best-
scoring path from dessert to desert involves a single deletion, for a total score of 1; the
worst-scoring path involves seven deletions and six additions, for a score of 13.

The Porter stemmer

The Porter (1980) stemming algorithm is a “lexicon-free” algorithm for stripping suffixes
from English words, using a sequence of character-level rules. Each rule can be described

Under contract with MIT Press, shared under CC-BY-NC-ND license.

202 CHAPTER 9. FORMAL LANGUAGE THEORY

qstart

a/a, b/b : 0

a/b, b/a : 1

a/✏, b/✏ : 1

✏/a, ✏/b : 1

Figure 9.5: State diagram for the Levenshtein edit distance finite state transducer. The
label x/y : c indicates a cost of c for a transition with input x and output y.

by an unweighted finite state transducer. The first rule is:

-sses ! -ss e.g., dresses ! dress [9.16]
-ies ! -i e.g., parties ! parti [9.17]
-ss ! -ss e.g., dress ! dress [9.18]

-s ! ✏ e.g., cats ! cat [9.19]

The final two lines appear to conflict; they are meant to be interpreted as an instruction
to remove a terminal -s unless it is part of an -ss ending. A state diagram to handle just
these final two lines is shown in Figure 9.6. Make sure you understand how this finite
state transducer handles cats, steps, bass, and basses.

Inflectional morphology

In inflectional morphology, word lemmas are modified to add grammatical information
such as tense, number, and case. For example, many English nouns are pluralized by the
suffix -s, and many verbs are converted to past tense by the suffix -ed. English’s inflectional
morphology is considerably simpler than many of the world’s languages. For example,
Romance languages (derived from Latin) feature complex systems of verb suffixes which
must agree with the person and number of the verb, as shown in Table 9.1.

The task of morphological analysis is to read a form like canto, and output an analysis
like CANTAR+VERB+PRESIND+1P+SING, where +PRESIND describes the tense as present
indicative, +1P indicates the first-person, and +SING indicates the singular number. The
task of morphological generation is the reverse, going from CANTAR+VERB+PRESIND+1P+SING
to canto. Finite state transducers are an attractive solution, because they can solve both
problems with a single model (Beesley and Karttunen, 2003). As an example, Figure 9.7
shows a fragment of a finite state transducer for Spanish inflectional morphology. The

Jacob Eisenstein. Draft of October 15, 2018.

ab → bb: 1
ab → aaab: 2

input = output; no cost

a→b or b→a

delete a character

insert a new character

input symbol
output symbol

COMP90042 L13

29

• Verb inflections in Spanish must match the subject in
person & number

• Goal of morphological analysis:

• canto → cantar+VERB+present+1P+singular

FST for Inflectional Morphology

cantar to sing

1P singular yo canto I sing
2P singular tu cantas you sing
3P singular ella canta she sings
1P plural nostotros cantamos we sing

2P plural vosotros cantáis you sing
3P plural ellas cantan they sing

COMP90042 L13

30

FST for Spanish Inflection

canto →

• canto+Noun+Masc+Sing

• cantar+Verb+PresInd+1P+Sing

canta → cantar+VERB+PresInd+3P+Sing

204 CHAPTER 9. FORMAL LANGUAGE THEORY

start
c/c a/a n/n t/t

o/o

✏/+Noun ✏/+Masc ✏/+Sing

✏/a ✏/r ✏/+Verb o/+PresInd ✏/+1p ✏/+Sing

a/+PresInd

✏/+3p ✏/+Sing

Figure 9.7: Fragment of a finite state transducer for Spanish morphology. There are two
accepting paths for the input canto: canto+NOUN+MASC+SING (masculine singular noun,
meaning a song), and cantar+VERB+PRESIND+1P+SING (I sing). There is also an accept-
ing path for canta, with output cantar+VERB+PRESIND+3P+SING (he/she sings).

— failing to accept strings or transductions that are valid. For example, a pluralization
transducer that does not accept foot/feet would undergenerate. Suppose we “fix” the trans-
ducer to accept this example, but as a side effect, it now accepts boot/beet; the transducer
would then be said to overgenerate. If a transducer accepts foot/foots but not foot/feet, then
it simultaneously overgenerates and undergenerates.

Finite state composition

Designing finite state transducers to capture the full range of morphological phenomena
in any real language is a huge task. Modularization is a classic computer science approach
for this situation: decompose a large and unwieldly problem into a set of subproblems,
each of which will hopefully have a concise solution. Finite state automata can be mod-
ularized through composition: feeding the output of one transducer T1 as the input to
another transducer T2, written T2�T1. Formally, if there exists some y such that (x, y) 2 T1
(meaning that T1 produces output y on input x), and (y, z) 2 T2, then (x, z) 2 (T2 � T1).
Because finite state transducers are closed under composition, there is guaranteed to be
a single finite state transducer that T3 = T2 � T1, which can be constructed as a machine
with one state for each pair of states in T1 and T2 (Mohri et al., 2002).

Example: Morphology and orthography In English morphology, the suffix -ed is added
to signal the past tense for many verbs: cook!cooked, want!wanted, etc. However, English
orthography dictates that this process cannot produce a spelling with consecutive e’s, so
that bake!baked, not bakeed. A modular solution is to build separate transducers for mor-
phology and orthography. The morphological transducer TM transduces from bake+PAST
to bake+ed, with the + symbol indicating a segment boundary. The input alphabet of TM
includes the lexicon of words and the set of morphological features; the output alphabet
includes the characters a-z and the + boundary marker. Next, an orthographic transducer
TO is responsible for the transductions cook+ed ! cooked, and bake+ed ! baked. The input
alphabet of TO must be the same as the output alphabet for TM , and the output alphabet

Jacob Eisenstein. Draft of October 15, 2018.

COMP90042 L13

31

• Yes
• No
• Maybe
• Don’t know

PollEv.com/jeyhanlau569

Is natural language regular?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L13

32

COMP90042 L13

33

Sometimes…

• Example:
the mouse that ran.

the cat that killed the mouse that ran.


the lion that bullied the hyena that bit the dog that chased
the cat that killed the mouse that ran

• Length is unbounded, but structure is local

‣ (Det Noun Prep Verb)*

‣ Can describe with FSA

COMP90042 L13

34

Non-Regular Languages

• Arithmetic expressions with balanced parentheses

‣ (a + (b x ( c / d ) ) )

‣ Can have arbitrarily many opening parentheses

‣ Need to remember how many open parentheses,
to produce the same number of closed
parentheses

‣ Can’t be done with finite number of states

• anbn

COMP90042 L13

35

Center Embedding

• Center embedding of relative clauses
‣ The cat loves Mozart
‣ The cat the dog chased loves Mozart
‣ The cat the dog the rat bit chased loves Mozart
‣ The cat the dog the rat the elephant admired bit chased

loves Mozart

• Need to remember the n subject nouns, to ensure n verbs
follow (and that they agree etc)

• Requires (at least) context-free grammar (next lecture!)

COMP90042 L13

36

Summary

• Concept of a language

• Regular languages

• Finite state automata: acceptors, transducers

• Weighted variants

• Application to edit distance, morphology

COMP90042 L13

37

Reading

• E18, Chapter 9.1