COMP90045 Programming Language Implementation
Finite-State Automata
Harald Søndergaard Lecture 3
Semester 1, 2019
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 1 / 27
Recognizers
A recognizer for a language L is an algorithm that takes an input string x and returns “yes” if x is in L and “no” otherwise.
Each regular expression used as a pattern in the specification of a lexical analyzer describes a regular language (the set of strings that matches the pattern). The union of these regular languages is itself a regular language.
The most complicated part of building a scanner is thus the task of building recognizers for these regular languages.
Simple recognizers can be built by hand; in more complicated cases it is much easier to have them constructed automatically.
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 2 / 27
Recognizing Regular Languages
The tools that generate scanners automatically are based on two kinds of mathematical constructs:
DFAs, or deterministic finite automata, and NFAs, or nondeterministic finite automata.
Every DFA and every NFA recognizes a regular language.
A regular language may be recognized by many DFAs and many NFAs.
DFAs and NFAs are usually displayed graphically as state transition diagrams.
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 3 / 27
DFAs
Formally, a deterministic finite automaton is a 5-tuple (Q,Σ,δ,q0,F), where
Q is a finite set of states,
Σ is a finite alphabet,
δ : Q × Σ → Q is the transition function, q0 ∈ Q is the start state, and
F ⊆ Q are the accept states.
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 4 / 27
Acceptance and Recognition by DFA
Let M = (Q,Σ,δ,q0,F) and let w = v1v2 ···vn be a string from Σ∗. M accepts w iff there is a sequence of states r0, r1, . . . , rn, with each
ri ∈ Q, such that
1 r0=q0
2 δ(ri,vi+1)=ri+1 fori=0,…,n−1 3 rn∈F
M recognises language A iff
A = {w | M accepts w}
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 5 / 27
State Transition Diagrams
Graphically, state transition diagrams show each state as a circle, with accept states shown as double circles and the start state marked by an arrow from nowhere.
Arrows between the states show the transition function. Each arrow is marked with a character or a set of characters.
Here is a DFA that recognizes ba∗b:
a
1b2a3b4 b
The alphabet is implicit in the annotations on the arrows.
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 6 / 27
The Error State
a
1b2a3b4 b
If the diagram doesn’t show any arrow labelled with a given input character leaving a given state, then the transition is implicitly to the invisible error state. The error state is not an accept state, and every transition from the error state leads back to the error state.
All states from which no accept state is reachable are equivalent to error states. If a diagram has more than one, they can be merged into one.
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 7 / 27
DFAs for Keywords
Building a DFA to recognize a keyword, or any constant string, is trivial. These three DFAs recognize “if”, “then” and “else”:
if
then
else
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 8 / 27
A Single DFA for Several Keywords
One could have a separate DFA to recognize each pattern in the scanner specification, testing them one by one, but that would be inefficient.
One can build DFAs that can recognize more than one pattern, in this case the keywords “if”, “in” and “’is”. Each accept state now needs to say which token’s pattern was found.
f in
s
⟨if ⟩ ⟨in⟩ ⟨is ⟩
PLI (Sem 1, 2019)
Finite-State Automata
⃝c University of Melbourne
9 / 27
Acceptance with Lookahead
DFAs were designed to test whether a string is in a language.
In practice, we want to use it to find the longest prefix that is in the language, so we cannot use the end of the string as the stopping criterion.
Instead, we stop when we transition to an error state (a state from which no accept state is reachable).
Every time we arrive at an accept state, we record which state it is and the length of the matched string. We then keep going, to search for a longer match.
When we transition to an error state, we use the longest match we recorded if there is one. If there isn’t one, we have found an error.
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 10 / 27
Acceptance with Lookahead: Example
The following DFA recognizes this regular expression:
< | =< | > | >= | = | == LT
< =
>
GT = GE
EQ < LE
=
EE
PLI (Sem 1, 2019)
Finite-State Automata ⃝c University of Melbourne 11 / 27
Unsigned Numbers
Here is a state transition diagram for unsigned number literals:
1d2.3d4d E
d
E
5 +,- 6 d 7 d d
Here d is [0-9](that is, any digit).
PLI (Sem 1, 2019) Finite-State Automata
⃝c University of Melbourne 12 / 27
QUIZ: Language of a DFA
Which language is recognised by this DFA?
a s
b
a
b
q1 q2b
a a
r1 r2a b
b
PLI (Sem 1, 2019)
Finite-State Automata
⃝c University of Melbourne
13 / 27
QUIZ: Language of a DFA
Which language is recognised by this DFA?
a
b
q1 q2b
a a
r1 r2a b
b
(strings starting and ending in a, or starting and ending in b).
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 13 / 27
a s
b
a | b | a(a|b)*a | b(a|b)*b
DFAs from Regular Expressions
DFAs may grow exponentially in the size of corresponding regular expressions. Take the regular expression for binary strings in which the third last bit is one: (0|1)*1(0|1)(0|1). The simplest DFA is:
0
0
000 0 100 0 010 110
1101 00 01
001 101 1 011 1 111 1
1
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 14 / 27
Nondeterminism
One can construct finite automata from regular expressions more easily if a state is allowed to have more than one transition on a given input symbol:
0,1
q1 1 q2 0,1 q3 0,1 q4
This is a nondeterministic finite automaton, or NFA.
NFAs are also allowed to move from one state to another without
consuming input.
Such a transition is called an ǫ (epsilon) transition.
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 15 / 27
NFAs
For any alphabet Σ, let Σǫ denote Σ ∪ {ǫ}. An NFA is a 5-tuple (Q,Σ,δ,q0,F), where
Q is a finite set of states,
Σ is a finite alphabet,
δ : Q × Σǫ → P(Q) is the transition function, q0 ∈ Q is the start state, and
F ⊆ Q are the accept states.
The only part of the 5-tuple that differs from DFAs is δ: the transition function is nondeterministic, and it allows ǫ transitions.
The class of languages recognised by NFAs is the same as the class of
languages recognised by DFAs; the addition of nondeterminism does
not result in more expressiveness.
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 16 / 27
Acceptance and Recognition by NFA
The definition of what it means for an NFA N to accept a string says that it has to be possible to make the necessary transitions.
Let N = (Q,Σ,δ,q0,F) be an NFA and let w = v1v2 ···vn where each vi ∈ Σ.
Nacceptswiffthereisaw′=v′v′···v′ whereeachv′∈Σ such 12miǫ
that filtering out all the ǫs from w′ yields w, and there is a sequence of states r0,r1,...,rm, with each ri ∈ Q, such that
1 r0=q0
2 ri+1 ∈δ(ri,vi′+1)fori=0,...,m−1 3 rm∈F
N recognises language A iff A = {w | N accepts w}.
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 17 / 27
NFAs from Regular Expressions
We can systematically convert a regular expression r into an NFA that recognises the language denoted by r.
Every NFA we build has a single start state and a single accept state. For ǫ, build
ǫ
For a, build
a
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 18 / 27
NFAs for Unions
For r | s, we build the NFA on the right, with the two big ovals representing the NFAs of r and s and the two states inside them representing their start and accept states.
So we have added two new states.
Note that the combined NFA has a single accept state. Also note that the start and accept states are not part of loops. These are invariants that will be maintained throughout the construction.
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 19 / 27
ǫ
ǫ
ǫ
ǫ
NFAs for Concatenations
For the concatenation r s, we overlay the NFAs for r and s.
By overlaying we mean letting r’s accept state coincide with s’s start
state.
The start state of the combined machine (the machine for r s) is r’s
start state, and the accept state is the accept state of s.
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 20 / 27
NFAs for Repetition (Kleene Star)
For r∗, we build the NFA sketched below.
That is, we modify the machine for r by making r’s accept state
non-accepting and adding an ǫ transition back to the start state. Then we add two new states that will become the start and accept
states for the machine for r and add ǫ transitions as shown.
ǫǫ
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 21 / 27
ǫ
ǫ
QUIZ: NFA construction
Construct an NFA for ((a | b)c)∗c.
Start from innermost expressions and work out:
a b
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 22 / 27
QUIZ: NFA construction
Construct an NFA for ((a | b)c)∗c. a
ǫǫ
ǫ
ǫ
b
PLI (Sem 1, 2019)
Finite-State Automata
⃝c University of Melbourne
23 / 27
QUIZ: NFA construction
Construct an NFA for ((a | b)c)∗c. a
ǫǫ
c
ǫ
ǫ
b
PLI (Sem 1, 2019)
Finite-State Automata
⃝c University of Melbourne
24 / 27
QUIZ: NFA construction
Construct an NFA for ((a | b)c)∗c. ǫ
a
ǫǫ
ǫ cǫc
ǫ
ǫ
b
ǫ
PLI (Sem 1, 2019)
Finite-State Automata ⃝c University of Melbourne 25 / 27
A Simpler Automaton for ((a | b)c)∗c
There are many obvious simplifications of the generated NFA.
In fact, it is easy to see that this DFA is equivalent:
ac
c
c
b
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 26 / 27
Next Lecture
In the next lecture we shall see that we can always turn an NFA into an equivalent DFA, and also that we can always find the smallest possible DFA needed for the job.
In fact that is what tools like lex, flex and alex do—starting from a regular expression they synthesize an equivalent DFA, producing the DFA in the form of a (C or Haskell) program.
PLI (Sem 1, 2019) Finite-State Automata ⃝c University of Melbourne 27 / 27