Scanner: Lexical Analysis Readings: EAC2 Chapter 2
EECS4302 M: Compilers and Interpreters Winter 2020
CHEN-WEI WANG
Scanner in Context
○ Recall:
Lexical Analysis Source Program
(seq. of characters)
Scanner
Syntactic Analysis seq. of tokens
Parser
Semantic Analysis
AST1 … ASTn
pretty printed
Target Program
○ Treats the input programas as a a sequence of characters ○ Applies rules recognizing character sequences as tokens
[ lexical analysis ] ● Reportscharactersequencesnotrecognizableastokens
● Producesaasequenceoftokens
○ Only part of compiler touching every character in input program. ○ Tokens recognizable by scanner constitute a regular language .
○ Upon termination:
2 of 68
nn
fa. Figure 2.3 shows the relationship between all of these constructio
present these constructions, we must distinguish between determini Scanner: Formulation & Implementation
s, or dfas, and nondeterministic fas, or nfas, in Section 2.4.1. N
RE
Thompson’s Construction
DFA Minimization
DFA
FIGURE2.3 TheCycleofConstructions. 3 of 68
Kleene’s Construction
Code for
a scanner
NFA
Subset Construction
o ae
Alphabets
● An alphabet is a finite, nonempty set of symbols.
○ The convention is to write Σ , possibly with a informative subscript, to denote the alphabet in question.
e.g., Σeng = {a,b,…,z,A,B,…,Z} e.g., Σbin = {0, 1}
e.g.,Σdec ={d ∣0≤d ≤9}
e.g., Σkey
[ the English alphabet ] [ the binary alphabet ] [thedecimalalphabet] [ the keyboard alphabet ]
● Use either a set enumeration or a set comprehension to define your own alphabet.
4 of 68
Strings (1)
● A or a is finite sequence of symbols chosen from
some alphabet.
e.g., Oxford is a string from the English alphabet Σeng e.g., 01010 is a string from the binary alphabet Σbin e.g., 01010.01 is not a string from Σbin
e.g., 57 is a string from the binary alphabet Σdec
string
word
● It is not correct to say, e.g., 01010 ∈ Σbin
● The length of a string w, denoted as ∣w∣, is the number of
characters it contains.
○ e.g., ∣Oxford∣ = 6
○ ε is the empty string (∣ε∣ = 0) that may be from any alphabet.
● Given two strings x and y, their concatenation , denoted as xy, is a new string formed by a copy of x followed by a copy of y.
○ e.g., Let x = 01101 and y = 110, then xy = 01101110 ○ The empty string ε is the identity for concatenation :
εw =w =wεforanystringw 5 of 68
[Why?]
Strings (2)
● Given an alphabet Σ, we write Σk , where k ∈ N, to denote the set of strings of length k from Σ
Σk ={w∣w is from Σ∧∣w∣=k} ○ e.g., {0, 1}2 = {00, 01, 10, 11}
○ ●
● Σ∗ 6 of 68
is {ε} for any alphabet Σ
is the set of nonempty strings from alphabet Σ
Σ+ = Σ1 ∪ Σ2 ∪ Σ3 ∪ . . . = {w ∣ w ∈ Σk ∧ k > 0} = ⋃ Σk k>0
is the set of strings of all possible lengths from alphabet Σ Σ∗ = Σ+ ∪ {ε}
Σ+
Σ0
Review Exercises: Strings
1. What is ∣{a,b,…,z}5∣?
2. Enumerate, in a systematic manner, the set {a, b, c}4.
3. Explain the difference between Σ and Σ1.
Σ is a set of symbols; Σ1 is a set of strings of length 1.
4. Prove or disprove: Σ1 ⊆ Σ2 ⇒ Σ∗1 ⊆ Σ∗2
7 of 68
Languages
● A language L over Σ (where ∣Σ∣ is finite) is a set of strings s.t.
L ⊆ Σ∗
● When useful, include an informative subscript to denote the
language L in question.
○ e.g., The language of valid Java programs
LJava = {prog ∣ prog ∈ Σkey ∧ prog compiles in Eclipse} ○ e.g., The language of strings with n 0’s followed by n 1’s (n ≥ 0)
{ε,01,0011,000111,…} = {0n1n ∣ n ≥ 0}
○ e.g., The language of strings with an equal number of 0’s and 1’s
{ε, 01, 10, 0011, 0101, 0110, 1100, 1010, 1001, . . . } = {w∣#of0’sinw=#of1’sinw}
8 of 68
Review Exercises: Languages
1. Use set comprehensions to define the following languages. Be as formal as possible.
○ A language over {0, 1} consisting of strings beginning with some 0’s (possibly none) followed by at least as many 1’s.
○ A language over {a, b, c} consisting of strings beginning with some a’s (possibly none), followed by some b’s and then some c’s, s.t. the#ofa’sisatleastasmanyasthesumof#’sofb’sandc’s.
2. Explain the difference between the two languages {ε} and ∅.
3. Justify that Σ∗, ∅, and {ε} are all languages over Σ.
4. Prove or disprove: If L is a language over Σ, and Σ2 ⊇ Σ, then L
is also a language over Σ2. ∗ ∗
Hint:ProvethatΣ⊆Σ2∧L⊆Σ ⇒L⊆Σ2
5. Prove or disprove: If L is a language over Σ, and Σ2 ⊆ Σ, then L
is also a language over Σ2. ∗ ∗
Hint:ProvethatΣ2⊆Σ∧L⊆Σ ⇒L⊆Σ2
9 of 68
Problems
● Given a language L over some alphabet Σ, a problem is the decision on whether or not a given string w is a member of L.
w∈L
Is this equivalent to deciding w ∈ Σ∗? [ No ]
● e.g., The Java compiler solves the problem of deciding if the string of symbols typed in the Eclipse editor is a member of LJava (i.e., set of Java programs with no syntax and type errors).
10 of 68
Regular Expressions (RE): Introduction
● (RegExp’s) are:
○ A type of notation
● Thisissimilartotheequally-expressiveDFA,NFA,andε-NFA.
○ Textual and look just like a programming language
● e.g.,01*+10*denotesL={0x ∣x ∈{1}}∪{1x ∣x ∈{0}}
● e.g., (0*10*10*)*10* denotes L = {w ∣ w has odd # of 1’s}
● ThisisdissimilartothediagrammaticDFA,NFA,andε-NFA.
● RegExp’scanbeconsideredasa“user-friendly”alternativetoNFAfor
describing software components. [e.g., text search]
● WritingaRegExpislikewritinganalgebraicexpression,usingthe
defined operators, e.g., ((4 + 3) * 5) % 6
● Despite the programming convenience they provide, RegExp’s,
DFA, NFA, and ε-NFA are all provably equivalent .
○ They are capable of defining all and only regular languages.
Regular expressions
language-defining
11 of 68
RE: Language Operations (1)
● Given Σ of input alphabets, the simplest RegExp is s ∈ Σ1. ○ e.g., Given Σ = {a, b, c}, expression a denotes the language
consisting of a single string a.
● Given two languages L, M ∈ Σ∗ , there are 3 operators for
building a larger language out of them: 1. Union
L ∪ M = {w ∣ w ∈ L ∨ w ∈ M} In the textual form, we write + for union.
2. Concatenation
LM = {xy ∣ x ∈ L ∧ y ∈ M}
In the textual form, we write either . or nothing at all for
concatenation.
12 of 68
RE: Language Operations (2) 3. Kleene Closure (or Kleene Star)
13 of 68
where
L0 =
L1 =
L2 =
…
Li = …
L = ⋃Li i0
{ε}
L {x1x2∣x1∈L∧x2∈L}
{x1x2…xi ∣xj∈L∧1≤j≤i}
i repetations
In the textual form, we write * for closure. Question: What is ∣Li ∣ (i ∈ N)?
Question: Given that L = {0}, what is L?
[ ∣L∣i ] [ L ]
RE: Construction (1)
We may build regular expressions recursively:
● Each (basic or recursive) form of regular expressions denotes a language (i.e., a set of strings that it accepts).
● Base Case:
○ Constants ε and ∅ are regular expressions.
L(ε) = {ε} L(∅) = ∅
○ An input symbol a ∈ Σ is a regular expression. L( a ) = {a}
If we want a regular expression for the language consisting of only
the string w ∈ Σ, we write w as the regular expression.
○ Variables such as L, M, etc., might also denote languages.
14 of 68
RE: Construction (2)
● Recursive Case Given that E and F are regular expressions:
○ The union E + F is a regular expression.
L( E + F ) = L(E ) ∪ L(F )
○ The concatenation EF is a regular expression. L( EF ) = L(E)L(F)
○ Kleene closure of E is a regular expression. L( E ) = (L(E))
○ A parenthesized E is a regular expression. L( (E) ) = L(E)
15 of 68
RE: Construction (3)
Exercises: ● ∅L
● ∅∗
[∅L=∅=L∅]
● ∅∗L ● ∅+L
[∅∗L=L=L∅∗ ] [∅+L=L=∅+L]
16 of 68
∅∗ = ∅0∪∅1∪∅2∪… = {ε}∪∅∪∅∪… = {ε}
RE: Construction (4)
Write a regular expression for the following language
{ w ∣ w has alternating 0’s and 1’s }
● Would (01)∗ work? [alternating 10’s?] ● Would (01)∗ + (10)∗ work? [starting and ending with 1?]
● 0(10)∗ + (01)∗ + (10)∗ + 1(01)∗ ● It seems that:
○ 1st and 3rd terms have (10) as the common factor. ○ 2nd and 4th terms have (01) as the common factor.
● Can we simplify the above regular expression? ● (ε+0)(10)∗+(ε+1)(01)∗
17 of 68
RE: Review Exercises
Write the regular expressions to describe the following languages:
● {w ∣w ends with 01}
● {w ∣w contains 01 as a substring}
● { w ∣ w contains no more than three consecutive 1’s }
● {w∣w ends with 01∨w has an odd # of 0’s} ●
●
18 of 68
⎧ s∈{+,−,ε} ⎪
⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪⎭
⎪ ∧ x∈Σ ⎨ sx.y dec
⎪∧y∈Σ
⎪ dec
⎪⎩ ∧ ¬(x=ε∧y=ε)
⎧⎪ x∈{0,1}∧y∈{0,1}
⎪
⎨ xy ∧ x has alternating 0’s and 1’s ⎬ ⎪ ⎪ ⎪⎩ ∧ yhasanodd#0’sandanodd#1’s ⎪⎭
⎫⎪
⎪
RE: Operator Precedence
● In an order of decreasing precedence: ○ Kleene star operator
○ Concatenation operator ○ Union operator
● When necessary, use parentheses to force the intended order of evaluation.
● e.g.,
○ 10 vs. (10)
○ 01 +1 vs. 0(1 +1) ○ 0+1 vs. (0+1)
[10 is equivalent to 1(0)] [01 +1 is equivalent to (0(1))+(1)] [0+1 is equivalent to (0)+(1)]
19 of 68
DFA: Deterministic Finite Automata (1.1)
● A deterministic finite automata (DFA) is a finite state machine (FSM) that accepts (or recognizes) a pattern of behaviour.
○ For our purpose of this course, we study patterns of strings (i.e., how alphabet symbols are ordered).
○ Unless otherwise specified, we consider strings in {0, 1}
○ Each pattern contains the set of satisfying strings.
○ We describe the patterns of strings using set comprehensions:
● {w ∣w has an odd number of 0’s} ● {w ∣w has an even number of 1’s}
● {w∣ w≠ε } ∧ w has equal # of alternating 0’s and 1’s
● {w ∣w contains 01 as a substring}
● {w ∣ w has an even number of 0’s }
∧ w has an odd number of 1’s
● Given a pattern description, we design a DFA that accepts it.
○ The resulting DFA can be transformed into an executable program. 20 of 68
DFA: Deterministic Finite Automata (1.2)
The transition diagram below defines a DFA which accepts exactly the language
{w ∣w has an odd number of 0’s}
11 0
s0: even 0’s
s1: odd 0’s
○ Each incoming or outgoing arc (called a ) corresponds to an input alphabet symbol.
○ s0 with an unlabelled incoming transition is the start state .
○ s3 drawn as a double circle is a final state .
○ All states have outgoing transitions covering {0, 1}. 21 of 68
0
transition
DFA: Deterministic Finite Automata (1.3)
The transition diagram below defines a DFA which accepts exactly the language
{w ∣ w ≠ ε } ∧ w has equal # of alternating 0’s and 1’s
22 of 68
s0: empty string
0, 1 nating 11
s2: more 1’s
s1: more 0’s
00
s5:
not alter-
1
0
1
0
1
s3: equal
(01)+
s4: equal
(10)+
0
Review Exercises: Drawing DFAs
Draw the transition diagrams for DFAs which accept other example string patterns:
● {w∣w has an even number of 1’s} ● {w ∣w contains 01 as a substring}
● {w ∣
w has an even number of 0’s ∧ w has an odd number of 1’s
}
23 of 68
DFA: Deterministic Finite Automata (2.1)
A deterministic finite automata (DFA) is a 5-tuple M = (Q, Σ, δ, q0, F)
○ Q is a finite set of states.
○ Σ is a finite set of input symbols (i.e., the alphabet). ○ δ∶(Q×Σ)→Qisatransitionfunction
δ takes as arguments a state and an input symbol and returns a state. ○ q0 ∈ Q is the start state.
○ F ⊆ Q is a set of final or accepting states.
24 of 68
DFA: Deterministic Finite Automata (2.2)
● Given a DFA M = (Q, Σ, δ, q0, F):
○ We write L(M) to denote the language of M : the set of strings that M accepts.
○ A string is accepted if it results in a sequence of transitions: beginning from the start state and ending in a final state.
a1a2…an ∣
L(M)={ 1≤i ≤n ∧ ai ∈Σ ∧ δ(qi1,ai)=qi ∧ qn ∈F }
○ M rejects any string w ∈/ L(M).
● We may also consider L(M) as concatenations of labels from the set of all valid paths of M’s transition diagram; each such path starts with q0 and ends in a state in F.
25 of 68
DFA: Deterministic Finite Automata (2.3)
● GivenaDFAM=(Q, Σ, δ, q0, F),wemaysimplifythe definition of L(M) by extending δ (which takes an input symbol) to δˆ (which takes an input string).∗
δˆ ∶ ( Q × Σ ) → Q We may define δˆ recursively, using δ!
δˆ(q,ε) = q
δˆ(q,xa) = δ( δˆ(q,x),a )
where q ∈ Q, x ∈ Σ∗, and a ∈ Σ
● A neater definition of L(M) : the set of strings w ∈ Σ such that
δˆ(q0,w) is an accepting state. ∗ ˆ L(M)={w∣w∈Σ ∧δ(q0,w)∈F}
● A language L is said to be a regular language , if there is some
DFA M such that L = L(M). 26 of 68
∗
DFA: Deterministic Finite Automata (2.4)
11 0
s0: even 0’s
s1: odd 0’s
We formalize the above DFA as M = (Q, Σ, δ, q0, F), where ● Q={s0,s1}
● Σ={0,1}
● δ={((s0,0),s1),((s0,1),s0),((s1,0),s0),((s1,1),s1)}
0
state / input 0 1
s0 s1s0 s1 s0s1
● q0 = s0
● F = {s1} 27 of 68
DFA: Deterministic Finite Automata (2.5.1)
s1: more 0’s
00
s5:
not alter-
1
0
1
0
1
0
s3: equal
(01)+
s0: empty string
0, 1 nating 11
s2: more 1’s
We formalize the above DFA as M = (Q, Σ, δ, q0, F), where ● Q = {s0,s1,s2,s3,s4,s5}
● Σ={0,1}
● q0 = s0
● F={s3,s4} 28 of 68
s4: equal
(10)+
DFA: Deterministic Finite Automata (2.5.2)
●δ=
s1: more 0’s
00
s5:
not alter-
1
0
1
0
s3: equal
(01)+
s0: empty string
1 0
0, 1 nating 11
s2: more 1’s
s4: equal
(10)+
0
29 of 68
state / input 1
s0 s1s2 s1 s5s3 s2 s4s5 s3 s1s5 s4 s5s2 s5 s5s5
Review Exercises: Formalizing DFAs
Formalize DFAs (as 5-tuples) for the other example string patterns mentioned:
● {w∣w has an even number of 0’s} ● {w ∣w contains 01 as a substring}
● {w ∣
w has an even number of 0’s ∧ w has an odd number of 1’s
}
30 of 68
NFA: Nondeterministic Finite Automata (1.1) Problem: Design a DFA that accepts the following language:
L={x01∣x∈{0,1}∗ }
That is, L is the set of strings of 0s and 1s ending with 01.
1
0
q0 0 q1 1 q2
0 1
Given an input string w, we may simplify the above DFA by: ○ nondeterministically treating state q0 as both:
● astatereadytoreadthelasttwoinputsymbolsfromw
● astatenotyetreadytoreadthelasttwoinputsymbolsfromw
○ substantially reducing the outgoing transitions from q1 and q2
31 of 68
Compare the above DFA with the DFA in slide 39.
NFA: Nondeterministic Finite Automata (1.2)
● A non-deterministic finite automata (NFA) that accepts the same language:
0, 1
q0 0 q1 1 q2
● How an NFA determines if an input 00101 should be processed:
32 of 68
NFA: Nondeterministic Finite Automata (2)
● A nondeterministic finite automata (NFA) , like a DFA, is a FSM that accepts (or recognizes) a pattern of behaviour.
● An NFA being nondeterministic means that from a given state, the same input label might corresponds to multiple transitions that lead to distinct states.
○ Each such transition offers an alternative path.
○ Each alternative path is explored independently and in parallel.
○ If there exists an alternative path that succeeds in processing the
input string, then we say the NFA accepts that input string.
○ If all alternative paths get stuck at some point and fail to process
the input string, then we say the NFA rejects that input string.
● NFAs are often more succinct (i.e., fewer states) and easier to
design than DFAs.
● However, NFAs are just as expressive as are DFAs.
○ We can always convert an NFA to a DFA. 33 of 68
NFA: Nondeterministic Finite Automata (3.1)
● A nondeterministic finite automata (NFA) is a 5-tuple M = (Q, Σ, δ, q0, F)
○ Q is a finite set of states.
○ Σ is a finite set of input symbols (i.e., the alphabet). ○ δ∶(Q×Σ)→P(Q)isatransitionfunction
δ takes as arguments a state and an input symbol and returns a set of states.
○ q0 ∈ Q is the start state.
○ F ⊆ Q is a set of final or accepting states.
● What is the difference between a ○ The transition function δ of a
○ The transition function δ of an
and an NFA ?
returns a single state. returns a set of states.
34 of 68
DFA
DFA
NFA
NFA: Nondeterministic Finite Automata (3.2)
● GivenaNFAM=(Q, Σ, δ, q0, F),wemaysimplifythe definition of L(M) by extending δ (which takes an input symbol) to δˆ (which takes an input string).
δˆ ∶ ( Q × Σ ∗ ) → P ( Q ) We may define δˆ recursively, using δ!
δˆ(q,ε) = {q} ′ ′
δˆ(q,xa) = ⋃{δ(q,a)∣q ∈δˆ(q,x)}
where q ∈ Q, x ∈ Σ∗, and a ∈ Σ ∗
● A neater definition of L(M) : the set of strings w ∈ Σ such that
δˆ(q0,w) contains at least one accepting state.
L ( M ) = { w ∣ w ∈ Σ ∗ ∧ δˆ ( q 0 , w ) ∩ F ≠ ∅ }
35 of 68
NFA: Nondeterministic Finite Automata (4)
0, 1
q0 0 q1 1 q2
Given an input string 00101:
● Read 0: δ( , 0) = { q0 , q1 }
● Read0: δ( ,0)∪δ(q1,0)={ q0 ,q1 }∪∅={q0,q1 }
● Read1: δ( ,1)∪δ(q1,1)={q0 }∪{q2 }={ ,q2 }
● Read0: δ( ,0)∪δ(q2,0)={q0,q1 }∪∅={q0, }
● Read1: δ(q0,1)∪δ(q1 ,1)={q0,q1 }∪{q2 }={q0,q1, q2 }
∵{q0,q1,q2 }∩{q2 }≠∅∴00101 is accepted 36 of 68
q0
q0
q0
q0
q0
q1
DFA NFA (1)
● For many languages, constructing an accepting NFA is easier than a DFA.
● From each state of an NFA:
○ Outgoing transitions need not cover the entire Σ.
○ An input symbol may non-deterministically lead to multiple states.
● In practice:
○ An NFA has just as many states as its equivalent DFA does.
○ An NFA often has fewer transitions than its equivalent DFA does.
● In the worst case:
○ While an NFA has n states, its equivalent DFA has 2n states.
● Nonetheless, an NFA is still just as expressive as a DFA.
○ Every language accepted by some NFA can also be accepted by
some DFA. 37 of 68
∀N∶NFA ● (∃D∶DFA ● L(D)=L(N))
DFA NFA (2.2): Lazy Evaluation (1) Given an NFA:
transition table: state / input
{q0 } {q0, q1}
{q0, q2} 38 of 68
(with lazy evaluation) produces a DFA
0 1 δ(q0,1)
0, 1
q0 0 q1 1 q2
Subset construction
δ(q0,0) = {q0,q1}
δ(q0,0)∪δ(q1,0) = {q0,q1}∪∅
=
{q0, q1}
δ(q0,0)∪δ(q2,0) = {q0,q1}∪∅
= {q0,q1}
= {q0 }
δ(q0,1)∪δ(q1,1) = {q0} ∪ {q2}
= {q0,q2}
δ(q0,1)∪δ(q2,1) = {q0}∪∅
= {q0 }
DFA NFA (2.2): Lazy Evaluation (2)
Applying subset construction (with lazy evaluation), we arrive in
a DFA transition table:
state / input
{q0 } {q0, q1} {q0, q2}
0
1
{q0 } {q0, q2} {q0}
{q0, q1}
{q0, q1}
{q0, q1} We then draw the DFA accordingly:
39 of 68
Compare the above DFA with the DFA in slide 31.
1
{q0} 0
0
{q0,q1}
0 1
1 {q0,q2}
DFA NFA (2.2): Lazy Evaluation (3)
● Given an NFA N = (QN,ΣN,δN,q0,FN), often only a small
portion of the ∣ P(QN )∣ subset states is reachable from {q0}.
ALGORITHM: ReachableSubsetStates
INPUT: q0 ∶ QN ; OUTPUT: Reachable ⊆ P(QN )
PROCEDURE:
Reachable := { {q0} } ToDiscover := { {q0 } } while(ToDiscover ≠ ∅) {
choose S∶P(QN) such that S∈ToDiscover remove S from ToDiscover NotYetDiscovered :=
( {δN(s,0) ∣ s∈S}∪{δN(s,1) ∣ s∈S} )∖Reachable Reachable := Reachable∪NotYetDiscovered ToDiscover := ToDiscover ∪ NotYetDiscovered
}
return Reachable
● RT of ReachableSubsetStates? [ O(2∣QN ∣) ] 40 of 68
ε-NFA: Examples (1)
Draw the NFA for the following two languages:
1.
2.
3.
⎧
⎪
⎪
⎨ xy ⎪
x ∈ {0, 1}
∧ x has alternating 0’s and 1’s
∧ yhasanodd#0’sandanodd#1’s
⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪⎭
⎪ ⎪⎩
{ w∶{0,1} ∣
w has alternating 0’s and 1’s
∨ whasanodd#0’sandanodd#1’s
⎧ s∈{+,−,ε} ⎫ ⎪ ⎪
}
41 of 68
∧ y∈{0,1}
⎪ ∧x∈Σ ⎪ ⎨sx.y dec ⎬
⎪∧y∈Σ ⎪ ⎪ dec ⎪
⎪⎩ ∧ ¬(x=ε∧y=ε) ⎪⎭
ε-NFA: Examples (2)
⎧ s∈{+,−,ε} ⎫
⎪ ⎪
⎪ ∧ x∈Σ ⎪
⎨ sx.y dec ⎬
⎪∧y∈Σ ⎪ ⎪ dec ⎪
⎪⎩ ∧ ¬(x=ε∧y=ε) ⎪⎭
0,1,…,9 0,1,…,9
q0 ✏,+,- q1 . q2 0,1,…,9 q3 ✏ q5
0,1,…,9
.
From q0 to q1, reading a sign is optional: a plus or a minus, or
nothing at all (i.e., ε). 42 of 68
q4
ε-NFA: Formalization (1) An ε-NFA is a 5-tuple
M = (Q, Σ, δ, q0, F)
○ Q is a finite set of states.
○ Σ is a finite set of input symbols (i.e., the alphabet). ○ δ∶(Q×(Σ∪{ε}))→P(Q)isatransitionfunction
δ takes as arguments a state and an input symbol, or an empty string ε, and returns a set of states.
○ q0 ∈ Q is the start state.
○ F ⊆ Q is a set of final or accepting states.
43 of 68
ε-NFA: Formalization (2)
0,1,…,9
0,1,…,9
✏,+,-
q0 q1 q2 q3 q5
Draw a transition table for the above NFA’s δ function: ε +,- . 0..9
q0{q1}{q1}∅ ∅
q1 ∅ ∅ {q2} {q1,q4} q2 ∅ ∅ ∅ {q3} q3 {q5} ∅ ∅ {q3} q4 ∅ ∅ {q3} ∅
q5 ∅ ∅ ∅ ∅
. 0,1,…,9
✏
0,1,…,9
.
q4
44 of 68
ε-NFA: Epsilon-Closures (1) ● Given ε-NFA N
N = (Q, Σ, δ, q0, F)
we define the epsilon closure (or ε-closure ) as a function
● Foranystateq∈Q
45 of 68
ECLOSE ∶ Q → P(Q)
ECLOSE(q) = {q} ∪ ⋃ ECLOSE(p) p∈δ(q,ε)
ε-NFA: Epsilon-Closures (2)
46 of 68
✏
q2
0 1
✏
ECLOSE(q0 )
= {δ(q0, ε) = {q1, q2}}
{q0} ∪ ECLOSE(q1) ∪ ECLOSE(q2)
= {ECLOSE(q1), δ(q1, ε) = {q3}, ECLOSE(q2), δ(q2, ε) = ∅}
{q0}∪( {q1}∪ECLOSE(q3) )∪( {q2}∪∅ )
= {ECLOSE(q3), δ(q3, ε) = {q5}}
{q0}∪( {q1}∪( {q3}∪ECLOSE(q5) ) )∪( {q2}∪∅ )
= {ECLOSE(q5), δ(q5, ε) = ∅}
{q0}∪( {q1}∪( {q3}∪( {q5}∪∅ ) ) )∪( {q2}∪∅ )
0,1
✏ 1
✏ q1 ✏
q0 q3 q5 q6
q4
ε-NFA: Formalization (3)
● Given a ε-NFA M = (Q, Σ, δ, q0, F), we may simplify the definition of L(M) by extending δ (which takes an input symbol) to δˆ (which takes an input string).
δˆ ∶ ( Q × Σ ∗ ) → P ( Q ) We may define δˆ recursively, using δ!
δˆ(q, ε) = ECLOSE(q) ′′ ′′ ′ ′
δˆ(q,xa) = ⋃{ECLOSE(q )∣q ∈δ(q,a)∧q ∈δˆ(q,x)}
where q ∈ Q, x ∈ Σ∗, and a ∈ Σ ∗
● Then we define L(M) as the set of strings w ∈ Σ such that
δˆ(q0,w) contains at least one accepting state.
L ( M ) = { w ∣ w ∈ Σ ∗ ∧ δˆ ( q 0 , w ) ∩ F ≠ ∅ }
47 of 68
ε-NFA: Formalization (4)
0,1,…,9
0,1,…,9
✏,+,-
q0 q1 q2 q3 q5
. 0,1,…,9
✏
0,1,…,9
.
Given an input string 5.6: δˆ(q0,ε) = ECLOSE(q0) = {q0,q1}
● Read 5: δ(q0,5)∪δ(q1,5) = ∅∪{q1,q4} = { q1,q4 }
δˆ(q0,5) = ECLOSE(q1)∪ECLOSE(q4) = {q1}∪{q4} = {q1,q4}
● Read .: δ(q1,.)∪δ(q4,.) = {q2}∪{q3} = { q2,q3 }
δˆ(q0,5.) = ECLOSE(q2)∪ECLOSE(q3) = {q2}∪{q3,q5} = {q2,q3,q5}
● Read 6: δ(q2,6)∪δ(q3,6)∪δ(q5,6) = {q3}∪{q3}∪∅ = { q3 }
δˆ(q0,5.6) = ECLOSE(q3) = {q3,q5} [5.6 is accepted] 48 of 68
q4
DFA ε-NFA: Subset Construction (1)
Subset construction (with lazy evaluation and epsilon closures ) produces a DFA transition table.
d∈0..9 s∈{+,−} . {q0, q1} {q1, q4} {q1 } {q2 }
{q1, q4}
{q1} {q1,q4} ∅ {q2} {q2} {q3,q5} ∅ ∅
{q2, q3, q5} {q3, q5}
{q1, q4} {q3, q5}
∅ {q2, q3, q5} ∅ ∅
{q3, q5} Forexample,δ({q0,q1},d)iscalculatedasfollows:
49 of 68
∅ ∅
⋃{ECLOSE(q) ∣ q ∈ δ(q0, d) ∪ δ(q1, d)}
= ⋃{ECLOSE(q)∣q∈∅∪{q1,q4}}
= ⋃{ECLOSE(q) ∣ q ∈ {q1, q4}}
= ECLOSE(q1 ) ∪ ECLOSE(q4 )
= {q1} ∪ {q4}
= {q1,q4}
[d ∈0..9]
DFA ε-NFA: Subset Construction (2)
● Given an ε=NFA N = (QN,ΣN,δN,q0,FN), by applying the extended subset construction to it, the resulting DFA
D = (QD,ΣD,δD,qDstart ,FD) is such that:
ΣD=ΣN ∗ˆ
QD = {S∣S⊆QN∧(∃w∶Σ ●S=δD(q0,w))} qDstart = ECLOSE(q0)
FD = {S∣S⊆QN∧S∩FN≠∅}
δD(S,a) = ⋃{ECLOSE(s′)∣s∈S∧s′ ∈δN(s,a)}
50 of 68
Regular Expression to ε-NFA
● Just as we construct each complex regular expression recursively, we define its equivalent ε-NFA recursively .
● Given a regular expression R, we construct an ε-NFA E, such that L(R) = L(E), with
○ Exactly one accept state.
○ No incoming arc to the start state.
○ No outgoing arc from the accept state.
51 of 68
Regular Expression to ε-NFA
Base Cases: ●ε
ε
ε
ε
●∅
(a) (a) (a)
●a (b) [a∈Σ] (b)
a a
(b)
52 of 68
a
(c)
(c)
Regular Expression to ε-NFA
Recursive Cases: ●R+S
ε R ε εRε
[R and S are RE’s]
εε R εε S
ε
ε
R ε(a) S
ε ε
(Sa) S (a)
● RS ● R∗
RεS (b)
ε
R (b) S
(b) ε εεε
εRε
53of68
ε εε R (c) (c)
ε
R
ε
Regular Expression to ε-NFA: Examples (1.1)
● 0+1
ε0ε ε0ε
εε1εε 1
(a) (a)
ε
ε ε 0 εε ε
● (0+1)∗
εε0εε
54 of 68
ε
εε1ε
ε1ε (b)
ε
1
Regular Expression to ε-NFA(b:)Examples (1.2) ● (0+1)∗1(0+1)
ε
Start ε ε 0 ε ε
ε1ε
ε
ε
ε0ε 1ε
55 of 68
ε1ε
(c)
Minimizing DFA: Motivation
● Recall: → →
● DFA produced by the subset construction (with lazy
evaluation) may not be minimum on its size of state. ● When the required size of memory is sensitive
(e.g., processor’s cache memory),
the fewer number of DFA states, the better.
Regular Expresion
ε-NFA
DFA
56 of 68
Minimizing DFA: Algorithm
57 of 68
ALGORITHM: MinimizeDFAStates
INPUT: DFA M =(Q, Σ, δ, q0, F)
OUTPUT: M s.t. minimum |Q| and equivalent behaviour as M
PROCEDURE:
P := ∅ /* refined partition so far */
T := { F,Q−F } /* last refined partition */ while (P ≠ T):
P := T
T := ∅
for(p ∈ P s.t. |p| > 1):
find the maximal S ⊆ p s.t. splittable(p, S) if S ≠∅ then
T:=T∪{S, p−S} else
T := T ∪ {p} end
splittable(p, S) holds iff there is c ∈ Σ s.t.
● Transitioncleadsalls∈Stostatesinthesamepartitionp1.
● Transitioncleadssomes∈p−Stoadifferentpartitionp2(p2≠p1).
Minimizing DFA: Examples
es2es3 bd2 b f dadc
s0 s1
0 1 b
c d3 c
i
s4 e s5
Curre Partit
{{s3,s5},{s0, {{s3,s5},{s0, {{s3,s5},{s0,
er the dfa fo ubset constr
tion algorith
on the right.
(a) DFA for “fee | fie”
(a) Original DFA
nt ion
q3
11∗ ExamsinqF4eIsGURE2.12 DFAforaq(5b|c ).
{{s,s},{s,s},{s,s}}
350124 01 1
Set
0
0, 1
1
Char Action
00
As a second example, consid
s1,s2,s4}} s1,s2,s4}}
s1,s2,s4}}
— {s3,s5}
— —
f 0 split1{s2} 3
{s,s}
Exercises: Minimize the DFA from here; Q1 & Q2, p59, EAC2.
{s ,s }5,{8sof}6,8{s },{s ,s }} all acllannot bnoenseplit. When the algorithm examin 350124
1
son’s construction and the s
q0
allq1 none q2
The first step of the minimiza
{s0,s1,s2,s4} 10
split {s2,s4} 0 {{d },{d ,d ,d }}, as shown
e
fromanystateinp .Forbothbandc,eachs
2
r u
m
Exercise:
Regular Expression to Minimized DFA
Given regular expression r[0..9]+ which specifies the pattern of a register name, derive the equivalent DFA with the minimum number of states. Show all steps.
59 of 68
Implementing DFA as Scanner
e.g., identifiers
categories of the source language.
● Eachsyntacticcategoryisspecifiedviaaregularexpression.
r1 + r1 +…+ rn syn. cat. 1 syn. cat. 2 syn. cat. n
● Overall,ascannershouldbeimplementedbasedontheminimized DFA accommodating all syntactic categories.
○ Principles of a scanner:
● Returnsonewordatatime
● Eachreturnedwordisthelongestpossiblethatmatchesapattern
● Aprioritymaybespecifiedamongpatterns
(e.g., new is a keyword, not identifier)
○ The source language has a list of syntactic categories: e.g., keyword while
e.g., white spaces
○ A compiler’s scanner must recognize words from all syntactic
[ while ] [ [a-zA-Z][a-zA-Z0-9_]* ] [ [ \t\r]+ ]
60 of 68
tate,cat];
A and s0 s1 s2 se Implementing DFA: Table-Driven Scanner (1)
d) do
invalid invalid register invalid
); me;
r 0,1,2,…,9 EOF Ot ● Consider the syntactic category of register names.
NextWord()
e Table, Type : r[0..9]+
● Specified as a ●Aferconversiontoεc-lNeaFrAs,tatchke;ntoDFA,thentominimizedDFA:
push(bad);
while (state̸=se) do0…9
NextChar(char); sss
Register Digit Other Ot
0…9
2EOF Other
Register Digit Other
The Classifier Table, CharCat 2.5 Implementing Scanners
0r 0, 11, 2, . . ., 9 lexeme ← lexeme + char;
ype[state]; NextWord()
nvalid;state ← s0 ;
●leTxheme f←oll‘o‘’w’;ingtablesencodeknowledgeabouttheaboveDFA:
clear stack; push(bad);
The Classifier Table, CharCat push(state);
s2 se s2 se se se se se
ivenScannerforRegisterNames.
2.5 Implementing Scanners cat ← CharCat[char];
61
TheTransitionTable,δ
(Type)
The Token Type Table, Type 0…9
ifResgtisatetre ∈SA Digit ThtheenUcnldeaerrlsytiancgk;DFA
s1 se s2 se Other
Transition
(δ)
Token Type
state ← δ[state,cat]; (Cehnda;rCat
state ←
Otsher s s 0 1
e
s
e
while (state̸=se) do Classifier
s. The sc
then clear stack;
The Classifier Table, CharCat push(state);
do
end;
61 of 68
The Transition Table, δ
NextChar(char);
regular expression
s0
s1
s2
se
invalid invalid register invalid
truncat
table-driven scanner for the rReolrlB[a0ck.(.).;9] , which was
cat ← CharCat[char];
an re fostratiel←ocδR[ersgetisgatetires,tDceiagrtit]n;Oathmeres. The left side of the
stTahte←Toske;n Typ 0
lexeme ← ‘’;
)
Register
lexreme anner
←0, l1e,x2e,m. .e., 9+ generator
iRfegistterate ∈SDigit Other Othesrtat A
chaEOrF; thenwh
pirlOoeth(desrutact
s0 s1 se se es∈/tSabalneds that drive
s1 se s2 se e̸=bad) do
s2 se s2 se pop();
se se se se e lexeme;
A
+
Digit
Other
a
(h eh
T
i
n .
6
Implementing DFA: Table-Driven Scanner (2)
The scanner then is implemented via a 4-stage skeleton:
NextWord()
— Stage 1: Initialization
state := s0 ; word := ε
initialize an empty stack s ; s.push(bad) — Stage 2: Scanning Loop
while (state ≠ se)
NextChar(char) ; word := word + char if state ∈ F then reset stack s end s.push(state)
cat := CharCat[char]
state := δ[state, cat]
— Stage 3: Rollback Loop while (state ∈/ F ∧ state ≠ bad)
state := s.pop()
truncate word
if state ∈ F then return Type[state] else return invalid
end
62 of 68
— Stage 4: Interpret and Report
Index (1)
Scanner in Context
Scanner: Formulation & Implementation Alphabets
Strings (1)
Strings (2)
Review Exercises: Strings
Languages
Review Exercises: Languages Problems
Regular Expressions (RE): Introduction
RE: Language Operations (1)
63 of 68
Index (2)
RE: Language Operations (2) RE: Construction (1)
RE: Construction (2)
RE: Construction (3)
RE: Construction (4)
RE: Review Exercises
RE: Operator Precedence
DFA: Deterministic Finite Automata (1.1) DFA: Deterministic Finite Automata (1.2) DFA: Deterministic Finite Automata (1.3)
Review Exercises: Drawing DFAs
64 of 68
Index (3)
DFA: Deterministic Finite Automata (2.1) DFA: Deterministic Finite Automata (2.2) DFA: Deterministic Finite Automata (2.3) DFA: Deterministic Finite Automata (2.4) DFA: Deterministic Finite Automata (2.5.1) DFA: Deterministic Finite Automata (2.5.2) Review Exercises: Formalizing DFAs
NFA: Nondeterministic Finite Automata (1.1) NFA: Nondeterministic Finite Automata (1.2) NFA: Nondeterministic Finite Automata (2)
NFA: Nondeterministic Finite Automata (3.1)
65 of 68
Index (4)
NFA: Nondeterministic Finite Automata (3.2) NFA: Nondeterministic Finite Automata (4) DFA ≡ NFA (1)
DFA ≡ NFA (2.2): Lazy Evaluation (1)
DFA ≡ NFA (2.2): Lazy Evaluation (2) DFA ≡ NFA (2.2): Lazy Evaluation (3) ε-NFA: Examples (1)
ε-NFA: Examples (2)
ε-NFA: Formalization (1)
ε-NFA: Formalization (2)
ε-NFA: Epsilon-Closures (1) 66 of 68
Index (5)
ε-NFA: Epsilon-Closures (2)
ε-NFA: Formalization (3)
ε-NFA: Formalization (4)
DFA ≡ ε-NFA: Subset Construction (1) DFA ≡ ε-NFA: Subset Construction (2) Regular Expression to ε-NFA
Regular Expression to ε-NFA
Regular Expression to ε-NFA
Regular Expression to ε-NFA: Examples (1.1) Regular Expression to ε-NFA: Examples (1.2)
Minimizing DFA: Motivation
67 of 68
Index (6)
Minimizing DFA: Algorithm
Minimizing DFA: Examples
Exercise:
Regular Expression to Minimized DFA
Implementing DFA as Scanner Implementing DFA: Table-Driven Scanner (1) Implementing DFA: Table-Driven Scanner (2)
68 of 68