constructions that transform an re into an fa that is suitable for direct imple-
mentation and an algorithm that derives an re for the language accepted by anns.
Tostic fa 2.4.1. Next,
Scanner: Lexical Analysis Readings: EAC2 Chapter 2
EECS4302 M: Compilers and Interpreters Winter 2020
CHEN-WEI WANG
fa. Figure 2.3 shows the relationship between all of these constructio
presenttheseconstructions,wemustdistinguishbetweendi Scanner: Formulation & Implementation
s, or dfas, and nondeterministic fas, or nfas, in Section
etermin
Kleene’s Construction
DFA Minimization
NFA
Code for
a scanner
RE
Thompson’s Construction
DFA
Subset Construction
FIGURE2.3 TheCycleofConstructions. 3 of 68
n
Scanner in Context
○ Recall:
○ Treats the input programas as a a sequence of characters ○ Applies rules recognizing character sequences as tokens
Lexical Analysis Source Program
(seq. of characters)
Scanner
Syntactic Analysis seq. of tokens
Parser
Semantic Analysis
AST1 … ASTn
pretty printed
Target Program
[ lexical analysis ] ● Reportscharactersequencesnotrecognizableastokens
○ Upon termination:
● Producesaasequenceoftokens
○ Only part of compiler touching every character in input program. ○ Tokens recognizable by scanner constitute a regular language .
2 of 68
Alphabets
● An alphabet is a finite, nonempty set of symbols.
[ the English alphabet ] [ the binary alphabet ] [thedecimalalphabet] [ the keyboard alphabet ]
● Use either a set enumeration or a set comprehension to define your own alphabet.
4 of 68
○ The convention is to write ⌃ , possibly with a informative subscript, to denote the alphabet in question.
e.g., ⌃eng = {a,b,…,z,A,B,…,Z} e.g., ⌃bin = {0, 1}
e.g.,⌃dec ={d 0≤d ≤9}
e.g., ⌃key
Strings (1)
● A or a is finite sequence of symbols chosen from
string
word
some alphabet.
e.g., Oxford is a string from the English alphabet ⌃eng e.g., 01010 is a string from the binary alphabet ⌃bin e.g., 01010.01 is not a string from ⌃bin
e.g., 57 is a string from the binary alphabet ⌃dec
● It is not correct to say, e.g., 01010 ∈ ⌃bin
● The length of a string w, denoted as w, is the number of
[Why?] ○ ✏ is the empty string (✏ = 0) that may be from any alphabet.
characters it contains. ○ e.g., Oxford = 6
● Given two strings x and y, their concatenation , denoted as xy, is a new string formed by a copy of x followed by a copy of y.
○ e.g., Let x = 01101 and y = 110, then xy = 01101110
○ The empty string ✏ is the identity for concatenation : 5of68 ✏w =w =w✏foranystringw
Review Exercises: Strings
1. What is {a,b,…,z}5?
2. Enumerate, in a systematic manner, the set {a, b, c}4.
3. Explain the difference between ⌃ and ⌃1.
⌃ is a set of symbols; ⌃1 is a set of strings of length 1.
4. Prove or disprove: ⌃1 ⊆ ⌃2 ⇒ ⌃∗1 ⊆ ⌃∗2
7 of 68
Strings (2)
● Given an alphabet ⌃, we write ⌃k , where k ∈ N, to denote the set of strings of length k from ⌃
⌃k ={ww is from ⌃∧w=k} ○ e.g., {0, 1}2 = {00, 01, 10, 11}
○ ⌃0 is {✏} for any alphabet ⌃
● ⌃+ is the set of nonempty strings from alphabet ⌃
⌃+ = ⌃1 ∪ ⌃2 ∪ ⌃3 ∪ . . . = {w w ∈ ⌃k ∧ k > 0} = ⌃k k>0
● ⌃∗ is the set of strings of all possible lengths from alphabet ⌃
⌃∗ = ⌃+ ∪ {✏} 6 of 68
Languages
● A language L over ⌃ (where ⌃ is finite) is a set of strings s.t.
L ⊆ ⌃∗
● When useful, include an informative subscript to denote the
language L in question.
○ e.g., The language of valid Java programs
LJava = {prog prog ∈ ⌃∗key ∧ prog compiles in Eclipse} ○ e.g., The language of strings with n 0’s followed by n 1’s (n ≥ 0)
{✏,01,0011,000111,…} = {0n1n n ≥ 0}
○ e.g., The language of strings with an equal number of 0’s and 1’s
8 of 68
{✏, 01, 10, 0011, 0101, 0110, 1100, 1010, 1001, . . . } = {w#of0’sinw=#of1’sinw}
Review Exercises: Languages
1. Use set comprehensions to define the following languages. Be as formal as possible.
○ A language over {0, 1} consisting of strings beginning with some 0’s (possibly none) followed by at least as many 1’s.
○ A language over {a, b, c} consisting of strings beginning with some a’s (possibly none), followed by some b’s and then some c’s, s.t. the#ofa’sisatleastasmanyasthesumof#’sofb’sandc’s.
2. Explain the difference between the two languages {✏} and .
3. Justify that ⌃∗, , and {✏} are all languages over ⌃.
4. Prove or disprove: If L is a language over ⌃, and ⌃2 ⊇ ⌃, then L
is also a language over ⌃2.
Hint: Prove that ⌃ ⊆ ⌃2 ∧ L ⊆ ⌃∗ ⇒ L ⊆ ⌃∗2
5. Prove or disprove: If L is a language over ⌃, and ⌃2 ⊆ ⌃, then L
is also a language over ⌃2.
Hint: Provethat⌃2 ⊆⌃∧L⊆⌃∗ ⇒L⊆⌃∗2 9 of 68
Regular Expressions (RE): Introduction
● (RegExp’s) are:
Regular expressions
○ A type of notation
● Thisissimilartotheequally-expressiveDFA,NFA,and✏-NFA.
language-defining
○ Textual and look just like a programming language
● e.g.,01*+10*denotesL={0x x ∈{1}∗}∪{1x x ∈{0}∗}
● e.g., (0*10*10*)*10* denotes L = {w w has odd # of 1’s}
● ThisisdissimilartothediagrammaticDFA,NFA,and✏-NFA.
● RegExp’scanbeconsideredasa“user-friendly”alternativetoNFAfor
describing software components. [e.g., text search] ● WritingaRegExpislikewritinganalgebraicexpression,usingthe
defined operators, e.g., ((4 + 3) * 5) % 6
● Despite the programming convenience they provide, RegExp’s,
DFA, NFA, and ✏-NFA are all provably equivalent .
○ They are capable of defining all and only regular languages.
11 of 68
Problems
● Given a language L over some alphabet ⌃, a problem is the decision on whether or not a given string w is a member of L.
w∈L
Is this equivalent to deciding w ∈ ⌃∗? [ No ]
● e.g., The Java compiler solves the problem of deciding if the string of symbols typed in the Eclipse editor is a member of LJava (i.e., set of Java programs with no syntax and type errors).
10 of 68
RE: Language Operations (1)
● Given ⌃ of input alphabets, the simplest RegExp is s ∈ ⌃1. ○ e.g., Given ⌃ = {a, b, c}, expression a denotes the language
consisting of a single string a.
● Given two languages L, M ∈ ⌃∗, there are 3 operators for
building a larger language out of them:
1. Union
2. Concatenation
In the textual form, we write either . or nothing at all for
concatenation.
12 of 68
L ∪ M = {w w ∈ L ∨ w ∈ M} In the textual form, we write + for union.
LM = {xy x ∈ L ∧ y ∈ M}
RE: Language Operations (2)
3. Kleene Closure (or Kleene Star)
L∗ = Li i≥0
where
L0 = {✏}
L1 = L
L2 = {x1x2x1∈L∧x2∈L}
…
13 of 68
[ Li ] [ L ]
Li = {xx…x x∈L∧1≤j≤i} 12ij
i repetations
…
In the textual form, we write * for closure. Question: What is Li (i ∈ N)?
Question: Given that L = {0}∗, what is L∗?
RE: Construction (2)
● Recursive Case Given that E and F are regular expressions: ○ The union E + F is a regular expression.
L( E + F ) = L(E ) ∪ L(F ) ○ The concatenation EF is a regular expression.
L( EF ) = L(E)L(F) ○ Kleene closure of E is a regular expression.
L( E∗ ) = (L(E))∗
○ A parenthesized E is a regular expression.
15 of 68
L( (E) ) = L(E)
RE: Construction (1)
We may build regular expressions recursively:
● Each (basic or recursive) form of regular expressions denotes a language (i.e., a set of strings that it accepts).
● Base Case:
○ Constants ✏ and are regular expressions.
L(✏) = {✏} L() =
14 of 68
○ An input symbol a ∈ ⌃ is a regular expression. L( a ) = {a}
If we want a regular expression for the language consisting of only
the string w ∈ ⌃∗, we write w as the regular expression.
○ Variables such as L, M, etc., might also denote languages.
RE: Construction (3)
Exercises: ● L
● ∗
●∗L ● +L
∗ = 0∪1∪2∪… = {✏}∪∪∪… = {✏}
[L==L]
[∗L=L=L∗] [+L=L=+L]
16 of 68
RE: Construction (4)
Write a regular expression for the following language
{ w w has alternating 0’s and 1’s }
● Would (01)∗ work? [alternating 10’s?] ● Would (01)∗ + (10)∗ work? [starting and ending with 1?]
● 0(10)∗ + (01)∗ + (10)∗ + 1(01)∗ ● It seems that:
○ 1st and 3rd terms have (10)∗ as the common factor. ○ 2nd and 4th terms have (01)∗ as the common factor.
● Can we simplify the above regular expression? ● (✏+0)(10)∗+(✏+1)(01)∗
17 of 68
RE: Operator Precedence
● In an order of decreasing precedence: ○ Kleene star operator
○ Concatenation operator ○ Union operator
● When necessary, use parentheses to force the intended order of evaluation.
● e.g.,
○ 10∗ vs. (10)∗
[10∗ is equivalent to 1(0∗)] [01∗ +1 is equivalent to (0(1∗))+(1)] [0+1∗ is equivalent to (0)+(1∗)]
19 of 68
○ 01∗ +1 vs. 0(1∗ +1) ○ 0+1∗ vs. (0+1)∗
RE: Review Exercises
Write the regular expressions to describe the following languages: ● {w w ends with 01}
● {w w contains 01 as a substring}
● { w w contains no more than three consecutive 1’s } ● {ww ends with 01∨w has an odd # of 0’s}
● s ∈ {+, −, ✏} sx.y∧ x∈⌃∗dec
∧ y ∈ ⌃∗dec ● ∧ ¬(x=✏∧y=✏)
x ∈{0,1}∗ ∧y ∈{0,1}∗
xy ∧ x has alternating 0’s and 1’s 18of68 ∧ yhasanodd#0’sandanodd#1’s
DFA: Deterministic Finite Automata (1.1)
● A deterministic finite automata (DFA) is a finite state machine (FSM) that accepts (or recognizes) a pattern of behaviour.
○ For our purpose of this course, we study patterns of strings (i.e., how alphabet symbols are ordered).
○ Unless otherwise specified, we consider strings in {0, 1}∗
○ Each pattern contains the set of satisfying strings.
○ We describe the patterns of strings using set comprehensions:
● {w w has an odd number of 0’s} ● {w w has an even number of 1’s}
● w w≠✏ ∧ w has equal # of alternating 0’s and 1’s
● {w w contains 01 as a substring}
● w w has an even number of 0’s
∧ w has an odd number of 1’s
● Given a pattern description, we design a DFA that accepts it.
○ The resulting DFA can be transformed into an executable program. 20 of 68
DFA: Deterministic Finite Automata (1.2)
The transition diagram below defines a DFA which accepts exactly the language
{w w has an odd number of 0’s} 11
s0: even 0’s
0
0
s1: odd 0’s
○ Each incoming or outgoing arc (called a transition ) corresponds to an input alphabet symbol.
○ s0 with an unlabelled incoming transition is the start state .
○ s3 drawn as a double circle is a final state .
○ All states have outgoing transitions covering {0, 1}. 21 of 68
Review Exercises: Drawing DFAs
Draw the transition diagrams for DFAs which accept other example string patterns:
● {ww has an even number of 1’s}
● {w w contains 01 as a substring}
● w w has an even number of 0’s ∧ w has an odd number of 1’s
23 of 68
DFA: Deterministic Finite Automata (1.3)
below defines a DFA which accepts
∧ w has equal # of alternating 0’s and 1’s
The
exactly the language
w
w ≠ ✏
s0: empty string
s1: more 0’s
00
1
0
1
0
1
s3: equal
(01)+
s4: equal
(10)+
0, 1 nating 11
s2: more 1’s
s5:
not alter-
22 of 68
0
transition diagram
DFA: Deterministic Finite Automata (2.1)
A deterministic finite automata (DFA) is a 5-tuple M = (Q, ⌃, , q0, F)
○ Q is a finite set of states.
○ ⌃ is a finite set of input symbols (i.e., the alphabet). ○ ∶(Q×⌃)→Qisatransitionfunction
takes as arguments a state and an input symbol and returns a state. ○ q0 ∈ Q is the start state.
○ F ⊆ Q is a set of final or accepting states.
24 of 68
DFA: Deterministic Finite Automata (2.2)
● Given a DFA M = (Q, ⌃, , q0, F):
○ We write L(M) to denote the language of M : the set of strings that M accepts.
○ A string is accepted if it results in a sequence of transitions: beginning from the start state and ending in a final state.
L(M)= a1a2…an 1≤i ≤n ∧ ai ∈⌃ ∧ (qi−1,ai)=qi ∧ qn ∈F
○ M rejects any string w ∈ L(M).
● We may also consider L(M) as concatenations of labels from the set of all valid paths of M’s transition diagram; each such path starts with q0 and ends in a state in F.
25 of 68
DFA: Deterministic Finite Automata (2.4)
We formalize the above DFA as M = (Q, ⌃, , q0, F), where ● Q={s0,s1}
● ⌃={0,1}
● ={((s0,0),s1),((s0,1),s0),((s1,0),s0),((s1,1),s1)}
state input 0 1 s0 s1s0
● q0 = s0
s1 s0s1
● F = {s1} 27 of 68
11 0
s0: even 0’s
s1: odd 0’s
0
DFA: Deterministic Finite Automata (2.3)
● GivenaDFAM=(Q, ⌃, , q0, F),wemaysimplifythe definition of L(M) by extending (which takes an input symbol) to ˆ (which takes an input string).∗
ˆ ∶ ( Q × ⌃ ) → Q We may define ˆ recursively, using !
(q,✏) = q ˆˆ
ˆ
(q,xa) = ( (q,x),a ) where q ∈ Q, x ∈ ⌃∗, and a ∈ ⌃
● A neater definition of L(M) : the set of strings w ∈ ⌃∗ such that ˆ0
(q ,w) is an accepting state. L(M)={ww∈⌃∗∧ (q ,w)∈F}
ˆ0
● A language L is said to be a regular language , if there is some
DFA M such that L = L(M). 26 of 68
DFA: Deterministic Finite Automata (2.5.1)
s0: empty string
s1: more 0’s
00
s5:
not alter-
1
0
1
0
1
0
s3: equal
(01)+
s4: equal
(10)+
0, 1 nating 11
s2: more 1’s
We formalize the above DFA as M = (Q, ⌃, , q0, F), where ● Q = {s0,s1,s2,s3,s4,s5}
● ⌃={0,1}
● q0 = s0
● F={s3,s4} 28 of 68
DFA: Deterministic Finite Automata (2.5.2)
● =
29 of 68
state input 1
s0 s1s2 s1 s5s3 s2 s4s5 s3 s1s5 s4 s5s2 s5 s5s5
1
s1: more 0’s
s3: equal
(01)+
s4: equal
(10)+
s0: empty string
000
s5:
not alter-
1 0
0, 1 nating 11
s2: more 1’s
1
0
0
NFA: Nondeterministic Finite Automata (1.1)
Problem: Design a DFA that accepts the following language: L={x01x∈{0,1}∗ }
That is, L is the set of strings of 0s and 1s ending with 01.
1
0
q0 0 q1 1 q2
0 1
Given an input string w, we may simplify the above DFA by: ○ nondeterministically treating state q0 as both:
● astatereadytoreadthelasttwoinputsymbolsfromw
● astatenotyetreadytoreadthelasttwoinputsymbolsfromw
○ substantially reducing the outgoing transitions from q1 and q2
31 of 68
Compare the above DFA with the DFA in slide 39.
Review Exercises: Formalizing DFAs
Formalize DFAs (as 5-tuples) for the other example string patterns mentioned:
● {ww has an even number of 0’s}
● {w w contains 01 as a substring}
● w w has an even number of 0’s ∧ w has an odd number of 1’s
30 of 68
NFA: Nondeterministic Finite Automata (1.2)
● A non-deterministic finite automata (NFA) that accepts the same language:
● How an NFA determines if an input 00101 should be processed: 32 of 68
0, 1
q0 0 q1 1 q2
NFA: Nondeterministic Finite Automata (2)
● A nondeterministic finite automata (NFA) , like a DFA, is a FSM that accepts (or recognizes) a pattern of behaviour.
● An NFA being nondeterministic means that from a given state, the same input label might corresponds to multiple transitions that lead to distinct states.
○ Each such transition offers an alternative path.
○ Each alternative path is explored independently and in parallel.
○ If there exists an alternative path that succeeds in processing the
input string, then we say the NFA accepts that input string.
○ If all alternative paths get stuck at some point and fail to process
the input string, then we say the NFA rejects that input string.
● NFAs are often more succinct (i.e., fewer states) and easier to
design than DFAs.
● However, NFAs are just as expressive as are DFAs.
○ We can always convert an NFA to a DFA. 33 of 68
NFA: Nondeterministic Finite Automata (3.2)
● GivenaNFAM=(Q, ⌃, , q0, F),wemaysimplifythe definition of L(M) by extending (which takes an input symbol) to ˆ (which takes an input string).
ˆ ∶ ( Q × ⌃ ∗ ) → P ( Q ) We may define ˆ recursively, using !
(q,✏) = {q} ˆˆ
ˆ
(q,xa) = { (q′,a) q′ ∈ (q,x)} where q ∈ Q, x ∈ ⌃∗, and a ∈ ⌃
● A neater definition of L(M) : the set of strings w ∈ ⌃∗ such that ˆ0
(q ,w) contains at least one accepting state. ˆ0
L(M)={ww∈⌃∗∧ (q ,w)∩F≠} 35 of 68
NFA: Nondeterministic Finite Automata (3.1)
● A nondeterministic finite automata (NFA) is a 5-tuple M = (Q, ⌃, , q0, F)
○ Q is a finite set of states.
○ ⌃ is a finite set of input symbols (i.e., the alphabet). ○ ∶(Q×⌃)→P(Q)isatransitionfunction
takes as arguments a state and an input symbol and returns a set of states.
○ q0 ∈ Q is the start state.
○ F ⊆ Q is a set of final or accepting states.
● What is the difference between a and an NFA ?
○ The transition function of a returns a single state.
○ The transition function of an returns a set of states.
34 of 68
DFA
DFA
NFA
NFA: Nondeterministic Finite Automata (4)
0, 1
q0 0 q1 1 q2
Given an input string 00101:
● Read 0: ( , 0) = { q0 , q1 }
● Read0: ( ,0)∪ (q1,0)={ q0 ,q1 }∪={q0,q1 }
● Read1: ( ,1)∪ (q1,1)={q0 }∪{q2 }={ ,q2 }
● Read0: ( ,0)∪ (q2,0)={q0,q1 }∪={q0, }
● Read1: (q0,1)∪ (q1 ,1)={q0,q1 }∪{q2 }={q0,q1, q2 }
q0
q0
q0
q0
q0
q1
∵{q0,q1,q2 }∩{q2 }≠∴00101 is accepted 36 of 68
DFA ≡ NFA (1)
● For many languages, constructing an accepting NFA is easier than a DFA.
● From each state of an NFA:
○ Outgoing transitions need not cover the entire ⌃.
○ An input symbol may non-deterministically lead to multiple states.
● In practice:
○ An NFA has just as many states as its equivalent DFA does.
○ An NFA often has fewer transitions than its equivalent DFA does.
● In the worst case:
○ While an NFA has n states, its equivalent DFA has 2n states.
● Nonetheless, an NFA is still just as expressive as a DFA.
○ Every language accepted by some NFA can also be accepted by
37 of 68
some DFA.
∀N∶NFA ● (∃D∶DFA ● L(D)=L(N))
DFA ≡ NFA (2.2): Lazy Evaluation (2)
Applying subset construction (with lazy evaluation), we arrive in
a DFA transition table:
state input 0 1
{q0} {q0,q1} {q0} {q0, q1} {q0, q1} {q0, q2} {q0, q2} {q0, q1} {q0}
We then draw the DFA accordingly:
1
{q0} 0
0
{q0,q1}
0 1
1 {q0,q2}
39 of 68
Compare the above DFA with the DFA in slide 31.
DFA ≡ NFA (2.2): Lazy Evaluation (1) Given an NFA:
transition table: state input
0, 1
q0 0 q1 1 q2
Subset construction
(with lazy evaluation) produces a DFA 0 1
(q0,0) (q0,1) = {q0, q1} = {q0 }
{q0 }
{q0, q1} = {q0,q1}∪ = {q0} ∪ {q2}
(q0,0)∪ (q1,0) (q0,1)∪ (q1,1) = {q0, q1} = {q0, q2}
{q0,q2}
(q0,0)∪ (q2,0) (q0,1)∪ (q2,1) = {q0,q1}∪ = {q0}∪
38 of 68
= {q0, q1} = {q0 }
DFA ≡ NFA (2.2): Lazy Evaluation (3)
● Given an NFA N = (QN,⌃N, N,q0,FN), often only a small
portion of the P(QN ) subset states is reachable from {q0}.
ALGORITHM: ReachableSubsetStates
INPUT: q0 ∶ QN ; OUTPUT: Reachable ⊆ P(QN )
PROCEDURE:
Reachable := { {q0} }
ToDiscover := while(ToDiscover ≠ ) {
{ {q0 } }
choose S∶P(QN) such that S∈ToDiscover remove S from ToDiscover NotYetDiscovered :=
}
( { N(s,0) s∈S}∪{ N(s,1) s∈S} )Reachable Reachable := Reachable∪NotYetDiscovered ToDiscover := ToDiscover ∪ NotYetDiscovered
return Reachable
● RT of ReachableSubsetStates? 40 of 68
[ O(2QN ) ]
✏-NFA: Examples (1)
Draw the NFA for the following two languages:
1.
2.
x ∈ {0, 1}∗ xy ∧ y∈{0,1}∗ ∧ x has alternating 0’s and 1’s ∧ yhasanodd#0’sandanodd#1’s
w ∶ {0,1}∗ 3.
41of68
w has alternating 0’s and 1’s ∨ whasanodd#0’sandanodd#1’s
s ∈ {+, −, ✏} sx.y∧ x∈⌃∗dec
∧ y ∈ ⌃∗dec ∧ ¬(x=✏∧y=✏)
✏-NFA: Formalization (1) An ✏-NFA is a 5-tuple
M = (Q, ⌃, , q0, F)
○ Q is a finite set of states.
○ ⌃ is a finite set of input symbols (i.e., the alphabet). ○ ∶(Q×(⌃∪{✏}))→P(Q)isatransitionfunction
43 of 68
takes as arguments a state and an input symbol, or an empty string ✏, and returns a set of states.
○ q0 ∈ Q is the start state.
○ F ⊆ Q is a set of final or accepting states.
✏-NFA: Examples (2)
s ∈ {+, −, ✏} sx.y∧ x∈⌃∗dec
∧ y ∈ ⌃∗dec ∧ ¬(x=✏∧y=✏)
0,1,…,9 0,1,…,9
q1 . q2 0,1,…,9 q3 ✏ q5
q0 ✏,+,-
0,1,…,9
.
From q0 to q1, reading a sign is optional: a plus or a minus, or
nothing at all (i.e., ✏). 42 of 68
q4
✏-NFA: Formalization (2)
Draw a transition table for the above NFA’s function: ✏ +,- . 0..9
q0{q1}{q1}
q1 {q2} {q1,q4}
q2 {q3}
q3 {q5} {q3}
q4 {q3}
q5 44 of 68
0,1,…,9
0,1,…,9
✏,+,-
q0 q1 q2 q3 q5
. 0,1,…,9
✏
0,1,…,9
.
q4
✏-NFA: Epsilon-Closures (1)
● Given ✏-NFA N
we define the epsilon closure (or ✏-closure ) as a function
● Foranystateq∈Q
N = (Q, ⌃, , q0, F) ECLOSE ∶ Q → P(Q)
45 of 68
ECLOSE(q) = {q} ∪ ECLOSE(p) p∈ (q,✏)
✏-NFA: Formalization (3)
● Given a ✏-NFA M = (Q, ⌃, , q0, F), we may simplify the definition of L(M) by extending (which takes an input symbol) to ˆ (which takes an input string).
ˆ ∶ ( Q × ⌃ ∗ ) → P ( Q ) We may define ˆ recursively, using !
(q, ✏) = ECLOSE(q) ˆˆ
ˆ
(q,xa) = {ECLOSE(q′′)q′′ ∈ (q′,a)∧q′ ∈ (q,x)} where q ∈ Q, x ∈ ⌃∗, and a ∈ ⌃
● Then we define L(M) as the set of strings w ∈ ⌃∗ such that ˆ0
(q ,w) contains at least one accepting state. ˆ0
L(M)={ww∈⌃∗∧ (q ,w)∩F≠} 47 of 68
✏-NFA: Epsilon-Closures (2)
ECLOSE(q0 )
= { (q0, ✏) = {q1, q2}}
{q0} ∪ ECLOSE(q1) ∪ ECLOSE(q2)
= {ECLOSE(q1), (q1, ✏) = {q3}, ECLOSE(q2), (q2, ✏) = }
{q0}∪( {q1}∪ECLOSE(q3) )∪( {q2}∪ ) = {ECLOSE(q3), (q3, ✏) = {q5}}
{q0}∪( {q1}∪( {q3}∪ECLOSE(q5) ) )∪( {q2}∪ ) = {ECLOSE(q5), (q5, ✏) = }
{q0}∪( {q1}∪( {q3}∪( {q5}∪ ) ) )∪( {q2}∪ ) 46 of 68
0,1
✏ 1
✏ q1 ✏
q0 q3 q5 q6
✏
q2
0 1
✏
q4
✏-NFA: Formalization (4)
Given an input string 5.6:
(q ,✏)=ECLOSE(q )={q ,q } ˆ0001
● Read 5: (q0,5)∪ (q1,5) = ∪{q1,q4} = { q1,q4 }
(q ,5) = ECLOSE(q )∪ECLOSE(q ) = {q }∪{q } = {q ,q }
ˆ0141414 ● Read .: (q1,.)∪ (q4,.) = {q2}∪{q3} = { q2,q3 }
(q ,5.)=ECLOSE(q )∪ECLOSE(q )={q }∪{q ,q }={q ,q ,q } ˆ023235235
● Read 6: (q2,6)∪ (q3,6)∪ (q5,6) = {q3}∪{q3}∪ = { q3 }
(q ,5.6) = ECLOSE(q ) = {q ,q } [5.6 is accepted]
ˆ0335 48 of 68
0,1,…,9
0,1,…,9
✏,+,-
q0 q1 q2 q3 q5
. 0,1,…,9
✏
0,1,…,9
.
q4
DFA ≡ ✏-NFA: Subset Construction (1)
Subset construction (with lazy evaluation and epsilon closures ) produces a DFA transition table.
d∈0..9 s∈{+,−} .
{q0, q1} {q1, q4} {q1 } {q2 }
{q1, q4} {q1, q4} {q2,q3,q5}
{q1} {q1,q4} {q2}
{q2} {q3,q5} {q2, q3, q5} {q3, q5} {q3,q5} {q3,q5}
Forexample, ({q0,q1},d)iscalculatedasfollows:
[d ∈0..9]
49 of 68
{ECLOSE(q) q ∈ (q0, d) ∪ (q1, d)} = {ECLOSE(q)q∈∪{q1,q4}}
= {ECLOSE(q) q ∈ {q1, q4}}
= ECLOSE(q1 ) ∪ ECLOSE(q4 )
= {q1} ∪ {q4}
= {q1,q4}
Regular Expression to ✏-NFA
● Just as we construct each complex regular expression
recursively, we define its equivalent ✏-NFA recursively .
● Given a regular expression R, we construct an ✏-NFA E, such that L(R) = L(E), with
51 of 68
○ Exactly one accept state.
○ No incoming arc to the start state.
○ No outgoing arc from the accept state.
DFA ≡ ✏-NFA: Subset Construction (2)
● Given an ✏=NFA N = (QN,⌃N, N,q0,FN), by applying the extended subset construction to it, the resulting DFA
D = (QD,⌃D, D,qDstart ,FD) is such that:
⌃D =⌃N
Q = {SS⊆Q ∧(∃w∶⌃∗●S= ˆ(q,w))}
DND0
qD = ECLOSE(q0 )
F start = {SS⊆Q ∧S∩F ≠} DNN
D(S,a) = {ECLOSE(s′)s∈S∧s′ ∈ N(s,a)} 50 of 68
Regular Expression to ✏-NFA
Base Cases: ●✏
● ∈ QD●] a
ε
ε
(b)
(b) [a ∈ ⌃] (b)
a a
ε
(a) (a) (a)
52 of 68
a
(c)
[S
(c) (c)
Regular Expression to ✏-NFA
Recursive Cases: ●R+S
● RS
● (0 + 1)∗1(0 + 1)
● R∗
εεε
53of68
ε
55of68
[R and S are RE’s]
Rε εε
R
εε εεε R εε
ε
S
0
R (b) S
ε0ε ε1ε
(b) ε εRε
R ε εε R
ε
● (0+1)∗
εε0εε
54 of 68
ε
(b)ε
ε
(c) (c)
(c)
Regular Expression to ✏-NFA: Examples (1.1)
Minimizing DFA: Motivation
● 0+1
ε0ε
εε1εε 1
εε ε1ε
ε0ε
ε 0 εε1ε
ε
ε
(a)
(a) ε
● Recall: Regular Expresion → ✏-NFA → DFA
● DFA produced by the subset construction (with lazy
evaluation) may not be minimum on its size of state. ● When the required size of memory is sensitive
(e.g., processor’s cache memory),
the fewer number of DFA states, the better.
56 of 68
(b)
ε1ε
Regular Expression to ✏-NFA(b:)Examples (1.2)
εε (Sa) ε Start ε ε ε Sε
(a) εε
R ε(a) S
ε 1 1ε
(c)
RεS (b)
ε
ε
Minimizing DFA: Algorithm
ALGORITHM: MinimizeDFAStates
INPUT: DFA M =(Q, ⌃, , q0, F)
OUTPUT: M′ s.t. minimum |Q| and equivalent behaviour as M
PROCEDURE:
P := /* refined partition so far */
T := { F,Q−F } /* last refined partition */ while (P ≠ T):
P := T
T :=
for(p ∈ P s.t. |p| > 1):
find the maximal S ⊆ p s.t. splittable(p, S) if S ≠ then
T:=T∪{S, p−S} else
T := T ∪ {p} end
splittable(p, S) holds iff there is c ∈ ⌃ s.t.
● Transitioncleadsalls∈Stostatesinthesamepartitionp1.
57of68 ● Transitioncleadssomes∈p−Stoadifferentpartitionp2(p2≠p1).
Exercise:
Regular Expression to Minimized DFA
Given regular expression r[0..9]+ which specifies the pattern of a register name, derive the equivalent DFA with the minimum number of states. Show all steps.
59 of 68
2.4 From Regular Expression to Scanner 57
Minimizing DFA: Examples
s4 e s5 c d3 c
es2es3 bd2 b f dadc
s0 s1 0 1 b
i
(a) DFA for “fee | fie”
(a) Original DFA
11⇤ q3 ExamninqF4eIsGURE 2.12 DFA for aq(5b|c
) .
nt
ion Set
00
s1,s2,s4}} s1,s2,s4}} s1,s2,s4}}
0
0, 1
Char Action
1
As a second example, consid
— — —
1
son’s construction and the s
q0 {s3,s5} allq1 none q2
The first step of the minimiza
{s0,s1,s2,s4} e split {s2,s4} 0 10
{{d },{d ,d ,d }}, as shown
s,s f0split1s2 3
,s,s 124011
Curre Partit
{{s3,s5},{s0,erthedfafo {{s3,s5},{s0,ubsetconstr {{s3,s5},{s0,tionalgorith
{{s3,s5},{s0,s } { }} { } { } on the right.
Exercises: Minimize the DFA from here; Q1 & Q2, p59, EAC2.
{s ,s }5,{8sof}6,8{s },{s ,s }} all acllannot bnoenseplit. When the algorithm examin 350124
from any state in p . For bothbandc, each s
Implementing DFA as Scanner
○ The source language has a list of syntactic categories:
e.g., keyword while [ while ] e.g., identifiers [ [a-zA-Z][a-zA-Z0-9_]* ] e.g., white spaces [ [ \t\r]+ ]
○ A compiler’s scanner must recognize words from all syntactic categories of the source language.
⇤
● Overall,ascannershouldbeimplementedbasedontheminimized DFA accommodating all syntactic categories.
● Eachsyntacticcategoryisspecifiedviaaregularexpression.
c)
shown in Figure 2.12a.
produced by Thomp-
○ Principles of a scanner:
● Returnsonewordatatime
r1 + r1 +…+ rn syn. cat. 1 syn. cat. 2 syn. cat. n
structs an i●niEtiaclhpraerttuirtnioend word is the longest possible that matches a pattern ● Aprioritymaybespecifiedamongpatterns
p1 has only one state, it
(e.g., new is a keyword, not identifier)
itfind6s0onfo68transitionsona s a transition back into p .
tep
0
1
2
3 4{esp2,
(b) Critical Steps in Minimizing the DFA
2tateha2 Thus, no symbol in 6 splits p2, and the final partition is {{d0}, {d1, d2, d3}}.
r a(b| uction, m con Since
S
The resulting minimal dfa is shown in Figure 2.12b. Recall that this is
tate);
a a
t c
e e
A
p
s
)
s
t
d
s
h
e
a
h
r
s
dt t d
a
a
n
e r r
x n
t
a e
CharCat[char]; The Transition Table, [state,cat];
te 2/ S te6=ba
2.5 Im
and s0 s1 s2 se Implementing DFA: Table-Driven Scanner (1)
A
d) do
invalid invalid register invalid
); me;
r 0,1,2,…,9 EOF Ot ● Consider the syntactic category of register names.
NextWord() ●Aferconversionto✏c-lNeaFrAs,tatchke;ntoDFA,thentominimizedDFA:
push(bad);
while (state6=se) do0…9
● Specified as a ype[state];
s0 ;
●leTxheme foll‘o‘’w’;ing tables encode knowledge about the above DFA:
NextChar(char); sss
0…9
2EOF Other
Register Digit Other
e Table, Type : r[0..9]+
Register Digit Other Ot
The Classifier Table, CharCat 2.5 Implementing Scanners
NextWord() nvalid;state
0r 0, 11, 2, . . ., 9 lexeme lexeme + char;
Otsher s s 0 1
e
s
e
ifResgtisatetre 2 SA Digit ThtheenUcnldeaerrlsytiancgk;DFA
s1 se s2 se Other
s2 se s2 se se se se se
push(bad); 2.5 Implementing Scanners 61
iven Scanner for Register Names. cat CharCat[char]; The Transition Table,
The Classifier Table, CharCat clear stack; push(state);
state (Cehnda;rCat
Transition ( )
[state,cat];
push(state); + table-driven scanner for the rReolrlB[a0ck.(.).;9] , which was
while (state6=se) do Classifier
Token Type
(Type)
NextChar(char);
s. The sc
s0
s1
s2
se
invalid invalid register invalid
lexreme anner
g0,el1ne,ex2e,rma. .e.t,o9+r
state The Classifier Table, CharCat truncat
then clear stack;
chaEOrF; thenwh
cat CharCat[char]; end; The Transition Table, an re fostratieloc R[ersgetisgatetires,tDceiagrtit]n;Oathmeres. The left side of the
stTahte Toske;n Typ 0
regular expression
lexeme ‘ ’;
pirlOoeth(desrutact
)
Register
Digit
Other
s0 s1 se se es2/tSabalneds that drive
s1 se s2 se e6=bad) do
s2 se s2 se pop();
se se se se e lexeme;
iRfegistterate 2 S Digit Other Othesrtat A
A
The Token Type Table, Type
61 of 68
do
plementing Scanners 61 Index (1)
Scanner in Context
Scanner: Formulation & Implementation
Alphabets
Strings (1)
Strings (2)
Review Exercises: Strings
Languages
Review Exercises: Languages
Problems
Regular Expressions (RE): Introduction
RE: Language Operations (1)
63 of 68
pop(her elexeher
k();
2 SA turn T turn i
Table-Dr
ression
s0; canner.
‘ ’; ack;
61
h
ows a
);
tate6=se) mpt at
ar(char);
end;
whisl
if state 2 SA
le
d the underlying dfas. Nostice tshe sismilarity between the code
the
s
k
e
to
n
scanner,s0
s1 se
A
te 2 SA
while(state 2/ S
and
s2
else return invalid;
en clear
l
e
t side shows the tables for
xe
me
+
ch
ar;
1
se then return Type[state];
s0 s1 s2 se
e these
righse
stack;
2e2e
state 6= bad) do
se se se se invalid invalid register invalid
state pop();
char]; truncate TlhexTeramnes;ition Table,
nFIGURE2.14 ATable-DrivenScannerforRegisterNames. The Token Type Table, Type
e,cat];ImplementingDasrFegAular:exTpraessbionlse.T-hDescraninveregeneraStorctheanpnronducesrtab(le2st)
Figure 2.14 shows a table-driven scanner for the re r [0. . . 9] token, and a final section that interprets and reports the
; loop late tnhF
RollBack();
r divides into four sections: initializations, a scanning
hat drive , which was
n
the skeleton scanner.
d
, a rs1
ack sl
oop in case the dfa over-
end;
fa’s behaviosr
oll bs2
d
0
e
do
if state 2 S
The scaAnner then is implemented via a 4-stage skeleton:
e[state];
id;
Scanner for Regis
ou he scanner g fig
r[0 le-driven sc
her re for iloc
n scanneTr,hw
rlying dfa. loo
shown in Fig sho
vides intorefs
a’s behavior ac
en, and a fin op repeats t
as
the
Fig
invalid invalid register invalid
+
then return Type[state];
NextWord() our first attempt at an re for iloc register names. The left side of the
else return invalid;
The Token Type Table, Type
rep-e-atSstathge t1w:o bInaisticialcitzioatnisoonf a scanner: read
figure shows the skeleton scanner, while the right side shows the tables for
state := s ; worrd[0:..=.9]✏ andtheunderlyingdfa.Noticethesimilaritybetweenthecode 0
+
eIGUdREf2a.14’sA Tabclet-Diroiven S.caInnterhfoar RletgsisterwNahmes.n the dfa enters the initialize an emphterye asndtathcekrecsogn;izesr.sphouwsnhi(nbFaigdu)re 2.2 on page 32. — Stage 2: Scanning Loop
while (state ≠ se)
skeleton scanner. loop that models the dfa’s behavior, a roll back loop in case the dfa over-
regularexpressions.ThescannTehregesnkerlaetornthsceannpnreordduicveisdetasbilnetsothfaotudrrsivecetions:initializations,ascanning
NextChar(char) ; word := word + char
ter Names. shoots the end of the token, and a final section that interprets and reports the if state ∈ F then reset stack s en+d
ure 2.14 shows a table-driven scanner for the re r [0. . . 9] , which was
s.push(state) results. The scanning loop repeats the two basic actions of a scanner: read
r first attempt at an re for iloc register names. The left side of the
enerator then produces tables that daricvhearacter and simulate the dfa’s action. It halts when the dfa enters the cat := CharCat[char]
ure shows the skeleton scanner, while the right side shows the tables for
+state := [state, cat]
. . . 9] and the underlying dfa. Notice the similarity between the code
anne-r-forSthtearegre[0.3..:9]+,Rwohilchlwbasck Loop
e and the recognizer shown in Figure 2.2 on page 32.
regwishteirlnaeme(s.sTthaetleft s∈ideFof∧thestate ≠ bad) ehislekethlsettroignahtstcesaidne:nse=hrowdsisv.tihpdeeotsapbi(nlet)sofofrour sections: initializations, a scanning
Notice the similarity between the code
p thattmroudneclsatthedfwao’srbdehavior, a roll back loop in case the dfa over-
ure 2.2 on page 32. ots-th-eeSntdaogfethe4to:ken,IantdearfipnarlestectiaondthaRtienpteorprrtetsandreportsthe
ouulrtssi.ecfTtihoensst:caianntitnieailnizg∈atlioFnosp,tarhespeceananntsinrtgheettuwronbaTsiycpaect[iosntsaotfea]scanner:read , a roellbsaeck lrooeptinucranse tihendvfaloviedr-
haracter and simulate the dfa’s action. It halts when the dfa enters the al section that interprets and reports the
end
he two basic actions of a scanner: read
62 of 68
the dfa’s action. It halts when the dfa enters the
Index (2)
RE: Language Operations (2)
RE: Construction (1)
RE: Construction (2)
RE: Construction (3)
RE: Construction (4)
RE: Review Exercises
RE: Operator Precedence
DFA: Deterministic Finite Automata (1.1)
DFA: Deterministic Finite Automata (1.2)
DFA: Deterministic Finite Automata (1.3)
Review Exercises: Drawing DFAs
64 of 68
ecognizer shown in Figure 2.2 on page 32.
CharCat[ [stat scanne
ate 2/ SA a els the
tate);
ate 6= bad)
of
t
po
te lexeme
canning
ck();
d simu
2 SA eturn Typ eturn inval
A Table-Driven
pressions. T scanner.
shows a tab empt at an s the skeleto nd the unde recognizer
n scanner di odels the df nd of the tok
scanning lo and simulate
p
h
()
e
;
Index (3)
DFA: Deterministic Finite Automata (2.1)
DFA: Deterministic Finite Automata (2.2)
DFA: Deterministic Finite Automata (2.3)
DFA: Deterministic Finite Automata (2.4)
DFA: Deterministic Finite Automata (2.5.1)
DFA: Deterministic Finite Automata (2.5.2)
Review Exercises: Formalizing DFAs
NFA: Nondeterministic Finite Automata (1.1)
NFA: Nondeterministic Finite Automata (1.2)
NFA: Nondeterministic Finite Automata (2)
NFA: Nondeterministic Finite Automata (3.1)
65 of 68
Index (5)
DFA ≡ ✏-NFA: Subset Construction (1) DFA ≡ ✏-NFA: Subset Construction (2) Regular Expression to ✏-NFA
67 of 68
✏-NFA: Epsilon-Closures (2)
✏-NFA: Formalization (3)
✏-NFA: Formalization (4)
Regular Expression to ✏-NFA
Regular Expression to ✏-NFA
Regular Expression to ✏-NFA: Examples (1.1)
Regular Expression to ✏-NFA: Examples (1.2)
Minimizing DFA: Motivation
Index (4)
DFA ≡ NFA (1)
DFA ≡ NFA (2.2): Lazy Evaluation (1) DFA ≡ NFA (2.2): Lazy Evaluation (2) DFA ≡ NFA (2.2): Lazy Evaluation (3) ✏-NFA: Examples (1)
NFA: Nondeterministic Finite Automata (3.2)
NFA: Nondeterministic Finite Automata (4)
✏-NFA: Examples (2)
✏-NFA: Formalization (1)
✏-NFA: Formalization (2)
✏-NFA: Epsilon-Closures (1)
66 of 68
Index (6)
Minimizing DFA: Examples
Exercise:
Regular Expression to Minimized DFA
Implementing DFA as Scanner
Implementing DFA: Table-Driven Scanner (1)
Implementing DFA: Table-Driven Scanner (2)
68 of 68
Minimizing DFA: Algorithm