COMP90045 Programming Language Implementation
How Scanner Generators Work
Harald Søndergaard Lecture 4
Semester 1, 2019
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 1 / 36
Scanner Generators
A scanner generator like flex or alex takes a language definition as input, in the form of regular definitions.
As output it produces a program—a scanner for the regular language specified.
It effectively takes regular expressions and turns them into DFAs.
The generated program has these DFAs encoded, in the form of tables or functions.
Conceptually, the scanner generator makes two important transformations:
From an NFA to a DFA (determinization). From a DFA to a minimal DFA (minimization).
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 2 / 36
How the Scanner Generator Works
Construct an NFA for each pattern pi :
Construct a single NFA representing the union of these NFAs, with a
distinct accept state for each pattern:
ǫ ǫ
ǫ
pi
p1
p2
. .
pn
Convert this NFA to a DFA, keeping track of which action applies in
which accept state.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 3 / 36
Converting an NFA to a DFA
For any NFA N, we can construct an equivalent DFA D.
The idea is to simulate all the possible transitions of the NFA. We create a state in D for each set of states that N can reach after consuming a given input string.
If N has k states, then D may have up to 2k states.
However, this limit is almost never reached in practice, because most of the possible DFA states correspond to a combination of NFA states that cannot be reached simultaneously for any input.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 4 / 36
Converting an NFA to a DFA
Consider the NFA on the right. We systematically construct an equivalent DFA.
The DFA’s start state is {1, 3}.
From this state, a will take us back to {1, 3}.
From {1, 3}, b can only take us to {2}.
Continuing thus (in a demand-driven fashion) gives the DFA.
Any state S which contains an accept state from the NFA will be an accept state for the DFA (why?)
a
2
b
1 a,b
ǫ
a
3
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne
5 / 36
QUIZ: Complete the Construction
a
2
b
1 a,b=⇒
ǫ
a
a
1,3 b 2
3
PLI (Sem 1, 2019)
How Scanner Generators Work
⃝c University of Melbourne
6 / 36
QUIZ: Complete the Construction
a
2
a
1,3 b 2
a 1 a,b=⇒ 3 2,3
ǫ
a
b
b
3
PLI (Sem 1, 2019)
How Scanner Generators Work ⃝c University of Melbourne 7 / 36
QUIZ: Complete the Construction
a
2
a
1,3 b 2
a 1ǫa,b=⇒ 3b2,3
b
b
a
a
3
1,2,3
PLI (Sem 1, 2019)
How Scanner Generators Work ⃝c University of Melbourne 8 / 36
QUIZ: Complete the Construction
a
2
a
1,3 b 2
a 1ǫa,b=⇒ 3b2,3
b
b
a
b
a
3
1,2,3
a
PLI (Sem 1, 2019)
How Scanner Generators Work
⃝c University of Melbourne
9 / 36
QUIZ: Complete the Construction
a
a
1,3 b 2 aba 1ǫa,b=⇒ 3b2,3
2
b
a
b
a 1,2,3 a
3
PLI (Sem 1, 2019)
How Scanner Generators Work
⃝c University of Melbourne
10 / 36
Subset Construction Operations
The algorithm for constructing a DFA D from an NFA N is called the subset construction algorithm. It uses these two operations on N:
ǫ-closure(T) gives the set of states that can be reached from any of the states t ∈ T using ǫ transitions only.
move(T,a) gives the set of states that can be reached from any of the states t ∈ T on input symbol a.
If N = (Q,Σ,δ,s0,F), then the start state of D is ǫ-closure({s0}). Every state in D that contains an accept state of N is an accept
state for D.
If a state in D contains more than one accept state of N, the only
effective one is the one for the rule that occurs first.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 11 / 36
Subset Construction Algorithm
The hard part of building D is computing D’s set of states, Dstates, and D’s transition function, Dtran.
Dstates ← {ǫ-closure({s0})}
while some state T in Dstates is unmarked do
mark T
for each a ∈ Σ do
U ← ǫ-closure(move(T ,a)) if U ̸∈ Dstates then
Dstates ← Dstates ∪ {U} Dtran[T , a] ← U
The entries in Dtran not filled in by this algorithm will transition to the error state.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 12 / 36
QUIZ: Subset Construction
Construct a DFA equivalent to the following NFA:
2 1bb
ǫ
a
ǫ
3
PLI (Sem 1, 2019)
How Scanner Generators Work
⃝c University of Melbourne
13 / 36
QUIZ: Subset Construction
a2 1,2 a b
b
a
2 1bb
ǫ
a
ǫ
move({1,2},a)) = {2} move({1,2},b)) = {3} move({2},a)) = {2} move({2},b)) = {3} move({1,2,3},a)) = {2} move({1,2,3},b)) = {3}
ǫ-closure: {2} ǫ-closure: {1,2,3} ǫ-closure: {2} ǫ-closure: {1,2,3} ǫ-closure: {2} ǫ-closure: {1,2,3}
{1,2} →a {2} {1,2} →b {1,2,3} {2} →a {2}
{2} →b {1,2,3} {1,2,3} →a {2} {1,2,3} →b {1,2,3}
3
1,2,3 b Start state: ǫ-closure({1}) = {1,2}
PLI (Sem 1, 2019) How Scanner Generators Work
⃝c University of Melbourne 14 / 36
How a Scanner Generator Works: An Example
a
abb a∗b+
1a2 act1 3a4b5b6 act2 7 b 8 act3
a
becomes
b 1a2 act1
ǫ 0 ǫ
ǫ
a
3a4b5b6 act2 7b8 act3
b
PLI (Sem 1, 2019)
How Scanner Generators Work ⃝c University of Melbourne 15 / 36
How a Scanner Generator Works: An Example
01a247b
3 7 act1 act3
bab b8b7a68
58
act3
act2
b
Input
Sequence of matched strings
abbaa
act2 act1 act1
abb
a
a
aabbab
act3 act3
aabb
ab
Remember the last accepting state visited and the buffer position at that point.
PLI (Sem 1, 2019)
How Scanner Generators Work
⃝c University of Melbourne 16 / 36
QUIZ: NFA to DFA
Build an NFA for ab* action 1
ab action 2
Then convert it to a DFA.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 17 / 36
QUIZ: NFA to DFA
Build an NFA for ab* action 1
b
1 a 2 act1
ǫ
ab action 2
Now convert it to a DFA.
ǫ
0
3 a 4 b 5 act2
PLI (Sem 1, 2019)
How Scanner Generators Work ⃝c University of Melbourne 18 / 36
QUIZ: NFA to DFA
Build an NFA for ab* action 1
b
1 a 2 act1
ǫ
ab action 2
Now convert it to a DFA.
ǫ
0
3 a 4 b 5 act2 b
1 a 2 act1
PLI (Sem 1, 2019)
How Scanner Generators Work ⃝c University of Melbourne 18 / 36
Minimizing a DFA
Different DFAs may recognise the same language.
For example, the DFA on the right is equivalent to the one we derived in the Quiz a few slides back (on the left):
2a 2a 1,2 a b a b
a b
1,2,3 b 1,2,3 b
For a regular language, it is always possible to construct a minimal
DFA, that is, one with the fewest possible states.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 19 / 36
The Classical Algorithm for DFA Minimization
It proceeds by building finer and finer partitionings π of states.
In the end, π will have the smallest possible number of equivalence
classes of indistinguishable states.
πold := {Q}
π := {Q \ F , F } while π ̸= πold
foreach x ∈ Σ πold := π
π := ∅
foreach S ∈ πold
/* Split S into finer classes */ π:=π∪{{s∈S|δ(s,x)∈T}|T ∈πold}\{∅}
After this we may remove error/unreachable states from π. For a less abstract version of the algorithm, see ICD Algorithm 1.4.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 20 / 36
Minimization Example
DFA to minimize:
a
2 a 4 b 6 aaa∗b a
1∪
a
b
3 a 5 b 7 ba∗b
b
PLI (Sem 1, 2019)
How Scanner Generators Work ⃝c University of Melbourne 21 / 36
Minimization Example
First complete the DFA by adding the error state: a
2 a 4 b 6 aaa∗b
a,b
b 18∪
a,b
b
3 a 5 b 7 ba∗b
b
Initial Partition π0 = {A,B} where A = {1,2,3,4,5,8},B = {6,7}.
a
a,b
a
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 22 / 36
Minimization Example
Consider group A and letter b. We have:
Fromstate 123458 Tostate 387678 Nowingroup A A B B B A
so our new partition is π1 = {A1, A2, B} where A1 = {1, 2, 8}, A2 = {3,4,5}, B = {6,7}.
Now consider A1 and letter b:
so our new partition is π2 = {A11, A12, A2, B} where A11 = {1},
A12 = {2, 8}, A2, B as before.
From state To state
Now in group
1 2 8 3 8 8 A2 A1 A1
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 23 / 36
Minimization Example
Now consider group A12 and letter a.
From state To state
Now in group
2 8
4 8 A2 A12
giving us a final partition π′ = {{1}, {2}, {8}, {3, 4, 5}, {6, 7}} .
It is final because any other choice of partition element with a letter
does not partition the states any further.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 24 / 36
Minimization Example
The minimized DFA looks like this:
a,b
1a2b8
a a,b 3b6
a
Note how the original states 3, 4, and 5 have collapsed into one (the state now labeled 3). Note that state 8 is an error state; no accept state is reachable from it.
b
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 25 / 36
Minimizing States for a Scanner Generator
Minimization is important for a scanner generator, to keep the table implementing the DFA’s transition function small.
However, we need a slight variant of the minimization algorithm.
In this variant, the initial partition doesn’t just distinguish accept states from non-accept states; it also distinguishes accept states with different actions from each other.
So in our example, it will distinguish states 6 and 7.
If there are n actions, the initial partition thus contains n + 1 sets of states.
The rest of the minimization algorithm is unchanged.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 26 / 36
Reserved Words and Identifiers
In most languages, the keywords of the language are reserved: they cannot be used for any other purpose.
@ident = [a-zA-Z][a-zA-Z0-9]*
rules :-
if { \s -> IF }
@ident { IDENT }
Most keywords match the pattern for identifiers.
If the language has many keywords, then using a separate pattern for each keyword will cause the scanner generator to build a large DFA.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 27 / 36
Reserved Words and Identifiers
A generation ago, the memory required to store a large DFA was sometimes a problem. Today it isn’t, but large DFAs may be slow, because accessing them lacks locality.
A common way to reduce the size of the DFA is to create a table (keywords) that associates each keyword with its token.
Then all the rules for keywords can be folded into the rule for identifiers:
@ident { \s -> case lookup s keywords of Just token -> token
Nothing -> IDENT s }
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 28 / 36
Interesting Design Flaws: No Reserved Words
PL/I, a much hyped language from the 1960s, had no reserved words. Since keywords can be used as identifiers, this allows programmers to write statements such as this:
IF THEN THEN THEN = ELSE; ELSE ELSE = THEN;
Exploitation of this freedom makes programs hard to read. (The first IF, the second THEN and the second ELSE—in red—delimit the different parts of an if-then-else statement; the other occurrences of IF, THEN and ELSE are identifiers.)
This flexibility makes the scanner much harder to write: it needs the parser to tell it when it is expecting a keyword, and when it is expecting an identifier.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 29 / 36
Right Context
The fact that FORTRAN 77 ignored spaces even inside lexemes caused a similarly annoying problem.
FORTRAN:
a) DO 5 I = 1,25
b) DO 5 I = 1.25
corresponding C:
for (I = 1, I <= 25, I++) ... DO5I = 1.25
We cannot decide whether DO is a keyword or the start of an identifier until we reach the comma or the period.
Flex and alex have mechanisms for handling situations like this. The flex pattern r/s tells flex to commit to pattern r only if the right context (in this case, the input characters following DO) matches pattern s.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 30 / 36
Matching Right Context (in alex)
The rule below tells alex to recognize DO as a keyword only when the input following it matches the pattern of a do loop header:
@stuff = [A-Z0-9\ ]+
rules :-
DO/@stuff\=@stuff\, { \s -> DO }
↑↑
matchStart input examined to here matchLen = 2
Note that the rule is too generous; it allows a match on input such as “DO X = 6Q,”. However, such a lexical error will be caught later.
D
O
5
I
=
1
,
2
5
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 31 / 36
More Design Flaws: Unbounded Lookahead
Some languages do not have bounds on the amount of lookahead required. One is PL/I:
declare(arg1,arg2,…,argn)
Is ‘declare’ a keyword or an identifier denoting an array? PL/I syntax doesn’t allow us to answer that question until we have seen the character after the right parenthesis. Since the number of arguments is unbounded, this requires unbounded lookahead.
Supporting unbounded lookahead complicates the scanner and slows it down.
But the reason why good languages avoid requiring it is that humans find constructs requiring unbounded lookahead hard to read.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 32 / 36
Start Conditions
In some languages, different parts of a program have different notions of what tokens are.
One example is an alex specification itself. In most of an alex specification, tokens include names, string constants, character classes and regular expression operators. However, between lines containing { and }, the tokens are entire lines.
Effectively, after the { token we want the lexer to switch to using another set of rules, and after the } we want it to switch back.
Flex and alex support start conditions. Each start condition has its own set of rules (pattern/action pairs), though a rule may belong to more than one start condition.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 33 / 36
Implementing Start Conditions
Add each start condition to the alphabet and construct the NFA as shown below. If S is the NFA state marked by the name of the current start condition, the lexer starts at the DFA state for ǫ-closure(S ).
sc1
sc1 ǫ ǫ
sc2
sc2
ǫ ǫ
.
.
.
.
PLI (Sem 1, 2019)
How Scanner Generators Work
⃝c University of Melbourne
34 / 36
psc1 1
psc1 n1
psc2 1
psc2 n2
Using Start Conditions in alex
%wrapper “monad”
rules :-
<0> ([^\”] | \n)* ;
<0> \”
{ begin string }
{ stringchar }
{ begin 0 }
This specifies a scanner with two states, 0 (the default start state) and string. We switch to the string state when we see a double quote, and stay there until we see the next double quote.
If a rule is prefixed with a state st, that means the rule should only be applied when the scanner is in state st.
The action begin st changes the scanner’s state to st, and stringchar is some action for dealing with characters inside a string.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 35 / 36
Coming Up Next Week
Next week we look at context-free languages and the parsing problem.
However, first (that is, on Monday), we will look at the use of monads in Haskell.
PLI (Sem 1, 2019) How Scanner Generators Work ⃝c University of Melbourne 36 / 36