Compilers and computer architecture: From strings to ASTs (1):
finite state automata for lexing
Martin Berger
September – October 2015
Recall the function of compilers
Plan for this week
Plan for this week
Remember the shape of compilers?
Source program
Lexical analysis
Syntax analysis
Semantic analysis, e.g. type checking
Intermediate code generation
Optimisation
Code generation
Translated program
Plan for this week
Remember the shape of compilers?
We learned about regular expressions (REs). They enable us to specify simple language (finite and infinite).
Source program
Lexical analysis
Syntax analysis
Semantic analysis, e.g. type checking
Intermediate code generation
Optimisation
Code generation
Translated program
Plan for this week
Remember the shape of compilers?
We learned about regular expressions (REs). They enable us to specify simple language (finite and infinite).
The question we need to answer is: how to decide, given a string s and a regular expression R, if s ∈ lang(R)?
We will later see that this is the main step towards an algorithm for lexing (tokenisation).
Source program
Lexical analysis
Syntax analysis
Semantic analysis, e.g. type checking
Intermediate code generation
Optimisation
Code generation
Translated program
Finite state automata
Finite state automata
A finite state automaton (FSA) is an algorithm that, given a string over an alphabet A, answers with TRUE or FALSE. In other words, an FSA accepts a language with alphabet A.
Finite state automata
A finite state automaton (FSA) is an algorithm that, given a string over an alphabet A, answers with TRUE or FALSE. In other words, an FSA accepts a language with alphabet A.
FSAs are easiest explained in pictures. Here is one with the alphabet {0, 1}
Finite state automata
initial
1
0
1
terminal
0
1
Finite state automata
initial
1
0
1
terminal
0
1
A word w is accepted by an FSA exactly if there is a path in the FSA from the initial state to the terminal state such that the edge labels we encounter on this path exactly spell the word w.
Finite state automata
initial
1
0
1
terminal
0
1
A word w is accepted by an FSA exactly if there is a path in the FSA from the initial state to the terminal state such that the edge labels we encounter on this path exactly spell the word w. What language does it accept?
Finite state automata
initial
1
0
1
terminal
0
1
A word w is accepted by an FSA exactly if there is a path in the FSA from the initial state to the terminal state such that the edge labels we encounter on this path exactly spell the word w. What language does it accept?
(1|01+ )0∗
Finite state automata
Finite state automata
a
A transition or edge s −→ t is to be understood as:
Finite state automata
a
A transition or edge s −→ t is to be understood as:
If the automaton is in state s and reads (’eats’) the character a then it moves to state t.
Finite state automata
a
A transition or edge s −→ t is to be understood as:
If the automaton is in state s and reads (’eats’) the character a then it moves to state t.
If we are at the end of the input, and the automaton is in an terminal (also called accepting) state, the input string as a whole is accepted and in the language of the automaton.
Finite state automata
a
A transition or edge s −→ t is to be understood as:
If the automaton is in state s and reads (’eats’) the character a then it moves to state t.
If we are at the end of the input, and the automaton is in an terminal (also called accepting) state, the input string as a whole is accepted and in the language of the automaton.
If we cannot find a path that terminates at the end of the input, and the automaton is NOT in an accepting state, the input string as a whole is rejected and is NOT in the language of the automaton.
FSA, formal definition
FSA, formal definition
A finite state automaton (FSA) is a tuple A = (A,S,i,F,R) such that the following is true.
Aisafiniteset,calledthealphabetoftheautomaton.
Sisafinitesetofstates.
i∈Sistheinitialstate.
F⊆Sisthesetofterminal,oracceptingstatesofthe automaton. Note: F can be empty. (What happens then?)
Risthetransitionrelation,i.e.itisarelationonstates,
characters and states. More formally, R is a subset of
α
S × A × S. We often write s −→ t instead of (s, α, t) ∈ R
FSA, formal definition
A finite state automaton (FSA) is a tuple A = (A,S,i,F,R) such that the following is true.
Aisafiniteset,calledthealphabetoftheautomaton.
Sisafinitesetofstates.
i∈Sistheinitialstate.
F⊆Sisthesetofterminal,oracceptingstatesofthe automaton. Note: F can be empty. (What happens then?)
Risthetransitionrelation,i.e.itisarelationonstates,
characters and states. More formally, R is a subset of α
S × A × S. We often write s −→ t instead of (s, α, t) ∈ R αα′
We say A is deterministic if whenever s −→ t and s −→ t then t = t′. Otherwise A is non-deterministic.
FSAs, deterministic vs non-deterministic
FSAs, deterministic vs non-deterministic
Which one is deterministic, which on is non-deterministic?
00
111 11
00
FSAs, deterministic vs non-deterministic
Which one is deterministic, which on is non-deterministic?
The finite state automaton on the left is deterministic, that on the right non-deterministic. Each has one accepting state, indicated by double circles. Initial states are often drawn with an incoming arrow without source.
00
111 11
00
FSAs, accepting a string
FSAs, accepting a string
A string (α1, …, αn) is accepted by the automaton if and only if there is a path
α1α2 αn
i −→ s 1 −→ s 2 · · · s n − 1 −→ s n
where i is the initial state, and sn is a terminal state. Note that the states i, s0, …, sn don’t have to be distinct.
FSAs, accepting a string
A string (α1, …, αn) is accepted by the automaton if and only if there is a path
α1α2 αn
i −→ s 1 −→ s 2 · · · s n − 1 −→ s n
where i is the initial state, and sn is a terminal state. Note that the states i, s0, …, sn don’t have to be distinct.
The language of an automaton A is the set of all accepted strings. We write lang(A) for this language.
FSA examples
FSA examples
In class.
FSAs vs REs
FSAs vs REs
Why do we bother introducing FSAs when we’ve got REs to specify the lexical structure of a programming language?
FSAs vs REs
Why do we bother introducing FSAs when we’ve got REs to specify the lexical structure of a programming language?
Because we need an algorithm to decide membership in the language specified by the RE, and convert the input to a token list. FSA are (almost) algorithms. REs and FSAs are connected by the following amazing and surprising facts.
FSAs vs REs
Why do we bother introducing FSAs when we’ve got REs to specify the lexical structure of a programming language?
Because we need an algorithm to decide membership in the language specified by the RE, and convert the input to a token list. FSA are (almost) algorithms. REs and FSAs are connected by the following amazing and surprising facts.
ForeachregularexpressionRoveralphabetA,thereisan deterministic FSA F over A such that lang(R) = lang(F), and vice versa.
FSAs vs REs
Why do we bother introducing FSAs when we’ve got REs to specify the lexical structure of a programming language?
Because we need an algorithm to decide membership in the language specified by the RE, and convert the input to a token list. FSA are (almost) algorithms. REs and FSAs are connected by the following amazing and surprising facts.
ForeachregularexpressionRoveralphabetA,thereisan deterministic FSA F over A such that lang(R) = lang(F), and vice versa.
Foreachnon-deterministicFSAFoveralphabetA,thereis an deterministic FSA F′ over A such that
lang(F′) = lang(F), and vice versa.
FSAs vs REs
Why do we bother introducing FSAs when we’ve got REs to specify the lexical structure of a programming language?
Because we need an algorithm to decide membership in the language specified by the RE, and convert the input to a token list. FSA are (almost) algorithms. REs and FSAs are connected by the following amazing and surprising facts.
ForeachregularexpressionRoveralphabetA,thereisan deterministic FSA F over A such that lang(R) = lang(F), and vice versa.
Foreachnon-deterministicFSAFoveralphabetA,thereis an deterministic FSA F′ over A such that
lang(F′) = lang(F), and vice versa.
Foreachε-automaton(seebelow)FoveralphabetA, there is an deterministic FSA F′ over A such that lang(F′) = lang(F), and vice versa.
FSAs vs REs
FSAs vs REs
An aside on the relationship between deterministic and non-deterministic FSAs: why bother at all with non-deterministic FSAs? Two reasons.
FSAs vs REs
An aside on the relationship between deterministic and non-deterministic FSAs: why bother at all with non-deterministic FSAs? Two reasons.
Non-determinsticFSAscanbeconvertedtodeterministic FSAs, so we are not loosing anything by using non-determinstic FSAs in terms of algorithms..
Non-deterministicFSAareusuallymuchsmaller(fewer states) than the deterministic FSAs accepting the same language (often exponentially so: if the NFA has n states, the DFA might have approximately 2n states).
This is a familiar story: we look at something from two angles (1) convenient for humans vs (2) convenient for the machine.
FSAs vs REs
FSAs vs REs
Given that REs and FSAs can describe the same language, how can we get from an RE to an FSA?
FSAs vs REs
Given that REs and FSAs can describe the same language, how can we get from an RE to an FSA?
Going straight from REs to deterministic FSAs is complicated. So we go there in several steps.
Lexical Regular NFA, epsilon specification expressions automaton
Table-driven DFA implementation of
DFA
FSAs vs REs
Given that REs and FSAs can describe the same language, how can we get from an RE to an FSA?
Going straight from REs to deterministic FSAs is complicated. So we go there in several steps.
We are using ε-automata which can be seen as a special case of NFAs. ε-automata make the conversion from REs to Java implementations easier.
Lexical Regular NFA, epsilon specification expressions automaton
Table-driven DFA implementation of
DFA
ε-automata
ε-automata
Formally, an ε-automaton with alphabet A is a (usually non-deterministic) FSA with alphabet A ∪ {ε}.
The definition of language ε-automaton accepted by an ε-automaton is slightly different from the definition for non-deterministic) FSAs.
ε-automata
Formally, an ε-automaton with alphabet A is a (usually non-deterministic) FSA with alphabet A ∪ {ε}.
The definition of language ε-automaton accepted by an ε-automaton is slightly different from the definition for non-deterministic) FSAs.
What is ε for?
ε-automata
Formally, an ε-automaton with alphabet A is a (usually non-deterministic) FSA with alphabet A ∪ {ε}.
The definition of language ε-automaton accepted by an ε-automaton is slightly different from the definition for non-deterministic) FSAs.
ε
What is ε for? We use ε-labelled transitions s −→ t to move from state s to state t, but without consuming input. This will be convenient later.
ε-automata
Formally, an ε-automaton with alphabet A is a (usually non-deterministic) FSA with alphabet A ∪ {ε}.
The definition of language ε-automaton accepted by an ε-automaton is slightly different from the definition for non-deterministic) FSAs.
ε
What is ε for? We use ε-labelled transitions s −→ t to move from state s to state t, but without consuming input. This will be convenient later. What language does this ε-automaton accept?
initial
epsilon epsilon
terminal1 1 terminal2
0
ε-automata
Formally, an ε-automaton with alphabet A is a (usually non-deterministic) FSA with alphabet A ∪ {ε}.
The definition of language ε-automaton accepted by an ε-automaton is slightly different from the definition for non-deterministic) FSAs.
ε
What is ε for? We use ε-labelled transitions s −→ t to move from state s to state t, but without consuming input. This will be convenient later. What language does this ε-automaton accept?
initial
epsilon epsilon
terminal1 1 terminal2
0
ε-automata
So, an ε-automaton with alphabet A is an FSA with alphabet A ∪ {ε}, but the language is different: the word w over the alphabet A is accepted by ε-automaton A precisely when there isawordw′ overA∪{ε}suchthat:
ε-automata
So, an ε-automaton with alphabet A is an FSA with alphabet A ∪ {ε}, but the language is different: the word w over the alphabet A is accepted by ε-automaton A precisely when there isawordw′ overA∪{ε}suchthat:
Ifweremoveallεfromw′weobtainw.
ε-automata
So, an ε-automaton with alphabet A is an FSA with alphabet A ∪ {ε}, but the language is different: the word w over the alphabet A is accepted by ε-automaton A precisely when there isawordw′ overA∪{ε}suchthat:
Ifweremoveallεfromw′weobtainw.
w′∈lang(A)asanormal(i.e.walkinganyedgeconsumes
the first character of the input string).
We write langε(A) for the language of an ε-automaton A.
ε-automata
So, an ε-automaton with alphabet A is an FSA with alphabet A ∪ {ε}, but the language is different: the word w over the alphabet A is accepted by ε-automaton A precisely when there isawordw′ overA∪{ε}suchthat:
Ifweremoveallεfromw′weobtainw.
w′∈lang(A)asanormal(i.e.walkinganyedgeconsumes
the first character of the input string).
We write langε(A) for the language of an ε-automaton A.
Example word: “h e l ε l ε o” gives us two chances to change state without consuming input and accept “hello”.
ε-automata
So, an ε-automaton with alphabet A is an FSA with alphabet A ∪ {ε}, but the language is different: the word w over the alphabet A is accepted by ε-automaton A precisely when there isawordw′ overA∪{ε}suchthat:
Ifweremoveallεfromw′weobtainw.
w′∈lang(A)asanormal(i.e.walkinganyedgeconsumes
the first character of the input string).
We write langε(A) for the language of an ε-automaton A.
Example word: “h e l ε l ε o” gives us two chances to change state without consuming input and accept “hello”.
So we have
langε(A) = {w | w′ ∈ lang(A),w is w′ with ε removed}
ε-automata are enough for non-deterministic FSA
ε-automata are enough for non-deterministic FSA
Non-determinism can always be translated to ε-automata that are deterministic except for ε-transitions.
initial
1 1
terminal1 1 terminal2
initial
epsilon
epsilon
0
11
terminal1 1 terminal2
0
Translation of REs to FSAs
Translation of REs to FSAs
We will translate every kind of RE (∅, ε, R|R′, …) into an FSA (an ε-FSA to be precise).
Translation of REs to FSAs
We will translate every kind of RE (∅, ε, R|R′, …) into an FSA (an ε-FSA to be precise).
We don’t need to details of each FSA in the translation, we will only be manipulating the initial and final state. All our translations have just one final state. We use the following notation to represent the FSAs arising in our translations.
Translation of ∅
Translation of ∅
A
Translation of ε
Translation of ε
epsilon
Translation of ′c′
Translation of ′c′
c
Translation of (A)
Translation of (A)
A
(A)
Translation of A|B
Translation of A|B
AB
epsilon
epsilon
A
B
epsilon
epsilon
Translation of AB
Translation of AB
AB
A
B
epsilon
Translation of A∗
Translation of A∗
epsilon
A
epsilon
A
epsilon
epsilon
Example translation
Example translation
What’s the automaton that the RE (1|0)∗1 translates to? (Writing e for ε)
Example translation
What’s the automaton that the RE (1|0)∗1 translates to? (Writing e for ε)
e
1
e
e
e ee1
ee 0
e
From NFAs (ε-automata) to DFAs
From NFAs (ε-automata) to DFAs Remember the lexer construction pipeline?
From NFAs (ε-automata) to DFAs Remember the lexer construction pipeline?
Lexical Regular NFA, epsilon specification expressions automaton
Table-driven DFA implementation of
DFA
From NFAs (ε-automata) to DFAs Remember the lexer construction pipeline?
Lexical Regular NFA, epsilon specification expressions automaton
Table-driven DFA implementation of
DFA
Now we want to translate our NFAs (ε-automata) to DFAs, because we can implement DFAs in e.g. Java (computers can’t handle non-determinism).
From NFAs (ε-automata) to DFAs: ε-closure
From NFAs (ε-automata) to DFAs: ε-closure Consider the last example.
From NFAs (ε-automata) to DFAs: ε-closure Consider the last example.
e
1
e
e
e ee1
ee 0
e
From NFAs (ε-automata) to DFAs: ε-closure Consider the last example.
e
1
e
e
e ee1
ee 0
e
From NFAs (ε-automata) to DFAs: ε-closure.
From NFAs (ε-automata) to DFAs: ε-closure.
Consider the last example.
From NFAs (ε-automata) to DFAs: ε-closure. Consider the last example.
e
1
e
e
e ee1
ee 0
e
From NFAs (ε-automata) to DFAs: ε-closure. Consider the last example.
e
1
e
e
e ee1
ee 0
e
From NFAs (ε-automata) to DFAs: ε-closure. Consider the last example.
e
1
e
e
e ee1
ee 0
e
From NFAs (ε-automata) to DFAs: ε-closure. Consider the last example.
e
1
e
e
e ee1
ee 0
e
From NFAs (ε-automata) to DFAs: ε-closure. Consider the last example.
e
1
e
e
e ee1
ee 0
e
From NFAs (ε-automata) to DFAs
From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states, i ∈ S initial state, F ⊆ S final states).
From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
a
From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
a
From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
ThealphabetisA
a
From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
ThealphabetisA
Thestateareallnon-emptysetsofstatesinS,i.e.all non-empty subsets of S.
a
From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
ThealphabetisA
Thestateareallnon-emptysetsofstatesinS,i.e.all
non-empty subsets of S.
Thestartstateistheε-closureofi.
a
From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
ThealphabetisA
Thestateareallnon-emptysetsofstatesinS,i.e.all
non-empty subsets of S.
Thestartstateistheε-closureofi.
Thefinalstatesareallnon-emptysetsX⊂Ssuchthat X ∩ F ̸= ∅.
a
From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
ThealphabetisA
Thestateareallnon-emptysetsofstatesinS,i.e.all
non-empty subsets of S.
Thestartstateistheε-closureofi.
Thefinalstatesareallnon-emptysetsX⊂Ssuchthat X ∩ F ̸= ∅.
We have a transition from X to Y with the label a exactly when Y = ε-closure of a(X).
a
From NFAs (ε-automata) to DFAs
From NFAs (ε-automata) to DFAs The example
e
C1E e
ee D0F
e
e
A e B
G e H e I 1 J
From NFAs (ε-automata) to DFAs
e
C1E e
ee D0F
e
A e B
G e H e I 1 J
e0
ABCDFGHI
01
Becomes
0
ABCDHI
1
1
Unreachable states omitted
ABCDEGHIJ
From NFAs (ε-automata) to DFAs
0
1
01
0
1
Check that the language of the new FSA is (1|0)∗1 as required.
From NFAs (ε-automata) to DFAs
From NFAs (ε-automata) to DFAs
Do you notice something? How many states does the new automaton have?
From NFAs (ε-automata) to DFAs
Do you notice something? How many states does the new
automaton have?
Consider our running example. It has 10 states. How many states would it DFA version have?
From NFAs (ε-automata) to DFAs
Do you notice something? How many states does the new
automaton have?
Consider our running example. It has 10 states. How many states would it DFA version have?
Ithas210 −1=1023…
From NFAs (ε-automata) to DFAs
Do you notice something? How many states does the new
automaton have?
Consider our running example. It has 10 states. How many states would it DFA version have?
Ithas210 −1=1023…
This exponential blowup is an intrinsic problem of converting non-deterministic automata into deterministic ones. It has nothing to do with ε-automata.
From NFAs (ε-automata) to DFAs
Do you notice something? How many states does the new
automaton have?
Consider our running example. It has 10 states. How many states would it DFA version have?
Ithas210 −1=1023…
This exponential blowup is an intrinsic problem of converting non-deterministic automata into deterministic ones. It has nothing to do with ε-automata.
Fortunately in many cases, most of them are inactive and can be ignored. However in pathological cases, all states are needed.
From NFAs (ε-automata) to DFAs
This example shows that the translation in the naive form presented here is not particularly efficient: of the 1023 states it introduces, only 3 are needed (active). It is possible to improve the translation, so the inactive states disappear.
Implementation of DFAs and NFAs
Implementation of DFAs and NFAs
Remember the lexer construction pipeline?
Implementation of DFAs and NFAs
Remember the lexer construction pipeline?
Lexical Regular NFA, epsilon specification expressions automaton
Table-driven DFA implementation of
DFA
Implementation of DFAs and NFAs
Remember the lexer construction pipeline?
Lexical Regular NFA, epsilon specification expressions automaton
Table-driven DFA implementation of
DFA
Now we want to translate DFAs into real programs e.g. Java.
Implementation of DFAs
Implementation of DFAs
A DFA is naturally implemented as a 2-dimensional table (array) T. Columns are indexed by the alphabet, rows are indexed by the states. Array element at row X and label l stores the the next state in the automaton when starting in state X and consuming l.
Implementation of DFAs
A DFA is naturally implemented as a 2-dimensional table (array) T. Columns are indexed by the alphabet, rows are indexed by the states. Array element at row X and label l stores the the next state in the automaton when starting in state X and consuming l.
0
A
B
01
C
0
1
1
Implementation of DFAs
A DFA is naturally implemented as a 2-dimensional table (array) T. Columns are indexed by the alphabet, rows are indexed by the states. Array element at row X and label l stores the the next state in the automaton when starting in state X and consuming l.
0
A
01 A
B C
B
00
C
1
1
1
B
C
C
B
B
C
Implementation of DFAs
A DFA is naturally implemented as a 2-dimensional table (array) T. Columns are indexed by the alphabet, rows are indexed by the states. Array element at row X and label l stores the the next state in the automaton when starting in state X and consuming l.
0
A
01 A
B C
B
00
C
1
1
1
B
C
C
B
B
C
In code: p/t/o
Implementation of DFAs
0
A
01 A
B C
B
00
C
1
1
1
B
C
C
B
B
C
Implementation of DFAs
def scan ( input : Array [ Char ] )
: Boolean = {
val table = … // transitions
var i = 0 // current character
var s = A // current state
val acceptingState = C
while ( i < input.length ) {
s = table [ s, input[i] ]
i += 1 }
return ( s == acceptingState ) }
0
A
01 A
B C
B
00
C
1
1
1
B
C
C
B
B
C
Implementation of DFAs
def scan ( input : Array [ Char ] )
: Boolean = {
val table = ... // transitions
var i = 0 // current character
var s = A // current state
val acceptingState = C
while ( i < input.length ) {
s = table [ s, input[i] ]
i += 1 }
return ( s == acceptingState ) }
Question: what if one of the state lacks outgoing transitions on some labels?
0
A
01 A
B C
B
00
C
1
1
1
B
C
C
B
B
C
Implementation of DFAs
def scan ( input : Array [ Char ] )
: Boolean = {
val table = ... // transitions
var i = 0 // current character
var s = A // current state
val acceptingState = C
while ( i < input.length ) {
s = table [ s, input[i] ]
i += 1 }
return ( s == acceptingState ) }
Question: what if one of the state lacks outgoing transitions on some labels? Answer: add artificial error states, and from the error state a transition back to itself for every character.
Implementation of DFAs
Implementation of DFAs
This idea, using a 2-dimensional table to implement an FSA is fundamental. Most (all?) real-world implementations of REs, FSAs etc use variants of it. It is worth understanding well.
Implementation of DFAs
Implementation of DFAs
Many rows in the array are identical (all in the example below, first and third row in the previous example). That is often the case in the implementation of lexers. We can save space by sharing rows (or columns):
0
0
A
A
B C
B
01
C
01
1
1
B
C
Outputting a token list
Outputting a token list
We have reached our intermediate goal: going from REs to algorithms that decide the language of the RE, i.e. respond with TRUE/FALSE for each input string. But in lexing we want a token list (or an error message). Fortunately, this is only a small variant of the decision problem.
Outputting a token list
We have reached our intermediate goal: going from REs to algorithms that decide the language of the RE, i.e. respond with TRUE/FALSE for each input string. But in lexing we want a token list (or an error message). Fortunately, this is only a small variant of the decision problem.
Hello ( 123 then ...
should yield a token list:
Outputting a token list
We have reached our intermediate goal: going from REs to algorithms that decide the language of the RE, i.e. respond with TRUE/FALSE for each input string. But in lexing we want a token list (or an error message). Fortunately, this is only a small variant of the decision problem.
Hello ( 123 then ...
should yield a token list:
T_Ident ( "Hello" ),
T_Left Bracket,
T_Num ( 123 ),
T_Then,
...
Mealy automata
Mealy automata
We use Mealy automata, which is a variant of FSAs which have not only an input action, but also an output action. A picture says more than a 1000 words.
initial I F Input eps T_if
T Output eps
HEN
eps
eps
T_then
eps
eps
Mealy automata
Mealy automata
initial I F Input eps T_if
T Output eps
HEN
eps
eps
T_then
eps
eps
With a Mealy automaton, when we have a path w1w2 wn
i −→ s 1 −→ s 2 · · · s n − 1 −→ s n u1u2 un
whenever we accept (and consume) the input string w1...wn we create an output string u1...un.
Lexer generators
Lexer generators
Lexers can be written by hand, but for realistic languages it is much easier to let the computer do most of the work. Lexer generators exist that take as input an ordered list of REs (ordering gives priority, see below) together with actions associated with each RE, and returns a working lexer. Actions allow you to associate Java code with regular expressions. Examples:
Flex
JFlex
JavaCC(lexerandparsergeneratorcombined)
There are many others. It is worth learning one of them, since they make it very easy to produce fast lexers.
Summary
Summary
In first approximation, lexing works like this
Summary
In first approximation, lexing works like this
WriteanREforthelexemesofeachtokenclass,e.g. Number = [0-9]+, Keywords = ...
Summary
In first approximation, lexing works like this
WriteanREforthelexemesofeachtokenclass,e.g.
Number = [0-9]+, Keywords = ...
ConstructabigRE,matchingalllexemesforalltokens.
R = Keywords Identifier Number LBracket ...
Summary
In first approximation, lexing works like this
WriteanREforthelexemesofeachtokenclass,e.g.
Number = [0-9]+, Keywords = ...
ConstructabigRE,matchingalllexemesforalltokens.
R = Keywords Identifier Number LBracket ...
ConstructanFSA(Mealyautomaton)forR.Letalexer generator do this work.
Error handling in lexers
Error handling in lexers
What if the lexer encounters a character in the input that does match any RE defining the lexical level of the language? It’s important for good compilers to return helpful error messages (not all compilers do this alas).
Error handling in lexers
What if the lexer encounters a character in the input that does match any RE defining the lexical level of the language? It’s important for good compilers to return helpful error messages (not all compilers do this alas).
There’s a neat way using regular expressions, longest match and priorities can also be used for error handling in lexers.
Error handling in lexers
What if the lexer encounters a character in the input that does match any RE defining the lexical level of the language? It’s important for good compilers to return helpful error messages (not all compilers do this alas).
There’s a neat way using regular expressions, longest match and priorities can also be used for error handling in lexers.
Use an RE that matches any character in the alphabet, but give it lowest priority. What effect does this have?
Error handling in lexers
What if the lexer encounters a character in the input that does match any RE defining the lexical level of the language? It’s important for good compilers to return helpful error messages (not all compilers do this alas).
There’s a neat way using regular expressions, longest match and priorities can also be used for error handling in lexers.
Use an RE that matches any character in the alphabet, but give it lowest priority. What effect does this have?
The error RE is really easy to write: use a RE that matches any character in the alphabet. Give this RE the lowest priority. Because it matches any character it will also always be a shortest possible match.
It catches anything that is not allowed by all previous REs. The output associated with this RE can be used for error messages.
Conclusion
Conclusion
Lexer: takes a program as string, returns a list of tokens.
Conclusion
Lexer: takes a program as string, returns a list of tokens.
The point of lexing is to have a ROUGH classification of the input program that enables the next stage (parsing) to determine of the program is syntactically well-formed, and to construct the AST. Regular expressions and FSAs are convenient tools for implementation of lexers.
The material in the textbooks
The material in the textbooks
DragonBook:Chapter2.6,Chapter3.
Appel,Palsberg:Chapter2.
“Engineeringacompiler”:Chapter2:especiallysections 2.1 to 2.5.