程序代写代做代考 Finite State Automaton flex compiler Java computer architecture algorithm Compilers and computer architecture: From strings to ASTs (1):

Compilers and computer architecture: From strings to ASTs (1):
finite state automata for lexing
Martin Berger
September – October 2015

Recall the function of compilers

Plan for this week

Plan for this week
Remember the shape of compilers?
Source program
Lexical analysis
Syntax analysis
Semantic analysis, e.g. type checking
Intermediate code generation
Optimisation
Code generation
Translated program

Plan for this week
Remember the shape of compilers?
We learned about regular expressions (REs). They enable us to specify simple language (finite and infinite).
Source program
Lexical analysis
Syntax analysis
Semantic analysis, e.g. type checking
Intermediate code generation
Optimisation
Code generation
Translated program

Plan for this week
Remember the shape of compilers?
We learned about regular expressions (REs). They enable us to specify simple language (finite and infinite).
The question we need to answer is: how to decide, given a string s and a regular expression R, if s ∈ lang(R)?
We will later see that this is the main step towards an algorithm for lexing (tokenisation).
Source program
Lexical analysis
Syntax analysis
Semantic analysis, e.g. type checking
Intermediate code generation
Optimisation
Code generation
Translated program

Finite state automata

Finite state automata
A finite state automaton (FSA) is an algorithm that, given a string over an alphabet A, answers with TRUE or FALSE. In other words, an FSA accepts a language with alphabet A.

Finite state automata
A finite state automaton (FSA) is an algorithm that, given a string over an alphabet A, answers with TRUE or FALSE. In other words, an FSA accepts a language with alphabet A.
FSAs are easiest explained in pictures. Here is one with the alphabet {0, 1}

Finite state automata
initial
1
0
1
terminal
0
1

Finite state automata
initial
1
0
1
terminal
0
1
A word w is accepted by an FSA exactly if there is a path in the FSA from the initial state to the terminal state such that the edge labels we encounter on this path exactly spell the word w.

Finite state automata
initial
1
0
1
terminal
0
1
A word w is accepted by an FSA exactly if there is a path in the FSA from the initial state to the terminal state such that the edge labels we encounter on this path exactly spell the word w. What language does it accept?

Finite state automata
initial
1
0
1
terminal
0
1
A word w is accepted by an FSA exactly if there is a path in the FSA from the initial state to the terminal state such that the edge labels we encounter on this path exactly spell the word w. What language does it accept?
(1|01+ )0∗

Finite state automata

Finite state automata
a
A transition or edge s −→ t is to be understood as:

Finite state automata
a
A transition or edge s −→ t is to be understood as:
If the automaton is in state s and reads (’eats’) the character a then it moves to state t.

Finite state automata
a
A transition or edge s −→ t is to be understood as:
If the automaton is in state s and reads (’eats’) the character a then it moves to state t.
If we are at the end of the input, and the automaton is in an terminal (also called accepting) state, the input string as a whole is accepted and in the language of the automaton.

Finite state automata
a
A transition or edge s −→ t is to be understood as:
If the automaton is in state s and reads (’eats’) the character a then it moves to state t.
If we are at the end of the input, and the automaton is in an terminal (also called accepting) state, the input string as a whole is accepted and in the language of the automaton.
If we cannot find a path that terminates at the end of the input, and the automaton is NOT in an accepting state, the input string as a whole is rejected and is NOT in the language of the automaton.

FSA, formal definition

FSA, formal definition
A finite state automaton (FSA) is a tuple A = (A,S,i,F,R) such that the following is true.
􏹩 Aisafiniteset,calledthealphabetoftheautomaton.
􏹩 Sisafinitesetofstates.
􏹩 i∈Sistheinitialstate.
􏹩 F⊆Sisthesetofterminal,oracceptingstatesofthe automaton. Note: F can be empty. (What happens then?)
􏹩 Risthetransitionrelation,i.e.itisarelationonstates,
characters and states. More formally, R is a subset of
α
S × A × S. We often write s −→ t instead of (s, α, t) ∈ R

FSA, formal definition
A finite state automaton (FSA) is a tuple A = (A,S,i,F,R) such that the following is true.
􏹩 Aisafiniteset,calledthealphabetoftheautomaton.
􏹩 Sisafinitesetofstates.
􏹩 i∈Sistheinitialstate.
􏹩 F⊆Sisthesetofterminal,oracceptingstatesofthe automaton. Note: F can be empty. (What happens then?)
􏹩 Risthetransitionrelation,i.e.itisarelationonstates,
characters and states. More formally, R is a subset of α
S × A × S. We often write s −→ t instead of (s, α, t) ∈ R αα′
We say A is deterministic if whenever s −→ t and s −→ t then t = t′. Otherwise A is non-deterministic.

FSAs, deterministic vs non-deterministic

FSAs, deterministic vs non-deterministic
Which one is deterministic, which on is non-deterministic?
00
111 11
00

FSAs, deterministic vs non-deterministic
Which one is deterministic, which on is non-deterministic?
The finite state automaton on the left is deterministic, that on the right non-deterministic. Each has one accepting state, indicated by double circles. Initial states are often drawn with an incoming arrow without source.
00
111 11
00

FSAs, accepting a string

FSAs, accepting a string
A string (α1, …, αn) is accepted by the automaton if and only if there is a path
α1α2 αn
i −→ s 1 −→ s 2 · · · s n − 1 −→ s n
where i is the initial state, and sn is a terminal state. Note that the states i, s0, …, sn don’t have to be distinct.

FSAs, accepting a string
A string (α1, …, αn) is accepted by the automaton if and only if there is a path
α1α2 αn
i −→ s 1 −→ s 2 · · · s n − 1 −→ s n
where i is the initial state, and sn is a terminal state. Note that the states i, s0, …, sn don’t have to be distinct.
The language of an automaton A is the set of all accepted strings. We write lang(A) for this language.

FSA examples

FSA examples
In class.

FSAs vs REs

FSAs vs REs
Why do we bother introducing FSAs when we’ve got REs to specify the lexical structure of a programming language?

FSAs vs REs
Why do we bother introducing FSAs when we’ve got REs to specify the lexical structure of a programming language?
Because we need an algorithm to decide membership in the language specified by the RE, and convert the input to a token list. FSA are (almost) algorithms. REs and FSAs are connected by the following amazing and surprising facts.

FSAs vs REs
Why do we bother introducing FSAs when we’ve got REs to specify the lexical structure of a programming language?
Because we need an algorithm to decide membership in the language specified by the RE, and convert the input to a token list. FSA are (almost) algorithms. REs and FSAs are connected by the following amazing and surprising facts.
􏹩 ForeachregularexpressionRoveralphabetA,thereisan deterministic FSA F over A such that lang(R) = lang(F), and vice versa.

FSAs vs REs
Why do we bother introducing FSAs when we’ve got REs to specify the lexical structure of a programming language?
Because we need an algorithm to decide membership in the language specified by the RE, and convert the input to a token list. FSA are (almost) algorithms. REs and FSAs are connected by the following amazing and surprising facts.
􏹩 ForeachregularexpressionRoveralphabetA,thereisan deterministic FSA F over A such that lang(R) = lang(F), and vice versa.
􏹩 Foreachnon-deterministicFSAFoveralphabetA,thereis an deterministic FSA F′ over A such that
lang(F′) = lang(F), and vice versa.

FSAs vs REs
Why do we bother introducing FSAs when we’ve got REs to specify the lexical structure of a programming language?
Because we need an algorithm to decide membership in the language specified by the RE, and convert the input to a token list. FSA are (almost) algorithms. REs and FSAs are connected by the following amazing and surprising facts.
􏹩 ForeachregularexpressionRoveralphabetA,thereisan deterministic FSA F over A such that lang(R) = lang(F), and vice versa.
􏹩 Foreachnon-deterministicFSAFoveralphabetA,thereis an deterministic FSA F′ over A such that
lang(F′) = lang(F), and vice versa.
􏹩 Foreachε-automaton(seebelow)FoveralphabetA, there is an deterministic FSA F′ over A such that lang(F′) = lang(F), and vice versa.

FSAs vs REs

FSAs vs REs
An aside on the relationship between deterministic and non-deterministic FSAs: why bother at all with non-deterministic FSAs? Two reasons.

FSAs vs REs
An aside on the relationship between deterministic and non-deterministic FSAs: why bother at all with non-deterministic FSAs? Two reasons.
􏹩 Non-determinsticFSAscanbeconvertedtodeterministic FSAs, so we are not loosing anything by using non-determinstic FSAs in terms of algorithms..
􏹩 Non-deterministicFSAareusuallymuchsmaller(fewer states) than the deterministic FSAs accepting the same language (often exponentially so: if the NFA has n states, the DFA might have approximately 2n states).
This is a familiar story: we look at something from two angles (1) convenient for humans vs (2) convenient for the machine.

FSAs vs REs

FSAs vs REs
Given that REs and FSAs can describe the same language, how can we get from an RE to an FSA?

FSAs vs REs
Given that REs and FSAs can describe the same language, how can we get from an RE to an FSA?
Going straight from REs to deterministic FSAs is complicated. So we go there in several steps.
Lexical Regular NFA, epsilon specification expressions automaton
Table-driven DFA implementation of
DFA

FSAs vs REs
Given that REs and FSAs can describe the same language, how can we get from an RE to an FSA?
Going straight from REs to deterministic FSAs is complicated. So we go there in several steps.
We are using ε-automata which can be seen as a special case of NFAs. ε-automata make the conversion from REs to Java implementations easier.
Lexical Regular NFA, epsilon specification expressions automaton
Table-driven DFA implementation of
DFA

ε-automata

ε-automata
Formally, an ε-automaton with alphabet A is a (usually non-deterministic) FSA with alphabet A ∪ {ε}.
The definition of language ε-automaton accepted by an ε-automaton is slightly different from the definition for non-deterministic) FSAs.

ε-automata
Formally, an ε-automaton with alphabet A is a (usually non-deterministic) FSA with alphabet A ∪ {ε}.
The definition of language ε-automaton accepted by an ε-automaton is slightly different from the definition for non-deterministic) FSAs.
What is ε for?

ε-automata
Formally, an ε-automaton with alphabet A is a (usually non-deterministic) FSA with alphabet A ∪ {ε}.
The definition of language ε-automaton accepted by an ε-automaton is slightly different from the definition for non-deterministic) FSAs.
ε
What is ε for? We use ε-labelled transitions s −→ t to move from state s to state t, but without consuming input. This will be convenient later.

ε-automata
Formally, an ε-automaton with alphabet A is a (usually non-deterministic) FSA with alphabet A ∪ {ε}.
The definition of language ε-automaton accepted by an ε-automaton is slightly different from the definition for non-deterministic) FSAs.
ε
What is ε for? We use ε-labelled transitions s −→ t to move from state s to state t, but without consuming input. This will be convenient later. What language does this ε-automaton accept?
initial
epsilon epsilon
terminal1 1 terminal2
0

ε-automata
Formally, an ε-automaton with alphabet A is a (usually non-deterministic) FSA with alphabet A ∪ {ε}.
The definition of language ε-automaton accepted by an ε-automaton is slightly different from the definition for non-deterministic) FSAs.
ε
What is ε for? We use ε-labelled transitions s −→ t to move from state s to state t, but without consuming input. This will be convenient later. What language does this ε-automaton accept?
initial
epsilon epsilon
terminal1 1 terminal2
0

ε-automata
So, an ε-automaton with alphabet A is an FSA with alphabet A ∪ {ε}, but the language is different: the word w over the alphabet A is accepted by ε-automaton A precisely when there isawordw′ overA∪{ε}suchthat:

ε-automata
So, an ε-automaton with alphabet A is an FSA with alphabet A ∪ {ε}, but the language is different: the word w over the alphabet A is accepted by ε-automaton A precisely when there isawordw′ overA∪{ε}suchthat:
􏹩 Ifweremoveallεfromw′weobtainw.

ε-automata
So, an ε-automaton with alphabet A is an FSA with alphabet A ∪ {ε}, but the language is different: the word w over the alphabet A is accepted by ε-automaton A precisely when there isawordw′ overA∪{ε}suchthat:
􏹩 Ifweremoveallεfromw′weobtainw.
􏹩 w′∈lang(A)asanormal(i.e.walkinganyedgeconsumes
the first character of the input string).
We write langε(A) for the language of an ε-automaton A.

ε-automata
So, an ε-automaton with alphabet A is an FSA with alphabet A ∪ {ε}, but the language is different: the word w over the alphabet A is accepted by ε-automaton A precisely when there isawordw′ overA∪{ε}suchthat:
􏹩 Ifweremoveallεfromw′weobtainw.
􏹩 w′∈lang(A)asanormal(i.e.walkinganyedgeconsumes
the first character of the input string).
We write langε(A) for the language of an ε-automaton A.
Example word: “h e l ε l ε o” gives us two chances to change state without consuming input and accept “hello”.

ε-automata
So, an ε-automaton with alphabet A is an FSA with alphabet A ∪ {ε}, but the language is different: the word w over the alphabet A is accepted by ε-automaton A precisely when there isawordw′ overA∪{ε}suchthat:
􏹩 Ifweremoveallεfromw′weobtainw.
􏹩 w′∈lang(A)asanormal(i.e.walkinganyedgeconsumes
the first character of the input string).
We write langε(A) for the language of an ε-automaton A.
Example word: “h e l ε l ε o” gives us two chances to change state without consuming input and accept “hello”.
So we have
langε(A) = {w | w′ ∈ lang(A),w is w′ with ε removed}

ε-automata are enough for non-deterministic FSA

ε-automata are enough for non-deterministic FSA
Non-determinism can always be translated to ε-automata that are deterministic except for ε-transitions.
initial
1 1
terminal1 1 terminal2
initial
epsilon
epsilon
0
11
terminal1 1 terminal2
0

Translation of REs to FSAs

Translation of REs to FSAs
We will translate every kind of RE (∅, ε, R|R′, …) into an FSA (an ε-FSA to be precise).

Translation of REs to FSAs
We will translate every kind of RE (∅, ε, R|R′, …) into an FSA (an ε-FSA to be precise).
We don’t need to details of each FSA in the translation, we will only be manipulating the initial and final state. All our translations have just one final state. We use the following notation to represent the FSAs arising in our translations.

Translation of ∅

Translation of ∅
A

Translation of ε

Translation of ε
epsilon

Translation of ′c′

Translation of ′c′
c

Translation of (A)

Translation of (A)
A
(A)

Translation of A|B

Translation of A|B
AB
epsilon
epsilon
A
B
epsilon
epsilon

Translation of AB

Translation of AB
AB
A
B
epsilon

Translation of A∗

Translation of A∗
epsilon
A
epsilon
A
epsilon
epsilon

Example translation

Example translation
What’s the automaton that the RE (1|0)∗1 translates to? (Writing e for ε)

Example translation
What’s the automaton that the RE (1|0)∗1 translates to? (Writing e for ε)
e
1
e
e
e ee1
ee 0
e

From NFAs (ε-automata) to DFAs

From NFAs (ε-automata) to DFAs Remember the lexer construction pipeline?

From NFAs (ε-automata) to DFAs Remember the lexer construction pipeline?
Lexical Regular NFA, epsilon specification expressions automaton
Table-driven DFA implementation of
DFA

From NFAs (ε-automata) to DFAs Remember the lexer construction pipeline?
Lexical Regular NFA, epsilon specification expressions automaton
Table-driven DFA implementation of
DFA
Now we want to translate our NFAs (ε-automata) to DFAs, because we can implement DFAs in e.g. Java (computers can’t handle non-determinism).

From NFAs (ε-automata) to DFAs: ε-closure

From NFAs (ε-automata) to DFAs: ε-closure Consider the last example.

From NFAs (ε-automata) to DFAs: ε-closure Consider the last example.
e
1
e
e
e ee1
ee 0
e

From NFAs (ε-automata) to DFAs: ε-closure Consider the last example.
e
1
e
e
e ee1
ee 0
e

From NFAs (ε-automata) to DFAs: ε-closure.

From NFAs (ε-automata) to DFAs: ε-closure.
Consider the last example.

From NFAs (ε-automata) to DFAs: ε-closure. Consider the last example.
e
1
e
e
e ee1
ee 0
e

From NFAs (ε-automata) to DFAs: ε-closure. Consider the last example.
e
1
e
e
e ee1
ee 0
e

From NFAs (ε-automata) to DFAs: ε-closure. Consider the last example.
e
1
e
e
e ee1
ee 0
e

From NFAs (ε-automata) to DFAs: ε-closure. Consider the last example.
e
1
e
e
e ee1
ee 0
e

From NFAs (ε-automata) to DFAs: ε-closure. Consider the last example.
e
1
e
e
e ee1
ee 0
e

From NFAs (ε-automata) to DFAs

From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states, i ∈ S initial state, F ⊆ S final states).

From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
a

From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
a

From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
􏹩 ThealphabetisA
a

From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
􏹩 ThealphabetisA
􏹩 Thestateareallnon-emptysetsofstatesinS,i.e.all non-empty subsets of S.
a

From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
􏹩 ThealphabetisA
􏹩 Thestateareallnon-emptysetsofstatesinS,i.e.all
non-empty subsets of S.
􏹩 Thestartstateistheε-closureofi.
a

From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
􏹩 ThealphabetisA
􏹩 Thestateareallnon-emptysetsofstatesinS,i.e.all
non-empty subsets of S.
􏹩 Thestartstateistheε-closureofi.
􏹩 Thefinalstatesareallnon-emptysetsX⊂Ssuchthat X ∩ F ̸= ∅.
a

From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
􏹩 ThealphabetisA
􏹩 Thestateareallnon-emptysetsofstatesinS,i.e.all
non-empty subsets of S.
􏹩 Thestartstateistheε-closureofi.
􏹩 Thefinalstatesareallnon-emptysetsX⊂Ssuchthat X ∩ F ̸= ∅.
􏹩 We have a transition from X to Y with the label a exactly when Y = ε-closure of a(X).
a

From NFAs (ε-automata) to DFAs

From NFAs (ε-automata) to DFAs The example
e
C1E e
ee D0F
e
e
A e B
G e H e I 1 J

From NFAs (ε-automata) to DFAs
e
C1E e
ee D0F
e
A e B
G e H e I 1 J
e0
ABCDFGHI
01
Becomes
0
ABCDHI
1
1
Unreachable states omitted
ABCDEGHIJ

From NFAs (ε-automata) to DFAs
0
1
01
0
1
Check that the language of the new FSA is (1|0)∗1 as required.

From NFAs (ε-automata) to DFAs

From NFAs (ε-automata) to DFAs
Do you notice something? How many states does the new automaton have?

From NFAs (ε-automata) to DFAs
Do you notice something? How many states does the new
automaton have?
Consider our running example. It has 10 states. How many states would it DFA version have?

From NFAs (ε-automata) to DFAs
Do you notice something? How many states does the new
automaton have?
Consider our running example. It has 10 states. How many states would it DFA version have?
Ithas210 −1=1023…

From NFAs (ε-automata) to DFAs
Do you notice something? How many states does the new
automaton have?
Consider our running example. It has 10 states. How many states would it DFA version have?
Ithas210 −1=1023…
This exponential blowup is an intrinsic problem of converting non-deterministic automata into deterministic ones. It has nothing to do with ε-automata.

From NFAs (ε-automata) to DFAs
Do you notice something? How many states does the new
automaton have?
Consider our running example. It has 10 states. How many states would it DFA version have?
Ithas210 −1=1023…
This exponential blowup is an intrinsic problem of converting non-deterministic automata into deterministic ones. It has nothing to do with ε-automata.
Fortunately in many cases, most of them are inactive and can be ignored. However in pathological cases, all states are needed.

From NFAs (ε-automata) to DFAs
This example shows that the translation in the naive form presented here is not particularly efficient: of the 1023 states it introduces, only 3 are needed (active). It is possible to improve the translation, so the inactive states disappear.

Implementation of DFAs and NFAs

Implementation of DFAs and NFAs
Remember the lexer construction pipeline?

Implementation of DFAs and NFAs
Remember the lexer construction pipeline?
Lexical Regular NFA, epsilon specification expressions automaton
Table-driven DFA implementation of
DFA

Implementation of DFAs and NFAs
Remember the lexer construction pipeline?
Lexical Regular NFA, epsilon specification expressions automaton
Table-driven DFA implementation of
DFA
Now we want to translate DFAs into real programs e.g. Java.

Implementation of DFAs

Implementation of DFAs
A DFA is naturally implemented as a 2-dimensional table (array) T. Columns are indexed by the alphabet, rows are indexed by the states. Array element at row X and label l stores the the next state in the automaton when starting in state X and consuming l.

Implementation of DFAs
A DFA is naturally implemented as a 2-dimensional table (array) T. Columns are indexed by the alphabet, rows are indexed by the states. Array element at row X and label l stores the the next state in the automaton when starting in state X and consuming l.
0
A
B
01
C
0
1
1

Implementation of DFAs
A DFA is naturally implemented as a 2-dimensional table (array) T. Columns are indexed by the alphabet, rows are indexed by the states. Array element at row X and label l stores the the next state in the automaton when starting in state X and consuming l.
0
A
01 A
B C
B
00
C
1
1
1
B
C
C
B
B
C

Implementation of DFAs
A DFA is naturally implemented as a 2-dimensional table (array) T. Columns are indexed by the alphabet, rows are indexed by the states. Array element at row X and label l stores the the next state in the automaton when starting in state X and consuming l.
0
A
01 A
B C
B
00
C
1
1
1
B
C
C
B
B
C
In code: p/t/o

Implementation of DFAs

0
A
01 A
B C
B
00
C
1
1
1
B
C
C
B
B
C
Implementation of DFAs
def scan ( input : Array [ Char ] )
: Boolean = {
val table = … // transitions
var i = 0 // current character
var s = A // current state
val acceptingState = C
while ( i < input.length ) { s = table [ s, input[i] ] i += 1 } return ( s == acceptingState ) } 0 A 01 A B C B 00 C 1 1 1 B C C B B C Implementation of DFAs def scan ( input : Array [ Char ] ) : Boolean = { val table = ... // transitions var i = 0 // current character var s = A // current state val acceptingState = C while ( i < input.length ) { s = table [ s, input[i] ] i += 1 } return ( s == acceptingState ) } Question: what if one of the state lacks outgoing transitions on some labels? 0 A 01 A B C B 00 C 1 1 1 B C C B B C Implementation of DFAs def scan ( input : Array [ Char ] ) : Boolean = { val table = ... // transitions var i = 0 // current character var s = A // current state val acceptingState = C while ( i < input.length ) { s = table [ s, input[i] ] i += 1 } return ( s == acceptingState ) } Question: what if one of the state lacks outgoing transitions on some labels? Answer: add artificial error states, and from the error state a transition back to itself for every character. Implementation of DFAs Implementation of DFAs This idea, using a 2-dimensional table to implement an FSA is fundamental. Most (all?) real-world implementations of REs, FSAs etc use variants of it. It is worth understanding well. Implementation of DFAs Implementation of DFAs Many rows in the array are identical (all in the example below, first and third row in the previous example). That is often the case in the implementation of lexers. We can save space by sharing rows (or columns): 0 0 A A B C B 01 C 01 1 1 B C Outputting a token list Outputting a token list We have reached our intermediate goal: going from REs to algorithms that decide the language of the RE, i.e. respond with TRUE/FALSE for each input string. But in lexing we want a token list (or an error message). Fortunately, this is only a small variant of the decision problem. Outputting a token list We have reached our intermediate goal: going from REs to algorithms that decide the language of the RE, i.e. respond with TRUE/FALSE for each input string. But in lexing we want a token list (or an error message). Fortunately, this is only a small variant of the decision problem. Hello ( 123 then ... should yield a token list: Outputting a token list We have reached our intermediate goal: going from REs to algorithms that decide the language of the RE, i.e. respond with TRUE/FALSE for each input string. But in lexing we want a token list (or an error message). Fortunately, this is only a small variant of the decision problem. Hello ( 123 then ... should yield a token list: T_Ident ( "Hello" ), T_Left Bracket, T_Num ( 123 ), T_Then, ... Mealy automata Mealy automata We use Mealy automata, which is a variant of FSAs which have not only an input action, but also an output action. A picture says more than a 1000 words. initial I F Input eps T_if T Output eps HEN eps eps T_then eps eps Mealy automata Mealy automata initial I F Input eps T_if T Output eps HEN eps eps T_then eps eps With a Mealy automaton, when we have a path w1w2 wn i −→ s 1 −→ s 2 · · · s n − 1 −→ s n u1u2 un whenever we accept (and consume) the input string w1...wn we create an output string u1...un. Lexer generators Lexer generators Lexers can be written by hand, but for realistic languages it is much easier to let the computer do most of the work. Lexer generators exist that take as input an ordered list of REs (ordering gives priority, see below) together with actions associated with each RE, and returns a working lexer. Actions allow you to associate Java code with regular expressions. Examples: 􏹩 Flex 􏹩 JFlex 􏹩 JavaCC(lexerandparsergeneratorcombined) There are many others. It is worth learning one of them, since they make it very easy to produce fast lexers. Summary Summary In first approximation, lexing works like this Summary In first approximation, lexing works like this 􏹩 WriteanREforthelexemesofeachtokenclass,e.g. Number = [0-9]+, Keywords = ... Summary In first approximation, lexing works like this 􏹩 WriteanREforthelexemesofeachtokenclass,e.g. Number = [0-9]+, Keywords = ... 􏹩 ConstructabigRE,matchingalllexemesforalltokens. R = Keywords Identifier Number LBracket ... Summary In first approximation, lexing works like this 􏹩 WriteanREforthelexemesofeachtokenclass,e.g. Number = [0-9]+, Keywords = ... 􏹩 ConstructabigRE,matchingalllexemesforalltokens. R = Keywords Identifier Number LBracket ... 􏹩 ConstructanFSA(Mealyautomaton)forR.Letalexer generator do this work. Error handling in lexers Error handling in lexers What if the lexer encounters a character in the input that does match any RE defining the lexical level of the language? It’s important for good compilers to return helpful error messages (not all compilers do this alas). Error handling in lexers What if the lexer encounters a character in the input that does match any RE defining the lexical level of the language? It’s important for good compilers to return helpful error messages (not all compilers do this alas). There’s a neat way using regular expressions, longest match and priorities can also be used for error handling in lexers. Error handling in lexers What if the lexer encounters a character in the input that does match any RE defining the lexical level of the language? It’s important for good compilers to return helpful error messages (not all compilers do this alas). There’s a neat way using regular expressions, longest match and priorities can also be used for error handling in lexers. Use an RE that matches any character in the alphabet, but give it lowest priority. What effect does this have? Error handling in lexers What if the lexer encounters a character in the input that does match any RE defining the lexical level of the language? It’s important for good compilers to return helpful error messages (not all compilers do this alas). There’s a neat way using regular expressions, longest match and priorities can also be used for error handling in lexers. Use an RE that matches any character in the alphabet, but give it lowest priority. What effect does this have? The error RE is really easy to write: use a RE that matches any character in the alphabet. Give this RE the lowest priority. Because it matches any character it will also always be a shortest possible match. It catches anything that is not allowed by all previous REs. The output associated with this RE can be used for error messages. Conclusion Conclusion Lexer: takes a program as string, returns a list of tokens. Conclusion Lexer: takes a program as string, returns a list of tokens. The point of lexing is to have a ROUGH classification of the input program that enables the next stage (parsing) to determine of the program is syntactically well-formed, and to construct the AST. Regular expressions and FSAs are convenient tools for implementation of lexers. The material in the textbooks The material in the textbooks 􏹩 DragonBook:Chapter2.6,Chapter3. 􏹩 Appel,Palsberg:Chapter2. 􏹩 “Engineeringacompiler”:Chapter2:especiallysections 2.1 to 2.5.