程序代写代做代考 flex compiler go algorithm computer architecture C Finite State Automaton Java Compilers and computer architecture: From strings to ASTs (1):

Compilers and computer architecture: From strings to ASTs (1):
finite state automata for lexing
Martin Berger 1 October 2019
1Email: M.F.Berger@sussex.ac.uk, Office hours: Wed 12-13 in Chi-2R312
1/1

Recall the function of compilers
2/1

Plan for this week
Source program
Lexical analysis
Syntax analysis
Semantic analysis, e.g. type checking
Intermediate code generation
Optimisation
Code generation
Translated program
Remember the shape of compilers?
We learned about regular expressions (REs). They enable us to specify simple language (finite and infinite).
The question we need to answer is: how to decide, given a string s and a regular expression R, if s ∈ lang(R)?
We will later see that this is the main step towards an algorithm for lexing (tokenisation).
3/1

Finite state automata
A finite state automaton (FSA) is an algorithm that, given a string over an alphabet A, answers with TRUE or FALSE. The strings that the FSA says TRUE to is the language of the FSA.
In other words, FSAs decide languages.
FSAs are easiest explained in pictures. Here is one with the alphabet {0, 1}
4/1

Finite state automata
initial
1
0
1
terminal
0
1
A word w is accepted by an FSA exactly if there is a path in the FSA from the initial state to a terminal state such that the edge labels we encounter on this path exactly spell the word w.
What language does the FSA above accept? (1|01+ )0∗
5/1

Finite state automata
a
A transition or edge s −→ t is to be understood as:
If the automaton is in state s and reads (’eats’) the character a then it moves to state t.
If we are at the end of the input, and the automaton is in an terminal (also called accepting) state, the input string as a whole is accepted and in the language of the automaton.
If we cannot find a path that terminates at the end of the input, and the automaton is NOT in an accepting state, the input string as a whole is rejected and is NOT in the language of the automaton.
6/1

FSA, formal definition
A finite state automaton (FSA) is a tuple A = (A,S,i,F,R) such that the following is true.
􏰉 Aisafiniteset,calledthealphabetoftheautomaton.
􏰉 Sisanon-emptyfinitesetofstates.
􏰉 i∈Sistheinitialstate.
􏰉 F⊆Sisthesetofterminal,oracceptingstatesofthe automaton. Note: F can be empty. (What happens then?)
􏰉 Risthetransitionrelation,i.e.itisarelationonstates,
characters and states. More formally, R is a subset of α
S × A × S. We often write s −→ t instead of (s, α, t) ∈ R αα′
We say A is deterministic if whenever s −→ t and s −→ t then t = t′. Otherwise A is non-deterministic.
7/1

FSAs, deterministic vs non-deterministic
Which one is deterministic, which on is non-deterministic?
The finite state automaton on the left is deterministic, that on the right non-deterministic. Each has one accepting state, indicated by double circles. Initial states are often drawn with an incoming arrow without source.
00
111 11
00
8/1

FSAs, accepting a string
A string (α1, …, αn) is accepted by the automaton if and only if there is a path
α1α2 αn
i −→ s 1 −→ s 2 · · · s n − 1 −→ s n
where i is the initial state, and sn is a terminal state. Note that the states i, s0, …, sn don’t have to be distinct.
The language of an automaton A is the set of all accepted strings. We write lang(A) for this language.
9/1

FSA examples
In class.
10/1

FSAs vs REs
Why do we bother introducing FSAs when we’ve got REs to specify the lexical structure of a programming language?
Because we need an algorithm to decide membership in the language specified by the RE, and convert the input to a token list. FSA are (almost) algorithms. REs and FSAs are connected by the following amazing and surprising facts.
􏰉 ForeachregularexpressionRoveralphabetA,thereisan deterministic FSA F over A such that lang(R) = lang(F), and vice versa.
􏰉 Foreachnon-deterministicFSAFoveralphabetA,thereis an deterministic FSA F′ over A such that
lang(F′) = lang(F), and vice versa.
11/1

Deterministic vs non-deterministic FSA: why bother?
An aside on the relationship between deterministic and non-deterministic FSAs: why bother at all with non-deterministic FSAs? Two reasons.
􏰉 Non-deterministicFSAareusuallymuchsmaller(fewer states) than the deterministic FSAs accepting the same language (often exponentially so: if the NFA has n states, the DFA might have approximately 2n states).
􏰉 DeterminsticFSAscanbeimplementedonreal machines. Question: Can non-deterministic FSAs be implemented (directly)?
􏰉 Non-deterministicFSAscanbeconvertedtodeterministic automata recognising the same language.
This is a familiar story: we look at something from two angles (1) convenient for humans vs (2) convenient for the machine.
12/1

FSAs vs REs
Given that REs and FSAs can describe the same language, how can we get from an RE to an FSA?
Going straight from REs to deterministic FSAs is complicated. So we go there in several steps.
We are using ε-automata which can be seen as a special case of NFAs. ε-automata make the conversion from REs to Java implementations easier.
13/1
Lexical specification
Regular expressions
NFA, epsilon automaton
Table-driven DFA implementation of
DFA
Brzozowski derivatives

ε-automata
Formally, an ε-automaton with alphabet A is a (usually non-deterministic) FSA with alphabet A ∪ {ε}.
The definition of language ε-automaton accepted by an ε-automaton is slightly different from the definition for non-deterministic) FSAs.
ε
What is ε for? We use ε-labelled transitions s −→ t to move from state s to state t, but without consuming input. This will be convenient later. What language does this ε-automaton accept?
epsilon
terminal1
initial
1
epsilon
terminal2
0
The language 0∗|1∗ as a regular expression.
14/1

ε-automata
So, an ε-automaton with alphabet A is an FSA with alphabet A ∪ {ε}, but the language is different: the word w over the alphabet A is accepted by ε-automaton A precisely when there isawordw′ overA∪{ε}suchthat:
􏰉 Ifweremoveallεfromw′weobtainw.
􏰉 w′∈lang(A)asanormal(i.e.walkinganyedgeconsumes
the first character of the input string).
We write langε(A) for the language of an ε-automaton A.
Example word: “h e l ε l ε o” gives us two chances to change state without consuming input and accept “hello”.
So we have
langε(A) = {w | w′ ∈ lang(A),w is w′ with ε removed}
15/1

ε-automata are enough for non-deterministic FSA
Non-determinism can always be translated to ε-automata that are deterministic except for ε-transitions.
16/1
initial
1 1
terminal1 1 terminal2
initial
epsilon
epsilon
0
11
terminal1 1 terminal2
0

Translation of REs to FSAs
We will translate every kind of RE (∅, ε, R|R′, …) into an FSA (an ε-FSA to be precise).
We don’t need to details of each FSA in the translation, we will only be manipulating the initial and final state. All our translations have just one final state. We use the following notation to represent the FSAs arising in our translations.
17/1

Translation of ∅
A
18/1

Translation of ε
epsilon
19/1

Translation of ′c′
c
20/1

Translation of (A)
A
(A)
21/1

Translation of A|B
AB
epsilon
epsilon
A
B
epsilon
epsilon
22/1

Translation of AB
AB
epsilon
A
B
23/1

Translation of A∗
24/1
epsilon
A
epsilon
A
epsilon
epsilon

Example translation
What’s the automaton that the RE (1|0)∗1 translates to? (Writing e for ε)
e
1
e
e
e ee1
ee 0
e
25/1

From NFAs (ε-automata) to DFAs Remember the lexer construction pipeline?
Now we want to translate our NFAs (ε-automata) to DFAs, because we can implement DFAs in e.g. Java (computers can’t handle non-determinism).
26/1
Lexical specification
Regular expressions
NFA, epsilon automaton
Table-driven DFA implementation of
DFA
Brzozowski derivatives

From NFAs (ε-automata) to DFAs: ε-closure Consider the last example.
e
1
e
e
e ee1
ee 0
e
The ε-closure of a set of states S in an automaton is the set of all states reachable from a state in S by 0 or more ε-transitions.
27/1

From NFAs (ε-automata) to DFAs: ε-closure. Consider the last example.
28/1
e
1
e
e
e ee1
ee 0
e

From NFAs (ε-automata) to DFAs
Let (A,S,i,F,→) be an ε-automaton (A alphabet, S states,
i ∈ S initial state, F ⊆ S final states).
For each a ∈ A and X ⊆ S let a(X) = {y ∈ S | x ∈ X,x −→ y}
Now the corresponding DFA (accepting the same language) is given as follows.
􏰉 ThenewalphabetisA
􏰉 Thenewstatesareallnon-emptysubsetsofS
􏰉 Thenewstartstateistheε-closureofi.
􏰉 Thenewfinalstatesareallnon-emptysetsX⊆Ssuch that X ∩ F ̸= ∅. (Why non-empty?)
􏰉 We have a new transition from X to Y with the label a exactly when Y = ε-closure of a(X ).
a
29/1

From NFAs (ε-automata) to DFAs The example
e
C1E e
ee D0F
e
e
A e B
G e H e I 1 J
30/1
e
C1E e
e
e ee1

From NFAs (ε-automata) to DFAs
0
1
01
0
1
Check that the language of the new FSA is (1|0)∗1 as required.
31/1

From NFAs (ε-automata) to DFAs
Do you notice something? How many states does the new
automaton have?
Consider our running example. It has 10 states. How many states would it DFA version have?
Ithas210 −1=1023…
This exponential blowup is an intrinsic problem of converting non-deterministic automata into deterministic ones. It has nothing to do with ε-automata.
Fortunately in many cases, most of them are inactive and can be ignored. However in pathological cases, all states are needed.
32/1

From NFAs (ε-automata) to DFAs
This example shows that the translation in the naive form presented here is not particularly efficient: of the 1023 states it introduces, only 3 are needed (active). It is possible to improve the translation, so the inactive states disappear.
33/1

Implementation of DFAs and NFAs
Remember the lexer construction pipeline?
Now we want to translate DFAs into real programs e.g. Java.
34/1
Lexical specification
Regular expressions
NFA, epsilon automaton
Table-driven DFA implementation of
DFA
Brzozowski derivatives

Implementation of DFAs
A DFA is naturally implemented as a 2-dimensional table (array) T. Columns are indexed by the alphabet, rows are indexed by the states. Array element at row X and character c stores the the next state in the automaton when starting in state X and consuming c.
code: p/t/o
35 / 1
0
A
B
01
C
0
1
1
0
0
A
01 A
B C
B
C
1
B
C
C
B
B
C

0
A
01 A
B C
B
00
C
1
1
1
B
C
C
B
B
C
Implementation of DFAs
def scan ( input : Array [ Char ] )
: Boolean = {
val table = … // transitions
var i = 0 // current character
var s = A // current state
val acceptingState = C
while ( i < input.length ) { s = table [ s, input[i] ] i += 1 } return ( s == acceptingState ) } Question: what if one of the state lacks outgoing transitions on some labels? Answer: add artificial error states, and from the error state a transition back to itself for every character. 36/1 Implementation of DFAs This idea, using a 2-dimensional table to implement an FSA is fundamental. Most (all?) real-world implementations of REs, FSAs etc use variants of it. It is worth understanding well. 37/1 Implementation of DFAs Many rows in the array are identical (all in the example below, first and third row in the previous example). That is often the case in the implementation of lexers. We can save space by sharing rows (or columns): 38/1 0 A A B C B 01 C 01 0 1 1 B C Outputting a token list We have reached our intermediate goal: going from REs to algorithms that decide the language of the RE, i.e. respond with TRUE/FALSE for each input string. But in lexing we want a token list (or an error message). Fortunately, this is only a small variant of the decision problem. Hello ( 123 then ... should yield a token list: T_Ident ( "Hello" ), T_Left Bracket, T_Num ( 123 ), T_Then, ... 39/1 Mealy automata We use Mealy automata, which is a variant of FSAs which have not only an input action, but also an output action. A picture says more than a 1000 words. initial I F Input eps T_if T Output eps HEN eps eps T_then eps eps 40/1 Mealy automata initial I F Input eps T_if T Output eps HEN eps eps T_then eps eps With a Mealy automaton, when we have a path w1w2 wn i −→ s 1 −→ s 2 · · · s n − 1 −→ s n u1u2 un whenever we accept (and consume) the input string w1...wn we create an output u1, ..., un. There are many variants, e.g. Moore automata. 41/1 Mealy automata: implementation We can implement Mealy automata by agumenting the 2-dimensional table with appropriate outputs that we accumulate as we consume the input string. 42/1 Lexer generators Lexers can be written by hand, but much easier to let the computer do that work. Lexer generators take as input an ordered list of REs (ordering gives priority, see below) together with actions (think Mealy automaton) associated with each RE, and returns a working lexer. Actions allow you to associate Java code with regular expressions. Examples: Flex, JFlex. Lexer generator upside: 􏰉 Lexergeneratorsproduceveryfastlexers 􏰉 Lexergeneratorsisolatethecompilerwriterfromhavingto worry about fast lexer implementations. Lexer generator downside: 􏰉 Yetanotherthingtolearn,and(likemostsoftware)tendto be badly documented. 􏰉 Anexpertcanprobablyproducefasterlexersthana generator. 43/1 Implementing lexers using regular expression libraries Modern programming languages often have elaborate regular expression libraries. They can be used for implementing lexers too. But you have to ensure things like “longest-match” and “keywords-first” heuristics. Key disadvantage: regular expressions tend to be slow, so not suitable for industrial strength compilers. But OK for toy compilers. 44/1 Summary In first approximation, lexing works like this 􏰉 WriteanREforthelexemesofeachtokenclass,e.g. Number = [0-9]+, Keywords = ... 􏰉 ConstructabigRE,matchingalllexemesforalltokens. R = Keywords | Identifier | Number | ... 􏰉 ConstructanFSA(Mealyautomaton)forR.Letalexer generator do this work. 45/1 Error handling in lexers What if the lexer encounters a character in the input that does not match any RE defining the lexical level of the language? It’s important for good compilers to return helpful error messages (not all compilers do this alas). There’s a neat way using regular expressions, longest match and priorities can also be used for error handling in lexers. Use a RE that matches any character in the alphabet. Give this RE the lowest priority. Because it matches any character it will also always be a shortest possible match. It catches anything that is not allowed by all previous REs. The output associated with this RE can be used for error messages. 46/1 Conclusion Lexer: takes a program as string, returns a list of tokens. The point of lexing is to have a ROUGH classification of the input program that enables the next stage (parsing) to determine of the program is syntactically well-formed, and to construct the AST. Regular expressions and FSAs are convenient tools for implementation of lexers. 47/1 The material in the textbooks 􏰉 DragonBook:Chapter2.6,Chapter3. 􏰉 Appel,Palsberg:Chapter2. 􏰉 “Engineeringacompiler”:Chapter2:especiallysections 2.1 to 2.5. 48/1