Compilers (contd.)
What we have said so far about how compilers and interpreters work is a bit too simplistic
Every practical compiler and interpreter has an initial component that takes the input program, checks that it matches the syntactic requirements of the language
Copyright By PowCoder代写 加微信 powcoder
Assuming that the input program does meet the requirements, converts it into an internal representation, typically a parse tree (The conversion might be done simultaneously with the checking)
The next component, in the case of a compiler, would be code generation, i.e., produce the m.l. program equivalent to the input program
In the case of an interpreter, the next component would bethe executor (which will execute the input program)
In both cases, the compiler/interpreter works with the parse tree, not the original source program
Typical structure of an interpreter:
Parser; Pretty Printer; Executor
Typical structure of a compiler:
Parser; Pretty Printer; Code generator
The second and third stages use the parse tree that has been produced by the first stage
Before we look at any of these, we will talk about how the syntax of the language is specified formally
The structure of the Parser will be driven by the syntax
That is why we will first talk about how syntax is (typically) specified
That, in turn, will be divided into two pieces:
Regular Expressions and Context Free Grammars
We will start with Regular Expressions
Structure of Compilers, Interpreters
The next several slides based on pages 51–58 of the Scott book
A formal grammar such as an RE is a way to specify the set of legal strings of the language
An RE is one of the following:
A given character such as C
Concatenation of two REs: such as “R1 R2”
(the quote signs are not part of the new RE)
Alternation denoted with “|”
Repetition denoted by *
The repetition means zero or more occurrences
Note: This is slightly less general than used in theory of computing courses but is good enough for PL discussions
Regular Expressions
Knowledge Check
Can the set of all decimal digits be described by an RE?
Can the set of all unsigned integers be described by an RE?
Can the set of all unsigned integers that have no leading zeroes be described by an RE?
Can the set of all integers that may be signed or unsigned be described by an RE?
The Tokenizer for an RE
Given a RE, we can create a program, as an FSA that recognizes all the strings in the language defined by the RE and rejects those that do not belong to the language
That is a Tokenizer
Although a Tokenizer can be hacked together for a given RE, we will not do that; we will build it as a DFA (deterministic finite state automaton); (This is what is done in practice)
Given an RE, the Tokenizer libraries in Java, Python, etc. do this automatically
But we will do it by hand so you will see what those libraries do
For the Interpreter project for the course, we will define the RE for the set of legal tokens in a simple language called Core
The first part of the project will require you to write the Tokenizer for Core as an FSA,
What is a DFA?
A DFA, F, consists of:
A finite (non-empty) alphabet of letters, a1, …, am
A finite (non-empty) set of states, s0, s1, …, sn
One of those states, say, s0 is the starting state
One (or more) of the states is a final (or accepting) state
A finite set of transitions (si, aj, sk) which means:
If F is in state si and the next letter in the input string is aj then:
F will consume aj and move to the state sk
F accepts st if-and-only-if starting in s0, F goes through its transitions, consuming the characters in st, so that when it has consumed all of st, F is in an accepting state
F recognizes the set of strings S if [[F accepts st] ↔ [st S]
Non-deterministic FAs (NFAs)
F is a non-deterministic (NFA) if:
If for any si, aj, there is more than one possible transition, i.e., we have sk, sk’ such that (si, aj, sk) and (si, aj, sk’) are both transitions
In an NFA, there may also be ԑ transitions
If NF is an NFA, the language accepted by NF consists of all the strings st for which any possible path that NF might take when consuming st leads to an accepting state
It is as if NF tries all possible paths for st and if it finds even one that leads to a final state, then NF accepts st
NFAs ≡ DFAs!!
Surprising fact: If NF is an NFA, there is a DFA F that accepts the same language as NF
Construction:
Take the set of states SF of F as the powerset of the set of states SNF of NF
Take, as accepting states of F, only those with one or more accepting states of NF
Define the transitions TF of F in terms of the transitions TNF of NF to consist of exactly those transitions (si,aj,sk) for which:
{ Ǝ[(sni si) ˄ (snk sk) ˄ ((sni, aj, snk) TNF) ] }
Claim (Knowledge Check): Show that the above construction ensures that the language accepted by F is the same as the one accepted by NF
Note: We will refer to accepting states also as final states
Compute ԑ-closure:
That means combine states 1, 2, 3
Powerset construction:
From the state {1,2,3} a 0-trans. will go to 2 or 4; a 1-trans. will go to 4
From the state {2,4}, a 0-trans will go to 3 which has an ԑ trans. to 2, so to either 2 or 3; a 1-trans will go to 2 or 4
From {2,3}, a 0-trans will go to 4
From 4, a 0-trans will go to 3; and a
1-trans will lead to rejection
The final result is the DFA equivalent to the original NFA
Example of NFA to DFA (from: https://en.wikipedia.org/wiki/Powerset_construction
Lesson I learned: Detailed discussions of NFAs and their equivalence to DFAs belong in CSE 3321, NOT in CSE 3341!
Slide 16 notes
How to do this in pure BNF? Using ε it is easy.
Without it, the number of productions increases quite a bit.
But using ε can cause problems for compilers. In homeworks, exams, etc. you may use it unless I say otherwise.
Relation to book: So far, mostly chapter 1; and 3.1., 3.2, 3.3; rest of chapter 3 not inclded.
We will move to chapter 4; that will lead us to the project.
A lot of this should be familiar from 321 and 625. But going over it again should make it easier to see how it relates to PLs and lang. implementations. The project also has some relation to 560.
Given alphabet , an RE over is:
A given character such as C
If R, S are two REs over , so is RS:
RS = { r s | r R s S }
i.e., RS is the set of all strings that can be obtained by concatenating an element of R & an element of S
If R, S are REs over , so is R S
written in PL discussions as R | S
If R is an RE over , so is R*
Recall: Regular Expressions
Slide 16 notes
How to do this in pure BNF? Using ε it is easy.
Without it, the number of productions increases quite a bit.
But using ε can cause problems for compilers. In homeworks, exams, etc. you may use it unless I say otherwise.
Relation to book: So far, mostly chapter 1; and 3.1., 3.2, 3.3; rest of chapter 3 not inclded.
We will move to chapter 4; that will lead us to the project.
A lot of this should be familiar from 321 and 625. But going over it again should make it easier to see how it relates to PLs and lang. implementations. The project also has some relation to 560.
REs and FAs
If R is a given RE, we can construct FR, a DFA that accepts all the strings generated by R and only those strings
Construct NFR, an NFA, corresponding to R
Convert NFR into FR by following our previous construction taking the powerset of the states of NFR as the set of states of FR
In practice (but not in our Tokenizer project): Minimize the number of states required in FR
Key question: Given R, how do we construct NFR
From RE to NFA
Knowledge check: For each item below, show it is appropriate
We will assume that each NFA will have a single final state
C: NFA has a starting state, a final state and a transition from the starting state to the final state labeled C
R1 | R2: Define new NFA N with Ԑ transitions from starting state of N to the starting states of N1, N2 and Ԑ transitions from the final states of N1, N2 to the final state of N
R1 R2: Define new NFA N with Ԑ transition from starting state of N to starting state of N1, Ԑ transition from final state of N1 to starting state of N2 and Ԑ transition from final state of N2 to final state of N
R*: Define new NFA N* by simply adding Ԑ transition from final state of N to starting state of N
Reserved words: program, begin, end, int, if, then, else,
while, loop, read, write
Operators/special symbols:
; , = ! [ ] && || ( ) + – * != == < > <= >=
Integers (unsigned)
Identifiers: Start with uppercase letter, followed by zero or
more uppercase letters and zero or more digits
White space (WS): blank, tab, carriage return, linefeed
WS requirement: Between each pair of tokens, there must
be one or more ws unless one of the tokens is an op/sp.
symbol in which case ws is optional
Tokens for Core
Homework (to be posted on Carmen and Piazza):
Define the RE corresponding to the Core’s tokens
Design an FA corresponding to your RE
Tokenizer Project (to be posted on Carmen and Piazza): Implement your FA and the methods listed on the next slide
But you do not have to strictly follow the approach we have described
What you will implement will, in fact, be a scanner that will convert the input string into a stream of Core tokens
Important: Practical tokenizers/scanners are greedy
E.g.: “====“, is parsed as “==“ followed by “==“
not as, say, “=“ followed by “==“followed by “=“,
or “==“ followed by “=“followed by “=“, etc.
Your Tokenizer should do the same
Both the homework and project will be posted on Carmen as well as Homework, Project
getToken():
returns (info about) current token;
Repeated calls to getToken() return same token.
skipToken():
skips current token;
next token becomes current token;
so next call to getToken() will return new token.
returns the value of the current (integer) token;
(what if current token is not an integer? — error!)
returns the name (string) of the current (id) token.
(what if current token is not an id? — error!)
Tokenizer methods …
Easier to think in terms of FAs rather than in terms of REs
An FA, by definition, has a finite set of states
Knowledge check:
Show that we cannot define an FA/RE that corresponds to an integer that is enclosed inside a balanced sequence of parentheses
Context-free grammars (aka BNF grammars) are designed to handle such requirements
Push-down automata (PDAs) are equivalent to CFGs in the same way that FAs are equivalent to REs but we won’t discuss PDAs
Limitations of REs
/docProps/thumbnail.jpeg
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com