CS计算机代考程序代写 ada python Fortran flex Haskell algorithm compiler Programming Language Syntax

Programming Language Syntax
Chapter 2

Regular Expressions
§ Token: a shortest string of characters with meaning
§ Tokens – specified by regular expressions
§ An alphabet S is any finite nonempty set § Examples:
§ English: {a, b, …, z},
§ binary: {0, 1}
§ {a, b, …, z, 0, 1, …, 9, ⋅, |, *, e}
§ The set of all finite strings over S is denoted S* § The empty string: e ∈ S* (has zero characters)
2

Regular Expressions
§Regular expressions
§ Regular expressions over an alphabet S are all strings obtained as follows:
§ e is a regular expression
§ any character a ∈ S is a regular expression
§ For reg. exp. a, b, the following are reg. exp.:
§ a⋅b – concatenation (‘⋅’ omitted: ab)
§ a | b – union (‘|’ = or) (sometimes denoted a + b) § a* – Kleene star (0 or more repetitions)
§ a+ = a a* (1 or more repetitions)
3

Regular Expressions
§ Example: Signed integers:
sign_int → (+|-| e )(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*
§ Example: Numerical constants: number → integer | real
integer → digit digit*
real → integer exponent | decimal ( exponent | e ) decimal → digit* ( . digit | digit . ) digit* exponent → (e|E) (+|-| e ) integer
digit →0|1|2|3|4|5|6|7|8|9 § ‘→’ means “can be”
§ Precedence order: ‘*’ > ‘⋅’ > ‘|’
4

Regular Expressions
§ Other applications:
§ grep family of tools in Unix § many editors
§ scripting languages:
§ Perl
§ Python § Ruby § awk § sed
5

Formatting issues
§ Upper vs. lower case
§ distinct in some languages: C, Python, Perl § same in others: Fortran, Lisp, Ada
§ Identifiers: letters, digits, underscore (most languages) § camel case: someIdentifierName
§ underscore: some_identifier_name
§ Unicode
§ non-Latin characters have become important
§ White spaces
§ usually ignored
§ separate statements: Python, Haskell, Go, Swift § indentation important: Python, Haskell
6

Context-Free Grammars
Context Free Grammar (CFG)
§ CFG consists of:
§ A set of terminals, T
§ A set of non-terminals, N
§ Astartsymbol,S∈N
§ Asetofproductions;subsetofN ́(N∪T)*
§ Example: Balanced parentheses:
S®e S ® SS S ® (S)
7

Context-Free Grammars
§ Example: CFG for arithmetic expressions:
expr →id|number|-expr |(expr)| expr op expr op →+|-|*|/|
§ Derivation: start with S, continue with productions § replace LHS nonterminal by the RHS
§ Example: generate the string: slope * x + intercept
expr ⟹ expr op expr ⟹ expropid
(S = expr) (‘⟹’=“derives”)
⟹ expr+id
⟹ expropexpr+id
⟹ expr opid+id
⟹ expr*id+id
⟹ id * id+ id
(slope) (x) (intercept)
§ Sentential form: any string along the way
8

Context-Free Grammars
§ Right-most derivation: the rightmost nonterminal is replaced
expr ⟹ expr op expr ⟹ expr op id
⟹ expr + id
⟹ expr op expr + id ⟹ expr opid+id ⟹ expr*id+id ⟹ id * id + id
§ Left-most derivation: the leftmost nonterminal is replaced expr ⟹ expr op expr
⟹ expr op expr op expr ⟹ id op expr op expr ⟹ id * expr op expr ⟹ id * id op expr ⟹ id * id + expr ⟹ id * id + id
9

Context-Free Grammars
Parse Tree
§ Represents a derivation graphically § Example: Parse tree for the string:
slope * x + intercept
10

Context-Free Grammars
§ Different parse tree for: slope * x + intercept
§ Tree allowed by the grammar but incorrect for the expression
§ Ambiguous grammar: two different parse trees for one string § Ambiguity is a problem for parsers
§ We want unambiguous grammars
11

Context-Free Grammars
§Parse tree for:3 + 4 * 5 § Precedence rules
13

Context-Free Grammars
§Parse tree for:10 – 4 – 3 § Left-associativity rules
14

Scanning
Scanning = Lexical Analysis
§ tokenizing source
§ removing comments
§ saving text of identifiers, numbers, strings
§ saving source locations (file, line, column) for error messages
15

Scanning
§ Example: simple calculator language
assign → :=
plus → +
minus → –
times → *
div → /
lparen → (
rparen → )
id → letter ( letter | digit )*
number → digit digit* | digit* ( . digit | digit . ) digit* comment → /* ( non-* | * non-/ )* *+/
(Algol style; C has ‘=’)
(except for read and write) |//(non-newline)* newline
16

Scanning
Ad-hoc scanner § Longest possible
token extracted § White spaces
are delimiters
17

Scanning
§ Structured scanner
§ DFA – Deterministic Finite Automaton
§ Separate final state for each token category
18

Scanning
§DFA
§ Built automatically from regular expressions § Tools: lex, flex, scagen
§ Difficult to build directly
§ build first an NFA – Nondeterministic FA § convert to DFA
§ minimize DFA (smallest number of states)
19

Scanning
§ Reg.exp. to NFA
§ Follows the structural definition of regular expressions
20

Scanning
§ Reg.exp. to NFA
§ Example:
d* ( . d | d . ) d*
21

Scanning
§ NFA to DFA
§ Example:
d* ( . d | d . ) d*
22

Scanning
§ DFA minimization §Example: d* (.d | d.) d*
23

Scanning
§ Scanners are built three ways: § ad-hoc:
§ fastest, most compact code § semi-mechanical pure DFA
§ nested case statements § table-driven DFA
§ automatically-generated scanners
§ “Longest-possible token” rule
§ return only when next character cannot continue current token § the next character needs to be saved for the next token
§ Keywords
§ DFA would need many states to identify
§ Better treat keywords as exceptions to the identifier rule
24

Scanning
§ Nested case statement DFA
state := 1 loop
read cur_char case state of
1: case cur_char of ‘ ‘, ‘\t’, ‘\n’: …
‘a‘ … ‘z’: … ‘0‘ … ‘9’: … ‘>’: …
…
2: case cur_char of
… …
n: case cur_char of …
25

Scanning
§Look-ahead
§ May need to peek at more than one character
§ look-ahead – characters necessary to decide
§ Example: Pascal
§ have 3 so far and see ‘.’ § 3.14 or 3..5 may follow
§ Example: Fortran
§ arbitrarily long look-ahead § DO 5 I = 1,25
§ execute statements up to 5 for I from 1 to 25 § DO 5 I = 1.25
§ assign 1.25 to the variable DO5I
§ NASA’s Mariner 1 may have been lost due to ‘.’ i.o ‘,’ § Fortran 77 has better syntax: DO 5,I = 1,25
26

Scanning
§Table-driven scanning
(continued on next slide)
27

Scanning
Table-driven scanning (cont’d)
28

Scanning
§ Scanner table used by previous code
§ state 17: white spaces; state 18: comments
§ scan_tab: entire table but last column § token_tab: last column
§ keyword_tab = {read, write}
29

Scanning
§Lexical errors
§ Very few – most strings correspond to some token
§ Should recover to enable the compiler detect more errors
§ throw away the current, invalid, token
§ skip forward to the next possible beginning of a new token
§ restart the scanning algorithm
§ count on the error-recovery mechanism of the parser to cope with a syntactically invalid sequence of tokens
30

Parsing
§Parser
§ in charge of the entire compilation process § Syntax-directed translation
§ calls the scanner to obtain tokens
§ assembles the tokens into a syntax tree
§ passes the tree to the later phases of the compiler
§ semantic analysis § code generation
§ code improvement
§ a parser is a language recognizer
§ context-free grammar is a language generator
31

Parsing
§Context-free language recognition § Earley, Cocke-Younger-Kasami alg’s
§ O(n3) time
§ too slow
§ There are classes of grammars with O(n) parsers: § LL: ‘Left-to-right, Leftmost derivation’.
§ LR: ‘Left-to-right, Rightmost derivationʼ
32

Parsing
§Top-down vs. Bottom-up
§ Top-down
§ predict based on next
token
§ Bottom-up
§ reduce right-hand side
§ Example:
A, B, C;
33

Parsing
§ Bottom-up
§ better
grammar
§ cannot be parsed top- down
§ Example:
A, B, C;
34

Parsing
§LL(k), LR(k)
§ k = no. tokens of look-ahead required to parse
§ almost all real compilers use LL(1), LR(1) § LR(0) – prefix property:
§ no valid string is a prefix of another valid string
35

LL Parsing
§ Recursive descent parser
§ one subroutine for each
nonterminal
§ Example: read A
read B
sum := A + B
write sum
write sum / 2
§ Continued on the next slide
37

LL Parsing
38

LL Parsing
§ Parse tree
39

LL Parsing
§Table-driven LL parsing:
§ repeatedly look up action in 2D table based on: § current leftmost non-terminal and
§ current input token
§ actions:
§ (1) match a terminal
§ (2) predict a production
§ (3) announce a syntax error
§ Stack: containing the expected symbols
§ initially contains the starting symbol
§ when predicting a production: push the right-hand side in reverse order
40

§ Table- driven
LL parsing:
41
LL Parsing

LL Parsing
§ LL(1) parse_tab for parsing for calculator language
§ each entry is the predicted production; ‘-’ means error § productions are numbered in order 1..19
§ prod_tab (not shown) gives the right-hand sides
42

§ Example: LL Parsing
read A
read B
sum := A + B
write sum
write sum / 2
43

LL Parsing
44

LL Parsing
§How to build the table:
§FIRST(α) – tokens that can start an α §FOLLOW(A) – tokens that can come after an A
EPS(α) ≡ if α ⟹* ε then true else false FIRST(α) ≡ {c | α⟹* cβ}
FOLLOW(A) ≡ {c | S ⟹+ αAcβ }
PREDICT(A → α) ≡ FIRST(α) ∪
if EPS(α) then FOLLOW(A) else ∅
§ If a token belongs to the predict set of more than one production with the same left-hand side, then the grammar is not LL(1)
§ Compute: pass over the grammar until nothing changes
§ Algorithm and examples on the next slides
45

LL Parsing
§ Constructing EPS, FIRST, FOLLOW, PREDICT
46

LL Parsing
§ Algorithm for constructing EPS, FIRST, FOLLOW, PREDICT (Continued on the next slide)
47

LL Parsing
§ Algorithm for constructing EPS, FIRST, FOLLOW, PREDICT
48

LL Parsing
§ Example: the sets EPS, FIRST, FOLLOW, PREDICT
49

LL Parsing
§ Problems trying to make a grammar LL(1) § left recursion: A ⟹+ Aα
§ example – cannot be parsed top-down
id_list → id_list_prefix; id_list_prefix → id_list_prefix , id id_list_prefix → id
§ solved by left-recursion elimination
id_list → id id_list_tail id_list_tail → , id id_list_tail id_list_tail → ;
§ General left-recursion elimination:
A → Aα1 | Aα2 | … | Aαn | 𝛽1 | 𝛽2 | … | 𝛽m
replaced by:
A → 𝛽1B| 𝛽2B| … | 𝛽mB
B → α1B| α2B| … | αnB | 𝜀
50

LL Parsing
§ Problems trying to make a grammar LL(1) § common prefixes
§ example
stmt → id := expr
stmt → id ( argument_list ) § solved by left-factoring
stmt → id stmt_list_tail stmt_list_tail → := expr stmt_list_tail → ( argument_list )
§ Note: Eliminating left recursion and common prefixes does NOT make a grammar LL; there are infinitely many non-LL languages, and the automatic transformations work on them just fine
51

LL Parsing
§ Problems trying to make a grammar LL(1)
§ the dangling else problem
§ prevents grammars from being LL(k) for any k § Example: ambiguous (Pascal)
stmt → if cond then_clause else_clause | other_stmt then_clause → then stmt
else_clause → else stmt | ε
if C1 then if C2 then S1 else S2
52

LL Parsing
§ Dangling else problem
§ Solution: unambiguous grammar
§ can be parsed bottom-up but not top-down
§ there is no top-down grammar stmt → balanced_stmt | unbalanced_stmt
balanced_stmt → if cond then balanced_stmt else balanced_stmt | other_stmt
unbalanced_stmt → if cond then stmt
| if cond then balanced_stmt else unbalanced_stmt
53

LL Parsing
§ Dangling else problem
§ Another solution – end-markers
stmt → IF cond then_clause else_clause END | other_stmt then_clause → THEN stmt_list
else_clause → ELSE stmt_list | ε
§ Modula-2, for example, one says:
if A = B then
if C = D then E := F end
else
G := H
end
§ Ada: end if
§ other languages: fi
54

LL Parsing
§ Problem with end markers: they tend to bunch up
if A = B then …
else if A = C then …
else if A = D then …
else if A = E then …
else …
end end end end
§ To avoid this: elsif if A = B then …
elsif A = C then …
elsif A = D then …
elsif A = E then …
else …
end
55

LR Parsing
§LR parsers
§ maintain a forest of subtrees of the parse tree
§ join trees together when recognizing a RHS
§ keeps the roots of subtrees in a stack
§ shift: tokens from scanner into the stack
§ reduce: when recognizing a RHS, pop it, push LHS § discovers a right-most derivation in reverse
Stack contents (roots of partial trees) ε
id (A)
id (A),
Remaining input
A, B, C;
, B, C;
B, C;
, C;
C; ;
id (A),
id (A),
id (A),
id (A), id (B), id (C) ;
id (A), id (B) , id (C) id_list_tail id (A) , id (B) id_list_tail
id (A) id_list_tail id_list
id (B)
id (B),
id (B),
id (C)
56

LR Parsing
§ Example: LR(1) grammar for calculator language
1. program → stmt_list $$
2. stmt_list → stmt_list stmt
3. stmt_list → stmt
4. stmt → id := expr
5. stmt → read id
6. stmt → write expr
7. expr → term
8. expr → expr add_op term
9. term → factor
10. term → term mult_op factor
11. factor → ( expr )
12. factor → id
13. factor → number
14. add_op → +
15. add_op → –
16. mult_op → *
17. mult_op → /
§ Compare with previous LL(1)
§ left recursive prod. is better § keeps operands together
program → stmt list $$
stmt_list → stmt stmt_list | ε
stmt → id := expr | read id | write expr expr → term term_tail
term_tail → add op term term_tail | ε
term → factor fact_tail
fact_tail → mult_op fact fact_tail | ε
factor →(expr)|id|number
add_op →+|-
mult_op → * | /
57

LR Parsing
§LR parser
§ recognizes right-hand sides of productions
§ keep track of productions we might be in the middle of § and where: represent the location in an RHS by a ‘⦁’
§ Example:
read A
read B
sum := A + B
write sum
write sum / 2
58

LR Parsing
§ start with:
program → ⦁ stmt_list $$ – this is called an LR-item
§ ‘⦁’ in front of stmt_list means we may be about to see the yield of stmt_list, that is, we could also be at the beginning of a production with stmt_list on LHS:
stmt_list → ⦁ stmt_list stmt stmt_list → ⦁ stmt
§ similarly, we need to include also:
stmt → stmt → stmt →
⦁ id := expr ⦁ read id
⦁ write expr
§ Only terminals follow, so we stop
59

LR Parsing
§ the state we have obtained is: program → ⦁ stmt_list $$
stmt_list → ⦁ stmt_list stmt
stmt_list → ⦁ stmt
stmt→ ⦁id:=expr … stmt→ ⦁readid …
stmt → ⦁ write expr … )
§ next token: read – the next state is:
stmt → read ⦁ id (empty closure)
§ next token: A – the next state is: stmt → read id ⦁
§ ‘⦁’ at the end means we can reduce § what is the new state?
(the basis)
(state 0)
(closure … …
(state 1) (state 1’)
60

LR Parsing
§ replace read id with stmt stmt_list → ⦁ stmt becomes stmt_list → stmt ⦁
§ we reduce again: replace stmt with stmt_list § this means shifting a stmt_list in state 0:
(state 0’) (state 2)
program → stmt_list ⦁ $$
stmt_list → stmt_list ⦁ stmt stmt → ⦁ id := expr stmt→ ⦁readid
stmt → ⦁ write expr
§Complete states on next slides
(basis …
… ) (closure …
…
… )
61

LR Parsing
62

LR Parsing
§ LL(1) parser: decides using nonterminal + token
§ LR(1) parser: decides using state + token § CFSM: Characteristic Finite State Machine § Almost always table-driven
66

LR Parsing
§ Parse table parse_tab § shift (s) followed by state
§ reduce (r), shift + reduce (b) followed by production
67

LR Parsing
§ Algorithm
§ uses the parse_tab (previous slide) and prod_tab (not shown)
§ example after algorithm for:
read A
read B
sum := A + B
write sum
write sum / 2
68

LR Parsing
69

LR Parsing
70

LR Parsing
71

LR Parsing
§Shift/reduce conflict
§ two items in a state:
§ one with ‘⦁’ in front of terminal (shift) § one with ‘⦁’ at the end (reduce)
§ SLR (simple LR)
§ conflict can be resolved using FIRST and FOLLOW
§ Example: state 6
§ stmt → write expr ⦁
§ expr → expr ⦁ add_op term
§ FIRST(add_op) ∩ FOLLOW(stmt) = ∅
72

LL(1) vs SLR(1)
§ LL(1) – no ambiguity, no left recursion, and, for any productions A → u | v:
§ FIRST(u) ∩ FIRST(v) = ∅
§ at most one of u and v can derive the empty string ε § if v =>* ε, then FIRST(u) ∩ FOLLOW (A) = ∅
§ SLR(1)
§ No shift/reduce conflict: cannot have in the same state:
A → u ⦁ xv, B → w ⦁ , with FIRST(x) ∩ FOLLOW(B) ≠ ∅ § No reduce/reduce conflict: cannot have in the same state:
A → u ⦁ , B → v ⦁ , with FOLLOW(A) ∩ FOLLOW(B) ≠ ∅
73

Related Posts