程序代写代做代考 compiler algorithm flex Programming Language Syntax: Scanning and Parsing

Programming Language Syntax: Scanning and Parsing
Read: Scott, Chapter 2.2 and 2.3.1

Lecture Outline
n Overview of scanning
n Overview of top-down and bottom-up parsing
n Top-down parsing
n Recursive descent n LL(1) parsing tables
Programming Languages CSCI 4430, A. Milanova 2

Scanning
n Scanner groups characters into tokens
n Scanner simplifies the job of the parser
position = initial + rate * 60;
Scanner
id = id + id * num ;
Parser
n Scanner is essentially a Finite Automaton
n Regular expressions specify the syntax of tokens
n Scanner recognizes the tokens in the program Programming Languages CSCI 4430, A. Milanova 3

Question
n Why most programming languages disallow nested multi-line comments?
n Comments are usually handled by the scanner, which essentially is a DFA. Handling multiline comments would require recognizing (/*)n(*/)n which is beyond the power of a DFA.
Programming Languages CSCI 4430, A. Milanova 4

Calculator Language
n Tokens times ® *
plus ® +
id ® letter ( letter | digit )*
except for read and write which are keywords (keywords are tokens as well)
Programming Languages CSCI 4430, A. Milanova 5

Ad-hoc (By hand) Scanner
skip any initial white space (space, tab, newline) if current_char in { +, * }
return corresponding single-character token (plus or times)
if current_char is a letter
read any additional letters and digits
check to see if the resulting string is read or write
if so, then return the corresponding token else return id
else announce an ERROR
Programming Languages CSCI 4430, A. Milanova 6

The Scanner as a DFA
Start 1
+
letter
4
space, tab, newline
*
23
letter, digit
Programming Languages CSCI 4430, A. Milanova 7

Building a Scanner
n Scanners are (usually) automatically generated from regular expressions:
Step 1: From a Regular Expression to an NFA
Step 2: From an NFA to a DFA Step 3: Minimizing the DFA
n lex/flex utilities generate scanner code
n Scanner code explicitly captures the states
and transitions of the DFA
Programming Languages CSCI 4430, A. Milanova 8

Table-Driven Scanning

cur_state := 1 loop
read cur_char
case scan_tab[cur_char, cur_state].action of
move: …
cur_state = scan_tab[cur_char, cur_state].new_state recognize: // emits the token
tok = token_tab[current_state] unread cur_char — push back char exit loop
error:

Programming Languages CSCI 4430, A. Milanova 9

Table-Driven Scanning
space,tab,newline
* + digit letter other
15 23-4-
2- —– times
3- —– plus
4- –44- id
55 —– space
Sketch of table: scan_tab and token_tab. See Scott for details.
Programming Languages CSCI 4430, A. Milanova 10

Today’s Lecture Outline
n Overview of scanning
n Overview of top-down and bottom-up parsing
n Top-down parsing
n Recursive descent n LL(1) parsing tables
Programming Languages CSCI 4430, A. Milanova 11

A Simple Calculator Language
asst_stmt ® id = expr // asst_stmt is the start symbol expr ® expr+expr | expr*expr |id
Character stream: position = initial + rate * time Token stream:id = id + id * id
Scanner
Parser
Parse tree:
asst_stmt id = expr
CSCI 4430, A. Milanova
(Parse tree simplified to fit on slide.) 12
id
id
expr
+
id
*

A Simple Calculator Language
asst_stmt ® id = expr // asst_stmt is the start symbol expr ® expr+expr | expr*expr |id
Character stream: position + initial = rate * time Token stream: id + …
Parse tree: Token stream is ill-formed according to our grammar, parse tree construction fails, therefore Syntax error!
Most compiler errors occur in the parser.
Scanner
Parser
Programming Languages CSCI 4430, A. Milanova 13

Parsing
n For any CFG, one can build a parser that runs in O(n3)
n Well-known algorithms
n But O(n3) time is unacceptable for a parser in a compiler!
Programming Languages CSCI 4430, A. Milanova 14

Parsing
n Objective: build a parse tree for an input string of tokens from a single scan of input
n Only special subclasses of context-free grammars (LL and LR) can do this
n Two approaches
n Top-down: builds parse tree from the root to the
leaves
n Bottom-up: builds parse tree from the leaves to the top
n Both are easily automated
Programming Languages CSCI 4430, A. Milanova 15

Grammar for Comma-separated Lists
list ® id list_tail // list is the start symbol list_tail®,idlist_tail |;
Generates comma-separated lists of id’s. E.g.,id; id,id,id;
Example derivation:
list Þ id list_tail
Þ id , id list_tail Þ id , id ;
Programming Languages CSCI 4430, A. Milanova 16

list ® id list_tail list_tail®,idlist_tail |;
Top-down Parsing
n Terminals are seen in the order of appearance in the token stream
id , id , id ;
n The parse tree is constructed n From the top to the leaves
list
id
list_tail
id list_tail
, id list_tail ;
,
n Corresponds to a left-most derivation
n Look at left-most nonterminal in current sentential form, and lookahead terminal and “predict” which production to apply
17

list ® id list_tail list_tail®,idlist_tail |;
Bottom-up Parsing
n Terminals are seen in the
order of appearance in the token stream
id , id , id ;
n The parse tree is constructed
n From the leaves to the top
n A right-most derivation in reverse
list
id
list_tail
id list_tail
,
,
id list_tail ;
Programming Languages CSCI 4430, A. Milanova
18

Today’s Lecture Outline
n Overview of scanning
n Overview of top-down and bottom-up parsing
n Top-down parsing
n Recursive descent n LL(1) parsing tables
Programming Languages CSCI 4430, A. Milanova 19

Top-down Predictive Parsing
n “Predicts” production to apply based on one or more lookahead token(s)
n Predictive parsers work with LL(k) grammars
n First L stands for “left-to-right” scan of input
n Second L stands for “left-most” derivation n Parse corresponds to left-most derivation
n k stands for “need k tokens of lookahead to predict”
n We are interested in LL(1)
Programming Languages CSCI 4430, A. Milanova 20

list ® id list_tail list_tail®,idlist_tail |;
Question
n Can we always predict (i.e., for any input)
what production to applies, based on
list
one token of lookahead?
id , id , id ;
id
list_tail
id list_tail
n Yes, there is at most one choice (i.e., at most one production applies)
n This grammar is an LL(1) grammar Programming Languages CSCI 4430, A. Milanova
,
id list_tail ;
,
21

list ® list_prefix ;
list_prefix ® list_prefix , id | id
Question
n A new grammar
n What language does it generate?
n Same, comma-separated lists
n Can we predict based on one token of lookahead?
id , id , id ;
list
list_prefix ; ?
Programming Languages CSCI 4430, A. Milanova
22

Top-down Predictive Parsing
n Back to predictive parsing
n “Predicts” production to apply based on one
or more lookahead token(s)
n Parser always gets it right!
n There is no need to backtrack, undo expansion, then try a different production
n Predictive parsers work with LL(k) grammars Programming Languages CSCI 4430, A. Milanova 23

Top-down Predictive Parsing
n Expression grammar: n Not LL(1)
n Unambiguous version: n Still not LL(1). Why?
expr ® expr + expr
| expr * expr |id
expr ® expr + term | term term ® term*id|id
expr ® term term_tail term_tail®+term term_tail |ε term ® id factor_tail
factor_tail ® * id factor_tail | ε
n LL(1) version:
Programming Languages CSCI 4430, A. Milanova 24

expr ® term term_tail term_tail®+term term_tail |ε term ® id factor_tail
factor_tail ® * id factor_tail | ε
Exercise
n Draw parse tree for expression id + id * id + id
Programming Languages CSCI 4430, A. Milanova 25

Recursive Descent
n Each nonterminal has a procedure
n The right-hand-sides (rhs) for the nonterminal
form the body of its procedure
n lookahead()
n Peeks at current token in input stream
n match(t)
n if lookahead() == t then consume current token,
else PARSE_ERROR
Programming Languages CSCI 4430, A. Milanova 26

Recursive Descent
start()
case lookahead() of
id: expr(); match($$) otherwise PARSE_ERROR
expr()
case lookahead() of
id: term(); term_tail() otherwise PARSE_ERROR
term_tail()
$$: skip
otherwise: PARSE_ERROR
($$ – end-of-input marker)
start ® expr $$
expr®termterm_tail term_tail®+term term_tail|ε term ® id factor_tail factor_tail ® * id factor_tail | ε
Predicting production term_tail ® + term term_tail +: match(‘+’); term(); term_tail()
case lookahead() of
Predicting epsilon production term_tail ® ε 27

Recursive Descent
start ® expr $$
expr®termterm_tail term_tail®+term term_tail|ε term ® id factor_tail factor_tail ® * id factor_tail | ε
term()
case lookahead() of
id: match(‘id’); factor_tail() otherwise: PARSE_ERROR
factor_tail()
Predicting production factor_tail ® *id factor_tail
case lookahead() of
*: match(‘*’); match(‘id’); factor_tail();
+,$$: skip
otherwise PARSE_ERROR
Predicting production factor_tail ® ε
Programming Languages CSCI 4430, A. Milanova 28

LL(1) Parsing Table
n But how does the parser “predict”?
n E.g., how does the parser know to expand a
factor_tail by factor_tail ® ε on + and $$?
n It uses the LL(1) parsing table
n One dimension is nonterminal to expand
n Other dimension is lookahead token
n We are interested in one token of lookahead
n Entry “nonterminal on token” contains the production to apply or contains nothing
Programming Languages CSCI 4430, A. Milanova 29

LL(1) Parsing Table
n One dimension is nonterminal to expand n Other dimension is lookahead token
a
n E.g., entry “nonterminal A on terminal a” contains production A ® α
Meaning: when parser is at nonterminal A and lookahead token is a, then parser expands A
by production A ® α
A
α
30

LL(1) Parsing Table
start ® expr $$
expr®termterm_tail term_tail®+term term_tail|ε term ® id factor_tail factor_tail ® * id factor_tail | ε
start
expr
term_tail
term
factor_tail
id
expr $$
term term_tail

id factor_tail

+


+ term term_tail

ε
*




* id factor_tail


ε

ε
Programming Languages CSCI 4430, A. Milanova 31
$$

Question
• Fill in the LL(1) parsing table for the comma- separated list grammar
start ® list $$
list ® id list_tail list_tail®,idlist_tail |;
start
list
list_tail
id
list $$
id list_tail

,


, id list_tail


;



Programming Languages CSCI 4430, A. Milanova 32
;
$$

The End
Programming Languages CSCI 4430, A. Milanova 33