程序代写代做代考 C algorithm compiler Java Programming Language Syntax

Programming Language Syntax
Read: Scott, Chapter 2.1

Lecture Outline
n Formal languages
n Regular expressions
n Context-free grammars n Derivation
n Parse
n Parse trees n Ambiguity
n Expression Grammars
Programming Languages CSCI 4430, A. Milanova 2

Last Class: Compiler
character stream token stream
parse tree
modified intermediate form
target language (assember)
modified
Scanner
Parser
Semantic analysis and intermediate code generation
Code generation
abstract syntax tree and intermediate form
Machine-dependent code improvement
Machine-independent code improvement
Programming Languages CSCI 4430, A. Milanova
3
target language

Syntax and Semantics
n Syntax is the form or structure of expressions, statements, and program units of a given language
n Syntax of a Java while statement: n while(boolean_expr)statement
n Semantics is the meaning of expressions, statements and program units of a given language
n Semanticsofwhile(boolean_expr)statement
n Execute statement repeatedly (0 or more times) as long as
boolean_expr evaluates to true
Programming Languages CSCI 4430, A. Milanova 4

Formal Languages
n Theoretical foundations – Automata theory n A language is a set of strings (also called
sentences) over a finite alphabet
n A generator is a set of rules that generate the strings in the language
n A recognizer reads input strings and determines whether they belong to the language
n Languages are characterized by the complexity of generation/recognition rules
n E.g., regular languages
n E.g., context-free languages
Programming Languages CSCI 4430, A. Milanova 5

Question
n What are the classes of formal languages?
n The Chomsky hierarchy: n Regular languages
n Context-free languages
n Context-sensitive languages
n Recursively enumerable languages
Programming Languages CSCI 4430, A. Milanova 6

Formal Languages
n Generators and recognizers become more complex as languages become more complex
n Regular languages
n Describe PL tokens (e.g., keywords, identifiers, numeric literals) n Generated by Regular Expressions
n Recognized by a Finite Automaton (scanner)
n Context-free languages
n Describe more complex PL constructs (e.g., expressions and
statements)
n Generated by a Context-free Grammar
n Recognized by a Push-down Automaton (parser)
n Even more complex constructs
Programming Languages CSCI 4430, A. Milanova 7

Formal Languages
n Main application of formal languages: enable proof of relative difficulty of computational problems
n Our focus: formal languages provide the formalism for describing PL constructs
n A compelling application of formal languages!
n Building a scanner
n Building a parser
n Central issue: build efficient, linear-time parsers Programming Languages CSCI 4430, A. Milanova 8

A Single Pass
position = initial + rate * 60;
• Scanner emits next token
• Parser consumes the token
and continues building the parse tree (typically bottom up)
Scanner
id = …
id =
Programming Languages CSCI 4430, A. Milanova … 9
assign
Parser
expr

Lecture Outline
n Formal languages
n Regular expressions
n Context-free grammars n Derivation
n Parse
n Parse trees n Ambiguity
n Expression Grammars
Programming Languages CSCI 4430, A. Milanova 10

Regular Expressions
n Simplest structure
n Formalism to describe the simplest programming language constructs, the tokens
n each symbols (e.g., “+”, “-”) is a token
n an identifier (e.g., position, rate, initial) is a token n a numeric constant (e.g., 59) is a token
n etc.
n Recognized by a finite automaton
Programming Languages CSCI 4430, A. Milanova 11

Regular Expressions
n A Regular Expression is one of the following: n A character, e.g., a
n The empty string, denoted by e
n Two regular expressions next to each other,
R1 R2
n Meaning: R1 R2 generates the language of strings that are
made up of any string generated by R1, followed by any string generated by R2
n Two regular expressions separated by |, R1 | R2
n Meaning: R1 | R2 generates the language that is the union of
the strings generated by R1 with the strings generated by R2 Programming Languages CSCI 4430, A. Milanova 12

Question
n What is the language defined by reg. exp. (a|b)(aa|bb) ?
n We saw concatenation and alternation. What operation is still missing?
Programming Language CSCI 4430, A. Milanova 13

Regular Expressions
n A Regular Expression is one of the following: n A character, e.g., a
n The empty string, denoted by e
n R1 R2
n R1 | R2
n Regular expression followed by a Kleene star, R*
n Meaning: the concatenation of zero or more strings generated by R
n E.g., a* generates {e, a, aa, aaa, … }
n E.g., (a|b)* generates all strings of a’s and b’s Programming Languages CSCI 4430, A. Milanova 14

Regular Expressions
n Precedence
n Kleene * has highest precedence
n Followed by concatenation n Followed by alternation |
n E.g., a b | c is (a b) | c not a (b | c) n Generates {ab,c} not {ab,ac}
n E.g., a b* generates {a,ab,abb,…} not {ε, ab, abab, ababab,…}
Programming Languages CSCI 4430, A. Milanova 15

Question
n What is the language defined by regular expression (0 | 1)* 1 ?
nWhatabout 0*(10*10*)* ?
Programming Languages CSCI 4430, A. Milanova 16

Regular Expressions in Programming Languages
n Describe tokens n Let
letter ® a|b|c| … |z
digit ® 1|2|3|4|5|6|7|8|9|0
n Which token is this? 1. letter ( letter | digit )*
2. digit digit *
3. digit * . digit digit *
? ?
?
Programming Languages CSCI 4430, A. Milanova
17

Regular Expressions in Programming Languages
n Which token is this:
number ® integer | real
real ® integer exponent | decimal ( exponent | ε ) decimal ® digit* ( . digit | digit . ) digit*
exponent ® (e|E) (+|-| ε ) integer
integer ® digit digit*
digit ® 1|2|3|4|5|6|7|8|9|0
Programming Languages CSCI 4430, A. Milanova 18

Lecture Outline
n Formal languages
n Regular expressions
n Context-free grammars n Derivation
n Parse
n Parse trees n Ambiguity
n Expression Grammars
Programming Languages CSCI 4430, A. Milanova 19

Context-Free Grammars
n Unfortunately, regular languages cannot specify all constructs in programming
n E.g., can we write a regular expression that specifies valid arithmetic expressions?
n id * ( id + id * ( number – id ) ) n Among other things, we need to ensure that
parentheses are matched!
n Answer is no. We need context-free languages and context-free grammars!
Programming Languages CSCI 4430, A. Milanova 20

Grammar
n A grammar is a formalism to describe the strings of a (formal) language
n A grammar consists of a set of terminals, set of nonterminals, a set of productions, and a start symbol
n Terminals are the characters in the alphabet
n Nonterminals represent language constructs
n Productions are rules for forming syntactically correct constructs
n Start symbol tells where to start applying the rules Programming Languages CSCI 4430, A. Milanova 21

Notation
Specification of identifier:
Regular expression: letter ( letter | digit )*
BNF: ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 ::= a | b | c | … | x | y | z
::= | |
Textbook and slides: digit ® 1|2|3|4|5|6|7|8|9|0 (also BNF) letter ® a|b|c|d|…|z
id ® letter | id letter | id digit Programming Languages CSCI 4430, A. Milanova 22
Nonterminals shown in italic
Terminals shown in typewriter

Regular Grammars
n Regular grammars generate regular languages
n The rules in regular grammars are of the form:
n Each left-hand-side (lhs) has exactly one nonterminal
n Each right-hand-side (rhs) is one of the following n A single terminal symbol or
n A single nonterminal symbol or
n A nonterminal followed by a terminal
e.g.,12*|0+ S® A|B A®1|A2 B®0|B0
Programming Languages CSCI 4430, A. Milanova 23

Question
n Is this a regular grammar:
S®0A A®S1 S®ε
n No, this is a context-free grammar
n It generates 0n1n, the canonical example of a context-
free language
n rhs should be nonterminal followed by a terminal, thus,
S ® 0 A is not a valid production
Programming Languages CSCI 4430, A. Milanova 24

Lecture Outline
n Formal languages
n Regular expressions
n Context-free grammars n Derivation
n Parse
n Parse trees n Ambiguity
n Expression Grammars
Programming Languages CSCI 4430, A. Milanova 25

Context-free Grammars (CFGs)
n Context-free grammars generate context-free languages
n Most of what we need in programming languages can be specified with CFGs
n Context-free grammars have rules of the form:
n Each left-hand-side has exactly one nonterminal
n Each right-hand-side contains an arbitrary sequence of terminals and nonterminals
n A context-free grammar e.g.0n1n ,n≥1 S ®0S1
S®01
Programming Languages CSCI 4430, A. Milanova 26

Question
n Examples of a non-context-free languages?
n E.g., anbmcndm n E.g., wcw
n E.g., anbncn
n≥1, m≥1
where w is in (0|1)* n≥1 (canonical example)
Programming Languages CSCI 4430, A. Milanova
27

Context-free Grammars
n Can be used to generate strings in the context-free language (derivation)
n Can be used to recognize well-formed strings in the context-free language (parse)
n In Programming Languages and compilers, we are concerned with two special CFGs, called LL and LR grammars
Programming Languages CSCI 4430, A. Milanova 28

Derivation
Simple context-free grammar for expressions:
expr ®id|(expr) | expr op expr op ® + | *
We can generate (derive) expressions:
expr Þ expr op expr Þ expr op id
Þ expr + id
Þ expr op expr + id Þexprop id+id Þexpr *id+id Þid *id+id
sentential form
sentence, string or yield
29

Derivation
n A derivation is the process that starts from the start symbol, and at each step, replaces a
nonterminal with the right-hand-side of a production
n E.g., expr op expr derives expr op id
We replaced the right (underlined) expr with id due to production expr ® id
n An intermediate sentence is called a sentential form
n E.g., expr op id is a sentential form
30

Derivation
n The resulting sentence is called yield
n E.g., id*id+id is the yield of our derivation
n What is a left-most derivation?
n Replaces the left-most nonterminal in the
sentential form at each step
n What is a right-most derivation?
n Replaces the right-most nonterminal in the sentential form at each step
n There are derivations that are neither left- nor
right-most
31

Question
n What kind of derivation is this: expr Þ expr op expr
Þ expr op id
Þ expr + id
Þ expr op expr + id Þexprop id+id Þexpr *id+id Þid *id+id
n A right-most derivation. At each step we replace the right-most nonterminal
Programming Languages CSCI 4430, A. Milanova 32

Question
n What kind of derivation is this: expr Þ expr op expr
Þ expr op id
Þ expr + id
Þ expr op expr + id
Þ id op expr + id Þ id op id + id Þ id * id + id
n Neither left-most nor right-most
Programming Languages CSCI 4430, A. Milanova 33

Parse
Recall our context-free grammar for expressions:
expr ®id|(expr) | expr op expr op ® + | *
n A parse is the reverse of a derivation
id*id+id
Þexpr Þ expr Þ expr Þ expr Þ expr Þ expr
*id+id op id + id op expr + id
+ id op id op expr
Þ expr Programming Languages CSCI 4430, A. Milanova
34

Parse
n A parse starts with the string of terminals, and at each step, replaces the right-hand-
side (rhs) of a production with the left-hand- side (lhs) of that production. E.g.,
… Þ expr op expr + id
Þ expr + id
Here we replaced expr op expr (the rhs of production expr ® expr op expr) with expr (the lhs of the production)
Programming Languages CSCI 4430, A. Milanova 35

Parse Tree
expr ®id|(expr) | expr op expr op ® + | *
expr Þ expr op expr Þ expr op id
Þ expr + id
Þ expr op expr + id Þexprop id+id Þexpr*id+id Þid *id+id
expr
expr op expr
expr op expr + id* id id
Internal nodes are nonterminals. Children are the rhs of a rule for that nonterminal.
Leaf nodes are terminals.
36

Ambiguity
n Ambiguity
n A grammar is ambiguous if some string can be
generated by two or more distinct parse trees
n There is no algorithm that can tell if an arbitrary context-free grammar is ambiguous
n Ambiguity arises in programming language grammars
n Arithmetic expressions
n If-then-else: the dangling else problem
n Ambiguity is bad
Programming Languages CSCI 4430, A. Milanova 37

Ambiguity
expr ®id|(expr) | expr op expr op ® + | *
nHow many parse trees forid * id + id?
Tree 1:expr expr op
+
id id+id id*id id
n Which one is “correct”?
expr
expr op expr
Tree 2:
expr
expr
op expr
*
expr op expr
38

Ambiguity
expr ®id|(expr) | expr op expr op ® + | *
nHow many parse trees forid + id + id?
Tree 1:expr expr op
+
id id+id id+id id
n Which one is “correct”?
expr
expr op expr
Tree 2:
expr
expr
op expr
+
expr op expr
39

Lecture Outline
n Formal languages
n Regular expressions
n Context-free grammars n Derivation
n Parse
n Parse trees n Ambiguity
n Expression Grammars
Programming Languages CSCI 4430, A. Milanova 40

Expression Grammars
n Generate expressions n Arithmetic expressions
n Regular expressions n Other
n Terminals: operands, operators, and parentheses
expr ®id|(expr) | expr op expr op ® + | *
Programming Languages CSCI 4430, A. Milanova 41

Handling Ambiguity
Our ambiguous grammar, slightly simplified:
expr ®id|(expr) | expr+expr | expr*expr
n Rewrite the grammar into unambiguous one: expr ® expr + term | term
term ® term * factor | factor
factor ®id|( expr)
n Forces left associativity of + and *
n Forces higher precedence of * over +
42

Rewriting Expression Grammars: Intuition
expr ®id|(expr) | expr+expr | expr*expr n A new nonterminal, term
n expr * expr becomes term. Thus, * gets pushed down the tree, forcing higher
precedence of *
n expr + becomes expr + term. Pushes leftmost + down the tree, forcing operand to associate with + on its left
n expr ® expr + expr becomes expr ® expr + term Programming Languages CSCI 4430, A. Milanova | term 43

Rewriting Expression Grammars:
term
id
Intuition
terms in the sum
E.g., look at id + id*id*id + id + id*id
expr
+ term
id*id
id
id*id*id
expr expr + term
expr + term
Programming Languages CSCI 4430, A. Milanova
44

Rewriting Expression Grammars: Intuition
n Another new nonterminal, factor and productions:
n term ® term * factor | factor n factor ®id| ( expr )
Programming Languages CSCI 4430, A. Milanova 45

Exercise
expr ® expr × expr | expr ^ expr | id
n How many parse trees for id×id^id×id ? n No need to draw them all
n Rewrite this grammar into an equivalent unambiguous grammar where
^ has higher precedence than × ^ is right-associative
× is left-associative
Programming Languages CSCI 4430, A. Milanova 46

The End
Programming Languages CSCI 4430, A. Milanova 47