Programming Language Syntax
Read: Scott, Chapter 2.1
Lecture Outline
n Formal languages
n Regular expressions
n Context-free grammars n Derivation
n Parse
n Parse trees n Ambiguity
n Expression Grammars
Programming Languages CSCI 4430, A. Milanova 2
Last Class: Compiler
character stream token stream
parse tree
modified intermediate form
target language (assember)
modified
Scanner
Parser
Semantic analysis and intermediate code generation
Code generation
abstract syntax tree and intermediate form
Machine-dependent code improvement
Machine-independent code improvement
Programming Languages CSCI 4430, A. Milanova
3
target language
Syntax and Semantics
n Syntax is the form or structure of expressions, statements, and program units of a given language
n Syntax of a Java while statement: n while(boolean_expr)statement
n Semantics is the meaning of expressions, statements and program units of a given language
n Semanticsofwhile(boolean_expr)statement
n Execute statement repeatedly (0 or more times) as long as
boolean_expr evaluates to true
Programming Languages CSCI 4430, A. Milanova 4
Formal Languages
n Theoretical foundations – Automata theory n A language is a set of strings (also called
sentences) over a finite alphabet
n A generator is a set of rules that generate the strings in the language
n A recognizer reads input strings and determines whether they belong to the language
n Languages are characterized by the complexity of generation/recognition rules
n E.g., regular languages
n E.g., context-free languages
Programming Languages CSCI 4430, A. Milanova 5
Question
n What are the classes of formal languages?
n The Chomsky hierarchy: n Regular languages
n Context-free languages
n Context-sensitive languages
n Recursively enumerable languages
Programming Languages CSCI 4430, A. Milanova 6
Formal Languages
n Generators and recognizers become more complex as languages become more complex
n Regular languages
n Describe PL tokens (e.g., keywords, identifiers, numeric literals) n Generated by Regular Expressions
n Recognized by a Finite Automaton (scanner)
n Context-free languages
n Describe more complex PL constructs (e.g., expressions and
statements)
n Generated by a Context-free Grammar
n Recognized by a Push-down Automaton (parser)
n Even more complex constructs
Programming Languages CSCI 4430, A. Milanova 7
Formal Languages
n Main application of formal languages: enable proof of relative difficulty of computational problems
n Our focus: formal languages provide the formalism for describing PL constructs
n A compelling application of formal languages!
n Building a scanner
n Building a parser
n Central issue: build efficient, linear-time parsers Programming Languages CSCI 4430, A. Milanova 8
A Single Pass
position = initial + rate * 60;
• Scanner emits next token
• Parser consumes the token
and continues building the parse tree (typically bottom up)
Scanner
id = …
id =
Programming Languages CSCI 4430, A. Milanova … 9
assign
Parser
expr
Lecture Outline
n Formal languages
n Regular expressions
n Context-free grammars n Derivation
n Parse
n Parse trees n Ambiguity
n Expression Grammars
Programming Languages CSCI 4430, A. Milanova 10
Regular Expressions
n Simplest structure
n Formalism to describe the simplest programming language constructs, the tokens
n each symbols (e.g., “+”, “-”) is a token
n an identifier (e.g., position, rate, initial) is a token n a numeric constant (e.g., 59) is a token
n etc.
n Recognized by a finite automaton
Programming Languages CSCI 4430, A. Milanova 11
Regular Expressions
n A Regular Expression is one of the following: n A character, e.g., a
n The empty string, denoted by e
n Two regular expressions next to each other,
R1 R2
n Meaning: R1 R2 generates the language of strings that are
made up of any string generated by R1, followed by any string generated by R2
n Two regular expressions separated by |, R1 | R2
n Meaning: R1 | R2 generates the language that is the union of
the strings generated by R1 with the strings generated by R2 Programming Languages CSCI 4430, A. Milanova 12
Question
n What is the language defined by reg. exp. (a|b)(aa|bb) ?
n We saw concatenation and alternation. What operation is still missing?
Programming Language CSCI 4430, A. Milanova 13
Regular Expressions
n A Regular Expression is one of the following: n A character, e.g., a
n The empty string, denoted by e
n R1 R2
n R1 | R2
n Regular expression followed by a Kleene star, R*
n Meaning: the concatenation of zero or more strings generated by R
n E.g., a* generates {e, a, aa, aaa, … }
n E.g., (a|b)* generates all strings of a’s and b’s Programming Languages CSCI 4430, A. Milanova 14
Regular Expressions
n Precedence
n Kleene * has highest precedence
n Followed by concatenation n Followed by alternation |
n E.g., a b | c is (a b) | c not a (b | c) n Generates {ab,c} not {ab,ac}
n E.g., a b* generates {a,ab,abb,…} not {ε, ab, abab, ababab,…}
Programming Languages CSCI 4430, A. Milanova 15
Question
n What is the language defined by regular expression (0 | 1)* 1 ?
nWhatabout 0*(10*10*)* ?
Programming Languages CSCI 4430, A. Milanova 16
Regular Expressions in Programming Languages
n Describe tokens n Let
letter ® a|b|c| … |z
digit ® 1|2|3|4|5|6|7|8|9|0
n Which token is this? 1. letter ( letter | digit )*
2. digit digit *
3. digit * . digit digit *
? ?
?
Programming Languages CSCI 4430, A. Milanova
17
Regular Expressions in Programming Languages
n Which token is this:
number ® integer | real
real ® integer exponent | decimal ( exponent | ε ) decimal ® digit* ( . digit | digit . ) digit*
exponent ® (e|E) (+|-| ε ) integer
integer ® digit digit*
digit ® 1|2|3|4|5|6|7|8|9|0
Programming Languages CSCI 4430, A. Milanova 18
Lecture Outline
n Formal languages
n Regular expressions
n Context-free grammars n Derivation
n Parse
n Parse trees n Ambiguity
n Expression Grammars
Programming Languages CSCI 4430, A. Milanova 19
Context-Free Grammars
n Unfortunately, regular languages cannot specify all constructs in programming
n E.g., can we write a regular expression that specifies valid arithmetic expressions?
n id * ( id + id * ( number – id ) ) n Among other things, we need to ensure that
parentheses are matched!
n Answer is no. We need context-free languages and context-free grammars!
Programming Languages CSCI 4430, A. Milanova 20
Grammar
n A grammar is a formalism to describe the strings of a (formal) language
n A grammar consists of a set of terminals, set of nonterminals, a set of productions, and a start symbol
n Terminals are the characters in the alphabet
n Nonterminals represent language constructs
n Productions are rules for forming syntactically correct constructs
n Start symbol tells where to start applying the rules Programming Languages CSCI 4430, A. Milanova 21
Notation
Specification of identifier:
Regular expression: letter ( letter | digit )*
BNF:
Textbook and slides: digit ® 1|2|3|4|5|6|7|8|9|0 (also BNF) letter ® a|b|c|d|…|z
id ® letter | id letter | id digit Programming Languages CSCI 4430, A. Milanova 22
Nonterminals shown in italic
Terminals shown in typewriter
Regular Grammars
n Regular grammars generate regular languages
n The rules in regular grammars are of the form:
n Each left-hand-side (lhs) has exactly one nonterminal
n Each right-hand-side (rhs) is one of the following n A single terminal symbol or
n A single nonterminal symbol or
n A nonterminal followed by a terminal
e.g.,12*|0+ S® A|B A®1|A2 B®0|B0
Programming Languages CSCI 4430, A. Milanova 23
Question
n Is this a regular grammar:
S®0A A®S1 S®ε
n No, this is a context-free grammar
n It generates 0n1n, the canonical example of a context-
free language
n rhs should be nonterminal followed by a terminal, thus,
S ® 0 A is not a valid production
Programming Languages CSCI 4430, A. Milanova 24
Lecture Outline
n Formal languages
n Regular expressions
n Context-free grammars n Derivation
n Parse
n Parse trees n Ambiguity
n Expression Grammars
Programming Languages CSCI 4430, A. Milanova 25
Context-free Grammars (CFGs)
n Context-free grammars generate context-free languages
n Most of what we need in programming languages can be specified with CFGs
n Context-free grammars have rules of the form:
n Each left-hand-side has exactly one nonterminal
n Each right-hand-side contains an arbitrary sequence of terminals and nonterminals
n A context-free grammar e.g.0n1n ,n≥1 S ®0S1
S®01
Programming Languages CSCI 4430, A. Milanova 26
Question
n Examples of a non-context-free languages?
n E.g., anbmcndm n E.g., wcw
n E.g., anbncn
n≥1, m≥1
where w is in (0|1)* n≥1 (canonical example)
Programming Languages CSCI 4430, A. Milanova
27
Context-free Grammars
n Can be used to generate strings in the context-free language (derivation)
n Can be used to recognize well-formed strings in the context-free language (parse)
n In Programming Languages and compilers, we are concerned with two special CFGs, called LL and LR grammars
Programming Languages CSCI 4430, A. Milanova 28
Derivation
Simple context-free grammar for expressions:
expr ®id|(expr) | expr op expr op ® + | *
We can generate (derive) expressions:
expr Þ expr op expr Þ expr op id
Þ expr + id
Þ expr op expr + id Þexprop id+id Þexpr *id+id Þid *id+id
sentential form
sentence, string or yield
29
Derivation
n A derivation is the process that starts from the start symbol, and at each step, replaces a
nonterminal with the right-hand-side of a production
n E.g., expr op expr derives expr op id
We replaced the right (underlined) expr with id due to production expr ® id
n An intermediate sentence is called a sentential form
n E.g., expr op id is a sentential form
30
Derivation
n The resulting sentence is called yield
n E.g., id*id+id is the yield of our derivation
n What is a left-most derivation?
n Replaces the left-most nonterminal in the
sentential form at each step
n What is a right-most derivation?
n Replaces the right-most nonterminal in the sentential form at each step
n There are derivations that are neither left- nor
right-most
31
Question
n What kind of derivation is this: expr Þ expr op expr
Þ expr op id
Þ expr + id
Þ expr op expr + id Þexprop id+id Þexpr *id+id Þid *id+id
n A right-most derivation. At each step we replace the right-most nonterminal
Programming Languages CSCI 4430, A. Milanova 32
Question
n What kind of derivation is this: expr Þ expr op expr
Þ expr op id
Þ expr + id
Þ expr op expr + id
Þ id op expr + id Þ id op id + id Þ id * id + id
n Neither left-most nor right-most
Programming Languages CSCI 4430, A. Milanova 33
Parse
Recall our context-free grammar for expressions:
expr ®id|(expr) | expr op expr op ® + | *
n A parse is the reverse of a derivation
id*id+id
Þexpr Þ expr Þ expr Þ expr Þ expr Þ expr
*id+id op id + id op expr + id
+ id op id op expr
Þ expr Programming Languages CSCI 4430, A. Milanova
34
Parse
n A parse starts with the string of terminals, and at each step, replaces the right-hand-
side (rhs) of a production with the left-hand- side (lhs) of that production. E.g.,
… Þ expr op expr + id
Þ expr + id
Here we replaced expr op expr (the rhs of production expr ® expr op expr) with expr (the lhs of the production)
Programming Languages CSCI 4430, A. Milanova 35
Parse Tree
expr ®id|(expr) | expr op expr op ® + | *
expr Þ expr op expr Þ expr op id
Þ expr + id
Þ expr op expr + id Þexprop id+id Þexpr*id+id Þid *id+id
expr
expr op expr
expr op expr + id* id id
Internal nodes are nonterminals. Children are the rhs of a rule for that nonterminal.
Leaf nodes are terminals.
36
Ambiguity
n Ambiguity
n A grammar is ambiguous if some string can be
generated by two or more distinct parse trees
n There is no algorithm that can tell if an arbitrary context-free grammar is ambiguous
n Ambiguity arises in programming language grammars
n Arithmetic expressions
n If-then-else: the dangling else problem
n Ambiguity is bad
Programming Languages CSCI 4430, A. Milanova 37
Ambiguity
expr ®id|(expr) | expr op expr op ® + | *
nHow many parse trees forid * id + id?
Tree 1:expr expr op
+
id id+id id*id id
n Which one is “correct”?
expr
expr op expr
Tree 2:
expr
expr
op expr
*
expr op expr
38
Ambiguity
expr ®id|(expr) | expr op expr op ® + | *
nHow many parse trees forid + id + id?
Tree 1:expr expr op
+
id id+id id+id id
n Which one is “correct”?
expr
expr op expr
Tree 2:
expr
expr
op expr
+
expr op expr
39
Lecture Outline
n Formal languages
n Regular expressions
n Context-free grammars n Derivation
n Parse
n Parse trees n Ambiguity
n Expression Grammars
Programming Languages CSCI 4430, A. Milanova 40
Expression Grammars
n Generate expressions n Arithmetic expressions
n Regular expressions n Other
n Terminals: operands, operators, and parentheses
expr ®id|(expr) | expr op expr op ® + | *
Programming Languages CSCI 4430, A. Milanova 41
Handling Ambiguity
Our ambiguous grammar, slightly simplified:
expr ®id|(expr) | expr+expr | expr*expr
n Rewrite the grammar into unambiguous one: expr ® expr + term | term
term ® term * factor | factor
factor ®id|( expr)
n Forces left associativity of + and *
n Forces higher precedence of * over +
42
Rewriting Expression Grammars: Intuition
expr ®id|(expr) | expr+expr | expr*expr n A new nonterminal, term
n expr * expr becomes term. Thus, * gets pushed down the tree, forcing higher
precedence of *
n expr + becomes expr + term. Pushes leftmost + down the tree, forcing operand to associate with + on its left
n expr ® expr + expr becomes expr ® expr + term Programming Languages CSCI 4430, A. Milanova | term 43
Rewriting Expression Grammars:
term
id
Intuition
terms in the sum
E.g., look at id + id*id*id + id + id*id
expr
+ term
id*id
id
id*id*id
expr expr + term
expr + term
Programming Languages CSCI 4430, A. Milanova
44
Rewriting Expression Grammars: Intuition
n Another new nonterminal, factor and productions:
n term ® term * factor | factor n factor ®id| ( expr )
Programming Languages CSCI 4430, A. Milanova 45
Exercise
expr ® expr × expr | expr ^ expr | id
n How many parse trees for id×id^id×id ? n No need to draw them all
n Rewrite this grammar into an equivalent unambiguous grammar where
^ has higher precedence than × ^ is right-associative
× is left-associative
Programming Languages CSCI 4430, A. Milanova 46
The End
Programming Languages CSCI 4430, A. Milanova 47