lec2.key
CS 314 Principles of Programming Languages
Prof. Zheng Zhang
Rutgers University
Lecture 2: Syntax Analysis (Scanning)
September 6, 2018
Announcement
2
• First recitation starts this coming Wednesday
• Homework 1 will be released after lecture 3.
• My office hour:
Thursday 2pm – 3pm at CoRE 315
• TA office hours will be announced soon.
Last Class
3
• Overview of compilation
• Syntax and semantics
• Formal language definition
• A rule-based rewriting system
• Introduction to regular expression
Compilation and Interpretation
4
Source
Code
Compiler
Machine
Code
Error
Compiler
• Recognize legal (and illegal) programs
• Generate correct code
• Manage storage of all variables and code
• Need format for object (or assembly) code
Big step up from assembler to higher level notations
Compilation and Interpretation
5
Source
Program
Compiler
Target
Program
Error
Input Output
Pure Compilation
• Mainly refers to translation
• Take a program in source language, output a program in target language
(usually machine code)
Compilation and Interpretation
6
Interpreter
Input
Output
Source
Program
Interpretation
• Interpreter stays around for the execution of the program
• Interpreter is the locus of control during execution
Compilation and Interpretation
7
• Most language implementations include a mixture of both
compilation and interpretation.
• Common case is compilation or simple pre-processing,
followed by interpretation.
Source
Program
Translator
Intermediate
Program
Input
Virtual Machine Output
Compiled V.S. Interpreted Languages
8
• We generally say that:
A language is “interpreted” if the initial translator is “simple”,
or “compiled” if the initial translator is “complicated”
• Very subjective, but a language is still “compiled” if the translator has
thorough analysis and non-trivial transformation.
Source
Program
Translator
Intermediate
Program
Input
Virtual Machine Output
Smalltalk and python: “interpreted”
Fortran and C: “compiled”
Syntax and Semantics of Programming Languages
9
Syntax:
Describes what a legal program looks like
Semantics:
Describes what a correct (legal) program means
Syntax of Programming Languages
10
• tokens – legal combination of characters in the language
The syntax of programming languages is often defined in two layers:
tokens and sentences.
Question: How to spell a token (word)?
Answer: Regular expressions
• sentences – legal combinations of tokens in the language
Question: How to build correct sentences with tokens?
Answer: (Context – free) grammars (CFG)
Lexical Analysis ( Scott 2.1, 2.2)
11
Character Sequence:
scanner
i f a < b t h e n c : d= = if id <= id then
:=id
Token Sequence:
Syntax Analysis ( Scott, Chapter 2.3)
12
if id <= id then
:=id
parser
Token Sequence:
if then
Parse Tree:
Tokens ( Scott 2.1, 2.2)
13
Tokens (Analogous to Words of Language)
• Smallest “atomic” units of further syntax analysis
• Used to build all the other constructs
• Example, in C:
Keywords: for if goto volatile…
= * / – < > == <= >= <> ( ) [ ] ; := . , …
Number: (Example: 3.14 28 …)
Identifier: (Example: b square addEntry …)
Formalisms for Lexical and Syntactic Analysis
14
Two issues in Formal Languages:
• Language Specification → formalism to describe what a valid
program (word/sentence) looks like.
• Language Recognition → formalism to describe a machine and an
algorithm that can verify that a program is valid or not.
Formalisms for Lexical and Syntactic Analysis
15
Two issues in Formal Languages:
• Language Specification → formalism to describe what a valid
program (word/sentence) looks like.
• Language Recognition → formalism to describe a machine and an
algorithm that can verify that a program is valid or not.
We use regular expression to specify tokens (words)
Regular Expressions
16
RE p Language L(p)
r | s L(r) ∪ L(s)
rs {RS | R ∈ L(r), S ∈ L(s)}
r+ L(r) ∪ L(rr) ∪ L(rrr) ∪…
r* (r* = r+ | ϵ) {ϵ} ∪ L(r) ∪ L(rr) ∪…
(s) L(s)
a {a}
ϵ {ϵ}
A syntax (notation) to specify regular languages.
A RE can simply be a letter from the
alphabet Σ or an empty symbol ϵ
L(p): the set of strings that can be represented using the expression p
Regular Expressions
17
RE p Language L(p)
r | s L(r) ∪ L(s)
rs {RS | R ∈ L(r), S ∈ L(s)}
r+ L(r) ∪ L(rr) ∪ L(rrr) ∪…
r* (r* = r+ | ϵ) {ϵ} ∪ L(r) ∪ L(rr) ∪…
(s) L(s)
a {a}
ϵ {ϵ}
A syntax (notation) to specify regular languages.
A RE can simply be a letter from the
alphabet Σ or an empty symbol ϵ
L(p): the set of strings that can be represented using the expression p
Either r or s is a
regular expression,
i.e. 0|11
Regular Expressions
18
RE p Language L(p)
r | s L(r) ∪ L(s)
rs {RS | R ∈ L(r), S ∈ L(s)}
r+ L(r) ∪ L(rr) ∪ L(rrr) ∪…
r* (r* = r+ | ϵ) {ϵ} ∪ L(r) ∪ L(rr) ∪…
(s) L(s)
a {a}
ϵ {ϵ}
A syntax (notation) to specify regular languages.
A RE can simply be a letter from the
alphabet Σ or an empty symbol ϵ
L(p): the set of strings that can be represented using the expression p
Regular Expressions
19
RE p Language L(p)
r | s L(r) ∪ L(s)
rs {RS | R ∈ L(r), S ∈ L(s)}
r+ L(r) ∪ L(rr) ∪ L(rrr) ∪…
r* (r* = r+ | ϵ) {ϵ} ∪ L(r) ∪ L(rr) ∪…
(s) L(s)
a {a}
ϵ {ϵ}
A syntax (notation) to specify regular languages.
A RE can simply be a letter from the
alphabet Σ or an empty symbol ϵ
L(p): the set of strings that can be represented using the expression p
Regular Expressions
20
RE p Language L(p)
r | s L(r) ∪ L(s)
rs {RS | R ∈ L(r), S ∈ L(s)}
r+ L(r) ∪ L(rr) ∪ L(rrr) ∪…
r* (r* = r+ | ϵ) {ϵ} ∪ L(r) ∪ L(rr) ∪…
(s) L(s)
a {a}
ϵ {ϵ}
A syntax (notation) to specify regular languages.
A RE can simply be a letter from the
alphabet Σ or an empty symbol ϵ
Any number of r’s
concatenated.
Regular Expressions
21
RE p Language L(p)
r | s L(r) ∪ L(s)
rs {RS | R ∈ L(r), S ∈ L(s)}
r+ L(r) ∪ L(rr) ∪ L(rrr) ∪…
r* (r* = r+ | ϵ) {ϵ} ∪ L(r) ∪ L(rr) ∪…
(s) L(s)
a {a}
ϵ {ϵ}
A syntax (notation) to specify regular languages.
Any number of r’s
concatenated.
A RE can simply be a letter from the
alphabet Σ or an empty string ϵ
Examples of Expressions— Solution
22
RE Language
a|bc {a, bc}
(b|c)a {ba, ca}
a ϵ {a}
a*|b {ϵ, a, aa, aaa, aaaa, …}∪{b}
ab* {a, ab, abb, abbb, abbbb, . . .}
ab*|c+ {a, ab, abb, abbb, abbbb, . . . } ∪ {c,cc,ccc,…}
(a|b)* {ϵ, a, b, aa, ab, ba, bb, aaa, aab, . . .}
(0|1)*0 binary numbers ending in 0
Examples of Expressions— Solution
23
RE Language
a|bc {a, bc}
(b|c)a {ba, ca}
a ϵ {a}
a*|b {ϵ, a, aa, aaa, aaaa, …}∪{b}
ab* {a, ab, abb, abbb, abbbb, . . .}
ab*|c+ {a, ab, abb, abbb, abbbb, . . . } ∪ {c,cc,ccc,…}
(a|b)* {ϵ, a, b, aa, ab, ba, bb, aaa, aab, . . .}
(0|1)*0 binary numbers ending in 0
Examples of Expressions— Solution
24
RE Language
a|bc {a, bc}
(b|c)a {ba, ca}
a ϵ {a}
a*|b {ϵ, a, aa, aaa, aaaa, …}∪{b}
ab* {a, ab, abb, abbb, abbbb, . . .}
ab*|c+ {a, ab, abb, abbb, abbbb, . . . } ∪ {c,cc,ccc,…}
(a|b)* {ϵ, a, b, aa, ab, ba, bb, aaa, aab, . . .}
(0|1)*0 binary numbers ending in 0
Examples of Expressions— Solution
25
RE Language
a|bc {a, bc}
(b|c)a {ba, ca}
a ϵ {a}
a*|b {ϵ, a, aa, aaa, aaaa, …}∪{b}
ab* {a, ab, abb, abbb, abbbb, . . .}
ab*|c+ {a, ab, abb, abbb, abbbb, . . . } ∪ {c,cc,ccc,…}
(a|b)* {ϵ, a, b, aa, ab, ba, bb, aaa, aab, . . .}
(0|1)*0 binary numbers ending in 0
Examples of Expressions— Solution
26
RE Language
a|bc {a, bc}
(b|c)a {ba, ca}
a ϵ {a}
a*|b {ϵ, a, aa, aaa, aaaa, …}∪{b}
ab* {a, ab, abb, abbb, abbbb, . . .}
ab*|c+ {a, ab, abb, abbb, abbbb, . . . } ∪ {c,cc,ccc,…}
(a|b)* {ϵ, a, b, aa, ab, ba, bb, aaa, aab, . . .}
(0|1)*0 binary numbers ending in 0
Examples of Expressions— Solution
27
RE Language
a|bc {a, bc}
(b|c)a {ba, ca}
a ϵ {a}
a*|b {ϵ, a, aa, aaa, aaaa, …}∪{b}
ab* {a, ab, abb, abbb, abbbb, . . .}
ab*|c+ {a, ab, abb, abbb, abbbb, . . . } ∪ {c,cc,ccc,…}
(a|b)* {ϵ, a, b, aa, ab, ba, bb, aaa, aab, . . .}
(0|1)*0 binary numbers ending in 0
Concatenation has
higher precedence
over alternation |.
28
Let letter stand for A | B | C | . . . | Z
Let digit stand for 0 | 1 | 2 | . . . | 9
integer constant:
identifier:
real constant:
Regular Expressions for Programming Languages
29
Let letter stand for A | B | C | . . . | Z
Let digit stand for 0 | 1 | 2 | . . . | 9
integer constant:
identifier:
real constant:
Regular Expressions for Programming Languages
digit+
letter(letter | digit)*
digit*.digit+
Formalisms for Lexical and Syntactic Analysis
30
Two issues in Formal Languages:
• Language Specification → formalism to describe what a valid
program (word/sentence) looks like.
• Language Recognition → formalism to describe a machine and an
algorithm that can verify that a program is valid or not.
We use finite state automata to recognize regular language
Finite State Automata
31
A Finite-State Automaton is a quadruple: < S, s, F, T >
• S is a set of states, e.g., {S0, S1, S2, S3}
• s is the start state, e.g., S0
• F is a set of final states, e.g., {S3}
• T is a set of labeled transitions, of the form
(state, input) → state
formally,
S × Σ → S
S2
S1
S3
0
1
0
1
S0
32
integer constant:
identifier:
real constant:
Regular Expressions for Programming Languages
digit+
letter(letter | digit)*
digit*.digit+
Let letter stand for A | B | C | . . . | Z
Let digit stand for 0 | 1 | 2 | . . . | 9
Recognizers for Regular Expressions
33
Example 1:
Integer Constant
RE: digit+
FSA: S0 S1
digit
digit
Recognizers for Regular Expressions(Cont.)
34
Example 2:
Identifier
RE: letter(letter | digit)*
Recognizers for Regular Expressions(Cont.)
35
S0 S1FSA:
letter
letter
digit
Example 2:
Identifier
RE: letter(letter | digit)*
Recognizers for Regular Expressions(Cont.)
36
Example 3:
Real constant
RE: digit*.digit+
Recognizers for Regular Expressions(Cont.)
37
Example 3:
Real constant
RE: digit*.digit+
S2FSA: S1. digitS0
digit digit
Finite State Automata
38
An FSA accepts or recognizes an input string N iff there is some path
from start state to a final state such that the labels on the path spell N.
Lack of entry in the table (or no arc for a given character) indicates
an error—reject.
Transitions can be represented using a transition table:
0 1
S0 S1 S2
S1 S3 –
S2 – S3
State
Input
S2
S1
S3
0
1
0
1
S0
Finite State Automata
39
An FSA accepts or recognizes an input string N iff there is some path
from start state to a final state such that the labels on the path spell N.
Lack of entry in the table (or no arc for a given character) indicates
an error—reject.
Transitions can be represented using a transition table:
0 1
S0 S1 S2
S1 S3 –
S2 – S3
State
S2
S1
S3
0
1
0
1
S0
Error
Input
Practical Recognizers
40
• Recognizer should be a deterministic finite automaton (DFA)
• Read until the end of a token
• Report errors (error recovery)
Practical Recognizers
41
“identifier” regular expression:
letter → (a | b | c | … | z | A | B | C | … | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter (letter | digit)∗
Practical Recognizers
42
“identifier” regular expression:
letter → (a | b | c | … | z | A | B | C | … | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter (letter | digit)∗
Recognizer for “identifier”:
S0 S1 S2
S3 error
digit
other letter/digit
letter other
error
Implementation: Code for the recognizer
43
char ← next_char();
state ← S0;
done ← false;
while( not done ) {
class ← char_class[char];
state ← next_state[class,state];
switch(state) {
case S1:
/* building an id */
token_value ← token_value + char;
char ← next_char();
if (char == DELIMITER)
done = true;
break;
case S2: /* error state */
case S3: /* error state */
token type = error;
done = true;
break;
}
}
return token_type;
class S0 S1 S2 S3
letter S1 S1 — —
digit S3 S1 — —
other S3 S2 — —
S0 S1 S2
S3 error
digit
other letter/digit
letter other
error
Implementation: Code for the recognizer
44
char ← next_char();
state ← S0;
done ← false;
while( not done ) {
class ← char_class[char];
state ← next_state[class,state];
switch(state) {
case S1:
/* building an id */
token_value ← token_value + char;
char ← next_char();
if (char == DELIMITER)
done = true;
break;
case S2: /* error state */
case S3: /* error state */
token type = error;
done = true;
break;
}
}
return token_type;
class S0 S1 S2 S3
letter S1 S1 — —
digit S3 S1 — —
other S3 S2 — —
S0 S1 S2
S3 error
digit
other letter/digit
letter other
error
Implementation: Code for the recognizer
45
char ← next_char();
state ← S0;
done ← false;
while( not done ) {
class ← char_class[char];
state ← next_state[class, state];
switch(state) {
case S1:
/* building an id */
token_value ← token_value + char;
char ← next_char();
if (char == DELIMITER)
done = true;
break;
case S2: /* error state */
case S3: /* error state */
token type = error;
done = true;
break;
}
}
return token_type;
class S0 S1 S2 S3
letter S1 S1 — —
digit S3 S1 — —
other S3 S2 — —
S0 S1 S2
S3 error
digit
other letter/digit
letter other
error
Implementation: Code for the recognizer
46
char ← next_char();
state ← S0;
done ← false;
while( not done ) {
class ← char_class[char];
state ← next_state[class, state];
switch(state) {
case S1:
/* building an id */
token_value ← token_value + char;
char ← next_char();
if (char == DELIMITER)
done = true;
break;
case S2: /* error state */
case S3: /* error state */
token type = error;
done = true;
break;
}
}
return token_type;
class S0 S1 S2 S3
letter S1 S1 — —
digit S3 S1 — —
other S3 S2 — —
S0 S1 S2
S3 error
digit
other letter/digit
letter other
error
Next Lecture
47
• Read Scott, Chapters 2.3 – 2.5
Things to do: