程序代写代做代考 python assembly algorithm interpreter Fortran assembler compiler lec2.key

lec2.key

CS 314 Principles of Programming Languages

Prof. Zheng Zhang
Rutgers University

Lecture 2: Syntax Analysis (Scanning)

September 6, 2018

Announcement

2

• First recitation starts this coming Wednesday
• Homework 1 will be released after lecture 3.
• My office hour: 


Thursday 2pm – 3pm at CoRE 315
• TA office hours will be announced soon.

Last Class

3

• Overview of compilation
• Syntax and semantics
• Formal language definition
• A rule-based rewriting system
• Introduction to regular expression

Compilation and Interpretation

4

Source
Code

Compiler

Machine
Code

Error

Compiler
• Recognize legal (and illegal) programs
• Generate correct code
• Manage storage of all variables and code
• Need format for object (or assembly) code

Big step up from assembler to higher level notations

Compilation and Interpretation

5

Source
Program

Compiler

Target
Program

Error

Input Output

Pure Compilation
• Mainly refers to translation
• Take a program in source language, output a program in target language

(usually machine code)

Compilation and Interpretation

6

Interpreter

Input

Output

Source
Program

Interpretation
• Interpreter stays around for the execution of the program
• Interpreter is the locus of control during execution

Compilation and Interpretation

7

• Most language implementations include a mixture of both
compilation and interpretation.

• Common case is compilation or simple pre-processing,
followed by interpretation.

Source
Program

Translator

Intermediate
Program

Input

Virtual Machine Output

Compiled V.S. Interpreted Languages

8

• We generally say that:
A language is “interpreted” if the initial translator is “simple”, 

or “compiled” if the initial translator is “complicated”

• Very subjective, but a language is still “compiled” if the translator has
thorough analysis and non-trivial transformation.

Source
Program

Translator

Intermediate
Program

Input

Virtual Machine Output

Smalltalk and python: “interpreted”
Fortran and C: “compiled”

Syntax and Semantics of Programming Languages

9

Syntax:

Describes what a legal program looks like

Semantics:

Describes what a correct (legal) program means

Syntax of Programming Languages

10

• tokens – legal combination of characters in the language

The syntax of programming languages is often defined in two layers: 

tokens and sentences.

Question: How to spell a token (word)?
Answer: Regular expressions

• sentences – legal combinations of tokens in the language

Question: How to build correct sentences with tokens?
Answer: (Context – free) grammars (CFG)

Lexical Analysis ( Scott 2.1, 2.2)

11

Character Sequence:

scanner

i f a < b t h e n c : d= = if id <= id then

:=id id

Token Sequence:

Syntax Analysis ( Scott, Chapter 2.3)

12

if id <= id then

:=id id

parser

Token Sequence:

if then

id <= id id := id

Parse Tree:

Tokens ( Scott 2.1, 2.2)

13

Tokens (Analogous to Words of Language)
• Smallest “atomic” units of further syntax analysis
• Used to build all the other constructs
• Example, in C:

Keywords: for if goto volatile…
= * / – < > == <= >= <> ( ) [ ] ; := . , …
Number: (Example: 3.14 28 …)
Identifier: (Example: b square addEntry …)

Formalisms for Lexical and Syntactic Analysis

14

Two issues in Formal Languages:
• Language Specification → formalism to describe what a valid
program (word/sentence) looks like.

• Language Recognition → formalism to describe a machine and an
algorithm that can verify that a program is valid or not.

Formalisms for Lexical and Syntactic Analysis

15

Two issues in Formal Languages:
• Language Specification → formalism to describe what a valid
program (word/sentence) looks like.

• Language Recognition → formalism to describe a machine and an
algorithm that can verify that a program is valid or not.

We use regular expression to specify tokens (words)

Regular Expressions

16

RE p Language L(p)

r | s L(r) ∪ L(s)

rs {RS | R ∈ L(r), S ∈ L(s)}

r+ L(r) ∪ L(rr) ∪ L(rrr) ∪…

r* (r* = r+ | ϵ) {ϵ} ∪ L(r) ∪ L(rr) ∪…

(s) L(s)

a {a}

ϵ {ϵ}

A syntax (notation) to specify regular languages.

A RE can simply be a letter from the
alphabet Σ or an empty symbol ϵ

L(p): the set of strings that can be represented using the expression p

Regular Expressions

17

RE p Language L(p)

r | s L(r) ∪ L(s)

rs {RS | R ∈ L(r), S ∈ L(s)}

r+ L(r) ∪ L(rr) ∪ L(rrr) ∪…

r* (r* = r+ | ϵ) {ϵ} ∪ L(r) ∪ L(rr) ∪…

(s) L(s)

a {a}

ϵ {ϵ}

A syntax (notation) to specify regular languages.

A RE can simply be a letter from the
alphabet Σ or an empty symbol ϵ

L(p): the set of strings that can be represented using the expression p

Either r or s is a
regular expression,
i.e. 0|11

Regular Expressions

18

RE p Language L(p)

r | s L(r) ∪ L(s)

rs {RS | R ∈ L(r), S ∈ L(s)}

r+ L(r) ∪ L(rr) ∪ L(rrr) ∪…

r* (r* = r+ | ϵ) {ϵ} ∪ L(r) ∪ L(rr) ∪…

(s) L(s)

a {a}

ϵ {ϵ}

A syntax (notation) to specify regular languages.

A RE can simply be a letter from the
alphabet Σ or an empty symbol ϵ

L(p): the set of strings that can be represented using the expression p

Regular Expressions

19

RE p Language L(p)

r | s L(r) ∪ L(s)

rs {RS | R ∈ L(r), S ∈ L(s)}

r+ L(r) ∪ L(rr) ∪ L(rrr) ∪…

r* (r* = r+ | ϵ) {ϵ} ∪ L(r) ∪ L(rr) ∪…

(s) L(s)

a {a}

ϵ {ϵ}

A syntax (notation) to specify regular languages.

A RE can simply be a letter from the
alphabet Σ or an empty symbol ϵ

L(p): the set of strings that can be represented using the expression p

Regular Expressions

20

RE p Language L(p)

r | s L(r) ∪ L(s)

rs {RS | R ∈ L(r), S ∈ L(s)}

r+ L(r) ∪ L(rr) ∪ L(rrr) ∪…

r* (r* = r+ | ϵ) {ϵ} ∪ L(r) ∪ L(rr) ∪…

(s) L(s)

a {a}

ϵ {ϵ}

A syntax (notation) to specify regular languages.

A RE can simply be a letter from the
alphabet Σ or an empty symbol ϵ

Any number of r’s
concatenated.

Regular Expressions

21

RE p Language L(p)

r | s L(r) ∪ L(s)

rs {RS | R ∈ L(r), S ∈ L(s)}

r+ L(r) ∪ L(rr) ∪ L(rrr) ∪…

r* (r* = r+ | ϵ) {ϵ} ∪ L(r) ∪ L(rr) ∪…

(s) L(s)

a {a}

ϵ {ϵ}

A syntax (notation) to specify regular languages.

Any number of r’s
concatenated.

A RE can simply be a letter from the
alphabet Σ or an empty string ϵ

Examples of Expressions— Solution

22

RE Language

a|bc {a, bc}

(b|c)a {ba, ca}

a ϵ {a}

a*|b {ϵ, a, aa, aaa, aaaa, …}∪{b}

ab* {a, ab, abb, abbb, abbbb, . . .}

ab*|c+ {a, ab, abb, abbb, abbbb, . . . } ∪ {c,cc,ccc,…}

(a|b)* {ϵ, a, b, aa, ab, ba, bb, aaa, aab, . . .}

(0|1)*0 binary numbers ending in 0

Examples of Expressions— Solution

23

RE Language

a|bc {a, bc}

(b|c)a {ba, ca}

a ϵ {a}

a*|b {ϵ, a, aa, aaa, aaaa, …}∪{b}

ab* {a, ab, abb, abbb, abbbb, . . .}

ab*|c+ {a, ab, abb, abbb, abbbb, . . . } ∪ {c,cc,ccc,…}

(a|b)* {ϵ, a, b, aa, ab, ba, bb, aaa, aab, . . .}

(0|1)*0 binary numbers ending in 0

Examples of Expressions— Solution

24

RE Language

a|bc {a, bc}

(b|c)a {ba, ca}

a ϵ {a}

a*|b {ϵ, a, aa, aaa, aaaa, …}∪{b}

ab* {a, ab, abb, abbb, abbbb, . . .}

ab*|c+ {a, ab, abb, abbb, abbbb, . . . } ∪ {c,cc,ccc,…}

(a|b)* {ϵ, a, b, aa, ab, ba, bb, aaa, aab, . . .}

(0|1)*0 binary numbers ending in 0

Examples of Expressions— Solution

25

RE Language

a|bc {a, bc}

(b|c)a {ba, ca}

a ϵ {a}

a*|b {ϵ, a, aa, aaa, aaaa, …}∪{b}

ab* {a, ab, abb, abbb, abbbb, . . .}

ab*|c+ {a, ab, abb, abbb, abbbb, . . . } ∪ {c,cc,ccc,…}

(a|b)* {ϵ, a, b, aa, ab, ba, bb, aaa, aab, . . .}

(0|1)*0 binary numbers ending in 0

Examples of Expressions— Solution

26

RE Language

a|bc {a, bc}

(b|c)a {ba, ca}

a ϵ {a}

a*|b {ϵ, a, aa, aaa, aaaa, …}∪{b}

ab* {a, ab, abb, abbb, abbbb, . . .}

ab*|c+ {a, ab, abb, abbb, abbbb, . . . } ∪ {c,cc,ccc,…}

(a|b)* {ϵ, a, b, aa, ab, ba, bb, aaa, aab, . . .}

(0|1)*0 binary numbers ending in 0

Examples of Expressions— Solution

27

RE Language

a|bc {a, bc}

(b|c)a {ba, ca}

a ϵ {a}

a*|b {ϵ, a, aa, aaa, aaaa, …}∪{b}

ab* {a, ab, abb, abbb, abbbb, . . .}

ab*|c+ {a, ab, abb, abbb, abbbb, . . . } ∪ {c,cc,ccc,…}

(a|b)* {ϵ, a, b, aa, ab, ba, bb, aaa, aab, . . .}

(0|1)*0 binary numbers ending in 0

Concatenation has
higher precedence
over alternation |.

28

Let letter stand for A | B | C | . . . | Z
Let digit stand for 0 | 1 | 2 | . . . | 9

integer constant:

identifier:

real constant:

Regular Expressions for Programming Languages

29

Let letter stand for A | B | C | . . . | Z
Let digit stand for 0 | 1 | 2 | . . . | 9

integer constant:

identifier:

real constant:

Regular Expressions for Programming Languages

digit+

letter(letter | digit)*

digit*.digit+

Formalisms for Lexical and Syntactic Analysis

30

Two issues in Formal Languages:
• Language Specification → formalism to describe what a valid
program (word/sentence) looks like.

• Language Recognition → formalism to describe a machine and an
algorithm that can verify that a program is valid or not.

We use finite state automata to recognize regular language

Finite State Automata

31

A Finite-State Automaton is a quadruple: < S, s, F, T >
• S is a set of states, e.g., {S0, S1, S2, S3}
• s is the start state, e.g., S0
• F is a set of final states, e.g., {S3}
• T is a set of labeled transitions, of the form 


(state, input) → state 

formally, 

S × Σ → S

S2

S1

S3

0

1

0

1

S0

32

integer constant:

identifier:

real constant:

Regular Expressions for Programming Languages

digit+

letter(letter | digit)*

digit*.digit+

Let letter stand for A | B | C | . . . | Z
Let digit stand for 0 | 1 | 2 | . . . | 9

Recognizers for Regular Expressions

33

Example 1:

Integer Constant
RE: digit+

FSA: S0 S1
digit

digit

Recognizers for Regular Expressions(Cont.)

34

Example 2:

Identifier
RE: letter(letter | digit)*

Recognizers for Regular Expressions(Cont.)

35

S0 S1FSA:
letter

letter

digit

Example 2:

Identifier
RE: letter(letter | digit)*

Recognizers for Regular Expressions(Cont.)

36

Example 3:

Real constant
RE: digit*.digit+

Recognizers for Regular Expressions(Cont.)

37

Example 3:

Real constant
RE: digit*.digit+

S2FSA: S1. digitS0

digit digit

Finite State Automata

38

An FSA accepts or recognizes an input string N iff there is some path
from start state to a final state such that the labels on the path spell N.
Lack of entry in the table (or no arc for a given character) indicates
an error—reject.

Transitions can be represented using a transition table:

0 1
S0 S1 S2
S1 S3 –
S2 – S3

State

Input

S2

S1

S3

0

1

0

1

S0

Finite State Automata

39

An FSA accepts or recognizes an input string N iff there is some path
from start state to a final state such that the labels on the path spell N.
Lack of entry in the table (or no arc for a given character) indicates
an error—reject.

Transitions can be represented using a transition table:

0 1
S0 S1 S2
S1 S3 –
S2 – S3

State

S2

S1

S3

0

1

0

1

S0

Error

Input

Practical Recognizers

40

• Recognizer should be a deterministic finite automaton (DFA)
• Read until the end of a token
• Report errors (error recovery)

Practical Recognizers

41

“identifier” regular expression:
letter → (a | b | c | … | z | A | B | C | … | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)

id → letter (letter | digit)∗

Practical Recognizers

42

“identifier” regular expression:
letter → (a | b | c | … | z | A | B | C | … | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)

id → letter (letter | digit)∗

Recognizer for “identifier”:

S0 S1 S2

S3 error

digit
other letter/digit

letter other

error

Implementation: Code for the recognizer

43

char ← next_char();
state ← S0;
done ← false;
while( not done ) {

class ← char_class[char];
state ← next_state[class,state];
switch(state) {

case S1: 

/* building an id */

token_value ← token_value + char;
char ← next_char();
if (char == DELIMITER)

done = true;
break;

case S2: /* error state */
case S3: /* error state */

token type = error;
done = true;
break;

}
}
return token_type;

class S0 S1 S2 S3

letter S1 S1 — —
digit S3 S1 — —
other S3 S2 — —

S0 S1 S2

S3 error

digit
other letter/digit

letter other

error

Implementation: Code for the recognizer

44

char ← next_char();
state ← S0;
done ← false;
while( not done ) {

class ← char_class[char];
state ← next_state[class,state];
switch(state) {

case S1: 

/* building an id */

token_value ← token_value + char;
char ← next_char();
if (char == DELIMITER)

done = true;
break;

case S2: /* error state */
case S3: /* error state */

token type = error;
done = true;
break;

}
}
return token_type;

class S0 S1 S2 S3

letter S1 S1 — —
digit S3 S1 — —
other S3 S2 — —

S0 S1 S2

S3 error

digit
other letter/digit

letter other

error

Implementation: Code for the recognizer

45

char ← next_char();
state ← S0;
done ← false;
while( not done ) {

class ← char_class[char];
state ← next_state[class, state];
switch(state) {

case S1: 

/* building an id */

token_value ← token_value + char;
char ← next_char();
if (char == DELIMITER)

done = true;
break;

case S2: /* error state */
case S3: /* error state */

token type = error;
done = true;
break;

}
}
return token_type;

class S0 S1 S2 S3

letter S1 S1 — —
digit S3 S1 — —
other S3 S2 — —

S0 S1 S2

S3 error

digit
other letter/digit

letter other

error

Implementation: Code for the recognizer

46

char ← next_char();
state ← S0;
done ← false;
while( not done ) {

class ← char_class[char];
state ← next_state[class, state];
switch(state) {

case S1: 

/* building an id */

token_value ← token_value + char;
char ← next_char();
if (char == DELIMITER)

done = true;
break;

case S2: /* error state */
case S3: /* error state */

token type = error;
done = true;
break;

}
}
return token_type;

class S0 S1 S2 S3

letter S1 S1 — —
digit S3 S1 — —
other S3 S2 — —

S0 S1 S2

S3 error

digit
other letter/digit

letter other

error

Next Lecture

47

• Read Scott, Chapters 2.3 – 2.5

Things to do: