Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
2. Lexical Analysis
NYU Courant Institute
Compiler Construction (CSCI-GA.2130-001)
Copyright By PowCoder代写 加微信 powcoder
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 1 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Acknowledgments
Adapted from CSCI-GA.2130-001 slides by and .
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 2 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Lexical Analysis Token Specifications Token Recognition Finite Automata
RE to ical Analyzers
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 3 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
First compilation phase
source program
Characters Tokens Tree Tree IR IR Machine code
— — — —
Lexical Analysis
Syntax Analysis
Semantic Analysis
Intermediate Representation Generator Optimizer
Code Generator Machine-Dependent Code Optimizer
Symbol Table
target machine code
Machine code
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 4 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Primary roles of the lexer:
read input characters and expand macros, identify lexeme patterns for tokens,
identify token attributes,
construct the symbol table,
error handling and reporting,
strip whitespace and comments,
and sometimes even modify the input stream!
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 5 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Software modules involved
Source Program
Lexical Analysis
Syntax Analysis
Semantic Analysis
Stream getNextChar getNextToken
Character mm
Lexer/scanner nn
Syntax Tree
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 6 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Why separate lexical and syntactic analysis?
Simplifies overall compiler design.
Improves compiler efficiency (allows specialized
techniques for lexer issues vs parser issues).
Better portability (input-device specific peculiarities restricted to lexer).
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 7 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
The stream of characters:
position = initial + rate * 60
Scanned into list of tokens, one for each lexeme: ⟨id,1⟩ ⟨=⟩ ⟨id,2⟩ ⟨+⟩ ⟨id,3⟩ ⟨∗⟩ ⟨num,60⟩
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 8 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Lexeme pattern (informal)
Attribute value
⟨id, 1⟩ ⟨=⟩ ⟨id, 2⟩ ⟨+⟩ ⟨id, 3⟩ ⟨∗⟩ ⟨num, 60⟩
identifier string equality symbol identifier string addition symbol identifier string multiplication symbol numeric constant
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 9 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Typical classes of tokens in a programming language: keywords,
operators,
one token for identifiers,
tokens for constants (numbers, literal strings, …),
tokens for punctuation symbols (comma, semicolon, …).
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 10 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Token attributes
Typical attribute values for the identifier token (id): the precise lexeme instance,
its type/storage requirements,
location where first found (error reporting).
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 11 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
float limitedSquare(float x) {
/* Returns squared x capped at 100. */
return (x <= -10.0 || x >= 10.0) ? 100 : x * x;
What tokens will likely be generated from this code?
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 12 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
keyword tokens float, return,
identifier tokens limitedSquare, x,
comment token or skipped /* … */
comparison tokens <=,>=,
logical tokens ||,
number tokens 100, 10.0, -10.0 or two separate tokens,
miscellaneous (-token, )-token, ?-token, :-token, ∗-token, ;-token {-token, }-token.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 13 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Dealing with errors
What if no pattern matches?
Delete input characters until a match is found. Insert missing character.
Replace a character with another.
Transpose two adjacent characters.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 14 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Input buffering
Buffering issues
Lexer may need to look at least a character ahead to make a token decision.
Example: Assume two tokens are defined as ‘*’ and ‘***’. Now consider the input character sequence ‘**a’. We need to scan and buffer all three characters to decide the tokens are two ‘*’.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 15 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Lexical Analysis
Token Specifications
Token Recognition Finite Automata RE to ical Analyzers
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 16 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Specifying tokens: lexeme patterns
Some lexeme patterns: +, 6-0, =, =-=, p-o-s-i-t-i-o-n, … Lexeme patterns include finite, infinite, cyclic, and balanced
Partition into:
Regular expressions (RE) (formal token descriptions), other patterns.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 17 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Formal token specification
Some lexeme patterns/strings as RE:
operator: +
equality:(=|==)
number:{d+ |d ∈ Digit}
identifiers: { l(d|l)∗ | d ∈ Digit, l ∈ Letter } empty string: ε (epsilon)
Digit and Letter stand for alphabets.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 18 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Definition
Regular expression concepts:
Alphabet any finite set of symbols.
String finite sequence of symbols drawn from an alphabet.
Language countable set of strings over some fixed alphabet.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 19 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Definition
Basic RE taxonomy:
Character Concatenation Choice Repetition(≥0) Grouping
e1 e2 e1 | e2 e∗
a, bc ε,a,aa,… ac, bc
where ∗ denotes the Kleene closure of the generated language.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 20 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Describe the languages (patterns/strings) generated by REs:
(ε|a)b∗ – strings of 0 or more ’b’s, prefixed by 0 or one ’a’. a(a|b)∗a – strings starting and ending with (separate) ’a’s
with substrings of 0 or more ’a’s and ’b’s in between.
a∗ba∗ba∗ba∗ – all strings which are a mix of ‘a’s and ’b’s containing exactly 3 ’b’s and 0 or more ’a’s.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 21 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Describe the languages (patterns/strings) generated by REs:
(ε|a)b∗ – strings of 0 or more ’b’s, prefixed by 0 or one ’a’.
a(a|b)∗a – strings starting and ending with (separate) ’a’s with substrings of 0 or more ’a’s and ’b’s in between.
a∗ba∗ba∗ba∗ – all strings which are a mix of ‘a’s and ’b’s containing exactly 3 ’b’s and 0 or more ’a’s.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 21 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Describe the languages (patterns/strings) generated by REs:
(ε|a)b∗ – strings of 0 or more ’b’s, prefixed by 0 or one ’a’.
a(a|b)∗a – strings starting and ending with (separate) ’a’s with substrings of 0 or more ’a’s and ’b’s in between.
a∗ba∗ba∗ba∗ – all strings which are a mix of ‘a’s and ’b’s containing exactly 3 ’b’s and 0 or more ’a’s.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 21 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Describe the languages (patterns/strings) generated by REs:
(ε|a)b∗ – strings of 0 or more ’b’s, prefixed by 0 or one ’a’.
a(a|b)∗a – strings starting and ending with (separate) ’a’s with substrings of 0 or more ’a’s and ’b’s in between.
a∗ba∗ba∗ba∗ – all strings which are a mix of ‘a’s and ’b’s containing exactly 3 ’b’s and 0 or more ’a’s.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 21 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Definition
Extended RE taxonomy: RE extended with
Optional Repetition(≥1) Character class
[0-9 _ a-z]
ε,a,b a,aa,… 7,_,c
where + denotes the positive closure of the generated language.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 22 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
State REs for the following languages:
all non-empty strings over the alphabet of ‘0’s and ‘1’s:
all strings beginning with a digit, followed by either a or b, andendingwithanuppercaseletter: [0-9](a|b)[A-Z]
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 23 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
State REs for the following languages:
all non-empty strings over the alphabet of ‘0’s and ‘1’s:
all strings beginning with a digit, followed by either a or b, andendingwithanuppercaseletter: [0-9](a|b)[A-Z]
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 23 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
State REs for the following languages:
all non-empty strings over the alphabet of ‘0’s and ‘1’s:
all strings beginning with a digit, followed by either a or b, andendingwithanuppercaseletter: [0-9](a|b)[A-Z]
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 23 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Simulation
Extended RE features as basic RE:
Extended syntax
Basic syntax
[e1 − e2 . . . ei − ei+1]
ee∗ e1|…|ei
where e1 ∈ {e1,…,e2},…,ei ∈ {ei,…,ei+1},i ≥ 1
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 24 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Example Tokenizer
int = [0-9]+
float = [0-9]+[.][0-9]+
id = [a-z]+([_]?[0-9]+)?
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 25 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Additional string vocabulary
a string obtained by removing zero or more symbols from the end.
a string obtained by removing zero or more symbols from the beginning.
a string obtained by removing any prefix and suffix.
subsequence a string formed by removing zero or more arbitrary positions.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 26 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Lexical Analysis Token Specifications Token Recognition Finite Automata
RE to ical Analyzers
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 27 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Formal Languages
Approaches for defining formal languages (“sets of strings”): Basic RE
Extended RE
Deterministic finite automata
Nondeterministic finite automata
Context-free grammars (CFG) – next time
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 28 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Formal Languages
Approaches for defining formal languages (“sets of strings”): Basic RE
Extended RE
Deterministic finite automata
Nondeterministic finite automata
Context-free grammars (CFG) – next time
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 28 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Example Token Definitions
if → if then → then else → else
digit→ [0−9] digits → digit+
number→ digits(.digits)?(e[+−]?digits)? letter → [A−Za−z]
id → letter (letter | digits)∗
relop→ <|>|<=|>=|=|<>
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 29 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Transition diagram
Diagram/automaton recognizes a relop token:
//2 return(relop, LE) 3 return(relop, NE)
4 return(relop, LT)
5 return(relop, EQ)
6 = other
//7 return(relop, GE)
8 return(relop, GT)
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 30 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Expressive Language Power
“Language power” given by the possibly generated strings
Comparisons:
extended and basic RE have same power
finite automata and RE have same power
CFG are more powerful than RE … more next time
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 31 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Expressive Language Power
“Language power” given by the possibly generated strings
Comparisons:
extended and basic RE have same power
finite automata and RE have same power
CFG are more powerful than RE … more next time
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 31 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Expressive Language Power
“Language power” given by the possibly generated strings
Comparisons:
extended and basic RE have same power
finite automata and RE have same power
CFG are more powerful than RE … more next time
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 31 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Lexical Analysis Token Specifications Token Recognition Finite Automata
RE to ical Analyzers
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 32 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Definition
Finite state automata (FSA) are string recognizers: answering “yes” or “no” to each input string.
Nondeterministic finite automata (NFA) transition diagrams with no restrictions on the labels of their edges.
Deterministic automata (DFA) transition diagrams where for each state and each symbol of the input alphabet there is exactly one edge leaving that state.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 33 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
NFA: Nondeterministic Finite (State) Automata
NFA accepting a(a|b)∗b:
start // a // a,b // b //
1 2 3 BB4 b
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 34 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
NFA accepting a(a|b)∗b
States={1,2,3,4}
{2} {3} {} {}
{} {3, 4} {4} {}
{} {} {2} {}
Start = 1
Finals = {4}
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 35 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
NFA core definition
finite set of states S (circles),
set of input symbols Σ (labels),
next-state transition function τ : S × Σ → S (edges/table), dedicated start state s0 ∈ S (start arrow),
set of accepting states ⊆ S (double circles).
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 36 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
DFA: Deterministic Finite (State) Automata
DFA accepting a(a|b)∗b:
// A a // B a bba
bb Ea,b D
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 37 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
DFA accepting a(a|b)∗b
States = {A,B,C,D,E}
Start = A
Finals = {D}
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 38 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
DFA core definition
Special case of NFA:
no moves on ε,
foreach(s,a)∈S×Σ:∃τ(s,a)unique!
Transition graph: there is exactly one edge out of each state s and each label a.
Transition table: each entry is a single state.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 39 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Some definitions
Operation ε-closure(s)
ε-closure(T) move(T,a)
Description
Set of NFA states reachable from state s on ε-transitions alone.
Set of NFA states reachable from some state s in T on ε- transitions alone. SetofNFAstatestowhichthereisatransitionon input symbol a from some state s in T
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 40 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
Subset construction
NFA to DFA (Algorithm 3.20.)
// Initially, ε-closure(s0) is the only
// state in Dstates, and it is unmarked.
while (there is an unmarked state T in Dstates) {
for (each input symbol a) {
U = epsilon-closure(move(T, a)) if (U not in Dstates)
add U as an unmarked state to Dstates
DTran[T, a] = U
Dstates states of the DFA we are constructing.
Compiler Construction (CSCI-GA.2130-001) 2. Lexical Analysis 41 / 59
Lexical Analysis Token Specifications Token Recognition Finite Automata RE to
NFA → DFA Translation for (a|b)∗abb
//0 //1 ::2 a //3 start ε ε
ε$$ b // ε 45
$$::6 //7 //8 //9 // 10
ε ε<CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com