Finite-state machines
Finite-State Machines (FSMs)
CS 536
Some announcements
P1
TA office hours
3
A compiler is a
recognizer of language S (Source)
a translator from S to T (Target)
a program in language H (Host)
For example, gcc: S is C, T is x86, H is C
Last time
Why do we need a compiler?
Processors can execute only binaries (machine-code/assembly programs)
Writing assembly programs will make you lose your mind
Write programs in a nice(ish) high-level language like C; compile to binaries
Last time
5
front end = understand source code S
IR = intermediate representation
back end = map IR to T
Last time
6
front end
back end
Symbol
table
P1
P2
P3
P4, P5
P6
Last time
Special linkage between scanner and parser in most compilers
7
Source Program
lexical analyzer (scanner)
syntax analyzer (parser)
Sequence of characters
Sequence of tokens
…
Conceptual
organization
syntax analyzer (parser)
…
next token,
please
source code
a < = p
lexical analyzer (scanner)
The scanner
Translates sequence of chars into a sequence of tokens (ignoring whitespace)
Each time the scanner is called it should:
find the longest prefix (lexeme) of the remaining input that corresponds to a token
return that token
8
a = 2 * b + abs(-71)
ident
(a)
asgn
int lit
(2)
times
ident
(b)
plus
ident
(abs)
lparens
int lit
(-71)
rparens
How to create a scanner?
For every possible lexeme that can occur in source program, return corresponding token
Inefficient
Error-prone
Scanner generator
Generates a scanner
Inputs:
- one regular expression for each token
- one regular expressions for each item to ignore (comments, whitespace, etc.)
Output: scanner program
How does a scanner generator work?
- Finite-state machines (FSMs)
10
FSMs: Finite State Machines
(A.k.a. finite automata, finite-state automata, etc.)
Input: string (sequence of chars)
Output: accept / reject
i.e., input is legal in language
Language defined by an FSM is the set of strings accepted by the FSM
11
Example 1
Language: single line comments with //
12
Nodes are states
Edges are transitions
Start state has an arrow (only one start state)
Final states are double circles (one or more)
Example 1
Language: single line comments with //
13
“// this is a comment.”
“/ / this is not.”
“// \n”
“Not // a comment”
Example 2
Language: Integer literals with an optional + or – (token: int-lit)
e.g., -543, +15, 0007
14
1
‘+’
‘-’
2
3
digit
digit
digit
FSMs, formally
15
finite set of states
the alphabet
(characters)
start state
final states
transition function
‘+’ ‘-’ digit
1 2 2 3
2 3
3 3
M ≡
L(M) = set of integer literals
FSM example, formally
16
anything else, machine is stuck
s1
s0
a
b
c
s0
s1
M ≡
What is L(M)?
L(M) = {ε, ab, abab, ababab, abababab, …. }
Coding an FSM
curr_state = start_state
done = false
while (!done)
ch = nextChar()
next = table[curr_state][ch]
if (next == stuck || ch == EOF)
done = true
else
curr_state = next
return final_states.contains(curr_state) &&
next!=stuck
17
FSM types: DFA & NFA
Deterministic
no state has >1 outgoing edge with same label
Nondeterministic
states may have multiple outgoing edges with same label
edges may be labelled with special symbol ɛ (empty string)
ɛ-transitions can happen without reading input
18
NFA Example
Language: Integer literals with an optional + or – (token: int-lit)
e.g., -543, +15, 0007
19
1
‘+’
‘-’
2
3
digit
digit
digit
1
‘+’, ‘-’
2
‘ε’
3
digit
digit
A string is accepted by an NFA if there exists a
sequence of transitions leading to a final state
Why NFA?
Simpler and more intuitive than DFA
Language: sequence of 0s and 1s, ending with 00
20
Extra example
A C/C++ identifier is a sequence of one or more letters, digits, or underscores. It cannot start with a digit.
21
Extra Example – Part 1
A C/C++ identifier is a sequence of one or more letters, digits, or underscores. It cannot start with a digit.
22
1
‘_’, letter
2
digit, letter, ‘_’
Extra example
A C/C++ identifier is a sequence of one or more letters, digits, or underscores. It cannot start with a digit.
What if you wanted to add the restriction that it can’t end with an underscore?
23
Extra Example – Part 2
What if you wanted to add the restriction that it can’t end with an underscore?
25
1
‘_’, letter
3
digit, letter, ‘_’
2
digit, letter
letter
Recap
The scanner reads a stream of characters and tokenizes it (i.e., finds tokens)
Tokens are defined using regular expressions, scanners are implemented using FSMs
FSMs can be non-deterministic
Next time: understand connection between DFA and NFA, regular languages and regular expressions
27
automatatutor.com
Loris D’Antoni
Play with automata!