Finite-State
Machines (FSMs)
CS 536
Some announcements
P1
TA office hours
3
A compiler is a
recognizer of language S (Source)
a translator from S to T (Target)
a program in language H (Host)
For example, gcc: S is C, T is x86, H is C
Last time
Why do we need a compiler?
• Processors can execute only binaries
(machine-code/assembly programs)
• Writing assembly programs will make you
lose your mind
• Write programs in a nice(ish) high-level
language like C; compile to binaries
Last time
5
front end = understand source code S
IR = intermediate representation
back end = map IR to T
Last time
6
front end
back end
Symbol
table
P1
P2
P3
P4, P5
P6
Last time
Special linkage between scanner
and parser in most compilers
Source
Program
lexical analyzer
(scanner)
syntax analyzer
(parser)
Sequence of characters
Sequence of tokens
…
Conceptual
organization
syntax analyzer
(parser)
…
next token,
please
source code
a < = p
lexical analyzer
(scanner)
The scanner
Translates sequence of chars into a sequence of
tokens (ignoring whitespace)
Each time the scanner is called it should:
• find the longest prefix (lexeme) of the remaining input that corresponds to a
token
• return that token
8
a = 2 * b + abs(-71)
ident
(a)
asgn int lit
(2)
times ident
(b)
plus ident
(abs)
lparens int lit
(-71)
rparens
How to create a scanner?
• For every possible lexeme that can occur in
source program, return corresponding token
• Inefficient
• Error-prone
Scanner generator
• Generates a scanner
• Inputs:
- one regular expression for each token
- one regular expressions for each item to ignore
(comments, whitespace, etc.)
• Output: scanner program
• How does a scanner generator work?
- Finite-state machines (FSMs)
10
FSMs: Finite State Machines
(A.k.a. finite automata, finite-state automata, etc.)
Input: string (sequence of chars)
Output: accept / reject
i.e., input is legal in language
Language defined by an FSM is the set of strings
accepted by the FSM
11
Example 1
Language: single line comments with //
12
• Nodes are states
• Edges are transitions
• Start state has an arrow (only one start state)
• Final states are double circles (one or more)
Example 1
Language: single line comments with //
13
1. “// this is a comment.”
2. “/ / this is not.”
3. “// \n”
4. “Not // a comment”
Example 2
Language: Integer literals with an optional + or –
(token: int-lit)
e.g., -543, +15, 0007
14
1
‘+’
‘-’
2
3
digit
digit
digit
FSMs, formally
15
finite set of states
the alphabet
(characters) start state
final states
transition function
‘+’ ‘-’ digit
1 2 2 3
2 3
3 3
M ≡
L(M) = set of integer literals
FSM example, formally
16
anything else, machine is stuck
s1
s0
a b c
s0
s1
M ≡
What is L(M)?
L(M) = {ε, ab, abab, ababab, abababab, …. }
Coding an FSM
curr_state = start_state
done = false
while (!done)
ch = nextChar()
next = table[curr_state][ch]
if (next == stuck || ch == EOF)
done = true
else
curr_state = next
return final_states.contains(curr_state) &&
next!=stuck
17
FSM types: DFA & NFA
Deterministic
no state has >1 outgoing edge with same label
Nondeterministic
states may have multiple outgoing edges with same label
edges may be labelled with special symbol ɛ (empty string)
ɛ-transitions can happen without reading input
18
NFA Example
Language: Integer literals with an optional + or –
(token: int-lit)
e.g., -543, +15, 0007
19
1
‘+’
‘-’
2
3
digit
digit
digit
1
‘+’, ‘-’
2
‘ε’
3
digit
digit
A string is accepted by an NFA if there exists a
sequence of transitions leading to a final state
Why NFA?
Simpler and more intuitive than DFA
Language: sequence of 0s and 1s, ending with 00
20
Extra example
A C/C++ identifier is a sequence of one or more
letters, digits, or underscores. It cannot start
with a digit.
21
Extra Example – Part 1
A C/C++ identifier is a sequence of one or more
letters, digits, or underscores. It cannot start
with a digit.
22
1
‘_’, letter
2
digit, letter, ‘_’
Extra example
A C/C++ identifier is a sequence of one or more
letters, digits, or underscores. It cannot start
with a digit.
What if you wanted to add the restriction that it
can’t end with an underscore?
23
Extra Example – Part 2
What if you wanted to add the restriction that it
can’t end with an underscore?
24
1
‘_’, letter
3
digit, letter,
‘_’
2
digit, letter
letter
Recap
The scanner reads a stream of characters and
tokenizes it (i.e., finds tokens)
Tokens are defined using regular expressions,
scanners are implemented using FSMs
FSMs can be non-deterministic
Next time: understand connection between
DFA and NFA, regular languages and regular
expressions
25
automatatutor.com
Loris D’Antoni
Play with automata!
http://automatatutor.com