RegExps & DFAs
RegExps & DFAs
CS 536
Pre-class warm up
Write the regexp for Fortran real literals
An optional sign (‘+’ or ‘-‘)
An integer or:
1 or more digits followed by a ‘.’ followed by 0 or more digits
or: A ‘.’ followed by one or more digits
(‘+’|’-’|ε) (digit+(‘.’|ε) | (digit*’.’digit+))
2
Last time
Explored NFAs
for every NFA there is an equivalent DFA
epsilon edges add no expressive power
Introduce regular languages / expressions
3
Today
Convert regexps to DFAs
From language recognizers to tokenizers
4
Regexp to NFAs
Literals/epsilon correspond to simple DFAs
Operators correspond to methods of joining DFAs
5
x^n, where n is even or divisible by 3
Regexp to NFA rules
Rules for operands
6
Regexp to NFA rules
Rules for alternation A|B
7
Make new start state q’ and new final state f’
Make original final states non-final
Add to δ:
q’,ε → qA
q’,ε → qB
Fa,ε→f’
Fb,ε→f’
Regexp to NFA rules
Rule for catenation A.B
8
Make new start state q’ and new final state f’
Make original final states non-final
Add to δ:
q’,ε → qA
fA,ε → qB
fb,ε→f’
Regexp to NFA rules
Rule for iteration A*
9
Make new start state q’ and new final state f’
Make original final states non-final
Add to δ:
q’,ε → qA
q’,ε→f’
f’,ε→qA
Regexp operator precedence
10
Operator Precedence Analogous math operator
| low addition
. medium multiplication
* high exponentiation
Tree representation of a regexp
11
Operator Precedence
| low
. medium
* high
Bottom-up conversion
12
Bottom-up conversion
13
Bottom-up conversion
14
Bottom-up conversion
15
Bottom-up conversion
16
Bottom-up conversion
17
Bottom-up conversion
18
Bottom-up conversion
19
Bottom-up conversion
20
Regexp to DFAs
We now have an NFA
We need to go to DFA
21
But what’s so great about DFAs?
Table-driven DFAs
Recall that δ can be expressed as a table
This leads to a very efficient array representation
22
s = start state
while (more input){
c = read char
s = table[s][c]
}
if s is final, accept
FSMs for tokenization
FSMs only check for language membership of a string
the scanner needs to recognize a stream of many different tokens using the longest match
the scanner needs to know what was matched
Idea: imbue states with actions that will fire when state is reached
23
A first cut at actions
Consider the language of Pascal identifiers
24
BAD: not longest match
Accounting for longest matches
BAD: maybe we needed that
character
A second take at actions
Give our FSMs ability to put chars back
25
Our first scanner
Consider a language with two statements
assignments: ID = expr
increments: ID += expr
where expr is of the form
ID + ID
ID ^ ID
ID < ID
ID <= ID
Identifiers ID follow C conventions
26
Combined DFA
27
29
do{
read char
perform action / update state
if (action was to return a token){
start again in start state
}
} (while not EOF or stuck);
Lexical analyzer generators
aka scanner generators
The transformation from regexp to scanner is formally defined
Can write tools to synthesize a lexer automatically
Lex: unix scanner generator
Flex: fast lex
JLex: Java version of Lex
30
JLex
Declarative specification
tell it what you want scanned, it will figure out the rest
Input: set of regexps + associated actions
xyz.jlex file
Output: Java source code for a scanner
xyz.jlex.java source code of scanner
31
jlex format
3 sections separated by %%
user code section
directives
regular expressions + actions
32
33
Rules section
Format is
can use macros from the directive sections in regex, surround with curly braces
Conventions
chars represent themselves (except special characters)
chars inside “” represent themselves (except \)
Regexp operators
| * + ? () .
Character class operators
– range
^ not
\ escape
34
35