RegExps & DFAs
CS 536
Pre-class warm up
Write the regexp for Fortran real literals
An optional sign (¡®+¡¯ or ¡®-¡®)
An integer or:
1 or more digits followed by a ¡®.¡¯ followed by 0 or more digits or: A ¡®.¡¯ followed by one or more digits
(¡®+¡¯|¡¯-¡¯|¦Å) (digit+(¡®.¡¯|¦Å) | (digit*¡¯.¡¯digit+))
2
Last time
Explored NFAs
for every NFA there is an equivalent DFA epsilon edges add no expressive power
Introduce regular languages / expressions
3
Today
Convert regexps to DFAs
From language recognizers to tokenizers
4
Regexp to NFAs
Literals/epsilon correspond to simple DFAs
Operators correspond to methods of joining
x^n, where n is even or divisible by 3
DFAs
5
Regexp to NFA rules
Rules for operands
6
Regexp to NFA rules
Rules for alternation A|B
Make new start state q¡¯ and new final state f¡¯
Make original final states non-final
Add to ¦Ä:
q¡¯,¦Å ¡ú qA q¡¯,¦Å ¡ú qB Fa,¦Å¡úf¡¯ Fb,¦Å¡úf¡¯
7
Regexp to NFA rules
Rule for catenation A.B
Make new start state q¡¯ and new final state f¡¯
Make original final states non-final
Add to ¦Ä:
q¡¯,¦Å ¡ú qA fA,¦Å ¡ú qB fb,¦Å¡úf¡¯
8
Regexp to NFA rules
Rule for iteration A*
Make new start state q¡¯ and new final state f¡¯
Make original final states non-final
Add to ¦Ä:
q¡¯,¦Å ¡ú qA q¡¯,¦Å¡úf¡¯ f¡¯,¦Å¡úqA
9
Regexp operator precedence
Operator Precedence
Analogous math operator
|
low
addition
.
medium multiplication
*
high
exponentiation
10
Tree representation of a regexp
Operator Precedence
| *
low
.
medium
high
11
Bottom-up conversion
12
Bottom-up conversion
13
Bottom-up conversion
14
Bottom-up conversion
15
Bottom-up conversion
16
Bottom-up conversion
17
Bottom-up conversion
18
Bottom-up conversion
19
Bottom-up conversion
20
Regexp to DFAs
We now have an NFA We need to go to DFA
But what¡¯s so great about DFAs?
21
Table-driven DFAs
Recall that ¦Ä can be expressed as a table
This leads to a very efficient array representation
s = start state
while (more input){
c = read char
s = table[s][c]
}
if s is final, accept
22
FSMs for tokenization
FSMs only check for language membership of a string
the scanner needs to recognize a stream of many different tokens using the longest match
the scanner needs to know what was matched
Idea: imbue states with actions that will fire when state is reached
23
A first cut at actions
Consider the language of Pascal identifiers
BAD: not longest match
Accounting for longest matches
BAD: maybe we needed that character
24
A second take at actions
Give our FSMs ability to put chars back
25
Our first scanner
Consider a language with two statements assignments: ID = expr
increments: ID += expr
where expr is of the form ID + ID
ID ^ ID
ID < ID
ID <= ID
Identifiers ID follow C conventions
26
Combined DFA
27
do{
read char
perform action / update state
if (action was to return a token){
start again in start state
}
} (while not EOF or stuck);
29
Lexical analyzer generators
aka scanner generators
The transformation from regexp to scanner is formally defined
Can write tools to synthesize a lexer automatically
Lex: unix scanner generator Flex: fast lex
JLex: Java version of Lex
30
JLex
Declarative specification
tell it what you want scanned, it will figure out the rest
Input: set of regexps + associated actions xyz.jlex file
Output: Java source code for a scanner xyz.jlex.java source code of scanner
31
jlex format
3 sections separated by %%
user code section
directives
regular expressions + actions
32
33
Rules section
Format is
can use macros from the directive sections in regex, surround with curly braces
Conventions
chars represent themselves (except special characters) chars inside ¡°¡± represent themselves (except \)
Regexp operators | * + ? () .
Character class operators – range
^ not
\ escape
34
35