RegExps & DFAs
Pre-class warm up
Copyright By PowCoder代写 加微信 powcoder
Write the regexp for Fortran real literals
An optional sign (‘+’ or ‘-‘)
An integer or:
1 or more digits followed by a ‘.’ followed by 0 or more digits
or: A ‘.’ followed by one or more digits
(‘+’|’-’|ε) (digit+(‘.’|ε) | (digit*’.’digit+))
Explored NFAs
for every NFA there is an equivalent DFA
epsilon edges add no expressive power
Introduce regular languages / expressions
Convert regexps to DFAs
From language recognizers to tokenizers
Regexp to NFAs
Literals/epsilon correspond to simple DFAs
Operators correspond to methods of joining
x^n, where n is even or divisible by 3
Regexp to NFA rules
Rules for operands
Regexp to NFA rules
Rules for alternation A|B
Make new start state q’ and new final state f’
Make original final states non-final
Regexp to NFA rules
Rule for catenation A.B
Make new start state q’ and new final state f’
Make original final states non-final
Regexp to NFA rules
Rule for iteration A*
Make new start state q’ and new final state f’
Make original final states non-final
Regexp operator
precedence
Operator Precedence Analogous math operator
| low addition
. medium multiplication
* high exponentiation
Tree representation of a
Operator Precedence
Bottom-up conversion
Bottom-up conversion
Bottom-up conversion
Bottom-up conversion
Bottom-up conversion
Bottom-up conversion
Bottom-up conversion
Bottom-up conversion
Bottom-up conversion
Regexp to DFAs
We now have an NFA
We need to go to DFA
But what’s so great about DFAs?
Table-driven DFAs
Recall that δ can be expressed as a table
This leads to a very efficient array representation
s = start state
while (more input){
c = read char
s = table[s][c]
if s is final, accept
FSMs for tokenization
FSMs only check for language membership of a
the scanner needs to recognize a stream of many different
tokens using the longest match
the scanner needs to know what was matched
Idea: imbue states with actions that will fire
when state is reached
A first cut at actions
Consider the language of Pascal identifiers
BAD: not longest match
Accounting for longest matches
BAD: maybe we needed that
A second take at actions
Give our FSMs ability to put chars back
Our first scanner
Consider a language with two statements
assignments: ID = expr
increments: ID += expr
where expr is of the form
Identifiers ID follow C conventions
Combined DFA
perform action / update state
if (action was to return a token){
start again in start state
} (while not EOF or stuck);
Lexical analyzer generators
aka scanner generators
The transformation from regexp to scanner is
formally defined
Can write tools to synthesize a lexer
automatically
Lex: unix scanner generator
Flex: fast lex
JLex: Java version of Lex
J specification
tell it what you want scanned, it will figure out the rest
Input: set of regexps + associated actions
xyz.jlex file
Output: Java source code for a scanner
xyz.jlex.java source code of scanner
jlex format
3 sections separated by %%
user code section
directives
regular expressions + actions
Rules section
Format is
can use macros from the directive sections in regex, surround with curly braces
Conventions
chars represent themselves (except special characters)
chars inside “” represent themselves (except \)
Regexp operators
| * + ? () .
Character class operators
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com