程序代写代做代考 assembly c/c++ compiler C Finite-State

Finite-State
Machines (FSMs)
CS 536

Some announcements
P1
TA office hours

Last time
A compiler is a
recognizer of language S (Source) a translator from S to T (Target)
a program in language H (Host)
For example, gcc: S is C, T is x86, H is C
3

Last time
Why do we need a compiler?
• Processors can execute only binaries (machine-code/assembly programs)
• Writing assembly programs will make you lose your mind
• Write programs in a nice(ish) high-level language like C; compile to binaries

Last time
front end = understand source code S IR = intermediate representation
back end = map IR to T
5

Last time
P1
P2 P3
P4, P5
Symbol table
P6
front end back end
6

Special linkage between scanner and parser in most compilers
Source Program
Sequence of characters
lexical analyzer (scanner)
Sequence of tokens
syntax analyzer (parser)
syntax analyzer (parser)
lexical analyzer (scanner)
source code
next token, please
a
< = p Conceptual organization ... ... The scanner Translates sequence of chars into a sequence of tokens (ignoring whitespace) a = 2 * b + abs(-71) ident asgn int lit times ident plus ident lparens int lit rparens (a) (2) (b) (abs) (-71) Each time the scanner is called it should: • find the longest prefix (lexeme) of the remaining input that corresponds to a token • return that token 8 How to create a scanner? • For every possible lexeme that can occur in source program, return corresponding token • Inefficient • Error-prone Scanner generator • Generates a scanner • Inputs: - one regular expression for each token - one regular expressions for each item to ignore (comments, whitespace, etc.) • Output: scanner program • How does a scanner generator work? - Finite-state machines (FSMs) 10 FSMs: Finite State Machines (A.k.a. finite automata, finite-state automata, etc.) Input: string (sequence of chars) Output: accept / reject i.e., input is legal in language Language defined by an FSM is the set of strings accepted by the FSM 11 Example 1 Language: single line comments with // • Nodes are states • Edges are transitions • Start state has an arrow (only one start state) • Final states are double circles (one or more) 12 Example 1 Language: single line comments with // 1. “// this is a comment.” 2. “/ / this is not.” 3. “// \n” 4. “Not // a comment” 13 Example 2 Language: Integer literals with an optional + or – (token: int-lit) e.g., -543, +15, 0007 digit ‘+’ 12 ‘-’ digit 3 digit 14 FSMs, formally M≡ finite set of states L(M) = set of integer literals the alphabet final states (characters) transition function start state ‘+’ ‘-’ digit 1 2 2 3 2 3 3 3 15 FSM example, formally M≡ What is L(M)? L(M) = {ε, ab, abab, ababab, abababab, .... } s0 s1 anything else, machine is stuck abc s1 s0 16 Coding an FSM curr_state = start_state done = false while (!done) ch = nextChar() next = table[curr_state][ch] if (next == stuck || ch == EOF) done = true else curr_state = next return final_states.contains(curr_state) && next!=stuck 17 FSM types: DFA & NFA Deterministic no state has >1 outgoing edge with same label
Nondeterministic
states may have multiple outgoing edges with same label
edges may be labelled with special symbol ɛ (empty string) ɛ-transitions can happen without reading input
18

NFA Example
Language: Integer literals with an optional + or – (token: int-lit)
e.g., -543, +15, 0007
digit
digit
digit ‘+’
33
digit ‘+’, ‘-’ digit
1212
‘-’ ‘ε’
A string is accepted by an NFA if there exists a sequence of transitions leading to a final state
19

Why NFA?
Simpler and more intuitive than DFA Language: sequence of 0s and 1s, ending with 00
20

Extra example
A C/C++ identifier is a sequence of one or more letters, digits, or underscores. It cannot start with a digit.
21

Extra Example – Part 1
A C/C++ identifier is a sequence of one or more letters, digits, or underscores. It cannot start with a digit.
digit, letter, ‘_’ 1 ‘_’, letter 2
22

Extra example
A C/C++ identifier is a sequence of one or more letters, digits, or underscores. It cannot start with a digit.
What if you wanted to add the restriction that it can’t end with an underscore?
23

Extra Example – Part 2
What if you wanted to add the restriction that it can’t end with an underscore?
digit, letter, ‘_’
1 ‘_’, letter 2 digit, letter 3 letter
24

Recap
The scanner reads a stream of characters and tokenizes it (i.e., finds tokens)
Tokens are defined using regular expressions, scanners are implemented using FSMs
FSMs can be non-deterministic
Next time: understand connection between DFA and NFA, regular languages and regular expressions
25

Play with automata!
automatatutor.com
Loris D’Antoni