Monash University Faculty of Information Technology 2nd Semester 2017
FIT2014
Assignment 2
Finite Automata, Context-Free Languages, Lexical analysis, Parsing, Turing machines, Universality
DUE: 11:55pm, Friday 6 October 2017
In these exercises, you will
• implement lexical analysers using lex (Problem 3);
• implement parsers using lex and yacc (Problems 1, 4);
• practise using the Pumping Lemma for regular languages (Problem 2);
• build a Turing machine (Problem 5);
• learn more about Finite Automata and universality (Problems 6–7).
Solutions to Problem 5 must be implemented in the simulator tuataraMonash, which is avail- able on Moodle.
How to manage this assignment
- You should start working on this assignment now, and spread the work over the time until it is due. Aim to do at least three questions before the mid-semester break. Do as much as possible before your week 10 prac class. There will not be time during the class itself to do the assignment from scratch; there will only be time to get some help and clarification.
- Don’t be deterred by the length of this document! Much of it is an extended tutorial to get you started with lex and yacc (pp. 2–5) and documentation for functions, written in C, that are provided for you to use (pp. 5–7); some sample outputs also take up a fair bit of space. Although lex and yacc are new to you, the questions about them only require you to modify some existing input files for them rather than write your own input files from scratch.
Instructions
Instructions are as for Assignment 1, except that some of the filenames have changed. The file to download is now asgn2.tar.gz, and unpacking it will create the directory asgn2 within your FIT2014 directory. You need to construct new lex files, using chain.l as a starting point, for Problems 1, 3 & 4, and you’ll need to construct a new yacc file from chain.y for Problem 4. Your submission must include (as well as the appropriate PDF files for the exercises requiring written solutions):
- a lex file prob1.l which should be obtained by modifying a copy of chain.l
- a lex file prob3.l which should also be obtained by modifying a copy of chain.l
- a lex file prob4.l which should be obtained by modifying a copy of prob3.l
- a yacc file prob4.y which should be obtained by modifying a copy of chain.y
- a Tuatara Turing machine file prob5.tm
- PDFfilesfortheexercisesrequiringwrittensolutions,namely,prob1.pdf,prob2.pdf,prob6.pdf, and prob7.pdf.
To submit your work, simply enter the command ‘make’ from within the asgn2 directory, and then submit the resulting .tar.gz file to Moodle. As last time, make sure that you have tested the submission mechanism and that you understand the e↵ect of make on your directory tree.
1
INTRODUCTION: Lex, Yacc and the CHAIN language
In this part of the Assignment, you will use the lexical analyser generator lex or its variant flex, initially by itself, and then with the parser generator yacc.
Some useful references on Lex and Yacc:
• T. Niemann, Lex & Yacc Tutorial, http://epaperpress.com/lexandyacc/
• Doug Brown, John Levine, and Tony Mason, lex and yacc (2nd edn.), O’Reilly, 2012. • thelexandyaccmanpages
We will illustrate the use of these programs with a language CHAIN based on certain expressions involving strings. Then you will use lex and yacc on a language CRYPT of expressions based on cryptographic operations.
CHAIN
The language CHAIN consists of expressions of the following type. An expression consists of a number of terms, with # between each pair of consecutive terms, where each term is either a string of lower-case letters or an application of the Reverse function to such a string. Examples of such expressions include
mala # y # Reverse(mala) block # drive # cut # pull # hook # sweep # Reverse(sweep) Reverse(side) # Reverse(direction) # Reverse(gear)
For lexical analysis, we wish to treat every lower-case alphabetical string as a lexeme for the token STRING, and the word Reverse as a lexeme for the token REVERSE.
Lex
An input file to lex is, by convention, given a name ending in .l. Such a file has three parts: • definitions,
• rules,
• C code.
These are separated by double-percent, %%. Comments begin with /* and end with */. Any comments are ignored when lex is run on the file.
You will find an input file, chain.l, among the files for this Assignment. Study its structure now, identifying the three sections and noticing that various pieces of code have been commented out. Those pieces of code are not needed yet, but some will be needed later.
We focus mainly on the Rules section, in the middle of the file. It consists of a series of statements of the form
pattern { action }
where the pattern is a regular expression and the action consists of instructions, written in C, specifying what to do with text that matches the pattern.1 In our file, each pattern represents a set of possible lexemes which we wish to identify. These are:
• a string of lower-case letters;
– This is taken to be an instance of the token STRING (i.e., a lexeme for that token).
1 This may seem reminiscent of awk, but note that: the pattern is not delimited by slashes, /. . . /, as in awk; the action code is in C, whereas in awk the actions are specified in awk’s own language, which has similarities with C but is not the same; and the action pertains only to the text that matches the pattern, whereas in awk the action pertains to the entire line in which the matching text is found.
2
• the specific string Reverse;
– Such a string is taken to be an instance of the token REVERSE.
• certain specific characters: #, (, );
• white space, being any sequence of spaces and tabs; • the newline character.
Note that all matching is case-sensitive.
Our action is, in most cases, to print a message saying what token and lexeme have been found.
For white space, we take no action at all. A character that cannot be matched by any pattern yields an error message.
If you run lex on the file chain.l, then lex generates the C program lex.yy.c.2 This is the source code for the lexical analyser. You compile it using a C compiler such as cc.
$ flex chain.l $ cc lex.yy.c
By default, cc puts the executable program in a file called a.out.3 This can be executed in the usual way, by just entering ./a.out at the command line. If you prefer to give the executable program another name, such as chain-lex, then you can tell this to the compiler using the -o option: cc lex.yy.c -o chain-lex.
When you run the program, it will initially wait for you to input a line of text to analyse. Do so, pressing Return at the end of the line. Then the lexical analyser will print, to standard output, messages showing how it has analysed your input. The printing of these messages is done by the printf statements from the file chain.l. Note how it skips over white space, and only reports on the lexemes and tokens.
$ ./a.out
mala # y #Reverse( mala) Token: STRING; Lexeme: mala Token and Lexeme: #
Token: STRING; Lexeme: y
Token and Lexeme: #
Token: REVERSE; Lexeme: Reverse Token and Lexeme: (
Token: STRING; Lexeme: mala Token and Lexeme: )
Token and Lexeme: <newline>
Try running this program with some input expressions of your own.
Yacc
We now turn to parsing, using yacc.
Consider the following grammar for CHAIN.
S ! E S ! ”
E ! E#E
E ! STRING
E ! REVERSE(STRING)
2The C program will have this same name, lex.yy.c, regardless of the name you gave to the lex input file. 3a.out is short for assembler output.
3
In this grammar, the non-terminals are S and E. Treat STRING and REVERSE as just single tokens, and hence single terminal symbols in this grammar.
We now generate a parser for this grammar, which will also evaluate the expressions, with # interpreted as concatenation and Reverse(. . . ) interpreted as reversing a string.
To generate this parser, you need two files, prob1.l (for lex) and chain.y (for yacc): • Copy chain.l to a new file prob1.l, and then modify prob1.l as follows:
- – in the Definitions section, uncomment the statement #include “y.tab.h”;
- – in the Rules section, in each action:
⇤ uncomment the statements of the form · yylval.str = …;
· return TOKENNAME;
· return *yytext;⇤ Comment out the printf statements. These may still be handy if debugging is needed, so don’t delete them altogether, but the lexical analyser’s main role now is to report the tokens and lexemes to the parser, not to the user.
- – in the C code section, comment out the function main(), which in this case occupies about four lines at the end of the file.
• chain.y, the input file for yacc, is provided for you. You don’t need to modify this yet.
An input file for yacc is, by convention, given a name ending in .y, and has three parts, very looselyanalogous to the three parts of a lex file but very di↵erent in their details and functionality: • Declarations,
• Rules,
• Programs.These are separated by double-percent, %%. Comments begin with /* and end with */.
Peruse the provided file chain.y, identify its main components, and pay particular attention tothe following, since you will need to modify some of them later. • in the Declarations section:
- – lines like
char *reverse(char *);char *simpleSub(char *, char*);
.
which are declarations of functions (but they are defined later, in the Programs section);
- – declarations of the tokens to be used:
%token <str> STRING %token <str> REVERSE
- – declarations of the nonterminal symbols to be used (which don’t need to start with an upper-case letter):
%type <str> start %type <str> expr
- – nomination of which nonterminal is the Start symbol: %start start
4
- in the Rules section, a list of grammar rules in BNF, except that the colon “:” is used instead of !, and there must be a semicolon at the end of each rule. Rules with a common left-hand- side may be written in the usual compact form, by listing their right-hand-sides separated by vertical bars, and one semicolon at the very end. The terminals may be token names, in which case they must be declared in the Declarations section and also used in the lex file, or single characters enclosed in forward-quote symbols. Each rule has an action, enclosed in braces {. . . }. A rule for a Start symbol may print output, but most other rules will have an action of the form $$ = . . . . The special variable $$ represents the value to be returned for that rule, and in e↵ect specifies how that rule is to be interpreted for evaluating the expression. The variables $1, $2, . . . refer to the values of the first, second, . . . symbols in the right-hand side of the rule.
- in the Programs section, various functions, written in C, that your parsers will be able to use. You do not need to modify these functions, and indeed should not try to do so unless you are an experienced C programmer and know exactly what you are doing! Most of these functions are not used yet; some will only be used later, in Problem 4.
After constructing the new lex file prob1.l as above, the parser can be generated by:
$ yacc -d chain.y
$ flex prob1.l
$ cc lex.yy.c y.tab.cThe executable program, which is now a parser for CHAIN, is again named a.out by default, and will replace any other program of that name that happened to be sitting in the same directory.
$ ./a.out
mala # y #Reverse( mala) malayalam4
Run it with some input expressions of your own.
Problem 1. [7 marks]
(a) Construct prob1.l, as described above, so that it can be used with chain.y to build a parser for CHAIN.
(b) Show that the grammar for CHAIN given above is ambiguous.
(c) Find an equivalent grammar (i.e., one that generates the same language) that is not ambiguous.
Cryptographic expressions
A cryptographic calculator performs simple operations on strings of a kind that are used in classical cryptosystems. It is also able to combine these operations in a natural way.
Suppose x = x1x2 ···xn and k = k1k2 ···kt are two strings, where xi and yi denote the i-th letters of x and y respectively.
The available operations are:
• sum: this is written x + k in our expressions, though our C function that computes it is called sum(. . . ). The resulting string has length min{n, t}, and its i-th letter is xi + ki mod 26, where the letters of the English alphabet correspond to numbers via a = 0, b = 1, . . . , z = 25.
4Malayalam is the main language of the southern Indian state of Kerala. The word was given as an example of a palindrome by a student in the first lecture.
5
For example,
x =
k = x + k =
thebushwasalivewithexcitement mrskoalahadabrandnewbabyandthenewsspreadlikewildfire fywlisswhsdljmejlglaycjrezhga
this is written x k, and is computed by the C function diff(. . . ). The
• di↵erence:
resulting string has length min{n, t}, and its i-th letter is xi ki mod 26.
• Vigen`ere cypher: this is written Vigenere(x, k), where x is the plaintext and k is the key. We first concatenate k with itself as many times as necessary in order to make it at least as long as x. Then we form the sum. The result is a string whose i-th letter is xi + k(i mod t)+1.
For example, if the plaintext is
inaholeinthegroundtherelivedahobbit
and the key is
bilbo
then Vigenere(inaholeinthegroundtherelivedahobbit, bilbo) returns the cyphertext
jvlicmmtohimrscvvouvfzpmwwmobvpjmjh
• Simple Substitution: this is written SimpleSub(x, k). Again, x is plaintext, and k is the key, but this time k must be a permutation of the 26-letter English alphabet, represented as a string in which each letter appears exactly once (and t = 26). Every a in the plaintext is replaced by the 1st letter k1 of the key; every b in the plaintext is replaced by the 2nd letter, k2, of the key; and so on. In general, the plaintext letter xi is replaced by kxi .
For example, if the plaintext is
thequickbrownfoxjumpsoverthelazydog
and the key is
qwertyuiopasdfghjklzxcvbnm
then SimpleSub(thequickbrownfoxjumpsoverthelazydog,qwertyuiopasdfghjklzxcvbnm) re-
turns the cyphertext
zitjxoeawkgvfygbpxdhlgctkzitsqmnrgu
• Local Transposition: this is written LocTran(x, k). The letters of the plaintext x are not replaced, as happens in the previous two cyphers, but rather rearranged according to a permutation,whichisrepresentedbyastringkofwdigits,wherew=|k|.5 Thispermutation string k has length at most 10, and it consists of the digits 0, 1, . . . , w arranged in some order. For example, if k has length 3, then it can be any of the strings 012, 021, 102, 120, 201, 210. The plaintext x is divided into blocks of w letters each, and the letters within each block are permuted according to k. If there are extra letters at the end — too few letters to make up another full block — then these are just copied across, with no change in their positions.
For example, if the plaintext is
thefamilyofdashwood
and the key is
201
then LocTran(thefamilyofdashwood,201) returns the cyphertext
5w stands for width, the traditional term for the size of a permutation used in local transposition. 6
ethmfayildofhasowod
These operations can be combined. Any valid expression can be given as the first argument to one of the cypher functions, or as any argument of a sum or di↵erence, to give another valid espression. So you can form expressions like
Vigenere(LocTran(triantiwontigongolope,3201),bunyip) + muldjewangk
Let CRYPT be the language of cryptographic expressions of this type that can be generated by the following grammar.
S ! S ! E ! E ! E ! E ! E ! E !
E
”
E+E
E E
SIMPLESUB(E , STRING) VIGENERE(E , STRING) LOCTRAN(E , DIGITS) STRING
In this grammar, the non-terminals are S and E. Treat SIMPLESUB, VIGENERE, LOCTRAN, STRING and DIGITS as just single tokens. For SIMPLESUB, VIGENERE, and LOCTRAN, we allow any nonempty prefix of the function name as well as the full name; e.g., S, Si, Sim, . . . , SimpleSu, SimpleSub are all acceptable lexemes for the token SIMPLESUB.
Problem 2. [7 marks]
Prove that CRYPT is not regular.
Problem 3. [7 marks]
Using the file provided for CHAIN as a starting point, construct a lex file, prob3, and use it to build a lexical analyser for CRYPT.
Sample output:
$ ./a.out Loc(Sim(Vig(therewasmovementatthestation,banjo),thequickbrownfxjmpsvlazydg),10) Token: LOCTRAN; Lexeme: Loc
Token and Lexeme: (
Token: SIMPLESUB; Lexeme: Sim
Token and Lexeme: (
Token: VIGENERE; Lexeme: Vig
Token and Lexeme: (
Token: STRING; Lexeme: therewasmovementatthestation
Token and Lexeme: ,
Token: STRING; Lexeme: banjo
Token and Lexeme: )
Token and Lexeme: ,
Token: STRING; Lexeme: thequickbrownfxjmpsvlazydg
Token and Lexeme: )
7
Token and Lexeme: ,
Token: DIGITS; Lexeme: 10
Token and Lexeme: )
Token and Lexeme: <newline> Control-D
$ ./a.out
V(twentysix – eleven, eleven) Token: VIGENERE; Lexeme: V Token and Lexeme: (
Token: STRING; Lexeme: twentysix Token and Lexeme: –
Token: STRING; Lexeme: eleven Token and Lexeme: ,
Token: STRING; Lexeme: eleven Token and Lexeme: )
Token and Lexeme: <newline>
Problem 4. [7 marks]
Make a copy of prob3.l, call it prob4.l, then modify it so that it can be used with yacc. Then construct a yacc file crypt.y from chain.y. Then use these lex and yacc files to build a parser for CRYPT.
Note that you do not have to program any of the cryptographic functions yourself. They have already been written: see the Programs section of the yacc file. The actions in your yacc file will need to call these functions, and you can do that by using the function call for reverse(. . . ) in chain.y as a template.
The core of your task is to write the grammar rules in the Rules section, in yacc format, with associated actions, using the examples in chain.y as a guide. You also need to do some modifications in the Declarations section to declare all tokens, using %token, and declare all nonterminal symbols, using %type.a See page 4.
aYou should still use start as your Start symbol. If you use another name instead, you will need to modify the %start line too.
Sample output:
$ ./a.out Loc(Sim(Vig(therewasmovementatthestation,banjo),thequickbrownfxjmpsvlazydg),10) kltpysiteauzfglhctaesircrktx
Control-D
$ ./a.out
V(twentysix – eleven, eleven)
twenty
The binary Vigen`ere cypher
The binary Vigen`ere cypher is just like the Vigen`ere cypher described above, except that it works with the binary alphabet and uses arithmetic mod-2 instead of mod-26. Given binary strings x = x1x2 ···xn as plaintext and k = k1k2 ···kt as key, the cyphertext z = z1z2 ···zn is given by
zi =(xi +ki)mod2, i=1,…,n.
8
Problem 5. [8 marks]
In this question, the Turing machine you build in Tuatara must have q1 as its Start State and q2 as its Accept State.
Build a Turing machine to compute the binary Vigen`ere cypher. Its input has the form xyk where the character y is used simply to separate the binary strings x and k. The output consists simply of the cyphertext z, a binary string of n bits. The output should have no y and/or keybits k after the cyphertext.
Your work on Tute 5, Q8, should help you get started on this Problem.
Multistart Finite Automata
A Multistart Finite Automaton (MFA) is an ordinary (deterministic) FA, except for the following alterations.
- It may have any number of Start states.
- At the start of computation, before reading any characters of the input string, it chooses any
Start state, arbitrarily, and then proceeds with normal deterministic FA computation.
- An input string is accepted if there is some Start state of the MFA such that, if computation starts in that state, it finishes in a Final state. It is rejected if, no matter what Start state is used, the computation always finishes in a non-Final state.
The language accepted by an MFA is the set of strings accepted by it.
Encoding Finite Automata
We can encode Finite Automata as strings, somewhat similar in spirit to the Code-Word Language (CWL) we used for Turing Machines (see Lecture 15), although the details are quite di↵erent.
Suppose that each FA is encoded as follows. Assume the states are numbered 1, . . . , N , and that state 1 is the Start state. For each state i, we give two numbers ↵i, i followed by a single letter fi. The numbers ↵i, i indicate the state the FA goes to when the input is a or b, respectively. These two numbers are represented in unary (so a number x becomes a string ax), with a single b after each of them. Then the single lett⇢er fi 2 {a,b} indicates whether or not state i is a Final state:
a, if state i is a Final state,
b, if state i is not a Final state.If a number ↵i or i is > N, then we interpret it as representing state N. So state N can be represented in more than one way, and so any FA has more than one representation in this scheme.
For example, consider the first FA given in Lecture 6, which has the following table.
ab
Start 1 2 1 231 Final 3 3 3
Problem 6. [6 marks]
Prove that a language is regular if and only if it is accepted by a Multistart FA.
fi =
9
We may represent its three rows by the strings aababb, aaababb, aaabaaaba. So the entire FA is represented by the string
aababbaaababbaaabaaaba
Let FA-REP be the language of all strings that represent FAs according to this convention.
Universality
In Lecture 15, you met Universal Turing Machines (UTMs). We had never mentioned the idea of universality when discussing our earlier, simpler computational models. We now investigate the relationship between universality and Finite Automata.
A Universal Finite Automaton (UFA) is a Finite Automaton that accepts the language of strings of the form
s#x
where s 2 FA-REP encodes some Finite Automaton A in the manner described above, x is a string accepted by A, and # does not appear in the input alphabet for A (since it is used as a separator to distinguish between the strings s and x).
Problem 7. [8 marks]
(a) Show that FA-REP is regular by giving a regular expression for it. (b) Does a UFA exist? Give a proof for your answer.
(It should help to use Problem 6 at some point.)
References
- Jane Austen, Sense and Sensibility, Thomas Egerton, London, 1811.
- A. B. ‘Banjo’ Patterson, The Man from Snowy River, poem first published in The Bulletin on
26 April 1890.
- J. R. R. Tolkien, The Hobbit, Allen & Unwin, London, 1937.
- Dorothy Wall, Blinky Bill, Angus & Robertson, Sydney, 1933.
10