COMP0012 Lexing and Parsing Coursework, Part 2
Submission Deadline Friday 6th March 2020 @ 11:55AM
The goal of this parsing coursework is to build a lexer and parser for the Z ̃λ programming language. Use JFlex 1.6.1 and Cup version 11b-20160615, using only the specified versions, to automatically generate code for your scanner and parser1. This coursework broadly requires writing regular expressions covering all legal words in the language (Lexer.lex), and a context free grammar describing its rules (Parser.cup).
You must work on this coursework individually and any kind of collaboration, including sharing your own tests with others, is strictly forbidden. You will get a single mark, comprising 8% of your mark for the COMP0012 module. Please submit your work (JFLex/CUP specifications) before Friday 6th March 2020 @ 11:55AM.
Detailed submission instructions are given at the end of the document. “There will always be noise on the line.”
—Engineering folklore
Despite our best efforts, this project description contains errors2. Part of this coursework is then to start to develop the skills you will need to cope with such problems now. First, you think hard about issues you discover (of which only a small subset will be errors in this problem statement) and make a reasonable decision and document your decision and its justification accompanying your submission. Second, you can ask stakeholders for clarification.
1 Interpreting the Specification
Throughout your career in IT, you will have to contend with interpreting and understanding specifications. If you fail a test case because of some ambiguities in the specification, you must go on to explain why it is a problem and justify how you decide to resolve it. When marking, we will consider the issues you discover; if your justification is sound, you will get full marks for the relevant test cases.
We have numbered each paragraph, table and program listing in this specification. For each issue that you find, make a note in the following format:
Paragraph: 72
Problem: it is not clear whether we can omit both the start and end indices
in sequence slicing, like “foo := bar[:]”.
Our solution: this is possible in many other languages that support list
slicing (like Python), so our compiler accepts this syntax.
Paragraph: 99
Problem: the spec has an assignment using the equality operator,
“foo = bar;”.
Our solution: we think this is a mistake, and our compiler does not accept
this statement.
Call this file ambiguities.txt and turn it along with your implementation of the parser.
1Section 4 explains why I have imposed these constraints on the permitted versions of these tools.
2Nonetheless, this document is already a clearer specification than any you will find in your subsequent careers in industrial IT.
1
2 TheZ ̃λLanguage
§1 You are to build a lexer and a parser for Z ̃λ .
§2 A program in Z ̃λ consists of a list of declarations which defines global variables and new data types as well as
functions. This list cannot be empty: it must have a definition for the special main function. The execution of a Z ̃λ program starts from main.
§3 Z ̃λ has two types of comments. First, any text, other than a newline, following # to the end of the line is a comment. Second, any text, including newlines, enclosed within /# … #/ is a comment, and may span multiple lines.
§4 An identifier starts with a letter, followed by an arbitrary number of underscores, letters, or digits. Identifiers are case-sensitive. Punctuation other than underscore is not allowed.
§5 Z ̃λ has anonymous functions in the form of lambda expressions, as in languages like Python and Haskell. A lambda expression is introduced by the keyword lambda and is followed by a non-empty list of comma-separated variables. These variables are assumed to be type top and do not take type annotations. The body of the lambda expressioncomesafterthe:symbol,i.e.f:int := lambda x, y : x + y + 5;,wherethetypef:int representstheexpectedreturntypeofx + y + 5.Lambdaboundvariabletypesaredisambiguatedatruntime, allowing Z ̃λ to have an element of dynamic typing.
§6 Z ̃λ has support for Python-like list comprehensions. A list comprehension is surrounded by square brackets andusesthebuiltinrangefunction.Theyhavetheform[x * 2 for x in range(10)]whereforandin are keywords. range is a function that generates an integer sequence. It can take up to three arguments, but must be called with at least one. These arguments are either integer literals, or variables, with the first argument being the starting point, the second the terminating point, and the third the increment. This is similar to the range function inPython.range(10)buildsanintegersequencestartingfrom0;range(1, 10)buildsfrom1;andrange (1, 20, step)buildsalistfrom1to20inincrementsofstep,auser-definedvariable.Anoptionalboolean valuedfilter,startingwithif,canalsobeused[x * 3 for x in range(100), if x / 2 = 5].The if expression must follow a comma, and is not terminated with fi.
§7 A character is a single letter, punctuation symbol, or digit wrapped in ’’ and has type char. The allowed punctuation symbols are space (See http://en.wikipedia.org/wiki/Punctuation) and the ASCII symbols, other than digits, on this page http://www.kerryr.net/pioneers/ascii3.htm.
§8 The boolean constants are T and F and have type bool. T and F cannot be used as identifiers.
§9 Numbers are integers (type int), rationals (type rat), or floats (type float). Negative numbers are
represented by the ’-’ symbol before the digits. Examples of integers include 1 and -1234; examples of rationals include 1/3 and -345_11/3; examples of floats are -0.1 and 3.14.
§10 Sequences (type seq) are ordered containers of elements. Sequences have nonnegative length. A sequence hasatype:itsdeclarationspecifiesthetypeofelementsitcontains.Forinstance,l : seq
§11 Z ̃λ sequences support the standard indexing syntax. For any sequence s, the operator len(s) : seq → N returns the length of s and the indices of s range from 0 to len(s)-1. The expression s[index] returns the element in s at index. String literals are syntactic sugar for character sequences, so “abc” is [ ’a’, ’b’, ’c’]. For the sequence s : seq
2
Kind
Boolean Numeric
Defined Over
bool int,rat,float
Syntax
!, &&, ||
+ – * / ^
in, ::, len(s), s[i], s[i:j], s[i:], s[:i]
< <= = !=
Primitive Data Types Aggregate Data Types Literal
Sequence seq
Comparison
int,rat,float bool, int, rat, float
bool, int, rat, float, char seq
string
Table 1: Z ̃λ data types.
Table 2: Z ̃λ operators.
§12 Sequences in Z ̃λ also support sequence slicing as in languages like Python or Ruby: id[i:j] returns another sequence, which is a subsequence of id starting at id[i] and ending at id[j]. Given a := [1,2,3,4,5], a[1:3] is [2,3,4]. When the start index is not given, it implies that the subsequence starts from index 0 of the original sequence (e.g., a[:2] is [1,2]). Similarly, when the end index is not given, the subsequence ends with the last element of the original sequence (e.g., a[3:] is [4,5]). Finally, indices can be negative, in which case its value is determined by counting backwards from the end of the original sequence: a[2:-1] is equivalent to a[2:a.len-1] and, therefore, is [3,4,5], while s[-2] is 4. The lower index in a slice must be positive and smaller than the upper index, after the upper index has been subtracted from the sequences length if it was negative.
§13 Table 1 defines Z ̃λ ’s builtin data types. In Table 2, “!” denotes logical not, “&&” logical and, and “||” logical or, as is typical in the C language family. Note that “=” is referential equality and “:=” is the assignment operatorinZ ̃λ.Theinoperatorcheckswhetheranelement(key)ispresentinasequence,asin2 in [1,2,3] andreturnsaboolean.Notethatinonlyoperatesontheoutermostsequence:3 in [[1],[2],[3]]isF,or false. “::” operator denotes concatenation, “s[i]” returns the ith entry in s and “len(s)” returns the length of s as defined in the discussion of sequences and their indexing above.
2.1 Declarations
§14 The syntax of field or variable declaration is “id : type”. A data type declaration is
tdef type_id { declaration_list } ;
where declaration_list is a comma-separated list of field/variable declarations. Once declared, a data type
can be used as a type in subsequent declarations.
§15 For readability, Z ̃λ supports type aliasing: the directive “alias old_name new_name ;” can appear
in a declaration list and allows the use of new_name in place of old_name.
alias seq
tdef person { name : string, surname : string, age : int };
tdef family { mother : person, father : person, children : seq
Listing 1: Z ̃λ data type declaration examples.
§16 Listing 2 shows the syntax of function declarations in Z ̃λ . Specifically, a function’s name is an identifier. The formal_parameter_list separates parameter declarations, which follow the variable/field declaration syntaxid : type,withcommas.Afunction’sbodycannotbeempty;itconsistsoflocalvariabledeclarations,if
3
fdef name (formal_parameter_list) { body } : returnType ; fdef name (formal_parameter_list) { body } ;
Listing 2: Z ̃λ function declaration syntax.
Assumes“person p”previouslydeclared
p.age + 10
b – foo(sum(10, c),
s1 :: s2 :: [1,2]
bar()) = 30
Illustrates method calls Assumess1ands2havetypeseq
Table 3: Z ̃λ expression examples.
any, followed by statements. The return type of a function is returnType; it can be omitted when the function does not return a value. The main function does not return a value; that is, return statements in main cannot have a value.
2.2 Expressions
§17 Z ̃λ expressions are applications of the operators defined above. Parentheses enforce precedence. For user-defined data type definitions, field references are expressions and have the form id.field. Function calls are expressions; the actual parameters of function calls are also expressions that, in the semantic phase (i.e. not this coursework), would be required to produce a type that can unify with the type of their parameter. Table 3 contains example expressions.
2.3 Statements
§18 In Table 4, var indicates a variable. An expression_list is a comma-separated list of expressions. As above, a body consists of local variable declarations (if any), followed by statements. Statements, apart from if-else and loop, terminate with a semicolon. The return statement appears in a function body, where it is optional. In any if statement, there can be zero or one else branch.
Assignment Input
Output Function Call
Control Flow
var := expression ;
read var ;
print expression ;
functionId ( expression_list );
if ( expression )then body fi
if ( expression ) then body else body fi loop body pool
break N; # N is optional and defaults to 1. return expression ;
Table 4: Z ̃λ statements.
§19 Variables may be initialised at the time of declaration: “id : type := init ;”. For newly defined data types, initialisation consists of a sequence of comma-separated values, each of which is assigned to the data type fields in the order of declaration. Listing 3 contains examples.
§20 The statement read var; reads a value from the standard input and stores it in var; the statement print prints evaluation of its expression parameter, followed by a newline.
§21 The if statement behaves like that in the C family language. The unguarded loop statement is the only explicitloopconstructinZ ̃λ.Toexitaloop,onemustusebreak N,usuallycoupledwithanifstatement;the optional argument N is a positive integer that specifies the number of nested loops to exit and defaults to 1. The use of break statement is forbidden outside a loop. Listing 4 shows how to use loop and break.
4
b : int := 10;
c : string := “hello world!”;
d : person := “Shin”, “Yoo”, 30;
e : char := ’a’;
f : seq
Listing 3: Z ̃λ variable declaration and initialization examples.
a : seq
b : seq
c : seq
j : int := 0;
loop
if (2 < i) then break;
fi loop
if (2 < j) then
break; # break to the outer loop
fi
if (b[j] < a[i]) then
break 2; # break out of two loops fi
j := j + 1;
pool
i := i + 1;
j := 0;
pool
Listing 4: Z ̃λ loop example.
Listing 5 shows an example program in Z ̃λ which defines a function reverse after main.
Error Handling
§22
3 §23
§24 The provided SC class uses a boolean field syntaxErrors of the parser object to decide whether parsing is successful. So please find such a public field in the Parser class and set it to true when a syntax error is generated.
Your parser will be tested against a test suite of positive and negative tests. This testing is scripted; so it is important for your output to match what the script expects.
4 §25
Submission Requirements and Instructions
Your scanner (lexer) must
• Use JLex (or JFlex) to automatically generate a scanner for the Z ̃λ language;
• Make use of macro definitions where necessary. Choose meaningful token type names to make your specification readable and understandable;
5
• •
• • •
main { # Main is not necessarily last.
a : seq
b : seq
};
fdef reverse (inseq : seq
i : int := 0;
loop
if (len(inseq) <= i) then break;
fi
outseq := inseq[i] :: outseq;
i := i + 1;
pool return outseq;
} : seq
Listing 5: Z ̃λ example program. Ignore whitespace and comments; and
Report the line and the column (offset into the line) where an error, usually unexpected input, first occurred. Use the code in Section 3, which specifies the format that will be matched by the grading script.
Your parser must
Use CUP to automatically produce a parser for the Z ̃λ language;
Resolve ambiguities in expressions using the precedence and associativity rules;
Print “Parsing successful.”, followed by a newline, if the program is syntactically correct.
Your scanner and parser must work together.
§26
§27 §28
§29 I have provided a Makefile on Moodle. This makefile must build your project, from source, when make is issued,
• using JFlex 1.6.1
• using Cup version 11b-20160615 • using Java SE 8 Update 242
• on CentOS 7.6
§30 If your submission fails to build using this Makefile with these versions of the tools and on the specified operating system, your mark will be zero.
§31 The provided makefile has a test rule. The marking script will call this rule to run your parser against a set of test cases. This set of test cases includes public test cases provided via Moodle and private ones; they include positive tests, on which your parser must emit “Parsing successful.” followed by a newline and nothing else, and negative tests on which your parser must emit the correct line and column of the error, as specified in Section 3 above. Your mark will be the number of positive tests cases you correctly pass and the number of negative test cases you correctly fail divided by the total number of test cases.
Once the scanner and parser have been successfully produced using JFlex and CUP, use the provided SC class to test your code on the test files given on the course webpage.
6
§32 Each student must use the automatic testing system (detailed in Section 5) to submit the coursework.
§33 The deadline for this coursework is Friday 6th March 2020 @ 11:55AM. We strictly adhere to UCL Academic Manual – Chapter 4: Assessment Framework for Taught Programmes and therefore impose the university’s late submission penalties (visit this page for more information). However, COMPILERBOTwill reject any submissions 5 working days after the deadline. If you still would like to submit after that, please attach your code in an email to Zheng Gao
§34 To distribute coursework workload more evenly across the academic year, the department has implemented a policy for spacing deadlines. As a result, you are getting more time to complete this coursework than the students from previous years. COMPILERBOTwill start accepting submissions from Friday 14th February 2020 @ 11:55AM.
5 Automatic Testing Guidelines
Your coursework should be developed using Git. We have created bare git repositories for each of you on a department server. Whenever you push a commit to them, a test suite consisting of a set of public (already given to you) and private test cases is run and then the results of the test run will be emailed to you. You can push your work to the repository as many times as you like before the deadline. All pushes will be automatically rejected 5 working days after the deadline. We will mark the last commit that you have pushed. We advise you to test access to your repository as soon as possible and to push your final commit well before the deadline.
5.1 SSH access to the Computer Science department
Ensure that you can SSH into the department before continuing. Run the following command:
ssh YOUR_CS_USERNAME@scm4.cs.ucl.ac.uk
If you do not have access, you are probably using the wrong username: your CS username is different to your UCL username (which you use to log in to Moodle, Portico etc.) Do not try to log in more than three times with an incorrect username, as your IP address will be banned by the department. Please visit Tech Support on the 4th floor of the Malet Place Engineering Building to figure out what your CS username is.
5.2 Pushing your work
Having confirmed that you indeed can ssh into a department server, download and decompress project.tar.gz we have provided on Moodle. This results in a directory project with the following structure:
project/
src/
lib/
bin/
tests/
open/
Change your working directory to tool-demo using the cd command on Unix-like operating systems. Then initialisetool-demointoagitrepositoryusinggit init.AddyourUCLrepositoryasaremote(thecommand below is a single line):
git remote add origin YOUR_CS_USERNAME@scm4.cs.ucl.ac.uk:/cs/student/
comp0012/YOUR_CS_USERNAME
Develop your lexer (i.e. Lexer.lex) and parser (i.e. Parser.cup) in the src/ subdirectory. Stage and commit all your local changes using git add and git commit. Once your coursework builds and you are ready to submit, create a text file called student.txt in the root of your repository with a single line containing your student number, UCL email, and name, in this format:
12345678
7
You MUST provide the correct student number and UCL email address; otherwise, your submission will be rejected. You MUST use the email address containing your name and year of entry (e.g. f.bloggs.14@ucl.ac.uk), whereas UCLID@ucl.ac.uk (e.g. ucaaxxx@ucl.ac.uk) is your Windows Live ID, not the actual address.
Next, stage and commit student.txt. At this point, your repository should contain at least the following files:
• student.txt
• src/SC.java
• src/Parser.cup • src/Lexer.lex
Push your repository via git push origin master. If all goes well, you should receive an email from COMPILERBOTafter a few minutes, which contains your test scores. The server will reject submissions that do not contain a student.txt file or other important files; if your push is rejected, read the error message.
If you think there is an error with this automatic testing system, please contact Zheng Gao
8