COMP0012 Compilers Lexing and Parsing Coursework
Submission Deadline Monday 1st April 2019 @ 11:55PM
The goal of this COMP0012 parsing coursework is to build a lexer and parser for the Z ̃sec programming language. Use JFlex 1.6.1 and Cup version 11b-20160615, using only the specified versions, to automatically generate code for your scanner and parser1. This coursework broadly requires writing regular expressions covering all legal words in the language (Lexer.lex), and a context free grammar describing its rules (Parser.cup).
You must work on this coursework individually and any kind of collaboration, including sharing your own tests with others, is strictly forbidden. You will get a single mark, comprising 10% of your mark for the COMP0012 module. Please submit your work (JFLex/CUP specifications) before Monday 1st April 2019 @ 11:55PM.
Detailed submission instructions are given at the end of the document. “There will always be noise on the line.”
—Engineering folklore
Despite our best efforts, this project description contains errors2. Part of this coursework is then to start to develop the skills you will need to cope with such problems now. First, you think hard about issues you discover (of which only a small subset will be errors in this problem statement) and make a reasonable decision and document your decision and its justification accompanying your submission. Second, you can ask stakeholders for clarification.
1 Interpreting the Specification
Throughout your career in IT, you will have to contend with interpreting and understanding specifications. If you fail a test case because of some ambiguities in the specification, you must go on to explain why it is a problem and justify how you decide to resolve it. When marking, we will consider the issues you discover; if your justification is sound, you will get full marks for the relevant test cases.
We have numbered each paragraph, table and program listing in this specification. For each issue that you find, make a note in the following format:
Paragraph: 72
Problem: it is not clear whether we can omit both the start and end indices
in sequence slicing, like “foo := bar[:]”.
Our solution: this is possible in many other languages that support list
slicing (like Python), so our compiler accepts this syntax.
Paragraph: 99
Problem: the spec has an assignment using the equality operator,
“foo = bar;”.
Our solution: we think this is a mistake, and our compiler does not accept
this statement.
Call this file ambiguities.txt and turn it along with your implementation of the parser.
1Section 4 explains why I have imposed these constraints on the permitted versions of these tools.
2Nonetheless, this document is already a clearer specification than any you will find in your subsequent careers in industrial IT.
1
2 The Z ̃sec Language
§1 You are to build a lexer and a parser for Z ̃sec.
§2 A program in Z ̃sec consists of a list of declarations which defines global variables and new data types as well
as functions. This list cannot be empty: it must have a definition for a main function.
§3 Z ̃sec has two types of comments. First, any character, other than a newline, following # to the end of the line is a comment. Second, any character, including newlines, enclosed within /# … #/ is a comment, and may span multiple lines.
§4 An identifier starts with a letter, followed by an arbitrary number of underscores, letters, or digits. Identifiers are case-sensitive. Punctuation other than underscore is not allowed.
§5 Z ̃sec is a security typed language3. Every primitive type must be labelled as either low or high4. A low label is declared with L and a high with H. Information is allowed to flow from L to H, but not vice versa. Security labels allow the compiler to check whether confidential information has been leaked publicly. Checking that the information flow is correct is not part of the coursework. We let S range over H and L in the remainder of this specification.
§6 A character is a single letter, punctuation symbol, or digit wrapped in ’’ and has type char S. The allowed punctuation symbols are space (See http://en.wikipedia.org/wiki/Punctuation) and the ASCII symbols, other than digits, on this page http://www.kerryr.net/pioneers/ascii3.htm.
§7 The boolean constants are T and F and have type bool S.
§8 Numbers are integers (type int S), rationals (type rat S), or floats (type float S). Negative numbers are
represented by the ’-’ symbol before the digits. Examples of integers include 1 and -1234; examples of rationals include 1/3 and -345_11/3; examples of floats are -0.1 and 3.14.
§9 Sequences (type seq) are ordered containers of elements. Sequences have nonnegative length. A sequence has a type: its declaration specifies the type of elements it contains. For instance, l : seq
§10 Z ̃sec sequences support the standard indexing syntax. For any sequence s, the operator len(s) : seq ! N returns the length of s and the indices of s range from 0 to len(s)-1. The expression s[index] returns the element in s at index. String literals are syntactic sugar for character sequences, so “abc” is [ ’a’, ’b’ , ’c’]. String literals are treated as low by default: they do not require a security annotation S. For the sequence s : seq
§11 Sequences in Z ̃sec also support sequence slicing as in languages like Python or Ruby: id[i:j] re- turns another sequence, which is a subsequence of id starting at id[i] and ending at id[j]. Given a := [1,2,3,4,5], a[1:3] is [2,3,4]. When the start index is not given, it implies that the subsequence starts from index 0 of the original sequence (e.g., a[:2] is [1,2]). Similarly, when the end index is not given, the subsequence ends with the last element of the original sequence (e.g., a[3:] is [4,5]). Finally, indices can be negative, in which case its value is determined by counting backwards from the end of the original sequence: a[2:-1] is equivalent to a[2:len(a)-1] and, therefore, is [3,4,5], while s[-2] is 4. The lower index in a slice must be positive and smaller than the upper index, after the upper index has been subtracted from the sequences length if it was negative.
3 See https://goo.gl/XN8ebs and https://goo.gl/ZawHTW. 4Language-Based Information-Flow Security, Sabelfeld and Myers, 2003.
2
Primitive Data Types Aggregate Data Types Literal
bool S, int S, rat S, float S, char S seq
string
Table 1: Z ̃sec data types.
Kind
Boolean Numeric
Defined Over
bool int,rat,float
Syntax
!, &&, ||
+ – * / ^
in, ::, len(s), s[i], s[i:j], s[i:], s[:i]
int, rat, float bool,int,rat,float =!=
Table 2: Z ̃sec operators.
Sequence seq
Comparison
< <=
§12 Table 1 defines Z ̃sec’s builtin data types. In Table 2, “!” denotes logical not, “&&” logical and and “||” logical or, as is typical in the C language family. Note that “=” is referential equality and “:=” is the assignment operator in Z ̃sec. The in operator checks whether an element (key) is present in a sequence, as in 2 in [1,2,3] and returns a boolean. Note that in only operates on the outermost sequence: 3 in [[1],[2],[3]] is F, or false. “::” operator denotes concatenation, “s[i]” returns the it h entry in s and “len(s)” returns the length of s as defined in the discussion of sequences and their indexing above.
2.1 Declarations
§13 The syntax of field or variable declaration is “id : type”. A data type declaration is
tdef type_id { declaration_list } ;
where declaration_list is a comma-separated list of field/variable declarations. Once declared, a data type can be used as a type in subsequent declarations. The security label of a new data type is the least upper bound of the labels of its constituent fields. It is not declared by the programmer but calculated during the compiler’s semantic analysis (i.e. not in this coursework).
§14 For readability, Z ̃sec supports type aliasing: the directive “alias old_name new_name ;” can appear in a declaration list and allows the use of new_name in place of old_name.
alias seq
tdef person { name : string, surname : string, age : int H };
tdef family { mother : person, father : person, children : sec
Listing 1: Z ̃sec data type declaration examples.
§15 Listing 2 shows the syntax of function declarations in Z ̃sec. Specifically, a function’s name is an identifier. The formal_parameter_list separates parameter declarations, which follow the variable/field declaration syntax id : type, with commas. A function’s body cannot be empty; it consists of local variable declarations, if any, followed by statements. The return type of a function is returnType; it is omitted when the function is main, or does not return a value.
2.2 Expressions
§16 Z ̃sec expressions are applications of the operators defined above. Parentheses enforce precedence. For
user-defined data type definitions, field references are expressions and have the form id.field. Function calls 3
fdef name (formal_parameter_list) { body } : returnType ; fdef name (formal_parameter_list) { body } ;
Listing 2: Z ̃sec function declaration syntax.
p.age + 10
b – foo(sum(10, c), bar()) = 30 s1 :: s2 :: [1,2]
Assumes “person p;” previously declared Illustrates method calls
Assumes s1 and s2 have type seq
Table 3: Z ̃sec expression examples.
can be either a statement or an expression; their parameters are expressions that, in the semantic phase (i.e. not this coursework), would be required to produce a type that can unify with the type of their parameter. Table 3 contains example expressions.
2.3 Statements
§17 In Table 4, var indicates a variable. An expression_list is a comma-separated list of expressions. As above, a body consists of local variable declarations (if any), followed by statements. Statements, apart from if-else and loop, terminate with a semicolon. The return statement appears in a function body, where it is optional. In any if statement, there can be zero or one else branch.
Assignment Input
Output Function Call
Control Flow
var := expression ;
read var ;
print expression ; _
functionId ( expression list );
if ( expression )then body fi
if ( expression ) then body else body fi loop body pool
break N; # N is optional and defaults to 1. return expression ;
Table 4: Z ̃sec statements.
§18 Variables may be initialised at the time of declaration: “id : type := init ;”. For newly defined data types, initialisation consists of a sequence of comma-separated values, each of which is assigned to the data type fields in the order of declaration. Listing 3 contains examples.
§19 The statement read var; reads a value from the standard input and stores it in var; the statement print prints evaluation of its expression parameter, followed by a newline.
§20 The if statement behaves like that in the C family language. The unguarded loop statement is the only loop construct in Z ̃sec. To exit a loop, one must use break N, usually coupled with an if statement; the optional argument N is a positive integer that specifies the number of nested loops to exit and defaults to 1. The use of break statement is forbidden outside a loop. Listing 4 shows how to use loop and break.
§21 Listing 5 shows an example program, contain two functions. The function main is the special Z ̃sec function where execution starts. Z ̃sec’s main returns no value.
4
3
§22
b : int H := 10;
c : string := “hello world!”;
d : person := “Shin”, “Yoo”, 30;
e : char L := ’a’;
f : seq
Listing 3: Z ̃sec variable declaration and initialization examples.
a : seq
j : int L := 0;
loop
if (2 < i) then
break; fi
loop
if (2 < j) then
break; # break to the outer loop fi
if (b[j] < a[i]) then
break 2; # break out of two loops
fi
j := j + 1;
pool
i := i + 1;
j := 0;
pool
Error Handling
Listing 4: Z ̃sec loop example.
Your parser will be tested against a test suite of positive and negative tests. This testing is scripted; so it is important for your output to match what the script expects. Add the following function definition into the "parser code" section of your Cup file, between its {: and :} delimiters.
public void syntax_error(Symbol current_token) { report_error(
"Syntax error at line " + (current_token.left+1) + ", column "
+ current_token.right + ". ", null );
}
Listing 6: Z ̃sec compiler error message format.
§23 The provided SC class uses a boolean field syntaxErrors of the parser object to decide whether parsing is successful. So please add such a public field to the Parser class and set it to true when a syntax error is generated.
5
• •
• •
• • •
main { # Main is not necessarily last.
a : seq
b : seq
};
fdef seq
i : int L := 0;
loop
if (l.len < i) then break;
fi
outseq := inseq[i] :: outseq;
i := i + 1;
pool return outseq;
} : seq
Listing 5: Z ̃sec example program. Submission Requirements and Instructions
Your scanner (lexer) must
Use JLex (or JFlex) to automatically generate a scanner for the Z ̃sec language;
Make use of macro definitions where necessary. Choose meaningful token type names to make your
specification readable and understandable;
Ignore whitespace and comments; and
Report the line and the column (offset into the line) where an error, usually unexpected input, first occurred. Use the code in Section 3, which specifies the format that will be matched by the grading script.
Your parser must
Use CUP to automatically produce a parser for the Z ̃sec language;
Resolve ambiguities in expressions using the precedence and associativity rules;
Print “parsing successful”, followed by a newline, if the program is syntactically correct.
Your scanner and parser must work together.
4 §24
§25
§26 §27
§28 I have provided a Makefile on Moodle. This makefile must build your project, from source, when make is issued,
• using JFlex 1.6.1
• using Cup version 11b-20160615 • using Java SE 8 Update 201
• on CentOS 7.6
Once the scanner and parser have been successfully produced using JFlex and CUP, use the provided SC class to test your code on the test files given on the course webpage.
6