Principles of Programming Languages winter 2017
Assignment 4 (Programming) Due: Wed 8 March ’17 (via svn)
Programming Part: Building a Lexical Analysis Program.
This is divided into two parts: Building a tokenizer and a lexer.
A tokenizer accepts an i/p stream (in this case, a file) and separates the data into tokens.
For programming languages, tokens represent language structures that have a specific pur-
pose in the code (e.g. numbers, brackets, etc..
A lexer receives some valid data stream (in this case the tokenizer) and ensures that every
datum in the stream is correctly categorized in the language. This ensures that the source
code is using the same dictionary of terms as represented by the language. For instance,
the language we will implement utilizes integers as the primary data type. Thus whenever
the lexer receives a chunk of data from the tokenizer that is an integer, it should mark it
as INTEGER for use later for semantic analysis.
(1) a) Build a tokenizer capable of reading space-delimited source files. One such source
file is provided on the website (alongside) for you to test your tokenizer. The to-
kenizer is to be written in Prolog, using predicates (no grammars), saved as a file
named tokenizer.pl , and should have the following functionality:
The tokenizer should be capable of reading in a single source file. Once a file
is opened, the file should be read from start to finish, storing every token as an
atom in a list, then the list should be unified to a single variable. The predicate for
retrieving tokens from a given source file should be tokenize file/2 as follows:
tokenize file( FileName, TokenList )
Where the first argument is an atom representing the path to a file and the second
argument is the token list created by tokenizing the file given in the first argument.
tokenizer should be capable of handling instances of multiple whitespace characters,
such as “Hello World” VS. “Hello World”, where the white space can be any
assortment of regular spaces, tab spaces, new lines, and carriage returns.
Note: The above material is covered in the notes for tutorial 5 (available on
Brightspace). The notes for the tutorial cover file I/O and have some comments on
building a tokenizer.
1
(2) Build a lexer, in Prolog, using predicates (no grammars), saved as a file named
lexer.pl, and it should perform the following functionality:
Your lexer should be implemented using a single predicate, lexer/2 as follows:
lexer( TokenList, LexedList )
Where the first argument is a list of atoms and the second argument is a list of
atoms representing the token labels as defined below:
TYPE_INT: int
TYPE_BOOL: bool
COMMA: ,
ASSIGN: =
LET: let
LET_IN: in
COND_IF: if
COND_THEN: then
COND_ELSE: else
LOGIC_EQ: ==
LOGIC_NOT_EQ: !=
LOGIC_GT: >
LOGIC_GTEQ: >=
ARITH_ADD: +
ARITH_SUB: –
OPEN_P: (
CLOSE_P: )
INTEGER: any integer
IDENTIFIER: anything else
Whenever your lexer analyzes a token, it should store the corresponding token label
as an atom in a separate list. Once every token has been analyzed, the token label
list is unified to LexedList (as above).
(e.g. if the token list is: [ int, main, ‘(’, int, input, ‘)’]
Then the token label list created by the lexer should be:
[ TYPE INT, IDENTIFIER, OPEN P, TYPE INT, IDENTIFIER, CLOSE P ]
• The tokenizer and lexer will be used together: the tokenizer produces a token
list, which the lexer uses to make a token label list. To make things easier,
consider using the :- consult(FileName). directive to load the tokenizer file
at the start of the lexer file. You can find more information in the SWI-Prolog
documentation under “Loading Prolog source files”.
• Carefully consider the order of your lexer clauses when deciding on which labels
to use. While most of the tokens are very easy to label, some will require more
thought to label properly.