Lexical Analyzer
CS 2210 Programming Project (Part I)
In this phase of the project, you will write a lexical analyzer for the CS 2210 programming language, MINI-JAVA. The analyzer will consist of a scanner, written in LEX, and routines to manage a lexical table, written in C. The rest of the compiler project will communicate with these program modules.
Please read the source code and understand how it works.
Copyright By PowCoder代写 加微信 powcoder
Token specification
Figure 1 defines the tokens that must be recognized, with their associated symbolic names. All multi-symbol tokens are separated by blanks, tabs, newlines, comments or delimiters.
Comments are enclosed in /* … */ and cannot be nested. An identifier is a sequence of (upper or lower case) letters or digits, beginning with a letter. Upper and lower case are not distinguished (i.e. the identifier ABC is the same as Abc. There is no limit on the length of identifiers. However, you may impose limits on the total number of distinct identifiers and string lexemes and on the total number of characters in all distinct identifiers and strings taken together. If defined, these limits should be defined as follows.
#define LIMIT1 500
There should be no other limitation on the number of lexemes that the lexical analyzer will process.
An integer constant is an unsigned sequence of digits representing a base 10 number. A string constant is a sequence of characters surrounded also by single quotes, e.g. ¡¯Hello, world¡¯. Hard-to-type or invisible characters can be represented in character and string constants by escape sequences; these sequences look like two characters, but represent only one. The escape sequences support by the MINI-JAVA language are \n for newline, \t for tab, \¡ä for the single quote and \\ for the
backslash. Any other character following a backslash is not treated as escape sequence.
Token attributes
A unique identification of each token (integer aliased with the symbolic token name) must be returned by the lexical analyzer. In addition, the lexical analyzer must pass extra information about some token to the parser. This extra information is passed to the parser as a single value, namely an integer, through a global variable as described below. For integer constants, the numeric value of the constant is passed. In order to allow other passes of the compiler to access the original identifier lexeme, the lexical analyzer passes an integer uniquely identifying an identifier (other than reserved words). String constants are treated in the same way, with a unique identifying number being passed. The unique identifying number for both identifiers and string constants should be an index (pointer) into a string table created by the lexical analyzer to record the lexemes. Same identifiers should return the same index.
Implementation
The central routine of the scanner is yylex, an integer function that returns a token number, indicating the type (identifier, integer constant, semicolon, etc.), of the next token in the input stream. In addition to the token type, yylex must set the global variables yyline and yycolumn to the line and column number at which that token appears. In the case of integer and string constants, store the value into the global integer variable yylval. Lex will write yylex for you, using the patterns and rules defined in your lex input file (which should be called lexer.l. Your rules must include the code to maintain yyline, yycolumn and yylval.
Token name
ASSGNnum DECLARATIONnum DOTnum ENDDECLARATIONSnum EQUALnum
LPARENnum METHODnum
PROGRAMnum RBRACnum
Symbolic Name
declarations
. enddeclarations =
identifier
Token Name
CLASSnum COMMAnum DIVIDEnum ELSEnum EQnum GEnum ICONSTnum IFnum LBRACEnum LEnum LTnum MINUSnum NOTnum PLUSnum RBRACEnum RETURNnum SCONSTnum TIMESnum VOIDnum EOFnum
Symbolic Name
>= integerconstant if
return stringconstant *
end of file
Figure 1: Defined Tokens in Mini Java.
In the case of identifiers and string constant, yylval contains a pointer pointing to a string table that contains the real string. The same index should be returned for the same identifier that appear at different places. Similarly the same index are returned for the same string. However, abc and ¡°abc¡± should return different indice in the string table.
Reserved words may be handled as regular expressions or stored as part of the id table. For example, reserve words may be pre-stored in the string table so your program can determine a reserve word from an identifier by the section of the table in which the lexeme is found. Efficiency should be a factor in the management of the lexical and string table.
You are to write a routine ReportError that takes a message and line and column numbers and reports an error, printing the message and indicating the position of the error. You need only print the line and column number to indicate the position. The #define mechanism should be used to allow the lexical analyzer to return token numbers symbolically. In order to avoid using token names that are reserved or significant in C or in the parser, the token names have been specified for you
in Figure 1.
The parser and the lexical analyzer must agree on the token number to ensure correct communication between them.
The token numbers can be chosen by you, as the compiler writer, or, by default, by Yacc (a parser generator to be used in the next assignment). The default token number for a literal character (i.e. a one-character token) is the numerical value of the character in the local character set. Other names are assigned token numbers starting at 257 in the order they are declared in your Yacc specification. Regardless of how token numbers are chosen, the end-marker must has token number 0 or negative, and thus your lexical analyzer must return a 0 ( or a negative) as a token number upon reaching the end of input.
Temporary driver
In order to test your lexcial analyzer without a parser, you will have to write a simple driver program which calls your lexical analyzer and print each token with its value as the input is scanned. For ease in combining the lexical analyzer and parser in the second assignment, the lexical analyzer function should be put in a file by itself. The following shows the structure of a driver, if you use it, please remember to break from the endless loop after recognize the end of file.
{… while (1) {
switch (yylex()) { case ICONTnum: … }
Error handling
Your lexical analyzer should recover from all malformed lexemes, as well as such things as string constants that extend across a line boudary or comments that are never terminated. Specifically, an identifier which starts with a digit is considered to be an error and should be reported.
An example program with output
The program /* Example 1: A hello world program */
program xyz; class Test {
method void main() { System.println(¡¯Hello World !!!¡¯);
The output of Lexical Analyzer
Line Column 2 13
4 23 4 28 4 29 4 30
5 23 5 30 5 31 5 48 5 49
Token PROGRAMnum IDnum SEMInum CLASSnum IDnum LBRACEnum METHODnum VOIDnum IDnum LPARENnum RPARENnum LBRACEnum IDnum DOTnum IDnum LPARENnum SCONSTnum RPARENnum SEMInum RBRACEnum RBRACEnum EOFnum
Index in String table 0
String Table : xyz test main system println Hello World !!! End of file
Assignment submission
No submission.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com