代写 C++ C compiler Compiler Theory

Compiler Theory
Assignment 8, February 14, 2019
Expression Parser Function

The next stage of your compiler project is to write a function that will take
tokens from your lexical analyzer and parse a single arithmetic expression.
At this point, we are not parsing full programs, just one expression, so

x = asdf * (ab + 45) + (sdaf >= 5)

is a complete program for this stage.

For the end of input for an expression, we will use a semicolon in the source
file. So test expression input will be like
3 + x*(6 – z);
Also, the expression parser needs to handle the end-of-file token. If EOF is
the first token in an expression, then we just want to exit normally. Else
it will be an error.

Recovering after errors is much harder here than in the lexical analyzer.
So will will simply print an error message and immediately exit the program.
The lexical analyzer can keep on going past errors like it did before, but any
errors in this parser part will cause an exit.

1. Create a new empty console project, say ExprParser, and copy your lexer2.c,
main.c, and atto-C.h files into the project folder. Add your copies of
lexer2.c and main.c to the project “Source Files” and atto-C.h to the
project “Header Files”. Even if your lexer2.c is not perfected yet,
you can use it here, since we will need only very simple tokens. Except
BE SURE YOUR LEXER DOES NOT RETURN TOKENS FOR COMMENTS AND WHITESPACE!!!
Your lexer should just proceed to the next token without returning.
And since we will not be modifying the lexer in this assignemnt, when you
do get the lexer working properly, you can just copy it over.

2. In atto-C.h, add a token type DOLLAR_TOK, for use in the parsing stack.
Also add UNARY_MINUS_TOK.

3. Create a new C source file in your project, say ExprParser.c, add the
appropriate comment block at the top, and add
#include “atto-C.h”

4. Your parser will need to compare the stack top terminal with the lookahead
token to make shift-reduce decisions. Two ways to handle this:

A. Explicit shift/reduce array.

In atto-C.h, re-number your expression tokens consecutively, 101-121, say,
in the same order they appear in your shift-reduce table from last time.
(There were 21 tokens listed in exercise 2 of Assignment 8). Also, number
SEMICOLON_TOK as 122 and EOF_TOK as 123.
Define some shift-reduce table entry values:
#define SH 1
#define RE 2
#define EQ 3
#define AC 4
#define ER 5
Define a 23 by 23 array filled in with your shift-reduce table

action[23][23] = {{AC,SH,SH,…},

};
OVER

The entries for SEMICOLON_TOK should be the same as DOLLAR_TOK. The entries
for EOF_TOK should be all ER, except action[0][22] is AC.

Then in the parser, you could use action[stacktop_token-101][lookahead-101]
to decide what to do. The advantages of this method is that you don’t need
to figure out f and g values, and you have immediate detection of error
and acceptance. A disadvantage is that is a lot to type in correctly.

B. Have a function f() that returns the f value for each expression token:

int f(int token_type)
{ switch(token_type)
{ case DOLLAR_TOK: return 1;
case NUMBER_TOK: return 85;

default: printf(“Expression parser f: Illegal token type: %d\n”,token_type)
exit(1);
}
}

and likewise for the g value. Don’t forget to include SEMICOLON_TOK and EOF_TOK
in g. Advantages of this method are that you don’t have to number your tokens
any particular way, and it is easy to edit precedence levels if you find
you got them wrong, and it is easy to add new tokens.

Don’t forget to put proper comments in for each function, giving its
purpose and meaning of its return value.

In my parser, I have chosen the second method, but in the past a lot of students
have chosen the first. Either way is fine.

5. The parser needs a stack. Since we will want to keep track of several items
of information about an item in the stack (token type, memory address, lexeme
of identifiers, value of numbers, etc.), the stack will be made of “structures”
of a type we define. In C, a structure is a user-defined datatype that has
several data fields of arbitrary type. The “class” concept of object-oriented
languages is just an elaboration of the structure concept. In fact, when C++
first came out, C++ compilers simply used C as the target language and
translated classes directly into structures. So add a structure definition
to your ExprParser.c file:

struct stack_t {
int token; // type of token in this stack position
};

Here, “stack_t” is the name of this new datatype. At the moment, we only
have token_type to take care of, so just one field in the structure. We
can add more fields later, as needed.

6. Now to define the stack itself. We will allow a maximum stack size of 1000,
since that should be plenty for our test expressions (although a proper compiler
would make the stack expandable). So add this after your structure declaration:

#define STACKMAX 1000
struct stack_t stack[STACKMAX];
int stacktop; // index of the current top of the stack

Every SHIFT action should test for stack overflow, of course.
OVER

7. There are two ways to set up the stack.
A. You could have separate stack entries for each nonterminal and terminal.
You would have to define virtual token types for the nonterminals then.
A disadvantage is that for our parser, we only have one type of
nonterminal, and we are pretty much ignoring the nonterminals when
making shift/reduce decisions, so you have to go hunting on the stack
for the terminals.

B. Have each stack entry be just a terminal. That way, the topmost
terminal is always right on top of the stack, and easy to find. The
presence of a nonterminal after a terminal can be indicated by a field in
the stack structure, since there cannot be more than one nonterminal
in a row in our expression parsing. For this method, add a field to
the struct_t definition:
int expr_after; // 1 for nonterminal following, 0 for not.

I have chosen the second method for my parser.

8. Now for the parsing function itself. I am calling my function expr_parser().
It doesn’t need any input arguments, since it is just going to call lexer()
when it needs the next lookahead token. For the return value, I am going
to use the token type that ended the expression. Right now, that will
be SEMICOLON_TOK or EOF_TOK, but later there will be other possibilities.

To atto-C.h, add the function prototype
int expr_parser(void);
so the rest of the compiler can know about the expr_parser function.

9. In ExprParser.c, add
int expr_parser()
{
}
to ExprParser.c, with a nice comment block before it, explaining this is
the arithmetic parsing function, using shift-reduce parsing. And state
that its return value is the token immediately after the expression.

The following steps through step 18 refer to the body of expr_parser().

10. Add a local variable:

int lookahead; // the current lookahead token type

11. Initialize the stack:

// initialize stack
stack[0].token = DOLLAR_TOK;
stack[0].expr_after = 0;
stacktop = 0;

12. Initialize the lookahead token:

lookahead = lexer();

13. The rest of the parser function will be an infinite loop doing shift/reduce.
There are several ways to do infinite loops in C:
while (1) { … }
or
do { … } while (1);
or
for(;;) { … }

I am currently fond of for(;;). But it’s your own choice.
OVER

14. First thing in the loop is to decide shift or reduce.

If you are using the full shift/reduce table, then you could do

switch ( action[stack[stacktop].token-101][lookahead-101])
{ case SH:
// shift code

break;
case RE:
case EQ:
// reduce code

break;
case AC:
// accept code

break;
case ER:
// error-handling code

break;
}

If you are using the f and g values, you could do

if ( f(stack[stacktop].token) <= g(lookahead) ) { // shift code } else { // reduce code } When using f and g, you handle accept and error at a later stage of this parser. 15. In the shift code, you should a. Print "Shift" so you can follow the action on the screen, like. printf("Shift token %d %s line %d\n",lookahead,lexeme,line_no); b. Increment stacktop. c. Test to see if the stack has overflowed. d. Push the lookahead token on the stack: stack[stacktop].token = lookahead; stack[stacktop].expr_after = 0; e. Get the next lookahead token from lexer(): lookahead = lexer(); 16. In the reduce code, you should a. Print "Reduce" to the screen: printf("Reduce\n"); b. Decide how to reduce, based on the top terminal on the stack. This could be a big "if" statement, with a part for each possible stack top terminal, or a "switch" statement based on the stack top terminal. Switches are preferred over big if's, since a compiler can generate a faster search through switch cases than going through "if" tests one by one. So I have OVER switch ( stack[stacktop].token ) { case DOLLAR_TOK: .... break; case NUMBER_TOK: .... break; ... case NOT_TOK: .... break; default: printf("Bad stack top token %d\n",stack[stacktop].token); exit(2); } You don't have to put in the "...." shown; they represent code to be filled in. A switch should always have a "default" case at the end to catch any cases that didn't match earlier. 17. The reduce action here is pretty simple: pop the last structure on the stack and mark that there is an expression after the previous terminal on the stack. So you can replace most of the .... instances above with stacktop--; stack[stacktop].expr_after = 1; Some special cases: a. For case DOLLAR_TOK, check if stacktop==0 and if stack[0] has an expression after it, and if it does, then the expression exists and is fully parsed, so just do return lookahead; If no expression, it is an error, and print out an error message and exit the program. If you are using f and g values, then for this to work, you must have your f and g values set up so the g value for $ causes a reduce for stacktop $, and shift for everything else (except left parenthesis, which is an error). b. When reducing a right parenthesis, you need to pop two structures off the stack, to get the previous left parenthesis. c. Reducing a left parenthesis on top of the stack is always an error. d. When reducing an assignment token, "=", you need to have an identifier in the previous stack position, so check for that, and if there is not an identifier, then print out an error message and exit. If there is and identifier, then pop two structures. 18. Now you need to insert all the checks to make sure that the nonterminal expressions are in the right places. For example, addition needs expressions before and after, so you need to insert checks into each case, like this: case PLUS_TOK: if ( !stack[stacktop].expr_after ) { printf("PLUS without following expression, line %d\n",line_no); exit(1);} if ( !stack[stacktop-1].expr_after ) { printf("PLUS without preceding expression, line %d\n",line_no); exit(1);} stacktop--; stack[stacktop].expr_after = 1; break; The tests will vary, according to the type of token. Unary tokens must have an expression after but not before. Numbers and identifiers cannot have expressions before or after. Right parentheses must have expression before and left parenthesis before, but not expression after. This is where you can catch all the ERR entries in your shift/reduce table. 19. If you are using the full shift-reduce table, then case AC can simply be return lookahead; and case ER can simply be printf("Parse error. Line %d, lookahead token %s\n",line_no,lexeme); OVER 20. In main.c, change the main() function to main() { int end_tok; do { end_tok = expr_parser(); printf("Accepted.\n"); } while ( end_tok != EOF_TOK ); } 21. Build and test your expression parser. Remember you are using ; as the end of an expression, so test input might look like this: 1; 1 + 3; 1 + 3 * 4; 2 * (x + y); x = 5; If you forget the ; then you can just type it on the next line, since newlines are ignored by your expression parser. Remember that running in "Run with debugging" mode will make the console window disappear immediately after a parser error message, since your program exits, so I suggest you run in "Run without debugging" mode unless you really intend to debug. 22. One small detail we have skipped so far is how to tell apart a unary minus sign from a binary minus sign. You can't use shift/reduce rules to tell them apart, since the two uses have different precedence levels. So a kludge is necessary. A "kludge" is a bit of ugly code that makes the best of a bad situation. Two ways it might be done: a. In the lexical analyzer, remember the previous token (you could make the local lookahead variable in expr_parser into a global variable.) If the previous token is a ) or number or identifier, it is a binary minus sign, else it is a unary minus sign. b. In the expression parser, at the top of the infinite loop, check to see if the lookahead is MINUS_TOK, and if it is, check to see if the stack top has an expression. If not, change the lookahead to UNARY_MINUS_TOK. You also need to check if you should change it back. I have chosen the second way, so I have inserted at the top of my infinite loop the code // Kludge for unary minus if ( (lookahead == MINUS_TOK) && !stack[stacktop].expr_after ) lookahead = UNARY_MINUS_TOK; if ( (lookahead == UNARY_MINUS_TOK) && stack[stacktop].expr_after ) lookahead = MINUS_TOK; 23. Test your parser. I have some test input in expr-test-good.c in BlackBoard. I don't have a expr-test-bad.c file, since the first error in a file makes the compiler exit, so it could only test one error. But here are some test cases for errors: 4 5; xx(5); x 4; (5)xx; ) + 5; 6 + + 7; (); 6 + * 7; (5)(7); 6 + x = 7;