Compilers and computer architecture From strings to ASTs (2): context free grammars
Martin Berger
October 2015
Recall the function of compilers
Recall we are discussing parsing
Source program
Lexical analysis
Intermediate code generation
Optimisation
Syntax analysis
Semantic analysis, e.g. type checking
Code generation
Translated program
Introduction
Introduction
Remember, we want to take a program given as a string and:
Checkifit’ssyntacticallycorrect,e.g.iseveryopened bracket later closed?
ProduceanASTforthecodegenerator.
Introduction
Remember, we want to take a program given as a string and:
Checkifit’ssyntacticallycorrect,e.g.iseveryopened bracket later closed?
ProduceanASTforthecodegenerator.
We split that task into two phases, lexing and parsing. Lexing throws away some information (e.g. how many white-spaces) and prepares a token-list, which is used by the parser. The token-list simplifies the parser, because some detail is not important for syntactic correctness:
ifx <2+3thenP elseQ is syntactically correct exactly when
ify <111+222thenP elseQ
Introduction
The token-list simplifies the parser, because some detail is not important for syntactic correctness:
ifx <2+3thenP elseQ is syntactically correct exactly when
ify <111+222thenP elseQ
So from the point of view of the next stage (parsing), all we
need to know is that the input is
T_if T_var T_less T_int T_plus T_int T_then ...
Of course we cannot throw away the names of variables etc completely, as the later stages (type-checking and code generation) need them. They are just irrelevant for syntax checking. We keep them and our token-lists are like this
T_if T_var ( "x" ) T_less T_int ( 2 ) T_plus ...
Two tasks of syntax analysis
Two tasks of syntax analysis
As with the lexical phase, we have to deal with two distinct tasks.
Two tasks of syntax analysis
As with the lexical phase, we have to deal with two distinct tasks.
Specifyingthatthesyntacticallycorrectprograms(token lists) are.
Two tasks of syntax analysis
As with the lexical phase, we have to deal with two distinct tasks.
Specifyingthatthesyntacticallycorrectprograms(token lists) are.
Checkingifaninputprogram(tokenlist)issyntactically correct according to the specification, and output a corresponding AST.
Two tasks of syntax analysis
As with the lexical phase, we have to deal with two distinct tasks.
Specifyingthatthesyntacticallycorrectprograms(token lists) are.
Checkingifaninputprogram(tokenlist)issyntactically correct according to the specification, and output a corresponding AST.
Let’s deal with specification first. What are our options? How about using regular expressions for this purpose?
Two tasks of syntax analysis
As with the lexical phase, we have to deal with two distinct tasks.
Specifyingthatthesyntacticallycorrectprograms(token lists) are.
Checkingifaninputprogram(tokenlist)issyntactically correct according to the specification, and output a corresponding AST.
Let’s deal with specification first. What are our options? How about using regular expressions for this purpose?
Alas not every language can be expressed in these formalisms. Example:
Two tasks of syntax analysis
As with the lexical phase, we have to deal with two distinct tasks.
Specifyingthatthesyntacticallycorrectprograms(token lists) are.
Checkingifaninputprogram(tokenlist)issyntactically correct according to the specification, and output a corresponding AST.
Let’s deal with specification first. What are our options? How about using regular expressions for this purpose?
Alas not every language can be expressed in these formalisms. Example:
Alphabet = {′(′,′ )′}.
Two tasks of syntax analysis
As with the lexical phase, we have to deal with two distinct tasks.
Specifyingthatthesyntacticallycorrectprograms(token lists) are.
Checkingifaninputprogram(tokenlist)issyntactically correct according to the specification, and output a corresponding AST.
Let’s deal with specification first. What are our options? How about using regular expressions for this purpose?
Alas not every language can be expressed in these formalisms. Example:
Alphabet = {′(′,′ )′}.
Language = all balanced parentheses (), ()(), (()), ((()(()())()(()))), ...
FSAs/REs can’t count
FSAs/REs can’t count
Let’s analyse the situation a bit more. Why can we not describe the language of all balanced parentheses using REs or FSAs.
FSAs/REs can’t count
Let’s analyse the situation a bit more. Why can we not describe the language of all balanced parentheses using REs or FSAs.
Each FSA has only a fixed number (say n) of states. But what if we have more than n open brackets before we hit a closing bracket?
FSAs/REs can’t count
Let’s analyse the situation a bit more. Why can we not describe the language of all balanced parentheses using REs or FSAs.
Each FSA has only a fixed number (say n) of states. But what if we have more than n open brackets before we hit a closing bracket?
Since there are only n states, when we reach the n + 1 open bracket, we must have gone back to a state that we already visited earlier, say when we processed the i-th bracket with
i