FOUNDATIONS OF COMPUTATION CM50260
MSC IN COMPUTER SCIENCE COURSEWORK
ISSUED 5 NOVEMBER 2018, DUE 10 DECEMBER 2018
This part of coursework forms 40% of the assessment for this unit.
1. General description of the task
Context-free grammars are used extensively in modeling the syntax of program- ming languages. A compiler for such a programming language will embody a parser, that is, an algorithm to determine whether a given word belongs to the language generated by a given context-free grammar, and, if so, construct a parse tree of the word. The compiler would then go on to translate this parse tree into a program in a more basic language, such as assembly language.
The aim of this part of your coursework is to construct two versions of a parser for a given context-free language and to run them on a few particular input words. More precisely, you will be given:
- (1) a particular context-free grammar G generating the context-free language L;
- (2) particular input words in the alphabet of terminals of G;
- (3) description of an algorithm for transforming a given grammar into Chomsky normal form;
- (4) descriptions of two algorithms deciding membership in a context-free lan- guage.
You will need to:
- (1) transform G into Chomsky normal form [8 marks];
- (2) write C code for the first algorithm [8 marks];
- (3) write C code for the second (dynamic programming) algorithm [6 marks];
- (4) run the first code on each of the given input words and, if a word turns out to belong to L, produce a parse tree for this input [8 marks];
- (5) run the second code on each of the given input words and, if a word turns out to belong to L, produce a parse tree for this input [10 marks]. 2. Data
1. The following grammar G generates a fragment of many common program- ming languages. This fragment consists of all words in alphabet {(, ), +, ∗, x} that represent syntactically correct arithmetic expressions involving + and ∗. Letter x stands for identifier, i.e., variable name. This grammar is unambiguous.
Let G = (Σ,NT,R,E), where E is the starting symbol, while the alphabet of terminals Σ, alphabet of nonterminals NT, and the set of rules R are as follows.
1
2 ISSUED 5 NOVEMBER 2018, DUE 10 DECEMBER 2018
Σ = {+,∗,(,),x}, NT ={T,F,E}, R={
E→E+T, E→T, T→T∗F, T→F,
F → (E),
F→x }
The symbols E,T, and F are abbreviations for expression, term, and factor, re- spectively.
2. Input words for the first algorithm:
(1) x+x (2) xx (3) ()
3. Input words for the second algorithm:
(1) (((x+(x)∗x)∗x)∗x)
(2) x∗x+x+x∗(x)
(3) (x∗x)∗x+x+(x)
(4) (x+x))∗((x)∗x
3. Algorithms
The material in this section is adapted from:
M. Sipser, Introduction to the Theory of Computation, second edition, Course Tech-
nology, Cengage Learning, 2006.
3.1. Chomsky normal form. [Sipser, pp. 108–111]
When working with context-free grammars, it is often convenient to have them
in simplified form. One of the simplest and most useful forms is called Chomsky normal form.
Definition 3.1. A context-free grammar is in Chomsky normal form if every rule is of the form
A → BC
A→a
where a is any terminal and A, B, and C are any nonterminals, except that B and C may not be the start variable. In addition we permit the rule S → e, where S is the start variable and e is the empty word symbol.
Algorithm. The following algorithm converts any context-free grammar to the equivalent grammar in Chomsky normal form.
Firstly, we add a new start nonterminal S0 and the rule S0 → S, where S is the original start nonterminal. This change guarantees that the start nonterminal doesn’t occur on the right-hand side of a rule.
Secondly, we take care of all e-rules. We remove an e-rule A → e, where A is not the start nonterminal. Then for each occurrence of an A on the right-hand
MSC IN COMPUTER SCIENCE 3
side of a rule, we add a new rule with that occurrence deleted. In other words, if R → uAv is a rule in which u and v are sequential forms (words of terminals and nonterminals), we add rule R → uv. We do so for each occurrence of an A, so, for example, the rule R → uAvAw causes us to add R → uvAw, R → uAvw, and R→uvw. IfwehavetheruleR→A,weaddR→eunlesswehadpreviously removed the rule R → e. We repeat these steps until we eliminate all e-rules not involving the start nonterminal.
Thirdly, we handle all unit rules. We remove a unit rule A → B. Then, whenever a rule B → u appears, we add the rule A → u unless this was a unit rule previously removed. As before, u is a word of terminals and nonterminals. We repeat these steps until we eliminate all unit rules.
Finally, we convert all remaining rules into the proper form. We replace each rule A → u1u2 ···uk, where k ≥ 3 and each ui is a terminal or nonterminal symbol, with the rules A → u1A1, A1 → u2A2, A2 → u3A3,…,Ak−2 → uk−1uk. The Ai’s are new nonterminals. If k = 2, we replace any terminal ui in the preceding rule(s) with the new variable Ui and add the rule U1 → ui.
3.2. First parsing algorithm.
Theorem 3.2 (Sipser, p. 132, Problem 2.26). If G is a context-free grammar in Chomsky normal form, then for any non-empty word w ∈ L(G) with |w| = n, exactly 2n − 1 steps are required for any derivation of w.
Algorithm (Sipser, pp. 172–173) On input w for grammar G:
(1) List all derivations with 2n − 1 steps, where n = |w|, except if n = 0, then instead list all derivations with one step.
(2) If any of these derivations generate w, accept; if not, reject.
3.3. Second parsing algorithm. (Sipser, pp. 266–267)
We assume that the grammar G is in Chomsky normal form and the length of
input word is n.
The algorithm (also known as Cocke-Kasami-Younger algorithm, or CKY) uses
technique called dynamic programming. It uses the accumulation of information about smaller subproblems to solve larger problems. We record the solution to any subproblem so that we need to solve it only once. We do so by making a table for all subproblems and entering their solutions systematically as we find them.
In this case we consider the subproblems of determining whether each nonter- minal in G generates each subword of w. The algorithm enters the solution to this subproblem in an n × n table. For i ≤ j the (i, j)th entry of the table contains the collection of nonterminals that generate the subword wiwi+1 ···wj. For i > j the table entries are unused.
The algorithm fills in the table entries for each subword of w. First it fills in the entries for the subwords of length 1, then those of length 2, and so on. It uses the entries for the shorter lengths to assist in determining the entries for the longer lengths.
For example, suppose the algorithm has already determined which nonterminals generate all subwords up to length k. To determine whether a nonterminal A generates a particular subword of length k + 1 the algorithm splits that subword into two non-empty pieces in the k possible ways. For each split, the algorithm
4 ISSUED 5 NOVEMBER 2018, DUE 10 DECEMBER 2018
examines each rule A → BC to determine whether B generates the first piece and C generates the second piece, using table entries previously computed. If both B and C generate the respective pieces, A generates the subword and so is added to the associated table entry.
The algorithm starts the process with the words of length 1 by examining the table for the rules A → b. It accepts the word w if and only if the start symbol S belongs to the (1, n)th entry of the table.
Here is pseudocode for this algorithm on input w = w1w2 · · · wn.
if w=e and S→e is a rule then accept; for i = 1 to n do
for each nonterminal A do
test whether A→b is a rule, where b=wi;
if so, place A in table(i,i);
for l = 2 to n do
For i=1 to n−l+1 do
let j:=i+l−1; for k=i to j−1 do
for each rule A→BC do
if table(i,k) contains B and table(k+1,j) contains C then
put A in table(i, j);
if S is in table(1,n) then accept else reject;
This algorithm can be modified to build a parse tree for accepted inputs. To do this we include into each cell table(i, k), along with nonterminals that generate the subword wiwi+1 · · · wj , also “back-pointers” to already constructed cells, corre- sponding to the split of the subword considered in this cell (there is only one such split if the grammar is unambiguous). Cells of the table together with back-pointers form a parse tree for w.
Example. Consider the grammar with one terminal a, two nonterminals S, A, whereSisthestartsymbol,andrules: S→a,S→AA,A→a. Notethatthe grammar is already in Chomsky normal form. It generates the language of two words, {a, aa}.
Applying the dynamic programming algorithm to the word aa we obtain the table:
12 1 S,A S
2 S,A
The start symbol S lies in the cell table(1, 2), thus aa is derivable. Back-pointers
for the table(1, 2) correspond to the splitting of aa into a and a, hence they point to table(1,1) and table(2,2) which gives us a parse tree consisting of the root S, two of its children, A and A, child a of one A and child a of another A.