1.
Monash University Faculty of Information Technology 2nd Semester 2021
FIT2014 Theory of Computation
Tutorial 5
Context-Free Grammars and the Pumping Lemmas
SOLUTIONS
Observe that i and n are both positive integers here. (a) (solution by FIT2014 tutors, 2013)
S → aBcc S → aScc
B → Bb B→b
(b)
Inductive basis:
The shortest string in the language (assuming both n and i are ≥ 1, but the proof can easily be
adapted to allow them both to be 0 as well) is a1b1c2 which equals abcc and has length 4. This string can be generated by the CFG as follows:
S ⇒ aBcc ⇒ abcc.
Inductive step:
Let l denote the length of the string, and assume l ≥ 5 (else we are back in the inductive basis, which we’ve already dealt with).
Assume that any string of the required form of length < l has a derivation using the CFG. (This is the Inductive Hypothesis.) Let anbic2n be any string in the language of length l, so that l = 3n+i. Since l ≥ 5 (else we’d be back in the inductive basis), we must have either n ≥ 2 or i ≥ 2. We deal with these two cases in turn.
If n ≥ 2, then the string has the form aan−1bic2(n−1)cc. The inner string here, an−1bic2(n−1), has length l − 3 which is < l, so we can use the Inductive Hypothesis, which implies that there is a derivation
S ⇒ · · · ⇒ an−1bic2(n−1).
Placing a at the start of every string in this derivation, and cc at the end of each such string, gives
aScc ⇒ · · · ⇒ aan−1bic2(n−1)cc.
This is still a valid sequence of derivation steps, by the context-free property. (In effect, all we’ve done is change the context in the same way throughout, but the derivation steps don’t depend on context, so all the derivation steps are still valid.) Now, the first string aScc can itself be derived
1
in a single step from S, by using the second rule of the CFG. Putting this step at the start gives a new derivation, which now starts with S and therefore gives a complete derivation of the string at the end:
S ⇒ aScc ⇒ · · · ⇒ aan−1bic2(n−1)cc.
Since the string at the end is just anbic2n, we now have a derivation for it.
It remains to deal with the case when i ≥ 2 and n = 1. In this case, the string has the form abbi−1cc. Now, the string with one fewer b in the middle — namely, abi−1cc — is also of the same general form, but has length l − 1, which is < l. So by the Inductive Hypothesis it has a derivation
S ⇒ ··· ⇒ abi−1cc.
Now, observe that the only rule of our CFG which does not have a non-terminal symbol on its right side is the rule B → b. It follows that this must be the last rule used in the above derivation. Furthermore, the b created by that rule is always the leftmost b in the string. If we were to replace just the last step in the above derivation by applying the rule B → Bb instead, leaving all earlier steps unchanged, then we would get a derivation of the string aBbi−1cc:
S ⇒ ··· ⇒ aBbi−1cc.
We now add to this a single application of the rule B → b, which gives us a derivation of the string
we are after, with the extra b in the middle:
S ⇒ ··· ⇒ aBbi−1cc ⇒ abbi−1cc.
This is now a complete derivation of our string abicc.
So, putting the two cases (the first was n ≥ 2, the second was n = 1 and i ≥ 2) together, we have
seen that, whatever form our string of length l takes, we can use the Inductive Hypothesis (applied to a shorter string) to construct a derivation for the string of length l.
Conclusion:
By Mathematical Induction, any string of the required form, of any length, can be generated by this CFG.
(c)
b, B → ε
a, a → ε
b, b → ε c, c → ε
ε, ε → $
ε, ε → S
a, S → ε
ε, ε → c
ε, ε → c
ε, $ → ε
Exercise: prove that the language generated by this grammar is not regular.
Suppose we allowed both i and n to be 0, too. Then the grammar and PDA become (with differences between the two solutions shown in green in each):
2
ε, ε → B ε, ε → S
ε, ε → B ε, B → b
S→B
S → aScc
B → Bb B→ε
ε, S → B
ε, ε → S
a, S → ε
ε, B → ε
a, a → ε
ε, ε → $
ε, $ → ε
b, b → ε c, c → ε
(ii).
W : X: Y :
Z:
∀L (CFL(L) ⇒ Infinite(L)) ∀L (CFL(L) ⇒ ¬Infinite(L))
∃L (CFL(L) ∧ Infinite(L))
Note: the following answer is incorrect: ∃L (CFL(L) ⇒ Infinite(L)). This is because the expression CFL(L) ⇒ Infinite(L) is True whenever L is not context- free, as well as when L is context-free and infinite.
∃L (CFL(L) ∧ ¬Infinite(L))
ε, ε → c
ε, ε → S ε, ε → c
ε, ε → B ε, B → b
Exercise: prove that the language generated by this grammar is not regular.
2.
(a)
(i). Z.
Note, in particular, that X is not the logical negation of W . (ii). Y , Z
(b) (i).
¬W =
= ∃L ¬(CFL(L) ⇒ Infinite(L))
= ∃L ¬(¬CFL(L) ∨ Infinite(L))
¬∀L (CFL(L) ⇒ Infinite(L))
3
(since, in general, A ⇒ B is equivalent to ¬A ∨ B)
= ∃L (¬¬CFL(L) ∧ ¬Infinite(L))
(by De Morgan’s Law)
= ∃L (CFL(L) ∧ ¬Infinite(L))
(iii). Z.
Note, in particular, that X is not logically equivalent to ¬W .
3. Here is the grammar again, for convenience.
(a)
(i). Here is a derivation of the string ha.
S ⇒ P ⇒ haQ
⇒ haε = ha
(using rule (1)) (using rule (3))
(using rule (5))
S→P (1) P→PP (2) P → haQ (3) Q → Qa (4) Q→ε (5)
We now prove that this is a shortest string in LOL.
Any derivation (of any string) from S must start with the production S ⇒ P . So at least one of rules (2), (3) must be used in any derivation. If only (2) is used, then no symbol except P can ever be generated. So rule (3) must be used. This introduces two terminal symbols (in fact, the string ha). So every string in LOL must have at least two letters.
(In fact, we have shown that every string in LOL must contain the two-letter string ha. With this observation, we can say that ha is the shortest string in LOL, not just that it is a shortest string in LOL.)
(ii). The grammar is not regular, since not every production rule has, on its right-hand side, a semiword or a string of terminals. In particular, rules (2) and (4) are not of the required form.
(iii). LOL is regular language. It has the regular expression haa∗(haa∗)∗.
(iv). The grammar is not in Chomsky Normal Form. Apart from (2), the rules are not in the
required form.
(b)
Step 1. How do we know such an x exists? In other words, how do we know LOL isn’t empty? Step 8. Consider: “. . . we can get strings that are even longer than that, and so on, indefinitely.”
This needs formal justification. Don’t “wave hands” to cover gaps in the reasoning, or rely on the reader to fill in gaps. This indicates a need for a proper proof by induction.
(c)
4
We prove this by induction on n.
Inductive basis (n = 2): we already know LOL has a string of length at least 2, since we saw in part (a)(i) that LOL contains the string ha.
Inductive step:
Suppose n ≥ 2. As our Inductive Hypothesis, assume that LOL contains a string of length ≥ n. Call it x. Since x ∈ LOL, there is a derivation for x using the given grammar.
Now we just repeat the reasoning of Steps 3–7 in part (b).
This gives us a string y which is one letter longer than x. Since x has length ≥ n, we deduce that y has length ≥ n + 1. So LOL contains a string of length at least n + 1.
Therefore, by Mathematical Induction, for all n ≥ 2, LOL contains a string of length ≥ n. (d)
This proof is correct, but unnecessarily complicated and long-winded.
The contradiction hinges on the fact that a finite language has a longest string, and therefore cannot have arbitrarily long strings. The deduction that LOL has arbitrarily long strings (Step 2) does indeed contradict the finiteness of LOL (assumed in Step 1), as stated in Step 3.
But you can derive the desired contradiction much more simply. All you need to do is to use the (assumed) finiteness of LOL to drive the construction, rather than “parking” the finiteness at the start of the proof and coming back to it at the very end. Here’s how you do it.
Any finite language has a longest string; but we have already seen how to take any string in LOL and make another string in LOL that is one letter longer. You don’t need to go to the extra trouble of showing that LOL has arbitrarily long strings.
(e)
Assume, by way of contradiction, that LOL is finite.
We know it is nonempty (part (a)(i)).
Since LOL is finite and nonempty, it must have a longest string. Call it x. (Note how the assumed
finiteness of LOL drives our construction of x.)
Our earlier argument shows how to construct a string that is one letter longer than x. This
contradicts the maximality of x.
Therefore our assumption, that LOL is finite, is incorrect. Therefore LOL is infinite.
4.
(b) Suppose L is context-free. Let k be the number of non-terminal symbols in a CNF CFG for L.
ONE APPROACH:
Let n be any positive integer such that n2 > 2k−1. Then w := an2 ∈ L, so by the Pumping
Lemma for Context-Free Languages, there exist strings u,v,x,y,z such that w = uvxyz, and the lengthofvxyis≤2k, 1 andv,yarenotbothempty,anduvixyiz∈Lforalli≥0.
Letlbethesumofthelengthsofvandy. (Note,l≥1.)
From now on, the proof is very close to Tute 4, Q6(b).
Then the length of uvixyiz is n2 + (i − 1)l. So the strings uvixyiz have lengths n2, n2 + l, n2 + 2l, n2 + 3l, . . .. This is an infinite arithmetic sequence of numbers, with each consecutive pair being l apart. But the sequence of lengths of strings in L is the sequence of square numbers, and by Tute 4, Q6(a), the gaps between them increase, eventually exceeding any specific number you care to name. So there comes a point where the gaps exceed l, and some of the numbers n2 + (i − 1)l fall between two squares. When that happens, uvixyiz ̸∈ L, a contradiction. So the assumption that L
1Thanks to FIT2014 tutor Roger Lim for spotting an error in this upper bound in an earlier version.
5
is context-free must be false. Hence L is not context-free.
ANOTHER APPROACH (due to FIT2014 tutor Harald B ̈ogeholz):2
Letn=2k. Thenw:=an2 ∈L,anditslengthn2 satisfiesn2 >2k−1 (sincen2 >n=2k >2k−1),
so it’s long enough for the Pumping Lemma for CFLs to apply. So there exist strings u,v,x,y,z such that w = uvxyz, and |vxy| ≤ 2k, and v,y are not both empty, and uvixyiz ∈ L for all i ≥ 0.
Let l be the sum of the lengths of v and y. (Note, l ≥ 1.) Observe that l ≤ |vxy|, so combining with |vxy| ≤ 2k = n (above), we get l ≤ n. Using i = 2 in the string uvixyiz gives uv2xy2z ∈ L. Now, this string has length |uv2xy2z| = |uvxyz| + |v| + |y| = |w| + l = n2 + l, which falls between the following lower and upper bounds:
n2+l ≥ n2+1>n2,
n2 +l ≤ n2 +n
(c) ∀k∃w ∈ L such that |w| > 2k−1 : 2k : ∃i≥0 uvixyiz̸∈L.
∃u,v,x,y,z such that w = uvxyz and vy ̸= ε and |vxy| ≤
∀u,v,x,y,z such that w = uvxyz and vy ̸= ε and |vxy| ≤
(d) If L is context-free, then the Pumping Lemma for CFLs tells us that Con has a winning strategy. He starts by choosing k to be ≥ the number of nonterminal symbols in a Chomsky Normal Form CFG for L.
If L is non-context-free, then it is harder to determine, in general, who has a winning strategy. If L can be shown to be non-context-free using a proof by contradiction based on the Pumping Lemma for CFLs, then Noni has a winning strategy. But some non-context-free languages cannot be shown to be non-context-free using this Pumping Lemma. Indeed, there exist non-context-free languages for which Con has a winning strategy in the Double Pumping Game. Can you find one?
2This is similar to the previous approach but pins down the details of getting a string uvixyiz to fall in a gap between successive members of L. To do this, it helps to choose w to be longer than it was above.
6
3
Assume that SIS is context-free. Then there exists a CFG in CNF that generates SIS.4 Let k be its number of nonterminals. Then, by the Pumping Lemma for CFLs, every w ∈ SIS with |w| > 2k−1 can be written w = uvxyz where vy ̸= ε, |vxy| ≤ 2k, and for all i ≥ 0 we have uvixyiz ∈ SIS.
For convenience, put N := 2k.
Let us choose w to be the SIS-representation of the sequence (2N , 2N + 1, 2N + 2). This looks like:
100…000 # 100…001 # 100…010
N + 1 bits N + 1 bits N + 1 bits
So it is certainly long enough.
Let uvxyz be any segmentation of w into five substrings such that vy ̸= ε and |vxy| ≤ 2k.
Case 1:
If either v or y includes a # then, in uv3xy3z, either v3 or y3 has three equally spaced #’s. The
two numbers represented by the two stretches of letters between these #’s are equal, due to the way they are obtained by repeating v or y. But it is not allowed, in SIS, for two numbers in the sequence to be equal. So uv3xy3z cannot belong to SIS.
Case 2:
If v and y each fall within the binary representation of one of the first two numbers in the sequence (i.e., either 2N or 2N + 1), then — since at least one of v, y is nonempty — pumping up (i.e., using some i ≥ 2 to form uvixyiz) will enlarge that number, and pumping sufficiently many times will make it larger than the third number in the sequence, 2N +2. This violates the increasing nature of the sequence, so the resulting word is not in SIS.
Case 3:
If v and y each fall within the binary representation of one of the last two numbers in the sequence (i.e., either 2N + 1 or 2N + 2), then instead of pumping up, we pump down, by choosing i = 0 to form uxz. Removing any bit from either of the numbers 2N + 1 or 2N + 2 changes an (N + 1)-bit binary number to an N-bit binary number, which must be < 2N. So it becomes smaller than the first number in the sequence. This destroys the strictly-increasing property, so the resulting string uxz is not in SIS.
Remark:
Had we pumped up (using i ≥ 2) instead of down, we would have made the number(s) even
larger. This would break the rules of SIS if both v and y lie within the middle number (i.e., 2N + 1), since the pumping would make it larger, so that it becomes at least as large as the last number, 2N + 2. But it might not break the rules of SIS if v lies within the second number and y lies within the third number, since in that case it’s possible (depending on the length, position and content of v and y) for the second and third numbers to be pumped up “in tandem” so that the strictly-increasing property of the sequence is preserved.
The only locations for v and y that may appear not to be covered by these three cases are if v and y do not contain a # and they do not fall entirely within one or the other of two consecutive numbers in this sequence. But this would require v to fall entirely within the first number and y to fall entirely within the third number. Then, the string between them, namely x, includes all the N +1 bits of the middle number, so |vxy| ≥ N +1. This contradicts the requirement that |vxy| ≤ 2k, since N = 2k. So this possibility cannot arise.
Having said that, it turns out that we could also deal with this case by pumping. We can either use i = 0, which reduces the third number so that it is no longer greater than the second
3Thanks to FIT2014 tutors Nathan Companez and Harald B ̈ogeholz for pointing out errors with an earlier version of this proof.
4 . . . or, at least, SIS \ {ε}. But, when using a Pumping Lemma to prove that a given language is not regular or context-free, we only find ourselves looking at sufficiently large strings; we never look at the empty string. So the fact that a CFG in CNF does not generate the empty string is not an issue for these proofs.
6.
7
number, or i ≥ 2, which increases the first number so that it is no longer less than the second number.
So, in every case, there exists i ≥ 0 such that uvixyiz ̸∈ SIS. This violates the conclusion of the Pumping Lemma for CFLs. So we have a contradiction. So our initial assumption, that SIS is context-free, was wrong. Therefore SIS is not context-free.
7.
First iteration: single letters.
a can be generated by the nonterminals A, D. b can be generated by the nonterminal C.
Second iteration: pairs of consecutive letters (a.k.a. digraphs). We only need to consider digraphs that actually appear in the target string. In this case, we consider all digraphs except aa.
ab: The a can be generated by A or D, and the b can be generated by C, so the pair ab can come from either of the nonterminal pairs AC and DC. The pair AC can be produced by A, but the pair DC cannot be produced by a single nonterminal. So the sole nonterminal that can produce ab is A.
bb: This pair can come only from the pair CC, but this pair cannot come from a single nonterminal. So there is no nonterminal that can produce bb.
ba: This pair can come from either CA or CD. The former cannot come from a single nonterminal; the latter can come from B. So the only possibility here is B.
We summarise what we have worked out for the digraphs in the following table.
digraph nonterminals that can produce it
ab A bb – ba B
Third iteration: triples of consecutive letters (a.k.a. trigraphs). We only need to consider trigraphs that actually appear in the target string. For our string abbba, these are abb, bbb, bba.
We view each triple as a concatenation of two nonempty shorter strings.
abb: This can be formed as a concatenation of a followed by bb, or as a concatenation of ab followed by b. The former is not possible, because bb cannot be produced by a nonterminal (as we saw above). But for the latter, ab can be produced by A and b can be produced by C. So abb can be produced by AC. This in turn can be produced from the single nonterminal A. No other single nonterminal can do the job.
bbb: followed possible
We summarise what we have found.
trigraph
abb bbb bba
This can be formed as a concatenation of b followed by bb, or as a concatenation of bb by b. But in each case, the bb cannot be produced by any nonterminal. So it is also not
to produce bbb from a nonterminal.
This can be formed as a concatenation of b followed by ba, or as a concatenation of bb
by a. We can rule the latter out, because of bb. For the former, it can be produced from CB, but that in turn cannot be produced by just a single nonterminal.
bba: followed
nonterminals that can produce it
A
– –
Fourth iteration: 4-tuples of consecutive letters (a.k.a. tetragraphs). For our string abbba, the ones we need to consider are abbb, bbba.
8
abbb: The only feasible split is abb,b which gives the nonterminal pair AC. This in turn can be produced from the single nonterminal A.
bbba: There is no feasible split. Each possible split includes a string (either the one before, or the one after, the split) that cannot be produced by a single nonterminal.
tetragraph
abbb bbba
nonterminals that can produce it
A
–
Fifth and last iteration:
This is for our target string abbba.
The split a,bbba does not help, because bbba cannot be produced from a nonterminal. The split ab,bba is similarly unhelpful: see bba above. The split abb,ba can be produced by the nonterminal pair AB, which in turn can be produced only by the single nonterminal S. The split abbb,a can be produced by the nonterminal pairs AA and AD, but neither of these can come from a single nonterminal pair. So we are left with just S.
Since the starting nonterminal S is one of the nonterminals from which the target string can be produced, that string belongs to the language generated by the grammar.
8.
(a) Start with the given grammar:
S→P (1)
P → PP (2) P → haQ (3) Q → Qa (4) Q→ε (5)
First, eliminate the empty productions. There is just one here, Q → ε. To ensure that we keep its effect, we create a new copy of every rule that has Q on its right-hand side, and delete that Q (i.e., replace it by the empty string; in effect, we apply the rule Q → ε to it). (It gets a bit more complicated for rules that have more than one Q on their right-hand side, but that doesn’t happen here. What should we do in that case?) In this case, there are two such rules, (3) and (4). So we get two new rules as well: P → ha and Q → a. We now have the following grammar, equivalent to the original one:
S→P (1)
P → PP (2) P → haQ (3) P → ha
Q → Qa (4) Q→a
We must now eliminate unit productions. We just have one, namely (1), which changes S to P. It can be replaced by rules that replace S by anything that can be produced, by application of a single rule, from P. In this case, we have three rules for P. We therefore get three new rules, and the grammar is now as follows.
S→PP S → haQ
9
S → ha
P → PP (2) P → haQ (3) P → ha
Q → Qa (4) Q→a
We now need rules to generate each terminal, by itself, from a new nonterminal. This gives new rulesA→aandH→h. (Fora,it’snotenoughtorelyontheruleQ→a,whichwealready have, because nonterminal Q can be replaced by other things too; these new rules for the terminals need new symbols that will only ever be replaced by their corresponding terminals.) We obtain the following grammar.
S→PP S → haQ S → ha
P → PP (2) P → haQ (3) P → ha
Q → Qa (4) Q→a
A→a H→h
We then modify all rules with a terminal whose right-hand sides are not just a single terminal, replacing the terminals by the corresponding nonterminals we introduced above.
S→PP
S → HAQ S → HA
P → PP (2) P → HAQ (3) P → HA
Q → QA (4) Q→a
A→a
H→h
The last step is to deal with the rules whose right-hand sides have three or more nonterminals. We
introduce the new nonterminal J and the new rule J → HA, and use it to modify the grammar.
S→PP S → JQ S → HA
P → PP (2) P → JQ (3)
10
P → HA
Q → QA (4) Q→a
A→a
H→h
J → HA
(b)
Applying the CYK algorithm to this grammar and the target string hahaa, gives the following
results at the successive iterations.
substring nonterminals that can produce it
hH
a Q,A ha S,P,J ah – aaQ hah – aha – haa S,P
. . . since these can all produce HA ...sinceQ⇒QA⇒aa
...using split ha,a and nonterminal pair JQ . . . using split ha,ha and nonterminal pair P P
haha
ahaa
hahaa
to the language generated by the grammar.
Supplementary exercises
9.
(i)
S, P
–
S, P
Since S appears in the list of nonterminals that can produce hahaa, we conclude that hahaa belongs
. . . using split ha,haa
a,b
S− Λ
C a
a
A Λ B E+
bb D
11
S→A
A → aA|bA|B B → aC|bD
C → aE
D → bE E→Λ
10.
The NFAs given above are already PDAs; they just don’t use the stack.
(A transition in an NFA, labelled by a letter x, becomes a transition x, ε → ε when the NFA is
(ii)
(iii)
(iv)
S→A
A → aB|bB|C B → aA|bA C→Λ
S→A
A → aB|bC|D B → aA
C → bA D→Λ
B
a,b a,b
S− Λ A
B aa
S− Λ A b
C
B
Λ D+ b
Λ
C+
G ab ba
S − Λ A Λ C a D a E Λ F Λ H+
S−
C Λ D Λ G bb
F
H
ba I
J +
AE abaaab
ba B
(v)
S→A
A → aB|C B → bA
C → aD
D → aE E→F
F → bG|H G → aF H→Λ
S → aA|bB A → bC
B → aC C→D
D → aE|bF|G E → aD
F → bD
G → aH|bI
H → bJ I → aJ J→Λ
viewed as a PDA.)
11. The terminal symbols for our grammar will be just the letters of our alphabet. Create a new nonterminal symbol for each of the regular expressions formed during the construction of the regular expression using any of the three operations: concatenation, alternative, and Kleene star. (Recall the inductive definition of regular expressions.)
12
• • • • •
Create a production rule of the form S → R, where S is the start symbol and R is the new nonterminal symbol that we have introduced to stand for the entire regular expression.
For each regular expression formed by concatenation, i.e., R1 = R2R3, create the production rule R1 → R2R3.
For each regular expression formed by the alternative operation, i.e., R1 = R2 ∪R3, create two production rules: R1 → R2 and R1 → R3.
For each regular expression formed by the Kleene star, i.e., R1 = R2∗, create the three produc- tionrulesR1 →ε,R1 →R2,andR1 →R1R2.
For each regular expression that is just a string w of letters (where w may be empty or nonempty), i.e., R1 = w, create the production rule R1 → w.
12.
(a)
Given a k-limited PDA P, define a NFA N from it as follows.
States of N:
For every possible combination of state and stack contents of P, we are going to create a state
of N. We represent the stack, with symbols s1,...,sd (from top down) where d ≤ k, by the k-tuple (s1,...,sk), with si = ε for j > d. Let Q be the set of states of P, let Σ be the input alphabet of P, and let Γ be the stack alphabet of P. For every q ∈ Q and every k-tuple (s1,…,sk) where each si ∈ Γ ∪ {ε}, we create a state r(q,s1,…,sk) for N. (We can suppose that, if si = ε, then sj = ε for all j > i.)
The number of states we create by this process is ≤ |Q| × (|Γ| + 1)k, which is finite since Q and Γ are both finite.
Transitions for N:
Take any transition in M, from some state q to another state q′. Suppose its label is x,s1 → s′1. We create corresponding transitions in N for x ∈ Σ ∪ {ε} from state r(q, s1, . . . , sk) to state r(q′,s′1,s2,…,sk). We do this for every pair of states in N whose representations have this form. The state changes simulate reading (popping) s1 from the stack and pushing s′1 onto it; the rest of the stack is unchanged.
Supposeinsteadthatourtransitionhaslabel x,ε→s′1. Wecreatecorrespondingtransitionsin N for x from state r(q, s1, . . . , sk) to state r(q′, s′1, s1, . . . , sk−1). (Here, the state change simulates no reading from the stack, and s′1 is pushed onto it.)
Now suppose the transition has label x, s1 → ε. We create corresponding transitions in N for x from state r(q,s1,…,sk) to state r(q′,s2,…,sk,ε).
Finally, suppose our transition has label x, ε → ε. We create corresponding transitions in N for x from state r(q,s1,…,sk) to state r(q′,s1,…,sk). (There is no change to the stack.)
The start state of N is the state r(S,ε,…,ε), where S is the start state of P and ε,…,ε represents the initially-empty stack.
The Final States of N are all states r(F,s1,…,sk) where F is a final state of P.
It can be shown that the NFA N we have constructed simulates the operation of the PDA P, and that an input string is accepted by P if and only if it is accepted by N.
(b) If a language is regular, then it is recognised by a NFA. Now, a NFA is just a PDA which never uses its stack, or in other words, a 0-limited PDA. This, in turn, is a special case of a k-limited PDA. So the language is recognised by a k-limited PDA.
Now suppose that a language is recognised by a k-limited PDA. We know from part (a) that this can be simulated by a NFA, which recognises the same language. So the language is recognised by a NFA. So it is regular.
13
Challenge: What if the limit on stack size is log n, where n is the input string length, rather than a constant? Are (log n)-limited PDAs equivalent to NFAs?
13.
(a)
We prove that FootyScore is not regular.
Suppose FootyScore is regular. Then it has a Finite Automaton, by Kleene’s Theorem. Let k be
the number of states in a Finite Automaton for FootyScore. Let n be any positive integer such that n ≥ k.
Let w be the string (1n, 1n, 17n). This may also be written (11···············1, 11···············1, 11···············1)
goal-string, length n behind-string, length n point-string, length 7n
This string represents a score of a team that has kicked n goals and n behinds, giving 6n + n = 7n points. It satisfies the definition of strings in FootyScore, so it belongs to the language. Therefore, by the Pumping Lemma for Regular Languages, there exist strings x, y, z such that w = xyz, and the length of xy is ≤ n, and y is non-empty, and xyiz ∈ FootyScore for all i ≥ 0.
Firstly, observe that y cannot include either of the two commas, since if it did, repeating y (in forming xyiz) would give more than two commas altogether, in violation of the definition of strings in FootyScore (which must have exactly two commas). Similarly, y cannot contain either of the two parentheses.
Therefore y must fall entirely within the goal-string, or entirely within the behind-string, or entirely within the point-string. In each of these cases, repeating y would upset the score: it would change the number of goals, or the number of behinds, or the number of points, without changing either of the other two numbers. When only one of the three numbers g,b,p is changed, it is no longer true that 6g + b = p, so the corresponding string no longer belongs to FootyScore.
This contradicts the conclusion from the Pumping Lemma, that xyiz ∈ FootyScore for all i. Hence our assumption, that FootyScore is regular, was wrong.
Therefore FootyScore is not regular.
The above proof did not use the fact that |xy| ≤ n (guaranteed by the Pumping Lemma). We could do a slightly different proof by using it. It tells us that the initial part xy of the string has length at most n, and since the goal string is more than n letters long, the substring xy must come before the first comma. This tells us that y lies entirely within the goal string, except that it may also contain the initial opening parenthesis (if x is empty, which is possible). In the former case, repeating y increases the number of goals without a corresponding increase in the number of points; in the latter case, the initial opening parenthesis is repeated. Either way, the rules of the language are violated, and again we can deduce that xyiz ̸∈ FootyScore, so obtaining a contradiction.
This argument illustrates the main purpose of |xy| ≤ n: it gives us more control over where y lies within w.
(b)
FootyScore is context-free, because it is generated by the following CFG.
S → (G) (6) G → 1G111111 (7) G → ,B (8) B → 1B1 (9) B→, (10)
14
14.
We first prove by induction on n that, for every substring y of x, where |y| ≤ n, the algorithm correctly finds all nonterminals in G that can generate y.
For the inductive basis, n = 1. The substrings y consist just of one letter each, and a grammar in Chomsky Normal Form will have, for each letter, some rules with that letter alone on their right- hand sides. The CYK algorithm finds all the nonterminals on the left of such rules, and specifies them as the nonterminals that can generate y, which is correct.
Inductive step: Let n ≥ 1. Suppose the claim is true for substrings y of length ≤ n. So, for each such substring, at the appropriate iteration, the algorithm finds all nonterminals that generate it.
Now consider what happens at the iteration that considers substrings of length n + 1. Let y be such a string. The algorithm considers each way of splitting y into two substrings, a left substring yL and a right substring yR. For each such split, we already know (from previous iterations) the set of nonterminals that can produce yL and the set of nonterminals that can produce yR. Since yL and yR are shorter than y, they each have ≤ n letters, so the inductive hypothesis applies, and tells us that these sets of nonterminals are complete and correct. The algorithm then constructs the set of all nonterminal pairs XY where X is in the first set of nonterminals (for yL) and Y is in the second set of nonterminals (for yR). For each such pair XY , it finds all rules of the form W → XY , and gives these W as the nonterminals that can produce y.
This yields a set of nonterminals that can produce y. We need also to show that every nonterminal that can produce y must be in this set.
For y to be produced from a nonterminal symbol W , the first production must replace W by two other nonterminals, by the structure of the rules in Chomsky Normal Form (and using the fact that |y| > 1, so we cannot replace W by just a single terminal). Call these nonterminals X and Y . The rest of the production of y produces a left substring from X and a right substring from Y . (It is routine also to show that these two substrings cannot be empty. Just note that empty productions do not occur in CNF.) So, if y can be produced from W, then we can split it into two shorter substrings such that, for some nonterminals X and Y that produce the left and right substrings respectively, the grammar has the rule W → XY . By the arguments in the previous paragraph, this shows that the algorithm does indeed find W among the nonterminals that can produce y.
All this works for any generated string y of length n + 1. This completes the inductive step. Therefore, by Mathematical Induction, the claim is true for all n.
A string is generated by the CFG if and only if it can be produced from the Start symbol. We have just proved that the CYK algorithm correctly finds the list of all nonterminals that can produce a given string. Therefore, a string is generated by the CFG if and only if the set of nonterminals that the CYK algorithm finds for that string, as possible producers of it, includes the Start symbol.
15.
Suppose CricketBowlingFigures is regular. Then it has a Finite Automaton, by Kleene’s Theo- rem. Let k be the number of states in a Finite Automaton for CricketBowlingFigures. Let n be any positive integer such that n ≥ k.
Let w be the string (1n,1n,,), representing the bowling figures (n,n,0,0) (for a bowler who bowls n overs without having any runs scored from the bowling at all, but without taking any wickets either). This may also be written
(11···············1,11···············1,,)
n overs n maiden overs
It satisfies the definition of strings in CricketBowlingFigures, so it belongs to the language. Therefore, by the Pumping Lemma for Regular Languages, there exist strings x, y, z such that w = xyz, and |xy| ≤ n, and y is non-empty, and xyiz ∈ CricketBowlingFigures for all i ≥ 0.
We prove that CricketBowlingFigures is not regular.
15
Since |xy| ≤ n, the nonempty string y must lie within the initial portion of the string consisting just of “(11 · · · 1”, before the first comma.
If y includes the left parenthesis at the very start, then repeating or deleting y destroys the pair of parentheses so that the string xyiz ̸∈ CricketBowlingFigures for any i ̸= 1.
If y does not include the left parenthesis, then it only includes 1s from the first unary number representing the number of overs, n. Then consider the string xz. The number of overs represented by this string is < n (specifically, it is n − |y|). But the number of maiden overs is still n, so we have a violation of the constraint that O ≥ M. Therefore xz ̸∈ CricketBowlingFigures.
This contradicts the conclusion of the Pumping Lemma for Regular Languages.
Therefore CricketBowlingFigures is not regular. (b)
We prove that CricketBowlingFigures is not context-free.
Suppose CricketBowlingFigures is context-free. Then it has a CFG G in Chomsky Normal Form. Let k be the number of non-terminal symbols in G. Let n be any positive integer such that n > 2k. Let w be the string (1n,1n,,), representing the bowling figures (n,n,0,6n) (for a bowler who bowls n overs without having any runs scored from the bowling at all, and taking a wicket from
every ball). This may also be written (11···············1,11···············1,,11···············1)
n overs n maiden overs 6n wickets
It satisfies the definition of strings in CricketBowlingFigures, so it belongs to the language. Therefore, by the Pumping Lemma for Context-Free Languages, there exist strings u, v, x, y, z such that w = uvxyz, and |vxy| ≤ 2k, and v and y are not both empty, and uvixyiz ∈ CricketBowlingFigures for all i ≥ 0.
Observe that |vxy| < n, since |vxy| < 2k and 2k < n. Consider the implications of this for the location of vxy within the string w. They could both be within one of the strings of 1s, or they could lie across O and M or across M and W) including the comma(s) between them. But they could not include parts of both O and W.
If either of v or y includes a comma and/or a parenthesis, then choosing i = 2 gives uv2xy2z ̸∈ CricketBowlingFigures, since uv2xy2z has more than three commas and/or an excess of parentheses, violating the definition of the language.
If vxy lies within the string for O, then choosing i = 0 gives the string uxz ̸∈ CricketBowlingFigures, since uxz has O < M.
If v lies within the string for O and y lies within the string for M, then choosing i = 0 gives uxz ̸∈ CricketBowlingFigures, since uxz has O = M < n and W = 6n > 6 O.
If v lies within the string for M (regardless of where y is), then choosing i = 2 gives uv2xy2z ̸∈ CricketBowlingFigures, since uv2xy2z has O < M.
If v and y both lie within the string for W , then choosing i = 2 gives uv2xy2z ̸∈ CricketBowlingFigures, since uv2xy2z has W > 6O.
So, every possible location of vxy leads to a contradiction with the conclusion of the Pumping Lemma for Context-Free Languages.
So the initial assumption that CricketBowlingFigures is context-free was wrong. Therefore CricketBowlingFigures is not context-free.
16