6. Context-free grammars
A canonical first example of a context-free grammar (CFG) is shown in (1). This grammar generates the stringset {anbn | n ≥ 1}, i.e. the stringset {ab, aabb, aaabbb, aaaabbbb, . . . }.
(1) S → aSb S → ab
The fact that the grammar in (1) generates the string ‘aaaabbbb’ is due to the fact that we can start from ‘S’ and gradually rewrite nonterminals in accord with the rules, one step at a time, as follows:
(2) S aSb
aaSbb aaaSbbb aaaabbbb
A sequence of strings related to each other like (2) is known as a derivation.
1
(3)
Formal definition of context-free grammars
A context-free grammar (CFG) is a four-tuple (N, Σ, I, R) where:
• N is a finite set of nonterminal symbols;
• Σ, the alphabet, is a finite set of terminal symbols (disjoint from N);
• I ⊆ N is the set of initial nonterminal symbols; and • R⊆(N×(N∪Σ)∗)isafinitesetofrules;and
So strictly speaking, (1) is an informal representation of the following mathematical object: (4) G = {S}, {a,b}, {S}, {(S,aSb),(S,ab)}
We often write u ⇒G v to mean that from the string u we can derive the string v by rewriting one symbol in accord with one of the rules of the grammar G; and write u ⇒G ∗ v to mean that from u we can derive v in one or more such steps.
So what we did above in (2) can be described in symbols like this:
(5) S ⇒G aSb ⇒G aaSbb ⇒G aaaSbbb ⇒G aaaabbbb
(6) S ⇒G ∗ aaaabbbb
A CFG G = (N,Σ,I,R) generates a string w ∈ Σ∗ iff there’s some way to derive w from some initial
nonterminal symbol:
(7) w∈L(G)⇐⇒ I(n)∧n⇒G∗w n∈N
Ling185A, Winter 2021 — Tim Hunter, UCLA
Two handy conventions:
2
• •
Wegenerallyuseuppercaselettersfornonterminalsymbolsandlowercaselettersforterminalsymbols, in which case we can recover the sets N and Σ if we’re only given a collection of rules.
Unless otherwise specified, we’ll assume that there is exactly one initial nonterminal symbol, and that it is the symbol on the left hand side of first rule listed. This way the grammar can be entirely identified by listing its rules.
Equivalent derivations, leftmost derivations, and rightmost derivations
Now let’s consider a more interesting grammar:
(8) NP→NPandNP NP → NP or NP
NP → apples NP → bananas NP → oranges
Notice that the string ‘apples and oranges’ has two distinct derivations in this grammar:
(9) a. NP
NP and NP
apples and NP apples and oranges
b. NP
NP and NP
NP and oranges apples and oranges
And the string ‘apples and oranges or bananas’ has many distinct derivations, but here are two of them:
(10) a.
NP
NP and NP
b. NP
NP or NP
NP and NP or NP
apples and NP or NP
apples and oranges or NP apples and oranges or bananas
apples and apples and apples and apples and
NP
NP or NP
oranges or NP oranges or bananas
But in an important
thing”, whereas the two in (10) seem “actually different”. The underlying distinction is that the two derivations in (9) are both plugging together the following three facts:
(11) NP ⇒∗ apples and oranges NP ⇒∗ apples
NP ⇒∗ oranges
but the two derivations in (10) plug together different collections of facts of this sort:
way, the two derivations in (9) seem to be just “different ways of doing the same
(12) a. NP ⇒∗ apples and oranges or bananas NP ⇒∗ apples
NP ⇒∗ oranges or bananas NP ⇒∗ oranges
NP ⇒∗ bananas
b. NP ⇒∗ apples and oranges or bananas NP ⇒∗ apples and oranges
NP ⇒∗ apples
NP ⇒∗ oranges
NP ⇒∗ bananas
Ling185A, Winter 2021 — Tim Hunter, UCLA
In the early generative linguistics literature, these individual constituency-facts like ‘NP ⇒∗ oranges or bananas’ were known as “is-a relationships” (e.g. “ ‘oranges or bananas’ is a NP”).1
Let’s call such a collection of is-a relationships an analysis of a string. So:
• (9a) and (9b) both correspond to the same analysis of the string ‘apples and oranges’, shown in (11),
but
• (10a) and (10b) correspond to two distinct analyses of the string ‘apples and oranges or bananas’, shown in (12a) and (12b), respectively.
An analysis can be represented graphically by a tree.
NP
NP and NP
NP
NP and NP
NP
NP or NP
(13) a.
A leftmost derivation is a derivation where every step rewrites the leftmost nonterminal symbol in the preceding string.
There is exactly one leftmost derivation per analysis.
(9) / (11)
apples
NP or oranges
NP bananas
NP apples
and NP oranges
bananas
apples
oranges
(10a) / (12a)
If two derivations correspond to the same analysis, i.e. they both express the same set of is-a relations,
we’ll call them equivalent derivations.
So the relationship between derivations and analyses is many-to-one; usually we only care about distinct analyses of a string, and the differences between equivalent derivations (such as (9a) and (9b)) concern only the order in which the rewrites take place. We can get rid of these “spurious choices” by adopting some fixed strategy that dictates, given any intermediate point in a derivation (such as ‘NP and NP’), which nonterminal symbol must be rewritten next.
b. A rightmost derivation is a derivation where every step rewrites the rightmost nonterminal symbol in the preceding string.
There is exactly one rightmost derivation per analysis.
The derivation in (9a) is a leftmost derivation; the derivation in (9b) is a rightmost derivation. The two derivations in (10) are both leftmost derivations.
So by considering only leftmost derivations or considering only rightmost derivations, we can leave aside the “irrelevant” differences; if we find two leftmost derivations or two rightmost derivations for a string, then it has two distinct tree structures, i.e. can be analyzed by two distinct sets of is-a relationships.
3 Finding analyses
We’ll assume from here on that all CFGs are in Chomsky Normal Form (CNF), which means that each rule’s right-hand side consists of either a single terminal symbol or exactly two nonterminal symbols. So in effect we’ll take R to be of the form R ⊆ N ×((N ×N)∪Σ). Assuming this form is computationally convenient and doesn’t really restrict what we can do.2
1See e.g. chapter 4 of Chomsky’s Syntactic Structures (1957), or chapter 1 of Lasnik’s Syntactic Structures Revisited (2000).
2Any CFG can be converted into a CFG in CNF that generates the same stringset — modulo the empty string, i.e. for any CFG G we can construct another CFG G′ such that G′ is in CNF and L(G′) = L(G) − {ε}.
(10b) / (12b)
Ling185A, Winter 2021 — Tim Hunter, UCLA
Here’s an example grammar in CNF:
(14) VP → V NP V VP →VPPP VP NP →NPPP VP PP → P NP NP
→ watches → watches → spies
→ watches
N = {VP,NP,PP,V,P}
Σ = {watches, spies, telescopes, with} I={VP}
NP → spies
NP → telescopes P →with
The important idea about how a CFG generates a string can be stated in the same form as what we saw for FSAs:
(15) w ∈ L(G) ⇐⇒ string w can generated via tree t all possible trees t
For an FSA, the relevant notion of a path is a linear path through the states; here, think of a tree as a branching path through the nonterminals.
One way in which CFGs are more complicated than FSAs is that, even given a particular string we’re interested in, we don’t know what the shape(s) of the relevant path(s) might be. So spelling out what it means to consider “all possible trees t” is a bit trickier than spelling out “all possible paths p” for an FSA.
But if we leave aside the “middle” of the tree for the moment, and just focus on the top and the bottom, we can at least write the following:
(16) x1x2…xn ∈L(G) ⇐⇒ ··· ··· I(r)∧···∧R(c1,x1)∧R(c2,x2)∧···∧R(cn,xn) r∈N c1∈N cn∈N
For example, what we would need to evaluate to see whether the grammar in (14) will generate ‘watches with telescopes’3 would look something like this (still leaving aside the “middle” of the tree):
(17) ··· I(r)∧···∧R(c1,watches)∧R(c2,with)∧R(c3,telescopes) r∈N c1∈N c2∈N c3∈N
Interleaving the growing of structure with the checking of this structure against the grammar, analogously to what we did with forward and backward values for FSAs, will let us sidestep the issue of working out exactly what would need to go into this “global disjunction”.
4 Inside values
For any CFG G there’s a two-place predicate insideG, relating strings to nonterminals:
(18) insideG(w)(n) is true iff the string w is derivable from the nonterminal n in grammar G
Given a way to work out insideG(w)(n) for any string and any state, we can easily use this to check for membership in L(G):
(19) w ∈ L(G) ⇐⇒ I(n) ∧ insideG(w)(n) n∈N
We can arrange the values of the predicate insideG in a table or chart. The chart in (20) shows the values of insideG for all non-empty substrings, or infixes, of ‘watches spies with telescopes’, where G is the grammar in (14). Each cell in the chart corresponds to one of these infixes.4
3Notice that this is a length-three string belonging to Σ∗, where Σ is the set of terminal symbols in (14).
4So a cell in this chart, since it corresponds to an infix (and specifies a value for each nonterminal), is analogous to a column in the tables we saw earlier for forward and backward values.
Ling185A, Winter 2021 — Tim Hunter, UCLA
(20)
VP: 1 NP: 1 PP: 0
V: 1 P: 0
VP: 1 NP: 0 PP: 0
V: 0 P: 0
VP: 0 NP: 0 PP: 0
V: 0 P: 0
VP: 1 NP: 0 PP: 0
V: 0 P: 0
VP: 1 NP: 1 PP: 0
V: 0 P: 0
VP: 0 NP: 0 PP: 0
V: 0 P: 0
VP: 1 NP: 1 PP: 0
V: 0 P: 0
VP: 0 NP: 0 PP: 0
V: 0 P: 1
VP: 0 NP: 0 PP: 1
V: 0 P: 0
VP: 0 NP: 1 PP: 0
V: 0 P: 0
watches . . .
spies . . .
with …
telescopes . . .
We can fill this table in gradually by “working from small to big”, similarly to what we saw for forward and backward values for FSAs, but with some important differences.
• Here it’s the cells on the diagonal, corresponding to infixes of length one, that can be filled in just by lookups into the grammar.5
• We then work upwards and to the right to fill in values for longer infixes, which can be calculated by looking “just one step back” at values for shorter infixes.
• But exactly what it means to look “one step back” is a bit more complicated because of the fact that we’re dealing with all infixes, not just prefixes and suffixes.
Specifically, every cell in the table can be filled in according to the following recursive definition (which is not as complicated as it looks):
(21) insideG(x)(n) = R(x, n)
insideG(x1 . . . xm)(n) = R(n, l, r) ∧ insideG(x1 . . . xi)(l) ∧ insideG(xi+1 . . . xm)(r)
1≤i