程序代写代做代考 information theory algorithm data structure DNA computational biology compiler University of Toronto, Department of Computer Science

University of Toronto, Department of Computer Science
CSC 2501F—Computational Linguistics, Fall 2018

Reading assignment 2

Due date: In class at 2:10, Friday 28 September 2018.
Late write-ups will not be accepted without documentation of a medical or other emergency.
This assignment is worth 5% of your final grade.

What to read

Lillian Lee, “Fast Context-Free Grammar Parsing Requires Fast Boolean Matrix Multiplica-
tion,” JACM, 49(1), 2002, 1–15.

What to write

Write a brief summary of the paper’s argumentation, with a critical assessment of its merits.

General requirements: Your write-up should be typed, using 12-point font and 1.5-line
spacing; it should fit on one to two sides of a sheet of paper. Hand in one copy at the begin-
ning of class.

Jo
ur

na
l

of
t

he
A

C
M

4
9(

1)
, p

p.
1


15

, 2
00

2

Fast Context-Free Grammar Parsing Requires Fast Boolean Matrix

Multiplication

Lillian Lee

Department of Computer Science, Cornell University

Abstract

In 1975, Valiant showed that Boolean matrix multiplication can be used for parsing context-
free grammars (CFGs), yielding the asympotically fastest (although not practical) CFG parsing
algorithm known. We prove a dual result: any CFG parser with time complexity O(gn3−²),
where g is the size of the grammar and n is the length of the input string, can be efficiently
converted into an algorithm to multiply m × m Boolean matrices in time O(m3−²/3). Given
that practical, substantially sub-cubic Boolean matrix multiplication algorithms have been quite
difficult to find, we thus explain why there has been little progress in developing practical, sub-
stantially sub-cubic general CFG parsers. In proving this result, we also develop a formalization
of the notion of parsing.

1 Introduction

The context-free grammar (CFG) formalism, introduced by Chomsky (1956), has enjoyed wide use
in a variety of fields. CFGs have been used to model the structure of programming languages,
human languages, and even biological data such as the sequences of nucleotides making up DNA
and RNA (Aho, Sethi, and Ullman, 1986; Jurafsky and Martin, 2000; Durbin et al., 1998).

CFGs are generative systems, where strings are derived via successive applications of rewriting
rules. In practice, however, the goal generally is not to generate valid strings from a grammar.
Rather, one typically already has some string of interest, such as a C program or an English
sentence, in hand, and the goal is to analyze — parse — the string with respect to the grammar.

Canonical methods for general CFG parsing are the CKY algorithm (Kasami, 1965; Younger,
1967) and Earley’s algorithm (Earley, 1970). Both have a worst-case running time of O(gn3) for
a CFG of size g and string of length n (Graham, Harrison, and Ruzzo, 1980), although CKY
requires the input grammar to be in Chomsky normal form in order to achieve this time bound.
Unfortunately, cubic dependence on the string length is prohibitively expensive in applications such
as speech recognition, where responses must be made in real time, or in situations where the input
sequences are very long, as in computational biology.

Asymptotically faster parsing algorithms do exist. Graham, Harrison, and Ruzzo (1980) give a
variant of Earley’s algorithm that is based on the so-called “four Russians” algorithm (Arlazarov et
al., 1970) for Boolean matrix multiplication (BMM); it runs in time O(gn3/ log n). Rytter (1985)
further modifies this parser by a compression technique, improving the dependence on the string
length to O(n3/ log2 n). But Valiant’s (1975) parsing method, which reorganizes the computations
of CKY, is the asymptotically fastest known. It also uses BMM; its worst-case running time for a
grammar in Chomsky normal form is proportional to M(n), where M(m) is the time it takes to
multiply two m×m Boolean matrices together.

1

Journal of the A
C
M
49(1), pp. 1–15, 2002

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3

3.1

3.2

1965 1970 1975 1980 1985 1990 1995 2000

best known matrix multiplication exponent

Bini−Capovani−Lotti−Romani

Strassen

Coppersmith−Winograd

standard algorithm

Schoenhage

Coppersmith−Winograd
Romani

Pan

Pan
Strassen

Figure 1: Lowest known upper bound on the exponent ω for the complexity of matrix multiplication.
For instance, before 1969, the fastest known algorithm for matrix multiplication took proportional
to m3 steps (ω = 3).

Since these subcubic parsing algorithms all depend on Boolean matrix multiplication, it is
natural to ask how fast BMM can be performed in practice. The asymptotically fastest way
known to perform BMM is to rely on algorithms for multiplying arbitrary matrices. There exist
matrix multiplication algorithms with time complexity O(m3−δ), thus improving over the standard
algorithm’s O(m3) running time; for instance, Strassen’s (1969) has a worst-case running time of
O(m2.81), and the fastest currently known, due to Coppersmith and Winograd (1987;1990), has time
complexity O(m2.376). (See Strassen (1990) for a historical account, plotted graphically in figure
1.) Unfortunately, the constants involved in the subcubic algorithms improving on Strassen’s result
are so large that these fast algorithms cannot be used in practice. As for Strassen’s method itself,
its practicality is ambiguous: empirical studies show that the “cross-over” point — the matrix size
at which it becomes better to use Strassen’s method — is above 100 (Bailey, 1988; Thottethodi,
Chatterjee, and Lebeck, 1998). In summary, despite decades of research effort, there has been little
success at finding a clearly practical, simple, fast matrix multiplication algorithm.

One might therefore hope to find a way to speed up CFG parsing without relying on matrix
multiplication. However, the main theorem of this paper is that fast CFG parsing requires fast
Boolean matrix multiplication, in the following precise sense: any parser running in time O(gn3−²)
that represents parse data in a retrieval-efficient way can be converted with little computational
overhead into an O(m3−²/3) BMM algorithm.

The restriction of our result to parsers with a linear dependence on the grammar size is crucial for
relating sub-cubic parsing to sub-cubic BMM. However, as discussed in section 2.3, this restriction
is a reasonable one since canonical parsing algorithms such as CKY and Earley’s algorithm have
this property, and furthermore, in domains like natural language processing, the grammar size is
often the dominating factor.

Our theorem, together with the fact that it has been quite difficult to find practical fast matrix
multiplication algorithms, explains why there has been little success to date in developing practical
CFG parsers running in substantially sub-cubic time.

2

VP

V

List

NP

DET

all

N′

N

flights

PP

on Tuesday

VP

V

List

NP

all flights

PP

on Tuesday

Figure 2: Two different parse trees for the sentence “List all flights on Tuesday”. The labels on
the interior nodes denote linguistic categories.

2 The parsing problem: a formalization

In this section, we motivate and set forth a formalization of the parsing problem.

2.1 Motivation for our definition

In formal language theory, emphasis has been placed on the recognition or membership problem:
deciding whether or not a given string can be derived by a grammar. However, we concentrate here
on the parsing problem: finding the parse structure, or analysis, assigned to a string by a grammar.
(In the case of ambiguous strings, multiple parses exist; we address this point below.)

From a theoretical standpoint, the two problems are almost equivalent. Recognition obviously
reduces to parsing, and indeed to our knowledge there are no CFG recognition algorithms that do
not implicitly compute parse information. Conversely, Ruzzo (1979) demonstrated that any CFG
recognition algorithm that is not already an implicit parser can be converted into an algorithm that
returns a (single) parse of the input string w, at a cost of only a factor of O(log |w|) slowdown.

In practice, however, the parsing problem is much more compelling than the membership prob-
lem. Understanding the structure of the input string is crucial to programming language compi-
lation, natural language understanding, RNA shape determination, and so on. In fact, in speech
recognition systems, a useful assumption is that any input utterance is somehow “valid”, even if
it is ungrammatical, thus making the recognition problem trivial. However, different parses of the
input sentence may lead to radically different interpretations. For example, the classic sentence
“List all flights on Tuesday” has two different parses (see Figure 2): one indicates that all flights
taking off on Tuesday should be listed right now, whereas the other asks to wait until Tuesday, and
then list all flights regardless of their departure date. Another well-known ambiguous sentence is
“I saw the man with the telescope”; observe that here the two possible interpretations seem to be
about equally likely.

The fact that some input strings are ambiguous raises the question of what we should require
the output of a parsing algorithm to be: any single parse of the input string (Ruzzo’s reduction of
parsing to recognition uses this model), or all possible parses? In practice, since multiple analyses
may be valid (as in the natural language examples above), it is clear that any practical parser
should return all parses.

It remains to determine what the format of the output parses should be. One problem is that
there exist grammars for which the number of parse trees for strings of length n grows exponentially

3

in n; for example, consider the Chomsky normal form CFG with productions S → SS|a.1 Hence, a
compressed representation of the parse structures must be used; otherwise, every parser could take
exponential time just to print its output. However, we must be careful to impose restrictions on
the compression rate: after all, we could perversely consider the input string itself to be a (rather
inconvenient) representation of all its parse trees (Ruzzo, 1979). We thus require practical parsers
to output all the parses of an input string in a representation that is both compact and yet allows
efficient retrieval of parse information. In the next subsection, we make this notion precise.

2.2 C-parsing of context-free grammars

We use the usual definition of a context-free grammar (CFG) as a 4-tuple G = (Σ, V,R, S), where
Σ is the set of terminals, V is the set of nonterminals, R is the set of rewrite rules or productions,
and S ∈ V is the start symbol. Given a string w = w1w2 · · ·wn in Σ∗, where each wi is an element
of Σ, we use the notation w

j
i to denote the substring wiwi+1 · · ·wj−1wj . The size of G, denoted by

|G|, is the sum of the lengths of all productions in R.
Our notion of necessary parse information is based on the concept of CFG c-derivations, which

are substring derivations that are consistent with some parse of the entire input string.

Definition 1 Let G = (Σ, V,R, S) be a CFG, and let w = w1w2 · · ·wn, wi ∈ Σ. A nonterminal
A ∈ V c-derives (consistently derives) wji if and only if the following conditions hold:

• A ⇒∗ wji , and

• S ⇒∗ wi−11 Awnj+1.

(These conditions together imply that S ⇒∗ w.)

We argue, as do Ruzzo (1979) and, for a different formalism, Satta (1994), that a practical
parser must create output from which c-derivation information can be retrieved efficiently. This
information is what allows us to ascertain that there exists an analysis of the input sequence for
which a certain substring forms a constituent, or coherent unit. In contrast, derivation information
records potential subderivations that may not be consistent with any analysis of the full input string.
For example, in the sentence “Only the lonely can play”, “the lonely can” could conceivably, in
isolation, form a noun phrase, but clearly in any reasonable grammar of English no nonterminal
c-derives that substring. While some parsers retain information about derivations that are not
c-derivations, we formulate our definition of parsing to include algorithms that do not.

Definition 2 A c-parser is an algorithm that takes a CFG G = (Σ, V,R, S) and string w ∈ Σ∗ as
input and produces output FG,w that acts as an oracle about parse information as follows: for any
A ∈ V ,

• If A c-derives wji , then FG,w(A, i, j) = “yes”.

• If A 6⇒∗ wji (which implies that A does not c-derive w
j
i ), then FG,w(A, i, j) = “no”.

• FG,w answers queries in constant time.
1
If we do not impose any restrictions on the form of the grammar, then an infinite number of parse trees can be

produced for a single string; for example, consider the production set S → S|a.

4

The asymmetry of derivation and c-derivation in our definition of c-parsing is deliberate. We
allow FG,w’s answer to be arbitrary if A ⇒∗ wji but A does not c-derive w

j
i ; we leave it to the

algorithm designer to decide which answer is appropriate. Thus, our definition makes the class
of c-parsers as broad as possible: if we had changed the first condition to “If A derives w

j
i . . .”,

then Earley parsers would be excluded, since they do not keep track of all substring derivations;
whereas if we had written the second condition as “If A does not c-derive w

j
i , . . . ”, then CKY

would not be a c-parser, since it tracks all substring derivations, not just c-derivations. In fact, the
class of c-parsers contains all tabular parsers, including generalized LR parsing, CKY, and Earley’s
algorithm (Nederhof and Satta, 1996). In contrast, Ruzzo (1979) deals with the difference between
derivations and c-derivations by defining two different problems (the weak all-parses problem and
the all-parses problem).

Our choice of an oracle rather than a specific data structure as the output of a c-parser is also
for the purpose of keeping our definition as broad as possible. In tabular algorithms like CKY, the
oracle is given in the form of a matrix or chart; indeed, Ruzzo’s (1979) definition of the all-parses
and weak all-parses problems requires the output to be a matrix. However (as Ruzzo points out),
this is not the only possibility, and furthermore has a liability from a technical point of view: if
the output must be a matrix, then all parsing algorithms must take time at least Ω(n2) even to
print their output. Since it may be possible for c-derivations to be represented more compactly, we
prefer to allow for this possibility in our definition.

Finally, with regards to the third condition, we observe that Satta (1994) imposes the same
constant-time constraint for a different grammar formalism (tree-adjoining grammars). On the
other hand, we could loosen this to allow query processing to take time polylogarithmic in the
string and grammar size without much effect on our results (see section 3.5).

2.3 Analyzing parser runtimes

It is common in the formal language theory literature to see the running time of parsing algorithms
described as a function of the length of the input string only (e.g., O(n3) for a string of length n).
That is, the size of the context-free grammar is often treated as a constant. This stems in part
from two characteristics of the programming languages and compilers domains: first, the size of a
computer program’s source code is typically much greater than the size of the grammar describing
the programming language’s syntax, so that the grammar term is negligible; and second, compilers
are constructed to analyze many different programs with respect to a single built-in grammar.

However, in other domains these conditions do not hold. For example, in natural language,
sentences are relatively short (not often longer than one hundred words) compared with the size
of the grammar: Johnson (1998) describes a (probabilistic) CFG for a subset of English that has
22,773 rules. Indeed, Joshi (1997) notes that “the real limiting factor in practice is the size of the
grammar”. Therefore, it is reasonable to include in the analysis of parsing time the dependence on
the grammar size, and we will do so here. As a point of information, we note that both CKY and
Earley’s algorithm can be implemented to run in time O(|G|n3) (Graham, Harrison, and Ruzzo,
1980), although CKY requires the input grammar to be in Chomsky normal form, conversion to
which may cause a quadratic increase in the number of productions in the grammar (Hopcroft and
Ullman, 1979).

5

c-parser F
G,w

(G,w) C(A,B)

BMM algorithm

reduction querier

Figure 3: Converting a c-parser into a BMM algorithm.

3 The reduction

In this section, we provide two efficient reductions of Boolean matrix multiplication to c-parsing,
thus proving that any c-parsing algorithm can be used as a Boolean matrix multiplication algo-
rithm with little computational overhead. The first reduction produces a string and a context-free
grammar; the second is a modification of the first in which the grammar produced is in Chom-
sky normal form. The techniques we use are an adaptation of Satta’s (1994) elegant reduction of
Boolean matrix multiplication to tree-adjoining grammar (TAG) parsing. However, Satta’s results
rely explicitly on properties of TAGs that allow them to generate non-context-free languages, and
so cannot be directly applied to CFGs.

3.1 Boolean matrix multiplication

A Boolean matrix is a matrix with entries from the set {0, 1}. A Boolean matrix multiplication
(BMM) algorithm takes as input two m×m Boolean matrices A and B and returns their Boolean
product A×B, which is the m×m Boolean matrix C whose entries are defined by

cij =
m∨

k=1

(aik ∧ bkj) .

That is, cij = 1 if and only if there exists a number k, 1 ≤ k ≤ m, such that aik = bkj = 1.
As noted above, the Boolean product C can be computed via standard matrix multiplication,

since cij =
∑m

k=1 aik ·bkj . This means that we can use the Coppersmith and Winograd (1990) general
matrix multiplication algorithm to calculate the Boolean matrix product of two m × m Boolean
matrices in time O(m2.376). To our knowledge, the asympotically fastest algorithms for BMM all
rely on general matrix multiplication; the fastest algorithms that do not do so are the so-called
“four Russians” algorithm (Arlazarov et al., 1970), with worst-case running time O(m3/ log(m)),
and Rytter’s (1985) variant which uses compression to reduce the time to O(m3/ log2(m)).

3.2 The reduction: first version

Our goal in this section is to show that Boolean matrix multiplication can be efficiently reduced to
c-parsing of CFGs. That is, we will describe a simple procedure that takes as input an instance of
the BMM problem and converts it into an instance of the CFG parsing problem with the following
property: any c-parsing algorithm run on the new parsing problem yields output from which it
is easy to determine the answer to the original BMM problem. We therefore demonstrate that
any c-parser can be used to solve Boolean matrix multiplication via the three-step process shown
schematically in Figure 3.

6

Thus, given two Boolean matrices A and B, we need show how to produce a grammar G and a
string w such that c-parsing w with respect to G yields output FG,w from which information about
the Boolean product C = A × B can be easily retrieved. Our approach will be to encode almost
all the information about A and B in the grammar.

We can sketch the desired behavior of the grammar G as follows. Suppose entries aik in A
and bkj in B are both 1. Assume we have some way to break up array indices into two parts so
that i can be reconstructed from i1 and i2, j can be reconstructed from j1 and j2, and k can be
reconstructed from k1 and k2 (we will describe a way to do this later; the motivation is to keep the
grammar size relatively small). Then, our grammar will permit the following derivation sequence:

Ci1,j1 ⇒ Ai1,k1Bk1,j1
⇒∗ wi2 · · ·wk2+δ

︸ ︷︷ ︸

derived by Ai1,k1

wk2+δ+1 · · ·wj2+2δ
︸ ︷︷ ︸

derived by Bk1,j1

,

where δ will be defined later. The key thing to observe is that Ci1,j1 generates two nonterminals
whose “inner” indices match, and that these two nonterminals generate substrings that lie exactly
next to each other. The “inner” indices constitute a check on k1, and substring adjacency constitutes
a check on k2; together, these two checks serve as a proof that aik = bkj = 1, and hence that cij is
also 1.

We now set up some notation. Let A and B be two Boolean matrices, each of size m×m, and
let C be their Boolean matrix product. In the rest of this section, we consider A, B, C, and m to
be fixed. Set d = dm1/3e, and set δ = d + 2. (The effect of these choices on the efficiency of our
reduction is discussed in section 3.5.) We will be constructing a string of length 3δ; we choose δ
slightly larger than d in order to avoid having epsilon-productions in our grammar.

Our index encoding function is as follows. Let i be a matrix index, 1 ≤ i ≤ m ≤ d3. Then, we
define the function f(i) = (f1(i), f2(i)) by

f1(i) = bi/dc (so that 0 ≤ f1(i) ≤ d2), and
f2(i) = (i mod d) + 2 (so that 2 ≤ f2(i) ≤ d+ 1).

Since f1(i) and f2(i) are essentially the quotient and remainder of integer division of i by d, we can
reconstruct i from (f1(i), f2(i)). It may be helpful to think of these two quantities as “high-order”
and “low-order” bits, respectively. For convenience, we will employ the notational shorthand of
using subscripts instead of the functions f1 and f2; that is, we write i1 and i2 for f1(i) and f2(i).

It is now our job to create a CFG G = (Σ, V,R, S) and a string w ∈ Σ∗ that encode information
about A and B and express constraints about their product C.

We choose the set of terminals to be Σ = {w` : 1 ≤ ` ≤ 3d+ 6}. The string we choose is
extremely simple, and in fact doesn’t depend on A or B at all: we set w = w1w2 · · ·w3d+6. We
consider w to be made up of three parts, x, y, and z, each of size δ:

w = w1w2 · · ·wd+2
︸ ︷︷ ︸

x

wd+3 · · ·w2d+4
︸ ︷︷ ︸

y

w2d+5 · · ·w3d+6
︸ ︷︷ ︸

z

.

Observe that for any array index i between 1 and m, it is the case that wi2 appears in x, wi2+δ
appears in y, and wi2+2δ appears in z, since

i2 ∈ [2, d+ 1],
i2 + δ ∈ [d+ 4, 2d+ 3], and

i2 + 2δ ∈ [2d+ 6, 3d+ 5].

7

We now turn our attention to constructing the grammar G. Our plan is to include a set of
nonterminals {Cp,q : 1 ≤ p, q ≤ d2} in V such that cij = 1 if and only if Ci1,j1 c-derives w

j2+2δ
i2

.

3.3 The grammar

To create G = (Σ, V,R, S), we build up the set of nonterminals and productions, starting with
V = {S} and R = ∅. We add nonterminal W to V for generating arbitrary non-empty substrings
and therefore add productions

W −→ w`W |w`, 1 ≤ ` ≤ 3d+ 6. (W -rules)

Next, we encode the entries of the input matrices A and B in our grammar. We add the
nonterminals from the sets {Ap,q : 1 ≤ p, q ≤ d2} and {Bp,q : 1 ≤ p, q ≤ d2}. Then, for every non-
zero entry aij in A, we add the production

Ai1,ji −→ wi2Wwj2+δ. (A-rules)

For every non-zero entry bij in B, we add the production

Bi1,ji −→ wi2+1+δWwj2+2δ. (B-rules)

To represent the entries of C, we add the nonterminals from the set {Cp,q : 1 ≤ p, q ≤ d2} and
include productions

Cp,q −→ Ap,rBr,q, 1 ≤ p, q, r ≤ d2. (C-rules)
Finally, we complete the construction with productions for the start symbol S:

S −→WCp,qW, 1 ≤ p, q ≤ d2. (S-rules)

We now prove the following result about the grammar and string we have just described.

Theorem 1 For 1 ≤ i, j ≤ m, the entry cij in C is non-zero if and only if Ci1,j1 c-derives w
j2+2δ
i2

.

Proof. Fix i and j.
Let us prove the “only if” direction first. Thus, suppose cij = 1. Then there exists a k such

that aik = bkj = 1. Figure 4 sketches how Ci1,j1 c-derives w
j2+2δ
i2

.

Claim 1 Ci1,j1 ⇒∗ w
j2+2δ
i2

.

The production Ci1,j1 −→ Ai1,k1Bk1,j1 is one of the C-rules in our grammar. Since aik = 1,
Ai1,k1 −→ wi2Wwk2+δ is one of our A-rules, and since bkj = 1, Bk1,j1 −→ wk2+1+δWwj2+2δ is one
of our B-rules. Finally, since i2 + 1 < (k2 + δ) − 1 and (k2 + 1 + δ) + 1 ≤ (j2 + 2δ) − 1, we have W ⇒∗ wk2+δ−1i2+1 and W ⇒ ∗ w j2+2δ−1 k2+2+δ , since both substrings are of length at least one. Therefore, Ci1,j1 ⇒ Ai1,k1Bk1,j1 ⇒∗ wi2Wwk2+δ ︸ ︷︷ ︸ derived by Ai1,k1 wk2+1+δWwj2+2δ ︸ ︷︷ ︸ derived by Bk1,j1 ⇒∗ wj2+2δi2 . 8 S Ci1,j1 Ai1,k1 Bk1,j1 x y z w1 wk2+1+δwk2+δ wj2+2δwi2 W ...... ... ... W w3d+6 Figure 4: Schematic of the derivation process when aik = bkj = 1. The substrings derived by Ai1,k1 and Bk1,j1 lie right next to each other. Claim 2 S ⇒∗ wi2−11 Ci1,j1w3d+6j2+2δ+1. This claim is essentially trivial, since by the definition of the S-rules, we know that S ⇒∗ WCi1,j1W . We need only show that neither wi2−11 nor w 3d+6 j2+2δ+1 is the empty string (and hence can be derived by W ); since 1 ≤ i2 − 1 and j2 + 2δ + 1 ≤ 3d+ 6, the claim holds. Claims 1 and 2 together prove that Ci1,j1 c-derives w j2+2δ i2 , as required.2 Next we prove the “if” direction. Suppose Ci1,j1 c-derives w j2+2δ i2 , which by definition means Ci1,j1 ⇒∗ w j2+2δ i2 . This can only arise through the application of a C-rule: Ci1,j1 ⇒ Ai1,k′Bk′,j1 ⇒∗ w j2+2δ i2 for some k′. It must be the case that for some `, Ai1,k′ ⇒∗ w`i2 and Bk′,j1 ⇒ ∗ w j2+2δ `+1 . But then we must have the productions Ai1,k′ −→ wi2Ww` and Bk′,j1 −→ w`+1Wwj2+2δ with ` = k′′ + δ for some k′′. But we can only have such productions if there exists a number k such that k1 = k ′, k2 = k ′′, aik = 1, and bkj = 1; and this implies that cij = 1. Examination of the proof reveals that we also have the following two corollaries. Corollary 1 For 1 ≤ i, j ≤ m, cij = 1 if and only if Ci1,j1 ⇒∗ w j2+2δ i2 . Hence, c-derivation and derivation are equivalent for the Cp,q nonterminals. Corollary 2 S ⇒∗ w if and only if C is not the all-zeroes matrix. 2 This proof would have been simpler if we had allowed W to derive the empty string. However, we avoid epsilon- productions in order to facilitate the conversion to Chomsky normal form discussed in the next section. 9 W −→ W`W |w` (1 ≤ ` ≤ 3d+ 6) W` −→ w` (1 ≤ ` ≤ 3d+ 6) Ai1,j1 −→ Wi2Xj2+δ (one for each nonzero entry aij in A) Xj2+δ −→ WWj2+δ (2 ≤ j2 ≤ d+ 1) Bi1,j1 −→ Wi2+1+δXj2+2δ (one for each nonzero entry bij in B) Xj2+2δ −→ WWj2+2δ (2 ≤ j2 ≤ d+ 1) Cp,q −→ Ap,rBr,q (1 ≤ p, q, r ≤ d2) S −→ WT T −→ Cp,qW (1 ≤ p, q ≤ d2) Figure 5: A Chomsky normal form version of the productions of the grammar from the previous section. Let us now calculate the size of G. V consists of roughly 3((d2)2) ≈ m4/3 nonterminals. R contains about 6d W -rules and (d2)2 ≈ m4/3 S-rules. There are at most m2 A-rules, since we have A-rules only for each non-zero entry in A; similarly, there are at most m2 B-rules. And lastly, there are (d2)3 ≈ m2 C-rules. Therefore, our grammar is of size O(m2) with a very small constant factor; considering that G encodes m×m matrices A and B, it is not possible to shrink this much further. 3.4 Chomsky normal form We would like our results to cover as large a class of parsers as possible. Some parsers, such as CKY, require the input grammar to be in Chomsky normal form (CNF), that is, where the right-hand side of every production consists of either exactly two nonterminals or exactly a single terminal. We therefore wish to construct a CNF version G′ of G. However, not only do we want Theorem 1 to hold for G′ as well as G, but, in order to preserve time bounds, we also desire that |G′| = O(|G|). Unfortunately, the standard algorithm for converting CFGs to CNF can yield a quadratic blow- up in the number of productions in the grammar (Hopcroft and Ullman, 1979) and thus is clearly unsatisfactory for our purposes. However, since G contains no epsilon-productions or unit pro- ductions, it is easy to convert G by adding a small number of record-keeping nonterminals and productions, with the resultant grammar G′ having very similar parse trees — in particular, the set of substrings that are c-derived by the Cp,q nonterminals are the same in each grammar. Figure 5 gives the productions of G′. Note that G′ has only O(d) more productions and nonterminals, and so |G′| = O(m2) as well. 3.5 Time bounds We are now in a position to show the relation between time bounds for Boolean matrix multiplica- tion and time bounds for CFG parsing. Theorem 2 Any c-parser P with running time O(T (g)t(n)) on grammars of size g and strings of length n can be converted into a BMM algorithm MP that runs in time O(max(m 2, T (m2)t(m1/3))). In particular, if P takes time O(gn3−²), then MP runs in time O(m 3−²/3). 10 Proof. MP acts as sketched in Figure 3. More precisely, given two Boolean m×m matrices A and B, it constructs G (or G′, as required) and w as described above. It feeds G and w to P , which outputs oracle FG,w. To compute the product matrix C, MP requests from the oracle the value of FG,w(Ci1,j1 , i2, j2 + 2δ) (that is, whether or not Ci1,j1 derives or c-derives3 w j2+2δ i2 ) for each i and j, 1 ≤ i, j ≤ m, setting cij to one if and only if the answer is “yes”. The running time of MP is computed as follows. It takes O(m 2) time to read the two input matrices. Since G is of size O(m2) and |w| = O(m1/3), it takes O(m2) time to build the input to P , which then computes FG,w in time O(T (m2)t(m1/3)). Retrieving C takes O(m2) since, by definition of c-parser, each query to the oracle takes constant time. So the total time spent by MP is O(max(m2, T (m2)t(m1/3))), as claimed. Note that if we redefine c-parsing so that oracle queries take f(g, n) time instead of constant time, where g is the size of the grammar and n is the length of the string, then the bound changes to O(max(m2f(g, n), T (m2)t(m1/3))); as long as f is polylogarithmic, the second argument of the maximum in the bound surely dominates. In the case where T (g) = g and t(n) = n3−², MP has a running time of O(m 2(m1/3)3−²) = O(m3−²/3). The case in which P takes time linear in the grammar size is of the most interest, since, as mentioned above, in natural language processing applications the grammar tends to be far larger than the strings to be parsed. In this case, our result directly converts any improvement in the exponent for CFG parsing to a reduction in the exponent for BMM. 3.5.1 Parameter choices Since Valiant (1975) proved that an O(m3−²) BMM algorithm can be transformed into a parser with time complexity O(n3−²) in the string length, it is natural to ask whether our technique could yield the stronger result (if it is in fact true) that a CFG parser running in time O(gn3−²) can be converted into an O(m3−²) BMM algorithm. We now explain why such a result cannot be obtained by a straightforward modification of the reduction method we described above. Our run-time results are based on a particular choice of where to divide matrix indices into “high order bits” and “low order bits”; in particular, we set d, which parametrizes the number of low order bits, to d = dm1/3e. We determined this value by considering the effect of d on the size of the resulting grammar and string: roughly speaking, a larger value shrinks the former but expands the latter. For convenience, let us set d = m`, and consider how to pick `. Since combining the higher-order bits and the lower-order bits yields a matrix index of mag- nitude at most m, it follows that the string has size O(m`) and the grammar will have size O(m2 + (m1−`)3) (the first term comes from the inclusion of the A- and B-rules, and the sec- ond term comes from the fact that the C-rules have to include the higher-order bits for three matrix indices). Hence, a parser with run-time complexity O(gn3−²) yields a BMM algorithm with run-time complexity O(m2+(3−²)` + m3−²`). Inspection reveals that when ` > 1/3, the first term
dominates; when ` < 1/3, the second term dominates; and the lowest upper bound occurs at the “crossing point” where ` = 1/3. 3 By corollary 1, the two notions are equivalent in this case. 11 4 Related results We have shown that the existence of a fast practical CFG parsing algorithm would yield a fast practical BMM algorithm. Given that fast practical BMM algorithms are thought not to exist, this establishes a limitation on the efficiency of practical CFG parsing, and helps explain why there has been very little success in developing practical sub-cubic general CFG parsers. There have been a number of related results regarding the time complexity of context-free grammar parsing and the relationship between this and other problems. We survey these results below. As mentioned above, the asymptotically fastest (although not practical) general context-free parsing algorithm is due to Valiant (1975), who showed that the problem can be reduced to Boolean matrix multiplication (this is the “opposite direction” of the reduction we present). His algorithm shows that the worst-case dependence of the speed of CFG parsing on the input string length is O(M(n)), where M(m) is the time it takes to multiply two m ×m Boolean matrices together. (Rytter (1995) provides an alternate version of this algorithm with the same asymptotic complexity.) Methods for reducing Boolean matrix multiplication to context-free grammar parsing were previously considered by Ruzzo (1979). He proved that the problem of producing all possible parses of a string of length n with respect to a context-free grammar is at least as hard as multiplying two √ n× √ n Boolean matrices together. His technique encodes most of the information about the matrices in strings (as opposed to in the grammar, as in our method). Ruzzo’s result does not serve to explain why practical sub-cubic CFG parsing algorithms have been so difficult to produce, since using his reduction translates even a parser running in time proportional to n1.5 to a cubic-time BMM algorithm. Harrison and Havel (Harrison and Havel, 1974; Harrison, 1978) note that there is a reduction of m×m BMM checking to context-free recognition (a BMM checker takes as input three Boolean matrices A, B, and C and reveals whether or not C is the Boolean product of A and B). These two decision problems are clearly related to the algorithmic problems we consider in this paper. However, this reduction, like Ruzzo’s, also converts a parser running in time proportional to n1.5 to a cubic-time BMM checking algorithm, which, again, is not as strong a result as ours. The problem of on-line CFL recognition is to proceed through each prefix wi1 of the input string w, determining whether or not wi1 is generated by the input context-free grammar before reading the next ((i+ 1)th) input symbol. The study of the complexity of this problem has a long history; in fact, the landmark paper of Hartmanis and Stearns (1965) that introduced the notions of time and space complexity contains an example of a CFL for which on-line recognition of strings of length n takes more than n steps. Currently, the best known lower bound for this problem is Ω( n 2 logn ) (Seiferas, 1986; Gallaire, 1969). However, on-line recognition is a more difficult task than the standard CFL recognition problem (indeed, it is the extra constraints imposed by the on-line requirement that make it easier to prove lower bounds), and so these results do not translate to the usual recognition paradigm. To date, there are no non-trivial lower bounds known for general CFL recognition. Relationships between parsing other grammatical formalisms and multiplying Boolean matrices have also been explored. In particular, several researchers have looked at Tree Adjoining Grammar (TAG) (Joshi, Levy, and Takahashi, 1975), an elegant formalism based on modifying tree structures. TAGs have strictly greater generative capacity than context-free grammars, but at the price of being (apparently) harder to parse: standard algorithms run in time proportional to n6, although Rajasekaran and Yooseph (1998) adapt Valiant’s (1975) technique to get an asymptotically faster parser using BMM. Satta (1994) gives a reduction of Boolean matrix multiplication to tree-adjoining 12 grammar parsing, demonstrating that any substantial improvement over O(gn6) for TAG parsing would result in a sub-cubic BMM algorithm. Our reduction was inspired by Satta’s and resembles his in the way that matrix information is encoded in a grammar. However, Satta’s reduction explicitly relies on TAG properties that allow non-context-free languages to be generated, and so cannot be directly applied to CFG parsing. 5 Acknowledgments Thanks to Zvi Galil, Joshua Goodman, Rebecca Hwa, Jon Kleinberg, Giorgio Satta, Stuart Shieber, Les Valiant, and the anonymous referees for many helpful comments and conversations. A prelimi- nary conference version of this paper appeared in the Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pp. 9–15; thanks to those reviewers for their comments and suggestions. This material is based upon work supported in part by the National Science Foun- dation under Grant No. IRI-9350192, an NSF Graduate Fellowship, and an AT&T GRPW/ALFP grant. Any opinions, findings, and conclusions or recommendations expressed above are those of the author and do not necessarily reflect the views of the National Science Foundation. References Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers: Principles, Techniques and Tools. Addison-Wesley, Reading, Mass. Arlazarov, V. L., E. A. Dinic, M. A. Kronrod, and I. A. Faradz̆ev. 1970. On economical construction of the transitive closure of an oriented graph. Soviet Math. Dokl., 11:1209–1210. English translation of Russian article in Dokl. Akad. Nauk SSSR 194 (1970). Bailey, David. 1988. Extra high speed matrix multiplication on the Cray-2. SIAM Journal on Scientific and Statistical Computing, 9(3):603–607. Chomsky, Noam. 1956. Three models for the description of language. IRE Transactions on Information Theory, 2(3):113–124. Coppersmith, Don and Shmuel Winograd. 1987. Matrix multiplication via arithmetic progression. In Symposium on the Theory of Computing, pages 1–6. Coppersmith, Don and Shmuel Winograd. 1990. Matrix multiplication via arithmetic progression. Journal of Symbolic Computation, 9(3):251–280. Special Issue on Computational Algebraic Complexity. Durbin, Richard, Sean Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological Sequence Analysis. Cambridge University Press, Cambridge, UK. Earley, Jay. 1970. An efficient context-free parsing algorithm. Communications of the ACM, 13(2):94–102. Gallaire, Hervé. 1969. Recognition time of context-free languages by on-line Turing machines. Information and Control, 15(3):288–295, September. Graham, Susan L., Michael A. Harrison, and Walter L. Ruzzo. 1980. An improved context-free recognizer. ACM Transactions on Programming Languages and Systems, 2(3):415–462. 13 Harrison, Michael and Ivan Havel. 1974. On the parsing of deterministic languages. Journal of the ACM, 21(4):525–548, October. Harrison, Michael A. 1978. Introduction to Formal Language Theory. Addison-Wesley, Reading, Mass. Hartmanis, Juris and Richard E. Stearns. 1965. On the computational complexity of algorithms. Transactions of the American Mathematical Society, 117:285–306. Hopcroft, John E. and Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, Massachusetts. Johnson, Mark. 1998. PCFG models of linguistic tree representations. Computational Linguistics, 24(4):613–632. Joshi, Aravind. 1997. Parsing techniques. In Ronald Cole, Joseph Mariani, Hans Uszkoreit, Giovanni Battista Varile, Annie Zaenen, Antonio Zampolli, and Victor Zue, editors, Survey of the State of the Art in Human Language Technology, Studies in Natural Language Processing. Cambridge University Press, Cambridge, UK, chapter 11.4, pages 351–356. Joshi, Aravind K., Leon S. Levy, and Masako Takahashi. 1975. Tree adjunct grammars. Journal of Computer and System Sciences, 10(1):136–163. Jurafsky, Daniel and James H. Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall series in Artificial Intelligence. Prentice Hall, Upper Saddle River, New Jersey. Contributing writers: Andrew Kehler, Keith Vander Linden, and Nigel Ward. Kasami, Tadao. 1965. An efficient recognition and syntax algorithm for context-free languages. Scientific Report AFCRL-65-758, Air Force Cambridge Research Lab, Bedford, MA. Nederhof, Mark-Jan and Giorgio Satta. 1996. Efficient tabular LR parsing. In 34th Annual Meeting of the ACL, pages 239–246. Rajasekaran, Sanguthevar and Shibu Yooseph. 1998. TAL recognition in O(M(n2)) time. Journal of Computer and System Sciences, 56(1):83–89, February. Ruzzo, Walter L. 1979. On the complexity of general context-free language parsing and recognition. In Sixth Colloquium on Automata, Languages and Programming (ICALP), volume 71 of Lecture Notes in Computer Science, pages 489–497. Springer-Verlag. Rytter, Wojciech. 1985. Fast recognition of pushdown automaton and context-free languages. Information and Control, 67:12–22. Rytter, Wojciech. 1995. Context-free recognition via shortest paths computation: a version of Valiant’s algorithm. Theoretical Computer Science, 143(2):343–352. Satta, Giorgio. 1994. Tree-adjoining grammar parsing and Boolean matrix multiplication. Com- putational Linguistics, 20(2):173–191, June. Seiferas, Joel. 1986. A simplified lower bound for context-free-language recognition. Information and Control, 69:255–260. 14 Strassen, Volker. 1969. Gaussian elimination is not optimal. Numerische Mathematik, 14(3):354– 356. Strassen, Volker. 1990. Algebraic complexity theory. In Jan van Leeuwen, editor, Handbook of Theoretical Computer Science, volume A. Elsevier Science Publishers, Amsterdam, chapter 11, pages 633–672. Thottethodi, Mithuna, Siddartha Chatterjee, and Alvin R. Lebeck. 1998. Tuning Strassen’s matrix multiplication for memory efficiency. In SC98: High Performance Networking and Computing Conference. Valiant, Leslie G. 1975. General context-free recognition in less than cubic time. Journal of Computer and System Sciences, 10:308–315. Younger, Daniel H. 1967. Recognition and parsing of context-free languages in time n3. Informa- tion and Control, 10(2):189–208. 15