University of Toronto, Department of Computer Science
CSC 2501F—Computational Linguistics, Fall 2018
Reading assignment 3
Due date: In class at 11:10, Thursday 11 October 2018.
Late write-ups will not be accepted without documentation of a medical or other emergency.
This assignment is worth 5% of your final grade.
What to read
Fernando Pereira, “Formal grammar and information theory: together again?,” Phil. Trans.
R. Soc. Lond. A, 358, 2000, 1239–1253.
What to write
Write a brief summary of the paper’s argumentation, with a critical assessment of its merits.
General requirements: Your write-up should be typed, using 12-point font and 1.5-line
spacing; it should fit on one to two sides of a sheet of paper. Hand in one copy at the begin-
ning of class.
Formal grammar and information theory:
together again?
By Fernando Pereira
AT & T Laboratories Research, A247, Shannon Laboratory, 180 Park Avenue,
Florham Park, NJ 07932-0971, USA (pereira@research.att.com)
In the last 40 years, research on models of spoken and written language has been split
between two seemingly irreconcilable traditions: formal linguistics in the Chomsky
tradition, and information theory in the Shannon tradition. Zellig Harris had advo-
cated a close alliance between grammatical and information-theoretic principles in
the analysis of natural language, and early formal-language theory provided another
strong link between information theory and linguistics. Nevertheless, in most research
on language and computation, grammatical and information-theoretic approaches
had moved far apart.
Today, after many years on the defensive, the information-theoretic approach has
gained new strength and achieved practical successes in speech recognition, informa-
tion retrieval, and, increasingly, in language analysis and machine translation. The
exponential increase in the speed and storage capacity of computers is the proxi-
mate cause of these engineering successes, allowing the automatic estimation of the
parameters of probabilistic models of language by counting occurrences of linguistic
events in very large bodies of text and speech. However, I will argue that information-
theoretic and computational ideas are also playing an increasing role in the scien-
ti c understanding of language, and will help bring together formal-linguistic and
information-theoretic perspectives.
Keywords: formal linguistics; information theory; machine learning
1. The great divide
In the last 40 years, research on models of spoken and written language has been
split between two seemingly irreconcilable points of view: formal linguistics in the
Chomsky tradition, and information theory in the Shannon tradition. The famous
quote of Chomsky (1957) signals the beginning of the split.
(1) Colourless green ideas sleep furiously.
(2) Furiously sleep ideas green colourless.
: : : It is fair to assume that neither sentence (1) nor (2) (nor indeed any
part of these sentences) has ever occurred in an English discourse. Hence,
in any statistical model for grammaticalness, these sentences will be ruled
out on identical grounds as equally `remote’ from English. Yet (1), though
nonsensical, is grammatical, while (2) is not.
Phil. Trans. R. Soc. Lond. A (2000) 358, 1239{1253
1239
c 2000 The Royal Society
on October 6, 2017http://rsta.royalsocietypublishing.org/Downloaded from
http://rsta.royalsocietypublishing.org/
1240 F. Pereira
Before and after the split, Zellig Harris had advocated a close alliance between
grammatical and information-theoretic principles in the analysis of natural lan-
guage (Harris 1951, 1991). Early formal-language theory provided another strong
link between information theory and linguistics. Nevertheless, in most research on
language and computation, those bridges were lost in an urge to take sides that was
as much personal and ideological as scienti c.
Today, after many years on the defensive, the information-theoretic approach is
again thriving and has led to practical successes in speech recognition, information
retrieval, and, increasingly, in language analysis and machine translation. The expo-
nential increase in the speed and storage capacity of computers is the proximate cause
of these successes, allowing the automatic estimation of the parameters of computa-
tional models of language by counting occurrences of linguistic events in very large
bodies of text and speech. However, vastly increased computer power would be irrel-
evant if automatically derived models or linguistic data were not able to generalize
to unseen data. I will argue below that progress in the design and analysis of such
models is not only playing a central role in those practical advances, but also car-
ries the promise of fundamentally deeper understanding of information-theoretic and
computational-complexity constraints on language acquisition.
2. Harris’s program
The ascent of Chomskian generative linguistics in the early 1960s swept the focus
of attention away from distributional views of language, especially those based on
the earlier structuralist tradition. In that tradition, Zellig Harris developed what
is probably the best-articulated proposal for a marriage of linguistics and informa-
tion theory. This proposal involves four main so-called constraints (Harris 1988) as
follows.
Partial order `: : : for each word: : : there are zero or more classes of words, called
its arguments, such that the given word will not occur in a sentence unless one
word: : : of each of its argument classes is present.’
There’s a strong similarity between the argument class information for a word
as suggested by Harris and its type in categorial grammar, or subcategoriza-
tion frames in other linguistic formalisms. However, traditional categorial gram-
mar (Lambek 1958) con®ates function{argument relationships and linear order,
whereas Harris factors out linear order explicitly. It is only more recently that cat-
egorial grammar has acquired the technical means to investigate such factorizations
(Morrill 1994; Moortgat 1995). It then becomes clear that Harris’s partial order
may be formalized as the partial order among set-theoretic function types. How-
ever, unlike modern categorial grammar, Harris’s partial order constraint speci es
only the basic con gurations corresponding to elementary clauses, while complex
clauses are a result of applying another constraint, reduction, to several elementary
clauses.
Likelihood `: : : each word has a particular and roughly stable likelihood of occurring
as argument, or operator, with a given other word, though there are many cases
of uncertainty, disagreement among speakers, and change through time.’
Using current terminology, one might interpret the likelihood constraint as a proba-
bilistic version of selectional restrictions. However, Harris makes a sharp distinction
Phil. Trans. R. Soc. Lond. A (2000)
on October 6, 2017http://rsta.royalsocietypublishing.org/Downloaded from
http://rsta.royalsocietypublishing.org/
Formal grammar and information theory 1241
between general language, in which likelihoods for llers of argument positions rep-
resent tendencies, and technical sublanguages, in which there are hard constraints
on argument llers, and which, thus, correspond more closely to the usual notion
of selectional restriction.
Reduction `It consists, for each language, of a few speci able types of reduction: : :
what is reduced is the high-likelihood: : : material: : : ; an example is zeroing the
repeated corresponding words under and.’
The reduction constraint tries to account both for morphological processes like con-
traction, and for processes that combine elementary clauses into complex clauses,
such as relativization, subordination and coordination. In each case, Harris claims
that high-likelihood material may be elided, although it would seem that additional
constraints on reduction may be necessary. Furthermore, connections between
reduction-based and transformational analyses (Harris 1965; Chomsky 1965) sug-
gest the possibility of modelling string distributions as the overt projection of a
hidden generative process involving operator-argument structures subject to the
likelihood constraint and transformations. Recent work linking transformational
and categorial approaches to syntax makes this possibility especially intriguing
(Stabler 1997; Cornell 1997).
Linearization `Since the relation that makes sentences out of words is a partial
order, while speech is linear, a linear projection is involved from the start.’
Harris’s theory left this step rather underspeci ed. Chomskian transformational
grammar can be seen as an e¬ort to ll in the gap with speci c mechanisms of
sentence generation that could be tested against native speaker grammaticality
judgements.
Thus, linguistic events involve the generation of basic con gurations|unordered
simple clauses|whose structure is determined by the partial order constraint and
whose distribution follows the probabilities associated with the likelihood constraint.
Those probabilities also govern the application of reduction|compression|to indi-
vidual con gurations or sets of linked con gurations. Finally, linearization yields the
observable aspects of the event. As I will discuss in x 7, though, the likelihood con-
straint as stated by Harris, or its current versions, leaves out dependences on the
broader discourse context that strongly a¬ect the likelihoods of linguistic events.
For the present discussion, the most important feature of Harris’s constraints is
how they explicitly link linguistic structure with distributional regularities involving
the relative frequencies of di¬erent structural con gurations. In particular, Harris
suggested how the structural and distributional regularities could work together to
support language acquisition and use:
: : : when only a small percentage of all possible sound-sequences actually
occurs in utterances, one can