计算机代考程序代写 information retrieval compiler Finite State Automaton algorithm University of Toronto, Department of Computer Science

University of Toronto, Department of Computer Science
CSC 2501F—Computational Linguistics, Fall 2021
Reading assignment 2
Due date: Electronically by 11am, Friday 1 October 2021.
Late write-ups will not be accepted without documentation of a medical or other emergency. This assignment is worth 5% of your final grade.
What to read
’an, “Computational Complexity of Probabilistic Disambiguation: NP-Completeness Results for Parsing Problems That Arise in Speech and Language Processing Applications,” Grammars, 5, 2002, 125–151.
What to write
Write a brief summary of the paper’s argumentation, with a critical assessment of its merits. Indicate how the material relates to the discussion of parsing that we have had in class.
Some points to consider:
• What does the term, “parsing,” mean to the author?
• HowcanparsingbecubicorlineartimeiftheauthorhasprovedthatitisNP-complete?
General requirements: Your write-up should be typed, using 12-point font and 1.5-line spacing; it should fit on one to two sides of a sheet of paper. Submit using the teach.cs submit command:
$ submit -c csc2501h -a Essay2 essay2.pdf

Computational Complexity of Probabilistic Disambiguation
NP-Completeness Results for Parsing Problems That Arise in Speech and Language Processing Applications
KHALIL SIMA’AN
Institute for Logic, Language and Computation (ILLC), University of Amsterdam, Room B.234, 166, 1018 WV Amsterdam, The Netherlands
E-mail:
Abstract. Recent models of natural language processing employ statistical reasoning for dealing with the ambiguity of formal grammars. In this approach, statistics, concerning the various linguistic phenomena of interest, are gathered from actual linguistic data and used to estimate the probabilities of the various entities that are generated by a given grammar, e.g., derivations, parse-trees and sen- tences. The extension of grammars with probabilities makes it possible to state ambiguity resolution as a constrained optimization formula, which aims at maximizing the probability of some entity that the grammar generates given the input (e.g., maximum probability parse-tree given some input sentence). The implementation of these optimization formulae in efficient algorithms, however, does not always proceed smoothly. In this paper, we address the computational complexity of ambiguity resolution under various kinds of probabilistic models. We provide proofs that some, frequently occurring problems of ambiguity resolution are NP-complete. These problems are encountered in various applications, e.g., language understanding for text- and speech-based applications. Assuming the common model of computation, this result implies that, for many existing probabilistic models it is not possible to devise tractable algorithms for solving these optimization problems.
JEL codes: D24, L60, 047
Key words: computational complexity, formal grammars, NP-Completeness, probabilistic ambiguity
resolution
Abbreviations: NLP – natural language processing
1. Introduction
No matter what its state of knowledge is, a so called performance model (Chomsky, 1965) of syntactic processing is expected to model, as accurately as possible, the input–output language processing behavior of a human. One particular “input- output property” of the human language processing system is of interest here: a human tends to associate a single, “most plausible” analysis with every utterance that she encounters. Remarkably, this is in sheer contrast with the highly ambigu- ous formal linguistic grammars of natural languages. In this paper, we take a closer look at some probabilistic performance models of syntactic processing, which are aimed at resolving grammatical ambiguity. We inspect the computational com- plexity that accompanies the application of these models to various problems of syntactic ambiguity resolution. Before we enter the computational complexity is-
Grammars 5: 125–151, 2002.
© 2002 Kluwer Academic Publishers. Printed in the Netherlands.

126 K. SIMA’AN
sues, however, we start out this paper with a brief overview of the general approach to probabilistic ambiguity resolution in Natural Language Processing (NLP).
In the field of NLP, it is widely believed that the solution to syntactic gram- matical ambiguity lies in the modeling of the extra-linguistic knowledge, e.g., world-knowledge, which plays a pivotal role in human language processing. Be- cause of their vastness and dynamic nature, the major extra-linguistic factors (e.g., world-knowledge or knowledge on cultural preferences) seem to be beyond com- plete modeling. Therefore, it seems inevitable that a model of syntactic ambiguity resolution must approximate these extra-syntactic factors by suitable means that allow for the optimization of its behavior given the current state of knowledge.
Recently, empirical performance models of syntactic processing started to im- plement ambiguity resolution through the projection of probabilistic grammars from so called tree-banks1 (e.g., Scha, 1990; Bod, 1992; Pereira and Schabes, 1992; Jelinek et al., 1994; Magerman, 1995; Collins, 1997; Charniak, 1999). In the context of this paper we are not so much interested in the different methods of projecting probabilistic grammars as much as in the computational aspects of the disambiguation problems under these probabilistic models.
Informally speaking, a probabilistic grammar (e.g., Salomaa, 1969) attaches to every rewrite-event (or grammar rule) r a conditional probability P (r|C), express- ing the probability of applying the rule given some context C. The probabilities are essentially “statistics” over syntactic phenomena in the tree-bank; they can be seen to represent the accumulations of the various extra-syntactic factors. Beyond the probabilistic grammar, a probabilistic language model stipulates how the probab- ility of a derivation of the grammar is estimated from the conditional probabilities of the individual rewrite-events. Consequently, the probabilities of parse-trees and sentences are defined in terms of the probabilities of the derivations that generate them (Section 3 provides example models). In other words, a probabilistic model defines probability functions over derivations, parse-trees and sentences of the probabilistic grammar.
In this probabilistic view, the problem of syntactic ambiguity resolution for language utterances can be expressed as an optimization problem. Given an input utterance U , the goal is to find the parse-tree T ∗ with the highest probability (most
1 A tree-bank consists of a bag (or multi-set or distribution) of parse-trees (lexicalized with ter- minal symbols) from some source grammar. In natural language processing, a tree-bank is usually obtained by collecting utterances from some domain of language use “at random” and manually annotating them with parse-trees. This way, it can be assumed (to some extent) that the tree-bank provides a representative sample of that domain. Most existing tree-banks (e.g., Marcus et al., 1993) are annotated with Phrase-Structure linguistic notions.

COMPUTATIONAL COMPLEXITY OF PROBABILISTIC DISAMBIGUATION 127 probable parse) under the optimization formula:2
T∗ = argmax P(T|U) = argmax P(T,U) T∈T T∈T P(U)
= argmaxP(T,U) (1) T∈T
where T is the set of parse-trees defined by the grammar, i.e.„ the tree-language of the grammar. Hence, T ∗ is the most probable parse-tree which co-occurs with utterance U. A major question in applying probabilistic models for syntactic dis- ambiguation is whether it is possible in any way, and then how, to transform optimization formula 1 into efficient algorithms?
In Computational Complexity theory, the departure point in addressing these questions lies in classifying the problems according to the time-complexities of the most efficient algorithms that can be devised for solving these problems. A problem that does not have any solutions of some desirable time-complexity constitutes a source of inconvenience and demands special treatment.
The present paper provides proofs that many of the common problems of prob- abilistic disambiguation, under various existing probabilistic models, belong to a class of problems for which we do not know whether we can devise deterministic polynomial-time algorithms. In fact, there is substantial evidence (see Garey and Johnson, 1981; Barton et al., 1987) that the problems that belong to this class, the NP-complete class, do not have such algorithms. For NP-complete problems, the only known deterministic algorithms have exponential-time complexity. This is inconvenient since exponential-time algorithms imply a serious limitation on the kinds and sizes of applications for which probabilistic models can be applied.
Each of the problems considered in this paper involves probabilistic disambig- uation of some kind of input under performance models that are based either on Stochastic Context-Free Grammars (SCFGs) (e.g., Jelinek et al., 1990; Black et al., 1993; Charniak, 1996) or on Stochastic Tree-Substitution Grammars (STSGs) (e.g., Bod, 1992, 1995a; Sekine and Grishman, 1995; Sima’an, 2000). The results of the present proofs apply also to the various Stochastic versions of Tree Adjoining- Grammars (STAG) (Joshi, 1985; Resnik, 1992; Schabes, 1992; Schabes and Wa- ters, 1993). It is noteworthy that the present proofs concern disambiguation prob- lems that arise within various, actually existing applications that range from those that involve parsing and interpretation of text to those that involve speech-under- standing and information retrieval. We will elaborate on these applications during the presentation of the problems, and, where possible, we also provide pointers to related literature.
This paper is aimed at readership both from Computer Science and Linguistics. Therefore, it includes an intuitive (rather than formal) brief overview of the no- tions of intractability and NP-completeness (Section 2), and formalizations of the
2 The meaning of the expression arg maxx∈X f (x) is “the element x ∈ X for which f (x) is maximal”.

128 K. SIMA’AN
abstract forms of some of the existing models of ambiguity resolution in language processing (Section 3). Section 4 states the disambiguation problems in question. Section 5 provides the proofs of NP-completeness. Finally, Section 6 discusses the informal conclusions of this study and provides pointers to existing research toward approximate, efficient solutions to some of these problems.
2. Intractability and NP-Completeness
This section provides a short, informal overview of the notions of intractability and NP-completeness. Our aim is to provide the basic terminology and explain the intuitive side of intractability, thereby also highlighting some of its limitations. The reader that is interested in further details and formalizations of intractability and NP-completeness are referred to (e.g., Hopcroft and Ullman, 1979; Garey and Johnson, 1981; Lewis and Papadimitriou, 1981; Davis and Weyuker, 1983; Barton et al., 1987). This section also serves as a reference to the 3SAT problem, used in the sequel for our proofs of NP-completeness of disambiguation problems.
2.1. WHAT IS A DECISION PROBLEM?
The notion in focus in this section is that of a tractable decision problem. Inform- ally speaking, a problem is a pair: a generic instance, stating the formal devices and components involved in the problem, and a question asked in terms of the generic instance. A decision problem is a problem where the question can have only one of two possible answers: Yes or No. For example, the well known 3SAT (3-satisfiability) problem3 is stated as follows:
INSTANCE: A Boolean formula in 3-conjunctive normal form (3CNF) over the variables u1, . . . , un.
QUESTION: Is the formula in INSTANCE satisfiable? i.e., is there an assignment of values true (T) or false (F) to the Boolean variables u1, . . . , un such that the given formula is true?
In the sequel, we refer to the (generic) instance of 3SAT with the name INS, and we use the following generic form to express the formula in INS:
(d11 ∨d12 ∨d13)∧(d21 ∨d22 ∨d23)∧···∧(dm1 ∨dm2 ∨dm3),
wherem􏲏1anddij isaliteral4 over{u1,…,un},forall1􏲐 i 􏲐 mandall 1 􏲐 j 􏲐 3. In some cases it is convenient to express the same formula using the notation C1 ∧ C2 ∧ ··· ∧ Cm, where Ci represents (di1 ∨ di2 ∨ di3), for all 1 􏲐 i 􏲐 m.
3 The 3SAT problem is a restriction of the more general satisfiability problem SAT which is the first problem proven to be NP-complete (known as Cook’s theorem).
4 A literal is a Boolean variable (e.g., uk ), or the negation of a Boolean variable (e.g., uk ).

COMPUTATIONAL COMPLEXITY OF PROBABILISTIC DISAMBIGUATION 129
Decision problems are particularly convenient for complexity studies mainly because of the natural correspondence between them and the formal object called “language”; the formal form of a decision problem is usually stated in terms of a language and a question concerning set-membership. The size or length of an instance of a decision problem is the main variable in any measure of the time- complexity of the algorithmic solutions to the problem (in case these solutions exist). Roughly speaking, this length is measured with respect to some reason- able encoding (deterministic polynomial-time computable) from each instance of the decision problem to a string in the corresponding language. In order not to complicate the discussion more than necessary, we follow common practice, as ex- plained in Garey and Johnson (1981), and assume measures of length that are more “natural” to the decision problem at hand (knowing that it is at most a polynomial- time cost to transform an instance to a string in the language corresponding to the decision problem). For 3SAT, for example, the length of an instance is 3m+(m−1) +2m, i.e., the number of symbols in the formula (not counting the parentheses) is linear in the number of conjuncts m.
2.2. TRACTABLE PROBLEMS, CLASS P AND CLASS NP
Roughly speaking, the theory of intractability and NP-completeness deals with the question whether a given problem has a general solution that is “computationally feasible”. In other words:
For every instance of the problem and for every input that is fed to that instance, of length n 􏲏 1: is there a (deterministic) algorithmic solution, which computes the answer in a number of computation steps that is proportional to a “cheap” function in n?
The problem with defining the term “cheap” lies in finding a borderline between those functions that can be considered expensive and those that can be considered cheap. A first borderline that has been drawn by a widely accepted thesis (Cook- Karp) is between polynomials and exponentials. Problems for which there is a deterministic polynomial-time solution (i.e.„ can be solved by a Deterministic Tur- ing Machine — DTM — in Polynomial-time) are called tractable. Other problems (in the set of problems that can be solved by Turing Machines) for which there are only deterministic exponential-time solutions (Turing Machines) are called intractable. The motivation behind the Cook-Karp discrimination between poly- nomials and exponentials is the difference in rate of growth between these two families. Generally speaking, exponentials tend to grow much faster than poly- nomials. Strong support for the Cook-Karp thesis came from practice, but here we will not go into the discussion concerning the stability of this thesis (see the discussions in e.g., Garey and Johnson, 1981; Barton et al., 1987).
In any case, the tractable decision problems, i.e.„ those that have a deterministic polynomial-time algorithmic solution (i.e., a polynomial-time DTM), are referred

130 K. SIMA’AN
to with the term class P problems. All other decision problems,5 that are intract- able, are referred to with the term NP-hard problems (see below for the reason for this terminology).
Besides the problems that can be solved in polynomial-time by a deterministic algorithmic solution, there exist problems that are solvable in polynomial-time provided that the algorithmic solution has access to an oracle, which is able to guess the right computation-steps without extra cost (i.e., these problems are solv- able by a so called Non-Deterministic Turing Machine (NDTM) in polynomial- time). Clearly, every problem in class P is found among these so called Non- deterministic Polynomial-time solvable problems, also called class NP problems (i.e., P ⊆ NP). The question is, of course, are there more problems in class NP than in class P, i.e., P ⊂ NP?
2.3. NP-COMPLETE PROBLEMS
Empirical clues to the hypothesis P ̸= NP is embodied by the discovery of many practical and theoretical problems that are known to be in class NP but for which nobody knows how to devise deterministic polynomial-time algorithmic solutions; these problems are in class NP but are not known to be in class P. This set of problems is called the class of NP-complete problems. NP-complete problems are the hardest problems in class NP.
Further of interest here is the class of NP-hard problems. This class consists of those problems that are at least as hard as any problem that is in NP, i.e., an NP-hard decision problem is one that is at least as hard as those that can be solved by an NDTM in polynomial-time.
To prove that a given problem L is NP-hard, it is sufficient6 to show that another problem that is already known to be NP-complete is deterministic polynomial-time reducible to L. This is done by providing a polynomial-time reduction (i.e., trans- form) from the NP-complete problem to problem L. Such a reduction shows how every instance of the NP-complete problem can be transformed into an instance of problem L. Naturally, the reduction must be answer-preserving, i.e., for every instance of the NP-complete problem and for every possible input, the instance’s answer is Yes to that input iff the L-instance’s (resulting from the reduction) answer is also Yes to the transformed-form of the input. Note that the reducibility relation between problems is a transitive relation.
Practically speaking, once we lay our hands on one NP-complete problem, we can prove other problems to be NP-hard. A problem that has been proven to be NP-complete is the 3SAT problem stated above. Therefore 3SAT can serve us in proving other problems to be NP-hard. To prove that a new problem is NP-
5 In the class of problems solved by Turing Machines.
6 The fact that every instance of an NP-complete problem T can be reduced into an instance of L
in deterministic polynomial-time implies that L is at least as hard as T – if L would be solvable in deterministic polynomial-time then T would also be.

COMPUTATIONAL COMPLEXITY OF PROBABILISTIC DISAMBIGUATION 131
completeness, however, one needs to prove that it is NP-hard and that it is in class NP.
In this work the focus is on optimization problems rather than decision prob- lems. In general, it is possible to derive from every optimization problem a decision problem that is (at most) as hard as the optimization problem (Garey and Johnson, 1981). Therefore, the proof of NP-completeness of an optimization problem goes through proving that its decision counterpart is NP-complete. We will exemplify how an optimization problem is translated into a decision problem in Section 4 which states the problems that this paper addresses: probabilistic ambiguity resol- ution. But first, however, we consider the formal probabilistic models that play a role in these problems.
2.4. GENERAL NOTATION
In the sequel we employ the notation x1n for an ordered sequence of entities x1 , . . . , xn (and also for the concatenation of symbols x1 . . . xn ), for any kind of entity xi,1 􏲐 i 􏲐 n. We also use the symbols T and F to represent respectively the Boolean values true and false.
3. Probabilistic Models of Ambiguity Resolution
The generic devices that are involved in the disambiguation problems considered in this paper are of three kinds: Stochastic Finite State Automata (SFSAs), Stochastic Context-Free Grammars (SCFGs) and Stochastic Tree-Substitution Grammars (STSGs). Next we provide the formal definitions of these devices, and specify the probability formulae common for most existing models of ambiguity resolution. It is relevant here to mention that the proofs of NP-hardness in this paper hold for any “weighted” forms of the above grammars in general. However, in this presentation we stick to “probabilistic/stochastic” versions because of the current wide interest in applying these specific versions to NLP tasks. Moreover, this allows us to specify some general aspects of existing probabilistic models, which might be of interest for the general readership.
Throughout the paper we assume that all automata and grammars are ε-free (i.e., do not allow empty-rewritings) and cycle-free.
3.1. STOCHASTIC FINITE STATE AUTOMATA
Stochastic Finite State Automaton (SFSA): An SFSA is a six-tuple ⟨Q, 􏲎, T , s, f, P ⟩ where Q, 􏲎 and T are finite sets of, respectively, symbols, states and trans- itions; a transition is a triple ⟨s1, s2, w⟩, where s1, s2 ∈ 􏲎 and w ∈ Q (called the label of the transition). The special states s and f (both in 􏲎) are respectively the start and target states. Finally, P : T → (0, 1] is a probability function which
P (⟨l, r, w⟩) = 1.
fulfills the equation ∀l ∈ 􏲎: 􏰄
⟨l,r,w⟩∈T

132 K. SIMA’AN
Two transitions ⟨a, b1, w⟩ and ⟨b2, c, v⟩ are called connected iff b1 = b2. An ordered sequence of one or more transitions t1,··· ,tn, with ti = ⟨li,ri,wi⟩ for alli,iscalledapathiff:(1)l1 = s,(2)rn = f and(3)forall1 􏲐 i < n holds ti and ti+1 are connected. For any path ⟨l1, r1, w1⟩, · · · , ⟨lm, rm, wm⟩, the concatenation of the labels of the transitions, i.e., w1 · · · wm, is called a sentence generated by the SFSA through that path7. The set of all paths that generate the same sentence X under a given SFSA is denoted by Paths(X). In the present context, two probability definitions are relevant: Path probability: the probability8 of a path t n = t1 , · · · , tn is approximated9 by n 􏲑n 1 􏲑n the formula P(t1 ) = P(t1) i=2 P(ti|t1,··· ,ti−1) ≈ i=1 P(ti). Sentence probability: the probability of a sequence of symbols w1n = w1 · · · wn from Q+ is given by: 􏰅 Further definitions pertaining to methods of probability estimation and probab- ility computation under SFSAs originate from the well known Hidden-Markov models (HMM) (Rabiner and Juang, 1986). HMMs are commonly used in speech- recognition as speech and language models. HMMs also provide a general inter- face between speech-recognizers and more advanced probabilistic language mod- els (e.g., based on linguistic knowledge): the output of the speech-recognizer is cast in an SFSA (over some lexicon), called a “word-lattice” or “word-graph” (Oeder and Ney, 1993), which is fed as input to the language model (e.g., a syntactic ana- lyzer); usually, the output of the language model is the single sentence (among the sentences accepted by the SFSA) which constitutes the system’s “best” guess of the spoken utterance. The formal version of the latter disambiguation task constitutes one of the problems which we prove NP-hard (under various probabilistic language models). For the purposes of the present paper it is sufficient to consider a special case of SFSAs, which we refer to with the term “Simple FSA” (abbreviated SIFSA). 7 In this paper we will assume that all states in the SFSA are “found” on paths, i.e., (1) all states are reachable (i.e., through sequences of connected transitions) from the start state s, and (2) the final state f is reachable from all other states. 8 We will allow overloading of P since it is a probability mass function over entities which are always clear from the context. 9 Considering the path as the joint-event of the individual transitions, this formula assumes stochastic independence between the different transitions. In Markov models, this independence assumption is relaxed to a certain degree by conditioning the probability of every transition on P(w1n) = P(path). (2) Note that if the set Paths(w1n) is empty P(w1n) = 0 by definition. path∈Paths(w1n) a few of the transitions that precede it in the path, e.g., for a first-order Markov model P(tn) ≈ 􏲑n 1 P (t1) i=2 P (ti |t1). Although important for modeling various tasks, this Markovian conditioning of probabilities is of no impact on our complexity results. COMPUTATIONAL COMPLEXITY OF PROBABILISTIC DISAMBIGUATION 133 Figure 1. Upper: an SIFSA. Lower: SFSA’s which are not an SIFSA. To this end, we define the term “non-labeled transition”: a non-labeled transition is the pair ⟨si , sj ⟩ obtained from any transition ⟨si , sj , w⟩ ∈ T . A non-labeled path consists of a sequence of connected non-labeled transitions starting from s and ending at f, i.e., ⟨s,s1⟩,··· ,⟨sn,f⟩. SimpleSFSA: An SIFSA is an SFSA ⟨Q,􏲎,T,s,f,P⟩ in which (1) all non- labeled paths constitute exactly the same sequence of connected non-labeled trans- itions and (2) the probabilities of transitions are uniform, i.e., ∀l ∈ 􏲎, P (⟨l, r, w⟩) = 1 , where NEXT(l) = {r|⟨l,r,w⟩ ∈ T}, and |Y| represents the cardinality |NEXT(l)| of set Y. Figure 1 exhibits examples of SIFSAs and other SFSAs. The definition of SIFSA implies the following notation for representing SIFSA’s: States: the states of an SIFSA can be mapped to successive positive natural num- bers. We will represent the natural number that a state si is mapped to by N(si). For the start state s: N(s) = 1. For every other state sj, if there is a transition ⟨si,sj⟩∈T withN(si)=k,thenN(sj)=k+1. Sentences: the set of sentences accepted by an SIFSA can be written as the product between sets Q1 ×···×Qm, where Qi ⊆Q, for all 1 􏲐 i 􏲐 m. Precisely, the set Qi consists of the labels of the transitions between the state sx forwhichN(sx)=iandthestatesy forwhichN(sy)=i+1,∀1􏲐i􏲐m. Because in this paper we are mainly interested in the set of sentences Q1 ×· · ·×Qm accepted by an SIFSA, we will be informal in referring to this set with the term “an SIFSA” (under the understanding that we can readily construct an SIFSA for recognizing this set). Furthermore, we will use the notation Qm1 as shorthand forQ1 ×···×Qm. SIFSAs are often encountered in applications such as spelling-correction, un- der the assumption that typos do not add or remove white-space10 . Clearly, because 10 For every word containing a typo, it is assumed that a set of words are hypothesized from an external dictionary. 134 K. SIMA’AN the proofs in this paper concern SIFSAs, they also hold for SFSAs in general. We undertook the task of formalizing general SFSAs precisely because they constitute the most common device which plays a major role in practical applications of probabilistic models of ambiguity resolution (Figure 1). 3.2. STOCHASTIC TREE-SUBSTITUTION GRAMMARS We define Stochastic Context-Free Grammars (SCFGs) (Salomaa, 1969; Jelinek et al., 1990) through the definition of the more general Stochastic Tree-Substitution Grammars (STSG) (Bod, 1992; Schabes, 1992). Let N and T denote respectively the finite set of non-terminal symbols and terminal symbols. Elementary-tree: An elementary-tree is a tree-structure which fulfills: the non- leaf nodes are labeled with non-terminal symbols from N and the leaf nodes are labeled with symbols from N ∪ T . We use the functional notation root(t) to represent the non-terminal label of the root node of the elementary-tree t. Stochastic Tree-Substitution Grammar (STSG) An STSG G is a five tuple ⟨N,T,R,S,P⟩, where the first three components are respectively the finite set of non-terminals, terminals and elementary-trees. The non-terminal S ∈ N is called the start-symbol. The function P : R → (0, 1] fulfills the following set of equations (constraints): ∀A ∈ N : 􏰄 {t∈R:root(t)=A} P(t) = 1. The term partial parse-tree is central in defining the various rewrite notions under STSGs. Every elementary-tree is a partial parse-tree. Further partial parse-trees are generated through partial derivations as defined next. Given a partial parse-tree T , let node μ be the left-most leaf node in T and let μ be labeled with the non-terminal X. A derivation-step from μ under STSG G is a rewriting of non-terminal X with an elementary-tree t ∈ R which fulfills root(t) = X. The result of this rewrite step is just as in the well known algebraic “variable substitution”: it results in a new partial parse-tree T ◦ t which consists of a duplicate of T with a duplicate of elementary-tree t substituted11 for the duplicate of μ. The operation of (left-most) substitution, denoted by ◦, distinguishes this kind of grammar. Hence “Stochastic Tree-Substitution Grammar”. A sequence of left-most substitutions t1 ◦ · · · ◦ tm is called a partial derivation of G. A partial-derivation generates a partial parse-tree under the interpretation (···(t1 ◦t2)◦···◦tm). Probability of a partial derivation: The probability of partial derivation d is cal- culated by the formula P (d) = 􏲑mi=1 P (ti ). 11 In the sense that the duplicate of μ now is a non-leaf node dominating a duplicate of t. COMPUTATIONAL COMPLEXITY OF PROBABILISTIC DISAMBIGUATION 135 In STSGs (in contrast with SCFGs), it is possible that various partial derivations generate one and the same partial parse-tree. This happens, for example, whenever thereareelementary-treest1,t2,t3,t4 ∈Rsuchthatt1◦t2 generatesthesamepartial parse-tree as t3 ◦ t4. Probability of a partial parse-tree: Let der(T ) denote the finite set of all par- tial derivations that generate partial parse-tree T . The probability of T being generated is then defined by: 􏰅 d∈der(T ) A partial derivation d = t1 ◦ · · · ◦ tm is also called a derivation iff it fulfills: (1) root(t1) = S, and (2) d generates a partial parse-tree T with leaf nodes labeled only with terminal symbols; in that case, T is called a parse-tree and the ordered sequence of terminal symbols labeling the leaf nodes (left-to-right) of T is called a sentence (generated by d). In natural language processing, often the grammars are highly ambiguous such that one and the same sentence can be generated together with different parse-trees. Probability of a sentence: Let the set of parse-trees which are generated together with sentence U under G be denoted by par(U) and the set of all derivations that generate U be denoted by der(U), then: 􏰅 P(T). (4) = P(d). (5) d∈der(U) Formulae (3) and (5) are quite important in the context of this paper. We will see that this kind of a definition (a sum over multiplications) plays a central role in all the optimization problems which we prove NP-hard. 3.2.1. Some convenient notation Whenever we do not wish to specify the sets par(U) we may consider the prob- ability of the joint occurrence of a parse-tree T with sentence U, i.e., P(T,U). This probability is defined to be zero if the sequence of terminal labels on the leaf nodes of T is different from U . Under this definition we may write: P (U ) = 􏰄 T PG(T,U), where the subscript G implies that the summation ranges over all parse trees T generated by STSG G. Similarly, the joint probability of a derivation d with parse-tree T (or sentence U) is defined to be zero if d does not gener- ate T (respectively U) in the given STSG. Under this definition we may write P(T) = P(d). (3) P(U) = T∈par(U) 􏰅 136 K. SIMA’AN P(U) = 􏰄 PG(d,U) and P(T) = 􏰄 PG(d,T), where again the summation dd ranges over all derivations possible under STSG G. 3.3. STOCHASTIC CONTEXT-FREE GRAMMARS Now that we defined STSGs, we can define Stochastic Context-Free Grammars (SCFGs) as a subclass thereof: StochasticContext-FreeGrammar: AnSCFG⟨N,T,R,S,P⟩isanSTSGwhich fulfills: every elementary-tree t ∈ R consists of a single “level” of non-leaf nodes, i.e., the root node of t dominates directly all its leaf nodes In SCFG terminology, elementary-trees are also called productions or (rewrite-)rules. In contrast with general STSGs, in SCFGs there is a one-to-one mapping between derivations12 and parse-trees. On the one hand, we will use SCFGs to show that this fact makes the syntactic ambiguity resolution of an input sentence easier under SCFGs than under STSGs (in general). On the other hand, we will also show that when the input to the SCFG becomes ambiguous (i.e., an SFSA), the selection of the most probable sentence is also NP-hard (this is related to formula (2)). The latter problem emerges, e.g., in applications of speech-understanding that involve an SCFG-based language model. This shows that the hard problems of ambiguity resolution are not so much inherent to the more advanced grammar formalisms as much as to the construction of the problem as optimization over summation formulae as in e.g., formula 5. In the next section we define each of the problems more formally. 4. Definition of the Optimization Problems In this section we state the four optimization problems that this study concerns. Subsequently, each of these problems is transformed into a suitable decision prob- lem that will be proven to be NP-complete in Section 5. 4.1. DISAMBIGUATION PROBLEMS Problem MPP: INSTANCE: An STSG G and a sentence w0n. QUESTION: What is the Most Probable Parse (MPP) of sentence w0n under STSGG?,i.e.,computeargmaxT PG(T,w0n). 12 Assuming that the choice of what non-terminal (in the current sentential-form) to expand at any derivation-step is fixed, e.g., left-most. COMPUTATIONAL COMPLEXITY OF PROBABILISTIC DISAMBIGUATION 137 This problem was put forward by Bod (1995b). As mentioned earlier, problem MPP plays a prominent role in various applications: in order to derive the semantics of an input sentence, most NLP systems (and the linguistic theories they are based on) assume that one needs a syntactic representation of that sentence. Under most linguistic theories, the syntactic representation is a parse-tree. For various probab- ilistic models (Bod, 1995a; Goodman, 1998; Schabes, 1992; Resnik, 1992), it can be shown that (under certain methods for estimating the model parameters from data) the MPP is the parse that maximizes the exact tree-match measure (under the given model) (Goodman, 1998). Problem MPPWG: INSTANCE: An STSG G and an SIFSA WG. QUESTION: What is the MPP of WG under13 STSG G?, i.e., compute argmaxT PG(T,WG). Applications of this problem are similar to the applications of problem MPP. In these applications, the input is ambiguous and the parser must select the most prob- able of the set of parses of all sentences which the SIFSA generates. By selecting the MPP, the parser selects a sentence of the SIFSA also. Typical applications lie in Speech Understanding (Oeder and Ney, 1993), morphological analysis (Sima’an et al., 2001), but also in parsing the ambiguous output of a part-of-speech (PoS) tagger14 or morphological analyzer which provides more than one solution for every word in the input sentence. Problem MPS: INSTANCE: An STSG G and an SIFSA WG. QUESTION: What is the Most Probable Sentence (MPS) in WG under STSG G?,i.e.,computeargmaxU PG(U,WG). This problem has applications that are similar to problem MPPWG. In Speech Understanding it is often argued that the language model should select the MPS rather than the MPP of the input SIFSA. Selecting the MPS, however, does not entail the selection of a syntactic structure for the sentence (which is important for further interpretation). Problem MPS-SCFG: 13 A parse generated for an SIFSA by some grammar is a parse generated by the grammar for a sentence that is generated by the SIFSA. In formal terms we should speak of a sentence in the intersection between the language of SIFSA and that of the grammar (van Noord, 1995). We will be slightly less formal than this to avoid complicating the notation more than necessary. Hence, we use the joint probability notation PG(T,WG) with the understanding that this is a set containing all PG (T , U ) for every sentence U generated by W G. 14 PoS-taggers assign to input words their lexical categories similar to lexical-analyzers that are used in compilers for programming languages. 138 K. SIMA’AN INSTANCE: An SCFG G and an SIFSA WG. QUESTION: What is the MPS in WG under SCFG G?, i.e., compute arg maxU PG(U, WG). This problem is a special case of problem MPS and applications that face this problem are similar to those that face problem MPS described above. The proof that this problem is also NP-complete shows that the source of hardness of MPS is not inherent to the fact that the language model involves an STSG. 4.2. THE CORRESPONDING DECISION PROBLEMS The decision problems that correspond to problems MPP, MPPWG, MPS and MPS-SCFG are given the same names. In these decision problems, we will supply with every problem instance a “threshold” (or “lower bound”) 0 < Q 􏲐 1.0 on the value of the optimization formula. Then, instead of asking for the maximum value, we now ask whether there is any value which exceeds Q. Decision problem MPP: INSTANCE: An STSG G, a sentence w0n and a real number 0 < Q 􏲐 1. QUESTION: Does STSG G generate for sentence w0n a parse T for which it assigns a probability PG(T,w0n) 􏲏 Q? Decision problem MPPWG: INSTANCE: An STSG G, an SIFSA WG and a real number 0 < Q 􏲐 1. QUESTION: Does STSG G generate for WG a parse T for which it assigns a probability PG(T,WG) 􏲏 Q? Decision problem MPS: INSTANCE: An STSG G, an SIFSA WG and a real number 0 < Q 􏲐 1. QUESTION: Does WG accept a sentence U for which STSG G assigns a probability PG(U,WG) 􏲏 Q? Decision problem MPS-SCFG: INSTANCE: An SCFG G, an SIFSA WG and a real number 0 < Q 􏲐 1. QUESTION: Does WG accept a sentence U for which SCFG G assigns a probability PG(U,WG) 􏲏 Q? 5. NP-Completeness Proofs In order to prove the NP-completeness of a given problem, it is sufficient to prove that the problem is NP-hard and is also a member of the class NP. For proving the NP-hardness of the problem, it is sufficient to exhibit a deterministic polynomial- time reduction from every instance of 3SAT to some instance of the problem at hand. This is what we do next for each of the decision problems listed in the preceding section. To this end, we restate the generic instance INS of 3SAT. COMPUTATIONAL COMPLEXITY OF PROBABILISTIC DISAMBIGUATION 139 Figure 2. The elementary-trees for the example 3SAT instance. INSTANCE: A Boolean formula in 3-conjunctive normal form (3CNF) over the variables u1, . . . , un: (d11 ∨d12 ∨d13)∧(d21 ∨d22 ∨d23)∧···∧(dm1 ∨dm2 ∨dm3), wherem􏲏1anddij isaliteralover{u1,...,un},forall1􏲐 i 􏲐 mandall 1 􏲐 j 􏲐 3. This formula is also denoted C1 ∧C2 ∧···∧Cm, where Ci represents (di1 ∨ di2 ∨ di3),forall1􏲐 i 􏲐 m. QUESTION: Is the formula in INS satisfiable? 140 K. SIMA’AN PROPOSITION 5.1. Decision problems MPP, MPPWG, MPS and MPS-SCFG are NP-complete. 5.1. A GUIDE TO THE REDUCTIONS The reductions in the next section are structured as follows. The first two reductions are conducted simultaneously from 3SAT to MPPWG and from 3SAT to MPS. Then the reductions from 3SAT to MPP and from 3SAT to MPS-SCFG are obtained from the preceding reduction by some minor changes. 5.2. 3SAT TO MPPWG AND MPS SIMULTANEOUSLY In the following, a reduction is devised which proves that both MPPWG and MPS are NP-hard. For convenience, the discussion concentrates on problem MPPWG, but also explains why the same reduction is suitable also for MPS. The reduction from the 3SAT instance INS to an MPPWG instance must con- struct an STSG, an SIFSA and a threshold value Q in deterministic polynomial- time. Moreover, the answers to the MPPWG instance must correspond exactly to the answers to INS. The presentation of the reduction shall be accompanied by an example of the following 3SAT instance (Barton et al., 1987): (u1 ∨u2 ∨u3)∧( where u1, u2 and u3 are Boolean variables. 5.2.1. Some Intuition A 3SAT instance is satisfiable iff at least one of the literals in each conjunct is assigned the value true. Implicit in this, but crucial, the different occurrences of the literals of the same variable must be assigned values consistently. These two obser- vations constitute the basis of the reduction. The reduction must capture these two “satisfiability-requirements” of INS in the problem-instances that it constructs. For example, for MPPWG we will construct an STSG and an SIFSA. The SIFSA will be WG = {T, F}3m, where 3m is the number of literals in the formula of 1 INS. The STSG will be constructed such that it has two kinds of derivations for every path (in WG) which constitutes a solution for INS (if INS is satisfiable): one kind of derivations takes care of the consistent assignment of truth values, and the other takes care of the assignment of the value true for exactly one literal in every conjunct. Crucially, the derivations will be assigned such probabilities that will enable us to know whether a path in WG is a solution for INS by inspecting the probability of that path. In other words, the probability of a path in WG will tell us whether the STSG derives that path by enough derivations of each of the two kinds of derivations in order for that path to be a solution for INS. u1 ∨u2 ∨u3), COMPUTATIONAL COMPLEXITY OF PROBABILISTIC DISAMBIGUATION 141 5.2.2. The Reduction The reduction constructs an STSG and an SIFSA. The STSG has start-symbol labeled S, two terminals represented by T and F, non-terminals which include (beside S) all Ck, for 1 􏲐 k 􏲐 m, and both literals of each Boolean variable of the formula of INS. The set of elementary-trees and probability function and the SIFSA are constructed as follows: 1. The reduction constructs for each Boolean variable ui, 1 􏲐 i 􏲐 n, two elementary-trees that correspond to assigning the values true (T) and false (F) to ui consistently through the whole formula. Each of these elementary-trees has root S, with children Ck, 1 􏲐 k 􏲐 m, in the same order as they appear in the formula of INS; subsequently the children of Ck are the non-terminals that correspond to its three disjuncts dk1, dk2 and dk3. And, finally, the assignment of true (false) to ui is achieved by creating a child terminal T (resp. F) to each non-terminal ui and F (resp. T) to each ui. The two elementary-trees for u1, of our running example, are shown in the top left corner of figure 2. 2. The reduction constructs three elementary-trees for each conjunct Ck. The three elementary-trees for conjunct Ck have the same internal structure: root Ck, with three children that correspond to the disjuncts dk1, dk2 and dk3. In each of these elementary-trees exactly one of the disjuncts has a child labeled with the terminal T; in each of the three elementary-trees this T child is a different one. Each of these elementary-trees corresponds to the given con- junct where one of the three possible literals is assigned the value T. For the elementary-trees which are constructed for our example see the top right corner of figure 2. 3. The reduction constructs for each of the literals of each variable ui two elemen- tary-trees where the literal is assigned in one case T and in the other F. Figure 2 shows these elementary-trees for variable u1 in the bottom left corner. 4. The reduction constructs one elementary-tree that has root S with children Ck , 1 􏲐 k 􏲐 m, in the same order as these appear in the formula of INS (see the bottom right corner of Figure 2). 5. The reduction assigns probabilities to the elementary-trees that were construc- ted by the preceding steps. The probabilities of the elementary-trees that have the same root non-terminal must sum up to 1. The probability of an elementary- tree with root label S that was constructed in step 1 of this reduction is a value pi, 1 􏲐 i 􏲐 n, where ui is the only variable of which the literals in the elementary-tree at hand are lexicalized (i.e., have terminal children). Let ni denote the number of occurrences of both literals of variable u in the formula 􏰁􏰂 i of INS. Then pi = θ 1 ni , for some real θ that has to fulfill some conditions 2 which will be derived in Section 5.2.3. The probability of the tree rooted with S and constructed at step 4 of this reduction must then be p0 = [1 − 2 􏰄ni=1 pi ]. 142 K. SIMA’AN The probability of the elementary-trees of root Ck (step 2) is 􏰁 1 􏰂, and of root ui 􏰁1􏰂 3 or ui (step 3) is 2 . Suitable probabilities are shown in figure 2 for our running example. Let Q denote a threshold probability that shall be derived below. The MPPWG (MPS) instance produced by this reduction is: INSTANCE: The STSG produced by the above reduction (probabilities are de- rived below), the SIFSA WG = {T, F}3m, and a probability threshold 0 < Q 􏲐 1. QUESTION: Does this STSG generate for the SIFSA WG = {T, F}3m a parse 1 (resp. a sentence) for which it assigns a probability greater than or equal to Q? 5.2.3. Deriving the Probabilities and the Threshold The parses generated by the constructed STSG differ only in the sentences on their frontiers. Therefore, if a sentence is generated by this STSG then it has exactly one parse. This clarifies why we choose to reduce 3SAT to MPPWG and MPS simultaneously. S S ✟✟✟❍❍❍ ✟❍ C1 C2 1 ✟✟ ✟✟❍❍ .. C1C2 .. ❍❍ ✟✟ ❍❍ C1 C2 ❍❍ ✟✟ u1 u2 u3 u1 u2 u3 T . . . . u3 u1 u2 u3 u1 u2 u3 .... ✟✟ ❍❍ T . . F . . u2 u3 u2 u3 T F F T T TFFF u2 u3 u1 u2 Figure 3. Left: example of the first kind of derivations. Right: example of the second kind. In this figure, the vertical dots signify substitutions. By inspecting the STSG resulting from the reduction, one can recognize exactly two types of derivations in this STSG (see Figure 3): 1. The first type corresponds to substituting a suitable elementary-tree for a non- terminal leaf node (i.e., literal) in any of the 2n elementary-trees constructed in step 1 of the reduction. This type of derivation corresponds to the consistent COMPUTATIONAL COMPLEXITY OF PROBABILISTIC DISAMBIGUATION 143 assignment of values to all literals of some variable ui . For all 1 􏲐 i 􏲐 n the probability of a derivation of this type is: 􏰥1􏰦3m−ni 􏰥1􏰦3m pi2 =θ2 2. The second type of derivation corresponds to substituting the elementary-trees that has Ck as root for the leaf-node labeled Ck in S → C1 . . . Cm, and sub- sequently substituting suitable elementary-trees for the non-terminal leaf-nodes that correspond to literals. This type of derivation corresponds to the assign- ment of the value true to at least one literal in each conjunct. The probability of any such derivation is: 􏰥1􏰦2m 􏰥1􏰦m 􏰅n 􏰥1􏰦ni 􏰥1􏰦2m 􏰥1􏰦m p02 3=[1−2θ 2]2 3 i=1 Next we derive both the threshold Q and the parameter θ. Any parse (or sentence) that fulfills both the “consistency of assignment” requirements and the requirement that each conjunct has at least one literal with child T, must be generated by n derivations of the first type and at least one derivation of the second type. Note that a parse can never be generated by more than n derivations of the first type. Thus the threshold Q must be set at: 􏰥1􏰦3m 􏰅n 􏰥1􏰦ni 􏰥1􏰦2m 􏰥1􏰦m Q=nθ2 +[1−2θ 2]2 3 i=1 However, θ must fulfill some requirements for our reduction to be acceptable: 1.Foralli:0􏰥􏰦1􏰥􏰦
􏰅n1ni 1m 22+2
i=1
3. For the resulting STSG to be a probabilistic model, the “probabilities” of parses and sentences must be in the interval (0, 1]. This is taken care of by demand- ing that the sum of the probabilities of elementary-trees that have the same root non-terminal is 1, and by the definition of the derivation’s probability, the parse’s probability, and the sentence’s probability.
There exists a θ that fulfills the requirements expressed in the inequalities 1 and 2 because the lower bound 1/2􏰄n 􏰁1􏰂ni + 􏰁1􏰂m is always larger than zero and
i=1 2 􏰄2 􏰁􏰂n is strictly smaller than the upper bound 1/2 n 1 i .
5.2.3.1. Polynomiality of the reduction: This reduction is deterministic poly-
nomial-time in m and n (note that n 􏲐 3m always). It constructs not more than
2n + 1 + 3m + 4n elementary-trees, each consisting of at most 7m + 1 nodes. And
the computation of the probabilities and the threshold is also conducted in determ-
inistic polynomial-time. Furthermore, the construction of the SIFSA{T,F}3m is
polynomial15 in m.
5.2.3.2. The reduction preserves answers: The proof that this reduction pre-
serves answers concerns the two possible answers Yes and No:
Yes: If INS’s answer is Yes then there is an assignment to the variables that is
consistent and where each conjunct has at least one literal assigned true. Any
possible assignment is represented by one sentence in WG. A sentence which
corresponds to a “successful” assignment must be generated by n derivations of
the first type and at least one derivation of the second type; this is because the
sentence w3m fulfills n consistency requirements (one per Boolean variable) 1
and has at least one T assignment for any of w3k+1, w3k+2 or w3k+3, for all 0 􏲐 k < m. Both this sentence and its corresponding parse have probabilities 􏲏 Q. Thus MPPWG and MPS also answer Yes. No: If INS’s answer is No, then all possible assignments are either not consistent or result in at least