CS计算机代考程序代写 prolog database Lambda Calculus Hidden Markov Mode algorithm Learning to Map Sentences to Logical Form:

Learning to Map Sentences to Logical Form:
Structured Classification with Probabilistic Categorial Grammars

Luke S. Zettlemoyer and Michael Collins
MIT CSAIL

.edu, .edu

Abstract

This paper addresses the problem of mapping
natural language sentences to lambda–calculus
encodings of their meaning. We describe a learn-
ing algorithm that takes as input a training set of
sentences labeled with expressions in the lambda
calculus. The algorithm induces a grammar for
the problem, along with a log-linear model that
represents a distribution over syntactic and se-
mantic analyses conditioned on the input sen-
tence. We apply the method to the task of learn-
ing natural language interfaces to databases and
show that the learned parsers outperform pre-
vious methods in two benchmark database do-
mains.

1 Introduction

Recently, a number of learning algorithms have been pro-
posed for structured classification problems. Structured
classification tasks involve the prediction of output labels
y from inputs x in cases where the output labels have rich
internal structure. Previous work in this area has focused
on problems such as sequence learning, where y is a se-
quence of state labels (e.g., see (Lafferty, McCallum, &
Pereira, 2001; Taskar, Guestrin, & Koller, 2003)), or nat-
ural language parsing, where y is a context-free parse tree
for a sentence x (e.g., see Taskar et al. (2004)).

In this paper we investigate a new type of structured classi-
fication problem, where the goal is to learn to map natural
language sentences to a lambda calculus encoding of their
semantics. As one example, consider the following sen-
tence paired with a logical form representing its meaning:

Sentence: what states border texas
Logical Form: λx.state(x) ∧ borders(x, texas)

The logical form in this case is an expression representing
the set of entities that are states, and that also border Texas.

The training data in our approach consists of a set of sen-
tences paired with logical forms, as in this example.

This is a particularly challenging problem because the
derivation from each sentence to its logical form is not an-
notated in the training data. For example, there is no di-
rect evidence that the word states in the sentence corre-
sponds to the predicate state in the logical form; in gen-
eral there is no direct evidence of the syntactic analysis of
the sentence. Annotating entire derivations underlying the
mapping from sentences to their semantics is highly labor-
intensive. Rather than relying on full syntactic annotations,
we have deliberately formulated the problem in a way that
requires a relatively minimal level of annotation.

Our algorithm automatically induces a grammar that maps
sentences to logical form, along with a probabilistic model
that assigns a distribution over parses under the grammar.
The grammar formalism we use is combinatory categorial
grammar (CCG) (Steedman, 1996, 2000). CCG is a con-
venient formalism because it has an elegant treatment of a
wide range of linguistic phenomena; in particular, CCG has
an integrated treatment of semantics and syntax that makes
use of a compositional semantics based on the lambda
calculus. We use a log-linear model—similar to models
used in conditional random fields (CRFs) (Lafferty et al.,
2001)—for the probabilistic part of the model. Log-linear
models have previously been applied to CCGs by Clark and
Curran (2003), but our work represents a major departure
from previous work on CCGs and CRFs, in that structure
learning (inducing an underlying discrete structure, i.e., the
grammar or CCG lexicon) forms a substantial part of our
approach.

Mapping sentences to logical form is a central problem in
designing natural language interfaces. We describe exper-
imental results on two database domains: Geo880, a set
of 880 queries to a database of United States geography;
and Jobs640, a set of 640 queries to a database of job list-
ings. Tang and Mooney (2001) described previous work on
these data sets. Previous work by Thompson and Mooney
(2002) and Zelle and Mooney (1996) used a subset of the
Geo880 corpus. We evaluated the algorithm’s accuracy in

returning entirely correct logical forms for each test sen-
tence. Our method achieves over 95% precision on both of
these domains, with recall of 79% on each domain. These
are highly competitive results when compared to the previ-
ous work.

2 Background

This section gives background material underlying our
learning approach. We first describe the lambda–calculus
expressions used to represent logical forms. We then de-
scribe combinatory categorial grammar (CCG), and the ex-
tension of CCG to probabilistic CCGs (PCCGs) through
log-linear models.

2.1 Semantics

The sentences in our training data are annotated with ex-
pressions in a typed lambda–calculus language similar to
the one presented by Carpenter (1997). The system has
three basic types: e, the type of entities; t, the type of truth
values; and r, the type of real numbers. It also allows func-
tional types, for example 〈e, t〉, which is the type assigned
to functions that map from entities to truth values. In spe-
cific domains, we will specify subtype hierarchies for e.
For example, in a geography domain we might distinguish
different entity subtypes such as cities, states, and rivers.

Figure 1 shows several sentences from the geography
(Geo880) domain, together with their associated logical
form. Each logical form is an expression from the lambda
calculus. The lambda–calculus expressions we use are
formed from the following items:

• Constants: Constants can either be entities, numbers
or functions. For example, texas is an entity (i.e., it
is of type e). state is a function that maps entities to
truth values, and is of type 〈e, t〉. size is a function
that maps entities to real numbers, and is therefore of
type 〈e, r〉 (in the geography domain, size(x) returns
the land-area of x).

• Logical connectors: The lambda–calculus expres-
sions include conjunction (∧), disjunction (∨), nega-
tion (¬), and implication (→).

• Quantification: The expressions include universal
quantification (∀) and existential quantification (∃).
For example, ∃x.state(x)∧borders(x, texas) is true
if and only if there is at least one state that borders
Texas. Expressions involving ∀ take a similar form.

• Lambda expressions: Lambda expressions represent
functions. For example, λx.borders(x, texas) is a
function from entities to truth values, which is true of
those states that border Texas.

a) What states border Texas
λx.state(x) ∧ borders(x, texas)

b) What is the largest state
arg max(λx.state(x), λx.size(x))

c) What states border the state that borders the most states
λx.state(x) ∧ borders(x, arg max(λy.state(y),

λy.count(λz.state(z) ∧ borders(y, z))))

Figure 1: Examples of sentences with their logical forms.

• Additional quantifiers: The expressions involve
the additional quantifying terms count, arg max,
arg min, and the definite operator ι. An example
of a count expression is count(λx.state(x)), which
returns the number of entities for which state(x)
is true. arg max expressions are of the form
arg max(λx.state(x), λx.size(x)). The first argu-
ment is a lambda expression denoting some set of en-
tities; the second argument is a function of type 〈e, r〉.
In this case the arg max operator would return the set
of items for which state(x) is true, and for which
size(x) takes its maximum value. arg min expres-
sions are defined analogously. Finally, the definite op-
erator creates expressions such as ι(λx.state(x)). In
this case the argument is a lambda expression denot-
ing some set of entities. ι(λx.state(x)) would return
the unique item for which state(x) is true, if a unique
item exists. If no unique item exists, it causes a pre-
supposition error.

2.2 Combinatory Categorial Grammars

The parsing formalism underlying our approach is that of
combinatory categorial grammar (CCG) (Steedman, 1996,
2000). A CCG specifies one or more logical forms—of the
type described in the previous section—for each sentence
that can be parsed by the grammar.

The core of any CCG is a lexicon, Λ. In a purely syntactic
version of CCG, the entries in Λ consist of a word (lexical
item) paired with a syntactic type. A simple example of a
CCG lexicon is as follows:

Utah := NP
Idaho := NP
borders := (S\NP )/NP

In this lexicon Utah and Idaho have the syntactic type NP ,
and borders has the more complex type (S\NP )/NP . A
syntactic type can be either one of a number of primitive
categories (in the example, NP or S), or it can be a com-
plex type of the form A/B or A\B where both A and B
can themselves be a primitive or complex type. The prim-
itive categories NP and S stand for the linguistic notions

a) Utah borders Idaho

NP (S\NP )/NP NP
utah λx.λy.borders(y, x) idaho

>
(S\NP )

λy.borders(y, idaho)
< S borders(utah, idaho) b) What states border Texas (S/(S\NP ))/N N (S\NP )/NP NP λf.λg.λx.f(x) ∧ g(x) λx.state(x) λx.λy.borders(y, x) texas > >
S/(S\NP ) (S\NP )

λg.λx.state(x) ∧ g(x) λy.borders(y, texas)
>

S
λx.state(x) ∧ borders(x, texas)

Figure 2: Two examples of CCG parses.

of noun-phrase and sentence respectively. Note that a sin-
gle word can have more than one syntactic type, and hence
more than one entry in the lexicon.

In addition to the lexicon, a CCG has a set of combinatory
rules which describe how adjacent syntactic categories in
a string can be recursively combined. The simplest such
rules are rules of functional application, defined as follows:

(1) The functional application rules:
a. A/B B ⇒ A
b. B A\B ⇒ A

Intuitively, a category of the form A/B denotes a string that
is of type A but is missing a string of type B to its right;
similarly, A\B denotes a string of type A that is missing a
string of type B to its left.

The first rule says that a string with type A/B can be com-
bined with a right-adjacent string of type B to form a new
string of type A. As one example, in our lexicon, borders,
(which has the type (S\NP )/NP ) can be combined with
Idaho (which has the type NP ) to form the string borders
Idaho with type S\NP . The second rule is a symmetric
rule applying to categories of the form A\B. We can use
this to combine Utah (type NP ) with borders Idaho (type
S\NP ) to form the string Utah borders Idaho with the type
S. We can draw a parse tree (or derivation) of Utah borders
Idaho as follows:

Utah borders Idaho

NP (S\NP )/NP NP
>

(S\NP )
< S Note that we use the notation −> and −< to denote appli- cation of rules 1(a) and 1(b) respectively. CCGs typically include a semantic type, as well as a syn- tactic type, for each lexical entry. For example, our lexicon would be extended as follows: Utah := NP : utah Idaho := NP : idaho borders := (S\NP )/NP : λx.λy.borders(y, x) We use the notation A : f to describe a category with syn- tactic type A and semantic type f . Thus Utah now has syn- tactic type NP , and semantic type utah. The functional application rules are then extended as follows: (2) The functional application rules (with semantics): a. A/B : f B : g ⇒ A : f(g) b. B : g A\B : f ⇒ A : f(g) Rule 2(a) now specifies how the semantics of the category A is compositionally built out of the semantics for A/B and B. Our derivations are then extended to include a com- positional semantics. See Figure 2(a) for an example parse. This parse shows that Utah borders Idaho has the syntactic type S and the semantics borders(utah, idaho). In spite of their relative simplicity, CCGs can capture a wide range of syntactic and semantic phenomena. As one example, see Figure 2(b) for a more complex parse. Note that in this case we have an additional primitive category, N (for nouns), and the final semantics is a lambda expression denoting the set of entities that are states and that border Texas. In this case, the lexical item what has a relatively complex category, which leads to the correct analysis of the underlying string. A full description of CCG goes beyond the scope of this paper. There are several extensions to the formalism: see (Steedman, 1996, 2000) for more details. In particular, CCG includes rules of combination that go beyond the sim- ple function application rules in 1(a) and 1(b).1 Additional combinatory rules allow CCGs to give an elegant treatment of linguistic phenomena such as coordination and relative clauses. In our work we make use of standard rules of application, forward and backward composition, and type- raising. In addition, we allow lexical entries consisting of strings of length greater than one, for example the Mississippi := NP : mississippi river This leads to a relatively minor change to the formalism, which in practice can be very useful. For example, it is eas- ier to directly represent the fact that the Mississippi refers 1One example of a more complex combinatory rule is that of forward composition: A/B : f B/C : g ⇒ A/C : λx.f(g(x)) Another rule which is frequently used is that of type-raising: NP : f ⇒ S/(S\NP ) : λg.g(f) This would allow NP : Utah to be type-raised to a category S/(S\NP ) : λg.g(Utah). to the Mississippi river with the lexical entry above than it is to try to construct this meaning compositionally from the meanings of the determiner the and the word Mississippi, which refers to the state of Mississippi when used without the determiner. 2.3 Probabilistic CCGs We now describe how to generalize CCGs to probabilis- tic CCGs (PCCGs). A CCG, as described in the previous section, will generate one or more derivations for each sen- tence S that can be parsed by the grammar. We will de- scribe a derivation as a pair (L, T ), where L is the final logical form for the sentence (e.g., borders(utah, idaho) in figure 2(a)), and T is the sequence of steps taken in de- riving L. We will frequently refer to T as a parse tree. A PCCG defines a conditional distribution P (L, T |S) over possible (L, T ) pairs for a given sentence S. In general, various sources of ambiguity can lead to a sen- tence S having more than one valid (L, T ) pair. This is the primary motivation for extending CCGs to PCCGs: PCCGs deal with ambiguity by ranking alternative parses for a sentence in order of probability. One source of am- biguity is lexical items having more than one entry in the lexicon. For example, New York might have entries NP : new york city and NP : new york state. An- other source of ambiguity is where a single logical form L may be derived by multiple derivations T . This latter form of ambiguity can occur in CCG, and is often referred to as spurious ambiguity; the term spurious is used because the different syntactic parses lead to identical semantics. In defining PCCGs, we make use of a conditional log- linear model that is similar to the model form in condi- tional random fields (CRFs) (Lafferty et al., 2001) or log- linear models applied to parsing (Ratnaparkhi, Roukos, & Ward, 1994; Johnson, Geman, Canon, Chi, & Rie- zler, 1999). Log-linear models for CCGs are described in (Clark & Curran, 2003). We assume a function f̄ mapping (L, T, S) triples to feature vectors in Rd. This function is defined by d individual features, so that f̄(L, T, S) = 〈f1(L, T, S), . . . , fd(L, T, S)〉. Each feature fj is typi- cally the count of some sub-structure within (L, T, S). The model is parameterized by a vector θ̄ ∈ Rd. The probabil- ity of a particular (syntax, semantics) pair is defined as P (L, T |S; θ̄) = ef̄(L,T,S)·θ̄∑ (L,T ) e f̄(L,T,S)·θ̄ (1) The sum in the denominator is over all valid parses for S under the CCG grammar. In this paper we make use of lexical features alone. For each lexical entry in the grammar, we have a feature fj that counts the number of times that the lexical entry is used in T . For example, in the simple grammar with en- tries for Utah, Idaho and borders, there would be three fea- tures of this type. While these features are quite simple, we have found them to be quite successful when applied to the Geo880 and Jobs640 data sets. More complex features are certainly possible (e.g., see (Clark & Curran, 2003)). In the future, we would like to explore more general features that have been shown to be useful in other parsing settings. 2.4 Parsing and Parameter Estimation We now turn to issues of parsing and parameter estimation. Parsing under a PCCG involves computing the most prob- able logical form L for a sentence S, arg max L P (L|S; θ̄) = arg max L ∑ T P (L, T |S; θ̄) where the arg max is taken over all logical forms L and the hidden syntax T is marginalized out by summing over all parses that produce L. We use dynamic programming al- gorithms for this step, which are very similar to CKY–style algorithms for parsing probabilistic context-free grammars (PCFGs).2 Dynamic programming is feasible within our approach because the feature-vector definitions f̄(L, T, S) involve local features that keep track of counts of lexical items in the derivation T .3 In parameter estimation, we assume that we have n training examples, {(Si, Li) : i = 1 . . . n}. Si is the i’th sentence in training data, and Li is the lambda expression associ- ated with that sentence. The task is to estimate the param- eter values θ̄ from these examples. Note that the training set does not include derivations Ti, and we therefore view derivations as hidden variables within the approach. The log-likelihood of the training set is given by: O(θ̄) = n∑ i=1 log P (Li|Si; θ̄) = n∑ i=1 log (∑ T P (Li, T |Si; θ̄) ) Differentiating with respect to θj yields: ∂O ∂θj = n∑ i=1 ∑ T fj(Li, T, Si)P (T |Si, Li; θ̄) − n∑ i=1 ∑ L,T fj(L, T, Si)P (L, T |Si; θ̄) The two terms in the derivative involve the calculation of expected values of a feature under the distributions 2CKY–style algorithms for PCFGs (Manning & Schutze, 1999) are related to the Viterbi algorithm for hidden Markov mod- els, or dynamic programming methods for Markov random fields. 3We use beam–search during parsing, where low-probability sub-parses are discarded at some points during parsing, in order to improve efficiency. P (T |Si, Li; θ̄) or P (T,L|Si; θ̄). Expectations of this type can again be calculated using dynamic programming, us- ing a variant of the inside-outside algorithm (Baker, 1979), which was originally formulated for probabilistic context- free grammars. Given this derivative, we can use it directly to maximize the likelihood using a stochastic gradient ascent algorithm (LeCun, Bottou, Bengio, & Haffner, 1998),4 which takes the following form: Set θ̄ to some initial value for k = 0 . . . N − 1 for i = 1 . . . n θ̄ = θ̄ + α0 (1+ct) ∂ log P (Li|Si;θ̄) ∂θ̄ where t = i+k×n is the total number of previous updates, N is a parameter that controls the number of passes over the training data, and α0 and c are learning–rate parameters. 3 Learning In the previous section we saw that a probabilistic Com- binatory Categorial Grammar (PCCG) is defined by a lex- icon Λ, together with a parameter vector θ̄. In this sec- tion, we present an algorithm that learns a PCCG. One input to the algorithm is a training set of n examples, {(Si, Li) : i = 1 . . . n}, where each training example is a string Si paired with a logical form Li. Another input to the algorithm is an initial lexicon, Λ0.5 Note that the training data includes neither direct evidence about the parse trees mapping each Si to Li, nor the set of lexical entries which are required for this mapping. We treat the parse trees as a hidden variable within our model. The set of possible parse trees for a sentence depends on the lexicon, which is itself learned from the training examples. Thus, at a high level, learning will involve the following two sub-problems: • Induction of a lexicon, Λ, which defines a set of parse trees for each training sentence Si. • Estimation of parameter values, which define a distri- bution over parse trees for any sentence. 4The EM algorithm could also be used, but would require some form of gradient ascent for the M–step. Because of this, we found it simpler to use gradient ascent for the entire optimization. 5In our experiments the initial lexicon includes lexical items that are derived directly from the database in the domain; for ex- ample, we have a list of entries {Utah := NP : utah, Idaho := NP : idaho, Nevada := NP : nevada, . . .} including every U.S. state in the geography domain. It also includes lexical items that are domain independent, and easily specified by hand: for ex- ample, the definition for “what” in Figure 2(b) would be included, as it would be useful across many domains. The first problem can be thought of as a form of structure learning, and is a major focus of the current section. The second problem is a more conventional parameter estima- tion problem, which roughly speaking can be solved using the gradient descent methods described in section 2.4. The remainder of this section describes an overall strat- egy for these two problems. We show how to interleave a structure-building step, GENLEX, with a parameter esti- mation step, in a way that results in a PCCG with a compact lexicon and effective parameter estimates for the weights of the log-linear model. Section 3.1 describes the main struc- tural step, GENLEX(S, L), which generates a set of candi- date lexical items that may be useful in deriving L from S. In section 3.2 we describe the overall learning algorithm, which prunes the lexical entries suggested by GENLEX and estimates the parameters of a log-linear model. 3.1 Lexical Learning We now describe the function GENLEX, which takes a sen- tence S and a logical form L and generates a set of lexical items. Our aim is to define GENLEX(S, L) in such a way that the set of lexical items that it generates allows at least one parse of S that results in L. As an example, consider the parse in Figure 2(a). When presented with the input sentence Utah borders Idaho and logical form borders(utah, idaho), we would like GENLEX to produce a lexicon that includes the three lexical items that were used in this parse, namely Utah := NP : utah Idaho := NP : idaho borders := (S\NP )/NP : λx.λy.borders(y, x) Our definition of GENLEX will also produce spurious lexical items, such as borders := NP : idaho and borders utah := (S\NP )/NP : λx.λy.borders(y, x). Later, we will see how these items can be pruned from the lexicon in a later stage of processing. To compute GENLEX, we make use of a function, C(L), that maps a logical form to a set of categories (such as NP : utah, or NP : idaho). GENLEX is then defined as GENLEX(S, L) = {x := y | x ∈ W (S), y ∈ C(L)} where W (S) is the set of all subsequences of words in S. The function C(L) is defined through a set of rules that examine L and produce categories based on its structure. Figure 3 shows the rules that we use. Each rule consists of a trigger that identifies some sub-structure within the logical form L. For each sub-structure in L that matches the trigger, a category is created and added to C(L). As one example, the second row in the table defines a rule that identifies all arity-one predicates p within the logical form Rules Categories produced from logical form Input Trigger Output Category arg max(λx.state(x) ∧ borders(x, texas), λx.size(x)) constant c NP : c NP : texas arity one predicate p1 N : λx.p1(x) N : λx.state(x) arity one predicate p1 S\NP : λx.p1(x) S\NP : λx.state(x) arity two predicate p2 (S\NP )/NP : λx.λy.p2(y, x) (S\NP )/NP : λx.λy.borders(y, x) arity two predicate p2 (S\NP )/NP : λx.λy.p2(x, y) (S\NP )/NP : λx.λy.borders(x, y) arity one predicate p1 N/N : λg.λx.p1(x) ∧ g(x) N/N : λg.λx.state(x) ∧ g(x) literal with arity two predicate p2 and constant second argument c N/N : λg.λx.p2(x, c) ∧ g(x) N/N : λg.λx.borders(x, texas) ∧ g(x) arity two predicate p2 (N\N)/NP : λx.λg.λy.p2(x, y) ∧ g(x) (N\N)/NP : λg.λx.λy.borders(x, y) ∧ g(x) an arg max / min with second argument arity one function f NP/N : λg. arg max / min(g, λx.f(x)) NP/N : λg. arg max(g, λx.size(x)) an arity one numeric-ranged function f S/NP : λx.f(x) S/NP : λx.size(x) Figure 3: The rules that define GENLEX. We use the term predicate to refer to a function that returns a truth value; function to refer to all other functions; and constant to refer to constants of type e. Each row represents a rule. The first column lists the triggers that identify some sub-structure within a logical form L, and then generate a category. The second column lists the category that is created. The third column lists example categories that are created when the rule is applied to the logical form at the top of this column. as triggers for creating a category N : λx.p(x). Given the logical form λx.major(x) ∧ city(x), which has the arity- one predicates major and city, this rule would create the categories N : λx.major(x) and N : λx.city(x). Intuitively, each of the rules in Figure 3 corresponds to a different linguistic sub-category such as noun, transitive verb, adjective, and so on. For example, the rule in the first row generates categories that are noun phrases, and the sec- ond rule generates nouns. The end result is an efficient way to generate a large set of linguistically plausible categories C(L) that could be used to construct a logical form L. 3.2 The Learning Algorithm Figure 4 shows the learning algorithm used within our ap- proach. The output of the algorithm is a PCCG, defined by a lexicon Λ and a parameter vector θ̄. As input, the algo- rithm takes a training set of sentences paired with logical forms, together with an initial lexicon, Λ0. At all stages, the algorithm maintains a parameter vector θ̄ which stores a real value associated with every possible lexical item. The set of possible lexical items is Λ∗ = Λ0 ∪ n⋃ i=1 GENLEX(Si, Li) In our experiments, the parameters were initialized to be 0.1 for all lexical items in Λ0, and 0.01 for all other lexical items. These values were chosen through experiments on the development data; they give a small initial bias towards using lexical items from Λ0 and favor parses that include more lexical items. The goal of the algorithm is to provide a relatively com- pact lexicon, which is a small subset of the entire set of possible lexical items. The algorithm achieves this by al- ternating between two steps. The goal of step 1 is to search for a relatively small number of lexical entries, which are nevertheless sufficient to successfully parse all training ex- amples. Step 2 is then used to re-estimate the parameters of the lexical items that are selected in step 1. In the t’th application of step 1, each sentence in turn is parsed with the current parameters θ̄t−1 and a spe- cial, sentence–specific lexicon which is defined as Λ0 ∪ GENLEX(Si, Li). This will result in one or more highest- scoring parses that have the logical form Li. 6 Lexical items are extracted from these highest-scoring parses alone. The result of this stage is to generate a small subset λi of GENLEX(Si, Li) for each training example. The out- put of step 1, at iteration t, is a subset of Λ∗, defined as Λt = Λ0 ∪ ⋃n i=1 λi. Step 2 re-estimates the parameters of the members of Λt, using stochastic gradient descent. The starting point for gradient descent when estimating θ̄t is θ̄t−1, i.e., the pa- rameter values at the previous iteration. For any lexical item that is not a member of Λt, the associated parameter in θ̄t is set to be the same as the corresponding parameter in θ̄t−1 (i.e., parameter values are simply copied across from the previous iteration). The motivation for cycling between steps 1 and 2 is as fol- lows. In step 1, keeping only those lexical items that occur in the highest scoring parse(s) leading to Li results in a compact lexicon. This step is also guaranteed to produce a lexicon Λt ⊂ Λ∗ such that the accuracy on the training data when running the PCCG (Λt, θ̄t−1) is at least as ac- curate as applying the PCCG (Λ∗, θ̄t−1). In other words, pruning the lexicon in this way cannot hurt parsing perfor- mance on training data in comparison to using all possible lexical entries.7 6Note that this set of highest-scoring parses is identical to the set produced by parsing with Λ∗, rather than the sentence-specific lexicon. This is because Λ0 ∪ GENLEX(Si, Li) contains all lex- ical items that can possibly be used to derive Li. 7To see this, note that restricting the lexicon in this way cannot exclude any of the highest-scoring parses for Si that lead to Li. In practice, it may exclude some parses that lead to logical forms for Si that are incorrect. Because the highest-scoring correct parses Step 2 also has a guarantee, in that the log-likelihood on the training data will improve (assuming that stochastic gradi- ent descent is successful in improving its objective). Step 2 is needed because after each application of step 1, the pa- rameters θ̄t−1 are optimized for Λt−1 rather than Λt, the current lexicon. Step 2 derives new parameter values θ̄t which are optimized for Λt. In summary, steps 1 and 2 together form a greedy, itera- tive method for simultaneously finding a compact lexicon and also optimizing the log-likelihood of the model on the training data. 4 Related Work This section discusses related work on natural language in- terfaces to databases (NLIDBs), in particular focusing on learning approaches, and related work on learning CCGs. There has been a significant amount of work on hand engi- neering NLIDBs. Androutsopoulos, Ritchie, and Thanisch (1995) provide a comprehensive summary of this work. Recent work in this area has focused on improved pars- ing techniques and designing grammars that can be ported easily to new domains (Popescu, Armanasu, Etzioni, Ko, & Yates, 2004). Zelle and Mooney (1996) developed one of the earliest ex- amples of a learning system for NLIDBs. This work made use of a deterministic shift–reduce parser and developed a learning algorithm, called CHILL, based on techniques from Inductive Logic Programming, to learn control rules for parsing. The major limitation of this approach is that it does not learn the lexicon, instead assuming that a lex- icon that pairs words with their semantic content (but not syntax) has been created in advance. Later, Thompson and Mooney (2002) developed a system that learns a lexicon for CHILL that performed almost as well as the original system. Most recently, Tang and Mooney (2001) devel- oped a statistical shift–reduce parser that significantly out- performed these original systems. However, this system, again, does not learn a lexicon. A number of previous learning methods (Papineni, Roukos, & Ward, 1997; Ramaswamy & Kleindienst, 2000; Miller, Stallard, Bobrow, & Schwartz, 1996; He & Young, 2004) have been applied to the ATIS domain, which involves a natural language interface to a travel database of flight in- formation. In the future we plan to test our method on this domain. Miller et al. (1996) describe an approach that as- sumes full annotation of parse trees. Papineni et al. (1997) and Ramaswamy and Kleindienst (2000) use approaches based on methods originally developed for machine trans- lation. He and Young (2004) describe an approach using an extension of hidden Markov models, resulting in a model with some of the power of context-free models. are still allowed, parsing performance cannot deteriorate. Inputs: • Training examples E = {(Si, Li) : i = 1 . . . n} where each Si is a sentence, each Li is a logical form. • An initial lexicon Λ0 Procedures: • PARSE(S, L, Λ, θ̄): takes as input a sentence S, a logical form L, a lexicon Λ, and a parameter vector θ̄. Returns the highest probability parse for S with logical form L, when S is parsed by a PCCG with lexicon Λ and parameters θ̄. If there is more than one parse with the same highest proba- bility, the entire set of highest probability parses is returned. Dynamic programming methods are used when implement- ing PARSE, see section 2.4 of this paper. • ESTIMATE(Λ, E, θ̄): takes as input a lexicon Λ, a train- ing set E, and a parameter vector θ̄. Returns parameter val- ues θ̄ that are the output of stochastic gradient descent on the training set E under the grammar defined by Λ. The input θ̄ is the initial setting for the parameters in the stochastic gra- dient descent algorithm. Dynamic programming methods are used when implementing ESTIMATE, see section 2.4. • GENLEX(S, L): takes as input a sentence S and a logical form L. Returns a set of lexical items. See section 3.1 for a description of GENLEX. Initialization: Define θ̄ to be a real-valued vector of arity |Λ∗|, where Λ∗ = Λ0 ∪ Sn i=1 GENLEX(Si, Li). θ̄ stores a pa- rameter value for each potential lexical item. The initial pa- rameters θ̄0 are taken to be 0.1 for any member of Λ0, and 0.01 for all other lexical items. Algorithm: • For t = 1 . . . T Step 1: (Lexical generation) • For i = 1 . . . n: – Set λ = Λ0 ∪ GENLEX(Si, Li). – Calculate π = PARSE(Si, Li, λ, θ̄ t−1). – Define λi to be the set of lexical entries in π. • Set Λt = Λ0 ∪ Sn i=1 λi Step 2: (Parameter Estimation) • Set θ̄t = ESTIMATE(Λt, E, θ̄t−1) Output: Lexicon ΛT together with parameters θ̄ T . Figure 4: The overall learning algorithm. There have been several pieces of previous work on learn- ing CCGs. Clark and Curran (2003) developed a method for leaning the parameters of a log-linear model for syntac- tic CCG parsing given fully annotated normal–form parse trees. Watkinson and Manandhar (1999) presented an un- supervised approach for learning CCGs that, again, does not perform any semantic analysis. We know of only one previous system (Bos, Clark, Steedman, Curran, & Hock- enmaier, 2004) that learns CCGs with semantics. However, this approach requires fully–annotated CCG derivations as supervised training data. As such, the techniques they em- ployed are not applicable to learning in our framework. Geo880 Jobs640 P R P R Our Method 96.25 79.29 97.36 79.29 COCKTAIL 89.92 79.40 93.25 79.84 Figure 5: The results for our method, and the previous work of COCKTAIL, when applied to the two database query domains. P is precision in recovering entire logical forms, R is recall. 5 Experiments We evaluated the learning algorithm on two domains: Geo880, a set of 880 queries to a database of U.S. geogra- phy; and Jobs640, a set of 640 queries to a database of job listings. The data were originally annotated with Prolog style semantics which we manually converted to equivalent statements in the lambda calculus. We compare the structured classifier results to the COCK- TAIL system (Tang & Mooney, 2001). The COCKTAIL experiments were conducted by performing ten–fold cross validation of the entire data set. We used a slightly differ- ent experimental set-up, where we made an explicit split between training and test data sets.8 The Geo880 data set was divided into 600 training examples and 280 test ex- amples; the Jobs640 set was divided into 500 training and 140 test examples. The parameters of the training algo- rithm were tuned by cross–validation on the training set. We did two passes of the overall learning loop in Figure 4. Each time we used gradient descent to estimate parame- ters, we performed three passes over the training set with the learning-rate parameters α0 = 0.1 and c = 0.001. We give precision and recall for the different algorithms, defined as Precision = # correct/total # parsed, Recall = # correct/total # examples. Sentences are correct if the parser gives a completely correct semantics. Figure 5 shows the results of the experiments. Our ap- proach has higher precision than COCKTAIL on both do- mains, with a very small reduction in recall. When eval- uating these results, it is important to realize that COCK- TAIL is provided with a fairly extensive lexicon that pairs words with semantic predicates. For example, the word borders would be paired with the predicate borders(x, y). This prior information goes substantially beyond the initial lexicon used in our own experiments.9 8This allowed us to use cross-validation experiments on the training set to optimize parameters, and more importantly to de- velop our algorithms while ensuring that we had not implicitly tuned our approach to the final test set. 9Note that the work of (Thompson & Mooney, 2002) does de- scribe a method which automatically learns a lexicon. However, results for this approach were worse than results for CHILL (Zelle & Mooney, 1996), which in turn were considerably worse than results for COCKTAIL on the Geo250 domain, a subset of the ex- amples in Geo880. states := N : λx.state(x) major := N/N : λf.λx.major(x) ∧ f(x) population := N : λx.population(x) cities := N : λx.city(x) rivers := N : λx.river(x) run through := (S\NP )/NP : λx.λy.traverse(y, x) the largest := NP/N : λf. arg max(f, λx.size(x)) river := N : λx.river(x) the highest := NP/N : λf. arg max(f, λx.elev(x)) the longest := NP/N : λf. arg max(f, λx.len(x)) Figure 6: Ten learned lexical items that had highest associated parameter values from a randomly chosen development run in the Geo880 domain. To better understand these results, we examined perfor- mance of our method through cross-validation on the train- ing set. We found that the approach creates a compact lexicon for the training examples that it parses. On the Geo880 domain, the initial number of lexical items created by GENLEX was on average 393.8 per training example After pruning, on average only 5.1 lexical items per train- ing example remained. The Jobs640 domain showed a re- duction from an average of 697.1 lexical items per training example, to 6.6 items. To investigate the disparity between precision and recall, we examined the behavior of the algorithm when trained in the cross-validation (development) regime. We found that on average, the learner failed to parse 9.3% of the training examples in the Geo880 domain, and 8.7% of training ex- amples in the Jobs640 domain. (Note that sentences which cannot be parsed in step 1 of the training algorithm are ex- cluded from the training set during step 2.) These parse failures were caused by sentences whose semantics could not be built from the lexical items that GENLEX created. For example, the learner failed to parse complex sentences such as Through which states does the Mississippi run be- cause GENLEX does not create lexical entries that allow the verb run to find its argument, the preposition through, when it has moved to the front of the sentence. This prob- lem is almost certainly a major cause of the lower recall on test examples. Exploring the addition of more rules to GENLEX is an important area for future work. Figure 6 gives a sample of lexical entries that are learned by the approach. These entries are linguistically plausible and should generalize well to unseen data. 6 Discussion and Future Work In this paper, we presented a learning algorithm that cre- ates accurate structured classifiers for natural language in- terfaces. A major focus for future work is to apply the algo- rithm to a range of larger data sets. Larger data sets should improve the recall performance and allow us to develop a more comprehensive set of rules for GENLEX, ultimately creating a robust system that can quickly learn interfaces for new, unseen domains with little human assistance. Although the experiments in this paper only learned natural language interfaces to databases, there are many other nat- ural language interfaces that the techniques can be gener- alized to handle. In particular, we will explore building in- terfaces to dialogue systems. These interfaces must handle a much wider range of semantic phenomena (for example, anaphora and ellipses). Extending the current algorithm to address these challenges will greatly increase the range of possible interfaces that are successfully learned. Acknowledgements We would like to thank Rohit Kate and Raymond Mooney for their help with obtaining the Geo880 and Jobs640 data sets. We also gratefully acknowledge the support of a ND- SEG graduate research fellowship and the National Science Foundation under grants 0347631 and 0434222. References Androutsopoulos, I., Ritchie, G., & Thanisch, P. (1995). Natural language interfaces to databases–an intro- duction. Journal of Language Engineering, 1(1), 29– 81. Baker, J. (1979). Trainable grammars for speech recogni- tion. In Speech Communication Papers for the 97th Meeting of the Acoustical Society of America. Bos, J., Clark, S., Steedman, M., Curran, J. R., & Hock- enmaier, J. (2004). Wide-coverage semantic repre- sentations from a CCG parser. In Proceedings of the 20th International Conference on Computational Linguistics, pp. 1240–1246. Carpenter, B. (1997). Type-Logical Semantics. The MIT Press. Clark, S., & Curran, J. R. (2003). Log-linear models for wide-coverage CCG parsing. In Proceedings of the SIGDAT Conference on Empirical Methods in Natu- ral Language Processing. He, Y., & Young, S. (2004). Semantic processing using the hidden vector state model. Computer Speech and Language. Johnson, M., Geman, S., Canon, S., Chi, Z., & Riezler, S. (1999). Estimators for stochastic “unification-based” grammars. In Proceedings of the Association for Computational Linguistics, pp. 535–541. Lafferty, J., McCallum, A., & Pereira, F. (2001). Condi- tional random fields: Probabilistic models for seg- menting and labeling sequence data. In Proceed- ings of the 18th International Conference on Ma- chine Learning. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recog- nition. Proceedings of the IEEE, 86(11), 2278–2324. Manning, C. D., & Schutze, H. (1999). Foundations of sta- tistical natural language processing. The MIT Press. Miller, S., Stallard, D., Bobrow, R. J., & Schwartz, R. L. (1996). A fully statistical approach to natural lan- guage interfaces. In Proceedings of the Association for Computational Linguistics, pp. 55–61. Papineni, K. A., Roukos, S., & Ward, T. R. (1997). Feature- based language understanding. In Proceedings of European Conference on Speech Communication and Technology. Popescu, A.-M., Armanasu, A., Etzioni, O., Ko, D., & Yates, A. (2004). Modern natural language interfaces to databases: Composing statistical parsing with se- mantic tractability. In Proceedings of the 20th Inter- national Conference on Computational Linguistics. Ramaswamy, G., & Kleindienst, J. (2000). Hierarchi- cal feature-based translation for scalable natural lan- guage understanding. In Proceedings of 6th Interna- tional Conference on Spoken Language Processing. Ratnaparkhi, A., Roukos, S., & Ward, R. T. (1994). A max- imum entropy model for parsing. In Proceedings of the International Conference on Spoken Language Processing. Steedman, M. (1996). Surface Structure and Interpreta- tion. The MIT Press. Steedman, M. (2000). The Syntactic Process. The MIT Press. Tang, L. R., & Mooney, R. J. (2001). Using multiple clause constructors in inductive logic programming for se- mantic parsing. In Proceedings of the 12th European Conference on Machine Learning, pp. 466–477. Taskar, B., Guestrin, C., & Koller, D. (2003). Max-margin markov networks. In Neural Information Processing Systems. Taskar, B., Klein, D., Collins, M., Koller, D., & Manning, C. (2004). Max-margin parsing. In Proceedings of the SIGDAT Conference on Empirical Methods in Natural Language Processing. Thompson, C. A., & Mooney, R. J. (2002). Acquiring word-meaning mappings for natural language inter- faces. Journal of Artificial Intelligence Research, 18. Watkinson, S., & Manandhar, S. (1999). Unsupervised lexical learning with categorial grammars using the LLL corpus. In Proceedings of the 1st Workshop on Learning Language in Logic, pp. 16–27. Zelle, J. M., & Mooney, R. J. (1996). Learning to parse database queries using inductive logic programming. In Proceedings of the 14th National Conference on Artificial Intelligence.