University of Toronto, Department of Computer Science
CSC 2501F—Computational Linguistics, Fall 2021
Reading assignment 4
Due date: Electronically by 12:10, Friday 29th October 2021.
Late write-ups will not be accepted without documentation of a medical or other emergency. This assignment is worth 5% of your final grade.
What to read
H. Lee et al., ”Deterministic Coreference Resolution Based on Entity-Centric, Precision-
Ranked Rules,” Computational Linguistics 39(4), 2013, pp. 885–916.
What to write
Write a brief summary of the paper’s argumentation, with a critical assessment of its merits.
Some points to consider:
• How could a rule-based method, even in 2013, be competitive on any task against pattern-recognition based methods?
• How might you try to improve it?
General requirements: Your write-up should be typed, using 12-point font and 1.5-line spacing; it should fit on one to two sides of a sheet of paper. Submit using the teach.cs submit command:
$ submit -c csc2501h -a Essay4 essay4.pdf
Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules
Heeyoung Lee∗ Stanford University
∗ Stanford University
Yves Peirsman∗∗ University of Chambers† United States Naval Academy
‡ University of Arizona
Dan Jurafsky§ Stanford University
We propose a new deterministic approach to coreference resolution that combines the global information and precise features of modern machine-learning models with the transparency and modularity of deterministic, rule-based systems. Our sieve architecture applies a battery of deterministic coreference models one at a time from highest to lowest precision, where each model builds on the previous model’s cluster output. The two stages of our sieve-based architecture, a mention detection stage that heavily favors recall, followed by coreference sieves that are precision-oriented, offer a powerful way to achieve both high precision and high recall. Further, our approach makes use of global information through an entity-centric model that encourages the sharing of features across all mentions that point to the same real-world entity. Despite its simplicity, our approach gives state-of-the-art performance on several corpora and genres, and has also been incorporated into hybrid state-of-the-art coreference systems for Chinese and
∗ Stanford University, 450 , Stanford, CA 94305. E-mail: , .
∗∗ University of Leuven, Blijde-Inkomststraat 21 PO Box 03308, B-3000 Leuven, Belgium. E-mail: .
† United States Naval Academy, 121 , Annapolis, MD 21402. E-mail: .
‡ University of Arizona, PO Box 210077, Tucson, AZ 85721-0077. E-mail: . § Stanford University, 450 , Stanford, CA 94305. E-mail: .
Submission received: 27 May 2012; revised submission received: 22 October 2012; accepted for publication: 20 November 2012.
doi:10.1162/COLI a 00152
© 2013 Association for Computational Linguistics
Computational Linguistics Volume 39, Number 4
Arabic. Our system thus offers a new paradigm for combining knowledge in rule-based systems that has implications throughout computational linguistics.
1. Introduction
Coreference resolution, the task of finding all expressions that refer to the same entity in a discourse, is important for natural language understanding tasks like summarization, question answering, and information extraction.
The long history of coreference resolution has shown that the use of highly precise lexical and syntactic features is crucial to high quality resolution (Ng and Cardie 2002b; Lappin and Leass 1994; Poesio et al. 2004a; Zhou and Su 2004; Bengtson and Roth 2008; Haghighi and Klein 2009). Recent work has also shown the importance of global inference—performing coreference resolution jointly for several or all mentions in a document—rather than greedily disambiguating individual pairs of mentions (Morton 2000; Luo et al. 2004; Yang et al. 2004; Culotta et al. 2007; Yang et al. 2008; Poon and Domingos 2008; Denis and Baldridge 2009; Rahman and Ng 2009; Haghighi and Klein 2010; Cai, Mujdricza-Maydt, and Strube 2011).
Modern systems have met this need for carefully designed features and global or entity-centric inference with machine learning approaches to coreference resolution. But machine learning, although powerful, has limitations. Supervised machine learning systems rely on expensive hand-labeled data sets and generalize poorly to new words or domains. Unsupervised systems are increasingly more complex, making them hard to tune and difficult to apply to new problems and genres as well. Rule-based models like Lappin and Leass (1994) were a popular early solution to the subtask of pronominal anaphora resolution. Rules are easy to create and maintain and error analysis is more transparent. But early rule-based systems relied on hand-tuned weights and were not capable of global inference, two factors that led to poor performance and replacement by machine learning.
We propose a new approach that brings together the insights of these modern supervised and unsupervised models with the advantages of deterministic, rule-based systems. We introduce a model that performs entity-centric coreference, where all men- tions that point to the same real-world entity are jointly modeled, in a rich feature space using solely simple, deterministic rules. Our work is inspired both by the seminal early work of Baldwin (1997), who first proposed that a series of high-precision rules could be used to build a high-precision, low-recall system for anaphora resolution, and by more recent work that has suggested that deterministic rules can outperform machine learning models for coreference (Zhou and Su 2004; Haghighi and Klein 2009) and for named entity recognition (Chiticariu et al. 2010).
Figure 1 illustrates the two main stages of our new deterministic model: mention detection and coreference resolution, as well as a smaller post-processing step. In the mention detection stage, nominal and pronominal mentions are identified using a high-recall algorithm that selects all noun phrases (NPs), pronouns, and named entity mentions, and then filters out non-mentions (pleonastic it, i-within-i, numeric entities, partitives, etc.).
The coreference resolution stage is based on a succession of ten independent coref- erence models (or ”sieves”), applied from highest to lowest precision. Precision can be informed by linguistic intuition, or empirically determined on a coreference corpus (see Section 4.4.3). For example, the first (highest precision) sieve links first-person pronouns inside a quotation with the speaker of a quotation, and the tenth sieve (i.e., low precision but high recall) implements generic pronominal coreference resolution.
886
Lee et al. Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules
Figure 1
The architecture of our coreference system.
Crucially, our approach is entity-centric—that is, our architecture allows each coref- erence decision to be globally informed by the previously clustered mentions and their shared attributes. In particular, each deterministic rule is run on the entire discourse, using and extending clusters (i.e., groups of mentions pointing to the same real-world entity, built by models in previous tiers). Thus, for example, in deciding whether two mentions i and j should corefer, our system can consider not just the local features of i and j but also any information (head word, named entity type, gender, or number) about the other mentions already linked to i and j in previous steps.
Finally, the architecture is highly modular, which means that additional coreference resolution models can be easily integrated.
The two stage architecture offers a powerful way to balance both high recall and precision in the system and make use of entity-level information with rule-based architecture. The mention detection stage heavily favors recall, and the following sieves favor precision. Our results here and in our earlier papers (Raghunathan et al. 2010; Lee et al. 2011) show that this design leads to state-of-the-art performance despite the simplicity of the individual components, and that the lack of language-specific lexical features makes the system easy to port to other languages. The intuition is not new; in addition to the prior coreference work mentioned earlier and discussed in Section 6, we draw on classic ideas that have proved to be important again and again in the history of natural language processing. The idea of beginning with the most accurate models or starting with smaller subproblems that allow for high-precision solutions combines the intuitions of “shaping” or “successive approximations” first proposed for learning by Skinner (1938), and widely used in NLP (e.g., the successively trained IBM MT models of Brown et al. [1993]) and the “islands of reliability” approaches to parsing and speech recognition [Borghesi and Favareto 1982; Corazza et al. 1991]). The idea of beginning with a high-recall list of candidates that are followed by a series of high-precision filters dates back to one of the earliest architectures in natural language processing, the part of speech tagging algorithm of the Computational Grammar Coder (Klein and Simmons
887
Computational Linguistics Volume 39, Number 4
1963) and the TAGGIT tagger (Greene and Rubin 1971), which begin with a high-recall list of all possible tags for words, and then used high-precision rules to filter likely tags based on context.
In the next section we walk through an example of our system applied to a simple made-up text. We then describe our model in detail and test its performance on three different corpora widely used in previous work for the evaluation of coreference resolution. We show that our model outperforms the state-of-the-art on each corpus. Furthermore, in these sections we describe analytic and ablative experiments demonstrating that both aspects of our algorithm (the entity-centric aspect that allows the global sharing of features between mentions assigned to the same cluster and the precision-based ordering of sieves) independently offer significant improvements to coreference, perform an error analysis, and discuss the relationship of our work to previous models and to recent hybrid systems that have used our algorithm as a component to resolve coreference in English, Chinese, and Arabic.
2. Walking Through a Sample Coreference Resolution
Before delving into the details of our method, we illustrate the intuition behind our approach with the simple pedagogical example listed in Table 1.
In the mention detection step, the system extracts mentions by inspecting all noun phrases (NP) and other modifier pronouns (PRP) (see Section 3.1 for details). In Table 1, this step identifies 11 different mentions and assigns them initially to distinct entities (Entity id and mention id in each step are marked by superscript and subscript). This component also extracts mention attributes—for example, John:{ne:person}, and A girl:{gender:female, number:singular}. These mentions form the input for the following sequence of sieves.
The first coreference resolution sieve (the speaker or quotation sieve) matches
pronominal mentions that appear in a quotation block to the corresponding speaker.
In general, in all the coreference resolution sieves we traverse mentions left-to-right in
a given document (see Section 3.2.1). The first match for this model is my9, which is
merged with John10 into the same entity (entity id: 9). This illustrates the advantages 10
of our incremental approach: by assigning a higher priority to the quotation sieve, we avoid linking my9 with A girl5, a common mistake made by generic coreference models, since anaphoric candidates (especially in subject position) are generally preferred to cataphoric ones (Hobbs 1978).
The next sieve searches for anaphoric antecedents that have the exact same string as the mention under consideration. This component resolves the tenth mention, John910, by linking it with John1. When searching for antecedents, we sort candidates in the same sentential clause from left to right, and we prefer sentences that are closer to the mention under consideration (see Section 3.2.2 for details). Thus, the sorted list of candidates for John910 is It7, My favorite8, My9, A girl5, the song6, He3, a new song4, John1, a musician2. The algorithm stops as soon as a matching antecedent is encountered. In this case, the algorithm finds John1 and does not inspect a musician2.
The relaxed string match sieve searches for mentions satisfying a looser set of string matching constraints than exact match (details in Section 3.3.3), but makes no change because there are no such mentions. The precise constructs sieve searches for several high-precision syntactic constructs, such as appositive relations and predicate nominatives. In this example, there are two predicate nominative relations in the first and fourth sentences, so this component clusters together John1 and a musician2, and It7 and my favorite8.
888
Lee et al. Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules
Table 1
A sample run-through of our approach, applied to a made-up sentence. In each step we mark in bold the affected mentions; superscript and subscript indicate entity id and mention id.
Input:
Mention Detection:
Speaker Sieve:
String Match:
Relaxed String Match:
Precise Constructs:
Strict Head Match A:
Strict Head Match B,C:
Proper Head Noun Match:
Relaxed Head Match:
Pronoun Match:
Post Processing:
Final Output:
John is a musician. He played a new song. A girl was listening to the song. “It is my favorite,” John said to her.
[John]1 is [a musician]2. [He]3 played [a new song]4. [A girl]5 was listening to [the song]6.
“[It]7 is [[my]9 favorite]8,” [John]10 said to [her]11. 7 9 8 10 11
[John]1 is [a musician]2. [He]3 played [a new song]4.
[A girl]5 was listening to [the song]6.
“[It]7 is [[my]9 favorite]8,” [John]9 said to [her]11. 7 9 8 10 11
[John]1 is [a musician]2. [He]3 played [a new song]4.
[A girl]5 was listening to [the song]6.
“[It]7 is [[my]1 favorite]8,” [John]1 said to [her]11. 7 9 8 10 11
[John]1 is [a musician]2. [He]3 played [a new song]4.
[A girl]5 was listening to [the song]6.
“[It]7 is [[my]1 favorite]8,” [John]1 said to [her]11. 7 9 8 10 11
[John]1 is [a musician]12. [He]3 played [a new song]4.
[A girl]5 was listening to [the song]6.
“[It]7 is [[my]1 favorite]7,” [John]1 said to [her]11. 7 9 8 10 11
[John]1 is [a musician]12. [He]3 played [a new song]4.
[A girl]5 was listening to [the song]46.
“[It]7 is [[my]1 favorite]7,” [John]1 said to [her]11. 7 9 8 10 11
[John]1 is [a musician]12. [He]3 played [a new song]4.
[A girl]5 was listening to [the song]46.
“[It]7 is [[my]1 favorite]7,” [John]1 said to [her]11. 7 9 8 10 11
[John]1 is [a musician]12. [He]3 played [a new song]4.
[A girl]5 was listening to [the song]46.
“[It]7 is [[my]1 favorite]7,” [John]1 said to [her]11. 7 9 8 10 11
[John]1 is [a musician]12. [He]3 played [a new song]4.
[A girl]5 was listening to [the song]46.
“[It]7 is [[my]1 favorite]7,” [John]1 said to [her]11. 7 9 8 10 11
[John]1 is [a musician]12. [He]13 played [a new song]4. [A girl]5 was listening to [the song]46.
“[It]47 is [[my]19 favorite]48,” [John]10 said to [her]511.
[John]1 is a musician. [He]13 played [a new song]4. [A girl]5 was listening to [the song]46.
“[It]47 is [my]19 favorite,” [John]10 said to [her]511.
[John]1 is a musician. [He]13 played [a new song]4. [A girl]5 was listening to [the song]46.
“[It]47 is [my]19 favorite,” [John]10 said to [her]511.
The next four sieves (strict head match A–C, proper head noun match) cluster mentions that have the same head word with various other constraints. a new song4 and the song6 are linked in this step.
The last resolution component in this example addresses pronominal coreference
resolution. The three pronouns in this text, He3, It7, and her11 are linked to their 37 11
889
Computational Linguistics Volume 39, Number 4
compatible antecedents based on their attributes, such as gender, number, and animacy.
In this step we assign He3 and her11 to entities 1 and 5, respectively (same gender), and 7 311
It7 to entity 4, which represents an inanimate concept.
The system concludes with a post-processing component, which implements
corpus-specific rules. For example, to align our output with the OntoNotes annotation standard, we remove mentions assigned to singleton clusters (i.e., entities with a single mention in text) and links obtained through predicate nominative patterns. Note that even though we might remove some coreference links in this step, these links serve an important purpose in the algorithm flow, as they allow new features to be discovered for the corresponding entity and shared between its mentions. See Section 3.2.3 for details on feature extraction.
3. The Algorithm
We first describe our mention detection stage, then introduce the general architecture of the coreference stage, followed by a detailed examination of the coreference sieves. In describing the architecture, we will sometimes find it helpful to discuss the precision of individual components, drawn from our later experiments in Section 4.
3.1 Mention Detection
As we suggested earlier, the recall of our mention detection component is more impor- tant than its precision. This is because for the OntoNotes corpus and for many practical applications, any missed mentions are guaranteed to affect the final score by decreas- ing recall, whereas spurious mentions may not impact the overall score if they are assigned to singleton clusters, because singletons are deleted during post-processing. Our mention detection algorithm implements this intuition via a series of simple yet broad-coverage heuristics that take advantage of syntax, named entity recognition and manually written patterns. Note that those patterns are built based on the OntoNotes annotation guideline because mention detection in general depends heavily on the annotation policy.
We start by marking all NPs, pronouns, and named entity mentions (see the named entity tagset in Appendix A) that were not previously marked (i.e., they appear as modifiers in other NPs) as candidate mentions. From this set of candidates we remove the mentions that match any of the following exclusion rules:
1. We remove a mention if a larger mention with the same head word exists (e.g., we remove The five insurance companies in The five insurance companies approved to be established this time).
2. We discard numeric entities such as percents, money, cardinals, and quantities (e.g., 9%, $10, 000, Tens of thousands, 100 miles).
3. We remove mentions with partitive or quantifier expressions (e.g., a total of 177 projects, none of them, millions of people).1
1 These are NPs with the word ‘of’ preceded by one of nine quantifiers or 34 partitives.
890
Lee et al. Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules
4. We remove pleonastic it pronouns, detected using a small set of patterns (e.g., It is possible that …, It seems that …, It turns out …). The complete set of patterns, using the tregex2 notation, is shown in Appendix B.
5. We discard adjectival forms of nations or nationality acronyms (e.g., American, U.S., U.K.), following the OntoNotes annotation guidelines.
6. We remove stop words from the following list determined by error analysis on mention detection: there, ltd., etc, ’s, hmm.
Note that some rules change depending on the corpus we use for evaluation. In particular, adjectival forms of nations are valid mentions in the Automated Content Extraction (ACE) corpus (Doddington et al. 2004), thus they would not be removed when processing this corpus.
3.2 Resolution Architecture
Traditionally, coreference resolution is implemented as a quadratic problem, where potential coreference links between any two mentions in a document are consid- ered. This is not ideal, however, as it increases both the likelihood of errors and the processing time. In this article, we argue that it is better to cautiously construct high-quality mention clusters,3 and use an entity-centric model that allows the shar- ing of information across these incrementally constructed clusters. We achieve these goals by: (a) aggressively filtering the search space for which mention to consider for resolution (Section 3.2.1) and which antecedents to consider for a given men- tion (Section 3.2.2), and (b) constructing features from partially built mention clusters (Section 3.2.3).
3.2.1 Mention Selection in a Given Sieve. Recall that our model is a battery of resolution sieves applied sequentially. Thus, in each given sieve, we have partial mention clusters produced by the previous model. We exploit this information for mention selection, by considering only mentions that are currently first in textual order in their cluster. For example, given the following ordered list of mentions, {m1, m2, m23, m34, m15, m26}, where the superscript indicates cluster id, our model will attempt to resolve only m2 and m34 (m1 is not resolved because it is the first mention in a text). These two are the only mentions that currently appear first in their respective clusters and have potential antecedents in the document. The motivation behind this heuristic is two-fold. First, early mentions are usually better defined than subsequent ones, which are likely to have fewer modifiers or be pronouns (Fox 1993). Because several of our models use features extracted from NP modifiers, it is important to prioritize mentions that include such information. Second, by definition, first mentions appear closer to the beginning of the document, hence there are fewer antecedent candidates to select from, and thus fewer opportunities to make a mistake.
We further prune the search space using a simple model of discourse salience. We disable coreference for mentions appearing first in their corresponding clusters that: (a) are or start with indefinite pronouns (e.g., some, other), (b) start with indefinite articles
2 http://nlp.stanford.edu/software/tregex.shtml.
3 In this article we use the terms mention cluster and entity interchangeably. We prefer the former when
discussing technical aspects of our approach and the latter in a more theoretical context.
891
Computational Linguistics Volume 39, Number 4
(e.g., a, an), or (c) are bare plurals. One exception to (a) and (b) is the model deployed in the Exact String Match sieve, which only links mentions if their entire extents match exactly (see Section 3.3.2). This model is triggered for all nominal mentions regardless of discourse salience, because it is possible that indefinite mentions are repeated in a document when concepts are discussed but not instantiated, e.g., a sports bar in the following:
Hanlon, a longtime Broncos fan, thinks it is the perfect place for a sports bar and has put up a blue-and-orange sign reading, “Wanted Broncos Sports Bar On This Site.” . . . In a Nov. 28 letter, Proper states “while we have no objection to your advertising the property as a location for a sports bar, using the Broncos’ name and colors gives the false impression that the bar is or can be affiliated with the Broncos.”
3.2.2 Antecedent Selection for a Given Mention. Given a mention mi, each model may either decline to propose a solution (in the hope that one of the subsequent models will solve it) or deterministically select a single best antecedent from a list of previous mentions m1, …, mi−1. We sort candidate antecedents using syntactic information provided by the Stanford parser. Candidates are sorted using the following criteria:
r
r
Figure 2
Example of left-to-right breadth-first tree traversal. The numbers indicate the order in which the NPs are visited.
In a given sentential clause (i.e., parser constituents whose label starts with S), candidates are sorted using a left-to-right breadth-first traversal of the corresponding syntactic constituent (Hobbs 1978). Figure 2 shows an example of candidate ordering based on this traversal. The left-to-right ordering favors subjects, which tend to appear closer to the beginning of the sentence and are more probable antecedents. The breadth-first traversal promotes syntactic salience by preferring noun phrases that are closer to the top of the parse tree (Haghighi and Klein 2009).
If the sentence containing the anaphoric mention contains multiple clauses, we repeat the previous heuristic separately in each S* constituent, starting with the one containing the mention.
892
Lee et al. Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules
r
The sorting of antecedent candidates is important because our algorithm stops at the first match. Thus, low-quality sorting negatively impacts the actual coreference links created.
This antecedent selection algorithm applies to all the coreference resolution sieves described in this article, with the exception of the speaker identification sieve (Sec- tion 3.3.1) and the sieve that applies appositive and predicate nominative patterns (Section 3.3.4).
3.2.3 Feature Sharing in the Entity-Centric Model. In a significant departure from previous work, each model in our framework gets (possibly incomplete) entity information for each mention from the clusters constructed by the earlier coreference models. In other words, each mention mi may already be assigned to an entity Ej containing a set of mentions: Ej = {mj1 , . . . , mjk }; mi ∈ Ej . Unassigned mentions are unique members of their own cluster. We use this information to share information between same-entity mentions.
This is especially important for pronominal coreference resolution (discussed later in this section), which can be severely affected by missing attributes (which introduce precision errors because incorrect antecedents are selected due to missing information) and incorrect attributes (which introduce recall errors because correct links are not generated due to attribute mismatch between mention and antecedent). To address this issue, we perform a union of all mention attributes (e.g., number, gender, animacy) for a given entity and share the result with all corresponding mentions. If attributes from different mentions contradict each other we maintain all variants. For example, our naive number detection assigns singular to the mention a group of students and plural to five students. When these mentions end up in the same cluster, the resulting number attributes becomes the set {singular, plural}. Thus this cluster can later be merged with both singular and plural pronouns.
3.3 Coreference Resolution Sieves
We describe next the sequence of coreference models proposed in this article. Table 2 lists all these models in the order in which they are applied. We discuss their individual contribution to the overall system later, in Section 4.4.3.
Table 2
Sequence of sieves as they are applied in the overall model.
Clauses in previous sentences are sorted based on their textual proximity to the anaphoric mention.
Sequence
Pass 1 Pass 2 Pass 3 Pass 4 Passes 5–7 Pass 8 Pass 9 Pass 10
Model Name
Speaker Identification Sieve
Exact String Match Sieve
Relaxed String Match Sieve
Precise Constructs Sieve (e.g., appositives) Strict Head Match Sieves A–C
Proper Head Noun Match Sieve Relaxed Head Match Sieve Pronoun Resolution Sieve
893
Computational Linguistics Volume 39, Number 4
3.3.1 Pass 1 – Speaker Identification. This sieve matches speakers to compatible pronouns, using shallow discourse understanding to handle quotations and conversation transcripts, following the early work of Baldwin (1995, 1997). We begin by identifying speakers within text. In non-conversational text, we use a simple heuristic that searches for the subjects of reporting verbs (e.g., say) in the same sentence or neighboring sentences to a quotation. In conversational text, speaker information is provided in the data set.
The extracted speakers then allow us to implement the following sieve heuristics: r ⟨I⟩s4 assigned to the same speaker are coreferent.
r r
Thus for example I, my, and she in the following sentence are coreferent: “[I] voted for [Nader] because [he] was most aligned with [my] values,” [she] said.
In addition to this sieve, we impose speaker constraints on decisions made by subsequent sieves:
r
r
r r
r
example (due to the third constraint).
3.3.2 Pass 2 – Exact Match. This model links two mentions only if they contain exactly the same extent text, including modifiers and determiners (e.g., [the Shahab 3 ground-ground missile] and [the Shahab 3 ground-ground missile]). As expected, this model is very precise, with a precision over 90% B3 (see Table 8 in Section 4.4.3).
3.3.3 Pass 3 – Relaxed String Match. This sieve considers two nominal mentions as coreferent if the strings obtained by dropping the text following their head words (such as relative clauses and PP and participial postmodifiers) are identical (e.g., [Clinton] and [Clinton, whose term ends in January]).
3.3.4 Pass 4 – Precise Constructs. This model links two mentions if any of the following conditions are satisfied:
r
4 We define ⟨I⟩ as I, my, me, or mine, ⟨we⟩ as first person plural pronouns, and ⟨you⟩ as second person pronouns.
⟨you⟩s with the same speaker are coreferent. The speaker and ⟨I⟩s in her text are coreferent.
The speaker and a mention which is not ⟨I⟩ in the speaker’s utterance cannot be coreferent.
Two ⟨I⟩s (or two ⟨you⟩s, or two ⟨we⟩s) assigned to different speakers cannot be coreferent.
Two different person pronouns by the same speaker cannot be coreferent.
Nominal mentions cannot be coreferent with ⟨I⟩, ⟨you⟩, or ⟨we⟩ in the same turn or quotation.
In conversations, ⟨you⟩ can corefer only with the previous speaker.
The constraints result in causing [my] and [he] to not be coreferent in the earlier
Appositive – the two nominal mentions are in an appositive construction (e.g., [Israel’s Deputy Defense Minister], [ ] , said . . . ). We use the standard Haghighi and Klein (2009) definition to detect appositives: third children of a parent NP whose expansion begins with (NP , NP), when there is not a conjunction in the expansion.
894
Lee et al. Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules
r
r
r
r
r Demonym5 – one of the mentions is a demonym of the other (e.g., [Israel] . . . [Israeli]). For demonym detection we use a static list of countries and their gentilic forms from Wikipedia.6
All of these constructs are very precise; we show in Section 4.4.3 that the B3 precision of the overall model after adding this sieve is approximately 90%. In the OntoNotes corpus, this sieve does not enhance recall significantly, mainly because appositions and predicate nominatives are not annotated in this corpus (they are annotated in ACE). Regardless of annotation standard, however, this sieve is important because it grows entities with high quality elements, which has a significant impact on the entity’s features (as discussed in Section 3.2.3).
3.3.5 Pass 5 – Strict Head Match. Linking a mention to an antecedent based on the naive matching of their head words generates many spurious links because it completely ignores possibly incompatible modifiers (Elsner and Charniak 2010). For example, Yale University and Harvard University have similar head words, but they are obviously different entities. To address this issue, this pass implements several constraints that must all be matched in order to yield a link:
r
5 Demonym is not annotated in OntoNotes but we keep it in the system.
6 http://en.wikipedia.org/wiki/List of adjectival and demonymic forms of place names.
Predicate nominative – the two mentions (nominal or pronominal) are in a copulative subject–object relation (e.g., [The New York-based College Board] is [a nonprofit organization that administers the SATs and promotes higher education] [Poon and Domingos 2008]).
Role appositive – the candidate antecedent is headed by a noun and appears as a modifier in an NP whose head is the current mention (e.g., [[actress] ]). This feature is inspired by Haghighi and Klein (2009), who triggered it only if the mention is labeled as a person by the Stanford named entity recognizer (NER). We constrain this heuristic more in our work: We allow this feature to match only if: (a) the mention is labeled as a person, (b) the antecedent is animate (we detail animacy detection in Section 3.3.9), and (c) the antecedent’s gender is not neutral.
Relative pronoun – the mention is a relative pronoun that modifies the head of the antecedent NP (e.g., [the finance street [which] has already formed in the Waitan district]).
Acronym – both mentions are tagged as NNP and one of them is an acronym of the other (e.g., [Agence France Presse] . . . [AFP]). Our acronym detection algorithm marks a mention as an acronym of another if its text equals the sequence of upper case characters in the other mention. The algorithm is simple, but our error analysis suggests it nonetheless does not lead to errors.
Entity head match – the mention head word matches any head word of mentions in the antecedent entity. Note that this feature is actually more relaxed than naive head matching in a pair of mentions because here it is satisfied when the mention’s head matches the head of any mention in the candidate entity. We constrain this feature by enforcing a conjunction with the following features.
895
Computational Linguistics Volume 39, Number 4
r Word inclusion – all the non-stop7 words in the current entity to be solved are included in the set of non-stop words in the antecedent entity. This heuristic exploits the discourse property that states that it is uncommon to introduce novel information in later mentions (Fox 1993). Typically, mentions of the same entity become shorter and less informative as the narrative progresses. For example, based on this constraint, the model correctly clusters together the two mentions in the following text:
. . . intervene in the [Florida Supreme Court]’s move . . . does look like very dramatic change made by [the Florida court]
and avoids clustering the two mentions in the following text:
The pilot had confirmed . . . he had turned onto [the correct runway] but pilots behind him say he turned onto [the wrong runway].
r
r
This pass continues to maintain high precision (over 86% B3) while improving recall significantly (approximately 4.5 B3 points).
3.3.6 Passes 6 and 7 – Variants of Strict Head Match. Sieves 6 and 7 are different relaxations of the feature conjunction introduced in Pass 5, that is, Pass 6 removes the compatible modifiers only feature, and Pass 7 removes the word inclusion constraint. All in all, these two passes yield an improvement of 0.9 B3 F1 points, due to recall improvements. Table 8 in Section 4.4.3 shows that the word inclusion feature is more precise than compatible modifiers only, but the latter has better recall.
3.3.7 Pass 8 – Proper Head Word Match. This sieve marks two mentions headed by proper nouns as coreferent if they have the same head word and satisfy the following constraints:
r r
r
7 Our stopword list includes person titles as well.
Compatible modifiers only – the mention’s modifiers are all included in the modifiers of the antecedent candidate. This feature models the same discourse property as the previous feature, but it focuses on the two individual mentions to be linked, rather than their corresponding entities. For this feature we only use modifiers that are nouns or adjectives.
Not i-within-i – the two mentions are not in an i-within-i construct, that is, one cannot be a child NP in the other’s NP constituent (Chomsky 1981).
Not i-within-i – same as in Pass 5.
No location mismatches – the modifiers of two mentions cannot contain different location named entities, other proper nouns, or spatial modifiers. For example, [Lebanon] and [southern Lebanon] are not coreferent.
No numeric mismatches – the second mention cannot have a number that does not appear in the antecedent, e.g., [people] and [around 200 people] are not coreferent.
896
Lee et al. Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules
3.3.8 Pass 9 – Relaxed Head Match. This pass relaxes the entity head match heuristic by allowing the mention head to match any word in the antecedent entity. For example, this heuristic matches the mention Sanders to an entity containing the mentions {Sauls, the judge, Circuit Judge N. }. To maintain high precision, this pass requires that both mention and antecedent be labeled as named entities and the types co- incide. Furthermore, this pass implements a conjunction of the given features with word inclusion and not i-within-i. This pass yields less than 0.4 point improvement in most metrics.
3.3.9 Pass 10 – Pronominal Coreference Resolution. With one exception (Pass 1), all the previous coreference models focus on nominal coreference resolution. It would be incor- rect to say that our framework ignores pronominal coreference in the previous passes, however. In fact, the previous models prepare the stage for pronominal coreference by constructing precise entities with shared mention attributes. These are crucial factors for pronominal coreference.
We implement pronominal coreference resolution using an approach standard for many decades: enforcing agreement constraints between the coreferent mentions. We use the following attributes for these constraints:
r
r
r
r
r r
When we cannot extract an attribute, we set the corresponding value to unknown and treat it as a wildcard—that is, it can match any other value. As expected, pronominal coreference resolution has a big impact on the overall score (e.g., 5 B3 F1 points in the development partition of OntoNotes).
3.4 Post Processing
This step implements several transformations required to guarantee that our out- put matches the annotation specification in the corresponding corpus. Currently this
Number – we assign number attributes based on: (a) a static list for pronouns; (b) NER labels: mentions marked as a named entity are considered singular with the exception of organizations, which can be both singular and plural; (c) part of speech tags: NN*S tags are plural and all other NN* tags are singular; and (d) a static dictionary from Bergsma and Lin (2006).
Gender – we assign gender attributes from static lexicons from Bergsma and Lin (2006), and Ji and Lin (2009).
Person – we assign person attributes only to pronouns. We do not enforce this constraint when linking two pronouns, however, if one appears within quotes. This is a simple heuristic for speaker detection (e.g., I and she point to the same person in “[I] voted my conscience,” [she] said).
Animacy – we set animacy attributes using: (a) a static list for pronouns; (b) NER labels (e.g., PERSON is animate whereas LOCATION is not); and (c) a dictionary bootstrapped from the Web (Ji and Lin 2009).
NER label – from the Stanford NER.
Pronoun distance – sentence distance between a pronoun and its
antecedent cannot be larger than 3.
897
Computational Linguistics Volume 39, Number 4
step is deployed only for the OntoNotes corpus and it contains the following two operations:
r r
4. Experimental Results
We start this section with overall results on three corpora widely used for the evaluation of coreference resolution systems. We continue with a series of ablative experiments that analyze the contribution of each aspect of our approach and conclude with error analysis, which highlights cases currently not solved by our approach.
4.1 Corpora
We used the following corpora for development and formal evaluation:
r
r
r
r
r
The corpora statistics are shown in Table 3. We used the first corpus (OntoNotes-Dev) for development and all others for the formal evaluation. We parsed all documents in the ACE and MUC corpora using the Stanford parser (Klein and Manning 2003) and the Stanford NER (Finkel, Grenager, and Manning 2005). We used the provided parse
We discard singleton clusters.
We discard the shorter mentions in appositive patterns and the mentions that appear later in text in copulative relations. For example, in the text [[ ] , the general manager] or [Mr. Savoca] had been [a
consultant. . . ], the mentions and a consultant. . . are removed in this stage.
OntoNotes-Dev – development partition of OntoNotes v4.0 provided in the CoNLL2011 shared task (Pradhan et al. 2011).
OntoNotes-Test – test partition of OntoNotes v4.0 provided in the CoNLL-2011 shared task.
ACE2004-Culotta-Test – partition of the ACE 2004 corpus reserved for testing by several previous studies (Culotta et al. 2007; Bengtson and Roth 2008; Haghighi and Klein 2009).
ACE2004-nwire – newswire subset of the ACE 2004 corpus, utilized by Poon and Domingos (2008) and Haghighi and Klein (2009) for testing.
MUC6-Test – test corpus from the sixth Message Understanding Conference (MUC-6) evaluation.
Table 3
Corpora statistics.
Corpora # Documents
OntoNotes-Dev 303 OntoNotes-Test 322 ACE2004-Culotta-Test 107 ACE2004-nwire 128 MUC6-Test 30
# Sentences
6,894 8,262 1,993 3,594
576
# Words
136K 142K 33K 74K 13K
# Entities
3,752 3,926 2,576 4,762
496
# Mentions
14,291 16,291 5,455 11,398 2,136
898
Lee et al. Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules
trees and named entity labels (not gold) in the OntoNotes corpora to facilitate the com- parison with other systems.
4.2 Evaluation Metrics
We use five evaluation metrics widely used in the literature. B3 and CEAF have im- plementation variations in how to take system mentions into account. We followed the same implementation as used in CoNLL-2011 shared task.
r
MUC (Vilain et al. 1995) – link-based metric which measures how many
predicted and gold mention clusters need to be merged to cover the gold
and predicted clusters, respectively.
(|Gi|−|p(Gi)|) R = (|Gi|−1)
(Gi: a gold mention cluster, p(Gi): partitions of Gi) (Si: a system mention cluster, p(Si): partitions of Si)
(|Si|−|p(Si)|) P = (|Si|−1)
F1 = 2PR P+R
r B3 (Bagga and Baldwin 1998) – mention-based metric which measures the proportion of overlap between predicted and gold mention clusters for a given mention. When Gmi is the gold cluster of mention mi and Smi is the system cluster of mention mi,
R = |Gmi∩Smi|, P = |Gmi∩Smi|, F1 = 2PR i |Gmi| i |Smi| P+R
r
r
r CoNLL F1 Average of MUC, B3, and CEAF-φ4 F1. This was the official metric in the CoNLL-2011 shared task (Pradhan et al. 2011).
4.3 Experimental Results
Tables 4 and 5 compare the performance of our system with other state-of-the-art systems in the CoNLL-2011 shared task and the ACE and MUC corpora, respectively. For the CoNLL-2011 shared task we report results in the closed track, which did not allow the use of external resources, and the open track, which allowed any other
CEAF (Constrained Entity Aligned F-measure) (Luo 2005) – metric based on entity alignment.
For best alignment g∗ = argmaxg∈Gm Φ(g) (Φ(g): total similarity of g, a one-to-one mapping from G: gold mention clusters to S: system mention clusters),
R=Φ(g∗) ,P=Φ(g∗) ,F1= 2PR i φ(Gi,Gi) i φ(Si,Si) P+R
If we use φ(G, S) = |G ∩ S|, it is called mention-based CEAF (CEAF-φ3), if we use φ(G, S) = 2|R∩S| , it is called entity-based CEAF (CEAF-φ4).
|R|+|S|
BLANC (BiLateral Assessment of NounPhrase Coreference) (Recasens and
Hovy 2011) – metric applying the Rand index (Rand 1971) to coreference to deal with imbalance between singletons and coreferent mentions by considering coreference and non-coreference links.
P = rc ,P = rn ,R = rc ,R = rn ,
c rc+wc n rn+wn c rc+wn n rn+wc F = 2PcRc ,F = 2PnRn ,BLANC= Fc+Fn
c Pc+Rc n Pn+Rn 2
(rc: the number of correct coreference links, wc: the number of incorrect
coreference links, rn: the number of correct non-coreference links, wn: the number of incorrect non-coreference links)
899
Computational Linguistics Volume 39, Number 4
Table 4
Performance of the top systems in the CoNLL-2011 shared task. All these systems use automatically detected mentions. We report results for both the closed and the open tracks, which allowed the use of resources not provided by the task organizers. MD indicates mention detection, and gold boundaries indicate that mention boundary information is given.
MUC
B3
CEAF-φ4
R P F1
R P F1
R P F1
Closed Track
System MD
R P F1 R P F1 F1
BLANC CoNLL
This paper Sapena Chang Nugues Santos Song Stoyanov
paper Klenner paper Nugues Chang Song Zhekova
75.1 66.8 92.4 28.2 68.1 62.0 69.9 68.1 67.8 63.3 57.8 80.4 70.8 65.0 67.8 62.1 62.1 60.0 61.1 63.6 65.9 62.8 71.9 57.5 64.5 64.1 65.5 58.7 55.4 68.3 69.8 57.0 67.5 37.6 17.1 61.1
74.3 67.9 67.2 67.6 70.6 66.3 64.4 60.3 24.6 62.3
79.5 71.3 74.2 70.7 63.4 73.2 65.8 69.9 67.1 65.1 76.9 64.7 59.6 71.2 58.4 77.6 69.2 57.3
70.7 43.2 64.9 69.0 65.5 67.3 67.8 64.8 61.0 62.3 64.3 63.9 64.3 61.9 61.1 62.7 48.3 26.7
70.9 67.4 68.4 62.3 35.3
75.2 72.4 67.9 67.8 66.1 70.3 64.9 66.7 62.7
61.8 57.5 59.6 56.3 63.2 59.6 57.2 57.2 57.2 60.2 57.1 58.6 59.2 54.3 56.7 53.7 67.8 60.0 63.6 54.0 58.4 51.1 49.9 50.5 55.6 51.5 53.5 45.7 52.8 49.0 55.1 50.1 52.5 59.9 46.4 52.3 57.9 51.4 54.5 48.5 44.9 46.6 42.0 55.6 47.9 46.4 39.6 42.7 28.9 20.7 24.1 12.5 50.6 20.0
62.8 59.3 61.0 56.7 58.9 57.8 59.7 55.7 57.6 49.0 50.7 49.9 18.6 51.0 27.2
65.9 62.1 63.9 64.3 60.1 62.1 55.0 65.5 59.8 57.8 61.4 59.5 62.6 56.8 59.6 69.8 55.0 61.5 46.1 58.8 51.6 46.7 68.4 55.5 33.5 37.2 35.2
68.4 68.2 68.3 62.8 72.1 67.1 67.1 70.5 68.8 66.7 64.2 65.5 68.8 62.8 65.7 60.7 66.1 63.2 72.6 53.3 61.4 62.6 65.4 64.0 69.7 62.4 65.9 57.1 72.9 64.1 66.3 58.4 62.1 71.6 55.1 62.3 67.8 55.4 61.0 61.6 62.3 61.9 52.6 73.1 61.1 63.6 57.3 60.3 67.1 56.7 61.5 35.1 89.9 50.5
68.9 69.0 68.9 64.6 71.0 67.7 66.3 64.1 65.2 61.7 68.6 65.0 39.0 85.6 53.6
69.5 70.6 70.0 68.3 65.2 66.7 62.2 76.7 68.7 64.5 70.3 67.3 73.2 62.2 67.3 77.1 52.5 62.5 53.9 73.4 62.2 54.4 70.2 61.3 55.5 68.2 61.2
43.4 47.8 45.5 44.8 38.4 41.3 41.9 41.9 41.9 38.1 41.1 39.5 35.9 40.2 37.9 43.4 30.7 36.0 32.0 40.8 35.9 40.7 41.8 41.2 32.3 35.4 33.8 43.2 36.8 39.7 34.3 39.1 36.5 30.3 42.4 35.3 30.1 35.8 32.7 35.2 38.6 36.8 42.0 30.3 35.2 35.1 42.3 38.3 31.6 41.2 35.8 45.8 17.4 25.2
43.3 46.8 45.0 42.7 40.7 41.7 38.3 42.2 40.2 41.3 39.7 40.5 43.3 19.4 26.8
46.3 50.5 48.3 39.9 44.2 41.9 46.8 37.2 41.4 41.4 38.2 39.7 32.9 37.3 34.9 31.0 44.8 36.6 43.5 32.1 37.0 43.8 25.9 32.5 38.3 34.7 36.4
70.6 76.2 73.0 57.8 69.5 73.1 71.1 56.0 71.2 77.1 73.7 56.0 72.0 70.3 71.1 54.5 73.4 66.9 69.5 53.4 69.5 59.7 61.5 53.1 73.2 58.9 60.9 51.9 61.4 68.4 63.9 51.9 61.9 63.5 62.6 51.0 61.1 73.9 64.7 50.9 69.9 62.2 64.8 50.4 71.1 61.8 64.6 50.0 72.6 62.4 65.4 49.4 63.0 65.8 64.3 48.5 62.8 69.2 65.2 48.1 58.7 61.6 59.9 47.1 52.8 57.1 53.8 40.4 51.5 56.8 51.1 31.9
71.9 76.6 74.0 58.3 69.8 74.0 71.6 55.7 69.2 68.5 68.9 54.3 66.1 73.9 69.1 51.8 51.6 52.9 51.8 35.8
72.0 78.6 74.8 60.7 72.5 71.0 71.8 56.9 71.0 79.3 74.3 56.6 72.7 72.0 72.3 55.5 64.1 64.1 64.1 53.9 76.6 60.3 63.0 53.6 64.1 70.5 66.5 50.3 66.3 58.8 60.2 49.8 53.5 63.3 54.8 44.3
Open Track
Closed Track – gold boundaries
resources. For the closed track, the organizers provided dictionaries for gender and number information, in addition to parse trees and named entity labels (Pradhan et al. 2011). For the open track, we used the following additional resources: (a) a hand-built list of genders of first names that we created, incorporating frequent names from census lists and other sources (Vogel and Jurafsky 2012) (b) an animacy list (Ji and Lin 2009), (c) a country and state gazetteer, and (d) a demonym list. These resources were also used for the results reported in Table 5.
A significant difference between Tables 4 and 5 is that in the former (other than its last block) we used predicted mentions (detected with the algorithm described in Section 3.1), whereas in the latter we used gold mentions. The only reason for this distinction is to facilitate comparison with previous work (all systems listed in Table 5 used gold mention boundaries).
The two tables show that, regardless of evaluation corpus and methodology, our system generally outperforms the previous state of the art. In the CoNLL shared task,
900
Lee et al. Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules
our system scores 1.8 CoNLL F1 points higher than the next system in the closed track and 2.6 points higher than the second-ranked system in the open track. The Chang et al. (2011) system has marginally higher B3 and BLANC F1 scores, but does not outperform our model on the other two metrics and the average F1 score. Table 5 shows that our model has higher B3 F1 scores than all the other models in the two ACE corpora. The model of Haghighi and Klein (2009) minimally outperforms ours by 0.6 B3 F1 points in the MUC corpus. All in all, these results prove that our approach compares favorably with a wide range of models, which include most aspects deemed important for coreference resolution, among other things, supervised learning using rich feature sets (Sapena, Padro ́, and Turmo 2011; Chang et al. 2011), joint inference using spectral clustering (Cai, Mujdricza-Maydt, and Strube 2011), and deterministic rule-based models (Haghighi and Klein 2009). We discuss in more detail the similarities and differences between our approach and previous work in Section 6.
Table 4 shows that using additional