Microsoft Word – method.doc
Joint Named Entity Recognition and Disambiguation with Deep Neural Networks
Abstract
Current entity linking methods typically first apply a named entity recognition (NER) model to extract a named
entity and classify it into a predefined category, then apply an entity disambiguation model to link the named
entity to a corresponding entity in the reference knowledge base. However, these methods ignore the inter-
relations between the two tasks. Our work jointly optimizes deep neural models of both NER and entity
disambiguation. In the entity disambiguation task, our deep neural model includes the recursive neural network
and convolutional neural network with attention mechanism. The model compares the similarities between the
mention and the candidate entities by simultaneously considering the semantic and background information,
including Wikipedia description pages, contexts where the mention occurs, and entity typing information. Our
attempt is the first to integrate the NER and entity disambiguation tasks by deep neural networks. The
experiments show that our model can effectively leverage the semantic information of context and perform
competitively to conventional approaches.
Keywords: named entity recognition, entity disambiguation, recursive neural network, convolutional neural
network, attention
1 Proposed Method
In the Entity Disambiguation model, we compute semantic
similarity between the mention part (context, document,
type) and a candidate entity part (context, document, type).
Figure 2 is the overview of the model.
Figure 1: The overview of the entity disambiguation model.
Specifically, we model the context part using TNNs with
LSTM (Tree-LSTM), as Tree-LSTM performs effectively on
representing the semantic meaning of long sentences. The
document part is encoded with CNNs attached with
attention mechanism (ACNN). Since the document is
relatively noisy, we use attention mechanism to capture
important components of the input. For the type part, we
use a fine-grained type system to provide a series of types
for each mention and entity to reduce the ambiguities.
1.1 Context Representation
We choose Tree-LSTMs to encode context information so
that we can obtain a better representation of text units by
considering the semantic composition of small elements.
In the context representation system, we first embed the
context parse tree and the context word vectors. We use
GloVe vector [13] as word vectors and Stanford Parser to
encode parse trees. After the embedding, we use the
constituency Tree-LSTMs [11] to generate the hidden
features for each node in the tree. For each node j, hjk and
cjk are the hidden state and memory cell of its kth child,
respectively.
!” = % & ! ‘” + )*
(!)
–
*=1
ℎ”* + 0 !
!”# = & ‘ ! (” + *#+
(!)
.
+=1
ℎ”+ + 1 !
!” = % & ! ‘” + )*
(!)
–
*=1
ℎ”* + 0 !
!” = tanh ) ! *” + ,-
(!)
0
-=1
ℎ”- + 3 !
!” = %” ⊙ ‘” + )”*
+
*=1
⊙ -“*
ℎ” = %” ⊙ tanh(,”)
After we generate the constituency Tree-LSTM, non-leaf
nodes are context representations which are recursively
computed by the representation of child nodes.
1.2 Document and Description Representation
To represent document information of a mention and the
description page of an entity, we use CNNs attached with
attention mechanism to encode the background
information.
Figure 2: The overview of the ducument and description
representation.
CNNs are good at extracting robust and abstract features
of input. After the word embedding, we use CNNs to
produce a fixed-length vector for the document. We use
the rectified linear unit as the activation unit and combine
the results with max pooling. However, CNN can not
capture important components in the document. To
decrease the effect of noisy mentions, we add an attention
mechanism to keep the model focus on the important part
with respect to the mention we need to disambiguate.
Inspired by Zeng et al. [14], for the candidate entity
description part, we use the entity representation as the
attention weight on the output of the convolution layer to
strengthen key parts of the description page. The entity
representation is computed by the average of word vectors
of candidate entity words. For the mention document part,
we use the candidate entity description vector as the
attention weight to strengthen the important components in
the mention document.
1.3 Type Representation
Fine-grained types provide structured information for an
entity. For the candidate entities, we use FIGER [15] by
Ling and Weld, a publicly available package, which
contains 112 fine-grained entity types to encode type
information. FIGER returns a set of types for each
candidate entity. We compute the probability of each type
being relevant to the entity.
For the mention type, we need the context information to
figure out the fine-grained types of each mention. We use
the system of Shimaoka et al. [16] which use an attentive
encoder neural model to predict fine-grained types for
mentions.
1.4 Entity Disambiguation
Before we perform the ED system for the mentions, we
need to generate a number of candidate entities. These
candidate entities have prior scores deriving from a pre-
computed frequency dictionary [17].
Similar to Francis et al. [2], we compute the semantic
similarities between the mentions and candidate entities
from multiple granularities. Let !”,!$,!%, &”, &$, &% be the
distributed representation of mention’s context, mention’s
document, mention’s type, candidate entity’s context,
candidate entity’s document and candidate entity’s type,
respectively. The feature ! “, $ below indicates multiple
granularities of semantic similarity.
Final scores of our ED system are the combination of the
prior scores and semantic similarity scores. We choose the
candidate entity which has the highest final scores to be
the result of our disambiguation system.
1.5 Joint Model
Since TNNs leverage linguistic information to represent
sentences, we choose a NER system BRNN-CNN by Li et
al. [12] which also includes Tree-LSTMs to construct the
joint model. Since both the NER and ED models use deep
neural networks, we perform the joint training by combining
the loss functions of the NER model and ED model and
minimizing the resulting function.
!’ = arg ()*
+
(-./0 + -/2)
Here, ! is the set of parameters for both the NER and ED
tasks.
2 Experiment and Evaluation
2.1 Training Data
Our training data are from English Wikipedia dump
exported on Nov. 3rd, 2017. We randomly select 6000
inner-links with the anchors as mentions. The inner-link is
regarded as a true named entity.
2.2 Test Datasets
AQUAINT [18]: This dataset consists of newswire text
data. It has 50 articles with over 700 mentions.
ACE04 [19]: This dataset is a subset of ACE2004 co-
reference documents annotated by Amazon Mechanical
Turk. It has 35 articles and each has 7 mentions on
average.
CoNLL-YAGO [20]: This dataset is from CoNLL 2003
shared task and consists of training, development and test
sets. We trained our model on the training dataset and test
on the test dataset which has 231 news articles and
several rarer entities.
2.3 Results
We use the Gerbil testing platform [21] with the
Disambiguate to Knowledge Base (D2KB) task to compare
our ED model with AGDISTIS [22], AIDA [20], Babelfy [23],
and DBpedia Spotlight [24]. The results are measured by
micro F1 scores. Our work outperforms all these four
systems..
Table 1: Results for Entity Disambiguation
AQUAINT ACE04 CoNLL-YAGO
AGDISTIS 48.69 68.30 60.11
AIDA 55,28 71.74 69.12
Babelfy 68.15 56.15 65.70
DBpedia
Spotlight
49.95 47.53 46.67
Tree-ACNN
(proposed)
75.86 78.79 81.66
We also compare our joint model using Stanford NER with
using the BRNN-CNN NER model [12] on the AQUAINT
dataset. The results of the joint model are measured by the
F1 scores of ED. The results are improved when ED
system is combined with a more accurate NER system.
Table 2: Results for Joint NER and ED
NER Micro F1
Stanford NER 79.97
BRNN-CNN NER 86.67
Joint NER and ED Micro F1
Stanford NER + Tree-ACNN 75.86
BRNN-CNN NER + Tree-ACNN(proposed) 79.39
References
(1) Z. He et al., “Learning entity representation for entity disambiguation,” In
Proceedings of the 51st ACL, vol. 2, pp. 30-34, 2013.
(2) M. Francis-Landau, G. Durrett, and D. Klein, “Capturing semantic
similarity for entity linking with convolutional neural network,” In
Proceedings of NAACL-HLT, pp. 1256-1261, 2016.
(3) N. Gupta, S. Singh, and D. Roth “Entity linking via joint encoding of
types, description, and context,” In Proceedings of EMNLP, pp. 2681-
2690, 2017.
(4) O. E. Ganea, and T. Hofmann, “Deep joint entity disambiguation with
local neural attention,” In Proceedings of EMNLP, pp. 2619-2629, 2017.
(5) J. Chiu, and E. Nichols, “Named entity recognition with bidirectional
LSTM-CNNs,” ACL, vol. 4, pp. 357-370, 2016.
(6) G. Lample, M et al., “Neural architectures for named entity recognition,”
In Proceedings of NAACL-HLT, pp.260-270, 2016.
(7) G. Luo et al., “Joint named entity recognition and disambiguation,” In
Proceedings of EMNLP, pp.879-888, 2015.
(8) D. B. Nguyen, M. Theobald, and G. Weikum, “J-NERD: joint named
entity recognition and disambiguation with rich linguistic features,” ACL,
vol. 4, pp. 215-229, 2016.
(9) J. B. Pollack., “Recursive distributed representations,” Artificial
Intelligence, vol. 46, pp. 77-105, 1990.
(10) R. Socher et al., “Recursive deep models for semantic compositionality
over a sentiment treebank,” In Proceedings of EMNLP, pp. 1631-1642,
2013.
(11) K. S. Tai, R. Socher, and C. D. Manning., “Improved semantic
representations from tree-structured long short-term memory networks,”
In Proceedings of the 53rd ACL, pp. 1556-1566, 2015.
(12) P. Li et al., “Leveraging linguistic structures for named entity recognition
with bidirectional recursive neural networks,” In Proceedings of EMNLP,
pp. 2664-2669, 2017.
(13) J. Pennigton, R. Socher, and C. Manning, “GloVe: global vectors for word
representation,” In Proceedings of EMNLP, pp. 1532-1543, 2014.
(14) W. Zeng, J. Tang, and X. Zhao, “Entity linking on Chinese Microblogs
via Deep Neural Network,” IEEE Access, vol. 6, pp. 25908-25920, 2018.
(15) X. Ling, and D. S. weld, “Fine-grained entity recognition,” AAAI, vol. 12,
pp. 94-100, 2012.
(16) S. Shimaoka et al., “Neural architectures for fine-grained entity type
classification,” In Proceedings of EACL, vol. 15, pp.1271-1280, 2017.
(17) V. Spitkovsky, and A. Chang, “A cross-lingual dictionary for english
wikipedia concepts,” LREC, 2012.
(18) D. Milne, and I. H. Witten, “Learning to link with wikipedia,” In
Proceedings of ACM, pp.509-518, 2008.
(19) L. Ratinov et al., “Local and global algorithms for disambiguation to
wikipedia,” In Proceedings of the 49th ACL, vol. 1, pp. 1375-1384, 2011.
(20) J. Hoffart et al., “Robust disambiguation of named entities in text,” In
Proceedings of EMNLP, pp. 782-792, 2011.
(21) R. Usbeck et al., “GERBIL: general entity annotator benchmarking
framework,” In proceedings of the 24th WWW, pp. 1133-1143, 2015.
(22) R. Usbeck et al., “AGDISTIS-graph-based disambiguation of named
entities using linked data,” In Proceedings of ISWC, pp.457-471, 2014.
(23) A. Moro, F. Cecconi, and R. Navigli., “Multilingual word sense
disambiguation and entity linking for everybody,” In Proceedings of
ISWC, pp. 25-28, 2014.
(24) P. N. Mendes et al., “DBpedia spotlight: shedding light on the web of
documents,” In Proceedings of the 7th Semantic. Pp. 1-8, 2011.