Entity Linking and Relation Recognition
Information Extraction
This time:
Entity Linking
The challenge of entity linking Techniques for entity linking
Relation Recognition
What is relation recognition? Identifying related entities Classifying relations
Data Science Group (Informatics) NLE/ANLP
Determining the Identity of Entities
The task is called:
Named Entity Disambiguation Entity Linking
Recall: IE is the task of extracting information from unstructured text: Detect entities of interest
Autumn 2015
1 / 24
e.g. Companies, locations, products
Detect relations of interest between entities:
COMPANY in LOCATION COMPANY sell PRODUCT COMPANY acquire COMPANY
etc.
Data Science Group (Informatics) NLE/ANLP
The Problem
A problem instance consists of:
A knowledge base (KB) such as Wikipedia An entity mention in a textual context
Goal:
return canonical entry in KB of entity being mentioned
or
return NIL if the entity does not in KB
Autumn 2015
2 / 24
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
3 / 24
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
4 / 24
Why This is Challenging
Why This is Challenging
Absence from KB
Not realistic that KB contains all entities being mentioned
— open class concepts hard to maintain up-to-date and complete — e.g. lists of people being talked about
Entity Ambiguity
— many different entities potentially referred to with same string
Manchester
— the city in England
— the town in Bolivia
— one of 32 towns in the USA — the football club
— the University
— the Airport
— song by The Beautiful South
Data Science Group (Informatics) NLE/ANLP Autumn 2015
Wikipedia as the KB
Many named entities have their own page in Wikipedia
The title of the page is a canonical way of naming the entity
Title of Wikipedia page for 42nd US president is
Bill Clinton
not
Data Science Group (Informatics)
Why This is Challenging
Name Variations
NLE/ANLP
Autumn 2015
5 / 24
6 / 24
— many different ways of referring to the same entity
Manchester United Football Club MUFC
Manchester United FC Manchester United
Man United Man U Manchester United
The Reds
Busby Babes
Lancashire & Yorkshire Railway Newton Heath The Heathens
William Jefferson Blythe III William Jefferson Clinton
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
7 / 24
Data Science Group (Informatics) NLE/ANLP Autumn 2015
8 / 24
Techniques for Entity Linking
Generating Candidates
Two phases
Find candidates in KB for given entity mention Rank candidates to find most probable
The name variants challenge is addressed here
Need to find all potentially relevant candidates
Familiar tradeoff:
precision versus recall
Need good recall
— so that the correct entity is among candidates
but too many candidates can hurt precision (and efficiency)
Data Science Group (Informatics) NLE/ANLP Autumn 2015 10 / 24
Strategies for Generating Candidates
Mention is a similar string to page title
— use a string similarity measure, e.g. Levenshtein Distance
Mention is a known alias for page title
— can extract from Wikipedia redirects and disambiguation pages
UK , Becks
Data Science Group (Informatics) NLE/ANLP
Strategies for Generating Candidates
Remember, this is just generating candidates! — typically makes limited use of the context
Mention is exact match with title of Wikipedia page
David Beckham
Autumn 2015
9 / 24
Mention is proper substring of title of Wikipedia page or vice-versa
Beckham
Mention is an acronym of page title
UoS
Data Science Group (Informatics) NLE/ANLP Autumn 2015 11 / 24
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
12 / 24
Information for Ranking Candidates
Strategies for Ranking Candidates
Entity mention occurs within a context
Co-occurrence of entity mentions
— other named entities in same document
Local context of an entity mention — neighbouring words
Global context of an entity mention
— document within which entity mention occurs — bag-of-words for document captures topic
Data Science Group (Informatics) NLE/ANLP
Relation Extraction
Discovering relationships between entities
Entity relatedness
Do co-occurring entities also co-occur with same types in KB pages?
Query relevance
Does a candidate KB page contain tokens in local context of entity mention?
Document similarity
Does a candidate KB page have a high bag-of-words similarity to document?
In their largest acquisition to date, Google has acquired YouTube for $1.65 billion in an all stock transaction.
< entity > < relationship > < entity > Google acquire YouTube COMPANY acquire COMPANY
Autumn 2015
13 / 24
Data Science Group (Informatics) NLE/ANLP Autumn 2015
Binary Relations
Relation extraction is typically concerned with binary relationships
Named entity recognition is the unary variant of this task — entities belong to a specified class
Binary relations are fundamental to meaning
14 / 24
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
15 / 24
Data Science Group (Informatics) NLE/ANLP Autumn 2015
16 / 24
Relation Granularity
Supervised Approaches
Recall that named entities are classified into classes — PERSON, PLACE, COMPANY, etc
Relation types can also be organised into classes
Two phases:
Phase 1: Extract a pair of entities that are related in some way Phase 2: Categorise the relationship that holds between the
X acquired Y X married to Y
Data Science Group (Informatics)
=⇒ Y PART-OF X
=⇒ X AFFILIATED-WITH Y
entities
NLE/ANLP
Autumn 2015
17 / 24
Data Science Group (Informatics)
Classifying Relations
Needs a multiclass classifier e.g. Naïve Bayes
Features used for classification:
NLE/ANLP
Autumn 2015
18 / 24
Extracting Related Entity Pairs
Needs a binary classifier
Are entities e1 and e2 related in this text?
Classifier trained on positive and negative examples Positive examples given in labelled training data
Class of each of the two target named entities Tokens appearing in named entity mentions
Negative examples are entities found in training data that are not labelled as being related
Data Science Group (Informatics) NLE/ANLP Autumn 2015 19 / 24
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
20 / 24
Classifying Relations
Syntactic Paths
More features used for classification:
Bag-of-words between entity mentions Distance between entity mentions
Number of other named entity mentions between target named entities
Using features for relation identification
Using syntactic paths as features
… YouTube, a subsidiary of Google …
How are YouTube and Google related in the syntax? Can be captured with a syntactic path
Data Science Group (Informatics) NLE/ANLP
Next Topic: Information Retrieval
What is information retrieval? Boolean retrieval
Indexing documents
Retrieval with an inverted index
Features of the syntactic structure
Data Science Group (Informatics) NLE/ANLP
Syntactic Paths: Example
NP NP PUNC
Autumn 2015
21 / 24
Autumn 2015
22 / 24
NP NNP,NP PP
YouTube
DT NN
a subsidiary
IN NP of NNP
Data Science Group (Informatics)
(NP ↑,NP ↓,NP ↓,PP ↓,NP ↓) NLE/ANLP
Google
Autumn 2015
23 / 24
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
24 / 24