PowerPoint Presentation
LECTURE 12
Informaton Extracton and Named Entty Recogniton
Arkaitz Zubiaga, 19th February, 2018
2
What is Informaton Extracton?
Named Entty Recogniton (NER).
Relaton Extracton (RE).
Other Informaton Extracton tasks.
LECTURE 12: CONTENTS
3
Informaton extracton: automatcally extractng structured
informaton from unstructured texts.
INFORMATION EXTRACTION (IE)
Subject: meetng
Date: 8th January, 2018
To: Arkaitz Zubiaga
Hi Arkaitz, we have fnally scheduled the meetng.
It will be in the Ada Lovelace room, next Monday 10am-11am.
-Mike
Create new Calendar entry
Event: Meeting w/ Mike
Date: 15 Jan, 2018
Start: 10:00am
End: 11:00am
Where: A. Lovelace
4
ON WIKIPEDIA… THERE ARE INFOBOXES…
Structured informaton:
5
…AND SENTENCES
Unstructured text:
6
We can use informaton extracton to populate the infobox
automatcally.
USING IE TO POPULATE WIKIPEDIA INFOBOXES
7
IE is the process of extractng some limited amount of semantc
content from text.
Can view it as the task of automatcally flling a questonnaire or
template (e.g. database table).
An important NLP applicaton since the 1990s.
INFORMATION EXTRACTION
8
Consists of the following subtasks:
Named Entty Recogniton.
Relaton Extracton.
Coreference resoluton.
Event extracton.
Temporal expression extracton.
Slot flling.
Entty linking.
INFORMATION EXTRACTION: SUBTASKS
NAMED ENTITY RECOGNITION
10
Named Enttt (NE): anything that can be referred to with a
proper name.
Usually 3 categories: person, locaton and organisaton.
Can be extended to numeric expressions (price, date, tme).
NAMED ENTITY RECOGNITION
11
Example of extended set of named enttes.
NAMED ENTITIES
12
Named Enttt Recogniton (NER) task: (1) identft spans of text
that consttute proper names, (2) categorise by entty type.
First step in IE, also useful for:
Reduce sparseness in text classifcaton.
Identfy target in sentment analysis.
Queston answering.
NAMED ENTITY RECOGNITION
13
Challenges in NER:
Ambiguity of segmentaton (is it an entty? boundaries?):
I’m on the Birmingham New Street-London Euston train.
locaton locaton
Ambiguity of entty type:
Downing St.: locaton (street) or organisaton (govt)?
Warwick: locaton (town) or organisaton (uni)?
Georgia: person (name) or locaton (country)?
CHALLENGES IN NAMED ENTITY RECOGNITION
14
Standard NER algorithm is a word-by-word sequence labelling
task, where labels capture both boundaries and entty type, e.g.:
NER AS A SEQUENCE LABELLING TASK
15
Convert it to BIO format:
(B): beginning, (I): in, (O): out.
NER AS A SEQUENCE LABELLING TASK
16
State-of-the-art approach:
MEMM or CRF is trained, to label each token in a text as
being part of a named enttt or not.
Use gazeteers to assign labels to those tokens.
Gazeteers: lists of place names, frst names, surnames,
organisatons, products, etc.
GeoNames.
Natonal Street Gazeteer (UK).
More gazeteers.
NER AS A SEQUENCE LABELLING TASK
http://www.geonames.org/
https://data.gov.uk/dataset/national-street-gazetteer
http://library.stanford.edu/guides/gazetteers
17
Standard measures: Precision, Recall, F-measure.
The entty rather than the word is the unit of response, i.e:
Segmentaton component makes evaluaton more challenging:
Using words for training but enttes as units of response
means → if we have Leamington Spa in the corpus, and our
system identfes Leamington only, that’s a mistake.
EVALUATION OF NER
18
Statstcal sequence models are the norm in academic research.
Commercial NER are ofen hybrids of rules & machine learning:
1. Use high precision rules to tag unambiguous enttes.
2. Search for substring matches of those unambiguous enttes.
3. Use applicaton-specifc name list to identfy other enttes.
4. Use probabilistc sequence labelling to complete that.
COMMERCIAL NER SYSTEMS
RELATION EXTRACTION
20
Relaton extracton: process of identfying relatons between
named enttes.
The spokesman of the UK’s Downing Street, James Slack,…
locaton organisaton person
Relatons:
Downing Street is in the UK (ORG-LOC).
James Slack is spokesperson of Downing Street (PER-ORG).
James Slack works in the UK (PER-LOC).
RELATION EXTRACTION
21
Examples of types of relatons:
There are databases that defne these relaton types.
RELATION TYPES
22
For example, there is the UMLS (Unifed Medical Language
System), a network that describes 134 broad subject categories,
entty types and 54 relatons between enttes. e.g.
Relatons usually describe propertes of the enttes.
RELATION TYPES
23
Wikipedia infoboxes can be used to create a corpus.
Convert it into relatons as RDF triples:
Univ. of Warwick – located_in – Coventry
Univ. of Warwick – established_in – 1954
…
RELATION EXTRACTION
24
First approach used in the 90s using lexical/syntactc paterns.
-: Such paterns are high precision but typically low recall.
-: Difcult to generalise to new domain and new IE tasks.
RELATION EXTRACTION USING PATTERNS
25
Supervised machine learning approaches typically do:
1. A fxed set of relatons and enttes is chosen.
2. A corpus is manually annotated with the enttes and
relatons to create training data.
3. A classifer is trained on the corpus and tested on an unseen
text to classify potental relatons between entty pairs, e.g.
SVM, Logistc Regression, Naive Bayes, Perceptron.
RE USING SUPERVISED MACHINE LEARNING
26
Important to choose good features, e.g. dependencies, POS tags.
Example: American Airlines, a unit of AMR, immediately matched the move,
spokesman Tim Wagner said.
RE USING SUPERVISED MACHINE LEARNING
27
Bootstrapping can be used to expand our training data
with instances where our classifer’s predictons are very
confdent.
RE USING SEMI-SUPERVISED LEARNING
28
Combine best of both worlds: bootstrapping + supervised
machine learning.
Use the Web (e.g. Wikipedia) to build large sets of relatons.
We can now train a more reliable supervised classifer.
RE USING DISTANT SUPERVISION
29
The main diference is that the schema for these relatons does
not need to be specifed in advance.
The relaton name is just the text linking two arguments, e.g.:
“Barack Obama was born in Hawaii” would create
Barack Obama; was born in; Hawaii
The challenge lies in identfying that two enttes connected with
a verb do consttute an actual relaton, not a false positve.
OPEN IE: UNSUPERVISED RE
30
Stanford’s state-of-the-art OpenIE system:
1. Using dependencies, splits sentence into clauses.
2. Each clause is maximally shortened into sentence fragments.
3. These fragments are output as OpenIE triples.
STANFORD OPENIE
https://nlp.stanford.edu/software/openie.html
31
Precision, Recall, F-measure for supervised methods.
For unsupervised methods evaluaton is much harder.
Can estmate Precision using a small labelled sample:
Precision at diferent levels of recall, e.g. 1000 new relatons,
10,000 new relatons etc.
EVALUATION OF RELATION EXTRACTION
OTHER INFORMATION
EXTRACTION TASKS
33
Coreference resoluton: fnding all expressions that refer to the
same entty in a text.
COREFERENCE RESOLUTION
34
Neural coref: python package for co-reference resoluton.
htps://github.com/huggingface/neuralcoref
And you can trt it online:
htps://huggingface.co/coref/
COREFERENCE RESOLUTION
https://github.com/huggingface/neuralcoref
https://huggingface.co/coref/
35
NEURAL COREF: EXAMPLE
36
Event extracton: Extractng structured informaton about events
from text. Ofen
extracton. Very useful to analyse news stories.
Steve Jobs died in 2011 → Steve Jobs;death in;2011
New Android phone was announced by Google →
New Android phone;announcement by;Google
EVENT EXTRACTION
37
Temporal expression extracton: structuring tme mentons in
text, ofen associated with event extracton.
Event 1: mid 80’s → released a pass.
Event 2: ~5 seconds later → slammed to the ground.
TEMPORAL EXPRESSION EXTRACTION
38
TEMPLATE/SLOT FILLING
Slot filling (aka Knowledge Base Population): task of taking an
incomplete knowledge base (e.g., Wikidata), and a large corpus of
text (e.g., Wikipedia), and completing the incomplete elements of the
knowledge base.
39
ENTITY LINKING
Entity linking: task of taking ambiguous entity mentions, and “link”
them with concrete entries in a knowledge base.
Alex Jones has a new film.
which of them? →
Can be defined as:
● Classification task.
● Similarity task.
● …
40
INFORMATION EXTRACTION: SUMMARY
Involves a broad range of tasks, with a common goal:
unstructured text → structured data
Sequence classification is often useful.
As with many other tasks, availability of data is crucial:
Labelled corpora.
Gazetteers.
Thesauri.
41
RESOURCES
Stanford CRF NER:
https://nlp.stanford.edu/software/CRF-NER.shtml
Stanford CoreNLP:
https://stanfordnlp.github.io/CoreNLP/
iepy:
https://pypi.python.org/pypi/iepy
Stanford-OpenIE-Python:
https://github.com/philipperemy/Stanford-OpenIE-Python
https://nlp.stanford.edu/software/CRF-NER.shtml
https://stanfordnlp.github.io/CoreNLP/
https://pypi.python.org/pypi/iepy
https://github.com/philipperemy/Stanford-OpenIE-Python
42
REFERENCES
Chang, C. H., Kayed, M., Girgis, M. R., & Shaalan, K. F. (2006). A
survey of web information extraction systems. IEEE transactions on
knowledge and data engineering, 18(10), 1411-1428.
Eikvil, L. (1999). Information extraction from world wide web-a survey.
43
ASSOCIATED READING
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 20.
Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 7.
Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43