lecture12.pptx
LECTURE 12
Informaton Extracton and Named Enttt Recogniton
Arkaitz Zubiaga, 19
th
Februart, 2018
2
What is Informaton Extracton?
Named Enttt Recogniton NER).
Relaton Extracton RE).
Other Informaton Extracton tasks.
LECTURE 12: CONTENTS
3
Informaton extracton: automatcallt extractng structured
informaton from unstructured texts.
INFORMATION EXTRACTION IE)
Subject: meetng
Date: 8th Januart, 2018
To: Arkaitz Zubiaga
Hi Arkaitz, we have fnallt scheduled the meetng.
It will be in the Ada Lovelace room, next Mondat 10am-11am.
-Mike
Create new Calendar entry
Event: Meeting w/ Mike
Date: 15 Jan, 2018
Start: 10:00am
End: 11:00am
Where: A. Lovelace
4
ON WIKIPEDIA… THERE ARE INFOBOXES…
Structured informaton:
5
…AND SENTENCES
Unstructured text:
6
We can use informaton extracton to populate the infobox
automatcallt.
USING IE TO POPULATE WIKIPEDIA INFOBOXES
7
IE is the process of extractng some limited amount of semantc
content from text.
Can view it as the task of automatcallt flling a questonnaire or
template e.g. database table).
An important NLP applicaton since the 1990s.
INFORMATION EXTRACTION
8
Consists of the following subtasks:
Named Enttt Recogniton.
Relaton Extracton.
Coreference resoluton.
Event extracton.
Temporal expression extracton.
Slot flling.
Enttt linking.
INFORMATION EXTRACTION: SUBTASKS
NAMED ENTITY RECOGNITION
10
Named Entty (NE): antthing that can be referred to with a
proper name.
Usuallt 3 categories: person, locaton and organisaton .
Can be extended to numeric expressions price, date, tme).
NAMED ENTITY RECOGNITION
11
Example of extended set of named enttes.
NAMED ENTITIES
12
Named Entty Recogniton (NER) task: 1) identfy spans of text
that consttute proper names, 2) categorise bt enttt ttpe.
First step in IE, also useful for:
Reduce sparseness in text classifcaton.
Identft target in sentment analtsis.
Queston answering.
NAMED ENTITY RECOGNITION
13
Challenges in NER:
Ambiguitt of segmentaton is it an enttt? boundaries?):
I’m on the Birmingham New Street-London Euston train.
locaton locaton
Ambiguitt of enttt ttpe:
Downing St.: locaton street) or organisaton govt)?
Warwick: locaton town) or organisaton uni)?
Georgia: person name) or locaton countrt)?
CHALLENGES IN NAMED ENTITY RECOGNITION
14
Standard NER algorithm is a word-bt-word sequence labelling
task, where labels capture both boundaries and enttt ttpe, e.g.:
NER AS A SEQUENCE LABELLING TASK
15
Convert it to BIO format:
(B): beginning, (I): in, (O): out.
NER AS A SEQUENCE LABELLING TASK
16
State-of-the-art approach:
MEMM or CRF is trained, to label each token in a text as
being part of a named entty or not .
Use gazeteers to assign labels to those tokens.
Gazeteers: lists of place names, frst names, surnames,
organisatons, products, etc.
GeoNames.
Natonal Street Gazeteer UK).
More gazeteers.
NER AS A SEQUENCE LABELLING TASK
17
Standard measures: Precision, Recall, F-measure.
The enttt rather than the word is the unit of response, i.e:
Segmentaton component makes evaluaton more challenging:
Using words for training but enttes as units of response
means → if we have Leamington Spa in the corpus, and our
ststem identfes Leamington onlt, that’s a mistake.
EVALUATION OF NER
18
Statstcal sequence models are the norm in academic research.
Commercial NER are ofen htbrids of rules & machine learning :
1. Use high precision rules to tag unambiguous enttes.
2. Search for substring matches of those unambiguous enttes.
3. Use applicaton-specifc name list to identft other enttes.
4. Use probabilistc sequence labelling to complete that.
COMMERCIAL NER SYSTEMS
RELATION EXTRACTION
20
Relaton extracton: process of identfting relatons between
named enttes.
The spokesman of the UK’s Downing Street, James Slack,…
locaton organisaton person
Relatons:
Downing Street is in the UK ORG-LOC).
James Slack is spokesperson of Downing Street PER-ORG).
James Slack works in the UK PER-LOC).
RELATION EXTRACTION
21
Examples of ttpes of relatons:
There are databases that defne these relaton ttpes.
RELATION TYPES
22
For example, there is the UMLS Unifed Medical Language
Ststem), a network that describes 134 broad subject categories,
enttt ttpes and 44 relatons between enttes. e.g.
Relatons usuallt describe propertes of the enttes.
RELATION TYPES
23
Wikipedia infoboxes can be used to create a corpus.
Convert it into relatons as RDF triples:
Univ. of Warwick – located_in – Coventrt
Univ. of Warwick – established_in – 1944
…
RELATION EXTRACTION
24
First approach used in the 90s using lexical/stntactc paterns.
-: Such paterns are high precision but ttpicallt low recall.
-: Difcult to generalise to new domain and new IE tasks.
RELATION EXTRACTION USING PATTERNS
25
Supervised machine learning approaches ttpicallt do:
1. A fxed set of relatons and enttes is chosen.
2. A corpus is manuallt annotated with the enttes and
relatons to create training data.
3. A classifer is trained on the corpus and tested on an unseen
text to classift potental relatons between enttt pairs, e.g.
SVM, Logistc Regression, Naive Bates, Perceptron.
RE USING SUPERVISED MACHINE LEARNING
26
Important to choose good features, e.g. dependencies, POS tags.
Example: American Airlines, a unit of AMR, immediatelt matched the move,
spokesman Tim Wagner said.
RE USING SUPERVISED MACHINE LEARNING
27
Bootstrapping can be used to expand our training data
with instances where our classifer’s predictons are vert
confdent.
RE USING SEMI-SUPERVISED LEARNING
28
Combine best of both worlds: bootstrapping + supervised
machine learning.
Use the Web e.g. Wikipedia) to build large sets of relatons.
We can now train a more reliable supervised classifer.
RE USING DISTANT SUPERVISION
29
The main diference is that the schema for these relatons does
not need to be specifed in advance .
The relaton name is just the text linking two arguments, e.g.:
“Barack Obama was born in Hawaii” would create
Barack Obama; was born in; Hawaii
The challenge lies in identfting that two enttes connected with
a verb do consttute an actual relaton, not a false positve.
OPEN IE: UNSUPERVISED RE
30
Stanford’s state-of-the-art OpenIE ststem:
1. Using dependencies, splits sentence into clauses.
2. Each clause is maximallt shortened into sentence fragments.
3. These fragments are output as OpenIE triples.
STANFORD OPENIE
31
Precision, Recall, F-measure for supervised methods .
For unsupervised methods evaluaton is much harder.
Can estmate Precision using a small labelled sample:
Precision at diferent levels of recall, e.g. 1000 new relatons,
10,000 new relatons etc.
EVALUATION OF RELATION EXTRACTION
OTHER INFORMATION
EXTRACTION TASKS
33
Coreference resoluton: fnding all expressions that refer to the
same enttt in a text.
COREFERENCE RESOLUTION
34
Neural coref: ptthon package for co-reference resoluton.
htps://github.com/huggingface/neuralcoref
And tou can try it online:
htps://huggingface.co/coref/
COREFERENCE RESOLUTION
35
NEURAL COREF: EXAMPLE
36
Event extracton: Extractng structured informaton about events
from text. Ofen enttt>- verb>- enttt>, as in relaton
extracton. Vert useful to analtse news stories.
Steve Jobs died in 2011 → Steve Jobs;death in;2011
New Android phone was announced bt Google →
New Android phone;announcement bt;Google
EVENT EXTRACTION
37
Temporal expression extracton: structuring tme mentons in
text, ofen associated with event extracton.
Event 1: mid 80’s → released a pass.
Event 2: ~4 seconds later → slammed to the ground.
TEMPORAL EXPRESSION EXTRACTION
38
TEMPLATE/SLOT FILLING
Slot filling (aka Knowledge Base Population): task of taking an
incomplete knowledge base (e.g., Wikidata), and a large corpus of
text (e.g., Wikipedia), and completing the incomplete elements of the
knowledge base.
39
ENTITY LINKING
Entity linking: task of taking ambiguous entity mentions, and “link”
them with concrete entries in a knowledge base.
Alex Jones has a new film.
which of them? →
Can be defined as:
● Classification task.
● Similarity task.
● …
40
INFORMATION EXTRACTION: SUMMARY
Involves a broad range of tasks, with a common goal:
unstructured text → structured data
Sequence classification is often useful.
As with many other tasks, availability of data is crucial:
Labelled corpora.
Gazetteers.
Thesauri.
41
RESOURCES
Stanford CRF NER:
https://nlp.stanford.edu/software/CRF-NER.shtml
Stanford CoreNLP:
https://stanfordnlp.github.io/CoreNLP/
iepy:
https://pypi.python.org/pypi/iepy
Stanford-OpenIE-Python:
https://github.com/philipperemy/Stanford-OpenIE-Python
42
REFERENCES
Chang, C. H., Kayed, M., Girgis, M. R., & Shaalan, K. F. (2006). A
survey of web information extraction systems. IEEE transactions on
knowledge and data engineering, 18(10), 1411-1428.
Eikvil, L. (1999). Information extraction from world wide web-a survey.
43
ASSOCIATED READING
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 20.
Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 7.