Information Extraction 1: Chunking & Named Entities
Information Extraction
This time:
What is Information Extraction? Tasks in IE
Chunking
What is chunking?
Chunking vs. parsing
IOB labelling
Chunking as sequence labelling
Suppose that we want to keep an up-to-date record of who currently holds the key executive positions at major companies
This sort of information is embedded within news articles
Google is to change its name to Alphabet, alongside a major restructuring. Sundar Pichai, who before ran most of Google’s most important products, will become its CEO. (Pichai had been rumoured to be in the running for the vacant position of Twitter CEO.) – Independent, 11.08.2015
Information Extraction (IE) is the language engineering application that address the problem of extracting information from unstructured text
Named Entity Recognition
What is a named entity? Challenges in NER
IOB tags again
NER as classification
Data Science Group (Informatics)
Information Extraction
NLE/ANLP
Autumn 2015
1 / 29
Data Science Group (Informatics) NLE/ANLP Autumn 2015
Tasks in Information Extraction
Text Chunking
Simplified form of parsing that groups words in a sentence into short phrases
Named Entity Recognition (NER)
Finds and classifies strings of tokens that mention named entities Entity classes: people, places, organisations, etc
Coreference Resolution
Links named entity mentions in a document that refer to the same entity
2 / 29
Not looking for just anything that happens to be there We know the types of things we want to find out Goal is to complete entries in a database
Values for particular fields are needed
Company Role Person
Alphabet Google Apple
CEO Larry Page CEO Sundar Pichai CEO Tim Cook
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
3 / 29
Data Science Group (Informatics) NLE/ANLP Autumn 2015
4 / 29
Tasks in Information Extraction (cont.)
Chunking for Information Extraction
Entity Linking
Associates named entity mentions with concepts in a knowledge base
Relation Recognition
Finds and classifies relationships between entities
Event Recognition
Finds and classifies events and the entities in roles associated with the event
Message: ID
Incident: Location Incident: Type
Incident: Instrument ID Perp: Individual ID
Phys Tgt: ID
Hum Tgt: Name
Hum Tgt: Effect of Incident
56,
TST2-MUC4-0048
El Salvador: San Salvador (City) Bombing
“bomb”
“urban guerrillas”
“vehicle”
“Garcia Alvarado”
Death: “Garcia Alvarado”
0. 3. 4. 6. 9.
12. 18. 23.
by
to a halt at an intersection
Data Science Group (Informatics)
Garcia Alvarado,
was killed
a bomb
urban guerrillas
on his vehicle
in downtown San Salvador.
Data Science Group (Informatics)
What is Chunking?
NLE/ANLP
Autumn 2015
5 / 29
when placed exploded as it came
NLE/ANLP
Autumn 2015
6 / 29
Grouping words into syntactically correlated phrases (chunks)
NP VP to
NP
Chunking: Simplified Syntactic Analysis
Noun chunks
cats
the car
the cheap plastic garden seat
many of my favourite 1980’s rock songs Apple Computers Inc
The highly successful University of Sussex Due South, Brighton, East Sussex
The current account deficit
will narrow
only $ 1.8 billion
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
7 / 29
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
8 / 29
Chunking: Simplified Syntactic Analysis
Syntactic Analysis of Text
Verb chunks: verbs, phrasal verbs, etc
snores
take
used to be
begins at
carry out
born into
has been successfully sold by was happily taken back by
Traditional view of syntactic analysis:
Determining the complete syntactic structure of a sentence Often used as a step in processing pipeline
Hierarchical analysis of how phrases combine to make other phrases
is looking to raise
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
9 / 29
Useful for some tasks
Data Science Group (Informatics)
Parsing
Problems:
NLE/ANLP
Autumn 2015
10 / 29
Parsing
NP
S
hit
VP
Det
The
N
man
V
NP
Prep
with
Efficiency — complexity usually worse than linear in string length Coverage — will never have complete coverage of a language Ambiguity — will often produce incorrect syntactic structure
NP
PP
Det
the
Det
the
N
boy
NP
N
stick
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
11 / 29
Data Science Group (Informatics) NLE/ANLP Autumn 2015 12 / 29
Why Chunking?
Technology for Chunking
More efficient
More robust to unexpected input Often gives detailed enough structure
Chunking analysis is not hierarchical Finite-state technology applicable
Regular expressions IOB labelling
Data Science Group (Informatics) NLE/ANLP
Chunking as Sequence Labelling
Example of IOB labelling:
Data Science Group (Informatics) NLE/ANLP
Chunking as IOB Labelling
IOB labels used to indicate chunk boundaries
B and I labels have versions for each type of chunk
Autumn 2015
13 / 29
Autumn 2015
14 / 29
W
s
a
t
h
r
o
n
d
o
g
PRP
e
VBD
w
DT
e
b
JJ
w
N
B-NP
O
B-NP
I-NP
I-NP
B-X I-X O
XcanbeNP orVP
start of new X chunk continuation of an X chunk not part of a chunk
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
15 / 29
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
16 / 29
IOB Labelling is like PoS Tagging
Chunking as a Sequence Labelling Task
The generic sequence labelling problem
Part-of-speech tagging involves labelling a sequence of tokens with part-of-speech tags
IOB chunking involves labelling a sequence of part-of-speech tags with IOB labels
raw text
tokeniser
token sequence
HMM PoS labeller
The same technology can be used for both e.g. Hidden Markov Models
Data Science Group (Informatics) NLE/ANLP
Named Entity Recognition
Autumn 2015
17 / 29
Data Science Group (Informatics)
Named Entities
NLE/ANLP
PoS sequence
HMM IOB labeller
IOB sequence
Autumn 2015
18 / 29
The task of identifying and classifying chunks that refer to named entities
What is a Named Entity?
An entity of a particular class, usually referred to with a proper
name
— often capitalised
Sir Basil Spence, University of Sussex, Brighton
Data Science Group (Informatics) NLE/ANLP Autumn 2015
19 / 29
Data Science Group (Informatics) NLE/ANLP Autumn 2015
20 / 29
Named Entity Recognition
Challenges in NER
Two aspects to the task:
Identify contiguous sequences of tokens that name an entity — identification of chunks
After the news broke, a representative of the
spoke …
Ambiguity can be a problem
May be more than one potential referent of same class
Bush could be one of two former US presidents
Bach could be any one of a number of composers (not significant
University of the Highlands and Islands of Scotland
Classify chunks according to the class of entity
for NER, but matters for entity linking)
Class of entity can depend on context
Wednesday could be the day of the week or a football team Java could be the place or the programming language
Data Science Group (Informatics) NLE/ANLP Autumn 2015
IOB Tagging as a Multiway Classification Task
Choice of IOB tag for a token is a classification task
Use features within window around token to be classified
What features should be used for the classification? — token
— PoS
— IOB tag from phrasal chunker
— Information about shape of token
— person, place, company, address, date, time. . .
— can be fine-grained and domain-specific, e.g. gene, protein, . . .
Data Science Group (Informatics) NLE/ANLP
Using IOB Tags
Can be viewed as sequence labelling task
Use IOB tags
Associate IOB tags with tokens (not PoS tags)
Autumn 2015
21 / 29
22 / 29
B-X token at start of chunk for named entity in class X I-X token continues chunk for named entity in class X O token not part of a named entity
Data Science Group (Informatics) NLE/ANLP Autumn 2015
23 / 29
Data Science Group (Informatics) NLE/ANLP Autumn 2015
24 / 29
Shape Feature
Named Entity Recognition as Classification
Shape Example
Lower bush Capitalised Sussex All capitalised USSU Mixed case iPhone Capitalized character with full stop D.
NE-IOB
e1
…
ei −k
…
ei −1
?
Classifier
…
cn
…
pn
…
tn
…
wn
Ends in digit Contains hyphen
Data Science Group (Informatics)
NLE/ANLP
U2 X-Files
Autumn 2015
25 / 29
Shape feature Chunk IOB PoS tags Tokens
Data Science Group (Informatics)
Features for NER
Context Window
NLE/ANLP
Autumn 2015
26 / 29
c1
…
ci −k
…
ci −1
ci
ci +1
…
ci +k
p1
…
pi −k
…
pi −1
pi
pi +1
…
pi +k
t1
…
ti −k
…
ti −1
ti
ti +1
…
ti +k
w1
…
wi −k
…
wi −1
wi
wi +1
…
wi +k
Entity Lists: People Lists and Gazetteers
Use lists of known entities:
— gazetteers: lists of places
— lists of people, companies, etc
Predictive power of features varies depending on context
Word shape is less useful for transcribed speech or microblog text
Check for presence of candidate named entities in list Accurate lists are very effective for NER
Hard to maintain accurate lists
Lists can be automatically generated from Wikipedia
Data Science Group (Informatics) NLE/ANLP Autumn 2015
27 / 29
Data Science Group (Informatics) NLE/ANLP Autumn 2015 28 / 29
Next Topic: NE Linking and Relation Extraction
Linking named entities to a knowledge base Extracting relations between entities
Data Science Group (Informatics) NLE/ANLP Autumn 2015 29 / 29