Information Extraction 1: Chunking & Named Entities
This time:
What is Information Extraction? Tasks in IE
Chunking
What is chunking?
Chunking vs. parsing
IOB labelling
Chunking as sequence labelling
Named Entity Recognition
What is a named entity? Challenges in NER
IOB tags again
NER as classification
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
1 / 29
Information Extraction
Suppose that we want to keep an up-to-date record of who currently holds the key executive positions at major companies
This sort of information is embedded within news articles
Google is to change its name to Alphabet, alongside a major restructuring. Sundar Pichai, who before ran most of Google’s most important products, will become its CEO. (Pichai had been rumoured to be in the running for the vacant position of Twitter CEO.) – Independent, 11.08.2015
Information Extraction (IE) is the language engineering application that address the problem of extracting information from unstructured text
Data Science Group (Informatics) NLE/ANLP Autumn 2015 2 / 29
Information Extraction
Not looking for just anything that happens to be there We know the types of things we want to find out Goal is to complete entries in a database
Values for particular fields are needed
Company Role Person
Alphabet Google Apple
CEO Larry Page CEO Sundar Pichai CEO Tim Cook
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
3 / 29
Tasks in Information Extraction
Text Chunking
Simplified form of parsing that groups words in a sentence into short phrases
Named Entity Recognition (NER)
Finds and classifies strings of tokens that mention named entities Entity classes: people, places, organisations, etc
Coreference Resolution
Links named entity mentions in a document that refer to the same entity
Data Science Group (Informatics) NLE/ANLP Autumn 2015 4 / 29
Tasks in Information Extraction (cont.)
Entity Linking
Associates named entity mentions with concepts in a knowledge base
Relation Recognition
Finds and classifies relationships between entities
Event Recognition
Finds and classifies events and the entities in roles associated with the event
Data Science Group (Informatics) NLE/ANLP Autumn 2015 5 / 29
Chunking for Information Extraction
0. 3. 4. 6. 9.
12. 18. 23.
Message: ID
Incident: Location Incident: Type
Incident: Instrument ID Perp: Individual ID
Phys Tgt: ID
Hum Tgt: Name
Hum Tgt: Effect of Incident
TST2-MUC4-0048
El Salvador: San Salvador (City) Bombing
“bomb”
“urban guerrillas”
“vehicle”
“Garcia Alvarado”
Death: “Garcia Alvarado”
Garcia Alvarado,
56,
when placed exploded as it came
was killed
a bomb
urban guerrillas
on his vehicle
by
to a halt at an intersection
in downtown San Salvador.
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
6 / 29
What is Chunking?
Grouping words into syntactically correlated phrases (chunks)
The current account deficit
NP
VP to NP
will narrow
only $ 1.8 billion
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 7 / 29
Chunking: Simplified Syntactic Analysis
Noun chunks
cats
the car
the cheap plastic garden seat
many of my favourite 1980’s rock songs Apple Computers Inc
The highly successful University of Sussex Due South, Brighton, East Sussex
Data Science Group (Informatics) NLE/ANLP Autumn 2015 8 / 29
Chunking: Simplified Syntactic Analysis
Verb chunks: verbs, phrasal verbs, etc
snores
take
used to be
begins at
carry out
born into
has been successfully sold by was happily taken back by
is looking to raise
Data Science Group (Informatics) NLE/ANLP Autumn 2015 9 / 29
Syntactic Analysis of Text
Traditional view of syntactic analysis:
Determining the complete syntactic structure of a sentence Often used as a step in processing pipeline
Hierarchical analysis of how phrases combine to make other phrases
Useful for some tasks
Data Science Group (Informatics) NLE/ANLP Autumn 2015 10 / 29
Parsing
NP
VP
NP
S
Det N V
The man hit
NP
PP
Det
the
N Prep
boy with
NP
Det
the
N
stick
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
11 / 29
Parsing
Problems:
Efficiency — complexity usually worse than linear in string length Coverage — will never have complete coverage of a language Ambiguity — will often produce incorrect syntactic structure
Data Science Group (Informatics) NLE/ANLP Autumn 2015 12 / 29
Why Chunking?
More efficient
More robust to unexpected input Often gives detailed enough structure
Data Science Group (Informatics) NLE/ANLP Autumn 2015 13 / 29
Technology for Chunking
Chunking analysis is not hierarchical Finite-state technology applicable
Regular expressions IOB labelling
Data Science Group (Informatics) NLE/ANLP Autumn 2015 14 / 29
Chunking as IOB Labelling
IOB labels used to indicate chunk boundaries
B and I labels have versions for each type of chunk
B-X I-X O
XcanbeNP orVP
start of new X chunk continuation of an X chunk not part of a chunk
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
15 / 29
Chunking as Sequence Labelling
Example of IOB labelling:
W
e
s
a
w
t
h
e
b
r
o
w
n
d
o
g
PRP
VBD
DT
JJ
N
B-NP
O
B-NP
I-NP
I-NP
Data Science Group (Informatics) NLE/ANLP Autumn 2015 16 / 29
IOB Labelling is like PoS Tagging
The generic sequence labelling problem
Part-of-speech tagging involves labelling a sequence of tokens with part-of-speech tags
IOB chunking involves labelling a sequence of part-of-speech tags with IOB labels
The same technology can be used for both e.g. Hidden Markov Models
Data Science Group (Informatics) NLE/ANLP Autumn 2015 17 / 29
Chunking as a Sequence Labelling Task
raw text
tokeniser
token sequence
HMM PoS labeller
PoS sequence
HMM IOB labeller
IOB sequence
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
18 / 29
Named Entity Recognition
The task of identifying and classifying chunks that refer to named entities
Data Science Group (Informatics) NLE/ANLP Autumn 2015 19 / 29
Named Entities
What is a Named Entity?
An entity of a particular class, usually referred to with a proper
name
— often capitalised
Sir Basil Spence, University of Sussex, Brighton
Data Science Group (Informatics) NLE/ANLP Autumn 2015 20 / 29
Named Entity Recognition
Two aspects to the task:
Identify contiguous sequences of tokens that name an entity — identification of chunks
After the news broke, a representative of the
spoke …
Classify chunks according to the class of entity
— person, place, company, address, date, time. . .
— can be fine-grained and domain-specific, e.g. gene, protein, . . .
University of the Highlands and Islands of Scotland
Data Science Group (Informatics) NLE/ANLP Autumn 2015 21 / 29
Challenges in NER
Ambiguity can be a problem
May be more than one potential referent of same class Bush could be one of two former US presidents
Bach could be any one of a number of composers (not significant for NER, but matters for entity linking)
Class of entity can depend on context
Wednesday could be the day of the week or a football team Java could be the place or the programming language
Data Science Group (Informatics) NLE/ANLP Autumn 2015 22 / 29
Using IOB Tags
Can be viewed as sequence labelling task
Use IOB tags
Associate IOB tags with tokens (not PoS tags)
B-X token at start of chunk for named entity in class X I-X token continues chunk for named entity in class X O token not part of a named entity
Data Science Group (Informatics) NLE/ANLP Autumn 2015 23 / 29
IOB Tagging as a Multiway Classification Task
Choice of IOB tag for a token is a classification task
Use features within window around token to be classified
What features should be used for the classification? — token
— PoS
— IOB tag from phrasal chunker
— Information about shape of token
Data Science Group (Informatics) NLE/ANLP Autumn 2015 24 / 29
Shape Feature
Shape Example
Lower bush Capitalised Sussex All capitalised USSU Mixed case iPhone Capitalized character with full stop D.
Ends in digit Contains hyphen
U2 X-Files
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
25 / 29
Named Entity Recognition as Classification
e1
…
ei −k
…
ei −1
?
NE-IOB
Classifier
…
cn
…
pn
…
tn
…
wn
Shape feature Chunk IOB PoS tags Tokens
c1
…
ci −k
…
ci −1
ci
ci +1
…
ci +k
p1
…
pi −k
…
pi −1
pi
pi +1
…
pi +k
t1
…
ti −k
…
ti −1
ti
ti +1
…
ti +k
w1
…
wi −k
…
wi −1
wi
wi +1
…
wi +k
Context Window
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
26 / 29
Entity Lists: People Lists and Gazetteers
Use lists of known entities:
— gazetteers: lists of places
— lists of people, companies, etc
Check for presence of candidate named entities in list Accurate lists are very effective for NER
Hard to maintain accurate lists
Lists can be automatically generated from Wikipedia
Data Science Group (Informatics) NLE/ANLP Autumn 2015 27 / 29
Features for NER
Predictive power of features varies depending on context
Word shape is less useful for transcribed speech or microblog text
Data Science Group (Informatics) NLE/ANLP Autumn 2015 28 / 29
Next Topic: NE Linking and Relation Extraction
Linking named entities to a knowledge base Extracting relations between entities
Data Science Group (Informatics) NLE/ANLP Autumn 2015 29 / 29