l18-information-extraction-v3
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE 1
COMP90042
Natural Language Processing
Lecture 18
Semester 1 2021 Week 9
Jey Han Lau
Information Extraction
COMP90042 L18
2
Information Extraction
• Given this:
‣ “Brasilia, the Brazilian capital, was founded in 1960.”
• Obtain this:
‣ capital(Brazil, Brasilia)
‣ founded(Brasilia, 1960)
• Main goal: turn text into structured data
COMP90042 L18
3
Applications
• Stock analysis
‣ Gather information from news and social media
‣ Summarise texts into a structured format
‣ Decide whether to buy/sell at current stock price
• Medical research
‣ Obtain information from articles about diseases
and treatments
‣ Decide which treatment to apply for new patient
COMP90042 L18
4
How?
• Given this:
‣ “Brasilia, the Brazilian capital, was founded in 1960.”
• Obtain this:
‣ capital(Brazil, Brasilia)
‣ founded(Brasilia, 1960)
• Two steps:
‣ Named Entity Recognition (NER): find out entities
such as “Brasilia” and “1960”
‣ Relation Extraction: use context to find the relation
between “Brasilia” and “1960” (“founded”)
COMP90042 L18
5
Machine learning in IE
• Named Entity Recognition (NER): sequence
models such as RNNs, HMMs or CRFs.
• Relation Extraction: mostly classifiers, either
binary or multi-class.
• This lecture: how to frame these two tasks in order
to apply sequence labellers and classifiers.
COMP90042 L18
6
Outline
• Named Entity Recognition
• Relation Extraction
• Other IE Tasks
COMP90042 L18
7
Named Entity Recognition
COMP90042 L18
8
Named Entity Recognition
Citing high fuel prices, United Airlines said Friday it
has increased fares by $6 per round trip on flights to
some cities also served by lower-cost carriers.
American Airlines, a unit of AMR Corp., immediately
matched the move, spokesman Tim Wagner said.
United, a unit of UAL Corp., said the increase took
effect Thursday and applies to most routes where it
competes against discount carriers, such as Chicago
to Dallas and Denver to San Francisco.
JM3, Ch 17
COMP90042 L18
9
Named Entity Recognition
Citing high fuel prices, [ORG United Airlines] said
[TIME Friday] it has increased fares by [MONEY
$6] per round trip on flights to some cities also
served by lower-cost carriers. [ORG American
Airlines], a unit of [ORG AMR Corp.], immediately
matched the move, spokesman [PER Tim Wagner]
said. [ORG United], a unit of [ORG UAL Corp.],
said the increase took effect [TIME Thursday] and
applies to most routes where it competes against
discount carriers, such as [GPE Chicago] to [GPE
Dallas] and [GPE Denver] to [GPE San Francisco]
COMP90042 L18
10
Typical Entity Tags
• PER: people, characters
• ORG: companies, sports teams
• LOC: regions, mountains, seas
• GPE: countries, states, provinces
(in some tagset this is labelled as LOC)
• FAC: bridges, buildings, airports
• VEH: planes, trains, cars
• Tag-set is application-dependent: some domains deal
with specific entities e.g. proteins and genes
COMP90042 L18
11
NER as Sequence Labelling
• NE tags can be ambiguous:
‣ “Washington” can be a person, location or
political entity
• Similar problem when doing POS tagging
‣ Incorporate context
• Can we use a sequence tagger for this (e.g. HMM)?
‣ No, as entities can span multiple tokens
‣ Solution: modify the tag set
COMP90042 L18
12
IO tagging
• [ORG American Airlines], a unit of
[ORG AMR Corp.], immediately
matched the move, spokesman [PER
Tim Wagner] said.
• ‘I-ORG’ represents a token that is
inside an entity (ORG in this case).
• All tokens which are not entities get the
‘O’ token (for outside).
• Cannot differentiate between:
‣ a single entity with multiple tokens
‣ multiple entities with single tokens
COMP90042 L18
13
• [ORG American Airlines],
a unit of [ORG AMR Corp.],
immediately matched the
move, spokesman [PER
Tim Wagner] said.
• B-ORG represents the
beginning of an ORG entity.
• If the entity has more than
one token, subsequent tags
are represented as I-ORG.
IOB tagging
COMP90042 L18
14
Steve Jobs founded Apple Inc. in 1976
Tagset: PER, ORG, LOC, TIME
PollEv.com/jeyhanlau569
Annotate the following sentence
with NER tags (IOB)
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L18
15
NER as Sequence Labelling
• Given such tagging scheme, we can train any
sequence labelling model
• In theory, HMMs can be used but discriminative
models such as CRFs are preferred
COMP90042 L18
16
NER: Features
• Example: L’Occitane
• Prefix/suffix:
‣ L / L’ / L’O / L’Oc / …
‣ e / ne / ane / tane / …
• Word shape:
‣ X’Xxxxxxxx / X’Xx
‣ XXXX-XX-XX (date!)
• POS tags / syntactic chunks: many entities are nouns
or noun phrases.
• Presence in a gazeteer: lists of entities, such as place
names, people’s names and surnames, etc.
COMP90042 L18
17
COMP90042 L18
18
NER: Classifier
COMP90042 L18
19
Deep Learning for NER
• State of the art approach uses LSTMs with
character and word embeddings (Lample et al.
2016)
COMP90042 L18
20
Relation Extraction
COMP90042 L18
21
Relation Extraction
• [ORG American Airlines], a unit of [ORG AMR
Corp.], immediately matched the move, spokesman
[PER Tim Wagner] said.
• Traditionally framed as triple extraction:
‣ unit(American Airlines, AMR Corp.)
‣ spokesman(Tim Wagner, American Airlines)
• Key question: do we know all the possible
relations?
COMP90042 L18
22
Relation Extraction
‣ unit(American Airlines, AMR Corp.) → subsidiary
‣ spokesman(Tim Wagner, American Airlines) → employment
COMP90042 L18
23
Methods
• If we have access to a fixed relation database:
‣ Rule-based
‣ Supervised
‣ Semi-supervised
‣ Distant supervision
• If no restrictions on relations:
‣ Unsupervised
‣ Sometimes referred as “OpenIE”
COMP90042 L18
24
Rule-Based Relation Extraction
• “Agar is a substance prepared from a mixture of
red algae such as Gelidium, for laboratory or
industrial use.”
• [NP red algae] such as [NP Gelidium]
• NP0 such as NP1 → hyponym(NP1, NP0)
• hyponym(Gelidium, red algae)
• Lexico-syntactic patterns: high precision, low
recall, manual effort required
COMP90042 L18
25
More Rules
COMP90042 L18
26
Supervised Relation Extraction
• Assume a corpus with annotated relations
• Two steps. First, find if an entity pair is related or not
(binary classification)
‣ For each sentence, gather all possible entity pairs
‣ Annotated pairs are considered positive examples
‣ Non-annotated pairs are taken as negative
examples
• Second, for pairs predicted as positive, use a multi-
class classifier (e.g. SVM) to obtain the relation
COMP90042 L18
27
Supervised Relation Extraction
• [ORG American Airlines], a unit of [ORG AMR
Corp.], immediately matched the move,
spokesman [PER Tim Wagner] said.
• First:
‣ (American Airlines, AMR Corp.) → positive
‣ (American Airlines, Tim Wagner) → positive
‣ (AMR Corp., Tim Wagner) → negative
• Second:
‣ (American Airlines, AMR Corp.) → subsidiary
‣ (American Airlines, Tim Wagner) → employment
COMP90042 L18
28
Features
• [ORG American Airlines], a unit of [ORG AMR
Corp.], immediately matched the move, spokesman
[PER Tim Wagner] said.
• (American Airlines, Tim Wagner) → employment
COMP90042 L18
29
Semi-supervised Relation Extraction
• Annotated corpora is very expensive to create
• Use seed tuples to bootstrap a classifier
COMP90042 L18
30
Semi-supervised Relation Extraction
1. Given seed tuple: hub(Ryanair, Charleroi)
2. Find sentences containing terms in seed tuples
• Budget airline Ryanair, which uses Charleroi as a
hub, scrapped all weekend flights out of the airport.
3. Extract general patterns
• [ORG], which uses [LOC] as a hub
4. Find new tuples with these patterns
• hub(Jetstar, Avalon)
5. Add these new tuples to existing tuples and repeat step 2
COMP90042 L18
31
• Difficult to create seed tuples
• Extracted tuples deviate from original relation over time
• Difficult to evaluate
• Tend not to find many novel tuples given seed tuples
• Extracted general patterns tend to be very noisy
PollEv.com/jeyhanlau569
What are some issues of such semi-
supervised relation extraction method?
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L18
32
COMP90042 L18
33
Semantic Drift
• Pattern: [NP] has a {NP}* hub at [LOC]
• Sydney has a ferry hub at Circular Quay
‣ hub(Sydney, Circular Quay)
• More erroneous patterns extracted from this
tuple…
• Should only accept patterns with high confidences
COMP90042 L18
34
Distant Supervision
• Semi-supervised methods assume the existence
of seed tuples to mine new tuples
• Can we mine new tuples directly?
• Distant supervision obtain new tuples from a range
of sources:
‣ DBpedia
‣ Freebase
• Generate massive training sets, enabling the use
of richer features, and no risk of semantic drift
COMP90042 L18
35
Unsupervised Relation Extraction
(“OpenIE”)
• No fixed or closed set of relations
• Relations are sub-sentences; usually has a verb
• “United has a hub in Chicago, which is the
headquarters of United Continental Holdings.”
‣ “has a hub in”(United, Chicago)
‣ “is the headquarters of”(Chicago, United
Continental Holdings)
• Main problem: mapping relations into canonical forms
COMP90042 L18
36
Evaluation
• NER: F1-measure at the entity level.
• Relation Extraction with known relation set: F1-
measure
• Relation Extraction with unknown relations: much
harder to evaluate
‣ Usually need some human evaluation
‣ Massive datasets used in these settings are
impractical to evaluate manually (use samples)
‣ Can only obtain (approximate) precision, not recall.
COMP90042 L18
37
Other IE Tasks
COMP90042 L18
38
Temporal Expression Extraction
“[TIME July 2, 2007]: A fare increase initiated [TIME last
week] by UAL Corp’s United Airlines was matched by
competitors over [TIME the weekend], marking the
second successful fare increase in [TIME two weeks].”
• Anchoring: when is “last week”?
‣ “last week” → 2007−W26
• Normalisation: mapping expressions to canonical
forms.
‣ July 2, 2007 → 2007-07-02
• Mostly rule-based approaches
COMP90042 L18
39
Event Extraction
• “American Airlines, a unit of AMR Corp.,
immediately [EVENT matched] [EVENT the
move], spokesman Tim Wagner [EVENT said].”
• Very similar to NER, including annotation and
learning methods.
• Event ordering: detect how a set of events
happened in a timeline.
‣ Involves both event extraction and temporal
expression extraction.
COMP90042 L18
40
A Final Word
• Information Extraction is a vast field with many
different tasks and applications
‣ Named Entity Recognition
‣ Relation Extraction
‣ Event Extraction
• Machine learning methods involve classifiers and
sequence labelling models.
COMP90042 L18
41
Reading
• JM3 Ch. 8.3, 17-17.2
• References:
‣ Lample et al, Neural Architectures for Named Entity
Recognition, NAACL 2016
https://github.com/glample/tagger
https://github.com/glample/tagger
https://github.com/glample/tagger