Information Extraction
COMP90042
Natural Language Processing
Lecture 18
Semester 1 2021 Week 9 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L18
Information Extraction Given this:
‣ “Brasilia, the Brazilian capital, was founded in 1960.”
•
•
•
Obtain this:
‣ capital(Brazil, Brasilia) ‣ founded(Brasilia, 1960)
Main goal: turn text into structured data
2
COMP90042
L18
•
Stock analysis
‣ Gather information from news and social media ‣ Summarise texts into a structured format
‣ Decide whether to buy/sell at current stock price
•
Medical research
‣ Obtain information from articles about diseases and treatments
‣ Decide which treatment to apply for new patient 3
Applications
COMP90042
L18
•
•
•
Given this:
How?
‣ “Brasilia, the Brazilian capital, was founded in 1960.”
Obtain this:
‣ capital(Brazil, Brasilia) ‣ founded(Brasilia, 1960)
Two steps:
‣ Named Entity Recognition (NER): find out entities such as “Brasilia” and “1960”
‣ Relation Extraction: use context to find the relation between “Brasilia” and “1960” (“founded”)
4
COMP90042
L18
•
•
•
Named Entity Recognition (NER): sequence models such as RNNs, HMMs or CRFs.
Machine learning in IE
Relation Extraction: mostly classifiers, either binary or multi-class.
This lecture: how to frame these two tasks in order to apply sequence labellers and classifiers.
5
COMP90042
L18
• • •
Named Entity Recognition Relation Extraction
Other IE Tasks
Outline
6
COMP90042
L18
Named Entity Recognition
7
COMP90042
L18
Named Entity Recognition
Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp., immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco.
JM3, Ch 17
8
COMP90042
L18
Named Entity Recognition
Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY $6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said. [ORG United], a unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies to most routes where it competes against discount carriers, such as [GPE Chicago] to [GPE Dallas] and [GPE Denver] to [GPE San Francisco]
9
COMP90042
L18
Typical Entity Tags
• PER: people, characters
• ORG: companies, sports teams
• LOC: regions, mountains, seas
• GPE: countries, states, provinces
(in some tagset this is labelled as LOC)
• FAC: bridges, buildings, airports
• VEH: planes, trains, cars
• Tag-set is application-dependent: some domains deal with specific entities e.g. proteins and genes
10
COMP90042
L18
NER as Sequence Labelling NE tags can be ambiguous:
‣ “Washington” can be a person, location or political entity
•
•
•
Similar problem when doing POS tagging ‣ Incorporate context
Can we use a sequence tagger for this (e.g. HMM)? ‣ No, as entities can span multiple tokens
‣ Solution: modify the tag set
11
COMP90042
L18
IO tagging
• [ORGAmericanAirlines],aunitof [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
• ‘I-ORG’representsatokenthatis inside an entity (ORG in this case).
• All tokens which are not entities get the ‘O’ token (for outside).
• Cannotdifferentiatebetween:
‣ a single entity with multiple tokens
‣ multiple entities with single tokens
12
COMP90042
L18
•
[ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
IOB tagging
•
•
B-ORG represents the beginning of an ORG entity.
If the entity has more than one token, subsequent tags are represented as I-ORG.
13
COMP90042
L18
Annotate the following sentence
with NER tags (IOB)
Steve Jobs founded Apple Inc. in 1976
Tagset: PER, ORG, LOC, TIME
PollEv.com/jeyhanlau569
14
COMP90042
L18
•
•
Given such tagging scheme, we can train any sequence labelling model
NER as Sequence Labelling
In theory, HMMs can be used but discriminative models such as CRFs are preferred
15
COMP90042
L18
NER: Features
• Example: L’Occitane
• Prefix/suffix:
‣ L / L’/ L’O / L’Oc / …
‣ e / ne / ane / tane / …
• Word shape:
‣ X’Xxxxxxxx / X’Xx ‣ XXXX-XX-XX (date!)
• POS tags / syntactic chunks: many entities are nouns or noun phrases.
• Presence in a gazeteer: lists of entities, such as place names, people’s names and surnames, etc.
16
COMP90042
L18
17
COMP90042
L18
NER: Classifier
18
COMP90042
L18
•
State of the art approach uses LSTMs with character and word embeddings (Lample et al. 2016)
Deep Learning for NER
19
COMP90042
L18
Relation Extraction
20
COMP90042
L18
•
•
[ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
•
Key question: do we know all the possible relations?
Relation Extraction
Traditionally framed as triple extraction:
‣ unit(American Airlines, AMR Corp.)
‣ spokesman(Tim Wagner, American Airlines)
21
COMP90042
L18
Relation Extraction
‣ unit(American Airlines, AMR Corp.) → subsidiary
‣ spokesman(Tim Wagner, American Airlines) → employment
22
COMP90042
L18
Methods
• Ifwehaveaccesstoafixedrelationdatabase: ‣ Rule-based
‣ Supervised
‣ Semi-supervised
‣ Distant supervision
• Ifnorestrictionsonrelations:
‣ Unsupervised
‣ Sometimes referred as “OpenIE”
23
COMP90042
L18
•
• •
• •
“Agar is a substance prepared from a mixture of red algae such as Gelidium, for laboratory or industrial use.”
Rule-Based Relation Extraction
[NP red algae] such as [NP Gelidium] NP0 such as NP1 → hyponym(NP1, NP0)
hyponym(Gelidium, red algae)
Lexico-syntactic patterns: high precision, low recall, manual effort required
24
COMP90042
L18
More Rules
25
COMP90042
L18
• •
Assume a corpus with annotated relations
Two steps. First, find if an entity pair is related or not (binary classification)
‣ For each sentence, gather all possible entity pairs
‣ Annotated pairs are considered positive examples
‣ Non-annotated pairs are taken as negative examples
•
Second, for pairs predicted as positive, use a multi- class classifier (e.g. SVM) to obtain the relation
Supervised Relation Extraction
26
COMP90042
L18
•
•
[ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
•
Second:
‣ (American Airlines, AMR Corp.) → subsidiary
‣ (American Airlines, Tim Wagner) → employment
Supervised Relation Extraction
First:
‣ (American Airlines, AMR Corp.) → positive ‣ (American Airlines, Tim Wagner) → positive ‣ (AMR Corp., Tim Wagner) → negative
27
COMP90042
L18
Features
• [ORGAmericanAirlines],aunitof[ORGAMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
• (AmericanAirlines,TimWagner)→employment
28
COMP90042
L18
Semi-supervised Relation Extraction
• •
Annotated corpora is very expensive to create Use seed tuples to bootstrap a classifier
29
COMP90042
L18
Semi-supervised Relation Extraction
1. Given seed tuple: hub(Ryanair, Charleroi)
2. Find sentences containing terms in seed tuples
• BudgetairlineRyanair,whichusesCharleroiasa hub, scrapped all weekend flights out of the airport.
3. Extract general patterns
• [ORG],whichuses[LOC]asahub
4. Find new tuples with these patterns • hub(Jetstar,Avalon)
5. Add these new tuples to existing tuples and repeat step 2 30
COMP90042
L18
What are some issues of such semi- supervised relation extraction method?
• Difficult to create seed tuples
• Extracted tuples deviate from original relation over time
• Difficult to evaluate
• Tend not to find many novel tuples given seed tuples
• Extracted general patterns tend to be very noisy
PollEv.com/jeyhanlau569
31
COMP90042
L18
32
COMP90042
L18
• •
•
•
Semantic Drift
Pattern: [NP] has a {NP}* hub at [LOC]
Sydney has a ferry hub at Circular Quay
‣ hub(Sydney, Circular Quay)
More erroneous patterns extracted from this
tuple…
Should only accept patterns with high confidences
33
COMP90042
L18
•
• •
of sources: ‣ DBpedia ‣ Freebase
•
Distant Supervision
Semi-supervised methods assume the existence of seed tuples to mine new tuples
Can we mine new tuples directly?
Distant supervision obtain new tuples from a range
Generate massive training sets, enabling the use of richer features, and no risk of semantic drift
34
COMP90042
Unsupervised Relation Extraction
(“OpenIE”) No fixed or closed set of relations
•
•
L18
•
•
Continental Holdings)
Main problem: mapping relations into canonical forms
Relations are sub-sentences; usually has a verb
“United has a hub in Chicago, which is the headquarters of United Continental Holdings.”
‣ “has a hub in”(United, Chicago)
‣ “is the headquarters of”(Chicago, United
35
COMP90042
L18
Evaluation
• NER:F1-measureattheentitylevel.
• RelationExtractionwithknownrelationset:F1- measure
• RelationExtractionwithunknownrelations:much harder to evaluate
‣ Usually need some human evaluation
‣ Massive datasets used in these settings are
impractical to evaluate manually (use samples)
‣ Can only obtain (approximate) precision, not recall. 36
COMP90042
L18
Other IE Tasks
37
COMP90042
L18
Temporal Expression Extraction
“[TIME July 2, 2007]: A fare increase initiated [TIME last week] by UAL Corp’s United Airlines was matched by competitors over [TIME the weekend], marking the second successful fare increase in [TIME two weeks].”
• Anchoring:whenis“lastweek”? ‣ “last week” → 2007−W26
• Normalisation:mappingexpressionstocanonical forms.
‣ July 2, 2007 → 2007-07-02 • Mostlyrule-basedapproaches
38
COMP90042
L18
•
•
•
“American Airlines, a unit of AMR Corp., immediately [EVENT matched] [EVENT the move], spokesman Tim Wagner [EVENT said].”
Event Extraction
Very similar to NER, including annotation and learning methods.
Event ordering: detect how a set of events happened in a timeline.
‣ Involves both event extraction and temporal expression extraction.
39
COMP90042
L18
•
Information Extraction is a vast field with many different tasks and applications
‣ Named Entity Recognition ‣ Relation Extraction
‣ Event Extraction
•
Machine learning methods involve classifiers and sequence labelling models.
A Final Word
40
COMP90042
L18
• •
JM3 Ch. 8.3, 17-17.2 References:
‣ Lample et al, Neural Architectures for Named Entity Recognition, NAACL 2016
https://github.com/glample/tagger
Reading
41