CS计算机代考程序代写 scheme database deep learning l18-information-extraction-v3

l18-information-extraction-v3

COMP90042
Natural Language Processing

Lecture 18
Semester 1 2021 Week 9

Jey Han Lau

Information Extraction

COMP90042 L18

Information Extraction

• Given this:
‣ “Brasilia, the Brazilian capital, was founded in 1960.”

• Obtain this:
‣ capital(Brazil, Brasilia)
‣ founded(Brasilia, 1960)

• Main goal: turn text into structured data

COMP90042 L18

Applications

• Stock analysis
‣ Gather information from news and social media
‣ Summarise texts into a structured format
‣ Decide whether to buy/sell at current stock price

• Medical research
‣ Obtain information from articles about diseases

and treatments
‣ Decide which treatment to apply for new patient

COMP90042 L18

How?

• Given this:
‣ “Brasilia, the Brazilian capital, was founded in 1960.”

• Obtain this:
‣ capital(Brazil, Brasilia)
‣ founded(Brasilia, 1960)

• Two steps:
‣ Named Entity Recognition (NER): find out entities

such as “Brasilia” and “1960”
‣ Relation Extraction: use context to find the relation

between “Brasilia” and “1960” (“founded”)

COMP90042 L18

Machine learning in IE

• Named Entity Recognition (NER): sequence
models such as RNNs, HMMs or CRFs.

• Relation Extraction: mostly classifiers, either
binary or multi-class.

• This lecture: how to frame these two tasks in order
to apply sequence labellers and classifiers.

COMP90042 L18

Outline

• Named Entity Recognition

• Relation Extraction

• Other IE Tasks

COMP90042 L18

Named Entity Recognition

COMP90042 L18

Named Entity Recognition

Citing high fuel prices, United Airlines said Friday it
has increased fares by $6 per round trip on flights to
some cities also served by lower-cost carriers.
American Airlines, a unit of AMR Corp., immediately
matched the move, spokesman Tim Wagner said.
United, a unit of UAL Corp., said the increase took
effect Thursday and applies to most routes where it
competes against discount carriers, such as Chicago
to Dallas and Denver to San Francisco. 

JM3, Ch 17

COMP90042 L18

Named Entity Recognition

Citing high fuel prices, [ORG United Airlines] said
[TIME Friday] it has increased fares by [MONEY
$6] per round trip on flights to some cities also
served by lower-cost carriers. [ORG American
Airlines], a unit of [ORG AMR Corp.], immediately
matched the move, spokesman [PER Tim Wagner]
said. [ORG United], a unit of [ORG UAL Corp.],
said the increase took effect [TIME Thursday] and
applies to most routes where it competes against
discount carriers, such as [GPE Chicago] to [GPE
Dallas] and [GPE Denver] to [GPE San Francisco]

COMP90042 L18

Typical Entity Tags

• PER: people, characters
• ORG: companies, sports teams
• LOC: regions, mountains, seas
• GPE: countries, states, provinces  

(in some tagset this is labelled as LOC)
• FAC: bridges, buildings, airports
• VEH: planes, trains, cars
• Tag-set is application-dependent: some domains deal

with specific entities e.g. proteins and genes

COMP90042 L18

NER as Sequence Labelling

• NE tags can be ambiguous:

‣ “Washington” can be a person, location or
political entity

• Similar problem when doing POS tagging

‣ Incorporate context

• Can we use a sequence tagger for this (e.g. HMM)?

‣ No, as entities can span multiple tokens

‣ Solution: modify the tag set

COMP90042 L18

IO tagging
• [ORG American Airlines], a unit of

[ORG AMR Corp.], immediately
matched the move, spokesman [PER
Tim Wagner] said.

• ‘I-ORG’ represents a token that is
inside an entity (ORG in this case).

• All tokens which are not entities get the
‘O’ token (for outside).

• Cannot differentiate between:
‣ a single entity with multiple tokens

‣ multiple entities with single tokens

COMP90042 L18

• [ORG American Airlines],
a unit of [ORG AMR Corp.],
immediately matched the
move, spokesman [PER
Tim Wagner] said.

• B-ORG represents the
beginning of an ORG entity.

• If the entity has more than
one token, subsequent tags
are represented as I-ORG.

IOB tagging

COMP90042 L18

Steve Jobs founded Apple Inc. in 1976

Tagset: PER, ORG, LOC, TIME

PollEv.com/jeyhanlau569

Annotate the following sentence  
with NER tags (IOB)

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L18

NER as Sequence Labelling

• Given such tagging scheme, we can train any
sequence labelling model

• In theory, HMMs can be used but discriminative
models such as CRFs are preferred

COMP90042 L18

NER: Features

• Example: L’Occitane
• Prefix/suffix:

‣ L / L’ / L’O / L’Oc / …
‣ e / ne / ane / tane / …

• Word shape:
‣ X’Xxxxxxxx / X’Xx
‣ XXXX-XX-XX (date!)

• POS tags / syntactic chunks: many entities are nouns
or noun phrases.

• Presence in a gazeteer: lists of entities, such as place
names, people’s names and surnames, etc.

COMP90042 L18

NER: Classifier

COMP90042 L18

Deep Learning for NER

• State of the art approach uses LSTMs with
character and word embeddings (Lample et al.
2016)

COMP90042 L18

Relation Extraction

COMP90042 L18

Relation Extraction

• [ORG American Airlines], a unit of [ORG AMR
Corp.], immediately matched the move, spokesman
[PER Tim Wagner] said.

• Traditionally framed as triple extraction:

‣ unit(American Airlines, AMR Corp.)

‣ spokesman(Tim Wagner, American Airlines)

• Key question: do we know all the possible
relations?

COMP90042 L18

Relation Extraction
‣ unit(American Airlines, AMR Corp.) → subsidiary

‣ spokesman(Tim Wagner, American Airlines) → employment

COMP90042 L18

Methods

• If we have access to a fixed relation database:

‣ Rule-based

‣ Supervised

‣ Semi-supervised

‣ Distant supervision
• If no restrictions on relations:

‣ Unsupervised

‣ Sometimes referred as “OpenIE”

COMP90042 L18

Rule-Based Relation Extraction

• “Agar is a substance prepared from a mixture of
red algae such as Gelidium, for laboratory or
industrial use.”

• [NP red algae] such as [NP Gelidium]

• NP0 such as NP1 → hyponym(NP1, NP0)

• hyponym(Gelidium, red algae)

• Lexico-syntactic patterns: high precision, low
recall, manual effort required

COMP90042 L18

More Rules

COMP90042 L18

Supervised Relation Extraction

• Assume a corpus with annotated relations

• Two steps. First, find if an entity pair is related or not
(binary classification)

‣ For each sentence, gather all possible entity pairs

‣ Annotated pairs are considered positive examples

‣ Non-annotated pairs are taken as negative
examples

• Second, for pairs predicted as positive, use a multi-
class classifier (e.g. SVM) to obtain the relation

COMP90042 L18

Supervised Relation Extraction

• [ORG American Airlines], a unit of [ORG AMR
Corp.], immediately matched the move,
spokesman [PER Tim Wagner] said.

• First:
‣ (American Airlines, AMR Corp.) → positive
‣ (American Airlines, Tim Wagner) → positive
‣ (AMR Corp., Tim Wagner) → negative

• Second:
‣ (American Airlines, AMR Corp.) → subsidiary
‣ (American Airlines, Tim Wagner) → employment

COMP90042 L18

Features

• [ORG American Airlines], a unit of [ORG AMR
Corp.], immediately matched the move, spokesman
[PER Tim Wagner] said.

• (American Airlines, Tim Wagner) → employment

COMP90042 L18

Semi-supervised Relation Extraction

• Annotated corpora is very expensive to create
• Use seed tuples to bootstrap a classifier

COMP90042 L18

Semi-supervised Relation Extraction

1. Given seed tuple: hub(Ryanair, Charleroi)
2. Find sentences containing terms in seed tuples
• Budget airline Ryanair, which uses Charleroi as a

hub, scrapped all weekend flights out of the airport.
3. Extract general patterns
• [ORG], which uses [LOC] as a hub

4. Find new tuples with these patterns
• hub(Jetstar, Avalon)

5. Add these new tuples to existing tuples and repeat step 2

COMP90042 L18

• Difficult to create seed tuples
• Extracted tuples deviate from original relation over time
• Difficult to evaluate
• Tend not to find many novel tuples given seed tuples
• Extracted general patterns tend to be very noisy

PollEv.com/jeyhanlau569

What are some issues of such semi-
supervised relation extraction method?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L18

Semantic Drift

• Pattern: [NP] has a {NP}* hub at [LOC]

• Sydney has a ferry hub at Circular Quay

‣ hub(Sydney, Circular Quay)

• More erroneous patterns extracted from this
tuple…

• Should only accept patterns with high confidences

COMP90042 L18

Distant Supervision

• Semi-supervised methods assume the existence
of seed tuples to mine new tuples

• Can we mine new tuples directly?
• Distant supervision obtain new tuples from a range

of sources:
‣ DBpedia
‣ Freebase

• Generate massive training sets, enabling the use
of richer features, and no risk of semantic drift

COMP90042 L18

Unsupervised Relation Extraction  
(“OpenIE”)

• No fixed or closed set of relations

• Relations are sub-sentences; usually has a verb

• “United has a hub in Chicago, which is the
headquarters of United Continental Holdings.”

‣ “has a hub in”(United, Chicago)

‣ “is the headquarters of”(Chicago, United
Continental Holdings)

• Main problem: mapping relations into canonical forms

COMP90042 L18

Evaluation

• NER: F1-measure at the entity level.

• Relation Extraction with known relation set: F1-
measure

• Relation Extraction with unknown relations: much
harder to evaluate

‣ Usually need some human evaluation

‣ Massive datasets used in these settings are
impractical to evaluate manually (use samples)

‣ Can only obtain (approximate) precision, not recall.

COMP90042 L18

Other IE Tasks

COMP90042 L18

Temporal Expression Extraction

“[TIME July 2, 2007]: A fare increase initiated [TIME last
week] by UAL Corp’s United Airlines was matched by
competitors over [TIME the weekend], marking the
second successful fare increase in [TIME two weeks].”

• Anchoring: when is “last week”?

‣ “last week” → 2007−W26

• Normalisation: mapping expressions to canonical
forms.

‣ July 2, 2007 → 2007-07-02

• Mostly rule-based approaches

COMP90042 L18

Event Extraction

• “American Airlines, a unit of AMR Corp.,
immediately [EVENT matched] [EVENT the
move], spokesman Tim Wagner [EVENT said].”

• Very similar to NER, including annotation and
learning methods.

• Event ordering: detect how a set of events
happened in a timeline.
‣ Involves both event extraction and temporal

expression extraction.

COMP90042 L18

A Final Word

• Information Extraction is a vast field with many
different tasks and applications

‣ Named Entity Recognition

‣ Relation Extraction

‣ Event Extraction

• Machine learning methods involve classifiers and
sequence labelling models.

COMP90042 L18

Reading

• JM3 Ch. 8.3, 17-17.2

• References:
‣ Lample et al, Neural Architectures for Named Entity

Recognition, NAACL 2016  
https://github.com/glample/tagger

https://github.com/glample/tagger
https://github.com/glample/tagger

Related Posts