程序代写代做代考 deep learning game database Information Extraction

Information Extraction
COMP90042
Natural Language Processing Lecture 18
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L18
Information Extraction Given this:
‣ “Brasilia, the Brazilian capital, was founded in 1960.”




Obtain this:
‣ capital(Brazil, Brasilia) ‣ founded(Brasilia, 1960)
Main goal: turn text into structured data such as databases, etc.
Help decision makers in applications.
2

COMP90042
L18

Stock analysis

Medical and biological research
‣ Obtain information from articles about diseases and treatments 

→ decide which treatment to apply for a new patient Rumour detection
‣ Detect events in social media

→ decide where, when and how to act

Examples
‣ Gather information from news and social media → summarise into a structured format → decide whether to buy/sell at current stock price
3

COMP90042
L18



Given this:
How?
‣ “Brasilia, the Brazilian capital, was founded in 1960.”
Obtain this:
‣ capital(Brazil, Brasilia) ‣ founded(Brasilia, 1960)
Two steps:
‣ Named Entity Recognition (NER): find out entities such as “Brasilia” and “1960”
‣ Relation Extraction: use context to find the relation between “Brasilia” and “1960” (“founded”)
4

COMP90042
L18




Named Entity Recognition (NER): sequence models such as RNNs, HMMs or CRFs.
Machine learning in IE
Relation Extraction: mostly classifiers, either binary or multi-class.
This lecture: how to frame these two tasks in order to apply classifiers and sequence labellers.
Choice of machine learning methods is up to the user (yes, deep learning methods can be applied).
5

COMP90042 L18
Named Entity Recognition
6

COMP90042
L18
Named Entity Recognition
Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp., immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco.

JM3, Ch 17
7

COMP90042
L18
Named Entity Recognition
Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY $6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said. [ORG United], a unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies to most routes where it competes against discount carriers, such as [LOC Chicago] to [LOC Dallas] and [LOC Denver] to [LOC San Francisco]
8

COMP90042
L18
Typical Entity Tags
• PER: people, characters
• ORG: companies, sports teams
• LOC: regions, mountains, seas
• GPE: countries, states, provinces 
 (sometimes conflated with LOC)
• FAC: bridges, buildings, airports
• VEH: planes, trains, cars
• Tag-set is application-dependent: some domains deal with specific entities e.g. proteins, genes or works of art.
9

COMP90042
L18
NER as Sequence Labelling • NEtagscanbeambiguous:
‣ “Washington” can be either a person, a location or a political entity.
• WefacedasimilarproblemwhendoingPOStagging. ‣ Solution: incorporate context by treating NER as sequence
labelling.
• Canweuseanout-of-the-boxsequencetaggerfor this (e.g., HMM)?
‣ Not really: entities can span multiple tokens. ‣ Solution: adapt the tag set.
10

COMP90042
L18
IO tagging
• [ORGAmericanAirlines],aunit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
• ‘I-ORG’representsatokenthatis inside an entity (ORG in this case).
• All tokens which are not entities get the ‘O’ token (for outside).
• Cannotdifferentiatebetweena single entity with multiple tokens or multiple entities with single tokens.
11

COMP90042
L18

[ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
IOB tagging


B-ORG represents the beginning of an ORG entity.
If the entity has more than one token, subsequent tags are represented as I-ORG.
12

COMP90042
L18
NER as Sequence Labelling


Given a tagging scheme and an annotated corpus, one can train any sequence labelling model
In theory, HMMs can be used but discriminative models such as MEMMs and CRFs are preferred
‣ Character-level features (is the first letter uppercase?) ‣ Extra resources, e.g., lists of names
‣ POS tags
13

COMP90042
L18
NER: Features
• Character and word shape features (ex: “L’Occitane”)
• Prefix/suffix:
‣ L / L’/ L’O / L’Oc / …
‣ e / ne / ane / tane / …
• Word shape:
‣ X’Xxxxxxxx / X’Xx ‣ XXXX-XX-XX (date!)
• POS tags / syntactic chunks: many entities are nouns or noun phrases.
• Presence in a gazeteer: lists of entities, such as place names, people’s names and surnames, etc.
14

COMP90042 L18
15

COMP90042
L18
NER: Classifier
16

COMP90042
L18

State of the art approach uses LSTMs with character and word embeddings (Lample et al. 2016)
Deep Learning for NER
17

COMP90042
L18
Relation Extraction
18

COMP90042
L18



[ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
Relation Extraction
Traditionally framed as triple extraction: ‣ unit(American Airlines, AMR Corp.)
‣ spokesman(Tim Wagner, American Airlines)
Key question: do we have access to a set of possible relations?
‣ Answer depends on the application
19

COMP90042
L18
Relation Extraction
‣ unit(American Airlines, AMR Corp.) → subsidiary
‣ spokesman(Tim Wagner, American Airlines) → employment
20

COMP90042
L18

If we have access to a fixed relation database: ‣ Rule-based
‣ Supervised
‣ Semi-supervised
‣ Distant supervision

If no restrictions on relations:
‣ Unsupervised
‣ Sometimes referred as “OpenIE”
Methods
21

COMP90042 L18

• •
• •
“Agar is a substance prepared from a mixture of red algae such as Gelidium, for laboratory or industrial use.”
Rule-Based Relation Extraction
[NP red algae] such as [NP Gelidium] NP0 such as NP1 → hyponym(NP1, NP0)
hyponym(Gelidium, red algae)
Lexico-syntactic patterns: high precision, low recall, manual effort required
22

COMP90042
L18
More Rules
23

COMP90042 L18
• •
Assume a corpus with annotated relations
Two steps. First, find if an entity pair is related or not (binary classification)
‣ For each sentence, gather all possible entity pairs
‣ Annotated pairs are considered positive examples
‣ Non-annotated pairs are taken as negative examples

Second, for pairs predicted as positive, use a multi-class classifier (e.g. SVM) to obtain the relation
Supervised Relation Extraction
24

COMP90042 L18


[ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.

Second:
‣ (American Airlines, AMR Corp.) → subsidiary
‣ (American Airlines, Tim Wagner) → employment
Supervised Relation Extraction
First:
‣ (American Airlines, AMR Corp.) → positive ‣ (American Airlines, Tim Wagner) → positive ‣ (AMR Corp., Tim Wagner) → negative
25

COMP90042
L18
Features
• [ORGAmericanAirlines],aunitof[ORGAMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
• (AmericanAirlines,TimWagner)→employment
26

COMP90042 L18
Semi-supervised Relation Extraction
• •
Annotated corpora is very expensive to create. Use seed tuples to bootstrap a classifier
1. Given a set of seed tuples
2. Find sentences containing these seed tuples 3. Extract general patterns from these sentences 4. Use these patterns to find new tuples
5. Repeat from step 2
27

COMP90042 L18
Semi-supervised Relation Extraction
1. Given seed tuple: hub(Ryanair, Charleroi)
2. Find sentences containing terms in seed tuples
• BudgetairlineRyanair,whichusesCharleroiasa hub, scrapped all weekend flights out of the airport.
3. Extract general patterns
• [ORG],whichuses[LOC]asahub
4. Find new tuples with these patterns • hub(Jetstar,Avalon)
5. Add these new tuples to existing tuples and repeat step 2 28

COMP90042
L18
• •


Issue: Semantic Drift
Pattern: [NP] has a {NP}* hub at [LOC]
Sydney has a ferry hub at Circular Quay
‣ hub(Sydney, Circular Quay)
More erroneous patterns extracted from this
tuple…
Should only accept patterns with high confidences
29

COMP90042
L18
Distant Supervision
• Semi-supervised methods assume the existence of seed tuples to mine new tuples
• Can we mine new tuples directly?
• Distant supervision obtain new tuples from a range of
sources:
‣ DBpedia ‣ Freebase
• Generate massive training sets, enabling the use of richer features, and no risk of semantic drift
• Still rely on a fixed set of relations
30

COMP90042
L18
ReVERB: Unsupervised Relation Extraction
• If there is no relation database or the goal is to find new relations, unsupervised approaches must be used.
• Relations become substrings, usually containing a verb
• “United has a hub in Chicago, which is the
headquarters of United Continental Holdings.”
‣ “has a hub in”(United, Chicago)
‣ “is the headquarters of”(Chicago, United Continental Holdings)
• Main problem: mapping the substring relations into canonical forms
31

COMP90042
L18
• •
measure

Evaluation
NER: F1-measure at the entity level.
Relation Extraction with known relation set: F1-
Relation Extraction with unknown relations: much harder to evaluate
‣ Usually need some human evaluation
‣ Massive datasets used in these settings are
impractical to evaluate manually: use a small sample ‣ Can only obtain (approximate) precision, not recall.
32

COMP90042
L18
Other IE Tasks
33

COMP90042 L18
Temporal Expression Extraction
“[TIME July 2, 2007]: A fare increase initiated [TIME last week] by UAL Corp’s United Airlines was matched by competitors over [TIME the weekend], marking the second successful fare increase in [TIME two weeks].”
• Anchoring:whenis“lastweek”? ‣ “last week” → 2007−W26
• Normalisation:mappingexpressionstocanonical forms.
‣ July 2, 2007 → 2007-07-02 • Mostlyrule-basedapproaches
34

COMP90042
L18



“American Airlines, a unit of AMR Corp., immediately [EVENT matched] [EVENT the move], spokesman Tim Wagner [EVENT said].”
Event Extraction
Very similar to NER, including annotation and learning methods.
Event ordering: detect how a set of events happened in a timeline.
‣ Involves both event extraction and temporal expression extraction.
35

COMP90042
L18

Information Extraction is a vast field with many different tasks and applications
‣ Named Entity Recognition + Relation Extraction ‣ Events can be tracked by combining event and
temporal expression extraction

Machine learning methods involve classifiers and sequence labelling models.
A Final Word
36

COMP90042
L18
• •
JM3 Ch. 18 – 18.2 References:
‣ Lample et al, Neural Architectures for Named Entity Recognition, NAACL 2016
 https://github.com/glample/tagger
Reading
37