程序代写代做代考 python android ada algorithm database lecture12.pptx

lecture12.pptx

LECTURE 12

Informaton Extracton and Named Enttt Recogniton

Arkaitz Zubiaga, 19
th
Februart, 2018

2

 What is Informaton Extracton?

 Named Enttt Recogniton NER).

 Relaton Extracton RE).

 Other Informaton Extracton tasks.

LECTURE 12: CONTENTS

3

 Informaton extracton: automatcallt extractng structured
informaton from unstructured texts.

INFORMATION EXTRACTION IE)

Subject: meetng

Date: 8th Januart, 2018

To: Arkaitz Zubiaga

Hi Arkaitz, we have fnallt scheduled the meetng.

It will be in the Ada Lovelace room, next Mondat 10am-11am.

-Mike
Create new Calendar entry

Event: Meeting w/ Mike
Date: 15 Jan, 2018
Start: 10:00am
End: 11:00am
Where: A. Lovelace

4

ON WIKIPEDIA… THERE ARE INFOBOXES…

 Structured informaton:

5

…AND SENTENCES

 Unstructured text:

6

 We can use informaton extracton to populate the infobox
automatcallt.

USING IE TO POPULATE WIKIPEDIA INFOBOXES

7

 IE is the process of extractng some limited amount of semantc
content from text.

 Can view it as the task of automatcallt flling a questonnaire or
template e.g. database table).

 An important NLP applicaton since the 1990s.

INFORMATION EXTRACTION

8

 Consists of the following subtasks:

 Named Enttt Recogniton.

 Relaton Extracton.

 Coreference resoluton.

 Event extracton.

 Temporal expression extracton.

 Slot flling.

 Enttt linking.

INFORMATION EXTRACTION: SUBTASKS

NAMED ENTITY RECOGNITION

10

 Named Entty (NE): antthing that can be referred to with a
proper name.

 Usuallt 3 categories: person, locaton and organisaton .

 Can be extended to numeric expressions price, date, tme).

NAMED ENTITY RECOGNITION

11

 Example of extended set of named enttes.

NAMED ENTITIES

12

 Named Entty Recogniton (NER) task: 1) identfy spans of text
that consttute proper names, 2) categorise bt enttt ttpe.

 First step in IE, also useful for:

 Reduce sparseness in text classifcaton.

 Identft target in sentment analtsis.

 Queston answering.

NAMED ENTITY RECOGNITION

13

 Challenges in NER:

 Ambiguitt of segmentaton is it an enttt? boundaries?):
I’m on the Birmingham New Street-London Euston train.

locaton locaton

 Ambiguitt of enttt ttpe:
Downing St.: locaton street) or organisaton govt)?
Warwick: locaton town) or organisaton uni)?
Georgia: person name) or locaton countrt)?

CHALLENGES IN NAMED ENTITY RECOGNITION

14

 Standard NER algorithm is a word-bt-word sequence labelling
task, where labels capture both boundaries and enttt ttpe, e.g.:

NER AS A SEQUENCE LABELLING TASK

15

 Convert it to BIO format:
(B): beginning, (I): in, (O): out.

NER AS A SEQUENCE LABELLING TASK

16

 State-of-the-art approach:

 MEMM or CRF is trained, to label each token in a text as
being part of a named entty or not .

 Use gazeteers to assign labels to those tokens.

 Gazeteers: lists of place names, frst names, surnames,
organisatons, products, etc.

 GeoNames.

 Natonal Street Gazeteer UK).

 More gazeteers.

NER AS A SEQUENCE LABELLING TASK

17

 Standard measures: Precision, Recall, F-measure.

 The enttt rather than the word is the unit of response, i.e:

 Segmentaton component makes evaluaton more challenging:

 Using words for training but enttes as units of response
means → if we have Leamington Spa in the corpus, and our
ststem identfes Leamington onlt, that’s a mistake.

EVALUATION OF NER

18

 Statstcal sequence models are the norm in academic research.

 Commercial NER are ofen htbrids of rules & machine learning :

1. Use high precision rules to tag unambiguous enttes.

2. Search for substring matches of those unambiguous enttes.

3. Use applicaton-specifc name list to identft other enttes.

4. Use probabilistc sequence labelling to complete that.

COMMERCIAL NER SYSTEMS

RELATION EXTRACTION

20

 Relaton extracton: process of identfting relatons between
named enttes.

The spokesman of the UK’s Downing Street, James Slack,…
locaton organisaton person

 Relatons:
 Downing Street is in the UK ORG-LOC).

 James Slack is spokesperson of Downing Street PER-ORG).

 James Slack works in the UK PER-LOC).

RELATION EXTRACTION

21

 Examples of ttpes of relatons:

 There are databases that defne these relaton ttpes.

RELATION TYPES

22

 For example, there is the UMLS Unifed Medical Language
Ststem), a network that describes 134 broad subject categories,
enttt ttpes and 44 relatons between enttes. e.g.

 Relatons usuallt describe propertes of the enttes.

RELATION TYPES

23

 Wikipedia infoboxes can be used to create a corpus.

 Convert it into relatons as RDF triples:
Univ. of Warwick – located_in – Coventrt
Univ. of Warwick – established_in – 1944

RELATION EXTRACTION

24

 First approach used in the 90s using lexical/stntactc paterns.

 -: Such paterns are high precision but ttpicallt low recall.

 -: Difcult to generalise to new domain and new IE tasks.

RELATION EXTRACTION USING PATTERNS

25

 Supervised machine learning approaches ttpicallt do:

1. A fxed set of relatons and enttes is chosen.

2. A corpus is manuallt annotated with the enttes and
relatons to create training data.

3. A classifer is trained on the corpus and tested on an unseen
text to classift potental relatons between enttt pairs, e.g.
SVM, Logistc Regression, Naive Bates, Perceptron.

RE USING SUPERVISED MACHINE LEARNING

26

 Important to choose good features, e.g. dependencies, POS tags.
 Example: American Airlines, a unit of AMR, immediatelt matched the move,

spokesman Tim Wagner said.

RE USING SUPERVISED MACHINE LEARNING

27

 Bootstrapping can be used to expand our training data
with instances where our classifer’s predictons are vert
confdent.

RE USING SEMI-SUPERVISED LEARNING

28

 Combine best of both worlds: bootstrapping + supervised
machine learning.

 Use the Web e.g. Wikipedia) to build large sets of relatons.

 We can now train a more reliable supervised classifer.

RE USING DISTANT SUPERVISION

29

 The main diference is that the schema for these relatons does
not need to be specifed in advance .

 The relaton name is just the text linking two arguments, e.g.:
“Barack Obama was born in Hawaii” would create
Barack Obama; was born in; Hawaii

 The challenge lies in identfting that two enttes connected with
a verb do consttute an actual relaton, not a false positve.

OPEN IE: UNSUPERVISED RE

30

 Stanford’s state-of-the-art OpenIE ststem:

1. Using dependencies, splits sentence into clauses.

2. Each clause is maximallt shortened into sentence fragments.

3. These fragments are output as OpenIE triples.

STANFORD OPENIE

31

 Precision, Recall, F-measure for supervised methods .

 For unsupervised methods evaluaton is much harder.
Can estmate Precision using a small labelled sample:

 Precision at diferent levels of recall, e.g. 1000 new relatons,
10,000 new relatons etc.

EVALUATION OF RELATION EXTRACTION

OTHER INFORMATION
EXTRACTION TASKS

33

 Coreference resoluton: fnding all expressions that refer to the
same enttt in a text.

COREFERENCE RESOLUTION

34

 Neural coref: ptthon package for co-reference resoluton.
htps://github.com/huggingface/neuralcoref

 And tou can try it online:
htps://huggingface.co/coref/

COREFERENCE RESOLUTION

35

NEURAL COREF: EXAMPLE

36

 Event extracton: Extractng structured informaton about events
from text. Ofen enttt>- verb>- enttt>, as in relaton
extracton. Vert useful to analtse news stories.

Steve Jobs died in 2011 → Steve Jobs;death in;2011

New Android phone was announced bt Google →
New Android phone;announcement bt;Google

EVENT EXTRACTION

37

 Temporal expression extracton: structuring tme mentons in
text, ofen associated with event extracton.

 Event 1: mid 80’s → released a pass.

 Event 2: ~4 seconds later → slammed to the ground.

TEMPORAL EXPRESSION EXTRACTION

38

TEMPLATE/SLOT FILLING

 Slot filling (aka Knowledge Base Population): task of taking an
incomplete knowledge base (e.g., Wikidata), and a large corpus of
text (e.g., Wikipedia), and completing the incomplete elements of the
knowledge base.

39

ENTITY LINKING

 Entity linking: task of taking ambiguous entity mentions, and “link”
them with concrete entries in a knowledge base.

Alex Jones has a new film.

which of them? →

 Can be defined as:
● Classification task.

● Similarity task.
● …

40

INFORMATION EXTRACTION: SUMMARY

 Involves a broad range of tasks, with a common goal:
unstructured text → structured data

 Sequence classification is often useful.

 As with many other tasks, availability of data is crucial:
 Labelled corpora.
 Gazetteers.
 Thesauri.

41

RESOURCES

 Stanford CRF NER:
https://nlp.stanford.edu/software/CRF-NER.shtml

 Stanford CoreNLP:
https://stanfordnlp.github.io/CoreNLP/

 iepy:
https://pypi.python.org/pypi/iepy

 Stanford-OpenIE-Python:
https://github.com/philipperemy/Stanford-OpenIE-Python

42

REFERENCES

 Chang, C. H., Kayed, M., Girgis, M. R., & Shaalan, K. F. (2006). A
survey of web information extraction systems. IEEE transactions on
knowledge and data engineering, 18(10), 1411-1428.

 Eikvil, L. (1999). Information extraction from world wide web-a survey.

43

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 20.

 Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 7.