程序代写代做代考 python android ada algorithm database PowerPoint Presentation

PowerPoint Presentation

LECTURE 12

Informaton Extracton and Named Entty Recogniton

Arkaitz Zubiaga, 19th February, 2018

2

 What is Informaton Extracton?

 Named Entty Recogniton (NER).

 Relaton Extracton (RE).

 Other Informaton Extracton tasks.

LECTURE 12: CONTENTS

3

 Informaton extracton: automatcally extractng structured
informaton from unstructured texts.

INFORMATION EXTRACTION (IE)

Subject: meetng

Date: 8th January, 2018

To: Arkaitz Zubiaga

Hi Arkaitz, we have fnally scheduled the meetng.

It will be in the Ada Lovelace room, next Monday 10am-11am.

-Mike
Create new Calendar entry

Event: Meeting w/ Mike
Date: 15 Jan, 2018
Start: 10:00am
End: 11:00am
Where: A. Lovelace

4

ON WIKIPEDIA… THERE ARE INFOBOXES…

 Structured informaton:

5

…AND SENTENCES

 Unstructured text:

6

 We can use informaton extracton to populate the infobox
automatcally.

USING IE TO POPULATE WIKIPEDIA INFOBOXES

7

 IE is the process of extractng some limited amount of semantc
content from text.

 Can view it as the task of automatcally flling a questonnaire or
template (e.g. database table).

 An important NLP applicaton since the 1990s.

INFORMATION EXTRACTION

8

 Consists of the following subtasks:

 Named Entty Recogniton.

 Relaton Extracton.

 Coreference resoluton.

 Event extracton.

 Temporal expression extracton.

 Slot flling.

 Entty linking.

INFORMATION EXTRACTION: SUBTASKS

NAMED ENTITY RECOGNITION

10

 Named Enttt (NE): anything that can be referred to with a
proper name.

 Usually 3 categories: person, locaton and organisaton.

 Can be extended to numeric expressions (price, date, tme).

NAMED ENTITY RECOGNITION

11

 Example of extended set of named enttes.

NAMED ENTITIES

12

 Named Enttt Recogniton (NER) task: (1) identft spans of text
that consttute proper names, (2) categorise by entty type.

 First step in IE, also useful for:

 Reduce sparseness in text classifcaton.

 Identfy target in sentment analysis.

 Queston answering.

NAMED ENTITY RECOGNITION

13

 Challenges in NER:

 Ambiguity of segmentaton (is it an entty? boundaries?):
I’m on the Birmingham New Street-London Euston train.

locaton locaton

 Ambiguity of entty type:
Downing St.: locaton (street) or organisaton (govt)?
Warwick: locaton (town) or organisaton (uni)?
Georgia: person (name) or locaton (country)?

CHALLENGES IN NAMED ENTITY RECOGNITION

14

 Standard NER algorithm is a word-by-word sequence labelling
task, where labels capture both boundaries and entty type, e.g.:

NER AS A SEQUENCE LABELLING TASK

15

 Convert it to BIO format:
(B): beginning, (I): in, (O): out.

NER AS A SEQUENCE LABELLING TASK

16

 State-of-the-art approach:

 MEMM or CRF is trained, to label each token in a text as
being part of a named enttt or not.

 Use gazeteers to assign labels to those tokens.

 Gazeteers: lists of place names, frst names, surnames,
organisatons, products, etc.

 GeoNames.

 Natonal Street Gazeteer (UK).

 More gazeteers.

NER AS A SEQUENCE LABELLING TASK

http://www.geonames.org/
https://data.gov.uk/dataset/national-street-gazetteer
http://library.stanford.edu/guides/gazetteers

17

 Standard measures: Precision, Recall, F-measure.

 The entty rather than the word is the unit of response, i.e:

 Segmentaton component makes evaluaton more challenging:

 Using words for training but enttes as units of response
means → if we have Leamington Spa in the corpus, and our
system identfes Leamington only, that’s a mistake.

EVALUATION OF NER

18

 Statstcal sequence models are the norm in academic research.

 Commercial NER are ofen hybrids of rules & machine learning:

1. Use high precision rules to tag unambiguous enttes.

2. Search for substring matches of those unambiguous enttes.

3. Use applicaton-specifc name list to identfy other enttes.

4. Use probabilistc sequence labelling to complete that.

COMMERCIAL NER SYSTEMS

RELATION EXTRACTION

20

 Relaton extracton: process of identfying relatons between
named enttes.

The spokesman of the UK’s Downing Street, James Slack,…
locaton organisaton person

 Relatons:
 Downing Street is in the UK (ORG-LOC).

 James Slack is spokesperson of Downing Street (PER-ORG).

 James Slack works in the UK (PER-LOC).

RELATION EXTRACTION

21

 Examples of types of relatons:

 There are databases that defne these relaton types.

RELATION TYPES

22

 For example, there is the UMLS (Unifed Medical Language
System), a network that describes 134 broad subject categories,
entty types and 54 relatons between enttes. e.g.

 Relatons usually describe propertes of the enttes.

RELATION TYPES

23

 Wikipedia infoboxes can be used to create a corpus.

 Convert it into relatons as RDF triples:
Univ. of Warwick – located_in – Coventry
Univ. of Warwick – established_in – 1954

RELATION EXTRACTION

24

 First approach used in the 90s using lexical/syntactc paterns.

 -: Such paterns are high precision but typically low recall.

 -: Difcult to generalise to new domain and new IE tasks.

RELATION EXTRACTION USING PATTERNS

25

 Supervised machine learning approaches typically do:

1. A fxed set of relatons and enttes is chosen.

2. A corpus is manually annotated with the enttes and
relatons to create training data.

3. A classifer is trained on the corpus and tested on an unseen
text to classify potental relatons between entty pairs, e.g.
SVM, Logistc Regression, Naive Bayes, Perceptron.

RE USING SUPERVISED MACHINE LEARNING

26

 Important to choose good features, e.g. dependencies, POS tags.
 Example: American Airlines, a unit of AMR, immediately matched the move,

spokesman Tim Wagner said.

RE USING SUPERVISED MACHINE LEARNING

27

 Bootstrapping can be used to expand our training data
with instances where our classifer’s predictons are very
confdent.

RE USING SEMI-SUPERVISED LEARNING

28

 Combine best of both worlds: bootstrapping + supervised
machine learning.

 Use the Web (e.g. Wikipedia) to build large sets of relatons.

 We can now train a more reliable supervised classifer.

RE USING DISTANT SUPERVISION

29

 The main diference is that the schema for these relatons does
not need to be specifed in advance.

 The relaton name is just the text linking two arguments, e.g.:
“Barack Obama was born in Hawaii” would create
Barack Obama; was born in; Hawaii

 The challenge lies in identfying that two enttes connected with
a verb do consttute an actual relaton, not a false positve.

OPEN IE: UNSUPERVISED RE

30

 Stanford’s state-of-the-art OpenIE system:

1. Using dependencies, splits sentence into clauses.

2. Each clause is maximally shortened into sentence fragments.

3. These fragments are output as OpenIE triples.

STANFORD OPENIE

https://nlp.stanford.edu/software/openie.html

31

 Precision, Recall, F-measure for supervised methods.

 For unsupervised methods evaluaton is much harder.
Can estmate Precision using a small labelled sample:

 Precision at diferent levels of recall, e.g. 1000 new relatons,
10,000 new relatons etc.

EVALUATION OF RELATION EXTRACTION

OTHER INFORMATION
EXTRACTION TASKS

33

 Coreference resoluton: fnding all expressions that refer to the
same entty in a text.

COREFERENCE RESOLUTION

34

 Neural coref: python package for co-reference resoluton.
htps://github.com/huggingface/neuralcoref

 And you can trt it online:
htps://huggingface.co/coref/

COREFERENCE RESOLUTION

https://github.com/huggingface/neuralcoref
https://huggingface.co/coref/

35

NEURAL COREF: EXAMPLE

36

 Event extracton: Extractng structured informaton about events
from text. Ofen , as in relaton
extracton. Very useful to analyse news stories.

Steve Jobs died in 2011 → Steve Jobs;death in;2011

New Android phone was announced by Google →
New Android phone;announcement by;Google

EVENT EXTRACTION

37

 Temporal expression extracton: structuring tme mentons in
text, ofen associated with event extracton.

 Event 1: mid 80’s → released a pass.

 Event 2: ~5 seconds later → slammed to the ground.

TEMPORAL EXPRESSION EXTRACTION

38

TEMPLATE/SLOT FILLING

 Slot filling (aka Knowledge Base Population): task of taking an
incomplete knowledge base (e.g., Wikidata), and a large corpus of
text (e.g., Wikipedia), and completing the incomplete elements of the
knowledge base.

39

ENTITY LINKING

 Entity linking: task of taking ambiguous entity mentions, and “link”
them with concrete entries in a knowledge base.

Alex Jones has a new film.

which of them? →

 Can be defined as:

● Classification task.

● Similarity task.

● …

40

INFORMATION EXTRACTION: SUMMARY

 Involves a broad range of tasks, with a common goal:
unstructured text → structured data

 Sequence classification is often useful.

 As with many other tasks, availability of data is crucial:
 Labelled corpora.
 Gazetteers.
 Thesauri.

41

RESOURCES

 Stanford CRF NER:
https://nlp.stanford.edu/software/CRF-NER.shtml

 Stanford CoreNLP:
https://stanfordnlp.github.io/CoreNLP/

 iepy:
https://pypi.python.org/pypi/iepy

 Stanford-OpenIE-Python:
https://github.com/philipperemy/Stanford-OpenIE-Python

https://nlp.stanford.edu/software/CRF-NER.shtml
https://stanfordnlp.github.io/CoreNLP/
https://pypi.python.org/pypi/iepy
https://github.com/philipperemy/Stanford-OpenIE-Python

42

REFERENCES

 Chang, C. H., Kayed, M., Girgis, M. R., & Shaalan, K. F. (2006). A
survey of web information extraction systems. IEEE transactions on
knowledge and data engineering, 18(10), 1411-1428.

 Eikvil, L. (1999). Information extraction from world wide web-a survey.

43

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 20.

 Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 7.

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43