程序代写代做代考 python android ada algorithm database lecture12.pptx

lecture12.pptx

LECTURE 12

Informaton Extracton and Named Enttt Recogniton

Arkaitz Zubiaga, 19
th
Februart, 2018

 What is Informaton Extracton?

 Named Enttt Recogniton NER).

 Relaton Extracton RE).

 Other Informaton Extracton tasks.

LECTURE 12: CONTENTS

 Informaton extracton: automatcallt extractng structured
informaton from unstructured texts.

INFORMATION EXTRACTION IE)

Subject: meetng

Date: 8th Januart, 2018

To: Arkaitz Zubiaga

Hi Arkaitz, we have fnallt scheduled the meetng.

It will be in the Ada Lovelace room, next Mondat 10am-11am.

-Mike
Create new Calendar entry

Event: Meeting w/ Mike
Date: 15 Jan, 2018
Start: 10:00am
End: 11:00am
Where: A. Lovelace

ON WIKIPEDIA… THERE ARE INFOBOXES…

 Structured informaton:

…AND SENTENCES

 Unstructured text:

 We can use informaton extracton to populate the infobox
automatcallt.

USING IE TO POPULATE WIKIPEDIA INFOBOXES

 IE is the process of extractng some limited amount of semantc
content from text.

 Can view it as the task of automatcallt flling a questonnaire or
template e.g. database table).

 An important NLP applicaton since the 1990s.

INFORMATION EXTRACTION

 Consists of the following subtasks:

 Named Enttt Recogniton.

 Relaton Extracton.

 Coreference resoluton.

 Event extracton.

 Temporal expression extracton.

 Slot flling.

 Enttt linking.

INFORMATION EXTRACTION: SUBTASKS

NAMED ENTITY RECOGNITION

 Named Entty (NE): antthing that can be referred to with a
proper name.

 Usuallt 3 categories: person, locaton and organisaton .

 Can be extended to numeric expressions price, date, tme).

NAMED ENTITY RECOGNITION

 Example of extended set of named enttes.

NAMED ENTITIES

 Named Entty Recogniton (NER) task: 1) identfy spans of text
that consttute proper names, 2) categorise bt enttt ttpe.

 First step in IE, also useful for:

 Reduce sparseness in text classifcaton.

 Identft target in sentment analtsis.

 Queston answering.

NAMED ENTITY RECOGNITION

 Challenges in NER:

 Ambiguitt of segmentaton is it an enttt? boundaries?):
I’m on the Birmingham New Street-London Euston train.

locaton locaton

 Ambiguitt of enttt ttpe:
Downing St.: locaton street) or organisaton govt)?
Warwick: locaton town) or organisaton uni)?
Georgia: person name) or locaton countrt)?

CHALLENGES IN NAMED ENTITY RECOGNITION

 Standard NER algorithm is a word-bt-word sequence labelling
task, where labels capture both boundaries and enttt ttpe, e.g.:

NER AS A SEQUENCE LABELLING TASK

 Convert it to BIO format:
(B): beginning, (I): in, (O): out.

NER AS A SEQUENCE LABELLING TASK

 State-of-the-art approach:

 MEMM or CRF is trained, to label each token in a text as
being part of a named entty or not .

 Use gazeteers to assign labels to those tokens.

 Gazeteers: lists of place names, frst names, surnames,
organisatons, products, etc.

 GeoNames.

 Natonal Street Gazeteer UK).

 More gazeteers.

NER AS A SEQUENCE LABELLING TASK

 Standard measures: Precision, Recall, F-measure.

 The enttt rather than the word is the unit of response, i.e:

 Segmentaton component makes evaluaton more challenging:

 Using words for training but enttes as units of response
means → if we have Leamington Spa in the corpus, and our
ststem identfes Leamington onlt, that’s a mistake.

EVALUATION OF NER

 Statstcal sequence models are the norm in academic research.

 Commercial NER are ofen htbrids of rules & machine learning :

1. Use high precision rules to tag unambiguous enttes.

2. Search for substring matches of those unambiguous enttes.

3. Use applicaton-specifc name list to identft other enttes.

4. Use probabilistc sequence labelling to complete that.

COMMERCIAL NER SYSTEMS

RELATION EXTRACTION

 Relaton extracton: process of identfting relatons between
named enttes.

The spokesman of the UK’s Downing Street, James Slack,…
locaton organisaton person

 Relatons:
 Downing Street is in the UK ORG-LOC).

 James Slack is spokesperson of Downing Street PER-ORG).

 James Slack works in the UK PER-LOC).

RELATION EXTRACTION

 Examples of ttpes of relatons:

 There are databases that defne these relaton ttpes.

RELATION TYPES

 For example, there is the UMLS Unifed Medical Language
Ststem), a network that describes 134 broad subject categories,
enttt ttpes and 44 relatons between enttes. e.g.

 Relatons usuallt describe propertes of the enttes.

RELATION TYPES

 Wikipedia infoboxes can be used to create a corpus.

 Convert it into relatons as RDF triples:
Univ. of Warwick – located_in – Coventrt
Univ. of Warwick – established_in – 1944
…

RELATION EXTRACTION

 First approach used in the 90s using lexical/stntactc paterns.

 -: Such paterns are high precision but ttpicallt low recall.

 -: Difcult to generalise to new domain and new IE tasks.

RELATION EXTRACTION USING PATTERNS

 Supervised machine learning approaches ttpicallt do:

1. A fxed set of relatons and enttes is chosen.

2. A corpus is manuallt annotated with the enttes and
relatons to create training data.

3. A classifer is trained on the corpus and tested on an unseen
text to classift potental relatons between enttt pairs, e.g.
SVM, Logistc Regression, Naive Bates, Perceptron.

RE USING SUPERVISED MACHINE LEARNING

 Important to choose good features, e.g. dependencies, POS tags.
 Example: American Airlines, a unit of AMR, immediatelt matched the move,

spokesman Tim Wagner said.

RE USING SUPERVISED MACHINE LEARNING

 Bootstrapping can be used to expand our training data
with instances where our classifer’s predictons are vert
confdent.

RE USING SEMI-SUPERVISED LEARNING

 Combine best of both worlds: bootstrapping + supervised
machine learning.

 Use the Web e.g. Wikipedia) to build large sets of relatons.

 We can now train a more reliable supervised classifer.

RE USING DISTANT SUPERVISION

 The main diference is that the schema for these relatons does
not need to be specifed in advance .

 The relaton name is just the text linking two arguments, e.g.:
“Barack Obama was born in Hawaii” would create
Barack Obama; was born in; Hawaii

 The challenge lies in identfting that two enttes connected with
a verb do consttute an actual relaton, not a false positve.

OPEN IE: UNSUPERVISED RE

 Stanford’s state-of-the-art OpenIE ststem:

1. Using dependencies, splits sentence into clauses.

2. Each clause is maximallt shortened into sentence fragments.

3. These fragments are output as OpenIE triples.

STANFORD OPENIE

 Precision, Recall, F-measure for supervised methods .

 For unsupervised methods evaluaton is much harder.
Can estmate Precision using a small labelled sample:

 Precision at diferent levels of recall, e.g. 1000 new relatons,
10,000 new relatons etc.

EVALUATION OF RELATION EXTRACTION

OTHER INFORMATION
EXTRACTION TASKS

 Coreference resoluton: fnding all expressions that refer to the
same enttt in a text.

COREFERENCE RESOLUTION

 Neural coref: ptthon package for co-reference resoluton.
htps://github.com/huggingface/neuralcoref

 And tou can try it online:
htps://huggingface.co/coref/

COREFERENCE RESOLUTION

NEURAL COREF: EXAMPLE

 Event extracton: Extractng structured informaton about events
from text. Ofen enttt>- verb>- enttt>, as in relaton
extracton. Vert useful to analtse news stories.

Steve Jobs died in 2011 → Steve Jobs;death in;2011

New Android phone was announced bt Google →
New Android phone;announcement bt;Google

EVENT EXTRACTION

 Temporal expression extracton: structuring tme mentons in
text, ofen associated with event extracton.

 Event 1: mid 80’s → released a pass.

 Event 2: ~4 seconds later → slammed to the ground.

TEMPORAL EXPRESSION EXTRACTION

TEMPLATE/SLOT FILLING

 Slot filling (aka Knowledge Base Population): task of taking an
incomplete knowledge base (e.g., Wikidata), and a large corpus of
text (e.g., Wikipedia), and completing the incomplete elements of the
knowledge base.

ENTITY LINKING

 Entity linking: task of taking ambiguous entity mentions, and “link”
them with concrete entries in a knowledge base.

Alex Jones has a new film.

which of them? →

 Can be defined as:
● Classification task.

● Similarity task.
● …

INFORMATION EXTRACTION: SUMMARY

 Involves a broad range of tasks, with a common goal:
unstructured text → structured data

 Sequence classification is often useful.

 As with many other tasks, availability of data is crucial:
 Labelled corpora.
 Gazetteers.
 Thesauri.

RESOURCES

 Stanford CRF NER:
https://nlp.stanford.edu/software/CRF-NER.shtml

 Stanford CoreNLP:
https://stanfordnlp.github.io/CoreNLP/

 iepy:
https://pypi.python.org/pypi/iepy

 Stanford-OpenIE-Python:
https://github.com/philipperemy/Stanford-OpenIE-Python

REFERENCES

 Chang, C. H., Kayed, M., Girgis, M. R., & Shaalan, K. F. (2006). A
survey of web information extraction systems. IEEE transactions on
knowledge and data engineering, 18(10), 1411-1428.

 Eikvil, L. (1999). Information extraction from world wide web-a survey.

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 20.

 Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 7.

Related Posts