lecture01.pptx
1
LECTURE 1
Introducton to NLP and Regular Expressions
Arkaitz Zubiaga, 8
th
January, 2018
2
Lectures: Mon (4pm, LIB2) & Wed (10am, L5)
Seminars: Mon (3pm, OC1.01) (week 2 onwards). The seminars
will cover supplementary material and provide technical detail.
Labs: Thu 2-4pm in CS0.01 (weeks 2, 3, 5, 7 and 9).
ABOUT THE MODULE: CS918
3
Assessment: 70% exam in May/June, 30% of 2 assignments:
Assign. 1: released week 2, deadline week 8.
Assign. 2: released week 4, deadline week 11.
ABOUT THE MODULE: CS918
4
Give fundamental understanding of NLP methods for processing
linguistc data in textual form.
Familiarisaton with diferent applicatons of NLP.
Give students the skills to apply state of the art NLP methods on
diferent types of tett (newswire, web, social media, scientfc
artcles).
AIMS OF THE MODULE: CS918
5
Essental:
Jurafsky, Daniel, and James H. Martn. 2009. Speech and Language Processing: An Introducton
to Natural Language Processing, Speech Recogniton, and Computatonal Linguistcs. 2nd and
3rd editons.
Bird Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly
Media, Inc., 2009.
Recommended:
Christopher D. Manning and Hinrich Schütze. 1999. Foundatons of Statstcal Natural Language
Processing. MIT Press, Cambridge, MA, USA.
Christopher M. Bishop. 2006. Patern Recogniton and Machine Learning (Informaton Science
and Statstcs). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
Marie-Francine Moens, and Juanzi Li. “Mining User Generated Content and Its Applicatons.” In
Mining User Generated Content, 3–17. Social Media and Social Computng. Chapman and
Hall/CRC, 2014.
BOOKS FOR THE MODULE
6
What is Natural Language Processing (NLP)?
What are NLP areas and applicatons?
Why is NLP challenging?
Basic text processing with Regular Expressions.
LECTURE 1: CONTENTS
2
7
NLP is the feld that studies computatonal methods for
automatcally identfying structure in human language data (e.g.
English or Chinese, writen or spoken).
NLP is also concerned with the insights that such computatonal
work gives us into human processing of language .
In this module, we will focus on tettual rather than spoken
language.
WHAT IS NATURAL LANGUAGE PROCESSING?
8
A lot of today’s knowledge is writen in tetts , even more so on
the Internet, social media, emails.
We need automated means to process all that content!
Communicaton with chatbots and across languages needs
understanding of language.
WHY IS NLP IMPORTANT?
9
Is being increasingly used by companies, e.g.:
WHY IS NLP IMPORTANT?
10
1940s: used mainly for machine translaton.
1980s: Gained momentum with a focus on computatonal
grammars for the representaton of meaning. Small corpora,
mostly rule-based.
1990s: Rapid expansion, large collectons, Internet.
2000s: Shift from computatonal grammars to statstcal (machine
learning).
2013-: Largely infuenced by Deep learning.
BRIEF HISTORY OF NLP
11
NLP APPLICATIONS: QUESTION ANSWERING
12
NLP APPLICATIONS: QUESTION ANSWERING
3
13
NLP APPLICATIONS: QUESTION ANSWERING
14
NLP APPLICATIONS: QUESTION ANSWERING
15
NLP APPLICATIONS: INFORMATION EXTRACTION
Subject: meetng
Date: 8th January, 2018
To: Arkaitz Zubiaga
Hi Arkaitz, we have fnally scheduled the meetng.
It will be in the Ada Lovelace room, next Monday 10am-11am.
-Mike
Create new Calendar entry
Event: Meeting w/ Mike
Date: 15 Jan, 2018
Start: 10:00am
End: 11:00am
Where: A. Lovelace
16
NLP APPLICATIONS: SENTIMENT ANALYSIS
17
NLP APPLICATIONS: SENTIMENT ANALYSIS
18
NLP APPLICATIONS: MACHINE TRANSLATION
4
19
NLP APPLICATIONS
Coreference resoluton
Queston answering (QA)
Part-of-speech (POS) tagging
Word sense disambiguaton (WSD)
Paraphrase
Named entty recogniton (NER)
Parsing
Summarizaton
Informaton extracton (IE)
Machine translaton (MT)
Dialog
Sentment analysis
mostly solved
making good progress
stll really hard
Spam detecton
Let’s go to Agra!Let’s go to Agra!
Buy V1AGRA …Buy V1AGRA …
✓
✗
Colorless green ideas sleep furiously.Colorless green ideas sleep furiously.
ADJ ADJ NOUN VERB ADV ADJ ADJ NOUN VERB ADV
Einstein met with UN ofcials in PrincetonEinstein met with UN ofcials in Princeton
PERSON ORG LOCPERSON ORG LOC
You’re invited to our dinner
party, Friday May 27 at 8:30
You’re invited to our dinner
party, Friday May 27 at 8:30
Party
May 27
add
Party
May 27
add
Best roast chicken in San Francisco!Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes.The waiter ignored us for 20 minutes.
Carter told Mubarak he shouldn’t run again.Carter told Mubarak he shouldn’t run again.
I need new bateries for my mouse.I need new bateries for my mouse.
The 13th Shanghai Internatonal Film Festval…The 13th Shanghai Internatonal Film Festval…
第 13届上海国际电影节开幕…第 13届上海国际电影节开幕…
The Dow Jones is upThe Dow Jones is up
Housing prices roseHousing prices rose
Economy is
good
Economy is
good
Q. How efectve is ibuprofen in reducing
fever in patents with acute febrile illness?
Q. How efectve is ibuprofen in reducing
fever in patents with acute febrile illness?
I can see Alcatraz from the window!I can see Alcatraz from the window!
XYZ acquired ABC yesterdayXYZ acquired ABC yesterday
ABC has been taken over by XYZABC has been taken over by XYZ
Where is Citzen Kane playing in SF? Where is Citzen Kane playing in SF?
Castro Theatre at 7:30. Do
you want a tcket?
Castro Theatre at 7:30. Do
you want a tcket?
The S&P500 jumpedThe S&P500 jumped
20
WHY IS NLP CHALLENGING?
Language is ambiguous, e.g. “Flying planes can be dangerous”
What is actually meant?
It can be dangerous for a person to fy planes.
Planes that are fying in the air can be dangerous.
21
WHY ELSE IS NLP CHALLENGING?
We had a double room, but was
to cold when we complaint
a pain in the neck
throw in the towel
neologisms
unfriend
Retweet
selfe
tricky entty names
Let It Be is a good song…
They were listening to One
Direction…
world knowledge
Mary and Sue are sisters.
Mary and Sue are mothers.
the London Euston-Birmingham
New Street train
is Euston-Birmingham a word?
non-standard English segmentaton issues idioms
PART 2: BASIC TEXT PROCESSING
WITH REGULAR EXPRESSIONS
23
REGULAR EXPRESSIONS
A formal language for specifying text paterns.
For searching or replacing text.
How do we search for any of the following in a text?
woodchuck
woodchucks
Woodchuck
Woodchucks
24
APPLICATION OF REGULAR EXPRESSIONS
When the user says “You are X”, ELIZA responds with “What
makes you think I am X?”, for any X.
X can be obtained with regular expressions.
5
25
REGULAR EXPRESSIONS: DISJUNCTIONS
• Leters inside square brackets ]銷
• Ranges [A-Z]
Patern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit
Patern Matches
[A-Z] An upper case leter Drenched Blossoms
[a-z] A lower case leter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
26
REGULAR EXPRESSIONS: NEGATIONS
• Negatons [^Ss]
• Caret (^) means negaton only when frst in ]銷
Patern Matches
[^A-Z] Not an upper case leter Oyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ reason
[^e^] Neither e nor ^ Look here
a^b The patern a caret b Look up a^b now
27
REGULAR EXPRESSIONS: MORE DISJUNCTION
Patern Matches
groundhog|woodchuck groundhog
woodchuck
yours|mine yours
mine
a|b|c = ]abc銷
[gG]roundhog|[Ww]oodchuck Groundhog
groundhog
Woodchuck
woodchuck
28
REGULAR EXPRESSIONS: ? * + .
Patern Matches
colou?r Optonal
previous char
color colour
oo*h! 0 or more of
previous char
oh! ooh! oooh! ooooh!
o+h! 1 or more of
previous char
oh! ooh! oooh! ooooh!
baa+ baa baaa baaaa baaaaa
beg.n begin begun begun beg3n
ba{5} 5 tmes baaaaa
29
REGULAR EXPRESSIONS: ANCHORS ^ $
Patern Matches
^[A-Z] Coventry
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
30
REGULAR EXPRESSIONS: ERRORS
False positives (Type I errors)
False negatves (Type II errors)
6
31
FALSE POSITIVES: TYPE I ERRORS
Instances that should not be output.
For instance, if we search for “]Tt銷he”:
There are 10 people in the room, they all have a laptop with
them.
32
FALSE NEGATIVES: TYPE II ERRORS
Instances that have been missed.
For instance, if we search for “the”:
The laptop is in the kitchen.
33
EVALUATION
In NLP we are always dealing with these kinds of errors.
Reducing the error rate for an applicaton often involves two
antagonistc eforts:
Increasing precision (minimising false positves)
Increasing coverage or recall (minimising false negatves).
34
EVALUATION: PRECISION AND RECALL
There are 10 peiople in the rioiom, they all have a laptiop with them.
(rato of correct items among those output)
¼ = 0.25
(rato of reference items that have been output)
1/1 = 1
35
EVALUATION: F1 SCORE
We want to optmise for both precision and recall:
(harmonic mean of precision and recall)
Equaton as follows, however generally ß = 1:
36
SUMMARY
Regular expressions play a surprisingly large role:
Sophistcated sequences of regular etpressions are often the
first model for any tett processing task .
For many hard tasks, we use machine learning classifiers.
But regular etpressions are used as features in the classifers
or to preprocess the text.
Can be very useful in capturing generalisatons .
7
37
REGULAR EXPRESSIONS: REFERENCES
Regular expressions with Python:
htps://docs.python.org/3.7/howto/regex.html
Testng regular expressions online:
htps://regex101.com/
38
RESOURCES
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapters 1-
2.
Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapters 1-3.