PowerPoint Presentation
LECTURE 9
Sequence Classifcatin and Part-Of-Speech Tagging
Arkaitz Zubiaga, 5th February, 2018
2
Sequence Classifcatin
Sequence Classifers:
Hidden Markiv Midels (HMM).
Maximum Entripy Markiv Midels (MEMM).
Cinditinal Randim Fields (CRF).
Using Sequence Classifers fir Part-if-Speech (POS) Tagging.
LECTURE 9: CONTENTS
3
Simetmes, classifcaaton of items in a sequence is dependent
on other sequence items:
Part-of-Speech (POS) tagging:
Assigning categories to words, e.g. adjectve, niun ir verb
Example:
The man saw a cat
DET NN VB DET NN
SEQUENCE CLASSIFICATION
DET: determiner
NN: noun
VB: verb
4
Why is classifcatin dependent in ither sequence items?
In iur example, “The man saw a cat”:
‘saw’ can be:
a noun (if it refers ti the tiil fir cutng)
a verb (if it’s simple past if ‘see’)
we can’t classify whether ‘saw’ is verb ir niun by liiking at the
wird aline → need ti liik at cintext, the sequence
SEQUENCE CLASSIFICATION
5
The man saw a cat
??? ??? ??? ??? ???
The frst item, ‘the’, is easy to classify, it’s always a determiner:
The man saw a cat
DET ??? ??? ??? ???
SEQUENCE CLASSIFICATION
6
The man saw a cat
DET ??? ??? ??? ???
Niw ‘man’ is ambiguous: (1) noun, i.e. male persin, ir (2) verb,
i.e. ‘take charge if’? We can liik at:
P(‘man
NN
’) vs P(‘man
VB
’) → pribability if the wird as niun ir verb
P(NN | DET) vs P(VB | DET) → pribability if a niun ir a verb ti be
preceded by a determiner
SEQUENCE CLASSIFICATION
7
Two probabiliates ti determine POS if a wird:
[1] P(‘manNN’) vs P(‘manVB’)
& [2] P(‘manNN’ | DET) vs P(‘manVB’ | DET)
Classifers we have seen si far (NB, MaxEnt, SVM) can handle [1].
But they can’t handle [2].
SEQUENCE CLASSIFICATION
8
Using training data, we can learn if wird categiry POS
i
being
preceded by POS
j
.
SEQUENCE CLASSIFICATION
9
SEQUENCE CLASSIFICATION
Prepositions can be followed by
nouns (.7) or determiners (.3),
never by verbs.
10
SEQUENCE CLASSIFICATION
Nouns can be followed by verbs,
prepositions, or another noun.
11
SEQUENCE CLASSIFICATION
Determiners are ALWAYS
followed by a noun.
12
[1] P(‘manNN’) vs P(‘manVB’)
[2] P(NN | DET) vs P(VB | DET)
The man saw a cat
DET ??? ??? ??? ???
SEQUENCE CLASSIFICATION
0
No matter the outcome of [1], we know it’s NN
thanks to [2].
13
OK, and hiw di we achieve this?
Hidden Markiv Midels (HMM)
Maximum Entripy Markiv Midels (MEMM)
Cinditinal Randim Fields (CRF)
Deep Learning:
Recurrent Neural Netwirks (RNN).
Ling/Shirt-Term Memiry Netwirks (LSTM).
Cinvilutinal Neural Netwirks (CNN).
SEQUENCE CLASSIFICATION
14
OK, and hiw di we achieve this?
Hidden Markov Models (HMM)
Maximum Entropy Markov Models (MEMM)
Condiatonal Random Fields (CRF)
Deep Learning:
Recurrent Neural Netwirks (RNN).
Ling/Shirt-Term Memiry Netwirks (LSTM).
Cinvilutinal Neural Netwirks (CNN).
SEQUENCE CLASSIFICATION
HIDDEN MARKOV MODELS
16
A Markiv chain:
Set if nodes connected with probabiliates, where weights in
all edges leaving a nide sum ti 1.
WHAT IS A MARKOV CHAIN?
17
MARKOV CHAINS
18
In a First irder Markiv Chain, the pribability if a partcular state
depends ONLY in the previius state.
Remember: Lecture 3 in language midels, Markiv assumptin:
P(library | I fiund twi piunds in the) ≈ P(library | the)
MARKOV CHAINS
19
WHAT IS A HIDDEN MARKOV MODEL (HMM)?
20
HMM: TWO KINDS OF PROBABILITIES
Transiaton probabiliates: pairwise pribabilites if tags being
preceded/filliwed by anither tag.
e.g. determiner likely filliwed by niun ir adjectve.
Emission probabiliates: pribability fir a partcular tag ti be a
partcular wird, e.g. verb is very likely ti be “is”.
21
Again, the Markiv assumptin, i.e. dependence inly in the
previius state.
There must be a dependency between sequence items, e.g.:
Weather (dependency): this mirning’s weather may
determine the pribability if this aferniin’s weather.
Independent events: my laptip having been briken diesn’t
determine the pribability if my next laptip breaking.
TWO IMPORTANT ASSUMPTIONS IN HMM
22
Limitatins if HMM:
Midels dependencies between each state and inly its
cirrespinding ibservatin.
Learns jiint distributin if states and ibservatins P(Y, X),
but nit the cinditinal pribability P(Y|X).
LIMITATIONS OF HMM
23
NLTK HMM:
http://www.nltk.irg/_midules/nltk/tag/hmm.html
hmmlearn:
https://github.cim/hmmlearn/hmmlearn
Scikit HMM:
http://scikit-learn.siurcefirge.net/stable/midules/hmm.html
HMM IMPLEMENTATIONS
http://www.nltk.org/_modules/nltk/tag/hmm.html
https://github.com/hmmlearn/hmmlearn
http://scikit-learn.sourceforge.net/stable/modules/hmm.html
24
Maximum Entripy Markiv Midels (MEMM): Midels
dependencies between each state and the full observaaton
sequence.
Learning ibjectve functin cinsistent with predictve functin:
P(Y|X).
MEMM: ALTERNATIVE TO HMM
25
MEMM: THE LABEL BIAS PROBLEM
26
In principle, lically, state 1 prefers state 2 in 2nd place.
MEMM: THE LABEL BIAS PROBLEM
27
But if we liik at the entre path (which MEMM dies):
P(1→1→1→1) = 0.4*0.45*0.5 = 0.09
P(1→2→2→2) = 0.6*0.3*0.3 = 0.054
MEMM: THE LABEL BIAS PROBLEM
Despite 1→2 being more
likely locally,
the entire path probability
will favour 1→1
28
Silutin: din’t nirmalise pribabilites lically, use lical
pitentals.
SOLUTION TO LABEL BIAS
29
CRF:
Glibal normaliser Z(x) overcomes label bias issue if MEMM.
Midels the dependency between each state and the enatre
observaaton sequence (like MEMM).
CONDITIONAL RANDOM FIELDS
30
Widely used fir a range if sequence classifcatin tasks:
Laferty, J., McCallum, A., & Pereira, F. C. (2001). Cinditinal
randim felds: Pribabilistc midels fir segmentng and labeling
sequence data.
https://repisitiry.upenn.edu/cgi/viewcintent.cgi?artcle=1162&cintext=cis_papers
CONDITIONAL RANDOM FIELDS
https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers
31
Pythin-crfsuite:
http://pythin-crfsuite.readthedics.ii/en/latest/
PyStruct:
https://pystruct.github.ii/
IMPLEMENTATIONS OF CRF
http://python-crfsuite.readthedocs.io/en/latest/
https://pystruct.github.io/
32
We’ve talked abiut classifer linear sequences.
But simetmes we can have mire cimplex sequences:
Graph ir tree-structured sequences.
BEYOND CLASSIFIER LINEAR SEQUENCES
33
Pist classifcatin in firums, e.g. binary classifcatin: dies a pist
answer the questin if the 1st pist?
Pist 1
Pist 1.1
Pist 1.1.1
Pist 1.1.2
Pist 1.2
Pist 1.2.1
Pist 1.2.1.1
Pist 1.2.2
WHERE DO WE FIND TREE SEQUENCES IN NLP?
34
HMM VS CRF
USING SEQUENCE CLASSIFIERS
FOR PART-OF-SPEECH (POS)
TAGGING
36
PART-OF-SPEECH TAGGING
As we were saying, sequence can play an important role in
determining POS tags in a sentence:
The man saw a cat
DET NN VB DET NN
“man” can’t be a verb if it’s preceded by a determiner.
37
PART-OF-SPEECH TAGGING
As we were saying, sequence can play an important role in
determining POS tags in a sentence:
The man saw a cat
DET NN VB DET NN
With HMM, only looking at previous label, we can end up predicting
sequence of 5 nouns (NN, NN, NN, NN, NN).
Looking at the entire sequence (MEMM, CRF), we avoid this, we
never have a sentence only made of 5 nouns.
38
PART-OF-SPEECH TAGGING
The list of POS tag types is actually quite large!
Open class (lexical) words
Closed class (functional)
Nouns Verbs
Proper Common
Modals
Main
Adjectives
Adverbs
Prepositions
Particles
Determiners
Conjunctions
Pronouns
… more
… more
IBM
Italy
cat / cats
snow
see
registered
can
had
old older oldest
slowly
to with
off up
the some
and or
he its
Numbers
122,312
one
Interjections Ow Eh
39
OPEN VS CLOSED CLASSES
Open vs. Closed classes:
Closed:
determiners: a, an, the,…
pronouns: she, he, I,…
prepositions: on, under, over, near, by,…
Open:
Nouns, Verbs, Adjectives, Adverbs.
40
CHALLENGES IN POS TAGGING: AMBIGUITY
Words often have more than one possible POS, e.g. back:
The back door = JJ (adjective)
On my back = NN (noun)
Win the voters back = RB (adverb)
Promised to back the bill = VB (verb)
See list of POS tags:
http://rednoise.org/rita/reference/PennTags.html
http://rednoise.org/rita/reference/PennTags.html
41
PART-OF-SPEECH TAGGING
POS tagging example:
Input: Plays well with others
NNS UH IN NNS
VBZ JJ
NN
RB
Output: Plays/VBZ well/RB with/IN others/NNS
1. List candidate labels for each
word.
2. Based on probabilities learnt from
training data, the classifier predicts
the most likely POS sequence.
NB: the fact that there are 2
unambiguous words (with & others)
is useful to first label them, and then
predict the other 2.
42
PART-OF-SPEECH TAGGING
Uses of POS tagging in NLP:
Text-to-speech.
Phrase search through regular expressions, e.g. (Det) Adj* N+
As input to or to speed up a full linguistic parser (later lectures)
If you know the tag, you can back off to it, e.g. in lemmatisation,
saw → see, or saw → saw?
In other fields:
Computational biology.
Prediction of series of data, e.g. weather forecasting.
43
POS TAGGING PERFORMANCE
Current POS tagging system are very accurate:
Baseline system that predicts the most common POS for a word
already gets 90% accuracy → owing to many words not being
ambiguous.
Standard classifiers (no sequence) can achieve 93% accuracy.
Sequence classifiers can achieve 97% accuracy.
44
REFERENCES
List of software for POS tagging:
https://aclweb.org/aclwiki/Part-of-speech_tagging
https://aclweb.org/aclwiki/Part-of-speech_tagging
45
ASSOCIATED READING
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapters 9-
10.
Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 6
Section 1.6.
Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45