lecture09.pptx
LECTURE 9
Sequence Classifcaaon and Part-Of-Speech Tagging
Arkaitz Zubiaga, 5
th
February, 2018
2
Sequence Classifcaaon
Sequence Classifers:
Hidden Markov Models (HMM).
Maximum Entropy Markov Models (MEMM).
Condiaonal Random Fields (CRF).
Using Sequence Classifers for Part-of-Speech (POS) Tagging.
LECTURE 9: CONTENTS
3
Someames, classifcaatin tif items in a sequence is dependent
tin tither sequence items:
Part-tif-Speech (POS) tagging:
Assigning categtiries tti wtirds, e.g. adjecave, noun or verb
Example:
The man saw a cat
DET NN VB DET NN
SEQUENCE CLASSIFICATION
DET: determiner
NN: noun
VB: verb
4
Why is classifcaaon dependent on other sequence items?
In our example, “The man saw a cat”:
‘saw’ can be:
a ntiun (if it refers to the tool for cutng)
a verb (if it’s simple past of ‘see’)
we can’t classify whether ‘saw’ is verb or noun by looking at the
word alone → need to look at context, the sequence
SEQUENCE CLASSIFICATION
5
The man saw a cat
??? ??? ??? ??? ???
The frst item, ‘the’, is easy tti classify, it’s always a determiner:
The man saw a cat
DET ??? ??? ??? ???
SEQUENCE CLASSIFICATION
6
The man saw a cat
DET ??? ??? ??? ???
Now ‘man’ is ambigutius: (1) ntiun, i.e. male person, or (2) verb,
i.e. ‘take charge of’? We can look at:
P(‘man
NN
’) vs P(‘man
VB
’) → probability of the word as noun or verb
P(NN | DET) vs P(VB | DET) → probability of a noun or a verb to be
preceded by a determiner
SEQUENCE CLASSIFICATION
7
Twti prtibabiliaes to determine POS of a word:
[1] P(‘manNN’) vs P(‘manVB’)
& [2] P(‘manNN’ | DET) vs P(‘manVB’ | DET)
Classifers we have seen so far (NB, MaxEnt, SVM) can handle [1].
But they can’t handle [2].
SEQUENCE CLASSIFICATION
8
Using training data, we can learn of word category POS
i
being
preceded by POS
j
.
SEQUENCE CLASSIFICATION
9
SEQUENCE CLASSIFICATION
Prepositions can be followed by
nouns (.7) or determiners (.3),
never by verbs.
10
SEQUENCE CLASSIFICATION
Nouns can be followed by verbs,
prepositions, or another noun.
11
SEQUENCE CLASSIFICATION
Determiners are ALWAYS
followed by a noun.
12
[1] P(‘manNN’) vs P(‘manVB’)
[2] P(NN | DET) vs P(VB | DET)
The man saw a cat
DET ??? ??? ??? ???
SEQUENCE CLASSIFICATION
0
No matter the outcome of [1], we know it’s NN
thanks to [2].
13
OK, and how do we achieve this?
Hidden Markov Models (HMM)
Maximum Entropy Markov Models (MEMM)
Condiaonal Random Fields (CRF)
Deep Learning:
Recurrent Neural Networks (RNN).
Long/Short-Term Memory Networks (LSTM).
Convoluaonal Neural Networks (CNN).
SEQUENCE CLASSIFICATION
14
OK, and how do we achieve this?
Hidden Marktiv Mtidels (HMM)
Maximum Entrtipy Marktiv Mtidels (MEMM)
Ctindiatinal Randtim Fields (CRF)
Deep Learning:
Recurrent Neural Networks (RNN).
Long/Short-Term Memory Networks (LSTM).
Convoluaonal Neural Networks (CNN).
SEQUENCE CLASSIFICATION
HIDDEN MARKOV MODELS
16
A Markov chain:
Set of ntides ctinnected with prtibabiliaes , where weights on
all edges leaving a node sum to 1.
WHAT IS A MARKOV CHAIN?
17
MARKOV CHAINS
18
In a First order Markov Chain, the probability of a paracular state
depends ONLY on the previous state.
Remember: Lecture 3 on language models, Markov assumpaon:
P(library | I found two pounds in the) ≈ P(library | the)
MARKOV CHAINS
19
WHAT IS A HIDDEN MARKOV MODEL (HMM)?
20
HMM: TWO KINDS OF PROBABILITIES
Transiatin prtibabiliaes: pairwise probabiliaes of tags being
preceded/followed by another tag.
e.g. determiner likely followed by noun or adjecave.
Emissitin prtibabiliaes: probability for a paracular tag to be a
paracular word, e.g. verb is very likely to be “is”.
21
Again, the Markov assumpaon, i.e. dependence only on the
previous state.
There must be a dependency between sequence items, e.g.:
Weather (dependency): this morning’s weather may
determine the probability of this afernoon’s weather.
Independent events: my laptop having been broken doesn’t
determine the probability of my next laptop breaking.
TWO IMPORTANT ASSUMPTIONS IN HMM
22
Limitaaons of HMM:
Models dependencies between each state and only its
corresponding observaaon.
Learns joint distribuaon of states and observaaons P(Y, X),
but not the condiaonal probability P(Y|X).
LIMITATIONS OF HMM
23
NLTK HMM:
htp://www.nltk.org/_modules/nltk/tag/hmm.html
hmmlearn:
htps://github.com/hmmlearn/hmmlearn
Scikit HMM:
htp://scikit-learn.sourceforge.net/stable/modules/hmm.html
HMM IMPLEMENTATIONS
24
Maximum Entropy Markov Models (MEMM): Models
dependencies between each state and the full tibservaatin
sequence.
Learning objecave funcaon consistent with predicave funcaon:
P(Y|X).
MEMM: ALTERNATIVE TO HMM
25
MEMM: THE LABEL BIAS PROBLEM
26
In principle, locally, state 1 prefers state 2 in 2 nd place.
MEMM: THE LABEL BIAS PROBLEM
27
But if we look at the enare path (which MEMM does):
P(1→1→1→1) = 0.4*0.45*0.5 = 0.09
P(1→2→2→2) = 0.6*0.3*0.3 = 0.054
MEMM: THE LABEL BIAS PROBLEM
Despite 1→2 being more
likely locally,
the entire path probability
will favour 1→1
28
Soluaon: don’t normalise probabiliaes locally, use local
potenaals.
SOLUTION TO LABEL BIAS
29
CRF:
Global ntirmaliser Z(x) tiverctimes label bias issue of MEMM.
Models the dependency between each state and the enare
tibservaatin sequence (like MEMM).
CONDITIONAL RANDOM FIELDS
30
Widely used for a range of sequence classifcaaon tasks:
Laferty, J., McCallum, A., & Pereira, F. C. (2001). Condiaonal
random felds: Probabilisac models for segmenang and labeling
sequence data.
htps://repository.upenn.edu/cgi/viewcontent.cgi?aracle=1162&context=cis_papers
CONDITIONAL RANDOM FIELDS
31
Python-crfsuite:
htp://python-crfsuite.readthedocs.io/en/latest/
PyStruct:
htps://pystruct.github.io/
IMPLEMENTATIONS OF CRF
32
We’ve talked about classifer linear sequences.
But someames we can have more complex sequences:
Graph or tree-structured sequences.
BEYOND CLASSIFIER LINEAR SEQUENCES
33
Post classifcaaon in forums, e.g. binary classifcaaon: does a post
answer the quesaon of the 1 st post?
Post 1
Post 1.1
Post 1.1.1
Post 1.1.2
Post 1.2
Post 1.2.1
Post 1.2.1.1
Post 1.2.2
WHERE DO WE FIND TREE SEQUENCES IN NLP?
34
HMM VS CRF
USING SEQUENCE CLASSIFIERS
FOR PART-OF-SPEECH (POS)
TAGGING
36
PART-OF-SPEECH TAGGING
As we were saying, sequence can play an important role in
determining POS tags in a sentence:
The man saw a cat
DET NN VB DET NN
“man” can’t be a verb if it’s preceded by a determiner.
37
PART-OF-SPEECH TAGGING
As we were saying, sequence can play an important role in
determining POS tags in a sentence:
The man saw a cat
DET NN VB DET NN
With HMM, only looking at previous label, we can end up predicting
sequence of 5 nouns (NN, NN, NN, NN, NN).
Looking at the entire sequence (MEMM, CRF), we avoid this, we
never have a sentence only made of 5 nouns.
38
PART-OF-SPEECH TAGGING
The list of POS tag types is actually quite large!
Open class (lexical) words
Closed class (functional)
Nouns Verbs
Proper Common
Modals
Main
Adjectives
Adverbs
Prepositions
Particles
Determiners
Conjunctions
Pronouns
… more
… more
IBM
Italy
cat / cats
snow
see
registered
can
had
old older oldest
slowly
to with
off up
the some
and or
he its
Numbers
122,312
one
Interjections Ow Eh
39
OPEN VS CLOSED CLASSES
Open vs. Closed classes:
Closed:
determiners: a, an, the,…
pronouns: she, he, I,…
prepositions: on, under, over, near, by,…
Open:
Nouns, Verbs, Adjectives, Adverbs.
40
CHALLENGES IN POS TAGGING: AMBIGUITY
Words often have more than one possible POS, e.g. back:
The back door = JJ (adjective)
On my back = NN (noun)
Win the voters back = RB (adverb)
Promised to back the bill = VB (verb)
See list of POS tags:
http://rednoise.org/rita/reference/PennTags.html
41
PART-OF-SPEECH TAGGING
POS tagging example:
Input: Plays well with others
NNS UH IN NNS
VBZ JJ
NN
RB
Output: Plays/VBZ well/RB with/IN others/NNS
1. List candidate labels for each
word.
2. Based on probabilities learnt from
training data, the classifier predicts
the most likely POS sequence.
NB: the fact that there are 2
unambiguous words (with & others)
is useful to first label them, and then
predict the other 2.
42
PART-OF-SPEECH TAGGING
Uses of POS tagging in NLP:
Text-to-speech.
Phrase search through regular expressions, e.g. (Det) Adj* N+
As input to or to speed up a full linguistic parser (later lectures)
If you know the tag, you can back off to it, e.g. in lemmatisation,
saw → see, or saw → saw?
In other fields:
Computational biology.
Prediction of series of data, e.g. weather forecasting.
43
POS TAGGING PERFORMANCE
Current POS tagging system are very accurate:
Baseline system that predicts the most common POS for a word
already gets 90% accuracy → owing to many words not being
ambiguous.
Standard classifiers (no sequence) can achieve 93% accuracy.
Sequence classifiers can achieve 97% accuracy.
44
REFERENCES
List of software for POS tagging:
https://aclweb.org/aclwiki/Part-of-speech_tagging
45
ASSOCIATED READING
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapters 9-
10.
Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 6
Section 1.6.