Sequence Labelling 1: Part-of-Speech Tagging
This time:
Parts of Speech
What are they useful for?
Open and closed PoS classes
PoS Tagsets
The Penn Treebank Tagset
PoS Tagging
Sources of information for tagging A simple unigram tagger Evaluating taggers
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 1 / 27
Parts of Speech
Words can be categorised according to how they behave grammatically
Traditionally, linguists distinguish about nine lexical categories (parts of speech):
noun
verb adjective adverb pronoun determiner preposition conjunction interjection
NLP often employs a larger set of categories
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 2 / 27
Are Parts of Speech Useful?
Identifying parts of speech is an important pre-processing step:
Can help to disambiguate words:
— information retrieval
— text to speech systems (pronunciation)
Tells us what sorts of words are likely to occur nearby: — adjectives often followed by nouns: happy student
— personal pronouns often followed by verbs: you laugh
Important for identifying larger grammatical structures
— grammatically plausible sequences (e.g. phrasal terms) — parsing (will look at this in weeks 6 and 7)
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 3 / 27
Open and Closed Classes
Parts of speech can be divided into open and closed categories.
Open classes: so-called because they are not fixed — New words may be added fairly often
— Other words may go out of the language
Closed classes: these classes are fixed
— Words are ‘functional’ rather than ‘content-bearing’
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 4 / 27
Open Classes
There are four major open PoS classes:
— noun
— verb
— adjective — adverb
In practice, may distinguish between different sorts of noun, different sorts of verb, etc.
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 5 / 27
Open Classes: Nouns
Proper nouns
— England, Kim, Microsoft, …
Common nouns
— count nouns: window, tyre, idea
— mass nouns: snow, rice, courage, …
e.g.: Kim wiped some snow off the windows
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 6 / 27
Open Classes: Verbs
Actions and processes
— run, chase, say, believe,…
N.B.: auxiliary verbs are a fixed set
— be, have, may, should, etc: not an open class
e.g.: It is said that he chased the thief
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 7 / 27
Open Classes: Adjectives & Adverbs
Adjectives: properties and qualities
— modify nouns: green, small, clever, mythical,…
Adverbs: usually modify verbs or verbal phrases — slowly, now, unfortunately, possibly, …
e.g.: Unfortunately, the clever thief quickly stole the green jewel
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 8 / 27
Closed Classes
Fixed set of words in class
Usually function words: grammatically important
not meaning bearing or ‘contentful’
frequently occurring and often short e.g. the, it, in, and
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 9 / 27
Closed Classes in English
article preposition pronoun conjunction auxiliary verb particle numeral
the, a, an, some
on under, over, to, with, by she, you, I, who
and, but, or, as, when, if can, may, are
up, down, at, by
one, two, first, second
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 10 / 27
Part-of-Speech Tagsets
A tagset provides a set of labels for marking PoS classes.
Different tag sets have been derived from work on text corpora:
Brown corpus
≈ 80 tags
Penn Treebank ≈ 45 tags Susanne corpus ≈ 350 tags
British National Corpus (BNC) ≈ 60 tags
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 11 / 27
The Penn Treebank tagset (1)
CC Coordinating conjunction CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word IN Preposition JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
and, but, or one, two the, some there
hoc
of, in, by big
bigger biggest
1, One can, should
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 12 / 27
The Penn Treebank tagset (2)
NN Noun, singular or mass NNS Noun, plural
NNP Proper noun, sing. NNPS Proper noun, plural PDT Predeterminer
POS Possessive ending PP Personal pronoun PP$ Possessive pronoun RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
dog
dogs Edinburgh Orkneys all, both
’s
I, you, she my, theirs quickly faster fastest
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 13 / 27
The Penn Treebank tagset (3)
RP Particle
SYM Symbol
TO The word “to”
UH Interjection
VB verb, base form VBD verb, past tense VBG verb, gerund
VBN verb, past participle VBP Verb, non-3sg, pres VBZ Verb, 3sg, pres WDT Wh-determiner
WP Wh-pronoun
up, off +, %, & to
oh, oops eat
ate
eating eaten
eat
eats which, that what, who
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 14 / 27
The Penn Treebank tagset (4)
WP$ Possessive-wh WRB Wh-adverb
$ Dollar sign
# Pound sign
“ Left quote
” Right quote
( Left parenthesis
) Right parenthesis
, Comma
. Sentence-final punctuation : Mid-sentence punctuation
whose how, where $
#
‘ ,“
’ ,”
(
)
,
. !?
: ; — …
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 15 / 27
Part-of-Speech Tagging
PoS tagging is the process of assigning a single part-of-speech tag to each word (and punctuation marker) in some text
“/“ The/DT guys/NNS that/WDT make/VBP traditional/JJ hardware/NN are/VBP really/RB being/VBG obsoleted/VBN by/IN microprocessor-based/JJ machines/NNS ,/, ”/” said/VBD Mr./NNP Benton/NNP ./.
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 16 / 27
Performing Part-of-Speech Tagging
It is a non-trivial task
Must resolve ambiguities
— the same word can have different tags in different contexts
Brown corpus
— 11.5% of word types and 40% of word tokens are ambiguous
Often one tag is much more likely for a given word than any other
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 17 / 27
Performing Part-of-Speech Tagging
A comparatively shallow form of processing Assigning one tag for each word
No larger structures created
— e.g. no identification of phrasal units
A task of some, but modest, value that can usually be done with high accuracy
Generally considered useful for downstream processing steps
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 18 / 27
Information Sources for PoS Tagging
What information can be used to determine the most likely PoS tag for a word?
Word identity (what the word is) Adjacent PoS tags
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 19 / 27
Information Sources: Word Identity
In isolation, a word is often ambiguous as to PoS: e.g. light:
Fading light/NN saves Pakistan from England defeat
The Dell XPS 13 is an astonishingly thin and light/JJ laptop Come on baby, light/VB my fire
Often one tag is more likely than the other possibilities
Tagging each word with its most common tag results in a tagger with about 90% accuracy
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 20 / 27
Information Sources for: Adjacent PoS Tags
PoS tags of adjacent words may be uncertain
But some tag sequences are more likely than others:
Fading/VBG light/NN saves/VB
— JJ and VB tags for light very unlikely in this context
Could use this information to select best possible sequence
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 21 / 27
Using Good Tag Sequences
Using only information about the most likely PoS tag sequence does not result in an accurate tagger
Yields a PoS tagger that is about 77% correct
NB: this represents % of tokens given the correct PoS
–> that’s about 1 error in every four or so words
Very few sentences would be labelled completely accurately
Best methods combine information about most probable tag for each word and most likely PoS sequences
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 22 / 27
A Unigram PoS Tagger
The NLTK provides a unigram tagger: nltk.tag.UnigramTagger Implements a tagging algorithm based on a table of unigram
probabilities:
tag(w) = argmax P(t | w) t
Choose most likely tag for word w
Always chooses same tag for a word independent of context
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 23 / 27
Estimating Parameters
tag(w) = argmax P(t | w)
t
We need to find a way of estimate a value for P(t | w) Use training data to estimate these probabilities
P(t | w) = number of occurrences of w tagged as t total number of occurrences of w
?
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 24 / 27
Evaluating taggers
Compare output of a tagger with a human-labelled gold standard
On edited text, best methods have accuracy of 96-97% — when using the (small) Penn treebank tagset
— average of one error every couple of sentences
Inter-annotator agreement is also around 97% Recall that a unigram tagger is about 90% accurate
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 25 / 27
Alternative Approaches
Hidden Markov Model (HMMs) Taggers
e.g. nltk.tag.hmm.HiddenMarkovModelTagger
Brill Tagger (Transformation-based error-driven learning)
nltk.tag.brill.BrillTagger
Maximum Entropy Taggers
e.g. nltk.tag.stanford.StanfordTagger
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 26 / 27
Next Topic: Hidden Markov Models
HMMs, sequence labelling and tagging Finding the most probable tag sequence The Viterbi algorithm
Learning HMMs from data
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 27 / 27