Part of Speech Tagging¶
Several corpora with manual part of speech (POS) tagging are included in NLTK. For this exercise, we’ll use a sample of the Penn Treebank corpus, a collection of Wall Street Journal articles. We can access the part-of-speech information for either the Penn Treebank or the Brown as follows. We use sentences here because that is the preferred representation for doing POS tagging.
In [2]:
import nltk
from nltk.corpus import treebank, brown
nltk.download(‘treebank’)
nltk.download(‘brown’)
print(treebank.tagged_sents()[0])
print(brown.tagged_sents()[0])
[nltk_data] Downloading package treebank to
[nltk_data] C:\Users\Comp30019\AppData\Roaming\nltk_data…
[nltk_data] Package treebank is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data] C:\Users\Comp30019\AppData\Roaming\nltk_data…
[nltk_data] Unzipping corpora\brown.zip.
[(‘Pierre’, ‘NNP’), (‘Vinken’, ‘NNP’), (‘,’, ‘,’), (’61’, ‘CD’), (‘years’, ‘NNS’), (‘old’, ‘JJ’), (‘,’, ‘,’), (‘will’, ‘MD’), (‘join’, ‘VB’), (‘the’, ‘DT’), (‘board’, ‘NN’), (‘as’, ‘IN’), (‘a’, ‘DT’), (‘nonexecutive’, ‘JJ’), (‘director’, ‘NN’), (‘Nov.’, ‘NNP’), (’29’, ‘CD’), (‘.’, ‘.’)]
[(‘The’, ‘AT’), (‘Fulton’, ‘NP-TL’), (‘County’, ‘NN-TL’), (‘Grand’, ‘JJ-TL’), (‘Jury’, ‘NN-TL’), (‘said’, ‘VBD’), (‘Friday’, ‘NR’), (‘an’, ‘AT’), (‘investigation’, ‘NN’), (‘of’, ‘IN’), (“Atlanta’s”, ‘NP$’), (‘recent’, ‘JJ’), (‘primary’, ‘NN’), (‘election’, ‘NN’), (‘produced’, ‘VBD’), (‘“’, ‘“’), (‘no’, ‘AT’), (‘evidence’, ‘NN’), (“””, “””), (‘that’, ‘CS’), (‘any’, ‘DTI’), (‘irregularities’, ‘NNS’), (‘took’, ‘VBD’), (‘place’, ‘NN’), (‘.’, ‘.’)]
In NLTK, word/tag pairs are stored as tuples, the transformation from the plain text “word/tag” representation to the python data types is done by the corpus reader.
The two corpora do not have the same tagset; the Brown was tagged with a more fine-grained tagset: for instance, instead of “DT” (determiner) as in the Penn Treebank, the word “the” is tagged as “AT” (article, which is a kind of determiner). We can convert them both to the Universal tagset.
In [3]:
import nltk
nltk.download(‘universal_tagset’)
print(treebank.tagged_sents(tagset=”universal”)[0])
print(brown.tagged_sents(tagset=”universal”)[0])
[nltk_data] Downloading package universal_tagset to
[nltk_data] C:\Users\Comp30019\AppData\Roaming\nltk_data…
[(‘Pierre’, ‘NOUN’), (‘Vinken’, ‘NOUN’), (‘,’, ‘.’), (’61’, ‘NUM’), (‘years’, ‘NOUN’), (‘old’, ‘ADJ’), (‘,’, ‘.’), (‘will’, ‘VERB’), (‘join’, ‘VERB’), (‘the’, ‘DET’), (‘board’, ‘NOUN’), (‘as’, ‘ADP’), (‘a’, ‘DET’), (‘nonexecutive’, ‘ADJ’), (‘director’, ‘NOUN’), (‘Nov.’, ‘NOUN’), (’29’, ‘NUM’), (‘.’, ‘.’)]
[(‘The’, ‘DET’), (‘Fulton’, ‘NOUN’), (‘County’, ‘NOUN’), (‘Grand’, ‘ADJ’), (‘Jury’, ‘NOUN’), (‘said’, ‘VERB’), (‘Friday’, ‘NOUN’), (‘an’, ‘DET’), (‘investigation’, ‘NOUN’), (‘of’, ‘ADP’), (“Atlanta’s”, ‘NOUN’), (‘recent’, ‘ADJ’), (‘primary’, ‘NOUN’), (‘election’, ‘NOUN’), (‘produced’, ‘VERB’), (‘“’, ‘.’), (‘no’, ‘DET’), (‘evidence’, ‘NOUN’), (“””, ‘.’), (‘that’, ‘ADP’), (‘any’, ‘DET’), (‘irregularities’, ‘NOUN’), (‘took’, ‘VERB’), (‘place’, ‘NOUN’), (‘.’, ‘.’)]
[nltk_data] Unzipping taggers\universal_tagset.zip.
Now, let’s create a basic unigram POS tagger. First, we need to collect POS distributions for each word. We’ll do this (somewhat inefficiently) using a dictionary of dictionaries. Note that we are using the PTB tag set from here on.
In [4]:
from collections import defaultdict
POS_dict = defaultdict(dict)
for word_pos_pair in treebank.tagged_words():
word = word_pos_pair[0].lower()
POS = word_pos_pair[1]
POS_dict[word][POS] = POS_dict[word].get(POS,0) + 1
Let’s look at some words which appear with multiple POS, and their POS counts:
In [5]:
for word in list(POS_dict.keys())[:100]:
if len(POS_dict[word]) > 1:
print(word)
print(POS_dict[word])
old
{‘JJ’: 24, ‘NNP’: 8}
will
{‘MD’: 280, ‘NN’: 1}
the
{‘DT’: 4753, ‘NNP’: 5, ‘JJ’: 5, ‘CD’: 1}
board
{‘NN’: 30, ‘NNP’: 43}
as
{‘IN’: 362, ‘RB’: 53}
a
{‘DT’: 1979, ‘NNP’: 2, ‘JJ’: 2, ‘IN’: 1, ‘LS’: 1, ‘NN’: 3}
nov.
{‘NNP’: 23, ‘NN’: 1}
chairman
{‘NN’: 45, ‘NNP’: 12}
dutch
{‘NNP’: 1, ‘JJ’: 2}
publishing
{‘VBG’: 4, ‘NN’: 10, ‘NNP’: 4}
group
{‘NN’: 43, ‘NNP’: 26}
and
{‘CC’: 1549, ‘JJ’: 3, ‘IN’: 1, ‘NN’: 2, ‘NNP’: 1}
gold
{‘NNP’: 2, ‘NN’: 9}
fields
{‘NNP’: 2, ‘NNS’: 11}
named
{‘VBN’: 21, ‘VBD’: 2}
british
{‘JJ’: 7, ‘NNP’: 4}
industrial
{‘JJ’: 19, ‘NNP’: 6}
form
{‘NN’: 17, ‘VB’: 1}
once
{‘RB’: 16, ‘IN’: 2}
used
{‘VBN’: 27, ‘JJ’: 2, ‘VBD’: 4}
to
{‘TO’: 2179, ‘IN’: 2, ‘JJ’: 1}
make
{‘VB’: 64, ‘VBP’: 10}
has
{‘VBZ’: 338, ‘VBP’: 1}
caused
{‘VBN’: 9, ‘VBD’: 4}
high
{‘JJ’: 27, ‘NN’: 3, ‘NNP’: 20, ‘RB’: 2}
cancer
{‘NN’: 7, ‘NNP’: 2}
more
{‘RBR’: 88, ‘JJ’: 1, ‘JJR’: 115}
ago
{‘IN’: 22, ‘RB’: 16}
reported
{‘VBD’: 25, ‘VBN’: 12}
even
{‘RB’: 72, ‘JJ’: 3, ‘VB’: 1}
brief
{‘JJ’: 4, ‘VB’: 1}
that
{‘WDT’: 217, ‘IN’: 514, ‘DT’: 114, ‘RB’: 3}
show
{‘VBP’: 4, ‘NN’: 7, ‘NNP’: 1, ‘VB’: 2}
up
{‘RP’: 77, ‘RB’: 49, ‘IN’: 27}
later
{‘JJ’: 12, ‘RBR’: 2}
said
{‘VBD’: 614, ‘VBN’: 14}
lorillard
{‘NNP’: 4, ‘NN’: 1}
new
{‘JJ’: 168, ‘NNP’: 160}
york-based
{‘JJ’: 4, ‘NNP’: 1}
stopped
{‘VBD’: 4, ‘VBN’: 3}
Common ambiguities that we see here are between nouns and verbs (increase, refunding, reports), and, among verbs, between past tense and past participles (contributed, reported, climbed).
To create an actual tagger, we just need to pick the most common tag for each
In [6]:
tagger_dict = {}
for word in POS_dict:
tagger_dict[word] = max(POS_dict[word],key=lambda x: POS_dict[word][x])
def tag(sentence):
return [(word,tagger_dict.get(word,”NN”)) for word in sentence]
example_sentence = “””You better start swimming or sink like a stone , cause the times they are a – changing .”””.split()
print(tag(example_sentence))
[(‘You’, ‘NN’), (‘better’, ‘JJR’), (‘start’, ‘VB’), (‘swimming’, ‘NN’), (‘or’, ‘CC’), (‘sink’, ‘VB’), (‘like’, ‘IN’), (‘a’, ‘DT’), (‘stone’, ‘NN’), (‘,’, ‘,’), (’cause’, ‘NN’), (‘the’, ‘DT’), (‘times’, ‘NNS’), (‘they’, ‘PRP’), (‘are’, ‘VBP’), (‘a’, ‘DT’), (‘-‘, ‘:’), (‘changing’, ‘VBG’), (‘.’, ‘.’)]
Though we’d probably want some better handling of capitalized phrases (backing off to NNP, or using the statistics for the lower-case token), and there are a few other obvious errors, generally it’s not too bad.
NLTK has built-in support for n-gram taggers; Let’s build unigram and bigram taggers, and test their performance. First we need to split our corpus into training and testing
In [7]:
size = int(len(treebank.tagged_sents()) * 0.9)
train_sents = treebank.tagged_sents()[:size]
test_sents = treebank.tagged_sents()[size:]
Let’s first compare a unigram and bigram tagger. All NLTK taggers have an evaluate method which prints out the accuracy on some test set.
In [8]:
from nltk import UnigramTagger, BigramTagger
unigram_tagger = UnigramTagger(train_sents)
bigram_tagger = BigramTagger(train_sents)
print(unigram_tagger.evaluate(test_sents))
print(unigram_tagger.tag(example_sentence))
print(bigram_tagger.evaluate(test_sents))
print(bigram_tagger.tag(example_sentence))
0.8627989821882952
[(‘You’, ‘PRP’), (‘better’, ‘JJR’), (‘start’, ‘VB’), (‘swimming’, None), (‘or’, ‘CC’), (‘sink’, ‘VB’), (‘like’, ‘IN’), (‘a’, ‘DT’), (‘stone’, ‘NN’), (‘,’, ‘,’), (’cause’, ‘NN’), (‘the’, ‘DT’), (‘times’, ‘NNS’), (‘they’, ‘PRP’), (‘are’, ‘VBP’), (‘a’, ‘DT’), (‘-‘, ‘:’), (‘changing’, ‘VBG’), (‘.’, ‘.’)]
0.13455470737913486
[(‘You’, ‘PRP’), (‘better’, None), (‘start’, None), (‘swimming’, None), (‘or’, None), (‘sink’, None), (‘like’, None), (‘a’, None), (‘stone’, None), (‘,’, None), (’cause’, None), (‘the’, None), (‘times’, None), (‘they’, None), (‘are’, None), (‘a’, None), (‘-‘, None), (‘changing’, None), (‘.’, None)]
The unigram tagger does way better. The reason is sparsity, the bigram tagger doesn’t have counts for many of the word/tag context pairs; what’s worse, once it can’t tag something, it fails catastrophically for the rest of the sentence tag, because it has no counts at all for missing tag contexts. We can fix this by adding backoffs, including the default tagger with just tags everything as NN
In [9]:
from nltk import DefaultTagger
default_tagger = DefaultTagger(“NN”)
unigram_tagger = UnigramTagger(train_sents,backoff=default_tagger)
bigram_tagger = BigramTagger(train_sents,backoff=unigram_tagger)
print(bigram_tagger.evaluate(test_sents))
print(bigram_tagger.tag(example_sentence))
0.8905852417302799
[(‘You’, ‘PRP’), (‘better’, ‘JJR’), (‘start’, ‘VB’), (‘swimming’, ‘NN’), (‘or’, ‘CC’), (‘sink’, ‘VB’), (‘like’, ‘IN’), (‘a’, ‘DT’), (‘stone’, ‘NN’), (‘,’, ‘,’), (’cause’, ‘VB’), (‘the’, ‘DT’), (‘times’, ‘NNS’), (‘they’, ‘PRP’), (‘are’, ‘VBP’), (‘a’, ‘DT’), (‘-‘, ‘:’), (‘changing’, ‘VBG’), (‘.’, ‘.’)]
We see a 3% increase in performance from adding the bigram information on top of the unigram information.
NLTK has interfaces to the Brill tagger (nltk.tag.brill) and also pre-build, state-of-the-art sequential POS tagging models, for instance the Stanford POS tagger (StanfordPOSTagger), which is what you should use if you actually need high-quality POS tagging for some application; if you are working on a computer with the Stanford CoreNLP tools installed and NLTK set up to use them (this is the case for the lab computers where workshops are held), the below code should work. If not, see the documentation here under “Stanford Tagger, NER, Tokenizer and Parser” to install them.
In [3]:
from nltk.tag import StanfordPOSTagger
stanford_tagger = StanfordPOSTagger(‘english-bidirectional-distsim.tagger’)
print(stanford_tagger.tag(brown.sents()[1]))
D:\anaconda3\lib\site-packages\nltk\tag\stanford.py:149: DeprecationWarning:
The StanfordTokenizer will be deprecated in version 3.2.5.
Please use nltk.tag.corenlp.CoreNLPPOSTagger or nltk.tag.corenlp.CoreNLPNERTagger instead.
super(StanfordPOSTagger, self).__init__(*args, **kwargs)
—————————————————————————
LookupError Traceback (most recent call last)
1 from nltk.tag import StanfordPOSTagger
2
—-> 3 stanford_tagger = StanfordPOSTagger(‘english-bidirectional-distsim.tagger’)
4 print(stanford_tagger.tag(brown.sents()[1]))
D:\anaconda3\lib\site-packages\nltk\tag\stanford.py in __init__(self, *args, **kwargs)
147
148 def __init__(self, *args, **kwargs):
–> 149 super(StanfordPOSTagger, self).__init__(*args, **kwargs)
150
151 @property
D:\anaconda3\lib\site-packages\nltk\tag\stanford.py in __init__(self, model_filename, path_to_jar, encoding, verbose, java_options)
64 self._JAR, path_to_jar,
65 searchpath=(), url=_stanford_url,
—> 66 verbose=verbose)
67
68 self._stanford_model = find_file(model_filename,
D:\anaconda3\lib\site-packages\nltk\__init__.py in find_jar(name_pattern, path_to_jar, env_vars, searchpath, url, verbose, is_regex)
719 searchpath=(), url=None, verbose=False, is_regex=False):
720 return next(find_jar_iter(name_pattern, path_to_jar, env_vars,
–> 721 searchpath, url, verbose, is_regex))
722
723
D:\anaconda3\lib\site-packages\nltk\__init__.py in find_jar_iter(name_pattern, path_to_jar, env_vars, searchpath, url, verbose, is_regex)
714 (name_pattern, url))
715 div = ‘=’*75
–> 716 raise LookupError(‘\n\n%s\n%s\n%s’ % (div, msg, div))
717
718 def find_jar(name_pattern, path_to_jar=None, env_vars=(),
LookupError:
===========================================================================
NLTK was unable to find stanford-postagger.jar! Set the CLASSPATH
environment variable.
For more information, on stanford-postagger.jar, see:
===========================================================================
In [ ]: