Session_6 (1)
Session 6: Part-of-speech (PoS) Tagging¶
Preliminaries¶
Things for you to do
The first thing you need to do is run the following cell. This will give you access to the Sussex NLTK package.
In [ ]:
import sys
sys.path.append(r’T:\Departments\Informatics\LanguageEngineering’)
This session concerns the task of part-of-speech tagging. It is loosely divided into 2 parts: the first part deals with the notion of PoS ambiguity of a vocabulary type; and the second part compares the performance of two taggers on various corpora.
We will be making an important distinction between tokens and types. A sentence in a document is make up of a sequence of tokens. For example [“the”, “cat”, “sat”, “on”, “the”, “mat”, “.”] contains 7 tokens, but only 6 distinct strings: there are two occurrences of “the”. The way we say this is that there are 6 types in the sentence, but 7 tokens. Tokens are occurrences of types. In this session we will be looking at the ambiguity of types not tokens.
NOTE
You will need to place the following line as the first line of your script in order to avoid only doing integer division:
from \_\_future__ import division
Average PoS tag ambiguity¶
The Part-of-Speech (PoS) tag ambiguity of a type is a measure of how varied the PoS tags are for that type. Some types are always (or almost always) labelled with the same PoS tag, so exhibit no (or very little) ambiguity. It is easy to predict the correct PoS tag for such words. On the other hand, a type that is commonly labelled by a variety of different PoS tags exhibits a high level of ambiguity, and is more challenging to deal with.
In this session we are going to be considering two measures of a type’s ambiguity. In this section we consider a simple measure that just counts the number of different tags that label the type. In the next section we will look at a more complex information-theoretic measure based on entropy.
Things for you to do
Create a function simple_pos_ambiguity. See below for guidance.
Function: simple_pos_ambiguity
Arguments
none
Returns
A dictionary (hashmap) mapping each type to its degree of ambiguity (the number of distinct PoS tags that the type is labelled with in the Wall Street Journal Corpus).
Create simple_pos_ambiguity as follows:
Create a Wall Street Journal corpus reader
Use the corpus reader’s method tagged_words, to get a list of all tokens in the corpus tagged with their PoS (e.g. if your corpus reader is called wsj_reader, then you’d call wsj_reader.tagged_words()). This method is available because the Wall Street Journal corpus has been hand-annotated with PoS tags.
For each type, build a set containing all of the different PoS tags that are assigned to that type. So if in the Wall Street Journal corpus “red” occurred only as a noun and adjective, then this number would be a two element set containing just these two part-of-speech tags. The size (cardinality) of the set is the ambiguity of that type. See below for details.
Return a Python dictionary (hashmap) mapping each type to its ambiguity.
The code below gives hints on how you might keep track of the number of PoS tags a single type occurs with. You will have to generalise to all types.
In [ ]:
example_dict = {} #creating an empty dictionary
# Only need to do this if we haven’t already seen ‘blue’
if ‘blue’ not in example_dict:
example_dict[“blue”] = set() #Mapping “blue” to an empty set
example_dict[“blue”].add(“JJ”) #Adding “JJ” to “blue”‘s empty set
example_dict[“blue”].add(“NN”) #Adding “NN” to “blue”‘s empty set
#if you call the above line twice, only one “NN” will be in the set, because sets don’t duplicate elements.
print len(example_dict[“blue”]) #Print the size of “blue”s tag set, which is its simple ambiguity
ambiguities = {} #new dictionary that’s going to hold the ambiguities of tokens
for token, tagset in example_dict.iteritems(): #iterate over all key-value pairs in our dict
ambiguities[token] = len(tagset)
print ambiguities
# once you have written simple_pos_ambiguity you can do the following…
ambiguities = simple_pos_ambiguity()
print ambiguities[“blue”]
Since simple_pos_ambiguity returns a dictionary mapping from types to their ambiguities, you can get the average ambiguity by simply using the NumPy average function (from numpy import average) on the values of the dictionary: average(ambiguities.values())
Things for you to do
Check that the ambiguity of “blue” is 2 in the Wall Street Journal corpus. It occurs as a noun (“NN”) and adjective (“JJ”) only.
Use simple_ambiguity to calculate the average PoS tag ambiguity in the Wall Street Journal corpus.
Entropy as a measure of ambiguity¶
In this activity, you are given a function that calculates PoS ambiguity in a different way, using the notion of entropy. Call the function get_entropy_ambiguity to get the PoS ambiguity of a word in the WSJ
As with the last measure, lowest values correspond to least ambiguity. However, here an ambiguity of 0 is the lowest, and 1 is very high.
In order to keep the code simple, it only computes the ambiguity of one word. So you’ll have to call the function once per word you’re interested in.
NOTE
The code below uses try-except statements. The code under the try statement is executed, and if an exception is raised (e.g. a KeyError, meaning that the thing you’re looking for isn’t in the hashmap in this case), then the code under the except statement is executed. This is a very Pythonic way of coding. The alternative would be e.g. to check at every single iteration whether the item is in the hashmap before trying to update it.
In [ ]:
from math import log
from sussex_nltk.corpus_readers import WSJCorpusReader
def get_entropy_ambiguity(word):
# Get the PoS ambiguity of *word* according to its occurrence in WSJ
pos_counts = {} # keep track of the number of times *word*
# appears with each PoS tag
for token, tag in WSJCorpusReader().tagged_words(): #for each token and tag in WSJ
if token == word: # if this token is the word we’re interested in
try:
pos_counts[tag] += 1 # if we’ve seen the tag before, increment the
# counter keeping track of occurrences
except KeyError: # Otherwise a KeyError will be thrown, catch it
pos_counts[tag] = 1 # Then start the counter at 1 for that tag
return entropy(pos_counts.values()) # return the entropy of the counts
def entropy(counts): # counts = list of counts of occurrences of tags
total = sum(counts) # get total number of occurrences
if not total: return 0 # if zero occurrences in total, then 0 entropy
entropy = 0
for i in counts: # for each tag count
p = i/float(total) # probability that the token occurs with this tag
try:
entropy += p * log(p,2) # add to entropy
except ValueError: pass # if p==0, then ignore this p
return -entropy if entropy else entropy # only negate if nonzero, otherwise
# floats can return -0.0, which is weird.
# Usage:
print ‘Ambiguity of “blue”: %s’ % get_entropy_ambiguity(“blue”)
NOTE
This code uses the built-in math function log
Things for you to do
Study the code for this measure; find out what information is used to calculate the PoS ambiguity.
Once you know what information it uses, use print statements to print out that information in order to investigate how and why this measure differs from the previous one.
Use your simple measure of PoS ambiguity (from the previous section) to calculate the PoS ambiguity of the words “either” and “value”. Now do the same with the entropy-based ambiguity measure. How do the measures differ? Why? Which measure produces a more representative figure for how ambiguous the PoS of a type is? Why?
Experiment with PoS taggers¶
In this section you will have a chance to use two different Part-of-Speech taggers: the NLTK Maximum Entropy PoS tagger; and the Twitter-specific PoS tagger from Gimpel et al.
The following code shows you how to use these taggers:
In [ ]:
from sussex_nltk.corpus_readers import ReutersCorpusReader
from sussex_nltk.tag import twitter_tag_batch
from nltk import pos_tag
from nltk.tokenize import word_tokenize
number_of_sentences = 10 #Number of sentences to sample and display
rcr = ReutersCorpusReader() #Create a corpus reader
sentences = rcr.sample_raw_sents(number_of_sentences) #Sample some sentences
#Tag with twitter specific tagger
# – it also tokenises for you in a twitter specific way
twitter_tagged = twitter_tag_batch(sentences)
#Tag with NLTK’s maximum entropy tagger
nltk_tagged = [pos_tag(word_tokenize(sentence)) for sentence in sentences]
#Print each sentence
for raw, twitter_sentence, nltk_sentence in zip(sentences,twitter_tagged,nltk_tagged):
print “—————–Sentence—————-”
print “Raw:\n %s ” % raw
print “Twitter tagged:”
for token, tag in twitter_sentence:
print ” %s\t%s” % (token,tag)
print “NLTK tagged:”
for token, tag in nltk_sentence:
print ” %s\t%s” % (token,tag)
The above example code uses the Python zip function, which allows you to iterate over multiple iterables at once.
Things for you to do
Try each tagger on a sample of sentences from the Reuters, Medline and Twitter corpora.
Try to observe limitations and strengths of the taggers on the corpora.