Session 7-checkpoint
Session 7: Dependency Parsing¶
Preliminaries¶
Things for you to do
The first thing you need to do is run the following cell. This will give you access to the Sussex NLTK package.
In [ ]:
import sys
sys.path.append(r’T:\Departments\Informatics\LanguageEngineering’)
This session concerns the task of Dependency Parsing. You will be using our Python implementation of arc-eager transition-based dependency parsing. The aim of this session is to get you comfortable with reading dependency trees, and learning about new dependency relations.
The focus will initially be on the direct object relation “dobj”. You will use it to find what people are wanting, loving, and buying. Your aim by the end of the lab session should be to understand this relation and be able to spot when the parser gets it wrong or right.
You will be given a dependency parser that has been pre-trained on the Wall Street Journal (which uses [stanford dependency relations}(http://nlp.stanford.edu/software/dependencies_manual.pdf) and the same PoS tags as the NLTK PoS tagger).
Dependency parsing produces a dependency tree view of a sentence, allowing us to see how the words function with each other, rather than just viewing the sentence as an unordered bag of words.
Notice that not only must sentences be tokenised before being passed to the parser, they must also be Part-of-Speech (PoS) tagged. The parser relies heavily on PoS information to learn and make decisions.
Parsing a sentence¶
We begin by looking at how to run the dependency parser on a list of sentences that you create.
Things for you to do
Make a short list of example sentences to experiment with.
For example
sentences = [“This is a great product!”, “I really wish I hadn’t bought this.”]
Use nltk tokeniser, word_tokenize, to tokenise your sentences to produce a list of tokenised sentences.
Call this list tokenised_sentences.
You will need to import word_tokenize from nltk.tokenize.
Use the nltk Part-of-Speech tagger, pos_tag, to PoS tag your tokenised sentences to produce a list of PoS tagged sentences.
Call this list tagged_sentences.
You will need to import pos_tag from nltk.
Use the dependency parser, dep_parse_sentences_arceager, to parse your PoS tagged sentences to produce a list of parsed sentences.
Call this list parsed_sentences.
You will need to import dep_parse_sentences_arceager from sussex_nltk.parse.
Iterate over parsed sentences printing each one.
We now explain how to interpret a dependency parsed sentence.
Once the tokens in a sentence have been parsed they will have five attributes:
ID: This is a unique ID assigned to the token (unique within each sentence).
FORM: This is the actual text of a token.
POS: This is the PoS tag assigned to the token.
HEAD: In this attribute, you will find the ID of the token on which the current token depends (the head of the current token).
DEPREL: This is the relation that holds between the current token and its head.
An example sentence is shown below. Notice the following properties of the sentence “sat the cat sat on the mat”:
“cat” is a noun. Its head is “sat”, and it’s the subject of “sat” (nsubj relation).
“sat” has 3 dependents: “cat” (its subject), “on” (a preposition), and “.” (punctuation).
There are two “the” tokens, one of which is a dependent of “cat” (with a determiner relation “det”), and the other is a dependent of “mat” (with a determiner relation “det”)
In [ ]:
ID FORM POS HEAD DEPREL
1 the DT 2 det
2 cat NN 3 nsubj
3 sat VBD 0 root
4 on IN 3 prep
5 the DT 6 det
6 mat NN 4 pobj
7 . . 3 punct
This sentence represented in a dependency graph is shown below.
NOTE
The arrows go FROM the head TO the dependent.
Things for you to do
For each of your example sentences, draw the dependency tree that was produced by the parser. You should do this on a piece of paper.
Dependency tree visualisation tool¶
On the teaching drive under Departments/Informatics/LanguageEngineering/, there is a file called “RunParserInLab4.bat”. Double-click this file and it will run an interactive dependency parser.
It performs two tasks, only one of which is relevant to you. In the pane labelled “Plain” you can copy-paste any ParsedSentence print-out, then press SHIFT+ENTER. Tthen he dependency tree will be visualised.
This may help you to understand the trees.
If you would like to use the tool at home, you should instead use a copy of the InteractiveParser.jar file from the same directory. Ensure that your home computer uses Java 7 (or later) at the terminal by default. Then at the command prompt type:
java -Xmx2g -jar /path/to/InteractiveParser.jar
Things for you to do
Run the dependency tree visualisation tool on the same sentences that you experimented with in the previous section.
Parsing the Amazon reviews¶
This sections shows how to get the parser to parse a selection of Amazon review sentences. The code can be used to do the following:
Filter a category of the Amazon review corpus for only those sentences containing particular tokens
PoS tag the filtered sentences
Dependency parse and print the filtered sentences
In [ ]:
from sussex_nltk.parse import dep_parse_sentences_arceager # Import the function which parses an iterable of sentences
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader # Import the corpus reader
from nltk import pos_tag # Import the PoS tagging function
# Create a list of reviews that contain the verb “to buy”,
# by filtering for several of its conjugations
sentences = []
verb_variants = set([“buy”,”buys”,”bought”])
# You can use any product category (or even all product categories),
# but DVD is small and takes less time to process
for sentence in AmazonReviewCorpusReader().category(“dvd”).sents():
# Check for several variations of the verb. “&” finds the intersection
# of 2 sets (what they have in common). So the below statement says
# “if the set of sentence tokens has any token in common with our verbs,
# then keep the sentence”
if verb_variants & set(sentence):
sentences.append(sentence) # Populate our list of sentences with
# this sentence if it contains what
# we’re after.
# Optionally limit the number of reviews to *n* for ease of processing
# and observation. Or perform some kind of sampling (not shown)
n = 10
sentences = sentences[:n] # Slice notation
#Create an iterable over dependency parsed sentences
# Round brackets create a generator instead of a list, see the link below
# the code for an explanation of generator expressions. This means that the
# sentences will only be PoS-tagged as and when the parser needs them, one
# at a time.
tagged_sents = (pos_tag(sentence) for sentence in sentences)
parsed_sents = dep_parse_sentences_arceager(tagged_sents)
# Now you can iterate over *parsed_sents* performing computations or printing
for sentence in parsed_sents:
print “—— Sentence ——”
print sentence
# This sentence is a ParsedSentence object, which is explained further in
# the next section.
Things for you to do
Look through the code snippet above and make sure you understand how it works.
Run the above code with different choices of review categories and different verb variants.
Processing the output of the parser¶
The function dep_parse_sentences_arceager returns an list of sentences, where each sentence is in the form of a ParsedSentence instance (which is a class defined in the sussex_nltk codebase).
Notice that you can iterate over a ParsedSentence instance, getting its tokens. Each token is a BasicToken instance (another class defined in the sussex_nltk codebase), which keeps track of the attributes of a token. For example, if you have a BasicToken instance called token, you can call token.deprel to get its dependency relation. The code snippet below illustrates the other properties a token has.
In [ ]:
# This code assumes you have the *parsed_sents* and *verb_variants* variables
# from the previous section. *parsed_sents* is a list of ParsedSentence objects
# Print to screen the parsed sentences
for sentence in parsed_sents: # *parsed_sents* acquired from the previous section
print “—–” # Just a separator
print sentence
# Each sentence is made up of a list of BasicToken objects
# Each token has several attributes: id, form (the actual word), pos,
# head, deprel (dependency relation)
for token in parsed_sents[0]: # for each token in the first sentence
print token.id, token.form, token.pos, token.head, token.deprel
# For each sentence, print all of the dependent tokens that are subjects
The snippet below shows you how to iterate through the sentences printing only the dependents of certain tokens, and only if a certain relation holds (here “nsubj”, but this could be any relation, e.g. “dobj”).
Note that those tokens which have “dobj” (direct object) relations with “love”, “buy”” and “want” will be the things that are loved, bought and wanted respectively. For example, in “I love the cat, but will buy the fish”, “cat” is the direct object of “love”, and “fish” is the direct object of “buy”.
The find_all_dependants function is a method on the ParsedSentence class, which takes as input an iterable over words. It will check the sentence for those words, and return all the dependents of all of the matches. The dependents will be BasicToken instances. So if a sentence contains word A, then all tokens which list A as their head will be returned.
In [ ]:
# For each sentence, print all of the dependent tokens that are subjects
# (nsubj relation) of your verb variants
relation = ‘nsubj’
for sentence in parsed_sents:
print “—— Sentence ——”
print sentence
# You could instead ‘print sentence.raw()’, which gives only the original
# text and takes less room on the screen, but you won’t be able to see
# parsing errors.
print ‘Dependents with “%s” relation:’ % relation, \
[token.form for token in sentence.find_all_dependants(verb_variants) if token.deprel == relation]
# print all of the dependents of the verbs that have the dependency
# relation “nsubj”, *verb_variants*. This uses the “get_dependents”
# function of the ParsedSentence object. Given an iterable of words,
# the function will return all of the dependents of those words.
Direct object examples¶
The direct object of a verb, is the recipient of the action. So in “I bought Shrek”, “Shrek” is the direct object of a buying action. So when we’re looking for the direct objects of “want”, “buy” and “love” we’re looking for the words which are wanted, bought and loved. The sentence says that “Shrek” was loved, but the parser doesn’t mark “Shrek” as the direct object of “love”, then it probably assigned the wrong relation.
The following is a screenshot of a dependency tree in which the “dobj” relation is correctly attached. “DVD” is the word on the receiving end of the buying action, and it is marked as such. See that in the tree, an arrow goes from “buy” to “DVD” with the label “dobj”. This is reflected in the text below the tree, where the token “DVD” says its head is token 20 (“buy”) with the relation “dobj”.
Below is an example of when “dobj” has been attached incorrectly. Notice the phrase “I bought this”; “this” is the thing being bought, so it should be marked as the “dobj” of “bought”.
Instead, it is marked as the subject of “seems” (the thing actually doing the “seeming”). This means the parser has interpreted it as “…this seems more like…”, rather than “…I bought this…”. It’s much more likely that the author was saying that James Garner seems like a bystander (so Garner should be the subject of “seems”).
Things for you to do
Follow the 3 steps below to acquire information about certain relations. Could this information be useful in subsequent tasks (e.g. some form of information extraction)? What problems may arise when trying to use this information? Why? Is there anything that could be done to make it more useful?
Find reviews in the Amazon review corpus which contain the following verbs: “love”, “buy”, and “want”.
PoS tag them with the NLTK PoS tagger, and then dependency parse them.
Write code to extract all of the tokens that appear as direct objects of those verbs (the relevant dependency relation is “dobj”). These should be the things that are being loved/bought/wanted.
Parsing tweets¶
This section reminds you how to sample some random tweet sentences and tokenise and PoS tag them with the NLTK tools. It then shows you how dependency parse the sentences.
This will allow you to investigate the performance of the parser on tweets.
In [ ]:
from sussex_nltk.corpus_readers import TwitterCorpusReader
from sussex_nltk.parse import dep_parse_sentences_arceager
from nltk.tokenize import word_tokenize
from nltk import pos_tag
tcr = TwitterCorpusReader()
# Get some (here 30) un-tokenised sentences from tweets
sents = tcr.sample_raw_sents(30)
# Tokenise and PoS tag the sentences
# Notice the round brackets instead of square brackets. This is a generator
# expression. It acts quite like a list, but instead of computing all list
# elements and storing all in memory, it only does one at a time.
# Therefore “tagged_sents” is a generator, not a list
tagged_sents = (pos_tag(word_tokenize(sentence)) for sentence in sents)
# Dependency parse the sentences
parsed_sents = dep_parse_sentences_arceager(tagged_sents)
# Now you can inspect the results by printing the sentences as in the
# previous section”’
for sentence in parsed_sents:
print “—–”
print sentence
A short explanation of generator expressions
Things for you to do
Assess the performance of the parser on a sample of tweets, and compare this with its performance on a sample of Amazon reviews. In particular, pick some common relations like “dobj” (direct object), “nsubj” (nominal subject), “amod” (adjectival modifier), and see how well they are assigned in the parsed tweets. Information on the different dependency relations can be found [here](http://nlp.stanford.edu/software/dependencies_manual.pdf).
Parsing tweets with Twitter-specific PoS tagger¶
This section reminds you how to sample some random tweet sentences and tokenise and PoS tag them with the Twitter-specific tools. It then shows you how to dependency parse the sentences.
This will allow you to investigate why the parser performs so poorly with the tags that should be more accurate on tweets.
In [ ]:
from sussex_nltk.tag import twitter_tag_batch
from sussex_nltk.corpus_readers import TwitterCorpusReader
from sussex_nltk.parse import dep_parse_sentences_arceager
tcr = TwitterCorpusReader()
# Get some (here 30) un-tokenised sentences from tweets
sents = tcr.sample_raw_sents(30)
# PoS tag the sentences (remember the twitter tagger
# also tokenises for you)
tagged_sents = twitter_tag_batch(sents)
# Dependency parse the sentences
parsed_sents = dep_parse_sentences_arceager(tagged_sents)
# Again, you have parsed sentences
for sentence in parsed_sents:
print “——”
print sentence
Things for you to do
PoS tag some tweets using the Twitter-Specific PoS-tagger. Why does the parser perform so poorly despite that fact that the PoS tagging is more accurate?
Further Reading¶
“Incrementality in Deterministic Dependency Parsing”, a paper on transition-based parsing.
Lecture series on data-driven dependency parsing in general.
Dependency Parsing book (available for download by visiting link from Sussex machine) covers grammar- and data-driven dependency parsing, transition- and graph-based and many other related issues.