Session 8-checkpoint
Sessions 8 and 9: Opinion Extraction¶
Things for you to do
The first thing you need to do is run the following cell. This will give you access to the Sussex NLTK package.
In [ ]:
import sys
sys.path.append(r’T:\Departments\Informatics\LanguageEngineering’)
In labs 8 and 9 you will be looking at ways to extract opinion bearing words from DVD Amazon reviews. The goal is to find words that describe particular aspects of the film being reviewed. The specific aspects of films that we will be considering are: the plot, the characters, the cinematography and the dialogue. We are, in other words, interested in finding all of those words in a review that express the reviewers opinion about one of these aspects of the film. The idea is that this will provide a fine-grained characterisation of the opinion being expressed by the author of the review. We will refer to the words we are looking for as opinion words, and refer to the words used for particular aspects of the review as aspect words.
Following on from last week’s session on dependency parsing, you will use the output of a dependency parser as the basis for identifying opinion words. This is based on the assumption that the opinion words we are looking for are words that occur in a sentence in the review in a particular (dependency) relationship to one of our aspect words (plot, characters, cinematography and dialogue).
For example, the opinion word “amazing” might be found because it is used in a sentence where it is an adjective modifying the aspect word “plot”, as in the sentence “I thought it had an amazing plot.”.
Acquiring parsed sentences¶
As you may have noticed from the previous session, loading a dependency parser into memory is quite a slow task on the lab machines over the network. So we have pre-parsed a collection of relevant DVD sentences for you. The code snippet below shows you how to get access to the pre-parsed sentences.
In [ ]:
from sussex_nltk.parse import load_parsed_dvd_sentences
aspect = “dialogue” # Our aspect word
parsed_sentences = load_parsed_dvd_sentences(aspect)
# To inspect the sentences, you could print them straight out
for parsed_sentence in parsed_sentences:
print “— Sentence —”
print parsed_sentence
# parsed_sentences is a list of ParsedSentence objects, where each sentence
# contains the word “dialogue” and was found in a DVD review.
Note that it is not possible to give an arbitrary aspect word as input to the load_parsed_dvd_sentences function that we have provided. Since it takes considerable computational resource to produce these data sets, we have pre-assembled a limited amount of suitable data for you to use for your experimentation.
Below is a full list of the aspects words that you can pass to the load_parsed_dvd_sentences function. The aspects in blue are the ones that you should definitely explore during these lab sessions (the others are for those interested in further exploration).
plot
characters
cinematography
dialogue
effects
acting
choreography
Things for you to do
For each of the aspect words “plot”, “characters”, “cinematography” and “dialogue”, use the function load_parsed_dvd_sentences to retrieve the parses for that aspect and find out how many parsed sentences are retrieved for that aspect.
Extracting content from ParsedSentence¶
When you call load_parsed_dvd_sentences(“plot”), you know that each sentence in the list that is returned contains at least one occurrence of the word “plot”. However, you do not know the position(s) in the sentence where “plot” occurs. For that you need a reference to the BasicToken objects for the occurrences of “plot”. A BasicToken object provides access to the token’s head and dependants.
The ParsedSentence object has a function called get_query_tokens which will return a list of occurrences of the string given as its argument in that sentence as BasicToken objects. The example below shows you how to get all the aspect tokens in a sentence and print them out.
In [ ]:
aspect = “dialogue”
# If you have a ParsedSentence object, you can get all the tokens whose form matches the aspect as shown below.
# So instead of just printing the parsed_sentence as in the previous section, get its aspect tokens and print them.
aspect_tokens = parsed_sentence.get_query_tokens(aspect)
# You could iterate over them and print them for inspection
for aspect_token in aspect_tokens:
print aspect_token
# Remember that each token in a ParsedSentence object is BasicToken object
The ParsedSentence object has a function for getting the dependants of a token. The code below shows how to use it, and how to print the result.
In [ ]:
# Given a ParsedSentence object, and an aspect token acquired from it (as in the previous section)
# Get all of the dependants of that aspect token.
dependants = parsed_sentence.get_dependants(aspect_token)
# You could print them out for inspection
for dependant in dependants:
print dependant
The ParsedSentence object has a function for getting the head of a token. The code below shows how to use it, and how to print the result.
In [ ]:
# Given a ParsedSentence object, and an aspect token acquired from it (as in the previous section)
# Get the head of the aspect token
head_token = parsed_sentence.get_head(aspect_token)
print head_token
Things for you to do
Write a function that takes an aspects (“plot”, “characters”, “cinematography”, and “dialogue”) and returns a list of all of the dependants of that aspect in the parsed dvd sentences.
Write a function that takes an aspects (“plot”, “characters”, “cinematography”, and “dialogue”) and returns a list of all of the heads of that aspect in the parsed dvd sentences.
Opinion extractor¶
Over the next two weeks you will be creating your own opinion extractor. In the code snippet below you will find a simple one to get you started.
Function: opinion_extractor
Arguments
aspect_token is the BasicToken instance (for our aspect token) from the ParsedSentence that we’re interested in.
parsed_sentence is the ParsedSentence instance containing the dependency tree information of our sentence of interest.
Returns
A list of the extracted opinions. opinion_extractor should always return a list (even if it’s empty).
In [ ]:
def opinion_extractor(aspect_token, parsed_sentence):
# Your function will have 3 steps:
# i. Initialise a list of opinions
opinions = []
# ii. Find opinions (as an example we get all the dependants of the aspect token that have the relation “det”)
opinions += [dependant.form for dependant in parsed_sentence.get_dependants(aspect_token) if dependant.deprel == “det”]
# You can continue to add to “opinions”. Remember you can get the head of a token, and filter by PoS tag or Deprel too!
# iii. Return the (possibly empty) list of opinions
return opinions
In the sections below, we will be describing a variety of ways in which you are asked to refine the simple opinion_extractor function shown above. When you are investigating how well the opinion_extractor you have built are working, you will want to view your opinion extractor’s output for a substantial number of sentences. It would be wise to print your output to a file. You are shown how to do this below.
Note that you will need to replace “/path/to/file.txt” with a suitable path.
In [ ]:
from sussex_nltk.parse import load_parsed_dvd_sentences, load_parsed_example_sentences
aspect = “plot” # Set this to the aspect token you’re interested in
save_file_path = r”/path/to/savefile.txt” # Set this to the location of the file you wish to create/overwrite with the saved output.
# Tracking these numbers will allow us to see what proportion of sentences we discovered features in
sentences_with_discovered_features = 0 # Number of sentences we discovered features in
total_sentences = 0 # Total number of sentences
# This is a “with statement”, it invokes a context manager, which handles the opening and closing of resources (like files)
with open(save_file_path, “w”) as save_file: # The ‘w’ says that we want to write to the file
# Iterate over all the parsed sentences
for parsed_sentence in load_parsed_dvd_sentences(aspect):
total_sentences += 1 # We’ve seen another sentence
opinions = [] # Make a list for holding any opinions we extract in this sentence
# Iterate over each of the aspect tokens in the sentences (in case there is more than one)
for aspect_token in parsed_sentence.get_query_tokens(aspect):
# Call your opinion extractor
opinions += opinion_extractor(aspect_token, parsed_sentence)
# If we found any opinions, write to the output file what we know.
if opinions:
# Currently, the sentence will only be printed if opinions were found. But if you want to know what you’re missing, you could move the sentence printing outside the if-statement
# Print a separator and the raw unparsed sentence
save_file.write(“— Sentence: %s —\n” % parsed_sentence.raw()) # “\n” starts a new line
# Print the parsed sentence
save_file.write(“%s\n” % parsed_sentence)
# Print opinions extracted
save_file.write(“Opinions: %s\n” % opinions)
sentences_with_discovered_features += 1 # We’ve found features in another sentence
print “%s sentences out of %s contained features” % (sentences_with_discovered_features, total_sentences)
Things for you to do
Using the basic opinion extractor we provide and the above code for storing opinion words in files, find the opinion words for each of the aspect tokens under consideration.
Adapting the opinion extractor¶
In the sections below we will be asking you to adapt the opinion extractor in various ways. First, however, we give you examples of python code that will help you devise your adapted opinion extractor. You should refer to these examples when you are working on your own opinion extractors.
The BasicToken object has a number of useful properties, include the following:
form: the actual form of the token, e.g. “plot”
pos: the part-of-speech of the token, e.g. “JJ” for adjective
deprel: the dependency relation that the BasicToken has with its head, e.g. “det” for determiner”
As the code snippet shown below illustrates, we can do different things depending on these properties.
In [ ]:
# Say for example we acquire a list of BasicToken objects by getting all the dependants of a token:
dependants = parsed_sentence.get_dependants(aspect_token)
# We could filter that list, keeping only those tokens whose dependency relations with the aspect token are “dobj”, by doing the following:
dependants = [token for token in dependants if token.deprel == “dobj”]
# Or we could filter that list, keeping only those tokens whose PoS tags are “RB” (for adverb)
dependants = [token for token in dependants if token.pos == “RB”]
# Or we could filter that list, keeping only those tokens whose form is NOT “main” or “special”
dependants = [token for token in dependants if token.form != “main” and token.form != “special”]
# Or if we had a single token, we could choose to add it to a list or not based on its properties
opinions = []
if token.pos.startswith(“JJ”): # If token is an adjective, then append its form to our list of opinions
opinions.append(token.form)
# Or we could search tokens for a property we wish to know is present
found_det = False
for dependant in dependants:
if dependant.deprel == “det”:
found_det = True
# Now subsequent code can use “found_det” to perform different tasks depending on
# whether or not there was a determiner relation in the dependants.
Extending the opinion extractor’s functionality¶
For the assessed coursework you are asked to develop and assess several extensions to the opinion extractor given above. For full details of what is required for the coursework see the coursework specification document.
All of the extensions below can be completed by adapting the examples shown in the previous sections. Look out for situations where you need to find the dependants or heads of tokens, or when you need to check the PoS or dependency type of a token.
You will benefit greatly from reading the section Tips for de-bugging and exploration (see below). Note that this section describes how to use a tool for visualising dependency trees.
Examples Test Set¶
In order to check that the opinion extractor you are developing is correctly defined, we have provided easy access to all of the example sentences used in this document. This will be referred to as the examples test set. All of the sentences in the test data have been parsed.
In order to access the examples test set you should do the following:
replace load_parsed_dvd_sentences(aspect) with load_parsed_example_sentences(), and
ensure that you’re importing load_parsed_example_sentences from sussex_nltk.parse
In [0]:
from sussex_nltk.parse import load_parsed_example_sentences
parsed_example_sentences = load_parsed_example_sentences()
# To inspect the sentences, you could print them straight out
for parsed_sentence in parsed_example_sentences:
print “— Sentence —”
print parsed_sentence
Note that in your coursework you should discuss the effectiveness of your opinion extractor on the full set of parsed DVD reviews.
Extension 1: Adjectival modification¶
In this section, we are interested in adjectival modification. This is when we have a noun like “dog” or “plot”, and there are one or more adjectives which are specifying the characteristics of that noun. E.g. “big brown dog” or “exciting fresh plot” (“big” and “brown” are both adjectivally modifying “dog”).
The dependency relation we use to show this relationship is “amod”.
Write a version of the opinion extraction function which, when given sentences such as the example below containing an aspects token (e.g. “plot”), uses the “amod” relations to extract a list of the adjectival modifiers of the aspect token (e.g. the two words “exciting” and “fresh” in this case).
“It has an exciting fresh plot.” produces “fresh”, “exciting”
NOTE
You may notice that certain aspect tokens are often described by non-opinion words. For example the phrase “main plot” is often used; “main” adjectivally modifies “plot”, so your opinion extractor will find “main” as an opinion. In the code snippets above we show you how to filter tokens based on their form; you could filter out specific words like “main”.
Things for you to do
Extend the opinion extractor as described above and apply it to examples test set in order to check that your function is working as required.
Investigate the extent to which your opinion extractor produces appropriate opinion bearing words by applying it to the full set of parsed DVD reviews. Consider all four aspects: “plot”, “characters”, “cinematography”, and “dialogue”.
Extension 2: Adjectives linked by copulae¶
In this section, we are interested in adjectives (PoS tag “JJ”) which are linked to our aspect term via the copula (conjugations of “to be”: “is”, “was”, “will be” etc.). Notice that if we were only looking for “amod” relations, we’d completely miss the word “dull” in the diagram below.
Notice that when linked via a copula to an adjective, the noun is always in an “nsubj” relation with the adjective itself.
Your opinion extraction function when given a sentences like the example below containing the aspect token “plot”, should use appropriate dependency relations to output the term opinion word “dull”.
“The plot was dull.” produces “dull”
Things for you to do
Extend the opinion extractor as described above and apply it to examples test set in order to check that your function is working as required.
Investigate the extent to which your opinion extractor produces appropriate opinion bearing words by applying it to the full set of parsed DVD reviews. Consider all four aspects: “plot”, “characters”, “cinematography”, and “dialogue”.
Extension 3: Adverbial modifiers¶
If you used the extractor you have built so far on the example sentences below, it will only find the opinion “dull”. It would not recover an indication of the strength of the opinion. Adverbs like “excessively” elaborate on the adjectives that they modify in adverbial modification relations.
The relevant dependency relation we use to show this relationship is “advmod”.
Your opinion extraction function when given a sentence like those below containing the aspect token “plot”, should use the advmod relation to output features like “excessively-dull” (if you have an adjective token in a variable adj_token, and an adverb in a variable adv_token then you could create this feature like this: adv_token.form + “-” + adj_token.form).
“It has an excessively dull plot.” produces “excessively-dull”
“The plot was excessively dull.” produces “excessively-dull”
NOTE
If you have a list of strings, you can use python’s join function to concatenate them into a single string. The following would join the strings together, placing a “-” between each:
joined_string = “-“.join(listofstrings)
Things for you to do
Extend the opinion extractor as described above and apply it to examples test set in order to check that your function is working as required.
Investigate the extent to which your opinion extractor produces appropriate opinion bearing words by applying it to the full set of parsed DVD reviews. Consider all four aspects: “plot”, “characters”, “cinematography”, and “dialogue”.
Extension 4: Negation¶
Look at the tree below; it is an example of an adjective linked by a copula. Your existing opinion extractor would extract “dull”. However, notice that the example is saying that the plot was not dull! This is an example of the use of negation.
The dependency relation we use to show this relationship is “neg”.
Your opinion extraction function when given sentences like those below containing the aspect token “plot”, should use the “neg” relation to output features like “not-dull”. If you have an adjective token called “token”, then you could create this feature like this: “not-” + token.form.
“The plot wasn’t dull.” produces “not-dull”
“It wasn’t an exciting fresh plot.” produces “not-exciting”, “not-fresh”
“The plot wasn’t excessively dull.” produces “not-excessively-dull”
Things for you to do
Extend the opinion extractor as described above and apply it to examples test set in order to check that your function is working as required.
Investigate the extent to which your opinion extractor produces appropriate opinion bearing words by applying it to the full set of parsed DVD reviews. Consider all four aspects: “plot”, “characters”, “cinematography”, and “dialogue”.
Extension 5: Conjunction¶
If you used your existing extractor on the tree below, it would only extract “cheesy”. However, “fun” and “inspiring” are both conjoined with “cheesy”; this means that they all apply to the subject (“plot”).
This conjunction relation is shown via the “conj” dependency. Note that words other than adjectives can be the conjuncts. You could investigate whether this is a problem.
Your opinion extraction function when given sentences like these containing the aspect token “plot” should uses the “conj” relation to extract all of the relevant features “cheesy, fun, inspiring”.
“The plot was cheesy, but fun and inspiring.” produces “cheesy”, “fun”, “inspiring”
“The plot was really cheesy and not particularly special.” produces “really-cheesy”, “not-particularly-special”
Things for you to do
Extend the opinion extractor as described above and apply it to examples test set in order to check that your function is working as required.
Investigate the extent to which your opinion extractor produces appropriate opinion bearing words by applying it to the full set of parsed DVD reviews. Consider all four aspects: “plot”, “characters”, “cinematography”, and “dialogue”.
Additional extensions¶
This section presents some examples on which your current opinion extractor will fail. In all of the examples below, “plot” is the aspect token.
“The script and plot are utterly excellent.” produces “utterly-excellent”
“The script and plot were unoriginal and boring.” produces “unoriginal”, “boring”
“The plot wasn’t lacking.” produces “not-lacking”
“The plot is full of holes.” produces “full-of-holes”
“There was no logical plot to this story.” produces “no-logical”
“I loved the plot.” produces “loved”
“I didn’t mind the plot.” produces “not-mind”
Things for you to do
Extend your extractor so that its output matches the expected output. Ensure that you make use of the dependencies relating the aspect token to the rest of the sentence. For example, do not just retrieve all of the adjectives in the sentence since this does not generalise well to more complex sentences.
Tips for de-bugging and exploration¶
Common sense¶
When you will be assessing whether your opinion extractor has been effective when analysing a given sentence. Before you look at what the dependency parser says, read the sentence carefully and determine for yourself the scope of the words. Consider the following sentence.
“This film has excellent characters and an intriguing and engaging plot.”
It should be obvious to you that here the plot is described as both “intriguing” and “engaging”. However, “excellent” is only used to describe the cinematography.
If the parser suggests a structure which implies that plot is also described by “excellent” (for example), something has gone wrong.
Dependency tree visualisation tool¶
On the teaching drive under Departments/Informatics/LanguageEngineering/, there is a file called “RunParserInLab4.bat”. Double-click this file in Lab4 and it will run an interactive dependency parser.
It performs two tasks, only one of which is relevant to you: in the pane labelled “Plain” you can copy-paste any ParsedSentence print-out (the token-per-line format), then press SHIFT+ENTER, and the dependency tree will be visualised.
This may help you to understand the trees.
You should probably avoid the text field at the bottom of the application. It uses a slightly different parser than the one in Sussex NLTK, so their answers will sometimes differ.
If you would like to use the tool at home, you should instead use a copy of the InteractiveParser.jar file from the same directory. Ensure that your home computer uses Java 7 at the terminal by default. Then at the command prompt type:
java -Xmx2g -jar /path/to/InteractiveParser.jar
For reference, the following are links to the documents describing the dependency relations and parts-of-speech tags we are using.
Printing only documents relevant to the current task¶
You will find that your output is dominated by examples of adjectival modification and adjectives linked via the copula. This means that when you add a new function (extensions 3-5) it will be difficult to determine the impact of that new functionality.
One way to solve this problem, is to (temporarily) output only those features produced by the new functionality.
For example, imagine you have just completed extensions 1 and 2. Next, you write code that adds the adverbial features (extension 3). When assessing how well your code is working, let your extractor only extract the “new” adverb features.
There are 2 easy ways to achieve this:
Comment out any extractor code that produces features that you’re not currently interested in. Or
Introduce a boolean variable, which you only set to True when you have extracted the feature that you are interested in. Then always ouput an empty list if the variable is False, otherwise output the full opinion list.
Running the parser on your own example sentences (Beware, this can take ~40secs!)¶
If you want to test an idea out and have the parser attempt to parse one of your own examples. Take a look at the code below.
In [ ]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from sussex_nltk.parse import dep_parse_sentences_arceager
sentences = [“This is the first example sentence”,
“This is the second example sentence”,
“This is the third example sentence”]
parsed_sents = dep_parse_sentences_arceager(pos_tag(word_tokenize(sentence)) for sentence in sentences)
for parsed_sentence in parsed_sents:
print “— Sentence —”
print parsed_sentence