程序代写代做代考 flex Excel python algorithm Opinion Extraction Report¶

Opinion Extraction Report¶

length: about 2300 words

Introduction¶

In this project, I will implement a opinion extractor to extract opinions from DVD Amazon reviews about several aspects like plot, characters, cinematography and dialogue.For example, if the review is “The dialogue is cringeworthy .”, then the opinion about dialogue should be “cringeworthy”.
The opinion extractor makes use of the dependency structure and part of speech tags information of the review sentences, so in this section, I will brieft instroduct the concept of the two technologies and the basic idea of my opinion extractor.

Part of Speech tagging¶

Part of Speech tagging is to assign a class tag to every word in a sentence. The class tag has grammatical meaning. Common class tags includes noun, verb, adjective, preposition, conjunction, interjection, adverb, and pronoun. We usually use NN to represent Noun, VB represent Verb, RB represent Adverb, JJ represents Adjective, and other representations please refer to http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

Part of Speach example¶

The
dialogue
is
just
cringeworthy
DT
VBZ
RB
just
JJ
Determiner
Noun
Verb
Adverb
Adjective

Dependency Parsing¶

Dependency parsing is try to understand a sentence’s syntactic structure and determine the relationship between different word in the sentence. The relationship commonly includes adverb modifier(advmod), adjectival modifier(amod), conjunct(conj), copula(cop) and nominal subject(nsubj). Please refer to http://nlp.stanford.edu/software/dependencies_manual.pdf for more kinds of relationship and notation symbols.

The dependency structure can be represent as a list of triples (head, dependant, relationship). The triple can be represented by an labeled edge from head to dependent and use the relationship as the label. So, we can visualize the dependency structure as a tree like following one.

Dependency Tree Example¶

We can also represent the dependency structure using a list of line. The lines are ordered by word position in the sentence. The line files are as follows [word-position word part-of-speech-tag head-position relationship].

Dependency Parsing Example¶

The dialogue is just cringeworthy .
1 The DT 2 det
2 dialogue NN 5 nsubj
3 is VBZ 5 cop
4 just RB 5 advmod
5 cringeworthy JJ 0 root
6 . . 5 punct

In computer memory, the representation of dependency structure is similar as above. In this project, we use class ParsedSentence to contain the dependency tree information of sentence. It can viewed as list of BasicToken. The BasicToken has field id, form, pos, head and deprel which is the same with the 5 line fields above.

Opinion Extraction Basic Idea¶

Suppose we want to find the opinions from the DVD reviews about the dialogue aspect of the film. Then the reviews that talk about dialogue most likely have word “dialogue” in their text. So we can first find sentences that contains the aspect word such as “dialogue”, then do part-of-speech tagging and dependency parsing on these sentences and get the ParsedSentence instance which contains all the information of pos and dependency structure. Then we can make use these information to extract the opinion. For example, if a word in the sentence is an adjectival modification of the aspect word, then it can be used as the opinion of this aspect. Sentence “It has interesting plot” has such structure, so we can extract “interesting” as opinion of plot. We can also make use of part-of-speech tag to filter the opinions, for example, we would like the opinion word should be adjective in some situation. And we can also make use of the text of word, for example, we would filter out “main” and “basic” as opinion word. The details algorithm will be described in following sections.

Opinion Extractor¶

I have successfully done the 5 extensions as required.

Algorithm Summary¶

Extension 1: Adjectival modification
Algorithm: For every dependent token of aspect token, if its deprel field is “advmod”, then take it as opinion. And I also filter the “main”, “special”, “basic” adjectives, because things such as “main plot”, “basic plot” , “main characters” are not real opinions.

Extension 2: Adjectives linked by copulae
Algorithm: If the aspect token’s deprel is “nsubj”, its head’s pos is “JJ” and the head has a “cop” dependent, then the head will be taken as opinion.

Extension 3: Adverbial modifiers
Algorithm: For token got from Extension 1 or Extension 2, iterate its dependants, if the dependant’s deprel is “advmod”, then append it before old opinion.

Extension 4: Negation
Algorithm: For Extension 1, if the aspect token has odd number of “neg” dependants, then the “not” is appended before opinion got from Extension 1. For Extension 2, if the token got from it has “neg” dependant, then “not” is appended before the token.

Extension 5: Conjunction
Algorithm: For token got from Extension 1 or Extension 2, iterate its dependants, if the dependant’s deprel is “conj”, then also get opinion from this dependant.

Implementation¶
In [2]:
import sys
sys.path.append(r’T:\Departments\Informatics\LanguageEngineering’)

from sussex_nltk.parse import load_parsed_dvd_sentences, load_parsed_example_sentences

Sussex NLTK root directory is /Users/liumeng/Documents/task/python-3600/LanguageEngineering/LanguageEngineering

Function: append_negation_and_adverbial
Arguments
can is the opinion token
parsed_sentence is the ParsedSentence instance containing the dependency tree information of our sentence of interest.
Returns
A tuple of two lists: ([opinion words], [tokens that have conjunction relation ship with can])
In [3]:
def append_negation_and_adverbial(can, parsed_sentence):
“””
this function add adverbial modifiers and negation to the opinion
token and find the tokens that have conjunction relation ship with it.
:param can: the opinion token
:param parsed_sentence: the parsed sentence
:return: ([opinion words], [tokens that have conjunction relation ship with can])
“””

# adverbial modifiers list
advs = []

# conjunction tokens list
conjTokens = []

for token in parsed_sentence.get_dependants(can):

if token.deprel == “advmod”:
# it is adverbial modifier
advs.append(token.form)

elif token.deprel == “neg”:
# it is negation word
advs.append(“not”)

elif token.deprel == “conj”:
# it is conjunction token
conjTokens.append(token)

advs.append(can.form)

op = ‘-‘.join(advs)

return op, conjTokens

Function: get_all_op
this function computes the opinion string of the opinion token and its conjunction tokens
Arguments
can is the opinion token
parsed_sentence is the ParsedSentence instance containing the dependency tree information of our sentence of interest.
Returns
list of opinion strings
In [4]:
def get_all_op(can, parsed_sentence):
“””
this function computes the opinion string of the opinion token
and its conjunction tokens
:param can: the opinion token
:param parsed_sentence: the parsed sentence
:return: list of opinion strings
“””

# compute the opinion string of opinion token and its conjunction tokens
op, conjTokens = append_negation_and_adverbial(can, parsed_sentence)
ops = [op]

# for every conjunction token, compute its opinion string
for tok in conjTokens:
op, t = append_negation_and_adverbial(tok, parsed_sentence)
ops.append(op)

return ops

Function: opinion_extractor
Arguments
aspect_token is the BasicToken instance (for our aspect token) from the ParsedSentence that we’re interested in.
parsed_sentence is the ParsedSentence instance containing the dependency tree information of our sentence of interest.
Returns
A list of the extracted opinions. opinion_extractor should always return a list (even if it’s empty).
In [5]:
def opinion_extractor(aspect_token, parsed_sentence):
“””
:param aspect_token: the aspect token
:param parsed_sentence: ParsedSentence instance containing the dependency tree information
:return: A list of the opinion strings
“””

# A list of the opinion strings
opinions = []

dependants = parsed_sentence.get_dependants(aspect_token)

# whether has neg for Adjectival modification
hasNeg = False
ops = []

# find the adjectival modification
for dependant in dependants:
if dependant.deprel == “amod” and dependant.form not in [“main”, “special”, “basic”, “other”]:
# adjectival modification

# get the opinion string of token dependant and its conjunction tokens
ops.extend(get_all_op(dependant, parsed_sentence))

elif dependant.deprel == “neg”:
# has neg for Adjectival modification

# neg of neg is positive
hasNeg = not hasNeg

# add Adjectival modification opinions
for op in ops:
if hasNeg:
# add neg not
opinions.append(“not-“+op)
else:
opinions.append(op)

# add opinions got from Adjectives linked by copulae
if aspect_token.deprel == “nsubj”:

# the possible opinion token which is the head of aspect token
can = parsed_sentence[aspect_token.head-1]
if can.pos == “JJ”:
# this token should be Adjective

# whether has copulae dependent
hasCop = False
for token in parsed_sentence.get_dependants(can):
if token.deprel == “cop”:
# it is copulae token
hasCop = True
break

# only add opinions when has copulae dependent
if hasCop:
# get the opinion string of token can and its conjunction tokens
opinions.extend(get_all_op(can, parsed_sentence))

return opinions

Examples Test Set Validation¶

The following codes extract opinions from the Examples Test Set and output the sentence and the opinion results. From the output, we can see that my extractor works well on all 5 extension examples.
In [7]:
def experimentExampleTestSet():

aspect = ‘plot’
parsed_sentences = load_parsed_example_sentences()

# the extension beginning number
extensionBeginNumber = {0:1, 1:2, 2:3, 4:4, 7:5, 9:6}

for i, parsed_sentence in enumerate(parsed_sentences):
if i in extensionBeginNumber:
print(“———————“)
if extensionBeginNumber[i] < 6: print("Extension %d Example\n" % extensionBeginNumber[i]) else: print("Additional Extension Example\n") opinions = [] # Make a list for holding any opinions we extract in this sentence for aspect_token in parsed_sentence.get_query_tokens(aspect): # Call your opinion extractor opinions.extend(opinion_extractor(aspect_token, parsed_sentence)) print("Sentence: %s" % parsed_sentence.raw()) print("Opinions: %s\n" % opinions) experimentExampleTestSet() --------------------- Extension 1 Example Sentence: It has an exciting fresh plot . Opinions: ['exciting', 'fresh'] --------------------- Extension 2 Example Sentence: The plot was dull . Opinions: ['dull'] --------------------- Extension 3 Example Sentence: It has an excessively dull plot . Opinions: ['excessively-dull'] Sentence: The plot was excessively dull . Opinions: ['excessively-dull'] --------------------- Extension 4 Example Sentence: The plot was n't dull . Opinions: ['not-dull'] Sentence: It was n't an exciting fresh plot . Opinions: ['not-exciting', 'not-fresh'] Sentence: The plot was n't excessively dull . Opinions: ['not-excessively-dull'] --------------------- Extension 5 Example Sentence: The plot was cheesy , but fun and inspiring . Opinions: ['cheesy', 'fun', 'inspiring'] Sentence: The plot was really cheesy and not particularly special . Opinions: ['really-cheesy', 'not-particularly-special'] --------------------- Additional Extension Example Sentence: The script and plot are utterly excellent . Opinions: [] Sentence: The script and plot were unoriginal and boring . Opinions: [] Sentence: The plot was n't lacking . Opinions: [] Sentence: The plot is full of holes . Opinions: ['full'] Sentence: There was no logical plot to this story . Opinions: ['logical'] Sentence: I loved the plot . Opinions: [] Sentence: I did n't mind the plot . Opinions: [] Performance On Amazon DVD Review Sentences¶ The following codes extract opinions from the DVD Review Sentences about four aspects plot, characters, cinematography, and dialogue. And save the results to files plot_full.txt, characters_full.txt, cinematography_full.txt and dialogue_full.txt respectively. In [10]: def experiment(parsed_sentences, save_file_path, aspect): """ extract opinions about aspect from parsed_sentences and save the result to save_file_path :param parsed_sentences: the parsed sentences :param save_file_path: output file path :param aspect: aspect word :return: """ total_sentences = 0 sentences_with_discovered_features = 0 with open(save_file_path, "w") as save_file: # The 'w' says that we want to write to the file # Iterate over all the parsed sentences for parsed_sentence in parsed_sentences: total_sentences += 1 # We've seen another sentence opinions = [] # Make a list for holding any opinions we extract in this sentence # Iterate over each of the aspect tokens in the sentences (in case there is more than one) for aspect_token in parsed_sentence.get_query_tokens(aspect): # Call your opinion extractor opinions += opinion_extractor(aspect_token, parsed_sentence) # If we found any opinions, write to the output file what we know. if opinions: # Currently, the sentence will only be printed if opinions were found. But if you want to know what you're missing, you could move the sentence printing outside the if-statement # Print a separator and the raw unparsed sentence save_file.write("--- Sentence: %s ---\n" % parsed_sentence.raw()) # "\n" starts a new line save_file.write("%s\n" % parsed_sentence) save_file.write("Opinions: %s\n\n" % opinions) sentences_with_discovered_features += 1 # We've found features in another sentence print "%s features out of %s sentences" % (sentences_with_discovered_features, total_sentences) def experimentDVD(): """ store all aspect opinion results :return: """ aspects = ["plot", "characters", "cinematography", "dialogue"] for aspect in aspects: parsed_sentences = load_parsed_dvd_sentences(aspect) experiment(parsed_sentences, aspect+"_full.txt", aspect) experimentDVD() 725 features out of 3119 sentences 1337 features out of 4584 sentences 271 features out of 677 sentences 356 features out of 1176 sentences Performance Analysis¶ From the 4 output files above, I randomly choose from each file 10 sentences to look. From the 40 sentences, I found 4 is incorrect. Case 1 is due to both pos tag and dependency parsing errror. Case 2 is due to deficiency of extractor, case 3 is due to both the parsing error and deficiency of extractor and case 4 is due to parsing error.

The concrete analysis of the 4 cases is as follows. Case 1¶ Sentence: Both films have the same borring plot .
Opinions: [“same”] 
Expected: ["same-borring"] Case 1 Dependency Tree¶  Reason: In this case, the pos tag of borring is incorrect, it should be “JJ” instead of “NN” and the same should be “”RB" instead of “JJ”. The wrong tags may lead to the wrong parsing. The relationship between borring and plot should be “amod” instead of nn and the relationship of “same” should be “advmod” and its head should be “borring”. 

Also note that the “borring” is the wrong spelling of “boring”. It may be one reason that leads to the wrong pos tag. From this I think it is better to add spell checking capability to the opinion extractor. My opinion extractor also should filter words like “same”. Case 2¶ Sentence: It truly never gets old with us as the characters are authentic and full of good humor . 
Opinions: ['authentic', 'full'] 
Expected: ['authentic', 'full-of-good-humor'] Case 2 Dependency Tree¶  Reason: My algorithm’s defects. My extractor can’t handle the situation where it’s a phrase rather than simple adjective after “be”, “is” , “are”. Case 3¶ Sentence: The cinematography , score , storyline , and characters are well-thought out and well chosen .
Opinions: [‘out-well-thought’] 
Expected: ['well-thought', 'well-chosen'] Case 3 Dependency Tree¶  Reason: The dependency parsing is incorrect, relationship from "out" to "well-thought" is not “advmod” and the parsing of “and well chosen” is also incorrect. And also note that my extractor can't handle the structure like "The characters are well chosen". Case 4¶ Sentence: The soundtrack is awesome , the cinematography is great , the dialogue is realistic , and for once , a movie set in a foreign land that actually uses actors from that area speaking the language of that area ! 
Opinions: ['for-realistic'] 
Expected: ['realistic'] Case 4 Dependency Tree¶  Reason: The parsing is incorrect. relationship from “For” to realistic is not “advmod”. Proposal for DVD summary website¶ Motivation¶ E-commerce is increasingly popular, one of the appealing aspect of shopping online is that customer can assess the product’s quality from other’s reviews and thus can make a intelligent decision. But reading other’s reviews are tedious and slow. For a DVD that has hundreds and thousands of reviews, we can only read twenty or at most thirty of them. And the small portion customer read can’t really give him/her the overall opinions of other customers. The reviews are often contradictory, one review may say the plot is interesting and other review might say the same DVD’s plot is boring. If we can provide a website that can summary the review opinion of different aspects of the DVDs and also give the number of reviews a opinion receives. For example, we can provide information such as how many reviews say the plot is interesting and how many reviews say it is boring. Such website I believe can let customers more quickly and accurately knows about the DVD. And the film companies will also surely appreciate a website that automatic give summary of customer’s opinion on their product. Proposal¶ In the backend, we should have a opinion extractor that extracts the opinion as the review stream comes in. The challenge is to make the extraction algorithm stable and efficient. It must work full-time and process the new review data in time. So it is better to have a parallel distributed opinion extraction algorithm. We can make use distributed framework such as MapReduce to enhance out algorithm’s ability to process massive data and the processing speed. Fortunately, this project’s extraction algorithm is easy to adapt to map-reduce framework. Because every review can be processed independently. The map step will be to extract the opinion of every review and the reduce map will be to sum the count of reviews of every kind of opinion. Another challenge is to internalize our website to support different languages. As we all know different languages have different syntax and structure. It's not easy to extend our part of speech tagging, dependency parsing and opinion extraction algorithm to process different languages. But if we make our codes more modular, our algorithm more flexible, we are likelly at least support the main 5 languages. Assessment¶ DVD's aspects are much more than the 4 aspects mentioned in this project. It much better for the website can also automatically find the interesting aspects that customer cares. We can make use of machine learning technique to enhance our opinion extractor's ability. For reviews that explitly express a opinion, the accuracy will be likely above 90%. For reviews that is not straightforward, it is difficult to extract opinion. So I think the accuracy will likely be about 85%. I think customers will like this website especially for film fans. The web can give them quick and relatively accurate summary for the film they are instested in, and it will be more neutral and objective than one that written by human. Reference¶ 1. http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html 2. http://nlp.stanford.edu/software/dependencies_manual.pdf