程序代写代做代考 python Chapter II - Exploring the corpus

Chapter II — Exploring the corpus

Chapter II: Exploring the corpus¶
Reading the corpus¶
Now that we can create a corpus of word-tokenized sentences, let’s look into ways to extract information from it. The two code cells below load in the Yelp corpus consisting of 15,000 reviews. The second cell will give you error messages, however, as there are several Python modules missing. Complete the code by importing the modules you need to run it. The error messages should guide you. You’ll need to write three lines in total! Once it works, it’ll take about 20 seconds to load.

In [ ]:

def read_word_tokenized_corpus(lines_input):
all_word_tokenized_reviews = []
for line in lines_input:
json_fields = json.loads(line)
review_string = json_fields[‘text’]
sentences = nltk.sent_tokenize(review_string)
word_tokenized_review = [nltk.word_tokenize(sentence) for sentence in sentences]
all_word_tokenized_reviews.append(word_tokenized_review)
return all_word_tokenized_reviews

In [ ]:

# YOUR CODE HERE

Extracting sentences containing a word¶
One thing we frequently want to do with text corpora is extract or print all sentences that match some pattern (e.g., that contain a word, or have a determiner followed by three adjectives, or are maximally 5 words long). One way to do so, is to iterate over your data, and print a sentence if it matches the pattern you’re interested in. The code below prints all words that contain the word ‘kimchi’ (but, notably not ‘KimChi’ or ‘kim’ and ‘chi’ as two words). Run it to see!

In [ ]:

for review in corpus:
for sentence in review:
if ‘kimchi’ in sentence:
print(‘ ‘.join(sentence))

Sentence extraction functions¶
It would be convenient if we could print the sentences containing whichever word we want from whatever corpus we have. Doing so by changing the code itself everytime will (1) make it more likely for errors to creep in, and (2) creates a wall of code. This can be avoided by encapsulating the code to print sentences containing a word in a function that takes a word and a corpus as its input items and prints the sentences.

In [ ]:

# YOUR CODE HERE

Now try calling your function on a few words of your choice in the three code blocks below!

In [ ]:

# YOUR CODE HERE

In [ ]:

# YOUR CODE HERE

In [ ]:

# YOUR CODE HERE

Extracting words in corpora¶
Sometimes, we might want more information than just the sentence context. Suppose we want to print the whole review as the context if it contains our target word. One strategy of doing so is by making a single long list of all the individual words in the review (called review_words), and printing that list (joined by spaces) if our target word occurs in that list.

The function below does most of that. One part is missing, however: the part where you take the list of lists of words that is review at every step of the for review in corpus: loop and make it into a single long list of words. So, for instance, you transform the list of lists:

[[‘Sansotei’, ‘serves’, ‘some’, ‘top’, ‘notch’, ‘ramen’, ‘.’], [‘They’, ‘take’, ‘no’, ‘reservations’, ‘,’, ‘so’, ‘my’, ‘company’, ‘of’, ‘five’, ‘had’, ‘to’, ‘wait’, ‘outside’, ‘for’, ‘about’, ‘half’, ‘an’, ‘hour’, ‘.’], [‘I’, ‘guess’, ‘it’, “‘s”, ‘normal’, ‘for’, ‘Saturday’, ‘night’, ‘.’], [‘Unlike’, ‘my’, ‘favorite’, ‘ramen’, ‘place’, ‘in’, ‘NYC’, ‘you’, ‘can’, ‘only’, ‘order’, ‘what’, “‘s”, ‘on’, ‘a’, ‘menu’, ‘.’], [‘No’, ‘deviations’, ‘or’, ‘improvisations’, ‘whatsoever’, ‘.’], [‘Our’, ‘waitress’, ‘did’, “n’t”, ‘speak’, ‘much’, ‘English’, ‘,’, ‘and’, ‘even’, ‘after’, ‘writing’, ‘everything’, ‘down’, ‘still’, ‘managed’, ‘to’, ‘make’, ‘a’, ‘mistake’, ‘or’, ‘two’, ‘.’], [‘But’, ‘,’, ‘my’, ‘spicy’, ‘sesame’, ‘ramen’, ‘called’, ‘Tan’, ‘Tan’, ‘was’, ‘exceptional’, ‘.’], [‘I’, ‘can’, ‘tell’, ‘you’, ‘that’, ‘.’], [‘Just’, ‘the’, ‘right’, ‘thickness’, ‘of’, ‘broth’, ‘,’, ‘and’, ‘taste’, ‘-‘, ‘omg’, ‘.’]]

into:

[‘Sansotei’, ‘serves’, ‘some’, ‘top’, ‘notch’, ‘ramen’, ‘.’, ‘They’, ‘take’, ‘no’, ‘reservations’, ‘,’, ‘so’, ‘my’, ‘company’, ‘of’, ‘five’, ‘had’, ‘to’, ‘wait’, ‘outside’, ‘for’, ‘about’, ‘half’, ‘an’, ‘hour’, ‘.’, ‘I’, ‘guess’, ‘it’, “‘s”, ‘normal’, ‘for’, ‘Saturday’, ‘night’, ‘.’, ‘Unlike’, ‘my’, ‘favorite’, ‘ramen’, ‘place’, ‘in’, ‘NYC’, ‘you’, ‘can’, ‘only’, ‘order’, ‘what’, “‘s”, ‘on’, ‘a’, ‘menu’, ‘.’, ‘No’, ‘deviations’, ‘or’, ‘improvisations’, ‘whatsoever’, ‘.’, ‘Our’, ‘waitress’, ‘did’, “n’t”, ‘speak’, ‘much’, ‘English’, ‘,’, ‘and’, ‘even’, ‘after’, ‘writing’, ‘everything’, ‘down’, ‘still’, ‘managed’, ‘to’, ‘make’, ‘a’, ‘mistake’, ‘or’, ‘two’, ‘.’, ‘But’, ‘,’, ‘my’, ‘spicy’, ‘sesame’, ‘ramen’, ‘called’, ‘Tan’, ‘Tan’, ‘was’, ‘exceptional’, ‘.’, ‘I’, ‘can’, ‘tell’, ‘you’, ‘that’, ‘.’, ‘Just’, ‘the’, ‘right’, ‘thickness’, ‘of’, ‘broth’, ‘,’, ‘and’, ‘taste’, ‘-‘, ‘omg’, ‘.’]

In [ ]:

# YOUR CODE HERE

In [ ]:

extract_key_word_in_review_context(‘oxtail’, corpus)

Extracting sets of words¶
Sometimes we’re not too interested in the contexts, but we would just like to know what words there are in a corpus that match a certain pattern. To extract all words ending in ‘esque’ (where X-esque means ‘to be or act like X’), for instance, we can run the code below:

In [ ]:

hits = set()
for review in corpus:
for sentence in review:
for word in sentence:
if word.endswith(‘esque’):
hits.add(word)
print(hits)

Now try it yourself: write code that extracts all words that match the regular expression ‘[mM]{3,}’ into a set called hits. (What does this regular expression do? If you want to refresh your regular expressions, check chapter 3, sec. 3.4). Next, print the set hits and inspect what’s in it.

In [ ]:

# YOUR CODE HERE

As with the sentence-printing functionality, we may want to encapsulate this code in a function. Create a function that takes a regular expression and a corpus as its input (the variables r and c in the function below) and returns a set of all words in the corpus matching that regular expression. Run the function on the regular expressions ‘g{3,}, ‘^[A-Z]+$’, and ‘^[0-9]+’

In [ ]:

# YOUR CODE HERE

In [ ]:

# YOUR CODE HERE

In [ ]:

# YOUR CODE HERE

In [ ]:

# YOUR CODE HERE

Stylometry¶
Next, let’s look at some stylometry. We’ve already looked into some stylometric (= measuring style) concepts, like lexical diversity, and sentence and word length at the beginning of this course. Let’s try one of those out on the corpus that we just created.

Here, we look at average sentence length. The average sentence length of a review is the average (or: mean) of all the lengths of the individual sentences. Start with working this out for a single review: finish the function in the following code cell and run in on the first review (corpus[0]) in the code cell below it. The first review is:

So you can verify manually if your code gives the correct average sentence length.

In [ ]:

# YOUR CODE HERE

In [ ]:

# YOUR CODE HERE

Next, iterate over all the reviews in the corpus, and print the review if the average sentence length is less than 5 words.

In [ ]:

# YOUR CODE HERE

Can you figure out what the maximum average sentence length of any review in the corpus is? And can you next print that review? Work this problem out in the cell below:

In [ ]:

# YOUR CODE HERE

Related Posts