程序代写代做代考 python exercise1

exercise1

Exercise one¶

Part A: Text preprocessing¶
Load the json data¶
I use the json library to convert each line to python object.
and then extract the content and convert to lowercase.

In [1]:

import json
news = []
with open(“signal-news1/signal-news1.jsonl”) as newsfile:
for line in newsfile:
news.append(json.loads(line))

In [2]:

# extract the content and convert to lowercase
contents = [l[“content”].lower() for l in news]

Remove URLS¶
My url regular expression is “(http[s]?://\S+)|(www.\S+)|(\S+.(com|org|edu))”.
It will match 3 kind of urls. (http[s]?://\S+) will match things such as “http://www”, ““http://domain”, “https://www.*, (www.\S+) will match things such as ““www.*”. and “(\S+.(com|org|edu))” will
match things such as “domain.com”

Some example matched URLS for the news story.

http://bit.ly/1jd3nq4
https://youtu.be/khlhd2ex8wk
amazon.com
analystratingsnetwork.com
www.twitter.com/russostrib
oma-online.org

Use the re.sub to replace matched urls with empty string.

In [3]:

import re

urlReg = “(http[s]?://\S+)|(www\.\S+)|(\S+\.(com|org|edu))”

i = 0
# show some of the matched URLS to ensure it is correct
for n in contents:
url = re.search(urlReg, n)
if url:
if i % 100 == 0:
print(url.group(0))
i += 1

savingcountrymusic.com
usmagazine.com
mlssoccer.com
impactalpha.com
tvguide.com
zacks.com
geeky-gadgets.com
cleveland.com
zacks.com
http://twitter.com/share?text=
http://www.americangeosciences.org/sites/default/files/education-reports-secondaryes_report.pdf
nhl.com
zacks.com
indiespeedrun.com
radaronline.com
usmagazine.com
marketbeat.com
whoscored.com

In [4]:

removedUrl = [re.sub(urlReg, “”, n) for n in contents]

Removeallnon-alphanumericcharactersexceptspaces¶
The non alphanum pattern is “[^a-zA-Z0-9\s]” to indicate that it is not a character
from a-z, A-Z, 0-9 and space character.
Use re.sub to remove matched string.

In [5]:

removedNonalphanum = [re.sub(“[^a-zA-Z0-9\s]”, “”, n) for n in removedUrl]

Split to words¶
Use split method to get word tokens.

In [6]:

words = []
for l in removedNonalphanum:
words.append(l.split())

Remove words with 3 characters or fewer¶
This is simple. Just use list comprehension to filter out the desired words.

In [7]:

removeShortWords = [ [word for word in l if len(word) > 3] for l in words]

Remove numbers that are fully made of digits¶
Regular expression for numbers are “\d+”.
Use fullmatch to test whether it is a full match (fully made of digits).
Also use list comprehension to filter out wanted words. List comprehension is
faster than loops. Thus, I will use list comprehension where it is possible.

In [8]:

digitP = re.compile(“\d+”)
removeNumbers = [[word for word in l if digitP.fullmatch(word) is None ] for l in removeShortWords]

Use an English lemmatiser to process all the words¶
I use the WordNetLemmatizer from nltk library to process the words.

In [9]:

from nltk.stem import WordNetLemmatizer

In [10]:

wl = WordNetLemmatizer()

In [11]:

lemmedWords = [[wl.lemmatize(word) for word in l] for l in removeNumbers]

Part B: N-grams¶

Compute N (number of tokens) and V (vocabulary size)¶
Use set to compute the vocabulary size. Number of tokens is the sum of the count of tokens in each row.

In [12]:

voca = set()
N = 0
for l in lemmedWords:
N += len(l)
for w in l:
voca.add(w)

In [13]:

print(“Number of tokens: %d” %N)
print(“Vocabulary size: %d” %len(voca))

Number of tokens: 3718089
Vocabulary size: 118701

List the top 25 bigrams¶
Use bigrams to get the bigrams and use defaultdict to make computing the bigram counts easier.
After compute the counts for each bigram, sort according to counts and print the top 25 bigrams.

In [14]:

from nltk.util import bigrams

In [15]:

from collections import defaultdict

# compute bigram counts
bicounts = defaultdict(int)
for l in lemmedWords:
for bi in bigrams(l):
bicounts[bi] += 1

In [16]:

sortedBigrams = sorted(bicounts.items(), key=lambda x: -x[1])

In [17]:

print(“top 25 bigrams based on the number of occurrences”)
for item in sortedBigrams[:25]:
print(item)

top 25 bigrams based on the number of occurrences
((‘more’, ‘than’), 3460)
((‘have’, ‘been’), 3316)
((‘hold’, ‘rating’), 2278)
((‘last’, ‘year’), 2266)
((‘this’, ‘year’), 2102)
((‘moving’, ‘average’), 2013)
((‘price’, ‘target’), 1924)
((‘average’, ‘price’), 1808)
((‘research’, ‘report’), 1671)
((‘target’, ‘price’), 1536)
((‘research’, ‘note’), 1390)
((‘nokia’, ‘nokia’), 1352)
((‘united’, ‘state’), 1311)
((‘this’, ‘week’), 1263)
((‘said’, ‘that’), 1206)
((‘price’, ‘objective’), 1201)
((‘they’, ‘have’), 1199)
((‘earnings’, ‘share’), 1186)
((‘premier’, ‘league’), 1167)
((‘that’, ‘will’), 1130)
((‘cell’, ‘phone’), 1116)
((‘last’, ‘week’), 1111)
((‘phone’, ‘plan’), 1076)
((‘plan’, ‘detail’), 1071)
((‘they’, ‘were’), 1046)

Compute the number of positive and negative word counts¶
Read the positve words and negative words from file. Convert them to set,
and for each story, test each word whether they are in postive words set or negative words set.

In [18]:

# function to read words from file
def readDict(fn):
f = open(fn, encoding = “ISO-8859-1”)
lines = f.readlines()
f.close()

return [line.strip() for line in lines[35:]]

# compute counts
def computeCounts(l, s):

return sum([x in s for x in l])

In [19]:

negWords = readDict(“signal-news1/opinion-lexicon-English/negative-words.txt”)

In [20]:

posWords = readDict(“signal-news1/opinion-lexicon-English/positive-words.txt”)

In [21]:

negSet = set(negWords)
posSet = set(posWords)

In [22]:

posNegCounts = [ (computeCounts(doc, posSet), computeCounts(doc, negSet)) for doc in lemmedWords]

In [23]:

# show the results for the first 10 rows
for i in range(10):
pc, nc = posNegCounts[i]
print(“Id %s: positive count: %d negative count: %d” %(news[i][“id”], pc, nc ))

Id 09344ca2-fdcc-4de3-9424-1b201e3da9ea: positive count: 30 negative count: 16
Id 87bc19b8-57df-4111-bf63-a39fcb8ad1a1: positive count: 28 negative count: 30
Id bd45307a-2589-410e-a327-ee7fd2144668: positive count: 0 negative count: 5
Id cb85f346-a961-4e3d-8942-3d2a0e436a85: positive count: 9 negative count: 0
Id f8a65c25-08bb-4b1a-b0a3-9f7994eb33a3: positive count: 14 negative count: 11
Id ecc20144-2d65-43cb-adab-9a9f6559f4f6: positive count: 2 negative count: 0
Id 3c8cc0df-fb5c-433f-b30b-0577e8b29b83: positive count: 17 negative count: 20
Id d60de2e1-3198-476a-85ec-2bd20b380a9a: positive count: 0 negative count: 0
Id 514dca6d-7ca1-4b54-bf27-4c5d8b11adb6: positive count: 13 negative count: 0
Id 5fc5f675-72e6-4cb9-a2a7-8e6dbaeb4ee5: positive count: 8 negative count: 5

Compute the number of news stories with more positive than negative words, as well as the number of news stories with more negative than positive words.¶
Use list comprehension to easily compute the counts.

In [24]:

morePosCounts = sum([x > y for (x, y) in posNegCounts])
moreNegCounts = sum([x < y for (x, y) in posNegCounts]) In [25]: print("number of news stories with more positive than negative words: %d" %morePosCounts) print("number of news stories with more negative than positive words: %d" %moreNegCounts) number of news stories with more positive than negative words: 10548 number of news stories with more negative than positive words: 6597 Part C: Language models¶ Compute language models for bigrams¶ First split the data to training and test part. Compute the unigram counts and bigram counts in the training data. Same as before, I use defaultdict to facilitate the computing. In [26]: train = lemmedWords[:16000] test = lemmedWords[16000:] Compute unigram counts¶ In [27]: uniCounts = defaultdict(int) for l in train: for word in l: uniCounts[word] += 1 Compute bigram counts¶ In [28]: biCounts = defaultdict(int) for l in train: for bi in bigrams(l): biCounts[bi] += 1 Compute the probability P(current|pre) with Laplace smoothing¶ I use the Laplace smoothing in the computing of the conditional probability P(current | pre). The formulae is $P(current | pre) = \frac {bigramcounts(current, pre) + 1} {unigramCounts(pre) + vocabularySize}$. In this way, when bigram count is 0, we can also give a non-zero probability. To make the computation numerically stable, I use the log probability, In [29]: import math prob = defaultdict(dict) vocaSize = len(uniCounts) # Laplace smoothing for bi, count in biCounts.items(): pre, aft = bi prob[pre][aft] = math.log( (count + 1) / (uniCounts[pre] + vocaSize)) Produce a sentence of 10 words by appending the most likely next word each time¶ In each step, for the last word $x$ generated, find the word $y$ that has the max value of the $p(y|x)$. In [30]: sent = ["they"] while len(sent) != 10: h = prob[sent[-1]] best = None for k in h: if best is None or h[k] > h[best]:
best = k

sent.append(best)

In [31]:

print(sent)

[‘they’, ‘have’, ‘been’, ‘given’, ‘rating’, ‘hold’, ‘rating’, ‘hold’, ‘rating’, ‘hold’]

Compute the perplexity on test data¶
Because if we multiply many small float point number, it may underflow. So I use log probability in the computation.
Add up the log probability. Because I use Laplace smoothing, when a bigram in the test data that doesn’t appear in train data, it also has non-probability. Then compute the average log probability. In the last, convert to perplexity.
$perplexity = e^{ – averageLogProb}$

In [32]:

logProb = 0
count = 0
for l in test:
for pre, aft in bigrams(l):
if aft not in prob[pre]:
# process the bigram that doesn’t appear in train data, Laplace smoothing
logProb += math.log(1 / (uniCounts[pre] + vocaSize))

else:
logProb += prob[pre][aft]
count += 1

avProb = logProb / count

perplexity = math.e ** (-avProb)

In [33]:

print(“perplexity: %f” %perplexity)

perplexity: 25685.738581

In [ ]: