wk3_solutions
Week 3 Workshop Solutions¶
© Professor Yuefeng Li
Copyright By PowCoder代写 加微信 powcoder
Task 1: Write a program that loads (read) an XML document, and prints out the itemid and the number of words in “\
import string
myfile=open(‘741299newsML.xml’, ‘r’)
start_end = False
file_=myfile.readlines()
word_count = 0 #wk3
for line in file_:
line = line.strip()
if(start_end == False):
if line.startswith(“
start_end = True
elif line.startswith(““):
line = line.replace(“
“, “”).replace(“
“, “”)
print(line)
line = line.translate(str.maketrans(”,”, string.digits)).translate(str.maketrans(string.punctuation, ‘ ‘*len(string.punctuation)))
for term in line.split():
word_count += 1 #wk3
#print(line)
myfile.close()
print(‘Document itemid: ‘+ docid+ ‘ contains: ‘+ str(word_count) + ‘ words’)
J.J. Lehto of Finland and of Britain drove their ailing McLaren to victory in the fifth round of the world GT championship on Sunday, beating the Mercedes of Schneider and Austrian Alexander Wurz by 15 seconds.
Their victory enabled them to open up a 16-point lead in the overall standings over Schneider, who mounted a strong challenge on the struggling leaders in the final minutes of the four-hour race.
But Soper, struggling with the car’s handling caused by a broken undertray, just managed to hold on for the win.
Lehto had opened up a lead of over 90 seconds during a mid-race downpour in the Ardennes mountains.
"I thought that everyone else was driving on dry-weather tyres," he joked afterwards.
"We swapped to rain tyres at exactly the right time and I was able to push hard and open up a big lead."
Third to finish was the Porsche of France’s and and Belgian Thierry Boutsen.
The Belgian, a former Formula One driver, switched from the car he normally shares with Stuck following a power-steering failure on his own car.
Document itemid: 741299 contains: 199 words
# This program firstly opens the .XML file and then represents it in a list of lines (strings)
# For each line, it firstly gets the ‘itemid’ by using recognizes tag
# It uses boolean variable ‘start_end’ to control the processing for
## for each line in
and
by using .replace method; and then uses .maketrans and .translate
## methods to remove digits and punctuations.
## It also counts terms (works) in the line by using word_count
Execises for using .maketrans and .translate methods¶
You may need to review week 2 lecture notes about the String Methods
line_s1 = “Lehto had opened up a lead of over 90 seconds during a mid-race downpour in the Ardennes mountains.”
line_s2 = line_s1.translate(str.maketrans(”,”, string.digits)).translate(str.maketrans(string.punctuation, \
‘ ‘*len(string.punctuation)))
print(line_s2)
print(line_s1)
mapping_tbl_digits=line_s1.maketrans(”,”, string.digits) # Remove digits
line_s3 = line_s1.translate(mapping_tbl_digits)
print(mapping_tbl_digits)
print(line_s3)
mapping_tbl_punc=line_s1.maketrans(string.punctuation, ‘ ‘*len(string.punctuation)) # Replace punctuation with ‘ ‘
print(mapping_tbl_punc)
line_s4 = line_s3.translate(mapping_tbl_punc)
print(line_s4)
Lehto had opened up a lead of over seconds during a mid race downpour in the Ardennes mountains
Lehto had opened up a lead of over 90 seconds during a mid-race downpour in the Ardennes mountains.
{48: None, 49: None, 50: None, 51: None, 52: None, 53: None, 54: None, 55: None, 56: None, 57: None}
Lehto had opened up a lead of over seconds during a mid-race downpour in the Ardennes mountains.
{33: 32, 34: 32, 35: 32, 36: 32, 37: 32, 38: 32, 39: 32, 40: 32, 41: 32, 42: 32, 43: 32, 44: 32, 45: 32, 46: 32, 47: 32, 58: 32, 59: 32, 60: 32, 61: 32, 62: 32, 63: 32, 64: 32, 91: 32, 92: 32, 93: 32, 94: 32, 95: 32, 96: 32, 123: 32, 124: 32, 125: 32, 126: 32}
Lehto had opened up a lead of over seconds during a mid race downpour in the Ardennes mountains
Design a parsing function (parse_doc(input, stops)) to read a file and represent the file as a tuple
(word_count, {docid:curr_doc})
Design the main function to read a xml file and common-english-words.txt (the list of stopping words), call function parse_doc(input, stops), and print the itemid (docid), the number of words (word_count) and the number of terms (len(curr_doc)).
import glob, os
import string
def parse_doc(inputpath, stop_ws):
#os.chdir(inputpath)
myfile=open(‘6146.xml’)
curr_doc = {}
start_end = False
file_=myfile.readlines()
word_count = 0 #wk3
for line in file_:
line = line.strip()
#print(line)
if(start_end == False):
if line.startswith(“
start_end = True
elif line.startswith(““):
line = line.replace(“
“, “”).replace(“
“, “”)
line = line.translate(str.maketrans(”,”, string.digits)).translate(str.maketrans(string.punctuation, ‘ ‘*len(string.punctuation)))
line = line.replace(“\\s+”, ” “)
for term in line.split():
word_count += 1 #wk3
term = term.lower() #wk3
if len(term) > 2 and term not in stop_words: #wk3
curr_doc[term] += 1
except KeyError:
curr_doc[term] = 1
myfile.close()
return(word_count, {docid:curr_doc})
# return a tuple, the first element is the number of words in
# the second one is a dirctionary that includes only one pair of doc_id and a disctionary of term_frequency pairs
import sys
stopwords_f = open(‘common-english-words.txt’, ‘r’) # wk3
stop_words = stopwords_f.read().split(‘,’)
stopwords_f.close()
x = parse_doc(“”,stop_words)
print(‘— Task2: The return value of function parse_doc —‘)
print(‘— The outcomes of Task3 —‘)
for doc in x[1].items():
print(‘Document itemid: ‘+ doc[0]+ ‘ contains: ‘+ str(x[0]) + ‘ words and ‘ + str(len(doc[1])) + ‘ terms’)
— Task2: The return value of function parse_doc —
(133, {‘6146’: {‘argentine’: 1, ‘bonds’: 1, ‘slightly’: 1, ‘higher’: 1, ‘small’: 1, ‘technical’: 2, ‘bounce’: 2, ‘wednesday’: 1, ‘amid’: 1, ‘low’: 1, ‘volume’: 1, ‘trader’: 2, ‘large’: 1, ‘foreign’: 1, ‘bank’: 1, ‘slight’: 1, ‘opening’: 1, ‘expect’: 1, ‘prices’: 1, ‘change’: 1, ‘much’: 1, ‘during’: 1, ‘session’: 1, ‘market’: 2, ‘moving’: 1, ‘news’: 1, ‘expected’: 2, ‘percent’: 1, ‘dollar’: 1, ‘denominated’: 1, ‘bocon’: 1, ‘previsional’: 1, ‘due’: 2, ‘rose’: 2, ‘argentina’: 2, ‘frb’: 1, ‘quot’: 2, ‘general’: 1, ‘uncertainty’: 1, ‘pointing’: 1, ‘events’: 1, ‘waiting’: 1, ‘including’: 1, ‘passage’: 1, ‘government’: 1, ‘new’: 1, ‘economic’: 1, ‘measures’: 1, ‘through’: 1, ‘congress’: 1, ‘now’: 1, ‘until’: 1, ‘early’: 1, ‘october’: 1, ‘addition’: 1, ‘traders’: 1, ‘awaiting’: 1, ‘meeting’: 1, ‘friday’: 1, ‘between’: 1, ‘economy’: 1, ‘minister’: 1, ‘roque’: 1, ‘fernandez’: 1, ‘international’: 1, ‘monetary’: 1, ‘fund’: 1, ‘delegation’: 1, ‘fiscal’: 1, ‘deficit’: 1, ‘axel’: 1, ‘bugge’: 1, ‘buenos’: 1, ‘aires’: 1, ‘newsroom’: 1}})
— The outcomes of Task3 —
Document itemid: 6146 contains: 133 words and 75 terms
Note that the solutions for task 2 and task 3 are slightly different from the .py solution. We don’t need to define an explicit main function. Also, the input xml file is in the current folder, so the inputpath is empty.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com