Objectives:
IFN647 Workshop (Week 3)
Pre-processing Textual Data
• Understand how to use python to do basic data pre- processing for text documents.
Copyright By PowCoder代写 加微信 powcoder
Task 1: Write a program that loads (reads) an XML document, and prints out the itemid and the number of words in
The following is the Format of the Document:
J.J. Lehto of Finland and of Britain drove their ailing McLaren to victory in the fifth round of the world GT championship on Sunday, beating the Mercedes of Schneider and Austrian Alexander Wurz by 15 seconds.
Their victory enabled them to open up a 16-point lead in the overall standings over Schneider, who mounted a strong challenge on the struggling leaders in the final minutes of the four-hour race.
But Soper, struggling with the car’s handling caused by a broken undertray, just managed to hold on for the win.
Lehto had opened up a lead of over 90 seconds during a mid-race downpour in the Ardennes mountains.
“I thought that everyone else was driving on dry-weather tyres,” he joked afterwards.
“We swapped to rain tyres at exactly the right time and I was able to push hard and open up a big lead.”
Third to finish was the Porsche of France’s and and Belgian Thierry Boutsen.
The Belgian, a former Formula One driver, switched from the car he normally shares with Stuck following a power-steering failure on his own car.
Task 2: Design a parsing function (parse_doc(input, stops)) to read a file and represent the file as a tuple
(word_count, {docid:curr_doc})
• word_countisthenumberofwordsin
You only need to tokenize the ‘
o Download the stopping words list (common-english- words.txt) from the Blackboard, and use it for this task.
You can initialize dictionary curr_doc ={}, then add terms into curr_doc ={}, when you go through terms, you may need to check if the new term exists in curr_doc and then update its frequency.
The following is an example of the return value of parse_doc() for file “6146.xml” (see the file in the data folder in the Blackboard):
(133, {‘6146’: {‘argentine’: 1, ‘bonds’: 1, ‘slightly’: 1, ‘higher’: 1, ‘small’: 1, ‘technical’: 2, ‘bounce’: 2, … , ‘newsroom’: 1}})
Task 3: Design a main function to read a xml file and common-english-words.txt (the list of stopping words), call function parse_doc(input, stops), and print the itemid (docid), the number of words (word_count) and the number of terms (len(curr_doc)).
The following is an example of the outputs:
Document itemid: 6146 contains: 133 words and 75 terms
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com