程序代写 IFN647 Workshop (Week 3)

Objectives:
IFN647 Workshop (Week 3)
Pre-processing Textual Data
• Understand how to use python to do basic data pre- processing for text documents.

Copyright By PowCoder代写 加微信 powcoder

Task 1: Write a program that loads (reads) an XML document, and prints out the itemid and the number of words in of the document.
The following is the Format of the Document:
BELGIUM: MOTOR RACING-LEHTO AND SOPER HOLD ON FOR GT VICTORY.
MOTOR RACING-LEHTO AND SOPER HOLD ON FOR GT VICTORY.
SPA FRANCORCHAMPS, Belgium

J.J. Lehto of Finland and of Britain drove their ailing McLaren to victory in the fifth round of the world GT championship on Sunday, beating the Mercedes of Schneider and Austrian Alexander Wurz by 15 seconds.

Their victory enabled them to open up a 16-point lead in the overall standings over Schneider, who mounted a strong challenge on the struggling leaders in the final minutes of the four-hour race.

But Soper, struggling with the car’s handling caused by a broken undertray, just managed to hold on for the win.

Lehto had opened up a lead of over 90 seconds during a mid-race downpour in the Ardennes mountains.

“I thought that everyone else was driving on dry-weather tyres,” he joked afterwards.

“We swapped to rain tyres at exactly the right time and I was able to push hard and open up a big lead.”

Third to finish was the Porsche of France’s and and Belgian Thierry Boutsen.

The Belgian, a former Formula One driver, switched from the car he normally shares with Stuck following a power-steering failure on his own car.

(c) Reuters Limited 1997

Task 2: Design a parsing function (parse_doc(input, stops)) to read a file and represent the file as a tuple
(word_count, {docid:curr_doc})
• word_countisthenumberofwordsin…<\text> • docidissimplyassignedbythe‘itemid’in • curr_docisadictionaryofterm_frequencypairs.
You only need to tokenize the ‘’ part of the document into words, exclude all tags, and discard punctuations and numbers. Then please remove stopping words and at last get all terms used in the ‘’ part.
o Download the stopping words list (common-english- words.txt) from the Blackboard, and use it for this task.
You can initialize dictionary curr_doc ={}, then add terms into curr_doc ={}, when you go through terms, you may need to check if the new term exists in curr_doc and then update its frequency.
The following is an example of the return value of parse_doc() for file “6146.xml” (see the file in the data folder in the Blackboard):
(133, {‘6146’: {‘argentine’: 1, ‘bonds’: 1, ‘slightly’: 1, ‘higher’: 1, ‘small’: 1, ‘technical’: 2, ‘bounce’: 2, … , ‘newsroom’: 1}})

Task 3: Design a main function to read a xml file and common-english-words.txt (the list of stopping words), call function parse_doc(input, stops), and print the itemid (docid), the number of words (word_count) and the number of terms (len(curr_doc)).
The following is an example of the outputs:
Document itemid: 6146 contains: 133 words and 75 terms

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com