IFN647 Text, Web and Media Analytics
Assignment 1 (Sem 1, 2022)
Required to be submitted:
1. Please put your outputs into text files specified in the question descriptions and put all .py files, data and a “readme.txt” in a folder (e.g., Your Surname_code), where “readme.txt” contains a short user manual to help your tutor run your Python code. Then zip all .txt files and the folder into a zip file named as “your student ID_Surname_Asm1.zip”.
Copyright By PowCoder代写 加微信 powcoder
2. Submit your zip file for this assignment in BB before 11.59pm on 6 May 2022.
3. Answer all three questions (9 tasks).
4. See the marking guide for more details on the distribution of marks and marking criteria.
Individual working: You should work on this assignment individually. Due date: Friday week 8 (6 May 2022)
Weighting: 20% of the assessment for IFN647.
Dataset (Rnews_v1 document collection)
• You will be working with a sample dataset which is a small subset of XML
documents (TREC RCV1 data collection), which is a pre-tokenized version of (for convenience, and for copyright reasons). The dataset can be downloaded from Blackboard.
You are asked to design Python code for three questions (9 tasks). You can add new variables, functions, methods, or update function parameters. However, you should provide comments to clearly describe why you are doing this.
Question 1. Document & query parsing
The motivation for Question 1 is to design your own document and query parsers. So please don’t use python packages that we didn’t use in the workshop.
Task 1.1: Define a document parsing function parse_rcv_coll(inputpath, stop_words) to parse a data collection (e.g., Rnews_v1 dataset), where parameter inputpath is the folder that stores a set of XML files, and parameter stop_words is a list of common English words (you may use the
IFN647 ASSIGNMENT1.221
file ‘common-english-words.txt’ to find all stop words). The following are the major steps in the document parsing function:
Step 1) The function reads XML files from inputpath (e.g., Rnews_v1). For each file, it finds the docID (document ID) and index terms, and then represents it in a BowDoc Object.
You need to define a BowDoc class by using Bag-of-Words to represent a document:
• BowDoc needs a docID variable which is simply assigned by the value of ‘itemid’ in
• In this task, BowDoc can be initialled with an attribute docID; an empty dictionary
(the variable name is terms) of key-value pair of (String term: int frequency); and
doc_len (the document length) attribute.
• You may define your own methods, e.g., getDocId() to get the document ID.
Step 2) It then builds up a collection of BowDoc objects for the given dataset, this collection can be a dictionary structure (as we used in the workshop), a linked list, or a class BowColl for storing a collection of BowDoc objects. Please note the rest descriptions are based on the dictionary structure with docID as key and BowDoc object as value.
Step 3) At last, it returns the collection of BowDoc objects.
You also need to follow the following requirements to define this parsing function:
Please use the basic text pre-processing steps, such as tokenizing, stopping words removal and stemming of terms.
Tokenizing – (please provide a definition of a word, and describe it in a Python comment)
• You need to tokenize at least the ‘
tags, and discard punctuations and/or numbers based on your definition of words.
• Define method addTerm() for class BowDoc to add new term or increase term
frequency when the term occur again. Stopping words removal and stemming of terms –
• Use the given stopping words list (“common-english-words.txt”) to ignore/remove
all stopping words. Open and read the given file of stop-words and store them into a
list stopwordList. When adding a term, please check whether the term exists in the
stopwordList, and ignore it if it is in the stopwordList.
• Please use porter2 stemming algorithm to update BowDoc’s terms.
Task 1.2: Define a query parsing function parse_query(query0, stop_words), where we assume the original query is a simple sentence or a title in a String format (query0), and stop_words is a list of stop words that you can get from ‘common-english-words.txt’.
For example, let query0 =
‘CANADA: Sherritt to buy Dynatec, spin off unit, canada.’ the function will return a dictionary
{‘canada’: 2, ‘sherritt’: 1, ‘buy’: 1, ‘dynatec’: 1, ‘spin’: 1, ‘unit’: 1}
Please note you should use the same text transformation technique as the document, i.e., tokenizing steps for queries must be identical to steps for documents.
Task 1.3: Define a main function to test function parse_rcv_coll( ). The main function uses the provided dataset, calls function parse_rcv_coll() to get a collection of BowDoc objects. For each document in the collection, firstly print out its docID, the number of index terms and the total number of works in the document (doc_len). It then sorts index terms (by frequency) and prints out a term:freq list. At last, it saves the output into a text file (file name is “your full name_Q1.txt”).
Sample Example of output for file “807606newsML.xml” Document 807606 contains 60 terms and have total 187 words
bid : 7 bank : 6 insur : 5 great : 5 west : 5 royal : 5 london : 4 quot : 4 trilon : 3 tender : 3 stake : 3 offer : 3 tuesday : 2
percent : 2 billion : 2 canada : 2 match : 2 myhal : 2 june : 2 sharehold : 2 per : 2
share : 2 financi : 1 corp : 1 group : 1 lifeco : 1 inc : 1 doe : 1 posit : 1 wait : 1 see : 1 fail : 1
Question 2. Tf*idf based IR model
Tf*idf is a popular term weighting method, which uses the following Eq. (1) to calculate a weight for term k in a document i, where the base of log is 10. You may review lecture notes to get the meaning of each variable in the equation.
Task 2.1: Define a function calc_df(coll) to calculate document-frequency (df) for a given
BowDoc collection coll and return a {term:df, …} dictionary. Example of output for this task
There are 10 documents in this dataset
The following are the terms’ document-frequency:
share: 5 market: 4 compani: 4 three: 4
royal: 4 public: 3 strong: 3 busi: 3
hold: 3 sector: 2 higher: 2 follow: 2 signific: 1 jihad: 1 katyusha: 1 morel: 1 westwood: 1 settlement: 1 hole: 1 privat: 1 andrea: 1 depend: 1 aug: 1
articl: 1 deviat: 1 swap: 1
Task 2.2: Use Eq (1) to define a function tfidf(doc, df, ndocs) to calculate tf*idf value (weight) of every term in a BowDoc object, where doc is a BowDoc object or a dictionary of {term:freq,…}, df is a {term:df, …} dictionary, and ndocs is the number of documents in a given BowDoc collection. The function returns a {term:tfidf_weight , …} dictionary for the given document doc.
Task 2.3: Define a main function to print out top 12 terms (with its value of tf*idf weight) for each document in Rnews_v1 if it has more than 12 terms and save the output into a text file (file name is “your full name_Q2.txt”).
You also need to implement a tf*idf based IR model. You can assume titles of XML documents (the
At last, append the output (in descending order) into the text file (“your full name_Q2.txt”).
Example of output for this task
Document 807606 contains 60 terms bid : 0.27477268692397266
insur : 0.2433890479622151
great : 0.2433890479622151
west : 0.2433890479622151 myhal : 0.2259384978055295 per : 0.2259384978055295 trilon : 0.19574301597552804 stake : 0.19574301597552804 offer : 0.19574301597552804 billion : 0.1579242327908045 match : 0.1579242327908045 bank : 0.14824878002268424 tender : 0.14642954913050904 royal : 0.13856709051308835 doe : 0.13344291648101658 wait : 0.13344291648101658 fail : 0.13344291648101658 georg : 0.13344291648101658 …
The Ranking Result for query: BELGIUM: MOTOR RACING-LEHTO AND SOPER HOLD ON FOR GT VICTORY.
741299 : 0.7258206779073599 809481 : 0.06517635855336815 807600 : 0.038674645645810815 780723 : 0
741309 : 0 780718 : 0 783803 : 0 809495 : 0 783802 : 0 807606 : 0 …
Question 3. BM25-based IR model
BM25 IR model is a popular and effective ranking algorithm, which uses the following Eq. (3) to calculate a document score or ranking for a given query Q and a document D, where the base of log is 2. You may review lecture notes to get the meaning of each variable in the equation.
You can use the BowDoc collection to work out some variables, such as N and ni (you may
assume R = ri = 0).
Task 3.1: Define a Python function avg_doc_len(coll) to calculate and return the average document length of all documents in the collection coll.
• In the BowDoc class, for the variable doc_len (the document length), add accessor (get) and mutator (set) methods for it.
• You may modify your code defined in Question 1 by calling the mutator method of doc_len to save the document length in a BowDoc object when creating the BowDoc object. At the same time, sum up every BowDoc’s doc_len as totalDocLength, then at the end, calculate the average document length and return it.
Task 3.2: Use Eq (3) to define a python function bm25(coll, q, df) to calculate documents’ BM25 score for a given original query q, where df is a {term:df, …} dictionary. Please note you should parse query using the same method as parsing documents (you can call function parse_query() that you defined for Question 1). For the given query q, the function returns a dictionary of {docID: bm25_score, … } for all documents in collection coll.
Task 3.3: Define a main function to implement a BM25-based IR model to rank documents in the given document collection News_v1 using your functions.
• You are required to test all the following queries: o This British fashion
o All fashion awards
o The stock markets
o The British-Fashion Awards
• The BM25-based IR model needs to print out the ranking result (in descending order) of top-5 possible relevant documents for a given query and append outputs into the text file (“your full name_Q3.txt”).
Example of output for this question (Note that you may get negative BM25 scores because N is not large enough and ni can be close to N. You can fix this by increasing N)
Average document length for this collection is: 272.3
The query is: This British fashion
The following are the BM25 score for each document:
Document ID: 741299, Doc Length: 199 — BM25 Score: 0.0
Document ID: 780723, Doc Length: 124 — BM25 Score: 0.0
Document ID: 741309, Doc Length: 104 — BM25 Score: 2.938396824948826 Document ID: 780718, Doc Length: 107 — BM25 Score: 0.0
Document ID: 783803, Doc Length: 490 — BM25 Score: 0.0
Document ID: 809481, Doc Length: 151 — BM25 Score: 0.0
Document ID: 809495, Doc Length: 703 — BM25 Score: 0.0
Document ID: 783802, Doc Length: 120 — BM25 Score: 8.551296313403592 Document ID: 807600, Doc Length: 538 — BM25 Score: 0.0
Document ID: 807606, Doc Length: 187 — BM25 Score: 0.0
The following are possibly relevant documents retrieved – 783802 8.551296313403592
741309 2.938396824948826
741299 0.0
780723 0.0 780718 0.0 …
Please Note
• You can add more methods, variables or functions. For any new one, you should provide comments to understand its definition and usage.
• Your programs should be well laid out, easy to read and well commented.
• All items submitted should be clearly labelled with your name or student number.
• Marks will be awarded for programs (correctness, programming style, elegance,
commenting) and outputs, according to the marking guide.
• You will lose marks for inaccurate outputs, code problems or errors, or missing required
files or comments.
END OF ASSIGNMENT 1
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com