INFS7410 Project – Part 1
version 1.0
Preamble
The due date for this assignment is 25 September 2020 23:59 Eastern Australia Standard Time.
This part of project is worth 15% of the overall mark for INFS7410 (part 1 + part 2 = 30%). A detailed marking sheet for this assignment is provided at the end of this document.
The project is to be completed within your group, but marked individually. Every student in a group should aim to make an equal contribution to each part of the project. You will receive a single mark for your group submission. You will also need to complete a peer assessment component for this project. The group mark will be weighted accordingly to a rounding of the median peer assessment you have received by your team mates.
We recommend that you make an early start on this assignment, and proceed by steps. There are a number of activities you may already tackle, including setting up the pipeline, manipulating the queries, implement some retrieval functions, and perform evaluation and analysis. We recommend that you make an early start on this assignment, and proceed by steps. Most of the assignment relies on knowledge and code you should have already have experienced in the computer practicals, however there are some hidden challenges here and there that you may require some time to solve.
Aim
Project aim: The aim of this project is to implement a number of information retrieval methods, evaluate them and compare them in the context of a real use-case.
Project Part 1 aim
The aim of Part 1 is to:
Setup your infrastructure to index the collection and evaluate queries. Implement common information retrieval baselines.
Tune your retrieval implementations to improve their effectiveness. Implement rank fusion methods.
The Information Retrieval Task: Web Passage Ranking
In this project we will consider the problem of open-domain passage ranking in answer to web queries. In this context, users pose queries to the search engine and expect answers in the form of a ranked list of passages (maximum 1000 passages to be retrieved).
The provided queries are real queries submitted to the Microsoft Bing search engine. In the collection, there are approximately 8.8 million passages and the goal is to rank them based on their relevance to the queries.
What we provide you with
A collection of 8.8 million text passages extracted from web pages
A list of queries to be used for training. Training is divided further into a small batch
( train_set.small.txt ) and a large batch ( train_set.txt ). You should use the small bath of training for developing your initial implementations and the large batch for tuning your implementations. If you have problems running the larger set because it takes long time (unlikely for part 1), then use the small training set for tuning too.
A list of queries to be used for development ( dev_set.txt ). You should perform your evaluation and analysis of your implementations on this dataset.
Relevance assessments for both the training ( train_set.qrels , train_set.small.qrels ) and development portions ( dev_set.qrels ) of the queries (qrels).
A Java Maven project that contains the Terrier dependencies and a skeleton code to give you a start.
A template for your project report.
Please find the document collection of passages, queries and relevance assessments, and terrier.properties file to perform indexing at the following URL: http://ielab-
data.uqcloud.net/dataset/infs7410-project-data
What you need to produce
You need to produce:
Correct implementations of the methods required by this project specifications.
Correct evaluation, analysis and comparison of the evaluated methods, written up into a report following the provided template
A project report that, following the provided template, details: an explanation of the retrieval methods used, include the formulas that represent the models you implemented and a snapshot of the code that implements that formula, an explanation of the evaluation settings followed, the evaluation of results (as described above), inclusive of analysis, a discussion of the findings.
Required methods to implement
In Part 1 of the project you are required to implement the following retrieval methods. All implementations should be based on your own code.
1. TF-IDF: create your own implementation using the Terrier API to extract index statistics. See the videos in Week 4 for background information.
2. BM25: create your own implementation using the Terrier API to extract index statistics. See the videos in Week 4 for background information.
3. Pseudo-relevance feedback using BM25 (by using the full version of the RSJ weight): create your own implementation using the Terrier API to extract index statistics. See the videos in Week 4 for background information.
4. Jelinek-Mercer Language Model: create your own implementation using the Terrier API to extract index statistics. See the videos in Week 4 for background information.
5. Dirichlet Language Model: create your own implementation using the Terrier API to extract index statistics. See the videos in Week 4 for background information.
6. The rank fusion method Borda; you need to create your own implementation of this. See the videos in Week 5 for background information.
7. The rank fusion method CombSUM; you need to create your own implementation of this. See the videos in Week 5 for background information.
8. The rank fusion method CombMNZ; you need to create your own implementation of this. See the videos in Week 5 for background information.
For each of the BM25, Jelinek-Mercer Language Model, and Dirichlet Language Model implementations, you are also required to tune the parameters of these methods. You must perform a parameter search over 5 sensibly chosen parameter values depending on the method (10 when the method has two parameters). For Pseudo-relevance feedback using BM25, set the best parameter values obtained with BM25.
For the rank fusion methods, consider fusing the highest performing tuned run from each of the TF-IDF, BM25, Pseudo-relevance feedback using BM25, Jelinek-Mercer Language Model, and Dirichlet Language Model implementations.
We strongly recommend you use the provided Maven project to implement your project. You should have already attempted many of the implementations above as part of the computer pracs exercises.
In the report, detail how the methods were implemented and which parameters you chose for tuning. Report only the results obtained after tuning.
Required evaluation to perform
In Part 1 of the project you are required to perform the following evaluation:
1. For all methods, train on the large training set of queries (train here means you use this data to tune any parameter of a retrieval model, e.g. and for BM25, etc.) and test on the development set of queries (using the parameter values you selected from the training set).
2. Report the results of every method on the training (only for the run you selected the tuned paramters from) and on the development set, separately, into tables. Perform statistical significance analysis across the results of the methods and report them in the tables.
3. Produce a gain-loss plot that compares BM25 vs. Pseudo-relevance feedback using BM25; and plots that compare BM25 vs. each rank fusion method.
4. Comment on trends and differences observed when comparing your findings. Is there a method that consistently outperform the others?
5. Provide insights of whether ranking fusion works, or if it does not, e.g., with respect to runs to be considered in the fusion process, queries, etc.
In terms of evaluation measures, evaluate the retrieval methods with respect to mean reciprocal rank at 100 passages retrieved (MRR@100, recip_rank in trec_eval ) using trec_eval . Remember to set the cut-off value ( -M , i.e., the maximum number of documents per topic to use in evaluation) to 100. You should use this measure as the target measure for tuning. Using
trec_eval , also compute nDCG at 3 and MRR at 10.
For all gain-loss plots, produce them with respect to MRR@100.
For all statistical significance analysis, use paired t-test; distinguish between p<0.05 and p<0.01.
How to submit
You will have to submit 3 files:
1. The report, formatted according to the provided template, saved as PDF document.
2. A zip file containing a folder called runs-part1 , that contains the eight runs (result files)
you have created for the implemented methods on the development set.
3. A zip file containing a folder called code-part1 , that contains all the code to re-run your
experiments. You do not need to include in this zip file your compiled code or your index. You may need to include additional files, e.g., if you manually process the topic files into an intermediate format (rather than automatically process them from the files we provide you), so that we can re-run your experiments to confirm your results and implementation.
All items need to be submitted via the relevant Turnitin link in the INFS7410 Blackboard site, by 25 September 2020, 23:59 Eastern Australia Standard Time, unless you have been given an extension (according to UQ policy), before the due date of the assignment. Peer assessment is to be completed by 27 September 2019, 23:59 Eastern Australia Standard Time.
INFS 7410 Project Part 1 – Marking Sheet
Criterion
%
7 100%
4 50%
FAIL 1 0%
IMPLEMENTATION The ability to:
• Understand
implement and execute common IR baseline and relevance feedback
• Understand implement and
execute rank
fusion methods
• Perform text
processing
7
• Correctly implements the specified baselines, relevance feedback and the rank fusion methods
• Correctly implements the specified baselines and relevance feedback
• No implementation
• Implements only baselines
EVALUATION The ability to:
• Empirically evaluate
and compare IR
methods
• Analyse the results of
empirical IR
evaluation
• Analyse the statistical
significance difference between IR methods’ effectiveness
7
• Correct empirical evaluation has been performed
• Uses all required evaluation measures
• Correct handling of the tuning regime (train/test)
• Reports all results for the
provided query sets into
appropriate tables
• Provides graphical analysis
of results on a query-by- query basis using appropriate gain-loss plots
• Provides correct statistical significance analysis within the result table; and correctly describes the statistical analysis performed
• Provides a written understanding and discussion of the results with respect to the methods
• Provides examples of where fusion works, and were it does not, and why, e.g., discussion with respect to queries, runs.
• Correct empirical evaluation has been performed
• Uses all required evaluation measures
• Correct handling of the tuning regime (train/test)
• Reports all results for the provided query sets into appropriate tables
• Provides graphical analysis of results on a query-by-query basis using appropriate gain-loss plots
• Does not perform statistical significance analysis, or errors are present in the analysis
• No or only partial empirical evaluation has been conducted, e.g. only on a set, or a subset of topics
• Only report a partial set of evaluation measures
• Fails to correctly handle training and testing partitions, e.g. train on test, reports only overall results
WRITE UP
Binary score: 0/1
The ability to:
• use fluent
language with correct grammar, spelling and punctuation
• use appropriate paragraph,
sentence
structure
• use appropriate
style and tone of
writing
• produce a
professionally presented document, according to the provided template
1
• Structure of the document is appropriate and meets expectations
• Clarity promoted by consistent use of standard grammar, spelling and punctuation
• Sentences are coherent
• Paragraph structure
effectively developed
• Fluent, professional style
and tone of writing.
• No proof reading errors
• Polished professional
appearance
• Written expression and
presentation are incoherent, with little or no structure, well below
required standard
• Structure of the document is not appropriate and does not meet expectations
• Meaning unclear as grammar and/or spelling contain frequent errors.
• Disorganised or incoherent writing.