INFS7410-project-part-2-checkpoint
INFS7410 Project – Part 2¶
version 1.0
Preamble¶
The due date for this assignment is 28 October 2021 16:00 Eastern Australia Standard Time.
This part of the project is worth 20% of the overall mark for INFS7410 (part 1 + part 2 = 40%). A detailed marking sheet for this assignment is provided alongside this notebook. The project is to be completed individually.
We recommend that you make an early start on this assignment and proceed by steps. There are several activities you may have already tackled, including setting up the pipeline, manipulating the queries, implement some retrieval functions, and performing evaluation and analysis. Most of the assignment relies on knowledge and code you should have already have experienced in the computer practicals; however, there are some hidden challenges here and there that you may require some time to solve.
Aim¶
Project aim: The aim of this project is for you to implement several neural information retrieval methods, evaluate them and compare them in the context of a multi-stage ranking pipeline.
The speficic objectives of Part 2 is to:
Setup your infrastructure to index the collection and evaluate queries.
Implement neural information retrieval models (only inference).
Implement multi-stage ranking pipelines, i.e., BM25 + neural rankers.
The Information Retrieval Task: Web Passage Ranking¶
As in part 1 of the project, in part 2 we will consider the problem of open-domain passage ranking in answer to web queries. In this context, users pose queries to the search engine and expect answers in the form of a ranked list of passages (maximum 1000 passages to be retrieved).
The provided queries are actual queries submitted to the Microsoft Bing search engine. There are approximately 8.8 million passages in the collection, and the goal is to rank them based on their relevance to the queries.
What we provide you with:¶
Files from practical¶
A collection of 8.8 million text passages extracted from web pages (collection.tsv— provided in Week 1).
A query file that contains 43 queries for you to perform retrieval experiments (queries.tsv— provided in Week 2).
A qrel file containing relevance judgements to tune your methods (qrels.txt— provided in Week 2).
Pytorch model files for ANCE.
Extra files for this project¶
A leaderboard system for you to evaluate how well your system performs.
A test query file that contains 54 queries for you to generate run files to submit to the leaderboard (test_queries.tsv).
This jupyter notebook, which you will include inside it your implementation and report.
An hdf5 file that contains TILDEv2 pre-computed terms weights for the collection. Download from this link
Put this notebook and provided files under the same directory.
What you need to produce¶
You need to produce:
Correct implementations of the methods required by this project specifications.
An explanation of the retrieval methods used, including the formulas that represent the models you implemented and code that implements that formula, an explanation of the evaluation settings followed, and a discussion of the findings. Please refer to the marking sheet to understand how each of these requirements are graded.
You are required to produce both of these within this jupyter notebook.
Required methods to implement¶
In Part 2 of the project, you are required to implement the following retrieval methods. All implementations should be based on your code (except for BM25, where you can use the Pyserini built-in SimpleSearcher).
Dense Retriever (ANCE): Use ANCE to re-rank BM25 top-k documents. See the practical in Week 10 for background information.
TILDEv2: Use TILDEv2 to re-rank BM25 top-k documents. See the practical in Week 10 for background information.
Three-stage ranking pipeline: Use TILDEv2 to re-rank BM25 top-k documents, then use monoBERT to re-rank TILDEv2 top-k documents. See the practical in Week 9 and Week 10 for background information.
You can choose an arbitrary number for the choice of cut-off k, but you need to be aware that these neural models are slow to perform inference on the CPU, where a large k might be infeasible. You are free to use Colab, but make sure you copy your code in this notebook.
For TILDEv2, unlike what you did in practical, we offer you the pre-computed term weights for the whole collection (for more details, see the Initial packages and functions cell). This means you can have a fast re-ranking speed for TILDEv2. Use this advantage to trade-off effectiveness and efficiency for your three-stage ranking pipeline implementation.
You should have already attempted many of these implementations above as part of the computer pracs exercises.
Required evaluation to perform¶
In Part 2 of the project, you are required to perform the following evaluation:
For all methods, report effectiveness using queries.tsv and qrels.txt and submit your runs on the test_queries.tsv using the parameter values you selected from the queries.tsv to the leaderboard system.
Report every method’s effectiveness and efficiency (average query latency) on the queries.tsv and the corresponding cut-off k into a table. Perform statistical significance analysis across the results of the methods and report them in the tables.
Produce a gain-loss plot that compares the most and least effective of the three required methods above in terms of nDCG@10 on queries.csv.
Comment on trends and differences observed when comparing your findings. Is there a method that consistently outperforms the others on the queries.tsv and the test_queries.tsv?
Regarding evaluation measures, evaluate the retrieval methods with respect to nDCG at 10 (ndcg_cut_10). You should use this measure as the target measure for tuning. Also compute reciprocal rank at 1000 (recip_rank), MAP (map) and Recall at 1000 (recall_1000).
For all statistical significance analyses, use a paired t-test and distinguish between p<0.05 and p<0.01.
How to submit¶
You will have to submit one file:
A zip file containing this notebook (.ipynb) and this notebook as a PDF document. The code should be able to be executed by us. Remember to include all your discussion and analysis also in this notebook and not as a separate file.
It needs to be submitted via the relevant Turnitin link in the INFS7410 BlackBoard site by 28 October 2021, 16:00 Eastern Australia Standard Time, unless you have been given an extension (according to UQ policy), before the due date of the assignment.
Initial packages and functions¶
Unlike prac week 10 which we compute contextualized term weights with TILDEv2 in an "on-the-fly" manner. In this project, we provide an hdf5 file that contains pre-computed term weights for all the passages in the collection.
Frist, pip install the h5py library:
In [2]:
!pip install h5py
Collecting h5py
Downloading h5py-3.4.0-cp37-cp37m-macosx_10_9_x86_64.whl (2.9 MB)
|████████████████████████████████| 2.9 MB 10.4 MB/s eta 0:00:01
Collecting cached-property
Using cached cached_property-1.5.2-py2.py3-none-any.whl (7.6 kB)
Requirement already satisfied: numpy>=1.14.5 in /Users/s4416495/anaconda3/envs/infs7410/lib/python3.7/site-packages (from h5py) (1.21.1)
Installing collected packages: cached-property, h5py
Successfully installed cached-property-1.5.2 h5py-3.4.0
The following cell gives you an example of how to use the file to access token weights and their corresponding token ids given a document id.
In [18]:
import h5py
from transformers import BertTokenizer
f = h5py.File(“tildev2_weights.hdf5″, ‘r’)
weights_file = f[‘documents’][:] # load the hdf5 file to the memory.
docid = 0
token_weights, token_ids = weights_file[docid]
tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)
for token_id, weight in zip(token_ids.tolist(), token_weights):
print(f”{tokenizer.decode([token_id])}: {weight}”)
presence: 3.62109375
communication: 7.53515625
amid: 5.79296875
scientific: 6.140625
minds: 6.53515625
equally: 3.400390625
important: 6.296875
success: 7.19140625
manhattan: 9.015625
project: 5.45703125
scientific: 5.1640625
intellect: 7.328125
cloud: 6.1171875
hanging: 3.318359375
impressive: 6.5234375
achievement: 6.48828125
atomic: 8.421875
researchers: 4.9375
engineers: 6.203125
what: -1.1708984375
success: 6.421875
truly: 3.67578125
meant: 4.25
hundreds: 3.19140625
thousands: 2.98828125
innocent: 5.12890625
lives: 3.029296875
ob: 2.35546875
##lite: 1.427734375
##rated: 2.828125
importance: 7.96484375
purpose: 4.69140625
quiz: 3.28515625
scientists: 5.0390625
bomb: 3.7109375
genius: 3.8828125
development: 2.55859375
solving: 3.224609375
significance: 3.90625
successful: 5.0703125
intelligence: 5.35546875
solve: 2.751953125
effect: 1.2392578125
objective: 2.2265625
research: 1.953125
_: -2.36328125
accomplish: 2.759765625
brains: 4.046875
progress: 1.6943359375
scientist: 3.0234375
Note, these token_ids include stopwords’ ids, remember to remove stopwords’ ids for query tokens.
In [ ]:
# Import all your python libraries and put setup code here.
Double-click to edit this markdown cell and describe the first method you are going to implement, e.g., ANCE
In [ ]:
# Put your implementation of methods here.
When you have described and provided implementations for each method, include a table with statistical analysis here.
For convenience, you can use tools like this one to make it easier: https://www.tablesgenerator.com/markdown_tables, or if you are using pandas, you can convert dataframes to markdown https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_markdown.html