Microsoft Word – assignment.docx
Objective
The key objectives of this assignment are to learn how to train and evaluate a non-trivial
machine learning model. More specifically, the task is called “learning-to-rank”. You will be
given a set of training features that contain the relevance labels 0 (not relevant), 1 (partially
relevant) and 2 (relevant), for a large set of query-document pairs. Your task is to research
this problem, find a suitable solution, train a model, and produce a result file from a training
set that will be scored using standard evaluation measures in Information Retrieval.
If you are unfamiliar with IR, you might want to look through the book Learning to Rank for
Information Retrieval by Tie-Yan Liu.
Provided files
The following files are provided:
• A2.pdf : This specification file.
• train.tsv : A large file of labelled query-document pairs suitable for training.
• test.tsv : The holdout set that you will use to create a runfile.
• documents.tsv: A 3 field file containing the document id, original html, and clean
text parse of each document.
• query.tsv: A 2 field file containing the query id and the query text for each query.
The A2.pdf file is on canvas, the train and test files are in a zip file you can download
using the URL below called A2data.zip and the document.tsv and query.tsv files can
be found in the optional download below called extradata.zip.
The Features Provided
Rules of the game
You are allowed to use any python library you like to solve the problem. The are a wealth of
tools to choose from, including pandas, numpy, and scikit-learn for the basic processing, and
multiple libraries designed specifically for Learning to Rank. I will let you find these on your
own. It should be easy to find several that will work, and you can try several to determine
which works the best. We will need you to ensure your environment is reproducible though,
so the correct way to do this is to create an anaconda environment for a specific version of
python (I strongly suggest it be 3.8, install any packages you need using pip (not anaconda),
and then generate a requirements.txt file to include with your submission. So, something like:
conda create -n SXXXXXX python=3.8
conda activate SXXXXXX
pip install pandas numpy scikit-learn
pip freeze > requirements.txt
This will create a new environment you can start in Anaconda using “conda activate
SXXXXX” and exit from using “conda deactivate”.
Rubric
The marking will be defined as follows:
• Making your environment reproducible (5/35 marks). So heed my suggestion above.
If you do not submit a correct requirements file and we cannot reproduce your results,
you will lose marks.
• A short write-up of your methodology and model decisions. This should be no more
that 4 pages single column, 12pt font. We will provide an exemplar template to help
you know what you cover. (10/35 marks)
• The remaining marks (20/35) are based on effectiveness. We will define 4 cutoff
scores. If you achieve only the lowest, it is +5, the second +10, the third +15, and
fourth +20. The scores will be based on model complexity. So a very simple model
would achieve the bottom quartile score, and a state-of-the-art one should achieve the
top quartile score. The score boundaries will be released next week when we have had
a chance to finalise all the possibilities. Note that you can also achieve the top quartile
even with a simple model if you decided to try your hand at feature engineering. I
implemented a pretty simple one myself already and it had a significant impact on
performance. So, we are providing the raw document and query data but you do not
have to use this if you do not want to do any feature engineering on your own. You
will be able to achieve the top quartile if you use the right model and tune it properly.
Once you have your model working, you should generate a single file called A2.run.
There should be 3 tab separated columns of data in the file which are query id,
document id, and score. These will look something like this:
Test Collection
You can download the two data files from:
http://wight.seg.rmit.edu.au/jsc/A2/A2data.zip
http://wight.seg.rmit.edu.au/jsc/A2/extradata.zip
The extra data file contains raw data from the documents and queries and is optional. You
only need to get it if you intend to try to create your own features. Be warned, this file is
reasonably large, even though it is compressed. It is around 985MB of raw data.
Hints
The best solutions are very likely to be pairwise or listwise regression algorithms. Also, you
will want to group by Query ID for reasons I hope are obvious. In other words, if you are
going to create a pair of positive and negative instances, you would want them to be from the
same query so that your model gets better at discriminating between a relevant and non-
relevant document. You want to treat it as a regression as you are predicting the most likely
score for a document, but all evaluation in web search is based on the idea of you returning a
“ranked list” (think 10 blue links in Google). These are just documents sorted from most
likely to least likely to be relevant for a particular query. We will cover this more in the
lectorial this week and next week.