Introduction
Information Retrieval H/M
Exercise 1 January 2020
The general objective of this exercise is to deploy an IR system and evaluate it on a medium size Web dataset. Students will use the Terrier Information Retrieval platform (http://terrier.org) to conduct their experiments. Terrier is a modular information retrieval platform, allowing to experiment with various test collections and retrieval techniques. It is written in Java. This first exercise will involve installing and configuring Terrier to evaluate a number of retrieval models and approaches. In this exercise, you will be familiarising yourself with Terrier by deploying various retrieval approaches and evaluating their impact on retrieval performance, as well as learning how to conduct an experiment in IR, and how to analyse results.
You will need to download the latest version of Terrier from http://terrier.org. We will provide a sample of Web documents (a signed user agreement to access and use this sample will be required), on which you will conduct your experiments. We will have a lab dedicated to Terrier, and you could also use the Terrier Github forum or the Moodle Class forum to ask questions.
Your work will be submitted through the Exercise 1 Quiz Instance available on Moodle. The Quiz asks you various questions, which you should answer based on the experiments you have conducted.
Collection:
You will use a sample of a TREC Web test collection, of approx. 800k documents, with corresponding topics & relevance assessments. Only those that have signed the agreement will have access to this collection (See Moodle). You can find the document corpus and other resources in the Windows directory \\file-alpha\cs-data (you should map this as a Network Drive, e.g. Y:1).
If you prefer Unix and have a school Unix account, you can find the same document corpus and other resources in the Unix directory /users/level4/software/IR/. In both locations, the directory contains:
• Dotgov_50pc/ (approx. 2.8GB) – the collection to index.
• TopicsQrels/ – topics & qrels for three topic sets from TREC 2004: homepage, namedpage,
topic-distillation.
Exercise Specification
There is some required programming in this exercise but there are also numerous experiments that need to be conducted. In particular, you will conduct three tasks:
1. IndextheprovidedWebcollectionusingTerrier’sdefaultindexingsetup.
2. Implement two weighting models (Simple TF*IDF and Vector Space TF*IDF), which you will
have to add to Terrier.
3. Evaluateandanalysetheresultingsystembyperformingthefollowingexperiments:
– Vary the weighting model: Simple TF.IDF vs. Vector Space TF.IDF vs. Terrier’s implemented TF.IDF vs. BM25 vs. PL2.
1 See https://support.microsoft.com/en-gb/help/4026635/windows-map-a-network-drive
Version 1.0 – 29th January 2
– Apply a Query Expansion mechanism: Use of a Query Expansion Mechanism vs. Non-use of Query Expansion
These are too many experimental parameters to address all at once, hence you must follow the prescribed activities given below. Once you conduct an activity, you should answer the corresponding questions on the Exercise 1 Quiz instance. Ensure that you click the “Next Page” button to save your answers on the Quiz instance.
Q1. Start by using Terrier’s default indexing setup: Porter Stemming applied & Stopwords removed. You will need to index the collection, following the instructions in Terrier’s documentation. In addition, we would like you to configure Terrier with the following additional property during indexing:
indexer.meta.reverse.keys=docno
Once you have indexed the collection, answer the Quiz questions asking you to enter your main
obtained indexing statistics (number of tokens, size of files, time to index, etc).
[1 mark]
Q2. Implement, test and evaluate two new weighting models in Terrier. Use the template class provided in the IRcourseHM project, available from the course Github repo (https://github.com/cmacdonald/IRcourseHM).
(a) Simple TF*IDF: The Simple TF*IDF weighting model you are required to implement is highlighted in purple in Lecture 3, slide 40. Use base10 logarithms in your implementation.
(b) A simple vector-space implementation, as per Lecture 4, slide 21. Your implementation should use document vectors that contain Simple TF*IDF scores, and query vectors that contain TFs. Marks will be awarded for correctness of implementation – you may use document length as a simpler approximation for document magnitude (and receive partial marks), or you may, for full marks, calculate the exact document magnitude.
See https://github.com/terrier-org/terrier-core/blob/5.x/doc/extend_retrieval.md#using- terrier-indices-in-your-own-code for useful snippets of code.
(c) Upload your two Java source code files when prompted by the Quiz instance. Then, answer the corresponding questions by inspecting the retrieved results for the mentioned weighting
models.
[10 marks]
Q3. Now you will experiment with all five weighting models (Simple TF*IDF, Vector Space TF*IDF, Terrier TF*IDF, BM25 and PL2) and analyse their results on 3 different topic sets, representing different Web retrieval tasks: homepage finding (HP04), named page finding (NP04), and topic distillation (TD04). A description of these topic sets and the underlying search tasks is provided on Moodle.
Provide the required MAP performances of each of the weighting models over the 3 topic sets. Report your MAP performances to 4 decimal places. Also, provide the average MAP performance of each weighting model across the three topic sets, when prompted by the Quiz instance.
[14 marks]
Next, for each topic set (HP04, NP04, TD04), draw a single Recall-Precision graph showing the performances for each of the 5 alternative weighting models (three Recall-Precision graphs in total). Upload the resulting graphs into the Moodle instance when prompted. Then, answer the corresponding question(s) on the Quiz instance.
Version 1.0 – 29th January 3
[6 marks]
Finally, you should now answer on the quiz the most effective weighting model (in terms of Mean Average Precision), which you will use for the rest of Exercise 1. To identify this model, simply identify the weighting model with the highest average performance across the 3 topic sets.
[1 mark]
Q4. You will now conduct the Query Expansion experiments using (a) the weighting model that produces the highest average Mean Average Precision (MAP) across the 3 topic sets in Q3 (b) Simple TF*IDF and (c) Vector Space TF*IDF.
Query expansion has a few parameters, e.g. query expansion model, number of documents to analyse, number of expansion terms – you should simply use the default query expansion settings of Terrier: Bo1, 3 documents, 10 expansion terms.
Run the three different models with Query Expansion on the homepage finding (HP04) and topic distillation (TD04) topic sets. Report the obtained MAP performances in the Quiz instance. Report
your MAP performances to 4 decimal places.
[6 marks]
Now, you will delve into the performance of the best retrieval model identified in Question Q3 using only the homepage finding (HP04) and topic distillation (TD04) topic sets. For each of the two topic sets, draw a separate query-by-query histogram comparing the MAP performance of your system with and without query expansion (two histograms to produce in total). Each histogram should show two bars for each query of the topic set: one bar corresponding to the MAP performance of the system on that given query with query expansion and one bar corresponding to the MAP performance of the system on that given query without query expansion. Using these histograms and their corresponding data, you should now be able to answer the corresponding questions of the Quiz instance.
Finally, answer the final analysis questions and complete your Quiz submission.
[6 marks] [6 marks]
Hand-in Instructions: All your answers to Exercise 1 must be submitted on the Exercise 1 Quiz instance, which will be available on Moodle. This exercise is worth 50 marks and 8% of the final course grade.
NB 1: You can (and should) naturally complete the answers to the quiz over several iterations. However, please ensure that you save your intermediary work on the Quiz instance by clicking the “Next Page” button every time you make any change in a given page of the quiz and you want it to be saved.
NB 2: To save you a lot of time, you are encouraged to write scripts for the collection and management of your experimental data (and to ensure that you don’t mix up your results) as well as the production of graphs using a plethora of existing tools.
Version 1.0 – 29th January 4