Lab: Using Terrier
Information Retrieval H/M Craig Macdonald January 2020
January 2020 © Terrier Team | University of Glasgow 1
Aims
• Understand the format of TREC test collection documents
• Index a TREC-formatted collection using Terrier
• Understand how to run and evaluate a batch retrieval experiment
• Understand how to extend Terrier with your own code
January 2020 © Terrier Team | University of Glasgow 2
The IR Problem
• Given an information need representation, retrieve relevant information
How can we methodically investigate new IR approaches?
January 2020 © Terrier Team | University of Glasgow 3
Main Existing Open-Source IR Platforms
• Non-academic
• Lucene / Nutch / Solr /
Elastic (Apache/Elastic) • Minion (Oracle Labs)
• Xapian (Cambridge)
• Sphinx (Sphinx Inc.)
• Academic
• Terrier (Glasgow)
• Lemur / Indri / Galago (CMU / UMass)
• Zettair (RMIT) • MG4J (Milano)
January 2020
© Terrier Team | University of Glasgow 4
Scaling Up IR Experimentation
• A modern IR platform should facilitate… • … handling large corpora
• … integrating new advanced techniques
• … tackling complex search tasks
• … evaluating multiple experimental scenarios
We hope to show you that Terrier meets all these requirements!
January 2020 © Terrier Team | University of Glasgow 5
The Terrier Project
• Research project (2001-)
• Researchers, students, interns, and visitors
• Open-source project (2004-)
• Latest release version 5.2 (1/2020)
• Integrates many of Glasgow’s research outcomes into the core project
• Several success cases
• Top performances over the years at TREC Web, Robust, Terabyte, RF, MQ,
Enterprise, Entity, Blog, Microblog, Crowdsourcing, Medical and Fair IR tracks
• Top performances at CLEF Adhoc and Web tracks
• Top performances at NTCIR Intent task
• Several knowledge exchange/deployments in industry
• One of the most used platforms in the research community.
January 2020 © Terrier Team | University of Glasgow 6
Part 0 – Setup
January 2020 © Terrier Team | University of Glasgow 7
Downloading Terrier
• Please download and decompress the latest version of Terrier from http://terrier.org/
• You will want the binary (“bin”) version, which is precompiled
• Choose the zip or tar.gz as appropriate for Windows vs. Mac/Linux
• Today’s lab is about using Terrier, including indexing and retrieving from a very small corpus of scientific abstracts called Vaswani
January 2020 © Terrier Team | University of Glasgow 8
Windows vs. Linux/Mac
• Terrier works on both Windows, Linux & Mac: • On Linux/Mac, start a Terminal
• On Windows, start a Command Prompt
• Terrier has commands for Windows and Linux/Mac
• bin/terrier (Linux & Mac) and bin\terrier.bat (Windows) are equivalent
• Back-slash vs forward-slash: These slides were originally written for Linux & Mac.
• We use / to indicate directories
• On Windows command prompt, you should use \ • In Java and configuration files, / should be used
TODO Windows screenshot
January 2020 © Terrier Team | University of Glasgow 9
Part 1 : Conducting a Batch Retrieval Experiment using
This gives more context to the Terrier experimentation quickstart document, found at:
https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart_experiments.md
January 2020 © Terrier Team | University of Glasgow 10
Recall: Nomenclature
• Corpus/Collection: a set of documents in files to be indexed
• Topics: a set of queries/information needs in a file
• Qrels/relevance assessments: known relevant documents for each query
Test collection
January 2020 © Terrier Team | University of Glasgow 11
TREC format Document Collection
• TREC-formatted collection files contain many documents, delimited by
• SGML: Its not XML, its not HTML
Corpus file from Vaswani: share/vaswani_npl/corpus/doc-text.trec inside the terrier directory
compact memories have flexible capacities a digital data storage system with capacity up to bits and random and or sequential access is described
an electronic analogue computer for solving systems of linear equations mathematical derivation of the operating principle and stability conditions for a computer consisting of amplifiers
January 2020 © Terrier Team | University of Glasgow
Multiple documents are placed in the same file
12
Configuring Terrier for Indexing
• Setup Terrier for using a TREC test collection by calling • bin/trec_setup.sh
• This creates default configuration files:
• etc/terrier.properties – records Terrier’s configuration • etc/collection.spec – list of files to be indexed
On Windows, bin\trec_setup.bat provides the same functionality as bin/trec_setup.sh
• We can change the indexing configuration by specifying properties in etc/terrier.properties, e.g. we might want to adapt to change the stemmer
• termpipelines=Stopword,PorterStemmer
• We would like you to set the following property for indexing:
• indexer.meta.reverse.keys=docno
Ref: https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart_experiments.md
January 2020 © Terrier Team | University of Glasgow
13
Using Terrier
On Windows, bin\terrier.bat provides the same functionality as bin/terrier
Ref: https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart_experiments.md
• The bin/terrier script is our way to running lots of Terrier commands
$ bin/terrier
Terrier version 5.2
No command specified. You must specify a command. Popular commands:
batchevaluate evaluate all run result files in the results directory batchindexing allows a static collection of documents to be indexed batchretrieval performs a batch retrieval “run” over a set of queries help provides a list of available commands
interactive runs an interactive querying session on the commandline
• You can use bin/terrier help
January 2020 © Terrier Team | University of Glasgow 14
All possible commands: …
Performing Indexing
• To create an index ready for querying, we use • bin/terrier batchindexing
• By default, the index is created in var/index/
• You can change the location of the index using the property: • terrier.index.path=/path/to/elsewhere
• Q0. Now run trec_setup and batchindexing to index the Vaswani corpus
Ref: https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart_experiments.md
January 2020 © Terrier Team | University of Glasgow 15
Viewing the Index (1)
• batchindexing will create an index in var/index/ consisting (mainly) of:
File
Purpose
data.lexicon.fsomapfile
The term vocabulary
data.document.fsarrayfile
The document index (document lengths)
data.inverted.bf
The inverted index posting lists
data.meta.zdata
Document metadata, e.g. DOCNO, URL
data.properties
Index configuration, e.g. number of tokens, number of documents
data.direct.bf
The direct index (opposite of inverted)
Ref: https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart_experiments.md
January 2020 © Terrier Team | University of Glasgow 16
Viewing the Index (2)
• See the statistics of the created index bin/terrier indexstats
• This will show:
Collection statistics:
Terrier will warn during indexing if it finds a document with no words. This is common in some corpora.
number of indexed documents: 10000 size of vocabulary: 34749
number of tokens: 760213
number of pointers: 499379
Ref: https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart_experiments.md
January 2020 © Terrier Team | University of Glasgow 17
Viewing the Index (3)
• Terrier also has commands to view the lexicon and posting list
• bin/terrier indexutil -printlex
aa,term318 Nt=5 TF=5 @{0 0 0} abac,term3257 Nt=3 TF=5 @{0 37 4} abandon,term6096 Nt=1 TF=1 @{0 46 4} abbrevi,term6509 Nt=3 TF=3 @{0 49 6} …
• bin/terrier indexutil -printpostingfile
0 ID(27) TF(1) ID(2624) TF(1) ID(2769) TF(1) ID(2976) TF(1)
• Think about the IR lectures. What do these numbers mean?
Ref: https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart_experiments.md
January 2020 © Terrier Team | University of Glasgow 18
ID(3057) TF(1) …
Viewing the Index (3)
• Terrier also has commands to view the lexicon and posting list
• bin/terrier indexutil -printlex
aa,term318 Nt=5 TF=5 @{0 0 0} abac,term3257 Nt=3 TF=5 @{0 37 4} abandon,term6096 Nt=1 TF=1 @{0 46 4} abbrevi,term6509 Nt=3 TF=3 @{0 49 6} …
• bin/terrier indexutil -printpostingfile
0 ID(27) TF(1) ID(2624) TF(1) ID(2769) TF(1)
ID(2976) TF(1) ID(3057) TF(1) …
A: Nt is the document frequency – needed to calculate IDF;
@{x y z} is the pointer into the posting file
A: These are postings – i.e. document id and frequency pairs where the given term appears
• Think about the IR lectures. What do these numbers mean?
Ref: https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart_experiments.md
January 2020 © Terrier Team | University of Glasgow 19
Viewing the Index (4)
• Terrier also has a command to view the contents of a document:
January 2020
comput:1 held:1 confer:1 june:1 societi:1 report:1 british:1
cambridg:1
Ref: https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart_experiments.md
© Terrier Team | University of Glasgow 20
• bin/terrier showdocument 4 Document Length: 8
These are the stemmed terms in document 4
Document Unique Terms: 8
Contents:
Interactive: Getting results for a query
• bin/terrier interactive allows you to enter a query and see what results are returned, from the command line
• bin/terrier interactive
INFO o.t.structures.CompressingMetaIndex – Structure meta reading lookup file
into memory
INFO o.t.structures.CompressingMetaIndex – Structure meta loading data file into memory
INFO o.t.applications.InteractiveQuerying – time to intialise index : 0.082
Please enter your query: satellite
INFO o.t.matching.PostingListManager – Query 1 with 1 terms has 1 posting lists
Displaying 1-438 results 0 8764 4.534789609018122 1 745 4.19620394754292
2 2262 4.1542030676443
Ref: https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart_experiments.md
January 2020 © Terrier Team | University of Glasgow 21
Format: rank docno score
Look back into the corpus. Do the top-ranked documents make sense for this query?
3 4156 4.115461822146293
Batch Retrieval
• Topic files: statements of information needs
• Similar format to collection
• Tell Terrier where the topics file is located using the property
• Property trec.topics=/path/to/my/topics
• Or –t option to batchretrieve LIQUIDS BY THE USE OF MICROWAVE TECHNIQUES • Use bin/terrier batchretrieve
to execute each query in the topics
Topics file is at share/vaswani_npl/query-text.trec
MEASUREMENT OF DIELECTRIC CONSTANT OF
MATHEMATICAL ANALYSIS AND DESIGN DETAILS
OF WAVEGUIDE FED MICROWAVE RADIATIONS
January 2020 © Terrier Team | University of Glasgow 22
• In TREC/IR parlance, this is called a “run”
Batch Retrieval Output
• Each run creates a new file in var/results/
• An example of the format is shown here. The columns are:
• queryid • “Q0”
• docno • rank
• score
• weighting model name
1 Q0 8172 0 12.190824371396216 DPH
1 Q0 9881 1 10.863125606091643 DPH
1 Q0 5502 2 10.823913858744959 DPH
1 Q0 1502 3 9.606050890574766 DPH
1 Q0 9859 4 9.325724637695417 DPH
Ref: https://github.com/terrier-org/terrier-core/blob/5.x/doc/quickstart_experiments.md
January 2020 © Terrier Team | University of Glasgow 23
Evaluating a Run
• We use a qrels file to tell us what the known relevant documents are for each query
• Terrier can use the qrels file to evaluate runs • Set trec.qrels property to point to the qrels
file
• Use bin/terrier batchevaluate to
evaluate the runs in the var/results folder
Example Qrels File:
1 0 9219 1 1 0 9859 1 1 0 9988 1 1 0 10081 1 1 0 10588 1 2 0 414 1
2 0 1894 1 2 0 3785 1 2 0 4720 1 2 0 5894 1
January 2020 © Terrier Team | University of Glasgow
24
Evaluation Output
• bin/terrier batchevalute calls the standard trec_eval program internally to evaluate the runs using the specific qrels
• The resulting trec_eval output is stored at var/results/run.eval
• Format: is measure query_or_all value
• You should be familiar with measures such as
P, map and iprec_at_recall You can use trec_eval directly, by running
bin/terrier trec_eval /path/to/qrels /path/to/run.res
num_q all 93
num_ret all 91930
num_rel all 2083 num_rel_ret all 1941 map all 0.2948
gm_map all 0.1980
Rprec all 0.2998
bpref all 0.9359 recip_rank all 0.7134 iprec_at_recall_0.00 all iprec_at_recall_0.10 all iprec_at_recall_0.20 all iprec_at_recall_0.30 all iprec_at_recall_0.40 all iprec_at_recall_0.50 all iprec_at_recall_0.60 all iprec_at_recall_0.70 all iprec_at_recall_0.80 all iprec_at_recall_0.90 all iprec_at_recall_1.00 all P_5 all 0.4667
P_10 all 0.3677
P_15 all 0.3097
P_20 all 0.2747
P_30 all 0.2394
P_100 all 0.1285
P_200 all 0.0789
P_500 all 0.0382
P_1000 all 0.0209
0.7376
0.6431
0.5232
0.4079
0.3424
0.2710
0.2028
0.1555
0.1075
0.0607
0.0245
January 2020 © Terrier Team | University of Glasgow
25
0.7 Analysing Evaluation Results 0.8 0.6
Interpolated R-P
• The evaluation lecture introduced different ways of analysing query results, e.g.
• Interpolated Recall-precision curves • Query histograms
• Use your favourite graphing tool to create the graphs
• Think carefully what you are compar- ing and what makes sense to show
0.5 0.4 0.3 0.2 0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2
Recall InL2 DPH
Per-Query Historgram
InL2 DPH
You can get per-query evaluation results :
-p option of batchevaluate OR 0.1
bin/terrier trec_eval –q qrelsFile runFile
January 2020 © Terrier Team | University of Glasgow
0
1 4 7 10131619222528313437404346495255586164677073767982858891
Topic Number 26
AP
Precision
Changing Batch Retrieval & Evaluation: Crib Sheet
Task
Configuration Property
Command line
Change the weighting model
trec.model
–w of the batchretrieve command
Use query expansion
-q of the batchretrieve command
Change the location of the index
terrier.index.prefix
Specify the topics file
trec.topics
-t of the batchretrieve command
Specify the qrels file
trec.qrels
-q of the batchevaluate command
Change any property without editing the terrier.properties file
-Dproperty.name=PropertyValue for any command
View the contents of an index
indexstats or indexutil or showdocument
See per-query evaluation results
-p of the batchevaluate command
OR
trec_eval –q qrelsFile runFile (see slide 22)
January 2020 © Terrier Team | University of Glasgow 27
Key Points Thus Far (on a Windows Lab machine)
• Download and extract Terrier zip file
• Open a Command Prompt and cd to the Terrier directory
• bin\trec_setup.bat share\vaswani_npl\corpus
• (edit etc\terrier.properties for indexer.meta.reverse.keys=docno) • bin\terrier.bat batchindexing
• bin\terrier.bat batchreceive -t share\vaswani_npl\query-text.trec
Download the zip file for Windows
M:
cd terrier-project-5.2-bin
You must use .bat NB: No \ at end
• bin\terrier.bat batchevaluate -q share\vaswani_npl\qrels
January 2020 © Terrier Team | University of Glasgow 28
Lab Exercise
• Q1 Use batchretrieve to conduct retrieval experiments
• (a) Evaluate the BM25 & DPH weighting models in terms of MAP
• (b) Evaluate the BM25 & DPH weighting models with query expansion in terms of MAP
January 2020 © Terrier Team | University of Glasgow 29
Part 2: Extending Terrier
January 2020 © Terrier Team | University of Glasgow 30
Introduction
• Terrier is a framework – it is easy to plugin alternative models
• Introducing a new weighting model is one the aims in the Exercise 1 of the coursework. In Exercise 2, you will develop among others a proximity search model, to be integrated in ML-based approach.
• For both Exercise 1 & 2, we have provided templates for the classes you must work with
• https://github.com/cmacdonald/IRcourseHM
• You do NOT need to recompile Terrier!
January 2020 © Terrier Team | University of Glasgow 31
Lab Exercise: Extending Terrier – Analyse an index
• Lets work through an example, starting from the Github template from https://github.com/cmacdonald/IRcourseHM
• You want to find which document has the highest term frequency for a given term.
• Download and checkout the IRcourseHM project (e.g. using Eclipse).
• Q2. Edit the src/main/java/uk/ac/gla/dcs/applications/HighestTF.java source file as appropriate. Be inspired by the “Extending Retrieval” page of the Terrier documentation
January 2020 © Terrier Team | University of Glasgow 32
Compiling and Running your Highest-tf Tool
• Use Eclipse/Maven/IntelliJ to compile your model into a jar file. Afterwards, lets help Terrier use it:
• (1)NamethatjarintheCLASSPATHenvironmentvariable1
• Or
• (1a) Refer to the package that has been “installed” using Maven, but setting the terrier.mvn.coords property as:
terrier.mvn.coords=uk.ac.gla.dcs:ircourse:1.0-SNAPSHOT
– NB: this MUST be in the terrier.properties file THEN
• (2) bin/terrier highest-tf satellite
1See the README.md for how to set the environment variable in different command shells
Ref: https://github.com/cmacdonald/IRcourseHM/blob/master/README.md
January 2020 © Terrier Team | University of Glasgow 33
Part 3: Further Reading
You now have enough background to conduct Exercise 1 of the IR course
January 2020 © Terrier Team | University of Glasgow 34
Useful Links
• https://github.com/terrier-org/terrier- core/blob/5.x/doc/quickstart_experimen ts.md
• https://github.com/terrier-org/terrier- core/blob/5.x/doc/configure_retrieval.m d
• https://github.com/terrier-org/terrier- core/blob/5.x/doc/evaluation.md
• Terrier Website
• http://terrier.org
• Terrier Documentation
• http://terrier.org/docs/current
• See Terrier on Github
• https://github.com/terrier-
org/terrier-core/
• Report issues about Terrier
• https://github.com/terrier- org/terrier-core/issues
• (FAQ document on Moodle, if needed)
• https://github.com/cmacdonald/IRcours
eHM/blob/master/README.md
January 2020 © Terrier Team | University of Glasgow 35