代写 html graph statistic software Introduction

Introduction
Information Retrieval H/M
Exercise 2 February 2019
Learning-to-rank is a recent paradigm used by commercial search engines to improve retrieval effectiveness by combining different sources of evidence (aka features). The key point of learning- to-rank is that it is easy to incorporate new features and to leverage the amount of potential training data available to Web search engines. In this exercise, you will be trying learning to rank using a number of standard or provided features, and implementing two additional features of your own. In particular, you will be implementing, testing and evaluating two proximity features, which aim to boost the scores of documents where the query terms appear in close proximity. Similar to Exercise 1, you will then be analysing the obtained results and commenting on the effectiveness of the IR system, including the effectiveness of your new features, but only using the homepage finding topic set (HP04). Effectiveness will be measured using two metrics: Mean Average Precision (MAP) and Precision at Rank 5 (P@5).
Exercise Specification
For retrieval, you will use Terrier’s support for learning-to-rank (LTR) and the included state-of- the-art Jforests LambdaMART LTR technique. In particular, the online documentation about LTR available at http://terrier.org/docs/current/learning.html will be an essential guide on how to conduct your experiments. However, you will only be using a limited set of 3 features as listed below.
For these experiments, you should use the provided index – this has more evidence in the index, including fields and “block” positions, hence for Exercise 2 you do not need to perform indexing. The provided index can be found at
/users/level4/software/IR/Resources/indices named blocks_fields_stemming.
In conducting your experiments, you must only use the homepage finding topic set (HP04) for evaluation, but to ensure a fair experimental setting, you’ll need to use separate “training” and “validation” topic sets, which can be found at:
/users/level4/software/IR/TopicsQrels/training and validation
Your first task is to deploy and evaluate a baseline LTR approach using the provided 3 features following the Terrier LTR instructions mentioned above. For generating the required LTR sample, you need to use the PL2 weighting model. The sample is then re-ranked using the 3 provided features.
Q1. First, in a table, report the effectiveness performances of the two system configurations: LTR (using PL2 to generate the sample) vs. PL2. In your report, show the results in a table as follows (Report all your MAP and P@5 performances to 4 decimal places):
Table 1: Performances of LTR vs PL2 on HP04 Topic Set
WMODEL:SingleFieldModel(BM25,0) QI:StaticFeature(OIS,/users/level4/software/IR/Resources/features/pagerank.oos.gz) DSM:org.terrier.matching.dsms.DFRDependenceScoreModifier
Measurement→ ↓Configuration
MAP
P@5
PL2
LTR (PL2 sample)
Version 1.0 – 14th February 2

Use the t-test statistical significance test to conclude if the performances of LTR are significantly better than that of PL2 using MAP or P@5. For example, to conduct a t-test, you can use the online t-test tool available at:
http://www.socscistatistics.com/tests/studentttest/
and select a two-tailed test with a significance level of 0.05.
For your two statistical tests, report their outcomes using the following format:
LTR vs PL2 on Metric M (M being either MAP or P@5): The t-value is XXXX. The p- value is XXXX. The result is at p < .05. Now, using the outcome of your t-test tool, state if LTR (PL2 sample) is better by a statistically significant margin than PL2 on either or both used effectiveness metrics. [7] Q2. Now, you should implement two additional proximity search features – a proximity search feature allows documents where the query terms occur closely together to be boosted. We require that you implement two of the functions numbered 1-5 in the following paper: NB: You should calculate your feature by aggregating the function score (mean or min or max, as appropriate) over all pairs of query terms. You will implement your new features as two DocumentScoreModifiers (DSM) classes, using the example DSM code provided in the Github repository(https://github.com/cmacdonald/IRcourseHM). You can add a DSM as an additional feature by appending its full name to the feature file, e.g.: Q2a. Name the two proximity features you have chosen to implement and provide a brief rationale for your choice of these two particular features, especially in terms of how they might affect the Ronan Cummins and Colm O'Riordan. 2009. Learning in a pairwise term-term proximity framework for information retrieval. In Proceedings of ACM SIGIR 2009. http://ir .dcs.gla.ac.uk/~ronanc/papers/cumminsSIGIR09.pdf DSM:org.myclass.MyProx1DSM performance of the deployed LTR baseline approach of Q1. [3] Q2b. Along with the submission of your source code, discuss briefly your implementation of the two features, highlighting in particular any assumptions or design choices you made, any difficulties that you had to overcome to implement the two features, and how these difficulties were reflected in the unit testing you conducted. [8] Q2c. Along with the submission of the source code of your unit tests, describe the unit tests that you have conducted to test that your implemented features behaved as expected. In particular, highlight any specific cases you tested your code for, and whether you identified and corrected any error in your code. [7] Q3. Once you have created and tested your two DSMs, you should experiment with LTR, including your new features, and comparing to your LTR baseline with the 3 initial provided features. As per best evaluation practices, when determining the benefits of your added features, you should add them separately to the list of provided features, then in combination to see if they provide different sources of evidence to the learner. This gives you 4 settings to compare and discuss (LTR baseline, LTR Baseline + DSM 1, LTR Baseline + DSM 2, LTR Baseline + DSM 1 + DSM 2). Report the obtained performances of your 4 LTR system variants in a table as follows (Report all your performances to 4 decimal places): Version 1.0 – 14th February 3 Measurement→ ↓Configuration MAP P@5 LTR (baseline) LTR + DSM1 LTR + DSM2 LTR + DSM1 + DSM2 Table 2: Performances of DSM1, DSM2 with LTR baseline individually and in combination on HP04 Topic Set Using the t-test statistical test, state if the introduction of either of your proximity features (or both) has led to a statistically significant enhancement of your LTR (baseline) performance in terms of MAP (i.e. you will conduct 3 tests in total). Report the details of your significance tests as in Q1. [8] Q4. Using MAP as the main evaluation metric, provide a concise, yet informative, discussion on the performance of the learned model with and without your additional features. In particular, comment on why your features did/did not help (individually or in combination), and what queries benefitted (their numbers, their nature and characteristics, etc.). As part of your discussion, you will need to provide and use the following: a) A recall-precision graph summarising the results of your 4 LTR system variants, b) Ahistogramwithaquery-by-queryperformanceanalysisofthe4systemvariantsused. c) A suitable table summarising the number of queries that have been improved/degraded/unaffected with the introduction of either (or both) of your proximity features with respect to the LTR baseline. d) A suitable table showing examples of queries that have been particularly improved or harmed with the introduction of your proximity search features. Q5. An overall reflection about what you have learnt from conducting the experiments and what your main conclusions are from this exercise. Hand-in Instructions: (a) Submit a PDF report (5 pages MAX) with your answers to Q1-Q5. (b) Submit the source code of your implemented features and their unit tests. Version 1.0 – 14th February 4 [10] [5]

Related Posts