IR H/M Course
16/02/2022
Generative vs. Discriminative
• A discriminative model estimates the probability of belonging to a class directly from the observed features of the document based on the training data
Copyright By PowCoder代写 加微信 powcoder
• Generative models perform well with low numbers of training examples
• Discriminative models usually perform better given enough training data
−Can also easily incorporate many features
Discriminative Models
Discriminative models typically rely on “machine learning” There has been considerable interaction between these fields
• 60s: Rocchio algorithm (c.f. Lecture 6) is a simple learning approach, and later in 80s & 90s other learned ranking algorithms based on user feedback were proposed
• 2000s: text classification
All were limited by low amounts of training data Web query logs have generated a wave of research
• From 2008 e.g., “Learning to Rank”
Neural nets have given rise to more efforts in text representation:
• C.f. Word2vec, Glove, BERT
IR H/M Course
16/02/2022
Motivations for Learning
How to choose term weighting models?
− Term weighting models have different assumptions about how relevant documents should be retrieved (c.f. Lecture 8)
− Field-based models: term occurrences in different fields matter differently
− Proximity-models: close co-occurrences matter more
− Priors: documents with particular lengths or URL/inlink
distributions matter more
− Query Features: Long queries, difficult queries, query type
How to combine all these easily and appropriately?
FEATURES FOR LEARNING
IR H/M Course
16/02/2022
Ranking Cascades
Typically, in web-scale search, the ranking process can be seen as a series of cascades [1]
• Rank some documents
• Pass top-ranked onto next cascade for refined re-ranking
1000s of documents
1,000,000,000s of documents
Simple Ranking: Identify a set most likely to contain relevant documents
Boolean: Do query terms occur?
e.g. AND/OR
[1] J Pederson. Query understanding at Bing. SIGIR 2010 Industry Day.
Re-Ranking: Try really hard to
get the top of the ranking correct, using many signals (features)
LEARNING TO RANK
Learning to Rank
Application of tailored machine learning techniques to automatically (select and) weight retrieval features
−Based on training data with relevance assessments
Learning to rank has been popularised by commercial search engines (e.g. Bing, Baidu, Yandex)
−They require large training datasets, possibly instantiated from click-through data
−Click-through data has facilitated the deployment of learning approaches
T.-Y. Liu. (2009). Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 3(3), 225–331.
IR H/M Course
16/02/2022
Types of Features
Typically, commercial search engines use hundreds of features for ranking documents, usually categorised as follows:
Varies depending on…
Query Dependent Features
Weighting models, e.g. BM25, PL2
Proximity models, e.g. Markov Random Fields Field-based weighting models, e.g. PL2F Deep-learned semantic matching
Independent Features
PageRank, number of inlinks Spamminess
Query Features
Query length Presence of entities
NB: Unlike text classification, features in LTR do NOT include the presence of a specific word in the document
Learning to Rank
1. Sample Identification
− Apply BM25 or similar (e.g. DFR PL2) to rank documents with
respect to the query (list of candidate documents)
− Hope that the sample contains enough relevant documents
2. Compute more features − Query Dependent
− Query Independent
− Query Features
3A. Learn ranking model
− Based on training data
3B. Apply learned model
− Re-rank sample documents
IR H/M Course
16/02/2022
LTR, Schematically
Sample Ranking
BM25 retrieval of K documents
Question: What are the main stages of learning to rank?
Learning to Rank technique
Generated using same previous two steps, + relevance assessments
N. Tonellotto, C. Macdonald, I. Ounis. (2013). Efficient and effective retrieval using selective pruning. WSDM’13.
Querydependent Queryindependent
− Weighting models − … including fields − Proximity/ngram
− Link Analysis, e.g. PageRank Score
− URL length
− Spam Score
− Document Type (PDF,
Wikipedia)
Query Independent, e.g. PageRank
Queryfeatures
− Query length
−Presence of entities
−Predicted query performance
Question: What are the three types of features in learning to rank?
Query Dependent, e.g. BM25, PL2F
document sample
f1 f2 f3 f4 d1 0.2 0.1 0.5 0.8
d2 0.5 0.7 0.0 0.2 d3 0.3 0.2 0.5 0.0
IR H/M Course
16/02/2022
Feature Support in
Terrier supports ranking using features, as specified in a configuration file
WMODEL:BM25
WMODEL:PL2
Weighting models Field-based model
WMODEL:DirichletLM
WMODEL:PL2F
QI:StaticFeature(OIS,data.inlinks.oos.gz) QI:StaticFeature(OIS,data.urllength.oos.gz) DSM:org.terrier.matching.dsms.DFRDependenceScoreModifier
Proximity model
Exercise 3 has a standardised feature list, to which you will add new simple URL length and proximity features
Query independent features
We might typically deploy a number of query dependent features
−Such as additional weighting models, e.g. fields/proximity, that are calculated based on information in the inverted index
We want the result set passed to the final cascade to have all query dependent features computed
−Once first cascade retrieval has ended, it’s too late to compute features without re-traversing the inverted index postings lists
−It would take too long to compute all features for all scored documents
Solution: cache the postings for documents that might make the sample
Dependent Feature Extraction
−Postings contain frequencies, positions, fields, etc.
IR H/M Course
16/02/2022
Solution: cache the postings for documents that might make the sample
Dependent Feature Extraction
Initial Sample Retrieval
“Fat” Sample ResultSet
Calculate Features; Apply LTR
Final ResultSet
C Macdonald et al. (2012). About Learning Models with Multiple Query Dependent Features. TOIS
Types and Examples of Learning to Rank Algorithms LEARNING PROCESS
IR H/M Course
16/02/2022
Learning to Rank Process
Training examples
✗ ✗ ✗ ✗ f1 f2 f3 f4
d1 0.2 0.1 0.5 0.8 ✓✓✓✓
Learned Model
Instead, we need to separate the queries in our test collection into disjoint sets: for training the model; for tuning the learning parameters (the validation query set); and for testing how effective the model is for unseen queries
• We use the same processing (sample retrieval and features) for queries in each set
document sample
d2 0.5 0.7 0.0 0.2 d3 0.3 0.2 0.5 0.0
unseen query
Re-rank using learned model
Training & Evaluating
We need lots of queries to learn an effective, robust model
We cannot measure the success of a learning-to-rank technique on one of the queries it has been trained upon
• Such training performance is likely to overestimate its effectiveness compared to unseen queries
• (More on validation shortly) 18
IR H/M Course
16/02/2022
Types of Learning
Classification
Ranking is a form of supervised machine learning. Slightly different from classification, more closely related to regression:
• For each instance, aims to predict its ground truth numerical label
Regression
Types of Learning for Ranking Tasks (1)
A ‘loss function’ is used to estimate how good a particular model is (based on the training data)
• Thelearnerwilltrymanydifferentmodels(tofindthebestone)
There are three types of learning appropriate for ranking tasks:
pointwise, pairwise & listwise
• Each differ in the way that their loss function is computed
Pointwise approaches are closest to classification/regression • They aim to predict the correct relevance label
−Classification: how many documents did the learner correctly predict as relevant/ non-relevant?
−Regression: how different is the predicted score from the correct relevance label for each document
OPRF: N Fuhr. (1989). Optimal Polynomial Retrieval Functions Based on the Probability Ranking Principle. TOIS. 7(3), 1989. GBRT: S. Tyree, K.Q. Weinberger, K. Agrawal, J. Paykin: Parallel boosted regression trees for web search ranking. WWW 2011.
predicted →
IR H/M Course
16/02/2022
Types of Learning for Ranking Tasks (2)
Pointwise techniques suffer from a limitation for ranking tasks, in that they consider each document independently, instead of trying to get the relevant documents ranked ABOVE the non-relevant ones
Pairwise Listwise
• Take two documents with some preference
− E.g. different relevance labels, 0 and 1, less and
more clicks
• Try to minimise a loss function based
on the ranking error between pairs of documents
• Problem? Doesn’t always approximate a real evaluation measure
• Examples: RankSVM, RankNet
• Consider the entire ranking of documents • Encapsulates a standard IR evaluation
measure within the loss function
• Not all measures are appropriate for list wise learning, e.g. vs
• Examples: AFS, LambdaMART
• Generally more effective than Pointwise or
RankSVM: T. Joachims. Optimizing Search Engines using Clickthrough Data. CIKM’03.
RankNet: C Burges et al. Learning to Rank using Gradient Descent. ICML’05.
AFS: D. Metzler. Automatic Feature Selection for the Markov Random Field Model for IR. CIKM’07.
LambdaMART: C. J. Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Technical Report MSR-TR-2010-82, Microsoft Research.
LINEAR MODELS FOR LTR
IR H/M Course
16/02/2022
Types of Learned Models (1) Linear Model
Linear Models (the most intuitive to comprehend)
− Many learning to rank techniques generate a linear combination of feature values:
score(d,Q)=∑wf ⋅valuef (d) f
−Linear Models make some assumptions:
• Feature Usage: They assume that the same features are needed by all
• Model Form: The model is only a linear combination of feature values.
– Contrast this with genetic algorithms, which can learn functional combinations of features, by randomly introducing operators (e.g. try divide feature a by feature b), but are unpractical to learn
−Question: How do we set wf values that maximise the performance of an IR evaluation metric?
Difficulty in Training
IR evaluation measures are not continuous, and hence are not differentiable wrt a feature weight
• i.e. they are not smooth as we vary a feature weight w
Metrics don’t respond to changes in the scores of documents, but only when a swap between the ordering of two documents takes place
• We need advanced gradient descent/hill-climbing/simulated annealing
• Many of the advances in learning to rank try to address this
Figures from: (a) Craswell et al., SIGIR 2005 (b) Hongning Wang 24
Q: Why are evaluation measures non-smooth wrt features weights?
IR H/M Course
16/02/2022
(Greedy) Automatic Feature Selection (AFS)
Consider a linear model:
score(d,Q)=∑wf ⋅valuef (d) f
AFS: Select a wf > 0 for the feature f that best improves
an evaluation measure M
Step 1: Select the most effective single feature according to M
f f f f 1234
f1 0.30 f2 0.35 f3 0.05 f4 0.10
d 0.2 0.1 0.5 0.8 1
d 0.2 0.1 0.5 0.8 1
d 0.2 0.1 0.5 0.8 1
d 0.5 0.7 0.0 0.2 2
d 0.5 0.7 0.0 0.2 2
d 0.5 0.7 0.0 0.2 2
d 0.3 0.2 0.5 0.0 3
d 0.3 0.2 0.5 0.0 3
d 0.3 0.2 0.5 0.0 3
w1 w2 w3 w4
(Greedy) Automatic Feature Selection (AFS)
Current Best M = 0
D. Metzler. Automatic Feature Selection for the Markov Random Field Model for IR. CIKM’07.
Consider a linear model:
score(d,Q)=∑wf ⋅valuef (d) f
AFS: Select a wf > 0 for the feature f that best improves an evaluation measure M Step 2: Select the feature and corresponding wf
that most improves M (e.g. using sim. annealing)
1234 d 0.2 0.1 0.5 0.8
d 0.2 0.1 0.5 0.8
d 0.2 0.1 0.5 0.8
d 0.5 0.7 0.0 0.2
d 0.5 0.7 0.0 0.2
d 0.5 0.7 0.0 0.2
d 0.3 0.2 0.5 0.0
d 0.3 0.2 0.5 0.0
d 0.3 0.2 0.5 0.0
w1 w2 w3 w4
f2 + 1.3 x f1 0.36 f2 + 0?.3 x f3 0.40 f2 + 0?.9 x f4 0.34
Current Best M = 0.35
D. Metzler. Automatic Feature Selection for the Markov Random Field Model for IR. CIKM’07.
IR H/M Course
16/02/2022
(Greedy) Automatic Feature Selection (AFS)
Consider a linear model:
score(d,Q)=∑wf ⋅valuef (d) f
AFS: Select a wf > 0 for the feature f that best improves an evaluation measure M Step 2: Select the feature and corresponding wf
that most improves M (e.g. using sim. annealing)
f2 + 1.3 x f1 0.36 f2+0.3xf3 0.40 f2 + 0.9 x f4 0.34
D. Metzler. Automatic Feature Selection for the Markov Random Field Model for IR. CIKM’07.
(Greedy) Automatic Feature Selection (AFS)
d 0.2 0.1 0.5 0.8 1
d 0.2 0.1 0.5 0.8 1
d 0.2 0.1 0.5 0.8 1
d 0.5 0.7 0.0 0.2 2
d 0.5 0.7 0.0 0.2 2
d 0.5 0.7 0.0 0.2 2
d 0.3 0.2 0.5 0.0 3
d 0.3 0.2 0.5 0.0 3
d 0.3 0.2 0.5 0.0 3
w1 w2 w3 w4
Current Best M = 0.40
Consider a linear model:
score(d,Q)=∑wf ⋅valuef (d) f
AFS: Select a wf > 0 for the feature f that best improves an evaluation measure M Step 2: Select the feature and corresponding wf
that most improves M (e.g. using sim. annealing)
f2 + 1.3 x f1 0.36 f2 + 0.3 x f3 0.40 f2 + 0.9 x f4 0.34
d 0.2 0.1 0.5 0.8 1
d 0.2 0.1 0.5 0.8 1
d 0.2 0.1 0.5 0.8 1
d 0.5 0.7 0.0 0.2 2
d 0.5 0.7 0.0 0.2 2
d 0.5 0.7 0.0 0.2 2
d 0.3 0.2 0.5 0.0 3
d 0.3 0.2 0.5 0.0 3
d 0.3 0.2 0.5 0.0 3
Step 3: Goto step 2, until we observe no more improvements in M, or no more features
w1 w2 w3 w4
Current Best M = 0.40
D. Metzler. Automatic Feature Selection for the Markov Random Field Model for IR. CIKM’07.
IR H/M Course
16/02/2022
Training performance, by Iteration
Source: Own experiments, AFS, MSLR-Web10k, 1334 queries, 136 features 29
Performance on Unseen Data
Local maxima
Global maxima
Overfitting (to the training data) – a model that does not generalise to unseen queries
Source: Own experiments, AFS, MSLR-Web10k, 666 queries, 136 features 30
IR H/M Course
16/02/2022
Solution: Use Validation Data
Final chosen model has 14 features
Validation data can be used to prevent overfitting – e.g. by setting “hyper- parameters” of the learning process, in this case number of iterations, on unseen data
Summary: Evaluating a Learning to Rank Technique
Evaluation of learning to rank (L2R) uses a test collection
• Documents & queries with known relevant documents We split queries into three subsets:
• Training – for learning the model
• Validation – for setting hyper-parameters, e.g. number of iterations. Unseen to the learner; prevents overfitting to the training data
• Testing – for determining how effective the learner is on unseen queries
In Exercise 3, you will evaluate on the HP04 (test) topics;
We have provided separate training and validation topic sets.
c.f. PyTerrier’s L2R documentation on how to learn using training & validation topics
What is the role of validation data during learning?
IR H/M Course
16/02/2022
REGRESSION TREES FOR
Type of Learned Models (2) Regression Trees
Tree Models
−A regression tree is series of decisions, leading to a partial score output −The outcome of the learner is a “forest” of many such trees, used to
calculate the final score of a document for a query
−Their ability to customise branches makes them more effective than linear models
−Regression trees are pointwise in nature, but several major search engines have created adapted regression tree techniques that are listwise
• E.g. Microsoft’s LambdaMART is at the heart of the Bing search engine For each document in the sample:
(many of them!)
S. Tyree, K. Weinberger, K. Agrawal, J. Paykin. (2011). Parallel Boosted Regression Trees for Web Search Ranking. WWW’11.
IR H/M Course
16/02/2022
Growing a Tree
f1 f2 f3 f4 L d1 0.2 0.1 0.5 0.8 1 d2 0.5 0.7 0.0 0.2 2 d3 0.3 0.2 0.5 0.0 0
Loss = 1.4 Q: Can we add another decision point?
Loss = 1.2
⟹ Gain = 0.2
Feature Importance
Regression Trees have an in-built measure of feature importance
• By recording the “gain” at each node during training time
Gain = 0.4
• Summing the gains for each feature Gain = 0.2
Norm. Total Gain
URL Length
NB: sklearn’s RandomForest, as well as XGBoost and LightGBM, all provide a feature_importances_ variable
IR H/M Course
16/02/2022
Summary of LTR Types
Some “take-aways”:
• Listwise techniques are typically more effective; linear models
are easy to understand; regression trees are very adaptable
−LambdaMART enhances pointwise regression trees with a listwise measure
−Currently one of the benchmarks to beat!
• Validation data is equally important to use for all types of
supervised learning, including LTR, to prevent overfitting −Always use a validation set!
In Exercise 3 of your coursework, you will be experimenting with the stateof- the-art LambdaMART technique, as provided by LightGBM
Example Performances
Example performances on 8000 queries from a Microsoft learning to rank dataset
• Differences might be small in magnitude, but are statistically significant over hundreds of queries
In Exercise 3 of your coursework, you will be experimenting with the stateof- the-art LambdaMART technique, as provided by LightGBM
Learning to Rank Technique
AFS (Simulated Annealing)
Boosted Regression Trees
LambdaMART
Listwise: Linear
Pointwise: Tree
Listwise: Tree
IR H/M Course
16/02/2022
The Importance of Sample Size
What sample size is necessary for effective learning?
−Small samples: faster learning, faster retrieval −Large samples: higher recall
(See next figure)
C. Macdonald, R. Santos and I. Ounis. (2012). The Whens and Hows of Learning to Rank. IR Journal. DOI: 1007/s10791-012-9209-9
The Importance of Sample Size
for different sample sizes
Small sample size, degraded performance
Increased performance as sample size rises
C. Macdonald, R. Santos and I. Ounis. (2012). The Whens and Hows of Learning to Rank. IR Journal. DOI: 1007/s10791-012-9209-9
No benefit for sample size larger than 1000
Pairwise techniques struggle with large samples
IR H/M Course
16/02/2022
Learning to Rank Best Practices (Basics)
About the sample – it must be:
Q: What is the role of the sample in learning-to-rank?
• Small enough that calculating features doesn’t take too long
• Large enough to have enough sufficient recall of relevant douments
−Previous slide says 1000 documents needed for TREC Web corpora
• Simple early cascade: Only time for a single weighting model to create the sample
− E.g. PL2 < BM25 with no features; but when features are added, no real difference
Multiple weighting models as features:
−Using more weighting models do improve the learned model
Field Models as Features
−Adding field-based models as features always improves effectiveness
In general, Query Independent features (e.g. PageRank, Inlinks,
URL length) improve effectiveness
C. Macdonald, R. Santos, I. Ounis, B. He. About Learning Models with Multiple Query Dependent Features. TOIS 31(3), 2013.
C Macdonald & I Ounis (2012). The Whens & Hows of Learning to Rank. INRT. 16(5), 2012. 41
Learning to Rank Best Practices (Query Features)
Tree-based lear
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com