编程辅导 BM25 Example

Search Engines

Text, Web And Media Analytics
Information Retrieval (IR)

Copyright By PowCoder代写 加微信 powcoder

1. Overview of IR Models
2. Older IR Models
Boolean Retrieval
Vector Space Model
3. Probabilistic Models
Language models
4. Relevance models
Pseudo-Relevance Feedback
KL-Divergence
Rocchio algorithm

1. Overview of IR Models
Information retrieval (IR) models Provide a mathematical framework for defining the search process
includes explanation of assumptions
basis of many ranking algorithms
can be implicit
For a given query Q, an IR model finds relevant documents to answer Q.
Progress in IR models has corresponded with improvements in effectiveness.
Theories about relevance
信息检索(IR)模型为定义搜索过程提供了一个数学框架
包括解释假设,许多排序算法的基础可以是隐式的

对于给定的查询Q, IR模型查找相关文档来回答Q。

红外模型的进展与有效性的提高相对应。

Complex concept that has been studied for some time
Many factors to consider
People often disagree when making relevance judgments
IR models make various assumptions about relevance to simplify problem
e.g., topical vs. user relevance
A document is topically relevant to a query if it is judged to be on the same topic, i.e., the query and the document are about the same thing.
User relevance considers all the other factors that go into a user’s judgment of relevance.
e.g., binary vs. multi-valued relevance

A document is topically relevant to a query if it is judged to be on the same topic, i.e., the query and the document are about the same thing.
User relevance considers all the other factors that go into a user’s judgment of relevance.

2. Order IR Models
Boolean Retrieval
Two possible outcomes for query processing
TRUE and FALSE
“exact-match” retrieval
simplest form of ranking
Query usually specified using Boolean operators
AND, OR, NOT,
Proximity operators (define constraints on the distance between words; e.g., requiring words to co-occur within a certain “window” (length) of text); and
Wildcard characters (define the minimum string match required for a word) are also commonly used in Boolean queries.

Searching by Numbers
It is the process of developing queries with a focus on the size of the retrieved set.
It is a consequence of the limitations of the Boolean retrieval model.
For example, sequence of queries driven by number of retrieved documents
e.g. “lincoln” search of news articles
president AND lincoln
president AND lincoln AND NOT (automobile OR car)
president AND lincoln AND biography AND life AND birthplace AND gettysburg AND NOT (automobile OR car)
president AND lincoln AND (biography OR life OR birthplace OR gettysburg) AND NOT (automobile OR car)
This will retrieve any document containing the words “president” and
“lincoln”, along with any one of the words “biography”, “life”, “birthplace”,
or “gettysburg” (and does not mention “automobile” or “car”).

Boolean Retrieval cont.
Advantages
Results are predictable, relatively easy to explain to users
Many different features can be incorporated (e.g., document date or type)
Efficient processing since many documents can be eliminated from search
Disadvantages
Effectiveness depends entirely on user (simple queries will not work well).
Complex queries are difficult that requires considerable experience.

Vector Space Model
Documents and query represented by a vector of term weights

A Collection of Documents can be represented by a matrix of term weights

Example (in inverted indexing)
3-d pictures useful, but can be misleading for high-dimensional space

The right-side terms*documents matrix (invert representation) is just the transpose of the left-side documents*terms matrix (representation).
The transpose of a matrix is an operator which flips a matrix over its diagonal. The first column is transferred to the first row and so on.

Vector Space Model – similarity measure
Documents ranked by distance between points representing query and documents
Similarity measure more common than a distance or dissimilarity measure
e.g., Cosine correlation

Example of Similarity Calculation
Consider two documents D1, D2 and a query Q
D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)

Document Representation – Term Weights
Tf*idf weight
Term frequency weight measures importance of term k in document (Di):

Inverse document frequency measures importance in collection:

Some heuristic modifications

Incorrect Eq in the text-book

Vector Space Model cont.
Advantages
Simple computational framework for ranking
Any similarity measure or term weighting scheme could be used
Disadvantages
Assumption of term independence
No predictions about techniques for effective ranking. There is an implicit assumption that relevance is related to the similarity of query and document vectors.

3. Probabilistic Models
It is hard to prove that an IR model will achieve better effectiveness than any other model since we are trying to formalize a complex human activity.
The validity of a retrieval model generally has to be validated empirically, rather than theoretically.
Probability Ranking Principle (early theoretical statement about effectiveness, Robertson (1977)). “Originally described as follow”:
“If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request,
where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose,
the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.”

IR as Classification – Bayes Classifier
Bayes Decision Rule
A document D is relevant if P(R|D) > P(NR|D)
Estimating probabilities
use Bayes Rule

P(R) is the priori probability of relevance
classify a document as relevant if

The left-hand side (lhs) is likelihood ratio.

P(R|D) is a conditional probability representing the probability of relevance for the given document D, and
P(NR|D) is the conditional probability of non-relevance.

It’s not clear how we would go about calculating P(R|D), but given information about the relevant set, we should be able to calculate P(D|R).

P(D|R) – is a conditional probability representing the probability of the appearance of document D in the relevant set.
P(D|NR) – is a conditional probability representing the probability of the appearance of document D in the non-relevant set.

Examples of Relevant and Non-relevant documents
Let Q = {US, ECONOM, ESPIONAG} be a query
C = {D1, D2, D3, D4, D5, D6, D7} be a collection of documents, where
D1 = {GERMAN, VW}
D2 = {US, US, ECONOM, SPY}
D3 = {US, BILL, ECONOM, ESPIONAG}
D4 = {US, ECONOM, ESPIONAG, BILL}
D5 = {GERMAN, MAN, VW, ESPIONAG}
D6 = {GERMAN, GERMAN, MAN, VW, SPY}
D7 = {US, MAN, VW}
P(R|D5) , P(R|D) ? Where D = {US, VW, ESPIONAG} = {d1, d2, d3}
P(D5|R), P(D|R) ?

Document ID Terms: dij Relevance to Q
D1 GERMAN, VW 0 no
D2 US, US, ECONOM, SPY 1 yes
D3 US, BILL, ECONOM, ESPIONAG 1 yes
D4 US, ECONOM, ESPIONAG, BILL 1 yes
D5 GERMAN, MAN, VW, ESPIONAG 0 no
D6 GERMAN, GERMAN, MAN, VW, SPY 0 no
D7 US, MAN, VW 0 no

Estimating P(D|R)
Assume independence

Binary independence model
document represented by a vector of binary features indicating term occurrence (or non-occurrence). E.g.,
T = {US, BILL, ECONOM, ESPIONAG, GERMAN, MAN, VW, SPY }
D = {US, VW, ESPIONAG}
The binary vector of D = (1, 0, 0, 1, 0, 0, 1, 0)
pi is probability that term i occurs (i.e., has value 1) in relevant document, si is probability of occurrence in non-relevant document

Binary Independence Model

Where i:di = 1 means terms that have the value 1 (presence) in the document;
i:di = 0 means terms that have the value 0 (absence) in the document.

Binary Independence Model
Scoring function is (if we ignore the common 2nd product)

Query provides information about relevant documents
If we assume pi constant, si approximated by entire collection, get idf-like weight

The assumption of pi is acceptable if all terms are important at the same level.
The assumption for si is reasonable if the number of relevant documents is very small.

Contingency Table

Putting these estimates into the scoring function:

  Relevant Non-relevant Total
d1 = 1 ri = 3 ni -ri = 1 ni = 4
d1 = 0 R- ri = 0 (N-R)-( ni -ri) = N- ni –R +ri = 3 N- ni = 3
Total R = 3 N-R = 4 N = 7

Example: term 1 = ‘US’
ri is the number of relevant documents containing term i
ni is the number of documents containing term i
N is the total number of documents in the collection
R is the number of relevant documents for this query
di = 1 if term i is present in the document, and 0 otherwise

Popular and effective ranking algorithm based on binary independence model
adds document and query term weights

fi is the frequency of term i in the document;
qfi is the frequency of term i in the query;
k1, k2 and K are parameters whose values are set empirically
, dl is doc length and avdl is the average length of a document in the collection.
Typical TREC value for k1 is 1.2, k2 varies from 0 to 1000, b = 0.75

BM25 Example
Query with two terms, “president lincoln”, (qf = 1)
No relevance information (r and R are zero)
N = 500,000 documents
“president” occurs in 40,000 documents (n1 = 40, 000)
“lincoln” occurs in 300 documents (n2 = 300)
“president” occurs 15 times in doc (f1 = 15)
“lincoln” occurs 25 times (f2 = 25)
document length is 90% of the average length (dl/avdl = .9)
k1 = 1.2, b = 0.75, and k2 = 100
K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11

BM25 Example

BM25 Example
Effect of term frequencies

Language Model
Unigram language model (a simple language model)
probability distribution over the words in a language.
For example, if the documents in a collection contained just five different words (w1, w2, …, w5), a possible language model for that collection might be
(0.2, 0.1, 0.35, 0.25, 0.1)
where each number is the probability of a word occurring.
N-gram language model
predicts words based on longer sequences are used. An n-gram model predicts a word based on the previous n − 1 words.
The most common n-gram models are bigram (predicting based on the previous word) and trigram (predicting based on the previous two words) models.

Language models are used to represent text in a variety of language technologies, such as speech recognition, machine translation, and handwriting recognition. The simplest form of language model, known as a unigram language model, is a probability distribution over the words in the language.

Language Model
We can use language models to represent the topical content of a document.
A topic is defined as a probability distribution over words – a language model.
A language model can be used to “generate” new text by sample words according to the probability distribution.
A topic in a document or query can be represented as a language model
i.e., words that tend to occur often when discussing a topic will have high probabilities in the corresponding language model

LMs for Retrieval
3 possibilities:
probability of generating the query text from a document language model
probability of generating the document text from a query language model
comparing the language models representing the query and document topics
Models of topical relevance
The probability of query generation is the measure of how likely it is that a document is about the same topic as the query.

Query-Likelihood Model
Rank documents by the probability that the query could be generated by the document model (i.e. same topic)
Given query, start with P(D|Q)
Using Bayes’ Rule

Assuming priori is uniform, unigram model

Estimating Probabilities
Obvious estimate for unigram probabilities is

If query words are missing from document, score will be zero
Missing 1 out of 4 query words same as missing 3 out of 4

Document texts are a sample from the language model
Missing words should not have zero probability of occurring
Smoothing is a technique for estimating probabilities for missing (or unseen) words
lower (or discount) the probability estimates for words that are seen in the document text
assign that “left-over” probability to the estimates for the words that are not seen in the text

Estimating Probabilities
Estimate for unseen words is αDP(qi|C)
P(qi|C) is the probability for query word i in the collection language model for collection C (background probability)
αD is a parameter
Estimate for words that occur is
(1 − αD) P(qi|D) + αD P(qi|C)
Different forms of estimation come from different αD

αD is a constant, λ
Gives estimate of

Ranking score

Use logs for convenience
accuracy problems multiplying small numbers

4. Relevance Models
A relevance model represents the topics covered by relevant documents.
E.g., language model can be used to represent information need
query and relevant documents are samples of text generated from this model
Document likelihood model
P(D|R) – is interpreted as the probability of generating the text in a document given a relevance model R.
less effective than query likelihood due to difficulties comparing across documents (often documents are large) of different lengths.
Note that a document with a model that is very similar to the relevance model is likely to be on the same topic.
how to compare two language models?

Pseudo-Relevance Feedback
Relevance feedback is to acquire user feedback on the output that are initially returned from a given query Q.
The feedback describes user information needs, and users typically label relevant outputs for Q (unlabelled outputs can be considered non-relevant information).
In practical applications, there are three types of feedback: explicit feedback, implicit feedback, and “pseudo” feedback.
Pseudo relevance feedback (blind feedback) is a method of finding an initial set of most likely relevant documents. Normally, we assume that the top “k” ranked documents are relevant to Q.
Pseudo relevance feedback can be used to estimate relevance model from query Q and top-k ranked documents.
Then rank new documents by similarity of document model to this relevance model
Kullback-Leibler divergence (KL-divergence) is a well-known measure of the difference between two probability distributions. It can be used to measure the similarity between document model and relevance model.

KL-Divergence
Given the true probability distribution P and another distribution Q that is an approximation to P,

Use negative KL-divergence for ranking, and
assume relevance model R for the query is the true distribution (not symmetric), and the approximation to be the document language model (D):

Please note the second term of this equation does not depend on the document, and can be ignored for the purpose of ranking.

KL-Divergence cont.
Given a simple maximum likelihood estimate for P(w|R), based on the frequency in the query text, ranking score for a document is

rank-equivalent to query likelihood score
Query likelihood model is a special case of retrieval based on relevance model.

Estimating the Relevance Model
Probability of pulling a word w out of the “bucket” representing the relevance model depends on the n query words we have just pulled out

We view the probability of w is the conditional probability of observing w given that we just observed the query words q1 . . . qn
By definition

Estimating the Relevance Model cont.
Joint probability is

Where P(D) usually assumed to be uniform;
P(w, q1 . . . qn) is simply a weighted average of the language model probabilities for w in a set of documents, where the weights are the query likelihood scores for those documents
Formal model for pseudo-relevance feedback
query expansion technique

Pseudo-Feedback Algorithm

Example from Top 10 Docs

Example from Top 50 Docs

Rocchio Algorithm
It is developed based on Relevance Feedback (or Pseudo-Feedback).
It is a technique for query modification
Rocchio algorithm
Maximizes the difference between the average vector representing the relevant documents and the average vector representing the non-relevant documents.
Modifies query Q to Q’ (optimal query) according to

α, β, and γ are parameters
Typical values 8, 16, 4
qj is the initial weight of query term j,
dij is the weight (if*idf) of the jth term in document i,
Rel is the set of identified relevant documents,
Nonrel is the set of non-relevant documents,
| . | gives the size of a set.

Chapter 7 (sections 7.1, 7.2 and 7.3) in text book – W. , Search Engines – Information retrieval in Practice; Pearson, 2010.
Albishre K., Li Y., . (2018) Query-Based Automatic Training Set Selection for Microblog Retrieval. In: ., ., ., Ho B., Ganji M., Rashidi L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science, vol 10938. DOI https://doi.org/10.1007/978-3-319-93037-4_26
Albishre, K., Li, Y. et al. Query-based unsupervised learning for improving social media search. World Wide Web 23, 1791–1809 (2020). https://doi.org/10.1007/s11280-019-00747-0

/docProps/thumbnail.jpeg

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com