COMP6714: Informa2on Retrieval & Web Search
Introduc*on to
Informa(on Retrieval
Lecture 9: Probabilis*c Model & Language Model
1
COMP6714: Informa2on Retrieval & Web Search
Recap of the last lecture
§ Improving search results
§ Especially for high recall. E.g., searching for aircra? so it
matches with plane; thermodynamic with heat
§ Op*ons for improving results…
§ Global methods
§ Query expansion
§ Thesauri
§ Automa*c thesaurus genera*on
§ Global indirect relevance feedback
§ Local methods
§ Relevance feedback
§ Pseudo relevance feedback
2
COMP6714: Informa2on Retrieval & Web Search
Probabilis*c relevance feedback
§ Rather than reweigh*ng in a vector space…
§ If user has told us some relevant and some irrelevant
documents, then we can proceed to build a
probabilis*c classifier, such as a Naive Bayes model:
§ P(tk|R) = |Drk| / |Dr|
§ P(tk|NR) = |Dnrk| / |Dnr|
§ tk is a term; Dr is the set of known relevant documents; Drk is the
subset that contain tk; Dnr is the set of known irrelevant
documents; Dnrk is the subset that contain tk.
3
COMP6714: Informa2on Retrieval & Web Search
Why probabili*es in IR?
User
Information Need
Documents
Document
Representation
Query
Representation
How to match?
In traditional IR systems, matching between each document and
query is attempted in a semantically imprecise space of index terms.
Probabilities provide a principled foundation for uncertain reasoning
.
Can we use probabilities to quantify our uncertainties?
Uncertain guess of
whether document
has relevant content
Understanding
of user need is
uncertain
4
COMP6714: Informa2on Retrieval & Web Search
Probabilis*c IR topics
§ Classical probabilis*c retrieval model
§ Probability ranking principle, etc.
§ (Naïve) Bayesian Text Categoriza*on
§ Bayesian networks for text retrieval
§ Language model approach to IR
§ An important emphasis in recent work
§ Probabilis2c methods are one of the oldest but also
one of the currently hoGest topics in IR.
§ Tradi2onally: neat ideas, but they’ve never won on
performance. It may be different now.
5
COMP6714: Informa2on Retrieval & Web Search
The document ranking problem
n We have a collec*on of documents
n User issues a query
n A list of documents needs to be returned
n Ranking method is core of an IR system:
n In what order do we present documents to the user?
n We want the “best” document to be first, second best
second, etc….
n Idea: Rank by probability of relevance of the
document w.r.t. informa(on need
n P(relevant|documenti, query)
6
COMP6714: Informa2on Retrieval & Web Search
Recall a few probability basics
§ For events a and b:
§ Bayes’ Rule
§ Odds:
∑ =
==
=
==∩=
aax
xpxbp
apabp
bp
apabp
bap
apabpbpbap
apabpbpbapbapbap
,
)()|(
)()|(
)(
)()|(
)|(
)()|()()|(
)()|()()|()(),(
)(1
)(
)(
)(
)(
ap
ap
ap
ap
aO
−
==
Posterior
Prior
7
COMP6714: Informa2on Retrieval & Web Search
The Probability Ranking Principle
“If a reference retrieval system’s response to each request is a
ranking of the documents in the collec*on in order of decreasing
probability of relevance to the user who submiged the request,
where the probabili*es are es*mated as accurately as possible on
the basis of whatever data have been made available to the system
for this purpose, the overall effec2veness of the system to its user
will be the best that is obtainable on the basis of those data.”
§ [1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron;
van Rijsbergen (1979:113); Manning & Schütze (1999:538)
8
COMP6714: Informa2on Retrieval & Web Search
Probability Ranking Principle
Let x be a document in the collection.
Let R represent relevance of a document w.r.t. given (fixed)
query and let NR represent non-relevance.
)(
)()|(
)|(
)(
)()|(
)|(
xp
NRpNRxp
xNRp
xp
RpRxp
xRp
=
=
p(x|R), p(x|NR) – probability that if a relevant (non-relevant)
document is retrieved, it is x.
Need to find p(R|x) – probability that a document x is relevant.
p(R),p(NR) – prior probability
of retrieving a (non) relevant
document
1)|()|( =+ xNRpxRp
R={0,1} vs. NR/R
9
COMP6714: Informa2on Retrieval & Web Search
Probability Ranking Principle (PRP)
§ Simple case: no selec*on costs or other u*lity
concerns that would differen*ally weight errors
§ Bayes’ Op*mal Decision Rule
§ x is relevant iff p(R|x) > p(NR|x)
§ PRP in ac*on: Rank all documents by p(R|x)
§ Theorem:
§ Using the PRP is op*mal, in that it minimizes the loss
(Bayes risk) under 1/0 loss
§ Provable if all probabili*es correct, etc. [e.g., Ripley 1996]
10
COMP6714: Informa2on Retrieval & Web Search
Probability Ranking Principle
§ More complex case: retrieval costs.
§ Let d be a document
§ C – cost of retrieval of relevant document
§ C’ – cost of retrieval of non-relevant document
§ Probability Ranking Principle: if
for all d’ not yet retrieved, then d is the next
document to be retrieved
§ We won’t further consider loss/u(lity from now
on
�
C ⋅ p(R | d) + ′ C ⋅ (1− p(R | d)) ≤ C ⋅ p(R | ′ d ) + ′ C ⋅ (1− p(R | ′ d ))
11
COMP6714: Informa2on Retrieval & Web Search
Probability Ranking Principle
§ How do we compute all those probabili*es?
§ Do not know exact probabili*es, have to use es*mates
§ Binary Independence Retrieval (BIR) – which we
discuss later today – is the simplest model
§ Ques*onable assump*ons
§ “Relevance” of each document is independent of
relevance of other documents.
§ Really, it’s bad to keep on returning duplicates
§ Boolean model of relevance
§ That one has a single step informa*on need
§ Seeing a range of results might let user refine query
12
COMP6714: Informa2on Retrieval & Web Search
Probabilis*c Retrieval Strategy
§ Es*mate how terms contribute to relevance
§ How do things like u, df, and length influence your
judgments about document relevance?
§ One answer is the Okapi formulae (S. Robertson)
§ Combine to find document relevance probability
§ Order documents by decreasing probability
13
COMP6714: Informa2on Retrieval & Web Search
Probabilis*c Ranking
Basic concept:
“For a given query, if we know some documents that are
relevant, terms that occur in those documents should be
given greater weighting in searching for other relevant
documents.
By making assumptions about the distribution of terms
and applying Bayes Theorem, it is possible to derive
weights theoretically.”
Van Rijsbergen
14
COMP6714: Informa2on Retrieval & Web Search
Binary Independence Model
§ Tradi*onally used in conjunc*on with PRP
§ “Binary” = Boolean: documents are represented as binary
incidence vectors of terms (cf. lecture 1):
§
§ iff term i is present in document x.
§ “Independence”: terms occur in documents independently
§ Different documents can be modeled as same vector
§ Bernoulli Naive Bayes model (cf. text categoriza*on!)
),,( 1 nxxx …
!
=
1=ix
15
COMP6714: Informa2on Retrieval & Web Search
Binary Independence Model
§ Queries: binary term incidence vectors
§ Given query q,
§ for each document d need to compute p(R|q,d).
§ replace with compu*ng p(R|q,x) where x is binary term
incidence vector represen*ng d Interested only in ranking
§ Will use odds and Bayes’ Rule:
)|(
),|()|(
)|(
),|()|(
),|(
),|(
),|(
qxp
qNRxpqNRp
qxp
qRxpqRp
xqNRp
xqRp
xqRO
!
!
!
!
!
!
!
==
16
COMP6714: Informa2on Retrieval & Web Search
Binary Independence Model
• Using Independence Assumption:
∏
=
=
n
i i
i
qNRxp
qRxp
qNRxp
qRxp
1 ),|(
),|(
),|(
),|(
!
!
),|(
),|(
)|(
)|(
),|(
),|(
),|(
qNRxp
qRxp
qNRp
qRp
xqNRp
xqRp
xqRO !
!
!
!
!
⋅==
Constant for a
given query Needs estimation
∏
=
⋅=
n
i i
i
qNRxp
qRxp
qROdqRO
1 ),|(
),|(
)|(),|(• So :
17
COMP6714: Informa2on Retrieval & Web Search
Binary Independence Model
∏
=
⋅=
n
i i
i
qNRxp
qRxp
qROdqRO
1 ),|(
),|(
)|(),|(
• Since xi is either 0 or 1:
∏∏
== =
=
⋅
=
=
⋅=
01 ),|0(
),|0(
),|1(
),|1(
)|(),|(
ii x i
i
x i
i
qNRxp
qRxp
qNRxp
qRxp
qROdqRO
• Let );,|1( qRxpp ii == );,|1( qNRxpr ii ==
• Assume, for all terms not occurring in the query (qi=0) ii rp =
Then…
This can be
changed (e.g., in
relevance feedback)
18
COMP6714: Informa2on Retrieval & Web Search
All matching terms Non-matching
query terms
Binary Independence Model
All matching terms
All query terms
∏∏
∏∏
===
=
===
−
−
⋅
−
−
⋅=
−
−
⋅⋅=
11
1
01
1
1
)1(
)1(
)|(
1
1
)|(),|(
iii
i
iii
q i
i
qx ii
ii
q
x i
i
qx i
i
r
p
pr
rp
qRO
r
p
r
p
qROxqRO
!
19
COMP6714: Informa2on Retrieval & Web Search
Binary Independence Model
Constant for
each query
Only quantity to be estimated
for rankings
∏∏
=== −
−
⋅
−
−
⋅=
11 1
1
)1(
)1(
)|(),|(
iii q i
i
qx ii
ii
r
p
pr
rp
qROxqRO
!
• Retrieval Status Value:
�
RSV = log
pi(1− ri)
ri(1− pi)xi = qi =1
∏ = log pi(1− ri)
ri(1− pi)xi = qi =1
∑
20
�
= log(odds(pi))− log(odds(ri))( )
xi = qi =1
∑ = logit(pi) − logit(ri)( )
xi = qi =1
∑
COMP6714: Informa2on Retrieval & Web Search
Binary Independence Model
• All boils down to computing RSV.
∑∏
==== −
−
=
−
−
=
11 )1(
)1(
log
)1(
)1(
log
iiii qx ii
ii
qx ii
ii
pr
rp
pr
rp
RSV
∑
==
=
1
;
ii qx
icRSV
�
ci = log
pi(1− ri)
ri(1− pi)
= logit(pi) − logit(ri)
So, how do we compute ci’s from our data ?
21
COMP6714: Informa2on Retrieval & Web Search
Binary Independence Model
• Estimating RSV coefficients.
• For each term i look at this table of document counts:
S
s
pi ≈ )(
)(
SN
sn
ri
−
−
≈
)()(
)(
log),,,(
sSnNsn
sSs
sSnNKci
+−−−
−
=≈
• Estimates:
22
Documents Relevant Non-Relevant Total
Xi = 1
Xi = 0
Total N S
n s n-s
S-s N-n-S+s N-n
N-S
However, these
estimates could
be 0.
COMP6714: Informa2on Retrieval & Web Search
Add ½ Smoothing
• Add ½ to each of the center four cells.
�
ci ≈ K(N,n,S,s) = log
(s+1 2) (S − s+1 2)
(n − s+1 2) (N − n − S + s+1 2)
23
Documents Relevant Non-Relevant Total
Xi = 1
Xi = 0
Total N+2 S+1
n+1 s+½ n-s+½
S-s+½ N-n-S+s+½ N-n+1
N-S+1
COMP6714: Informa2on Retrieval & Web Search
Example /1
24
§ Query = {x1, x2}
§ O(R=1|D3, q)
Doc Judgment x1 x2 x3
D1 R 1 1 1
D2 R 0 0 1
D3 R 1 0 0
D4 NR 1 0 1
D5 NR 0 1 1
COMP6714: Informa2on Retrieval & Web Search
Example /2
25
§ Es*mate pi and ri
Doc Judgment x1 x2 x3
D1 R 1 1 1
D2 R 0 0 1
D3 R 1 0 0
D4 NR 1 0 1
D5 NR 0 1 1
COMP6714: Informa2on Retrieval & Web Search
Es*ma*on – key challenge
§ If non-relevant documents are approximated by
the whole collec*on, then ri (prob. of occurrence
in non-relevant documents for query) is n/N and
§ log (1– ri)/ri = log (N– n)/n ≈ log N/n = IDF!
§ pi (probability of occurrence in relevant
documents) can be es*mated in various ways:
§ from relevant documents if know some
§ Relevance weigh*ng can be used in feedback loop
§ constant (Cro} and Harper combina*on match – 0.5) –
then just get idf weigh*ng of terms
§ propor*onal to prob. of occurrence in collec*on
§ more accurately, to log of this (Greiff, SIGIR 1998)
26
in fact, u-idf can be deemed as the cross-entropy
COMP6714: Informa2on Retrieval & Web Search
27
Itera*vely es*ma*ng pi
1. Assume that pi constant over all xi in query
§ pi = 0.5 (even odds) for any given doc
2. Determine guess of relevant document set:
§ V is fixed size set of highest ranked documents on this
model (note: now a bit like u.idf!)
3. We need to improve our guesses for pi and ri, so
§ Use distribu*on of xi in docs in V. Let Vi be set of
documents containing xi
§ pi = |Vi| / |V|
§ Assume if not retrieved then not relevant
§ ri = (ni – |Vi|) / (N – |V|)
4. Go to 2. un*l converges then return ranking
op*onal
COMP6714: Informa2on Retrieval & Web Search
Probabilis*c Relevance Feedback
1. Guess a preliminary probabilis*c descrip*on of R
and use it to retrieve a first set of documents V, as
above.
2. Interact with the user to refine the descrip*on:
learn some definite members of R and NR
3. Rees*mate pi and ri on the basis of these
§ Or can combine new informa*on with original guess (use
Bayesian prior):
4. Repeat, thus genera*ng a succession of
approxima*ons to R.
κ
κ
+
+
=
||
|| )1()2(
V
pV
p iii
κ is
prior
weight
28
op*onal
COMP6714: Informa2on Retrieval & Web Search
PRP and BIR
§ Ge�ng reasonable approxima*ons of probabili*es
is possible.
§ Requires restric*ve assump*ons:
§ term independence
§ terms not in query don’t affect the outcome
§ boolean representa*on of documents/queries/
relevance
§ document relevance values are independent
§ Some of these assump*ons can be removed
§ Problem: either require par*al relevance informa*on or only can
derive somewhat inferior term weights
29
COMP6714: Informa2on Retrieval & Web Search
Okapi BM25
§ Heuris*cally extend the BIR to include informa*on
of term frequencies, document length, etc.
§ Typically,
30
�
RSVd = log
N
dft
⎛
⎝
⎜
⎞
⎠
⎟
t∈q
∑ ⋅ (k1 +1)tf t ,d
k1 (1− b) + b ⋅
Ld
Lave
⎛
⎝
⎜
⎞
⎠
⎟ + tf t ,d
⋅
(k3 +1)tf t ,q
k3 + tf t,q
idf Normalized term
freq (doc)
Normalized term
freq (query)
�
k1,k3 ∈ [1.2,2.0],b = 0.75
caps the contribu*on of R
COMP6714: Informa2on Retrieval & Web Search
Good and Bad News
§ Standard Vector Space Model
§ Empirical for the most part; success measured by results
§ Few proper*es provable
§ Probabilis*c Model Advantages
§ Based on a firm theore*cal founda*on
§ Theore*cally jus*fied op*mal ranking scheme
§ Disadvantages
§ Making the ini*al guess to get V
§ Binary word-in-doc weights (not using term frequencies)
§ Independence of terms (can be alleviated)
§ Amount of computa*on
§ Has never worked convincingly beger in prac*ce
31
COMP6714: Informa2on Retrieval & Web Search
Resources
S. E. Robertson and K. Spärck Jones. 1976. Relevance Weigh*ng of Search
Terms. Journal of the American Society for Informa2on Sciences 27(3):
129–146.
C. J. van Rijsbergen. 1979. Informa2on Retrieval. 2nd ed. London:
Bugerworths, chapter 6. [Most details of math] hgp://
www.dcs.gla.ac.uk/Keith/Preface.html
N. Fuhr. 1992. Probabilis*c Models in Informa*on Retrieval. The Computer
Journal, 35(3),243–255. [Easiest read, with BNs]
F. Crestani, M. Lalmas, C. J. van Rijsbergen, and I. Campbell. 1998. Is This
Document Relevant? … Probably: A Survey of Probabilis*c Models in
Informa*on Retrieval. ACM Compu2ng Surveys 30(4): 528–552.
hgp://www.acm.org/pubs/cita*ons/journals/surveys/1998-30-4/p528-crestani/
[Adds very ligle material that isn’t in van Rijsbergen or Fuhr ]
32
COMP6714: Informa2on Retrieval & Web Search
Resources
H.R. Turtle and W.B. Cro}. 1990. Inference Networks for Document Retrieval.
Proc. ACM SIGIR: 1-24.
E. Charniak. Bayesian nets without tears. AI Magazine 12(4): 50-63 (1991). hGp://
www.aaai.org/Library/Magazine/Vol12/12-04/vol12-04.html
D. Heckerman. 1995. A Tutorial on Learning with Bayesian Networks. Microso}
Technical Report MSR-TR-95-06
hGp://www.research.microso?.com/~heckerman/
N. Fuhr. 2000. Probabilis*c Datalog: Implemen*ng Logical Informa*on Retrieval
for Advanced Applica*ons. Journal of the American Society for Informa2on
Science 51(2): 95–110.
R. K. Belew. 2001. Finding Out About: A Cogni2ve Perspec2ve on Search Engine
Technology and the WWW. Cambridge UP 2001.
MIR 2.5.4, 2.8
33
COMP6714: Informa2on Retrieval & Web Search
LANGUAGE MODEL
34
COMP6714: Informa2on Retrieval & Web Search
Today
§ The Language Model Approach to IR
§ Basic query genera*on model
§ Alterna*ve models
35
COMP6714: Informa2on Retrieval & Web Search
Standard Probabilis*c IR
query
d1
d2
dn
…
Information
need
document collection
matching
),|( dQRP
36
COMP6714: Informa2on Retrieval & Web Search
IR based on Language Model (LM)
query
d1
d2
dn
…
Information
need
document collection
generation
)|( dMQP 1d
M
2d
M
…
nd
M§ A common search heuris*c is to use words that you expect to find in matching documents as your
query – why, I saw Sergey Brin advoca*ng that
strategy on late night TV one night in my hotel
room, so it must be good!
§ The LM approach directly exploits that idea!
§ See later slides for a more formal jus*fica*on 37
COMP6714: Informa2on Retrieval & Web Search
Formal Language (Model)
§ Tradi*onal genera*ve model: generates strings
§ Finite state machines or regular grammars, etc.
§ Example:
I wish
I wish
I wish I wish
I wish I wish I wish
I wish I wish I wish I wish
…
38
COMP6714: Informa2on Retrieval & Web Search
Stochas*c Language Models
§ Models probability of genera*ng strings in the
language (commonly all strings over alphabet ∑)
0.2 the
0.1 a
0.01 man
0.01 woman
0.03 said
0.02 likes
… …
the man likes the woman
0.2 0.01 0.02 0.2 0.01
multiply
Model M
P(s | M) = 0.00000008
s:
39
COMP6714: Informa2on Retrieval & Web Search
Stochas*c Language Models
§ Model probability of genera*ng any string
0.2 the
0.01 class
0.0001 sayst
0.0001 pleaseth
0.0001 yon
0.0005 maiden
0.01 woman
Model M1 Model M2
maiden class pleaseth yon the
0.0005 0.01 0.0001 0.0001 0.2
0.01 0.0001 0.02 0.1 0.2
P(s|M2) > P(s|M1)
0.2 the
0.0001 class
0.03 sayst
0.02 pleaseth
0.1 yon
0.01 maiden
0.0001 woman
40
COMP6714: Informa2on Retrieval & Web Search
Stochas*c Language Models
§ A sta*s*cal model for genera*ng text
§ Probability distribu*on over strings in a given language
M
P ( | M ) = P ( | M )
P ( | M, )
P ( | M, )
P ( | M, )
41
COMP6714: Informa2on Retrieval & Web Search
Unigram and higher-order models
§ Unigram Language Models
§ Bigram (generally, n-gram) Language Models
§ Other Language Models
§ Grammar-based models (PCFGs), etc.
§ Probably not the first thing to try in IR
= P ( ) P ( | ) P ( | ) P ( | )
P ( ) P ( ) P ( ) P ( )
P ( )
P ( ) P ( | ) P ( | ) P ( | )
Easy.
Effective!
42
COMP6714: Informa2on Retrieval & Web Search
Using Language Models in IR
§ Treat each document as the basis for a model (e.g.,
unigram sufficient sta*s*cs)
§ Rank document d based on P(d | q)
§ P(d | q) = P(q | d) x P(d) / P(q)
§ P(q) is the same for all documents, so ignore
§ P(d) [the prior] is o}en treated as the same for all d
§ But we could use criteria like authority, length, genre
§ P(q | d) is the probability of q given d’s model
§ Very general formal approach
43
COMP6714: Informa2on Retrieval & Web Search
The fundamental problem of LMs
§ Usually we don’t know the model M
§ But have a sample of text representa*ve of that model
§ Es*mate a language model from a sample
§ Then compute the observa*on probability
P ( | M ( ) )
M
doc query
44
COMP6714: Informa2on Retrieval & Web Search
Language Models for IR
§ Language Modeling Approaches
§ Agempt to model query genera(on process
§ Documents are ranked by the probability that a query
would be observed as a random sample from the
respec(ve document model
§ Mul*nomial approach
45
COMP6714: Informa2on Retrieval & Web Search
Retrieval based on probabilis*c LM
§ Treat the genera*on of queries as a random process.
§ Approach
§ Infer a language model for each document.
§ Es*mate the probability of genera*ng the query according
to each of these models.
§ Rank the documents according to these probabili*es.
§ Usually a unigram es*mate of words is used
§ Some work on bigrams, paralleling van Rijsbergen
46
COMP6714: Informa2on Retrieval & Web Search
Retrieval based on probabilis*c LM
§ Intui*on
§ Users …
§ Have a reasonable idea of terms that are likely to occur in
documents of interest.
§ They will choose query terms that dis*nguish these documents
from others in the collec*on.
§ Collec*on sta*s*cs …
§ Are integral parts of the language model.
§ Are not used heuris*cally as in many other approaches.
§ In theory. In prac*ce, there’s usually some wiggle room for
empirically set parameters
47
COMP6714: Informa2on Retrieval & Web Search
Query genera*on probability (1)
§ Ranking formula
§ The probability of producing the query given the language model of
document d using MLE is:
∏
∏
∈
∈
=
=
Qt d
dt
Qt
dmld
dl
tf
MtpMQp
),(
)|(ˆ)|(ˆ
Unigram assumption:
Given a particular language model, the
query terms occur independently
),( dttf
ddl
: language model of document d
: raw tf of term t in document d
: total number of tokens in document d
dM
)|()(
)|()(),(
dMQpdp
dQpdpdQp
≈
=
48
COMP6714: Informa2on Retrieval & Web Search
Insufficient data
§ Zero probability
§ May not wish to assign a probability of zero to a document
that is missing one or more of the query terms [gives
conjunc*on seman*cs]
§ General approach
§ A non-occurring term is possible, but no more likely than
would be expected by chance in the collec*on.
§ If ,
0)|( =dMtp
0),( =dttf
cs
cs
cf
Mtp td =)|(
tcf : raw count of term t in the collection
: raw collection size(total number of tokens in the collection)
49
COMP6714: Informa2on Retrieval & Web Search
Insufficient data
§ Zero probabili*es spell disaster
§ We need to smooth probabili*es
§ Discount nonzero probabili*es
§ Give some probability mass to unseen things
§ There’s a wide space of approaches to smoothing
probability distribu*ons to deal with this problem,
such as adding 1, ½ or ℇ to counts, Dirichlet priors,
discoun*ng, and interpola*on
§ [See FSNLP ch. 6 or CS224N if you want more]
§ A simple idea that works well in prac*ce is to use a
mixture between the document mul*nomial and the
collec*on mul*nomial distribu*on
50
COMP6714: Informa2on Retrieval & Web Search
Mixture model
§ Jelinek-Mercer method
§ P(w|d) = λPmle(w|Md) + (1 – λ)Pmle(w|Mc)
§ Mixes the probability from the document with the
general collec*on frequency of the word.
§ Correctly se�ng λ is very important
§ A high value of lambda makes the search
“conjunc*ve-like” – suitable for short queries
§ A low value is more suitable for long queries
§ Can tune λ to op*mize performance
§ Perhaps make it dependent on document size (cf. Dirichlet
prior or Wigen-Bell smoothing) 51
COMP6714: Informa2on Retrieval & Web Search
Basic mixture model summary
§ General formula*on of the LM for IR
§ The user has a document in mind, and generates the query
from this document.
§ The equa*on represents the probability that the
document that the user had in mind was in fact this one.
∏
∈
+−=
Qt
dMtptpdpdQp ))|()()1(()(),( λλ
collection/background language model
individual-document model
52
COMP6714: Informa2on Retrieval & Web Search
Rela*onship to idf
fqi,D=0 è query word that
does not occur in the doc
Add contribu*ons
from i:fqi,D>0
Becomes a
constant | Q, C
propor*onal to the u, inversely
propor*onal to the cf
Note here (i.e., [CMS09]) λ is
mul*plied to the background
model.
53
COMP6714: Informa2on Retrieval & Web Search
Example
§ Document collec*on (2 documents)
§ d1: Xerox reports a profit but revenue is down
§ d2: Lucent narrows quarter loss but revenue decreases
further
§ Model: MLE unigram from documents; λ = ½
§ Query: revenue down
§ P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]
= 1/8 x 3/32 = 3/256
§ P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]
= 1/8 x 1/32 = 1/256
§ Ranking: d1 > d2 54
COMP6714: Informa2on Retrieval & Web Search
Ponte and Cro} Experiments
§ Data
§ TREC topics 202-250 on TREC disks 2 and 3
§ Natural language queries consis*ng of one sentence each
§ TREC topics 51-100 on TREC disk 3 using the concept fields
§ Lists of good terms
…
1. Contract, agreement
2. Launch vehicle, rocket, payload, satellite
3. Launch services, … 55
COMP6714: Informa2on Retrieval & Web Search
Precision/recall results 202-250
56
COMP6714: Informa2on Retrieval & Web Search
Precision/recall results 51-100
57
COMP6714: Informa2on Retrieval & Web Search
Language models: pro & con
§ Novel way of looking at the problem of text retrieval
based on probabilis*c language modeling
§ Conceptually simple and explanatory
§ Formal mathema*cal model
§ Natural use of collec*on sta*s*cs, not heuris*cs (almost…)
§ LMs provide effec*ve retrieval and can be improved to the
extent that the following condi*ons can be met
§ Our language models are accurate representa*ons of the data.
§ Users have some sense of term distribu*on.*
§ *Or we get more sophis*cated with transla*on model
58
COMP6714: Informa2on Retrieval & Web Search
Comparison With Vector Space
§ There’s some rela*on to tradi*onal u.idf models:
§ (unscaled) term frequency is directly in model
§ the probabili*es do length normaliza*on of term
frequencies
§ the effect of doing a mixture with overall collec*on
frequencies is a ligle like idf: terms rare in the general
collec*on but common in some documents will have a
greater influence on the ranking
59
COMP6714: Informa2on Retrieval & Web Search
Comparison With Vector Space
§ Similar in some ways
§ Term weights based on frequency
§ Terms o}en used as if they were independent
§ Inverse document/collec*on frequency used
§ Some form of length normaliza*on useful
§ Different in others
§ Based on probability rather than similarity
§ Intui*ons are probabilis*c rather than geometric
§ Details of use of document length and term, document,
and collec*on frequency differ
60
COMP6714: Informa2on Retrieval & Web Search
Resources
J.M. Ponte and W.B. Cro}. 1998. A language modelling approach to informa*on
retrieval. In SIGIR 21.
D. Hiemstra. 1998. A linguis*cally mo*vated probabilis*c model of informa*on
retrieval. ECDL 2, pp. 569–584.
A. Berger and J. Lafferty. 1999. Informa*on retrieval as sta*s*cal transla*on. SIGIR 22,
pp. 222–229.
D.R.H. Miller, T. Leek, and R.M. Schwartz. 1999. A hidden Markov model informa*on
retrieval system. SIGIR 22, pp. 214–221.
[Several relevant newer papers at SIGIR 23–25, 2000–2002.]
Workshop on Language Modeling and Informa*on Retrieval, CMU 2001. hgp://
la.l*.cs.cmu.edu/callan/Workshops/lmir01/ .
The Lemur Toolkit for Language Modeling and Informa*on Retrieval. hgp://
www-2.cs.cmu.edu/~lemur/ . CMU/Umass LM and IR system in C(++), currently
ac*vely developed.
61