COMP6714: Informa2on Retrieval & Web Search
Introduc*on to
Informa(on Retrieval
Lecture 7: Scoring and results assembly
1
COMP6714: Informa2on Retrieval & Web Search
Recap: ;-idf weigh*ng
§ The ;-idf weight of a term is the product of its ;
weight and its idf weight.
§ Best known weigh*ng scheme in informa*on retrieval
§ Increases with the number of occurrences within a
document
§ Increases with the rarity of the term in the collec*on
)df/(log)tflog1(w 10,, tdt Ndt ×+=
Ch. 6
2
COMP6714: Informa2on Retrieval & Web Search
Recap: Queries as vectors
§ Key idea 1: Do the same for queries: represent them
as vectors in the space
§ Key idea 2: Rank documents according to their
proximity to the query in this space
§ proximity = similarity of vectors
Ch. 6
3
COMP6714: Informa2on Retrieval & Web Search
Recap: cosine(query,document)
∑∑
∑
==
==•=
•
=
V
i i
V
i i
V
i ii
dq
dq
d
d
q
q
dq
dq
dq
1
2
1
2
1),cos( !
!
!
!
!!
!!!!
Dot product Unit vectors
cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.
Ch. 6
4
COMP6714: Informa2on Retrieval & Web Search
This lecture
§ Speeding up vector space ranking
§ PuRng together a complete search
system
§ Will require learning about a number of
miscellaneous topics and heuris*cs
Ch. 7
5
Ques(on: Why don’t we just use the query
processing methods for Boolean queries?
COMP6714: Informa2on Retrieval & Web Search
Compu*ng cosine scores
Sec. 6.3.3
6
Term-at-a-*me
COMP6714: Informa2on Retrieval & Web Search
Efficient cosine ranking
§ Find the K docs in the collec*on “nearest” to the
query ⇒ K largest query-doc cosines.
§ Efficient ranking:
§ Compu*ng a single cosine efficiently.
§ Choosing the K largest cosine values efficiently.
§ Can we do this without compu*ng all N cosines?
Sec. 7.1
7
COMP6714: Informa2on Retrieval & Web Search
Efficient cosine ranking
§ What we’re doing in effect: solving the K-nearest
neighbor problem for a query vector
§ In general, we do not know how to do this efficiently
for high-dimensional spaces
§ But it is solvable for short queries, and standard
indexes support this well
Sec. 7.1
8
Curse of dimensionality
COMP6714: Informa2on Retrieval & Web Search
Special case – unweighted queries
§ No weigh*ng on query terms
§ Assume each query term occurs only once
§ Then for ranking, don’t need to normalize query
vector
§ Slight simplifica*on of algorithm from Lecture 6
Sec. 7.1
9
COMP6714: Informa2on Retrieval & Web Search
Faster cosine: unweighted query
Sec. 7.1
10
COMP6714: Informa2on Retrieval & Web Search
Compu*ng the K largest cosines:
selec*on vs. sor*ng
§ Typically we want to retrieve the top K docs (in the
cosine ranking for the query)
§ not to totally order all docs in the collec*on
§ Can we pick off docs with K highest cosines?
§ Let n of docs with nonzero cosines
§ We seek the K best of these n
Sec. 7.1
11
COMP6714: Informa2on Retrieval & Web Search
Use heap for selec*ng top K
§ Binary tree in which each node’s value > the values
of children
§ Takes 2n opera*ons to construct, then each of K
“winners” read off in 2log n steps.
§ Total *me is O(n + K*log(n))
§ For n=1M, K=100, this is about 10% of the cost of
sor*ng. 1
.9 .3
.8 .3
.1
.1
Sec. 7.1
12
hop://en.wikipedia.org/wiki/Binary_heap
COMP6714: Informa2on Retrieval & Web Search
Quick Select
§ QuickSelect is similar to QuickSort to find the top-K
elements from an array
§ Takes O(n) *me
§ Sor*ng the top-K items takes O(K*log(K)) *me
§ Total *me is O(n + K*log(K))
13
hop://en.wikipedia.org/wiki/Quickselect
COMP6714: Informa2on Retrieval & Web Search
Query Processing
§ Document-at-a-*me
§ Calculates complete scores for documents by processing
all term lists, one document at a *me
§ Term-at-a-*me
§ Accumulates scores for documents by processing term lists
one at a *me
§ Both approaches have op*miza*on techniques that
significantly reduce *me required to generate scores
§ Dis*nguish between safe and heuris*c op*miza*ons
[CMS09].begin
COMP6714: Informa2on Retrieval & Web Search
Document-At-A-Time
COMP6714: Informa2on Retrieval & Web Search
Document-At-A-Time
COMP6714: Informa2on Retrieval & Web Search
Term-At-A-Time
3:1
COMP6714: Informa2on Retrieval & Web Search
Term-At-A-Time
// accumulators
// Ad contains par*al score
COMP6714: Informa2on Retrieval & Web Search
Op*miza*on Techniques
§ Term-at-a-*me uses more memory for accumulators,
but accesses disk more efficiently
§ Two classes of op*miza*on
§ Read less data from inverted lists
§ e.g., skip lists
§ beoer for simple feature func*ons
§ Calculate scores for fewer documents
§ e.g., conjunc*ve processing
§ beoer for complex feature func*ons
COMP6714: Informa2on Retrieval & Web Search
Conjunc*ve Processing
§ Requires the result document containing all the
query terms (i.e., conjunc*ve Boolean queries)
§ More efficient
§ Can also be more effec*ve for short queries
§ Default for many search engines
§ Can be combined with both DAAT and TAAT
20
COMP6714: Informa2on Retrieval & Web Search
Conjunctive
Term-at-a-Time
COMP6714: Informa2on Retrieval & Web Search
Conjunctive
Document-at-a-Time
COMP6714: Informa2on Retrieval & Web Search
Threshold Methods
§ Threshold methods use number of top-ranked
documents needed (k) to op*mize query processing
§ for most applica*ons, k is small
§ For any query, there is a minimum score that each
document needs to reach before it can be shown to
the user
§ score of the kth-highest scoring document
§ gives threshold τ
§ op*miza*on methods es*mate τʹ to ignore documents
COMP6714: Informa2on Retrieval & Web Search
Threshold Methods
§ For document-at-a-*me processing, use score of
lowest-ranked document so far for τʹ
§ for term-at-a-*me, have to use kth-largest score in the
accumulator table
§ MaxScore method compares the maximum score
that remaining documents could have to τʹ
§ safe op*miza*on in that ranking will be the same without
op*miza*on
COMP6714: Informa2on Retrieval & Web Search
MaxScore Example
§ Compute max term scores, μt, of each list and sort
them in decreasing order (fixed during query
processing)
§ Assume k =3, τʹ is lowest score ater first three docs
§ If μtree < τ ʹ è any doc that scores higher than τ’ must contains
at least one of the first two keywords (aka required term set)
§ Use pos*ngs lists of required term set to “drive” the query processing
§ Will only check some of the white pos*ngs in the list of “tree” to
compute score è at least all gray pos*ngs are skipped.
xyz
Beoer than the example in the
textbook. See my Note 2 too.
COMP6714: Informa2on Retrieval & Web Search
Other Approaches
§ Early termina*on of query processing
§ ignore high-frequency word lists in term-at-a-*me
§ ignore documents at end of lists in doc-at-a-*me
§ unsafe op*miza*on
§ List ordering
§ order inverted lists by quality metric (e.g., PageRank) or by
par*al score
§ makes unsafe (and fast) op*miza*ons more likely to
produce good documents
[CMS09].end
COMP6714: Informa2on Retrieval & Web Search
Boolenecks
§ Primary computa*onal booleneck in scoring: cosine
computa*on
§ Can we avoid all this computa*on?
§ Yes, but may some*mes get it wrong
§ a doc not in the top K may creep into the list of K
output docs
§ Is this such a bad thing?
Sec. 7.1.1
27
COMP6714: Informa2on Retrieval & Web Search
Cosine similarity is only a proxy
§ Jus*fica*ons
§ User has a task and a query formula*on
§ Cosine matches docs to query
§ Thus cosine is anyway a proxy for user happiness
§ Approximate query processing
§ If we get a list of K docs “close” to the top K by cosine
measure, should be ok
Sec. 7.1.1
28
COMP6714: Informa2on Retrieval & Web Search
Generic approach
§ Find a set A of contenders, with K < |A| << N
§ A does not necessarily contain the top K, but has
many docs from among the top K
§ Return the top K docs in A
§ Think of A as pruning non-contenders
§ The same approach is also used for other (non-
cosine) scoring func*ons
§ Will look at several schemes following this approach
Sec. 7.1.1
29
COMP6714: Informa2on Retrieval & Web Search
Index elimina*on
§ Basic algorithm FastCosineScore of Fig 7.1 only
considers docs containing at least one query term
§ Take this further:
§ Only consider high-idf query terms
§ Only consider docs containing many query terms
Sec. 7.1.2
30
COMP6714: Informa2on Retrieval & Web Search
High-idf query terms only
§ For a query such as catcher in the rye
§ Only accumulate scores from catcher and rye
§ Intui*on: in and the contribute liole to the scores
and so don’t alter rank-ordering much
§ Benefit:
§ Pos*ngs of low-idf terms have many docs → these (many)
docs get eliminated from set A of contenders
Sec. 7.1.2
31
COMP6714: Informa2on Retrieval & Web Search
Docs containing many query terms
§ Any doc with at least one query term is a candidate
for the top K output list
§ For mul*-term queries, only compute scores for docs
containing several of the query terms
§ Say, at least 3 out of 4
§ Imposes a “sot conjunc*on” on queries seen on web
search engines (early Google)
§ Easy to implement in pos*ngs traversal
Sec. 7.1.2
32
COMP6714: Informa2on Retrieval & Web Search
3 of 4 query terms
Brutus
Caesar
Calpurnia
1 2 3 5 8 13 21 34
2 4 8 16 32 64 128
13 16
Antony 3 4 8 16 32 64 128
32
Scores only computed for docs 8, 16 and 32.
Sec. 7.1.2
33
Can generalize to WAND method (safe)
COMP6714: Informa2on Retrieval & Web Search
Champion lists
§ Precompute for each dic*onary term t, the r docs of
highest weight in t’s pos*ngs
§ Call this the champion list for t
§ (aka fancy list or top docs for t)
§ Note that r has to be chosen at index build *me
§ Thus, it’s possible that r < K
§ At query *me, only compute scores for docs in A =
∪t∈Q ChampionList(t)
§ Pick the K top-scoring docs from amongst these
Sec. 7.1.3
34
Inspired by “fancy lists” of Google:
hop://infolab.stanford.edu/~backrub/google.html
COMP6714: Informa2on Retrieval & Web Search
Exercises
§ How do Champion Lists relate to Index Elimina*on?
Can they be used together?
§ How can Champion Lists be implemented in an
inverted index?
§ Note that the champion list has nothing to do with small
docIDs
Sec. 7.1.3
35
COMP6714: Informa2on Retrieval & Web Search
Quan*ta*ve
Sta*c quality scores
§ We want top-ranking documents to be both relevant
and authorita2ve
§ Relevance is being modeled by cosine scores
§ Authority is typically a query-independent property
of a document
§ Examples of authority signals
§ Wikipedia among websites
§ Ar*cles in certain newspapers
§ A paper with many cita*ons
§ Many diggs, Y!buzzes or del.icio.us marks
§ (Pagerank)
Sec. 7.1.4
36
COMP6714: Informa2on Retrieval & Web Search
Modeling authority
§ Assign to each document a query-independent
quality score in [0,1] to each document d
§ Denote this by g(d)
§ Thus, a quan*ty like the number of cita*ons is scaled
into [0,1]
§ Exercise: suggest a formula for this.
Sec. 7.1.4
37
COMP6714: Informa2on Retrieval & Web Search
Net score
§ Consider a simple total score combining cosine
relevance and authority
§ net-score(q,d) = g(d) + cosine(q,d)
§ Can use some other linear combina*on than an equal
weigh*ng
§ Indeed, any func*on of the two “signals” of user happiness
– more later
§ Now we seek the top K docs by net score
Sec. 7.1.4
38
COMP6714: Informa2on Retrieval & Web Search
Top K by net score – fast methods
§ First idea: Order all pos*ngs by g(d)
§ Key: this is a common ordering for all pos*ngs
§ Thus, can concurrently traverse query terms’
pos*ngs for
§ Pos*ngs intersec*on
§ Cosine score computa*on
§ Exercise: write pseudocode for cosine score
computa*on if pos*ngs are ordered by g(d)
Sec. 7.1.4
39
COMP6714: Informa2on Retrieval & Web Search
Why order pos*ngs by g(d)?
§ Under g(d)-ordering, top-scoring docs likely to
appear early in pos*ngs traversal
§ In *me-bound applica*ons (say, we have to return
whatever search results we can in 50 ms), this allows
us to stop pos*ngs traversal early
§ Short of compu*ng scores for all docs in pos*ngs
Sec. 7.1.4
40
COMP6714: Informa2on Retrieval & Web Search
Champion lists in g(d)-ordering
§ Can combine champion lists with g(d)-ordering
§ Maintain for each term a champion list of the r docs
with highest g(d) + ;-idftd
§ Seek top-K results from only the docs in these
champion lists
Sec. 7.1.4
41
COMP6714: Informa2on Retrieval & Web Search
High and low lists
§ For each term, we maintain two pos*ngs lists called
high and low
§ Think of high as the champion list
§ When traversing pos*ngs on a query, only traverse
all the high lists first
§ If we get more than K docs, select the top K and stop
§ Only union the high lists
§ Else proceed to get docs from the low lists
§ Can be used even for simple cosine scores, without
global quality g(d)
§ A means for segmen*ng index into two *ers
Sec. 7.1.4
42
COMP6714: Informa2on Retrieval & Web Search
Impact-ordered pos*ngs
§ We only want to compute scores for docs for which
wft,d is high enough
§ We sort each pos*ngs list by wft,d
§ Now: not all pos*ngs in a common order!
§ How do we compute scores in order to pick off top
K?
§ Two ideas follow
Sec. 7.1.5
43
COMP6714: Informa2on Retrieval & Web Search
1. Early termina*on
§ When traversing t’s pos*ngs, stop early ater either
§ a fixed number of r docs
§ wft,d drops below some threshold
§ Take the union of the resul*ng sets of docs
§ One from the pos*ngs of each query term
§ Compute only the scores for docs in this union
Sec. 7.1.5
44
COMP6714: Informa2on Retrieval & Web Search
2. idf-ordered terms
§ When considering the pos*ngs of query terms
§ Look at them in order of decreasing idf
§ High idf terms likely to contribute most to score
§ As we update score contribu*on from each query
term
§ Stop if doc scores rela*vely unchanged
§ Can apply to cosine or some other net scores
Sec. 7.1.5
45
COMP6714: Informa2on Retrieval & Web Search
Cluster pruning: preprocessing
§ Pick √N docs at random: call these leaders
§ For every other doc, pre-compute nearest
leader
§ Docs aoached to a leader: its followers;
§ Likely: each leader has ~ √N followers.
Sec. 7.1.6
46
Why N1/2 learder?
COMP6714: Informa2on Retrieval & Web Search
Cluster pruning: query processing
§ Process a query as follows:
§ Given query Q, find its nearest leader L.
§ Seek K nearest docs from among L’s
followers.
Sec. 7.1.6
47
COMP6714: Informa2on Retrieval & Web Search
Visualiza*on
Query
Leader Follower
Sec. 7.1.6
48
COMP6714: Informa2on Retrieval & Web Search
Why use random sampling
§ Fast
§ Leaders reflect data distribu*on
Sec. 7.1.6
49
COMP6714: Informa2on Retrieval & Web Search
General variants
§ Have each follower aoached to b1=3 (say) nearest
leaders.
§ From query, find b2=4 (say) nearest leaders and their
followers.
§ Can recur on leader/follower construc*on.
Sec. 7.1.6
50
COMP6714: Informa2on Retrieval & Web Search
Exercises
§ To find the nearest leader in step 1, how many
cosine computa*ons do we do?
§ Why did we have √N in the first place?
§ Hint: write down the algorithm, model its cost, and
minimize the cost.
§ What is the effect of the constants b1, b2 on the
previous slide?
§ Devise an example where this is likely to fail – i.e., we
miss one of the K nearest docs.
§ Likely under random sampling.
Sec. 7.1.6
51
COMP6714: Informa2on Retrieval & Web Search
Parametric and zone indexes
§ Thus far, a doc has been a sequence of terms
§ In fact documents have mul*ple parts, some with
special seman*cs:
§ Author
§ Title
§ Date of publica*on
§ Language
§ Format
§ etc.
§ These cons*tute the metadata about a document
Sec. 6.1
52
COMP6714: Informa2on Retrieval & Web Search
Fields
§ We some*mes wish to search by these metadata
§ E.g., find docs authored by William Shakespeare in the
year 1601, containing alas poor Yorick
§ Year = 1601 is an example of a field
§ Also, author last name = shakespeare, etc
§ Field or parametric index: pos*ngs for each field
value
§ Some*mes build range trees (e.g., for dates)
§ Field query typically treated as conjunc*on
§ (doc must be authored by shakespeare)
Sec. 6.1
53
COMP6714: Informa2on Retrieval & Web Search
Zone
§ A zone is a region of the doc that can contain an
arbitrary amount of text e.g.,
§ Title
§ Abstract
§ References …
§ Build inverted indexes on zones as well to permit
querying
§ E.g., “find docs with merchant in the *tle zone and
matching the query gentle rain”
Sec. 6.1
54
COMP6714: Informa2on Retrieval & Web Search
Example zone indexes
Encode zones in dictionary vs. postings.
Sec. 6.1
55
COMP6714: Informa2on Retrieval & Web Search
Tiered indexes
§ Break pos*ngs up into a hierarchy of lists
§ Most important
§ …
§ Least important
§ Can be done by g(d) or another measure
§ Inverted index thus broken up into *ers of
decreasing importance
§ At query *me use top *er unless it fails to yield K
docs
§ If so drop to lower *ers
Sec. 7.2.1
56
COMP6714: Informa2on Retrieval & Web Search
Example *ered index
Sec. 7.2.1
57
COMP6714: Informa2on Retrieval & Web Search
Query term proximity
§ Free text queries: just a set of terms typed into the
query box – common on the web
§ Users prefer docs in which query terms occur within
close proximity of each other
§ Let w be the smallest window in a doc containing all
query terms, e.g.,
§ For the query strained mercy the smallest window in
the doc The quality of mercy is not strained is 4
(words)
§ Would like scoring func*on to take this into account
– how?
Sec. 7.2.2
58
COMP6714: Informa2on Retrieval & Web Search
Query parsers
§ Free text query from user may in fact spawn one or
more queries to the indexes, e.g. query rising
interest rates
§ Run the query as a phrase query
§ If