COMP6714: Informa2on Retrieval & Web Search
Introduc*on to
Informa(on Retrieval
Lecture 7: Scoring and results assembly
1
COMP6714: Informa2on Retrieval & Web Search
Ch. 6
Recap: ;-idf weigh*ng
§ The ;-idf weight of a term is the product of its ; weight and its idf weight.
t,d
w =(1+logtf )×log (N/df)
t,d 10 t
§ Best known weigh*ng scheme in informa*on retrieval
§ Increases with the number of occurrences within a document
§ Increases with the rarity of the term in the collec*on
2
COMP6714: Informa2on Retrieval & Web Search
Ch. 6
Recap: Queries as vectors
§ Key idea 1: Do the same for queries: represent them as vectors in the space
§ Key idea 2: Rank documents according to their proximity to the query in this space
§ proximity = similarity of vectors
3
COMP6714: Informa2on Retrieval & Web Search
Ch. 6
Recap: cosine(query,document)
Dot product
Unit vectors
!!!!! ∑Vqd !q•dqd ii
cos(q,d)= ! ! = ! • ! = i=1
qd q d ∑V q2 ∑V d2
i=1 i i=1 i
cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.
4
COMP6714: Informa2on Retrieval & Web Search
Ch. 7
This lecture
§ Speeding up vector space ranking § PuRng together a complete search
system
§ Will require learning about a number of miscellaneous topics and heuris*cs
Ques(on: Why don’t we just use the query processing methods for Boolean queries?
5
COMP6714: Informa2on Retrieval & Web Search
Sec. 6.3.3
Compu*ng cosine scores
Term-at-a-*me
6
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1
Efficient cosine ranking
§ Find the K docs in the collec*on “nearest” to the
query ⇒ K largest query-doc cosines.
§ Efficient ranking:
§ Compu*ng a single cosine efficiently.
§ Choosing the K largest cosine values efficiently. § Can we do this without compu*ng all N cosines?
7
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1
Efficient cosine ranking
§ What we’re doing in effect: solving the K-nearest neighbor problem for a query vector
§ In general, we do not know how to do this efficiently for high-dimensional spaces
§ But it is solvable for short queries, and standard indexes support this well
Curse of dimensionality
8
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1
Special case – unweighted queries
§ No weigh*ng on query terms
§ Assume each query term occurs only once
§ Then for ranking, don’t need to normalize query vector
§ Slight simplifica*on of algorithm from Lecture 6
9
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1
Faster cosine: unweighted query
10
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1
Compu*ng the K largest cosines: selec*on vs. sor*ng
§ Typically we want to retrieve the top K docs (in the cosine ranking for the query)
§ not to totally order all docs in the collec*on § Can we pick off docs with K highest cosines?
§ Let n of docs with nonzero cosines
§ We seek the K best of these n
11
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1
Use heap for selec*ng top K
§ Binary tree in which each node’s value > the values
of children
§ Takes 2n opera*ons to construct, then each of K “winners” read off in 2log n steps.
§ Total *me is O(n + K*log(n))
§ For n=1M, K=100, this is about 10% of the cost of
sor*ng. 1 .9
.3 .3 .8 .1
.1 12
hop://en.wikipedia.org/wiki/Binary_heap
COMP6714: Informa2on Retrieval & Web Search
Quick Select
§ QuickSelect is similar to QuickSort to find the top-K elements from an array
§ Takes O(n) *me
§ Sor*ng the top-K items takes O(K*log(K)) *me
§ Total *me is O(n + K*log(K))
13
hop://en.wikipedia.org/wiki/Quickselect
COMP6714: Informa2on Retrieval & Web Search
Query Processing § Document-at-a-*me
§ Calculates complete scores for documents by processing all term lists, one document at a *me
§ Term-at-a-*me
§ Accumulates scores for documents by processing term lists
one at a *me
§ Both approaches have op*miza*on techniques that significantly reduce *me required to generate scores
§ Dis*nguish between safe and heuris*c op*miza*ons
[CMS09].begin
COMP6714: Informa2on Retrieval & Web Search
Document-At-A-Time
COMP6714: Informa2on Retrieval & Web Search
Document-At-A-Time
COMP6714: Informa2on Retrieval & Web Search
Term-At-A-Time
3:1
COMP6714: Informa2on Retrieval & Web Search
Term-At-A-Time
// accumulators
// Ad contains par*al score
COMP6714: Informa2on Retrieval & Web Search
Op*miza*on Techniques
§ Term-at-a-*me uses more memory for accumulators, but accesses disk more efficiently
§ Two classes of op*miza*on
§ Read less data from inverted lists
§ e.g., skip lists
§ beoer for simple feature func*ons
§ Calculate scores for fewer documents § e.g., conjunc*ve processing
§ beoer for complex feature func*ons
COMP6714: Informa2on Retrieval & Web Search
Conjunc*ve Processing
§ Requires the result document containing all the query terms (i.e., conjunc*ve Boolean queries)
§ More efficient
§ Can also be more effec*ve for short queries § Default for many search engines
§ Can be combined with both DAAT and TAAT
20
COMP6714: Informa2on Retrieval & Web Search
Conjunctive Term-at-a-Time
COMP6714: Informa2on Retrieval & Web Search
Conjunctive Document-at-a-Time
COMP6714: Informa2on Retrieval & Web Search
Threshold Methods
§ Threshold methods use number of top-ranked
documents needed (k) to op*mize query processing § for most applica*ons, k is small
§ For any query, there is a minimum score that each document needs to reach before it can be shown to the user
§ score of the kth-highest scoring document
§ gives threshold τ
§ op*miza*on methods es*mate τʹ to ignore documents
COMP6714: Informa2on Retrieval & Web Search
Threshold Methods
§ For document-at-a-*me processing, use score of
lowest-ranked document so far for τʹ
§ for term-at-a-*me, have to use kth-largest score in the
accumulator table
§ MaxScore method compares the maximum score
that remaining documents could have to τʹ
§ safe op*miza*on in that ranking will be the same without op*miza*on
COMP6714: Informa2on Retrieval & Web Search
MaxScore Example
xyz
§ Compute max term scores, μt, of each list and sort them in decreasing order (fixed during query processing)
§ Assume k =3, τʹ is lowest score ater first three docs
§ If μtree < τ ʹèany doc that scores higher than τ’ must contains at least one of the first two keywords (aka required term set)
§ Use pos*ngs lists of required term set to “drive” the query processing
§ Will only check some of the white pos*ngs in the list of “tree” to compute scoreèat least all gray pos*ngs are skipped.
Beoer than the example in the textbook. See my Note 2 too.
COMP6714: Informa2on Retrieval & Web Search
Other Approaches
§ Early termina*on of query processing
§ ignore high-frequency word lists in term-at-a-*me § ignore documents at end of lists in doc-at-a-*me § unsafe op*miza*on
§ List ordering
§ order inverted lists by quality metric (e.g., PageRank) or by
par*al score
§ makes unsafe (and fast) op*miza*ons more likely to produce good documents
[CMS09].end
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.1
Boolenecks
§ Primary computa*onal booleneck in scoring: cosine computa*on
§ Can we avoid all this computa*on? § Yes, but may some*mes get it wrong
§ a doc not in the top K may creep into the list of K output docs
§ Is this such a bad thing?
27
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.1
Cosine similarity is only a proxy
§ Jus*fica*ons
§ User has a task and a query formula*on
§ Cosine matches docs to query
§ Thus cosine is anyway a proxy for user happiness
§ Approximate query processing
§ IfwegetalistofKdocs“close”tothetopKbycosine measure, should be ok
28
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.1
Generic approach
§ Find a set A of contenders, with K < |A| << N
§ A does not necessarily contain the top K, but has many docs from among the top K
§ Return the top K docs in A
§ Think of A as pruning non-contenders
§ The same approach is also used for other (non- cosine) scoring func*ons
§ Will look at several schemes following this approach 29
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.2
Index elimina*on
§ Basic algorithm FastCosineScore of Fig 7.1 only considers docs containing at least one query term
§ Take this further:
§ Only consider high-idf query terms
§ Only consider docs containing many query terms
30
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.2
High-idf query terms only
§ For a query such as catcher in the rye
§ Only accumulate scores from catcher and rye
§ Intui*on: in and the contribute liole to the scores and so don’t alter rank-ordering much
§ Benefit:
§ Pos*ngs of low-idf terms have many docs → these (many) docs get eliminated from set A of contenders
31
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.2
Docs containing many query terms
§ Any doc with at least one query term is a candidate
for the top K output list
§ For mul*-term queries, only compute scores for docs
containing several of the query terms
§ Say, at least 3 out of 4
§ Imposes a “sot conjunc*on” on queries seen on web
search engines (early Google)
§ Easy to implement in pos*ngs traversal
32
COMP6714: Informa2on Retrieval & Web Search
3 of 4 query terms
Can generalize to WAND method (safe)
Sec. 7.1.2
Antony
3
4
8
16
32
64
128
Brutus
2
4
8
16
32
64
128
Caesar
21 34
1
2
3
5
8
13
Calpurnia
13
16
32
Scores only computed for docs 8, 16 and 32.
33
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.3
Champion lists
§ Precompute for each dic*onary term t, the r docs of
highest weight in t’s pos*ngs § Call this the champion list for t § (aka fancy list or top docs for t)
§ Note that r has to be chosen at index build *me § Thus, it’s possible that r < K
§ At query *me, only compute scores for docs in A = ∪t∈Q ChampionList(t)
§ Pick the K top-scoring docs from amongst these
Inspired by “fancy lists” of Google: hop://infolab.stanford.edu/~backrub/google.html
34
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.3
Exercises
§ How do Champion Lists relate to Index Elimina*on? Can they be used together?
§ How can Champion Lists be implemented in an inverted index?
§ Note that the champion list has nothing to do with small docIDs
35
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.4
Sta*c quality scores
§ We want top-ranking documents to be both relevant and authorita2ve
§ Relevance is being modeled by cosine scores
§ Authority is typically a query-independent property
of a document
§ Examples of authority signals § Wikipedia among websites
§ Ar*cles in certain newspapers
§ A paper with many cita*ons
§ Many diggs, Y!buzzes or del.icio.us marks § (Pagerank)
Quan*ta*ve
36
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.4
Modeling authority
§ Assign to each document a query-independent
quality score in [0,1] to each document d § Denote this by g(d)
§ Thus, a quan*ty like the number of cita*ons is scaled into [0,1]
§ Exercise: suggest a formula for this.
37
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.4
Net score
§ Consider a simple total score combining cosine relevance and authority
§ net-score(q,d) = g(d) + cosine(q,d)
§ Can use some other linear combina*on than an equal
weigh*ng
§ Indeed, any func*on of the two “signals” of user happiness – more later
§ Now we seek the top K docs by net score
38
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.4
Top K by net score – fast methods
§ First idea: Order all pos*ngs by g(d)
§ Key: this is a common ordering for all pos*ngs
§ Thus, can concurrently traverse query terms’ pos*ngs for
§ Pos*ngs intersec*on
§ Cosine score computa*on
§ Exercise: write pseudocode for cosine score computa*on if pos*ngs are ordered by g(d)
39
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.4
Why order pos*ngs by g(d)?
§ Under g(d)-ordering, top-scoring docs likely to
appear early in pos*ngs traversal
§ In *me-bound applica*ons (say, we have to return whatever search results we can in 50 ms), this allows us to stop pos*ngs traversal early
§ Short of compu*ng scores for all docs in pos*ngs
40
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.4
Champion lists in g(d)-ordering
§ Can combine champion lists with g(d)-ordering
§ Maintain for each term a champion list of the r docs with highest g(d) + ;-idftd
§ Seek top-K results from only the docs in these champion lists
41
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.4
High and low lists
§ For each term, we maintain two pos*ngs lists called
high and low
§ Think of high as the champion list
§ When traversing pos*ngs on a query, only traverse all the high lists first
§ If we get more than K docs, select the top K and stop § Only union the high lists
§ Else proceed to get docs from the low lists
§ Can be used even for simple cosine scores, without
global quality g(d)
§ A means for segmen*ng index into two *ers
42
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.5
Impact-ordered pos*ngs
§ We only want to compute scores for docs for which
wft,d is high enough
§ We sort each pos*ngs list by wft,d
§ Now: not all pos*ngs in a common order!
§ How do we compute scores in order to pick off top K?
§ Two ideas follow
43
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.5
1. Early termina*on
§ When traversing t’s pos*ngs, stop early ater either § a fixed number of r docs
§ wft,d drops below some threshold
§ Take the union of the resul*ng sets of docs § One from the pos*ngs of each query term
§ Compute only the scores for docs in this union
44
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.5
2. idf-ordered terms
§ When considering the pos*ngs of query terms
§ Look at them in order of decreasing idf
§ High idf terms likely to contribute most to score
§ As we update score contribu*on from each query term
§ Stop if doc scores rela*vely unchanged
§ Can apply to cosine or some other net scores
45
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.6
Why N1/2 learder? Cluster pruning: preprocessing
§ Pick √N docs at random: call these leaders § For every other doc, pre-compute nearest
leader
§ Docs aoached to a leader: its followers; § Likely: each leader has ~ √N followers.
46
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.6
Cluster pruning: query processing
§ Process a query as follows:
§ Given query Q, find its nearest leader L.
§ Seek K nearest docs from among L’s followers.
47
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.6
Visualiza*on
Query
Leader Follower
48
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.6
Why use random sampling
§ Fast
§ Leaders reflect data distribu*on
49
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.6
General variants
§ Have each follower aoached to b1=3 (say) nearest leaders.
§ From query, find b2=4 (say) nearest leaders and their followers.
§ Can recur on leader/follower construc*on.
50
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.1.6
Exercises
§ To find the nearest leader in step 1, how many cosine computa*ons do we do?
§ Why did we have √N in the first place?
§ Hint: write down the algorithm, model its cost, and
minimize the cost.
§ What is the effect of the constants b1, b2 on the previous slide?
§ Devise an example where this is likely to fail – i.e., we miss one of the K nearest docs.
§ Likely under random sampling.
51
COMP6714: Informa2on Retrieval & Web Search
Sec. 6.1
Parametric and zone indexes
§ Thus far, a doc has been a sequence of terms
§ In fact documents have mul*ple parts, some with special seman*cs:
§ Author
§ Title
§ Date of publica*on § Language
§ Format
§ etc.
§ These cons*tute the metadata about a document
52
COMP6714: Informa2on Retrieval & Web Search
Sec. 6.1
Fields
§ We some*mes wish to search by these metadata
§ E.g., find docs authored by William Shakespeare in the year 1601, containing alas poor Yorick
§ Year = 1601 is an example of a field
§ Also, author last name = shakespeare, etc
§ Field or parametric index: pos*ngs for each field value
§ Some*mes build range trees (e.g., for dates)
§ Field query typically treated as conjunc*on § (doc must be authored by shakespeare)
53
COMP6714: Informa2on Retrieval & Web Search
Sec. 6.1
Zone
§ A zone is a region of the doc that can contain an arbitrary amount of text e.g.,
§ Title
§ Abstract
§ References ...
§ Build inverted indexes on zones as well to permit querying
§ E.g., “find docs with merchant in the *tle zone and matching the query gentle rain”
54
COMP6714: Informa2on Retrieval & Web Search
Sec. 6.1
Example zone indexes
Encode zones in dictionary vs. postings.
55
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.2.1
Tiered indexes
§ Break pos*ngs up into a hierarchy of lists § Most important
§ ...
§ Least important
§ Can be done by g(d) or another measure § Inverted index thus broken up into *ers of
decreasing importance
§ At query *me use top *er unless it fails to yield K docs
§ If so drop to lower *ers
56
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.2.1
Example *ered index
57
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.2.2
Query term proximity
§ Free text queries: just a set of terms typed into the query box – common on the web
§ Users prefer docs in which query terms occur within close proximity of each other
§ Let w be the smallest window in a doc containing all query terms, e.g.,
§ For the query strained mercy the smallest window in the doc The quality of mercy is not strained is 4 (words)
§ Would like scoring func*on to take this into account – how?
58
COMP6714: Informa2on Retrieval & Web Search
Sec. 7.2.3
Query parsers
§ Free text query from user may in fact spawn one or more queries to the indexes, e.g. query rising interest rates
§ Run the query as a phrase query
§ If