程序代写代做代考 scheme algorithm COMP6714: Informa2on Retrieval & Web Search

COMP6714: Informa2on Retrieval & Web Search 

Introduc)on to 

Informa(on Retrieval 

Lecture 6: Scoring, Term Weigh)ng and the 
Vector Space Model 

COMP6714: Informa2on Retrieval & Web Search 

Recap of lecture 5 
  Collec)on and vocabulary sta)s)cs: Heaps’ and Zipf’s laws 
  Dic)onary compression for Boolean indexes 

  Dic)onary string, blocks, front coding 
  Pos)ngs compression: Gap encoding, prefix‐unique codes 

  Variable‐Byte, Gamma codes, Golomb/Rice codes 
collection (text, xml markup etc) 3,600.0
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, γ-encoded 101.0

COMP6714: Informa2on Retrieval & Web Search 

This lecture; IIR Sec)ons 6.2‐6.4.3 

  Ranked retrieval 
  Scoring documents 
  Term frequency 
  Collec)on sta)s)cs 
  Weigh)ng schemes 
  Vector space scoring 

COMP6714: Informa2on Retrieval & Web Search 

Ranked retrieval 

  Thus far, our queries have all been Boolean. 
  Documents either match or don’t. 

  Good for expert users with precise understanding of 
their needs and the collec)on. 
  Also good for applica)ons: Applica)ons can easily 
consume 1000s of results. 

  Not good for the majority of users. 
  Most users incapable of wri)ng Boolean queries (or they 
are, but they think it’s too much work). 

  Most users don’t want to wade through 1000s of results. 
  This is par)cularly true of web search. 

Ch. 6

COMP6714: Informa2on Retrieval & Web Search 

Problem with Boolean search: 
feast or famine 

  Boolean queries o^en result in either too few (=0) or 
too many (1000s) results. 

  Query 1: “standard user dlink 650” → 200,000 hits 
  Query 2: “standard user dlink 650 no card found”: 0 
hits 

  It takes a lot of skill to come up with a query that 
produces a manageable number of hits. 
  AND gives too few; OR gives too many 

Ch. 6

COMP6714: Informa2on Retrieval & Web Search 

Ranked retrieval models 

  Rather than a set of documents sa)sfying a query 
expression, in ranked retrieval models, the system 
returns an ordering over the (top) documents in the 
collec)on with respect to a query 

  Free text queries: Rather than a query language of 
operators and expressions, the user’s query is just 
one or more words in a human language 

  In principle, there are two separate choices here, but 
in prac)ce, ranked retrieval models have normally 
been associated with free text queries and vice versa 

COMP6714: Informa2on Retrieval & Web Search 

Feast or famine: not a problem in 
ranked retrieval 

  When a system produces a ranked result set, large 
result sets are not an issue 
  Indeed, the size of the result set is not an issue 
  We just show the top k ( ≈ 10) results 
  We don’t overwhelm the user 

  Premise: the ranking algorithm works 

Ch. 6

COMP6714: Informa2on Retrieval & Web Search 

Scoring as the basis of ranked retrieval 

  We wish to return in order the documents most likely 
to be useful to the searcher 

  How can we rank‐order the documents in the 
collec)on with respect to a query? 

  Assign a score – say in [0, 1] – to each document 
  This score measures how well document and query 
“match”. 

Ch. 6

COMP6714: Informa2on Retrieval & Web Search 

Query‐document matching scores 

  We need a way of assigning a score to a query/
document pair 

  Let’s start with a one‐term query 
  If the query term does not occur in the document: 
score should be 0 

  The more frequent the query term in the document, 
the higher the score (should be) 

  We will look at a number of alterna)ves for this. 

Ch. 6

COMP6714: Informa2on Retrieval & Web Search 

Take 1: Jaccard coefficient 

  Recall from Lecture 3: A commonly used measure of 
overlap of two sets A and B 

  jaccard(A,B) = |A ∩ B| / |A ∪ B| 
  jaccard(A,A) = 1 
  jaccard(A,B) = 0 if A ∩ B = 0 
  A and B don’t have to be the same size. 
  Always assigns a number between 0 and 1. 

Ch. 6

COMP6714: Informa2on Retrieval & Web Search 

Jaccard coefficient: Scoring example 

  What is the query‐document match score that the 
Jaccard coefficient computes for each of the two 
documents below? 

  Query: ides of march 
  Document 1: caesar died in march 
  Document 2: the long march 

Ch. 6

the term Ides of March is best known as the date that 
Julius Caesar was killed in 709 AUC or 44 B.C 

COMP6714: Informa2on Retrieval & Web Search 

Issues with Jaccard for scoring 

1  It doesn’t consider term frequency (how many )mes 
a term occurs in a document) 

2  Rare terms in a collec)on are more informa)ve 
than frequent terms. Jaccard doesn’t consider this 
informa)on 

  We need a more sophis)cated way of normalizing for 
length 

Ch. 6

COMP6714: Informa2on Retrieval & Web Search 

Recall (Lecture 1): Binary term‐
document incidence matrix 

Each document is represented by a binary vector ∈ {0,1}|V|

Sec. 6.2

COMP6714: Informa2on Retrieval & Web Search 

Term‐document count matrices 

  Consider the number of occurrences of a term in a 
document:  
  Each document is a count vector in ℕv: a column below  

Sec. 6.2

COMP6714: Informa2on Retrieval & Web Search 

Bag of words model 

  Vector representa)on doesn’t consider the ordering 
of words in a document 

  John is quicker than Mary and Mary is quicker than 
John have the same vectors 

  This is called the bag of words model. 
  In a sense, this is a step back: The posi)onal index 
was able to dis)nguish these two documents. 

  We will look at “recovering” posi)onal informa)on 
later in this course. 

  For now: bag of words model 
15 

COMP6714: Informa2on Retrieval & Web Search 

Term frequency t 

  The term frequency tt,d of term t in document d is 
defined as the number of )mes that t occurs in d. 

  We want to use t when compu)ng query‐document 
match scores. But how? 

  Raw term frequency is not what we want: 
  A document with 10 occurrences of the term is more 
relevant than a document with 1 occurrence of the term. 

  But not 10 )mes more relevant. 
  Relevance does not increase propor)onally with 
term frequency. 

NB: frequency = count in IR 
16 

COMP6714: Informa2on Retrieval & Web Search 

Log‐frequency weigh)ng 

  The log frequency weight of term t in d is 

  0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. 
  Score for a document‐query pair: sum over terms t in 
both q and d: 

  score 

  The score is 0 if none of the query terms is present in 
the document. 

Sec. 6.2

COMP6714: Informa2on Retrieval & Web Search 

Document frequency 

  Rare terms are more informa)ve than frequent terms 
  Recall stop words 

  Consider a term in the query that is rare in the 
collec)on (e.g., arachnocentric) 

  A document containing this term is very likely to be 
relevant to the query arachnocentric 

  → We want a high weight for rare terms like 
arachnocentric. 

Sec. 6.2.1

COMP6714: Informa2on Retrieval & Web Search 

Document frequency, con)nued 

  Frequent terms are less informa)ve than rare terms 
  Consider a query term that is frequent in the 
collec)on (e.g., high, increase, line) 

  A document containing such a term is more likely to 
be relevant than a document that doesn’t 

  But it’s not a sure indicator of relevance. 
  → For frequent terms, we want high posi)ve weights 
for words like high, increase, and line 

  But lower weights than for rare terms. 
  We will use document frequency (df) to capture this. 

Sec. 6.2.1

COMP6714: Informa2on Retrieval & Web Search 

idf weight 

  dft is the document frequency of t: the number of 
documents that contain t 
  dft is an inverse measure of the informa)veness of t 
  dft  ≤ N 

  We define the idf (inverse document frequency) of t 
by 

  We use log (N/dft) instead of N/dft to “dampen” the effect 
of idf. 

Will turn out the base of the log is immaterial.

Sec. 6.2.1

COMP6714: Informa2on Retrieval & Web Search 

idf example, suppose N = 1 million 

term dft idft

calpurnia 1

animal 100

sunday 1,000

fly 10,000

under 100,000

the 1,000,000

There is one idf value for each term t in a collection.

Sec. 6.2.1

COMP6714: Informa2on Retrieval & Web Search 

Effect of idf on ranking 

  Does idf have an effect on ranking for one‐term 
queries, like 
  iPhone 

  idf has no effect on ranking one term queries 
  idf affects the ranking of documents for queries with at 
least two terms 

  For the query capricious person, idf weigh)ng makes 
occurrences of capricious count for much more in the final 
document ranking than occurrences of person. 

COMP6714: Informa2on Retrieval & Web Search 

Collec)on vs. Document frequency 

  The collec)on frequency of t is the number of 
occurrences of t in the collec)on, coun)ng 
mul)ple occurrences. 

  Example: 

  Which word is a bever search term (and should 
get a higher weight)? 

Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

Sec. 6.2.1

COMP6714: Informa2on Retrieval & Web Search 

t‐idf weigh)ng 

  The t‐idf weight of a term is the product of its t 
weight and its idf weight. 

  Best known weigh)ng scheme in informa)on retrieval 
  Note: the “‐” in t‐idf is a hyphen, not a minus sign! 
  Alterna)ve names: t.idf, t x idf 

  Increases with the number of occurrences within a 
document 

  Increases with the rarity of the term in the collec)on 

Sec. 6.2.2

COMP6714: Informa2on Retrieval & Web Search 

Final ranking of documents for a query 

25 
€

Score(q,d) = tf.idft,dt∈q∩d∑

Sec. 6.2.2

COMP6714: Informa2on Retrieval & Web Search 

Binary → count → weight matrix 

Each document is now represented by a real-valued
vector of tf-idf weights ∈ R|V|

Sec. 6.3

COMP6714: Informa2on Retrieval & Web Search 

Documents as vectors 

  So we have a |V|‐dimensional vector space 
  Terms are axes of the space 
  Documents are points or vectors in this space 
  Very high‐dimensional: tens of millions of dimensions 
when you apply this to a web search engine 

  These are very sparse vectors ‐ most entries are zero. 

Sec. 6.3

COMP6714: Informa2on Retrieval & Web Search 

Queries as vectors 

  Key idea 1: Do the same for queries: represent them 
as vectors in the space 

  Key idea 2: Rank documents according to their 
proximity to the query in this space 

  proximity = similarity of vectors 
  proximity ≈ inverse of distance 
  Recall: We do this because we want to get away from 
the you’re‐either‐in‐or‐out Boolean model. 

  Instead: rank more relevant documents higher than 
less relevant documents 

Sec. 6.3

COMP6714: Informa2on Retrieval & Web Search 

Formalizing vector space proximity 

  First cut: distance between two points 
  ( = distance between the end points of the two vectors) 

  Euclidean distance? 
  Euclidean distance is a bad idea . . . 
  . . . because Euclidean distance is large for vectors of 
different lengths. 

Sec. 6.3

COMP6714: Informa2on Retrieval & Web Search 

Why distance is a bad idea 

The Euclidean distance 
between q 

and d2 is large even 
though the 

distribu)on of terms 
in the query q and the 
distribu)on of 

terms in the 
document d2 are 
very similar. 

Sec. 6.3

COMP6714: Informa2on Retrieval & Web Search 

Use angle instead of distance 

  Thought experiment: take a document d and append 
it to itself. Call this document d′. 

  “Seman)cally” d and d′ have the same content 
  The Euclidean distance between the two documents 
can be quite large 

  The angle between the two documents is 0, 
corresponding to maximal similarity. 

  Key idea: Rank documents according to angle with 
query. 

Sec. 6.3

COMP6714: Informa2on Retrieval & Web Search 

From angles to cosines 

  The following two no)ons are equivalent. 
  Rank documents in decreasing order of the angle between 
query and document 

  Rank documents in increasing order of 
cosine(query,document) 

  Cosine is a monotonically decreasing func)on for the 
interval [0o, 180o] 

Sec. 6.3

COMP6714: Informa2on Retrieval & Web Search 

From angles to cosines 

  But how – and why – should we be compu)ng cosines? 

Sec. 6.3

COMP6714: Informa2on Retrieval & Web Search 

Length normaliza)on 

  A vector can be (length‐) normalized by dividing each 
of its components by its length – for this we use the 
L2 norm: 

  Dividing a vector by its L2 norm makes it a unit 
(length) vector (on surface of unit hypersphere) 

  Effect on the two documents d and d′ (d appended 
to itself) from earlier slide: they have iden)cal 
vectors a^er length‐normaliza)on. 
  Long and short documents now have comparable weights 

Sec. 6.3

COMP6714: Informa2on Retrieval & Web Search 

cosine(query,document) 

Dot product Unit vectors

qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.

Sec. 6.3

COMP6714: Informa2on Retrieval & Web Search 

Cosine for length‐normalized vectors 

  For length‐normalized vectors, cosine similarity is 
simply the dot product (or scalar product): 

                                   for q, d length‐normalized. 

€

cos(

q ,

d ) =

q •

d = qidii=1

V
∑

COMP6714: Informa2on Retrieval & Web Search 

Cosine similarity illustrated 

COMP6714: Informa2on Retrieval & Web Search 

Cosine similarity amongst 3 documents 

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How similar are 

the novels 
SaS: Sense and 
Sensibility 

PaP: Pride and 
Prejudice, and 
WH: Wuthering 

Heights? 
Term frequencies (counts)

Sec. 6.3

Note: To simplify this example, we don’t do idf weighting.
38 

COMP6714: Informa2on Retrieval & Web Search 

3 documents example contd. 

Log frequency weigh(ng 

term SaS PaP WH
affection 3.06 2.76 2.30
jealous 2.00 1.85 2.04
gossip 1.30 0 1.78
wuthering 0 0 2.58

A9er length normaliza(on 

term SaS PaP WH
affection 0.789 0.832 0.524
jealous 0.515 0.555 0.465
gossip 0.335 0 0.405
wuthering 0 0 0.588

cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69

Why do we have cos(SaS,PaP) > cos(SaS,WH)?

Sec. 6.3

COMP6714: Informa2on Retrieval & Web Search 

Compu)ng cosine scores 

Sec. 6.3

COMP6714: Informa2on Retrieval & Web Search 

t‐idf weigh)ng has many variants 

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?

Sec. 6.4

COMP6714: Informa2on Retrieval & Web Search 

Weigh)ng may differ in queries vs 
documents 

  Many search engines allow for different weigh)ngs 
for queries vs. documents 

  SMART Nota)on: denotes the combina)on in use in 
an engine, with the nota)on ddd.qqq, using the 
acronyms from the previous table 

  A very standard weigh)ng scheme is: lnc.ltc 
  Document: logarithmic t (l as first character), no idf and 
cosine normaliza)on 

  Query: logarithmic t (l in le^most column), idf (t in second 
column), cosine normaliza)on … 

A bad idea?

Sec. 6.4

COMP6714: Informa2on Retrieval & Web Search 

t‐idf example: lnc.ltc 

Term Query Document Prod
tf-

raw
tf-wt df idf wt n’lize tf-raw tf-wt wt n’lize

auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Document: car insurance auto insurance
Query: best car insurance

Exercise: what is N, the number of docs?

Score = 0+0+0.27+0.53 = 0.8

Doc length =

€

12 + 02 +12 +1.32 ≈1.92

Sec. 6.4

COMP6714: Informa2on Retrieval & Web Search 

Summary – vector space ranking 

  Represent the query as a weighted t‐idf vector 
  Represent each document as a weighted t‐idf vector 
  Compute the cosine similarity score for the query 
vector and each document vector 

  Rank documents with respect to the query by score 
  Return the top K (e.g., K = 10) to the user 

COMP6714: Informa2on Retrieval & Web Search 

Resources for today’s lecture 

  IIR 6.2 – 6.4.3 

  hvp://www.miislita.com/informa)on‐retrieval‐
tutorial/cosine‐similarity‐tutorial.html 
  Term weigh)ng and cosine similarity tutorial for SEO folk! 

Ch. 6

Related Posts