COMP6714: Informa2on Retrieval & Web Search
Introduc)on to
Informa(on Retrieval
Lecture
6:
Scoring,
Term
Weigh)ng
and
the
Vector
Space
Model
1
COMP6714: Informa2on Retrieval & Web Search
Recap
of
lecture
5
Collec)on
and
vocabulary
sta)s)cs:
Heaps’
and
Zipf’s
laws
Dic)onary
compression
for
Boolean
indexes
Dic)onary
string,
blocks,
front
coding
Pos)ngs
compression:
Gap
encoding,
prefix‐unique
codes
Variable‐Byte,
Gamma
codes,
Golomb/Rice
codes
collection (text, xml markup etc) 3,600.0
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, γ-encoded 101.0
MB
2
COMP6714: Informa2on Retrieval & Web Search
This lecture; IIR Sec)ons 6.2‐6.4.3
Ranked
retrieval
Scoring
documents
Term
frequency
Collec)on
sta)s)cs
Weigh)ng
schemes
Vector
space
scoring
3
COMP6714: Informa2on Retrieval & Web Search
Ranked retrieval
Thus
far,
our
queries
have
all
been
Boolean.
Documents
either
match
or
don’t.
Good
for
expert
users
with
precise
understanding
of
their
needs
and
the
collec)on.
Also
good
for
applica)ons:
Applica)ons
can
easily
consume
1000s
of
results.
Not
good
for
the
majority
of
users.
Most
users
incapable
of
wri)ng
Boolean
queries
(or
they
are,
but
they
think
it’s
too
much
work).
Most
users
don’t
want
to
wade
through
1000s
of
results.
This
is
par)cularly
true
of
web
search.
Ch. 6
4
COMP6714: Informa2on Retrieval & Web Search
Problem
with
Boolean
search:
feast
or
famine
Boolean
queries
o^en
result
in
either
too
few
(=0)
or
too
many
(1000s)
results.
Query
1:
“standard
user
dlink
650”
→
200,000
hits
Query
2:
“standard
user
dlink
650
no
card
found”:
0
hits
It
takes
a
lot
of
skill
to
come
up
with
a
query
that
produces
a
manageable
number
of
hits.
AND
gives
too
few;
OR
gives
too
many
Ch. 6
5
COMP6714: Informa2on Retrieval & Web Search
Ranked retrieval models
Rather
than
a
set
of
documents
sa)sfying
a
query
expression,
in
ranked
retrieval
models,
the
system
returns
an
ordering
over
the
(top)
documents
in
the
collec)on
with
respect
to
a
query
Free
text
queries:
Rather
than
a
query
language
of
operators
and
expressions,
the
user’s
query
is
just
one
or
more
words
in
a
human
language
In
principle,
there
are
two
separate
choices
here,
but
in
prac)ce,
ranked
retrieval
models
have
normally
been
associated
with
free
text
queries
and
vice
versa
6
COMP6714: Informa2on Retrieval & Web Search
Feast
or
famine:
not
a
problem
in
ranked
retrieval
When
a
system
produces
a
ranked
result
set,
large
result
sets
are
not
an
issue
Indeed,
the
size
of
the
result
set
is
not
an
issue
We
just
show
the
top
k
(
≈
10)
results
We
don’t
overwhelm
the
user
Premise: the ranking algorithm works
Ch. 6
7
COMP6714: Informa2on Retrieval & Web Search
Scoring as the basis of ranked retrieval
We
wish
to
return
in
order
the
documents
most
likely
to
be
useful
to
the
searcher
How
can
we
rank‐order
the
documents
in
the
collec)on
with
respect
to
a
query?
Assign
a
score
–
say
in
[0,
1]
–
to
each
document
This
score
measures
how
well
document
and
query
“match”.
Ch. 6
8
COMP6714: Informa2on Retrieval & Web Search
Query‐document matching scores
We
need
a
way
of
assigning
a
score
to
a
query/
document
pair
Let’s
start
with
a
one‐term
query
If
the
query
term
does
not
occur
in
the
document:
score
should
be
0
The
more
frequent
the
query
term
in
the
document,
the
higher
the
score
(should
be)
We will look at a number of alterna)ves for this.
Ch. 6
9
COMP6714: Informa2on Retrieval & Web Search
Take 1: Jaccard coefficient
Recall
from
Lecture
3:
A
commonly
used
measure
of
overlap
of
two
sets
A
and
B
jaccard(A,B)
=
|A
∩
B|
/
|A
∪
B|
jaccard(A,A)
=
1
jaccard(A,B)
=
0
if
A
∩
B
=
0
A
and
B
don’t
have
to
be
the
same
size.
Always
assigns
a
number
between
0
and
1.
Ch. 6
10
COMP6714: Informa2on Retrieval & Web Search
Jaccard coefficient: Scoring example
What
is
the
query‐document
match
score
that
the
Jaccard
coefficient
computes
for
each
of
the
two
documents
below?
Query:
ides
of
march
Document
1:
caesar
died
in
march
Document
2:
the
long
march
Ch. 6
11
the
term
Ides
of
March
is
best
known
as
the
date
that
Julius
Caesar
was
killed
in
709
AUC
or
44
B.C
COMP6714: Informa2on Retrieval & Web Search
Issues with Jaccard for scoring
1 It
doesn’t
consider
term
frequency
(how
many
)mes
a
term
occurs
in
a
document)
2 Rare
terms
in
a
collec)on
are
more
informa)ve
than
frequent
terms.
Jaccard
doesn’t
consider
this
informa)on
We
need
a
more
sophis)cated
way
of
normalizing
for
length
Ch. 6
12
COMP6714: Informa2on Retrieval & Web Search
Recall
(Lecture
1):
Binary
term‐
document
incidence
matrix
Each document is represented by a binary vector ∈ {0,1}|V|
Sec. 6.2
13
COMP6714: Informa2on Retrieval & Web Search
Term‐document count matrices
Consider
the
number
of
occurrences
of
a
term
in
a
document:
Each
document
is
a
count
vector
in
ℕv:
a
column
below
Sec. 6.2
14
COMP6714: Informa2on Retrieval & Web Search
Bag of words model
Vector
representa)on
doesn’t
consider
the
ordering
of
words
in
a
document
John
is
quicker
than
Mary
and
Mary
is
quicker
than
John
have
the
same
vectors
This
is
called
the
bag
of
words
model.
In
a
sense,
this
is
a
step
back:
The
posi)onal
index
was
able
to
dis)nguish
these
two
documents.
We
will
look
at
“recovering”
posi)onal
informa)on
later
in
this
course.
For
now:
bag
of
words
model
15
COMP6714: Informa2on Retrieval & Web Search
Term frequency t
The
term
frequency
tt,d
of
term
t
in
document
d
is
defined
as
the
number
of
)mes
that
t
occurs
in
d.
We
want
to
use
t
when
compu)ng
query‐document
match
scores.
But
how?
Raw
term
frequency
is
not
what
we
want:
A
document
with
10
occurrences
of
the
term
is
more
relevant
than
a
document
with
1
occurrence
of
the
term.
But
not
10
)mes
more
relevant.
Relevance
does
not
increase
propor)onally
with
term
frequency.
NB:
frequency
=
count
in
IR
16
COMP6714: Informa2on Retrieval & Web Search
Log‐frequency weigh)ng
The log frequency weight of term t in d is
0
→
0,
1
→
1,
2
→
1.3,
10
→
2,
1000
→
4,
etc.
Score
for
a
document‐query
pair:
sum
over
terms
t
in
both
q
and
d:
score
The
score
is
0
if
none
of
the
query
terms
is
present
in
the
document.
Sec. 6.2
17
COMP6714: Informa2on Retrieval & Web Search
Document frequency
Rare
terms
are
more
informa)ve
than
frequent
terms
Recall
stop
words
Consider
a
term
in
the
query
that
is
rare
in
the
collec)on
(e.g.,
arachnocentric)
A
document
containing
this
term
is
very
likely
to
be
relevant
to
the
query
arachnocentric
→
We
want
a
high
weight
for
rare
terms
like
arachnocentric.
Sec. 6.2.1
18
COMP6714: Informa2on Retrieval & Web Search
Document frequency, con)nued
Frequent
terms
are
less
informa)ve
than
rare
terms
Consider
a
query
term
that
is
frequent
in
the
collec)on
(e.g.,
high,
increase,
line)
A
document
containing
such
a
term
is
more
likely
to
be
relevant
than
a
document
that
doesn’t
But
it’s
not
a
sure
indicator
of
relevance.
→
For
frequent
terms,
we
want
high
posi)ve
weights
for
words
like
high,
increase,
and
line
But
lower
weights
than
for
rare
terms.
We
will
use
document
frequency
(df)
to
capture
this.
Sec. 6.2.1
19
COMP6714: Informa2on Retrieval & Web Search
idf weight
dft
is
the
document
frequency
of
t:
the
number
of
documents
that
contain
t
dft
is
an
inverse
measure
of
the
informa)veness
of
t
dft
≤
N
We
define
the
idf
(inverse
document
frequency)
of
t
by
We
use
log
(N/dft)
instead
of
N/dft
to
“dampen”
the
effect
of
idf.
Will turn out the base of the log is immaterial.
Sec. 6.2.1
20
COMP6714: Informa2on Retrieval & Web Search
idf example, suppose N = 1 million
term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000
There is one idf value for each term t in a collection.
Sec. 6.2.1
21
COMP6714: Informa2on Retrieval & Web Search
Effect of idf on ranking
Does
idf
have
an
effect
on
ranking
for
one‐term
queries,
like
iPhone
idf
has
no
effect
on
ranking
one
term
queries
idf
affects
the
ranking
of
documents
for
queries
with
at
least
two
terms
For
the
query
capricious
person,
idf
weigh)ng
makes
occurrences
of
capricious
count
for
much
more
in
the
final
document
ranking
than
occurrences
of
person.
22
COMP6714: Informa2on Retrieval & Web Search
Collec)on vs. Document frequency
The
collec)on
frequency
of
t
is
the
number
of
occurrences
of
t
in
the
collec)on,
coun)ng
mul)ple
occurrences.
Example:
Which
word
is
a
bever
search
term
(and
should
get
a
higher
weight)?
Word Collection frequency Document frequency
insurance 10440 3997
try 10422 8760
Sec. 6.2.1
23
COMP6714: Informa2on Retrieval & Web Search
t‐idf weigh)ng
The
t‐idf
weight
of
a
term
is
the
product
of
its
t
weight
and
its
idf
weight.
Best
known
weigh)ng
scheme
in
informa)on
retrieval
Note:
the
“‐”
in
t‐idf
is
a
hyphen,
not
a
minus
sign!
Alterna)ve
names:
t.idf,
t
x
idf
Increases
with
the
number
of
occurrences
within
a
document
Increases with the rarity of the term in the collec)on
Sec. 6.2.2
24
COMP6714: Informa2on Retrieval & Web Search
Final ranking of documents for a query
25
€
Score(q,d) = tf.idft,dt∈q∩d∑
Sec. 6.2.2
COMP6714: Informa2on Retrieval & Web Search
Binary → count → weight matrix
Each document is now represented by a real-valued
vector of tf-idf weights ∈ R|V|
Sec. 6.3
26
COMP6714: Informa2on Retrieval & Web Search
Documents as vectors
So
we
have
a
|V|‐dimensional
vector
space
Terms
are
axes
of
the
space
Documents
are
points
or
vectors
in
this
space
Very
high‐dimensional:
tens
of
millions
of
dimensions
when
you
apply
this
to
a
web
search
engine
These are very sparse vectors ‐ most entries are zero.
Sec. 6.3
27
COMP6714: Informa2on Retrieval & Web Search
Queries as vectors
Key
idea
1:
Do
the
same
for
queries:
represent
them
as
vectors
in
the
space
Key
idea
2:
Rank
documents
according
to
their
proximity
to
the
query
in
this
space
proximity
=
similarity
of
vectors
proximity
≈
inverse
of
distance
Recall:
We
do
this
because
we
want
to
get
away
from
the
you’re‐either‐in‐or‐out
Boolean
model.
Instead:
rank
more
relevant
documents
higher
than
less
relevant
documents
Sec. 6.3
28
COMP6714: Informa2on Retrieval & Web Search
Formalizing vector space proximity
First
cut:
distance
between
two
points
(
=
distance
between
the
end
points
of
the
two
vectors)
Euclidean
distance?
Euclidean
distance
is
a
bad
idea
.
.
.
.
.
.
because
Euclidean
distance
is
large
for
vectors
of
different
lengths.
Sec. 6.3
29
COMP6714: Informa2on Retrieval & Web Search
Why distance is a bad idea
The
Euclidean
distance
between
q
and
d2
is
large
even
though
the
distribu)on
of
terms
in
the
query
q
and
the
distribu)on
of
terms
in
the
document
d2
are
very
similar.
Sec. 6.3
30
COMP6714: Informa2on Retrieval & Web Search
Use angle instead of distance
Thought
experiment:
take
a
document
d
and
append
it
to
itself.
Call
this
document
d′.
“Seman)cally”
d
and
d′
have
the
same
content
The
Euclidean
distance
between
the
two
documents
can
be
quite
large
The
angle
between
the
two
documents
is
0,
corresponding
to
maximal
similarity.
Key
idea:
Rank
documents
according
to
angle
with
query.
Sec. 6.3
31
COMP6714: Informa2on Retrieval & Web Search
From angles to cosines
The
following
two
no)ons
are
equivalent.
Rank
documents
in
decreasing
order
of
the
angle
between
query
and
document
Rank
documents
in
increasing
order
of
cosine(query,document)
Cosine
is
a
monotonically
decreasing
func)on
for
the
interval
[0o,
180o]
Sec. 6.3
32
COMP6714: Informa2on Retrieval & Web Search
From angles to cosines
But how – and why – should we be compu)ng cosines?
Sec. 6.3
33
COMP6714: Informa2on Retrieval & Web Search
Length normaliza)on
A
vector
can
be
(length‐)
normalized
by
dividing
each
of
its
components
by
its
length
–
for
this
we
use
the
L2
norm:
Dividing
a
vector
by
its
L2
norm
makes
it
a
unit
(length)
vector
(on
surface
of
unit
hypersphere)
Effect
on
the
two
documents
d
and
d′
(d
appended
to
itself)
from
earlier
slide:
they
have
iden)cal
vectors
a^er
length‐normaliza)on.
Long
and
short
documents
now
have
comparable
weights
Sec. 6.3
34
COMP6714: Informa2on Retrieval & Web Search
cosine(query,document)
Dot product Unit vectors
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.
Sec. 6.3
35
COMP6714: Informa2on Retrieval & Web Search
Cosine for length‐normalized vectors
For
length‐normalized
vectors,
cosine
similarity
is
simply
the
dot
product
(or
scalar
product):
for q, d length‐normalized.
36
€
cos(
q ,
d ) =
q •
d = qidii=1
V
∑
COMP6714: Informa2on Retrieval & Web Search
Cosine similarity illustrated
37
COMP6714: Informa2on Retrieval & Web Search
Cosine similarity amongst 3 documents
term SaS PaP WH
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
How similar are
the
novels
SaS:
Sense
and
Sensibility
PaP:
Pride
and
Prejudice,
and
WH:
Wuthering
Heights?
Term frequencies (counts)
Sec. 6.3
Note: To simplify this example, we don’t do idf weighting.
38
COMP6714: Informa2on Retrieval & Web Search
3 documents example contd.
Log frequency weigh(ng
term SaS PaP WH
affection 3.06 2.76 2.30
jealous 2.00 1.85 2.04
gossip 1.30 0 1.78
wuthering 0 0 2.58
A9er length normaliza(on
term SaS PaP WH
affection 0.789 0.832 0.524
jealous 0.515 0.555 0.465
gossip 0.335 0 0.405
wuthering 0 0 0.588
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
Sec. 6.3
39
COMP6714: Informa2on Retrieval & Web Search
Compu)ng cosine scores
Sec. 6.3
40
COMP6714: Informa2on Retrieval & Web Search
t‐idf weigh)ng has many variants
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
Sec. 6.4
41
COMP6714: Informa2on Retrieval & Web Search
Weigh)ng
may
differ
in
queries
vs
documents
Many
search
engines
allow
for
different
weigh)ngs
for
queries
vs.
documents
SMART
Nota)on:
denotes
the
combina)on
in
use
in
an
engine,
with
the
nota)on
ddd.qqq,
using
the
acronyms
from
the
previous
table
A
very
standard
weigh)ng
scheme
is:
lnc.ltc
Document:
logarithmic
t
(l
as
first
character),
no
idf
and
cosine
normaliza)on
Query:
logarithmic
t
(l
in
le^most
column),
idf
(t
in
second
column),
cosine
normaliza)on
…
A bad idea?
Sec. 6.4
42
COMP6714: Informa2on Retrieval & Web Search
t‐idf example: lnc.ltc
Term Query Document Prod
tf-
raw
tf-wt df idf wt n’lize tf-raw tf-wt wt n’lize
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53
Document: car insurance auto insurance
Query: best car insurance
Exercise: what is N, the number of docs?
Score = 0+0+0.27+0.53 = 0.8
Doc length =
€
12 + 02 +12 +1.32 ≈1.92
Sec. 6.4
43
COMP6714: Informa2on Retrieval & Web Search
Summary – vector space ranking
Represent
the
query
as
a
weighted
t‐idf
vector
Represent
each
document
as
a
weighted
t‐idf
vector
Compute
the
cosine
similarity
score
for
the
query
vector
and
each
document
vector
Rank
documents
with
respect
to
the
query
by
score
Return
the
top
K
(e.g.,
K
=
10)
to
the
user
44
COMP6714: Informa2on Retrieval & Web Search
Resources for today’s lecture
IIR 6.2 – 6.4.3
hvp://www.miislita.com/informa)on‐retrieval‐
tutorial/cosine‐similarity‐tutorial.html
Term
weigh)ng
and
cosine
similarity
tutorial
for
SEO
folk!
Ch. 6
45