COMP6714:
Informa2on
Retrieval
&
Web
Search
Introduc)on
to
Informa(on
Retrieval
Lecture
6:
Scoring,
Term
Weigh)ng
and
the
Vector
Space
Model
1
COMP6714:
Informa2on
Retrieval
&
Web
Search
Recap
of
lecture
5
Collec)on
and
vocabulary
sta)s)cs:
Heaps’
and
Zipf’s
laws
Dic)onary
compression
for
Boolean
indexes
Dic)onary
string,
blocks,
front
coding
Pos)ngs
compression:
Gap
encoding,
prefix‐unique
codes
Variable‐Byte,
Gamma
codes,
Golomb/Rice
codes
MB
collection (text, xml markup etc)
3,600.0
collection (text)
960.0
Term-doc incidence matrix
40,000.0
postings, uncompressed (32-bit words)
400.0
postings, uncompressed (20 bits)
250.0
postings, variable byte encoded
116.0
postings, γ-encoded
101.0
2
COMP6714:
Informa2on
Retrieval
&
Web
Search
This
lecture;
IIR
Sec)ons
6.2‐6.4.3
Ranked
retrieval
Scoring
documents
Term
frequency
Collec)on
sta)s)cs
Weigh)ng
schemes
Vector
space
scoring
3
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch. 6
Ranked
retrieval
Thus
far,
our
queries
have
all
been
Boolean.
Documents
either
match
or
don’t.
Good
for
expert
users
with
precise
understanding
of
their
needs
and
the
collec)on.
Also
good
for
applica)ons:
Applica)ons
can
easily
consume
1000s
of
results.
Not
good
for
the
majority
of
users.
Most
users
incapable
of
wri)ng
Boolean
queries
(or
they
are,
but
they
think
it’s
too
much
work).
Most
users
don’t
want
to
wade
through
1000s
of
results.
This
is
par)cularly
true
of
web
search.
4
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch. 6
Problem
with
Boolean
search:
feast
or
famine
Boolean
queries
o^en
result
in
either
too
few
(=0)
or
too
many
(1000s)
results.
Query
1:
“standard
user
dlink
650”
→
200,000
hits
Query
2:
“standard
user
dlink
650
no
card
found”:
0
hits
It
takes
a
lot
of
skill
to
come
up
with
a
query
that
produces
a
manageable
number
of
hits.
AND
gives
too
few;
OR
gives
too
many
5
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ranked
retrieval
models
Rather
than
a
set
of
documents
sa)sfying
a
query
expression,
in
ranked
retrieval
models,
the
system
returns
an
ordering
over
the
(top)
documents
in
the
collec)on
with
respect
to
a
query
Free
text
queries:
Rather
than
a
query
language
of
operators
and
expressions,
the
user’s
query
is
just
one
or
more
words
in
a
human
language
In
principle,
there
are
two
separate
choices
here,
but
in
prac)ce,
ranked
retrieval
models
have
normally
been
associated
with
free
text
queries
and
vice
versa
6
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch. 6
Feast
or
famine:
not
a
problem
in
ranked
retrieval
When
a
system
produces
a
ranked
result
set,
large
result
sets
are
not
an
issue
Indeed,
the
size
of
the
result
set
is
not
an
issue
We
just
show
the
top
k
(
≈
10)
results
We
don’t
overwhelm
the
user
Premise:
the
ranking
algorithm
works
7
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch. 6
Scoring
as
the
basis
of
ranked
retrieval
We
wish
to
return
in
order
the
documents
most
likely
to
be
useful
to
the
searcher
How
can
we
rank‐order
the
documents
in
the
collec)on
with
respect
to
a
query?
Assign
a
score
–
say
in
[0,
1]
–
to
each
document
This
score
measures
how
well
document
and
query
“match”.
8
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch. 6
Query‐document
matching
scores
We
need
a
way
of
assigning
a
score
to
a
query/ document
pair
Let’s
start
with
a
one‐term
query
If
the
query
term
does
not
occur
in
the
document:
score
should
be
0
The
more
frequent
the
query
term
in
the
document,
the
higher
the
score
(should
be)
We
will
look
at
a
number
of
alterna)ves
for
this.
9
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch. 6
Take
1:
Jaccard
coefficient
Recall
from
Lecture
3:
A
commonly
used
measure
of
overlap
of
two
sets
A
and
B
jaccard(A,B)
=
|A
∩
B|
/
|A
∪
B|
jaccard(A,A)
=
1
jaccard(A,B)
=
0
if
A
∩
B
=
0
A
and
B
don’t
have
to
be
the
same
size.
Always
assigns
a
number
between
0
and
1.
10
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch. 6
Jaccard
coefficient:
Scoring
example
What
is
the
query‐document
match
score
that
the
Jaccard
coefficient
computes
for
each
of
the
two
documents
below?
Query:
ides
of
march
Document
1:
caesar
died
in
march
Document
2:
the
long
march
the
term
Ides
of
March
is
best
known
as
the
date
that
Julius
Caesar
was
killed
in
709
AUC
or
44
B.C
11
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch. 6
Issues
with
Jaccard
for
scoring
1 It
doesn’t
consider
term
frequency
(how
many
)mes
a
term
occurs
in
a
document)
2 Rare
terms
in
a
collec)on
are
more
informa)ve
than
frequent
terms.
Jaccard
doesn’t
consider
this
informa)on
We
need
a
more
sophis)cated
way
of
normalizing
for
length
12
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.2
Recall
(Lecture
1):
Binary
term‐ document
incidence
matrix
Each document is represented by a binary vector ∈ {0,1}|V| 13
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.2
Term‐document
count
matrices
Consider
the
number
of
occurrences
of
a
term
in
a
document:
Each
document
is
a
count
vector
in
Nv:
a
column
below
14
COMP6714:
Informa2on
Retrieval
&
Web
Search
Bag
of
words
model
Vector
representa)on
doesn’t
consider
the
ordering
of
words
in
a
document
John
is
quicker
than
Mary
and
Mary
is
quicker
than
John
have
the
same
vectors
This
is
called
the
bag
of
words
model.
In
a
sense,
this
is
a
step
back:
The
posi)onal
index
was
able
to
dis)nguish
these
two
documents.
We
will
look
at
“recovering”
posi)onal
informa)on
later
in
this
course.
For
now:
bag
of
words
model
15
COMP6714:
Informa2on
Retrieval
&
Web
Search
Term
frequency
t
The
term
frequency
tt,d
of
term
t
in
document
d
is
defined
as
the
number
of
)mes
that
t
occurs
in
d.
We
want
to
use
t
when
compu)ng
query‐document
match
scores.
But
how?
Raw
term
frequency
is
not
what
we
want:
A
document
with
10
occurrences
of
the
term
is
more
relevant
than
a
document
with
1
occurrence
of
the
term.
But
not
10
)mes
more
relevant.
Relevance
does
not
increase
propor)onally
with
term
frequency.
NB:
frequency
=
count
in
IR
16
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.2
Log‐frequency
weigh)ng
The
log
frequency
weight
of
term
t
in
d
is
0
→
0,
1
→
1,
2
→
1.3,
10
→
2,
1000
→
4,
etc.
Score
for
a
document‐query
pair:
sum
over
terms
t
in
both
q
and
d:
score
The
score
is
0
if
none
of
the
query
terms
is
present
in
the
document.
17
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.2.1
Document
frequency
Rare
terms
are
more
informa)ve
than
frequent
terms
Recall
stop
words
Consider
a
term
in
the
query
that
is
rare
in
the
collec)on
(e.g.,
arachnocentric)
A
document
containing
this
term
is
very
likely
to
be
relevant
to
the
query
arachnocentric
→
We
want
a
high
weight
for
rare
terms
like
arachnocentric.
18
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.2.1
Document
frequency,
con)nued
Frequent
terms
are
less
informa)ve
than
rare
terms
Consider
a
query
term
that
is
frequent
in
the
collec)on
(e.g.,
high,
increase,
line)
A
document
containing
such
a
term
is
more
likely
to
be
relevant
than
a
document
that
doesn’t
But
it’s
not
a
sure
indicator
of
relevance.
→
For
frequent
terms,
we
want
high
posi)ve
weights
for
words
like
high,
increase,
and
line
But
lower
weights
than
for
rare
terms.
We
will
use
document
frequency
(df)
to
capture
this.
19
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.2.1
idf
weight
dft
is
the
document
frequency
of
t:
the
number
of
documents
that
contain
t
dft
is
an
inverse
measure
of
the
informa)veness
of
t
dft
≤
N
We
define
the
idf
(inverse
document
frequency)
of
t
by
We
use
log
(N/dft)
instead
of
N/dft
to
“dampen”
the
effect
of
idf.
Will turn out the base of the log is immaterial.
20
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.2.1
idf
example,
suppose
N
=
1
million
term
dft
idft
calpurnia
1
animal
100
sunday
1,000
fly
10,000
under
100,000
the
1,000,000
There is one idf value for each term t in a collection. 21
COMP6714:
Informa2on
Retrieval
&
Web
Search
Effect
of
idf
on
ranking
Does
idf
have
an
effect
on
ranking
for
one‐term
queries,
like
iPhone
idf
has
no
effect
on
ranking
one
term
queries
idf
affects
the
ranking
of
documents
for
queries
with
at
least
two
terms
For
the
query
capricious
person,
idf
weigh)ng
makes
occurrences
of
capricious
count
for
much
more
in
the
final
document
ranking
than
occurrences
of
person.
22
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.2.1
Collec)on
vs.
Document
frequency
The
collec)on
frequency
of
t
is
the
number
of
occurrences
of
t
in
the
collec)on,
coun)ng
mul)ple
occurrences.
Example:
Word
Collection frequency
Document frequency
insurance
10440
3997
try
10422
8760
Which
word
is
a
bever
search
term
(and
should
get
a
higher
weight)?
23
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.2.2
t‐idf
weigh)ng
The
t‐idf
weight
of
a
term
is
the
product
of
its
t
weight
and
its
idf
weight.
Best
known
weigh)ng
scheme
in
informa)on
retrieval
Note:
the
“‐”
in
t‐idf
is
a
hyphen,
not
a
minus
sign!
Alterna)ve
names:
t.idf,
t
x
idf
Increases
with
the
number
of
occurrences
within
a
document
Increases
with
the
rarity
of
the
term
in
the
collec)on
24
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.2.2
Final
ranking
of
documents
for
a
query
Score(q,d)=∑t∈q∩d tf.idft,d
25
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
Binary
→
count
→
weight
matrix
Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|
26
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
Documents
as
vectors
So
we
have
a
|V|‐dimensional
vector
space
Terms
are
axes
of
the
space
Documents
are
points
or
vectors
in
this
space
Very
high‐dimensional:
tens
of
millions
of
dimensions
when
you
apply
this
to
a
web
search
engine
These
are
very
sparse
vectors
‐
most
entries
are
zero.
27
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
Queries
as
vectors
Key
idea
1:
Do
the
same
for
queries:
represent
them
as
vectors
in
the
space
Key
idea
2:
Rank
documents
according
to
their
proximity
to
the
query
in
this
space
proximity
=
similarity
of
vectors
proximity
≈
inverse
of
distance
Recall:
We
do
this
because
we
want
to
get
away
from
the
you’re‐either‐in‐or‐out
Boolean
model.
Instead:
rank
more
relevant
documents
higher
than
less
relevant
documents
28
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
Formalizing
vector
space
proximity
First
cut:
distance
between
two
points
(
=
distance
between
the
end
points
of
the
two
vectors)
Euclidean
distance?
Euclidean
distance
is
a
bad
idea
.
.
.
.
.
.
because
Euclidean
distance
is
large
for
vectors
of
different
lengths.
29
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
Why
distance
is
a
bad
idea
The
Euclidean
distance
between
q
and
d2
is
large
even
though
the
distribu)on
of
terms
in
the
query
q
and
the
distribu)on
of
terms
in
the
document
d2
are
very
similar.
30
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
Use
angle
instead
of
distance
Thought
experiment:
take
a
document
d
and
append
it
to
itself.
Call
this
document
d′.
“Seman)cally”
d
and
d′
have
the
same
content
The
Euclidean
distance
between
the
two
documents
can
be
quite
large
The
angle
between
the
two
documents
is
0,
corresponding
to
maximal
similarity.
Key
idea:
Rank
documents
according
to
angle
with
query.
31
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
From
angles
to
cosines
The
following
two
no)ons
are
equivalent.
Rank
documents
in
decreasing
order
of
the
angle
between
query
and
document
Rank
documents
in
increasing
order
of
cosine(query,document)
Cosine
is
a
monotonically
decreasing
func)on
for
the
interval
[0o,
180o]
32
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
From
angles
to
cosines
But
how
–
and
why
–
should
we
be
compu)ng
cosines?
33
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
Length
normaliza)on
A
vector
can
be
(length‐)
normalized
by
dividing
each
of
its
components
by
its
length
–
for
this
we
use
the
L2
norm:
Dividing
a
vector
by
its
L2
norm
makes
it
a
unit
(length)
vector
(on
surface
of
unit
hypersphere)
Effect
on
the
two
documents
d
and
d′
(d
appended
to
itself)
from
earlier
slide:
they
have
iden)cal
vectors
a^er
length‐normaliza)on.
Long
and
short
documents
now
have
comparable
weights
34
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
cosine(query,document)
Dot product
Unit vectors
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.
35
COMP6714:
Informa2on
Retrieval
&
Web
Search
Cosine
for
length‐normalized
vectors
For
length‐normalized
vectors,
cosine
similarity
is
simply
the
dot
product
(or
scalar
product):
cos(q,d) = q • d = ∑V qidi i=1
for
q,
d
length‐normalized.
36
€
COMP6714:
Informa2on
Retrieval
&
Web
Search
Cosine
similarity
illustrated
37
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
Cosine
similarity
amongst
3
documents
How
similar
are
the
novels
SaS:
Sense
and
Sensibility
PaP:
Pride
and
Prejudice,
and
WH:
Wuthering
Heights?
Term frequencies (counts)
term
SaS
PaP
WH
affection
115
58
20
jealous
10
7
11
gossip
2
0
6
wuthering
0
0
38
Note: To simplify this example, we don’t do idf weighting.
38
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
3
documents
example
contd.
Log
frequency
weigh(ng
A9er
length
normaliza(on
term
SaS
PaP
WH
affection
3.06
2.76
2.30
jealous
2.00
1.85
2.04
gossip
1.30
0
1.78
wuthering
0
0
2.58
term
SaS
PaP
WH
affection
0.789
0.832
0.524
jealous
0.515
0.555
0.465
gossip
0.335
0
0.405
wuthering
0
0
0.588
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
39
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.3
Compu)ng
cosine
scores
40
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.4
t‐idf
weigh)ng
has
many
variants
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
41
COMP6714:
Informa2on
Retrieval
&
Web
Search
Sec. 6.4
Weigh)ng
may
differ
in
queries
vs
documents
Many
search
engines
allow
for
different
weigh)ngs
for
queries
vs.
documents
SMART
Nota)on:
denotes
the
combina)on
in
use
in
an
engine,
with
the
nota)on
ddd.qqq,
using
the
acronyms
from
the
previous
table
A
very
standard
weigh)ng
scheme
is:
lnc.ltc
Document:
logarithmic
t
(l
as
first
character),
no
idf
and
cosine
normaliza)on
A bad idea?
Query:
logarithmic
t
(l
in
le^most
column),
idf
(t
in
second
column),
cosine
normaliza)on
…
42
COMP6714:
Informa2on
Retrieval
&
Web
Search
auto
best
car
t‐idf
example:
lnc.ltc
Document: car insurance auto insurance Query: best car insurance
insurance
0
1
1
1
0
1
1
1
5000
50000
10000
1000
2.3
1.3
2.0
3.0
1.3
2.0
3.0
0
0.34
0.52
0.78
0
1
0
1
2
1.3
1
0
1
wt
1.3
1
0
1
Sec. 6.4
Term
tf- raw
tf-wt
df
Query
0.52
0.52
0.68
0.27
0.53
0
0
0
Exercise: what is N, the number of docs?
Doc length =
idf
wt
n’lize
tf-raw
Document
tf-wt
n’lize
Prod
12 + 02 +12 +1.32 ≈1.92 Score = 0+0+0.27+0.53 = 0.8
43
COMP6714:
Informa2on
Retrieval
&
Web
Search
Summary
–
vector
space
ranking
Represent
the
query
as
a
weighted
t‐idf
vector
Represent
each
document
as
a
weighted
t‐idf
vector
Compute
the
cosine
similarity
score
for
the
query
vector
and
each
document
vector
Rank
documents
with
respect
to
the
query
by
score
Return
the
top
K
(e.g.,
K
=
10)
to
the
user
44
COMP6714:
Informa2on
Retrieval
&
Web
Search
Ch. 6
Resources
for
today’s
lecture
IIR
6.2
–
6.4.3
hvp://www.miislita.com/informa)on‐retrieval‐ tutorial/cosine‐similarity‐tutorial.html
Term
weigh)ng
and
cosine
similarity
tutorial
for
SEO
folk!
45