程序代写代做代考 scheme algorithm COMP6714:
Informa2on
Retrieval
&
Web
Search


COMP6714:
Informa2on
Retrieval
&
Web
Search


Introduc)on
to


Informa(on
Retrieval


Lecture
6:
Scoring,
Term
Weigh)ng
and
the

Vector
Space
Model


1


COMP6714:
Informa2on
Retrieval
&
Web
Search


Recap
of
lecture
5

  Collec)on
and
vocabulary
sta)s)cs:
Heaps’
and
Zipf’s
laws

  Dic)onary
compression
for
Boolean
indexes


  Dic)onary
string,
blocks,
front
coding

  Pos)ngs
compression:
Gap
encoding,
prefix‐unique
codes


  Variable‐Byte,
Gamma
codes,
Golomb/Rice
codes

collection (text, xml markup etc) 3,600.0
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, γ-encoded 101.0

MB

2


COMP6714:
Informa2on
Retrieval
&
Web
Search


This
lecture;
IIR
Sec)ons
6.2‐6.4.3


  Ranked
retrieval

  Scoring
documents

  Term
frequency

  Collec)on
sta)s)cs

  Weigh)ng
schemes

  Vector
space
scoring


3


COMP6714:
Informa2on
Retrieval
&
Web
Search


Ranked
retrieval


  Thus
far,
our
queries
have
all
been
Boolean.

  Documents
either
match
or
don’t.


  Good
for
expert
users
with
precise
understanding
of

their
needs
and
the
collec)on.

  Also
good
for
applica)ons:
Applica)ons
can
easily

consume
1000s
of
results.


  Not
good
for
the
majority
of
users.

  Most
users
incapable
of
wri)ng
Boolean
queries
(or
they

are,
but
they
think
it’s
too
much
work).


  Most
users
don’t
want
to
wade
through
1000s
of
results.

  This
is
par)cularly
true
of
web
search.


Ch. 6

4


COMP6714:
Informa2on
Retrieval
&
Web
Search


Problem
with
Boolean
search:

feast
or
famine


  Boolean
queries
o^en
result
in
either
too
few
(=0)
or

too
many
(1000s)
results.


  Query
1:
“standard
user
dlink
650”
→
200,000
hits

  Query
2:
“standard
user
dlink
650
no
card
found”:
0

hits


  It
takes
a
lot
of
skill
to
come
up
with
a
query
that

produces
a
manageable
number
of
hits.

  AND
gives
too
few;
OR
gives
too
many


Ch. 6

5


COMP6714:
Informa2on
Retrieval
&
Web
Search


Ranked
retrieval
models


  Rather
than
a
set
of
documents
sa)sfying
a
query

expression,
in
ranked
retrieval
models,
the
system

returns
an
ordering
over
the
(top)
documents
in
the

collec)on
with
respect
to
a
query


  Free
text
queries:
Rather
than
a
query
language
of

operators
and
expressions,
the
user’s
query
is
just

one
or
more
words
in
a
human
language


  In
principle,
there
are
two
separate
choices
here,
but

in
prac)ce,
ranked
retrieval
models
have
normally

been
associated
with
free
text
queries
and
vice
versa


6


COMP6714:
Informa2on
Retrieval
&
Web
Search


Feast
or
famine:
not
a
problem
in

ranked
retrieval


  When
a
system
produces
a
ranked
result
set,
large

result
sets
are
not
an
issue

  Indeed,
the
size
of
the
result
set
is
not
an
issue

  We
just
show
the
top
k
(
≈
10)
results

  We
don’t
overwhelm
the
user


  Premise:
the
ranking
algorithm
works


Ch. 6

7


COMP6714:
Informa2on
Retrieval
&
Web
Search


Scoring
as
the
basis
of
ranked
retrieval


  We
wish
to
return
in
order
the
documents
most
likely

to
be
useful
to
the
searcher


  How
can
we
rank‐order
the
documents
in
the

collec)on
with
respect
to
a
query?


  Assign
a
score
–
say
in
[0,
1]
–
to
each
document

  This
score
measures
how
well
document
and
query

“match”.


Ch. 6

8


COMP6714:
Informa2on
Retrieval
&
Web
Search


Query‐document
matching
scores


  We
need
a
way
of
assigning
a
score
to
a
query/
document
pair


  Let’s
start
with
a
one‐term
query

  If
the
query
term
does
not
occur
in
the
document:

score
should
be
0


  The
more
frequent
the
query
term
in
the
document,

the
higher
the
score
(should
be)


  We
will
look
at
a
number
of
alterna)ves
for
this.


Ch. 6

9


COMP6714:
Informa2on
Retrieval
&
Web
Search


Take
1:
Jaccard
coefficient


  Recall
from
Lecture
3:
A
commonly
used
measure
of

overlap
of
two
sets
A
and
B


  jaccard(A,B)
=
|A
∩
B|
/
|A
∪
B|

  jaccard(A,A)
=
1

  jaccard(A,B)
=
0
if
A
∩
B
=
0

  A
and
B
don’t
have
to
be
the
same
size.

  Always
assigns
a
number
between
0
and
1.


Ch. 6

10


COMP6714:
Informa2on
Retrieval
&
Web
Search


Jaccard
coefficient:
Scoring
example


  What
is
the
query‐document
match
score
that
the

Jaccard
coefficient
computes
for
each
of
the
two

documents
below?


  Query:
ides
of
march

  Document
1:
caesar
died
in
march

  Document
2:
the
long
march


Ch. 6

11


the
term
Ides
of
March
is
best
known
as
the
date
that

Julius
Caesar
was
killed
in
709
AUC
or
44
B.C


COMP6714:
Informa2on
Retrieval
&
Web
Search


Issues
with
Jaccard
for
scoring


1  It
doesn’t
consider
term
frequency
(how
many
)mes

a
term
occurs
in
a
document)


2  Rare
terms
in
a
collec)on
are
more
informa)ve

than
frequent
terms.
Jaccard
doesn’t
consider
this

informa)on


  We
need
a
more
sophis)cated
way
of
normalizing
for

length


Ch. 6

12


COMP6714:
Informa2on
Retrieval
&
Web
Search


Recall
(Lecture
1):
Binary
term‐
document
incidence
matrix


Each document is represented by a binary vector ∈ {0,1}|V|

Sec. 6.2

13


COMP6714:
Informa2on
Retrieval
&
Web
Search


Term‐document
count
matrices


  Consider
the
number
of
occurrences
of
a
term
in
a

document:


  Each
document
is
a
count
vector
in
ℕv:
a
column
below



Sec. 6.2

14


COMP6714:
Informa2on
Retrieval
&
Web
Search


Bag
of
words
model


  Vector
representa)on
doesn’t
consider
the
ordering

of
words
in
a
document


  John
is
quicker
than
Mary
and
Mary
is
quicker
than

John
have
the
same
vectors


  This
is
called
the
bag
of
words
model.

  In
a
sense,
this
is
a
step
back:
The
posi)onal
index

was
able
to
dis)nguish
these
two
documents.


  We
will
look
at
“recovering”
posi)onal
informa)on

later
in
this
course.


  For
now:
bag
of
words
model

15


COMP6714:
Informa2on
Retrieval
&
Web
Search


Term
frequency
t


  The
term
frequency
tt,d
of
term
t
in
document
d
is

defined
as
the
number
of
)mes
that
t
occurs
in
d.


  We
want
to
use
t
when
compu)ng
query‐document

match
scores.
But
how?


  Raw
term
frequency
is
not
what
we
want:

  A
document
with
10
occurrences
of
the
term
is
more

relevant
than
a
document
with
1
occurrence
of
the
term.


  But
not
10
)mes
more
relevant.

  Relevance
does
not
increase
propor)onally
with

term
frequency.


NB:
frequency
=
count
in
IR

16


COMP6714:
Informa2on
Retrieval
&
Web
Search


Log‐frequency
weigh)ng


  The
log
frequency
weight
of
term
t
in
d
is


  0
→
0,
1
→
1,
2
→
1.3,
10
→
2,
1000
→
4,
etc.

  Score
for
a
document‐query
pair:
sum
over
terms
t
in

both
q
and
d:


  score


  The
score
is
0
if
none
of
the
query
terms
is
present
in

the
document.


Sec. 6.2

17


COMP6714:
Informa2on
Retrieval
&
Web
Search


Document
frequency


  Rare
terms
are
more
informa)ve
than
frequent
terms

  Recall
stop
words


  Consider
a
term
in
the
query
that
is
rare
in
the

collec)on
(e.g.,
arachnocentric)


  A
document
containing
this
term
is
very
likely
to
be

relevant
to
the
query
arachnocentric


  →
We
want
a
high
weight
for
rare
terms
like

arachnocentric.


Sec. 6.2.1

18


COMP6714:
Informa2on
Retrieval
&
Web
Search


Document
frequency,
con)nued


  Frequent
terms
are
less
informa)ve
than
rare
terms

  Consider
a
query
term
that
is
frequent
in
the

collec)on
(e.g.,
high,
increase,
line)


  A
document
containing
such
a
term
is
more
likely
to

be
relevant
than
a
document
that
doesn’t


  But
it’s
not
a
sure
indicator
of
relevance.

  →
For
frequent
terms,
we
want
high
posi)ve
weights

for
words
like
high,
increase,
and
line


  But
lower
weights
than
for
rare
terms.

  We
will
use
document
frequency
(df)
to
capture
this.


Sec. 6.2.1

19


COMP6714:
Informa2on
Retrieval
&
Web
Search


idf
weight


  dft
is
the
document
frequency
of
t:
the
number
of

documents
that
contain
t

  dft
is
an
inverse
measure
of
the
informa)veness
of
t

  dft

≤
N


  We
define
the
idf
(inverse
document
frequency)
of
t

by


  We
use
log
(N/dft)
instead
of
N/dft
to
“dampen”
the
effect

of
idf.


Will turn out the base of the log is immaterial.

Sec. 6.2.1

20


COMP6714:
Informa2on
Retrieval
&
Web
Search


idf
example,
suppose
N
=
1
million


term dft idft

calpurnia 1

animal 100

sunday 1,000

fly 10,000

under 100,000

the 1,000,000

There is one idf value for each term t in a collection.

Sec. 6.2.1

21


COMP6714:
Informa2on
Retrieval
&
Web
Search


Effect
of
idf
on
ranking


  Does
idf
have
an
effect
on
ranking
for
one‐term

queries,
like

  iPhone


  idf
has
no
effect
on
ranking
one
term
queries

  idf
affects
the
ranking
of
documents
for
queries
with
at

least
two
terms


  For
the
query
capricious
person,
idf
weigh)ng
makes

occurrences
of
capricious
count
for
much
more
in
the
final

document
ranking
than
occurrences
of
person.


22


COMP6714:
Informa2on
Retrieval
&
Web
Search


Collec)on
vs.
Document
frequency


  The
collec)on
frequency
of
t
is
the
number
of

occurrences
of
t
in
the
collec)on,
coun)ng

mul)ple
occurrences.


  Example:


  Which
word
is
a
bever
search
term
(and
should

get
a
higher
weight)?


Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

Sec. 6.2.1

23


COMP6714:
Informa2on
Retrieval
&
Web
Search


t‐idf
weigh)ng


  The
t‐idf
weight
of
a
term
is
the
product
of
its
t

weight
and
its
idf
weight.


  Best
known
weigh)ng
scheme
in
informa)on
retrieval

  Note:
the
“‐”
in
t‐idf
is
a
hyphen,
not
a
minus
sign!

  Alterna)ve
names:
t.idf,
t
x
idf


  Increases
with
the
number
of
occurrences
within
a

document


  Increases
with
the
rarity
of
the
term
in
the
collec)on


Sec. 6.2.2

24


COMP6714:
Informa2on
Retrieval
&
Web
Search


Final
ranking
of
documents
for
a
query


25


Score(q,d) = tf.idft,dt∈q∩d∑

Sec. 6.2.2

COMP6714:
Informa2on
Retrieval
&
Web
Search


Binary
→
count
→
weight
matrix


Each document is now represented by a real-valued
vector of tf-idf weights ∈ R|V|

Sec. 6.3

26


COMP6714:
Informa2on
Retrieval
&
Web
Search


Documents
as
vectors


  So
we
have
a
|V|‐dimensional
vector
space

  Terms
are
axes
of
the
space

  Documents
are
points
or
vectors
in
this
space

  Very
high‐dimensional:
tens
of
millions
of
dimensions

when
you
apply
this
to
a
web
search
engine


  These
are
very
sparse
vectors
‐
most
entries
are
zero.


Sec. 6.3

27


COMP6714:
Informa2on
Retrieval
&
Web
Search


Queries
as
vectors


  Key
idea
1:
Do
the
same
for
queries:
represent
them

as
vectors
in
the
space


  Key
idea
2:
Rank
documents
according
to
their

proximity
to
the
query
in
this
space


  proximity
=
similarity
of
vectors

  proximity
≈
inverse
of
distance

  Recall:
We
do
this
because
we
want
to
get
away
from

the
you’re‐either‐in‐or‐out
Boolean
model.


  Instead:
rank
more
relevant
documents
higher
than

less
relevant
documents


Sec. 6.3

28


COMP6714:
Informa2on
Retrieval
&
Web
Search


Formalizing
vector
space
proximity


  First
cut:
distance
between
two
points

  (
=
distance
between
the
end
points
of
the
two
vectors)


  Euclidean
distance?

  Euclidean
distance
is
a
bad
idea
.
.
.

  .
.
.
because
Euclidean
distance
is
large
for
vectors
of

different
lengths.


Sec. 6.3

29


COMP6714:
Informa2on
Retrieval
&
Web
Search


Why
distance
is
a
bad
idea


The
Euclidean
distance

between
q


and
d2
is
large
even

though
the


distribu)on
of
terms

in
the
query
q
and
the

distribu)on
of


terms
in
the

document
d2
are

very
similar.


Sec. 6.3

30


COMP6714:
Informa2on
Retrieval
&
Web
Search


Use
angle
instead
of
distance


  Thought
experiment:
take
a
document
d
and
append

it
to
itself.
Call
this
document
d′.


  “Seman)cally”
d
and
d′
have
the
same
content

  The
Euclidean
distance
between
the
two
documents

can
be
quite
large


  The
angle
between
the
two
documents
is
0,

corresponding
to
maximal
similarity.


  Key
idea:
Rank
documents
according
to
angle
with

query.


Sec. 6.3

31


COMP6714:
Informa2on
Retrieval
&
Web
Search


From
angles
to
cosines


  The
following
two
no)ons
are
equivalent.

  Rank
documents
in
decreasing
order
of
the
angle
between

query
and
document


  Rank
documents
in
increasing
order
of

cosine(query,document)


  Cosine
is
a
monotonically
decreasing
func)on
for
the

interval
[0o,
180o]


Sec. 6.3

32


COMP6714:
Informa2on
Retrieval
&
Web
Search


From
angles
to
cosines


  But
how
–
and
why
–
should
we
be
compu)ng
cosines?


Sec. 6.3

33


COMP6714:
Informa2on
Retrieval
&
Web
Search


Length
normaliza)on


  A
vector
can
be
(length‐)
normalized
by
dividing
each

of
its
components
by
its
length
–
for
this
we
use
the

L2
norm:


  Dividing
a
vector
by
its
L2
norm
makes
it
a
unit

(length)
vector
(on
surface
of
unit
hypersphere)


  Effect
on
the
two
documents
d
and
d′
(d
appended

to
itself)
from
earlier
slide:
they
have
iden)cal

vectors
a^er
length‐normaliza)on.

  Long
and
short
documents
now
have
comparable
weights


Sec. 6.3

34


COMP6714:
Informa2on
Retrieval
&
Web
Search


cosine(query,document)


Dot product Unit vectors

qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.

Sec. 6.3

35


COMP6714:
Informa2on
Retrieval
&
Web
Search


Cosine
for
length‐normalized
vectors


  For
length‐normalized
vectors,
cosine
similarity
is

simply
the
dot
product
(or
scalar
product):





































for
q,
d
length‐normalized.


36


cos(

q ,

d ) =

q •

d = qidii=1

V

COMP6714:
Informa2on
Retrieval
&
Web
Search


Cosine
similarity
illustrated


37


COMP6714:
Informa2on
Retrieval
&
Web
Search


Cosine
similarity
amongst
3
documents


term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How
similar
are


the
novels

SaS:
Sense
and

Sensibility


PaP:
Pride
and

Prejudice,
and

WH:
Wuthering


Heights?

Term frequencies (counts)

Sec. 6.3

Note: To simplify this example, we don’t do idf weighting.
38


COMP6714:
Informa2on
Retrieval
&
Web
Search


3
documents
example
contd.


Log
frequency
weigh(ng


term SaS PaP WH
affection 3.06 2.76 2.30
jealous 2.00 1.85 2.04
gossip 1.30 0 1.78
wuthering 0 0 2.58

A9er
length
normaliza(on


term SaS PaP WH
affection 0.789 0.832 0.524
jealous 0.515 0.555 0.465
gossip 0.335 0 0.405
wuthering 0 0 0.588

cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69

Why do we have cos(SaS,PaP) > cos(SaS,WH)?

Sec. 6.3

39


COMP6714:
Informa2on
Retrieval
&
Web
Search


Compu)ng
cosine
scores


Sec. 6.3

40


COMP6714:
Informa2on
Retrieval
&
Web
Search


t‐idf
weigh)ng
has
many
variants


Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?

Sec. 6.4

41


COMP6714:
Informa2on
Retrieval
&
Web
Search


Weigh)ng
may
differ
in
queries
vs

documents


  Many
search
engines
allow
for
different
weigh)ngs

for
queries
vs.
documents


  SMART
Nota)on:
denotes
the
combina)on
in
use
in

an
engine,
with
the
nota)on
ddd.qqq,
using
the

acronyms
from
the
previous
table


  A
very
standard
weigh)ng
scheme
is:
lnc.ltc

  Document:
logarithmic
t
(l
as
first
character),
no
idf
and

cosine
normaliza)on


  Query:
logarithmic
t
(l
in
le^most
column),
idf
(t
in
second

column),
cosine
normaliza)on
…


A bad idea?

Sec. 6.4

42


COMP6714:
Informa2on
Retrieval
&
Web
Search


t‐idf
example:
lnc.ltc


Term Query Document Prod
tf-

raw
tf-wt df idf wt n’lize tf-raw tf-wt wt n’lize

auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Document: car insurance auto insurance
Query: best car insurance

Exercise: what is N, the number of docs?

Score = 0+0+0.27+0.53 = 0.8

Doc length =

12 + 02 +12 +1.32 ≈1.92

Sec. 6.4

43


COMP6714:
Informa2on
Retrieval
&
Web
Search


Summary
–
vector
space
ranking


  Represent
the
query
as
a
weighted
t‐idf
vector

  Represent
each
document
as
a
weighted
t‐idf
vector

  Compute
the
cosine
similarity
score
for
the
query

vector
and
each
document
vector


  Rank
documents
with
respect
to
the
query
by
score

  Return
the
top
K
(e.g.,
K
=
10)
to
the
user


44


COMP6714:
Informa2on
Retrieval
&
Web
Search


Resources
for
today’s
lecture


  IIR
6.2
–
6.4.3


  hvp://www.miislita.com/informa)on‐retrieval‐
tutorial/cosine‐similarity‐tutorial.html

  Term
weigh)ng
and
cosine
similarity
tutorial
for
SEO
folk!


Ch. 6

45