程序代写代做代考 scheme algorithm information retrieval data structure L5-compression-yifang

L5-compression-yifang

COMP6714: Information Retrieval & Web Search

Introduction to

Information Retrieval

Lecture 5: Index Compression

COMP6714: Information Retrieval & Web Search

Inverted index

▪For each term t, we must store a list of all
documents that contain t.

▪ Identify each doc by a docID, a document serial
number

▪Can we used fixed-size arrays for this?

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

174

54 101

What happens if the word Caesar
is added to document 14?

COMP6714: Information Retrieval & Web Search

Inverted index

▪We need variable-size postings lists
▪ On disk, a continuous run of postings is normal and
best
▪ In memory, can use linked lists or variable length
arrays

▪ Some tradeoffs in size/ease of insertion

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

174

54 101

Dictionary Postings

Sorted by docID (more later on why).

Posting

COMP6714: Information Retrieval & Web Search

Inverted index construction

Tokenizer

Token stream Friends Romans Countrymen

Linguistic modules

Modified tokens friend roman countryman

Indexer

Inverted index

friend

roman

countryman

2 4

13 16

Documents to
be indexed

Friends, Romans, countrymen.

COMP6714: Information Retrieval & Web Search

Where do we pay in storage?

5Pointers

Terms
and

counts

Lists of
docIDs

COMP6714: Information Retrieval & Web Search

Today

▪ Why compression?
▪ Collection statistics in more detail (with RCV1)
▪ How big will the dictionary and postings be?

▪ Dictionary compression
▪ Postings compression

Ch. 5

COMP6714: Information Retrieval & Web Search

Why compression (in general)?
▪ Use less disk space
▪ Saves a little money

▪ Keep more stuff in memory
▪ Increases speed

▪ Increase speed of data transfer from disk to memory
▪ [read compressed data | decompress] is faster than
[read uncompressed data]

▪ Premise: Decompression algorithms are fast
▪ True of the decompression algorithms we use

Ch. 5

COMP6714: Information Retrieval & Web Search

Why compression for inverted indexes?
▪ Dictionary
▪ Make it small enough to keep in main memory
▪ Make it so small that you can keep some postings lists in
main memory too

▪ Postings file(s)
▪ Reduce disk space needed
▪ Decrease time needed to read postings lists from disk
▪ Large search engines keep a significant part of the postings
in memory.
▪ Compression lets you keep more in memory

▪ We will devise various IR-specific compression schemes

Ch. 5

COMP6714: Information Retrieval & Web Search

RCV1: Our collection for this lecture

▪Shakespeare’s collected works definitely aren’t large
enough for demonstrating many of the points in this
course.
▪The collection we’ll use isn’t really large enough
either, but it’s publicly available and is at least a
more plausible example.
▪As an example for applying scalable index
construction algorithms, we will use the Reuters
RCV1 collection.
▪This is one year of Reuters newswire (part of 1995
and 1996)

COMP6714: Information Retrieval & Web Search

Reuters RCV1 statistics
▪ symbol value statistic
▪ N documents

▪ L avg. # tokens per doc
▪ M terms (= word types)

800,000

200
~400,000

▪ avg. # bytes per token 6
(incl. spaces/punct.)

▪ avg. # bytes per token 4.5
(without spaces/punct.)

▪ avg. # bytes per term 7.5

Sec. 5.1

COMP6714: Information Retrieval & Web Search

Index parameters vs. what we index
(details IIR Table 5.1, p.80)

size of word types (terms) non-positional
postings

positional postings

dictionary non-positional index positional index

Size
(K)

∆% cumul
%

Size (K) ∆
%

cumul
%

Size (K) ∆
%

cumul
%

Unfiltered 484 109,971 197,879
No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9
Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9
30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38
150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52
stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52

Exercise: give intuitions for all the ‘0’ entries. Why do some
zero entries correspond to big deltas in other columns?

Sec. 5.1

COMP6714: Information Retrieval & Web Search

Lossless vs. lossy compression
▪ Lossless compression: All information is preserved.
▪ What we mostly do in IR.

▪ Lossy compression: Discard some information
▪ Several of the preprocessing steps can be viewed as
lossy compression: case folding, stop words,
stemming, number elimination.

▪ Chap/Lecture 7: Prune postings entries that are
unlikely to turn up in the top k list for any query.
▪ Almost no loss quality for top k list.

Sec. 5.1

COMP6714: Information Retrieval & Web Search

Vocabulary vs. collection size
▪ How big is the term vocabulary?
▪ That is, how many distinct words are there?

▪ Can we assume an upper bound?
▪ Not really: At least 7020 = 1037 different words of length 20

▪ In practice, the vocabulary will keep growing with
the collection size
▪ Especially with Unicode ☺

Sec. 5.1

COMP6714: Information Retrieval & Web Search

Vocabulary vs. collection size
▪ Heaps’ law: M = kTb

▪ M is the size of the vocabulary, T is the number of
tokens in the collection

▪ Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5
▪ In a log-log plot of vocabulary size M vs. T, Heaps’
law predicts a line with slope about ½
▪ It is the simplest possible relationship between the two in
log-log space

▪ An empirical finding (“empirical law”)

Sec. 5.1

COMP6714: Information Retrieval & Web Search

Heaps’ Law
For RCV1, the dashed line

log10M = 0.49 log10T + 1.64
is the best least squares fit.
Thus, M = 101.64T0.49 so k =
101.64 ≈ 44 and b = 0.49.

Good empirical fit for
Reuters RCV1 !

For first 1,000,020 tokens,
law predicts 38,323 terms;
actually, 38,365 terms

Fig 5.1 p81

Sec. 5.1

COMP6714: Information Retrieval & Web Search

Exercises
▪ What is the effect of including spelling errors, vs.
automatically correcting spelling errors on Heaps’
law?

▪ Compute the vocabulary size M for this scenario:
▪ Looking at a collection of web pages, you find that there
are 3000 different terms in the first 10,000 tokens and
30,000 different terms in the first 1,000,000 tokens.

▪ Assume a search engine indexes a total of 20,000,000,000
(2 × 1010) pages, containing 200 tokens on average

▪ What is the size of the vocabulary of the indexed
collection as predicted by Heaps’ law?

Sec. 5.1

COMP6714: Information Retrieval & Web Search

Zipf’s law

▪ Heaps’ law gives the vocabulary size in collections.
▪ We also study the relative frequencies of terms.
▪ In natural language, there are a few very frequent
terms and very many very rare terms.

▪ Zipf’s law: The ith most frequent term has frequency
proportional to 1/i .

▪ cfi∝ 1/i = K/i where K is a normalizing constant
▪ cfi is collection frequency: the number of
occurrences of the term ti in the collection.

Sec. 5.1

COMP6714: Information Retrieval & Web Search

Zipf consequences
▪ If the most frequent term (the) occurs cf1 times
▪ then the second most frequent term (of) occurs cf1/2
times

▪ the third most frequent term (and) occurs cf1/3 times …
▪ Equivalent: cfi = K/i where K is a normalizing factor,
so
▪ log cfi = log K – log i
▪ Linear relationship between log cfi and log i

▪ Another power law relationship

Sec. 5.1

COMP6714: Information Retrieval & Web Search

Zipf’s law for Reuters RCV1

Sec. 5.1

COMP6714: Information Retrieval & Web Search

Compression
▪ Now, we will consider compressing the space
for the dictionary and postings
▪ Basic Boolean index only
▪ No study of positional indexes, etc.
▪ We will consider compression schemes

Ch. 5

COMP6714: Information Retrieval & Web Search

DICTIONARY COMPRESSION

Sec. 5.2

COMP6714: Information Retrieval & Web Search

Why compress the dictionary?
▪ Search begins with the dictionary
▪ We want to keep it in memory
▪ Memory footprint competition with other
applications

▪ Embedded/mobile devices may have very little
memory

▪ Even if the dictionary isn’t in memory, we want it to
be small for a fast search startup time

▪ So, compressing the dictionary is important

Sec. 5.2

COMP6714: Information Retrieval & Web Search

Dictionary storage – first cut
▪ Array of fixed-width entries
▪ ~400,000 terms; 28 bytes/term = 11.2 MB.

Dictionary search
structure

20 bytes 4 bytes each

Sec. 5.2

COMP6714: Information Retrieval & Web Search

Fixed-width terms are wasteful
▪ Most of the bytes in the Term column are wasted –
we allot 20 bytes for 1 letter terms.
▪ And we still can’t handle supercalifragilisticexpialidocious or

hydrochlorofluorocarbons.

▪ Written English averages ~4.5 characters/word.
▪ Exercise: Why is/isn’t this the number to use for
estimating the dictionary size?

▪ Ave. dictionary word in English: ~8 characters
▪ How do we use ~8 characters per dictionary term?

▪ Short words dominate token counts but not type
average.

Sec. 5.2

COMP6714: Information Retrieval & Web Search

Compressing the term list:
Dictionary-as-a-String

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Total string length =
400K x 8B = 3.2MB

Pointers resolve
3.2M

positions: log23.2M
=

22bits = 3bytes

■Store dictionary as a (long) string of characters:
■Pointer to next word shows end of current word
■Hope to save up to 60% of dictionary space.

Sec. 5.2

COMP6714: Information Retrieval & Web Search

Space for dictionary as a string
▪ 4 bytes per term for Freq.
▪ 4 bytes per term for pointer to Postings.
▪ 3 bytes per term pointer
▪ Avg. 8 bytes per term in term string
▪ 400K terms x 19 ➔ 7.6 MB (against 11.2MB for fixed
width)

Now avg. 19
bytes/term,
not 28.

Sec. 5.2

COMP6714: Information Retrieval & Web Search

Blocking
▪ Store pointers to every kth term string.
▪ Example below: k=4.

▪ Need to store term lengths (1 extra byte)

….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Save 9 bytes
on 3
pointers.

Lose 4 bytes
on

term lengths.

Sec. 5.2

COMP6714: Information Retrieval & Web Search

Net
▪ Example for block size k = 4
▪ Where we used 3 bytes/pointer without blocking
▪ 3 x 4 = 12 bytes,

now we use 3 + 4 = 7 bytes.

Shaved another ~0.5MB. This reduces the size of the
dictionary from 7.6 MB to 7.1 MB.
We can save more with larger k.

Why not go with larger k?

Sec. 5.2

COMP6714: Information Retrieval & Web Search

Exercise
▪ Estimate the space usage (and savings compared to
7.6 MB) with blocking, for block sizes of k = 4, 8 and
16.

Sec. 5.2

COMP6714: Information Retrieval & Web Search

Dictionary search without blocking

▪ Assuming each
dictionary term equally
likely in query (not really
so in practice!), average
number of comparisons
= (1+2·2+4·3+4)/8 ~2.6

Sec. 5.2

Exercise: what if the
frequencies of query terms
were non-uniform but
known, how would you
structure the dictionary
search tree?

COMP6714: Information Retrieval & Web Search

Dictionary search with blocking

▪ Binary search down to 4-term block;
▪ Then linear search through terms in block.

▪ Blocks of 4 (binary tree), avg. =
(1+2·2+2·3+2·4+5)/8 = 3 compares

Sec. 5.2

COMP6714: Information Retrieval & Web Search

Exercise
▪ Estimate the impact on search performance (and
slowdown compared to k=1) with blocking, for block
sizes of k = 4, 8 and 16.

Sec. 5.2

COMP6714: Information Retrieval & Web Search

Front coding
▪ Front-coding:
▪ Sorted words commonly have long common prefix – store
differences only

▪ (for last k-1 in a block of k)
8automata8automate9automatic10automation

♢8automat*a1♢e2♢ic3♢ion

Encodes automat Extra length
beyond automat.

Begins to resemble general string compression.

Sec. 5.2

COMP6714: Information Retrieval & Web Search

Front Encoding [Witten, Moffat, Bell]

▪ Sorted words commonly have long common prefix
▪store differences only

▪ Complete front encoding
▪ (prefix-len, suffix-len, suffix)

▪ Partial 3-in-4 front encoding
▪ No encoding/compression for the first string in a block
▪ Enables binary search

String Complete Front
Encoding

Partial 3-in-4 Front
Encoding

8, automata 4, 4, mata , 8, automata

8, automate 7, 1, e 7, 1, e

9, automatic 7, 2, ic 7, 2, ic

10, automation 8, 2, on 8, , on

Assume
previous
string is
“auto”

COMP6714: Information Retrieval & Web Search

RCV1 dictionary compression summary

Technique Size in MB

Fixed width 11.2

Dictionary-as-String with pointers to every term 7.6

Also, blocking k = 4 7.1

Also, Blocking + front coding 5.9

Sec. 5.2

COMP6714: Information Retrieval & Web Search

POSTINGS COMPRESSION

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Postings compression
▪ The postings file is much larger than the dictionary,
factor of at least 10.

▪ Key desideratum: store each posting compactly.
▪ A posting for our purposes is a docID.
▪ For Reuters (800,000 documents), we would use 32
bits per docID when using 4-byte integers.

▪ Alternatively, we can use log2 800,000 ≈ 20 bits per
docID.

▪ Our goal: use a lot less than 20 bits per docID.

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Postings: two conflicting forces
▪ A term like arachnocentric occurs in maybe one doc
out of a million – we would like to store this posting
using log2 1M ~ 20 bits.

▪ A term like the occurs in virtually every doc, so 20
bits/posting is too expensive.
▪ Prefer 0/1 bitmap vector in this case

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Postings file entry
▪ We store the list of docs containing a term in
increasing order of docID.
▪ computer: 33,47,154,159,202 …

▪ Consequence: it suffices to store gaps.
▪ 33,14,107,5,43 …

▪ Hope: most gaps can be encoded/stored with far
fewer than 20 bits.

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Three postings entries

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Variable length encoding
▪ Aim:
▪ For arachnocentric, we will use ~20 bits/gap entry.
▪ For the, we will use ~1 bit/gap entry.

▪ If the average gap for a term is G, we want to use
~log2G bits/gap entry.

▪ Key challenge: encode every integer (gap) with about
as few bits as needed for that integer.

▪ This requires a variable length encoding
▪ Variable length codes achieve this by using short
codes for small numbers

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Variable Byte (VB) codes
▪ For a gap value G, we want to use close to the fewest
bytes needed to hold log2 G bits

▪ Begin with one byte to store G and dedicate 1 bit in
it to be a continuation bit c

▪ If G ≤127, binary-encode it in the 7 available bits and
set c =1

▪ Else encode G’s lower-order 7 bits and then use
additional bytes to encode the higher order bits
using the same algorithm

▪ At the end set the continuation bit of the last byte to
1 (c =1) – and for the other bytes c = 0.

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Example
docIDs 824 829 215406
gaps 5 214577
VB code 00000110

10111000
10000101 00001101

00001100
10110001

Postings stored as the byte concatenation
000001101011100010000101000011010000110010110001

Key property: VB-encoded postings are
uniquely prefix-decodable.

For a small gap (5), VB
uses a whole byte.

Sec. 5.3

Hex(824)=0x0338
Hex(214577)=0x0003463
1

COMP6714: Information Retrieval & Web Search

Other variable unit codes
▪ Instead of bytes, we can also use a different “unit of
alignment”: 32 bits (words), 16 bits, 4 bits (nibbles).

▪ Variable byte alignment wastes space if you have
many small gaps – nibbles do better in such cases.

▪ Variable byte codes:
▪ Used by many commercial/research systems
▪ Good low-tech blend of variable-length coding and
sensitivity to computer memory alignment matches (vs.
bit-level codes, which we look at next).

▪ There is also recent work on word-aligned codes that
pack a variable number of gaps into one word (e.g.,
simple9)

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Simple9
▪ Encodes as many gaps as possible in one DWORD
▪ 4 bit selector + 28 bit data bits
▪ Encodes 9 possible ways to “use” the data bits

Sec. 5.3

Selector # of gaps encoded Len of each gap
encoded

Wasted bits

0000 28 1 0

0001 14 2 0

0010 9 3 1

0011 7 4 0

0100 5 5 3

0101 4 7 0

0110 3 9 1

0111 2 14 0

1000 1 28 0

COMP6714: Information Retrieval & Web Search

Unary code
▪ Represent n as n 1s with a final 0.
▪ Unary code for 3 is 1110.
▪ Unary code for 40 is
11111111111111111111111111111111111111110 .

▪ Unary code for 80 is:
111111111111111111111111111111111111111111
111111111111111111111111111111111111110

▪ This doesn’t look promising, but….

COMP6714: Information Retrieval & Web Search

Bit-Aligned Codes
▪ Breaks between encoded numbers can occur after
any bit position

▪ Unary code
▪ Encode k by k 1s followed by 0
▪ 0 at end makes code unambiguous

COMP6714: Information Retrieval & Web Search

Unary and Binary Codes

▪ Unary is very efficient for small numbers such as 0
and 1, but quickly becomes very expensive
▪ 1023 can be represented in 10 binary bits, but requires
1024 bits in unary

▪ Binary is more efficient for large numbers, but it may
be ambiguous

COMP6714: Information Retrieval & Web Search

Elias-γ Code
▪ To encode a number k, compute

▪ kd is number of binary digits, encoded in unary

unary

binary

COMP6714: Information Retrieval & Web Search

Elias-δ Code
▪ Elias-γ code uses no more bits than unary, many
fewer for k > 2
▪ 1023 takes 19 bits instead of 1024 bits using unary

▪ In general, takes 2⎣log2k⎦+1 bits
▪ To improve coding of large numbers, use Elias-δ code
▪ Instead of encoding kd in unary, we encode kd + 1 using
Elias-γ

▪ Takes approximately 2 log2 log2 k + log2 k bits

COMP6714: Information Retrieval & Web Search

Elias-δ Code
▪ Split (kd+ 1) into:

▪ encode kdd in unary, kdr in binary, and kr in binary

COMP6714: Information Retrieval & Web Search

Gamma code properties
▪ G is encoded using 2 ⎣log G⎦ + 1 bits
▪ Length of offset is ⎣log G⎦ bits
▪ Length of length is ⎣log G⎦ + 1 bits

▪ All gamma codes have an odd number of bits
▪ Almost within a factor of 2 of best possible, log2 G

▪ Gamma code is uniquely prefix-decodable, like VB
▪ Gamma code can be used for any distribution
▪ Gamma code is parameter-free

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Gamma seldom used in practice
▪ Machines have word boundaries – 8, 16, 32, 64 bits
▪ Operations that cross word boundaries are slower

▪ Compressing and manipulating at the granularity of
bits can be slow

▪ Variable byte encoding is aligned and thus
potentially more efficient

▪ Regardless of efficiency, variable byte is conceptually
simpler at little additional space cost

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Shannon Limit
▪ Is it possible to derive codes that are optimal (under
certain assumptions)?

▪ What is the optimal average code length for a code
that encodes each integer (gap lengths)
independently?

▪ Lower bounds on average code length: Shannon
entropy
▪ H(X) = – Σx=1n Pr[X=x] log Pr[X=x]

▪ Asymptotically optimal codes (finite alphabets):
arithmetic coding, Huffman codes

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Global Bernoulli Model
▪ Assumption: term occurrence are Bernoulli events
▪ Notation:
▪ n: # of documents, m: # of terms in vocabulary
▪ N: total # of (unique) occurrences

▪ Probability of a term tj occurring in document di: p =
N/nm

▪ Each term-document occurrence is an independent
event

▪ Probability of a gap of length x is given by the
geometric distribution

Sec. 5.3

How to design an optimal
code for geometric
distribution?

COMP6714: Information Retrieval & Web Search

Golomb Code
▪ Golomb Code (Golomb 1966): highly efficient way to
design optimal Huffman-style code for geometric
distribution
▪ Parameter b
▪ For given x ≥ 1, computer integer quotient
▪ and remainder

▪ Assume b = 2k
▪ Encode q in unary, followed by r coded in binary
▪ A bit complicated if b != 2k. See wikipedia.

▪ First step: (q+1) bits
▪ Second step: log(b) bits

Sec. 5.3

It can also be deemed as a
generalization of the unary
code.

COMP6714: Information Retrieval & Web Search

Golomb Code & Rice Code
▪ How to determine optimal b*?
▪ Select minimal b such that

▪ Result due to Gallager and Van Voorhis 1975:
generates an optimal prefix code for geometric
distribution

▪ Small p approximation:

▪ Rice code: only allow b = 2k

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Local Bernoulli Model
▪ If length of posting lists is known, then a Bernoulli
model on each individual inverted list can be used

▪ Frequent words are coded with smaller b, infrequent
words with larger b

▪ Term frequency need to be encoded (use gamma-
code)

▪ Local Bernoulli outperforms global Bernoulli model in
practice (method of practice!)

Sec. 5.3

COMP6714: Information Retrieval & Web Search

RCV1 compression
Data structure Size in MB
dictionary, fixed-width 11.2
dictionary, term pointers into string 7.6
with blocking, k = 4 7.1
with blocking & front coding 5.9
collection (text, xml markup etc) 3,600.0
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, γ−encoded 101.0

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Google’s Indexing Choice
▪ Index shards partition by doc, multiple replicates
▪ Disk-resident index
▪ Use outer parts of the disk
▪ Use different compression methods for different fields:
Ricek (a special kind of Golomb code) for gaps, and Gamma
for positions.

▪ In-memory index
▪ All positions; No docid

▪ Keep track of document boundaries
▪ Group-variant encoding

▪ Fast to decode

Sec. 5.3

61Source: Jeff Dean’s WSDM 2009 Keynote

COMP6714: Information Retrieval & Web Search

Other details
▪ Gap = docidn- docidn-1 – 1
▪ Freq = freq – 1
▪ Pos_Gap = posn- posn-1 – 1

▪ C.f., Jiangong Zhang, Xiaohui Long and Torsten Suel:
Performance of Compressed Inverted List Caching in
Search Engines. WWW 2008.

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Index compression summary
▪ We can now create an index for highly efficient
Boolean retrieval that is very space efficient

▪ Only 4% of the total size of the collection
▪ Only 10-15% of the total size of the text in the
collection

▪ However, we’ve ignored positional information
▪ Hence, space savings are less for indexes used in
practice
▪ But techniques substantially the same.

Sec. 5.3

COMP6714: Information Retrieval & Web Search

Resources for today’s lecture
▪ IIR 5
▪ MG 3.3, 3.4.
▪ F. Scholer, H.E. Williams and J. Zobel. 2002.
Compression of Inverted Indexes For Fast Query
Evaluation. Proc. ACM-SIGIR 2002.
▪ Variable byte codes

▪ V. N. Anh and A. Moffat. 2005. Inverted Index
Compression Using Word-Aligned Binary Codes.
Information Retrieval 8: 151–166.
▪ Word aligned codes

Ch. 5

Related Posts