L5-compression-yifang
COMP6714: Information Retrieval & Web Search
Introduction to
Information Retrieval
Lecture 5: Index Compression
1
COMP6714: Information Retrieval & Web Search
Inverted index
▪For each term t, we must store a list of all
documents that contain t.
▪ Identify each doc by a docID, a document serial
number
▪Can we used fixed-size arrays for this?
2
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
What happens if the word Caesar
is added to document 14?
COMP6714: Information Retrieval & Web Search
Inverted index
▪We need variable-size postings lists
▪ On disk, a continuous run of postings is normal and
best
▪ In memory, can use linked lists or variable length
arrays
▪ Some tradeoffs in size/ease of insertion
3
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
Dictionary Postings
Sorted by docID (more later on why).
Posting
COMP6714: Information Retrieval & Web Search
Inverted index construction
4
Tokenizer
Token stream Friends Romans Countrymen
Linguistic modules
Modified tokens friend roman countryman
Indexer
Inverted index
friend
roman
countryman
2 4
2
13 16
1
Documents to
be indexed
Friends, Romans, countrymen.
COMP6714: Information Retrieval & Web Search
Where do we pay in storage?
5Pointers
Terms
and
counts
Lists of
docIDs
COMP6714: Information Retrieval & Web Search
Today
▪ Why compression?
▪ Collection statistics in more detail (with RCV1)
▪ How big will the dictionary and postings be?
▪ Dictionary compression
▪ Postings compression
Ch. 5
6
COMP6714: Information Retrieval & Web Search
Why compression (in general)?
▪ Use less disk space
▪ Saves a little money
▪ Keep more stuff in memory
▪ Increases speed
▪ Increase speed of data transfer from disk to memory
▪ [read compressed data | decompress] is faster than
[read uncompressed data]
▪ Premise: Decompression algorithms are fast
▪ True of the decompression algorithms we use
Ch. 5
7
COMP6714: Information Retrieval & Web Search
Why compression for inverted indexes?
▪ Dictionary
▪ Make it small enough to keep in main memory
▪ Make it so small that you can keep some postings lists in
main memory too
▪ Postings file(s)
▪ Reduce disk space needed
▪ Decrease time needed to read postings lists from disk
▪ Large search engines keep a significant part of the postings
in memory.
▪ Compression lets you keep more in memory
▪ We will devise various IR-specific compression schemes
Ch. 5
8
COMP6714: Information Retrieval & Web Search
RCV1: Our collection for this lecture
▪Shakespeare’s collected works definitely aren’t large
enough for demonstrating many of the points in this
course.
▪The collection we’ll use isn’t really large enough
either, but it’s publicly available and is at least a
more plausible example.
▪As an example for applying scalable index
construction algorithms, we will use the Reuters
RCV1 collection.
▪This is one year of Reuters newswire (part of 1995
and 1996)
9
COMP6714: Information Retrieval & Web Search
Reuters RCV1 statistics
▪ symbol value statistic
▪ N documents
▪ L avg. # tokens per doc
▪ M terms (= word types)
800,000
200
~400,000
▪ avg. # bytes per token 6
(incl. spaces/punct.)
▪ avg. # bytes per token 4.5
(without spaces/punct.)
▪ avg. # bytes per term 7.5
Sec. 5.1
10
COMP6714: Information Retrieval & Web Search
Index parameters vs. what we index
(details IIR Table 5.1, p.80)
size of word types (terms) non-positional
postings
positional postings
dictionary non-positional index positional index
Size
(K)
∆% cumul
%
Size (K) ∆
%
cumul
%
Size (K) ∆
%
cumul
%
Unfiltered 484 109,971 197,879
No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9
Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9
30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38
150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52
stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52
Exercise: give intuitions for all the ‘0’ entries. Why do some
zero entries correspond to big deltas in other columns?
Sec. 5.1
11
COMP6714: Information Retrieval & Web Search
Lossless vs. lossy compression
▪ Lossless compression: All information is preserved.
▪ What we mostly do in IR.
▪ Lossy compression: Discard some information
▪ Several of the preprocessing steps can be viewed as
lossy compression: case folding, stop words,
stemming, number elimination.
▪ Chap/Lecture 7: Prune postings entries that are
unlikely to turn up in the top k list for any query.
▪ Almost no loss quality for top k list.
Sec. 5.1
12
COMP6714: Information Retrieval & Web Search
Vocabulary vs. collection size
▪ How big is the term vocabulary?
▪ That is, how many distinct words are there?
▪ Can we assume an upper bound?
▪ Not really: At least 7020 = 1037 different words of length 20
▪ In practice, the vocabulary will keep growing with
the collection size
▪ Especially with Unicode ☺
Sec. 5.1
13
COMP6714: Information Retrieval & Web Search
Vocabulary vs. collection size
▪ Heaps’ law: M = kTb
▪ M is the size of the vocabulary, T is the number of
tokens in the collection
▪ Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5
▪ In a log-log plot of vocabulary size M vs. T, Heaps’
law predicts a line with slope about ½
▪ It is the simplest possible relationship between the two in
log-log space
▪ An empirical finding (“empirical law”)
Sec. 5.1
14
COMP6714: Information Retrieval & Web Search
Heaps’ Law
For RCV1, the dashed line
log10M = 0.49 log10T + 1.64
is the best least squares fit.
Thus, M = 101.64T0.49 so k =
101.64 ≈ 44 and b = 0.49.
Good empirical fit for
Reuters RCV1 !
For first 1,000,020 tokens,
law predicts 38,323 terms;
actually, 38,365 terms
Fig 5.1 p81
Sec. 5.1
15
COMP6714: Information Retrieval & Web Search
Exercises
▪ What is the effect of including spelling errors, vs.
automatically correcting spelling errors on Heaps’
law?
▪ Compute the vocabulary size M for this scenario:
▪ Looking at a collection of web pages, you find that there
are 3000 different terms in the first 10,000 tokens and
30,000 different terms in the first 1,000,000 tokens.
▪ Assume a search engine indexes a total of 20,000,000,000
(2 × 1010) pages, containing 200 tokens on average
▪ What is the size of the vocabulary of the indexed
collection as predicted by Heaps’ law?
Sec. 5.1
16
COMP6714: Information Retrieval & Web Search
Zipf’s law
▪ Heaps’ law gives the vocabulary size in collections.
▪ We also study the relative frequencies of terms.
▪ In natural language, there are a few very frequent
terms and very many very rare terms.
▪ Zipf’s law: The ith most frequent term has frequency
proportional to 1/i .
▪ cfi∝ 1/i = K/i where K is a normalizing constant
▪ cfi is collection frequency: the number of
occurrences of the term ti in the collection.
Sec. 5.1
17
COMP6714: Information Retrieval & Web Search
Zipf consequences
▪ If the most frequent term (the) occurs cf1 times
▪ then the second most frequent term (of) occurs cf1/2
times
▪ the third most frequent term (and) occurs cf1/3 times …
▪ Equivalent: cfi = K/i where K is a normalizing factor,
so
▪ log cfi = log K – log i
▪ Linear relationship between log cfi and log i
▪ Another power law relationship
Sec. 5.1
18
COMP6714: Information Retrieval & Web Search
Zipf’s law for Reuters RCV1
19
Sec. 5.1
COMP6714: Information Retrieval & Web Search
Compression
▪ Now, we will consider compressing the space
for the dictionary and postings
▪ Basic Boolean index only
▪ No study of positional indexes, etc.
▪ We will consider compression schemes
Ch. 5
20
COMP6714: Information Retrieval & Web Search
DICTIONARY COMPRESSION
Sec. 5.2
21
COMP6714: Information Retrieval & Web Search
Why compress the dictionary?
▪ Search begins with the dictionary
▪ We want to keep it in memory
▪ Memory footprint competition with other
applications
▪ Embedded/mobile devices may have very little
memory
▪ Even if the dictionary isn’t in memory, we want it to
be small for a fast search startup time
▪ So, compressing the dictionary is important
Sec. 5.2
22
COMP6714: Information Retrieval & Web Search
Dictionary storage – first cut
▪ Array of fixed-width entries
▪ ~400,000 terms; 28 bytes/term = 11.2 MB.
Dictionary search
structure
20 bytes 4 bytes each
Sec. 5.2
23
COMP6714: Information Retrieval & Web Search
Fixed-width terms are wasteful
▪ Most of the bytes in the Term column are wasted –
we allot 20 bytes for 1 letter terms.
▪ And we still can’t handle supercalifragilisticexpialidocious or
hydrochlorofluorocarbons.
▪ Written English averages ~4.5 characters/word.
▪ Exercise: Why is/isn’t this the number to use for
estimating the dictionary size?
▪ Ave. dictionary word in English: ~8 characters
▪ How do we use ~8 characters per dictionary term?
▪ Short words dominate token counts but not type
average.
Sec. 5.2
24
COMP6714: Information Retrieval & Web Search
Compressing the term list:
Dictionary-as-a-String
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Total string length =
400K x 8B = 3.2MB
Pointers resolve
3.2M
positions: log23.2M
=
22bits = 3bytes
■Store dictionary as a (long) string of characters:
■Pointer to next word shows end of current word
■Hope to save up to 60% of dictionary space.
Sec. 5.2
25
COMP6714: Information Retrieval & Web Search
Space for dictionary as a string
▪ 4 bytes per term for Freq.
▪ 4 bytes per term for pointer to Postings.
▪ 3 bytes per term pointer
▪ Avg. 8 bytes per term in term string
▪ 400K terms x 19 ➔ 7.6 MB (against 11.2MB for fixed
width)
Now avg. 19
bytes/term,
not 28.
Sec. 5.2
26
COMP6714: Information Retrieval & Web Search
Blocking
▪ Store pointers to every kth term string.
▪ Example below: k=4.
▪ Need to store term lengths (1 extra byte)
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Save 9 bytes
on 3
pointers.
Lose 4 bytes
on
term lengths.
Sec. 5.2
27
COMP6714: Information Retrieval & Web Search
Net
▪ Example for block size k = 4
▪ Where we used 3 bytes/pointer without blocking
▪ 3 x 4 = 12 bytes,
now we use 3 + 4 = 7 bytes.
Shaved another ~0.5MB. This reduces the size of the
dictionary from 7.6 MB to 7.1 MB.
We can save more with larger k.
Why not go with larger k?
Sec. 5.2
28
COMP6714: Information Retrieval & Web Search
Exercise
▪ Estimate the space usage (and savings compared to
7.6 MB) with blocking, for block sizes of k = 4, 8 and
16.
Sec. 5.2
29
COMP6714: Information Retrieval & Web Search
Dictionary search without blocking
▪ Assuming each
dictionary term equally
likely in query (not really
so in practice!), average
number of comparisons
= (1+2·2+4·3+4)/8 ~2.6
Sec. 5.2
Exercise: what if the
frequencies of query terms
were non-uniform but
known, how would you
structure the dictionary
search tree?
30
COMP6714: Information Retrieval & Web Search
Dictionary search with blocking
▪ Binary search down to 4-term block;
▪ Then linear search through terms in block.
▪ Blocks of 4 (binary tree), avg. =
(1+2·2+2·3+2·4+5)/8 = 3 compares
Sec. 5.2
31
COMP6714: Information Retrieval & Web Search
Exercise
▪ Estimate the impact on search performance (and
slowdown compared to k=1) with blocking, for block
sizes of k = 4, 8 and 16.
Sec. 5.2
32
COMP6714: Information Retrieval & Web Search
Front coding
▪ Front-coding:
▪ Sorted words commonly have long common prefix – store
differences only
▪ (for last k-1 in a block of k)
8automata8automate9automatic10automation
♢8automat*a1♢e2♢ic3♢ion
Encodes automat Extra length
beyond automat.
Begins to resemble general string compression.
Sec. 5.2
33
COMP6714: Information Retrieval & Web Search
Front Encoding [Witten, Moffat, Bell]
34
▪ Sorted words commonly have long common prefix
▪store differences only
▪ Complete front encoding
▪ (prefix-len, suffix-len, suffix)
▪ Partial 3-in-4 front encoding
▪ No encoding/compression for the first string in a block
▪ Enables binary search
String Complete Front
Encoding
Partial 3-in-4 Front
Encoding
8, automata 4, 4, mata , 8, automata
8, automate 7, 1, e 7, 1, e
9, automatic 7, 2, ic 7, 2, ic
10, automation 8, 2, on 8, , on
Assume
previous
string is
“auto”
COMP6714: Information Retrieval & Web Search
RCV1 dictionary compression summary
Technique Size in MB
Fixed width 11.2
Dictionary-as-String with pointers to every term 7.6
Also, blocking k = 4 7.1
Also, Blocking + front coding 5.9
Sec. 5.2
35
COMP6714: Information Retrieval & Web Search
POSTINGS COMPRESSION
Sec. 5.3
36
COMP6714: Information Retrieval & Web Search
Postings compression
▪ The postings file is much larger than the dictionary,
factor of at least 10.
▪ Key desideratum: store each posting compactly.
▪ A posting for our purposes is a docID.
▪ For Reuters (800,000 documents), we would use 32
bits per docID when using 4-byte integers.
▪ Alternatively, we can use log2 800,000 ≈ 20 bits per
docID.
▪ Our goal: use a lot less than 20 bits per docID.
Sec. 5.3
37
COMP6714: Information Retrieval & Web Search
Postings: two conflicting forces
▪ A term like arachnocentric occurs in maybe one doc
out of a million – we would like to store this posting
using log2 1M ~ 20 bits.
▪ A term like the occurs in virtually every doc, so 20
bits/posting is too expensive.
▪ Prefer 0/1 bitmap vector in this case
Sec. 5.3
38
COMP6714: Information Retrieval & Web Search
Postings file entry
▪ We store the list of docs containing a term in
increasing order of docID.
▪ computer: 33,47,154,159,202 …
▪ Consequence: it suffices to store gaps.
▪ 33,14,107,5,43 …
▪ Hope: most gaps can be encoded/stored with far
fewer than 20 bits.
Sec. 5.3
39
COMP6714: Information Retrieval & Web Search
Three postings entries
Sec. 5.3
40
COMP6714: Information Retrieval & Web Search
Variable length encoding
▪ Aim:
▪ For arachnocentric, we will use ~20 bits/gap entry.
▪ For the, we will use ~1 bit/gap entry.
▪ If the average gap for a term is G, we want to use
~log2G bits/gap entry.
▪ Key challenge: encode every integer (gap) with about
as few bits as needed for that integer.
▪ This requires a variable length encoding
▪ Variable length codes achieve this by using short
codes for small numbers
Sec. 5.3
41
COMP6714: Information Retrieval & Web Search
Variable Byte (VB) codes
▪ For a gap value G, we want to use close to the fewest
bytes needed to hold log2 G bits
▪ Begin with one byte to store G and dedicate 1 bit in
it to be a continuation bit c
▪ If G ≤127, binary-encode it in the 7 available bits and
set c =1
▪ Else encode G’s lower-order 7 bits and then use
additional bytes to encode the higher order bits
using the same algorithm
▪ At the end set the continuation bit of the last byte to
1 (c =1) – and for the other bytes c = 0.
Sec. 5.3
42
COMP6714: Information Retrieval & Web Search
Example
docIDs 824 829 215406
gaps 5 214577
VB code 00000110
10111000
10000101 00001101
00001100
10110001
Postings stored as the byte concatenation
000001101011100010000101000011010000110010110001
Key property: VB-encoded postings are
uniquely prefix-decodable.
For a small gap (5), VB
uses a whole byte.
Sec. 5.3
43
Hex(824)=0x0338
Hex(214577)=0x0003463
1
COMP6714: Information Retrieval & Web Search
Other variable unit codes
▪ Instead of bytes, we can also use a different “unit of
alignment”: 32 bits (words), 16 bits, 4 bits (nibbles).
▪ Variable byte alignment wastes space if you have
many small gaps – nibbles do better in such cases.
▪ Variable byte codes:
▪ Used by many commercial/research systems
▪ Good low-tech blend of variable-length coding and
sensitivity to computer memory alignment matches (vs.
bit-level codes, which we look at next).
▪ There is also recent work on word-aligned codes that
pack a variable number of gaps into one word (e.g.,
simple9)
Sec. 5.3
44
COMP6714: Information Retrieval & Web Search
Simple9
▪ Encodes as many gaps as possible in one DWORD
▪ 4 bit selector + 28 bit data bits
▪ Encodes 9 possible ways to “use” the data bits
Sec. 5.3
45
Selector # of gaps encoded Len of each gap
encoded
Wasted bits
0000 28 1 0
0001 14 2 0
0010 9 3 1
0011 7 4 0
0100 5 5 3
0101 4 7 0
0110 3 9 1
0111 2 14 0
1000 1 28 0
COMP6714: Information Retrieval & Web Search
Unary code
▪ Represent n as n 1s with a final 0.
▪ Unary code for 3 is 1110.
▪ Unary code for 40 is
11111111111111111111111111111111111111110 .
▪ Unary code for 80 is:
111111111111111111111111111111111111111111
111111111111111111111111111111111111110
▪ This doesn’t look promising, but….
46
COMP6714: Information Retrieval & Web Search
Bit-Aligned Codes
▪ Breaks between encoded numbers can occur after
any bit position
▪ Unary code
▪ Encode k by k 1s followed by 0
▪ 0 at end makes code unambiguous
47
COMP6714: Information Retrieval & Web Search
Unary and Binary Codes
▪ Unary is very efficient for small numbers such as 0
and 1, but quickly becomes very expensive
▪ 1023 can be represented in 10 binary bits, but requires
1024 bits in unary
▪ Binary is more efficient for large numbers, but it may
be ambiguous
48
COMP6714: Information Retrieval & Web Search
Elias-γ Code
▪ To encode a number k, compute
▪ kd is number of binary digits, encoded in unary
unary
binary
49
COMP6714: Information Retrieval & Web Search
Elias-δ Code
▪ Elias-γ code uses no more bits than unary, many
fewer for k > 2
▪ 1023 takes 19 bits instead of 1024 bits using unary
▪ In general, takes 2⎣log2k⎦+1 bits
▪ To improve coding of large numbers, use Elias-δ code
▪ Instead of encoding kd in unary, we encode kd + 1 using
Elias-γ
▪ Takes approximately 2 log2 log2 k + log2 k bits
50
COMP6714: Information Retrieval & Web Search
Elias-δ Code
▪ Split (kd+ 1) into:
▪ encode kdd in unary, kdr in binary, and kr in binary
51
COMP6714: Information Retrieval & Web Search
52
COMP6714: Information Retrieval & Web Search
Gamma code properties
▪ G is encoded using 2 ⎣log G⎦ + 1 bits
▪ Length of offset is ⎣log G⎦ bits
▪ Length of length is ⎣log G⎦ + 1 bits
▪ All gamma codes have an odd number of bits
▪ Almost within a factor of 2 of best possible, log2 G
▪ Gamma code is uniquely prefix-decodable, like VB
▪ Gamma code can be used for any distribution
▪ Gamma code is parameter-free
Sec. 5.3
53
COMP6714: Information Retrieval & Web Search
Gamma seldom used in practice
▪ Machines have word boundaries – 8, 16, 32, 64 bits
▪ Operations that cross word boundaries are slower
▪ Compressing and manipulating at the granularity of
bits can be slow
▪ Variable byte encoding is aligned and thus
potentially more efficient
▪ Regardless of efficiency, variable byte is conceptually
simpler at little additional space cost
Sec. 5.3
54
COMP6714: Information Retrieval & Web Search
Shannon Limit
▪ Is it possible to derive codes that are optimal (under
certain assumptions)?
▪ What is the optimal average code length for a code
that encodes each integer (gap lengths)
independently?
▪ Lower bounds on average code length: Shannon
entropy
▪ H(X) = – Σx=1n Pr[X=x] log Pr[X=x]
▪ Asymptotically optimal codes (finite alphabets):
arithmetic coding, Huffman codes
Sec. 5.3
55
COMP6714: Information Retrieval & Web Search
Global Bernoulli Model
▪ Assumption: term occurrence are Bernoulli events
▪ Notation:
▪ n: # of documents, m: # of terms in vocabulary
▪ N: total # of (unique) occurrences
▪ Probability of a term tj occurring in document di: p =
N/nm
▪ Each term-document occurrence is an independent
event
▪ Probability of a gap of length x is given by the
geometric distribution
Sec. 5.3
56
How to design an optimal
code for geometric
distribution?
COMP6714: Information Retrieval & Web Search
Golomb Code
▪ Golomb Code (Golomb 1966): highly efficient way to
design optimal Huffman-style code for geometric
distribution
▪ Parameter b
▪ For given x ≥ 1, computer integer quotient
▪ and remainder
▪ Assume b = 2k
▪ Encode q in unary, followed by r coded in binary
▪ A bit complicated if b != 2k. See wikipedia.
▪ First step: (q+1) bits
▪ Second step: log(b) bits
Sec. 5.3
57
It can also be deemed as a
generalization of the unary
code.
COMP6714: Information Retrieval & Web Search
Golomb Code & Rice Code
▪ How to determine optimal b*?
▪ Select minimal b such that
▪ Result due to Gallager and Van Voorhis 1975:
generates an optimal prefix code for geometric
distribution
▪ Small p approximation:
▪ Rice code: only allow b = 2k
Sec. 5.3
58
COMP6714: Information Retrieval & Web Search
Local Bernoulli Model
▪ If length of posting lists is known, then a Bernoulli
model on each individual inverted list can be used
▪ Frequent words are coded with smaller b, infrequent
words with larger b
▪ Term frequency need to be encoded (use gamma-
code)
▪ Local Bernoulli outperforms global Bernoulli model in
practice (method of practice!)
Sec. 5.3
59
COMP6714: Information Retrieval & Web Search
RCV1 compression
Data structure Size in MB
dictionary, fixed-width 11.2
dictionary, term pointers into string 7.6
with blocking, k = 4 7.1
with blocking & front coding 5.9
collection (text, xml markup etc) 3,600.0
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, γ−encoded 101.0
Sec. 5.3
60
COMP6714: Information Retrieval & Web Search
Google’s Indexing Choice
▪ Index shards partition by doc, multiple replicates
▪ Disk-resident index
▪ Use outer parts of the disk
▪ Use different compression methods for different fields:
Ricek (a special kind of Golomb code) for gaps, and Gamma
for positions.
▪ In-memory index
▪ All positions; No docid
▪ Keep track of document boundaries
▪ Group-variant encoding
▪ Fast to decode
Sec. 5.3
61Source: Jeff Dean’s WSDM 2009 Keynote
COMP6714: Information Retrieval & Web Search
Other details
▪ Gap = docidn- docidn-1 – 1
▪ Freq = freq – 1
▪ Pos_Gap = posn- posn-1 – 1
▪ C.f., Jiangong Zhang, Xiaohui Long and Torsten Suel:
Performance of Compressed Inverted List Caching in
Search Engines. WWW 2008.
Sec. 5.3
62
COMP6714: Information Retrieval & Web Search
Index compression summary
▪ We can now create an index for highly efficient
Boolean retrieval that is very space efficient
▪ Only 4% of the total size of the collection
▪ Only 10-15% of the total size of the text in the
collection
▪ However, we’ve ignored positional information
▪ Hence, space savings are less for indexes used in
practice
▪ But techniques substantially the same.
Sec. 5.3
63
COMP6714: Information Retrieval & Web Search
Resources for today’s lecture
▪ IIR 5
▪ MG 3.3, 3.4.
▪ F. Scholer, H.E. Williams and J. Zobel. 2002.
Compression of Inverted Indexes For Fast Query
Evaluation. Proc. ACM-SIGIR 2002.
▪ Variable byte codes
▪ V. N. Anh and A. Moffat. 2005. Inverted Index
Compression Using Word-Aligned Binary Codes.
Information Retrieval 8: 151–166.
▪ Word aligned codes
Ch. 5
64