程序代写代做代考 C algorithm information retrieval go data structure COMP6714: Information Retrieval & Web Search

COMP6714: Information Retrieval & Web Search
Introduction to
Information Retrieval
Lecture 5: Index Compression
1

COMP6714: Information Retrieval & Web Search
Inverted index
▪For each term t, we must store a list of all
documents that contain t.
▪ Identify each doc by a docID, a document serial
number
▪Can we used fixed-size arrays for this? Brutus
1
2
4
11
31
45
173
174
Caesar
1
2
4
5
6
16
57
132
Calpurnia
2
31
54
101
What happens if the word Caesar is added to document 14?
2

COMP6714: Information Retrieval & Web Search
Inverted index
▪We need variable-size postings lists
▪ On disk, a continuous run of postings is normal and best
▪ In memory, can use linked lists or variable length
arrays
Some tradeoffs in size/ease of insertion
Posting
Brutus ▪
1 2 4 11 31 45
173
174
Caesar
1
2
4
5
6
16
57
132
Calpurnia
2
31
54
101
Dictionary
Postings
Sorted by docID (more later on why).
3

COMP6714: Information Retrieval & Web Search
Inverted index construction
Documents to be indexed
Token stream
Modified tokens
Inverted index
Friends, Romans, countrymen.
Tokenizer
Linguistic modules
Indexer
Friends
Romans
Countrymen
friend
2
1
4
2
friend
roman
roman
countryman
countryman
13
16
4

COMP6714: Information Retrieval & Web Search
Where do we pay in storage?
Terms and counts
Lists of docIDs
Pointers
5

COMP6714: Information Retrieval & Web Search
Ch. 5
Today
▪ Whycompression?
▪ Collectionstatisticsinmoredetail(withRCV1)
▪ How big will the dictionary and postings be? ▪ Dictionarycompression
▪ Postingscompression
6

COMP6714: Information Retrieval & Web Search
Ch. 5
Why compression (in general)?
▪ Use less disk space
▪ Savesalittlemoney
▪ Keep more stuff in memory
▪ Increasesspeed
▪ Increase speed of data transfer from disk to memory
▪ [readcompresseddata|decompress]isfasterthan [read uncompressed data]
▪ Premise:Decompressionalgorithmsarefast ▪ Trueofthedecompressionalgorithmsweuse
7

COMP6714: Information Retrieval & Web Search
Ch. 5
Why compression for inverted indexes?
▪ Dictionary
▪ Makeitsmallenoughtokeepinmainmemory
▪ Makeitsosmallthatyoucankeepsomepostingslistsin main memory too
▪ Postings file(s)
▪ Reducediskspaceneeded
▪ Decreasetimeneededtoreadpostingslistsfromdisk
▪ Largesearchengineskeepasignificantpartofthepostings in memory.
▪ Compressionletsyoukeepmoreinmemory
▪ We will devise various IR-specific compression schemes
8

COMP6714: Information Retrieval & Web Search
RCV1: Our collection for this lecture
▪ Shakespeare’s collected works definitely aren’t large enough for demonstrating many of the points in this course.
▪ The collection we’ll use isn’t really large enough either, but it’s publicly available and is at least a more plausible example.
▪ As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection.
▪ This is one year of Reuters newswire (part of 1995 and 1996)
9

COMP6714: Information Retrieval & Web Search
Sec. 5.1
Reuters RCV1 statistics
▪ symbol value
statistic
documents
avg. # tokens per doc terms (= word types)
▪ N
▪ L
▪ M ▪
▪ ▪
800,000
200 ~400,000
avg. # bytes per token (incl. spaces/punct.)
avg. # bytes per token (without spaces/punct.)
6
4.5 avg. # bytes per term 7.5
10

COMP6714: Information Retrieval & Web Search
Unfiltered
Index parameters vs. what we index (details IIR Table 5.1, p.80)
No numbers
Case folding
30 stopwords
dictionary
Size (K)
484
474
392
391
∆%
-2
cumul %
-2
-19
-19
non-positional index
Size (K)
109,971
100,680
96,969
83,390
150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52 stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52
∆ %
-8
-14
cumul %
-8
-12
-24
positional index
Size (K)
197,879
179,158
179,158
121,858
Sec. 5.1
size of
word types (terms)
non-positional postings
positional postings
∆ %
-9
-31
cumul %
-9
-17
-3
0
-9
-0
-38
Exercise: give intuitions for all the ‘0’ entries. Why do some zero entries correspond to big deltas in other columns? 11

COMP6714: Information Retrieval & Web Search
Sec. 5.1
Lossless vs. lossy compression
▪ Lossless compression: All information is preserved.
▪ WhatwemostlydoinIR.
▪ Lossy compression: Discard some information
▪ Several of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination.
▪ Chap/Lecture 7: Prune postings entries that are unlikely to turn up in the top k list for any query.
▪ Almostnolossqualityfortopklist.
12

COMP6714: Information Retrieval & Web Search
Sec. 5.1
Vocabulary vs. collection size
▪ How big is the term vocabulary?
▪ Thatis,howmanydistinctwordsarethere?
▪ Can we assume an upper bound?
▪ Not really: At least 7020 = 1037 different words of length 20
▪ In practice, the vocabulary will keep growing with the collection size
▪ EspeciallywithUnicode☺
13

COMP6714: Information Retrieval & Web Search
Sec. 5.1
Vocabulary vs. collection size
▪ Heaps’ law: M = kTb
▪ M is the size of the vocabulary, T is the number of tokens in the collection
▪ Typicalvalues:30≤k≤100andb≈0.5
▪ In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about 1⁄2
▪ Itisthesimplestpossiblerelationshipbetweenthetwoin log-log space
▪ Anempiricalfinding(“empiricallaw”)
14

COMP6714: Information Retrieval & Web Search
Sec. 5.1
Heaps’ Law For RCV1, the dashed line
log10M = 0.49 log10T + 1.64 is the best least squares fit.
Thus, M = 101.64T0.49 so k = 101.64 ≈ 44 and b = 0.49.
Good empirical fit for Reuters RCV1 !
For first 1,000,020 tokens, law predicts 38,323 terms; actually, 38,365 terms
Fig 5.1 p81
15

COMP6714: Information Retrieval & Web Search
Sec. 5.1
Exercises
▪ What is the effect of including spelling errors, vs. automatically correcting spelling errors on Heaps’ law?
▪ Compute the vocabulary size M for this scenario:
▪ Lookingatacollectionofwebpages,youfindthatthere are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens.
▪ Assumeasearchengineindexesatotalof20,000,000,000 (2 × 1010) pages, containing 200 tokens on average
▪ Whatisthesizeofthevocabularyoftheindexed collection as predicted by Heaps’ law?
16

COMP6714: Information Retrieval & Web Search
Sec. 5.1
Zipf’s law
▪ Heaps’ law gives the vocabulary size in collections.
▪ We also study the relative frequencies of terms.
▪ In natural language, there are a few very frequent terms and very many very rare terms.
▪ Zipf’s law: The ith most frequent term has frequency proportional to 1/i .
▪ cfi ∝ 1/i = K/i where K is a normalizing constant
▪ cfi is collection frequency: the number of occurrences of the term ti in the collection.
17

COMP6714: Information Retrieval & Web Search
Sec. 5.1
Zipf consequences
▪ If the most frequent term (the) occurs cf1 times
▪ thenthesecondmostfrequentterm(of)occurscf1/2 times
▪ thethirdmostfrequentterm(and)occurscf1/3times…
▪ Equivalent: cfi = K/i where K is a normalizing factor,
so
▪ logcfi =logK-logi
▪ Linear relationship between log cfi and log i
▪ Another power law relationship
18

COMP6714: Information Retrieval & Web Search
Sec. 5.1
Zipf’s law for Reuters RCV1
19

COMP6714: Information Retrieval & Web Search
Ch. 5
Compression
▪ Now, we will consider compressing the space for the dictionary and postings
▪ Basic Boolean index only
▪ No study of positional indexes, etc.
▪ We will consider compression schemes
20

COMP6714: Information Retrieval & Web Search
Sec. 5.2
DICTIONARY COMPRESSION
21

COMP6714: Information Retrieval & Web Search
Sec. 5.2
Why compress the dictionary?
▪ Search begins with the dictionary
▪ We want to keep it in memory
▪ Memory footprint competition with other applications
▪ Embedded/mobile devices may have very little memory
▪ Even if the dictionary isn’t in memory, we want it to be small for a fast search startup time
▪ So, compressing the dictionary is important
22

COMP6714: Information Retrieval & Web Search
Sec. 5.2
Dictionary storage – first cut ▪ Array of fixed-width entries
▪ ~400,000terms;28bytes/term=11.2MB.
Dictionary search structure
23
20 bytes
4 bytes each

COMP6714: Information Retrieval & Web Search
Sec. 5.2
Fixed-width terms are wasteful
▪ Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms.
▪ And we still can’t handle supercalifragilisticexpialidocious or hydrochlorofluorocarbons.
▪ Written English averages ~4.5 characters/word. ▪ Exercise:Whyis/isn’tthisthenumbertousefor
estimating the dictionary size?
▪ Ave. dictionary word in English: ~8 characters
▪ Howdoweuse~8charactersperdictionaryterm?
▪ Short words dominate token counts but not type average.
24

COMP6714: Information Retrieval & Web Search
Sec. 5.2
Compressing the term list:
Dictionary-as-a-String
■Store dictionary as a (long) string of characters: ■Pointer to next word shows end of current word ■Hope to save up to 60% of dictionary space.
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Total string length = 400K x 8B = 3.2MB
Pointers resolve 3.2M positions: log23.2M =
22bits = 3bytes
25

COMP6714: Information Retrieval & Web Search
Sec. 5.2
Space for dictionary as a string
▪ 4 bytes per term for Freq.
▪ 4 bytes per term for pointer to Postings.
▪ 3 bytes per term pointer
▪ Avg. 8 bytes per term in term string
Now avg. 19 bytes/term, not 28.
▪ 400K terms x 19 ➔ 7.6 MB (against 11.2MB for fixed width)
26

COMP6714: Information Retrieval & Web Search
Sec. 5.2
Blocking
▪ Store pointers to every kth term string.
▪ Examplebelow:k=4.
▪ Need to store term lengths (1 extra byte)
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Save 9 bytes on 3 pointers.
Lose 4 bytes on
term lengths.
27

COMP6714: Information Retrieval & Web Search
Sec. 5.2
Net
▪ Example for block size k = 4
▪ Where we used 3 bytes/pointer without blocking ▪ 3×4=12bytes,
now we use 3 + 4 = 7 bytes.
Shaved another ~0.5MB. This reduces the size of the dictionary from 7.6 MB to 7.1 MB.
We can save more with larger k.
Why not go with larger k?
28

COMP6714: Information Retrieval & Web Search
Sec. 5.2
Exercise
▪ Estimate the space usage (and savings compared to 7.6 MB) with blocking, for block sizes of k = 4, 8 and 16.
29

COMP6714: Information Retrieval & Web Search
Sec. 5.2
Dictionary search without blocking
▪ Assuming each dictionary term equally likely in query (not really so in practice!), average number of comparisons = (1+2·2+4·3+4)/8 ~2.6
Exercise: what if the frequencies of query terms were non-uniform but known, how would you structure the dictionary search tree?
30

COMP6714: Information Retrieval & Web Search
Sec. 5.2
Dictionary search with blocking
▪ Binary search down to 4-term block;
▪ Thenlinearsearchthroughtermsinblock.
▪ Blocks of 4 (binary tree), avg. = (1+2·2+2·3+2·4+5)/8 = 3 compares
31

COMP6714: Information Retrieval & Web Search
Sec. 5.2
Exercise
▪ Estimate the impact on search performance (and slowdown compared to k=1) with blocking, for block sizes of k = 4, 8 and 16.
32

COMP6714: Information Retrieval & Web Search
Sec. 5.2
Front coding ▪ Front-coding:
▪ Sortedwordscommonlyhavelongcommonprefix–store differences only
▪ (forlastk-1inablockofk) 8automata8automate9automatic10automation
♢8automat*a1♢e2♢ic3♢ion
Encodes automat
Extra length beyond automat.
Begins to resemble general string compression. 33

COMP6714: Information Retrieval & Web Search
Front Encoding [Witten, Moffat, Bell]
▪ Sorted words commonly have long common prefix
▪store differences only
▪ Complete front encoding
▪ (prefix-len,suffix-len,suffix)
▪ Partial 3-in-4 front encoding
▪ Noencoding/compressionforthefirststringinablock ▪ Enablesbinarysearch
Assume previous string is “auto”
String Complete Front Partial 3-in-4 Front Encoding Encoding
8, automata 4, 4, mata , 8, automata
8, automate 7, 1, e 7, 1, e
9, automatic 7, 2, ic 7, 2, ic
10, automation 8, 2, on 8, , on
34

COMP6714: Information Retrieval & Web Search
Sec. 5.2
RCV1 dictionary compression summary
Technique
Size in MB
Fixed width
11.2
Dictionary-as-String with pointers to every term
7.6
Also, blocking k = 4
7.1
Also, Blocking + front coding
5.9
35

COMP6714: Information Retrieval & Web Search
Sec. 5.3
POSTINGS COMPRESSION
36

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Postings compression
▪ The postings file is much larger than the dictionary, factor of at least 10.
▪ Key desideratum: store each posting compactly.
▪ A posting for our purposes is a docID.
▪ For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers.
▪ Alternatively, we can use log2 800,000 ≈ 20 bits per docID.
▪ Our goal: use a lot less than 20 bits per docID.
37

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Postings: two conflicting forces
▪ A term like arachnocentric occurs in maybe one doc out of a million – we would like to store this posting using log2 1M ~ 20 bits.
▪ A term like the occurs in virtually every doc, so 20 bits/posting is too expensive.
▪ Prefer0/1bitmapvectorinthiscase
38

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Postings file entry
▪ We store the list of docs containing a term in increasing order of docID.
▪ computer:33,47,154,159,202…
▪ Consequence: it suffices to store gaps.
▪ 33,14,107,5,43…
▪ Hope: most gaps can be encoded/stored with far fewer than 20 bits.
39

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Three postings entries
40

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Variable length encoding
▪ Aim:
▪ Forarachnocentric,wewilluse~20bits/gapentry. ▪ Forthe,wewilluse~1bit/gapentry.
▪ If the average gap for a term is G, we want to use ~log2G bits/gap entry.
▪ Key challenge: encode every integer (gap) with about as few bits as needed for that integer.
▪ This requires a variable length encoding
▪ Variable length codes achieve this by using short codes for small numbers
41

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Variable Byte (VB) codes
▪ For a gap value G, we want to use close to the fewest
bytes needed to hold log2 G bits
▪ Begin with one byte to store G and dedicate 1 bit in
it to be a continuation bit c
▪ If G ≤127, binary-encode it in the 7 available bits and
set c =1
▪ Else encode G’s lower-order 7 bits and then use additional bytes to encode the higher order bits using the same algorithm
▪ At the end set the continuation bit of the last byte to
1 (c =1) – and for the other bytes c = 0.
42

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Example
1
docIDs
824
829
215406
gaps
5
214577
VB code
00000110
10111000
10000101
00001101
00001100
10110001
Postings stored as the byte concatenation
000001101011100010000101000011010000110010110001
Key property: VB-encoded postings are
uniquely prefix-decodable.
For a small gap (5), VB
uses a whole byte.
43
Hex(824)=0x0338 Hex(214577)=0x0003463

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Other variable unit codes
▪ Instead of bytes, we can also use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles).
▪ Variable byte alignment wastes space if you have many small gaps – nibbles do better in such cases.
▪ Variable byte codes:
▪ Usedbymanycommercial/researchsystems
▪ Goodlow-techblendofvariable-lengthcodingand sensitivity to computer memory alignment matches (vs. bit-level codes, which we look at next).
▪ There is also recent work on word-aligned codes that pack a variable number of gaps into one word (e.g., simple9)
44

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Simple9
▪ Encodes as many gaps as possible in one DWORD
▪ 4 bit selector + 28 bit data bits
▪ Encodes9possiblewaysto“use”thedatabits
Selector # of gaps encoded Len of each gap Wasted bits encoded
0000 28 1 0
0001 14 2 0
0010 9 3 1
0011 7 4 0
0100 5 5 3
0101 4 7 0
0110 3 9 1
0111 2 14 0
1000 1 28 0
45

COMP6714: Information Retrieval & Web Search
Unary code
▪ Represent n as n 1s with a final 0.
▪ Unary code for 3 is 1110.
▪ Unary code for 40 is
11111111111111111111111111111111111111110 .
▪ Unary code for 80 is:
111111111111111111111111111111111111111111 111111111111111111111111111111111111110
▪ This doesn’t look promising, but….
46

COMP6714: Information Retrieval & Web Search
Bit-Aligned Codes
▪ Breaks between encoded numbers can occur after any bit position
▪ Unary code
▪ Encodekbyk1sfollowedby0
▪ 0atendmakescodeunambiguous
47

COMP6714: Information Retrieval & Web Search
Unary and Binary Codes
▪ Unary is very efficient for small numbers such as 0 and 1, but quickly becomes very expensive
▪ 1023canberepresentedin10binarybits,butrequires 1024 bits in unary
▪ Binary is more efficient for large numbers, but it may be ambiguous
48

COMP6714: Information Retrieval & Web Search
Elias-γ Code
▪ To encode a number k, compute
▪ kd is number of binary digits, encoded in unary
unary binary
49

COMP6714: Information Retrieval & Web Search
Elias-δ Code
▪ Elias-γ code uses no more bits than unary, many fewer for k > 2
▪ 1023takes19bitsinsteadof1024bitsusingunary
▪ In general, takes 2⎣log2k⎦+1 bits
▪ To improve coding of large numbers, use Elias-δ code ▪ Instead of encoding kd in unary, we encode kd + 1 using
Elias-γ
▪ Takes approximately 2 log2 log2 k + log2 k bits
50

COMP6714: Information Retrieval & Web Search
Elias-δ Code
▪ Split (kd+ 1) into:
▪ encode kdd in unary, kdr in binary, and kr in binary
51

COMP6714: Information Retrieval & Web Search
52

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Gamma code properties
▪ Gisencodedusing2⎣logG⎦+1bits ▪ Lengthofoffsetis⎣logG⎦bits
▪ Lengthoflengthis⎣logG⎦+1bits
▪ All gamma codes have an odd number of bits
▪ Almost within a factor of 2 of best possible, log2 G
▪ Gamma code is uniquely prefix-decodable, like VB
▪ Gamma code can be used for any distribution
▪ Gamma code is parameter-free
53

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Gamma seldom used in practice
▪ Machines have word boundaries – 8, 16, 32, 64 bits
▪ Operationsthatcrosswordboundariesareslower
▪ Compressing and manipulating at the granularity of
bits can be slow
▪ Variable byte encoding is aligned and thus potentially more efficient
▪ Regardless of efficiency, variable byte is conceptually simpler at little additional space cost
54

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Shannon Limit
▪ Is it possible to derive codes that are optimal (under certain assumptions)?
▪ What is the optimal average code length for a code that encodes each integer (gap lengths) independently?
▪ Lower bounds on average code length: Shannon entropy
▪ H(X)=-Σx=1nPr[X=x]logPr[X=x]
▪ Asymptotically optimal codes (finite alphabets): arithmetic coding, Huffman codes
55

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Global Bernoulli Model
▪ Assumption: term occurrence are Bernoulli events
▪ Notation:
▪ n:#ofdocuments,m:#oftermsinvocabulary ▪ N:total#of(unique)occurrences
▪ Probability of a term tj occurring in document di: p = N/nm
▪ Each term-document occurrence is an independent event
▪ Probability of a gap of length x is given by the geometric distribution
56
How to design an optimal code for geometric distribution?

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Golomb Code
▪ Golomb Code (Golomb 1966): highly efficient way to design optimal Huffman-style code for geometric distribution
▪ Parameterb
▪ Forgivenx≥1,computerintegerquotient ▪ andremainder
▪ Assume b = 2k
▪ Encodeqinunary,followedbyrcodedinbinary ▪ Abitcomplicatedifb!=2k.Seewikipedia.
▪ First step: (q+1) bits
▪ Second step: log(b) bits
57
It can also be deemed as a generalization of the unary code.

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Golomb Code & Rice Code
▪ How to determine optimal b*?
▪ Select minimal b such that
▪ Result due to Gallager and Van Voorhis 1975: generates an optimal prefix code for geometric distribution
▪ Small p approximation:
▪ Rice code: only allow b = 2k
58

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Local Bernoulli Model
▪ If length of posting lists is known, then a Bernoulli model on each individual inverted list can be used
▪ Frequent words are coded with smaller b, infrequent words with larger b
▪ Term frequency need to be encoded (use gamma- code)
▪ Local Bernoulli outperforms global Bernoulli model in practice (method of practice!)
59

COMP6714: Information Retrieval & Web Search
Sec. 5.3
RCV1 compression
Data structure
Size in MB
dictionary, fixed-width
11.2
dictionary, term pointers into string
7.6
with blocking, k = 4
7.1
with blocking & front coding
5.9
collection (text, xml markup etc)
3,600.0
collection (text)
960.0
Term-doc incidence matrix
40,000.0
postings, uncompressed (32-bit words)
400.0
postings, uncompressed (20 bits)
250.0
postings, variable byte encoded
116.0
postings, γ−encoded
101.0
60

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Google’s Indexing Choice
▪ Index shards partition by doc, multiple replicates
▪ Disk-resident index
▪ Useouterpartsofthedisk
▪ Usedifferentcompressionmethodsfordifferentfields: Ricek (a special kind of Golomb code) for gaps, and Gamma for positions.
▪ In-memory index
▪ Allpositions;Nodocid
▪ Keeptrackofdocumentboundaries ▪ Group-variantencoding
▪ Fasttodecode
Source: Jeff Dean’s WSDM 2009 Keynote
61

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Other details
▪ Gap = docidn- docidn-1 – 1
▪ Freq=freq–1
▪ Pos_Gap = posn- posn-1 – 1
▪ C.f., Jiangong Zhang, Xiaohui Long and Torsten Suel: Performance of Compressed Inverted List Caching in Search Engines. WWW 2008.
62

COMP6714: Information Retrieval & Web Search
Sec. 5.3
Index compression summary
▪ We can now create an index for highly efficient Boolean retrieval that is very space efficient
▪ Only 4% of the total size of the collection
▪ Only 10-15% of the total size of the text in the
collection
▪ However, we’ve ignored positional information
▪ Hence, space savings are less for indexes used in practice
▪ Buttechniquessubstantiallythesame.
63

COMP6714: Information Retrieval & Web Search
Ch. 5
Resources for today’s lecture
▪ IIR5
▪ MG 3.3, 3.4.
▪ F. Scholer, H.E. Williams and J. Zobel. 2002. Compression of Inverted Indexes For Fast Query Evaluation. Proc. ACM-SIGIR 2002.
▪ Variablebytecodes
▪ V. N. Anh and A. Moffat. 2005. Inverted Index Compression Using Word-Aligned Binary Codes. Information Retrieval 8: 151–166.
▪ Wordalignedcodes
64