2021/4/28 SIMC Indexing
SIMC Indexing
Signature-based indexing Superimposed Codewords (SIMC)
SIMC Example
SIMC Queries
Example SIMC Query
SIMC Parameters
Query Cost for SIMC
Page-level SIMC
Bit-sliced SIMC
Comparison of Approaches Signature-based Indexing in PostgreSQL
>>
COMP9315 21T1 ♢ SIMC Indexing ♢ [0/16]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
1/18
2021/4/28 SIMC Indexing
∧ >>
❖ Signature-based indexing
Reminder: le organisation for signature indexing (two
les)
One signature slot per tuple slot; unused signature slots are zeroed.
COMP9315 21T1 ♢ SIMC Indexing ♢ [1/16]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
2/18
2021/4/28 SIMC Indexing
<< ∧ >> ❖ Superimposed Codewords (SIMC)
In a superimposed codewords (simc) indexing scheme
a tuple descriptor is formed by overlaying attribute codewords
each codeword is m bits long and has k bits set to 1
A tuple descriptor desc(t) is
a bit-string, m bits long, where j ≤ nk bits are set to 1 desc(t) = cw(A1) OR cw(A2) OR … OR cw(An)
Method (assuming all n attributes are used in descriptor):
bits desc = 0
for (i = 1; i <= n; i++) {
bits cw = codeword(A[i],m,k)
desc = desc | cw
}
COMP9315 21T1 ♢ SIMC Indexing ♢ [2/16]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
3/18
2021/4/28 SIMC Indexing
<< ∧ >>
❖ SIMC Example
Consider the following tuple (from bank deposit database)
Ithasthefollowingcodewords/descriptor(form=12, k =2)
Ai cw(Ai)
Perryridge 010000000001 102 000000000011 Hayes 000001000100 400 000010000100 desc(t) 010011000111
COMP9315 21T1 ♢ SIMC Indexing ♢ [3/16]
Branch
AcctNo
Name
Amount
Perryridge
102
Hayes
400
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
4/18
2021/4/28 SIMC Indexing
❖ SIMC Queries
To answer query q in SIMC
rst generate desc(q) by OR-ing codewords for known attributes
then attempt to match desc(q) against all signatures in sig le
<< ∧ >>
E.g.considerthequery(Perryridge, ?, Ai cw(Ai)
Perryridge 010000000001
? 000000000000
? 000000000000
? 000000000000
desc(q) 010000000001
COMP9315 21T1 ♢ SIMC Indexing ♢ [4/16]
?, ?).
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
5/18
2021/4/28 SIMC Indexing
<< ∧ >>
❖ SIMC Queries (cont)
Once we have a query descriptor, we search the
signature le:
pagesToCheck = {}
// scan r signatures
for each descriptor D[i] in signature file {
if (matches(D[i],desc(q))) {
pid = pageOf(tupleID(i))
pagesToCheck = pagesToCheck ∪ pid }
}
// then scan bsq = bq + δ pages to check for matches
Matching can be implemented ef ciently …
#define matches(sig,qdesc) ((sig & qdesc) == qdesc)
COMP9315 21T1 ♢ SIMC Indexing ♢ [5/16]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
6/18
2021/4/28 SIMC Indexing
<< ∧ >>
❖ Example SIMC Query
Consider the query and the example database:
Signature
010000000001 100101001001 010011000111 101001001001 101100000011 010101010101 100101010011
Deposit Record
(Perryridge,?,?,?)
(Brighton,217,Green,750)
(Perryridge,102,Hayes,400)
(Downtown,101,Johnshon,512) (Mianus,215,Smith,700) (Clearview,117,Throggs,295) (Redwood,222,Lindsay,695)
Gives two matches: one true match, one false match. COMP9315 21T1 ♢ SIMC Indexing ♢ [6/16]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
7/18
2021/4/28 SIMC Indexing
<< ∧ >>
❖ SIMC Parameters
False match probablity pF = likelihood of a false match
How to reduce likelihood of false matches?
usedifferenthashfunctionforeachattribute (hifor Ai)
increase descriptor size (m)
choose k so that ≅ half of bits are set
Larger m means larger signature le ⇒ read more signature data.
Having k too high ⇒ increased overlapping. Having k too low ⇒ increased hash collisions.
COMP9315 21T1 ♢ SIMC Indexing ♢ [7/16]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
8/18
2021/4/28 SIMC Indexing
❖ SIMC Parameters (cont) How to determine “optimal” m and k?
1. start by choosing acceptable pF
(e.g. pF ≤ 10-4 i.e. one false match in 10,000)
2. then choose m and k to achieve no more than this pF. Formulae to derive m and k given pF and n:
k = 1/loge2 . loge ( 1/pF )
m = (1/loge2)2.n.loge(1/pF)
Formula from Bloom (1970), “Space/Time Trade-offs in Hash Coding with Allowable Errors”, Communications of the ACM, 13 (7): 422–426
COMP9315 21T1 ♢ SIMC Indexing ♢ [8/16]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
9/18
2021/4/28 SIMC Indexing
<< ∧ >>
❖ Query Cost for SIMC
Cost to answer pmr query: Costpmr = bD + bsq
read r descriptors on bD descriptor pages
then read bsq data pages and check for matches
bD = ceil( r/cD ) and cD = oor(B/ceil(m/8))
E.g. m=64, B=8192, r=104 ⇒ cD = 1024, bD=10
bsqincludespageswithrq matchingtuplesandrF false matches
Expected false matches = rF = (r – rq).pF ≅ r.pF if rq ≪ r E.g. Worst bsq = rq+rF, Best bsq = 1, Avg bsq =
ceil(b(rq+rF)/r)
COMP9315 21T1 ♢ SIMC Indexing ♢ [9/16]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
10/18
2021/4/28 SIMC Indexing
<< ∧ >>
❖ Page-level SIMC
SIMC has one descriptor per tuple … potentially
inef cient.
Alternative approach: one descriptor for each data page.
Every attribute of every tuple in page contributes to descriptor.
Size of page descriptor (PD) (clearly larger than tuple descriptor):
use above formulae but with c.n “attributes”
E.g.n=4,c=64,pF=10-3 ⇒ mp≅3680bits≅ 460bytes
Typically, pages are 1..8KB ⇒ 8..64 PD/page (cPD). E.g.mp≅460, B=8192, cPD≅17
COMP9315 21T1 ♢ SIMC Indexing ♢ [10/16]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
11/18
2021/4/28 SIMC Indexing
<< ∧ >>
❖ Page-level SIMC (cont)
File organisation for page-level superimposed codeword
index
COMP9315 21T1 ♢ SIMC Indexing ♢ [11/16]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
12/18
2021/4/28 SIMC Indexing
<< ∧ >>
❖ Page-level SIMC (cont)
Algorithm for evaluating pmr query using page
descriptors
pagesToCheck = {}
// scan b mp-bit page descriptors
for each descriptor D[i] in signature file {
if (matches(D[i],desc(q))) {
pid = i
pagesToCheck = pagesToCheck ∪ pid }
}
// read and scan bsq data pages
for each pid in pagesToCheck {
Buf = getPage(dataFile,pid)
check tuples in Buf for answers
}
COMP9315 21T1 ♢ SIMC Indexing ♢ [12/16]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
13/18
2021/4/28 SIMC Indexing
<< ∧ >>
❖ Bit-sliced SIMC
Improvement: store b m-bit page descriptors as m b-bit
“bit-slices”
COMP9315 21T1 ♢ SIMC Indexing ♢ [13/16]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
14/18
2021/4/28 SIMC Indexing
<< ∧ >>
❖ Bit-sliced SIMC (cont)
Algorithm for evaluating pmr query using bit-sliced
descriptors
matches = ~0 //all ones
// scan m r-bit slices
for each bit i set to 1 in desc(q) {
slice = fetch bit-slice i
matches = matches & slice
}
for each bit i set to 1 in matches {
fetch page i
scan page for matching records
}
Effective because desc(q) typically has less than half bits setto1
COMP9315 21T1 ♢ SIMC Indexing ♢ [14/16]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
15/18
2021/4/28 SIMC Indexing
❖ Comparison of Approaches Tuple-based
r signatures, m-bit signatures, k bits/attribute
read all pages of signature le in ltering for a query
Page-based
b signatures, mp-bit signatures, k bits/attribute
read all pages of signature le in ltering for a query
Bit-sliced
m signatures, b-bit slices, k bits/attribute
read less than half of the signature le in ltering for a query
All signature les are roughly the same size, for a given pF
COMP9315 21T1 ♢ SIMC Indexing ♢ [15/16]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
16/18
2021/4/28 SIMC Indexing
❖ Signature-based Indexing in PostgreSQL
PostgreSQL supports signature based indexing via the bloom module
(Signature-based indexes like this are often called Bloom lters) Creating a Bloom index
create index Idx on R using bloom (a1,a2,a3)
with (length=64, col1=3, col2=4, col3=2);
Example: 10000 tuples, query = select * from R where a1=55 and a2=42, no matching tuples, random numeric values for attributes
No indexes … execution time 15ms
B-tree index on all attributes … execution time 12ms
Bloom index … execution time 0.4ms
For more details, see PostgreSQL doc, Appendix F.5
COMP9315 21T1 ♢ SIMC Indexing ♢ [16/16]
<< ∧
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
17/18
2021/4/28 SIMC Indexing
Produced: 6 Apr 2021
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-simc/slides.html
18/18