2021/4/28 CATC Indexing
CATC Indexing
Signature-based indexing Concatenated Codewords (CATC) CATC Example
CATC Queries
Example SIMC Query
CATC Parameters
Query Cost for CATC
Variations on CATC
Comparison with SIMC
>>
COMP9315 21T1 ♢ CATC Indexing ♢ [0/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
1/14
2021/4/28 CATC Indexing
∧ >>
❖ Signature-based indexing
Reminder: le organisation for signature indexing
(two les)
One signature slot per tuple slot; unused signature slots are zeroed.
We use the terms “signature” and “descriptor” interchangeably
COMP9315 21T1 ♢ CATC Indexing ♢ [1/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
2/14
2021/4/28 CATC Indexing
<< ∧ >> ❖ Concatenated Codewords (CATC)
In a concatenated codewords (catc) indexing scheme a tuple signature is formed by concatenating
attribute codewords
the signature is m bits long, with ≅m/2 bits set to1
codewordforattri isui bitslongandhas≅ui/2 bits set to 1
each codeword could be different length, but always ∑1..n ui = m
A tuple descriptor (signature) desc(t) is
desc(t) = cw(An) + cw(An-1) … + cw(A+) + cw(A1)
where “+” represents bit-string concatentation
The order that the concenated codewords appears doesn’t matter, as long as it’s done consistently
COMP9315 21T1 ♢ CATC Indexing ♢ [2/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
3/14
2021/4/28 CATC Indexing
<< ∧ >>
❖ CATC Example
Consider the following tuple (from bank deposit database)
It has the following codewords/descriptor (for m = 16, ui=4)
Ai cw(Ai)
Perryridge 0101
102 1001
Hayes 1010
400 1100
desc(t) 1100101010010101
COMP9315 21T1 ♢ CATC Indexing ♢ [3/12]
Branch
AcctNo
Name
Amount
Perryridge
102
Hayes
400
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
4/14
2021/4/28 CATC Indexing
❖ CATC Queries
To answer query q in CATC
rst generate desc(q) by combining codewords for all attributes
forknownAi usecw(Ai);for”unknown”Ai,use cw(Ai) = 0
E.g.considerthequery(Perryridge, ?, Hayes, ?).
Ai cw(Ai)
Perryridge 0101
? 0000
Hayes 1010
? 0000
desc(q) 0000101000000101
COMP9315 21T1 ♢ CATC Indexing ♢ [4/12]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
5/14
2021/4/28 CATC Indexing
<< ∧ >>
❖ CATC Queries (cont)
Once we have a query descriptor, we search the
signature le:
pagesToCheck = {}
// scan r signatures
for each descriptor D[i] in signature file {
if (matches(D[i],desc(q))) {
pid = pageOf(tupleID(i))
pagesToCheck = pagesToCheck ∪ pid }
}
// then scan bsq = bq + δ pages to check for matches
Matching can be implemented ef ciently …
#define matches(sig,qdesc) ((sig & qdesc) == qdesc)
COMP9315 21T1 ♢ CATC Indexing ♢ [5/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
6/14
2021/4/28 CATC Indexing
<< ∧ >>
❖ Example SIMC Query
Consider the query and the example database:
Signature
0000101000000101 1010100101101001 1100101010010101 1010011010010110 0110101001010011 1010101011000101 1001010100111001
Deposit Record
(Perryridge,?,Hayes,?)
(Brighton,217,Green,750)
(Perryridge,102,Hayes,400)
(Downtown,101,Johnshon,512) (Mianus,215,Smith,700) (Clearview,117,Throggs,295) (Redwood,222,Lindsay,695)
Gives two matches: one true match, one false match.
COMP9315 21T1 ♢ CATC Indexing ♢ [6/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
7/14
2021/4/28 CATC Indexing
<< ∧ >>
❖ CATC Parameters
False match probablity pF = likelihood of a false
match
How to reduce likelihood of false matches?
usedifferenthashfunctionforeachattribute (hi for Ai)
increase descriptor size (m)
Larger m means larger signature le ⇒ read more signature data.
Since ui ‘s are relatively small, hash collisions may be a serious issue
But making ui’s means larger signatures ⇒ optimisation problem
COMP9315 21T1 ♢ CATC Indexing ♢ [7/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
8/14
2021/4/28 CATC Indexing
❖ CATC Parameters (cont) How to determine “optimal” m and u ?
1. start by choosing acceptable pF
(e.g. pF ≤ 10-4 i.e. one false match in 10,000)
2. then choose m to achieve no more than this pF.
Formulae to derive “good” m: m = ( 1/loge 2 )2 . n . loge ( 1/pF )
Choice of ui values
eachAi hassameui, or
allocateui basedonsizeofattributedomains
COMP9315 21T1 ♢ CATC Indexing ♢ [8/12]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
9/14
2021/4/28 CATC Indexing
<< ∧ >>
❖ Query Cost for CATC
Cost to answer pmr query: Costpmr = bD + bsq
read r descriptors on bD descriptor pages thenreadbsq datapagesandcheckformatches
bD = ceil( r/cD ) and cD = oor(B/ceil(m/8))
E.g. m=64, B=8192, r=104 ⇒ cD = 1024, bD=10
bsqincludespageswithrq matchingtuplesandrF false matches
Expected false matches = rF = (r – rq).pF ≅ r.pF if rq ≪r
E.g. Worst bsq = rq+rF, Best bsq = 1, Avg bsq = ceil(b(rq+rF)/r)
COMP9315 21T1 ♢ CATC Indexing ♢ [9/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
10/14
2021/4/28 CATC Indexing
<< ∧ >>
❖ Variations on CATC
CATC has one descriptor per tuple … potentially
inef cient.
Alternative approach: one descriptor for each data page.
Every attribute of every tuple in page contributes to descriptor.
Size of page descriptor mp = ( 1/loge 2 )2 . c.n . loge ( 1/pF )
Sizeofcodewordsisproportionallylarger (unless attribute domain small)
E.g.n=4,c=64,pF=10-3 ⇒ mp≅3680bits≅ 460bytes
Typically, pages are 1..8KB ⇒ 8..64 PD/page (cPD). E.g.mp≅460, B=8192, cPD≅17
COMP9315 21T1 ♢ CATC Indexing ♢ [10/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
11/14
2021/4/28 CATC Indexing
<< ∧ >>
❖ Variations on CATC (cont) Improvement: store b × mp-bit page descriptors as
mp × b-bit “bit-slices”
Ifb=2x thenusessamestorageaspagedescriptors
Query cost: scan ui / 2 bit-slices for each known attribute
If k is set of known attribute values, #slices = ∑i∈k ui /2
E.g.b=128,m=256,n=4,ui=16
(a,?,c,?) requires scan of 2 × 8 128-bit (16-byte) slices
compared to scan of 128 page descriptors, where each PD is 64-bits (8-bytes)
COMP9315 21T1 ♢ CATC Indexing ♢ [11/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
12/14
2021/4/28 CATC Indexing
❖ Comparison with SIMC Assume same m , pF , n for each method …
CATC has ui-bit codewords, each has ≅ui / 2 bits set to1
SIMC has m-bit codewords, each has k bits set to 1
Signatures for both have m bits, with ≅m / 2 bits set to1
CATC has exibility in ui, but small(er) codewords so more hash collisions
SIMC has less hash collisions, but has errors from “unfortunate” overlays
COMP9315 21T1 ♢ CATC Indexing ♢ [12/12]
<< ∧
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
13/14
2021/4/28 CATC Indexing
Produced: 28 Mar 2021
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/sig-catc/slides.html
14/14