CS计算机代考程序代写 scheme AI 2021/4/28 Signature-based Indexing

2021/4/28 Signature-based Indexing
Signature-based Indexing
Indexing with Signatures Signatures
Generating Codewords Superimposed Codewords (SIMC) Concatenated Codewords (CATC) Queries using Signatures
False Matches SIMC vs CATC
>>
COMP9315 21T1 ♢ Signature-based Indexing ♢ [0/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 1/14

2021/4/28 Signature-based Indexing
❖ Indexing with Signatures Signature-based indexing:
designedforpmrqueries (conjunctionof equalities)
does not try to achieve better than O(n) performance
attempts to provide an “efcient” linear scan Each tuple is associated with a signature
a compact (lossy) descriptor for the tuple
formed by combining information from multiple attributes
stored in a signature le, parallel to data le
Instead of scanning/testing tuples, do pre- ltering via signatures.
COMP9315 21T1 ♢ Signature-based Indexing ♢ [1/12]
∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 2/14

2021/4/28 Signature-based Indexing
<< ∧ >> ❖ Indexing with Signatures (cont)
File organisation for signature indexing (two les)
One signature slot per tuple slot; unused signature slots are zeroed.
Signatures do not determine record placement ⇒ can use with other indexing.
COMP9315 21T1 ♢ Signature-based Indexing ♢ [2/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 3/14

2021/4/28 Signature-based Indexing
<< ∧ >>
❖ Signatures
A signature “summarises” the data from one tuple
A tuple consists of n attribute values A1 .. An A codeword cw(Ai) is
a bit-string, m bits long, where k bits are set to 1 (k ≪ m)
derived from the value of a single attribute Ai
A tuple descriptor (signature) is built by combining cw(Ai), i=1..n
aim to have roughly half of the bits set to 1
Two strategies for building signatures: overlay, concatenate
COMP9315 21T1 ♢ Signature-based Indexing ♢ [3/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 4/14

2021/4/28 Signature-based Indexing
<< ∧ >>
❖ Generating Codewords Generating a k-in-m codeword for attribute Ai
bits codeword(char *attr_value, int m, int k)
{
int nbits = 0; // count of set bits bits cword = 0; // assuming m <= 32 bits srandom(hash(attr_value)); while (nbits < k) { int i = random() % m; if (((1 << i) & cword) == 0) { cword |= (1 << i); nbits++; } } return cword; // m-bits with k 1-bits and m-k 0-bits } COMP9315 21T1 ♢ Signature-based Indexing ♢ [4/12] https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 5/14 2021/4/28 Signature-based Indexing << ∧ >> ❖ Superimposed Codewords
(SIMC)
In a superimposed codewords (simc) indexing scheme
tuple descriptor formed by overlaying attribute codewords (bitwise-OR)
COMP9315 21T1 ♢ Signature-based Indexing ♢ [5/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 6/14

2021/4/28 Signature-based Indexing
<< ∧ >> ❖ Superimposed Codewords
(SIMC) (cont)
A SIMC tuple descriptor desc(t) is
a bit-string, m bits long, where j ≤ nk bits are setto1
desc(t) = cw(A1) OR cw(A2) OR … OR cw(An) Method (assuming all n attributes are used in descriptor):
Bits desc = 0
for (i = 1; i <= n; i++) { bits cw = codeword(A[i],m,k) desc = desc | cw } COMP9315 21T1 ♢ Signature-based Indexing ♢ [6/12] https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 7/14 2021/4/28 Signature-based Indexing << ∧ >> ❖ Concatenated Codewords
(CATC)
In a concatenated codewords (catc) indexing schema
tuple descriptor formed by concatenating attribute codewords
COMP9315 21T1 ♢ Signature-based Indexing ♢ [7/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 8/14

2021/4/28 Signature-based Indexing
<< ∧ >> ❖ Concatenated Codewords
(CATC) (cont)
A CATC tuple descriptor desc(t) is
a bit-string, m bits long, where j = nk bits are setto1
desc(t) = cw(A1) + cw(A2) + … + cw(An) (+ is concatenation)
Each codeword is p = m/n bits long, with k = p/2 bits set to 1
Method (assuming all n attributes are used in descriptor):
Bitsdesc=0; intp=m/n for (i = 1; i <= n; i++) { bits cw = codeword(A[i],p,k) desc = desc | (cw << p*(n-i)) } COMP9315 21T1 ♢ Signature-based Indexing ♢ [8/12] https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 9/14 2021/4/28 Signature-based Indexing << ∧ >> ❖ Queries using Signatures
To answer query q with a signature-based index
rst generate a query descriptor desc(q) then scan the signature le using the query
descriptor
if sigi matches desc(q), then tuple i may be a match
desc(q) is formed from codewords of known attributes.
Effectively,anyunknownattributeAi hascw(Ai) =0
E.g. for SIMC (a,?,c,?,e) = cw(a) OR 0 OR cw(c) OR 0 OR cw(e)
COMP9315 21T1 ♢ Signature-based Indexing ♢ [9/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 10/14

2021/4/28 Signature-based Indexing
<< ∧ >> ❖ Queries using Signatures (cont)
Once we have a query descriptor, we search the signature le:
pagesToCheck = {}
// scan r descriptors
for each descriptor D[i] in signature file {
if (matches(D[i],desc(q))) {
pid = pageOf(tupleID(i))
pagesToCheck = pagesToCheck ∪ pid }
}
// scan bq + δ data pages
for each pid in pagesToCheck {
Buf = getPage(dataFile,pid)
check tuples in Buf for answers
}
COMP9315 21T1 ♢ Signature-based Indexing ♢ [10/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 11/14

2021/4/28 Signature-based Indexing
<< ∧ >>
❖ False Matches
Both SIMC and CATC can produce false matches
matches(D[i],desc(q)) is true, but Tup[i] is not a solution for q
Why does this happen?
signatures are based on hashing, and it is possible that
hash(key1) == hash(key2) even though key1 != key2
for SIMC, overlaying could also produce “unfortunate” bit-combinations
To mitigate this, need to choose “good” m and k
COMP9315 21T1 ♢ Signature-based Indexing ♢ [11/12]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 12/14

2021/4/28 Signature-based Indexing
<< ∧ ❖ SIMC vs CATC Both build m-bit wide signatures, with ~1/2 bits setto1 Both have codewords with ~m/2n bits set to 1 CATC: codewords are m/n = p-bits wide shorter codewords → more hash collisions SIMC: codewords are also m-bits wide longer codewords ⇒ less hash collisions, but also has overlay collisions CATC has option of different length codeword pi for each Ai (∑pi = m) COMP9315 21T1 ♢ Signature-based Indexing ♢ [12/12] https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 13/14 2021/4/28 Signature-based Indexing Produced: 20 Mar 2021 https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/signatures/slides.html 14/14