2021/4/28 Hashed Files
Hashed Files
Hashing
Hashing Performance Selection with Hashing Insertion with Hashing Deletion with Hashing Problem with Hashing… Flexible Hashing
>>
COMP9315 21T1 ♢ Hashed Files ♢ [0/15]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
1/17
2021/4/28 Hashed Files
∧ >>
❖ Hashing
Basic idea: use key value to compute page address of
tuple.
e.g. tuple with key = v is stored in page i
Requires: hash function h(v) that maps KeyDomain → [0..b-1].
hashing converts key value (any type) into integer value
integer value is then mapped to page index note: can view integer value as a bit-string
COMP9315 21T1 ♢ Hashed Files ♢ [1/15]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
2/17
2021/4/28 Hashed Files
❖ Hashing (cont)
PostgreSQL hash function (simpli ed):
Datum hash_any(unsigned char *k, int keylen) {
uint32 a, b, c, len, *ka = (uint32 *)k;
/* Set up the internal state */
len = keylen;
a = b = c = 0x9e3779b9+len+3923095;
/* handle most of the key */
while (len >= 12) {
a += ka[0]; b += ka[1]; c += ka[2]; mix(a, b, c);
ka+=3; len-=12;
}
… collect data from remaining bytes into a,b,c …
mix(a, b, c);
return UInt32GetDatum(c);
}
See backend/access/hash/hashfunc.c for details (incl mix())
COMP9315 21T1 ♢ Hashed Files ♢ [2/15]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
3/17
2021/4/28 Hashed Files
<< ∧ >>
❖ Hashing (cont)
hash_any() gives hash value as 32-bit quantity
(uint32).
Two ways to map raw hash value into a page address:
if b = 2k, bitwise AND with k low-order bits set to one
uint32 hashToPageNum(uint32 hval) {
uint32 mask = 0xFFFFFFFF;
return (hval & (mask >> (32-k)));
}
otherwise, use mod to produce value in range 0..b- 1
uint32 hashToPageNum(uint32 hval) {
return (hval % b);
}
COMP9315 21T1 ♢ Hashed Files ♢ [3/15]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
4/17
2021/4/28 Hashed Files
❖ Hashing Performance Aims:
distribute tuples evenly amongst buckets
havemostbucketsnearlyfull (attempttominimise wasted space)
Note: if data distribution not uniform, address distribution can’t be uniform.
Best case: every bucket contains same number of tuples.
Worst case: every tuple hashes to same bucket. Average case: some buckets have more tuples than
others.
Use over ow pages to handle “overfull” buckets (cf. sorted les)
All tuples in each bucket must have same hash value.
COMP9315 21T1 ♢ Hashed Files ♢ [4/15]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
5/17
2021/4/28 Hashed Files
❖ Hashing Performance (cont) Two important measures for hash les:
loadfactor: L = r/bc
average over ow chain length: Ov = bov / b
Three cases for distribution of tuples in a hashed le:
<< ∧ >>
Case
L
Ov
Best
≅1
0
Worst
>>1
**
Average
<1
0
❖ Selection with Hashing Select via hashing on unique key k (one)
// select * from R where k = val
getPageViaHash(R, val, P)
for each tuple t in page P {
if (t.k == val) return t
}
for each overflow page Q of P {
for each tuple t in page Q {
if (t.k == val) return t
}}
Costone: Best=1, Avg=1+Ov/2 Worst= 1+max(OvLen)
COMP9315 21T1 ♢ Hashed Files ♢ [6/15]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
7/17
2021/4/28 Hashed Files
<< ∧ >> ❖ Selection with Hashing (cont)
Working out which page, given a key …
getPageViaHash(Reln R, Value key, Page p)
{
uint32 h = hash_any(key, len(key));
PageID pid = h % nPages(R);
get_page(R, pid, buf);
}
COMP9315 21T1 ♢ Hashed Files ♢ [7/15]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
8/17
2021/4/28 Hashed Files
<< ∧ >> ❖ Selection with Hashing (cont)
Select via hashing on non-unique hash key nk (pmr)
// select * from R where nk = val
getPageViaHash(R, val, P)
for each tuple t in page P {
if (t.nk == val) add t to results
}
for each overflow page Q of P {
for each tuple t in page Q {
if (t.nk == val) add t to results }}
return results
Costpmr = 1 + Ov
If Ov is small (e.g. 0 or 1), very good retrieval cost
COMP9315 21T1 ♢ Hashed Files ♢ [8/15]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
9/17
2021/4/28 Hashed Files
<< ∧ >> ❖ Selection with Hashing (cont)
Hashing does not help with range queries** … Costrange = b + bov
Selection on attribute j which is not hash key … Costone, Costrange, Costpmr = b + bov
** unless the hash function is order-preserving (and most aren’t)
COMP9315 21T1 ♢ Hashed Files ♢ [9/15]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
10/17
2021/4/28 Hashed Files
<< ∧ >>
❖ Insertion with Hashing
Insertion uses similar process to one queries.
// insert tuple t with key=val into rel R
getPageViaHash(R, val, P)
if room in page P {
insert t into P; return
}
for each overflow page Q of P {
if room in page Q {
insert t into Q; return }}
add new overflow page Q
link Q to previous page
insert t into Q
Costinsert : Best: 1r + 1w Worst: 1+max(OvLen))r + 2w
COMP9315 21T1 ♢ Hashed Files ♢ [10/15]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
11/17
2021/4/28 Hashed Files
<< ∧ >>
❖ Deletion with Hashing
Similar performance to select on non-unique key:
// delete from R where k = val
// f = data file … ovf = ovflow file getPageViaHash(R, val, P)
ndel = delTuples(P,k,val)
if (ndel > 0) put_page(dataFile(R),P.pid,P) for each overflow page Q of P {
ndel = delTuples(Q,k,val)
if (ndel > 0) put_page(ovFile(R),Q.pid,Q)
}
Extra cost over select is cost of writing back modi ed pages.
Method works for both unique and non-unique hash keys.
COMP9315 21T1 ♢ Hashed Files ♢ [11/15]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
12/17
2021/4/28 Hashed Files
<< ∧ >>
❖ Problem with Hashing…
So far, discussion of hashing has assumed a xed le
size (b).
What size le to use?
thesizeweneedrightnow (performance degrades as le over ows)
themaximumsizewemighteverneed (signifcant waste of space)
Problem: change le size ⇒ change hash function ⇒ rebuild le
Methods for hashing with les whose size changes:
extendiblehashing,dynamichashing (needa directory, no over ows)
linearhashing (expands le”sytematically”,no directory, has over ows)
COMP9315 21T1 ♢ Hashed Files ♢ [12/15]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
13/17
2021/4/28 Hashed Files
❖ Flexible Hashing
All exible hashing methods …
treat hash as 32-bit bit-string
adjust hashing by using more/less bits
Start with hash function to convert value to bit- string:
uint32 hash(unsigned char *val)
Require a function to extract d bits from bit-string:
unit32 bits(int d, uint32 val) Use result of bits() as page address.
COMP9315 21T1 ♢ Hashed Files ♢ [13/15]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
14/17
2021/4/28 Hashed Files
<< ∧ >>
❖ Flexible Hashing (cont)
Important concept for exible hashing: splitting
consider one page (all tuples have same hash value)
recompute page numbers by considering one extra bit
if current page is 101, new pages have hashes 0101 and 1101
some tuples stay in page 0101 (was 101) some tuples move to page 1101 (new page)
also, rehash any tuples in over ow pages of page
101
Result: expandable data le, never requiring a complete le rebuild
COMP9315 21T1 ♢ Hashed Files ♢ [14/15]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
15/17
2021/4/28 Hashed Files
❖ Flexible Hashing (cont) Example of splitting:
<< ∧
Tuples only show key value; assume h(val) = val
COMP9315 21T1 ♢ Hashed Files ♢ [15/15]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
16/17
2021/4/28 Hashed Files
Produced: 7 Mar 2021
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/file-hashed/slides.html
17/17