CS3402
1
CS3402 : Chapter 8 Files and Hashed Files
Semester B, 2020/2021
Storage Medium for Databases
Memory hierarchy
Storage medium: CPU register -> CPU cache -> RAM ->
CS3402
2
magnetic disks/optical disks -> offline storage
Slower in access delay but larger in memory size (less expensive)
Disk Storage Devices
Preferred secondary storage device for high storage capacity and low cost
Data are stored as magnetized areas on magnetic disk surfaces
A disk pack contains several magnetic disks connected to a rotating spindle
CS3402 3
Disk Storage Devices
Disks are divided into concentric circular tracks on each disk surface
Track capacities vary typically from 4 to 50 Kbytes or more A track is divided into fixed size sectors and then into blocks
CS3402
4
Typical block sizes range from B=512 bytes to B=4096 bytes
Whole blocks are transferred between disk and main memory for processing
Files of Records
A record is a collection of data values (fields).
A file (e.g., table) is a sequence of records (e.g., tuples). A file can have fixed-length or variable length records.
CS3402 5
Records
Fixed length records:
Every record in the file has exactly the same size (in bytes)
Variable length records:
CS3402
6
If different records in the file have different sizes
Separator characters or length fields are needed so that the record can be “parsed”
Record Blocking
The records of a file must be allocated to disk blocks because a block is the unit of data transfer between disk and memory.
When block size > record size, each block will contain numerous records.
Blocking: storing a number of records into one block on the disk. Blocking factor (bfr) = number of records per block
Suppose that the block size is B bytes. For a file of fixed-length records of size R bytes, with B ≥ R, we can fit bfr=B/R records per block, where the x (floor function) rounds down the number x to an integer.
Unused space in each block = B – (bfr *R) bytes.
CS3402 7
Spanned vs Unspanned Records
File records can be unspanned or spanned
Spanned: a record can be stored in more than one block
Unspanned: no record can span two blocks
In a file of fixed-length records, unspanned blocking is usually used.
For variable-length records, either a spanned or an unspanned organization can be used.
It is advantageous to use spanning to reduce the lost space in each block. CS3402 8
Methods for Organizing Records on Disks
1. Unordered records (Heap files)
2. Ordered records (Sorted files)
3. Hashed records
CS3402 9
Unordered Files
Also called a heap file (records are unordered)
Insertion: New records are inserted at the end of the file
Arranged in their insertion sequence
Record insertion is efficient (add to the end)
Searching: A linear search through the file records is necessary to
search for a record
This requires reading and searching half the file blocks on
average, and is hence quite expensive (n/2) Worst case, all records (n)
Reading the records in order of a particular field requires sorting the file records
CS3402
10
9 16 50 2 10 4 8 12 60 100
Ordered Files
Also called a sequential file (records are ordered)
File records are kept sorted by the values of an ordering field
Insertion is expensive: records must be inserted in the correct order
It is common to keep a separate unordered overflow file for new records to improve insertion efficiency; this is periodically merged with the main ordered file
Reading the records in order of the ordering field is quite efficient 2 4 8 9 1012165060100
CS3402
11
Binary Search for Ordered Files
• A binary search can be used to search for a record on its ordering field value
• This requires reading and searching log2n of the file blocks on the average, an improvement over linear search
CS3402 12
Ordered File Example
Some blocks of an ordered file of EMPLOYEE records with Name as the ordering key field.
CS3402 13
Average Access Times
The following table shows the average access time to access a specific record for a given type of file
CS3402 14
Hashed Files
Hashing for disk files is called External Hashing (files on disk)
The file blocks are divided into M equal-sized buckets, numbered
bucket0, bucket1, …, bucketM-1
One of the file fields is designated to be the hash key of the file
If a record has search key K, we store the record by linking it to the bucket list for the bucket numbered h(K)
CS3402
15
h(K) is a hash function that transforms the hash field value into an integer between 0 and M-1.
Hashed Files
CS3402
16
4, 8
5, 13
2, 6, 10, 14
Bucket 0
Bucket 1
Bucket 2
Bucket 3
Key value (k)
4
h(k) = k mod 4
External Hashed Files
CS3402 17
Example Hashed Tables
We assume a block can hold two records and M = 4, i.e., the hash function h returns values from 0 to 3
Example: We add to the hash table a record with key g and h(g) = 1. Then we add the new record to the bucket numbered 1
Collisions occur when a new record hashes to a bucket that is already full -> Collision resolution is needed.
0
4
0
4
CS3402
18
51 1
5 1
9
1
22
33 33
7
7
22
Collision Resolution by Open Addressing
Open addressing: proceeding from the occupied position specified by the hash address, the program checks the subsequent positions in order until an unused (empty) position is found.
CS3402
19
Linear Probing
If collide, try Bucket_id+1, Bucket_id+2, …, Bucket_id+n
Quadratic Probing
If collide, try Bucket_id+1, Bucket_id+4,…, Bucket_id+n2
Collision Resolution by Chaining
Chaining:
For this method, various overflow locations are kept, usually by
CS3402
20
extending the array with a number of overflow positions In addition, a pointer field is added to each bucket
A collision is resolved by:
placing the new record in an unused overflow bucket and
setting the pointer of the occupied hash address bucket to the address of that overflow bucket
Collision Resolution by Chaining
CS3402 21
Hashed Files
To reduce the number of overflow, a hash file is typically kept 70- 80% full
The hash function h should distribute the records uniformly among the buckets
Otherwise, search time will be increased because many overflow records will exist
Searching overflow records are more expensive Main disadvantages of static external hashing:
Fixed number of buckets M -> Difficult to expand and shrink the file dynamically
-> Need hashing techniques that allow dynamic file expansion CS3402 22
Extendible and Dynamic Hashing
Dynamic Hashing and Extendible Hashing:
Hashing techniques are extended to allow dynamic growth and
shrinking of the number of file records
Both dynamic and extendible hashing use the binary representation
(e.g., 1100…) of the hash value h(K) in order to access a directory In extendible hashing, the directory is an array of size 2d, where
d is called the global depth
In dynamic hashing, the directory is a binary tree
CS3402 23
Extendible and Dynamic Hashing
The directories can be stored on disk, and they expand or shrink dynamically
Directory entries point to the disk blocks that contain the stored records
An insertion in a disk block that is full causes the block to split into two blocks and the records are redistributed among the two blocks
The directory is updated appropriately
CS3402 24
Extendible Hashing
A directory consisting of an array of 2d bucket addresses is maintained
d is called the global depth of the directory
First (high-order) d bits of a hash value is used as an index to
CS3402
25
the array to determine a directory entry
The address in that entry determines the bucket storing the records
Local depth d’ specifies the number of bits on which the bucket contents are based
The value of d’ can be increased or decreased by one at a time to handle overflow or underflow respectively
Extendible Hashing
CS3402 26
Extendible Hashing Example
Suppose the hash function produces a sequence of four bits
At the beginning, only one bit is used as illustrated by d =1
The directory therefore has only two entries and points to two blocks. Assume that each block can hold at most two records.
CS3402
1001 1100
27
The first block holds all the current records whose search keys hash to a bit sequence beginning with 0
The second block holds all those whose search keys hash to a
sequence beginning with 1
d’=1 0001
0 1
d’=1
d=1
0 1
d’=1
00 01 10 11
1010 1100
Extendible Hashing Example
Suppose we insert a record whose key hash to the sequence 1010
Since the first bit is 1, it belongs in the second block.
However, the second block is full. It needs to be split into two buckets 10 and 11. The records are redistributed in these two buckets.
The local depth and global depth are increased.
0001 d=1
d=2
1001
CS3402
28
1001 1100
1100
d’=2
d’=1
0001
d’=1
d’=2
Extendible Hashing Example
Alternative notation: Take last d bits (instead of first d bits) of the hash value.
CS3402 29
Dynamic Hashing
Dynamic hashing maintains tree-structured directory with two types of nodes:
CS3402
30
Internal nodes that have two pointers: the left pointer corresponding to the 0 bit (in the hash address) and a right pointer corresponding to the 1 bit
Leaf nodes: these hold a pointer to the actual bucket with records
Dynamic Hashing
CS3402 31
Extendible vs Dynamic Hashing
CS3402 32
Reference
7e
Ch. 16
CS3402 33