CS计算机代考程序代写 database chain cache CS3402

CS3402
1
CS3402 : Chapter 8 Files and Hashed Files
Semester B, 2020/2021

Storage Medium for Databases
 Memory hierarchy
Storage medium: CPU register -> CPU cache -> RAM ->
CS3402
2
magnetic disks/optical disks -> offline storage
Slower in access delay but larger in memory size (less expensive)

Disk Storage Devices
 Preferred secondary storage device for high storage capacity and low cost
 Data are stored as magnetized areas on magnetic disk surfaces
 A disk pack contains several magnetic disks connected to a rotating spindle
CS3402 3

Disk Storage Devices
 Disks are divided into concentric circular tracks on each disk surface
Track capacities vary typically from 4 to 50 Kbytes or more  A track is divided into fixed size sectors and then into blocks
CS3402
4
Typical block sizes range from B=512 bytes to B=4096 bytes
Whole blocks are transferred between disk and main memory for processing

Files of Records
 A record is a collection of data values (fields).
 A file (e.g., table) is a sequence of records (e.g., tuples).  A file can have fixed-length or variable length records.
CS3402 5

Records
 Fixed length records:
Every record in the file has exactly the same size (in bytes)
 Variable length records:
CS3402
6
If different records in the file have different sizes
Separator characters or length fields are needed so that the record can be “parsed”

Record Blocking
 The records of a file must be allocated to disk blocks because a block is the unit of data transfer between disk and memory.
 When block size > record size, each block will contain numerous records.
Blocking: storing a number of records into one block on the disk.  Blocking factor (bfr) = number of records per block
Suppose that the block size is B bytes. For a file of fixed-length records of size R bytes, with B ≥ R, we can fit bfr=B/R records per block, where the  x (floor function) rounds down the number x to an integer.
Unused space in each block = B – (bfr *R) bytes.
CS3402 7

Spanned vs Unspanned Records
 File records can be unspanned or spanned
Spanned: a record can be stored in more than one block
Unspanned: no record can span two blocks
 In a file of fixed-length records, unspanned blocking is usually used.
 For variable-length records, either a spanned or an unspanned organization can be used.
 It is advantageous to use spanning to reduce the lost space in each block. CS3402 8

Methods for Organizing Records on Disks
1. Unordered records (Heap files)
2. Ordered records (Sorted files)
3. Hashed records
CS3402 9

Unordered Files
 Also called a heap file (records are unordered)
 Insertion: New records are inserted at the end of the file
Arranged in their insertion sequence
Record insertion is efficient (add to the end)
 Searching: A linear search through the file records is necessary to
search for a record
This requires reading and searching half the file blocks on
average, and is hence quite expensive (n/2) Worst case, all records (n)
 Reading the records in order of a particular field requires sorting the file records
CS3402
10
9 16 50 2 10 4 8 12 60 100

Ordered Files
 Also called a sequential file (records are ordered)
 File records are kept sorted by the values of an ordering field
 Insertion is expensive: records must be inserted in the correct order
It is common to keep a separate unordered overflow file for new records to improve insertion efficiency; this is periodically merged with the main ordered file
 Reading the records in order of the ordering field is quite efficient 2 4 8 9 1012165060100
CS3402
11

Binary Search for Ordered Files
• A binary search can be used to search for a record on its ordering field value
• This requires reading and searching log2n of the file blocks on the average, an improvement over linear search
CS3402 12

Ordered File Example
Some blocks of an ordered file of EMPLOYEE records with Name as the ordering key field.
CS3402 13

Average Access Times
 The following table shows the average access time to access a specific record for a given type of file
CS3402 14

Hashed Files
 Hashing for disk files is called External Hashing (files on disk)
 The file blocks are divided into M equal-sized buckets, numbered
bucket0, bucket1, …, bucketM-1
 One of the file fields is designated to be the hash key of the file
 If a record has search key K, we store the record by linking it to the bucket list for the bucket numbered h(K)
CS3402
15
h(K) is a hash function that transforms the hash field value into an integer between 0 and M-1.

Hashed Files
CS3402
16
4, 8
5, 13
2, 6, 10, 14
Bucket 0
Bucket 1
Bucket 2
Bucket 3
Key value (k)
4
h(k) = k mod 4

External Hashed Files
CS3402 17

Example Hashed Tables
 We assume a block can hold two records and M = 4, i.e., the hash function h returns values from 0 to 3
 Example: We add to the hash table a record with key g and h(g) = 1. Then we add the new record to the bucket numbered 1
 Collisions occur when a new record hashes to a bucket that is already full -> Collision resolution is needed.
0
4
0
4
CS3402
18
51 1
5 1
9
1
22
33 33
7
7
22

Collision Resolution by Open Addressing
Open addressing: proceeding from the occupied position specified by the hash address, the program checks the subsequent positions in order until an unused (empty) position is found.
CS3402
19
Linear Probing
If collide, try Bucket_id+1, Bucket_id+2, …, Bucket_id+n
Quadratic Probing
If collide, try Bucket_id+1, Bucket_id+4,…, Bucket_id+n2

Collision Resolution by Chaining
 Chaining:
For this method, various overflow locations are kept, usually by
CS3402
20
extending the array with a number of overflow positions In addition, a pointer field is added to each bucket
A collision is resolved by:
placing the new record in an unused overflow bucket and
setting the pointer of the occupied hash address bucket to the address of that overflow bucket

Collision Resolution by Chaining
CS3402 21

Hashed Files
 To reduce the number of overflow, a hash file is typically kept 70- 80% full
 The hash function h should distribute the records uniformly among the buckets
Otherwise, search time will be increased because many overflow records will exist
Searching overflow records are more expensive  Main disadvantages of static external hashing:
Fixed number of buckets M -> Difficult to expand and shrink the file dynamically
-> Need hashing techniques that allow dynamic file expansion CS3402 22

Extendible and Dynamic Hashing
 Dynamic Hashing and Extendible Hashing:
Hashing techniques are extended to allow dynamic growth and
shrinking of the number of file records
 Both dynamic and extendible hashing use the binary representation
(e.g., 1100…) of the hash value h(K) in order to access a directory In extendible hashing, the directory is an array of size 2d, where
d is called the global depth
In dynamic hashing, the directory is a binary tree
CS3402 23

Extendible and Dynamic Hashing
 The directories can be stored on disk, and they expand or shrink dynamically
Directory entries point to the disk blocks that contain the stored records
 An insertion in a disk block that is full causes the block to split into two blocks and the records are redistributed among the two blocks
The directory is updated appropriately
CS3402 24

Extendible Hashing
 A directory consisting of an array of 2d bucket addresses is maintained
 d is called the global depth of the directory
First (high-order) d bits of a hash value is used as an index to
CS3402
25
the array to determine a directory entry
The address in that entry determines the bucket storing the records
 Local depth d’ specifies the number of bits on which the bucket contents are based
The value of d’ can be increased or decreased by one at a time to handle overflow or underflow respectively

Extendible Hashing
CS3402 26

Extendible Hashing Example
 Suppose the hash function produces a sequence of four bits
 At the beginning, only one bit is used as illustrated by d =1
 The directory therefore has only two entries and points to two blocks. Assume that each block can hold at most two records.
CS3402
1001 1100
27
The first block holds all the current records whose search keys hash to a bit sequence beginning with 0
The second block holds all those whose search keys hash to a
sequence beginning with 1
d’=1 0001
0 1
d’=1
d=1

0 1
d’=1
00 01 10 11
1010 1100
Extendible Hashing Example
 Suppose we insert a record whose key hash to the sequence 1010
 Since the first bit is 1, it belongs in the second block.
 However, the second block is full. It needs to be split into two buckets 10 and 11. The records are redistributed in these two buckets.
 The local depth and global depth are increased.
0001 d=1
d=2
1001
CS3402
28
1001 1100
1100
d’=2
d’=1
0001
d’=1
d’=2

Extendible Hashing Example
 Alternative notation: Take last d bits (instead of first d bits) of the hash value.
CS3402 29

Dynamic Hashing
 Dynamic hashing maintains tree-structured directory with two types of nodes:
CS3402
30
Internal nodes that have two pointers: the left pointer corresponding to the 0 bit (in the hash address) and a right pointer corresponding to the 1 bit
Leaf nodes: these hold a pointer to the actual bucket with records

Dynamic Hashing
CS3402 31

Extendible vs Dynamic Hashing
CS3402 32

Reference
 7e
Ch. 16
CS3402 33