CMPT 454
A traditional database requires both persistent and working (transient) memory
▪ The data is stored in non-volatile secondary storage to reduce the risk of data loss
▪ Hard Disk Drive (HDD) or Solid State Drive (SSD)
▪ Data is transferred from disk to main memory
▪ Where operations (read, update, write, …) are performed on the data
New technologies provide alternatives to HDDs ▪ However they are likely to be the primary storage
medium for many databases for some time
John Edgar 2
disk head
for 1st surface
surface
platters rotate 10,ooo rpm
disk head array
moves in and out
platters
each has 2 surfaces
John Edgar
3
Areas on a disk are magnetized to store bit values
▪ Grouped into bytes
▪ The capacity of disks for personal computers ranges from hundreds of gigabytes to a few terabytes
▪ Server and mainframe disks are often 10TB or more Disks are made magnetic material
▪ Referred to as platters
▪ Single or double sided
▪ Multiple platters may be grouped together in a disk to increase capacity
John Edgar 4
surfaces
platter
cylinder
John Edgar
5
track
set of tracks with same diameter on all surfaces
surfaces are made up of concentric rings called tracks
tracks are divided into arcs called sectors which
… contain a contiguous sequence of bytes
blocks are the units which are read from or written to
when the disk is formatted the block size is set to a small number of sectors, 4 to 16 kb
Block sizes vary
▪ Typically ranges from 512 to 8,192 bytes
▪ Blocks are separated by fixed sized gaps ▪ Which contain control information
Blocks can be addressed by cylinder number, surface number and block number
▪ In many modern disk drives a single Logical Block Address (LBA) identifies a block
▪ Which are numbered from 0 to block capacity – 1 Blocks map to pages
▪ A higher level abstraction
John Edgar 6
A sector is an arc of a track
▪ Tracks closer to the centre of the
disk have smaller arc lengths There are a number of
different sector organizations ▪ Sectors subtending a fixed angle
▪ Sectors on different tracks have different recording densities
▪ Uniform recording density
▪ The same arc on different tracks
holds different numbers of sectors ▪ Combination of the two
John Edgar 7
Query processor requests a record
▪ Request handled by the buffer manager
▪ Data from a disk is read or written in units of a block
▪ Blocks typically contain multiple records
▪ The desired block is read from disk into main memory
main memory
disk
blocks record
John Edgar
8
Most writes require an initial read of a disk block
▪ Even blind writes
Consider inserting a new
record into a disk block
▪ Which contains existing records
The block must be read first
▪ To preserve the existing data
▪ Referred to as the read-modify- write cycle
new record
disk
key existing records
main memory
main memory
write
disk
existing data overwritten
no read-modify-write cycle
main memory
read write disk
read-modify-write cycle
John Edgar 9
There is one disk head for each surface
▪ Moved together as a unit called a disk head array
▪ All disk heads are in identical positions with respect to their surfaces
▪ To read or write a block a disk head must be positioned over it
Only one disk head can read or write at a time
John Edgar 10
To access a block on a disk
▪ The disk head pivots over the desired track
▪ Seek time – average 4 to 10 ms
▪ Wait for leading edge of block to reach the disk head
▪ Rotational delay – derived from rpm
▪ The desired block is read as it passes underneath the disk head
▪ Transfer time – derived from rpm
Drive controlled by a processor
▪ Called the disk controller John Edgar
The disk spins at a constant speed
responsible for controlling the actuator, determining when the disc has rotated to a sector, transferring data
11
The seek time and rotational delay depend on ▪ Where the disk head is before the request
▪ Which track is being requested
▪ How far the disk has to rotate
▪ Average rotational delay = 1⁄2 * max rotational delay The transfer time depends on the request size
▪ The transfer time (in ms) for one block equals ▪ (60,000 ÷ disk rpm) ÷ blocks per track
▪ The transfer time (in ms) for an entire track equals
▪ (60,000 ÷ disk rpm)
▪ For a disk with 10,000 rpm = 6ms
60,000?
John Edgar 12
1,000(ms) * 60(s)
Minimum seek time
▪ 0 – the disk head is on the
desired track
Maximum seek time
▪ Time to move from the innermost to outermost track
Average seek time
▪ 1/3 maximum seek time ▪ Not 1⁄2 maximum seek time
In practice the disk head does not move at constant speed
▪ It must accelerate / decelerate John Edgar
disk centre
disk head
disk head
disk edge
13
Transferring data between main memory and register is fast
▪ DRAM 50 ns ▪ Cache 5 ns
▪ Depending on which cache ▪ Register 1 ns
HDD access is very slow in comparison
▪ Can be broken into components
▪ Seek time + rotational delay + transfer time
Main memory vs. disk ▪ 15 ms vs. 0.000,060 ms
▪ 250,000timesfaster John Edgar
7,200 rpm
4.16 ms average latency
vs.
15 ms – estimate of reading one record
comparison from:
https://scoutapm.com/blog/understandi ng-disk-i-o-when-should-you-be- worried
14
Disk access is slow
▪ The largest components are seek time then rotational
Assume average seek time is 10ms
delay
Access two records in adjacent blocks on a track
▪ Seek the track, rotate to first block, and transfer two
and average rotational delay is 4ms
blocks = 10 + 4 + 2*1 = 16ms
Accessing two records on different tracks
▪ Seek the desired track, rotate to the block, and transfer the block, then repeat = (10 + 4 + 1)*2 = 30ms
Solution: store related data in close proximity
John Edgar 15
assume transfer time of 1ms
Approximate order of proximity ▪ Same block
▪ Adjacent blocks on same track
▪ Same track
▪ Same track, different cylinder
▪ Adjacent cylinder
▪ ….
In practice
▪ Fill tracks in same cylinder with related data
▪ Usually from a single table
▪ Then fill tracks in adjacent cylinders
cylinder
John Edgar
16
The disk head array does not have to be moved to read data from the same track on different platters
The minimum unit of transfer of is a block
▪ Multiple contiguous blocks may be transferred as a unit
▪ To a correspondingly sized main memory buffer
▪ This is much faster than reading blocks one at a time ▪ Since seek time and rotational delay are only incurred once
Buffer management is important in reducing access time and includes
▪ Prefetching
▪ Double buffering
… discussed later …
John Edgar
17
Requests to read a block (or blocks) are processed in some order based on the disk scheduling algorithm
There are a variety of such algorithms
▪ First-come-first served (FIFO)
▪ Elevator and its variants
▪ SCAN,LOOK,C-SCAN,C-LOOK,…
▪ Shortest-seek Goals
▪ Reduce overall access time ▪ Avoid starvation
John Edgar 18
2,0001
4,0004
6,0002
10,0006
14,0003
16,0005
A fair algorithm would take a first-come, first-serve approach
▪ Insert requests in a queue and process them in the order in which they are received
Cylinder
Received
Complete
Moved
Total
2,000
0
5
2,000
2,000
6,000
0
14
4,000
6,000
14,000
0
27
8,000
14,000
4,000
10
43
10,000
24,000
16,000
20
60
12,000
36,000
10,000
30
72
6,000
42,000
John Edgar 19
2,0001
4,0004
6,0002
10,0006
14,0003
16,0005
The elevator algorithm generally performs better than FIFO
▪ Requests are buffered and the disk head moves in one direction, processing requests
▪ The arm then reverses direction
Cylinder
Received
Complete
Moved
Total
2,000
0
5
2,000
2,000
6,000
0
14
4,000
6,000
14,000
0
27
8,000
14,000
16,000
20
35
2,000
16,000
10,000
30
46
6,000
22,000
4,000
30
58
6,000
28,000
John Edgar 20
Hard drives fail
The failure probability
follows a bathtub curve ▪ High at the start
▪ Lemons
▪ High at the end
Assume a lifespan of around 3 to 5 years
▪ Back up your data John Edgar
HDD after a head crash
21
Intermittent failure
▪ Multiple attempts are required to read or write a sector ▪ Use checksums to check that incurred data has not been read
Media decay
▪ A bit or a number of bits are permanently corrupted and it is
impossible to read a sector Write failure
▪ A sector cannot be written to or retrieved ▪ Often caused by a power failure during a write
Disk crash
▪ The entire disk becomes unreadable
John Edgar 22
Each sector contains additional bits whose values are
▪ Odd (data) bit sum: checksum bit = 1
▪ Even (data) bit sum: checksum bit = 0 data bit sum is odd
Using a single checksum bit allows errors of only one
bit to be detected reliably
Several checksum bits can be maintained to reduce
the chance of failing to notice an error
▪ e.g. 8 checksum bits, one for each bit position in a byte
based on the data bits in the sector
▪ Simple single-bit checksum maintains an even parity
e.g. 7 data bits and 1 checksum bit data: 0111011 1: checksum bit
John Edgar 23
Compared to main memory hard drives have two major problems
▪ They are painfully slow
▪ They are sadly unreliable
Both these issues are, to some extent,
addressed by using RAID ▪ Or by using SSDs
▪ Or an in-memory database
… there is a certain amount of exaggeration going on here …
John Edgar 24