Persistence: RAID
Andrea Arpaci-Dusseau CS 537, Fall 2019
ADMINISTRIVIA
Project 5: Due yesterday
– Working on grading…
Project 6: MapReduce with locking (not xv6) – Avail Thu or Fri
Midterm 2: Nov 11/6 (Wed) from 7:30-9:30pm
– Practice exams available
– Discussion section: Midterm Review
– Rooms:
– Mostly Concurrency
+ Some Virtualization (usually repeated from Midterm1)
– No Persistence
– Lab Hours -> Office Hours in CS 1207
Learning Outcomes
CS 537 Andrea C. Arpaci-Dusseau Introduction to Operating Systems Remzi H. Arpaci-Dusseau
Why use more than one disk?
What are the different RAID levels? (striping, mirroring, parity) Which RAID levels are best for reliability? for capacity?
Which are best for performance? (sequential vs. random reads and writes)
Persistence
How to ensure data is available across reboots
– even after power outages, hardware failure, system crashes?
Topics:
– Persistent storage devices (HDDs, R AID, SSDs)
– FileAPIforprocesses
– FSimplementation(meta-datastructures,allocationpolicies)
– Crashrecovery(journaling)
– Advanced Topics: Distributed systems?
Hierarchical buses
Hardware support for I/O
read/write head
Disk Terminology
spindle
platter surface
sector
track
cylinder
(stack of tracks across all surfaces)
Review points
Disks: Linear array of sectors (512 bytes) IO Time = Seek + Rotation + Transfer Sequential bandwidth >> Random Low-level details of disks:
– Track skew
– Zones
– Buffering/caching on disk
Disk Scheduling
– FCFS
– SS TF (shortest seek time first)
– SPTF (shortest positioning time first – only disk knows)
– Dealing with starvation
• Elevator (Scan) or Circular Scan (C-Scan)
• Bounded Window
– Greedy vs. Truly Optimal
I/O Schedulers
OS
I/O Scheduler
Where should the I/O scheduler go?
Scheduler
Disk
What happens at OS Level?
Assume 2 processes A and B each call read() with C-SCAN
void reader(int fd) {
char buf[1024];
int rv;
while((rv = read(buf)) != 0) {
assert(rv);
// takes short time, e.g., < 1ms
process(buf, rv);
} }
23 16 15 8
How will processes be scheduled?
Stream of requests seen by disk: ABABABA
21 12 11 20 19
2214
13
9 17
10 18
spin
Work Conservation
Work conserving schedulers always do work if work exists – Principle applies to any type of scheduler (CPU too)
Could be better to wait if can anticipate another request will arrive
Such non-work-conserving schedulers are called anticipatory schedulers
– Keeps resource idle while waiting for future requests
Better stream of requests for OS to give disk: AAAAAABBBBBBAAAAAA
I/O Device Summary
Overlap I/O and CPU whenever possible! – Use interrupts, DMA
Storage devices provide common block interface On a disk: Never do random I/O unless you must!
– Quicksort is a terrible algorithm on disk
Spend time to schedule on slow, stateful devices Next: Other storage devices (RAIDs and SSDs/flash)
Today: RAID
Only One Disk?
Sometimes we want many disks — why? • Capacity
• Reliability
• Performance
Challenge:
Most file systems work on only one disk (assume linear array of blocks)
Solution 1: JBOD
JBOD: Just a Bunch Of Disks
Application stores different files on different file systems
Application
FS FS FS FS
Disadvantages:
Application must manage multiple devices Not portable
Solution 2: RAID
RAID: Redundant Array of Inexpensive Disks
RAID is:
- -
Application
FS
Fake Logical Disk
Logical disk gives
- - -
transparent deployable
capacity performance reliability
Build logical disk from many physical disks
Why Inexpensive Disks?
Alternative to RAID: buy an expensive, high-end disk RAID Approach
– Economies of scale! Commodity disks cost less
– Can buy many commodity H/W components for same price as
few high-end components
– Write software to build high-quality logical devices from many cheap devices
General Strategy: MAPPING
Build fast, large disk from smaller disks 0 100 200
RAID
Disk
Disk
0 1000 100
RAID Mapping
How should RAID map logical block addresses to physical block addresses? – Some similarity to virtual memory
1) Dynamic mapping (logical x sometimes maps to physical y and sometimes z):
– Use data structure (array, hash table, tree)
– Page tables
2) Static mapping (logical x always maps to physical y): – Use simple math
– RAID
General Strategy: REDUNDANCY
Add even more disks for reliability
0
100
RAID
200
Disk Disk
100 0 100
Disk
Disk
0 100 0
100 0
Redundancy
How many physical copies should RAID keep for every logical block? Increase number of copies:
– improves reliability (and maybe performance)
Decrease number of copies
– improves space efficiency
1) RAID Level:
Reasoning About RAID
system for mapping logical to physical blocks
2) Workload:
types of reads/writes issued by applications (sequential vs. random)
3) Metric:
capacity, reliability, performance
1) RAID Decisions
Which logical blocks map to which physical blocks on disks?
How to use extra physical blocks (if any)?
Different RAID levels make different trade-offs RAID 0: Striping
RAID 1: Mirroring RAID 4: Parity
RAID 5: Rotated Parity
2) Workloads
Reads
One operation (for latency)
Steady-state I/O (for throughput or bandwidth)
Sequential Random
Writes
One operation (for latency)
Steady-state I/O (for throughput or bandwidth)
Sequential Random
3) Metrics
Capacity: how much space is available to higher levels?
Reliability: how many disks can RAID safely lose? (assume fail stop!) Performance: how long does each workload take?
Normalize each to characteristics of one disk
N := number of disks
C := capacity of 1 disk (500 GB?)
S := sequential throughput of 1 disk (100 MB/s?) R := random throughput of 1 disk (5 MB/s?)
D := latency of one small I/O operation
RAID-0: Striping
Optimize for capacity. No redundancy Logical Blocks:
0
1
2
3
4
5
6
7
0
1
2
3
0
1
2
3
Disk 0 Disk 1 01 23 45 67
Disk 0
Disk 1
RAID-0: 4 disks
Disk 0 Disk 1 Disk 2 Disk 3 0123 4567
8 9 10 11 12 13 14 15
RAID-0: 4 disks
Disk 0 Disk 1 Disk 2 Disk 3 0123
stripe:
4567
8 9 10 11 12 13 14 15
Given logical address A, find: Disk = ...
Offset = ...
Given logical address A, find: Disk = A % disk_count Offset = A / disk_count
Chunk size = 1
Real Systems: Chunk Size
Disk 0 Disk 1 Disk 2 Disk 3 0123 4567
8 9 10 11
12 13 14 15
Simplification: assume chunk size of 1
Disk 0 Disk 1 Disk 2 Disk 3 0246 1357
stripe:
Chunk size = 2
8 10 12 14 9 11 13 15
RAID-0 Visualization
1. . /example-stripe-one-request. csh
2. . /example-stripe-random-reads. py
3. . /example-stripe-random-writes. py
RAID-0: Analysis
What is capacity?
How many disks can fail (no loss)? Latency?
Throughput (sequential, random)?
Buying more disks improves throughput, but not latency!
N*C
0 D
N*S , N*R
N := number of disks
C := capacity of 1 disk
S := sequential throughput of 1 disk
R := random throughput of 1 disk
D := latency of one small I/O operation
Disk 0 Disk 1 Disk 2 Disk3 0123 4567
8 9 10 11
12 13 14 15
RAID-1: Mirroring
Logical Blocks:
Disk 0 Disk 1 Keep two copies of all data
0
1
2
3
0
1
2
3
0
1
2
3
RAID-10: Mirroring + Striping
2 disks
Disk 0 Disk 1 00 11 22 33
Disk 0 0011 2233 4455 6677
Disk 1 Disk 2 Disk 3
4 disks
Raid-1: Mirroring
Disk 0 Disk 1 Disk 2 Disk 3 0011 2233 4455 6677
How many disks can fail without losing any data?
Assume disks are fail-stop
- each disk works or it doesn’t - system knows when disk fails
Tougher Errors:
- latent sector errors
- silent data corruption
Always handle 1 disk failure
May handle N/2 if to different replicas
RAID-1 Visualization
4) . /example-mirror-random-reads. py
5) . /example-mirror-random-writes. py
6) . /example-mirror-seq-reads. csh
7) . /example-mirror-seq-reads-slomo. csh
RAID-1: Analysis
N/2 * C
1 (or maybe N / 2) D
What is capacity?
How many disks can fail? Latency (read, write)?
N := number of disks
C := capacity of 1 disk
S := sequential throughput of 1 disk
R := random throughput of 1 disk
D := latency of one small I/O operation
Disk 0
0 2233 4455 6677
Disk 1 0
Disk 2 1
Disk 3 1
RAID-1: Throughput
What is steady-state throughput for - random reads?
- random writes?
- sequential writes?
- sequential reads?
Disk 0 Disk 1 Disk 2 Disk 3 0011 2233 4455 6677
N*R N/2 * R
N/2 * S
Book: N/2 * S (other models: N * S)
Side Issue: Crashes
Disk0 Disk1 0
1 2 3
write(A) to 2
A
B
C
D
A
B
C
D
Side Issue: Crashes
Disk0 Disk1 0
1 2 3
write(A) to 2
A
B
A
D
A
B
C
D
Side Issue: Crashes
Disk0 Disk1 0
1 2 3
write(A) to 2
A
B
A
D
A
B
A
D
Side Issue: Crashes
Disk0 Disk1 0
1 2 3
write(T) to 3
A
B
A
T
A
B
A
D
Side Issue: Crashes
Disk0 Disk1 0
1 2 3
CR ASH!
A
B
A
T
A
B
A
D
Side Issue: Crashes
Disk0 Disk1 0
1 2 3
after reboot, how to tell which data is right?
A
A
B
B
A
A
T
D
Crashes: H/W Solution
Problem: Consistent-Update Problem
Use non-volatile RAM in RAID controller
Can replay to ensure all copies are updated
Software RAID controllers (e.g., Linux md) don’t have this option
R AID-1
R AID-4
R AID-0 Capacity
Reliability
Raid-4 Strategy
Use one disk for parity
In algebra:
Equation with N variables and N-1 are known, can often solve for unknown
Treat sectors across disks in a stripe as equation Data on bad disk is the unknown in equation
Disk 0 0
3
6
9
Disk 1 1
4
7
10
Disk 2 2
5
8
11
Disk 3 P0
P1
P2
P3
RAID-4 with Parity
Data blocks on disks 0 – 2
Disk 3 for parity
Parity calculated over data blocks in stripe P0 = B0 XOR B1 XOR B2
Parity Example: 1
Disk0 Disk1 Disk2 Disk3 Disk4 Stripe:
010 011 101 111 011
Calculate parity from data blocks
Parity = D0 XOR D1 XOR D2 XOR D3
(parity)
Parity Example: 1
Disk0 Disk1 Disk2 Disk3 Disk4 Stripe:
(parity)
Can reconstruct blocks of lost disk by taking XOR D1 = D0 XOR D2 XOR D3
010
011
101 111 011
Updating Parity: XOR
If write “0110” to block 0, how should parity be updated?
One approach: read all other N-2 blocks in stripe and calculate new parity Second approach: Read old value at block 0
1100
Read old value for parity
0101
Calculate new parity
1111
Write out new parity
à 2 reads and 2 writes (1 read and 1 write to parity block)
RAID-4 Visualization
8) . /example-raid4-random-reads. csh
9) . /example-raid4-full-stripe-writes. csh
10) ./example-raid4-random-writes.csh
RAID-4: Analysis
(N-1) * C 1
D, 2*D (read and write parity disk)
What is capacity?
How many disks can fail? Latency (read, write)?
N := number of disks
C := capacity of 1 disk
S := sequential throughput of 1 disk
R := random throughput of 1 disk
D := latency of one small I/O operation
Disk 0 0
3
6
9
Disk 1 1
4
7
10
Disk 2 2
5
8
11
Disk 3 P0
P1
P2
P3
What is steady-state throughput for - sequential reads?
- sequential writes?
- random reads?
- random writes?
N := number of disks
C := capacity of 1 disk
S := sequential throughput of 1 disk
R := random throughput of 1 disk
D := latency of one small I/O operation
how to avoid parity bottleneck?
RAID-4: Throughput
(N-1) * S
(N-1) * S (parity calculated for full stripe) (N-1) * R
R/2 (read and write parity disk)
Disk 0 0
3
6
9
Disk 1 1
4
7
10
Disk 2 2
5
8
11
Disk 3 P0
P1
P2
P3
RAID-5
Disk0 Disk1 Disk2 Disk3 Disk4
----P
---P-
--P--
...
Rotate parity across different disks
Where exactly do individual data blocks go?
Left-symmetric RAID-5
D0 D1 D2 D3 D4
0 1 2 3 P0 5 6 7 P1 4 1011P28 9 15 P3 12 13 14 P4 16 17 18 19
Pattern repeats...
RAID-5 Visualization
11) ./example-raid5-simple.csh
12) ./example-raid5-random-reads.csh 13) ./example-raid5-random-writes.csh
RAID-5: Analysis
(N-1) * C 1
D, 2*D (read and write parity disk)
What is capacity?
How many disks can fail? Latency (read, write)?
These metrics same as RAID-4...
N := number of disks C := capacity of 1 disk
S := sequential throughput of 1 disk
R := random throughput of 1 disk
D := latency of one small I/O operation
Disk0Disk1Disk2Disk3Disk4
...
----P
---P-
--P--
Steady-state throughput for RAID-4 - sequential reads?
- sequential writes?
- random reads?
- random writes?
(N-1) * S
(N-1) * S (parity calculated for full stripe) (N-1) * R
R/2 (read and write parity disk)
What is steady-state throughput for RAID-5? - sequential reads?
- sequential writes?
- random reads?
- random writes?
Disk0Disk1Disk2Disk3Disk4
...
RAID-5: Throughput
(N-1) * S (N-1) * S
(N) * R N * R/4
----P
---P-
--P--
Higher RAID Levels
RAID-6 can handle more than 1 disk failure Use multiple parity blocks
(Not covered in this course)
RAID Level Comparisons
Reliability Capacity
RAID-0 0 R AID-1 1 RAID-4 1 RAID-5 1
C*N C*N/2 (N-1) * C (N-1) * C
RAID LEVEL Comparisons
ReadLatency WriteLatency RAID-0 D D RAID-1 D D
RAID-4 D 2D RAID-5 D 2D
RAID-0 RAID-1 R AID-4 RAID-5
Seq Read N*S N/2 * S (N-1)*S (N-1)*S
Seq Write N*S
N/2 * S (N-1)*S (N-1)*S
Rand Read N*R
N * R (N-1)*R
N * R
Rand Write N*R
N/2 * R R/2
N/4 * R
RAID Level Comparisons
RAID-5 is strictly better than RAID-4
Seq Read RAID-0 N*S
RAID-1 N/2*S RAID-5 (N-1)*S
Seq Write N*S N/2*S (N-1)*S
Rand Read N*R N*R
N * R
Rand Write N*R N/2*R N/4 * R
RAID Level Comparisons
RAID-0 is always fastest and has best capacity (but at cost of reliability) RAID-1 better than RAID-5 for random workloads
RAID-5 better than RAID-1 for sequential workloads
RAID Summary
Block-based interface:
Very deployable and popular storage solution due to transparency
Many engineering tradeoffs with RAID
Capacity, reliability, performance for different workloads
Can build RAID over any other block-based storage device SSDs instead of HDDs!