CS计算机代考程序代写 data structure file system distributed system concurrency algorithm Persistence: RAID

Persistence: RAID
Andrea Arpaci-Dusseau CS 537, Fall 2019

ADMINISTRIVIA
Project 5: Due yesterday
– Working on grading…
Project 6: MapReduce with locking (not xv6) – Avail Thu or Fri
Midterm 2: Nov 11/6 (Wed) from 7:30-9:30pm
– Practice exams available
– Discussion section: Midterm Review
– Rooms:
– Mostly Concurrency
+ Some Virtualization (usually repeated from Midterm1)
– No Persistence
– Lab Hours -> Office Hours in CS 1207

Learning Outcomes
CS 537 Andrea C. Arpaci-Dusseau Introduction to Operating Systems Remzi H. Arpaci-Dusseau
Why use more than one disk?
What are the different RAID levels? (striping, mirroring, parity) Which RAID levels are best for reliability? for capacity?
Which are best for performance? (sequential vs. random reads and writes)

Persistence
How to ensure data is available across reboots
– even after power outages, hardware failure, system crashes?
Topics:
– Persistent storage devices (HDDs, R AID, SSDs)
– FileAPIforprocesses
– FSimplementation(meta-datastructures,allocationpolicies)
– Crashrecovery(journaling)
– Advanced Topics: Distributed systems?

Hierarchical buses
Hardware support for I/O

read/write head
Disk Terminology
spindle
platter surface
sector
track
cylinder
(stack of tracks across all surfaces)

Review points
Disks: Linear array of sectors (512 bytes) IO Time = Seek + Rotation + Transfer Sequential bandwidth >> Random Low-level details of disks:
– Track skew
– Zones
– Buffering/caching on disk
Disk Scheduling
– FCFS
– SS TF (shortest seek time first)
– SPTF (shortest positioning time first – only disk knows)
– Dealing with starvation
• Elevator (Scan) or Circular Scan (C-Scan)
• Bounded Window
– Greedy vs. Truly Optimal

I/O Schedulers
OS
I/O Scheduler
Where should the I/O scheduler go?
Scheduler
Disk

What happens at OS Level?
Assume 2 processes A and B each call read() with C-SCAN
void reader(int fd) {
char buf[1024];
int rv;
while((rv = read(buf)) != 0) {
assert(rv);
// takes short time, e.g., < 1ms process(buf, rv); } } 23 16 15 8 How will processes be scheduled? Stream of requests seen by disk: ABABABA 21 12 11 20 19 2214 13 9 17 10 18 spin Work Conservation Work conserving schedulers always do work if work exists – Principle applies to any type of scheduler (CPU too) Could be better to wait if can anticipate another request will arrive Such non-work-conserving schedulers are called anticipatory schedulers – Keeps resource idle while waiting for future requests Better stream of requests for OS to give disk: AAAAAABBBBBBAAAAAA I/O Device Summary Overlap I/O and CPU whenever possible! – Use interrupts, DMA Storage devices provide common block interface On a disk: Never do random I/O unless you must! – Quicksort is a terrible algorithm on disk Spend time to schedule on slow, stateful devices Next: Other storage devices (RAIDs and SSDs/flash) Today: RAID Only One Disk? Sometimes we want many disks — why? • Capacity • Reliability • Performance Challenge: Most file systems work on only one disk (assume linear array of blocks) Solution 1: JBOD JBOD: Just a Bunch Of Disks Application stores different files on different file systems Application FS FS FS FS Disadvantages: Application must manage multiple devices Not portable Solution 2: RAID RAID: Redundant Array of Inexpensive Disks RAID is: - - Application FS Fake Logical Disk Logical disk gives - - - transparent deployable capacity performance reliability Build logical disk from many physical disks Why Inexpensive Disks? Alternative to RAID: buy an expensive, high-end disk RAID Approach – Economies of scale! Commodity disks cost less – Can buy many commodity H/W components for same price as few high-end components – Write software to build high-quality logical devices from many cheap devices General Strategy: MAPPING Build fast, large disk from smaller disks 0 100 200 RAID Disk Disk 0 1000 100 RAID Mapping How should RAID map logical block addresses to physical block addresses? – Some similarity to virtual memory 1) Dynamic mapping (logical x sometimes maps to physical y and sometimes z): – Use data structure (array, hash table, tree) – Page tables 2) Static mapping (logical x always maps to physical y): – Use simple math – RAID General Strategy: REDUNDANCY Add even more disks for reliability 0 100 RAID 200 Disk Disk 100 0 100 Disk Disk 0 100 0 100 0 Redundancy How many physical copies should RAID keep for every logical block? Increase number of copies: – improves reliability (and maybe performance) Decrease number of copies – improves space efficiency 1) RAID Level: Reasoning About RAID system for mapping logical to physical blocks 2) Workload: types of reads/writes issued by applications (sequential vs. random) 3) Metric: capacity, reliability, performance 1) RAID Decisions Which logical blocks map to which physical blocks on disks? How to use extra physical blocks (if any)? Different RAID levels make different trade-offs RAID 0: Striping RAID 1: Mirroring RAID 4: Parity RAID 5: Rotated Parity 2) Workloads Reads One operation (for latency) Steady-state I/O (for throughput or bandwidth) Sequential Random Writes One operation (for latency) Steady-state I/O (for throughput or bandwidth) Sequential Random 3) Metrics Capacity: how much space is available to higher levels? Reliability: how many disks can RAID safely lose? (assume fail stop!) Performance: how long does each workload take? Normalize each to characteristics of one disk N := number of disks C := capacity of 1 disk (500 GB?) S := sequential throughput of 1 disk (100 MB/s?) R := random throughput of 1 disk (5 MB/s?) D := latency of one small I/O operation RAID-0: Striping Optimize for capacity. No redundancy Logical Blocks: 0 1 2 3 4 5 6 7 0 1 2 3 0 1 2 3 Disk 0 Disk 1 01 23 45 67 Disk 0 Disk 1 RAID-0: 4 disks Disk 0 Disk 1 Disk 2 Disk 3 0123 4567 8 9 10 11 12 13 14 15 RAID-0: 4 disks Disk 0 Disk 1 Disk 2 Disk 3 0123 stripe: 4567 8 9 10 11 12 13 14 15 Given logical address A, find: Disk = ... Offset = ... Given logical address A, find: Disk = A % disk_count Offset = A / disk_count Chunk size = 1 Real Systems: Chunk Size Disk 0 Disk 1 Disk 2 Disk 3 0123 4567 8 9 10 11 12 13 14 15 Simplification: assume chunk size of 1 Disk 0 Disk 1 Disk 2 Disk 3 0246 1357 stripe: Chunk size = 2 8 10 12 14 9 11 13 15 RAID-0 Visualization 1. . /example-stripe-one-request. csh 2. . /example-stripe-random-reads. py 3. . /example-stripe-random-writes. py RAID-0: Analysis What is capacity? How many disks can fail (no loss)? Latency? Throughput (sequential, random)? Buying more disks improves throughput, but not latency! N*C 0 D N*S , N*R N := number of disks C := capacity of 1 disk S := sequential throughput of 1 disk R := random throughput of 1 disk D := latency of one small I/O operation Disk 0 Disk 1 Disk 2 Disk3 0123 4567 8 9 10 11 12 13 14 15 RAID-1: Mirroring Logical Blocks: Disk 0 Disk 1 Keep two copies of all data 0 1 2 3 0 1 2 3 0 1 2 3 RAID-10: Mirroring + Striping 2 disks Disk 0 Disk 1 00 11 22 33 Disk 0 0011 2233 4455 6677 Disk 1 Disk 2 Disk 3 4 disks Raid-1: Mirroring Disk 0 Disk 1 Disk 2 Disk 3 0011 2233 4455 6677 How many disks can fail without losing any data? Assume disks are fail-stop - each disk works or it doesn’t - system knows when disk fails Tougher Errors: - latent sector errors - silent data corruption Always handle 1 disk failure May handle N/2 if to different replicas RAID-1 Visualization 4) . /example-mirror-random-reads. py 5) . /example-mirror-random-writes. py 6) . /example-mirror-seq-reads. csh 7) . /example-mirror-seq-reads-slomo. csh RAID-1: Analysis N/2 * C 1 (or maybe N / 2) D What is capacity? How many disks can fail? Latency (read, write)? N := number of disks C := capacity of 1 disk S := sequential throughput of 1 disk R := random throughput of 1 disk D := latency of one small I/O operation Disk 0 0 2233 4455 6677 Disk 1 0 Disk 2 1 Disk 3 1 RAID-1: Throughput What is steady-state throughput for - random reads? - random writes? - sequential writes? - sequential reads? Disk 0 Disk 1 Disk 2 Disk 3 0011 2233 4455 6677 N*R N/2 * R N/2 * S Book: N/2 * S (other models: N * S) Side Issue: Crashes Disk0 Disk1 0 1 2 3 write(A) to 2 A B C D A B C D Side Issue: Crashes Disk0 Disk1 0 1 2 3 write(A) to 2 A B A D A B C D Side Issue: Crashes Disk0 Disk1 0 1 2 3 write(A) to 2 A B A D A B A D Side Issue: Crashes Disk0 Disk1 0 1 2 3 write(T) to 3 A B A T A B A D Side Issue: Crashes Disk0 Disk1 0 1 2 3 CR ASH! A B A T A B A D Side Issue: Crashes Disk0 Disk1 0 1 2 3 after reboot, how to tell which data is right? A A B B A A T D Crashes: H/W Solution Problem: Consistent-Update Problem Use non-volatile RAM in RAID controller Can replay to ensure all copies are updated Software RAID controllers (e.g., Linux md) don’t have this option R AID-1 R AID-4 R AID-0 Capacity Reliability Raid-4 Strategy Use one disk for parity In algebra: Equation with N variables and N-1 are known, can often solve for unknown Treat sectors across disks in a stripe as equation Data on bad disk is the unknown in equation Disk 0 0 3 6 9 Disk 1 1 4 7 10 Disk 2 2 5 8 11 Disk 3 P0 P1 P2 P3 RAID-4 with Parity Data blocks on disks 0 – 2 Disk 3 for parity Parity calculated over data blocks in stripe P0 = B0 XOR B1 XOR B2 Parity Example: 1 Disk0 Disk1 Disk2 Disk3 Disk4 Stripe: 010 011 101 111 011 Calculate parity from data blocks Parity = D0 XOR D1 XOR D2 XOR D3 (parity) Parity Example: 1 Disk0 Disk1 Disk2 Disk3 Disk4 Stripe: (parity) Can reconstruct blocks of lost disk by taking XOR D1 = D0 XOR D2 XOR D3 010 011 101 111 011 Updating Parity: XOR If write “0110” to block 0, how should parity be updated? One approach: read all other N-2 blocks in stripe and calculate new parity Second approach: Read old value at block 0 1100 Read old value for parity 0101 Calculate new parity 1111 Write out new parity à 2 reads and 2 writes (1 read and 1 write to parity block) RAID-4 Visualization 8) . /example-raid4-random-reads. csh 9) . /example-raid4-full-stripe-writes. csh 10) ./example-raid4-random-writes.csh RAID-4: Analysis (N-1) * C 1 D, 2*D (read and write parity disk) What is capacity? How many disks can fail? Latency (read, write)? N := number of disks C := capacity of 1 disk S := sequential throughput of 1 disk R := random throughput of 1 disk D := latency of one small I/O operation Disk 0 0 3 6 9 Disk 1 1 4 7 10 Disk 2 2 5 8 11 Disk 3 P0 P1 P2 P3 What is steady-state throughput for - sequential reads? - sequential writes? - random reads? - random writes? N := number of disks C := capacity of 1 disk S := sequential throughput of 1 disk R := random throughput of 1 disk D := latency of one small I/O operation how to avoid parity bottleneck? RAID-4: Throughput (N-1) * S (N-1) * S (parity calculated for full stripe) (N-1) * R R/2 (read and write parity disk) Disk 0 0 3 6 9 Disk 1 1 4 7 10 Disk 2 2 5 8 11 Disk 3 P0 P1 P2 P3 RAID-5 Disk0 Disk1 Disk2 Disk3 Disk4 ----P ---P- --P-- ... Rotate parity across different disks Where exactly do individual data blocks go? Left-symmetric RAID-5 D0 D1 D2 D3 D4 0 1 2 3 P0 5 6 7 P1 4 1011P28 9 15 P3 12 13 14 P4 16 17 18 19 Pattern repeats... RAID-5 Visualization 11) ./example-raid5-simple.csh 12) ./example-raid5-random-reads.csh 13) ./example-raid5-random-writes.csh RAID-5: Analysis (N-1) * C 1 D, 2*D (read and write parity disk) What is capacity? How many disks can fail? Latency (read, write)? These metrics same as RAID-4... N := number of disks C := capacity of 1 disk S := sequential throughput of 1 disk R := random throughput of 1 disk D := latency of one small I/O operation Disk0Disk1Disk2Disk3Disk4 ... ----P ---P- --P-- Steady-state throughput for RAID-4 - sequential reads? - sequential writes? - random reads? - random writes? (N-1) * S (N-1) * S (parity calculated for full stripe) (N-1) * R R/2 (read and write parity disk) What is steady-state throughput for RAID-5? - sequential reads? - sequential writes? - random reads? - random writes? Disk0Disk1Disk2Disk3Disk4 ... RAID-5: Throughput (N-1) * S (N-1) * S (N) * R N * R/4 ----P ---P- --P-- Higher RAID Levels RAID-6 can handle more than 1 disk failure Use multiple parity blocks (Not covered in this course) RAID Level Comparisons Reliability Capacity RAID-0 0 R AID-1 1 RAID-4 1 RAID-5 1 C*N C*N/2 (N-1) * C (N-1) * C RAID LEVEL Comparisons ReadLatency WriteLatency RAID-0 D D RAID-1 D D RAID-4 D 2D RAID-5 D 2D RAID-0 RAID-1 R AID-4 RAID-5 Seq Read N*S N/2 * S (N-1)*S (N-1)*S Seq Write N*S N/2 * S (N-1)*S (N-1)*S Rand Read N*R N * R (N-1)*R N * R Rand Write N*R N/2 * R R/2 N/4 * R RAID Level Comparisons RAID-5 is strictly better than RAID-4 Seq Read RAID-0 N*S RAID-1 N/2*S RAID-5 (N-1)*S Seq Write N*S N/2*S (N-1)*S Rand Read N*R N*R N * R Rand Write N*R N/2*R N/4 * R RAID Level Comparisons RAID-0 is always fastest and has best capacity (but at cost of reliability) RAID-1 better than RAID-5 for random workloads RAID-5 better than RAID-1 for sequential workloads RAID Summary Block-based interface: Very deployable and popular storage solution due to transparency Many engineering tradeoffs with RAID Capacity, reliability, performance for different workloads Can build RAID over any other block-based storage device SSDs instead of HDDs!