PERSISTENCE: Crash Consistency
Andrea Arpaci-Dusseau CS 537, Fall 2019
ADMINISTRIVIA
Make-up Points for Projects 4
Due 1 week from when made available
File System Structures Quiz: Lecture Content Due Today
Project 6: Due Friday (5pm or midnight) Discussion tomorrow for more questions
Project 7: Available immediately after Thanksgiving xv6 File systems: Improvements + checker? Will have Specification Quiz
AGENDA / LEARNING OUTCOMES
How to maintain consistency with power failures / crashes?
• What can go wrong if disk blocks are not updated consistently?
• How can file system be checked and fixed after crash?
• How can journaling be used to obtain atomic updates?
• How can the performance of journaling be improved?
Data Redundancy
Definition:
if A and B are two pieces of data, and knowing A eliminates some or all
values B could be, there is redundancy between A and B
RAID examples:
– mirrored disk (complete redundancy) – parity blocks (partial redundancy)
File system examples:
– Superblock: field contains total blocks in FS
– Inodes: field contains pointer to data block
– Is there redundancy across these two fields? Why or why not?
File System CONSISTENCY Example
Superblock: field contains total number of blocks in FS value = N
Inode: field contains pointer to data block; possible values? values in {0,1,2,…,N – 1}
Pointers to block N or after are invalid!
Total-blocks field has redundancy with inode pointers
Why is consistency challenging?
File system may perform several disk writes to redundant blocks
If file system is interrupted between writes, may leave data in inconsistent state Only single sector writes are guaranteed to be atomic by disk
What can interrupt write operations? – power loss
– kernel panic
– reboot
File system must update multiple structures: append to /foo/bar
data inode root foo bar root foo bar bitmap bitmap inode inode inode data data data
read write
read write
write
FILE APPEND EXAMPLE
Written to disk
Result
Db
Lost data (nothing bad)
I[v2]
Point to garbage; another file could use data block
B[v2]
Space leak
I[v2] + B[v2]
Point to garbage data
I[v2] + Db
Another file could use same datablock
B[v2] + Db
Space leak
How can file system fix Inconsistencies?
Solution #1:
FSCK = file system checker
Strategy:
After crash, scan whole disk for contradictions and “fix” if needed Keep file system off-line until FSCK completes
For example, how to tell if data bitmap block is consistent?
Is the corresponding data block allocated or free?
If any pointer to data block, the corresponding bit should be 1; else bit is 0 Read every valid inode+indirect block
Fsck Checks
Do superblocks match?
Do directories contain “.” and “..”? Size and numblocks in inodes match?
Is the list of free blocks correct?
Do number of dir entries equal inode link counts? Do different inodes ever point to same block? Are there any bad block pointers?
…
FREE BLOCKS EXAMPLE
inode link_count = 1
block (number 123)
data bitmap 0011001100
for block 123
How to fix to have consistent file system?
FREE BLOCKS EXAMPLE
inode link_count = 1
block (number 123)
data bitmap 0011001101
for block 123
LINK COUNT EXAMPLE
Dir Entry
Dir Entry
How to fix to have consistent file system?
inode link_count = 1
LINK COUNT EXAMPLE
Dir Entry
Dir Entry
inode link_count = 2
Link Count (example 2)
inode
link_count = 1
How to fix???
Link Count (example 2)
Dir Entry
fix!
ls -l /
total 150
drwxr-xr-x 401 18432 Dec 31 1969 afs/
drwxr-xr-x.
drwxr-xr-x.
dr-xr-xr-x.
dr-xr-xr-x.
drwx——.
…
2 4096 Nov 3 09:42 bin/
5 4096 Aug 1 14:21 boot/
13 4096 Nov 3 09:41 lib/
10 12288 Nov 3 09:41 lib64/
2 16384 Aug 1 10:57 lost+found/
inode
link_count = 1
Duplicate Pointers
inode
link_count = 1
block
(number 123)
inode
link_count = 1
How to fix to have consistent file system?
Duplicate Pointers
inode
link_count = 1
block
(number 123)
copy
inode
link_count = 1
block
(number 729)
But is this correct?
inode
link_count = 1
BAD POINTER
9999
super block
tot-blocks=8000
BAD POINTER
inode
link_count = 1
super block
tot-blocks=8000
Problems with fsck
Problem 1:
– Not always obvious how to fix file system image
– Don’t know “correct” state, just consistent one
– Easy way to get consistency: reformat disk!
Problem 2: fsck is very sloW
Checking a 600GB disk takes ~70 minutes
ffsck: The Fast File System Checker
Ao Ma, Chris Dragga, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau
Consistency Solution #2: Journaling
Goals
– Ok to do some recovery work after crash, but not to read entire disk
– Don’t move file system to just any consistent state, get correct state
Atomicity
– Definition of atomicity for concurrency: operations in critical sections are not interrupted by operations on related critical sections
– Definition of atomicity for persistence: collections of writes are not interrupted by crashes; either (all new) or (all old) data is visible
Consistency vs ATOMICITY
Say a set of writes moves the disk from state A to B
empty all states
consistent states
AB
fsck gives consistency Atomicity gives A or B.
Journaling: General Strategy
Never delete ANY old data, until ALL new data is safely on disk Ironically, add redundancy to fix the problem caused by redundancy
1. Make a note of what needs to be written
2. After note is completely written, can update metadata and data
If crash and recover:
1. If see note is not completely written, ignore (old data still good)
2. If see note is completely written, replay to get new data
Journaling Terminology
Extra blocks in note are called a “journal”
Writes to journal are “journal transaction”
Last valid bit written is “journal commit block” Transaction is committed
Checkpoint: Writing to in-place metadata and data after commit
JOURNAL LAYOUT
Transaction
JOURNAL write AND ChECKPOINTS
0 1 2 3 4 5 6 7 8 9 10 11 12
transaction: write A to block 5; write B to block 2
JOURNAL write AND ChECKPOINTS
B A 5,2 A B TxE 0 1 2 3 4 5 6 7 8 9 10 11 12
Journal descriptor block: What is in this transaction? TxE: Journal commit
Checkpoint: Writing new data to in-place locations
JOURNAL REUSE AND ChECKPOINTS
B A 5,2 A B TxE 0 1 2 3 4 5 6 7 8 9 10 11 12
After first transaction checkpoints, can begin next one Next transaction: write C to block 4; write T to block 6
Journal Reuse and Checkpoints
B C A T 4,6 C T TxE
0 1 2 3 4 5 6 7 8 9 10 11 12 transaction: write C to block 4; write T to block 6
Ordering FOR CONSISTENCY
B C A T 4,6 C T TxE 0 1 2 3 4 5 6 7 8 9 10 11 12
What operations can proceed in parallel and which must be strictly ordered?
Strict ordering is expensive: must flush from memory to disk
tell disk not to reorder
tell disk can’t cache, must persist to final media
writes: 9,10,11, 12, 4, 6,12
Ordering FOR CONSISTENCY
B C A T 4,6 C T TxE
0 1 2 3 4 5 6 7 8 9 10 11 12 transaction: write C to block 4; write T to block 6
Barriers
write order: 9,10,11 | 12 | 4,6 | 12
1) Before journal commit, ensure journal transaction entries complete 2) Before checkpoint, ensure journal commit complete
3) Before free journal, ensure in-place updates complete
CHECKSUM OPTIMIZATION
Can we get rid of barrier between (9, 10, 11) and 12 ?
B CAT 4,6CT
0 1 2 3 4 5 6 7 8 9 10 11 12
write order: 9,10,11,12 | 4,6 | 12 In last transaction block, store checksum of rest of transaction
(Calculate over blocks 9, 10, 11) How does recovery change?
Tx Chk
During recovery: If checksum does not match, treat as not valid
Write buffering OPTIMIZATIONS
B CAT 4,6CT
0 1 2 3 4 5 6 7 8 9 10 11 12
Batched updates
– If two files are created, inode bitmap, inode etc. are written twice
– Mark as dirty in-memory and batch many updates into one transaction
Delay checkpoints
– Note: after journal write, there is no rush to checkpoint
– If system crashes, still have persistent copy of written data!
– Journaling is sequential (fast!), checkpointing is random (slow!) – Solution? Delay checkpointing for some time
Tx Chk
Circular Log
Difficulty: need to reuse journal space
Solution: keep many transactions for un-checkpointed data
Journal:
0 128 MB
Circular log
T1
T2
T3
T4
Must wait until transaction T1 is checkpointed before it can be freed and reused for next transaction, T5
How to avoid writing all disk blocks Twice?
Observation:
Most of writes are user data (esp sequential writes)
If user data isn’t consistent, file system still operates correctly
Strategy: Journal all metadata, including superblock, bitmaps, inodes, indirects, directories
Guarantees metadata is consistent if crash occurs Won’t lead or re-allocate blocks to multiple files
For regular user data, write it back whenever convenient
Implication: Files may contain garbage (partial old, partial new) if crash and recover Application needs to deal with this if it cares
METADATA JOURNALING
Writeback mode:
Do not control ordering of data + metadata
B’ I’ TxB B’ I’ TxE 0 1 2 3 4 5 6 7 8 9 10 11 12
transaction: append to inode I Ensures B’ and I’ are updated atomically
Crash !?!
Point to garbage data, but data block won’t be allocated to another file
Ordered Journaling
Still only journal metadata
But write data before the transaction!
Ordered Journal
BID
0 1 2 3 4 5 6 7 8 9 10 11 12
What happens if crash now? (before update B and I) B indicates D currently free, I does not point to D; Lose D, but that might be acceptable
Ordered JOURNALING
B I D TxB B’ I’ TxE 0 1 2 3 4 5 6 7 8 9 10 11 12
transaction: append to inode I
Crash !?!
Can replay journal, update in-place B’ and I’ which will point to D, as desired
Ordered JOURNALING
B’ I’ D TxB B’ I’ TxE 0 1 2 3 4 5 6 7 8 9 10 11 12
transaction: append to inode I
Ensures B’ and I’ are updated atomically AFTER D is on disk
Crash !?!
Everything is all good; if didn’t clear TxE, will replay transaction, extra work but no problems
Other journaling approaches…
Physical Journal
TxB length=3 blks=4, 6, 1
0000000000
0000000000
0000000000
0000100000
inode
… addr[?]=521
Write out lots of information, but how much was really needed?
data block
TxE (checksum)
Physical Journal
TxB length=3 blks=4, 6, 1
0000000000
0000000000
0000000000
0000100000
inode
… addr[?]=521
data block
TxE (checksum)
Actual changed data is much smaller!
Logical Journal
TxB length=1
list of changes
Logical journals record changes to bytes, not contents of new blocks
On recovery:
Need to read existing contents of in-place data and (re-)apply changes
TxE (checksum)
SUMMARY
Crash consistency: Important problem in file system design! Two main approaches
FSCK
Fix file system image after crash happens Slow and only ensures consistency
Still useful to fix problems (bit flips, FS bugs)
Journaling
Write a transaction before in-place updates Checksum, batching, ordered journal optimizations Used by most modern file systems
Ordered-mode for meta-data is very popular (default mode)
Some file systems don’t use journals, but still write new data before deleting old Copy-on-write file systems next