CSC 369 Operating Systems
Redundancy, Write Optimizations SSDs, Modern File Systems
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Copyright By PowCoder代写 加微信 powcoder
• Last time:
• Optimizations:caching,read-ahead
• Disk characteristics, optimizations
• Disk-awareallocation,I/Oscheduling
• Crashconsistencyandrecovery
• Recovering from disk failures, Redundancy
• Optimizing for writes, Log-structured file systems
• Solid State Drives (SSDs)
• ExampleFileSystems,modernfeatures
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 2
Redundancy
(another approach to data persistency)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Redundancy
• Wesawhowthejournalhelpsrecoverfromacrashduringawrite, by replaying the log entries of the write operations
• This assumes that the disk is still usable after rebooting
• Whatifwehavediskfailures?
• Whatcanwedotopreventdataloss?
• Havemorethanonecopiesofthedata:Redundancy!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 4
• RedundantArrayofInexpensiveIndependentDisks(RAID)
• Reliability strategies:
• Concepts:
• Data duplicated – mirror images, redundant full copy => one disk fails, we have the mirror
• Dataspreadoutacrossmultiplediskswithredundancy=>Canrecover from a disk failure, by reconstructing the data
• RedundancyviaMirroring:keepmultiplecopiesofthesameblockon different drives, just in case a drive fails
• Parity information: XOR each bit from 2 drives, store checksum on 3rd drive
• MultipleRAIDlevels–we’llonlylookatafew
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 5
Some standard RAID levels
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 6
Img. Source: wikipedia
Write Optimizations
How do we optimize writes?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
• Readperformance
• Placerelatedblocksinsamecylindergroup,placeinodesclosetodatablocks,etc.
• Fileslaidoutwithspatiallocalityinmind
A1 A2 B1 B2 C1 C2 D1 D2 E1 E2 • Changesinplace(tomitigateseeks)=>e.g.,A1’,B2’,C2’
A1’ A2 B1 B2’ C1 C2’ D1 D2 E1 E2
• Avoids fragmenting files, keeps locality => Reads perform well.
• Whataboutwriteperformance?
• Writes are not well-clustered
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 8
Read Performance or Write Performance?
• PerformanceofFFS:optimizedfordiskblockclustering,using properties of the disk to inform file system layout
• Observation: Memory is now large enough
• Most reads that go to the disk are the first read of a file. Subsequent
reads are satisfied in memory by file buffer cache.
• => there is no performance problem with reads. But write calls could be made faster.
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 9
Log-Structured File System (LFS)
• Differentapproach:
• Memory is increasing => don’t care about reads, most will hit in mem. • AssumewriteswillposethebiggerI/Openalty
=> Treat storage as a circular log (Log Structured File System)
• Positive side-effects?
• Write throughput improved (batched into large sequential chunks)
• Crash recovery – simpler
• Disadvantages?
• Initialassumptionmaynothold=>readsmuchsloweronHDDs.Why?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 10
Log-Structured File System (LFS)
• Ousterhout1989
• Writeallfilesystemdatainacontinuouslog.
• Uses inodes and directories from FFS
superblock
data blocks
• Segments:eachhasasummaryblock
• Summaryblockcontainspointertonextone
• Needafreshsegment=>firstcleananexistingpartially-usedsegment (garbage collection)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 11
… I[v3] End of End of
LFS: Locating inodes
• So,howdowefindtheinodesthough?
• IntypicalUNIXFSs,it’seasy–arrayondiskatfixedlocations
• Superblock=>InodeTableaddr;ThenaddInode#*InodeSize
• LFS:notsoeasy.Why?
• Updatesaresequential=>inodesscatteredalloverthedisk • Also,inodesnotstatic,keepmoving
chmod u+x doh
chown homer:simpsonfamily doh
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 12
LFS: Locating inodes
• LFS:Needsaninodemap(imap)tofindtheinodes
• An inode number is no longer a simple index from the start of an inode table
I[f1] … I[f2] … I[f3]
• Inodemapmustkeeppersistent,toknowwhereinodesare • Soithastobeondiskaswell..
• So,..whereexactlyistheinodemapstoredondisk?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 13
LFS: Locating inodes
• Putitonafixedpartofthedisk
• Inodemapgetsupdatedfrequentlythough=>Seeks↑Performance↓
Aww man, What now?
• Instead,placechunksofinodemapnexttonewinformation
D I[k] imap
• Holdon,butthenhowdowenowfindtheimap?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 14
LFS: Locating inodes
• Ok fine! The file system must have some fixed and known location on disk to begin a file lookup
• Checkpoint region (CR)
• Pointers to the latest pieces of the inode map • So, find imap pieces by reading the CR!
CR D I[k]imap
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 15
Crash Recovery
• WhatifthesystemcrasheswhileLFSiswritingtodisk?
• LFS normally buffers writes in segment
• Whenfull(oratperiodictimeintervals),writessegmenttodisk
• LFSwritessegmentsandperiodicallyalsoupdatestheCR
• Crashescanhappenatanypoint
• Howdowehandlecrashesduringthesewrites?
• Solution:
• 1.Uncommittedsegments–easy:reconstructfromlogafterreboot • 2.CR:KeeptwoCRs,ateitherendofthedisk;alternatewrites
• Update protocol (header, body, last block)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 16
Garbage Collection – Complicated!
• LFS repeatedly writes latest version of a file to new locations on disk
• Olderversionsoffiles(garbage)arescatteredthroughoutthedisk
• Mustperiodicallyfindtheseobsoleteversionsoffiledataandclean up => free blocks for subsequent writes
• Cleaningdoneonasegment-by-segmentbasis
• Sincethesegmentsarelargechunks,itavoidsthesituationofhaving
small “holes” of free space
• GarbagecollectioninLFS–interestingresearchproblem!
• Seriesofpaperswerepublishedlookingatitsperformance
• Depending on when cleaning happens there may be significant
performance loss.
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 17
• Disadvantages:
LFS Summary
• Anewapproachthatprioritizesupdateperformance
• Gist:Insteadofoverwritinginplace,appendtologandreclaim (garbage collect) obsolete data later.
• Advantage:
• Veryefficientwrites
• Less efficient reads (more indirection does not solve everything) • But assumes most reads hit in memory anyway.
• Garbage collection is a tricky problem
• LFSinspiredtheloggingfeaturesofjournalingfilesystemsinuse
today. E.g., Ext3/Ext4
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 18
VFS (Virtual File System) Concept
• Providesanabstractfilesysteminterface
• Separates abstraction of file and collections of files from specific implementations
• System calls such as open, read, write, etc. can be implemented in terms of operations on the abstract file system
• vfs_open, vfs_close
• Abstraction layer is for the OS itself
• user-level programmer interacts with the file systems through the system calls
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 19
ext3_lookup() ext3_open()
ntfs_lookup() ntfs_open()
nfs_lookup() nfs_open()
dev_lookup() dev_open()
Local file system (ext3)
Local file system (NTFS)
Network file system(NFS)
Raw devices (console, swap)
Schematic View of VFS
User-level File System Interface (C-library system calls)
Kernel File System Interface (system call handlers)
sys_open()
vfs_open() VFS interface
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 20
Solid State Disks (SSDs)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
• Disadvantages:
• Expensive(costperbit) • Wear-out(flash-based)
Solid State Disks (SSD)
• Replacerotatingmechanicaldiskswithnon-volatilememory • Battery-backedRAM
• NANDflash
• Advantages:
• FasterthanHDDs
• NANDflashstoragetechnology • Read/write/erase!operations
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 22
… … … … 4KB 4KB … 4KB
Data erased in blocks of typically >= 128 pages
SSD Characteristics
• Datacannotbemodified“inplace” • Nooverwritewithouterase
• Terminology:
• Page(unitofread/write),block(unitoferaseoperation)
4KB 4KB … 4KB 4KB 4KB … 4KB
Data written in 4KB pages
• Uniformrandomaccessperformance!
• Disks typically have multiple channels so data can be split (striped)
across blocks, speeding access time
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 23
• Writing a new file system block => can just write to a free page
• Considernowupdatingafilesystemblock(e.g.abitmapallocation
block in ext2 file system)
• Findtheblockcontainingthetargetpage
• Read all active pages in the block into controller memory
• Updatetargetpagewithnewdataincontrollermemory
• Erasetheblock(highvoltagetosetallbitsto1)
• Writeentireblocktodrive
• SomeFSblocksarefrequentlyupdated
• AndSSDblockswearout(limitederasecycles)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 24
• Wearlevelling
• Always write to new location
SSD Algorithms
• Keep a map from logical FS block number to current SSD block and page location
• Old versions of logically overwritten pages are “stale”
• Garbagecollection
• Reclaiming stale pages and creating empty erased blocks
• RAID5-like(withparitychecking)stripingacrossI/O channels to multiple NAND chips
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 25
For simplicity of illustration, assume: – 4 pages per block, and
– files are only 2 pages each.
On the SSD:
striped file data, with parity
Write request for File3, needs 2 pages of space
File3 written in empty pages
Example: Writing on an SSD
= free page
File0 P0 File1 P0 File2 Pp
File0 P1 File1 Pp File2 P0
File0 Pp File1 P1 File2 P1
= page with already written (and active) data
File0 P0 File1 P0 File2 Pp File3 P0
File0 P1 File1 Pp File2 P0 File3 P1
File0 Pp File1 P1 File2 P1 File3 Pp
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
For simplicity of illustration, assume:
– 4 pages per block, and
– on this illustration, we only look at 1 block, containing pages from multiple files
= free page
On the SSD:
Write updated content Pj’ to new page
Read pages in memory
Write pages from memory
File0 Pi File1 Pj File2 Pk
File0 P0 File0 P1 File1 P0 File1 P1
Receiving update to File1’s Pj contents
File3 is written now and needs a page on this block
In Memory:
= page with already written but stale data
Page with old contents (Pj) becomes stale
Erase block
Writing on an SSD
Reorder pages, remove stale page and copy data for File3
= page with already written (and active) data
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Flash Translation Layer (FTL)
• Goal 1: Translate reads/writes to logical blocks into reads/writes/erases on physical pages/blocks
• AllowsSSDstoexportasimple”blockinterface”thatharddiskshave traditionally exported
• Hideswrite-inducedcopyingandgarbagecollectionfromapplications
• Goal2:Reducewriteamplification(theamountofextracopying needed to deal with block-level erases)
• Goal3:Implementwearleveling(distributingwritesequallytoall blocks, to avoid fast failure of a “hot” block)
• FTListypicallyimplementedinhardwareintheSSD,butis
implemented in software for some SSDs
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 30
File Systems and SSDs
• Typically,sameFSsasforharddiskdrives • ext4,Btrfs,XFS,JFS,etc.,allsupportSSDs
• NoneedfortheFSitselftotakecareofwear-leveling • DoneinternallybytheSSD/FTL
• Someflashfilesystems(F2FS,JFFS2)helpreducewriteamplification (esp. for small updates – e.g., FS metadata)
• OthertypicalHDDfeatures–dowewantthese? • Defragmentation
• Disk scheduling algorithms
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 31
SSD reliability
• FAST2016 paper “Flash reliability in production” Google study on: • Millionsofdrivedaysover6years
• 10differentdrivemodels
• 3differentflashtypes:MLC,eMLCandSLC • Enterpriseandconsumerdrives
• Fullpaper:https://www.usenix.org/conference/fast16/technical- sessions/presentation/schroeder
• Keypoints:http://www.zdnet.com/article/ssd-reliability-in-the-real- world-googles-experience/
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 32
Summary: Basic File System Goals
• Efficientlytranslatefilenameintofilenumberusingadirectory
• Sequentialaccessperformance
• Efficient random access to any file block
• Efficientsupportforsmallfiles(overheadintermsofspaceand access time)
• Supportlargefiles
• Efficientmetadatastorageandlookup
• Crash recovery
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 33
Summary: File System Components
• Index structure to locate each block of a file • Freespacemanagement
• Locality heuristics
• Crashrecovery
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 34
• FS comparison
• Modern file systems, ZFS, Btrfs
Next up …
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 35
Example File Systems Modern FS features
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Index structure
Linked list Block
Tree (inodes) Block
Tree (extents) Extent
Index structure granularity
Free space management
FAT array Defragmentation
Locality heuristics
Block groups,
Best fit Defragmentation
FS Comparison
Table Source: Operating Systems, Anderson & Dahlin, p 554
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 37
Ext2, Ext3, Ext4
• Linux file system evolution
• Ext2 originally borrowed heavily from FFS
• Recall: Reduce seeks for faster reads, etc. • Next: more advanced file systems
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 40
Modern file systems
• Manyfilesystemsoutthere
• Thisisnotintendedasacomprehensive”laundrylist” • Focusonnewfeatures/designdecisions
• Highlightfeaturesofsomemodernfilesystems
• Btrfs • ZFS
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 41
• Advancedfilesystem,originallydevelopedbySun
• InFreeBSD,butLinuxporttoo(licensingissue)
• InLinux,useFUSEorviakernelmodule(e.g.,seezfsutils-linuxonUbuntu)
• CombinestheroleofLVMandFS
• Modernfeatures:hugestorage,dataintegrity(checkandrepair), integrated RAID (RAID-Z), copy-on-write and snapshots, data compression, data deduplication, etc.
• Design goals:
• Pooledstorage
• Dataintegrity
• Performance
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 42
Pooled storage
• TraditionalLVMmodelvsZFSstoragepoolmodel
Traditional FS LVM
Traditional FS LVM
ZFS ZFS …
• Storagepool=>Seamlessbuilt-inRAIDcalledRAID-Z.
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 43
ZFS Storage Pool
• RecallthegoalofRAID:Avoiddatalossfromdiskfailure
• Supportedviathepooldesign,noneedforextraH/WorS/W
• RAID-Z1: at least 2 disks for storage and 1 for parity • SimilartoRAID5,butovercomesRAID-5writeholeerror
Dynamic stripe width – every block is its own RAID stripe, every RAID-Z write is a full- stripe write
A1 A2 A3 Ap B1 B2 Bp B3 C1 Cp C2 C3 Dp D1 D2 D3
A1 A2 Ap B1 B2 B3 Bp C1 Cp D1 D2 D3 Dp D4 D5 Dp’
Dynamic stripe width
Disk1 Disk2 Disk3 Disk4
Disk1 Disk2 Disk3 Disk4
• Otherlevelstosurvive>1diskfailure,e.g.,RAID-Z2,RAID-Z3,etc.
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 44
Note: why is D striped like this?
dnode_phys_t
dnblkptr[]
dva3 dva2 …
dva3 dva2 …
uint8_t dn_type; uint8_t dn_indblkshift; uint8_t dn_nlevels=3; uint8_t dn_nblkptr=3; uint8_t checksum; uint8_t compress;
uint8_t dn_blkptr[3]; …
Data layout
• dnode: similar to inode, data structure representing an object (file, dir, volume, etc. — see dn_type field)
• Anobjectisstoredasatreeofblockpointers(direct/indirect)
• Root:object’smetadata(type,#oflevelsinthetree,itsspaceused,etc.),plus3block pointers which reference disk blocks (actual contents of object)
• ZFScanstoreupto3copiesofdatapointedtobyablockpointer(eachhasitsownDVA)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 45
Data integrity
• DataIntegrity:Hierarchicalchecksumsforalldataandmetadata • Forerrordetectionandcorrection
• Write the checksum along with the data
• Checksums are stored in a block’s parent, not with the block itself — Why?
• Canverifydatacorrectnessonuse,byrecalculatingthechecksum • ZFSwillattempttoautomaticallycorrecterrorsusingredundantcopies
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 46
Copy-on-Write
• RecallCoWfrommemorymanagement-sameprinciple!
• Blockswithactivedataareneveroverwritten,butreallocatedinsteadin free space
• Metadatablocksreferencingdatablocksarealsoreadandreallocated • Atomicityguaranteedinatransactionalmanner,likeindatabases
• Key benefit:
• Allowssnapshots!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 47
• EnabledbytheCoWtransactionalmodel(staledatacanberetained)
• An “image” of the entire data as it existed at a single point in time • Consistent:reflectstheentirestateoftheFSatagivenpointintime
• Can be used to restore the previous state of the FS (or files, dirs, etc.)
• Goodperformance:thestaledataisalreadystored
• Space-efficient:allunchangeddataissharedamongtheFSand
• Snapshotsareread-only,canbeusedforbackups
• Writablesnapshots(akaClones)canalsobecreated=>twoindependent FSs that share blocks. CoW still applies to both!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 48
ZFS Transactions and Intent Log
• Eachwriteisperformedasatransaction
• Transactionsaregroupedintobatchescalled”transactiongroups” • Bufferedforashortinterval(afewsec)
• Transaction groups are ideally pipelined => good write bandwidth
• Whataboutrecoverability?Rememberbufferingwritestradeoff!
• TXGswrittentoaZIL(ZFSIntentLog)onaSLOGdevice
• SynchronouswritesaresenttoZIL,thenflushedtothelong-termstorage • What would be a good SLOG device?
• Note:Undernormaloperati
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com