CSC 369 Operating Systems
Disk I/O and File System Optimization
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Copyright By PowCoder代写 加微信 powcoder
• File operations (open() /read() / write()) incur a good amount of disk I/Os
• Then how can the file system perform reasonably well? • Caching!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 3
• Whatdowewanttocache?
File Buffer Cache
• Keyobservation:Applicationsexhibitsignificantlocalityfor reading and writing files. How? (recall VM)
• Idea:Cachefileblocksinmemorytocapturelocality • This is called the file buffer cache
• Cache is system wide, used and shared by all processes • Reading from the cache makes a disk perform like memory • Significant reuse: spatial and temporal locality
• Even a 4 MB cache can be very effective
• Inodes, directory entries, disk blocks for “hot” files, even whole
files if small.
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 4
Caching and Buffering
• So,weusememorytocacheimportantblocks
• Static partitioning: at boot time, allocate a fixed-size cache in memory (typically 10% of total memory) — early file systems
• Dynamicpartitioning:integratevirtualmemorypagesandfilesystempagesintoa unified page cache, so pages of memory can be flexibly allocated for either virtual memory or file system, used by modern systems
• Replacementpolicy:typicallyuseLRU
• Tradeoffbetweenstaticvsdynamicpartitioning
• Applicabletoanyresourceallocationproblemingeneral!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 5
Caching Writes
• Caching works really well for reads!
• For writes, not as much!
• Writes still have to go to disk anyway to become persistent.
• Once a block is modified in memory, the write back to disk may not be immediate (synchronous).
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 6
Caching Writes
• Use a buffer to buffer writes. Buffering a batch of disk writes is helpful because:
• Combinemultiplewritesintoonewrite
• e.g., updating multiple bits of the inode bitmap
• Canimproveperformancebyschedulingthebufferedwrites(lazyupdates) • e.g., can schedule buffered writes in such a way that they happen
sequentially on disk. • Canavoidsomewrites
• e.g., one write changes a bit from 0 to 1, then another write changes it from 1 to 0, if buffered than no need to write anything to disk
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 7
Tradeoff: speed vs durability
• Caching and buffering improves the speed of file system reads and writes
• However,itsacrificesthedurabilityofdata.
• Crashoccurs=>bufferedwritesnotwrittentodiskyet,arelost
• Better durability => sync to disk more frequently => worse speed
• ShouldIfavourspeedordurability? • Itdepends
• On what? It depends on the application, e.g., • web browser cache
• bank database
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 8
• Severalwaystoaddresstheseconcerns
• Batterybacked-upRAM(NVRAM)
Approaches?
• Delayed writes only for a specific amount of time
• How long do we hold dirty data in memory?
• Asynchronous writes (“write-behind”)
• Maintain a queue of uncommitted blocks • Periodically flush the queue to disk
• Unreliable
• As with write-behind, but maintain queue in NVRAM
• Expensive
• Log-structuredfilesystem
• Always write contiguously at end of previous write
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 9
• Manyfilesystemsimplement“readahead”
• FSpredictsthattheprocesswillrequestnextblock
• Forsequentiallyaccessedfiles,canbeabigwin
Read Ahead
• FSgoesaheadandrequestsitfromthedisk
• Thiscanhappenwhiletheprocessiscomputingonpreviousblock
• Overlap I/O with execution
• Whentheprocessrequestsblock,itwillbeincache
• Complimentstheon-diskcache,whichalsoisdoingreadahead
• Unlessblocksforthefilearescatteredacrossthedisk
• Filesystemstrytopreventthat,though(duringallocation)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 10
A closer look at secondary storage, where files live…
• We looked at performance with caching, buffering, read-ahead • What about the actual disk’s physical characteristics?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 11
Memory Bus
I/O Bridge
Main Memory
Disk Controller
Display Controller
Ethernet Controller
I/O Diagram
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 12
• Ancienthistory
Secondary Storage Devices
• Magnetic disks
• Fixed(hard)disks
• Removable (floppy) disks
• Opticaldisks
• Write-once, read-many (CD-R, DVD-R) • Write-many,read-many(CD-RW)
• We’re going to focus on the use of fixed (hard) magnetic disks for
implementing secondary storage
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 13
Track Sector
Disk Components
Read/Write Head
Upper Surface Platter Lower Surface
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 14
Disk Performance
• Diskrequestperformancedependsonanumberofsteps
• Seek–movingthediskarmtothecorrectcylinder • Dependsonhowfastdiskarmcanmove
• Typicaltimes:1-15ms,dependingondistance(avg5-6ms) • Improvingveryslowly(7-10%peryear)
• Rotation – waiting for the sector to rotate under the head
• Dependsonrotationrateofdisk(7200RPMSATA,15KRPMSCSI)
• Averagelatencyof1⁄2rotation(~4msfor7200RPMdisk) • Hasnotchangedinrecentyears
• Transfer – transferring data from surface into disk controller electronics, sending it back to the host
• Dependsondensity(increasingquickly)
• ~100 MB/s, average sector transfer time of ~5us
• improvingrapidly(~40%peryear)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 15
After BLUE read Seek for RED Rotational latency After RED read When the OS uses the disk, it tries to minimize the cost of all of
Traditional service time components
these steps
• Particularly seeks and rotation
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 16
Some hardware optimizations
• Trackskew • Zones
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 17
… 32 7 6 5 4
40 30 29 37
33 9 2 26 1011 0 1 38
33 9 2 26 1011 0 1
34 … 25 35 24 37
… 25 35 24
If the arm moves to outer track too slowly, may miss sector 36 and have to wait for a whole rotation.
Instead, skew the track locations, so that we have enough time to position.
Track skew
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 18
Each sector is 512 bytes Outer tracks are larger by Notice anything though? geometry, so they should
hold more sectors.
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 19
Cache, aka Track Buffer
• A small memory chip, part of the hard drive • Usually 8-16MB
• Different from cache that OS has
• Unlike the OS cache, it is aware of the disk geometry
• When reading a sector, may cache the whole track to speed up future reads on the same track
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 20
• Disks are messy physical devices:
• Errors, bad blocks, missed seeks, etc.
Disks and the OS
• The job of the OS is to hide this mess from higher level software • Low-leveldevicecontrol(initiateadiskread,etc.)
• Higher-level abstractions (files, databases, etc.)
• The OS may provide different levels of disk access to different clients
• Physicaldisk(surface,cylinder,sector)
• Logicaldisk(diskblock#)
• Logical file (file block, record, or byte #)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 21
Software Interface Layers
File System (Database) Device Driver
I/O Controller Disk Media
• Each layer abstracts details below it for layers above it
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 22
Disk Interaction
• Specifyingdiskrequestsrequiresalotofinfo:
• Cylinder#,surface#,track#,sector#,transfersize…
• Older disks required the OS to specify all of this • The OS needed to know all disk parameters
• Moderndisksaremorecomplicated
• Notallsectorsarethesamesize,sectorsareremapped,etc.
• Currentdisksprovideahigher-levelinterface(SCSI)
• Thediskexportsitsdataasalogicalarrayofblocks[0…N]
• Disk maps logical blocks to cylinder/surface/track/sector
• Onlyneedtospecifythelogicalblock#toread/write
• ButnowthediskparametersarehiddenfromtheOS
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 23
The common storage device interface
…5 6 7 12 23… OS’s view of storage device
Storage exposed as linear array of blocks Common block size: 512 bytes
Number of blocks: device capacity / block size
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Back to File Systems…
• Key idea: File systems need to be aware of disk characteristics for performance
• Allocation algorithms to enhance performance • Request scheduling to reduce seek time
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 25
Enhancing achieved disk performance
• High-leveldiskcharacteristicsyieldtwogoals:
• Closeness
• reduce seek times by putting related things close to each other
• generally, benefits can be in the factor of 2 range
• Amortization
• amortize each positioning delay by grabbing lots of useful data • generally, benefits can reach into the factor of 10 range
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 26
Allocation strategies
• Disks perform best if seeks are reduced and large transfers are used
• Scheduling requests is one way to achieve this
• Allocating related data “close together” on the disk is
even more important
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 27
FFS: a disk-aware file system
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Original Unix File System
• Recall FS sees storage as linear array of blocks • Each block has a logical block number (LBN)
Superblock
Bitmap Inodes Data Blocks
Default usage of LBN space
• Simple,straightforwardimplementation
• Easy to implement and understand
• But very poor utilization of disk bandwidth. Why?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 29
Data and Inode Placement – problem #1
• OnanewFS,blocksareallocatedsequentially,closetoeachother.
A1 A2 B1 B2 C1 C2 D1 D2 E1 E2
• AstheFSgetsolder,filesarebeingdeletedandcreaterandomgaps
A1 A2 B1 B2 C1 C2 D1 D2
• Inagingfilesystems,datablocksendupallocatedfarfromeachother:
A1 A2 F1 F2 C1 C2 D1 D2 F3
• Datablocksfornewfilesendupscatteredacrossthedisk!
• Fragmentationofanagingfilesystemcausesmoreseeking!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 30
Data and Inode Placement – problem #2
Superblock
Bitmap Inodes Data Blocks
Default usage of LBN space
• Inodesallocatedfarfromblocks
• Allinodesatbeginningofdisk,farfromdata
• Recallthatwhenwetraverseafilepath,ateachlevelweinspect the inode first, then access the data block.
• Traversingfilenamepaths,manipulatingfiles,directoriesrequires
going back and forth from inodes to data blocks • =>Again,lotsofseeks!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 31
• BSDUnixfolksdidaredesign(early-mid80s?)thatthey called the Fast File System (FFS)
• Improved disk utilization, decreased response time
• McKusick, Joy, Leffler, and Fabry, ACM TOCS, Aug. 1984
• A major breakthrough in the history of File Systems
• AllmodernFSsdrawfromthelessonslearnedfromFFS
• Goodexampleofbeingdevice-awareforperformance! University of Toronto Mississauga, Department of Mathematical and Computational Sciences 32
Cylinder Groups
• BSDFFSaddressedplacementproblemsusingthenotionofa cylinder group (aka allocation groups in lots of modern FS’s)
• Disk partitioned into groups of cylinders
• Data blocks in same file allocated in same cylinder group
• Files in same directory allocated in same cylinder group
• Inodes for files are allocated in same cylinder group as file data blocks
Superblock
Cylinder Group
Cylinder group organization
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 33
Cylinder Groups (continued)
• Allocationincylindergroupsprovidescloseness • Reduces number of long seeks
• Freespacerequirement
• Tobeabletoallocateaccordingtocylindergroups,thediskmusthave free space scattered across cylinders
• 10% of the disk is reserved just for keeping the disk partially free all the time
• Whenallocatinglargefiles,breakitintolargechunksandallocatefrom different cylinder groups, so it does not fill up one cylinder group
• Ifpreferredcylindergroupisfull,allocatefroma“nearby”group
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 34
• Goal:Minimizeseeks!
• Policies:
• FCFS(donothing)
Disk Scheduling Algorithms
• Because seeks are so expensive (milliseconds!), OS tries to schedule disk requests that are queued waiting for the disk
• Reasonablewhenloadislow
• Longwaitingtimesforlongrequestqueues • SSTF(shortestseektimefirst)
• Minimizearmmovement(seektime),maximizerequestrate
• Favors middle blocks • SCAN(elevator)
• Servicerequestsinonedirectionuntildone,thenreverse • C-SCAN
• LikeSCAN,butonlygoinonedirection(typewriter)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 36
Disk Scheduling Algorithms(2)
• LOOK/C-LOOK
• LikeSCAN/C-SCANbutonlygoasfaraslastrequestineachdirection(notfull width of the disk)
• Ingeneral,unlesstherearerequestqueues,diskschedulingdoesnothave much impact
• Importantforservers,lesssoforPCs
• Modern disks often do the disk scheduling themselves • DisksknowtheirlayoutbetterthanOS,canoptimizebetter • Ifso,ignores/undoesanyschedulingdonebyOS
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 37
Example: FCFS
Head at 53 98 183 37 122 14 124 65 67 Head at
0 14 37 53 6567 98 122124
+ 146 + 85 + 108 + 110 + 59 +2 = 640
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
+ 30 + 23 + 84 + 24 +2 + 59 = 236
Example: SSTF
Head at 53 98 183 37 122 14 124 65 67 Head at
0 14 37 53 6567 98 122124
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Example: Scan
Head at 53 98 183 37 122 14 124 65 67 Head at
0 14 37 53 6567 98 122124
+2 + 31 + 24 +2 + 59 = 236
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Example: Look
Head at 53 98 183 37 122 14 124 65 67 Head at
0 14 37 53 6567 98 122124
+2 + 31 + 24 +2 + 59 = 208
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
• Awarenessofstoragemedia-Disks • Physicalstructure
• Placement problems and strategies, FFS • Disk scheduling algorithms
• FSperformanceenhancements:caching,read-ahead
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 42
• Reliability
• Error detection and recovery
• Journaling (ext3, ext4, JFS,…)
Next up…
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 43
Reliability
• Key question: How do we guarantee consistency of on-disk storage?
• How do we handle incomplete operations due to power outage or OS crash?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 44
FFS: Consistency Issues – Overview
• Inodes:fixedsizestructurestoredincylindergroups
Cylinder Group
• Metadataupdatesmustbesynchronousoperations.Why?
• Filesystemoperationsaffectmultiplemetadatablocks
• Writenewlyallocatedinodetodiskbeforeitsnameisenteredina directory.
• Removeadirectorynamebeforetheinodeisdeallocated
• Deallocate an inode (mark as free in bitmap) before that file’s data blocks are placed into the cylinder group free list.
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 45
FFS Observation 1: Crash recovery
• IftheOScrashesinbetweenanyofthesesynchronous operations, then the file system is in an inconsistent state.
• Solutions(overview):
• fsck – post-crash recovery process to scan file system structure and restore consistency
• All data blocks pointed to by inodes (and indirect blocks) must be marked allocated in the data bitmap
• All allocated inodes must be in some dir entries
• Inode link count must match
• Log updates to enable roll-back or roll-forward
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 46
Inode bitmap
Data bitmap
Data blocks
01000000 00001000
• Addadatablock:Db..Whatchanges?
Inode Data bitmap bitmap
Data blocks
01000000 I[v2] 00001100
• Three writes: I[v2], B[v2], Db
Example: update
• Consider a simple update: append 1 data block to a file • AssumeasimilarFSstructureasseenbefore:
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 47
Crash consistency
• Whatifonlyonewritesucceedsbeforeacrash? • 1.JustDbwritesucceeds.
• No inode, no bitmap => as if the write did not occur
• FS not inconsistent, but data is lost! • 2.JustI[v2]writesucceeds.
• No data block => will read garbage data from disk.
• No bitmap entry, but inode has a pointer to Db => FS inconsistency! • 3.JustB[v2]writesucceeds.
• Bitmap says Db is allocated, inode has no pointer to it => again, FS inconsistent + Db can never be used again
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 48
Other crash scenarios
• Whatifonlytwowritessucceedbeforeacrash? • 1. Only I[v2] and B[v2] writes succeed.
• Inode and bitmap agree => FS metadata is consistent
• However, Db contains garbage. • 2. Only I[v2] and Db writes succeed.
• Inode points to correct data, but clashes with B[v1] => FS inconsistency, must fix this!
• 3. Only B[v2] and Db writes succeed.
• Again, inode and bitmap info does not match
• Even though Db was written, no inode points to it (which file is it part of? No idea!)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 49
Crash consistency problem
Solution #1: fsck
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 50
Similar tools exist on various systems
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 51
Solution: fsck
• fsck:UNIXtoolforfindinginconsistenciesandrepairingthem
• Cannotfixallproblems!
• WhenDbisgarbage–
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com