File systems over the network
Recap: How your application reaches storage device
Applications
Copyright By PowCoder代写 加微信 powcoder
I/O libraries Buffer
data fread/fwrite — input.bin/output.bin fread/fwrite — input.bin/output.bin
File system
read/write — 0, 512, 4096, … (block address)
Device Driver
Device independent I/O interface (e.g. ioctl)
data read/write — block
read/write —
Device Controller
Device Controller
Kernel data Hardware
c addresses addresses
Device Driver
Recap: File systems on a computer
Unix File System
• Hierarchicaldirectorystructure • File—metadata(inode)+data • Everythingisfiles
BSD Fast File System — optimize for reads
• Cylindergroup—Layoutdatacarefullywithdevicecharacteristics,replicatedmetadata • Largerblocksize&fragmentstofixthedrawback
• Afewothernewfeatures
Sprite Log-structured File System — optimize for small random writes • Computerscachealot—readsarenomorethedominatingtraffic
• Aggregatessmallwritesintolargesequentialwritestothedisk
• Invalidateoldercopiestosupportrecovery
Recap: Extent file systems — ext2, ext3, ext4
Basically optimizations over FFS + Extent + Journaling (write-ahead logs) Extent — consecutive disk blocks
A file in ext file systems — a list of extents
• Write-aheadlogs—performswritesasinLFS
• Applythelogtothetargetlocationwhenappropriate
Block group
• ModernH.D.Dsdonothavetheconceptof“cylinders”
• Theylabelneighboringsectorswithconsecutiveblockaddresses
• DoesnotworkforSSDsgiventheinternallog-structuredmanagementofblock addresses
Recap: flash SSDs, NVM-based SSDs
Asymmetric read/write behavior/performance
Wear-out faster than traditional magnetic disks
Another layer of indirection is introduced • Intensifylog-on-logissues
• Weneedtorevisethefilesystemdesign
The introduction of virtual file system interface
User-space
Applications, user-space libraries
open, close, read, write, …
Virtual File System
open, close, read, write, …
File system #1 (e.g. ext4) File system #2 (e.g. f2fs)
read/write — 0, 512, 4096, … (block address)
Device independent I/O interface (e.g. ioctl)
Device Driver
data read/write — block
read/write —
Device Driver
Device Controller
Device Controller
Google file system
Network File System
The introduction of virtual file system interface
User-space
Applications, user-space libraries
open, close, read, write, …
Virtual File System
open, close, read, write, …
open, close, read, write, …
File system #1 (e.g. ext4) File system #2 (e.g. f2fs) File system #3 — NFS
read/write — 0, 512, 4096, … (block address) open, close, read, write, …
Device independent I/O interface (e.g. ioctl)
Device Driver
— block addresses
— block addresses
Device Driver
read/write
Device Controller
Device Controller
read/write
Network Stack
Network Device Driver
Device Controller
User- space
NFS Client/Server
Applications, user- space libraries
NFS Server
open, close,
read, write, …
open, close,
read, write, …
Virtual File System
Virtual File System
open, close,
read, write, …
Network Stack
Network Device Driver
open, close,
read, write, …
Disk File System
Network Stack
Network Device Driver
read/write —
block addresses
I/O interface
Device Driver
Device Controller
Device Controller
Device Controller
The local file system forwards the requests to the server
the mount point in UNIX
inode: Where in the server
generation numer: version number of the file
How does NFS handle a file?
The client gives it’s file system a tuple to describe data
• Volume:Identifywhichservercontainsthefile—representedby
The server response the client with file system attributes as 11
local disks
Number of network operations
For a file /mnt/nfs/home/hungwei/foo.c , how many network sends/receives in total does NFS need to perform to fetch the actual file content in the worst case? (assume the file system is mounted to /mnt/nfs)
A. 8 B. 9 C. 10 D. 11 E. 12
https://www.pollev.com/hungweitseng close in
client How open works with NFS server open(“/mnt/nfs/home/hungwei/foo.c”, O_RDONLY);
lookup for home return the inode of home
read for home
return the data of home lookup for hungwei return the inode of hungwei read for hungwei return the data of hungwei lookup for foo.c return the inode of foo.c read for foo.c
return the data of foo.c
Number of network operations
For a file /mnt/nfs/home/hungwei/foo.c , how many network sends/receives in total does NFS need to perform to fetch the actual file content in the worst case? (assume the file system is mounted to /mnt/nfs)
A. 8 B. 9 C. 10 D. 11
NFS operations are expensive
• Lotsofnetworkround-trips
• NFSserverisauser-spacedaemon
With caching on the clients
• Onlythefirstreferenceneedsnetworkcommunication • Laterrequestscanbesatisfiedinlocalmemory
Stateless NFS
How many of the following statements fit the reason why NFS uses a
stateless protocol, in which the protocol doesn’t track any client state?
! Simplifythesystemdesignforrecoveryafterservercrashes ” Simplifytheclientdesignforrecoveryafterclientcrashes # Easiertoguaranteefileconsistency
$ Improvethenetworklatency
A. 0 B. 1 C. 2 D. 3 E. 4
https://www.pollev.com/hungweitseng close in
Stateless NFS
How many of the following statements fit the reason why NFS uses a ! Simplifythesystemdesignforrecoveryafterservercrashes
stateless protocol, in which the protocol doesn’t track any client state?
” Simplifytheclientdesignforrecoveryafterclientcrashes
If using stateful protocol, FDs on all clients are lost
If using stateful protocol, server doesn’t know client crashes and consider the file is open still
# Easiertoguaranteefileconsistency
$ Improvethenetworklatency A. 0
The server has no knowledge about who has the file Nothing to do with NFS
Idempotent operations
Given the same input, always give the same output regardless You only need to retry the same operation if it failed
how many times the operation is employed
Think about this
Application
File System
Network Stack
Client C won’t be aware of the change in Client A
Application
pdate foo.txt in ca
File System
Network Stack
Application
File System
Network Stack
File Server
File System Network Stack
Flush-on-close: flush all write buffer contents when close the
Later open operations will get the latest content
Force-getattr:
• Openafilerequiresgetattrfromservertochecktimestamps • attributecachetoremedytheperformance
The Google File System
, , and Shun-
How many of the following fit the optimization goals for GFS? ! Optimizeforstoringsmallfiles
” Optimizeforfast,modernstoragedevices
# Optimizeforrandomwrites
$ Optimizeforaccesslatencies A. 0
https://www.pollev.com/hungweitseng close in
How many of the following fit the optimization goals for GFS? ! Optimizeforstoringsmallfiles
” Optimizeforfast,modernstoragedevices
# Optimizeforrandomwrites
$ Optimizeforaccesslatencies
B. 1 C. 2 D. 3 E. 4
Conventional file systems do not fit the demand of data centers
Why we care about GFS
Workloads in data centers are different from conventional
• • • • • •
Storage based on inexpensive disks that fail frequently Many large files in contrast to small files for personal data Primarily reading streams of data
Sequential writes appending to the end of existing files Must support multiple concurrent operations
Bandwidth is more critical than latency 33
Data-center workloads for GFS
MapReduce (MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004)
Large-scale machine learning problems
Extraction of user data for popular queries
Extraction of properties of web pages for new experiments and products Large-scale graph computations
BigTable (Bigtable: A Distributed Storage System for Structured Data, OSDI 2006)
Google analytics Google earth Personalized search
Google Search (Web Search for a Planet: The Google Cluster Architecture, IEEE
Micro, vol. 23, 2003)
MapReduce: Simplified Data
Processing on Large Clusters
and Google
Split 0 read Split 4
Workers Workers
assign map
User application
assign reduce
Output File #0
Sharing among workers? — No
Overwrite input? — No
Output File #1
Map Intermediate
Conventional file systems do not fit the demand of data centers
• • • • • •
Storage based on inexpensive disks that fail frequently
Why we care about GFS
Workloads in data centers are different from conventional
Many large files in contrast to small files for personal data
Primarily reading streams of data
Sequential writes appending to the end of existing files
Must support multiple concurrent operations
Bandwidth is more critical than latency
— MapReduce is fault tolerant
— MapReduce aims at processing large amount of data once
— MapReduce reads chunks of large files
— Output file keep growing as workers keep writing
—MapReduce has thousands of workers simultaneously
—MapReduce only wants to finish tasks within “reasonable” amount of time
What GFS proposes?
Maintaining the same interface
• Thesamefunctioncalls
• Thesamehierarchicaldirectory/files
Hierarchical namespace implemented with flat structure Master/chunkservers/clients
Files are decomposed into large chunks (e.g. 64MB) with
Large Chunks
How many of the following datacenter characteristics can large chunks help
! Storagebasedoninexpensivedisksthatfailfrequently ” Manylargefilesincontrasttosmallfilesforpersonaldata # Primarilyreadingstreamsofdata
$ Sequentialwritesappendingtotheendofexistingfiles & Mustsupportmultipleconcurrentoperations
‘ Bandwidthismorecriticalthanlatency
https://www.pollev.com/hungweitseng close in
Large Chunks
How many of the following datacenter characteristics can large chunks help
! Storagebasedoninexpensivedisksthatfailfrequently ” Manylargefilesincontrasttosmallfilesforpersonaldata # Primarilyreadingstreamsofdata
$ Sequentialwritesappendingtotheendofexistingfiles & Mustsupportmultipleconcurrentoperations
‘ Bandwidthismorecriticalthanlatency
Latency Numbers Every Programmer Should Know
Operations
Compress 1K bytes with Zippy
Latency (ns)
Latency (us)
Latency (ms)
L1 cache reference
Branch mispredict
Mutex lock/unlock
~ 1 CPU cycle
L2 cache reference
14x L1 cache
Main memory reference
20x L2 cache, 200x L1 cache
Send 1K bytes over 1 Gbps network
Read 4K randomly from SSD*
Read 1 MB sequentially from memory
Round trip within same datacenter
150,000 ns
250,000 ns
~1GB/sec SSD
Read 1 MB sequentially from SSD*
Read 512B from disk
500,000 ns
1,000,000 ns
10,000,000 ns
~1GB/sec SSD, 4X memory
Read 1 MB sequentially from disk
Send packet CA-Netherlands-CA
20,000,000 ns
150,000,000 ns
41 150,000 us
20x datacenter roundtrip
80x memory, 20X SSD
Announcement
Second last Reading Quiz due next Tuesday
Office hour
• MTu11a-12p,W2p-3p&F11a-12p
• UsetheofficehourZoomlink,notthelectureone
• Nolatesubmissionisallowed—tomaketimeforgradingandpotentialof
Revision policy
• fix your bugs and schedule a meeting with the TA within a week after grading
• You have to answer several design questions
• you can get 70% of the remaining grades if you passed
Engineering
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com