CS计算机代考程序代写 cache FTP concurrency data structure algorithm mips scheme database chain file system Chapter 12: File System Implementation

Chapter 12: File System Implementation

Chapter 12:
File System Implementation

Chapter 12: File System Implementation
File-System Structure
File-System Implementation
Directory Implementation
Allocation Methods
Free-Space Management
Efficiency and Performance
Recovery
NFS
Example: WAFL File System

Objectives

To describe the details of implementing local file systems and directory structures

To discuss block allocation and free-block algorithms and trade-offs

File-System Structure
File is a Logical storage unit And/or a Collection of related information
File system resides on secondary storage (disks)
Disk provides in-place rewrite and random access
I/O transfers performed in blocks of sectors (usually 512 bytes)
Two different design challenges:
How does the file look to the user?
Defining a file and its attributes
Operations allowed on the file
Directory structure for organizing files
Provides user interface to storage, mapping logical to physical
Create algorithms and data structures to map the logical file system onto the physical secondary storage devices

Layered File System

“read drive1, cylinder 72, track 2, sector 10, into memory location 1060”
low-level hardware specific commands to hardware controller
“retrieve physical block 123”
“read drive1, cylinder 72, track 2, sector 10, into memory location 1060”
“retrieve physical block 123”
“retrieve logical block 63”
“retrieve logical block 63”
“read Customer file”
“read Customer file”

File-System Structure

File control block – storage structure consisting of information about a file
Device driver controls the physical device
File system organized into layers

File System Layers

Device drivers manage I/O devices at the I/O control layer
Given commands like “read drive1, cylinder 72, track 2, sector 10, into memory location 1060” outputs low-level hardware specific commands to hardware controller
Basic file system given command like “retrieve block 123” translates to device driver
manages memory buffers and caches (allocation, freeing, replacement)
Buffers hold data in transit
Caches hold frequently used data
File organization module understands files, logical address, and physical blocks
Translates logical block # to physical block #
Manages free space, disk allocation

File System Layers (Cont.)

Logical file system manages metadata information
Metadata – Data about Data. Anything but the Data
Translates file name into file number, file handle, and location by maintaining File Control Blocks (inodes in UNIX). They contain
File ownership
Permissions
Location of File contents
Directory management – for use by File Organization module
Protection

Layering useful for reducing complexity and redundancy, but adds overhead and can decrease performance
Logical layers can be implemented by any coding method according to OS designer

File System Layers (Cont.)

File-system research is an active area within OS design.

Many file systems are in use today, sometimes many within an operating system each with its own format

CD-ROM is ISO 9660

Unix has UFS based on Berkeley FFS

Windows has FAT, FAT32, NTFS as well as floppy, CD, DVD Blu-ray (NTFS replaced FAT as storage sizes grew)

Linux has more than 40 types, with extended file system ext2 and ext3 leading; plus distributed file systems, etc.

New ones still arriving –

GoogleFS – high-performance access from many clients across a very large number of disks

FUSE – file systems implemented at user level, not kernel level.

Google File System

File-System Implementation
On-disk and in-memory structures are used to implement a file system.
On Disk
Boot control block contains info needed by system to boot OS from that volume
Needed if volume contains OS, usually first block of volume
Volume control block (superblock, master file table) contains volume details
Total # of blocks, # of free blocks, block size, free block pointers or array
Free-FCB count and FCB pointers
Directory structure organizes the files
File Names and inode numbers.
In NTFS it is stored in master file table
Per-file FCB containing many details about the file
Unique ID to allow association with a directory entry.

File-System Implementation
In-memory structures
Used for both file-system management and performance improvement via caching
Data are loaded at mount time
Data are updated during file-system operations
Data are discarded at dismount
Data Structures include:
In-memory mount table contains information about each mounted table
In-memory Directory Structure Cache holds the directory information of recently accessed directories.
The System-wide-Open-File Table contains a copy of the FCB of each open file, as well as other information.
The Per-Process-Open-File Table contains a pointer to the entry in the System-wide-Open-File Table.
Buffers hold file-system blocks when they are being read to or from disk.

File-System Implementation

Per-file File Control Block (FCB) contains many details about the file

To Create a New File

An application program calls the logical file system.
The logical file system knows the format of the directory structures.
It allocates a new FCB or gets one from the free FCB pool
It reads the appropriate directory into memory, updates it with the new file name and FCB, and writes it back to disk.

To Use a File

The file must be opened before use for I/O.
The open() call passes a file name to the logical file system.
The open() system call first searches the System-wide-Open-File Table to see if the file is already in use by another process.
If it is, a Per-Process-Open-File Table entry is created pointing to the existing System-wide-Open-File table.
If the file is not already open, the directory structure is searched for the provided file name. Parts of the directory structure are cached in memory to speed this search operation.
When the file is found, the FCB is copied into the System-wide-Open-File Table in memory
An entry is then made in the Per-Process-Open-File Table with a pointer to the System-wide-Open-File table and some other fields:
A pointer to the current location in the file (for the next read() and write() operations)
The access mode in which the file is open.

To Use a File (cont.)

The open() call returns a pointer to the appropriate entry in the Per-Process-Open-File Table
All file operations are then used using this pointer.
The file name may not be part of the Open-File table as the system has no use for it once the FCB is located on disk.
It could be cached to save time on subsequent opens of the same file.
The file entry is known as
A File Descriptor in UNIX
A file Handle in Windows

When a Process Closes a File
The Per-Process-Open-File Table entry is removed.
The System-wide-Open-File table entry’s open count is decremented.
When the last user of the file closes it, any updated metadata is copied back to the disk-based directory structure and the System-wide-Open-File table entry is removed.
It could be cached to save time on subsequent opens of the same file.

In-Memory
File System Structures

Partitions and Mounting
The layout of a disk can have many variations depending on the OS.
A disk can be sliced into multiple partitions or a volume can span multiple partitions on multiple disks.
Partition can be a volume containing a file system (“cooked”) or raw – just a sequence of blocks with no file system
Raw disk is used where no file system is appropriate
UNIX swap space
Some databases
Boot Loader set of blocks that contain enough code to know how to find and load the kernel from the file system
Boot Loader can also support dual-booting.

Dual Booting

Dual booting (aka multi booting) is the act of installing multiple operating systems on a computer and being able to choose which one to boot when switching on the computer.
A common reason for dual booting is for compatibility. Not everything for Windows works on Mac and vice versa.
Another reason is for testing purposes. The ability to quickly switch between operating systems can help in the development of software.
The first step is to partition your hard drive
Install your other operating system on the new partition
Restart your computer and boot up either operating system

Partitions and Mounting

Root partition contains the OS kernel
Other partitions (containing other OSs, other file systems, or be raw) can mount automatically or manually
At mount time, file system consistency checked
Is all metadata correct?
If not, fix it, try again
If yes, add to mount table, allow access

Virtual File Systems
Virtual File Systems (VFS) on Unix provide an object-oriented way of implementing file systems
VFS allows the same system call interface (the API) to be used for different types of file systems
Separates file-system generic operations from implementation details
Implementation can be one of many file systems types, or network file system
Implements vnodes which hold inodes or network file details
Then dispatches operation to appropriate file system implementation routines

Virtual File Systems (Cont.)

The API is to the VFS interface, rather than any specific type of file system

Virtual File System Implementation
For example, Linux has four object types:
inode, file, superblock, dentry
VFS defines set of operations on the objects that must be implemented
Every object has a pointer to a function table
Function table has addresses of routines to implement that function on that object
For example:
• int open(. . .)—Open a file
• int close(. . .)—Close an already-open file
• ssize t read(. . .)—Read from a file
• ssize t write(. . .)—Write to a file
• int mmap(. . .)—Memory-map a file

Directory Implementation
Linear list of file names with pointer to the data blocks
Simple to program and Time-consuming to execute
To create a new file we must search the directory to be sure that no existing file has the same name.
The delete a file, we search the directory for the named file and then release the space allocated to it.
To reuse the directory entry we can:
Mark the entry unused (special name, dirty bit)
Attach it to a list of free directory entries
Copy the last entry in the directory into the freed location and decrease the length of the directory.
A linked list can decrease time to delete a file. WHY?

The selection of Directory allocation and Directory Management algorithms significantly affects the efficiency, performance, and reliability of the file system.

Directory Implementation

The selection of Directory allocation and Directory Management algorithms significantly affects the efficiency, performance, and reliability of the file system.
Linear list disadvantage is that finding a file requires a linear search
Directory information is used frequently and users will notices if access to it is slow.
Many OSs implement a software cache to store the most recently used directory information. A cache hit avoids the need to constantly reread the information from disk.
A sorted list will support a binary search and decreased average search time.
Maintaining the sorted list makes creating an deleting files more complicated since many entries may have to be moved to maintain the sorted list
An advantage of the sorted list is that a sorted directory listing can be produced without a separate sort step.

Directory Implementation
Hash Table – linear list with hash data structure
Decreases directory search time
Collisions – situations where two file names hash to the same location
Only good if entries are fixed size, or use chained-overflow method
The selection of Directory allocation and Directory Management algorithms significantly affects the efficiency, performance, and reliability of the file system.

Allocation Methods – Contiguous
An allocation method refers to how disk blocks are allocated for files:
Contiguous allocation – each file occupies set of contiguous blocks
Best performance in most cases
Simple – only starting location (block #) and length (number of blocks) are required
Problems include finding space for file, knowing file size (particularly, output files), external fragmentation, need for compaction off-line (downtime) or on-line (with performance degradation)

Contiguous Allocation
Mapping from logical to physical

Extent-Based Systems
Many newer file systems (i.e., Veritas File System) use a modified contiguous allocation scheme

Extent-based file systems allocate disk blocks in extents

An extent is a contiguous block of disks
Extents are allocated for file allocation
A file consists of one or more extents

Allocation Methods – Linked
Linked allocation – each file is a linked list of disk blocks that may be scattered anywhere on the disk.
Directory contains a pointer to the first and last blocks of the file.
To create a new file, we simply create an new entry in the directory
Each directory entry has a pointer to the first disk block of the file
Pointer is initialized to null to signify an empty file.
Size field is set to zero.
A write to the file calls the free-space management module to find a free block. The new block is written to, and is linked to the end of the file.
To read the file, we read blocks by following the pointers from block to block.
No compaction, external fragmentation

Allocation Methods – Linked
Disadvantages:
It can only be used effectively with sequential-access files.
To find the ith block of a file, we must start at the beginning of that file and follow the pointers to the ith block.
Each access to a pointer requires a disk read and some require a disk seek
Hence, it is inefficient for direct access capability
Pointers also take up space on the disk.
Improve efficiency by clustering blocks into groups called clusters
With four blocks per cluster, the relative size of pointer is smaller.
but increases internal fragmentation
Reliability can be a problem if pointer gets lost of damaged.
Doubly linked lists can increase reliability at higher overhead cost.

Linked Allocation

Allocation Methods – Linked (Cont.)
FAT (File Allocation Table) variation
Beginning of volume has table, indexed by block number
Much like a linked list, but faster on disk and cacheable
Can result in significant number of disk head seeks since the FAT is at the beginning of the volume and then it needs to access the block itself which may be anywhere.
New block allocation simple

File-Allocation Table

Linked Allocation
Each file is a linked list of disk blocks: blocks may be scattered anywhere on the disk
pointer

block =

Mapping
Block to be accessed is the Qth block in the linked chain of blocks representing the file.

Displacement into block = R + 1
LA/511
Q
R

Allocation Methods – Indexed
Indexed allocation
Each file has its own index block(s) of pointers to its data blocks

Logical view

Example of Indexed Allocation

Indexed Allocation (Cont.)
Need index table

Random access is supported

Dynamic access without external fragmentation, but have overhead of large index block for even small files.

Which leads to the question..
How large should the index block be?
Since every file must have one we want it to be as small as possible.
But what happens if it is too small? We will not have enough pointers for a large file?

Indexed Allocation – Mapping (Cont.)

Linked Scheme
Index Blocks are initially one disk block
These can be read and written to directly by itself.
For larger file we can link Index Blocks as follows:
An index block contains a small header that contains
The name of the file
A set of the first 100 disk-block addresses.
The ‘next’ address (last word in the index block) is null for a small file, or contains a pointer to another index block for a large file.

Indexed Allocation – Mapping (Cont.)

Multilevel Index:
A variant of linked representation
Uses a first-level index block to point to a set of second-level index blocks which point to file blocks
To access a block the OS uses the first-level index to find the second level index block and then uses that block to find the desired data block.
This approach could be continued to a third or fourth level depending on the file size.
Two-level index (4K blocks could store 1,024 four-byte pointers in outer index -> 1,048,567 data blocks and file size of up to 4GB)

Indexed Allocation – Mapping (Cont.)

Indexed Allocation – Mapping (Cont.)
Combined Scheme:
Used in UNIX-based file systems
Keep first 15 pointers in file’s inode.
First 12 point to direct blocks (the contain addresses that point to blocks that contain file data)
For small files (up to 48K) no additional index block is required.
The next three pointers point to indirect blocks
The first points to a single indirect block – containing addresses of blocks containing data
The second points to a double indirect block – containing addresses of blocks that contain addresses of blocks containing data
The third points to a triple indirect block

With 64-bit pointers as in UNIX and LINUX this allows file systems to be several exibytes in size.

Combined Scheme: UNIX UFS

More index blocks than can be addressed with 32-bit file pointer
4K bytes per block, 32-bit addresses

Performance
Allocation methods vary in:
Storage efficiency
Data-block access time
Best method depends on file access type
Contiguous great for sequential and random
Linked good for sequential, not random
Declare access type at creation -> select either contiguous or linked
Indexed more complex
Single block access could require 2 index block reads then data block read
Clustering can help improve throughput, reduce CPU overhead

Performance (Cont.)
Adding instructions to the execution path to save one disk I/O is reasonable
Intel Core i7 Extreme Edition 990x (2011) at 3.46Ghz = 159,000 MIPS
http://en.wikipedia.org/wiki/Instructions_per_second
Typical disk drive at 250 I/Os per second
159,000 MIPS / 250 = 630 million instructions during one disk I/O
Fast SSD drives provide 60,000 IOPS
159,000 MIPS / 60,000 = 2.65 million instructions during one disk I/O

Free-Space Management
Disk space is limited
We need to reuse the space from deleted files for new files wherever possible.
To keep track of free disk space the systems maintains a free-space list
The free-space list records all free disk blocks –not assigned to a file or directory.
To create a file, we search the free-space list for the required amount of space and allocate that space to the new file.
The space is removed from the free-space list.
When a file is deleted, its disk space is added to the free-space list
The free-space “list” is often not implemented as a “list”!!

Free-Space Management
File system maintains free-space list to track available blocks/clusters
(Using term “block” for simplicity)
Bit vector or bit map (n blocks)

…

0
1
2
n-1
bit[i] =

1  block[i] free
0  block[i] occupied
Block number calculation
(number of bits per word) *
(number of 0-value words) +
offset of first 1 bit
CPUs have instructions to return offset within word of first “1” bit

Free-Space Management (Cont.)
Bit map requires extra space
Example:
block size = 4KB = 212 bytes
disk size = 240 bytes (1 terabyte)
n = 240/212 = 228 bits (or 32MB)
if clusters of 4 blocks -> 8MB of memory

Easy to get contiguous files

Linked Free Space List on Disk

Linked list (free list)
Pointer to first free block is kept at special location on the disk and cached into main memory
Cannot get contiguous space easily
No waste of space
No need to traverse the entire list (if # free blocks recorded)

Free-Space Management (Cont.)
Grouping
Modify linked list to store address of next n-1 free blocks in first free block, plus a pointer to next block that contains free-block-pointers (like this one)
This makes it easy and quick to find addresses of a large number of free blocks.

Counting
Because space is frequently contiguously used and freed, with contiguous-allocation allocation, extents, or clustering
Keep address of first free block and with it, a count of following free blocks
Free space list then has entries containing addresses and counts
Balanced tree would be used rather linked list for efficient lookup, insertion, and deletion.

Free-Space Management (Cont.)
Space Maps
Used in ZFS
Consider meta-data I/O on very large file systems
Full data structures like bit maps couldn’t fit in memory -> thousands of I/Os
Divides device space into metaslab units and manages metaslabs
Given volume can contain hundreds of metaslabs
Each metaslab has associated space map
Uses counting algorithm
But records to log file rather than file system
Log of all block activity, in time order, in counting format
Metaslab activity -> load space map into memory in balanced-tree structure, indexed by offset
Replay log into that structure
Combine contiguous free blocks into single entry

Efficiency and Performance
Efficiency dependent on:
Disk allocation and directory algorithms
Types of data kept in file’s directory entry
Pre-allocation or as-needed allocation of metadata structures
Fixed-size or varying-size data structures

Efficiency and Performance (Cont.)

Performance
Keeping data and metadata close together
Buffer cache – separate section of main memory for frequently used blocks
Synchronous writes sometimes requested by apps or needed by OS
No buffering / caching – writes must hit disk before acknowledgement
Asynchronous writes more common, buffer-able, faster
Free-behind and read-ahead – techniques to optimize sequential access
Reads frequently slower than writes

Page Cache
A page cache caches pages rather than disk blocks using virtual memory techniques and addresses

Memory-mapped I/O uses a page cache

Routine I/O through the file system uses the buffer (disk) cache

This leads to the following figure

I/O Without a Unified Buffer Cache

Unified Buffer Cache
A unified buffer cache uses the same page cache to cache both memory-mapped pages and ordinary file system I/O to avoid double caching

But which caches get priority, and what replacement algorithms to use?

I/O Using a Unified Buffer Cache

Recovery
Consistency checking – compares data in directory structure with data blocks on disk, and tries to fix inconsistencies
Can be slow and sometimes fails

Use system programs to back up data from disk to another storage device (magnetic tape, other magnetic disk, optical)

Recover lost file or disk by restoring data from backup

Log Structured File Systems
Log structured (or journaling) file systems record each metadata update to the file system as a transaction
All transactions are written to a log
A transaction is considered committed once it is written to the log (sequentially)
Sometimes to a separate device or section of disk
However, the file system may not yet be updated
The transactions in the log are asynchronously written to the file system structures
When the file system structures are modified, the transaction is removed from the log
If the file system crashes, all remaining transactions in the log must still be performed
Faster recovery from crash, removes chance of inconsistency of metadata

The Sun Network File System (NFS)
An implementation and a specification of a software system for accessing remote files across LANs (or WANs)

The implementation is part of the Solaris and SunOS operating systems running on Sun workstations using an unreliable datagram protocol (UDP/IP protocol and Ethernet

NFS (Cont.)
Interconnected workstations viewed as a set of independent machines with independent file systems, which allows sharing among these file systems in a transparent manner
A remote directory is mounted over a local file system directory
The mounted directory looks like an integral subtree of the local file system, replacing the subtree descending from the local directory
Specification of the remote directory for the mount operation is nontransparent; the host name of the remote directory has to be provided
Files in the remote directory can then be accessed in a transparent manner
Subject to access-rights accreditation, potentially any file system (or directory within a file system), can be mounted remotely on top of any local directory

NFS (Cont.)
NFS is designed to operate in a heterogeneous environment of different machines, operating systems, and network architectures; the NFS specifications independent of these media

This independence is achieved through the use of RPC primitives built on top of an External Data Representation (XDR) protocol used between two implementation-independent interfaces

The NFS specification distinguishes between the services provided by a mount mechanism and the actual remote-file-access services

Three Independent File Systems

Mounting in NFS
Mounts
Cascading mounts

NFS Mount Protocol
Establishes initial logical connection between server and client
Mount operation includes name of remote directory to be mounted and name of server machine storing it
Mount request is mapped to corresponding RPC and forwarded to mount server running on server machine
Export list – specifies local file systems that server exports for mounting, along with names of machines that are permitted to mount them
Following a mount request that conforms to its export list, the server returns a file handle—a key for further accesses
File handle – a file-system identifier, and an inode number to identify the mounted directory within the exported file system
The mount operation changes only the user’s view and does not affect the server side

NFS Protocol
Provides a set of remote procedure calls for remote file operations. The procedures support the following operations:
searching for a file within a directory
reading a set of directory entries
manipulating links and directories
accessing file attributes
reading and writing files
NFS servers are stateless; each request has to provide a full set of arguments (NFS V4 is just coming available – very different, stateful)
Modified data must be committed to the server’s disk before results are returned to the client (lose advantages of caching)
The NFS protocol does not provide concurrency-control mechanisms

Three Major Layers of NFS Architecture
UNIX file-system interface (based on the open, read, write, and close calls, and file descriptors)

Virtual File System (VFS) layer – distinguishes local files from remote ones, and local files are further distinguished according to their file-system types
The VFS activates file-system-specific operations to handle local requests according to their file-system types
Calls the NFS protocol procedures for remote requests

NFS service layer – bottom layer of the architecture
Implements the NFS protocol

Schematic View of NFS Architecture

NFS Path-Name Translation
Performed by breaking the path into component names and performing a separate NFS lookup call for every pair of component name and directory vnode

To make lookup faster, a directory name lookup cache on the client’s side holds the vnodes for remote directory names

NFS Remote Operations
Nearly one-to-one correspondence between regular UNIX system calls and the NFS protocol RPCs (except opening and closing files)
NFS adheres to the remote-service paradigm, but employs buffering and caching techniques for the sake of performance
File-blocks cache – when a file is opened, the kernel checks with the remote server whether to fetch or revalidate the cached attributes
Cached file blocks are used only if the corresponding cached attributes are up to date
File-attribute cache – the attribute cache is updated whenever new attributes arrive from the server
Clients do not free delayed-write blocks until the server confirms that the data have been written to disk

Example: WAFL File System
Used on Network Appliance “Filers” – distributed file system appliances
“Write-anywhere file layout”
Serves up NFS, CIFS, http, ftp
Random I/O optimized, write optimized
NVRAM for write caching
Similar to Berkeley Fast File System, with extensive modifications

The WAFL File Layout

Snapshots in WAFL

End of Chapter 12

.MsftOfcThm_Accent1_Fill {
fill:#4472C4;
}
.MsftOfcThm_Accent1_Stroke {
stroke:#4472C4;
}

.MsftOfcThm_Accent5_Fill {
fill:#D36F68;
}

.MsftOfcThm_Accent2_Fill {
fill:#656A59;
}

.MsftOfcThm_Accent3_Fill {
fill:#46B2B5;
}

.MsftOfcThm_Accent4_Fill {
fill:#8CAA7E;
}

.MsftOfcThm_Accent1_Fill {
fill:#4472C4;
}
.MsftOfcThm_Accent1_Stroke {
stroke:#4472C4;
}

.MsftOfcThm_Background1_Fill {
fill:#FFFFFF;
}

i
n
d
e
x

t
a
b
l
e

.MsftOfcThm_Accent1_Fill {
fill:#4472C4;
}
.MsftOfcThm_Accent1_Stroke {
stroke:#4472C4;
}

Related Posts