代写代考 MCS80-05144, and the Defense Advance Research Projects Agency (DoD) under A

A Fast File System for UNIX
MARSHALL K. MCKUSICK, WILLIAM N. JOY, SAMUEL J. LEFFLER, and ROBERT S. FABRY
Computer Systems Research Group
A reimplementation of the UNIXTM file system is described. The reimplementation provides sub- stantially higher throughput rates by using more flexible allocation policies that allow better locality of reference and can be adapted to a wide range of peripheral and processor characteristics. The new file system clusters data that is sequentially accessed and provides two block sizes to allow fast access to large files while not wasting large amounts of space for small files. File access rates of up to ten times faster than the traditional UNIX file system are experienced. Long-needed enhancements to the programmers’ interface are discussed. These include a mechanism to place advisory locks on files, extensions of the name space across file systems, the ability to use long file names, and provisions for administrative control of resource usage.

Copyright By PowCoder代写 加微信 powcoder

Categories and Subject Descriptors: D.4.3 [Operating Systems]: File Systems Management–fi/e organization; directory structures; access methods; D.4.2 [Operating Systems]: Storage Manage- ment–allocation/deaUocation strategies; secondary storage devices; D.4.8 [Operating Systems]: Performance–measurements; operational analysis; H.3.2 [Information Systems]: Information Storage–fi/e organization
General Terms: Measurement, Performance
Additional Keywords and Phrases: UNIX, file system organization, file system performance, file system design, application program interface
1. INTRODUCTION
This paper describes the changes from the original 512-byte UNIX1file system to file the new system released with the 4.2 Berkeley Software Distribution. It presents the motivations for the changes, the methods used to effect these changes, the rationale behind the design decisions, and a description of the new implementation. This discussion is followed by a summary of the results that have been obtained, directions for future work, and the additions and changes that have been made to the facilities that are available to programmers.
1UNIX is a trademark of AT&T Bell Laboratories.
This work was done under grants from the National Science Foundation under grant MCS80-05144, and the Defense Advance Research Projects Agency (DoD) under ARPA Order No. 4031 monitored by Naval Electronic System Command under contract N00039-82-C-0235.
Authors’ present addresses: M. K. McKusick and R. S. Fabry, Computer Science Division, Department of Electrical Engineering and Computer Science, University of California, Berkeley, Berkeley, CA 94720; W. N. Joy, Sun Microsystems, Inc., 2550 Garcia Ave., Mountain View, CA 94043; S. J. Leffler, Lucasfilm Ltd., P.O. Box 2009, San Rafael, CA 94912.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
© 1984 ACM 0734-2071/84/0181-0197 $00.75
ACM Transactions on Computer Systems, Vol. 2, No. 3, August 1984, Pages 181-197.

182 • M.K. McKusick, W. N. Joy, S. J. Leffler, and R. S. Fabry
The original UNIX system that runs on the PDP-112 has simple and elegant file system facilities. File system I/0 is buffered by the kernel; there are no alignment constraints on data transfers and all operations are made to appear synchronous. All transfers to the disk are in 512-byte blocks, which can be placed arbitrarily within the data area of the file system. Virtually no constraints other than available disk space are placed on file growth [14, 18].3
When used on the VAX-11 together with other UNIX enhancements, the original 512-byte UNIX file system is incapable of providing the data throughput rates that many applications require. For example, applications such as very- large-scale integration (VLSI) design and image processing involve a small amount of processing on large quantities of data and need to have a high throughput from the file system. High throughput rates are also needed by programs that map files from the file system into large virtual address spaces. Paging data in and out of the file system is likely to occur frequently [5]. This requires a file system providing higher bandwidth than the original 512-byte UNIX one that provides only about two percent of the maximum disk bandwidth or about 20 kilobytes per second per arm [21, 16].
Modifications have been made to the original UNIX file system to improve its performance. Since the UNIX file system interface is well understood and not inherently slow, this development retained the abstraction and simply changed the underlying implementation to increase its throughput. Consequently, users of the system have not been faced with massive software conversion.
Problems with file system performance have been dealt with extensively in the literature; see [15] for a survey. Previous work to improve the UNIX file system performance has been done by Ferrin [4]. The UNIX operating system drew many of its ideas from Multics, a large, high performance operating system [3]. Other work includes Hydra [1], Spice [19], and a file system for a LISP environment [17]. A good introduction to the physical latencies of disks is described in [11].
2. OLD FILE SYSTEM
In the file system developed at Bell Laboratories (the “traditional” file system), each disk drive is divided into one or more partitions. Each of these disk partitions may contain one file system. A file system never spans multiple partitions. 4 A file system is described by its superblock, which contains the basic parameters of the file system. These include the number of data blocks in the file system, a count of the maximum number of files, and a pointer to the free list, a linked list of all the free blocks in the file system.
Within the file system are fries. Certain files are distinguished as directories and contain pointers to files that may themselves be directories. Every file has a
2DEC, PDP, VAX, MASSBUS, and UNIBUS are trademarks of Digital Equipment Corporation. 3In practice, a file’s size is constrained to be less than about one gigabyte.
4By “partition” here we refer to the subdivision of physical space on a disk drive. In the traditional file system, as in the new file system, file systems are really located in logical disk partitions that may overlap. This overlapping is made available, for example, to allow programs to copy entire disk drives containing multiple file systems.
ACM Transactions on Computer Systems, Vol. 2, No. 3, August 1984.

descriptor associated with it called an inode. An inode contains information describing ownership of the file, time stamps marking last modification and access times for the file, and an array of indices that point to the data blocks for the file. For the purposes of this section, we assume that the first 8 blocks of the file are directly referenced by values stored in an inode itself.5An inode may also contain references to indirect blocks containing further data block indices. In a file system with a 512-byte block size, a singly indirect block contains 128 further block addresses, a doubly indirect block contains 128 addresses of further singly indirect blocks, and a triply indirect block contains 128 addresses of further doubly indirect blocks.
A 150-megabyte traditional UNIX file system consists of 4 megabytes of inodes followed by 146 megabytes of data. This organization segregates the inode information from the data; thus accessing a file normally incurs a long seek from the file’s inode to its data. Files in a single directory are typically not allocated consecutive slots in the 4 megabytes of inodes, causing many nonconsecutive blocks of inodes to be accessed when executing operations on the inodes of several files in a directory.
The allocation of data blocks to files is also suboptimum. The traditional file system never transfers more than 512 bytes per disk transaction and often finds that the next sequential data block is not on the same cylinder, forcing seeks between 512-byte transfers. The combination of the small block size, limited read-ahead in the system, and many seeks severely limits file system throughput.
The first work at Berkeley on the UNIX file system attempted to improve both reliability and throughput. The reliability was improved by staging modifi- cations to critical file system information so that they could either be completed or repaired cleanly by a program after a crash [6]. The file system performance was improved by a factor of more than two by changing the basic block size from 512 to 1024 bytes. There were two reasons for the increase: each disk transfer accessed twice as much data, and most files could be described without need to access indirect blocks since the direct blocks contained twice as much data. The file system with these changes will henceforth be referred to as the old file system.
This performance improvement gave a strong indication that increasing the block size was a good method for improving throughput. Although the throughput had doubled, the old file system was still using only about four percent of the disk bandwidth. The main problem was that although the free list was initially ordered for optimal access, it quickly became scrambled as files were created and removed. Eventually the free list became entirely random, causing files to have their blocks allocated randomly over the disk. This forced a seek before every block access. Although old file systems provided transfer rates of up to 175 kilobytes per second when they were first created, this rate deteriorated to 30 kilobytes per second after a few weeks of moderate use because of this random- ization of data block placement. There was no way of restoring the performance of an old file system except to dump, rebuild, and restore the file system. Another possibility, as suggested by Maruyama [9] would be to have a process that periodically reorganized the data on the disk to restore locality.
5The actual number may vary from system to system, but is usually in the range 5-13.
ACM Transactions on Computer Systems, Vol. 2, No. 3, August 1984.
A Fast File System for UNIX • 183

184 • M.K. McKusick, W. N. Joy, S. J. Leffler, and R. S. Fabry 3. NEW FILE SYSTEM ORGANIZATION
In the new file system organization (as in the old file system organization), each disk drive contains one or more file systems. A file system is described by its superblock, located at the beginning of the file system’s disk partition. Because the superblock contains critical data, it is replicated to protect against cata- strophic loss. This is done when the file system is created; since the superblock data does not change, the copies need not be referenced unless a head crash or other hard disk error causes the default superblock to be unusable.
To insure that it is possible to create files as large as 23~bytes with only two levels of indirection, the minimum size of a file system block is 4096 bytes. The size of file system blocks can be any power of two greater than or equal to 4096. The block size of a file system is recorded in the file system’s superblock so it is possible for file systems with different block sizes to be simultaneously accessible on the same system. The block size must be decided at the time that the file system is created; it cannot be subsequently changed without rebuilding the file system.
The new file system organization divides a disk partition into one or more areas called cylinder groups. A cylinder group is comprised of one or more consecutive cylinders on a disk. Associated with each cylinder group is some bookkeeping information that includes a redundant copy of the superblock, space for inodes, a bit map describing available blocks in the cylinder group, and summary information describing the usage of data blocks within the cylinder group. The bit map of available blocks in the cylinder group replaces the traditional file system’s free list. For each cylinder group a static number of inodes is allocated at file system creation time. The default policy is to allocate one inode for each 2048 bytes of space in the cylinder group, expecting this to be far more than will ever be needed.
All the cylinder group bookkeeping information could be placed at the begin- ning of each cylinder group. However if this approach were used, all the redundant information would be on the top platter. A single hardware failure that destroyed the top platter could cause the loss of all redundant copies of the superblock. Thus the cylinder group bookkeeping information begins at a varying offset from the beginning of the cylinder group. The offset for each successive cylinder group is calculated to be about one track further from the beginning of the cylinder group than the preceding cylinder group. In this way the redundant information spirals down into the pack so that any single track, cylinder, or platter can be lost without losing all copies of the superblock. Except for the first cylinder group, the space between the beginning of the cylinder group and the beginning of the cylinder group information is used for data blocks,e
6While it appears that the first cylinder group could be laid out with its superblock at the “known” location, this would not work for file systems with blocks sizes of 16 kilobytes or greater. This is because of a requirement that the first 8 kilobytes of the disk be reserved for a bootstrap program and a separate requirement that the cylinder group information begin on a file system block boundary. To start the cylinder group on a file system block boundary, file systems with block sizes larger than 8 kilobytes would have to leave an empty space between the end of the boot block and the beginning of the cylinder group. Without knowing the size of the file system blocks, the system would not know what roundup function to use to find the beginning of the first cylinder group.
ACM Transactions on Computer Systems, Vol. 2, No. 3, August 1984.

Space used (Mbytes)
775.2 807.8 828.7 866.5 948.5
0.0 4.2 6.9
11.8 22.4 45.6
Organization
Data only, no separation between files
Data only, each file starts on 512-byte boundary Data + inodes, 512-byte block UNIX file system Data + inodes, 1024-byte block UNIX file system Data + inodes, 2048-byte block UNIX file system Data + inodes, 4096-byte block UNIX file system
Amount of Wasted Space as a Function of Block Size
3.1. Optimizing Storage Utilization
Data are laid out so that larger blocks can be transferred in a single disk transaction, greatly increasing file system throughput. As an example, consider a file in the new file system composed of 4096-byte data blocks. In the old file system this file would be composed of 1024-byte blocks. By increasing the block size, disk accesses in the new file system may transfer up to four times as much information per disk transaction. In large files, several 4096-byte blocks may be allocated from the same cylinder so that even larger data transfers are possible before requiring a seek.
The main problem with larger blocks is that most UNIX file systems are composed of many small files. A uniformly large block size wastes space. Table I shows the effect of file system block size on the amount of wasted space in the file system. The files measured to obtain these figures reside on one of our time- sharing systems that has roughly 1.2 gigabytes of on-line storage. The measure- ments are based on the active user file systems containing about 920 megabytes of formatted space. The space wasted is calculated to be the percentage of space on the disk not containing user data. As the block size on the disk increases, the waste rises quickly, to an intolerable 45.6 percent waste with 4096-byte file system blocks.
To be able to use large blocks without undue waste, small files must be stored in a more efficient way. The new file system accomplishes this goal by allowing the division of a single file system block into one or more fragments.The file system fragment size is specified at the time that the file system is created; each file system block can be broken optionally into 2, 4, or 8 fragments, each of which is addressable. The lower bound on the size of these fragments is constrained by the disk sector size, typically 512 bytes. The block map associated with each cylinder group records the space available in a cylinder group at the fragment level; to determine if a block is available, aligned fragments are examined. Figure 1 shows a piece of a map from a 4096/1024 file system.
Each bit in the map records the status of a fragment; an “X” shows that the fragment is in use, while a “O” shows that the fragment is available for allocation. In this example, fragments 0-5, 10, and 11 are in use, while fragments 6-9, and 12-15 are free. Fragments of adjoining blocks cannot be used as a full block, even if they are large enough. In this example, fragments 6-9 cannot be allocated as a full block; only fragments 12-15 can be coalesced into a full block.
On a file system with a block size of 4096 bytes and a fragment size of 1024 bytes, a file is represented by zero or more 4096-byte blocks of data, and possibly
A Fast File System for UNIX 185
ACM Transactions on Computer Systems, Vol. 2, No. 3, August 1984.

186 • M.K. McKusick, W. N. Joy, S. J. Leffler, and R. S. Fabry
OOXX 0OO0 8-11 12-15
Fig. 1 Example layout of blocks and fragments in a 4096/1024 file system.
a single fragmented block. If a file system block must be fragmented to obtain space for a small amount of data, the remaining fragments of the block are made available for allocation to other files. As an example, consider an ll,000-byte file stored on a 4096/1024 byte file system. This file would use two full size blocks and one three-fragment portion of another block. If no block with three aligned fragments is available at the time the file is created, a full size block is split yielding the necessary fragments and a single unused fragment. This remaining fragment can be allocated to another file as needed.
Space is allocated to a file when a program does a write system call. Each time data is written to a file, the system checks to see if the size of the file has increased.7 If the file needs to be expanded to hold the new data, one of three conditions exists:
(1) There is enough space left in an already allocated block or fragment to hold the new data. The new data are written into the available space.
(2) The file contains no fragmented blocks (and the last block in the file contains insufficient space to hold the new data). If space exists in a block already allocated, the space is filled with new data. If the remainder of the new data contains more than a full block of data, a full block is allocated and the first full block of new data are written there. This process is repeated until less than a full block of new data remains. If the remaining new data to be written will fit in less than a full block, a block with the necessary fragments is located; otherwise a full block is located. The remaining new data are written into the located space.
(3) The file contains one or more fragments (and the fragments contain insufficient space to hold the new data). If the size of the new data plus the size of the data already in the fragments exceed the size of a full block, a new block is allocated. The contents of the fragments are copied to the beginning of the block and the remainder of the block is filled with new data. The process then continues as in (2) above. Otherwise, if the new data to be written will fit in less than a full block, a block with the necessary fragme

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com