The Google File System
, , and Shun- Google∗
We have designed and implemented the Google File Sys- tem, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
While sharing many of the same goals as previous dis- tributed file systems, our design has been driven by obser- vations of our application workloads and technological envi- ronment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore rad- ically different design points.
Copyright By PowCoder代写 加微信 powcoder
The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our ser- vice as well as research and development efforts that require large data sets. The largest cluster to date provides hun- dreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.
Categories and Subject Descriptors
D [4]: 3—Distributed file systems General Terms
Design, reliability, performance, measurement
Fault tolerance, scalability, data storage, clustered storage ∗The authors can be reached at the following addresses:
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
SOSP’03, October 19–22, 2003, , , USA. Copyright 2003 ACM 1-58113-757-5/03/0010 …$5.00.
1. INTRODUCTION
We have designed and implemented the Google File Sys- tem (GFS) to meet the rapidly growing demands of Google’s data processing needs. GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. However, its design has been driven by key observations of our application work- loads and technological environment, both current and an- ticipated, that reflect a marked departure from some earlier file system design assumptions. We have reexamined tradi- tional choices and explored radically different points in the design space.
First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive com- modity parts and is accessed by a comparable number of client machines. The quantity and quality of the compo- nents virtually guarantee that some are not functional at any given time and some will not recover from their cur- rent failures. We have seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power sup- plies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system.
Second, files are huge by traditional standards. Multi-GB files are common. Each file typically contains many applica- tion objects such as web documents. When we are regularly working with fast growing data sets of many TBs comprising billions of objects, it is unwieldy to manage billions of ap- proximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters such as I/O operation and block sizes have to be revisited.
Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of data share these characteristics. Some may constitute large repositories that data analysis programs scan through. Some may be data streams continuously generated by running ap- plications. Some may be archival data. Some may be in- termediate results produced on one machine and processed on another, whether simultaneously or later in time. Given this access pattern on huge files, appending becomes the fo- cus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.
Fourth, co-designing the applications and the file system API benefits the overall system by increasing our flexibility.
For example, we have relaxed GFS’s consistency model to vastly simplify the file system without imposing an onerous burden on the applications. We have also introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them. These will be discussed in more details later in the paper.
Multiple GFS clusters are currently deployed for different purposes. The largest ones have over 1000 storage nodes, over 300 TB of disk storage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis.
2. DESIGN OVERVIEW 2.1 Assumptions
In designing a file system for our needs, we have been guided by assumptions that offer both challenges and op- portunities. We alluded to some key observations earlier and now lay out our assumptions in more details.
• The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
• The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
• The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1 MB or more. Successive operations from the same client often read through a contiguous region of a file. A small ran- dom read typically reads a few KBs at some arbitrary offset. Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go back and forth.
• The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads. Once written, files are sel- dom modified again. Small writes at arbitrary posi- tions in a file are supported but do not have to be efficient.
• The system must efficiently implement well-defined se- mantics for multiple clients that concurrently append to the same file. Our files are often used as producer- consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concur- rently append to a file. Atomicity with minimal syn- chronization overhead is essential. The file may be read later, or a consumer may be reading through the file simultaneously.
• High sustained bandwidth is more important than low latency. Most of our target applications place a pre- mium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write.
2.2 Interface
GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX. Files are organized hierarchically in directories and identified by path- names. We support the usual operations to create, delete, open, close, read, and write files.
Moreover, GFS has snapshot and record append opera- tions. Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to ap- pend data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is use- ful for implementing multi-way merge results and producer- consumer queues that many clients can simultaneously ap- pend to without additional locking. We have found these types of files to be invaluable in building large distributed applications. Snapshot and record append are discussed fur- ther in Sections 3.4 and 3.3 respectively.
2.3 Architecture
A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown in Figure 1. Each of these is typically a commodity Linux machine running a user-level server process. It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable.
Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range. For reliability, each chunk is replicated on multi- ple chunkservers. By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace.
The master maintains all file system metadata. This in- cludes the namespace, access control information, the map- ping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master peri- odically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.
GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the applica- tion. Clients interact with the master for metadata opera- tions, but all data-bearing communication goes directly to the chunkservers. We do not provide the POSIX API and therefore need not hook into the Linux vnode layer.
Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata, however.) Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.
2.4 Single Master
Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement
Application
GFS master /foo/bar File namespace
chunk 2ef0
GFS client
(chunk handle, chunk locations)
(chunk handle, byte range)
GFS chunkserver
Linux file system
GFS chunkserver
Linux file system
(file name, chunk index)
Instructions to chunkserver Chunkserver state
Figure 1: GFS Architecture
Data messages Control messages
chunk data
and replication decisions using global knowledge. However, we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.
Let us explain the interactions for a simple read with refer- ence to Figure 1. First, using the fixed chunk size, the client translates the file name and byte offset specified by the ap- plication into a chunk index within the file. Then, it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information using the file name and chunk index as the key.
The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk. Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened. In fact, the client typically asks for multiple chunks in the same request and the master can also include the informa- tion for chunks immediately following those requested. This extra information sidesteps several future client-master in- teractions at practically no extra cost.
2.5 Chunk Size
Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file sys- tem block sizes. Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed. Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size.
A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location informa- tion. The reduction is especially significant for our work- loads because applications mostly read and write large files sequentially. Even for small random reads, the client can comfortably cache all the chunk location information for a multi-TB working set. Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persis-
tent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
On the other hand, a large chunk size, even with lazy space allocation, has its disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially.
However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunk file and then started on hundreds of ma- chines at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous re- quests. We fixed this problem by storing such executables with a higher replication factor and by making the batch- queue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.
2.6 Metadata
The master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. All metadata is kept in the master’s memory. The first two types (names- paces and file-to-chunk mapping) are also kept persistent by logging mutations to an operation log stored on the mas- ter’s local disk and replicated on remote machines. Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in the event of a master crash. The master does not store chunk location informa- tion persistently. Instead, it asks each chunkserver about its chunks at master startup and whenever a chunkserver joins the cluster.
2.6.1 In-Memory Data Structures
Since metadata is stored in memory, master operations are fast. Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background. This periodic scanning is used to implement chunk garbage collection, re-replication in the presence of chunkserver fail- ures, and chunk migration to balance load and disk space
Record Append
Serial success
interspersed with
inconsistent
Concurrent successes
consistent but undefined
inconsistent
usage across chunkservers. Sections 4.3 and 4.4 will discuss these activities further.
One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. This is not a serious limitation in practice. The mas- ter maintains less than 64 bytes of metadata for each 64 MB chunk. Most chunks are full because most files contain many chunks, only the last of which may be partially filled. Sim- ilarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly us- ing prefix compression.
If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory.
2.6.2 Chunk Locations
The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with reg- ular HeartBeat messages.
We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunkservers at startup, and periodically thereafter. This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often.
Another way to understand this design decision is to real- ize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver.
2.6.3 Operation Log
The operation log contains a historical record of critical metadata changes. It is central to GFS. Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent op- erations. Files and chunks, as well as their versions (see Section 4.5), are all uniquely and eternally identified by the logical times at which they were created.
Since the operation log is critical, we must store it reli- ably and not make changes visible to clients until metadata changes are made persistent. Otherwise, we effectively lose the whole file system or recent client operations even if the chunks themselves survive. Therefore, we replicate it on multiple remote machines and respond to a client opera- tion only after flushing the corresponding log record to disk both locally and remotely. The master batches several log records together before flushing thereby reducing the impact of flushing and replication on overall system throughput.
The master recovers its file system state by replaying the operation log. To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the
Table 1: File Region State After Mutation
limited number of log records after that. The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lo
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com