代写代考 The Hadoop Distributed File System

The Hadoop Distributed File System
Konstantin Shvachko, , , Y ahoo!
Sunnyvale, California USA
{Shv, Hairong, SRadia,

Copyright By PowCoder代写 加微信 powcoder

Abstract—The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.
Keywords: Hadoop, HDFS, distributed file system
I. INTRODUCTION AND RELATED WORK
Hadoop [1][16][19] provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce [3] paradigm. An important characteristic of Hadoop is the partitioning of data and compu- tation across many (thousands) of hosts, and executing applica- tion computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. Hadoop clus- ters at Yahoo! span 25 000 servers, and store 25 petabytes of application data, with the largest cluster being 3500 servers. One hundred other organizations worldwide report using Hadoop.
developed at Facebook. Pig [4], ZooKeeper [6], and Chukwa were originated and developed at Yahoo! Avro was originated at Yahoo! and is being co-developed with Cloudera.
HDFS is the file system component of Hadoop. While the interface to HDFS is patterned after the UNIX file system, faithfulness to standards was sacrificed in favor of improved performance for the applications at hand.
HDFS stores file system metadata and application data separately. As in other distributed file systems, like PVFS [2][14], Lustre [7] and GFS [5][8], HDFS stores metadata on a dedicated server, called the NameNode. Application data are stored on other servers called DataNodes. All servers are fully connected and communicate with each other using TCP-based protocols.
Unlike Lustre and PVFS, the DataNodes in HDFS do not use data protection mechanisms such as RAID to make the data durable. Instead, like GFS, the file content is replicated on mul- tiple DataNodes for reliability. While ensuring data durability, this strategy has the added advantage that data transfer band- width is multiplied, and there are more opportunities for locat- ing computation near the needed data.
Several distributed file systems have or are exploring truly distributed implementations of the namespace. Ceph [17] has a cluster of namespace servers (MDS) and uses a dynamic sub- tree partitioning algorithm in order to map the namespace tree to MDSs evenly. GFS is also evolving into a distributed name- space implementation [8]. The new GFS will have hundreds of namespace servers (masters) with 100 million files per master. Lustre [7] has an implementation of clustered namespace on its roadmap for Lustre 2.2 release. The intent is to stripe a direc- tory over multiple metadata servers (MDS), each of which con- tains a disjoint portion of the namespace. A file is assigned to a particular MDS using a hash function on the file name.
II. ARCHITECTURE
A. NameNode
The HDFS namespace is a hierarchy of files and directo- ries. Files and directories are represented on the NameNode by inodes, which record attributes like permissions, modification and access times, namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file) and each block of the file is inde- pendently replicated at multiple DataNodes (typically three, but user selectable file-by-file). The NameNode maintains the namespace tree and the mapping of file blocks to DataNodes
MapReduce HBase
Hive ZooKeeper Chukwa Avro
Distributed file system Subject of this paper!
Distributed computation framework Column-oriented table service
Dataflow language and parallel execution framework
Data warehouse infrastructure Distributed coordination service System for collecting management data Data serialization system
Table 1. Hadoop project components
Hadoop is an Apache project; all components are available via the Apache open source license. Yahoo! has developed and contributed to 80% of the core of Hadoop (HDFS and MapRe- duce). HBase was originally developed at Powerset, now a department at Microsoft. Hive [15] was originated and devel-
978-1-4244-7153-9/10/$26.00 ©2010 IEEE

(the physical location of file data). An HDFS client wanting to read a file first contacts the NameNode for the locations of data blocks comprising the file and then reads block contents from the DataNode closest to the client. When writing data, the cli- ent requests the NameNode to nominate a suite of three DataNodes to host the block replicas. The client then writes data to the DataNodes in a pipeline fashion. The current design has a single NameNode for each cluster. The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently.
HDFS keeps the entire namespace in RAM. The inode data and the list of blocks belonging to each file comprise the meta- data of the name system called the image. The persistent record of the image stored in the local host’s native files system is called a checkpoint. The NameNode also stores the modifica- tion log of the image called the journal in the local host’s na- tive file system. For improved durability, redundant copies of the checkpoint and journal can be made at other servers. Dur- ing restarts the NameNode restores the namespace by reading the namespace and replaying the journal. The locations of block replicas may change over time and are not part of the persistent checkpoint.
B. DataNodes
Each block replica on a DataNode is represented by two files in the local host’s native file system. The first file contains the data itself and the second file is block’s metadata including checksums for the block data and the block’s generation stamp. The size of the data file equals the actual length of the block and does not require extra space to round it up to the nominal block size as in traditional file systems. Thus, if a block is half full it needs only half of the space of the full block on the local drive.
During startup each DataNode connects to the NameNode and performs a handshake. The purpose of the handshake is to verify the namespace ID and the software version of the DataNode. If either does not match that of the NameNode the DataNode automatically shuts down.
The namespace ID is assigned to the file system instance when it is formatted. The namespace ID is persistently stored on all nodes of the cluster. Nodes with a different namespace ID will not be able to join the cluster, thus preserving the integ- rity of the file system.
The consistency of software versions is important because incompatible version may cause data corruption or loss, and on large clusters of thousands of machines it is easy to overlook nodes that did not shut down properly prior to the software upgrade or were not available during the upgrade.
A DataNode that is newly initialized and without any namespace ID is permitted to join the cluster and receive the cluster’s namespace ID.
After the handshake the DataNode registers with the NameNode. DataNodes persistently store their unique storage IDs. The storage ID is an internal identifier of the DataNode, which makes it recognizable even if it is restarted with a differ- ent IP address or port. The storage ID is assigned to the
DataNode when it registers with the NameNode for the first time and never changes after that.
A DataNode identifies block replicas in its possession to the NameNode by sending a block report. A block report contains the block id, the generation stamp and the length for each block replica the server hosts. The first block report is sent immedi- ately after the DataNode registration. Subsequent block reports are sent every hour and provide the NameNode with an up-to- date view of where block replicas are located on the cluster.
During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available. The default heartbeat in- terval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode con- siders the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The NameNode then schedules creation of new replicas of those blocks on other DataNodes.
Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and the num- ber of data transfers currently in progress. These statistics are used for the NameNode’s space allocation and load balancing decisions.
The NameNode does not directly call DataNodes. It uses replies to heartbeats to send instructions to the DataNodes. The instructions include commands to:
• replicate blocks to other nodes;
• remove local block replicas;
• re-register or to shut down the node;
• send an immediate block report.
These commands are important for maintaining the overall system integrity and therefore it is critical to keep heartbeats frequent even on big clusters. The NameNode can process thousands of heartbeats per second without affecting other NameNode operations.
HDFS Client
User applications access the file system using the HDFS client, a code library that exports the HDFS file system inter- face.
Similar to most conventional file systems, HDFS supports operations to read, write and delete files, and operations to cre- ate and delete directories. The user references files and directo- ries by paths in the namespace. The user application generally does not need to know that file system metadata and storage are on different servers, or that blocks have multiple replicas.
When an application reads a file, the HDFS client first asks the NameNode for the list of DataNodes that host replicas of the blocks of the file. It then contacts a DataNode directly and requests the transfer of the desired block. When a client writes, it first asks the NameNode to choose DataNodes to host repli- cas of the first block of the file. The client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and the

Figure 1. An HDFS client creates a new file by giving its path to the NameNode. For each block of the file, the NameNode returns a list of DataNodes to host its replicas. The client then pipelines data to the chosen DataNodes, which eventually confirm the creation of the block replicas to the NameNode.
client sends the further bytes of the file. Each choice of DataNodes is likely to be different. The interactions among the client, the NameNode and the DataNodes are illustrated in Fig. 1.
Unlike conventional file systems, HDFS provides an API that exposes the locations of a file blocks. This allows applica- tions like the MapReduce framework to schedule a task to where the data are located, thus improving the read perform- ance. It also allows an application to set the replication factor of a file. By default a file’s replication factor is three. For criti- cal files or files which are accessed very often, having a higher replication factor improves their tolerance against faults and increase their read bandwidth.
be configured to store the checkpoint and journal in multiple storage directories. Recommended practice is to place the di- rectories on different volumes, and for one storage directory to be on a remote NFS server. The first choice prevents loss from single volume failures, and the second choice protects against failure of the entire node. If the NameNode encounters an error writing the journal to one of the storage directories it automati- cally excludes that directory from the list of storage directories. The NameNode automatically shuts itself down if no storage directory is available.
The NameNode is a multithreaded system and processes requests simultaneously from multiple clients. Saving a trans- action to disk becomes a bottleneck since all other threads need to wait until the synchronous flush-and-sync procedure initi- ated by one of them is complete. In order to optimize this process the NameNode batches multiple transactions initiated by different clients. When one of the NameNode’s threads ini- tiates a flush-and-sync operation, all transactions batched at that time are committed together. Remaining threads only need to check that their transactions have been saved and do not need to initiate a flush-and-sync operation.
E. CheckpointNode
The NameNode in HDFS, in addition to its primary role serving client requests, can alternatively execute either of two other roles, either a CheckpointNode or a BackupNode. The role is specified at the node startup.
The CheckpointNode periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal. The CheckpointNode usually runs on a different host from the NameNode since it has the same memory re- quirements as the NameNode. It downloads the current check- point and journal files from the NameNode, merges them lo- cally, and returns the new checkpoint back to the NameNode.
Image and Journal
The namespace image is the file system metadata that de- scribes the organization of application data as directories and files. A persistent record of the image written to disk is called a checkpoint. The journal is a write-ahead commit log for changes to the file system that must be persistent. For each client-initiated transaction, the change is recorded in the jour- nal, and the journal file is flushed and synched before the change is committed to the HDFS client. The checkpoint file is never changed by the NameNode; it is replaced in its entirety when a new checkpoint is created during restart, when re- quested by the administrator, or by the CheckpointNode de- scribed in the next section. During startup the NameNode ini- tializes the namespace image from the checkpoint, and then replays changes from the journal until the image is up-to-date with the last state of the file system. A new checkpoint and empty journal are written back to the storage directories before the NameNode starts serving clients.
If either the checkpoint or the journal is missing, or be- comes corrupt, the namespace information will be lost partly or entirely. In order to preserve this critical information HDFS can

Creating periodic checkpoints is one way to protect the file system metadata. The system can start from the most recent checkpoint if all other persistent copies of the namespace im- age or journal are unavailable.
Creating a checkpoint lets the NameNode truncate the tail of the journal when the new checkpoint is uploaded to the NameNode. HDFS clusters run for prolonged periods of time without restarts during which the journal constantly grows. If the journal grows very large, the probability of loss or corrup- tion of the journal file increases. Also, a very large journal ex- tends the time required to restart the NameNode. For a large cluster, it takes an hour to process a week-long journal. Good practice is to create a daily checkpoint.
F . BackupNode
A recently introduced feature of HDFS is the BackupNode. Like a CheckpointNode, the BackupNode is capable of creating periodic checkpoints, but in addition it maintains an in- memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode.
The BackupNode accepts the journal stream of namespace transactions from the active NameNode, saves them to its own storage directories, and applies these transactions to its own namespace image in memory. The NameNode treats the BackupNode as a journal store the same as it treats journal files in its storage directories. If the NameNode fails, the BackupNode’s image in memory and the checkpoint on disk is a record of the latest namespace state.
The BackupNode can create a checkpoint without down- loading checkpoint and journal files from the active NameNode, since it already has an up-to-date namespace im- age in its memory. This makes the checkpoint process on the BackupNode more efficient as it only needs to save the name- space into its local storage directories.
The BackupNode can be viewed as a read-only NameNode. It contains all file system metadata information except for block locations. It can perform all operations of the regular NameNode that do not involve modification of the namespace or knowledge of block locations. Use of a BackupNode pro- vides the option of running the NameNode without persistent storage, delegating responsibility for the namespace state per- sisting to the BackupNode.
G. Upgrades, File System Snapshots
During software upgrades the possibility of corrupting the system due to software bugs or human mistakes increases. The purpose of creating snapshots in HDFS is to minimize potential damage to the data stored in the system during upgrades.
The snapshot mechanism lets administrators persistently save the current state of the file system, so that if the upgrade results in data loss or corruption it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot.
The snapshot (only one can exist) is created at the cluster administrator’s option whenever the system is started. If a snapshot is requested, the NameNode first reads the checkpoint
and journal files and merges them in memory. Then it writes the new checkpoint and the empty journal to a new location, so that the old checkpoint and journal remain unchanged.
During handshake the NameNode instructs DataNodes whether to create a local snapshot. The local snapshot on the DataNode cannot be created by replicating the data files direc- tories as this will require doubling the storage capacity of every DataNode on the cluster. Instead each DataNode creates a copy of the storage directory and hard links existing block files into it. When the DataNode removes a block it removes only the hard link, and block modifications during appends use the copy-on-write technique. Thus old block replicas remain un- touched in their old directories.
The cluster administrator can choose to roll back HDFS to the snapshot state when restarting the system. The NameNode recovers the checkpoint saved when the snapshot was created. DataNodes restore the previously renamed directories and initi- ate a background process to delete block replicas created after the snapshot was made. Having chosen to roll back, there is no provision to roll forward. The cluster administrator can recover the storage occupied by the snapshot by commanding the sys- tem to abandon the snapshot, thus finalizing the software up- grade.
System evolution may lead to a change in the format of the NameNode’s checkpoint and journal files, or in the data repre- sentation of block replica files on DataNodes. The layout ver- sion identifies the data representation formats, and is persis- tently stored in the NameNode’s and the DataNodes’ storage directories. During startup each node compares the layout ver- sion of the current software with the version stored in its stor- age directories and automatically converts data from older for- mats to the newer ones. The conversion requires the mandatory creation of a snapshot when the system restarts with the new software layout version.
HDFS does not separate layout versions

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com