Apache Hadoop YARN: Yet Another Resource Negotiator
C Murthyh ’ Radiah
facebook.com
h : hortonworks.com, m : microsoft.com,
Copyright By PowCoder代写 加微信 powcoder
inmobi.com, y : yahoo-inc.com, f :
The initial design of Apache Hadoop [1] was tightly fo- cused on running massive, MapReduce jobs to process a web crawl. For increasingly diverse companies, Hadoop has become the data and computational agora ́ —the de facto place where data and computational resources are shared and accessed. This broad adoption and ubiquitous usage has stretched the initial design well beyond its in- tended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the re- source management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) cen- tralized handling of jobs’ control flow, which resulted in endless scalability concerns for the scheduler.
In this paper, we summarize the design, development, and current state of deployment of the next genera- tion of Hadoop’s compute platform: YARN. The new architecture we introduced decouples the programming model from the resource management infrastructure, and delegates many scheduling functions (e.g., task fault- tolerance) to per-application components. We provide experimental evidence demonstrating the improvements we made, confirm improved efficiency by reporting the experience of running YARN on production environ- ments (including 100% of Yahoo! grids), and confirm the flexibility claims by discussing the porting of several
Copyright⃝c 2013bytheAssociationforComputingMachinery,Inc. (ACM). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the au- thor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
SoCC’13, 1–3 Oct. 2013, Santa Clara, California, USA.
ACM 978-1-4503-2428-1. http://dx.doi.org/10.1145/2523616.2523633
programming frameworks onto YARN viz. Dryad, Gi- raph, Hoya, Hadoop MapReduce, REEF, Spark, Storm, Tez.
1 Introduction
Apache Hadoop began as one of many open-source im- plementations of MapReduce [12], focused on tackling the unprecedented scale required to index web crawls. Its execution architecture was tuned for this use case, focus- ing on strong fault tolerance for massive, data-intensive computations. In many large web companies and star- tups, Hadoop clusters are the common place where op- erational data are stored and processed.
More importantly, it became the place within an or- ganization where engineers and researchers have instan- taneous and almost unrestricted access to vast amounts of computational resources and troves of company data. This is both a cause of Hadoop’s success and also its biggest curse, as the public of developers extended the MapReduce programming model beyond the capabili- ties of the cluster management substrate. A common pattern submits “map-only” jobs to spawn arbitrary pro- cesses in the cluster. Examples of (ab)uses include fork- ing web servers and gang-scheduled computation of it- erative workloads. Developers, in order to leverage the physical resources, often resorted to clever workarounds to sidestep the limits of the MapReduce API.
These limitations and misuses motivated an entire class of papers using Hadoop as a baseline for unrelated environments. While many papers exposed substantial issues with the Hadoop architecture or implementation, some simply denounced (more or less ingeniously) some of the side-effects of these misuses. The limitations of the original Hadoop architecture are, by now, well un- derstood by both the academic and open-source commu- nities.
In this paper, we present a community-driven effort to
move Hadoop past its original incarnation. We present the next generation of Hadoop compute platform known as YARN, which departs from its familiar, monolithic architecture. By separating resource management func- tions from the programming model, YARN delegates many scheduling-related functions to per-job compo- nents. In this new context, MapReduce is just one of the applications running on top of YARN. This separa- tion provides a great deal of flexibility in the choice of programming framework. Examples of alternative pro- gramming models that are becoming available on YARN are: Dryad [18], Giraph, Hoya, REEF [10], Spark [32], Storm [4] and Tez [2]. Programming frameworks run- ning on YARN coordinate intra-application communi- cation, execution flow, and dynamic optimizations as they see fit, unlocking dramatic performance improve- ments. We describe YARN’s inception, design, open- source development, and deployment from our perspec- tive as early architects and implementors.
2 History and rationale
In this section, we provide the historical account of how YARN’s requirements emerged from practical needs. The reader not interested in the requirements’ origin is invited to skim over this section (the requirements are highlighted for convenience), and proceed to Section 3 where we provide a terse description of the YARN’s ar- chitecture.
Yahoo! adopted Apache Hadoop in 2006 to replace the infrastructure driving its WebMap application [11], the technology that builds a graph of the known web to power its search engine. At the time, the web graph con- tained more than 100 billion nodes and 1 trillion edges. The previous infrastructure, named “Dreadnaught,” [25] had reached the limits of its scalability at 800 machines and a significant shift in its architecture was required to match the pace of the web. Dreadnought already ex- ecuted distributed applications that resembled MapRe- duce [12] programs, so by adopting a more scalable MapReduce framework, significant parts of the search pipeline could be migrated easily. This highlights the first requirement that will survive throughout early ver- sions of Hadoop, all the way to YARN—[R1:] Scalabil- ity.
In addition to extremely large-scale pipelines for Ya- hoo! Search, scientists optimizing advertising analytics, spam filtering, and content optimization drove many of its early requirements. As the Apache Hadoop com- munity scaled the platform for ever-larger MapReduce jobs, requirements around [R2:] Multi-tenancy started to take shape. The engineering priorities and intermedi- ate stages of the compute platform are best understood in
this context. YARN’s architecture addresses many long- standing requirements, based on experience evolving the MapReduce platform. In the rest of the paper, we will assume general understanding of classic Hadoop archi- tecture, a brief summary of which is provided in Ap- pendix A.
2.1 The era of ad-hoc clusters
Some of Hadoop’s earliest users would bring up a clus- ter on a handful of nodes, load their data into the Ha- doop Distributed File System (HDFS)[27], obtain the re- sult they were interested in by writing MapReduce jobs, then tear it down [15]. As Hadoop’s fault tolerance im- proved, persistent HDFS clusters became the norm. At Yahoo!, operators would load “interesting” datasets into a shared cluster, attracting scientists interested in deriv- ing insights from them. While large-scale computation was still a primary driver of development, HDFS also acquired a permission model, quotas, and other features to improve its multi-tenant operation.
To address some of its multi-tenancy issues, Yahoo! developed and deployed Hadoop on Demand (HoD), which used Torque[7] and Maui[20] to allocate Hadoop clusters on a shared pool of hardware. Users would sub- mit their job with a description of an appropriately sized compute cluster to Torque, which would enqueue the job until enough nodes become available. Onces nodes be- come available, Torque would start HoD’s ’leader’ pro- cess on the head node, which would then interact with Torque/Maui to start HoD’s slave processes that subse- quently spawn a JobTracker and TaskTrackers for that user which then accept a sequence of jobs. When the user released the cluster, the system would automatically collect the user’s logs and return the nodes to the shared pool. Because HoD sets up a new cluster for every job, users could run (slightly) older versions of Hadoop while developers could test new features easily. Since Hadoop released a major revision every three months, 1. The flex- ibility of HoD was critical to maintaining that cadence— we refer to this decoupling of upgrade dependencies as [R3:] Serviceability.
While HoD could also deploy HDFS clusters, most users deployed the compute nodes across a shared HDFS instance. As HDFS scaled, more compute clusters could be allocated on top of it, creating a virtuous cycle of increased user density over more datasets, leading to new insights. Since most Hadoop clusters were smaller than the largest HoD jobs at Yahoo!, the JobTracker was rarely a bottleneck.
HoD proved itself as a versatile platform, anticipat- ing some qualities of Mesos[17], which would extend
1Between 0.1 and 0.12, Hadoop released a major version every month. It maintained a three month cycle from 0.12 through 0.19
the framework-master model to support dynamic re- source allocation among concurrent, diverse program- ming models. HoD can also be considered a private- cloud precursor of EC2 Elastic MapReduce, and DInsight offerings—without any of the isolation and security aspects.
2.2 Hadoop on Demand shortcomings
Yahoo! ultimately retired HoD in favor of shared MapReduce clusters due to middling resource utiliza- tion. During the map phase, the JobTracker makes ev- ery effort to place tasks close to their input data in HDFS, ideally on a node storing a replica of that data. Because Torque allocates nodes without accounting for locality,2 the subset of nodes granted to a user’s Job- Tracker would likely only contain a handful of relevant replicas. Given the large number of small jobs, most reads were from remote hosts. Efforts to combat these artifacts achieved mixed results; while spreading Task- Trackers across racks made intra-rack reads of shared datasets more likely, the shuffle of records between map and reduce tasks would necessarily cross racks, and sub- sequent jobs in the DAG would have fewer opportuni- ties to account for skew in their ancestors. This aspect of [R4:] Locality awareness is a key requirement for YARN.
High-level frameworks like Pig[24] and Hive[30] of- ten compose a workflow of MapReduce jobs in a DAG, each filtering, aggregating, and projecting data at every stage in the computation. Because clusters were not re- sized between jobs when using HoD, much of the ca- pacity in a cluster lay fallow while subsequent, slimmer stages completed. In an extreme but a very common sce- nario, a single reduce task running on one node could prevent a cluster from being reclaimed. Some jobs held hundreds of nodes idle in this state.
Finally, job latency was dominated by the time spent allocating the cluster. Users could rely on few heuristics when estimating how many nodes their jobs required, and would often ask for whatever multiple of 10 matched their intuition. Cluster allocation latency was so high, users would often share long-awaited clusters with col- leagues, holding on to nodes for longer than anticipated, raising latencies still further. While users were fond of many features in HoD, the economics of cluster utiliza- tion forced Yahoo! to pack its users into shared clus- ters. [R5:] High Cluster Utilization is a top priority for YARN.
2Efforts to modify torque to be “locality-aware” mitigated this ef- fect somewhat, but the proportion of remote reads was still much higher than what a shared cluster could achieve.
2.3 Shared clusters
Ultimately, HoD had too little information to make intel- ligent decisions about its allocations, its resource granu- larity was too coarse, and its API forced users to provide misleading constraints to the resource layer.
However, moving to shared clusters was non-trivial. While HDFS had scaled gradually over years, the Job- Tracker had been insulated from those forces by HoD. When that guard was removed, MapReduce clusters sud- denly became significantly larger, job throughput in- creased dramatically, and many of the features inno- cently added to the JobTracker became sources of criti- cal bugs. Still worse, instead of losing a single workflow, a JobTracker failure caused an outage that would lose all the running jobs in a cluster and require users to manu- ally recover their workflows.
Downtime created a backlog in processing pipelines that, when restarted, would put significant strain on the JobTracker. Restarts often involved manually killing users’ jobs until the cluster recovered. Due to the com- plex state stored for each job, an implementation pre- serving jobs across restarts was never completely de- bugged.
Operating a large, multi-tenant Hadoop cluster is hard. While fault tolerance is a core design principle, the surface exposed to user applications is vast. Given vari- ous availability issues exposed by the single point of fail- ure, it is critical to continuously monitor workloads in the cluster for dysfunctional jobs. More subtly, because the JobTracker needs to allocate tracking structures for every job it initializes, its admission control logic in- cludes safeguards to protect its own availability; it may delay allocating fallow cluster resources to jobs because the overhead of tracking them could overwhelm the Job- Tracker process. All these concerns may be grouped un- der the need for [R6:] Reliability/Availability.
As Hadoop managed more tenants, diverse use cases, and raw data, its requirements for isolation became more stringent, but the authorization model lacked strong, scalable authentication—a critical feature for multi- tenant clusters. This was added and backported to mul- tiple versions. [R7:] Secure and auditable operation must be preserved in YARN. Developers gradually hard- ened the system to accommodate diverse needs for re- sources, which were at odds with the slot-oriented view of resources.
While MapReduce supports a wide range of use cases, it is not the ideal model for all large-scale computa- tion. For example, many machine learning programs re- quire multiple iterations over a dataset to converge to a result. If one composes this flow as a sequence of MapReduce jobs, the scheduling overhead will signifi- cantly delay the result [32]. Similarly, many graph al-
Client — RM
Scheduler AMService
Node Manager
ResourceManager
RM — NodeManager
MPI Container AM Container
Node Manager
Node Manager
gorithms are better expressed using a bulk-synchronous parallel model (BSP) using message passing to com- municate between vertices, rather than the heavy, all- to-all communication barrier in a fault-tolerant, large- scale MapReduce job [22]. This mismatch became an impediment to users’ productivity, but the MapReduce- centric resource model in Hadoop admitted no compet- ing application model. Hadoop’s wide deployment in- side Yahoo! and the gravity of its data pipelines made these tensions irreconcilable. Undeterred, users would write “MapReduce” programs that would spawn alter- native frameworks. To the scheduler they appeared as map-only jobs with radically different resource curves, thwarting the assumptions built into to the platform and causing poor utilization, potential deadlocks, and insta- bility. YARN must declare a truce with its users, and pro- vide explicit [R8:] Support for Programming Model Diversity.
Beyond their mismatch with emerging framework re- quirements, typed slots also harm utilization. While the separation between map and reduce capacity prevents deadlocks, it can also bottleneck resources. In Hadoop, the overlap between the two stages is configured by the user for each submitted job; starting reduce tasks later increases cluster throughput, while starting them early in a job’s execution reduces its latency.3 The number of map and reduce slots are fixed by the cluster operator, so fallow map capacity can’t be used to spawn reduce tasks and vice versa.4 Because the two task types com- plete at different rates, no configuration will be perfectly balanced; when either slot type becomes saturated, the JobTracker may be required to apply backpressure to job initialization, creating a classic pipeline bubble. Fungi- ble resources complicate scheduling, but they also em- power the allocator to pack the cluster more tightly. This highlights the need for a [R9:] Flexible Resource Model.
While the move to shared clusters improved utiliza- tion and locality compared to HoD, it also brought con- cerns for serviceability and availability into sharp re- lief. Deploying a new version of Apache Hadoop in a shared cluster was a carefully choreographed, and a re- grettably common event. To fix a bug in the MapReduce implementation, operators would necessarily schedule downtime, shut down the cluster, deploy the new bits, validate the upgrade, then admit new jobs. By conflat- ing the platform responsible for arbitrating resource us- age with the framework expressing that program, one is forced to evolve them simultaneously; when opera- tors improve the allocation efficiency of the platform,
3This oversimplifies significantly, particularly in clusters of unreli- able nodes, but it is generally true.
4 Some users even optimized their jobs to favor either map or reduce tasks based on shifting demand in the cluster [28].
Figure 1: YARN Architecture (in blue the system components, and in yellow and pink two applications running.)
users must necessarily incorporate framework changes. Thus, upgrading a cluster requires users to halt, vali- date, and restore their pipelines for orthogonal changes. While updates typically required no more than re- compilation, users’ assumptions about internal frame- work details—or developers’ assumptions about users’ programs—occasionally created blocking incompatibil- ities on pipelines running on the grid.
Building on lessons learned by evolving Apache Ha- doop MapReduce, YARN was designed to address re- quirements (R1-R9). However, the massive install base of MapReduce applications, the ecosystem of related projects, well-worn deployment practice, and a tight schedule would not tolerate a radical redesign. To avoid the trap of a “second system syndrome” [6], the new ar- chitecture reused as much code from the existing frame- work as possible, behaved in familiar patterns, and left many speculative features on the drawing board. This lead to the final requirement for the YARN redesign: [R10:] Backward compatibility.
In the remainder of this paper, we provide a descrip- tion of YARN’s architecture (Sec. 3), we report about real-world adoption of YARN (Sec. 4), provide experi- mental evidence validating some of the key architectural choices (Sec. 5) and conclude by comparing YARN with some related work (Sec. 6).
3 Architecture
To address the requirements we discussed in Section 2, YARN lifts some functions into a platform layer respon- sible for resource management, leaving coordination of logical execution plans to a host of framework imple- mentations. Specifically, a per-cluster ResourceManager (RM) tracks resource usage and node liveness, enforces allocation invariants, and arbitrates contention among tenants. By separating these duties in the JobTrack
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com