Slicer: Auto-Sharding for Datacenter Applications
, , , , , , , Pan Gu, , , , , , and , Google;
-Ari, Technion—Israel Institute of Technology
https://www.usenix.org/conference/osdi16/technical-sessions/presentation/adya
Copyright By PowCoder代写 加微信 powcoder
This paper is included in the Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16).
November 2–4, 2016 • Savannah, GA, USA
ISBN 978-1-931971-33-1
Open access to the Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation
is sponsored by USENIX.
Slicer: Auto-Sharding for Datacenter Applications
, , , , , , , Pan Gu, , , , , , , and -Ari†
Sharding is a fundamental building block of large-scale applications, but most have their own custom, ad-hoc implementations. Our goal is to make sharding as eas- ily reusable as a filesystem or lock manager. Slicer is Google’s general purpose sharding service. It monitors signals such as load hotspots and server health to dy- namically shard work over a set of servers. Its goals are to maintain high availability and reduce load imbalance while minimizing churn from moved work.
In this paper, we describe Slicer’s design and imple- mentation. Slicer has the consistency and global opti- mization of a centralized sharder while approaching the high availability, scalability, and low latency of systems that make local decisions. It achieves this by separating concerns: a reliable data plane forwards requests, and a smart control plane makes load-balancing decisions off the critical path. Slicer’s small but powerful API has proven useful and easy to adopt in dozens of Google ap- plications. It is used to allocate resources for web ser- vice front-ends, coalesce writes to increase storage band- width, and increase the efficiency of a web cache. It currently handles 2-7M req/s of production traffic. The median production Slicer-managed workload uses 63% fewer resources than it would with static sharding.
1 Introduction
Many applications require the resources of more than one computer, especially at Google’s typical scale. An ap- plication that distributes its work across multiple com- puters requires some scheme for splitting it up. Often, work is simply split randomly. This is ubiquitous in web services, where the dominant architecture puts a round- robin load-balancer in front of a fleet of interchangeable application processes (“tasks”).
However, in many applications, it is hard to ensure that every task can service any request. For example,
directed a median of 2 Mreq/s of production traffic with peaks exceeding 7 Mreq/s.
Google’s speech recognizer (§3.2.1) uses a different ma- chine learning model for each spoken language. Loading a model is too slow for interactive use: a language must be resident before a request arrives. One task cannot fit every model, making random request balancing unten- able. Instead, each task loads only a subset of languages, and incoming requests are routed to a prepared task.
In the past, Google applications like the speech recog- nizer had their own one-off sharders. Experience taught us that sharding is hard to get right: the plumbing is tedious, and it can take years to tune and cover corner cases. Rebuilding a sharder for every application wastes engineering effort and often produces brittle results.
In practice, custom sharders typically make do with simplistic static sharding that is unresponsive to changes in workload distribution and task availability. Simple schemes utilize resources poorly. In the speech recog- nizer, resources required per language peak at different times as speakers wake and sleep. When tasks fail, re- quests must be redistributed among the healthy tasks. When a datacenter fails, a great wave of traffic sloshes over to the remaining datacenters, dramatically altering the request mix. Before Slicer, the speech team handled variation with overprovisioning and manual intervention.
Slicer refactors sharding into a reusable and easily adopted building block akin to a filesystem or lock man- ager. Slicer is a general-purpose infrastructure service
†Technion – Israel
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 739
00 1 2 3 4 5 6 7 time (days)
1.0 0.8 0.6 0.4 0.2
0.002468 Mreq/s
Over five-minute intervals in a recent week, Slicer
that partitions work across tasks in applications that ben- efit from affinity. Slicer is minimally invasive to appli- cations: they need only associate incoming requests with a key of their choice that is used to rendezvous requests with tasks. In the speech recognizer, the slice key is the language. Other applications use fine-grained slice keys, such as usernames or URLs. Slicer assigns part of the key space to each task and routes incoming requests to them via integration with Google’s front-end load bal- ancers and RPC system.
Slicer addresses these needs by sharding dynamically. It monitors the request load to detect hotspots. It moni- tors task availability changes due to service provisioning, system updates, and hardware failures. It rebalances the key mapping to maintain availability of all keys and re- duce load imbalance among tasks while minimizing key churn.
Slicer can trade off consistency with availability, offer- ing either strongly or eventually consistent assignments. In consistent assignment mode, no task ever believes a key is assigned to it if the Assigner does not agree. The simplest application of this property ensures that at most one task is authoritative for a key, reducing availability but making it easy to write a correct application that mu- tates state. Alternatively, Slicer can distribute overlap- ping eventually consistent assignments, eliminating pe- riods of unavailability and reacting rapidly to load shifts.
Slicer’s design differs significantly from past shard- ing systems, driven by its use in dozens of large-scale systems at Google. Slicer provides global optimization and consistency guarantees possible with a centralized load-balancer, but it achieves nearly the same resilience to failures and low latency as systems that make purely local decisions, such as distributed hash tables.
In a production environment, customers cannot tol- erate flag days (synchronized restarts). By separat- ing the forwarding data plane from the policy control plane, Slicer simplifies customer-linked libraries and keeps complexity in a central service where the team can more easily coordinate changes.
This functionality is all exposed through a narrow, readily adopted API that has proven useful in Google ap- 2
plications with a variety of needs:
Avoiding storage overhead. A stateless front-end that
accesses underlying durable storage on every request is conceptually simple but pays a high performance cost over keeping state in RAM. In some applications, includ- ing our speech recognizer, this overhead dwarfs all other time spent serving a user request. For example, a Google pub-sub service[9] processes 600 Kreq/s, most of which do one hash and one comparison to a hash in memory.
Slicer is a general-purpose sharding service that splits an application’s work across a set of tasks that form a job within a datacenter, balancing load across the tasks. A “task” is an application process running on a multi- tenant host machine alongside tasks from other applica- tions. The unit of sharding in Slicer is a key, chosen by the application. Slicer integrates with Google’s Stubby RPC system to easily route RPCs originating in other ser- vices and with Google’s frontend HTTP load balancers to
Fetching the hash via a storage RPC would be correct but incur far more overhead and latency.
Automatic scaling. Many cluster management sys- tems can automatically expand the number of tasks as- signed to a job based on load, but these are typically coarse-grained decisions with heavyweight configura- tion. Our speech recognizer handles dozens of lan- guages, and Slicer’s key redundancy provides a single- configuration mechanism to independently scale those many fine-grained resources.
Write aggregation. Several event processors at Google (§3.3.1) ingest huge numbers of small events and summarize them by key (such as data source) into a database. Aggregating writes from stateless front ends is possible, but aggregating like keys on the same task can be more efficient; Data Analysis Pipeline sees 80% fewer storage requests. Affinity provides similar bene- fits for other expensive, immobile resources like network sockets: Slicer routes requests for an external host to one task with the socket already open.
Sharding state is well-studied; see Section 6. Slicer draws on storage sharding [2, 14, 15] but applies to more classes of application. Compared to other general- purpose sharding systems [5, 10, 8, 13], Slicer offers more features (better load balancing, optional assign- ment consistency, and key replication) and an architec- ture focused on high availability.
This paper makes the following contributions:
An architecture that separates the assignment gen- eration “control plane” from the request forwarding “data plane”, which provides algorithmic versatil- ity, high performance, resilience to failure, and ex- ploits existing lease managers and storage systems as robust building blocks.
An effective load-balancing algorithm that mini- mizes key churn and has proven effective in a va- riety of applications.
An evaluation on production deployments of sev- eral large applications that shows the benefits and availability of the Slicer architecture.
Slicer Overview and API
740 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
route HTTP requests from external browsers and REST clients.
Slicer has the following components: a centralized Slicer Service; the Clerk, a library linked into applica- tion clients; and the Slicelet, a library linked into appli- cation server tasks. (Figure 2). The Service is written in Java; the libraries are available in C++, Java, and Go. The Slicer Service generates an assignment mapping key ranges (“slices”) to tasks and distributes it to the Clerks and Slicelets, together called the subscribers. The Clerk directs client requests for a key to the assigned task. The Slicelet enables a task to learn when it is assigned or re- lieved of a slice. The Slicer Service monitors load and task availability to generate new assignments to main- tain availability of all keys. Application code interacts only indirectly with the Slicer Service via the Clerk and Slicelet libraries.
2.1 Sharding Model
Application keys may be fine-grained, such as user IDs, or coarse-grained, such as the languages in the speech recognizer described in Section 3.2.1. Keys are an atomic unit of work placement: all state associated with a single key will be collocated on those task replicas to which the key is assigned, but different keys may be as- signed to different tasks. Slicer does not observe applica- tion state; it merely notifies the task of the keys the task should serve.
Slicer hashes each application key into a 63-bit slice key; each slice in an assignment is a range in this hashed keyspace. Manipulating key ranges makes Slicer’s work- load independent of whether an application has ten keys or a billion and means that an application can create new keys without Slicer on the critical path. As a result, there is no limit on the number of keys nor must they be enu- merated.
Hashing keys simplifies the load balancing algorithm because clusters of hot keys in the application’s keyspace are likely uniformly distributed in the hashed keyspace.
The cost is lost locality: contiguous application keys are scattered. Many Google applications are already struc- tured around single-key operations rather than scans, en- couraged by the behavior of existing storage systems. For others, Section 2.2 offers a mitigation.
Some applications require all requests for the same key to be served by the same task, for example, to main- tain a write-through cache. For these, Slicer offers a con- sistency guarantee on what assignments a Slicelet can observe (§4.5). For many other applications, weaker se- mantics are correct even when requests for the same key are served by different tasks. For example, such systems serve read-only data (such as Google Fonts), or provide weak consistency to their users (such as Cloud DNS), or have an underlying storage system that provides strong consistency (such as event aggregation systems).
Such applications can configure Slicer with key re- dundancy, allowing assignment of each slice to multi- ple tasks. Slicer honors a minimum redundancy to pro- tect availability and automatically increases replication for hot slices, which we call asymmetric key redundancy.
2.2 Slicelet Interface
The application server task interacts with Slicer via the “Slicelet” API (Figure 3). A simple application, like the Flywheel URL status cache (§3.1.1), is free to ignore this API entirely and answer whatever requests arrive; Slicer transparently improves performance. An applica- tion may register a SliceletListener to learn when slices arrive and depart, so it can prefetch and garbage- collect state (such as the speech models in Section 3.2.1).
A few affinity-mode applications use isAffin- itizedKey to discover misrouted requests, such as when retrying a request from the client is cheaper than processing it at the wrong server (§3.3).
interface Slicelet {
boolean isAffinitizedKey(String key);
Opaque getSliceKeyHandle(String key);
boolean isAssignedContinuously(Opaque handle);
interface SliceletListener {
void onChangedSlices(List
List
Figure 3: Slicer Server API
To support applications that require exclusive key ownership to maintain consistent in-memory state, the Slicelet provides an API inspired by Centrifuge [10]. The task calls getSliceKeyHandle when a request ar- rives, and passes the handle back to isAssignedCon- tinuously before externalizing the result. Note that checking assignment at beginning and end is insufficient, since the slice may have been unassigned and reassigned
Client Clerk
Assignments (distributed in background)
Slicer Service
Server Slicelet
Server Slicelet
Server Slicelet
Figure 2: Abstract Slicer architecture.
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 741
in the meantime. A task may also cache a handle across multiple requests, for example to cache a user’s inbox during a session.
To scan its store to preload state, an application may need to map from hashed slices keys back to original application keys. Applications with few keys (such as language names in the speech recognizer) can precom- pute an index at each task. Applications with many keys typically adjust their storage schema, either by prefixing the primary key with the hashed slice key or by adding a secondary index. In future work, Slicer will support unhashed application-defined keys and implement range sharding to preserve locality among adjacent application- defined keys.
By default, Slicer load balances on request rate (req/s). The Slicelet integrates with Stubby to transparently mon- itor request rate per slice. Some applications have highly variable cost per request, or want to balance a different metric like task CPU utilization. An extension to the API of Figure 3 lets tasks report a custom load metric.
2.3 Clerk Interface
The Clerk provides a single function which maps a key to the addresses of its assigned tasks (Figure 4). Most appli- cations ignore this API and simply enable transparent in- tegration with Google’s RPC system Stubby or Google’s HTTP proxy GFE (Google Front End).
interface Clerk {
Set
Figure 4: Slicer Client API
Stubby typically directs RPCs round-robin from each client to a subset of tasks in a job. We extended Stubby to accept an additional slice key argument with each RPC, causing the task to be selected using Slicer’s assignment. Stubby also has support for Google’s global load bal- ancer, which selects the network-closest datacenter for each RPC. With both enabled, the global load balancer picks a datacenter, and Slicer picks the task from the job in that datacenter.
The GFE is an HTTP proxy that accepts requests from the Internet and routes each to an internal task. The GFE offers a declarative language for selecting routing fea- tures from a request’s URL, parameters, cookies, headers and more. Slicer integration interprets any such feature as a slice key.
3 Slicer Uses in Production Systems
Slicer is used by more than 20 client services at Google, and it balances 2-7M requests per second with more than 100,000 application client processes and server tasks connected to it (Figure 1). Prospective customers eval-
uate their systems against a test instance of Slicer that routes another 2 Mreq/s.
This section illustrates some of Slicer’s use cases. Cur- rent uses of Slicer fit three categories: in-memory cache, in-memory store, and aggregation.
3.1 In-memory Cache Applications
Slicer is most commonly used for in-memory dynamic caches over storage state.
3.1.1 Flywheel
Flywheel is an optimizing HTTP proxy for mobile de- vices [11]. Flywheel tracks which websites have recently been unreachable, enabling an immediate response to a client that averts a timeout. Flywheel uses a set of “tracker” tasks as a repository of website reachability. In the original design, updates and requests were sent to a random tracker task. Because the semantics are forgiving, this worked but converged slowly. To hasten unreachability detection, Flywheel now uses Slicer with website server name as the key, so that updates and re- quests converge on a single task.
3.1.2 Other cache uses
Many other services use Slicer to manage caches.
1. Meeting scheduler: manages meetings and provides calendar functions. Includes a per-user cache for faster responses.
2. Crawl manager: crawls pages and extracts meta- data. Retains last crawl time per URL to provide crawl rate-limiting.
3. Fonts service: serves fonts to various web and mo- bile applications. Caches font files and subsets of font files.
4. Configuration sync service: periodically checks end-to-end configurations for entities from multiple sources. Entity affinitization allows comparisons of configurations from multiple sources.
5. Data analysis pipeline: analyzes stored data and serves summary results. Caches query results per source.
6. Jobprofiling:cachesmetadatausedforjobprofiling by job name.
7. User Contacts Cache: caches user’s contacts infor- mation when fetched by a user’s mobile or web ap- plication.
8. User Metadata Cache: caches user’s meta- data/preferences for a user in a video display ap- plication.
9. Service Control: caches aggregated metrics and logs for public APIs.
742 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
3.2 In-memory Store Applications
The in-memory caches in the previous section handle shard reassignment by discarding state, causing future requests to the moved keys to see a cache miss. In con- trast, the tasks of an in-memory store load any missing data from an underlying store, and thus resharding events only affect latency; the stored data remains available. 3.2.1 Speech Recognition
As mentioned in Section 1, a speech recognition system uses Slicer to assign languages to tasks and route incom- ing requests to a task with the required model loaded. The speech team originally manually p
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com