IX: A Protected Dataplane Operating System for High Throughput and Low Latency
, Stanford University; , École Polytechnique Fédérale de Lausanne (EPFL); , , and , Stanford University; , École Polytechnique Fédérale de Lausanne (EPFL)
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/belay
This paper is included in the Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. October 6–8, 2014 • Broomfield, CO 978-1-931971-16-4
Copyright By PowCoder代写 加微信 powcoder
Open access to the Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation
is sponsored by USENIX.
2 1 1Stanford University
The conventional wisdom is that there is a basic mis- match between these requirements and existing network- ing stacks in commodity operating systems. Conse- quently, some systems bypass the kernel and implement the networking stack in user-space [29, 32, 40, 59, 61]. While kernel bypass eliminates context switch overheads, on its own it does not eliminate the difficult tradeoffs be- tween high packet rates and low latency (see §5.2). More- over, user-level networking suffers from lack of protec- tion. Application bugs and crashes can corrupt the net- working stack and impact other workloads. Other sys- tems go a step further by also replacing TCP/IP with RDMA in order to offload network processing to special- ized adapters [17, 31, 44, 47]. However, such adapters must be present at both ends of the connection and can only be used within the datacenter.
We propose IX, an operating system designed to break the 4-way tradeoff between high throughput, low latency, strong protection, and resource efficiency. Its architec- ture builds upon the lessons from high performance mid- dleboxes, such as firewalls, load-balancers, and software routers [16, 34]. IX separates the control plane, which is responsible for system configuration and coarse-grain resource provisioning between applications, from the dat- aplanes, which run the networking stack and application logic. IX leverages Dune and virtualization hardware to run the dataplane kernel and the application at distinct protection levels and to isolate the control plane from the dataplane [7]. In our implementation, the control plane is the full Linux kernel and the dataplanes run as pro- tected, library-based operating systems on dedicated hard- ware threads.
The IX dataplane allows for networking stacks that op- timize for both bandwidth and latency. It is designed around a native, zero-copy API that supports processing of bounded batches of packets to completion. Each dat- aplane executes all network processing stages for a batch of packets in the dataplane kernel, followed by the associ-
IX: A Protected Dataplane Operating System for High Throughput and Low Latency
The conventional wisdom is that aggressive networking requirements, such as high packet rates for small mes- sages and microsecond-scale tail latency, are best ad- dressed outside the kernel, in a user-level networking stack. We present IX, a dataplane operating system that provides high I/O performance, while maintaining the key advantage of strong protection offered by existing ker- nels. IX uses hardware virtualization to separate man- agement and scheduling functions of the kernel (control plane) from network processing (dataplane). The data- plane architecture builds upon a native, zero-copy API and optimizes for both bandwidth and latency by dedi- cating hardware threads and networking queues to data- plane instances, processing bounded batches of packets to completion, and by eliminating coherence traffic and multi-core synchronization. We demonstrate that IX out- performs Linux and state-of-the-art, user-space network stacks significantly in both throughput and end-to-end la- tency. Moreover, IX improves the throughput of a widely deployed, key-value store by up to 3.6× and reduces tail latency by more than 2×.
1 Introduction
Datacenter applications such as search, social network- ing, and e-commerce platforms are redefining the require- ments for systems software. A single application can con- sist of hundreds of software services, deployed on thou- sands of servers, creating a need for networking stacks that provide more than high streaming performance. The new requirements include high packet rates for short mes- sages, microsecond-level responses to remote requests with tight tail latency guarantees, and support for high connection counts and churn [2, 14, 46]. It is also im- portant to have a strong protection model and be elastic in resource usage, allowing other applications to use any idling resources in a shared cluster.
USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 49
ated application processing in user mode. This approach amortizes API overheads and improves both instruction and data locality. We set the batch size adaptively based on load. The IX dataplane also optimizes for multi-core scalability. The network adapters (NICs) perform flow- consistent hashing of incoming traffic to distinct queues. Each dataplane instance exclusively controls a set of these queues and runs the networking stack and a single appli- cation without the need for synchronization or coherence traffic during common case operation. The IX API de- parts from the POSIX API, and its design is guided by the commutativity rule [12]. However, the libix user-level library includes an event-based API similar to the popular libevent library [51], providing compatibility with a wide range of existing applications.
We compare IX with a TCP/IP dataplane against Linux 3.16.1 and mTCP, a state-of-the-art user-level TCP/IP stack [29]. On a 10GbE experiment using short mes- sages, IX outperforms Linux and mTCP by up to 10× and 1.9× respectively for throughput. IX further scales to a 4x10GbE configuration using a single multi-core socket. The unloaded uni-directional latency for two IX servers is 5.7μs, which is 4× better than between standard Linux kernels and an order of magnitude better than mTCP, as both trade-off latency for throughput. Our evaluation with memcached, a widely deployed key-value store, shows that IX improves upon Linux by up to 3.6× in terms of throughput at a given 99th percentile latency bound, as it can reduce kernel time, due essentially to network pro- cessing, from ∼ 75% with Linux to < 10% with IX.
IX demonstrates that, by revisiting networking APIs and taking advantage of modern NICs and multi-core chips, we can design systems that achieve high through- put and low latency and robust protection and resource efficiency. It also shows that, by separating the small sub- set of performance-critical I/O functions from the rest of the kernel, we can architect radically different I/O sys- tems and achieve large performance gains, while retain- ing compatibility with the huge set of APIs and services provided by a modern OS like Linux.
The rest of the paper is organized as follows. §2 mo- tivates the need for a new OS architecture. §3 and §4 present the design principles and implementation of IX. §5 presents the quantitative evaluation. §6 and §7 discuss open issues and related work.
2 Background and Motivation
Our work focuses on improving operating systems for ap- plications with aggressive networking requirements run- ning on multi-core servers.
2.1 Challenges for Datacenter Applications
Large-scale, datacenter applications pose unique chal- lenges to system software and their networking stacks:
Microsecond tail latency: To enable rich interactions be- tween a large number of services without impacting the overall latency experienced by the user, it is essential to reduce the latency for some service requests to a few tens of μs [3, 54]. Because each user request often involves hundreds of servers, we must also consider the long tail of the latency distributions of RPC requests across the data- center [14]. Although tail-tolerance is actually an end-to- end challenge, the system software stack plays a signifi- cant role in exacerbating the problem [36]. Overall, each service node must ideally provide tight bounds on the 99th percentile request latency.
High packet rates: The requests and, often times, the replies between the various services that comprise a datacenter application are quite small. In Facebook’s memcached service, for example, the vast majority of requests uses keys shorter than 50 bytes and involves val- ues shorter than 500 bytes [2], and each node can scale to serve millions of requests per second [46].
The high packet rate must also be sustainable under a large number of concurrent connections and high con- nection churn [23]. If the system software cannot handle large connection counts, there can be significant impli- cations for applications. The large connection count be- tween application and memcached servers at Facebook made it impractical to use TCP sockets between these two tiers, resulting in deployments that use UDP datagrams for get operations and an aggregation proxy for put op- erations [46].
Protection: Since multiple services commonly share servers in both public and private datacenters [14, 25, 56], there is need for isolation between applications. The use of kernel-based or hypervisor-based networking stacks largely addresses the problem. A trusted network stack can firewall applications, enforce access control lists (ACLs), and implement limiters and other policies based on bandwidth metering.
Resource efficiency: The load of datacenter applications varies significantly due to diurnal patterns and spikes in user traffic. Ideally, each service node will use the fewest resources (cores, memory, or IOPS) needed to satisfy packet rate and tail latency requirements at any point. The remaining server resources can be allocated to other ap- plications [15, 25] or placed into low power mode for en- ergy efficiency [4]. Existing operating systems can sup- port such resource usage policies [36, 38].
50 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association
2.2 The Hardware – OS Mismatch
The wealth of hardware resources in modern servers should allow for low latency and high packet rates for dat- acenter applications. A typical server includes one or two processor sockets, each with eight or more multithreaded cores and multiple, high-speed channels to DRAM and PCIe devices. Solid-state drives and PCIe-based Flash storage are also increasingly popular. For networking, 10 GbE NICs and switches are widely deployed in dat- acenters, with 40 GbE and 100 GbE technologies right around the corner. The combination of tens of hardware threads and 10 GbE NICs should allow for rates of 15M packets/sec with minimum sized packets. We should also achieve 10–20μs round-trip latencies given 3μs latency across a pair of 10 GbE NICs, one to five switch crossings with cut-through latencies of a few hundred ns each, and propagation delays of 500ns for 100 meters of distance within a datacenter.
Unfortunately, commodity operating systems have been designed under very different hardware assumptions. Kernel schedulers, networking APIs, and network stacks have been designed under the assumptions of multiple applications sharing a single processing core and packet inter-arrival times being many times higher than the la- tency of interrupts and system calls. As a result, such operating systems trade off both latency and throughput in favor of fine-grain resource scheduling. Interrupt co- alescing (used to reduce processing overheads), queuing latency due to device driver processing intervals, the use of intermediate buffering, and CPU scheduling delays fre- quently add up to several hundred μs of latency to remote requests. The overheads of buffering and synchronization needed to support flexible, fine-grain scheduling of appli- cations to cores increases CPU and memory system over- heads, which limits throughput. As requests between ser- vice tiers of datacenter applications often consist of small packets, common NIC hardware optimizations, such as TCP segmentation and receive side coalescing, have a marginal impact on packet rate.
2.3 Alternative Approaches
Since the network stacks within commodity kernels can- not take advantage of the abundance of hardware re- sources, a number of alternative approaches have been suggested. Each alternative addresses a subset, but not all, of the requirements for datacenter applications.
User-space networking stacks: Systems such as OpenOnload [59], mTCP [29], and Sandstorm [40] run the entire networking stack in user-space in order to elim- inate kernel crossing overheads and optimize packet pro- cessing without incurring the complexity of kernel modifi-
cations. However, there are still tradeoffs between packet rate and latency. For instance, mTCP uses dedicated threads for the TCP stack, which communicate at rela- tively coarse granularity with application threads. This aggressive batching amortizes switching overheads at the expense of higher latency (see §5). It also complicates resource sharing as the network stack must use a large number of hardware threads regardless of the actual load. More importantly, security tradeoffs emerge when net- working is lifted into the user-space and application bugs can corrupt the networking stack. For example, an at- tacker may be able to transmit raw packets (a capability that normally requires root privileges) to exploit weak- nesses in network protocols and impact other services [8]. It is difficult to enforce any security or metering policies beyond what is directly supported by the NIC hardware.
Alternatives to TCP: In addition to kernel bypass, some low-latency object stores rely on RDMA to offload pro- tocol processing on dedicated Infiniband host channel adapters [17, 31, 44, 47]. RDMA can reduce latency, but requires that specialized adapters be present at both ends of the connection. Using commodity Ethernet network- ing, Facebook’s memcached deployment uses UDP to avoid connection scalability limitations [46]. Even though UDP is running in the kernel, reliable communication and congestion management are entrusted to applications.
Alternatives to POSIX API: MegaPipe replaces the POSIX API with lightweight sockets implemented with in-memory command rings [24]. This reduces some soft- ware overheads and increases packet rates, but retains all other challenges of using an existing, kernel-based net- working stack.
OS enhancements: Tuning kernel-based stacks provides incremental benefits with superior ease of deployment. Linux SO REUSEPORT allows multi-threaded applica- tions to accept incoming connections in parallel. Affinity- accept reduces overheads by ensuring all processing for a network flow is affinitized to the same core [49]. Recent Linux Kernels support a busy polling driver mode that trades increased CPU utilization for reduced latency [27], but it is not yet compatible with epoll. When microsec- ond latencies are irrelevant, properly tuned stacks can maintain millions of open connections [66].
3 IX Design Approach
The first two requirements in §2.1 — microsecond latency and high packet rates — are not unique to datacenter ap- plications. These requirements have been addressed in the design of middleboxes such as firewalls, load-balancers, and software routers [16, 34] by integrating the network-
USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 51
ing stack and the application into a single dataplane. The two remaining requirements — protection and resource efficiency — are not addressed in middleboxes because they are single-purpose systems, not exposed directly to users.
Many middlebox dataplanes adopt design principles that differ from traditional OSes. First, they run each packet to completion. All network protocol and applica- tion processing for a packet is done before moving on to the next packet, and application logic is typically inter- mingled with the networking stack without any isolation. By contrast, a commodity OS decouples protocol process- ing from the application itself in order to provide schedul- ing and flow control flexibility. For example, the kernel relies on device and soft interrupts to context switch from applications to protocol processing. Similarly, the ker- nel’s network stack will generate TCP ACKs and slide its receive window even when the application is not con- suming data, up to an extent. Second, middlebox data- planes optimize for synchronization-free operation in or- der to scale well on many cores. Network flows are dis- tributed into distinct queues via flow-consistent hashing and common case packet processing requires no synchro- nization or coherence traffic between cores. By contrast, commodity OSes tend to rely heavily on coherence traffic and are structured to make frequent use of locks and other forms of synchronization.
IX extends the dataplane architecture to support un- trusted, general-purpose applications and satisfy all re- quirements in §2.1. Its design is based on the following key principles:
Separation and protection of control and data plane:
IX separates the control function of the kernel, respon- sible for resource configuration, provisioning, schedul- ing, and monitoring, from the dataplane, which runs the networking stack and application logic. Like a conven- tional OS, the control plane multiplexes and schedules re- sources among dataplanes, but in a coarse-grained man- ner in space and time. Entire cores are dedicated to data- planes, memory is allocated at large page granularity, and NIC queues are assigned to dataplane cores. The control plane is also responsible for elastically adjusting the allo- cation of resources between dataplanes.
The separation of control and data plane also allows us to consider radically different I/O APIs, while permit- ting other OS functionality, such as file system support, to be passed through to the control plane for compatibil- ity. Similar to the Exokernel [19], each dataplane runs a single application in a single address space. However, we use modern virtualization hardware to provide three- way isolation between the control plane, the dataplane,
and untrusted user code [7]. Dataplanes have capabilities similar to guest OSes in virtualized systems. They man- age their own address translations, on top of the address space provided by the control plane, and can protect the networking stack from untrusted application logic through the use of privilege rings. Moreover, dataplanes are given direct pass-through access to NIC queues through mem- ory mapped I/O.
Run to completion with adaptive batching: IX data- planes run to completion all stages needed to receive and transmit a packet, interleaving protocol processing (kernel mode) and application logic (user mode) at well-defined transition points. Hence, there is no need for intermediate buffering between protocol stages or between application logic and the networking stack. Unlike previous work that applied a similar approach to eliminate receive livelocks during congestion periods [45], IX uses run to comple- tion during all load conditions. Thus, we are able to use polling and avoid interrupt overhead in the common case by dedicating cores to the dataplane. We still rely on in- terrupts as a mechanism to regain control, for example, if application logic is slow to respond. Run to completion improves both message throughput and latency because successive stages tend to access many of the same data, leading to better data cache locality.
The IX dataplane also makes extensive use of batch- ing. Previous systems applied batching at the system call boundary [24, 58] and at the network API and
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com