CS代写 ACM 978-1-4503-4240-7/16/04. . . $15.00

The Linux Scheduler: a Decade of Wasted Cores
Jean- Universite ́ Nice Sophia- Data

As a central part of resource management, the OS thread scheduler must maintain the following, simple, invariant: make sure that ready threads are scheduled on available cores. As simple as it may seem, we found that this invari- ant is often broken in Linux. Cores may stay idle for sec- onds while ready threads are waiting in runqueues. In our experiments, these performance bugs caused many-fold per- formance degradation for synchronization-heavy scientific applications, 13% higher latency for kernel make, and a 14- 23% decrease in TPC-H throughput for a widely used com- mercial database. The main contribution of this work is the discovery and analysis of these bugs and providing the fixes. Conventional testing techniques and debugging tools are in- effective at confirming or understanding this kind of bugs, because their symptoms are often evasive. To drive our in- vestigation, we built new tools that check for violation of the invariant online and visualize scheduling activity. They are simple, easily portable across kernel versions, and run with a negligible overhead. We believe that making these tools part of the kernel developers’ tool belt can help keep this type of bug at bay.

Copyright By PowCoder代写 加微信 powcoder

1. Introduction
“And you have to realize that there are not very many things that have aged as well as the scheduler. Which is just another proof that scheduling is easy.”
Linus Torvalds, 2001 [43]
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, contact the Owner/Author(s). Request permissions from or Publications Dept., ACM, Inc., fax +1 (212) 869-0481.
EuroSys ’16, April 18 – 21, 2016, London, United Kingdom
Copyright ⃝c 2016 held by owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-4240-7/16/04. . . $15.00
DOI: http://dx.doi.org/10.1145/2901318.2901326

Vivien Que ́ma Grenoble INP / ENSIMAG

Justin of British Columbia

University of British Columbia

Classical scheduling problems revolve around setting the length of the scheduling quantum to provide interactive re- sponsiveness while minimizing the context switch overhead, simultaneously catering to batch and interactive workloads in a single system, and efficiently managing scheduler run queues. By and large, by the year 2000, operating systems designers considered scheduling to be a solved problem; the Linus Torvalds quote is an accurate reflection of the general opinion at that time.
Year 2004 brought an end to Dennard scaling, ushered in the multicore era and made energy efficiency a top concern in the design of computer systems. These events once again made schedulers interesting, but at the same time increas- ingly more complicated and often broken.
Our recent experience with the Linux scheduler revealed that the pressure to work around the challenging properties of modern hardware, such as non-uniform memory access latencies (NUMA), high costs of cache coherency and syn- chronization, and diverging CPU and memory latencies, re- sulted in a scheduler with an incredibly complex implemen- tation. As a result, the very basic function of the scheduler, which is to make sure that runnable threads use idle cores, fell through the cracks.
The main contribution of this work is the discovery and study of four performance bugs in the Linux scheduler. These bugs cause the scheduler to leave cores idle while runnable threads are waiting for their turn to run.1 Resulting performance degradations are in the range 13-24% for typi- cal Linux workloads, and reach 138× in some corner cases. Energy waste is proportional. Since these bugs undermine a crucial kernel sub-system, cause substantial, sometimes massive, degradation of performance, and evade conven- tional testing and debugging techniques, understanding their nature and provenance is important.
1 This occurs even though the scheduler is not explicitly configured to save power by purposefully leaving cores unused so they can be brought into a low-power state.

These bugs have different root causes, but a common symptom. The scheduler unintentionally and for a long time leaves cores idle while there are runnable threads waiting in runqueues. Short-term occurrences of this condi- tion are acceptable: the system may temporarily enter such a state when, for instance, a thread exits or blocks or when a thread is created or becomes unblocked. Long-term occur- rences are not an expected behavior. The Linux scheduler is work-conserving, meaning that it should never leave cores idle if there is work to do. Long-term presence of this symp- tom is, therefore, unintentional: it is due to bugs and it hurts performance.
We provide fixes to these bugs, and observe substan- tial performance improvements. Synchronization-intensive applications experienced many-fold speedups; one barrier- heavy scientific application ended up running 138 times faster.2 Kernel make and a TPC-H workload on a widely used commercial DBMS improved performance by 13% and 14% respectively. The TPC-H query most affected by the bug sped up by 23%.
Detecting these bugs is difficult. They do not cause the system to crash or hang, but eat away at performance, often in ways that are difficult to notice with standard performance monitoring tools. With the TPC-H workload, for example, the symptom occurred many times throughout the execution, but each time it lasted only a few hundreds of milliseconds – too short to detect with tools like htop, sar or perf. Yet, collectively these occurrences did enough damage to slow down the most affected query by 23%. Even in cases where the symptom was present for a much longer duration, the root cause was difficult to discover, because it was a result of many asynchronous events in the scheduler.
We initially suspected scheduler bugs when we observed unexplained performance in the TPC-H database workload, which we were evaluating for a different project. Conven- tional tools were unhelpful to us in either confirming the bugs or understanding their root causes. To that end, we de- signed two new tools. The first tool, which we call a sanity checker, periodically checks for the violation of the afore- mentioned invariant, catches the bugs on a live system and collects a trace with relevant information for offline analy- sis. The second tool visualizes traces of scheduling activity to expedite debugging. These tools were easy to port be- tween kernel versions (from Linux 3.17 through 4.3), ran with negligible overhead and consistently detected invariant violations. Keeping them in the standard tool belt can help reduce future occurrence of this class of bugs.
The rest of the paper is organized as follows. Section 2 describes the architecture of the Linux scheduler. Section 3 introduces the bugs we discovered, analyzes their root causes and reports their effect on performance. Section 4 presents the tools. In Section 5 we reflect on the lessons learned as
2 As we explain later in the paper, scheduling shortcomings exacerbated lock contention.
a result of this study and identify open research problems. Section 6 discusses related work and Section 7 summarizes our findings.
2. The Linux Scheduler
We first describe how Linux’s Completely Fair Scheduling (CFS) algorithm works on a single-core single-user system (Section 2.1). From this perspective, the algorithm is quite simple. Then, in (Section 2.2) we explain how limitations of modern multicore systems force developers to work-around potential performance bottlenecks, which results in a sub- stantially more complex and bug-prone implementation.
2.1 On a single-CPU system, CFS is very simple
Linux’s CFS is an implementation of the weighted fair queueing (WFQ) scheduling algorithm, wherein the avail- able CPU cycles are divided among threads in proportion to their weights. To support this abstraction, CFS (like most other CPU schedulers) time-slices the CPU among the run- ning threads. The key decisions made in the scheduler are: how to determine a thread’s timeslice? and how to pick the next thread to run?
The scheduler defines a fixed time interval during which each thread in the system must run at least once. The interval is divided among threads proportionally to their weights. The resulting interval (after division) is what we call the timeslice. A thread’s weight is essentially its priority, or niceness in UNIX parlance. Threads with lower niceness have higher weights and vice versa.
When a thread runs, it accumulates vruntime (runtime of the thread divided by its weight). Once a thread’s vruntime exceeds its assigned timeslice, the thread is pre-empted from the CPU if there are other runnable threads available. A thread might also get pre-empted if another thread with a smaller vruntime is awoken.
Threads are organized in a runqueue, implemented as a red-black tree, in which the threads are sorted in the increas- ing order of their vruntime. When a CPU looks for a new thread to run it picks the leftmost node in the red-black tree, which contains the thread with the smallest vruntime.
2.2 On multi-core systems, CFS becomes quite complex
In multicore environments the implementation of the sched- uler becomes substantially more complex. Scalability con- cerns dictate using per-core runqueues. The motivation for per-core runqueues is that upon a context switch the core would access only its local runqueue, when it looks for a thread to run. Context switches are on a critical path, so they must be fast. Accessing only a core-local queue prevents the scheduler from making potentially expensive synchronized accesses, which would be required if it accessed a globally shared runqueue.
However, in order for the scheduling algorithm to still work correctly and efficiently in the presence of per-core

runqueues, the runqueues must be kept balanced. Consider a dual-core system with two runqueues that are not balanced. Suppose that one queue has one low-priority thread and an- other has ten high-priority threads. If each core looked for work only in its local runqueue, then high-priority threads would get a lot less CPU time than the low-priority thread, which is not what we want. We could have each core check not only its runqueue but also the queues of other cores, but this would defeat the purpose of per-core runqueues. Therefore, what Linux and most other schedulers do is pe- riodically run a load-balancing algorithm that will keep the queues roughly balanced.
“I suspect that making the scheduler use per-CPU queues together with some inter-CPU load balancing logic is probably trivial . Patches already exist, and I don’t feel that people can screw up the few hundred lines too badly.”
Linus Torvalds, 2001 [43]
Conceptually, load balancing is simple. In 2001, CPUs were mostly single-core and commodity server systems typ- ically had only a handful of processors. It was, therefore, difficult to foresee that on modern multicore systems load balancing would become challenging. Load balancing is an expensive procedure on today’s systems, both computation- wise, because it requires iterating over dozens of runqueues, and communication-wise, because it involves modifying re- motely cached data structures, causing extremely expensive cache misses and synchronization. As a result, the scheduler goes to great lengths to avoid executing the load-balancing procedure often. At the same time, not executing it often enough may leave runqueues unbalanced. When that hap- pens, cores might become idle when there is work to do, which hurts performance. So in addition to periodic load- balancing, the scheduler also invokes “emergency” load bal- ancing when a core becomes idle, and implements some load-balancing logic upon placement of newly created or newly awoken threads. These mechanisms should, in theory, ensure that the cores are kept busy if there is work to do.
We next describe how load balancing works, first explain- ing the algorithm and then the optimizations that the sched- uler employs to maintain low overhead and to save power. Later we show that some of these optimizations make the code more complex and cause bugs.
2.2.1 The load balancing algorithm
Crucial for understanding the load balancing algorithm is the metric that the CFS scheduler uses to track load. We begin by explaining the metric and then describe the actual algorithm.
The load tracking metric. A strawman load-balancing algorithm would simply ensure that each runqueue has roughly the same number of threads. However, this is not necessarily what we want. Consider a scenario with two run-
queues, where one queue has some number of high-priority threads and another queue has the same number of low- priority threads. Then high-priority threads would get the same amount of CPU time as low-priority threads. That is not what we want. One idea, then, is to balance the queues based on threads’ weights, not their number.
Unfortunately, balancing the load based solely on thread weights is not sufficient either. Consider a scenario with ten threads in two runqueues: one thread is of high priority and nine threads are of low priority. Let us assume that the weight of the high-priority thread is nine times higher than those of the low-priority threads. With the load balanced according to threads’ weights, one runqueue would contain the high-priority thread, while the other would contain the nine low-priority threads. The high-priority thread would get nine times more CPU than the low-priority threads, which appears to be what we want. However, suppose that the high- priority thread often sleeps for short periods of time, so the first core often goes idle. This core would have to frequently steal work from the other core’s runqueue to keep itself busy. However, we do not want work stealing to become the common case, because this defeats the purpose of per-core runqueues. What we really want is to balance the runqueues in a smarter way, accounting for the fact that the high priority thread does not need a whole core.
To achieve this goal, CFS balances runqueues not just based on weights, but based on a metric called load, which is the combination of the thread’s weight and its average CPU utilization. If a thread does not use much of a CPU, its load will be decreased accordingly.
Additionally, the load-tracking metric accounts for vary- ing levels of multithreading in different processes. Consider a scenario where we have one process with lots of threads, and another process with few threads. Then the threads of the first process combined will receive a lot more CPU time than the threads of the second process. As a result, the first pro- cess would use most of the CPU cycles and starve the other process. This would be unfair. So as of version 2.6.38 Linux added a group scheduling feature to bring fairness between groups of threads (cgroup feature). When a thread belongs to a cgroup, its load is further divided by the total number of threads in its cgroup. This feature was later extended to automatically assign processes that belong to different ttys to different cgroups (autogroup feature).
The load-balancing algorithm. A basic load balancing algorithm would compare the load of all cores and then transfer tasks from the most loaded core to the least loaded core. Unfortunately this would result in threads being mi- grated across the machine without considering cache local- ity or NUMA. Instead, the load balancer uses a hierarchical strategy.
The cores are logically organized in a hierarchy, at the bottom of which is a single core. How the cores are grouped in the next levels of the hierarchy depends on how they

Algorithm 1 Simplified load balancing algorithm. {Function running on each cpu cur cpu:}
Figure 1: A machine with 32 cores, four nodes (eight cores per node) and SMT-level sharing among pairs of cores. The four grey areas represent the scheduling domains relative to the first core of the machine. Note that at the second level of the hierarchy we have a group of three nodes. That is because these three nodes are reachable from the first core in one hop. At the 4th level, we have all nodes of the machines because all nodes are reachable in 2 hops. Figure 4 shows the connectivity between the nodes on our system.
share the machine’s physical resources. In the example pro- vided here we describe the hierarchy on our experimental machine (see Table 5), where pairs of cores share functional units (e.g., FPU), and groups of eight cores share a last-level cache (LLC). A group of eight cores sharing an LLC form a NUMA node. Different NUMA nodes have varying connec- tivity as explained below and as shown in Figure 4. Conse- quently, on our target system, at the second level of the hier- archy we have pairs of cores, and at the next level we have groups of eight cores, each sharing an LLC (e.g., a NUMA node). NUMA nodes are further grouped according to their level of connectivity [23]. Nodes that are one hop apart from each other will be at the next level, and so on. An example of such a hierarchy is shown in Figure 1. Each level of the hierarchy is called a scheduling domain.
The load balancing algorithm is summarized in Algo- rithm 1. Load balancing is run for each scheduling domain, starting from the bottom to the top. At each level, one core of each domain is responsible for balancing the load. This core is either the first idle core of the scheduling domain if the domain has idle cores whose free CPU cycles can be used for load balancing, or the first core of the scheduling domain otherwise (Lines 2–9). Following this, the average load is computed for each scheduling group of the schedul- ing domain (Line 10), and the busiest group is picked, based on heuristics that favor overloaded and imbalanced groups (Line 10). If the busiest group’s load is lower than the lo- cal group’s load, the load is considered balanced at this level (Line 16). Otherwise, the load is balanced between the lo- cal CPU and the busiest CPU of the group, with a tweak to ensure that load balancing works even in the presence of tasksets (Lines 18–23).
Assume, for the time being, that this algorithm is run by all cores in every load-balancing period; in the next section we will explain that, as an optimization, not all cores actually do. A core executing the algorithm begins at the second-
2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13:
14: 15: 16: 17:
18: 19: 20: 21: 22:
forallsdinscheddomainsofcurcpudo
if sd has idle cores then
first cpu = 1st idle CPU of sd
first cpu = 1st CPU of sd
if cur cpu ̸= first cpu then
continue end if
for all sched group sg in sd do
sg.load = average loads of CPUs in sg
busiest = overloaded sg with the highest load
(or, if nonexistent) imbalanced sg with highest load (or, if nonexistent) sg with highest load
local = sg containing cur cpu
if busiest.load ≤ local.load then
continue end if
busiest cpu = pick busiest cpu of sg
try to balance load between busiest cpu and cur cpu if load cannot be balanced due to tasksets then
exclude busiest cpu, goto line 18 end if
to-lowest level of the hierarchy and balances the load one level below. For example, on the system in Figure 1 the core will begin at the pair-of-cores level and will balance the load between the two individual cores con

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com