CS代写 CCR-5703049, and CCR.8907666), the Washington Technology Center, and Digita

[Recreated electronic version by . 20, 1998]
Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism
THOMAS E. ANDERSON, BRIAN N. BERSHAD, EDWARD D. LAZOWSKA, and HENRY M. LEVY
University of Washington

Copyright By PowCoder代写 加微信 powcoder

Threads are the vehicle for concurrency in many approaches to parallel programming. Threads can be supported either by the operating system kernel or by user-level library code in the application address space, but neither approach has been fully satisfactory.
This paper addresses this dilemma. First, we argue that the performance of kernel threads is inherently worse than that of user-level threads, rather than this being an artifact of existing implementations; managing parallelism at the user level is essential to high- performance parallel computing. Next, we argue that the problems encountered in integrating user-level threads with other system services is a consequence of the lack of kernel support for user-level threads provided by contemporary multiprocessor operating systems; kernel threads are the wrong abstraction on which to support user-level management of parallelism. Finally, we describe the design, implementation, and performance of a new kernel interface and user-level thread package that together provide the same functionality as kernel threads without compromising the performance and flexibility advantages of user-level management of parallelism.
Categories and Subject Descriptors: D.4.1 [Operating Systems]: Process Management; D.4.4 [Operating Systems]: Communications Management— input/output; D.4.7 [Operating Systems]: Organization and Design; D.4.8 [Operating Systems]: Performance General Terms: Design, Measurement, Performance
Additional Key Words and Phrases: Multiprocessor, thread
1. INTRODUCTION
The effectiveness of parallel computing depends to a great extent on the performance of the primitives that are used to express and control the parallelism within programs. Even a coarse-grained parallel program can
This work was supported in part by the National Science Foundation (grants CCR- 5619663, CCR-5703049, and CCR.8907666), the Washington Technology Center, and Digital Equipment Corporation (the Systems Research Center and the External Research Program). Anderson was supported by an IBM Graduate Fellowship and Bershad by an AT&T Ph.D. Scholarship. Anderson is now with the Computer Science Division, University of California at Berkeley; Bershad is now with the School of Computer Science, University.
Authors’ address: Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
© 1992 ACM 0734-2071/92/0200-0053 $01.50
ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992, Pages 53–79.

54 l T.E.Andersonetal.
exhibit poor performance if the cost of creating and managing parallelism is high. Even a fine-grained program can achieve good performance if the cost of creating and managing parallelism is low.
One way to construct a parallel program is to share memory between a collection of traditional UNIX-like processes, each consisting of a single address space and a single sequential execution stream within that address space. Unfortunately, because such processes were designed for multiprogramming in a uniprocessor environment, they are simply too inefficient for general-purpose parallel programming; they handle only coarse-grained parallelism well.
The shortcomings of traditional processes for general-purpose parallel programming have led to the use of threads. Threads separate the notion of a sequential execution stream from the other aspects of traditional processes such as address spaces and I/O descriptors. This separation of concerns yields a significant performance advantage relative to traditional processes.
1.1 The Problem
Threads can be supported either at user level or in the kernel. Neither approach has been fully satisfactory.
User-level threads are managed by runtime library routines linked into each application so that thread management operations require no kernel intervention. The result can be excellent performance: in systems such as PCR [25] and FastThreads [2], the cost of user-level thread operations is within an order of magnitude of the cost of a procedure call. User-level threads are also flexible; they can be customized to the needs of the language or user without kernel modification.
User-level threads execute within the context of traditional processes; indeed, user-level thread systems are typically built without any modifications to the underlying operating system kernel. The thread package views each process as a “virtual processor,” and treats it as a physical processor executing under its control; each virtual processor runs user-level code that pulls threads off the ready list and runs them. In reality, though, these virtual processors are being multiplexed across real, physical processors by the underlying kernel. “Real world” operating system activity, such as multiprogramming, I/O, and page faults, distorts the equivalence between virtual and physical processors; in the presence of these factors, user-level threads built on top of traditional processes can exhibit poor performance or even incorrect behavior.
Multiprocessor operating systems such as Mach [2], Topaz [22], and V [7] provide direct kernel support for multiple threads per address space. Programming with kernel threads avoids the system integration problems exhibited by user-level threads, because the kernel directly schedules each application’s threads onto physical processors. Unfortunately, kernel threads, just like traditional UNIX processes, are too heavyweight for use in many parallel programs. The performance of kernel threads, although typically an order of magnitude better than that of traditional processes, has been typically an order of magnitude worse than the best-case performance of user-level threads (e.g., in the absence of
ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992.

Scheduler Activations: Effective Kernel Support l 55
multiprogramming and I/O). As a result, user-level threads have ultimately been implemented on top of the kernel threads of both Mach (C Threads [8]) and Topaz (WorkCrews [24]). User-level threads are built on top of kernel threads exactly as they are built on top of traditional processes; they have exactly the same performance, and they suffer exactly the same problems.
The parallel programmer, then, has been faced with a difficult dilemma: employ user-level threads, which have good performance and correct behavior provided the application is uniprogrammed and does no I/O, or employ kernel threads, which have worse performance but are not as restricted.
1.2 The Goals of this Work
In this paper we address this dilemma. We describe a kernel interface and a user-level thread package that together combine the functionality of kernel threads with the performance and flexibility of user-level threads. Specifically,
— In the common case when thread operations do not need kernel intervention, our performance is essentially the same as that achieved by the best existing user-level thread management systems (which suffer from poor system integration).
— In the infrequent case when the kernel must be involved, such as on processor reallocation or I/O, our system can mimic the behavior of a kernel thread management system:
— No processor idles in the presence of ready threads.
— No high-priority thread waits for a processor while a low-priority
thread runs.
— When a thread traps to the kernel to block (for example, because of a
page fault), the processor on which the thread was running can be used to run another thread from the same or from a different address space.
— The user-level part of our system is structured to simplify application- specific customization. It is easy to change the policy for scheduling an application’s threads, or even to provide a different concurrency model such as workers [16], Actors [1], or Futures [10].
The difficulty in achieving these goals in a multiprogrammed multiprocessor is that the necessary control and scheduling information is distributed between the kernel and each application’s address space. To be able to allocate processors among applications, the kernel needs access to user-level scheduling information (e.g., how much parallelism there is in each address space). To be able to manage the application’s parallelism, the user-level support software needs to be aware of kernel events (e.g., processor reallocations and I/O request/completions) that are normally hidden from the application.
1.3 The Approach
Our approach provides each application with a virtual multiprocessor,
ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992.

56 l T.E.Andersonetal.
an abstraction of a dedicated physical machine. Each application knows exactly how many (and which) processors have been allocated to it and has complete control over which of its threads are running on those processors. The operating system kernel has complete control over the allocation of processors among address spaces including the ability to change the number of processors assigned to an application during its execution.
To achieve this, the kernel notifies the address space thread scheduler of every kernel event affecting the address space, allowing the application to have complete knowledge of its scheduling state. The thread system in each address space notifies the kernel of the subset of user-level thread operations that can affect processor allocation decisions, preserving good performance for the majority of operations that do not need to be reflected to the kernel.
The kernel mechanism that we use to realize these ideas is called scheduler activations. A scheduler activation vectors control from the kernel to the address space thread scheduler on a kernel event; the thread scheduler can use the activation to modify user-level thread data structures, to execute user-level threads, and to make requests of the kernel.
We have implemented a prototype of our design on the DEC SRC Firefly multiprocessor workstation [22]. While the differences between scheduler activations and kernel threads are crucial, the similarities are great enough that the kernel portion of our implementation required only relatively straightforward modifications to the kernel threads of Topaz, the native operating system on the Firefly. Similarly, the user-level portion of our implementation involved relatively straightforward modifications to FastThreads, a user-level thread system originally designed to run on top of Topaz kernel threads.
Since our goal is to demonstrate that the exact functionality of kernel threads can be provided at the user level, the presentation in this paper assumes that user-level threads are the concurrency model used by the programmer or compiler. We emphasize, however, that other concurrency models, when implemented at user level on top of kernel threads or processes, suffer from the same problems as user-level threads— problems that are solved by implementing them on top of scheduler activations.
2. USER-LEVEL THREADS: PERFORMANCE ADVANTAGES AND FUNCTIONALITY LiMITATIONS
In this section we motivate our work by describing the advantages that user-level threads offer relative to kernel threads, and the difficulties that arise when user-level threads are built on top of the interface provided by kernel threads or processes. We argue that the performance of user-level threads is inherently better than that of kernel threads, rather than this being an artifact of existing implementations. User-level threads have an additional advantage of flexibility with respect to programming models and environments. Further, we argue that the lack of system integration exhibited by user-level threads is not inherent in user-level threads themselves, but is a consequence of inadequate kernel support.
ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992.

Scheduler Activations: Effective Kernel Support l 57
2.1 The Case for User-Level Thread Management
It is natural to believe that the performance optimizations found in user- level thread systems could be applied within the kernel, yielding kernel threads that are as efficient as user-level threads without the compromises in functionality. Unfortunately, there are significant inherent costs to managing threads in the kernel:
— The cost of accessing thread management operations: With kernel threads, the program must cross an extra protection boundary on every thread operation, even when the processor is being switched between threads in the same address space. This involves not only an extra kernel trap, but the kernel must also copy and check parameters in order to protect itself against buggy or malicious programs. By contrast, invoking user-level thread operations can be quite inexpensive, particularly when compiler techniques are used to expand code inline and perform sophisticated register allocation. Further, safety is not compromised: address space boundaries isolate misuse of a user-level thread system to the program in which it occurs.
— The cost of generality: With kernel thread management, a single underlying implementation is used by all applications. To be general- purpose, a kernel thread system must provide any feature needed by any reasonable application; this imposes overhead on those applications that do not use a particular feature. In contrast, the facilities provided by a user-level thread system can be closely matched to the specific needs of the applications that use it, since different applications can be linked with different user-level thread libraries. As an example, most kernel thread systems implement preemptive priority scheduling, even though many parallel applications can use a simpler policy such as first-in-first- out [24].
These factors would not be important if thread management operations were inherently expensive. Kernel trap overhead and priority scheduling, for instance, are not major contributors to the high cost of UNIX-like processes. However, the cost of thread operations can be within an order of magnitude of a procedure call. This implies that any overhead added by a kernel implementation, however small, will be significant, and a well- written user-level thread system will have significantly better performance than a well-written kernel-level thread system.
To illustrate this quantitatively, Table I shows the performance of example implementations of user-level threads, kernel threads, and UNIX- like processes, all running on similar hardware, a CVAX processor. FastThreads and Topaz kernel threads were measured on a CVAX Firefly; Ultrix (DEC’s derivative of UNIX) was measured on a CVAX uniprocessor workstation. (Each of these implementations, while good, is not “optimal.” Thus, our measurements are illustrative and not definitive).
The two benchmarks are Null Fork, the time to create, schedule, execute and complete a process/thread that invokes the null procedure (in other words, the overhead of forking a thread), and Signal-Wait, the time for a
ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992.

58 l T.E.Andersonetal.
Table I: Thread Operation Latencies (μsec.)
Topaz Ultrix Operation FastThreads threads processes
Null Fork 34 948 11300 Signal-Wait 37 441 1840
process/thread to signal a waiting process/thread, and then wait on a condition (the overhead of synchronizing two threads together). Each benchmark was executed on a single processor, and the results were averaged across multiple repetitions. For comparison, a procedure call takes about 7 μsec. on the Firefly, while a kernel trap takes about 19 μsec.
Table I shows that while there is an order of magnitude difference in cost between Ultrix process management and Topaz kernel thread management, there is yet another order of magnitude difference between Topaz threads and FastThreads. This is despite the fact that the Topaz thread code is highly tuned with much of the critical path written in assembler.
Commonly, a tradeoff arises between performance and flexibility in choosing where to implement system services [26]. User-level threads, however, avoid this tradeoff: they simultaneously improve both performance and flexibility. Flexibility is particularly important in thread systems since there are many parallel programming models, each of which may require specialized support within the thread system. With kernel threads, supporting multiple parallel programming models may require modifying the kernel, which increases complexity, overhead, and the likelihood of errors in the kernel.
2.2 Sources of Poor Integration in User-Level Threads Built on the Traditional Kernel Interface
Unfortunately, it has proven difficult to implement user-level threads that have the same level of integration with system services as is available with kernel threads. This is not inherent in managing parallelism at the user level, but rather is a consequence of the lack of kernel support in existing systems. Kernel threads are the wrong abstraction for supporting user- level thread management. There are two related characteristics of kernel threads that cause difficulty:
— Kernel threads block, resume, and are preempted without notification to the user level.
— Kernel threads are scheduled obliviously with respect to the user-level thread state.
These can cause problems even on a uniprogrammed system. A user- level thread system will often create as many kernel threads to serve as “virtual processors” as there are physical processors in the system; each will be used to run user-level threads. When a user-level thread makes a
ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992.

Scheduler Activations: Effective Kernel Support l 59
blocking I/O request or takes a page fault, though, the kernel thread serving as its virtual processor also blocks. As a result, the physical processor is lost to the address space while the I/O is pending, because there is no kernel thread to run other user-level threads on the just-idled processor.
A plausible solution to this might be to create more kernel threads than physical processors; when one kernel thread blocks because its user-level thread blocks in the kernel, another kernel thread is available to run user- level threads on that processor. However, a difficulty occurs when the I/O completes or the page fault returns: there will be more runnable kernel threads than processors, each kernel thread in the middle of running a user- level thread. In deciding which kernel threads are to be assigned processors, the operating system will implicitly choose which user-level threads are assigned processors.
In a traditional system, when there are more runnable threads than processors, the operating system could employ some kind of time-slicing to ensure each thread makes progress. When user-level threads are running on top of kernel threads, however, time-slicing can lead to problems. For example, a kernel thread could be preempted while its user-level thread is holding a spin-lock; any user-level threads accessing the lock will then spin-wait until the lock holder is rescheduled. Zahorjan et al. [28] have shown that time-slicing in the presence of spin-locks can result in poor performance. As another example, a kernel thread running a user-level thread could be preempted to allow another kerne

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com