Xen and the Art of Virtualization
Paul Barham∗, , , , , , †, ,
University of Cambridge Computer Laboratory 15 JJ Thomson Avenue, Cambridge, UK, CB3 0FD
Copyright By PowCoder代写 加微信 powcoder
1. INTRODUCTION
Numerous systems have been designed which use virtualization to subdivide the ample resources of a modern computer. Some require specialized hardware, or cannot support commodity operating sys- tems. Some target 100% binary compatibility at the expense of performance. Others sacrifice security or functionality for speed. Few offer resource isolation or performance guarantees; most pro- vide only best-effort provisioning, risking denial of service.
This paper presents Xen, an x86 virtual machine monitor which allows multiple commodity operating systems to share conventional hardware in a safe and resource managed fashion, but without sac- rificing either performance or functionality. This is achieved by providing an idealized virtual machine abstraction to which oper- ating systems such as Linux, BSD and Windows XP, can be ported with minimal effort.
Our design is targeted at hosting up to 100 virtual machine in- stances simultaneously on a modern server. The virtualization ap- proach taken by Xen is extremely efficient: we allow operating sys- tems such as Linux and Windows XP to be hosted simultaneously for a negligible performance overhead — at most a few percent compared with the unvirtualized case. We considerably outperform competing commercial and freely available solutions in a range of microbenchmarks and system-wide tests.
Categories and Subject Descriptors
D.4.1 [Operating Systems]: Process Management; D.4.2 [Opera- ting Systems]: Storage Management; D.4.8 [Operating Systems]: Performance
General Terms
Design, Measurement, Performance
Virtual Machine Monitors, Hypervisors, Paravirtualization
∗Microsoft Research Cambridge, UK †Intel Research Cambridge, UK
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
SOSP’03, October 19–22, 2003, , , USA. Copyright 2003 ACM 1-58113-757-5/03/0010 …$5.00.
Modern computers are sufficiently powerful to use virtualization to present the illusion of many smaller virtual machines (VMs), each running a separate operating system instance. This has led to a resurgence of interest in VM technology. In this paper we present Xen, a high performance resource-managed virtual machine mon- itor (VMM) which enables applications such as server consolida- tion [42, 8], co-located hosting facilities [14], distributed web ser- vices [43], secure computing platforms [12, 16] and application mobility [26, 37].
Successful partitioning of a machine to support the concurrent execution of multiple operating systems poses several challenges. Firstly, virtual machines must be isolated from one another: it is not acceptable for the execution of one to adversely affect the perfor- mance of another. This is particularly true when virtual machines are owned by mutually untrusting users. Secondly, it is necessary to support a variety of different operating systems to accommodate the heterogeneity of popular applications. Thirdly, the performance overhead introduced by virtualization should be small.
Xen hosts commodity operating systems, albeit with some source modifications. The prototype described and evaluated in this paper can support multiple concurrent instances of our XenoLinux guest operating system; each instance exports an application binary inter- face identical to a non-virtualized Linux 2.4. Our port of Windows XP to Xen is not yet complete but is capable of running simple user-space processes. Work is also progressing in porting NetBSD.
Xen enables users to dynamically instantiate an operating sys- tem to execute whatever they desire. In the XenoServer project [15, 35] we are deploying Xen on standard server hardware at econom- ically strategic locations within ISPs or at Internet exchanges. We perform admission control when starting new virtual machines and expect each VM to pay in some fashion for the resources it requires. We discuss our ideas and approach in this direction elsewhere [21]; this paper focuses on the VMM.
There are a number of ways to build a system to host multiple applications and servers on a shared machine. Perhaps the simplest is to deploy one or more hosts running a standard operating sys- tem such as Linux or Windows, and then to allow users to install files and start processes — protection between applications being provided by conventional OS techniques. Experience shows that system administration can quickly become a time-consuming task due to complex configuration interactions between supposedly dis- joint applications.
More importantly, such systems do not adequately support per- formance isolation; the scheduling priority, memory demand, net- work traffic and disk accesses of one process impact the perfor- mance of others. This may be acceptable when there is adequate provisioning and a closed user group (such as in the case of com-
putational grids, or the experimental PlanetLab platform [33]), but not when resources are oversubscribed, or users uncooperative.
One way to address this problem is to retrofit support for per- formance isolation to the operating system. This has been demon- strated to a greater or lesser degree with resource containers [3], Linux/RK [32], QLinux [40] and SILK [4]. One difficulty with such approaches is ensuring that all resource usage is accounted to the correct process — consider, for example, the complex interac- tions between applications due to buffer cache or page replacement algorithms. This is effectively the problem of “QoS crosstalk” [41] within the operating system. Performing multiplexing at a low level can mitigate this problem, as demonstrated by the Exokernel [23] and Nemesis [27] operating systems. Unintentional or undesired interactions between tasks are minimized.
We use this same basic approach to build Xen, which multiplexes physical resources at the granularity of an entire operating system and is able to provide performance isolation between them. In con- trast to process-level multiplexing this also allows a range of guest operating systems to gracefully coexist rather than mandating a specific application binary interface. There is a price to pay for this flexibility — running a full OS is more heavyweight than running a process, both in terms of initialization (e.g. booting or resuming versus fork and exec), and in terms of resource consumption.
For our target of up to 100 hosted OS instances, we believe this price is worth paying; it allows individual users to run unmodified binaries, or collections of binaries, in a resource controlled fashion (for instance an Apache server along with a PostgreSQL backend). Furthermore it provides an extremely high level of flexibility since the user can dynamically create the precise execution environment their software requires. Unfortunate configuration interactions be- tween various services and applications are avoided (for example, each Windows instance maintains its own registry).
The remainder of this paper is structured as follows: in Section 2 we explain our approach towards virtualization and outline how Xen works. Section 3 describes key aspects of our design and im- plementation. Section 4 uses industry standard benchmarks to eval- uate the performance of XenoLinux running above Xen in compar- ison with stand-alone Linux, VMware Workstation and User-mode Linux (UML). Section 5 reviews related work, and finally Section 6 discusses future work and concludes.
Notwithstanding the intricacies of the x86, there are other argu- ments against full virtualization. In particular, there are situations in which it is desirable for the hosted operating systems to see real as well as virtual resources: providing both real and virtual time allows a guest OS to better support time-sensitive tasks, and to cor- rectly handle TCP timeouts and RTT estimates, while exposing real machine addresses allows a guest OS to improve performance by using superpages [30] or page coloring [24].
We avoid the drawbacks of full virtualization by presenting a vir- tual machine abstraction that is similar but not identical to the un- derlying hardware — an approach which has been dubbed paravir- tualization [43]. This promises improved performance, although it does require modifications to the guest operating system. It is important to note, however, that we do not require changes to the application binary interface (ABI), and hence no modifications are required to guest applications.
We distill the discussion so far into a set of design principles:
1. Support for unmodified application binaries is essential, or users will not transition to Xen. Hence we must virtualize all architectural features required by existing standard ABIs.
2. Supporting full multi-application operating systems is im- portant, as this allows complex server configurations to be virtualized within a single guest OS instance.
3. Paravirtualization is necessary to obtain high performance and strong resource isolation on uncooperative machine ar- chitectures such as x86.
4. Even on cooperative machine architectures, completely hid- ing the effects of resource virtualization from guest OSes risks both correctness and performance.
Note that our paravirtualized x86 abstraction is quite different from that proposed by the recent Denali project [44]. Denali is de- signed to support thousands of virtual machines running network services, the vast majority of which are small-scale and unpopu- lar. In contrast, Xen is intended to scale to approximately 100 vir- tual machines running industry standard applications and services. Given these very different goals, it is instructive to contrast Denali’s design choices with our own principles.
Firstly, Denali does not target existing ABIs, and so can elide certain architectural features from their VM interface. For exam- ple, Denali does not fully support x86 segmentation although it is exported (and widely used1) in the ABIs of NetBSD, Linux, and Windows XP.
Secondly, the Denali implementation does not address the prob- lem of supporting application multiplexing, nor multiple address spaces, within a single guest OS. Rather, applications are linked explicitly against an instance of the Ilwaco guest OS in a manner rather reminiscent of a libOS in the Exokernel [23]. Hence each vir- tual machine essentially hosts a single-user single-application un- protected “operating system”. In Xen, by contrast, a single virtual machine hosts a real operating system which may itself securely multiplex thousands of unmodified user-level processes. Although a prototype virtual MMU has been developed which may help De- nali in this area [44], we are unaware of any published technical details or evaluation.
Thirdly, in the Denali architecture the VMM performs all paging to and from disk. This is perhaps related to the lack of memory- management support at the virtualization layer. Paging within the
1For example, segments are frequently used by thread libraries to address thread-local data.
XEN: APPROACH & OVERVIEW
In a traditional VMM the virtual hardware exposed is function- ally identical to the underlying machine [38]. Although full virtu- alization has the obvious benefit of allowing unmodified operating systems to be hosted, it also has a number of drawbacks. This is particularly true for the prevalent IA-32, or x86, architecture.
Support for full virtualization was never part of the x86 archi- tectural design. Certain supervisor instructions must be handled by the VMM for correct virtualization, but executing these with in- sufficient privilege fails silently rather than causing a convenient trap [36]. Efficiently virtualizing the x86 MMU is also difficult. These problems can be solved, but only at the cost of increased complexity and reduced performance. VMware’s ESX Server [10] dynamically rewrites portions of the hosted machine code to insert traps wherever VMM intervention might be required. This transla- tion is applied to the entire guest OS kernel (with associated trans- lation, execution, and caching costs) since all non-trapping privi- leged instructions must be caught and handled. ESX Server imple- ments shadow versions of system structures such as page tables and maintains consistency with the virtual tables by trapping every up- date attempt — this approach has a high cost for update-intensive operations such as creating a new application process.
Memory Management
Segmentation
Protection Exceptions
System Calls
Interrupts
Device I/O
Network, Disk, etc.
Cannot install fully-privileged segment descriptors and cannot overlap with the top end of the linear address space.
Guest OS has direct read access to hardware page tables, but updates are batched and validated by the hypervisor. A domain may be allocated discontiguous machine pages.
Guest OS must run at a lower privilege level than Xen.
Guest OS must register a descriptor table for exception handlers with Xen. Aside from page faults, the handlers remain the same.
Guest OS may install a ‘fast’ handler for system calls, allowing direct calls from an application into its guest OS and avoiding indirecting through Xen on every call.
Hardware interrupts are replaced with a lightweight event system.
Each guest OS has a timer interface and is aware of both ‘real’ and ‘virtual’ time.
Virtual devices are elegant and simple to access. Data is transferred using asynchronous I/O rings. An event mechanism replaces hardware interrupts for notifications.
Table 1: The paravirtualized x86 interface.
VMM is contrary to our goal of performance isolation: malicious virtual machines can encourage thrashing behaviour, unfairly de- priving others of CPU time and disk bandwidth. In Xen we expect each guest OS to perform its own paging using its own guaran- teed memory reservation and disk allocation (an idea previously exploited by self-paging [20]).
Finally, Denali virtualizes the ‘namespaces’ of all machine re- sources, taking the view that no VM can access the resource alloca- tions of another VM if it cannot name them (for example, VMs have no knowledge of hardware addresses, only the virtual addresses created for them by Denali). In contrast, we believe that secure ac- cess control within the hypervisor is sufficient to ensure protection; furthermore, as discussed previously, there are strong correctness and performance arguments for making physical resources directly visible to guest OSes.
In the following section we describe the virtual machine abstrac- tion exported by Xen and discuss how a guest OS must be modified to conform to this. Note that in this paper we reserve the term guest operating system to refer to one of the OSes that Xen can host and we use the term domain to refer to a running virtual machine within which a guest OS executes; the distinction is analogous to that be- tween a program and a process in a conventional system. We call Xen itself the hypervisor since it operates at a higher privilege level than the supervisor code of the guest operating systems that it hosts.
2.1 The Virtual Machine Interface
Table 1 presents an overview of the paravirtualized x86 interface, factored into three broad aspects of the system: memory manage- ment, the CPU, and device I/O. In the following we address each machine subsystem in turn, and discuss how each is presented in our paravirtualized architecture. Note that although certain parts of our implementation, such as memory management, are specific to the x86, many aspects (such as our virtual CPU and I/O devices) can be readily applied to other machine architectures. Furthermore, x86 represents a worst case in the areas where it differs significantly from RISC-style processors — for example, efficiently virtualizing hardware page tables is more difficult than virtualizing a software- managed TLB.
2.1.1 Memory management
Virtualizing memory is undoubtedly the most difficult part of paravirtualizing an architecture, both in terms of the mechanisms required in the hypervisor and modifications required to port each
guest OS. The task is easier if the architecture provides a software- managed TLB as these can be efficiently virtualized in a simple manner [13]. A tagged TLB is another useful feature supported by most server-class RISC architectures, including Alpha, MIPS and SPARC. Associating an address-space identifier tag with each TLB entry allows the hypervisor and each guest OS to efficiently coexist in separate address spaces because there is no need to flush the entire TLB when transferring execution.
Unfortunately, x86 does not have a software-managed TLB; in- stead TLB misses are serviced automatically by the processor by walking the page table structure in hardware. Thus to achieve the best possible performance, all valid page translations for the current address space should be present in the hardware-accessible page table. Moreover, because the TLB is not tagged, address space switches typically require a complete TLB flush. Given these limi- tations, we made two decisions: (i) guest OSes are responsible for allocating and managing the hardware page tables, with minimal involvement from Xen to ensure safety and isolation; and (ii) Xen exists in a 64MB section at the top of every address space, thus avoiding a TLB flush when entering and leaving the hypervisor.
Each time a guest OS requires a new page table, perhaps be- cause a new process is being created, it allocates and initializes a page from its own memory reservation and registers it with Xen. At this point the OS must relinquish direct write privileges to the page-table memory: all subsequent updates must be validated by Xen. This restricts updates in a number of ways, including only allowing an OS to map pages that it owns, and disallowing writable mappings of page tables. Guest OSes may batch update requests to amortize the overhead of entering the hypervisor. The top 64MB region of each address space, which is reserved for Xen, is not ac- cessible or remappable by guest OSes. This address region is not used by any of the common x86 ABIs however, so this restriction does not break application compatibility.
Segmentation is virtualized in a similar way, by validating up- dates to hardware segment descriptor tables. The only restrictions on x86 segment descriptors are: (i) they must have lower privi- lege than Xen, and (ii) they may not allow any access to the Xen- reserved portion of the address space.
Virtualizing the CPU has several implications for guest OSes. Principally, the insertion of a hypervisor below the operating sys- tem violates the usual assumption that the OS is the most privileged
entity in the system. In order to protect the hypervisor from OS misbehavior (and domains from one another) guest OSes must be modified to run at a lower privilege level.
Many processor architectures only provide two privilege levels. In these cases the guest OS would share the lower privilege level with applications. The guest OS would then protect itself by run- ning in a separate address space from its applications, and indirectly pass control to and from applications via the hypervisor to set the virtual privilege level and change the current address space. Again, if the processor’s TLB supports address-space tags then expensive TLB flushes can be avoided.
Efficient virtualizion of privilege levels is possible on x8
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com