程序代写 CCR-9200832, tmd GER-9450075), ARPA Cumegie – versity Subcontract #38 1375

Implementing Global Memory Management in a Workstation Cluster
, Feeley, Wdliam E. Morgan, t . Pighin, . Karlin, . Levy Department of Computer Science and Engineering
University of Washington
and Chandramohan A. Thekkath

Copyright By PowCoder代写 加微信 powcoder

DEC Systems Research Center
Advances in network and processor technology have greatly changed the communication and computational power of local-area workstation clusters. However, operating systems still treat work- station clusters as a collection of loosely-connected processors, where each workstation acts as an autonomous and independent agent. This operating system structure makes it difficult to exploit the characteristics of current clusters, such as low-latency commu- nication, huge primary memories, and high-speed processors, in order to improve the performance of cluster applications.
This paper describes the design and implementation of global memory management in a workstation cluster. Our objective is to use a single, unified, but distributed memory management algo- rithm at the lowest level of the operating system. By managing memory globally at this level, all system- and higher-level soft- ware, including VM, file systems, transaction systems, and user applications, can benefit from available cluster memory. We have implemented our algorithm in the OSF/1 operating system running on an ATM-connected cluster of DEC Alpha workstations. Our measurements show that on a suite of memory-intensive programs, our system improves performance by a factor of 1.5 to 3.5. We also show that our algorithm has a performance advantage over others that have been proposed in the past.
1 Introduction
This paper examines global memory management in a workstation cluster. By a cluster, we mean a high-speed local-area network with 100 or so high-performance machines operating within a sin- gle administrative domain. Our premise is that a single, unified, memory management algorithm can be used at a low-level of the operating system to manage memory cluster-wide. In contrast, each operating system in today’s clusters acts as an autonomous
tAuthor’s current address: DECwest Engineering, Bellevue, WA.
This work was supported in part by the N~tlontd Science Foundation (Grants no CDA-9 123308, CCR-9200832, tmd GER-9450075), ARPA Cumegie – versity Subcontract #38 1375-50196, the Wmhington Technology Center, md Digital Equipment Corpomtion. M.Feeley was supported in part by a fellowship from Intel Co~oration, W. Morgan was supported in p~rt by Digital Equipment Corpomtion
Permission to make digitahwd copy of part or all of this work for personaJ or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the mpyright notice, the title of the publication and its date appear, and notice is given that
~PYin9 i$ by permission of ACM, Inc. To oopy otherwise, to republish, to post on servers, or to redistribute to Ma, reqtares prior specific permission andfor a fee.
SIGOPS ’95 12/95 CO, USA
CI 1995 ACM 0-89791-71 5-4/95/0012…$3.50
agent, exporting services to other nodes, but not acting in a co- ordinated way. Such autonomy has advantages, but results in an underutilization of resources that could be used to improve per- formance. For example, global memory management allows the operating system to use cluster-wide memory to avoid many disk accesses; this becomes more important with the widely growing disparity between processor speed and disk speed. We believe that as processor performance increases and communication latency decreases, workstation or personal computer clusters should be managed more as a multicomputer than as a collection of indepen- dent machines.
We have defined a global memory management algorithm and implemented it in the OSF/1 operating system, running on a collec- tion of DEC Alpha workstations connected by a DEC AN2 ATM network [1]. By inserting a global memory management algorithm at the lowest OS level, our system integrates, in a natural way, all cluster memory for use by all higher-level functions, including VM paging, mapped files, and file system buffering. Our system can automatically reconfigure to allow machines to join and in most cases, to depart the cluster at any time. In particular, with our algorithm and implementation, no globally-managed data is lost when a cluster node crashes.
Using our system, which we call GMS (for Global Memory Service), we have conducted experiments on clusters of up to 20 machines using a suite of real-world application programs. Our results show that the basic costs for global memory management operations are modest and that application performance improve- ment can be significant, For example, we show a 1.5- to 3.5-fold speedup for a collection of memory-intensive applications running with GMS; these speedups are close to optimal for these applica- tions, given the relative speeds of remote memory and disk.
The paper is organized as follows. Section 2 compares our work to earlier systems. In Section 3 we describe our algorithm for global memory management. Section 4 details our OSF/1 implementation. We present performance measurements of the implementation in Section 5. Section 6 discusses limitations of our algorithm and implementation, and possible solutions to those limitations. Finally, we summarize and conclude in Section 7.
2 Comparison With Previous Work
Several previous studies have examined various ways of using remote memory. Strictly theoretical results related to this problem include [2, 3, 7, 24]. Leach et al. describe remote paging in the context of the Apollo DOMAIN System [15]. Each machine in

the network has a paging server that accepts paging requests from remote nodes. This system allowed local users to statically restrict the amount of physical memory available to the paging server.
Comer and Griffioen described a remote memory model in which the cluster contains workstations, disk servers. and remote memory servers [8]. The remote memory servers were dedicated machines whose large primary memories could be allocated by workstations with heavy paging activity. No client-to-client resource sharing occurred, except through the servers. Felten and Zahorj an gener- alized this idea to use memory on idle client machines as paging backing store [12]. When a machine becomes idle, its kernel acti- vates an otherwise dormant memory server, which registers itself for remote use. Whenever a kernel replaces a VM page, it queries a central registry to locate active memory servers, picking one at random to receive the replacement victim. Felten and Zahorj an used a simple queueing model to predict performance.
In a different environment, Schilit and Duchamp have used re- mote paging to enhance the performance of mobile computers [18]. Their goal is to permit small memory-starved portable computers to page to the memories of larger servers nearby: pages could migrate from server to server as the portables migrate.
Franklin et al. examine the use of remote memory in a client- server DBMS system [13]. Their system assumes a centralized database server that contains the disks for stable store plus a large memory cache. Clients interact with each other via acentral server. On a page read request, if the page is not cached in the server’s memory, the server checks whether another client has that page cached; if so, the server asks that client to forward its copy to the workstation requesting the read. Franklin et al. evaluate several variants of this algorithm using a synthetic database workload.
Dahlin et al. evaluate the use of several algorithms for utilizing remote memory, the best of which is called N-chance forward- ing [10]. Using N-chance forwarding, when a node is about to replace a page, it checks whether that page is the last copy in the cluster (a “singlet”); if so, the node forwards that page to a randomly-picked node, otherwise it discards the page. Each page sent to remote memory has a circulation count, N, and the page is discarded after it has been forwarded to N nodes. When a node receives a remote page, that page is made the youngest on its LRU list, possibly displacing another page on that node; if possible, a duplicate page or recirculating page is chosen for replacement. Dahlin et al. compare a~gorithms using a simulator running one two-day trace of a Sprite workload; their analysis examines file sys- tem data pages only (i.e., no VM paging activity and no program executable).
Our work is related to these previous studies, but also differs in significant ways. First, our algorithm is integrated with the lowest level of the system and encompasses all memory activity: VM paging, mapped files, and explicit file access. Second, in previous systems, even where client-to-client sharing occurs, each node acts as an autonomous agent. In contrast, we manage memory globally, attempting to make good choices both for the faulting node and the cluster as a whole (we provide a more detailed comparison of the global vs. autonomous scheme following the presentation of our algorithm in the next section). Third, our system can gracefully handle addition and deletion of nodes in the cluster without user intervention. Finally, we have an implementation that is well integrated into a production operating system: OSF/1.
Several other efforts, while not dealing directly with remote paging. relate to our work, Most fundamental is the work of Li and Hudak, who describe a number of alternative strategies for managing pages in a distributed shared virtual memory system [ 16].
Similar management issues exist at the software level in single address space systems such as Opal [6], and at the hardware level in NUMA and COMA architectures [9, 21], Eager et al. [11] describe strategies for choosing target nodes on which to offload tasks in a distributed load sharing environment.
3 Algorithm
This section describes the basic algorithm used by GMS. The de- scription is divided into two parts. First, we present a high-level de- scription of the global replacement algorithm. Second, we describe the probabilistic process by which page information is maintained and exchanged in the cluster.
3.1 The Basic Algorithm
As previously stated, our goal is to globally coordinate memory management, We assume that nodes trust each other but may crash at any time. All nodes run the same algorithm and attempt to make choices that are good in a global cluster sense, as well as for the local node. We classify pages on a node P as being either local pages, which have been recently accessed on P, or global pages, which are stored in P’s memory on behalf of other nodes. Pages may also be private or shared, shared pages occur because two or more nodes might access a common file exported by a file server. Thus, a shared page may be found in the active local memories of multiple nodes; however, a page in global memory is always private.
In general, the algorithm changes the local/global memory bal- ance as the result of faults caused by an access to a nonresident page. Node P, on a fault, performs the following global replace- ment algorithm, which we describe in terms of 4 possible cases:
Case 1: The faulted page is in the global memory of another node, Q, We swap the desired page in Q’s global memory with any global page in P’s global memory. Once brought into P’s memory, the faulted page becomes a local page, increasing
the size of P’s local memory by 1. Q’s local/global balance is unchanged. This is depicted in Figure 1.
Case 2: The faulted page is in the global memory of node Q, but P’s memory contains only local pages. Exchange the LRU local page on P with the faulted page on Q. The size of the global memory on Q and the local memory on P are unchanged.
Case 3: The page is on disk. Read the faulted page into node P’s memory, where it becomes a local page. Choose the oldest page in the cluster (say, on node Q) for replacement and write it to disk if necessary. Send a global page on node P to node Q where it continues as a global page. If P has no global pages, choose P’s LRU local page instead. This is shown in Figure 2.
Case 4: The faulted page is a shared page in the local memory of another node Q, Copy that page into a frame on node P, leaving the original in local memory on Q. Choose the oldest page in the cluster (say, on node R) for replacement and write it to disk if necessary. Send a global page on node P to node R where it becomes a global page (if P has no global pages, choose P’s LRU local page).
The behavior of this algorithm is fairly straightforward. Over time, nodes that are actively computing and using memory will

Figure 1: Global replacement with hit in the global cache,
obviously impossible to maintain complete global age information at every instant; therefore, we use a variant in which each node has only approximate information about global pages. The objective of our algorithm is to provide a reasonable tradeoff between the ac- curacy of information that is available to nodes and the efficiency of distributing that information. The key issue is guaranteeing the validity of the age information and deciding when it must be updated.
Our algorithm divides time into epochs, Each epoch has a maxi- mum duration, 7’, and a maximum number of cluster replacements, M, that will be allowed in that epoch. The values of T and M vary from epoch to epoch, depending on the state of global memory and the workload. A new epoch is triggered when either(1) the dura- tion of the epoch, T, has elapsed, (2) M global pages have been replaced, or (3) the age information is detected to be inaccurate. Currently, each epoch is on the order of 5–10 seconds.
Our system maintains age information on every node for both local and global pages. At the start of each epoch, every node sends a summary of the ages of its local and global pages to a designated initiator node. Using this information, the initiator computes a weight, w;, for each node Z,such that out of the M oldest pages in the network, w, reside in node i’s memory at the beginning of the epoch. The initiator also determines the minimum age, MinAge, that will be replaced from the cluster (i e., sent to disk or discarded) in the new epoch. The initiator sends the weights w, and the value A4irulge to all nodes in the cluster. In addition, the initiator selects the node with the most idle pages (the largest w,) to be the initiator for the following epoch.
During an epoch, when a node P must evict a page from its memory to fault in a page from disk (Cases 3 and 4), it first checks if the age of the evicted page is older than &finAge. If so, it simply discards the page (since this page is expected to be discarded sometime during this epoch). If not, P sends the page to node i, where the probability of choosing node i is proportional to w,. In this case, the page discarded from P becomes a global page on node i, and the oldest page on i is discarded.
Our algorithm is probabilistic: on average, during an epoch, tbe ith node receives w, /M of the evictions in that epoch, replacing its oldest page for each one. This yields two useful properties. First, our algorithm approximates LRU in the sense that if M pages are discarded by global replacement during the epoch, they are the globally oldest M pages in the cluster. Second, it yields a simple way to determine statistically when M pages have been replaced; i.e., when the node with the largest w~ receives w, pages, it declares an end to the epoch.
To reduce the divergence from strict LRU, it is thus important to keep the duration of the epoch T and the value of M appropriate for the current behavior of the system. The decision procedure for choosing these values considers (1) the distribution of global page ages, (2) the expected rate at which pages will be discarded from the cluster, and (3) the rate at which the distributed age information is expected to become inaccurate. t The latter two rates are estimated from their values in preceding epochs. Roughly speaking, the more old pages there are in the network, the longer T should be (and the larger M and A4inAge are); similarly, if the expected discard rate is low, T can be larger as well. When the number of old pages in the network is too small, indicating that all nodes are actively using their memory, MinAge is set to O, so that pages are always discarded or written to disk rather than forwarded.
fThe age distribution on a node ch~nges when Its global pages me consumed due to m mcreme in Its local ciiche size.
L(XI.I Cache
Figure 2: Global replacement showing miss in the global cache. The faulted page is read from disk, and the oldest page in the network is either discarded (if clean) or written back to disk.
fill their memories with local pages and will begin using remote memory in the cluster; nodes that have been idle for some time and whose pages are old will begin to fill their memories with global pages. The balance between local and global storage on a node is thus dynamic and depends on its workload and the workload in the cluster. The basic issue is when to change the amount of global store and local storage, both on a node and in the cluster overall. In general, on a fault requiring a disk read, the (active) faulting node grows its local memory, while the cluster node with the oldest page (an “idle” node) loses a page to disk. Global memory grows when the faulting node has no global pages and the oldest page in the network is a local page (i.e., the oldest local page on the faulting node becomes a global page, replacing the oldest cluster page.)
Ultimately, our goal is to minimize the total cost of all memory references within the cluster. The cost of a memory reference depends on the state of the referenced page: in local memory, in global memory on another node, or on disk. A local hit is over three orders of magnitude faster than a global memory or disk access, while a global memory hit is only two to ten times faster than a disk access. Therefore, in making replacement decisions, we might choose to replace a global page before a local page of the same age, because the cost of mistakenly replacing a local page is substantially higher. Which decision is better depends on future behavior. To predict future behavior, a cost function is associated with each page. This cost function is related to LRU, but is based on both the age of the page and its state. Our current implementation boosts the ages of global pages to favor their replacement over local pages of approximately the same age.
3.2 Managing Global Age Information
When a faulted page is read from disk (Cases 3 and 4), our al- gorithm discards the oldest page in the cluster. As described so far, we assume full global information about the state of nodes and their pages in order to locate this oldest page. However, it N
L4ud C’,’hc
Ohlcst C,l’hc Gh)hCS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com