程序代写代做代考 kernel database graph ada algorithm clock C distributed system file system flex Chapter 7:

Chapter 7:
Replication Management using the
State Machine Approach
Fred B. Schneider*
Department of Computer Science Cornell University
Ithaca, New York 14853 U.S.A.
This chapter reprints my paper “Implementing Fault-tolerant Services using the State Machine Approach: A Tutorial” which orginally appeared in ACM Computing Surveys 22 (Dec. 1990). The paper has been reformatted, but otherwise remains unchanged.
Most distributed systems employ replicated services in one form or another. By replicating a service, we can support fault-tolerance as well as improving overall throughput by placing server replicas at sites where the service is needed. Pro- tocols for replication management can be divided into two general classes. The first, called “the state machine approach” or “active replication”, has no centralized control. This class is the subject of this chapter. The second class of protocols is called the “primary-backup approach”, and it is discussed in Chapter 8 (???FS et al-primary-backup).
The state machine approach ties together a number of the fundamental problems that have been discussed in previous chapters. Chapter 2 (??FS-models) should be consulted to put in perspective the distributed systems models of this chapter. Chapter 4 (???Ozalp+Keith) discusses the logical clocks used here to order client requests. And, Chapter 5 (???Toueg+Hadzlacos) discusses semantics for various communications primitives that support the state machine approach.
*This material is based on work supported in part by the Office of Naval Research under contract N00014-91-J-1219, the National Science Foundation under Grant No. CCR-8701103, and DARPA/NSF Grant No. CCR-9014363. Any opin- ions, findings, and conclusions or recommendations expressed in this publication are those of the author and do not reflect the views of these agencies.
-1-

1. Introduction
Distributed software is often structured in terms of clients and services. Each service comprises one or more servers and exports operations which clients invoke by making requests. Although using a single, centralized, server is the simplest way to implement a service, the resulting service can only be as fault-tolerant as the processor executing that server. If this level of fault tolerance is unac- ceptable, then multiple servers that fail independently must be employed. Usually, replicas of a sin- gle server are executed on separate processors of a distributed system, and protocols are employed to coordinate client interactions with these replicas. The physical and electrical isolation of processors in a distributed system ensures that server failures are independent, as required.
The state machine approach is a general method for implementing a fault-tolerant service by
replicating servers and coordinating client interactions with server replicas.1 The approach also pro- vides a framework for understanding and designing replication management protocols. Many proto- cols that involve replication of data or software—be it for masking failures or simply to facilitate cooperation without centralized control—can be derived using the state machine approach. Although few of these protocols actually were obtained in this manner, viewing them in terms of state machines helps in understanding how and why they work.
This paper is a tutorial on the state machine approach. It describes the approach and its imple- mentation for two representative environments. Small examples suffice to illustrate the points. How- ever, the approach has been successfully applied to larger examples; some of these are mentioned in §9. Section 2 describes how a system can be viewed in terms of a state machine, clients, and output devices. Coping with failures is the subject of §3 through §6. An important class of optimizations— based on the use of time—is discussed in §7. Section 8 describes dynamic reconfiguration. The his- tory of the approach and related work is discussed in §9.
2. State Machines
Services, servers, and most programming language structures for supporting modularity define state machines. A state machine consists of state variables, which encode its state, and commands, which transform its state. Each command is implemented by a deterministic program; execution of the command is atomic with respect to other commands and modifies the state variables and/or pro- duces some output. A client of the state machine makes a request to execute a command. The request names a state machine, names the command to be performed, and contains any information needed by the command. Output from request processing can be to an actuator (e.g. in a process- control system), to some other peripheral device (e.g. a disk or terminal), or to clients awaiting responses from prior requests.
1The term “state machine” is a poor one but, nevertheless, is the one used in the literature.
-1-

In this tutorial, we will describe a state machine simply by listing its state variables and com- mands. As an example, state machine memory of Figure 2.1 implements a time-varying mapping from locations to values. A read command permits a client to determine the value currently associ- ated with a location, and a write command associates a new value with a location.
For generality, our descriptions of state machines deliberately do not specify how command invocation is implemented. Commands might be implemented
using a collection of procedures that share data and are invoked by a call, as in a monitor,
using a single process that awaits messages containing requests and performs the actions they
specify, as in a server, or
using a collection of interrupt handlers, in which case a request is made by causing an interrupt, as in an operating system kernel. (Disabling interrupts permits each command to be executed to completion before the next is started.)
For example, the state machine of Figure 2.2 implements commands to ensure that at all times at most one client has been granted access to some resource. In it, x y denotes the result of appending y to the end of list x, head(x) denotes the first element of list x, and tail(x) denotes the list obtained by deleting the first element of list x. This state machine would probably be implemented as part of the supervisor-call handler of an operating system kernel.
Requests are processed by a state machine one at a time, in an order that is consistent with potential causality. Therefore, clients of a state machine can make the following assumptions about
memory: state_machine
var store : array [0..n] of word
read: command(loc:0..n) send store [loc ] to client
end read;
write : command(loc : 0..n , value : word) store [loc ] :! value
end write end memory
Figure 2.1. A memory
-2-

mutex: state_machine
var user : client_id init !;
waiting : list of client_id init !
acquire: command
if user !! ” send OK to client;
user :! client
[] user # ! ” waiting :! waiting client fi
end acquire
release: command
if waiting !! ” user :! !
[] waiting # ! ” send OK to head (waiting ); user :! head(waiting);
fi
end release end mutex
Figure 2.2. A resource allocator
the order in which requests are processed:
O1: Requests issued by a single client to a given state machine sm are processed by sm in the order they were issued.
O2: If the fact that request r was made to a state machine sm by client c could have caused a request r$ to be made by a client c$ to sm, then sm processes r before r$.
Note that due to communications network delays, O1 and O2 do not imply that a state machine will process requests in the order made or in the order received.
To keep our presentation independent of the interprocess communication mechanism used to transmit requests to state machines, we will program client requests as tuples of the form
%state_machine.command, arguments&
and postulate that any results from processing a request are returned using messages. For example, a client might execute
-3-
waiting :! tail(waiting)

%memory.write, 100, 16.2&; %memory.read, 100&; receive v from memory
to set the value of location 100 to 16.2, request the value of location 100, and await that value, setting v to it upon receipt.
The defining characteristic of a state machine is not its syntax, but that it specifies a determinis- tic computation that reads a stream of requests and processes each, occasionally producing output:
Semantic Characterization of a State Machine. Outputs of a state machine are completely determined by the sequence of requests it processes, independent of time and any other activity in a system.
Not all collections of commands necessarily satisfy this characterization. Consider the following pro- gram to solve a simple process-control problem in which an actuator is adjusted repeatedly based on the value of a sensor. Periodically, a client reads a sensor, communicates the value read to state machine pc, and delays approximately D seconds:
monitor: process
do true ” val :! sensor;
%pc.adjust, val&;
delay D od
end monitor
State machine pc adjusts an actuator based on past adjustments saved in state variable q, the sensor reading, and a control function F.
pc: state_machine var q : real;
adjust: command(sensor_val:real) q :! F(q, sensor_val);
send q to actuator
end adjust end pc
Although it is tempting to structure pc as a single command that loops—reading from the sensor, evaluating F, and writing to actuator—if the value of the sensor is time-varying, then the result would not satisfy the semantic characterization given above and therefore would not be a state machine. This is because values sent to actuator (the output of the state machine) would not depend solely on the requests made to the state machine but would, in addition, depend on the execution speed of the loop. In the structure used above, this problem has been avoided by moving the loop into monitor.
In practice, having to structure a system in terms of state machines and clients does not consti- tute a real restriction. Anything that can be structured in terms of procedures and procedure calls can also be structured using state machines and clients—a state machine implements the procedure, and
-4-

requests implement the procedure calls. In fact, state machines permit more flexibility in system structure than is usually available with procedure calls. With state machines, a client making a request is not delayed until that request is processed, and the output of a request can be sent some- place other than to the client making the request. We have not yet encountered an application that could not be programmed cleanly in terms of state machines and clients.
3. Fault Tolerance
Before turning to the implementation of fault-tolerant state machines, we must introduce some terminology concerning failures. A component is considered faulty once its behavior is no longer consistent with its specification. In this paper, we consider two representative classes of faulty behavior:
Byzantine Failures. The component can exhibit arbitrary and malicious behavior, perhaps involving collusion with other faulty components [Lamport et al 82].
Fail-stop Failures. In response to a failure, the component changes to a state that permits other components to detect that a failure has occurred and then stops [Schneider 84].
Byzantine failures can be the most disruptive, and there is anecdotal evidence that such failures do occur in practice. Allowing Byzantine failures is the weakest possible assumption that could be made about the effects of a failure. Since a design based on assumptions about the behavior of faulty com- ponents runs the risk of failing if these assumptions are not satisfied, it is prudent that life-critical sys- tems tolerate Byzantine failures. However, for most applications, it suffices to assume fail-stop failures.
A system consisting of a set of distinct components is t fault-tolerant if it satisfies its specification provided that no more than t of those components become faulty during some interval of
interest.2 Fault-tolerance traditionally has been specified in terms of MTBF (mean-time-between- failures), probability of failure over a given interval, and other statistical measures [Siewiorek & Swarz 82]. While it is clear that such characterizations are important to the users of a system, there are advantages in describing fault tolerance of a system in terms of the maximum number of com- ponent failures that can be tolerated over some interval of interest. Asserting that a system is t fault- tolerant makes explicit the assumptions required for correct operation; MTBF and other statistical measures do not. Moreover, t fault-tolerance is unrelated to the reliability of the components that make up the system and therefore is a measure of the fault tolerance supported by the system archi- tecture, in contrast to fault tolerance achieved simply by using reliable components. MTBF and other statistical reliability measures of a t fault-tolerant system can be derived from statistical reliability
2A t fault-tolerant system might continue to operate correctly if more than t failures occur, but correct operation cannot be guaranteed.
-5-

measures for the components used in constructing that system—in particular, the probability that there will be t or more failures during the operating interval of interest. Thus, t is typically chosen based on statistical measures of component reliability.
4. Fault-tolerant State Machines
A t fault-tolerant version of a state machine can be implemented by replicating that state machine and running a replica on each of the processors in a distributed system. Provided each replica being run by a non-faulty processor starts in the same initial state and executes the same requests in the same order, then each will do the same thing and produce the same output. Thus, if we assume that each failure can affect at most one processor, hence one state machine replica, then by combining the output of the state machine replicas of this ensemble, we can obtain the output for the t fault-tolerant state machine.
When processors can experience Byzantine failures, an ensemble implementing a t fault- tolerant state machine must have at least 2t ” 1 replicas, and the output of the ensemble is the output produced by the majority of the replicas. This is because with 2t ” 1 replicas, the majority of the out- puts remain correct even after as many as t failures. If processors experience only fail-stop failures, then an ensemble containing t ” 1 replicas suffices, and the output of the ensemble can be the output produced by any of its members. This is because only correct outputs are produced by fail-stop pro- cessors, and after t failures one non-faulty replica will remain among the t ” 1 replicas.
The key, then, for implementing an t fault-tolerant state machine is to ensure
Replica Coordination. All replicas receive and process the same sequence of requests.
This can be decomposed into two requirements concerning dissemination of requests to replicas in an
ensemble.
Agreement. Every non-faulty state machine replica receives every request.
Order. Every non-faulty state machine replica processes the requests it receives in the same relative order.
Notice that Agreement governs the behavior of a client in interacting with state machine replicas and that Order governs the behavior of a state machine replica with respect to requests from various clients. Thus, while Replica Coordination could be partitioned in other ways, the Agreement-Order partitioning is a natural choice because it corresponds to the existing separation of the client from the state machine replicas.
Implementations of Agreement and Order are discussed in §4.1 and §4.2. These implementa- tions make no assumptions about clients or commands. While this generality is useful, knowledge of commands allows Replica Coordination, hence Agreement and Order, to be weakened, and thus allows cheaper protocols to be employed for managing the replicas in an ensemble. Examples of two common weakenings follow.
-6-

First, Agreement can be relaxed for read-only requests when fail-stop processors are being assumed. When processors are fail-stop, a request r whose processing does not modify state variables need only be sent to a single non-faulty state machine replica. This is because the response from this replica is—by definition—guaranteed to be correct and, because r changes no state variables, the state of the replica that processes r will remain identical to the states of replicas that do not.
Second, Order can be relaxed for requests that commute. Two requests r and r$ commute in a state machine sm if the sequence of outputs and final state of sm that would result from processing r followed by r$ is the same as would result from processing r$ followed by r. An example of a state machine where Order can be relaxed appears in Figure 4.1. State machine tally determines the first from among a set of alternatives to receive at least MAJ votes and sends this choice to SYSTEM. If clients cannot vote more than once and the number of clients Cno satisfies 2MAJ>Cno, then every request commutes with every other. Thus, implementing Order would be unnecessary—different replicas of the state machine will produce the same outputs even if they process requests in different orders. On the other hand, if clients can vote more than once or 2MAJ ‘ Cno, then reordering requests might change the outcome of the election.
Theories for constructing state machine ensembles that do not satisfy Replica Coordination are proposed in [Aizikowitz 89] and [Mancini & Pappalardo 88]. Both theories are based on proving that an ensemble of state machines implements the same specification as a single replica does. The approach taken in [Aizikowitz 89] uses temporal logic descriptions of state sequences, while the approach in [Mancini & Pappalardo 88] uses an algebra of actions sequences. A detailed description of this work is beyond the scope of this tutorial.
tally: state_machine
var votes : array[candidate] of integer init 0
cast_vote: command(choice : candidate) votes[choice] :! votes[choice]”1;
if votes[choice](MAJ ” send choice to SYSTEM; halt
[] votes[choice] Enuf for all 0 ‘ ).
where Enuf 1
P())/2 if Byzantine failures are possible. 0 if only fail-stop failures are possible.
A processor failure may cause the Combining Condition to be violated by increasing F()), thereby decreasing P ()) , F ()). When Byzantine failures are possible, if a faulty processor can be identified, then removing it from the ensemble decreases Enuf without further decreasing P()),F()); this can keep the Combining Condition from being violated. When only fail-stop failures are possi- ble, increasing the number of non-faulty processors—by adding one that has been repaired—is the only way to keep the Combining Condition from being violated because increasing P()) is the only way to ensure that P()),F())>0 holds. Therefore, provided the following conditions hold, it may be possible to maintain the Combining Condition forever and thus tolerate an unbounded total number of faults over the life of the system.
9Observe that if Byzantine failures are possible, then a faulty client can be elected. Such problems are always possible when voters do not have detailed knowledge about the candidates in an election.
-20-

F1: If Byzantine failures are possible, then state machine replicas being executed by faulty processors are identified and removed from the ensemble before the Combin- ing Condition is violated by subsequent processor failures.
F2: State machine replicas running on repaired processors are added to the ensemble before the Combining Condition is violated by subsequent processor failures.
F1 and F2 constrain the rates at which failures and repairs occur.
Removing faulty processors from an ensemble of state machines can also improve system per- formance. This is because the number of messages that must be sent to achieve agreement is usually proportional to the number of state machine replicas that must agree on the contents of a request. In addition, some protocols to implement agreement execute in time proportional to the number of pro- cessors that are faulty. Removing faulty processors clearly reduces both the message complexity and time complexity of such protocols.
Adding or removing a client from the system is simply a matter of changing the state machine so that henceforth it responds to or ignores requests from that client. Adding an output device is also straightforward—the state machine starts sending output to that device. Removing an output device from a system is achieved by disabling the device. This is done by putting the device in a state that prevents it from affecting the environment. For example, a CRT terminal can be disabled by turning off the brightness so that the screen can no longer be read; a hydraulic actuator controlling the flap on an airplane wing can be disabled by opening a cutoff valve so that the actuator exerts no presure on that control surface. However, as suggested by these examples, it is not always possible to disable a faulty output device: turning off the brightness might have no effect on the screen and the cutoff valve might not work. Thus, there are systems in which no more than a total of t actuator faults can be tolerated because faulty actuators cannot be disabled.
The configuration of a system structured in terms of a state machine and clients can be described using three sets: the clients C, the state machine replicas S, and the output devices O. S is used by the agreement protocol and therefore must be known to clients and state machine replicas. It can also be used by an output device to determine which send operations made by state machine replicas should be ignored. C and O are used by state machine replicas to determine from which clients requests should be processed and to which devices output should be sent. Therefore, C and O must be available to all state machine replicas.
Two problems must be solved to support changing the system configuration. First, the values of C, S, and O must be available when required. Second, whenever a client, state machine replica, or output device is added to the configuration, the state of that element must be updated to reflect the current state of the system. These problems are considered in the following two subsections.
-21-

8.1. Managing the Configuration
The configuration of a system can be managed using the state machine in that system. Sets C, S, and O are stored in state variables and changed by commands. Each configuration is valid for a col- lection of requests—those requests r such that uid(r) is in the range defined by two successive configuration-change requests. Thus, whenever a client, state machine replica, or output device per- forms an action connected with processing r, it uses the configuration that is valid for r. This means that a configuration-change request must schedule the new configuration for some point far enough in the future so that clients, state machine replicas, and output devices all find out about the new configuration before it actually comes into effect.
There are various ways to make configuration information available to the clients and output devices of a system. (The information is already available to the state machine.) One is for clients and output devices to query the state machine periodically for information about relevant pending configuration changes. Obviously, communication costs for this scheme are reduced if clients and output devices share processors with state machine replicas. Another way to make configuration information available is for the state machine to include information about configuration changes in messages it sends to clients and output devices in the course of normal processing. Doing this requires periodic communication between the state machine and clients and between the state machine and output devices.
Requests to change the configuration of the system are made by a failure/recovery detection mechanism. It is convenient to think of this mechanism as a collection of clients, one for each ele- ment of C, S, or O. Each of these configurators is responsible for detecting the failure or repair of the single object it manages and, when such an event is detected, for making a request to alter the configuration. A configurator is likely to be part of an existing client or state machine replica and might be implemented in a variety of ways.
When elements are fail-stop, a configurator need only check the failure-detection mechanism of that element. When elements can exhibit Byzantine failures, detecting failures is not always possible. When it is possible, a higher degree of fault tolerance can be achieved by reconfiguration. A non- faulty configurator satisfies two safety properties.
C1: Only a faulty element is removed from the configuration.
C2: Only a non-faulty element is added to the configuration.
However, a configurator that does nothing satisfies C1 and C2. Changing the configuration enhances fault-tolerance only if F1 and F2 also hold. For F1 and F2 to hold, a configurator must also (1) detect faults and cause elements to be removed and (2) detect repairs and cause elements to be added. Thus, the degree to which a configurator enhances fault tolerance is directly related to the degree to which (1) and (2) are achieved. Here, the semantics of the application can be helpful. For example, to infer that a client is faulty, a state machine can compare requests made by different clients or by the same
-22-

client over a period of time. To determine that a processor executing a state machine replica is faulty, the state machine can monitor messages sent by other state machine replicas during execution of an agreement protocol. And, by monitoring aspects of the environment being controlled by actuators, a state machine replica might be able to determine that an output device is faulty. Some elements, such as processors, have internal failure-detection circuitry that can be read to determine whether that ele- ment is faulty or has been repaired and restarted. A configurator for such an element can be imple- mented by having the state machine periodically poll this circuitry.
In order to analyze the fault-tolerance of a system that uses configurators, failure of a configurator can be considered equivalent to the failure of the element that the configurator manages. This is because with respect to the Combining Condition, removal of a non-faulty element from the system or addition of a faulty one is the same as that element failing. Thus, in a t fault-tolerant sys- tem, the sum of the number of faulty configurators that manage non-faulty elements and the number of faulty components with non-faulty configurators must be bounded by t.
8.2. Integrating a Repaired Object
Not only must an element being added to a configuration be non-faulty, it also must have the correct state so that its actions will be consistent with those of rest of the system. Define e[ri] to be the state that a non-faulty system element e should be in after processing requests r0 through ri. An element e joining the configuration immediately after request rjoin must be in state e[rjoin] before it can participate in the running system.
An element is self-stabilizing [Dijkstra 74] if its current state is completely defined by the previ- ous k inputs it has processed, for some fixed k. Obviously, running such an element long enough to ensure that it has processed k inputs is all that is required to put it in state e[rjoin]. Unfortunately, the design of self-stabilizing state machines is not always possible.
When elements are not self-stabilizing, processors are fail-stop, and logical clocks are imple- mented, cooperation of a single state machine replica smi is sufficient to integrate a new element e into the system. This is because state information obtained from any state machine replica smi must be correct. In order to integrate e at request rjoin, replica smi must have access to enough state infor- mation so that e[rjoin] can be assembled and forwarded to e.
When e is an output device, e[rjoin] is likely to be only a small amount of device- specific set-up information—information that changes infrequently and can be stored in state variables of smi .
When e is a client, the information needed for e[rjoin] is frequently based on recent sensor values read and can therefore be determined by using information provided to smi by other clients.
-23-

And, when e is a state machine replica, the information needed for e[rjoin] is stored in the state variables and pending requests at smi .
The protocol for integrating a client or output device e is simple—e[rjoin] is sent to e before the output produced by processing any request with a unique identifier larger than uid(rjoin). The proto- col for integrating a state machine replica smnew is a bit more complex. It is not sufficient for replica smi simply to send the values of all its state variables and copies of any pending requests to smnew. This is because some client request might be received by smi after sending e[rjoin] but delivered to smnew before its repair. Such a request would neither be reflected in the state information forwarded by smi to smnew nor received by smnew directly. Thus, smi must, for a time, relay to smnew requests
received from clients.10 Since requests from a given client are received by smnew in the order sent and in ascending order by request identifier, once smnew has received a request directly (i.e. not relayed) from a client c, there is no need for requests from c with larger identifiers to be relayed to smnew. If smnew informs smi of the identifier on a request received directly from each client c, then smi can know when to stop relaying to sm new requests from c.
The complete integration protocol is summarized in the following.
Integration with Fail-stop Processors and Logical Clocks. A state machine replica smi
can integrate an element e at request r join into a running system as follows.
If e is a client or output device, smi sends the relevant portions of its state variables to e and does so before sending any output produced by requests with unique identifiers larger than the one on rjoin.
If e is a state machine replica smnew, then smi
(1) sends the values of its state variables and copies of any pending requests to sm new ,
(2) sends to smnew every subsequent request r received from each client c such that uid (r ) < uid (rc ), where rc is the first request sm new received directly from c after being restarted. The existence of synchronized real-time clocks permits this protocol to be simplified because smi can determine when to stop relaying messages based on the passage of time. Suppose, as in §4, there exists a constant + such that a request r with unique identifier uid(r) will be received by every (correct) state machine replica no later than time uid(r)"+ according to the local clock at the receiv- ing processor. Let smnew join the configuration at time )join. By definition, smnew is guaranteed to receive every request that was made after time )join on the requesting client’s clock. Since unique 10Duplicate copies of some requests might be received by sm new . -24- identifiers are obtained from the real-time clock of the client making the request, smnew is guaranteed to receive every request r such that uid(r)()join. The first such a request r must be received by smi by time )join "+ according to its clock. Therefore, every request received by smi after )join "+ must also be received directly by smnew. Clearly, smi need not relay such requests, and we have the fol- lowing protocol. Integration with Fail-stop Processors and Real-time Clocks. A state machine replica smi can integrate an element e at request r join into a running system as follows. If e is a client or output device, then smi sends the relevant portions of its state vari- ables to e and does so before sending any output produced by requests with unique identifiers larger than the one on r join . If e is a state machine replica sm new then smi (1) sends the values of its state variables and copies of any pending requests to sm new , (2) sends to sm new every request received during the next interval of duration +. When processors can exhibit Byzantine failures, a single state machine replica smi is not sufficient for integrating a new element into the system. This is because state information furnished by smi might not be correct—smi might be executing on a faulty processor. To tolerate t failures in a system with 2t " 1 state machine replicas, t " 1 identical copies of the state information and t " 1 ident- ical copies of relayed messages must be obtained. Otherwise, the protocol is as described above for real-time clocks. Stability Revisited The stability tests of §4 do not work when requests made by a client can be received from two sources—the client and via a relay. During the interval that messages are being relayed, smnew, the state machine replica being integrated, might receive a request r directly from c but later receive r$, another request from c, with uid (r ) > uid (r $), because r $ was relayed by smi . The solution to this problem is for smnew to consider requests received directly from c stable only after no relayed requests from c can arrive. Thus, the stability test must be changed:
Stability Test During Restart. A request r received directly from a client c by a restarting state machine replica smnew is stable only after the last request from c relayed by another processor has been received by smnew.
An obvious way to implement this is for a message to be sent to smnew when no further requests from c will be relayed.
-25-

9. Related Work
The state machine approach was first described in [Lamport 78a] for environments in which failures could not occur. It was generalized to handle fail-stop failures in [Schneider 82], a class of failures between fail-stop and Byzantine failures in [Lamport 78b], and full Byzantine failures in [Lamport 84]. These various state machine implementations were first characterized using the Agree- ment and Order requirements and a stability test in [Schneider 85].
The state machine approach has been used in the design of significant fault-tolerant process con- trol applications [Wensley et al 78]. It has also been used in the design of distributed synchronization—including read/write locks and distributed semaphores [Schneider 80], input/output guards for CSP and conditional Ada SELECT statements [Schneider 82]—and in the design of a fail- stop processor approximation using processors that can exhibit arbitrary behavior in response to a failure [Schlichting & Schneider 83] [Schneider 84]. A stable storage implementation described in [Bernstein 85] exploits properties of a synchronous broadcast network to avoid explicit protocols for Agreement and Order and employs Transmitting a Default Vote (as described in §7). The notion of + common storage, suggested in [Cristian et al 85], is a state machine implementation of memory that uses the Real-time Clock Stability Test. The decentralized commit protocol of [Skeen 82] can be viewed as a straightforward application of the state machine approach, while the 2 phase commit pro- tocol described in [Gray 78] can be obtained from decentralized commit simply by making restrictive assumptions about failures and performing optimizations based on these assumptions. The Paxon Synod commit protocol [Lamport 89] also can be understood in terms of the state machine approach. It is similar to, but cheaper to execute, than the standard 3 phase commit protocol. Finally, the method of implementing highly available distributed services in [Liskov & Ladin 86] uses the state machine approach, with clever optimizations of the stability test and agreement protocol that are pos- sible due to the semantics of the application and the use of fail-stop processors.
A critique of the state machine approach for transaction management in database systems appears in [Garcia-Molina et al 84]. Experiments evaluating the performance of various of the stabil- ity tests in a network of SUN Workstations are reported in [Pittelli & Garcia-Molina 89]. That study also reports on the performance of request batching, which is possible when requests describe data- base transactions, and the use of null requests in the Logical Clock Stability Test Tolerating Fail-stop Failures of §4.
Primitives to support the Agreement and Order requirements for Replica Coordination have been included in two operating systems toolkits. The ISIS Toolkit [Birman 85] provides ABCAST and CBCAST for allowing an applications programmer to control the delivery order of messages to the members of a process group (i.e. collection of state machine replicas). ABCAST ensures that all state machine replicas process requests in the same order; CBCAST allows more flexibility in mes- sage ordering and ensures that causally related requests are delivered in the correct relative order. ISIS has been used to implement a number of prototype applications. One example is the RNFS
-26-

(replicated NFS) file system, a network file system that is tolerant to fail-stop failures and runs on top of NFS, that was designed using the state machine approach [Marzullo & Schmuck 88].
The Psync primitive [Peterson et al 89], which has been implemented in the x-kernel [Hutchin- son & Peterson 88], is similar to the CBCAST of ISIS. Psync, however, makes available to the pro- grammer the graph of the message “potential causality” relation, while CBCAST does not. Psync is intended to be a low-level protocol that can be used to implement protocols like ABCAST and CBCAST; the ISIS primitives are intended for use by applications programmers and, therefore, hide the “potential causality” relation while at the same time including support for group management and failure reporting.
Acknowledgments
Discussions with O. Babaoglu, K. Birman, and L. Lamport over the past 5 years have helped me to formulate these ideas. Useful comments on drafts of this paper were provided by J. Aizikowitz, O. Babaoglu, A. Bernstein, K. Birman, R. Brown, D. Gries, K. Marzullo, and B. Simons. I am also very grateful to Sal March, managing editor of ACM Computing Surveys, for his thorough reading of this paper and many helpful comments.
References
[Aizikowitz 89] Aizikowitz, J. Designing Distributed Services Using Refinement Mappings. Ph.D. Dissertation, Computer Science Department, Cornell University, Ithaca, New York, August 1989. Also available as technical report TR 89-1040.
[Bernstein 85] Bernstein, A.J. A loosely coupled system for reliably storing data. IEEE Trans. on Software Engineering SE-11, 5 (May 1985), 446-454.
[Birman 85] Birman, K.P. Replication and fault tolerance in the ISIS system. Proc. Tenth ACM Symposium on Operating Systems Principles. (Orcas Island, Washington, Dec. 1985), ACM, 79-86.
[Birman & Joseph 87] Birman, K.P. and T. Joseph. Reliable communication in the presence of failures. ACM TOCS 5, 1 (Feb. 1987), 47-76.
[Cristian et al 85] Cristian, F., H. Aghili, H.R. Strong, and D. Dolev. Atomic Broadcast: From sim- ple message diffusion to Byzantine agreement. Proc. Fifteenth International Conference on Fault-tolerant Computing, (Ann Arbor, Mich., June 1985), IEEE
-27-

Computer Society.
[Dijkstra 74] Dijkstra, E.W. Self Stabilization in Spite of Distributed Control. CACM 17, 11 (Nov. 1974), 643-644.
[Fischer et al 85] Fischer, M., N. Lynch, and M. Paterson. Impossibility of distributed consensus with one faulty process. JACM 32, 2 (April 1985), 374-382.
[Garcia-Molina et al 84] Garcia-Molina, H., F. Pittelli, and S. Davidson. Application of Byzantine agreement in database systems. ACM TODS 11, 1 (March 1986), 27-47.
[Gray 78] Gray, J. Notes on Data Base Operating Systems. Operating Systems: An Advanced Course, Lecture Notes in Computer Science, Vol. 60, Springer-Verlag, New York, 1978, 393-481.
[Halpern et al 84] Halpern, J., B. Simons, R. Strong, and D. Dolev. Fault-tolerant clock synchroni- zation. Proc. of the Third ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, (Vancouver, Canada, August 1984), 89-102.
[Hutchinson & Peterson 88] Hutchinson, N. and L. Peterson. Design of the x-kernel. Proc. of SIGCOMM ’88—Symposium on Communication Architectures and Protocols (Stanford, CA, August 1988), 65-75.
[Lamport 78a] Lamport, L. Time, clocks and the ordering of events in a distributed system. CACM 21, 7 (July 1978), 558-565.
[Lamport 78b] Lamport, L. The implementation of reliable distributed multiprocess systems. Com- puter Networks 2 (1978), 95-114.
[Lamport 84] Lamport, L. Using time instead of timeout for fault-tolerance in distributed systems. ACM TOPLAS 6, 2 (April 1984), 254-280.
[Lamport 89] Lamport, L. The part-time parliament. Technical report 49, Digital Equipment Cor- poration Systems Research Center, Palo Alto, CA, Sept. 1989.
[Lamport & Melliar-Smith 84] Lamport, L and P.M. Melliar-Smith. Byzantine clock synchroniza- tion. Proc. of the Third ACM SIGACT-SIGOPS Symposium on Principles of Dis- tributed Computing, (Vancouver, Canada, August 1984), 68-74.
[Lamport et al 82] Lamport, L., R. Shostak, and M. Pease. The Byzantine generals problem. ACM TOPLAS 4, 3 (July 1982), 382-401.
[Liskov & Ladin 86] Liskov, B. and R. Ladin. Highly-available distributed services and fault- tolerant distributed garbage collection. Proc. of the Fifth ACM Symposium on Principles of Distributed Computing, (Calgary, Alberta, Canada, August 1986), ACM, 29-39.
-28-

[Mancini & Pappalardo 88] Mancini, L. and G. Pappalardo. Towards a theory of replicated process- ing. Formal Techniques in Real-time and Fault-tolerant Systems, Lecture Notes in Computer Science, Vol. 331, Springer-Verlag, New York, 1988, 175-192.
[Marzullo 89] Marzullo, K. Implementing fault-tolerant sensors. Technical Report TR 89-997, Computer Science Department, Cornell University, Ithaca, New York, May 1989.
[Marzullo & Schmuck 88] Marzullo, K. and F. Schmuck. Supplying high availability with a stan- dard network file system. Proceedings of the Eighth International Conference on Distributed Computing Systems, (San Jose, CA, June 1988), IEEE Computer Society, 447-455.
[Peterson et al 89] Peterson, L.L, N.C. Bucholz, and R.D. Schlichting. Preserving and using context information in interprocess communication. ACM TOCS 7, 3 (August 1988), 217-246.
[Pittelli & Garcia-Molina 89] Pittelli, F.M. and H. Garcia-Molina. Reliable scheduling in a TMR database system. ACM TOCS 7, 1 (Feb. 1989) 25-60.
[Powell & Presotto 83] Powell, M. and D. Presotto. PUBLISHING: A reliable broadcast communi- cation mechanism. Proc. of Ninth ACM Symposium on Operating Systems Prin- ciples, (Bretton Woods, New Hampshire, October 1983), ACM, 100-109.
[Schlichting & Schneider 83] Schlichting, R.D. and F.B. Schneider. Fail-Stop processors: An approach to designing fault-tolerant computing systems. ACM TOCS 1, 3 (August 1983), 222-238.
[Schneider 80] Schneider, F.B. Ensuring Consistency on a Distributed Database System by Use of Distributed Semaphores. Proc. International Symposium on Distributed Data Bases (Paris, France, March 1980), INRIA, 183-189.
[Schneider 82] Schneider, F.B. Synchronization in distributed programs. ACM TOPLAS 4, 2 (April 1982), 179-195.
[Schneider 84] Schneider, F.B. Byzantine generals in action: Implementing fail-stop processors. ACM TOCS 2, 2 (May 1984), 145-154.
[Schneider 85] Schneider, F.B. Paradigms for distributed programs. Distributed Systems—Methods and Tools for Specification, Lecture Notes in Computer Science, Vol. 190, Springer-Verlag, New York, N.Y. 1985, 343-430.
[Schneider 86] Schneider, F.B. A paradigm for reliable clock synchronization. Proc. Advanced Seminar on Real-Time Local Area Networks (Bandol, France, April 1986), INRIA, 85-104.
[Schneider et al 84] Schneider, F.B., D. Gries, and R.D. Schlichting. Fault-Tolerant Broadcasts. -29-

Science of Computer Programming 4 (1984), 1-15.
[Siewiorek & Swarz 82] Siewiorek, D.P. and R.S. Swarz. The Theory and Practice of Reliable Sys-
tem Design. Digital Press, Bedford, Mass, 1982.
[Skeen 82] Skeen, D. Crash Recovery in a Distributed Database System. Ph.D. Thesis, University
of California at Berkeley, May 1982.
[Strong & Dolev 83] Strong, H.R. and D. Dolev. Byzantine agreement. Intellectual Leverage for the Information Society, Digest of Papers, (Compcon 83, IEEE Computer Society, March 1983), IEEE Computer Society, 77-82.
[Wensley et al 78] Wensley, J., et al. SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control. Proc. IEEE 66, 10 (Oct. 1978), 1240-1255.
-30-

Related Posts