Distributed Systems Foundations
GLOBAL STATES AND
CHECKPOINTS
CS 271 1
Motivation
CS 271 2
Detecting global properties
Want to discover if a property holds in a distributed system
Three examples:
Distributed garbage collection: if there are no longer any
reference to objects, the memory taken up by the objects
should be reclaimed.
Distributed deadlock detection: when each of a collection
of processes waits for another process to send it a
message, and where there is a cycle in the graph of this
“wait-for” relationship.
Distributed termination detection: detect if a distributed
algorithm has terminated. Need to test if each process
has halted and no more messages in the network.
CS 271 3
CS 271 4
Distributed Checkpoints and Rollback
Recovery
• Fault tolerance is achieved by periodically using
stable storage to save the processes’ states during
the failure-free execution.
• Upon a failure, a failed process rolls back from one of
its saved states, thereby reducing the amount of lost
computation.
• Each of the saved states is called a checkpoint
CS 271 5
Checkpoint based Recovery
• Uncoordinated checkpointing: Each
process takes its checkpoints
independently
• Coordinated checkpointing: Processes
coordinate their checkpoints in order to
save a system-wide consistent state.
CS 271 6
Domino effect: uncoordinated
example
P0
P1
P2
m0
m1
m2 m3
m4
m5
m7
m6
Recovery
Line
Domino Effect: Cascaded rollback which causes the
system to roll back too far in the computation (even to the
beginning), in spite of all the checkpoints
Coordinated Non-blocking
• Processes could coordinate, but …
• Do we really need to block …?
!
!
K. Mani Chandy Leslie Lamport
CS 271 7
Global State
Chandy and Lamport—TOCS 1985
• Global state of a distributed system
– Local state of each process
– Messages sent but not received
• Many applications need the state of the system
– Failure recovery, distributed deadlock detection
– Detect stable properties.
• Problem: how can you figure out the state of a
distributed system?
– Each process is independent
– Network does not have any processing power.
• Distributed snapshot: a consistent global state
CS 271 8
Distributed System Model
• Assume each process communicates with
another process using unidirectional FIFO
point-to-point channels (e.g, TCP connections)
CS 271 9
p q
r
c1
c2
c3c4
CS 271 10
A Simple Example
A Variant of producer-consumer example
• Producer code:
while (1)
{
produce m;
send m;
wait for ack;
}
• Consumer code:
while (1)
{
recv m;
consume m;
send ack;
}
CS 271 11
Example: Initial State
m
Producer Consumer
CS 271 12
Example
m
Producer Consumer
CS 271 13
Example
m
Producer Consumer
CS 271 14
Example
a
Producer Consumer
CS 271 15
Example
a
Producer Consumer
CS 271 16
Example
a
Producer Consumer
CS 271 17
A naïve snapshot algorithm
• Processes record their state at any arbitrary
point
• A designated process collects these states
+ So simple!!
– Correct??
CS 271 18
Example
Producer Consumer problem
Producer records its state
m
Producer Consumer
CS 271 19
Example
m
Producer Consumer
CS 271 20
Example
Consumer records its state
m
Producer Consumer
CS 271 21
Example
The recorded state
m m
Producer Consumer
!!!!!!!!!!!!!!!!!!
CS 271 22
Where did we err?
• What did we do wrong?
producer
consumer
m
CS 271 23
Error!!
• The sender has no record of the sending
• The receiver has the record of the receipt
• Result
– Global state has record of the receive event but no
send event violating the happened before
concept!!
CS 271 24
The Notion of Consistency
• A global state is consistent if it could have
been observed by an external observer
• If a b then it is never the case that b is
observed by an external observer and not a
• All feasible states are consistent
CS 271 25
An Example
p
q
p q
Sp
0 Sp
1 Sp
2 Sp
3
Sq
0 Sq
1 Sq
2 Sq
3
m1
m2
m3
CS 271 26
A Consistent State?
p
q
p q
Sp
0 Sp
1 Sp
2 Sp
3
Sq
0 Sq
1 Sq
2 Sq
3
m1
m2
m3
Sp
1 Sq
1
CS 271 27
Yes
p
q
p q
Sp
0 Sp
1 Sp
2 Sp
3
Sq
0 Sq
1 Sq
2 Sq
3
m1
m2
m3
Sp
1 Sq
1
CS 271 28
A Consistent State?
p
q
p q
Sp
0 Sp
1 Sp
2 Sp
3
Sq
0 Sq
1 Sq
2 Sq
3
m1
m2
m3
Sp
2 Sq
3
m3
CS 271 29
Yes
p
q
p q
Sp
0 Sp
1 Sp
2 Sp
3
Sq
0 Sq
1 Sq
2 Sq
3
m1
m2 m3
Sp
2 Sq
3
m3
CS 271 30
What about……
p
q
p q
Sp
0 Sp
1 Sp
2 Sp
3
Sq
0 Sq
1 Sq
2 Sq
3
m1
m2
m3
Sp
1 Sq
3
Consistent State
a) A consistent cut
b) An inconsistent cut
CS 271 31
Distributed Snapshot Algorithm
• Any process can initiate the algorithm
– Save local state
– Send MARKERs on every outgoing channel
• On receiving a first marker on a channel c:
– Process saves local state and state of c is empty
– Send MARKERs on all outgoing channels, and save
messages on all other incoming channels (not c).
• On receiving subsequent marker on a channel:
– stop saving messages for that channel
– Saved messages are the state of the channel
CS 271 32
Distributed Snapshot
• A process finishes when
– It receives a marker on each incoming channel and
processes them all
– State: local state plus state of all channels
– Send state to initiator
• Any process can initiate snapshot
– Multiple snapshots may be in progress
• Each is distinguished by tagging the marker with the
initiator ID (and sequence number)
CS 271 33
Example
p q
r
c1
c2
c3c4
initiator
p
q
r
marker
checkpoint
x x
x
x x
CS 271 34
CS 271 35
Execution Example
p
q
Sq
0 Sq
1 Sq
2 Sq
3
Sp
0 Sp
1 Sp
2 Sp
3
m1 m2 m3
p q
c1
c2
initiator
CS 271 36
Execution Example
p
q
Sq
0 Sq
1 Sq
2 Sq
3
Sp
0 Sp
1 Sp
2 Sp
3
m1 m2 m3
q records state as Sq
1 , sends marker to p
p q
c1
c2
initiator
CS 271 37
Execution Example
p
q
Sq
0 Sq
1 Sq
2 Sq
3
Sp
0 Sp
1 Sp
2 Sp
3
m1 m2 m3
p records state as Sp
2, channel state as empty
p q
c1
c2
initiator
CS 271 38
Execution
Example
p
q
Sq
0 Sq
1 Sq
2 Sq
3
Sp
0 Sp
1 Sp
2 Sp
3
m1 m2 m3
q records channel state as m3
p q
c1
c2
initiator
CS 271 39
Execution
Example
p
q
Sq
0 Sq
1 Sq
2 Sq
3
Sp
0 Sp
1 Sp
2 Sp
3
m1 m2 m3
Recorded Global State = ((Sp
2, Sq
1), (0,m3) )
p q
c1
c2
initiator
Two processes trading in “widgets”
p1 p2
c2
c1
account widgets
$1000 (none)
account widgets
$50 2000
• Process p1 sends orders for widgets over c2 to p2,
enclosing payment at the rate of $10 per widget.
• In exchange, process p2 sends widgets along
channel c1 to p1.
CS 271 40
Execution Example
CS 271 41
p
1
p
2
(empty)<$1000, 0> <$50, 2000>
(empty)
c
2
c1e0: p1 sends (10, $100).
p
1
p
2
(Order 10, $100)<$900, 0> <$50, 2000>c2
1. p1 records its state and
sends a marker over c2.
(empty)c1
e1: p2 sends 5 free widgets to p1
p
1
p
2
(Order 10, $100)<$900, 0> <$50, 1995>
(five widgets)
c
2
c
1
Send 5 freebie widgets!
2. p2 receives marker & records
its state & channel c2 as empty.
Then sends marker over c1
p
1
p
2
(Order 10, $100)<$900, 5> <$50, 1995>c2
c
1
e2: p1 receives five widgets
3. p1 receives the marker, &
records state of c1 as 5 widget
p
1
p
2
(Order 10, $100)<$900, 5> <$50, 1995>c2
c
1
The Retrieved State
What is this State? Does it correspond to
any of the actual global states the system
went through?
p1 p2
(empty)
<$1000, 0> <$50, 1995>
<5 widgets>
C2
c1
CS 271 42
The Execution of the Algorithm
e0: p1 sends (10, $100).
e1: p2 sends 5 widges to p1
e2: p1 receives five widgets
p1 p2
(empty)
<$1000, 0> <$50, 1995>
<5 widgets>
C2
c1
CS 271 43
The Execution of the Algorithm
e0: p1 sends (10, $100).
e1: p2 sends 5 widges to p1
e2: p1 receives five widgets
e1: p2 sends 5 widges to p1
e0: p1 sends (10, $100)
e2: p1 receives five widgets
p1 p2
(empty)
<$1000, 0> <$50, 1995>
<5 widgets>
C2
c1
CS 271 44
The Execution of the Algorithm
e0: p1 sends (10, $100).
e1: p2 sends 5 widges to p1
e2: p1 receives five widgets
e1: p2 sends 5 widges to p1
e0: p1 sends (10, $100)
e2: p1 receives five widgets
p1 p2
(empty)
<$1000, 0> <$50, 1995>
<5 widgets>
C2
c1
TAKE SNAPSHOT
CS 271 45
Model
• Finite set of processes. Finite set of directed
channels. Modeled as a Directed Graph.
• Channels are FIFO, error free.
• A process is a finite set of states: an initial
state and a set of events.
• State of channel is msgs sent but not received.
CS 271 46
Correctness Proof
• Global State: set of process and channel states.
• Let seq = e1, e2, ……., en be a dist comp.
• Let Si be the state before event ei.
• Let Sinit be the state in which the algorithm
started and Sfinal the state when it terminated.
• Ssnap be the recorded state.
• Show that Ssnap is reachable from Sinit and Sfinal
is reachable from Ssnap
CS 271 47
Correctness Proof
• There is a sequence seq’ such that:
1. seq’ is a permutation of seq
2. Sinit = Ssnap or Sinit occurs earlier than Ssnap
3. Sfinal = Ssnap or Ssnap occurs earlier than Sfinal
• e is a pre-recording event iff e is in p and p
records state after e in seq.
• E is post-recording event iff e is in p and p
records state before e in seq.
CS 271 48
Correctness Proof
1. All events ei where i < init are pre-recording. 2. All events ei where i > final are post-recording.
3. There can be some post-recording events before
some pre-recording events.
• Possible on same process? NO
• What about on different processes?
• A pair of events a, b can be scheduled in any
order if there is no causal order between
them, so (a; b) is equivalent to (b; a)
CS 271 49
Checkpoint Proof: Different
Processes Case
• Violates FIFO property
P
Q
m2
m3
M
m1
Post-Rec
Pre-Rec
Reachability between states in the
snapshot algorithm
Sinit Sfinal
Ssnap
actual execution e0,e1,…
recording recording
begins ends
pre-snap: e’0,e ‘1,…e
‘
R-1 post-snap: e
‘
R,e ‘R+1,…
‘
CS 271 51
Why does it work?
Let an observer observe the following actions:
pre[i] pre[k] post[k] pre[j] post[i] pre[l] post[j] post[l] …
pre[i] pre[k] pre[j] post[k] post[i] pre[l] post[j] post[l] …
pre[i] pre[k] pre[j] post[k] pre[l] post[i] post[j] post[l] …
pre[i] pre[k] pre[j] pre[l] post[k] post[i] post[j] post[l] …
Ssnap
52
Ssnap
Easy conceptualization of the snapshot state
All pre All post
Returns a correct global state
• Obtain seq by reordering events of seq
between first snap and last snap, putting all pre-
recording events before all post-recording
events, preserving causality.
• Returned state is exactly the global state of seq
between the pre-recording and post-recording
events.
CS 271 53
Correctness
• Termination:
–Communication graph is strongly connected
–All snap eventually, because of either snap
input or marker message.
– If there is a communication path from pi to
pk, then pk will record its state a finite period
of time after pi
–Markers eventually sent and received on all
channels.
CS 271 54
What about Channel States?
• Recorded state of channel C from p to q:
– Sequence of msgs received by q before Marker received
Minus
– Sequence of msgs received by q before state recording
CS 271 55
What about Channel States?
• What we want:
– Sequence of msgs sent by p before state recording
Minus
– Sequence of msgs received by q before state recording
CS 271 56
What about Channel States?
• Recorded state of channel C from p to q:
– Sequence of msgs received by q before Marker received
Minus
– Sequence of msgs received by q before state recording
• What we want:
– Sequence of msgs sent by p before state recording
Minus
– Sequence of msgs received by q before state recording
CS 271 57
What about Channel States?
• Recorded state of channel C from p to q:
– Sequence of msgs received by q before Marker received
Minus
– Sequence of msgs received by q before state recording
• What we want:
– Sequence of msgs sent by p before state recording
Minus
– Sequence of msgs received by q before state recording
CS 271 58
What about Channel States?
• Recorded state of channel C from p to q:
– Sequence of msgs received by q before Marker received
Minus
– Sequence of msgs received by q before state recording
• What we want:
– Sequence of msgs sent by p before state recording
Minus
– Sequence of msgs received by q before state recording
• They are equal!
CS 271 59
Detecting Stable Properties
• A predicate y (S) is a stable property if once y is
true for state S it remains true (unless you
interfere with system) for all subsequent states.
• Run protocol and record state S*.
– If y(Sinit) is true then y(S*) is true
– If y(S*) is true then y (Sfinal) is true
• y(S*) true implies property holds
• y(S*) false does NOT implies property does not
hold.
CS 271 60