Systems, Networks & Concurrency 2019
Safety & Livene7ss Uwe R. Zimmer – The Australian National University
[Ben2006]
[Chandy1983]
Ben-Ari, M
Chandy, K, Misra, Jayadev & Haas, Laura
Principles of Concurrent and Dis- tributed Programming
second edition, Prentice-Hall 2006
Distributed deadlock detection
© 2019 Uwe R. Zimmer, The Australian National University
page 460 of 757 (chapter 7: “Safety & Liveness” up to page 462)
Safety & Liveness
References for this chapter
Transactions on Computer Sys- tems (TOCS) 1983 vol. 1 (2)
[Silberschatz2001]
Silberschatz, Abraham, Gal- vin, Peter & Gagne, Greg Operating System Concepts John Wiley & Sons, Inc., 2001
Safety properties: Liveness properties:
Safety & Liveness
Repetition
Correctness concepts in concurrent systems
Extended concepts of correctness in concurrent systems: ¬ Terminationisoftennotintendedorevenconsideredafailure
(P (I) /Processes (I, S)) & XQ (I, S)
where XQ means that Q does always hold
(P (I) /Processes (I, S)) & oQ (I, S)
where oQ means that Q does eventually hold (and will then stay true)
and S is the current state of the concurrent system
© 2019 Uwe R. Zimmer, The Australian National University page 461 of 757 (chapter 7: “Safety & Liveness” up to page 462)
Examples:
Safety & Liveness
Repetition
Correctness concepts in concurrent systems
Liveness properties:
(P (I) /Processes (I, S)) & oQ (I, S)
where oQ means that Q does eventually hold (and will then stay true)
• Requests need to complete eventually.
• The state of the system needs to be displayed eventually. • No part of the system is to be delayed forever (fairness).
Interesting liveness properties can become very hard to proof
© 2019 Uwe R. Zimmer, The Australian National University page 462 of 757 (chapter 7: “Safety & Liveness” up to page 462)
Liveness properties:
(P (I) /Processes (I, S)) & oQ (I, S)
Safety & Liveness
• Weak fairness: ?4R & ?G … eventually, if a process requests continually.
Liveness
Fairness
where oQ means that Q does eventually hold (and will then stay true) Fairness (as a means to avoid starvation): Resources will be granted …
• Strong fairness: 4?R & ?G … eventually, if a process requests infinitely often.
• Linear waiting: ?R & ?G … before any other process had the same resource granted more than once (common fairness in distributed systems).
• First-in, first-out: ?R & ?G … before any other process which applied for the same resource at a later point in time (common fairness in single-node systems).
© 2019 Uwe R. Zimmer, The Australian National University page 463 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Revisiting
Correctness concepts in concurrent systems
Safety properties: Examples:
(P (I) /Processes (I, S)) & XQ (I, S)
where XQ means that Q does always hold
• Mutual exclusion (no resource collisions) has been addressed • Absence of deadlocks to be addressed now
(and other forms of ‘silent death’ and ‘freeze’ conditions)
• Specified responsiveness or free capabilities Real-time systems (typical in real-time / embedded systems or server applications)
© 2019 Uwe R. Zimmer, The Australian National University page 464 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Most forms of synchronization may lead to
Deadlocks
(Avoidance / prevention of deadlocks is one central safety property)
How to predict them?
How to find them?
How to resolve them?
… or are there structurally dead-lock free forms of synchronization?
© 2019 Uwe R. Zimmer, The Australian National University page 465 of 757 (chapter 7: “Safety & Liveness” up to page 512)
process P1;
statement X;
var reserve_1, reserve_2 : semaphore := 1; process P2;
wait (reserve_1);
wait (reserve_2);
wait (reserve_2);
wait (reserve_1);
statement Y; — employ all resources
signal (reserve_2);
signal (reserve_1);
statement B; — employ all resources
signal (reserve_1);
signal (reserve_2);
statement Z; end P1;
statement C; end P2;
Sequence of operations: A
© 2019 Uwe R. Zimmer, The Australian National University
C; X
Z; 6X,Z ; A,B,C@; 6A,C ; X,Y,Z@; J6B ; Y@
page 466 of 757 (chapter 7: “Safety & Liveness” up to page 512)
< B <
or: 6A ; X@ followed by a deadlock situation.
Safety & Liveness
Towards synchronization
Reserving resources in reverse order
< Y <
statement A;
process P1;
statement X;
process P2;
statement A;
process P3;
statement K;
wait (reserve_1);
wait (reserve_2);
wait (reserve_2);
wait (reserve_3);
wait (reserve_3);
wait (reserve_1);
statement Y;
statement B;
statement L;
signal (reserve_2);
signal (reserve_1);
signal (reserve_3);
signal (reserve_2);
signal (reserve_1);
signal (reserve_3);
statement Z; end P1;
statement C; end P2;
statement M; end P3;
Sequence of operations: A
< B
< C; X < Y < Z; K < L <
Safety & Liveness
Towards synchronization
Circular dependencies
var reserve_1, reserve_2, reserve_3 : semaphore := 1;
M;
6X,Z ; A,B,C ; K,M@; 6A,C ; X,Y,Z ; K,M@; 6A,C ; K,L,M ; X,Z@; J6B ; Y ; L@
or: 6A ; X ; K@ followed by a deadlock situation.
© 2019 Uwe R. Zimmer, The Australian National University page 467 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Necessary deadlock conditions:
1. Mutual exclusion:
resources cannot be used simultaneously.
© 2019 Uwe R. Zimmer, The Australian National University page 468 of 757 (chapter 7: “Safety & Liveness” up to page 512)
1. Mutual exclusion:
resources cannot be used simultaneously.
Safety & Liveness
Deadlocks
Necessary deadlock conditions:
2. Hold and wait:
a process applies for a resource, while it is holding another resource (sequential requests).
© 2019 Uwe R. Zimmer, The Australian National University page 469 of 757 (chapter 7: “Safety & Liveness” up to page 512)
1. Mutual exclusion:
resources cannot be used simultaneously.
Safety & Liveness
Deadlocks
Necessary deadlock conditions:
2. Hold and wait:
a process applies for a resource, while it is holding another resource (sequential requests).
3. No pre-emption:
resources cannot be pre-empted; only the process itself can release resources.
© 2019 Uwe R. Zimmer, The Australian National University page 470 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Necessary deadlock conditions:
1. Mutual exclusion:
resources cannot be used simultaneously.
2. Hold and wait:
a process applies for a resource, while it is holding another resource (sequential requests).
3. No pre-emption:
resources cannot be pre-empted; only the process itself can release resources.
4. Circular wait: a ring list of processes exists,
where every process waits for release of a resource by the next one.
© 2019 Uwe R. Zimmer, The Australian National University page 471 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Necessary deadlock conditions:
1. Mutual exclusion:
resources cannot be used simultaneously.
2. Hold and wait:
a process applies for a resource, while it is holding another resource (sequential requests).
3. No pre-emption:
resources cannot be pre-empted; only the process itself can release resources.
4. Circular wait: a ring list of processes exists,
where every process waits for release of a resource by the next one.
A system may become deadlocked, if all these conditions apply!
© 2019 Uwe R. Zimmer, The Australian National University page 472 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
• Deadlock detection & recovery
find deadlocked processes and recover the system in a coordinated way
Deadlocks
Deadlock strategies:
• Ignorance & restart
Kill or restart unresponsive processes, power-cycle the computer, ...
• Deadlock avoidance
the resulting system state is checked before any resources are actually assigned
• Deadlock prevention
the system prevents deadlocks by its structure
© 2019 Uwe R. Zimmer, The Australian National University page 473 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Deadlock prevention
(Remove one of the four necessary deadlock conditions)
1. Break Mutual exclusion:
Mutual exclusion Hold and wait No pre-emption Circular wait
© 2019 Uwe R. Zimmer, The Australian National University
page 474 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Deadlock prevention
(Remove one of the four necessary deadlock conditions)
1. Break Mutual exclusion:
By replicating critical resources, mutual exclusion becomes un- necessary (only applicable in very specific cases).
Mutual exclusion Hold and wait No pre-emption Circular wait
2. Break Hold and wait:
© 2019 Uwe R. Zimmer, The Australian National University
page 475 of 757 (chapter 7: “Safety & Liveness” up to page 512)
1. Break Mutual exclusion:
By replicating critical resources, mutual exclusion becomes un- necessary (only applicable in very specific cases).
Mutual exclusion Hold and wait No pre-emption Circular wait
2. Break Hold and wait:
Allocation of all required resources in one request.
Processes can either hold none or all of their required resources.
3. Introduce Pre-emption: :
Safety & Liveness
Deadlocks
Deadlock prevention
(Remove one of the four necessary deadlock conditions)
© 2019 Uwe R. Zimmer, The Australian National University page 476 of 757 (chapter 7: “Safety & Liveness” up to page 512)
1. Break Mutual exclusion:
By replicating critical resources, mutual exclusion becomes un- necessary (only applicable in very specific cases).
Mutual exclusion Hold and wait No pre-emption Circular wait
2. Break Hold and wait:
Allocation of all required resources in one request.
Processes can either hold none or all of their required resources.
Safety & Liveness
Deadlocks
Deadlock prevention
(Remove one of the four necessary deadlock conditions)
3. Introduce Pre-emption:
Provide the additional infrastructure to allow for pre-emption of resources. Mind that re- sources cannot be pre-empted, if their states cannot be fully stored and recovered.
4. Break Circular waits:
© 2019 Uwe R. Zimmer, The Australian National University page 477 of 757 (chapter 7: “Safety & Liveness” up to page 512)
1. Break Mutual exclusion:
By replicating critical resources, mutual exclusion becomes un- necessary (only applicable in very specific cases).
Mutual exclusion Hold and wait No pre-emption Circular wait
2. Break Hold and wait:
Allocation of all required resources in one request.
Processes can either hold none or all of their required resources.
Safety & Liveness
Deadlocks
Deadlock prevention
(Remove one of the four necessary deadlock conditions)
3. Introduce Pre-emption:
Provide the additional infrastructure to allow for pre-emption of resources. Mind that re- sources cannot be pre-empted, if their states cannot be fully stored and recovered.
4. Break Circular waits:
E.g. order all resources globally and restrict processes to request resources in that order only.
© 2019 Uwe R. Zimmer, The Australian National University page 478 of 757 (chapter 7: “Safety & Liveness” up to page 512)
RAG = "V,E,; Resource allocation graphs consist of vertices V and edges E. V = P jR; Vertices V can be processes P or Resource types R.
Pi
with processes P = "P1,f,Pn,
and resources types R = "R1,fRk,
requests
Safety & Liveness
Deadlocks
Pi
Resource Allocation Graphs
holds
(Silberschatz, Galvin & Gagne)
E = Ec jEr jEa; Edges E can be “claims” Ec, “requests” Er or “assignments” Ea with claims Ec = $Pi " Rj,f.
requests Er = $Pi " Rj,f.
and assignments Ea = $Rj " Pi,f.
Rj Pi
Note: any resource type Rj can have more than one instance of a resource.
© 2019 Uwe R. Zimmer, The Australian National University page 479 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Rj
claims
Rj
Resource Allocation Graphs
© 2019 Uwe R. Zimmer, The Australian National University
page 480 of 757 (chapter 7: “Safety & Liveness” up to page 512)
(Silberschatz, Galvin & Gagne)
R1
Safety & Liveness
Deadlocks
P1
P2 R2
Resource Allocation Graphs
(Silberschatz, Galvin & Gagne)
R1
Two process, reverse allocation deadlock:
P1
P2 R2
Safety & Liveness
Deadlocks
© 2019 Uwe R. Zimmer, The Australian National University page 481 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Resource Allocation Graphs
© 2019 Uwe R. Zimmer, The Australian National University
page 482 of 757 (chapter 7: “Safety & Liveness” up to page 512)
(Silberschatz, Galvin & Gagne)
R1 R3
Safety & Liveness
Deadlocks
P1 P2 P3 R2
Resource Allocation Graphs
(Silberschatz, Galvin & Gagne)
R1 R3
No circular dependency no deadlock:
P1 P2 P3 R2
Safety & Liveness
Deadlocks
© 2019 Uwe R. Zimmer, The Australian National University page 483 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Resource Allocation Graphs
© 2019 Uwe R. Zimmer, The Australian National University
page 484 of 757 (chapter 7: “Safety & Liveness” up to page 512)
(Silberschatz, Galvin & Gagne)
R1 R3
Safety & Liveness
Deadlocks
P1 P2 P3 R2
Resource Allocation Graphs
(Silberschatz, Galvin & Gagne)
R1 R3
Two circular dependencies deadlock:
P1 "R1 "P2 "R3 "P3 "R2 "P1 as well as: P2 " R3 " P3 " R2 " P2
P1 P2 P3 R2
Derived rule:
If some processes are deadlocked then there are cycles in the resource allocation graph.
Safety & Liveness
Deadlocks
© 2019 Uwe R. Zimmer, The Australian National University page 485 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Edge Chasing
(for the distributed version see Chandy, Misra & Haas) 6blocking processes:
R1 R3
Send a probe to all requested yet unassigned resources con- taining ids of: [the blocked, the sending, the targeted node].
P1 P2 P3 R2
6nodes on probe reception:
Propagate the probe to all processes holding the critical resources or to all requested yet unassigned resources – while updating the second and third entry in the probe.
7a process receiving its own probe: (blocked-id = targeted-id)
Circular dependency detected. © 2019 Uwe R. Zimmer, The Australian National University
page 486 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Resource Allocation Graphs
(Silberschatz, Galvin & Gagne)
R1 R3
P1 P2 P3
Knowledge of claims:
Claims are potential future requests which have no blocking ef-
Safety & Liveness
Deadlocks
fect on the claiming process – while actual requests are blocking. R2
© 2019 Uwe R. Zimmer, The Australian National University page 487 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Resource Allocation Graphs
(Silberschatz, Galvin & Gagne)
R1
R3
Assignment of resources such that circular dependencies are avoided:
P1
P2 R2
P3
© 2019 Uwe R. Zimmer, The Australian National University
page 488 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Resource Allocation Graphs
(Silberschatz, Galvin & Gagne) Earlier derived rule:
R1 R3
If some processes are deadlocked
then there are cycles in the resource allocation graph.
P1 P2 P3 R2
Reverse rule for multiple instances:
If there are cycles in the resource allocation graph and there are multiple instances per resource
then the involved processes are potentially deadlocked.
Reverse rule for single instances:
If there are cycles in the resource allocation graph and there is exactly one instance per resource then the involved processes are deadlocked.
© 2019 Uwe R. Zimmer, The Australian National University
page 489 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Resource Allocation Graphs
(Silberschatz, Galvin & Gagne)
R1
Reverse rule for single instances:
P1
P2 R2
If there are cycles in the resource allocation graph and there is exactly one instance per resource then the involved processes are deadlocked.
Actual deadlock identified
© 2019 Uwe R. Zimmer, The Australian National University
page 490 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Resource Allocation Graphs
(Silberschatz, Galvin & Gagne)
R1 R3
Reverse rule for multiple instances:
P1 P2 P3 R2
If there are cycles in the resource allocation graph and there are multiple instances per resource
then the involved processes are potentially deadlocked.
Potential deadlock identified
© 2019 Uwe R. Zimmer, The Australian National University
page 491 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Resource Allocation Graphs
(Silberschatz, Galvin & Gagne)
R1 R3
Reverse rule for multiple instances:
P1 P2 P3 R2
If there are cycles in the resource allocation graph and there are multiple instances per resource
then the involved processes are potentially deadlocked.
Potential deadlock identified
– yet clearly not an actual deadlock here
© 2019 Uwe R. Zimmer, The Australian National University
page 492 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
P4
Resource Allocation Graphs
(Silberschatz, Galvin & Gagne)
R1
R3
How to detect actual deadlocks in the general case?
(multiple instances per resource)
P1
P2 R2
P3
© 2019 Uwe R. Zimmer, The Australian National University
page 493 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
• Claimed [i, j]
• Requested [i, j]
• Completed [i]
• Simulated_Free [j]
Safety & Liveness
Deadlocks
Banker’s Algorithm
There are processes Pi ! {P1,f,Pn} and resource types Rj ! {R1,f,Rm} and data structures: • Allocated [i, j]
• Free [j]
the number of resources of type j currently allocated to process i. the number of currently available resources of type j. the number of resources of type j required by process i eventually. the number of currently requested resources of type j by process i. boolean vector indicating processes which may complete.
Number of available resources assuming that complete processes deallocate their resources. © 2019 Uwe R. Zimmer, The Australian National University page 494 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Banker’s Algorithm
1. Simulated_Free % Free; 6i: Completed [i] % False; 2.While 7i: JCompleted [i]
and 6j: Requested [i, j] < Simulated_Free [j] do:
6j: Simulated_Free [j] % Simulated_Free [j] + Allocated [i, j];
Completed [i] % True;
3. If 6i: Completed [i] then the system is currently deadlock-free!
else all processes i with JCompleted [i] are involved in a deadlock!.
© 2019 Uwe R. Zimmer, The Australian National University page 495 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Deadlocks
Banker’s Algorithm
1. Simulated_Free % Free; 6i: Completed [i] % False; 2.While 7i: JCompleted [i]
and 6j: Claimed [i, j] < Simulated_Free [j] do:
6j: Simulated_Free [j] % Simulated_Free [j] + Allocated [i, j];
Completed [i] % True;
3. If 6i: Completed [i] then the system is safe!
A safe system is a system in which future deadlocks can be avoided assuming the current set of available resources.
© 2019 Uwe R. Zimmer, The Australian National University page 496 of 757 (chapter 7: “Safety & Liveness” up to page 512)
if (Request < Claimed) and (Request < Free) then Free := Free - Request;
Claimed := Claimed - Request;
Allocated := Allocated + Request;
if System_is_safe (checked by e.g. Banker’s algorithm) then Grant request
else
Safety & Liveness
Deadlocks
Banker’s Algorithm
Check potential future system safety by simulating a granted request: (Deadlock avoidance)
Restore former system state: (Free, Claimed, Allocated) end if;
end if;
© 2019 Uwe R. Zimmer, The Australian National University page 497 of 757 (chapter 7: “Safety & Liveness” up to page 512)
(keeping most processes close to the resources they require).
Organize the nodes in an adequate topology (e.g. a tree).
Safety & Liveness
Check for deadlock inside nodes
with blocked resource requests and detect/avoid local deadlock immediately.
Exchange resource status information
between nodes occasionally and detect global deadlocks eventually.
Deadlocks
Distributed deadlock detection
Observation: Deadlock detection methods like Banker’s Algorithm are too communication intensive to be commonly applied in full and at high frequency in a distributed system.
Therefore a distributed version needs to: Split the system into nodes of reasonable locality
© 2019 Uwe R. Zimmer, The Australian National University page 498 of 757 (chapter 7: “Safety & Liveness” up to page 512)
A deadlock has been detected now what? Breaking the circular dependencies can be done by:
Either pre-empt an assigned resource which is part of the deadlock. or stop a process which is part of the deadlock.
Safety & Liveness
Deadlocks
Deadlock recovery
Usually neither choice can be implemented ‘gracefully’ and deals only with the symptoms.
Deadlock recovery does not address the reason for the problem! (i.e. the deadlock situation can re-occur again immediately)
© 2019 Uwe R. Zimmer, The Australian National University page 499 of 757 (chapter 7: “Safety & Liveness” up to page 512)
• Deadlock avoidance
System state is checked with every resource assignment.
Safety & Liveness
Deadlocks
Deadlock strategies:
• Deadlock prevention
System prevents deadlocks by its structure or by full verification
More generally applicable, yet computationally very expensive. • Deadlock detection & recovery
Detect deadlocks and break them in a ‘coordinated’ way.
Less computationally expensive (as lower frequent), yet usually ‘messy’. • Ignorance & random kill
Kill or restart unresponsive processes, power-cycle the computer, ...
More of a panic reaction than a method. © 2019 Uwe R. Zimmer, The Australian National University page 500 of 757 (chapter 7: “Safety & Liveness” up to page 512)
The best approach if applicable.
Definitions of atomicity:
An operation is atomic if the processes performing it ...
• (by communication) ... do not communicate with other processes while the atomic operation is performed.
• (by means of states) ... cannot detect any outside state change and do not reveal their own state changes until the atomic operation is complete.
Short:
Safety & Liveness
Atomic & idempotent operations
Atomic operations
• (by ‘awareness’) ... are not aware of the existence of any other active
process, and no other active process is aware of the activity of the
processes during the time the processes are performing the atomic operation.
An atomic operation can be considered to be indivisible and instantaneous.
© 2019 Uwe R. Zimmer, The Australian National University page 501 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Indivisible phases
Commitment times
Atomic Operations
Flow of tasks
Safety & Liveness
Atomic & idempotent operations
Atomic operations
0 5 10 15 20 25 30 35 40 45 time
© 2019 Uwe R. Zimmer, The Australian National University page 502 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Atomic & idempotent operations
Atomic operations
Important implications:
1. An atomic operation is either performed in full or not at all.
2. A failed atomic operation cannot have any impact on its surroundings (must keep or re-instantiate the full initial state).
3. If any part of an atomic operation fails,
then the whole atomic operation is declared failed.
4. All parts of an atomic operations (including already completed parts) must be prepared to declare failure
until the final global commitment.
© 2019 Uwe R. Zimmer, The Australian National University page 503 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Definition of idempotent operations:
An operation is idempotent if the observable effect of the oper- ation are identical for the cases of executing the operation:
• once,
• multipletimes, • infinitelyoften.
Observations:
• Idempotent operations are often atomic, but do not need to be.
• Atomic operations do not need to be idempotent.
• Idempotent operations can ease the requirements for synchronization.
Safety & Liveness
Atomic & idempotent operations
Idempotent operations
© 2019 Uwe R. Zimmer, The Australian National University page 504 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Reliability
::= measure of success
with which a system conforms to its specification.
Failure Error Fault
::= a deviation of a system from its specification. ::= the system state which leads to a failure.
::= the reason for an error.
::= low failure rate.
Safety & Liveness
Reliability, failure & tolerance
‘Terminology of failure’ or ‘Failing terminology’?
© 2019 Uwe R. Zimmer, The Australian National University page 505 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Reliability, failure & tolerance
Faults during different phases of design
• Inconsistent or inadequate specifications
frequent source for disastrous faults
• Software design errors
• Component & communication system failures
© 2019 Uwe R. Zimmer, The Australian National University page 506 of 757 (chapter 7: “Safety & Liveness” up to page 512)
frequent source for disastrous faults rare and mostly predictable
Safety & Liveness
Reliability, failure & tolerance
Faults in the logic domain
• Non-termination / -completion
Systems ‘frozen’ in a deadlock state, blocked for missing input, or in an infinite loop
Watchdog timers required to handle the failure • Range violations and other inconsistent states
Run-time environment level exception handling required to handle the failure • Value violations and other wrong results
User-level exception handling required to handle the failure
© 2019 Uwe R. Zimmer, The Australian National University page 507 of 757 (chapter 7: “Safety & Liveness” up to page 512)
• Transient faults
• Intermittent faults • Permanent faults
Single ‘glitches’, interference, ... very hard to handle Faults of a certain regularity ... require careful analysis Faults which stay ... the easiest to find
© 2019 Uwe R. Zimmer, The Australian National University
page 508 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Reliability, failure & tolerance
Faults in the time domain
fail never
fail uncontrolled
© 2019 Uwe R. Zimmer, The Australian National University
page 509 of 757 (chapter 7: “Safety & Liveness” up to page 512)
too early
too late
never (omission)
Constraint Value
Time domain
Value domain
fail silent
fail stop
fail controlled
Safety & Liveness
Reliability, failure & tolerance
Observable failure modes
Failure modes
error
error
Safety & Liveness
Reliability, failure & tolerance
Fault prevention, avoidance, removal, ...
and / or
Fault tolerance
© 2019 Uwe R. Zimmer, The Australian National University page 510 of 757 (chapter 7: “Safety & Liveness” up to page 512)
• Full fault tolerance
Safety & Liveness
Reliability, failure & tolerance
Fault tolerance
the system continues to operate in the presence of ‘foreseeable’ error conditions , without any significant loss of functionality or performance — even though this might reduce the achievable total operation time.
• Graceful degradation (fail soft)
the system continues to operate in the presence of ‘foreseeable’ error conditions,
• Fail safe
Full fault tolerance is not maintainable for an infinite operation time!
while accepting a partial loss of functionality or performance. the system halts and maintains its integrity.
Graceful degradation might have multiple levels of reduced functionality.
© 2019 Uwe R. Zimmer, The Australian National University page 511 of 757 (chapter 7: “Safety & Liveness” up to page 512)
• Liveness • Fairness
• Safety
• Deadlockdetection • Deadlockavoidance • Deadlockprevention
• Atomic & Idempotent operations • Definitions & implications
• Failure modes
• Definitions, fault sources and basic fault tolerance
© 2019 Uwe R. Zimmer, The Australian National University
page 512 of 757 (chapter 7: “Safety & Liveness” up to page 512)
Safety & Liveness
Summary
Safety & Liveness