Assumptions so far
l E.g., assume no message lost
l E.g., assume each peer is honest
l Those assumptions might not hold in practice l Lead to the field of dependable computing
Copyright By PowCoder代写 加微信 powcoder
Fault Model
l Crash Failure
l PermanentlyDeadvs.Recoverableafterafineperiodoftime
n Under asynchronous model
l Never be able to certainly detected
l Since there is no lower bound on things like processor speed, channel speed
n Under synchronous model
l Can be detected using timeout
l Nomatteritisrecoverableornot,onfailure,weassumeitfailslike: n 1) the program crashes
l i.e., it fails like segmentation fault
l Rule out the case of ‘exception caught’ in Java
n 2) others can detect its failure
n 3) the internal state is lost (i.e., state in RAM lost)
n This kind of crash failure model is named as “Fail-stop” n Most protocols only deal with fail-stop crash failure
l Aprotocolcan’ttoleratefail-stopwon’tbeabletotoleratearbitrarycrash-failure
Fault Model
l Crash Failure
l Omission Failure
l Message lost due to poor network router problem, etc.
l In distributed systems, we assume that is masked by the networking protocol (e.g., TCP) and this won’t happen
l Byzantine Failure
l Embrace every possible failure l WhenIsendavaluex=5toM
n Mmightforwardxas6,xas5,xas7..
n M might not forward x out….
l Due to various reasons like software bugs and malicious, etc.
Fault-Tolerant Systems
l P: states that satisfy the safety properties l Given a set of fault actions F
l The fault span Q corresponds to the maximal set of configurations/states that the system can get into when F happen
l A system is F-tolerant when after all F-actions are gone l the set of system states back to P
Types of F-tolerant system
l Tolerance masking system l When a fault is occurred,
n It is masked – meaning the impact of its occurrence is taken away, i.e., state space is P l Masking means your system liveness and safety both preserved
l Tolerance non-masking system
l Faults may temporarily affect the system and violate safety, i.e., state space is Q l Liveness is not compromised and eventually state space back to P
n E.g., temporary lag when watching a movie online n “Backward recovery” (offline recovery)
l E.g., Database crash and recovery from a checkpoint and undo/redo log n “Forward recovery” (self-stabilizing system)
l Go on live with a short period of inconsistency state but eventually back to P l Fail-safe system
l When a fault is occurred, n Safety is preserved
n But no guarantee liveness is preserved
l Degraded system
l E.g., Just response time is slower E.g., files are read-only (safe mode) after fault occurred
P: states that are safe Q: fault span states
How to detect crash-fault?
l Due to FLP, in asynchronous system, fault detection can’t be both complete and accurate
l Complete: you don’t miss any faulty process
l Accurate: you won’t get any false alarm (i.e., a process is healthy but you regard it as fail
because it is simply too slow)
l To a distributed protocol, which one is less evil? l We prefer complete over accurate
l Then how to detect it? Fault-detection “algorithm”
n Ping-ack (source-driven; demand-driven) n Heart-beating (target-driven)
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com