Distributed Computing, COMP 4001 1
Fault Tolerant Broadcasting
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 2
Outline
• Broadcasting
• Fault-free Broadcasting
• Fault-Tolerant Broadcasting • Multiple Intitiators
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 3
Broadcasting (1/2)
• Broadcasting refers to sending a message simultaneously to many users.
• It is usually initiated by a user in a network.
• We are interested in efficient broadcasting, as measured by – number of messages, and
– time required
to complete successfully the broadcast.
• Broadcasting uses available communication channels.
– We must specify the communication channels to be used and in what order.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 4
Broadcasting (2/2)
• Broadcasting is a preferred routing method because it is flat.
• Broadcasting is used in – Eternet,
– Wireless,
and other networks.
• Efficiency of broadcasting depends on (and is also constrained by) underlying graph G = (V, E).
• Broadcasting in general graphs is multihop.
– Typically, message transmission for broadcasting is based on building a BFS tree from the “broadcast initiating” node.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 5
General Setting
• Consider a (strongly) connected network of N processes 0…N − 1.
– This may be a multi-hop graph.
• Each process i has a “stable (unchanging) value” s(i)
associated with it.
• The goal is to devise an algorithm by which every process i can
broadcast its value s(i) to every other process in this system. – This may require multiple hops.
• At the end, each process i will have the set of all possible valuesVi ={s(k):0≤k≤N−1}.
• Generally, the problem is solved with a so-called “heart beat” algorithm.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 6
Heartbeat Algorithms
• Initially Vi = {s(i)}.
• To complete the broadcast: in roundsa every process i will
periodically
1. send its current Vi along each of its outgoing channels,
2. receive whatever values have been received by it along the incoming channels
3. update Vi.
• The operation resembles the pumping of blood in the heart, so
these types of algorithms are called heartbeat algorithms.
• Two important issues need attention:
– The termination of the algorithm
– The message complexity
aEach round involves Send; Receive; Process;
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 7
Heartbeat Algorithm (1/2)
• No need to send Vi, if it has not changed since the last send operation: suffices to send the incremental change only
• Each process i is associated with two sets of values:
– Vi denotes the current set of values collected so far,
– Wi will represent the last value of Vi sent along the outgoing channels so far.
• Let (i, j) represent the channel i → j from i to j.
• The algorithm terminates when no process receives any new value, and every channel is empty.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 8
Heartbeat Algorithm (2/2)
• The program for process i is given below.
• Correctness is proved in two steps.
– 1st step: show that when empty(i,k) holds, Wi ⊆ Vk.
– 2nd step: show that at the end every process must have received the value s(i) from every other process i.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 9
Fault-Tolerant
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 10
Main Question
• Is broadcasting possible in a network when some nodes (may) fail to transmit messages?
• Two main issues:
– In which networks?
– What does it mean nodes may fail?
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 11
Broadcasting
• In the sequel we look at broadcasting and its efficiency when the underlying network is Kn (the complete graph on n nodes).
– In this setting, broadcasting is an instance of flat routing.
• Further, the communication model does not allow for “multicasting”, whereby a given node can communicate with specific nodes at a time.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 12
Deterministic Strategies
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 13
Calling
• Alice wants to organize a party for 121 students (including herself ).
• She does not know their email addresses, but
– she has a list of all 120 students with their phone numbers which was recently given to by every student.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 14
Calling and Broadcasting
• Broadcasting depends on the medium (i.e., channels) used. – If Alice can shout simultaneously to all of them then it
takes only one step (Ethernet uses this idea!).
• However this may cause collisions and in any case it is not an option in our current study.
• Here the communication medium allows only phone calls.
– This is the so-called point-to-point model: only one
processor can talk to another processor at a time.
– We will approach the problem imposing “message scheduling and processor coordination”.
– We also call this the phone call model.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 15
Broadcasting
• In the phone call model, broadcasting itself would require the availability of 120 separate channels to make the phone calls:
– which may not be available.
• Alice could try to do 120 phone calls herself, – which would consume a lot of time.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 16
New Idea: Assisted Broadcasting
• Any other nodes can assist in the broadcast.
• Not only Alice can call but certain users can call other users.
– Therefore the design would require some form of coordination of who can send to whom and by when.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 17
Alternative Approach (1/2)
• The first strategy that comes to mind is a silent post game resembling whispering:
– Alice just calls the first person on the list and asks him/her to call the next one on the list,
– who will then call the next one and asks him/her to call the next one on the list,
– and so on,
– until everybody on the list has been reached.
• The advantage of this strategy is that every student only has to make one call.
• To make this work, an order of users must have been agreed on.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 18
Alternative Approach (2/2)
• Since the calls have to be performed one after the other, a very long time can go by until all students have been reached.
– Algorithm imposes an underlying “Line Graph’ topology’. – For n students this takes time n − 1
• Some issues:
– If just 10% of the students (i.e., 12 students) do not reach the next one on the list within the same day they were called, it takes at least 12 days until everybody has been informed.
– Even worse: if someone does not bother to call the next one on the list, the whole system will break down!
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 19
New Idea: Binary Search
• A master uses two helpers to cut a sorting problem into two smaller sorting problems,
– who themselves use two helpers each to cut their sorting problems into even smaller problems,
– andsoon,…
– . . . until just one element is left.
• The idea resembles binary search.
• Couldn’t something similar to this also work for the distribution of calls?
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 20
Partitioning Strategy
• Alice could divide the phone list into two halves and call the first person on each of the two halves.
• Each of them will then be asked to cut their list into two further halves and call the first person on these halves.
• This is continued until everybody has been called, i.e., we reach a level in which people are called who just have to take care of an empty list.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 21
Partitioning
• In this way, the students can be reached much quicker. – For n students this takes time O(log n)
• Alice determines that just seven rounds of calls are sufficient to reach all 120 fellow students.
• This is much better than 120 rounds of calls!
• However, the strategy sounds very confusing/technical,
– it’s questionable whether the other students can be made to adhere to the rules without errors.
• Thus, Alice thinks about an alternative strategy.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 22
Natural Calling Rounds
• Alice calls the first two people on the list, 1 and 2, and asks 1 to call the students at positions 3 and 4 while 2 is asked to call the students at positions 5 and 6, and so on.
– User i calls users 2i+1 and 2i+2.
• General rule: everybody at position i in the list will call the students at positions 2i + 1 and 2i + 2 (if they exist).
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 23
Example
• Assume n = 31 (Alice included):
• Rule: i→2i+1,2i+2
• Start: Alice → 1, 2
– 1→3,4and2→5,6
– 3→7,8and4→9,10and5→11,12and6→13,14
– 7→15,16and8→17,18and9→19,20and10→21,22 and 11 → 23, 24 and 12 → 25, 26 and 13 → 27, 28 and
14 → 29, 30
• Information spreads as fast as the previous strategy, – calling rule seems much more natural and
– easier to understand.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 24
Not Robust Enough
• Alice is not quite happy with this calling strategy.
• What if one of her fellow students does not count right and
calls a wrong pair of students on the list?
• Moreover, there can still be a couple of students who just forget or do not bother calling their pair on the list.
• In this case, some students would not be informed.
• Therefore, Alice thinks about a more robust strategy:
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 25
New Idea: Allow Overlap
• allow overlappnig calls
– to ensure fault tolerance
• allow calls from multiple initiators! – to ensure fault tolerance
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 26
More Robust (1/2)
• One possibility would be that for each i, the person in “list” position i would call the four students at positions
2i+1,2i+2,2i+3,2i+4.
• Thus, for each i,
i → 2i + 1, 2i + 2, 2i + 3, 2i + 4.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 27
• In Summary
More Robust (2/2)
i → 2i + 1, 2i + 2, 2i + 3, 2i + 4 i + 1 → 2i + 3, 2i + 4, 2i + 5, 2i + 6 i + 2 → 2i + 5, 2i + 6, 2i + 7, 2i + 8
. → .
• Notice the overlap between consecutive transmissions!
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 28
Calling Patterns
• Who is going to call user k?
• Assume k = 2l is even.
– k is called by users l−1 and l−2.
l − 2 → 2(l − 2) + 1, 2(l − 2) + 2, 2(l − 2) + 3, 2(l − 2) + 4
l − 1 → 2(l − 1) + 1, 2(l − 1) + 2, 2(l − 1) + 3, 2(l − 1) + 4
• Assumek=2l+1isodd.
– k is called by users l−1 and l.
l − 1 → 2(l − 1) + 1, 2(l − 1) + 2, 2(l − 1) + 3, 2(l − 1) + 4
l → 2l + 1, 2l + 2, 2l + 3, 2l + 4
• All students (except for the first four on the list who will directly be called by Alice) will be called by exactly two students in the ideal case.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 29
Example (1/2)
• Assume n = 17 (Including Alice)a
• Start: Alice → 1, 2, 3, 4
• Rule: i→2i+1,2i+2,2i+3,2i+4 – 1 → 3,4,5,6, 2 → 5,6,7,8,
– 3 → 7,8,9,10,
– 4 → 9,10,11,12,
– 5 → 11,12,13,14, – 6 → 13,14,15,16, – 7→15,16
aNB: user 7 sends only to two users instead of four. To overcome this problem the algorithm can wrap-around to the beginning nodes 1, 2, We will not discuss this issue in detail.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 30
Example (2/2)
• Information spreads as fast as the previous strategy, – calling rule sounds now much more precise, and – the system is now fault tolerant.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 31
More Robust
• Thus, as long as for each such pair at most one of the students is unreliable (by not being reachable or forgetting to make the call), all of the reliable students will still be informed.
• Intuitively, this can be argued as follows:
– If one can select a caller for each student who works reliably, then everybody who is reliable has a reliable call chain from himself or herself back to Alice.
• This strategy can be made even more robust, so that she can be really sure to reach everybody who is reachable:
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 32
Even More Robust (1/2)
• For some fixed r:a
If every student at position i calls the students at positions
2i+1,2i+2,…,2i+2r
then every student (except for the first 2r ones who are directly called by Alice) will be called by exactly r many students in the ideal case.
aThe parameter r is related to the desired fault tolerance.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 33
• In Summary
Even More Robust (2/2)
i → 2i + 1, 2i + 2, . . . , 2i + 2r
i + 1 → 2i + 3, 2i + 4, . . . , 2i + 2r + 2 i + 2 → 2i + 5, 2i + 6, . . . , 2i + 2r + 4
. → .
i + r − 1 → 2(i + r) − 1, 2(i + r), . . . , 2(i + r) + 2r − 2
. → .
• Notice the overlap of consecutive transmissions!
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 34
Calling Patterns (1/2)
• How many users will call user k?
• Use the Euclidean algorithm to divide k by 2r and let j < 2r
be the remainder and q ≥ 1 the quotient so that k = 2qr + j
• Observe that
k = 2qr + j = 2(qr + s) + j − 2s, for all s (positive or negative).
• Recall the calling rule for i = qr + s:
i → 2i + 1, 2i + 2, . . . , 2i + 2r
• Henceuserkwillbecalledbyusersisuchthati=qr+s, provided that 0 ≤ j − 2s ≤ 2r.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 35
Calling Patterns (2/2)
• Therefore, if r was used in the design of broadcast each user is called by r alternate users.
• Hence, as long as at most r − 1 of these are faulty (e.g., they are not calling) all reliable students will still be reached.
• Thus, the algorithm ensures fault tolerance for up to r − 1 faults!
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 36
Multiple Initiators
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 37
Question: If Alice Fails?
• A weakness of the algorithm is that it depends on a single initiator, namely Alice.
• Consider the following scenario:
– k initiator nodes wake-up at the same time.
– A given number among them, say f < k, may fail.
• Can we design a fault tolerant algorithm?
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 38
Exercisesa
1. Prove the correctness of the Heart-Beat algorithm by proving. (a) show that when empty(i,k) holds, Wi ⊆ Vk.
(b) show that at the end every process must have received the value s(i) from every other process i.
2. List some of the issues that may arise in the complete network Kn by the simultaneously transmitting of messages by many nodes.
3. Verify the calling patterns discussed in the lecture, when every student at position i calls the students at positions
2i+1,2i+2,...,2i+2r
4. Show that tor n students the partitioning algorithm takes time O(log n).
aNot to submit
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 39
5. (⋆) Design a broadcast algorithm which is fault tolerant under k initiators.
6. (⋆) Extend the previous broadcast algorithm to be fault tolerant under < r participants.
7. (⋆⋆) An interesting analysis for broadcasting is the average case. For a given number x (e.g., 10) of unreliable students, who are assumed to be randomly distributed over the list, we want to determine the minimum value of r for which the probability that all reliable students are reached is still above, say, 90%. In other words,
Given n participants and a parameter 0 < p < 1, what is the minimum value of r such that
Pr[all reliable students are reached] ≥ p
Give a broadcasting algorithm and analyze its complexity. As a
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 40
hint this can be based on an array A that is defined as follows:
(a) N is the total number of students.
(b) A: array [1 . . . N ] of integers; A[i] counts, for a reliable student at position i, the number of calls that student would get from other reliable students.
(c) For every reliable student, A[i] is initially set to 0.
(d) For all unreliable students at position i, A[i] will initially be set to −r (so that even after r calls there will not be a positive value in A[i]).
In order to determine this r, one can use the algorithm presented below (Below Steffi = Alice).
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 41
Simulate this algorithm and test its performance.
8. Design fault tolerant broadcasting algorithms assuming multiple initiators.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 42
9. Design a probe-echo algorithm to compute the topology of a network whose topology is a strongly connected graph. When the algorithm terminates, the initiator of the algorithm should have knowledge about all the nodes and the links in the network.
10. Design an algorithm to count the total number of processes in a unidirectional ring of unknown size. Note that any process in the ring can initiate this computation, and more than one processes can concurrently run the algorithm. Feel free to use process ids.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)
Distributed Computing, COMP 4001 43
References
• Christian Scheideler. Broadcasting–How Can I Quickly Disseminate Information? In Algorithms Unplugged. Berthold V ̈ocking H, Alt, M. Dietzfelbinger, R. Reischuk, C. Scheideler, H. Vollmer, D. Wagner (Editors), 2011.
• R. Karp, S. Shenker, C. Schindelhauer, and B. V ̈ocking: Randomized Rumor Spreading. In: IEEE Symposium on Foundations of Computer Science (FOCS), pp. 565-574, 2000.
Evangelos Kranakis, Carleton University, SCS (November 7, 2020)