Distributed Systems Foundations
REPLICATED DICTIONARY
AND LOG
CS 171 1
Replicated dictionary problem
• Efficient solutions to the replicated log and
dictionary problems. Wuu and Bernstein
PODC 84.
• Replication is a fundamental method for fault-
tolerance.
• Replication is also used for performance.
• Dictionary is a common data structure which
is often fully replicated on all sites.
• (Blockchains are also fully replicated)
CS 171 2
Replication motivation
• What if data is stored in ONE place?
CS 171 3
DATA
Replication motivation
• What if data is stored in ONE place?
– Failures cause data to be inaccessible (single point
of failure)
CS 171 4
DATA
CRASH!!
Replication motivation
• What if data is stored in ONE place?
– Failures cause data to be inaccessible (single point
of failure)
– Users far away experience large latency to access
data
CS 171 5
DATA
Replication
• Replication requires maintaining data in
multiple locations
– Identical copies?
CS 171 6
Replica A Replica B
Basic assumptions
• Very general: Sites may crash, links may fail,
partitioning.
• Asynchronous System
• Each site maintains a local clock (a counter).
• Use Lamport’s event execution model and
happens-before relation.
CS 171 7
Motivating example for causality
Micro-blogging application
Causality in micro-blogging
• A micro-blogging website has users and a
micro-blog
• One global stream of users blog
• In what order should the stream display
blogs?
• In a non-replicated deployment, order by time
of arrival.
CS 171 9
Causality between blogs
• A user A blogs blog a1 as a consequence of an
earlier blog b2 (or blogs) by another user
(users) B.
CS 171 10
Blog stream
User B: in which room is
CS171?
User A: I cannot believe
students are still asking
about class locations!
Causality between blogs
• This causality is trivially captured in non-
replicated systems. User A made a comment
after seeing what caused it (blog by User B) in
the stream, which means it already exists.
CS 171 11
SERVER
Causality between blogs
• For three replicas, A, B, and C. If blogs are just
replicated asynchronously it will not
guarantee causality.
• Naïve-replication:
– When a blog is posted send to all replicas.
– When a replica receives a blog, append to log.
CS 171 12
Causality between blogs
• Blog a is caused by blog b2,
• at replica C blog a is ordered before blog b2;
• problem!
CS 171 13
Replica A
Replica B
Replica C
Blog b2
Blog a
Blog b1
Log stores
• An append operation inserts a record at the
head of the log
• Changing the content of a record: delete
record and then insert
• Reading log => all log content.
insert(a) Insert(b) Insert(c) Delete(b) Insert(d)
Head of log
CS 171 14
The Log- Example
Site 1
Log
The Log- Example
• I(x)
Site 1
I(x)
Log
The Log- Example
• I(y)
Site 1
I(x) I(y)
Log
The Log- Example
• D(y)
Site 1
I(x) I(y) D(y)
Log
The Log- Example
• D(x)
Site 1
I(x) I(y) D(y) D(x)
Log
The log problem
• Each site maintains a copy of the log.
• The log contains local events, i.e.,
– insert
– delete
• The goal of the algorithm is to keep all copies
of the log up to date.
• Li is the copy of the log at site i.
• L(e) is the contents of log Lnode(e) immediately
after event e is executed.
CS 171 20
An example of a log
• Logs are transmitted between replicas
• What to include in the transmitted log?
• Lb = {insert(x,20)}
• La = {insert(y,50),insert(z,100)}
CS 171 21
Replica A
Replica B
insert(z,100)
Lb
insert(y,50)
insert(x,20)
La
The log problem
• Log Problem: find an algorithm that maintains
the log such that given an execution
events e,f: if f e then f is in L(e)
• General approach:
– For each local event, insert a record in the local log.
– Exchange logs to update other sites
• Main question: when to exchange logs? With
application communication to capture the
happens before relation.
CS 171 22
Attempt at Solution 1
• A solution:
– Site i sends to site j all records in the log that were inserted
since i last sent a message to j.
– WHY INCORRECT?
– Log message lost – log messages reordered
23
Replica A
Replica B
T1=insert(x,20)
insert(x,20) insert(w,60)
T2=insert(w,60)
CS 171
Attempt at Solution 2
• Another solution:
– each site i includes its log Li with each message.
– On receiving a message, a site j incorporates all new
event records.
–BAD:
• Entire log sent with each message
• Entire log kept at each node.
CS 171 24
Replica A
Replica B
T1=Lb
insert(x,20) insert(w,60)
T2=Lb
Efficient solution for log problem
• Observation 1: Once i “knows” that j
“knows” of an event e (which may have
occurred on site k), then i does not need to
include event e in messages to j.
• Observation 2: Once i “knows” that all sites
“know” about an event e, then i does not
need to keep a record of e in its local log.
CS 171 25
2 Dimensional Time-Table
• TTi[n,n]:
• if TTi[j,k] = t,
then site i knows that site j has learned of all
events that occurred at site k up to time t.
tj
k
CS 171 26
The 2 dimensional timetable
• site j might actually know about more
events, but site i may not be aware of it.
• TTi[i,i] is the value of clock at site i.
• TTi[i,k] is the value of clock at site k of the
most recent event at site k that site i is
aware of.
CS 171 27
Two dimensional timetable
• Let hasrec(TTi, e, k) be true iff
TTi[k,node(e)] >= time(e)
• The algorithm must guarantee that if
hasrec(TTi, e, k) is true, then site k has
learned of event e.
• Note: site i need not send a record of
event e to site k if hasrec(TTi, e, k) is true.
CS 171 28
The Replicated Log- Example
Site 1
Log
0 0 0
0 0 0
0 0 0
TimeTable
Site 1
Log
0 0 0
0 0 0
0 0 0
TimeTable
Site 2
Log
0 0 0
0 0 0
0 0 0
TimeTable
Site 3
Log
0 0 0
0 0 0
0 0 0
TimeTable
Site 1
I(X)Log
1 0 0
0 0 0
0 0 0
TimeTable
Site 2
Log
0 0 0
0 0 0
0 0 0
TimeTable
Site 3
Log
0 0 0
0 0 0
0 0 0
TimeTable
I(X)
Site 1
I(X) I(Y)Log
2 0 0
0 0 0
0 0 0
TimeTable
Site 2
Log
0 0 0
0 0 0
0 0 0
TimeTable
Site 3
Log
0 0 0
0 0 0
0 0 0
TimeTable
I(X) I(Y)
Site 1
I(X) I(Y)Log
2 0 0
0 0 0
0 0 0
TimeTable
Site 2
I(Z)Log
0 0 0
0 1 0
0 0 0
TimeTable
Site 3
Log
0 0 0
0 0 0
0 0 0
TimeTable
I(X) I(Y)
I(Z)
Site 1
I(X) I(Y) D(Y)Log
3 0 0
0 0 0
0 0 0
TimeTable
Site 2
I(Z)Log
0 0 0
0 1 0
0 0 0
TimeTable
Site 3
Log
0 0 0
0 0 0
0 0 0
TimeTable
I(X) I(Y)
I(Z)
D(Y)
Site 1
I(X) I(Y) D(Y)Log
3 0 0
0 0 0
0 0 0
TimeTable
Site 2
I(Z)Log
0 0 0
0 1 0
0 0 0
TimeTable
Site 3
Log
0 0 0
0 0 0
0 0 0
TimeTable
I(X) I(Y)
I(Z)
D(Y)S(2)
Site 1 knows that site 2
knows till clock 0 about
its local events
Site one send all its log
local events with clock > 0
Site 1 sends [(I(X),1),
(I(Y),2), (D(Y),3)] [] [] and
its timetable
How to update the timetable?
3 0 0
0 0 0
0 0 0
+
0 0 0
0 1 0
0 0 0
=
0 0 0
0 1 0
0 0 0
•Max of all elements
3
•Max of times in local i’th row and remote k’th row.
3
Log maintenance
• Initialize all entries in TT to 0.
• For each local operation, insert a copy in the local log.
• With each send operation from site i to site k: piggyback
TT+ the following subset of the local log Li: all records e
such that hasrec(TTi, e, k) is not true.
• On receipt of a message from site k by site i:
– incorporate all new events into local log
– update TT:
• Max of all elements
• Max of times in local i’th row and remote k’th row.
CS 171 37
Dictionary problem
• Assume we want to maintain a replicated
dictionary with insert, delete and lookup.
• On receipt of a message with partial log & TT:
– Update local copy of the dictionary
– Update local copy of TT as before
– Garbage collect local log from any records that
correspond to events e such that
⌐ site j such that hasrec(TTi, e, j) is not true
CS 171 38
Replicated Dictionary- Example
Site 1
Log Dictionary
Replicated Dictionary- Example
• I(x)
Site 1
I(x)
x …
Log
Dictionary
Replicated Dictionary- Example
• I(y)
Site 1
I(x) I(y)
X …
y …
Log
Dictionary
Replicated Dictionary- Example
• D(y)
Site 1
I(x) I(y) D(y)
X …
y …
Log
Dictionary
Replicated Dictionary- Example
• D(x)
Site 1
I(x) I(y) D(y) D(x)
X …
Log
Dictionary
Site 1
Log
Dictionary
0 0 0
0 0 0
0 0 0
TimeTable
Site 2
Log
Dictionary
0 0 0
0 0 0
0 0 0
TimeTable
Site 3
Log
Dictionary
0 0 0
0 0 0
0 0 0
TimeTable
Site 1
I(X) X …Log
Dictionary
1 0 0
0 0 0
0 0 0
TimeTable
Site 2
Log
Dictionary
0 0 0
0 0 0
0 0 0
TimeTable
Site 3
Log
Dictionary
0 0 0
0 0 0
0 0 0
TimeTable
I(X)
Site 1
I(X) I(Y) X …
Y …
Log
Dictionary
2 0 0
0 0 0
0 0 0
TimeTable
Site 2
Log
Dictionary
0 0 0
0 0 0
0 0 0
TimeTable
Site 3
Log
Dictionary
0 0 0
0 0 0
0 0 0
TimeTable
I(X) I(Y)
Site 1
I(X) I(Y) X …
Y …
Log
Dictionary
2 0 0
0 0 0
0 0 0
TimeTable
Site 2
I(Z) Z …Log
Dictionary
0 0 0
0 1 0
0 0 0
TimeTable
Site 3
Log
Dictionary
0 0 0
0 0 0
0 0 0
TimeTable
I(X) I(Y)
I(Z)
Site 1
I(X) I(Y) D(Y) X …
Y …
Log
Dictionary
3 0 0
0 0 0
0 0 0
TimeTable
Site 2
I(Z) Z …Log
Dictionary
0 0 0
0 1 0
0 0 0
TimeTable
Site 3
Log
Dictionary
0 0 0
0 0 0
0 0 0
TimeTable
I(X) I(Y)
I(Z)
D(Y)