COMP5349 – Cloud Computing
Week 11: Consistency in Cloud Storage and DB Services
Dr. Ying Zhou School of Computer Science
Outline
n Consistency in Cloud Storage and DB Services Individual customized solutions
n Paxos Algorithm
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-2
Cloud Storage and DB Services
n Run on a large fleet of nodes
n Data are partitioned and replicated
In GFS, a large file is split into many fix sized chunks, each chunk is replicated on three different nodes (by default)
In Bigtable, a large table is split into many tablets, the actual tablet files (logs and SSTables) are replicated by GFS
Windows Azure Storage adopts similar layered design, with storage optimization using erasure coding
In Dynamo, there is no table concept, the whole database is a collection of Key/Value pairs. Each key/value pair is stored on a specific node based on its key’s hash value and is replicated in two other subsequent nodes (by default).
In Aurora, the database layer consists of MySQL or Postgres instances organized following master/slave replication, the actual data is partitioned into 10GB segments and each is replicated 6 times by underlying storage service
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-3
How to keep replicas consistent
n There are two ways of looking at consistency. From client point of view
From server point of view
n From client point of view
There are many processes performing read/write operations on a
distributed storage/db system
Strong consistency: after a process completes the write operation, subsequent read should return the updated value
Weak consistency: after a process completes the write operation, some read may return old value
Eventual consistency: if no new write are made to the object, eventually all reads will return the latest updated value
https://www.allthingsdistributed.com/2008/12/eventually_consistent.html
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-4
How to keep replicas consistent
n Consistency level observed by the client depends on how system (server side) handles write and read
To ensure strong consistency, do we need to keep all replicas the same after a write request is complete?
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-5
Revisit: Write and Read in GFS
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
1. The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas.
Grant a new lease if no one holds one
2. The master replies with the identity of the primary and the locations of the other (secondary) replicas
Cached in the client
3. The client pushes the data to all the replicas
4. Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serialization.
5. The Primary forwards the write request to all secondary replicas.
6. Secondaries signal completion.
7. Primary replies to client. Errors handled
by retrying. 11-6
Revisit: GFS write and read
1
2
3 4
› Client translates file name and byte offset to chunk index. › Sends request to master.
› Master replies with chunk handle and location of replicas. › Client caches this info.
› Client sends request to a close replica, specifying chunk handle and byte range. › The chunk server sends chunk data to the client
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-7
GFS: Read and Write
n Read request only contacts a single replica
n Write request contacts all replicas and the acknowledgement is sent after all replicas have received the data and acted on it
Eager propagation
n GFS only achieves weak consistency
It does not have roll back mechanism, some replica may fail to write the data in the disk after receiving the data, leaving data in inconsistency state among replica
Simple retry may create duplicate records in certain replica
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-8
Revisit: Write and Read in Bigtable
n Each tablet is served by a single tablet server Both read and write are managed by this server
n Write path
The latest write content is kept in the memory of the tablet server The write request is also persisted as commit log (GFS files)
n Bigtable has strong consistency
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-9
Revisit: Dynamo/Aurora Read and Write
n Dynamo/Aurora uses quorum based technique
Obtain a minimum number of votes before carrying out an operation
In the context of replicated storage/database system, a minimum number of replicas should be responded before returning the write and read results back to the client
Replica number(N), Write(W) and read(R) quorum, the rules for strong consistency include
¡ W > N/2
¡ W+R > N
¡ Typical value combinations:
• N=3,W=2,R=2
• N=6,W=4,R=3(AmazonAurora)
In a typical setting, quorum members are replica, Dynamo uses sloppy quorum which allows node without no data copy to participate in the read/write operation
¡ Dynamo achieves eventual consistency
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-10
In summary
n Read/Write operations in replicated database systems can be implemented in various ways to achieve different levels of consistency
n How many replicas to contact for read and write operations? n How many responses to wait before responding to the
client? n E.g.
Contact N replica, wait for N responses ¡ GFS write
Contact 1 replica, wait for 1 response ¡ GFS read
Contact N replica, wait for W or R responses ( W< N, R
Proposer p’s phase 2 accept requests for a proposal numbered n1 are ignored because the acceptors have all promised not to accept any new proposal numbered less than n2.
proposer p then begins and completes phase 1 for a new proposal number n3 > n2, causing the second phase 2 accept requests of proposer q to be ignored.
n To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals.
Like the coordinator in a replicated system
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-35
prepare(2) A
B
X promise (2) Y
promise(2)
Z
accept(n=2,v=5)
prepare(4)
prepare(5)
Non-progress situation
accept(n=4,v=10)
promise (4)ignore
promise (4)
promise (5)
ignore
ignore
ignore
promise (5)
promise(2)
promise (4) ignore promise (5) ignore
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
11-36
Outline
n Consistency in Cloud Storage and DB Services
n Paxos Algorithm
Basic Algorithm
Running Multiple Paxos in Replicated System
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
11-37
Implementing a State Machine
n A single instance of Paxos algorithm only allows one value to be chosen n Any real system would involve many data, replicated in a number of
servers and the data can be updated by multiple clients
n Each replica server can be viewed as a deterministic state machine that performs client commands in some sequence
n The state machine has a current state; it performs a step by taking as input a command and producing an output and a new state.
n For strong consistency, the system need to ensure
All or majority of the replica servers execute the client commands They execute the commands in the same order
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
11-38
Replicated System
C1 R1 R2 R3 C2
qa
qb
qa qa qb qb qb qa
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
11-39
Multiple Paxos
n This would be implemented as infinite number of Paxos instance, correspond to the number of client commands a distributed system can receive
n Assuming the set of servers is fixed, each server should play all the roles (proposer, acceptor and learner)
All instances would have the same set of agents.
n A single server is elected as the distinguished proposer
n Each Paxos instance is numbered and represent the ith command the system should run.
It runs a Paxos algorithm to chose what would be the actual command to run in the ith step, the value of a proposal would be an actual client request, e.g. update row 1 of table 2 by setting column 3’s value to 4 .
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-40
Multiple Paxos in replicated system
Distinguished proposer
C1 R1 R2 R3 C2
qa
qb
All clients send command/queries to the distinguished proposer
It might run 1st Paxos to get agreement with all replicas that the first command to run is qa
At the same time, it can run the 2nd Paxos to make sure that all replicas know that the second command to run is qb
Any replica that knows the chosen value of 1st and 2nd Paxos can go ahead to execute the queries
A replica may learn the value of 2nd Paxos before learning the 1st Paxos, in that case, it needs to wait until it learns the value of the 1st Paxos instance.
The distinguished proposer runs infinite number of Paxos instance, each is numbered and represent the ith command/query the system should run
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
11-41
Multiple Paxos at the same time
C1 R1 R2 R3 C2
1st prepare(1) 2nd prepare(1) 2nd prepare(1)
1st accept(n=1,v=qa)
1st prepare(1)
2nd accept(n=1,v=qb)
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
11-42
Leadership Changeover
n Suppose the previous leader has just failed and a new leader has been elected.
The new leader is one of the replicas, it should know most of the commands that have been chosen, but may have slightly different knowledge with other replicas
Suppose “it knows commands 1–134, 138, and 139—that is, the values chosen in instances 1–134, 138, and 139 of the consensus algorithm”
There is a knowledge gap 135-137 for the new leader, it could be caused by
¡ Previous leader has not executed phase 2 of these instances
¡ Previous leader has executed phase 2 and only some replica received
the message
The new leader also does not know if any value has been chosen or been proposed in any Paxos instance after 139 by the old leader
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-43
New leader catching up
n Executes phase 1 of instances 135–137 and of all instances greater than 139
If some other replica receives the phase 2 request from the old leader in any instance, it will respond with those accepted value to the new leader
Otherwise, just a promise
n Suppose in instances 135 and 140, some replica has accepted values proposed by the old leader
They would respond their accepted values
n The new leader would execute the 2nd phase of these two
instances to confirm the chosen value
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
11-44
New leader catching up (illustration)
Old leader
new leader
R1 R2 R3 R4 R5
135th accept(n=1, v= qx)
140th accept(n=1, v= qy)
135th prepare(n=2)
135th promise(n=2); accepted (n=1, v= qx))
135th promise(n=2); accepted (n=1, v= qx))
135th promise(n=2);
135th accept(n=2, v=qx)
The 140th instance is handled in slightly different way
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
11-45
New leader catching up (cont’d)
n Now the system has chosen value for 1-135; 138-140
All replicas call execute commands 1-135;
They cannot execute 138-140 commands because there is a gap, there is no value for 136 and 137
n The gap could be caused by
The old leader only executed the first phase of these two instances, or
None of the replica received second phase message
n The new leader has two options to fill in the gap
Assign the next two commands issued by the client as 136 and 137 value ¡ But they are likely issued after all 138-140 commands
Assign special no-op value to both.
n The new leader has already executed the first phase those instances Execute the second phase with special value no-op
Now all replicas can go ahead with 130-140 commands
n The new leader go ahead by assigning newly received commands to 141 and later instance
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
11-46
The need for running infinite instances
n A newly chosen leader executes phase 1 for infinitely many instances of the consensus algorithm.
E.g. In previous scenario, a new leader needs to “Executes phase 1 of instances 135–137 and of all instances greater than 139”
n Any leader can send proposals for various commands concurrently
n It does not need to wait for a value chosen for 140th instance before sending out the request for 141st instance
n The new leader does not know how far ahead the old leader has been progressed
E.g. In the scenario, the new leader only has data up to 139th instance, but some replica has accepted 140th instance’s value
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-47
Run infinite instances efficiently
n “Using the same proposal number for all instances, it can do this by sending a single reasonably short message to the other servers. In phase 1, an acceptor responds with more than a simple OK only if it has already received a phase 2 message from some proposer. (In the scenario, this was the case only for instances 140.) Thus, a server (acting as acceptor) can respond for all instances with a single reasonably short message. ”
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-48
Example of handling infinite instances
Old leader
new leader
C1 R1 R2 R3 R4 R5
135th accept(n=1, v= qx)
140th accept(n=1, v= qy)
promise (n=2 for instance >139)
prepare(n=2 for instance >139)
140th accept(n=2, v=qy)
141st accept(n=2, v=qz)
promise(n=2 for instance >139) accepted (n=1, v= qy) for instance 140)
promise(n=2 for instance >139)
qz
COMP5349 “Cloud Computing” – 2020 (Y. Zhou )
11-49
Any Practical Usage of Paxos?
Yes, for example:
n Google Chubby (OSDI2006)
Distributed lock system and meta-data repository nHadoopZookeeper (DISC2009)
Open-source implementation of Chubby; uses own ZAP protocol that is inspired by/based on Paxos
n Google Spanner (OSDI2012)
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-50
References
n Leslie Lamport, “Paxos Made Simple”, ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001)
COMP5349 “Cloud Computing” – 2020 (Y. Zhou ) 11-51