Operating Systems
Distributed Systems
4160 – Distributed and Parallel Computing
Copyright By PowCoder代写 加微信 powcoder
Goal: Wide-Area Transactions
• Distributed transactions with strong consistency guarantee
– BigTable: high performance; no transaction
– Megastore: support SQL, low write-throughput
• One of the first systems that really combines of 2PC (sharding;scalability) with Paxos (replication; availability)
Linearizable (replication) + Serializable (db)
External Consistency
• Linearizability (Shared Memory, OS):
– Multiple threads/processes invoke operations on a single object (e.g., linked list) • But it must behave like operations happen one-by-one (serial) following real-time order
• Linearizability (Replicated State Machine, Distributed):
– Multiple threads/processes invoke operations on a “single” machine
• But in fact, the single object has multiple physical replicas
• So, it must behave like operations happen one-by-one (serial) following real-time order
• Serializability (Shared Memory, Single-Node DB):
– Multiple threads (transactions) on multiple items • But it must behave like any serial order
– So:T1T2T3orT2T1T3isalsofine
• Serializability (Distributed Transaction; sharding)
• Distributed txn on replicated machines
– One-copy Serializability
– External Consistency
= a serial order of transactions that respects real-time
= if T1 commit < T2 commit, only accept serial order = T1 T2
Setting: Sharding + Replication
Shard 1: A-M Shard 2: N-Z
Paxos group
Shard 1: A-M Shard 2: N-Z
Shard 1: A-M Shard 2: N-Z
• Explicitly differentiate read-only transaction from general transaction
– the application developers have to specify (tell) Spanner this is a read-only transaction
– Spanner handles r/o and r-w tnxs differently
– Otherwise, Spanner can’t know T is a r/o until it finishes T
• Well... most use cases use SQL within a hosted program or stored procedure, can actually do static analysis upfront...
Read-only transactions
• Spanner optimizes for ro-txns
– Read from local replica
• The challenge is how to ensure consistency?
• Since the local replica might not be a majority in Paxos
Serializability challenge
• EvenreadingthelatestRaft’scommittedvaluestill can’t uphold transactional level serializability
T1: commit
T2: Start T3:
Ref. MIT 6.824
External Consistency in Spanner • The serializable part
– r/o transactions:
• Use the snapshot read part of Snapshot Isolation
– Snapshot read by multi-version timestamp ordering (MVTO) – r/w transactions:
• Use 2PL + 2PC
• The linearizable part
– Despite async and FLP, it is related to real-time ordering, right?
– So, as long as the timestamps of distributed servers perfectly synchronize across the continent.....
General transactions (rw-txn)
• { a=a-1; in shard 1 y=y+1; in shard 2}
a=a-1; y=y+1 TC
(also in Paxos)
Shard 1 (leader of the Paxos group)
Shard 2 (leader of the Paxos group)
Acq lock(a); Update (a) 2PC commit
Acq lock(y) Update (y)
2PC + 2PL + Paxos
Snapshot Isolation history
• A very practical isolation level in single-node DB
• Snapshot read: all reads of a transaction see the same and latest committed DB state
– Hence, simple implementation: MV + T/O
» Every item x has a committed timestamp ts » Every tnx T has a begin timestamp TS
• Treadsthelatestversionofxwhosemaxts
– Since Paxos leader sends out the writes in timestamp order
2. Transactions that have prepared (1st phase in 2PC) but not yet step into 2nd phase finish
– because whether they will commit/abort are still unknown T1: commit
T2: Start T3:
There was an assumption – the timestamp across nodes are in-sync
• T1’s ts=10; T3’s ts=15; T2’s ts=20;
• What would happen if the machine time is not well synchronized, especially in WAN?
– What if T3 ts=18? //too large
• Ok, just T3 has to wait longer until the replica receives write ts>18
– What if T3 ts=9? //too small
• WRONG RESULT! (as it should read T1’s write)
T1: commit
T2: commit T3: Rx
Use a centralized time server to generate timestamp? • It works – as a single source of time
• But added one more round-trip to that central server
Spanner sol: “TrueTime”
• TrueTimeTT:[earliest,latest]
– GPS+atomic clock guarantees the true time must within an interval
• TrueTime-based timestamp assignment rules:
– R/O txn: because it supposes to read something really committed, to be safe
• its timestamp is when it starts: TT.latest
– R/W txn T: its timestamp is when it commits:
• i.e., T.ts = TT.now.latest()
• But this .latest() might still be in the future when T is ending
• Hence “Commit wait”:
– Delay its commit until the true real time TT.now.earliest() > T.ts – Commit = can tell client ‘ok’
• Commit wait:
– for R/W txn T,
• Delay its commit until TT.now.earliest() > T.ts
T1: Wx=1 commit T1.TS=1
commit begin commit end
Wx=2 TT.now()=[1,9] … delay till TT.now()=[>9, ] T2.TS=9
Rx TT.now()=[10,12] T3.TS=12
True-time timestamp assignment rules:
R/O txn: its timestamp is when it starts: TT.latest R/W txn: its timestamp is when it commits: TT.latest
CC Summary
Jinliang Wei (CMU CSD)
Strict serializability vs External consistency
• External consistency > strict serializability > serializability
• Serializable:
– Theoretically A completed in 1945 where B submitted in 2022*
• It is still correct as long as the resulting state looks like B -> A • Strict serializable:
– If A commits before B starts, the serializable schedule must be A before B, but still no ordering restriction once they overlap
• E.g., RHS, schedule B -> A: still strict serializable • External consistency:
– has ordering restriction on overlapping transactions – If A commits before B commits
– So, RHS, schedule B -> A not ok
*https://fauna.com/blog/serializability-vs-strict-serializability-the-dirty-secret-of-database-isolation-levels
How can Spanner combat CAP? • Use its own private network
• “Nevertheless, it is worth noting that should a network partition occur, Spanner chooses consistency over availability, which means it has more in common with traditional databases such as Oracle than with next-
generation databases such as Dynamo.”
How can Spanner combat CAP?
https://cloud.google.com/blog/products/databases/inside-cloud-spanner- and-the-cap-theorem
• 2PC + RSM (first); Before:
• Very few cross-continent level systems
– Distributed DB: 2PC with offline recovery (no HA) – RSM: no sharding transaction
• FLP? Still there, of course. But, node fail during Paxos? restart
• CAP? No P: Private Network (new); Favors C over A (old)
• Linearizable? Use true + safe time (new)
• Serializable by Snapshot Read + 2PL Write* + 2PC (mix of
• Optimization for r/o txn:
– Read local-replica (old) by
• Reading the right snapshot (not new in DB)
* well.. has made SI serializable already…
• A minority replica delays returning the read request until it feels safe (similar to Zookeeper, right?)
Linearizable transactions? Necessary?
• External consistency is good. But serializable has been very okay in DB, so why bother?
– My guess is:
• propagating the write-sets from leader to the followers (to achieve one
copy serializable-1SR) incurs high network traffic • You can’t propagate the transaction commands
– as the replicas will result in different serializable schedules
– their solution is to independently find a common order for every replica to follow
• The most natural order among distributed servers is the real-time order
– Hence, GPS/atomic clock/TrueTime to get the common order in a distributed sense.
• In other words, linearizable may NOT be a requirement, it is more a by-product of their solution
• Recent solution: Deterministic Concurrency Control
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com