程序代写 Operating Systems

Operating Systems
Distributed Systems
4160 – Distributed and Parallel Computing

Copyright By PowCoder代写 加微信 powcoder

Goal: Wide-Area Transactions
• Distributed transactions with strong consistency guarantee
– BigTable: high performance; no transaction
– Megastore: support SQL, low write-throughput
• One of the first systems that really combines of 2PC (sharding;scalability) with Paxos (replication; availability)
Linearizable (replication) + Serializable (db)

External Consistency
• Linearizability (Shared Memory, OS):
– Multiple threads/processes invoke operations on a single object (e.g., linked list) • But it must behave like operations happen one-by-one (serial) following real-time order
• Linearizability (Replicated State Machine, Distributed):
– Multiple threads/processes invoke operations on a “single” machine
• But in fact, the single object has multiple physical replicas
• So, it must behave like operations happen one-by-one (serial) following real-time order
• Serializability (Shared Memory, Single-Node DB):
– Multiple threads (transactions) on multiple items • But it must behave like any serial order
– So:T1T2T3orT2T1T3isalsofine
• Serializability (Distributed Transaction; sharding)
• Distributed txn on replicated machines
– One-copy Serializability
– External Consistency
= a serial order of transactions that respects real-time
= if T1 commit < T2 commit, only accept serial order = T1 T2 Setting: Sharding + Replication Shard 1: A-M Shard 2: N-Z Paxos group Shard 1: A-M Shard 2: N-Z Shard 1: A-M Shard 2: N-Z • Explicitly differentiate read-only transaction from general transaction – the application developers have to specify (tell) Spanner this is a read-only transaction – Spanner handles r/o and r-w tnxs differently – Otherwise, Spanner can’t know T is a r/o until it finishes T • Well... most use cases use SQL within a hosted program or stored procedure, can actually do static analysis upfront... Read-only transactions • Spanner optimizes for ro-txns – Read from local replica • The challenge is how to ensure consistency? • Since the local replica might not be a majority in Paxos Serializability challenge • EvenreadingthelatestRaft’scommittedvaluestill can’t uphold transactional level serializability T1: commit T2: Start T3: Ref. MIT 6.824 External Consistency in Spanner • The serializable part – r/o transactions: • Use the snapshot read part of Snapshot Isolation – Snapshot read by multi-version timestamp ordering (MVTO) – r/w transactions: • Use 2PL + 2PC • The linearizable part – Despite async and FLP, it is related to real-time ordering, right? – So, as long as the timestamps of distributed servers perfectly synchronize across the continent..... General transactions (rw-txn) • { a=a-1; in shard 1 y=y+1; in shard 2} a=a-1; y=y+1 TC (also in Paxos) Shard 1 (leader of the Paxos group) Shard 2 (leader of the Paxos group) Acq lock(a); Update (a) 2PC commit Acq lock(y) Update (y) 2PC + 2PL + Paxos Snapshot Isolation history • A very practical isolation level in single-node DB • Snapshot read: all reads of a transaction see the same and latest committed DB state – Hence, simple implementation: MV + T/O » Every item x has a committed timestamp ts » Every tnx T has a begin timestamp TS • Treadsthelatestversionofxwhosemaxts15
– Since Paxos leader sends out the writes in timestamp order
2. Transactions that have prepared (1st phase in 2PC) but not yet step into 2nd phase finish
– because whether they will commit/abort are still unknown T1: commit
T2: Start T3:

There was an assumption – the timestamp across nodes are in-sync
• T1’s ts=10; T3’s ts=15; T2’s ts=20;
• What would happen if the machine time is not well synchronized, especially in WAN?
– What if T3 ts=18? //too large
• Ok, just T3 has to wait longer until the replica receives write ts>18
– What if T3 ts=9? //too small
• WRONG RESULT! (as it should read T1’s write)
T1: commit
T2: commit T3: Rx

Use a centralized time server to generate timestamp? • It works – as a single source of time
• But added one more round-trip to that central server

Spanner sol: “TrueTime”
• TrueTimeTT:[earliest,latest]
– GPS+atomic clock guarantees the true time must within an interval
• TrueTime-based timestamp assignment rules:
– R/O txn: because it supposes to read something really committed, to be safe
• its timestamp is when it starts: TT.latest
– R/W txn T: its timestamp is when it commits:
• i.e., T.ts = TT.now.latest()
• But this .latest() might still be in the future when T is ending
• Hence “Commit wait”:
– Delay its commit until the true real time TT.now.earliest() > T.ts – Commit = can tell client ‘ok’

• Commit wait:
– for R/W txn T,
• Delay its commit until TT.now.earliest() > T.ts
T1: Wx=1 commit T1.TS=1
commit begin commit end
Wx=2 TT.now()=[1,9] … delay till TT.now()=[>9, ] T2.TS=9
Rx TT.now()=[10,12] T3.TS=12
True-time timestamp assignment rules:
R/O txn: its timestamp is when it starts: TT.latest R/W txn: its timestamp is when it commits: TT.latest

CC Summary
Jinliang Wei (CMU CSD)

Strict serializability vs External consistency
• External consistency > strict serializability > serializability
• Serializable:
– Theoretically A completed in 1945 where B submitted in 2022*
• It is still correct as long as the resulting state looks like B -> A • Strict serializable:
– If A commits before B starts, the serializable schedule must be A before B, but still no ordering restriction once they overlap
• E.g., RHS, schedule B -> A: still strict serializable • External consistency:
– has ordering restriction on overlapping transactions – If A commits before B commits
– So, RHS, schedule B -> A not ok
*https://fauna.com/blog/serializability-vs-strict-serializability-the-dirty-secret-of-database-isolation-levels

How can Spanner combat CAP? • Use its own private network
• “Nevertheless, it is worth noting that should a network partition occur, Spanner chooses consistency over availability, which means it has more in common with traditional databases such as Oracle than with next-
generation databases such as Dynamo.”

How can Spanner combat CAP?
https://cloud.google.com/blog/products/databases/inside-cloud-spanner- and-the-cap-theorem

• 2PC + RSM (first); Before:
• Very few cross-continent level systems
– Distributed DB: 2PC with offline recovery (no HA) – RSM: no sharding transaction
• FLP? Still there, of course. But, node fail during Paxos? restart
• CAP? No P: Private Network (new); Favors C over A (old)
• Linearizable? Use true + safe time (new)
• Serializable by Snapshot Read + 2PL Write* + 2PC (mix of
• Optimization for r/o txn:
– Read local-replica (old) by
• Reading the right snapshot (not new in DB)
* well.. has made SI serializable already…
• A minority replica delays returning the read request until it feels safe (similar to Zookeeper, right?)

Linearizable transactions? Necessary?
• External consistency is good. But serializable has been very okay in DB, so why bother?
– My guess is:
• propagating the write-sets from leader to the followers (to achieve one
copy serializable-1SR) incurs high network traffic • You can’t propagate the transaction commands
– as the replicas will result in different serializable schedules
– their solution is to independently find a common order for every replica to follow
• The most natural order among distributed servers is the real-time order
– Hence, GPS/atomic clock/TrueTime to get the common order in a distributed sense.
• In other words, linearizable may NOT be a requirement, it is more a by-product of their solution
• Recent solution: Deterministic Concurrency Control

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com