DISTRIBUTED SYSTEMS
2PC, 3PC, PAXOS [SKYPE]
1
3PC AND PAXOS
Last week(s) …
Introduction to fault tolerance Adaptive timeouts & probes
Semantic models
at-most-once at-least-once zero-or-once …
DISTRIBUTED SYSTEMS 2 3PC AND PAXOS
3PC AND PAXOS
System model: Everything can fail!
Need to establish the system model in order to determine algorithm
System model fails:
Communication Node reliability Node behaviour
DISTRIBUTED SYSTEMS 3 3PC AND PAXOS
3PC AND PAXOS
Communication failures
Message loss
Message reordering
Message duplication resend on message loss
Message corruption accidental, deliberate
Message transmission delays Network partitioning
DISTRIBUTED SYSTEMS 4 3PC AND PAXOS
3PC AND PAXOS
Node reliability
Nodes can crash (how many?) … and then restart
“Correct” and “incorrect” nodes
DISTRIBUTED SYSTEMS 5 3PC AND PAXOS
3PC AND PAXOS
Byzantine Failure Model
Any behaviour is expected/permitted of incorrect nodes
refuse to follow protocol
forge/fake messages from other nodes collusion to deceive correct nodes interfere with communication
DISTRIBUTED SYSTEMS 6 3PC AND PAXOS
3PC AND PAXOS
System models
The weaker the system model (e.g., the more failures) the stronger the algorithm in place
Lamport’s Mutex:
service: distributed mutual exclusion system model:
reliable message delivery in-order delivery
reliable nodes
DISTRIBUTED SYSTEMS 7 3PC AND PAXOS
3PC AND PAXOS
Impossibility Theorems
CAP – Consistency, Availability, Partinioning we cannot have all three at the same time
Fischer, Lynch, Paterson (FLP)
no consensus is possible in an asynchronous network with unreliable nodes
can’t tell difference between message delay and failure
DISTRIBUTED SYSTEMS 8 3PC AND PAXOS
3PC AND PAXOS
Bounds
Consensus: N>= 2f+1
Byzantine consensus: N>= 3f + 1
DISTRIBUTED SYSTEMS 9 3PC AND PAXOS
3PC AND PAXOS
Example
Suppose we have an application requirement that an employee’s manager is always paid more than the employee:
emp.salary < emp.manager.salary
This is referred to as a CONSISTENCY CONSTRAINT: – We need to be able to write applications that can
manipulate data whilst maintaining such requirements.
– In a client/server application, we need to be able to support
manipulation of such data by several remote operations
executing concurrently.
– Atomic transactions provide one way to write such
applications.
DISTRIBUTED SYSTEMS 10 3PC AND PAXOS
3PC AND PAXOS
Example - Incomplete Executions
Suppose we have the code:
emp.salary = emp.salary * 1.1
if ( emp.salary >= emp.manager.salary ) {
} emp.manager.salary = emp.manager.salary * 1.1
Now, if the server crashes after executing the first assignment, but not the if statement, it is possible that the consistency constraint will be violated.
This situation, where only part of an execution is completed, is termed a failure to achieve ATOMICITY.
DISTRIBUTED SYSTEMS 11 3PC AND PAXOS
3PC AND PAXOS
Example – Incomplete Executions
Suppose we have the code:
emp.salary = emp.salary * 1.1
Clearly, since this does not check the manager’s salary at all, this code can trivially lead to a violation of the consistency constraint.
This code can fail to preserve CONSISTENCY, even when it executes to completion.
DISTRIBUTED SYSTEMS 12 3PC AND PAXOS
3PC AND PAXOS
Interference
Suppose we have two concurrent executions of the code: emp.salary = emp.salary * 1.1
if ( emp.salary >= emp.manager.salary ) {
} emp.manager.salary = emp.manager.salary * 1.1
Now, consider the following:
Execution A executes the first line.
Execution B executes the first line.
At this point, the salary is 1.21 times its original value.
A executes the if statement, and calculates, but does not assign, the new manager salary (1.1 times the original).
B executes the if statement, multiplying the manager’s salary by 1.1.
A completes, assigning 1.1 times the original manager’s salary.
The employees salary has increased by 21%, the manager’s by 10%, so the consistency constraint may be violated.
This can be attributed to a lack of ISOLATION of concurrent executions:
An alternative (non-transactional) analysis identifies a failure of cooperation.
DISTRIBUTED SYSTEMS 13 3PC AND PAXOS
3PC AND PAXOS
Durability
Finally, we can’t reliably do anything unless we can be sure that at some point an execution has had a permanent effect.
If we can’t guarantee this, then we can’t guarantee that our system will ever get anything done in any lasting sense.
A failure to provide this guarantee is said to be a failure to provide DURABILITY.
guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure
DISTRIBUTED SYSTEMS 14 3PC AND PAXOS
3PC AND PAXOS
Transaction Properties
ATOMICITY – transactions are indivisible with respect to crashes. That is, transactions either COMMIT and have full effect or ABORT and have no effect at all; it is not possible for a crash to lead to only part of the effect of a transaction being recorded.
CONSISTENCY – when considered by itself, each transaction moves the application from one consistent state to another.
ISOLATION – transactions are indivisible with respect to each other. This is, it not possible for a transaction to see a temporary state during the execution of a concurrent transaction.
A related idea is SERIALISABILITY, this describes the situation that when two transaction execute concurrent, the effect is the same as if one executed after the other.
DURABILITY – the effects of those transactions that commit will survive subsequent failures:
Not all failures, but certainly server crashes, and possibly others.
DISTRIBUTED SYSTEMS 15 3PC AND PAXOS
3PC AND PAXOS
Transaction systems
The hard part about building a transaction processing system is dealing with the following:
CONCURRENCY CONTROL – ensuring that concurrent transactions execute in isolation and achieve serialisability.
RECOVERY – ensuring both that the effects of committed transactions survive crashes (durability) and that only the effects of committed transactions survive (atomicity).
Concurrency control and recovery are not independent, so a transaction system must co-ordinate its concurrency control and recovery mechanisms.
DISTRIBUTED SYSTEMS 16 3PC AND PAXOS
3PC AND PAXOS
Transaction Commit
A transaction is deemed to commit if and only if the commit record for that transaction reaches the disk:
The commit record is preceded by enough information to ensure the transaction’s full effect is durable – either the modified data is forced to disk (very inefficient), or updates are recorded in the log (redo records).
Writing the commit record is an all-or-nothing event.
However, this only works for transactions confined to a single server:
The existence or absence of a commit record in a single, sequential, log determines a transaction’s fate.
In a DISTRIBUTED TRANSACTION, there will be log records in the logs of each server involved in the transaction:
We need an end-of-transaction record in each log, to determine whether the complete effect has been recorded.
But, each end-of-transaction record is written independently – so it’s possible that some will be written and not others – unlike writing a commit record, writing the end- of-transaction records is not an all or nothing event.
DISTRIBUTED SYSTEMS 17 3PC AND PAXOS
3PC AND PAXOS
Distributed Transactions
The ATOMICITY OF TRANSACTIONS requires that when a distributed transaction comes to an end, either all of its operations are carried out or none of them.
In a DISTRIBUTED TRANSACTION, there will be log records in the logs of each server involved in the transaction:
We need an end-of-transaction record in each log, to determine whether the complete effect has been recorded.
But, each end-of-transaction record is written independently – so it’s possible that some will be written and not others – unlike writing a commit record, writing the end-of-transactions records is not an all or nothing event.
DISTRIBUTED SYSTEMS 18 3PC AND PAXOS
3PC AND PAXOS
One Phase Commit
Simple ONE-PHASE ATOMIC COMMIT PROTOCOL
The client to send the commit or abort request to all of the participants in the transactions.
Keep on repeating the request until all of them have acknowledged that they had carried it out.
This approach is simple but will not work:
When the client requests a commit, it does not allow a server to make a unilateral decision to abort.
DISTRIBUTED SYSTEMS 19 3PC AND PAXOS
3PC AND PAXOS
Distributed Transactions
Efficient DISTRIBUTED TRANSACTION PROCESSING requires more sophisticated commit and recovery techniques:
We shall examine the simplest efficient distributed commit algorithm, namely DISTRIBUTED TWO-PHASE COMMIT.
Note that this has nothing to do with two-phase locking.
The distributed two phase commit algorithm for a given transaction includes all processes
involved in that transaction:
As a preliminary, one of these processes is elected or appointed as the COORDINATOR, and the remaining processes are termed the PARTICIPANTS. In a client/server system, the client that started the transaction is typically the coordinator.
Since it is distributed algorithm, it needs to cope with these processes failing and restarting during a commit operation.
DISTRIBUTED SYSTEMS 20 3PC AND PAXOS
3PC AND PAXOS
Distributed 2PC
The algorithm is divided into two phases, a PREPARE or DECISION phase and a COMMIT PROPAGATION phase.
The prepare phase:
The coordinator sends a prepare message to each participant.
If the participant determines that it cannot commit, it sends an abort response to the coordinator and can abort the transaction locally.
If the participant determines that it can commit, it writes a prepare record to the tail of the log, forces the log, and then sends a prepare message to the coordinator – the participant may not commit the transaction at this point.
DISTRIBUTED SYSTEMS 21 3PC AND PAXOS
3PC AND PAXOS
Prepare Phase
1. Coordinator sends prepare message to all participants.
Coordinator
canCommit?
canCommit? canCommit?
Part1
Part2 Part3
DISTRIBUTED SYSTEMS 22 3PC AND PAXOS
3PC AND PAXOS
Prepare Phase
1. Coordinator sends prepare message to all participants.
2. All participants that are prepared to commit write prepare record, then send
prepare message back to coordinator. Yes
Prepare Record
Part1
Prepare Record
Coord
Yes Part2 Yes
Prepare Record
Part3
DISTRIBUTED SYSTEMS 23 3PC AND PAXOS
3PC AND PAXOS
Prepare Phase
1. Some participants are prepared to commit, while some decide to abort. Prepare Record
Yes
Part1
Yes Part2 No
Part3
Prepare Record
Coord
Abort Locally
DISTRIBUTED SYSTEMS 24 3PC AND PAXOS
3PC AND PAXOS
Distributed 2PC
The commit propagation phase:
If the coordinator receives prepare responses from all participants, it decides to commit the transaction. Otherwise, the coordinator decides to abort the transaction.
If the coordinator decides to commit, then:
It writes a commit decision record to the log, and forces the log.
It sends a commit message to each participant, with an at-least-once protocol. When each participant receives a commit message, it writes a commit record to its log, and forces the log.
Each participant sends a acknowledgement to the coordinator after forcing the commit record.
If the coordinator decides to abort, it must inform (at least) those participants that sent prepare message replies in the first phase:
This enables those participants to abort the transaction locally.
DISTRIBUTED SYSTEMS 25 3PC AND PAXOS
3PC AND PAXOS
Propagation Phase
• Coordinator writes commit decision record to its log.
Part1
Commit Locally
Coord
Part2 Part3
DISTRIBUTED SYSTEMS 26 3PC AND PAXOS
3PC AND PAXOS
3. Coordinator writes commit decision record to its log.
4. Coordinator sends commit message to all participants.
doCommit
doCommit
doCommit
Part1
Part2 Part3
Coord
DISTRIBUTED SYSTEMS 27 3PC AND PAXOS
3PC AND PAXOS
3. Coordinator writes commit decision record to its log.
4. Coordinator sends commit message to all participants.
5. All participants write commit record to log, then send acknowledge message to coordinator.
Acknowledge
Commit Record
Part1
Part2 Part3
Commit Record
Coord
Acknowledge Acknowledge
Commit Record
DISTRIBUTED SYSTEMS 28 3PC AND PAXOS
3PC AND PAXOS
Decision to commit
How do participants decide whether to commit?
If a participant has crashed between the first time the transaction reaches that participant and when the participant receives the prepare message, then:
It may have lost some updates, either in the form of redo records not reaching the disk, or data updates not reaching the disk.
In such cases, it must respond with an abort message.
This can be implemented as follows:
Each process records a crash count in stable storage, which is incremented each time the computer reboots.
When a given transaction reaches a process, write a start-transaction record, including the current crash count.
When a prepare message reaches a participant, it responds with prepare iff the crash count in the transaction’s start-transaction record is the same as the current crash count.
DISTRIBUTED SYSTEMS 29 3PC AND PAXOS
3PC AND PAXOS
3PC Motivation
2PC is most awesome
However, it can block on failures How? Why?
DISTRIBUTED SYSTEMS 30 3PC AND PAXOS
3PC AND PAXOS
The distributed two phase commit protocol is subject to blocking:
At some points, processes involved in a given transaction must wait for messages, and can neither abort nor commit the transaction until it receives the required messages.
Participants that respond with prepare messages must block, waiting for the result (commit or
abort) from the coordinator:
If the coordinator crashes, the participant could be waiting for a long time.
Similarly, if the network partitions, a participant in a different partition to the coordinator will have to wait for the network to recover.
More sophisticated protocols, such as three phase commit protocols, can avoid blocking in the event of coordinator crash.
Blocking is unavoidable in the event of network partitions.
DISTRIBUTED SYSTEMS 31 3PC AND PAXOS
3PC AND PAXOS
3PC
Proposed 10 years after 2PC
Goal: Turn 2PC into a live (non-blocking) protocol 3PC should never block on node failures as 2PC did
Insight: 2PC suffers from allowing nodes to irreversibly commit an outcome before ensuring that the others know the outcome, too
Idea in 3PC: split“commit/abort”phase into two phases
First communicate the outcome to everyone
Let them commit only after everyone knows the outcome
DISTRIBUTED SYSTEMS 32 3PC AND PAXOS
3PC AND PAXOS
3PC
http://en.wikipedia.org/wiki/Three-phase_commit_protocol
DISTRIBUTED SYSTEMS 33 3PC AND PAXOS
3PC AND PAXOS
Crash Scenario
Coord sends doCommit to A (and all)
A
B
C
D
A gets it and commits They both crash
Coord
DISTRIBUTED SYSTEMS 34 3PC AND PAXOS
3PC AND PAXOS
What would happen in 2PC?
DISTRIBUTED SYSTEMS 35 3PC AND PAXOS
3PC AND PAXOS
What would happen in 2PC?
Everybody else waits for the coordinator to come back online
DISTRIBUTED SYSTEMS 36 3PC AND PAXOS
3PC AND PAXOS
3PC
http://en.wikipedia.org/wiki/Three-phase_commit_protocol
DISTRIBUTED SYSTEMS 37 3PC AND PAXOS
3PC AND PAXOS
3PC properties
Liveness: Yay
does not block, always makes progress
Correctness: Nay
what about network partitions?
A or Coordinator are not crashed, they are just inaccessible
DISTRIBUTED SYSTEMS 38 3PC AND PAXOS
3PC AND PAXOS
Partition Scenario
Coord
A becomes partitioned from B, C, D
Coordinator crashes
A B
C D
Coord sends prepareCommit to A
A has received prepareCommit so will commit upon
timeout
B/C/D have not received it and thus will abort
DISTRIBUTED SYSTEMS 39 3PC AND PAXOS
3PC AND PAXOS
Question
Can we design a protocol that is both correct and alive?
DISTRIBUTED SYSTEMS 40 3PC AND PAXOS
LECTURE 3. RPC [CLOUD COMPUTING]
DISTRIBUTED SYSTEMS 41 LECTURE 3. RPC [CLOUD COMPUTING]
3PC AND PAXOS
Paxos
The only known completely-safe and largely-live
agreement protocol
Lets all nodes agree on the same value despite node failures, network failures, and delays
Only blocks in exceptional circumstances that are vanishingly rare in practice
Extremely useful, e.g.:
nodes agree that client X gets a lock
nodes agree that Y is the primary
nodes agree that Z should be the next operation to be executed
DISTRIBUTED SYSTEMS 42 3PC AND PAXOS
3PC AND PAXOS
Paxos
Widely used in industry and academia
Google: Chubby (Paxos-based distributed lock service) Bigtable and other Google services use Chubby
Yahoo: Zookeeper (Paxos-based distributed lock service)
Open source:
libpaxos (Paxos-based atomic broadcast)
Zookeeper is open-source and integrates with Hadoop
Strong recommendation: Paxos is hard to get right, so use an existing implementation!
DISTRIBUTED SYSTEMS 43 3PC AND PAXOS
3PC AND PAXOS
Paxos
Correctness (safety)
If agreement is reached, everyone agrees on the same value The value agreed upon was proposed by some node
Fault tolerance (i.e., as-good-as-it-gets liveness) If less than half the nodes fail, the rest of the nodes reach
agreement eventually
No guaranteed termination (i.e., imperfect liveness)
Paxos may not always converge on a value, but only in very degenerate cases that are improbable in the real world
DISTRIBUTED SYSTEMS 44 3PC AND PAXOS
3PC AND PAXOS
The Basics
proposer
v
acceptors
Paxos is similar to 2PC, but with some twists
One (or more) node decides to be coordinator (proposer)
Proposer proposes a value and solicits acceptance from others (acceptors)
Proposer announces the chosen value or tries again if it’s failed to converge on a value
DISTRIBUTED SYSTEMS 45 3PC AND PAXOS
3PC AND PAXOS
• Paxos is similar to 2PC, but with so • One (or more) node decides to be c
Paxos – the egalitarian
• Proposer proposes a value and soli others (acceptors)
Any node can propose/accept, no one has special powers • Proposer announces the chosen val
failed to converge on a value
More than one node may propose at one time
proposer
• Hence any no no one
• Just lik of frie
anyon (includ
v
v’
v’ v’
acceptors
v’
v’
v’
proposer
m o ci
u
, d
n e
i
DISTRIBUTED SYSTEMS 46
3PC AND PAXOS
e o
t e
P
h
e d
n
3PC AND PAXOS
Questionz!
What if multiple nodes become proposers simultaneously?
What if the new proposer proposes different values than an already decided value?
What if there is a network partition?
What if a proposer crashes in the middle of solicitation?
What if a proposer crashes after deciding but before announcing results?
Scenario: groups of friends organize a party.
DISTRIBUTED SYSTEMS 47 3PC AND PAXOS
3PC AND PAXOS
Important mechanisms
Proposal ordering
Lets nodes decide which of several concurrent proposals to accept and which to reject
Majority voting
2PC needs all nodes to vote Yes before committing => may block
Paxos requires only a majority of the acceptors (half+1) to accept a proposal => will work if nearly half the nodes fail to reply
no two majorities can exist simultaneously –> networks partitions are not a problem
DISTRIBUTED SYSTEMS 48 3PC AND PAXOS
3PC AND PAXOS
Proposal 1
Each proposer propose to all acceptors
Each acceptor accepts the first proposal it receives and rejects rest
If the proposer receives positive replies from a majority of acceptors, it chooses its own value
There is at most 1 majority, hence at most a single value is chosen, even if there are partitions
Proposer sends chosen value to everyone
Problems?
DISTRIBUTED SYSTEMS 49 3PC AND PAXOS
3PC AND PAXOS
Proposal 1
Each proposer propose to all acceptors
Each acceptor accepts the first proposal it receives and rejects rest
If the proposer receives positive replies from a majority of acceptors, it chooses its own value
There is at most 1 majority, hence at most a single value is chosen, even if there are partitions
Proposer sends chosen value to everyone
Problems?
what if multiple proposers propose at the same time? what if proposer dies?
DISTRIBUTED SYSTEMS 50 3PC AND PAXOS
3PC AND PAXOS
Proposal 2
Enforce a global ordering of all proposals
Let acceptors recant their older proposals and accept newer ones
This will allow consistent progress for both simultaneous proposers and dead proposers
Problems?
DISTRIBUTED SYSTEMS 51 3PC AND PAXOS
3PC AND PAXOS
Proposal 2
Enforce a global ordering of all proposals
Let acceptors recant their older proposals and accept newer ones
This will allow consistent progress for both simultaneous proposers and dead proposers
Problems?
What if old proposer isn’t dead, but rather just sloooow?
It may think that its proposed value has won, whereas a newer proposer’s value has won — getting back on your word creates problems
printed invitations have been sent for the party
DISTRIBUTED SYSTEMS 52 3PC AND PAXOS
3PC AND PAXOS
Paxos Solutions
Each acceptor must be able to accept multiple proposals
Order proposals globally by a proposal number, which acceptors use to select which proposals to accept/reject
“node-address:node-local-sequence-number,”e.g.:N1:7
When acceptor receives a new proposal with # n, it looks at
the highest-number proposal it has already accepted, # m, and accepts the new proposal only if n>m
rejects it otherwise
If the acceptor decides to accept new proposal # n, then it will ask the new proposer to use the same value as # m’s
DISTRIBUTED SYSTEMS 53 3PC AND PAXOS
3PC AND PAXOS
Paxos Phase 1
a) Prepare
A proposer (the leader) creates a proposal n
n > than any previous proposal number used by this proposer. The proposer sends a proposal with number n to a majority
b) Promise
If n > any previous proposal number received from any proposer by the acceptor
Acceptor must return a promise to ignore all future proposals having a number < n
If the acceptor accepted a proposal at some point in the past, it must include the previous proposal number and previous value in its response to the proposer
send prepare-ok Otherwise, ignore or send NACK
DISTRIBUTED SYSTEMS 54 3PC AND PAXOS
3PC AND PAXOS
Paxos Phase 2
a) Accept Request
If a proposer receives enough prepare-ok from a majority, it needs to set a value to its proposal
If any acceptors had sent a value and proposal number to the proposer, then proposer sets the value of its proposal to the value associated with the highest proposal number reported by the acceptors
If none of the acceptors had accepted a proposal up to this point, then the proposer may choose any value for its proposal
The Proposer sends an Accept Request message to a majority with the chosen value for its proposal.
b) Accepted
If an acceptor receives an Accept Request message for a proposal n, it must accept it : accept-
ok
iff it has not already promised to only consider proposals having an identifier greater than n -> also implies acceptor considers proposer LEADER
if it has, respond with accept-reject
DISTRIBUTED SYSTEMS 55 3PC AND PAXOS
3PC AND PAXOS
Paxos Phase 3
Decide
If a leader gets accept-ok from a majority, sends decide to all nodes
If it fails to get accept-ok, restart
DISTRIBUTED SYSTEMS 56 3PC AND PAXOS
3PC AND PAXOS
When does a value v get chosen?
DISTRIBUTED SYSTEMS 57 3PC AND PAXOS
3PC AND PAXOS
When does a value v get chosen?
When leader receives a majority prepare-ok and proposes V
When a majority nodes accept V
When the leader receives a majority accept-ok for value V
DISTRIBUTED SYSTEMS 58 3PC AND PAXOS
3PC AND PAXOS
Majorities
Paxos requires half+1 to agree on a value before it can be chosen
Sneaky properties:
no two separate majorities can exist simultaneously (not even if we have partitions)
If two majorities successively agree on two distinct proposals, #n and #m, respectively, then there is at least one node that was in both majorities and has seen both proposals
If another majority agrees then on a third proposal,#p,then there are a set of nodes that collectively will have seen all three proposals, #m, #n, #p
Thus, if a proposal with value v is chosen, all higher proposals will preserve value v (so nodes need not recant)
DISTRIBUTED SYSTEMS 59 3PC AND PAXOS
3PC AND PAXOS
Example – 5 nodes
half + 1 = 3
Cannot have two majorities simultaneously
DISTRIBUTED SYSTEMS 60 3PC AND PAXOS
3PC AND PAXOS
Example – 5 nodes
If two majorities successively agree on two distinct proposals, #n and #m, respectively, then there is at least one node that was in both majorities and has seen both proposals
AGREE ON ORANGE AT TIME T=1 AGREE ON RED AT TIME T=2
DISTRIBUTED SYSTEMS 61 3PC AND PAXOS
3PC AND PAXOS
Example – 5 nodes
If another majority agrees then on a third proposal,#p,then there are a set of nodes that collectively will have seen all three proposals, #m, #n, #p
AGREE ON ORANGE AT TIME T=1
AGREE ON RED AT TIME T=2
AGREE ON BLUE AT TIME T=3
DISTRIBUTED SYSTEMS 62 3PC AND PAXOS
3PC AND PAXOS
Summary
2PC proposed in the 1970s. Problem: blocks on failure of single node even in synchronous networks.
The need for an extra, third phase to avoid blocking is shown in the early 1980s. 3PC is an obvious corollary, but it is not provably correct, and in fact suffers from incorrectness under certain network conditions.
Paxos is proposed in the 1990s (Leslie Lamport), which is provably correct in asynchronous networks that eventually become synchronous, does not block if a majority of participants are available (so withstands faults) and has provably minimal message delays in the best case.
DISTRIBUTED SYSTEMS 63 3PC AND PAXOS
3PC AND PAXOS
Skype
DISTRIBUTED SYSTEMS 64 3PC AND PAXOS
3PC AND PAXOS
M$
DISTRIBUTED SYSTEMS 65 3PC AND PAXOS
Skype
Voice over IP peer-to-peer application
Built by Kaaza in 2003
Overlay network
Advanced functionality provided in an application-specific manner without modification of the core architecture of the Internet
SYSTEMS TALK
DISTRIBUTED SYSTEMS 66
Skype
Jan 2011: 27 million simultaneous online users March 2012: 35 million simultaneous online users
June 2012: 70 million downloads on Android devices
July 2012: over the quarter, Skype users have logged 115 billion minutes of calls
218,798 years (218.8 millennia)
SYSTEMS TALK
DISTRIBUTED SYSTEMS 67
DISTRIBUTED SYSTEMS 68
Architecture
P2P infrastructure
Ordinary hosts
Super nodes Selected on demand CPU
Bandwidth Reachability Availability
SYSTEMS TALK
User connection
Users are authenticated via a well-known login server
Users then contact a well known supernode
First time, around seven supernodes
“Host cache” continues to grow to hundreds of supernodes: IP and port
Initial versions: the users’ buddy list & conversations were stored locally
Multiple logins permitted: messages and calls routed to all user locations
1
SYSTEMS TALK
DISTRIBUTED SYSTEMS 69
Supernodes
Main functionality: efficient search of the global index of users
Distributed among the supernodes
Client contacts supernode -> supernode contacts on average 8 other supernodes: hashing + periodic flooding
User search takes 3-4 seconds to complete for global IP addresses
Intermediary nodes seem to cache results
SYSTEMS TALK
DISTRIBUTED SYSTEMS 70
Voice connection
Once user is discovered:
TCP for call requests and terminations
TCP or UDP for audio streaming (UDP preferred) TCP + intermediary node to circumvent firewall
Encoding and decoding: tailored to operate in
Internet environments
SYSTEMS TALK
DISTRIBUTED SYSTEMS 71