The University of Sydney Page 1
COMP3221: Distributed
Systems
Communication (TCP)
Dr Nguyen Tran
School of Computer Science
The University of Sydney Page 2
Previously…
– Previous lecture:
– Message-based communication is complex (e.g., routing towards
destination, subject to message losses)
– Today’s lecture:
– How to avoid message losses?
– How to give the impression that everything happens locally (not
remotely)?
– One to many communication
The University of Sydney Page 3
Outline
– The Problem of Message Loss
– The TCP/IP Solution
– Multicast communication
The University of Sydney Page 4
The Problem of Message Loss
The University of Sydney Page 5
Message loss
– Networks are in general unreliable
– Messages can be lost (never been delivered even if sent)
– Examples:
– A server receives too many requests simultaneously so it cannot treat all
– A router drops the message because its queue is full
Message losses may impact the computation of a distributed system
Cause
The University of Sydney Page 6
Message loss
– Constraints of the problem
– Two armies, each led by a general on separate mountains surrounding a
battlefield (distributed system)
– Can only communicate via messengers (message passing)
– Messengers can be killed before reaching destination (message losses)
– Goal: they want to coordinate an attack
– If they attack at different times, they both die
– If they attack at the same time, they win
Coordinated Attack Problem
The University of Sydney Page 7
Message loss
– There is no protocols to make sure they will win!
Coordinated Attack Problem (con’t)
time
12h!
The University of Sydney Page 8
Message loss
– There is no protocols to make sure they will win!
Coordinated Attack Problem (con’t)
time
Attack at noon, ok?
Ok, it works
for me
Copy that
12h, ok?
Ok
Did you get my ok?
So what?
Good!
…
The University of Sydney Page 9
Message loss
– Constraints of the problem
– Two remote entities of a distributed system
– Can only communicate through messages
– The network is unreliable: messages can be dropped
– Goal: they want to make sure to do something simultaneously
This is impossible, even if all messages go through
Analogy in networking
The University of Sydney Page 10
TCP/IP
Communication 2/2
Week 4, COMP3221
The University of Sydney Page 11
TCP: Overview RFCs: 793,1122,1323, 2018, 2581
– Point-to-point:
– one sender, one receiver
– Connection-oriented, reliable, in-order byte steam:
– Full duplex data:
– bi-directional data flow in same connection
– Flow controlled:
– sender will not overwhelm receiver
– Congestion controlled:
– TCP congestion and flow control set window size
The University of Sydney Page 12
TCP seq. numbers, ACKs
sequence numbers:
–byte stream “number” of
first byte in segment’s
data
acknowledgements:
–seq # of next byte
expected from other side
–cumulative ACK
Q: how receiver handles out-
of-order segments
–A: TCP spec doesn’t say, –
up to implementor
source port # dest port #
sequence number
acknowledgement number
checksum
rwnd
urg pointer
incoming segment to sender
A
sent
ACKed
sent, not-
yet ACKed
(“in-
flight”)
usable
but not
yet sent
not
usable
window size
N
sender sequence number space
source port # dest port #
sequence number
acknowledgement number
checksum
rwnd
urg pointer
outgoing segment from sender
The University of Sydney Page 13
TCP seq. numbers, ACKs
User
types
‘C’
host ACKs
receipt
of echoed
‘C’
host ACKs
receipt of
‘C’, echoes
back ‘C’
simple telnet scenario
Host BHost A
Seq=42, ACK=79, data = ‘C’
Seq=79, ACK=43, data = ‘C’
Seq=43, ACK=80
The University of Sydney Page 14
TCP sender events:
data rcvd from app:
§ create segment with seq #
§ seq # is byte-stream
number of first data byte
in segment
§ start timer if not already
running
• think of timer as for
oldest unacked
segment
• expiration interval:
TimeOutInterval
timeout:
§ retransmit segment that
caused timeout
§ restart timer
ack rcvd:
§ if ack acknowledges
previously unacked
segments
• update what is known
to be ACKed
• start timer if there are
still unacked segments
The University of Sydney Page 15
TCP round trip time, timeout
Q: how to set TCP
timeout value?
§ longer than RTT
• but RTT varies
§ too short: premature
timeout, unnecessary
retransmissions
§ too long: slow reaction to
segment loss
Q: how to estimate RTT?
– SampleRTT: measured
time from segment
transmission until ACK receipt
– ignore retransmissions
– SampleRTT will vary, want
estimated RTT “smoother”
– average several recent
measurements, not just
current SampleRTT
The University of Sydney Page 16
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
100
150
200
250
300
350
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
RT
T
(m
ill
is
ec
on
ds
)
SampleRTT Estimated RTT
EstimatedRTT = (1- a)*EstimatedRTT + a*SampleRTT
§ exponential weighted moving average
§ influence of past sample decreases exponentially fast
§ typical value: a = 0.125
TCP round trip time, timeout
RT
T
(m
ill
is
ec
on
ds
)
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
sampleRTT
EstimatedRTT
time (seconds)
The University of Sydney Page 17
– timeout interval: EstimatedRTT plus “safety margin”
– large variation in EstimatedRTT -> larger safety margin
– estimate SampleRTT deviation from EstimatedRTT:
DevRTT = (1-b)*DevRTT +
b*|SampleRTT-EstimatedRTT|
TCP round trip time, timeout
(typically, b = 0.25)
TimeoutInterval = EstimatedRTT + 4*DevRTT
estimated RTT “safety margin”
The University of Sydney Page 18
TCP: retransmission scenarios
lost ACK scenario
Host BHost A
Seq=92, 8 bytes of data
ACK=100
Seq=92, 8 bytes of data
Xtim
eo
ut
ACK=100
premature timeout
Host BHost A
Seq=92, 8 bytes of data
ACK=100
Seq=92, 8
bytes of data
tim
eo
ut
ACK=120
Seq=100, 20 bytes of data
ACK=120
SendBase=100
SendBase=120
SendBase=120
SendBase=92
The University of Sydney Page 19
TCP: retransmission scenarios
X
cumulative ACK
Host BHost A
Seq=92, 8 bytes of data
ACK=100
Seq=120, 15 bytes of data
tim
eo
ut
Seq=100, 20 bytes of data
ACK=120
The University of Sydney Page 20
TCP fast retransmit
§ time-out period often
relatively long:
• long delay before
resending lost packet
§ detect lost segments via
duplicate ACKs.
• sender often sends
many segments back-
to-back
• if segment is lost, there
will likely be many
duplicate ACKs.
if sender receives 3
ACKs for same data
(“triple duplicate ACKs”),
resend unacked
segment with smallest
seq #
§ likely that unacked
segment lost, so don’t
wait for timeout
TCP fast retransmit
(“triple duplicate ACKs”),
The University of Sydney Page 21
X
fast retransmit after sender
receipt of triple duplicate ACK
Host BHost A
Seq=92, 8 bytes of data
ACK=100
tim
eo
ut ACK=100
ACK=100
ACK=100
TCP fast retransmit
Seq=100, 20 bytes of data
Seq=100, 20 bytes of data
The University of Sydney Page 22
TCP flow control
application
process
TCP socket
receiver buffers
TCP
code
IP
code
application
OS
receiver protocol stack
application may
remove data from
TCP socket buffers ….
… slower than TCP
receiver is delivering
(sender is sending)
from sender
receiver controls sender, so
sender won’t overflow
receiver’s buffer by transmitting
too much, too fast
flow control
The University of Sydney Page 23
TCP flow control
buffered data
free buffer spacerwnd
RcvBuffer
TCP segment payloads
to application process
– receiver “advertises” free
buffer space by including
rwnd value in TCP header of
receiver-to-sender segments
– RcvBuffer size set via
socket options (typical default is
4096 bytes)
– many operating systems auto
adjust RcvBuffer
– sender limits amount of
unacked (“in-flight”) data to
receiver’s rwnd value
– guarantees receive buffer will
not overflow
receiver-side buffering
source port # dest port #
sequence number
acknowledgement number
checksum
rwnd
urg pointer
The University of Sydney Page 24
congestion:
– informally: “too many sources sending too much data too
fast for network to handle”
– different from flow control!
– manifestations:
– lost packets (buffer overflow at routers)
– long delays (queueing in router buffers)
– a top-10 problem!
Principles of congestion control
The University of Sydney Page 25
TCP congestion control
§ approach: sender increases transmission rate (window
size), probing for usable bandwidth, until loss occurs
• additive increase: increase cwnd by 1 MSS (Maximum
Segment Size) every RTT until loss detected
• multiplicative decrease: cut cwnd in half after loss
c
w
n
d
:
TC
P
s
en
de
r
co
ng
es
tio
n
w
in
do
w
s
iz
e
AIMD saw tooth
behavior: probing
for bandwidth
additively increase window size …
…. until loss occurs (then cut window in half)
time
The University of Sydney Page 26
TCP Congestion Control: details
§ sender limits transmission:
§ Congestion window (cwnd) is
dynamic function of perceived
network congestion
TCP sending rate:
§ roughly: send cwnd bytes,
wait RTT for ACKS, then
send more byteslast byte
ACKed sent, not-
yet ACKed
(“in-
flight”)
last byte
sent
cwnd
LastByteSent-
LastByteAcked
< cwnd sender sequence number space rate ~ cwnd RTT bytes/sec The University of Sydney Page 27 TCP Slow Start § when connection begins, increase rate exponentially until first loss event: • initially cwnd = 1 MSS (Maximum Segment Size) • double cwnd every RTT • done by incrementing cwnd for every ACK received § summary: initial rate is slow but ramps up exponentially fast Host A one segment R TT Host B time two segments four segments The University of Sydney Page 28 TCP: detecting, reacting to loss § Loss indicated by timeout: • cwnd set to 1 MSS; • window then grows exponentially (as in slow start) to threshold, then grows linearly § Loss indicated by 3 duplicate ACKs: TCP RENO • dup ACKs indicate network capable of delivering some segments • cwnd is cut in half window then grows linearly § TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks) The University of Sydney Page 29 Q: when should the exponential increase switch to linear? A: when cwnd gets to 1/2 of its value before timeout. Implementation: § variable ssthresh § on loss event, ssthresh is set to 1/2 of cwnd just before loss event TCP: switching from slow start to Congestion Avoidance The University of Sydney Page 30 Connection Management before exchanging data, sender/receiver “handshake”: – agree to establish connection (each knowing the other willing to establish connection) – agree on connection parameters, e.g. MSS, rwnd, etc. – connection-oriented connection state: ESTAB connection variables: seq # client-to-server server-to-client rcvBuffer size at server,client application network connection state: ESTAB connection Variables: seq # client-to-server server-to-client rcvBuffer size at server,client application network Socket clientSocket = newSocket("hostname","port number"); Socket connectionSocket = welcomeSocket.accept(); The University of Sydney Page 31 TCP 3-way handshake SYNbit=1, Seq=x choose init seq num, x send TCP SYN msg ESTAB SYNbit=1, Seq=y ACKbit=1; ACKnum=x+1 choose init seq num, y send TCP SYNACK msg, acking SYN ACKbit=1, ACKnum=y+1 received SYNACK(x) indicates server is live; send ACK for SYNACK; this segment may contain client-to-server data received ACK(y) indicates client is live SYNSENT ESTAB SYN RCVD client state LISTEN server state LISTEN The University of Sydney Page 32 FIN_WAIT_2 CLOSE_WAIT FINbit=1, seq=y ACKbit=1; ACKnum=y+1 ACKbit=1; ACKnum=x+1 wait for server close can still send data can no longer send data LAST_ACK CLOSED TIMED_WAIT timed wait for 2*max segment lifetime CLOSED TCP: closing a connection FIN_WAIT_1 FINbit=1, seq=xcan no longer send but can receive data clientSocket.close() client state server state ESTABESTAB The University of Sydney Page 33 HTTP Overview HTTP: hypertext transfer protocol – Web’s application layer protocol – client/server model – client: browser that requests, receives, (using HTTP protocol) and “displays”Web objects – server:Web server sends (using HTTP protocol) objects in response to requests PC running Firefox browser server running Apache Web server iPhone running Safari browser HTTP request HTTP response HTT P r equ est HTT P r esp ons e The University of Sydney Page 34 HTTP overview Uses TCP: § client initiates TCP connection (creates socket) to server, port 80 § server accepts TCP connection from client § HTTP messages (application- layer protocol messages) exchanged between browser (HTTP client) and Web server (HTTP server) § TCP connection closed HTTP is “stateless” – server maintains no information about past client requests What is HTTPS ? – HTTP over TLS (Transport Layer Security), e.g. over SSL The University of Sydney Page 35 Related terminology – Broadcast: one-to-all communication – Action of sending a message to all nodes of the system – Typically used for relatively small systems, like IP broadcast as part of Ethernet in LANs – Unicast: one-to-one communication – In contrast with broadcast, describe a special form of distribution to a single receiver – Used generally in the context of multimedia or streaming applications – Anycast: one-to-random-one communication – Send a message to an IP address range from to obtain a response from any potential receiver, whose IP address belongs to this range – Used with UDP and TCP – Multicast: one-to-many communication – Action of sending a message to multiple nodes of the system (not necessarily all nodes) – This term is used in many contexts (network, algorithm) The University of Sydney Page 36 Conclusion (Transportation layer) – Message losses can have dramatic consequences – TCP/IP protocol suite hides these losses from the application level – Socket, RPC, use TCP/IP – Sockets are complex – RPC is more transparent for the client – Multicast communication scales well for large distributed systems