COMP3221: Distributed Systems
Communication (TCP)
Dr Nguyen Tran
School of Computer Science
The University of Sydney
Page 1
Previously…
– Previous lecture:
– Message-based communication is complex (e.g., routing towards
destination, subject to message losses)
– Today’s lecture:
– How to avoid message losses?
– How to give the impression that everything happens locally (not remotely)?
– One to many communication
The University of Sydney
Page 2
Outline
– The Problem of Message Loss – The TCP/IP Solution
– Multicast communication
The University of Sydney Page 3
The Problem of Message Loss
The University of Sydney Page 4
Message loss
Cause
– Networks are in general unreliable
– Messages can be lost (never been delivered even if sent)
– Examples:
– A server receives too many requests simultaneously so it cannot treat all – A router drops the message because its queue is full
Message losses may impact the computation of a distributed system
The University of Sydney Page 5
Message loss
Coordinated Attack Problem
– Constraints of the problem
– Two armies, each led by a general on separate mountains surrounding a battlefield (distributed system)
– Can only communicate via messengers (message passing)
– Messengers can be killed before reaching destination (message losses)
– Goal: they want to coordinate an attack – If they attack at different times, they both die – If they attack at the same time, they win
The University of Sydney Page 6
Message loss
Coordinated Attack Problem (con’t)
– There is no protocols to make sure they will win! time
12h!
The University of Sydney
Page 7
Message loss
Coordinated Attack Problem (con’t)
– There is no protocols to make sure they will win! time
12h, ok?
Good! So what?
Ok
Did you get my ok?
The University of Sydney
Page 8
…
Attack at noon, ok?
Ok, it works for me
Copy that
Message loss
Analogy in networking
– Constraints of the problem
– Two remote entities of a distributed system
– Can only communicate through messages
– The network is unreliable: messages can be dropped
– Goal: they want to make sure to do something simultaneously This is impossible, even if all messages go through
The University of Sydney
Page 9
TCP/IP
Communication 2/2 Week 4, COMP3221
The University of Sydney Page 10
TCP: Overview RFCs: 793,1122,1323, 2018, 2581
– Point-to-point:
– one sender, one receiver
– Connection-oriented, reliable, in-order byte steam: – Full duplex data:
– bi-directional data flow in same connection – Flow controlled:
– sender will not overwhelm receiver – Congestion controlled:
– TCP congestion and flow control set window size The University of Sydney
Page 11
TCP seq. numbers, ACKs
sequence numbers:
–byte stream “number” of first byte in segment’s data
acknowledgements:
–seq # of next byte expected from other side
–cumulative ACK
Q: how receiver handles out-
of-order segments
–A: TCP spec doesn’t say, – up to implementor
outgoing segment from sender
source port # dest port #
sequence number
acknowledgement number
rwnd
checksum
urg pointer
window size N
sender sequence number space
sent ACKed
sent, not- yet ACKed (“in- flight”)
usable not but not usable yet sent
incoming segment to sender
source port # dest port #
sequence number
acknowledgement number
A
rwnd
checksum
urg pointer
The University of Sydney
Page 12
TCP seq. numbers, ACKs
Host A Host B
User types ‘C’
Seq=79, ACK=43, data = ‘C’ host ACKs
The University of Sydney
Page 13
receipt of echoed ‘C’
Seq=43, ACK=80
simple telnet scenario
Seq=42, ACK=79, data = ‘C’
host ACKs
receipt of ‘C’, echoes back ‘C’
TCP sender events:
data rcvd from app: timeout:
§ create segment with seq # § retransmit segment that
caused timeout numberoffirstdatabyte §restarttimer
§ seq # is byte-stream in segment
§ start timer if not already running
• think of timer as for oldest unacked segment
• expiration interval: TimeOutInterval
ack rcvd:
§ if ack acknowledges previously unacked segments
• update what is known to be ACKed
• start timer if there are still unacked segments
The University of Sydney
Page 14
TCP round trip time, timeout
Q: how to set TCP
timeout value? § longer than RTT
• but RTT varies
§ too short: premature timeout, unnecessary retransmissions
§ too long: slow reaction to segment loss
Q: how to estimate RTT?
– SampleRTT: measured time from segment transmission until ACK receipt
– ignore retransmissions
– SampleRTT will vary, want
estimated RTT “smoother”
– average several recent measurements, not just current SampleRTT
The University of Sydney
Page 15
TCP round trip time, timeout
EstimatedRTT = (1- a)*EstimatedRTT + a*SampleRTT
§ exponential weighted moving average
§ influence of past sample decreases exponentially fast
§ typical value: a = 0.125
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
350
300
250
200
150
100
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
sampleRTT
EstimatedRTT
The University of Sydney
time (seconds)
Page 16
time (seconnds)
SampleRTT Estimated RTT
RTT (milliseconds)
RTT (milliseconds)
TCP round trip time, timeout
– timeout interval: EstimatedRTT plus “safety margin” – largevariationinEstimatedRTT->largersafetymargin
– estimate SampleRTT deviation from EstimatedRTT: DevRTT = (1-b)*DevRTT +
b*|SampleRTT-EstimatedRTT| (typically, b = 0.25)
TimeoutInterval = EstimatedRTT + 4*DevRTT
The University of Sydney
estimated RTT
“safety margin”
Page 17
TCP: retransmission scenarios
Host A Host B
Seq=92, 8 bytes of data
Host A Host B SendBase=92
ACK=100
X
Seq=92, 8 bytes of data
ACK=100
lost ACK scenario
SendBase=100 SendBase=120
SendBase=120
Seq=92, 8 bytes of data Seq=100, 20 bytes of data
ACK=100 ACK=120
Seq=92, 8 bytes of data
ACK=120
premature timeout
The University of Sydney
Page 18
timeout
timeout
TCP: retransmission scenarios
Host A
Host B
Seq=92, 8 bytes of data
Seq=100, 20 bytes of data
X ACK=100 ACK=120
Seq=120, 15 bytes of data
cumulative ACK
The University of Sydney
Page 19
timeout
TCP fast retransmit
§ time-out period often relatively long:
• long delay before resending lost packet
§ detect lost segments via duplicate ACKs.
• sender often sends many segments back- to-back
• if segment is lost, there will likely be many duplicate ACKs.
TCP fast retransmit
if sender receives 3 ACKs for same data
(“triple duplicate ACKs”), (“triple duplicate ACKs”),
resend unacked segment with smallest seq #
§ likely that unacked segment lost, so don’t wait for timeout
The University of Sydney
Page 20
TCP fast retransmit
Host A Host B
Seq=92, 8 bytes of data
Seq=100, 20 bytes of data
X
ACK=100
ACK=100
ACK=100 ACK=100
Seq=100, 20 bytes of data
The University of Sydney
fast retransmit after sender receipt of triple duplicate ACK
Page 21
timeout
TCP flow control
application process
TCP code
TCP socket receiver buffers
IP code
application may remove data from TCP socket buffers ….
… slower than TCP receiver is delivering (sender is sending)
application OS
flow control
receiver controls sender, so sender won’t overflow receiver’s buffer by transmitting too much, too fast
from sender
The University of Sydney
Page 22
receiver protocol stack
TCP flow control
– receiver“advertises”free buffer space by including rwnd value in TCP header of receiver-to-sender segments
– RcvBuffer size set via socket options (typical default is 4096 bytes)
– manyoperatingsystemsauto adjust RcvBuffer
– senderlimitsamountof unacked (“in-flight”) data to receiver’s rwnd value
– guaranteesreceivebufferwill not overflow
to application process
buffered data
free buffer space
RcvBuffer
rwnd
TCP segment payloads receiver-side buffering
source port # dest port #
sequence number
acknowledgement number
rwnd
checksum
urg pointer
The University of Sydney
Page 23
Principles of congestion control
congestion:
– informally: “too many sources sending too much data too
fast for network to handle”
– different from flow control!
– manifestations:
– lost packets (buffer overflow at routers) – long delays (queueing in router buffers)
– a top-10 problem! The University of Sydney
Page 24
TCP congestion control
§ approach: sender increases transmission rate (window size), probing for usable bandwidth, until loss occurs
• additive increase: increase cwnd by 1 MSS (Maximum Segment Size) every RTT until loss detected
• multiplicative decrease: cut cwnd in half after loss additively increase window size …
…. until loss occurs (then cut window in half)
AIMD saw tooth behavior: probing for bandwidth
The University of Sydney
Page 25
time
cwnd: TCP sender congestion window size
TCP Congestion Control: details
sender sequence number space
TCP sending rate:
§ roughly: send cwnd bytes, wait RTT for ACKS, then send more bytes
cwnd
last byte ACKed
sent, not- yet ACKed (“in- flight”)
last byte sent
§ sender limits transmission:
§ Congestion window (cwnd) is dynamic function of perceived network congestion
The University of Sydney
Page 26
rate
~
cwnd RTT
bytes/sec
LastByteSent- < LastByteAcked
cwnd
TCP Slow Start
§ when connection begins, increase rate exponentially until first loss event:
• initiallycwnd=1MSS (Maximum Segment Size)
• double cwnd every RTT
• done by incrementing cwnd for every ACK received
§ summary: initial rate is slow but ramps up exponentially fast
Host A
Host B
The University of Sydney
Page 27
time
RTT
one segment
two segments
four segments
TCP: detecting, reacting to loss
§ Loss indicated by timeout:
• cwnd set to 1 MSS;
• windowthengrowsexponentially(asinslowstart)tothreshold,then grows linearly
§ Loss indicated by 3 duplicate ACKs: TCP RENO
• dupACKsindicatenetworkcapableofdeliveringsomesegments • cwnd is cut in half window then grows linearly
§ TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks) The University of Sydney Page 28
TCP: switching from slow start to Congestion Avoidance
Q: when should the exponential increase switch to linear?
A: when cwnd gets to 1/2 of its value before timeout.
Implementation:
§ variable ssthresh
§ on loss event, ssthresh is set to 1/2 of cwnd just before loss event
The University of Sydney
Page 29
Connection Management
before exchanging data, sender/receiver “handshake”: – agree to establish connection (each knowing the other willing to
establish connection)
– agree on connection parameters, e.g. MSS, rwnd, etc. – connection-oriented
application
connection state: ESTAB connection variables:
seq # client-to-server server-to-client
rcvBuffer size at server,client
network
application
connection state: ESTAB connection Variables:
seq # client-to-server server-to-client
rcvBuffer size at server,client
network
The University of Sydney
Page 30
Socket clientSocket = newSocket("hostname","port
number");
Socket connectionSocket = welcomeSocket.accept();
TCP 3-way handshake
client state
LISTEN
choose init seq num, x
server state
LISTEN
SYN RCVD
SYNSENT
SYNbit=1, Seq=x
SYNbit=1, Seq=y ACKbit=1; ACKnum=x+1
ACKbit=1, ACKnum=y+1
choose init seq num, y send TCP SYNACK msg, acking SYN
send TCP SYN msg
received SYNACK(x) indicates server is live; send ACK for SYNACK;
ESTAB
this segment may contain client-to-server data
received ACK(y) indicates client is live
The University of Sydney
Page 31
ESTAB
TCP: closing a connection
client state
server state
ESTAB
CLOSE_WAIT
LAST_ACK
CLOSED
ESTAB
clientSocket.close()
FIN_WAIT_1
FIN_WAIT_2
TIMED_WAIT
can no longer send but can receive data
wait for server close
FINbit=1, seq=x
ACKbit=1; ACKnum=x+1
FINbit=1, seq=y ACKbit=1; ACKnum=y+1
can still send data
can no longer send data
CLOSED
The University of Sydney
Page 32
timed wait for 2*max
segment lifetime
HTTP Overview
HTTP: hypertext transfer protocol
– Web’s application layer protocol
– client/server model
– client: browser that requests, receives, (using HTTP protocol) and “displays” Web objects
– server: Web server sends (using HTTP protocol) objects in response to requests
PC running Firefox browser
server running
Apache Web server
The University of Sydney
Page 33
iPhone running Safari browser
HTTP request
HTTP response
HTTP request
HTTP response
HTTP overview
Uses TCP:
§ client initiates TCP connection (creates socket) to server, port 80
§ server accepts TCP connection from client
§ HTTP messages (application- layer protocol messages) exchanged between browser (HTTP client) and Web server (HTTP server)
§ TCP connection closed The University of Sydney
HTTP is “stateless”
– server maintains no information about past client requests
What is HTTPS ?
– HTTP over TLS (Transport Layer Security), e.g. over SSL
Page 34
Related terminology
– Broadcast:one-to-allcommunication
– Actionofsendingamessagetoallnodesofthesystem
– Typicallyusedforrelativelysmallsystems,likeIPbroadcastaspartof Ethernet in LANs
– Unicast:one-to-onecommunication
– Incontrastwithbroadcast,describeaspecialformofdistributiontoasingle
receiver
– Usedgenerallyinthecontextofmultimediaorstreamingapplications
– Anycast:one-to-random-onecommunication
– SendamessagetoanIPaddressrangefromtoobtainaresponsefrom
any potential receiver, whose IP address belongs to this range
– UsedwithUDPandTCP
– Multicast:one-to-manycommunication
– Actionofsendingamessagetomultiplenodesofthesystem(not
necessarily all nodes)
– Thistermisusedinmanycontexts(network,algorithm)
The University of Sydney
Page 35
Conclusion (Transportation layer)
– Message losses can have dramatic consequences
– TCP/IP protocol suite hides these losses from the application level
– Socket, RPC, use TCP/IP
– Sockets are complex
– RPC is more transparent for the client
– Multicast communication scales well for large distributed systems The University of Sydney Page 36