Chapter 3 Transport Layer
Chapter 3: Transport Layer
All material copyright 1996-2016
J.F Kurose and K.W. Ross, All Rights Reserved
Chapter 3 outline
Transport services and protocols
3.1 transport-layer services
3.5 connection-oriented transport: TCP
provide logical communication between app processes running on different hosts
transport
3.2 multiplexing and demultiplexing
• segment structure
• reliable data transfer
• flow control
• connection management
transport protocols run in end systems
3.3 connectionless transport: UDP
3.6 principles of congestion control
• send side: breaks app messages into segments, passes to network layer
3.4 principles of reliable data transfer
• rcv side: reassembles segments into messages, passes to app layer
application
3.7 TCP congestion control
transport
Transport Layer 2-1
Transport Layer 3-2
Transport Layer 3-3
Transport Layer 3-4
more than one transport protocol available to apps
network data link physical
our goals:
understandprinciples learnaboutInternet
behind transport layer services:
transport layer protocols:
• multiplexing, demultiplexing
• UDP: connectionless transport
• reliable data transfer
• flow control
• congestion control
• TCP: connection-oriented reliable transport
• Internet: TCP and UDP
• TCP congestion control
application
network data link physical
Transport vs. network layer
Internet transport-layer protocols
network layer: logical communication between hosts
reliable, in-order delivery (TCP)
application
transport layer: logical
12 kids in Ann’s house sending letters to 12 kids in Bill’s house:
• congestion control • flow control
• connection setup
network data link physical
network data link physical
communication between processes
hosts = houses
network data link physical
• relies on, enhances, network layer services
app messages = letters in envelopes
• no-frills extension of “best-effort” IP
network data link physical
network data link physical
Chapter 3 outline
Multiplexing/demultiplexing
3.1 transport-layer services
3.5 connection-oriented transport: TCP
multiplexing at sender:
demultiplexing at receiver:
3.2 multiplexing and demultiplexing
• segment structure
• reliable data transfer
• flow control
• connection management
handle data from multiple sockets, add transport header (later used for demultiplexing)
use header info to deliver received segments to correct socket
3.3 connectionless transport: UDP
application
3.4 principles of reliable data transfer
3.6 principles of congestion control
P3
transport network
P4
process
household analogy:
network data link physical
network data link physical
processes = kids
unreliable, unordered delivery: UDP
transport protocol = Ann and Bill who demux to in- house siblings
application
network-layer protocol = postal service
services not available: • delay guarantees
• bandwidth guarantees
network data link physical
3.7 TCP congestion control
transport network
link physical
transport network
Transport Layer 3-5
Transport Layer 3-6
Transport Layer 3-7
Transport Layer 3-8
application
P1 P2
application
socket
link physical
link physical
transport
network data link physical
transport
How demultiplexing works
Connectionless demultiplexing
host receives IP datagrams • each datagram has source IP
32 bits
recall: created socket has
recall: when creating datagram to send into UDP socket, must specify
address, destination IP address
DatagramSocket mySocket1
= new DatagramSocket(12534);
• destination IP address • destination port #
• each datagram carries one transport-layer segment
other header fields
• each segment has source, destination port number
application data (payload)
when host receives UDP segment:
IP datagrams with same dest. port #, but different source IP addresses and/or source port numbers will be directed to same socket at dest
host uses IP addresses & port numbers to direct segment to appropriate socket
• checks destination port # in segment
Connectionless demux: example
Connection-oriented demux
DatagramSocket
mySocket2 = new
DatagramSocket
DatagramSocket
serverSocket = new
DatagramSocket
DatagramSocket
mySocket1 = new
DatagramSocket
(5775);
TCP socket identified by 4-tuple:
server host may support many simultaneous TCP sockets:
(9157); application
(6428); application
• source IP address
• source port number • dest IP address
• dest port number
• each socket identified by its own 4-tuple
P3
P4
transport network
transport network
transport network
web servers have different sockets for each connecting client
link physical
link physical
link physical
demux: receiver uses all four values to direct segment to appropriate socket
source port: 9157 dest port: 6428
source port: ? dest port: ?
source port: 6428 dest port: 9157
source port: ? dest port: ?
• non-persistent HTTP will have different socket for each request
P1
application
source port #
dest port #
host-local port #:
TCP/UDP segment format
• directs UDP segment to socket with that port #
Transport Layer 3-9
Transport Layer 3-10
Transport Layer 3-11
Transport Layer 3-12
Connection-oriented demux: example
Connection-oriented demux: example
host: IP address A
source IP,port: B,80 dest IP,port: A,9157
source IP,port: C,5775 dest IP,port: B,80
host: IP address C
host: IP address A
source IP,port: B,80 dest IP,port: A,9157
source IP,port: C,5775 dest IP,port: B,80
host: IP address C
application
P4
P5 P6
application
application
P4
application
P3
P2 P3
P3
P2 P3
transport network
transport network link
transport network
transport network
transport network link
transport network
link physical
physical
link physical
link physical
physical
link physical
source IP,port: A,9157 dest IP, port: B,80
source IP,port: C,9157 dest IP,port: B,80
source IP,port: A,9157 dest IP, port: B,80
source IP,port: C,9157 dest IP,port: B,80
three segments, all destined to IP address: B, dest port: 80 are demultiplexed to different sockets
Transport Layer 3-13
Transport Layer 3-14
Chapter 3 outline
UDP: User Datagram Protocol [RFC 768]
3.1 transport-layer services
3.5 connection-oriented transport: TCP
“no frills,” “bare bones” Internet transport protocol
UDP use:
streaming multimedia
3.2 multiplexing and demultiplexing
• segment structure
• reliable data transfer
• flow control
• connection management
“best effort” service, UDP segments may be:
apps (loss tolerant, rate sensitive)
3.3 connectionless transport: UDP
to app
SNMP
reliable transfer over
3.4 principles of reliable data transfer
3.6 principles of congestion control
connectionless:
• no handshaking
UDP:
application
application
server: IP address B
server: IP address B
3.7 TCP congestion control
between UDP sender, receiver
add reliability at application layer
Transport Layer 3-15
Transport Layer 3-16
• lost
• delivered out-of-order
DNS
• each UDP segment handled independently of others
application-specific error recovery!
threaded server
UDP: segment header
UDP checksum
source port # length
dest port # checksum
why is there a UDP?
sender:
receiver:
32 bits
length, in bytes of UDP segment, including header
Goal: detect “errors” (e.g., flipped bits) in transmitted segment
application data (payload)
no connection establishment (which can add delay)
treat segment contents, including header fields, as sequence of 16-bit integers
compute checksum of received segment
UDP segment format
small header size
no congestion control:
sender puts checksum value into UDP checksum field
• YES – no error detected. But maybe errors nonetheless? More later ….
Internet checksum: example
Chapter 3 outline
example: add two 16-bit integers
3.1 transport-layer services
3.5 connection-oriented transport: TCP
simple: no connection state at sender, receiver
checksum: addition (one’s complement sum) of segment contents
check if computed checksum equals checksum field value:
11110011001100110 11101010101010101
3.2 multiplexing and demultiplexing
• segment structure
• reliable data transfer •flowcontrol •connectionmanagement
wraparound 11011101110111011 sum 11011101110111100
3.3connectionless transport:UDP
3.6principlesofcongestion control
checksum 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 Note: when adding numbers, a carryout from the most
3.4principlesofreliable data transfer
3.7 TCP congestion control
significant bit needs to be added to the result
UDP can blast away as fast as desired
Transport Layer 3-17
Transport Layer 3-18
Transport Layer 3-19
Transport Layer 3-20
• NO – error detected
Principles of reliable data transfer
Principles of reliable data transfer
important in application, transport, link layers • top-10 list of important networking topics!
important in application, transport, link layers • top-10 list of important networking topics!
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
Principles of reliable data transfer
Reliable data transfer: getting started
important in application, transport, link layers • top-10 list of important networking topics!
rdt_send(): called from above, (e.g., by app.). Passed data to deliver to receiver upper layer
deliver_data(): called by rdt to deliver data to upper
characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)
udt_send(): called by rdt, to transfer packet over unreliable channel to receiver
rdt_rcv(): called when packet arrives on rcv-side of channel
Transport Layer 3-21
Transport Layer 3-22
Transport Layer 3-23
Transport Layer 3-24
send side
receive side
Reliable data transfer: getting started
rdt1.0: reliable transfer over a reliable channel
we’ll:
incrementally develop sender, receiver sides of
underlying channel perfectly reliable • no bit errors
• no loss of packets
reliable data transfer protocol (rdt)
consider only unidirectional data transfer • but control info will flow on both directions!
separate FSMs for sender, receiver:
• sender sends data into underlying channel
• receiver reads data from underlying channel
use finite state machines (FSM) to specify sender,
receiver
event causing state transition actions taken on state transition
state: when in this “state” next state uniquely determined by next event
state 1
state 2
packet = make_pkt(data) udt_send(packet)
extract (packet,data) deliver_data(data)
rdt2.0: channel with bit errors
rdt2.0: channel with bit errors
underlying channel may flip bits in packet • checksum to detect bit errors
underlying channel may flip bits in packet • checksum to detect bit errors
the question: how to recover from errors:
• acknowledgements (ACKs): receiver explicitly tells sender
the question: how to recover from errors:
• acknowledgements (ACKs): receiver explicitly tells sender
that pkt received OK
that pkt received OK
event actions
sender
receiver
• negative acknowledgements (NAKs): receiver explicitly tells sender that pkt had errors
• negative acknowledgements (NAKs): receiver explicitly tells sender that pkt had errors
• sender retransmits pkt on receipt of NAK
How do humans recover from “errors”
• sender retransmits pkt on receipt of NAK
new mechanisms in rdt2.0 (beyond rdt1.0):
new mechanisms in rdt2.0 (beyond rdt1.0): during conversation?
• error detection
• error detection
• receiver feedback: control msgs (ACK,NAK) rcvr- >sender
• feedback: control msgs (ACK,NAK) from receiver to sender
Transport Layer 3-25
Transport Layer 3-26
Transport Layer 3-27
Transport Layer 3-28
Wait for call from above
rdt_send(data)
Wait for call from below
rdt_rcv(packet)
rdt2.0: FSM specification
rdt2.0: operation with no errors
rdt_send(data)
sndpkt = make_pkt(data, checksum)
receiver
rdt_send(data)
snkpkt = make_pkt(data, checksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) && isNAK(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) && isNAK(rcvpkt)
Wait for call from above
Wait for ACK or NAK
rdt_rcv(rcvpkt) && corrupt(rcvpkt)
Wait for call from above
Wait for ACK or NAK
rdt_rcv(rcvpkt) && corrupt(rcvpkt)
udt_send(sndpkt) rdt_rcv(rcvpkt) && isACK(rcvpkt)
udt_send(NAK)
udt_send(sndpkt) rdt_rcv(rcvpkt) && isACK(rcvpkt)
udt_send(NAK)
sender
Wait for call from below
Wait for call from below
rdt2.0: error scenario
rdt2.0 has a fatal flaw!
rdt_send(data)
snkpkt = make_pkt(data, checksum)
what happens if ACK/NAK corrupted?
handling duplicates: sender retransmits
udt_send(sndpkt)
rdt_rcv(rcvpkt) && isNAK(rcvpkt)
Wait for call from above
Wait for ACK or NAK
rdt_rcv(rcvpkt) && corrupt(rcvpkt)
sender doesn’t know what happened at receiver!
current pkt if ACK/NAK corrupted
udt_send(sndpkt) rdt_rcv(rcvpkt) && isACK(rcvpkt)
udt_send(NAK)
sender adds sequence number to each pkt
Wait for call from below
can’t just retransmit: possible duplicate
receiver discards (doesn’t deliver up) duplicate pkt
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) extract(rcvpkt,data)
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) extract(rcvpkt,data)
deliver_data(data) udt_send(ACK)
deliver_data(data) udt_send(ACK)
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) extract(rcvpkt,data)
sender sends one packet, then waits for receiver response
deliver_data(data) udt_send(ACK)
Transport Layer 3-29
Transport Layer 3-30
Transport Layer 3-31
Transport Layer 3-32
stop and wait
rdt2.1: sender, handles garbled ACK/NAKs
rdt2.1: receiver, handles garbled ACK/NAKs
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt) && isACK(rcvpkt)
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt) && isACK(rcvpkt)
sndpkt = make_pkt(NAK, chksum) udt_send(sndpkt)
sndpkt = make_pkt(NAK, chksum) udt_send(sndpkt)
rdt_rcv(rcvpkt) &&
not corrupt(rcvpkt) && has_seq1(rcvpkt)
Wait for 0 from below
Wait for 1 from below
rdt_rcv(rcvpkt) &&
not corrupt(rcvpkt) && has_seq0(rcvpkt)
rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isNAK(rcvpkt) )
Wait for ACK or NAK 1
Wait for call 1 from above
sndpkt = make_pkt(ACK, chksum) udt_send(sndpkt)
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && has_seq1(rcvpkt)
sndpkt = make_pkt(ACK, chksum) udt_send(sndpkt)
udt_send(sndpkt)
sndpkt = make_pkt(1, data, checksum) udt_send(sndpkt)
extract(rcvpkt,data) deliver_data(data)
sndpkt = make_pkt(ACK, chksum) udt_send(sndpkt)
rdt2.1: discussion
rdt2.2: a NAK-free protocol
sender:
receiver:
same functionality as rdt2.1, using ACKs only
seq # added to pkt
two seq. #’s (0,1) will
must check if received packet is duplicate
instead of NAK, receiver sends ACK for last pkt received OK
suffice. Why?
• state indicates whether 0 or 1 is expected pkt seq #
• receiver must explicitly include seq # of pkt being ACKed duplicate ACK at sender results in same action as
must check if received ACK/NAK corrupted
note: receiver can not know if its last ACK/NAK received OK at sender
NAK: retransmit current pkt
twice as many states
• state must “remember” whether “expected” pkt should have seq # of 0 or 1
rdt_send(data)
sndpkt = make_pkt(0, data, checksum)
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && has_seq0(rcvpkt)
udt_send(sndpkt)
rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isNAK(rcvpkt) )
extract(rcvpkt,data) deliver_data(data)
sndpkt = make_pkt(ACK, chksum) udt_send(sndpkt)
Wait for call 0 from above
Wait for ACK or NAK 0
udt_send(sndpkt)
rdt_rcv(rcvpkt) && (corrupt(rcvpkt)
rdt_rcv(rcvpkt) && (corrupt(rcvpkt)
rdt_send(data)
Transport Layer 3-33
Transport Layer 3-34
Transport Layer 3-35
Transport Layer 3-36
rdt2.2: sender, receiver fragments
rdt3.0: channels with errors and loss
rdt_rcv(rcvpkt) && (corrupt(rcvpkt) ||
rdt_rcv(rcvpkt) &¬corrupt(rcvpkt) && isACK(rcvpkt,0)
• checksum, seq. #, ACKs,retransmissions will be of help … but not enough
ifpkt(orACK)justdelayed (not lost):
has_seq1(rcvpkt)) udt_send(sndpkt)
Wait for 0 from below
receiver FSM fragment
• retransmission will be duplicate, but seq. #’s already handles this
rdt3.0 sender
rdt3.0 in action
rdt_rcv(rcvpkt)
send pkt0
pkt0
send pkt0 pkt0
timeout udt_send(sndpkt) start_timer
Wait for ACK1
Wait for call 1 from above
ack0
rcv pkt0 send ack0
resend pkt1 pkt1
ack1 send ack1
Wait for call 0from above
Wait for ACK0
timeout udt_send(sndpkt) start_timer
ack0
rcv pkt0 send ack0
rcv pkt0 ack0 send ack0
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt) && isACK(rcvpkt,1)
rcv ack0 send pkt1
pkt1
rcv pkt1 send ack1
rcv ack0
send pkt1 pkt1X
stop_timer
rcv ack1 send pkt0
pkt0
timeout
rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isACK(rcvpkt,0) )
rdt_send(data)
(b) packet loss
Transport Layer 3-40
rdt_send(data)
sndpkt = make_pkt(0, data, checksum)
new assumption:
approach: sender waits “reasonable” amount of time for ACK
udt_send(sndpkt)
rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) ||
underlying channel can also lose packets (data, ACKs)
Wait for call 0 from above
sender FSM fragment
Wait for ACK 0
isACK(rcvpkt,1) ) udt_send(sndpkt)
retransmits if no ACK received in this time
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt) && has_seq1(rcvpkt)
• receiver must specify seq # of pkt being ACKed
extract(rcvpkt,data) deliver_data(data)
sndpkt = make_pkt(ACK1, chksum) udt_send(sndpkt)
Transport Layer 3-37
requires countdown timer Transport Layer 3-38
rdt_send(data)
rdt_rcv(rcvpkt) && ( corrupt(rcvpkt) || isACK(rcvpkt,1) )
sndpkt = make_pkt(0, data, checksum) udt_send(sndpkt)
start_timer
sender
receiver
sender
receiver
sndpkt = make_pkt(1, data, checksum) udt_send(sndpkt)
start_timer
(a) no loss
ack0 send ack0
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt) && isACK(rcvpkt,0) stop_timer
ack1
loss
rdt_rcv(rcvpkt)
rcv ack1 pkt0 send pkt0
rcv pkt1 rcv pkt0
Transport Layer 3-39
rdt3.0 in action
sender
receiver
Performance of rdt3.0
sender
receiver
send pkt0
pkt0
rdt3.0 is correct, but performance stinks
e.g.: 1 Gbps link, 15 ms prop. delay, 8000 bit packet:
send pkt0
pkt0 rcv pkt0 ack0 send ack0
ack0
rcv pkt0 send ack0
rcv ack0 send pkt1
rcv ack0 send pkt1
pkt1
rcv pkt1 send ack1
timeout
rcv ack1 send pkt0
rcv pkt1 (detect duplicate)
resend pkt1
pkt1 rcv pkt1 (detect duplicate)
send ack1
rcv ack1 send pkt0
ack1 send ack1
rcv ack1 send pkt0
ack1 ack0
rcv pkt0 send ack0
sender = = 30.008 = 0.00027 RTT + L / R
pkt1 rcv pkt1 ack1 send ack1
ack1
D = RL = 8000 bits = 8 microsecs trans 109 bits/sec
X
timeout
loss
resend pkt1
pkt1 pkt0
U sender: utilization – fraction of time sender busy sending U L/R .008
(c) ACK loss
send ack0 (d) premature timeout/ delayed ACK
network protocol limits use of physical resources! Transport Layer 3-42
pkt0
pkt0
rcv pkt0 (detect duplicate)
rcv pkt0 ack0 send ack0
ack0
if RTT=30 msec, 1KB pkt every 30 msec: 33kB/sec thruput over 1 Gbps link
rdt3.0: stop-and-wait operation
Pipelined protocols
sender first packet bit transmitted, t = 0
receiver
pipelining: sender allows multiple, “in-flight”, yet- to-be-acknowledged pkts
last packet bit transmitted, t = L / R RTT
first packet bit arrives
last packet bit arrives, send ACK
• range of sequence numbers must be increased • buffering at sender and/or receiver
ACK arrives, send next packet, t = RTT + L / R
Usender= L/R RTT + L / R
= .008 30.008
= 0.00027
two generic forms of pipelined protocols: go-Back-N, selective repeat
Transport Layer 3-41
Transport Layer 3-43
Transport Layer 3-44
Pipelining: increased utilization
Pipelined protocols: overview
sender last bit transmitted, t = L / R
receiver
Go-back-N:
Selective Repeat:
first packet bit transmitted, t = 0
sender can have up to N unacked packets in pipeline
sender can have up to N unack’ed packets in pipeline
ACK(n): ACKs all pkts up to, including seq # n – “cumulative ACK”
Wait
• may receive duplicate ACKs (see receiver) timer for oldest in-flight pkt
rdt_rcv(rcvpkt) && notcorrupt(rcvpkt)
ACK arrives, send next packet, t = RTT + L / R
• doesn’t ack packet if there’s a gap
U sender =
3L / R RTT + L / R
.0024
= 30.008 =
0.00081
• when timer expires, retransmit all unacked packets
• when timer expires, retransmit only that unacked packet
RTT
first packet bit arrives
last packet bit arrives, send ACK
last bit of 2nd packet arrives, send ACK last bit of 3rd packet arrives, send ACK
receiver only sends cumulative ack
rcvr sends individual ack for each packet
Go-Back-N: sender
GBN: sender extended FSM
k-bit seq # in pkt header
“window” of up to N, consecutive unack’ed pkts allowed
rdt_send(data)
timeout(n): retransmit packet n and all higher seq # pkts in window
base = getacknum(rcvpkt)+1 If (base == nextseqnum)
3-packet pipelining increases utilization by a factor of 3!
sender has timer for oldest unacked packet
sender maintains timer for each unacked packet
Transport Layer 3-45
Transport Layer 3-46
Transport Layer 3-47
start_timer
Transport Layer 3-48
else refuse_data(data)
base=1 nextseqnum=1
timeout
start_timer udt_send(sndpkt[base]) udt_send(sndpkt[base+1])
… udt_send(sndpkt[nextseqnum-1])
rdt_rcv(rcvpkt)
&& corrupt(rcvpkt)
if (nextseqnum < base+N) {
sndpkt[nextseqnum] = make_pkt(nextseqnum,data,chksum) udt_send(sndpkt[nextseqnum])
if (base == nextseqnum)
start_timer nextseqnum++ }
stop_timer else
GBN: receiver extended FSM
GBN in action
&& hasseqnum(rcvpkt,expectedseqnum)
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Xloss
expectedseqnum=1 Wait sndpkt =
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(expectedseqnum,ACK,chksum) udt_send(sndpkt)
expectedseqnum++
(wait)
receive pkt3, discard, (re)send ack1
make_pkt(0,ACK,chksum)
default udt_send(sndpkt)
sender window (N=4)
sender
receiver
rdt_rcv(rcvpkt) 012345678 &¬currupt(rcvpkt) 012345678
send pkt0 send pkt1 send pkt2 send pkt3
receive pkt0, send ack0 receive pkt1, send ack1
ACK-only: always send ACK for correctly-received pkt with highest in-order seq #
ignore duplicate ACK
• may generate duplicate ACKs
• need only remember expectedseqnum
pkt 2 timeout
out-of-orderpkt: 012345678 012345678
• discard (don’t buffer): no receiver buffering! • re-ACK pkt with highest in-order seq #
Selective repeat
Selective repeat: sender, receiver windows
receiver individually acknowledges all correctly received pkts
• buffers pkts, as needed, for eventual in-order delivery to upper layer
sender only resends pkts for which ACK not received
• sender timer for each unACKed pkt sender window
• N consecutive seq #’s
• limits seq #s of sent, unACKed pkts
Transport Layer 3-49
Transport Layer 3-50
Transport Layer 3-51
Transport Layer 3-52
012345678 012345678
rcv ack0, send pkt4 rcv ack1, send pkt5
receive pkt4, discard, (re)send ack1 receive pkt5, discard, (re)send ack1
012345678 0 1 2 3 4 5 6 7 8
send pkt2 send pkt3 send pkt4 send pkt5
rcv pkt2, deliver, send ack2 rcv pkt3, deliver, send ack3 rcv pkt4, deliver, send ack4 rcv pkt5, deliver, send ack5
Selective repeat
Selective repeat in action
sender
receiver
sender window (N=4)
sender
receiver
data from above:
pkt n in [rcvbase, rcvbase+N-1]
012345678 012345678 012345678 012345678
send pkt0 send pkt1 send pkt2 send pkt3
receive pkt0, send ack0 receive pkt1, send ack1
if next available seq # in window, send pkt
send ACK(n)
Xloss
timeout(n):
out-of-order: buffer
(wait)
resend pkt n, restart timer ACK(n) in [sendbase,sendbase+N]:
in-order: deliver (also deliver buffered, in-order pkts), advance window to next not-yet-received pkt
012345678 012345678
rcv ack0, send pkt4 rcv ack1, send pkt5
receive pkt3, buffer, send ack3
mark pkt n as received
if n smallest unACKed pkt,
record ack3 arrived
receive pkt4, buffer, send ack4
advance window base to next unACKed seq #
ACK(n)
otherwise: 012345678
send pkt2 record ack4 arrived record ack5 arrived
Selective repeat: dilemma
sender window (after receipt)
receiver window (after receipt)
Chapter 3 outline
example:
0123012 pkt0 0123012 pkt1 0123012 pkt2
0123012 0123012 0123012
3.1 transport-layer services
3.5 connection-oriented transport: TCP
seq #’s: 0, 1, 2, 3
0 1 2 3 0 1 2 pkt3 X 0 1 2 3 0 1 2 pkt0
window size=3
will accept packet with seq number 0
3. 2 3. 3 3. 4
multiplexing and demultiplexing
• segment structure
• reliable data transfer
• flow control
• connection management
receiver sees no difference in two scenarios!
(a) no problem
duplicate data accepted as new in (b)
receiver can’t see sender side. receiver behavior identical in both cases! something’s (very) wrong!
connectionless transport: UDP
Q: what relationship between seq # size and window size to avoid problem in (b)?
0123012 X 0123012
3.7 TCP congestion control
pkt n in [rcvbase-N,rcvbase-1]
pkt 2 timeout
receive pkt5, buffer, send ack5
ignore
012345678 012345678
rcv pkt2; deliver pkt2, pkt3, pkt4, pkt5; send ack2
0 1 2 3 0 1 2 pkt0 0123012 pkt1 0123012 pkt2
0123012
principles of reliable data transfer
3.6 principles of congestion control
timeout X retransmit pkt0
pkt0 (b) oops!
0 1 2 3 0 1 2
will accept packet with seq number 0
Transport Layer 3-53
Transport Layer 3-54
Transport Layer 3-55
Transport Layer 3-56
0 1 2 3 4 5 6 7 8
Q: what happens when ack2 arrives?
TCP: Overview RFCs: 793,1122,1323, 2018, 2581
TCP segment structure
point-to-point:
• one sender, one receiver
full duplex data:
• bi-directional data flow
URG: urgent data (generally not used)
source port # dest port # sequence number
counting
by bytes
of data
(not segments!)
reliable, in-order byte stream:
in same connection
ACK: ACK # valid
acknowledgement number
• no “message boundaries”
connection-oriented: • handshaking (exchange
RST, SYN, FIN: connection estab (setup, teardown
checksum
options (variable length)
pipelined:
• TCP congestion and
of control msgs) inits sender, receiver state before data exchange
commands)
flow control set window size
flow controlled: • sender will not
Internet checksum (as in UDP)
application data (variable length)
TCP seq. numbers, ACKs
TCP seq. numbers, ACKs
sequence numbers:
outgoing segment from sender
• byte stream “number” of first byte in segment’s data
sequence number
acknowledgements:
window size N
Seq=42, ACK=79, data = ‘C’
• seq # of next byte expected from other side
sender sequence number space
host ACKs receipt of ‘C’, echoes back ‘C’
• cumulative ACK
Q: how receiver handles
host ACKs receipt
Seq=79, ACK=43, data = ‘C’
out-of-order segments
sent ACKed
sent, not- yet ACKed (“in- flight”)
usable but not yet sent
not usable
of echoed ‘C’
• A: TCP spec doesn’t say, - up to implementor
incoming segment to sender
• MSS: maximum segment size
PSH: push data now (generally not used)
head not UAPRSF len used
receive window Urg data pointer
# bytes rcvr willing to accept
source port # dest port #
Host A
Host B
acknowledgement number rwnd
checksum
urg pointer
User types ‘C’
overwhelm receiver
source port # dest port # sequence number
simple telnet scenario
acknowledgement number
A rwnd checksum urg pointer
Transport Layer 3-57
Transport Layer 3-58
Transport Layer 3-59
Transport Layer 3-60
32 bits
Seq=43, ACK=80
RTT (milliseconds)
RTT (milliseconds)
TCP round trip time, timeout
TCP round trip time, timeout
Q: how to set TCP timeout value?
Q: how to estimate RTT?
EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT exponential weighted moving average
longer than RTT • but RTT varies
SampleRTT: measured time from segment transmission until ACK receipt
influence of past sample decreases exponentially fast typical value: = 0.125
too short: premature timeout, unnecessary retransmissions
• ignore retransmissions
SampleRTT will vary, want
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
too long: slow reaction to segment loss
• average several recent measurements, not just current SampleRTT
200
TCP round trip time, timeout
Chapter 3 outline
timeout interval: EstimatedRTT plus “safety margin” • large variation in EstimatedRTT -> larger safety margin
3.1 transport-layer services
3.5 connection-oriented transport: TCP
estimate SampleRTT deviation from EstimatedRTT: DevRTT = (1-)*DevRTT +
3.2 multiplexing and demultiplexing
• segment structure
• reliable data transfer
• flow control
• connection management
estimated RTT “smoother”
250
*|SampleRTT-EstimatedRTT| (typically, = 0.25)
3.3 connectionless transport: UDP
3.6 principles of congestion control
TimeoutInterval = EstimatedRTT + 4*DevRTT
3.4 principles of reliable data transfer
3.7 TCP congestion control
estimated RTT “safety margin” Transport Layer 3-63
Transport Layer 3-64
Transport Layer 3-61
time (seconds) Transport Layer 3-62 SampleRTT Estimated RTT
350
300
150
sampleRTT EstimatedRTT
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
100
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)
timeout
timeout
TCP reliable data transfer
TCP sender events:
TCP creates rdt service on top of IP’s unreliable service
data rcvd from app:
timeout:
• pipelined segments
let’s initially consider simplified TCP sender:
seq # is byte-stream number of first data bytein segment
restart timer ack rcvd:
• cumulative acks
ifackacknowledges previously unacked segments
• single retransmission timer
• ignore duplicate acks • ignore flow control,
start timer if not already running
retransmissions triggered by:
congestion control
• think of timer as for oldest unacked segment
• update what is known to be ACKed
• timeout events • duplicate acks
• expiration interval: TimeOutInterval
• start timer if there are still unacked segments
TCP sender (simplified)
TCP: retransmission scenarios
pass segment to IP (i.e., “send”) NextSeqNum = NextSeqNum + length(data) if (timer currently not running)
Seq=92, 8 bytes of data
Seq=92, 8 bytes of data Seq=100, 20 bytes of data
NextSeqNum = InitialSeqNum SendBase = InitialSeqNum
wait for event
start timer
ACK received, with ACK field value y
SendBase=100 SendBase=120
if (y > SendBase) {
SendBase = y
/* SendBase–1: last cumulatively ACKed byte */ if (there are currently not-yet-acked segments)
ACK=100
ACK=120
}
Transport Layer 3-67
start timer else stop timer
lost ACK scenario
premature timeout
Transport Layer 3-68
data received from application above create segment, seq. #: NextSeqNum
Host A Host B
Host A SendBase=92
Host B
timeout
ACK=100 ACK=120
retransmit not-yet-acked segment with smallest seq. #
start timer
Seq=92, 8 bytes of data
Seq=92, 8 bytes of data
Transport Layer 3-65
Transport Layer 3-66
create segment with seq #
retransmit segment that caused timeout
X
ACK=100
SendBase=120
timeout
timeout
TCP: retransmission scenarios
TCP ACK generation [RFC 1122, RFC 2581]
Host A Host B
Seq=92, 8 bytes of data Seq=100, 20 bytes of data
arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed
delayed ACK. Wait up to 500ms
for next segment. If no next segment, send ACK
X ACK=100 ACK=120
arrival of in-order segment with expected seq #. One other segment has ACK pending
immediately send single cumulative ACK, ACKing both in-order segments
Seq=120, 15 bytes of data
arrival of out-of-order segment higher-than-expect seq. # . Gap detected
immediately send duplicate ACK, indicating seq. # of next expected byte
cumulative ACK
arrival of segment that partially or completely fills gap
immediate send ACK, provided that segment starts at lower end of gap
TCP fast retransmit
TCP fast retransmit
time-outperiod often relatively long:
Host A
Host B
• long delay before resending lost packet
if sender receives 3 ACKs for same data
Seq=92, 8 bytes of data Seq=100, 20 bytes of data
detect lost segments via duplicate ACKs.
(“triple duplicate ACKs”), (“triple duplicate ACKs”),
X
• sender often sends many segments back- to-back
resend unacked segment with smallest seq #
ACK=100 ACK=100 ACK=100 ACK=100
• if segment is lost, there will likely be many duplicate ACKs.
likely that unacked segment lost, so don’t wait for timeout
Seq=100, 20 bytes of data
TCP fast retransmit
Transport Layer 3-69
Transport Layer 3-70
Transport Layer 3-71
fast retransmit after sender receipt of triple duplicate ACK
Transport Layer 3-72
event at receiver
TCP receiver action
Chapter 3 outline
TCP flow control
3.1 transport-layer services
3.5 connection-oriented transport: TCP
TCP socket receiver buffers
application OS
3.2 multiplexing and demultiplexing
• segment structure
• reliable data transfer
• flow control
• connection management
… slower than TCP receiver is delivering (sender is sending)
TCP code
3.3 connectionless transport: UDP
3.6 principles of congestion control
3.4 principles of reliable data transfer
flow control
IP code
TCP flow control
Chapter 3 outline
receiver “advertises” free buffer space by including rwnd value in TCP header of receiver-to-sender segments
to application process
3.1 transport-layer services
3.5 connection-oriented transport: TCP
• RcvBuffer size set via socket options (typical default is 4096 bytes)
RcvBuffer
rwnd
buffered data free buffer space
3.2 multiplexing and demultiplexing
• segment structure
• reliable data transfer
• flow control
• connection management
• many operating systems autoadjust RcvBuffer
TCP segment payloads receiver-side buffering
3.4 principles of reliable data transfer
sender limits amount of unacked (“in-flight”) data to receiver’s rwnd value
3.7 TCP congestion control
guarantees receive buffer will not overflow
3.7 TCP congestion control
receiver controls sender, so sender won’t overflow receiver’s buffer by transmitting too much, too fast
Transport Layer 3-73
Transport Layer 3-74
Transport Layer 3-75
Transport Layer 3-76
3.3 connectionless transport: UDP
3.6 principles of congestion control
application may remove data from TCP socket buffers ….
application process
from sender
receiver protocol stack
Connection Management
Agreeing to establish a connection
before exchanging data, sender/receiver “handshake”: agree to establish connection (each knowing the other willing
2-way handshake: Let’s talk
Q: will 2-way handshake always work in network?
variable delays retransmitted messages (e.g. req_conn(x)) due to message loss
to establish connection)
agree on connection parameters
application
application
OK
ESTAB
connection state: ESTAB connection variables:
connection state: ESTAB connection Variables:
seq # client-to-server server-to-client
seq # client-to-server server-to-client
message reordering can’t “see” other side
rcvBuffer size at server,client
rcvBuffer size at server,client
choose x
req_conn(x) acc_conn(x)
network
network
ESTAB
Socket clientSocket = newSocket(“hostname”,”port
Socket connectionSocket = welcomeSocket.accept();
number”);
Agreeing to establish a connection
TCP 3-way handshake
2-way handshake failure scenarios:
choose x
choose x
LISTEN SYNSENT
choose init seq num, x send TCP SYN msg
LISTEN
retransmit req_conn(x)
retransmit req_conn(x)
choose init seq num, y send TCP SYNACK msg, acking SYN
SYN RCVD
ESTAB
ESTAB
data(x+1)
accept data(x+1)
SYNbit=1, Seq=y ACKbit=1; ACKnum=x+1
client terminates
server forgets x
client terminates
server req_conn(x) forgets x
this segment may contain client-to-server data
ACKbit=1, ACKnum=y+1
req_conn(x) acc_conn(x)
ESTAB
req_conn(x) acc_conn(x)
ESTAB
SYNbit=1, Seq=x
req_conn(x)
retransmit data(x+1)
received SYNACK(x) indicates server is live; send ACK for SYNACK;
connection x completes
connection x completes
ESTAB
half open connection! (no client!)
accept data(x+1)
ESTAB
data(x+1)
ESTAB
Transport Layer 3-77
Transport Layer 3-78
Transport Layer 3-79
Transport Layer 3-80
ESTAB
ESTAB
client state
server state
received ACK(y) indicates client is live
ESTAB
TCP 3-way handshake: FSM
TCP: closing a connection
client state
server state
3.1 transport-layer services
3.5 connection-oriented transport: TCP
ESTAB FIN_WAIT_1
ESTAB
FIN_WAIT_2
wait for server close
can still
connectionless transport: UDP
TIMED_WAIT
can no longer send data
CLOSED
Socket connectionSocket = welcomeSocket.accept();
respond to received FIN with ACK
• on receiving FIN, ACK can be combined with own FIN
SYN(x)
number”);
simultaneous FIN exchanges can be handled
SYNACK(seq=y,ACKnum=x+1) create new socket for communication back to client
listen
SYN(seq=x)
ACK(ACKnum=y+1)
ESTAB
SYNACK(seq=y,ACKnum=x+1) ACK(ACKnum=y+1)
SYN rcvd
SYN sent
TCP: closing a connection
Chapter 3 outline
clientSocket.close()
can no longer send but can receive data
FINbit=1, seq=x ACKbit=1; ACKnum=x+1
CLOSE_WAIT send data
3. 2 3. 3 3. 4
multiplexing and demultiplexing
• segment structure
• reliable data transfer
• flow control
• connection management
timed wait for 2*max
CLOSED
3.7 TCP congestion control
segment lifetime
Socket clientSocket = newSocket(“hostname”,”port
closed
client, server each close their side of connection • send TCP segment with FIN bit = 1
FINbit=1, seq=y ACKbit=1; ACKnum=y+1
LAST_ACK
principles of reliable data transfer
3. 6
principles of congestion control
Transport Layer 3-81
Transport Layer 3-82
Transport Layer 3-83
Transport Layer 3-84
out
out
delay
Principles of congestion control
Causes/costs of congestion: scenario 1
congestion:
informally: “too many sources sending too much
two senders, two original data: in
throughput: out unlimited shared
data too fast for network to handle”
different from flow control!
manifestations:
output link buffers
• lost packets (buffer overflow at routers)
R/2
• long delays (queueing in router buffers) a top-10 problem!
Causes/costs of congestion: scenario 2
Causes/costs of congestion: scenario 2
one router, finite buffers
sender retransmission of timed-out packet
idealization: perfect knowledge
R/2
• application-layer input = application-layer output: in =
sender sends only when router buffers available
out
• transport-layer input includes retransmissions :
in R/2
Host B
finite shared output link buffers
Host B
finite shared output link buffers
Host A
A
free buffer space!
in : original data
’in: original data, plus
out
copy
in : original data
’in: original data, plus
out
retransmitted data
retransmitted data
‘
in
in
Transport Layer 3-85
Transport Layer 3-86
Transport Layer 3-87
Transport Layer 3-88
receivers
one router, infinite buffers
output link capacity: R
no retransmission
Host A
Host B
in R/2
maximum per-connection large delays as arrival rate, in,
in R/2 throughput: R/2 approaches capacity
out
out out
Causes/costs of congestion: scenario 2
Causes/costs of congestion: scenario 2
Idealization: known loss packets can be lost, dropped at router due to full buffers
Idealization: known loss packets can be lost, dropped at router due to full buffers
R/2
sender only resends if packet known to be lost
sender only resends if packet known to be lost
when sending at R/2, some packets are retransmissions but asymptotic goodput is still R/2 (why?)
Host B
Host B
copy
in : original data
’in: original data, plus
out
in : original data
’in: original data, plus
out
A
no buffer space!
A
free buffer space!
retransmitted data
retransmitted data
Causes/costs of congestion: scenario 2
Causes/costs of congestion: scenario 2
Realistic: duplicates
packets can be lost, dropped at
R/2
Realistic: duplicates
packets can be lost, dropped at
R/2
router due to full buffers
when sending at R/2, some packets are retransmissions including duplicated that are delivered!
router due to full buffers
when sending at R/2, some packets are retransmissions including duplicated that are delivered!
sender times out prematurely, sending two copies, both of which are delivered
in
R/2
sender times out prematurely, sending two copies, both of which are delivered
in R/2
Host B
timecooutpy
in ’in
out
“costs” of congestion:
more work (retrans) for given “goodput”
unneeded retransmissions: link carries multiple copies of pkt
A
free buffer space!
Transport Layer 3-89
Transport Layer 3-90
Transport Layer 3-91
Transport Layer 3-92
• decreasing goodput
in
R/2
cwnd: TCP sender congestion window size
out
Causes/costs of congestion: scenario 3
Causes/costs of congestion: scenario 3
four senders
multihop paths
timeout/retransmit
Q: what happens as in and in’ increase ?
C/2
Host D
when packet dropped, any “upstream transmission capacity used for that packet was wasted!
Chapter 3 outline
TCP congestion control:
additive increase multiplicative decrease
3.1 transport-layer services
3.5 connection-oriented transport: TCP
approach: sender increases transmission rate (window size), probing for usable bandwidth, until loss occurs
3.2 multiplexing and demultiplexing
• segment structure
• reliable data transfer
• flow control
• connection management
• additive increase: increase cwnd by 1 MSS every RTT until loss detected
3.3 connectionless transport: UDP
• multiplicative decrease: cut cwnd in half after loss additively increase window size …
3.4 principles of reliable data transfer
3.6 principles of congestion control
…. until loss occurs (then cut window in half)
Host A
in : original data
’in: original data, plus
out Host B
A: as red in’ increases, all arriving blue pkts at upper queue are dropped, blue throughput 0
retransmitted data
’ in
C/2
finite shared output link buffers
Host C
3.7 TCP congestion control
AIMD saw tooth behavior: probing for bandwidth
Transport Layer 3-93
Transport Layer 3-94
Transport Layer 3-95
Transport Layer 3-96
another “cost” of congestion:
time
RTT
TCP Congestion Control: details
TCP Slow Start
sender sequence number space
when connection begins, increase rate exponentially until first loss event:
Host A
Host B
last byte ACKed
sent, not- yet ACKed (“in- flight”)
last byte sent
roughly: send cwnd bytes, wait RTT for ACKS, then send more bytes
• initially cwnd = 1 MSS
cwnd
TCP sending rate:
sender limits transmission:
• done by incrementing cwnd for every ACK received
LastByteSent-
LastByteAcked
< cwnd
cwnd is dynamic, function
summary: initial rate is slow but ramps up exponentially fast
of perceived network congestion
time
TCP: detecting, reacting to loss
TCP: switching from slow start to CA
loss indicated by timeout:
Q: when should the exponential increase switch to linear?
• cwnd set to 1 MSS;
• window then grows exponentially (as in slow start) to threshold, then grows linearly
A: when cwnd gets to 1/2 of its value before timeout.
loss indicated by 3 duplicate ACKs: TCP RENO • dup ACKs indicate network capable of delivering
Implementation:
some segments
• cwnd is cut in half window then grows linearly
variable ssthresh
TCP T ahoe always sets cwnd to 1 (timeout or 3 duplicate acks)
on loss event, ssthresh is set to 1/2 of cwnd just before loss event
rate ~ cwnd bytes/sec RTT
• double cwnd every RTT
Transport Layer 3-97
Transport Layer 3-98
Transport Layer 3-99
Transport Layer 3-100
Summary: TCP Congestion Control
TCP throughput
W: window size (measured in bytes) where loss occurs • avg. window size (# in-flight bytes) is 3⁄4 W
• avg. thruput is 3/4W per RTT
cwnd = 1 MSS ssthresh = 64 KB dupACKcount = 0
slow start
cwnd > ssthresh
congestion avoidance
timeout ssthresh = cwnd/2
timeout
ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0 retransmit missing segment
duplicate ACK dupACKcount++
3 W
4 RTT bytes/sec
cwnd = 1 MSS dupACKcount = 0 retransmit missing segment
avg TCP thruput =
dupACKcount == 3
cwnd = ssthresh dupACKcount = 0
dupACKcount == 3
ssthresh= cwnd/2 cwnd = ssthresh + 3 retransmit missing segment
fast recovery
ssthresh= cwnd/2
cwnd = ssthresh + 3 retransmit missing segment
duplicate ACK dupACKcount++
New A C K !
n e w A C K. cwnd = cwnd + MSS
New A C K !
avg. TCP thruput as function of window size, R T T? • ignore slow start, assume always data to send
timeout
ssthresh = cwnd/2
New ACK!
new ACK
cwnd = cwnd+MSS
dupACKcount = 0
transmit new segment(s), as allowed
(MSS/cwnd) dupACKcount = 0
cwnd = 1
dupACKcount = 0 retransmit missing segment
New ACK
W W/2
TCP Futures: TCP over “long, fat pipes” example: 1500 byte segments, 100ms RTT, want
TCP Fairness
10 Gbps throughput
fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K
requires W = 83,333 in-flight segments
throughput in terms of segment loss probability, L [Mathis 1997]:
TCP connection 1
TCP throughput = 1.22 . MSS RTT L
➜ to achieve 10 Gbps throughput, need a loss rate of L = 2·10-10 – a very small loss rate!
TCP connection 2
bottleneck router capacity R
new versions of TCP for high-speed
transmit new segment(s), as allowed
duplicate ACK
cwnd = cwnd + MSS
transmit new segment(s), as allowed
Transport Layer 3-101
Transport Layer 3-102
Transport Layer 3-103
Transport Layer 3-104
Why is TCP fair?
Fairness (more)
two competing sessions:
additive increase gives slope of 1, as throughout increases
multiplicative decrease decreases throughput proportionally
Fairness and UDP
Fairness, parallel TCP connections
R
equal bandwidth share
• do not want rate throttled by congestion control
Connection 1 throughput R
• new app asks for 11 TCPs, gets R/2
loss: decrease window by factor of 2 congestion avoidance: additive increase
instead use UDP:
• send audio/video at
web browsers do this
e.g., link of rate R with 9
Explicit Congestion Notification (ECN)
Chapter 3: summary
network-assisted congestion control:
principles behind transport layer services:
two bits in IP header (ToS field) marked by network router to indicate congestion
• multiplexing, demultiplexing
next:
congestion indication carried to receiving host
leaving the network “edge” (application, transport layers)
receiver (seeing congestion indication in IP datagram) ) sets ECE bit on receiver-to-sender ACK segment to notify sender of congestion
• reliable data transfer
• flow control
• congestion control
into the network “core”
source
TCP ACK segment
destination
application transport network link physical
ECE=1
application transport network link physical
instantiation, implementation in the Internet
two network layer chapters:
IP datagram
Transport Layer 3-107
Transport Layer 3-108
ECN=00
ECN=11
loss: decrease window by factor of 2 congestion avoidance: additive increase
constant rate, tolerate packet loss
existing connections:
Transport Layer 3-105
Transport Layer 3-106
multimedia apps often do not use TCP
application can open multiple parallel connections between two hosts
• UDP • TCP
• data plane
• control plane
• new app asks for 1 TCP, gets rate R/10