Slide 1
HPC ARCHITECTURES
Interconnects and Networks
Introduction
• Interconnects and Networks move data from one place to
another.
• Vital part of all computer systems, and especially
important in parallel systems.
• Interconnects occur at many scales with a variety of
differing requirements:
• Connecting components within a processor.
• Connect processors to external memory. (Memory Interconnect)
• Connect parallel compute elements. (Processor Interconnect)
• Connect processors to disks (e.g. SANs Storage Area Networks).
• LANs (Local Area Networks).
• WANS (Wide Area Networks).
• The internet (Global Network).
HPC Architectures 2
OSI Model
• This is a theoretical model of networking developed by the
ISO standards organisation.
• Divides networking into 7 layers:
1. Physical Layer
2. Data Link Layer
3. Network Layer
4. Transport Layer
5. Session Layer
6. Presentation Layer
7. Application Layer
• This division into layers makes it easier to define standards.
• A Network Layer standard can be independent of the underlying
Data Link Layer it is implemented on.
HPC Architectures 3
OSI and HPC
• Higher levels of the OSI models may not make sense in
some contexts:
• E.g. Memory interconnects
• Lower OSI levels still useful.
• Some layers may be merged/omitted for performance
reasons.
• Especially in Proprietary HPC interconnects where performance is
critical and inter-operation largely irrelevant.
• Message passing libraries such as MPI may provide functions from
both level-5 and level-6 though for performance reasons their
implementations may use levels as low as level-2.
HPC Architectures 4
Physical layer
• Level-1 of the OSI model is the physical layer.
• There are many different ways of moving bits of
information around:
• Voltages applied to wires
• Optical pulses/waves travelling along fibre-optic cable.
• Electromagnetic waves travelling along an electrical transmission
line.
• Radio waves propagating in air.
• All of these can be used to construct computer
interconnects.
HPC Architectures 5
Electrical Signalling lines
• Simplest form of signalling is to apply a voltage to a wire.
• Essentially the same approach as used in the Victorian telegraph.
• Underlying physics described by the “Telegrapher’s Equations” derived
in 1855 by Heaviside.
• At low frequencies wave nature can be neglected, wire acts
like a simple capacitor
• At high-frequencies (where wavelength approaches wire
length) signalling lines need to be constructed as
“transmission lines”
HPC Architectures 6
Point-to-point/Multi-station
• Signalling lines can be used as a broadcast interconnect
(many recipients can read the same signal)
• Original Ethernet was multi-station broadcast network
• Maximum segment length needed to be restricted so collisions
could be detected reliably.
• Fast networks easier to build out of routers connected by
point-to-point links.
• Full-duplex (simultaneous communications in both directions).
• Half-duplex (only one direction at a time).
HPC Architectures 7
Serial vs Parallel
• Data rates can be increased by either
• Increasing the signalling frequency
• Using multiple wires in parallel
• Parallel connections
• Generally more expensive as more wires and more off-chip pin
connections.
• Number of available chip pins proportional to chip perimeter so scarce
resource on large modern chips.
• Operating frequency limited by clock-skew between wires
• Serial connections
• Can operate at higher frequencies
• Throughput can be increased by using multiple (independent/non-
synchronised) serial connections.
• Most modern networks based on High-Speed serial connections.
HPC Architectures 8
High speed serial connections
• Many modern technologies use high speed serial
connections.
• Complex encoding of binary data onto analogue signals.
• Currently running at GHz signal frequencies (cm wavelengths).
• Used in a variety of different types of interconnect
• USB, Infiniband, SATA, PCIe, HDMI
• Underlying SerDes (Serializer/Deserializer) technology essentially the
same.
• Electrical signalling consumes large amounts of power
• Increases with distance and frequency
• Power is a major limiting factor in modern HPC system design.
• General trend towards optical signalling
HPC Architectures 9
Optical networks
• Signalling with flashing lights is as old as fire.
• Modern optical networks use lasers confined within optical
fibre.
• Still using electromagnetic waves but light frequency is much higher
than signalling frequency and can be treated as binary pulses.
• Energy loss in optical fibre is very low (energy costs are essentially
independent of distance at scales less than many kilometres) but optical
transceivers take a lot of power.
• Optical fibres are point to point links
• Signals need to be converted back to electrical to perform routing.
• Recent advances in silicon-photonics allow all necessary
components for optical networking to be built on-chip.
• Chip to chip optical links now a possibility.
HPC Architectures 10
Packet switching
• Network bits are usually organised into packets.
• Currently HPC networks are all packet-switched networks.
• As well as embedded payload the packet header usually
contains additional information e.g.
• Source address
• Destination address
• Checksums for error detection
• Packet size.
• Routers read header to determine where next destination.
• Routers are always electronic so need to convert from optical domain to
route.
• Packet overhead reduces effective bandwidth (especially for
small messages).
HPC Architectures 11
Optical switching
• In the future we expect silicon photonics to make optical
switching more important.
• Optical switches are circuit-switches not packet-switches.
• However each fibre can carry multiple wave-lengths which
can be switched independently.
• Can embed complex circuit topology in simpler physical one.
28/10/2019 HPC Architectures 12
Networks
• More complex networks are built by linking network segments
using routers.
HPC Architectures 13
• Topology of the network can affect cost/performance
– Topology described as a graph
– Routers as graph nodes
– Links as graph edges
– Properties of the graph related to properties of the network
Diameter
• The distance between 2 graph nodes is the length of the
shortest path connecting them.
• This is proportional to the network latency between the routers in
the corresponding network. (Assuming the network latency of each
link is the same)
• The diameter of graph is largest distance between 2
nodes contained in the graph.
• This is proportional to the worst case network latency of the
corresponding network.
HPC Architectures 14
Degree
• The degree of a graph node is the number of edges
attached to that node.
• This corresponds to the number of ports needed on the router in
the corresponding network.
• In practice many high port count routers are implemented internally
as networks of smaller routers.
HPC Architectures 15
Bisection
• A bisection of a graph is when it is divided into two equal
parts and the number of links connecting the two parts is
the width of that bisection.
• The minimum bisection width of a graph is the smallest
value from all possible bisections of the graph.
• This is proportional to the worst case bandwidth between two
halves of the corresponding network (If all links can support the
same bandwidth).
• This is usually referred to as the bisection bandwidth of the
network.
HPC Architectures 16
Example Topologies
• Simple topologies
HPC Architectures 17
1-D array
Ring
More example topologies
HPC Architectures 18
2-D Mesh
2-D torus
4-cube
Natural extensions
to higher
dimensions
Multi stage networks
• Common approach to building scalable networks
• Routers arranged in layers
• Only the outermost layers have connections to network end-points
• Inner layers connect router to router.
• Larger networks need more intermediate layers
• Many different varients
• Can scale as: N log N
HPC Architectures 19
Tree topologies
• Outside of HPC many computer networks are built using a
hierarchy of routers in a tree topology.
• Makes sense for client/server and task-farms where most
communication is between leaves and root of the tree.
• For general communication patterns large volumes of traffic will
need to pass through links/routers near the root of the tree.
• To maintain bisection bandwidth need higher performing
routers/links near root of tree.
HPC Architectures 20
Fat tree
• Uses multiple root nodes to maintain bisection bandwidth
HPC Architectures
Recursive networks
• Many multi-stage networks can be built recursively
• E.g. the Benes network.
HPC Architectures 22
Dragonfly
• Hierarchy of “groups”
• Within a group all nodes are fully connected
• All groups are fully connected in the next layer
• Next layer connection usually “Fatter” to maintain
bisection bandwidth.
28/10/2019 HPC Architectures 23
Routing and Addressing
• How do we get a message through a network?
• choice of multiple paths
• Message can specify route, or just the destination
• in latter case routers have to decide the route
• Can be deterministic or adaptive
• deterministic: every message between two given nodes always takes
the same path
• Packet order preserved (no re-assembly required)
• May not use all available bandwidth
• adaptive: path can vary according to network conditions/randomly
• Easier to implement fault tolerance. Better use of available paths.
• Can be minimal or non-minimal
• minimal routing always takes shortest path
• not necessarily unique path
HPC Architectures 24
Routing latency
• Routing overheads add to the overall network latency.
• Store and forward routers
• Read entire packet into internal buffer before forwarding down next link.
• Full packet needs to be stored (always)
• Can add significantly to message latency if packets are large.
• Cut-through router
• Calculates next link from message header.
• Starts forwarding packet as soon as destination known. (May be before
the full packet has been received.
• May still have to buffer packet if output link is busy.
• Either way routing algorithm needs to be fast. E.g.
• Simple algorithm based on destination address.
• Deterministic algorithm with cached results.
HPC Architectures 25
Dimension ordered routing
• Example: Dimension ordered routing in 2-D grid
• Simple algorithm
• go in X direction until X co-ordinate is correct
• then go in Y direction.
• Deterministic, minimal
HPC Architectures 26
Error correction and Re-transmition
• Hardware is not perfect
• Packets may be lost (Need to implement packet acknowledge/resend)
• Packets may be corrupted (Checksums, ECC codes resend)
• Guaranteed delivery network
• Packet verification and resend implemented on each network hop.
• Minimises performance impact.
• More buffer space needed
• Have to make sure buffer space not exhausted (e.g. link flow control)
• Non guaranteed delivery
• Verification and resend implemented end-to-end at higher OSI layer
• Packets dropped when error detected (or insufficient buffer space)
• Larger performance impact from errors.
• Can recover from complete loss of router node
• Model used by Internet, TCP/IP, Ethernet.
HPC Architectures 27
OSI Layers again
• TCP is built on top of IP (Internet protocol) which is carried
by some data-link protocol. Data is sent as packets
• Think Russian dolls.
• Protocol nesting adds overhead
• Even when HPC interconnects support IP can be faster to target
lower levels.
HPC Architectures 28
Data
TCP packet
IP packet
HDR
HDR
HDR
TCP packet
IP packet
Data layer packet
Data HDR HDR HDR