L13-HPC-Architectures
Parallel Architectures
HPC Architectures
Overview
• Background
• Flynn’s Taxonomy
• SIMD
• MIMD
• Classification via Memory
• Distributed Memory
• Shared Memory
• Clusters
• Summary
Background
• Beginnings of Parallel Computing
• idea of performing calculations in parallel first suggested by Charles
Babbage in 19th Century
• … but was not technically possible at the time
• Modern HPC
• 1940s: Colossus used parallel computations during second world war
• 1950s: IBM and US universities working on designs for parallel
machines
• 1960s: solid state components make computers cheaper and more
reliable
• 1970s: vector machines generate demand for low cost supercomputing
(Cray)
• 1980s: widely available parallel machines (DAP, Cosmic Cube)
• 1990s: fast RISC chips available (SPARC, MIPS, PA-RISC, …)
Serial v Parallel Computers
• Serial computers are easier to program than parallel
computers
• …but there are limits on single processor performance
• physical: speed of light, uncertainty principle
• practical: design, manufacture
• Parallel computers dominate HPC because
• they allow highest performance
• they are more cost effective
• Achieving good performance requires
• high quality algorithms, decomposition and programming
Flynn’s Taxonomy
• Classification of architectures by instruction stream and
data stream
• “Some Computer Organizations and Their Effectiveness”. IEEE Trans. Comput. C-21: 948
• SISD: Single Instruction Single Data
• serial machines
• MISD: Multiple Instructions Single Data
• (probably) no real examples
• SIMD: Single Instruction Multiple Data
• MIMD: Multiple Instructions Multiple Data
SIMD Architecture
• Single Instruction Multiple Data
• A single processor instruction executes simultaneously on
multiple pieces of different data
• Instructions issued by front-end controller/processor
Vector processors
• Vector processors were an early example of SIMD
architectures.
• Processors had vector registers that contained multiple words of
data (64 words typical on Cray architectures).
• A single machine instruction could combine two vectors to produce
a vector result.
• Most implementations utilised pipe-line parallelism rather than
replicating Floating Point Units for each element of the vector.
Large Scale SIMD
• In the 1970s and 1980s various systems were built using
SIMD on a large scale.
• Usually thousands of simple processors operating in
lock-step controlled by a front-end
• Each processor has its own memory where it keeps its
data
• Data is distributed across the processors!
• Processors can communicate with each other
• Usually only connected to neighbour processors
• Long distance communication by “shifting” data multiple times.
• Examples:
• DAP, MasPar, CM200
SIMD Architecture
P M
P M P
P M P M P M P
P M P M P M
P M P M
P M P P PM M M
M
M
Front-end
Network
Peripherals
MicroProcessor SIMD
• Many modern microprocessors have some SIMD
instructions
• E.g. SSE instructions in x86 processors.
• Like early vector machines these operate on registers
rather than distributed data.
• Vector length typically shorter
• Usually use replicated FPUs not pipe-lining.
MIMD Architecture
• Multiple Instructions Multiple Data
• Several independent processors capable of executing
separate programs
• Subdivision by relationship between processors and
memory
Distributed Memory
• MIMD-DM
• each processor has its own local memory
• Processors connected by some interconnect mechanism
• Processors communicate via explicit message passing
• effectively sending emails to each other
• Highly scalable architecture
• allows Massively Parallel Processing (MPP)
• Examples
• Nothing current – no single core processors!
• Cray T3D/T3E
Distributed Memory
P M
P MP M
P M
P M P M
P M
P M
Interconnect
Distributed Memory
• Processors behave like distinct workstations
• each runs its own copy of the operating system
• no interaction except via the interconnect
• Pros
• adding processors increases memory bandwidth
• can grow to almost any size
• Cons
• scalability relies on good interconnect
• jobs are placed by user and remain on the same processors
• potential for high system management overhead
Shared Memory
• MIMD-SM
• each processor has access to a global memory store
• Communications via write/reads to memory
• caches are automatically kept up-to-date or coherent
• Simple to program (no explicit communications)
• Scaling is difficult because of memory access bottleneck
• Usually modest numbers of processors
Symmetric MultiProcessing
• Each processor in an SMP has equal access to all parts of memory
• same latency and bandwidth
• Examples
– any multicore laptop/PC/server
P P PP P P
Bus/Interconnect
Memory
NUMA
• Each processor has some fast local memory
• Direct access to slower remote memory via global
address space
• Hardware includes support circuitry to deal with remote
accesses, allowing very fast communications
• Result is Non Uniform Memory Access (NUMA)
• Most modern multi-core processors are NUMA to some
degree.
Schematic CC-NUMA machine
P P
B
M
P P
B
M
P P
B
M
P P
B
M
P P
B
M
P P
B
M
P P
B
M
P P
B
M
B
B B
Shared Memory
• Looks like a single machine to the user
• a single operating system covers all the processors
• the OS automatically moves jobs around the CPUs
• Pros
• simple to use and maintain
• CC-NUMA architectures allow scaling to 100’s of CPUs
• Cons
• potential problems with simultaneous access to memory
• sophisticated hardware required to maintain cache coherency
• scalability ultimately limited by this
• Examples
• Any multisocket PC/server
• HPE Integrity MC990 X Server up to 32 sockets
Shared Memory Cluster
Interconnect
P
P
P
P M
P
P
P
P M
P
P
P
P M
P
P
P
P M P
P
P
P M
P
P
P
P MPP
P
P M
P
P
P
P M
Shared Memory Clusters
• Technology Pyramid…
• …encouraged clustering of SMP nodes.
• i.e.. top-end nodes are the mid-range systems
• Recent trend towards Multicore processors
• Low end clusters and Custom HPC systems have SMP nodes.
SMP server
workstation cluster
HPC
Shared Memory Clusters
• Combine features of two architectures
• shared-memory within a node
• distributed memory between nodes
• Pros
• constructed as a standard distributed memory machine
• but with more powerful nodes
• Cons
• may be hard to take advantage of mixed architecture
• more complicated to understand performance
• combination of interconnect and memory system behaviour
• Examples
• All current large HPC systems
• Cray XC30/XC40, IBM BlueGeneQ, etc.
ARCHER
• Built from nodes containing 2×12-core Xeon’s
– each node a mini 24-way NUMA system
A bespoke Cray interconnect
• essentially a high-end SMP cluster
– network is a dragonfly
Cray XC30 Node
Cray XC30 Compute Node
NUMA Node 1NUMA Node 0
• The XC30 Compute node
features:
• 1 x Aries NIC
• Connects to shared Aries router and
wider network
• PCI-e 3.0
• Optimized short transfer
mechanism (FMA)
• Provides global access to memory,
used by MPI and PGAS
• High issue rate for small transfers: 8-
64 byte put/get and amo in particular
• HPC optimized network
• Small packet size 64-bytes
• Router bandwidth >> injection
bandwidth
• Adaptive Routing & Dragonfly
topology
• Fault tolerant design
• Link level retry on error
• Adaptive routing around failed links
• Network reconfigures automatically
(and quickly) if a component fails
• End to end CRC check with
automatic software retry in MPI
Intel® Xeon®
12 Core die
Aries
Router
Intel® Xeon®
12 Core die
Aries NIC
32GB 32GB
PCIe 3.0
Aries
Network
QPI
DDR3
XC30 blade
Cray XC30 Rank1 Network
o Chassis with 16 compute blades
o 128 Sockets
o Inter-Aries communication over
backplane
o Per-Packet adaptive Routing
16 Aries connected
by backplane
Cray XC30 Rank-2 Copper Network
4 nodes
connect to a
single Aries
6 backplanes
connected with
copper cables in a 2-
cabinet group:
Active optical
cables interconnect
groups
2 Cabinet
Group
768 Sockets
Cray XC30 Routing
S
D
With adaptive
routing we select
between minimal
and non-minimal
paths based on
load
The Cray XC30
Class-2 Group has
sufficient bandwidth
to support full
injection rate for all
384 nodes with
non-minimal routing
M
Minimal routes
between any two
nodes in a group
are just two hops
Non-minimal route
requires up to four
hops.
R M
M
Cray XC30 Network Overview – Rank-3 Network
An all-to-all pattern is wired between the
groups using optical cables (blue network)
Up to 240 ports are available per 2-cabinet
group
The global bandwidth can be tuned by varying
the number of optical cables in the group-to-
group connections
Example: An 4-group system is interconnected with 6 optical
“bundles”. The “bundles” can be configured between 20 and 80
cables wide
Group 0 Group 1 Group 2 Group 3
Adaptive Routing over optical network
• An all-to-all pattern is
wired between the groups
Group 0
Group 1
Group 2
Group 3Group 4
Assume Minimal
path from Group 0 to
3 becomes
congested
Traffic can “bounce
off” any other
intermediate group
Doubles load on network but
more effectively utilizes full
system bandwidth
Summary
• Flynn’s taxonomy looks somewhat dated
• SIMD within processors and accelerators
• Large scale HPC based on MIMD
• Wide variety of memory architectures for MIMD
• need to sub-classify by memory
• Current parallel systems based on commodity
microprocessors or clusters of SMPs
• … providing leverage with commercial products
• Parallel architectures appear to be the present and future
of HPC