CS计算机代考程序代写 mips x86 cache algorithm L13-HPC-Architectures

L13-HPC-Architectures

Parallel Architectures

HPC Architectures

Overview
• Background

• Flynn’s Taxonomy
• SIMD
• MIMD

• Classification via Memory
• Distributed Memory
• Shared Memory
• Clusters

• Summary

Background
• Beginnings of Parallel Computing

• idea of performing calculations in parallel first suggested by Charles
Babbage in 19th Century

• … but was not technically possible at the time

• Modern HPC
• 1940s: Colossus used parallel computations during second world war
• 1950s: IBM and US universities working on designs for parallel

machines
• 1960s: solid state components make computers cheaper and more

reliable
• 1970s: vector machines generate demand for low cost supercomputing

(Cray)
• 1980s: widely available parallel machines (DAP, Cosmic Cube)
• 1990s: fast RISC chips available (SPARC, MIPS, PA-RISC, …)

Serial v Parallel Computers

• Serial computers are easier to program than parallel
computers

• …but there are limits on single processor performance
• physical: speed of light, uncertainty principle
• practical: design, manufacture

• Parallel computers dominate HPC because
• they allow highest performance
• they are more cost effective

• Achieving good performance requires
• high quality algorithms, decomposition and programming

Flynn’s Taxonomy

• Classification of architectures by instruction stream and
data stream

• “Some Computer Organizations and Their Effectiveness”. IEEE Trans. Comput. C-21: 948

• SISD: Single Instruction Single Data
• serial machines

• MISD: Multiple Instructions Single Data
• (probably) no real examples

• SIMD: Single Instruction Multiple Data
• MIMD: Multiple Instructions Multiple Data

SIMD Architecture

• Single Instruction Multiple Data
• A single processor instruction executes simultaneously on

multiple pieces of different data
• Instructions issued by front-end controller/processor

Vector processors
• Vector processors were an early example of SIMD

architectures.
• Processors had vector registers that contained multiple words of

data (64 words typical on Cray architectures).
• A single machine instruction could combine two vectors to produce

a vector result.
• Most implementations utilised pipe-line parallelism rather than

replicating Floating Point Units for each element of the vector.

Large Scale SIMD
• In the 1970s and 1980s various systems were built using

SIMD on a large scale.
• Usually thousands of simple processors operating in

lock-step controlled by a front-end
• Each processor has its own memory where it keeps its

data
• Data is distributed across the processors!

• Processors can communicate with each other
• Usually only connected to neighbour processors
• Long distance communication by “shifting” data multiple times.

• Examples:
• DAP, MasPar, CM200

SIMD Architecture

P M

P M P

P M P M P M P

P M P M P M

P M P M

P M P P PM M M

M

M
Front-end

Network
Peripherals

MicroProcessor SIMD
• Many modern microprocessors have some SIMD

instructions
• E.g. SSE instructions in x86 processors.

• Like early vector machines these operate on registers
rather than distributed data.

• Vector length typically shorter
• Usually use replicated FPUs not pipe-lining.

MIMD Architecture

• Multiple Instructions Multiple Data
• Several independent processors capable of executing

separate programs
• Subdivision by relationship between processors and

memory

Distributed Memory

• MIMD-DM
• each processor has its own local memory

• Processors connected by some interconnect mechanism
• Processors communicate via explicit message passing

• effectively sending emails to each other
• Highly scalable architecture

• allows Massively Parallel Processing (MPP)
• Examples

• Nothing current – no single core processors!
• Cray T3D/T3E

Distributed Memory

P M
P MP M

P M

P M P M

P M

P M
Interconnect

Distributed Memory
• Processors behave like distinct workstations

• each runs its own copy of the operating system
• no interaction except via the interconnect

• Pros
• adding processors increases memory bandwidth
• can grow to almost any size

• Cons
• scalability relies on good interconnect
• jobs are placed by user and remain on the same processors
• potential for high system management overhead

Shared Memory
• MIMD-SM

• each processor has access to a global memory store
• Communications via write/reads to memory

• caches are automatically kept up-to-date or coherent
• Simple to program (no explicit communications)
• Scaling is difficult because of memory access bottleneck
• Usually modest numbers of processors

Symmetric MultiProcessing
• Each processor in an SMP has equal access to all parts of memory

• same latency and bandwidth

• Examples
– any multicore laptop/PC/server

P P PP P P

Bus/Interconnect

Memory

NUMA

• Each processor has some fast local memory
• Direct access to slower remote memory via global

address space
• Hardware includes support circuitry to deal with remote

accesses, allowing very fast communications
• Result is Non Uniform Memory Access (NUMA)
• Most modern multi-core processors are NUMA to some

degree.

Schematic CC-NUMA machine

P P

B

M

P P

B

M

P P

B

M

P P

B

M

P P

B

M

P P

B

M

P P

B

M

P P

B

M

B

B B

Shared Memory
• Looks like a single machine to the user

• a single operating system covers all the processors
• the OS automatically moves jobs around the CPUs

• Pros
• simple to use and maintain
• CC-NUMA architectures allow scaling to 100’s of CPUs

• Cons
• potential problems with simultaneous access to memory
• sophisticated hardware required to maintain cache coherency
• scalability ultimately limited by this

• Examples
• Any multisocket PC/server
• HPE Integrity MC990 X Server up to 32 sockets

Shared Memory Cluster

Interconnect

P
P

P
P M

P
P

P
P M

P
P

P
P M

P
P

P
P M P

P

P
P M

P
P

P
P MPP

P
P M

P
P

P
P M

Shared Memory Clusters
• Technology Pyramid…

• …encouraged clustering of SMP nodes.
• i.e.. top-end nodes are the mid-range systems

• Recent trend towards Multicore processors
• Low end clusters and Custom HPC systems have SMP nodes.

SMP server

workstation cluster

HPC

Shared Memory Clusters

• Combine features of two architectures
• shared-memory within a node
• distributed memory between nodes

• Pros
• constructed as a standard distributed memory machine
• but with more powerful nodes

• Cons
• may be hard to take advantage of mixed architecture
• more complicated to understand performance
• combination of interconnect and memory system behaviour

• Examples
• All current large HPC systems
• Cray XC30/XC40, IBM BlueGeneQ, etc.

ARCHER
• Built from nodes containing 2×12-core Xeon’s

– each node a mini 24-way NUMA system

A bespoke Cray interconnect
• essentially a high-end SMP cluster

– network is a dragonfly

Cray XC30 Node
Cray XC30 Compute Node

NUMA Node 1NUMA Node 0

• The XC30 Compute node
features:

• 1 x Aries NIC
• Connects to shared Aries router and

wider network
• PCI-e 3.0

• Optimized short transfer
mechanism (FMA)
• Provides global access to memory,

used by MPI and PGAS
• High issue rate for small transfers: 8-

64 byte put/get and amo in particular
• HPC optimized network

• Small packet size 64-bytes
• Router bandwidth >> injection

bandwidth
• Adaptive Routing & Dragonfly

topology

• Fault tolerant design
• Link level retry on error
• Adaptive routing around failed links
• Network reconfigures automatically

(and quickly) if a component fails
• End to end CRC check with

automatic software retry in MPI

Intel® Xeon®
12 Core die

Aries
Router

Intel® Xeon®
12 Core die

Aries NIC

32GB 32GB

PCIe 3.0

Aries
Network

QPI

DDR3

XC30 blade

Cray XC30 Rank1 Network

o Chassis with 16 compute blades
o 128 Sockets
o Inter-Aries communication over

backplane
o Per-Packet adaptive Routing

16 Aries connected
by backplane

Cray XC30 Rank-2 Copper Network

4 nodes
connect to a
single Aries

6 backplanes
connected with

copper cables in a 2-
cabinet group:

Active optical
cables interconnect

groups

2 Cabinet
Group

768 Sockets

Cray XC30 Routing
S

D

With adaptive
routing we select
between minimal
and non-minimal
paths based on
load

The Cray XC30
Class-2 Group has
sufficient bandwidth
to support full
injection rate for all
384 nodes with
non-minimal routing

M

Minimal routes
between any two
nodes in a group
are just two hops

Non-minimal route
requires up to four

hops.

R M

M

Cray XC30 Network Overview – Rank-3 Network

An all-to-all pattern is wired between the
groups using optical cables (blue network)

Up to 240 ports are available per 2-cabinet
group

The global bandwidth can be tuned by varying
the number of optical cables in the group-to-
group connections

Example: An 4-group system is interconnected with 6 optical
“bundles”. The “bundles” can be configured between 20 and 80
cables wide

Group 0 Group 1 Group 2 Group 3

Adaptive Routing over optical network
• An all-to-all pattern is

wired between the groups

Group 0

Group 1

Group 2

Group 3Group 4

Assume Minimal
path from Group 0 to
3 becomes
congested

Traffic can “bounce
off” any other
intermediate group

Doubles load on network but
more effectively utilizes full
system bandwidth

Summary

• Flynn’s taxonomy looks somewhat dated
• SIMD within processors and accelerators
• Large scale HPC based on MIMD

• Wide variety of memory architectures for MIMD
• need to sub-classify by memory

• Current parallel systems based on commodity
microprocessors or clusters of SMPs
• … providing leverage with commercial products

• Parallel architectures appear to be the present and future
of HPC