Distributed-memory Algorithm
Design and Analysis slides
CMPSC 450
Distributed memory platforms
……
Memory . Memory Memory Memory
NI NI NI NI CCCC
Communication network
CMPSC 450
Distributed memory platforms
• Explicit communication and synchronization between processors • Message passing
• Process/task: software abstraction
• No shared resources, need to be explicit about where data resides
• How fast can processors communicate and synchronize? • Depends on the communication network topology
CMPSC 450
Why do we need models (for distributed computations)?
• To characterize efficiency and optimality for distributed memory algorithms
• To compare different algorithms
• To bridge gap between hardware and programming model
CMPSC 450
Network topology
• Network: collection of switches connected by communication channels
• Interconnection pattern is called the network topology • Bus, Ring, Mesh, Hypercube, Torus, Tree, Fat-tree, …
CMPSC 450
Simplest distributed computational model
• P processors, local memory per processor • Assume a network topology
• e.g., ring, hypercube, tree
• Explicit synchronous processor communication • Send(X, i): Send a copy of X to processor Pi
• Receive(Y, j): Receive a copy of Y from processor Pj
• Routing: delivering message from source to destination
CMPSC 450
Parameters characterizing topology
• Diameter: maximum distance between any pair of nodes (lower is better)
• Degree: number of connected neighbors for a node
• Maximum degree of any node (lower is better)
• Average degree (lower is better)
• Average path length between any pair of nodes (lower is better)
• Bisection (band)width: Number of links to cut, to separate network into two halves (higher is better)
CMPSC 450
Linear array topology
• Diameter: n-1
• Total number of links: n-1
• Average degree: ~ 2
• Max degree: 2
• Bisection width: 1
• Average shortest path length: ?
CMPSC 450
Ring topology
• Diameter: n/2
• Total number of links: n
• Average degree: 2
• Bisection width: 2
• Average shortest path length: ?
CMPSC 450
Example: Broadcast on a ring
• Input: Processor number i. p processors. A value v on processor 1.
• Output: All processors have a local copy of value v.
• Assumptions :- Only one send and receive per unit time. Each send and receive finish their task in unit time. (Don’t care about underlying implementation of Send and Receive or underlying hardware and network characteristics).
begin
1. if i = 1 then set y := v else set y := 0
2. for j = 1 to p-1 do
if (j = i) then send(y, (i+1) % n) if(j=(i+1) %n)thenreceive(y,i)
end
Wait! There is an error in this code that results in deadlock!
The second conditional should be: if (i = (j+1) % n) then receive(y,i)
CMPSC 450
1
2
3
4
Parallel time
• T = Tcomp + Tcomm
• Tcomm is p units of time in the previous broadcast example
• For a ring topology, any non-trivial communication requires O(p) parallel steps
CMPSC 450
2D mesh
• 2D version of a linear array
• p = m2 processors, m x m processor grid • Diameter: 2√p – 2
• Average degree: ~ 4
• Intel Paragon
CMPSC 450
2D torus
• 2D version of a ring
• p = m2 processors, m x m processor grid • Diameter: ?
• Bisection bandwidth: ?
• Degree: 4
CMPSC 450
3D torus
• 3D version of a ring
• p = m3 processors, m x m x m processor grid • Diameter: ?
• Bisection: ?
• Degree: 6
CMPSC 450
Hypercube
• n = 2d processors, d-dimensional cube • Recursive structure
• Diameter: d
• Bisection width: n/2
• Regular structure
CMPSC 450
Fat tree
• Hierarchical network design
• Fat links and skinny links: variable link bandwidths
CMPSC 450
Fully connected network
• Diameter: 1
• Number of links: p(p-1) • Degree: p-1
• Bisection width: ?
CMPSC 450
Other topology considerations
• Fault tolerance • Scalability
• Partitionability • Cost
• Symmetry
• Fujitsu K computer: 6D mesh/torus
CMPSC 450
Algorithms for a network topology
• Very popular in the past
• Matrix multiplication on a 2D mesh • Hypercube algorithms
• Summation on 3D torus
• Systolic paradigm
• Still important for collective communication and frequently-used subroutines (barriers, broadcasts, reductions, …)
• Can be very challenging to design from scratch
• Algorithm not general-purpose
CMPSC 450
Alternative to modeling topology
• Model communication using latency and bandwidth
• Model congestion
• Latency: time taken by a small message to traverse the network • Depends on hardware (number of hops) + software (number of
buffers/copies)
• Bandwidth: rate/throughput at which data can be transferred between switches
• Injection bandwidth, link bandwidth, bisection bandwidth
CMPSC 450
Modeling communication cost
• Parallel time T(n, p) = Tcomp + Tcomm
• Tcomm: model using system point-to-point bandwidth and latency
estimates
• α-β model, or Hockney model
• Tcomm: α * (# messages) + β * (# words transferred)
• α : latency per message (in seconds)
• β : inverse of bandwidth, transfer time per byte (seconds/word)
CMPSC 450
α-β model
• Does not take congestion into account
• α and β are assumed to be constant, but in reality, they depend on • message size
• position of communicating nodes in network • contention
• software implementation
• Bidirectional communication links
• Network interface is single-ported (at most one message can be sent, and one message received simultaneously)
CMPSC 450
Measuring point-to-point latency and bandwidth
• Careful microbenchmarking
• PingPong benchmark
• e.g., Ohio State Univ MPI Microbenchmarks • e.g., Intel MPI benchmarks
• It’s important to have a sense of typical latency and bandwidth numbers
• Figure 3.1 of text book (page 64)
• Network latency on current systems: ~ 1-2 microseconds • Point-to-point bandwidth: ~ 6-10 GB/s
CMPSC 450
OSU Microbenchmarks output
CMPSC 450
OSU Microbenchmarks output
CMPSC 450
Distributed memory algorithms
• We are free to assume a virtual process topology that’s suitable for our problem
• Virtual topology can be a ring, k-ary tree, 2D/3D, ring, hypercube, etc.
• The virtual process topology may or may not be the same as the
interconnect topology
• Message-passing software (usually MPI) will handle mapping of virtual processes or tasks to physical cores/nodes
CMPSC 450
Basic communication operations
• Well-defined communication patterns
• We will reuse these primitives in higher-level algorithms
• Efficient implementations can be developed by software vendors for underlying interconnect
CMPSC 450
Basic communication operations
• (One-to-all) broadcast
• Broadcast values from root to all other tasks
• (All-to-one) reduction
• Reduces values from all tasks to single task
• Scatter (One-to-all)
• Send values from one task to all other tasks
• Gather (All-to-one)
• Gather values from a group of tasks
• (All-to-all) personalized exchange
• Send values from all tasks to all other tasks
CMPSC 450
Simple Broadcast algorithm
• Linear broadcast: Assume a ring topology, at each time step, one process sends data to process to its left/right.
• For a message of s bytes, Tcomm in α-β model is (P-1)*(α+sβ)=O(Pα +Psβ)
• Clearly inefficient
• Only one point-to-point link being used at any time step • Tcomm scales linearly with P!
• `Latency-bound’ for small messages (Pα term dominates), `bandwidth- bound’ for large messages (Psβ term dominates)
CMPSC 450
Balanced binary-tree based broadcast approach
• Assume a tree-based virtual topology
• Source/root process is root of binary tree, 2t messages sent in each
time step t.
• Tcomm is
• What fraction of links are now being used?
• Can we do better?
CMPSC 450
Broadcasting very large messages (s >> P)
• Pipelined strategy
• Split message into segments of size z
• In each time step, send segment from task i to i+1
• Tcomm is ) (assuming P = O(s/z))
CMPSC 450
Lower bounds, Optimality
• What is the lower bound on the number of messages? • log P messages
• What is a lower bound on data bytes transferred? • s bytes
• Binary tree-based approach has a log P multiplicative factor in bandwidth term
• Pipelined approach has a P/log P multiplicative factor in latency term
CMPSC 450
Pipelined, Binary tree-based algorithm
• Tcomm is
• How? Pseudo-code?
CMPSC 450
(All-to-one) Reduction
• Flip the balanced binary tree • Same bounds as broadcast
CMPSC 450
Gather/Scatter
• Gather values from/Scatter values to a group of tasks • A tree-based approach is suitable for small messages
CMPSC 450
All-to-all broadcast (Allgather)
• Input: each task has n bytes of data
• Output: each task has n(P-1) bytes of additional data gathered from all other tasks
CMPSC 450
Allgather operation
Input Output
1234
CMPSC 450
Allgather algorithms
• Gather + Broadcast?
• Ring-based algorithm • Recursive doubling
CMPSC 450
Input Output
Allgather: ring algorithm
1234
Tcomm=(P-1)α + n(P-1)β
CMPSC 450
Recursive-doubling algorithm
1234
Input Output
CMPSC 450
Input Output
Recursive-doubling algorithm 1234
Tcomm=(lg P)α + n(P-1)β
CMPSC 450
All-to-all reduce (Allreduce)
• P Gathers + P Broadcasts?
• P Broadcasts + local reduce?
• Ring-based for large messages
• Recursive doubling for small messages
CMPSC 450
Barrier synchronization
• Similar cost as broadcasting/reducing a very small message • Tcomm = (log P)
CMPSC 450