CS代写 COMP5426 Distributed

COMP5426 Distributed
mming Distributed Memory
Platforms (Design (1))

Copyright By PowCoder代写 加微信 powcoder

n a distributed
communicate
 It is common for a comp
computing nodes, ea
mory Platfor
memory platform
its own local addressing
 Local variables cannot be accessed by other processes (no shared variables between processes)
ta, processes must
each other using
send and receive message passing mechanisms
 Then we need to use shared-memory programming within a multicore node, and message passing
(variants of)
uter cluster to have multiple
icore processor

n a distributed m istributed across
Memory Platform
emory machine data need the processors
 How data is partitioned and distributed
sly consider
 Of course,
s (or work) are
ed to data
to minimize communic
partitionin
locality (also to reduce
overheads – send a large message, rather than many small ones)
to balance
ion overheads
ch is also
munication

Matrix Multiplicat

Key issue: how to associate data and tasks and how to
ssign them
In the algorithm design for shared mem we discussed how to partition task matr
data with computation
The partitioning techniques, 1D blocking, or 2D blocking can be used here, but for distributed memory machines we also need to partition matrices
ory platform, ix C and then

In 1D blocking a block row of block row of A and the whole
In 2D blocking a block of C (task) block row of A and a block column
C (task) is B
is associated with of B
ted with a

D block row-wise distribution:
 A(i), B(i) and C(i) are n/p by n block r
Using 1D partitioning and assuming n is divisible by  n is the matrix size and p is the number of processes
C(i) refers to the n/p by n block row that process i
owns (similarly f

have the formula
Σj A(i,j)*B(j (e.g., C(0) = C(0) + A(0,0)*B(0)+A(01)*B(1)+A(02)*B(
= C(i) + A(i)*B = C(i) +
urther partition A(i):
 A(i,j) is the n/p by n/p sub-block of A(i)
 in column
hrough (j+1)*n/p

or each C(i) we
row-wise distrib
) = C(i) + Σj A(i,j)*B(j)
 (e.g., C(0) = C(0) + A(0,0)*B(0)+A(01
explicit commu
need the whole B, i
However, they are in different processors
nication – how?
)*B(1) +A(02)*B(2))
.e., B(0),
B(1), B(3)

A(1,0) A(1,1) A(1,2) A(1,3)
A(2,0) A(2,1) A(2,2) A(2,3)
A(3,0) A(3,1)
A(3,2) A(3,3)
Initial situation
Can we do bett
Broadcast: pi
A(0,0) A(0,1) A(0,2) A(0,3)

A(1,0) A(1,1) A(1,2) A(1,3)
A(2,0) A(2,1) A(2,2) A(2,3)
A(3,0) A(3,1)
A(3,2) A(3,3)
Initial situation
shift: Procs shift B(i)s
A(0,0) A(0,1) A(0,2) A(0,3)

Mesh System
* B(0,2) + A(1,1) * B(1,
2) + A(1,2) * B(2,2)

Initialization: A(i,j) shifts i steps and B(i,j) shifts up steps
= A(1,0) *
1) * B(1,2) + A(1,2) * B(2,2)

allel Scan

allel Scan
 Synchroniz
y sequential,
The algorithm for shared memory machines co three parallel steps
one thread is active
but overhead not heavy as SM
sually small
synchronize
synchronize

allel Scan
The algorithm could also memory machines
 Use message passing (gather an
communication a
How about large
processes) – 2nd
cts as synchron
d scatter)
d for Distributed
ization points
g., multiple
ds to be parallelized

allel Scan
The algorithm could also memory machines
 Use message passing (gather an
communication a
How about large
processes) – 2nd
cts as synchron
d scatter)
d for Distributed
ization points
g., multiple
ds to be parallelized
parallel scan

allel Scan

allel Scan
There are also several other parallel scan algorithms
scans are fine grained paral
lel algori
practice MPI provi
des optimized MPI_S
n functions

n a distributed
 Local variable
shared variab
memory platform
its own local addressing
les between processes)
 To exchange data, processes must explicitly co each other using (variants of) send and receive passing mechanisms
n the algorithm des
essed by other
ign for DM machines
to seriously consider
mmunicate with message
both data partitioning
and task assignment (i.e., to move the right data to the right place at the right time)
 We also need to seriously consider how communication overheads

t for shared
Since data are distributed across the algorithm design in general is more co
 In MM example additional needed to move data to th organizing processes (e.g.,
nificantly affect
 In Scan operation example we see that the parallel algorithm is not simply a natural extension of sequential ones
processes, mplicated than
communications are e desired places, and
1D or 2D) may

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com