COMP5426 Distributed
mming Distributed Memory
Platforms (Design (1))
Copyright By PowCoder代写 加微信 powcoder
n a distributed
communicate
It is common for a comp
computing nodes, ea
mory Platfor
memory platform
its own local addressing
Local variables cannot be accessed by other processes (no shared variables between processes)
ta, processes must
each other using
send and receive message passing mechanisms
Then we need to use shared-memory programming within a multicore node, and message passing
(variants of)
uter cluster to have multiple
icore processor
n a distributed m istributed across
Memory Platform
emory machine data need the processors
How data is partitioned and distributed
sly consider
Of course,
s (or work) are
ed to data
to minimize communic
partitionin
locality (also to reduce
overheads – send a large message, rather than many small ones)
to balance
ion overheads
ch is also
munication
Matrix Multiplicat
Key issue: how to associate data and tasks and how to
ssign them
In the algorithm design for shared mem we discussed how to partition task matr
data with computation
The partitioning techniques, 1D blocking, or 2D blocking can be used here, but for distributed memory machines we also need to partition matrices
ory platform, ix C and then
In 1D blocking a block row of block row of A and the whole
In 2D blocking a block of C (task) block row of A and a block column
C (task) is B
is associated with of B
ted with a
D block row-wise distribution:
A(i), B(i) and C(i) are n/p by n block r
Using 1D partitioning and assuming n is divisible by n is the matrix size and p is the number of processes
C(i) refers to the n/p by n block row that process i
owns (similarly f
have the formula
Σj A(i,j)*B(j (e.g., C(0) = C(0) + A(0,0)*B(0)+A(01)*B(1)+A(02)*B(
= C(i) + A(i)*B = C(i) +
urther partition A(i):
A(i,j) is the n/p by n/p sub-block of A(i)
in column
hrough (j+1)*n/p
or each C(i) we
row-wise distrib
) = C(i) + Σj A(i,j)*B(j)
(e.g., C(0) = C(0) + A(0,0)*B(0)+A(01
explicit commu
need the whole B, i
However, they are in different processors
nication – how?
)*B(1) +A(02)*B(2))
.e., B(0),
B(1), B(3)
A(1,0) A(1,1) A(1,2) A(1,3)
A(2,0) A(2,1) A(2,2) A(2,3)
A(3,0) A(3,1)
A(3,2) A(3,3)
Initial situation
Can we do bett
Broadcast: pi
A(0,0) A(0,1) A(0,2) A(0,3)
A(1,0) A(1,1) A(1,2) A(1,3)
A(2,0) A(2,1) A(2,2) A(2,3)
A(3,0) A(3,1)
A(3,2) A(3,3)
Initial situation
shift: Procs shift B(i)s
A(0,0) A(0,1) A(0,2) A(0,3)
Mesh System
* B(0,2) + A(1,1) * B(1,
2) + A(1,2) * B(2,2)
Initialization: A(i,j) shifts i steps and B(i,j) shifts up steps
= A(1,0) *
1) * B(1,2) + A(1,2) * B(2,2)
allel Scan
allel Scan
Synchroniz
y sequential,
The algorithm for shared memory machines co three parallel steps
one thread is active
but overhead not heavy as SM
sually small
synchronize
synchronize
allel Scan
The algorithm could also memory machines
Use message passing (gather an
communication a
How about large
processes) – 2nd
cts as synchron
d scatter)
d for Distributed
ization points
g., multiple
ds to be parallelized
allel Scan
The algorithm could also memory machines
Use message passing (gather an
communication a
How about large
processes) – 2nd
cts as synchron
d scatter)
d for Distributed
ization points
g., multiple
ds to be parallelized
parallel scan
allel Scan
allel Scan
There are also several other parallel scan algorithms
scans are fine grained paral
lel algori
practice MPI provi
des optimized MPI_S
n functions
n a distributed
Local variable
shared variab
memory platform
its own local addressing
les between processes)
To exchange data, processes must explicitly co each other using (variants of) send and receive passing mechanisms
n the algorithm des
essed by other
ign for DM machines
to seriously consider
mmunicate with message
both data partitioning
and task assignment (i.e., to move the right data to the right place at the right time)
We also need to seriously consider how communication overheads
t for shared
Since data are distributed across the algorithm design in general is more co
In MM example additional needed to move data to th organizing processes (e.g.,
nificantly affect
In Scan operation example we see that the parallel algorithm is not simply a natural extension of sequential ones
processes, mplicated than
communications are e desired places, and
1D or 2D) may
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com