COMP5426 Distributed
Design (1)
Logical organization (or parallel co
Copyright By PowCoder代写 加微信 powcoder
organization of parallel programs
parallel c
Distributed-memory (e.g.,
ed-memory (e.g., mult
SIMD (data parallel) and multithreading (e.g.,
– actual hardware
computer clu
A global memory is accessible to all
uniform memory access (UMA) platform – If the time taken by a processor
non-uniform memory access (NUMA) platform
Shared-memory programming paradigms, e.g., multithreads
(Pthreads) or directives
synchronization
Assume x is thread
Global variables are
accessible
ll processo
explicit sy tions on sh
Assume x is
hronization by ed addresses
performing
Implicit thread commu global shared variable
that thread
nication in
threa x = a;
Question: How can we guarantee that in thread 2 will be equal to that of a?
hronization!
parallel p
to sequential seems simply
programming
However, which can well
We’ll see that in parallel memory platform
programming
res may be very similar
program structures
a natural extension of sequential
we might write a simple parallel program solve the given problem, but not perform
These platforms comprise
of a set of processors
memory Processes
send and r
communicate
primitives
Popular libraries such as
MPI and PVM provide such
primitives for process
communication
In parallel p platform
composition of
For blocking send
blocked till
A novice mi
parallel p
data receive
serves as a synchronization point. Thus synchronization is implicit
am for distri
rogramming for distributed-m
data determines
work as data are partitioned, distributed and kept local to each process
receive (processes
send/receive pair
buted-memory
than for shared-memory platform
signment of
to write a
e.g., even
toy experimental
We need to use shared-memory
multicore node
passing between the nodes
it is very common for a computer
multiple computing n
each being a multicore processor
SIMD Platform
instructio
all processing
n new types of
large number
processors
different data in shared memory
Shared memory synchronization
Parallel programming and
UDA and OpenCL
framework:
Memory Hierarc
Most unoptimized 10% of the machin
Much of the performance
processors
Most of that
and compiler
But some We need to
they do th
parallel programs run at less than e’s “peak” performance
ers and ILP are managed by
other times they don’t
write programs to make things more
odes to achieve high performance
Memory Hierarc
Think of a p hierarchy
Communicatio
lower latency higher bandwidth
smaller capacity
higher latency lower bandwidth
larger capacity
remote memory}
local memory
old words from L1
hold cache
line from L2
line from L3
hold cache line from LM
hold data bl
ock from R
Memory Hierarc
higher latency lower bandwidth
larger capacity
programs exhibit a
tial locality: acces
temporal locality: reu
lower latency higher bandwidth
smaller capacity
sing an ite
local memory
high degree of
hold bytes from L1
hold cache
hold cache
ous accesses
m that was previously accessed
line from L2
line from L3
hold cache line from LM
hold data bl
ock from R
Memory Hierarc
Take perfo
advantage rmance:
values in s
register) and reuse temporal locality
Get a chunk of contiguous vector register) and use w
spatial locality
Allow processor to issue multiple independent or writes with a single instruction
data parallel
superscalar operations (VLI
data into c
hole chunk
operations
Assume just 2 and slow
data initial
hierarchy,
t = time per slow mem
slow memory
m = number of memory elements (words
= number of arithmetic operations
tf = time per arithmetic op
slow memory
ory operation
Actual time
tm/tf q ≥ tm
ger q mean
Minimum possible time memory
q = f / m average number of flops pe access – computational intensity (key
efficiency)
– machine balance (key to machine /tf needed to get at least half of
en all data
to minimum f * tf
/tf * 1/q)
r slow memory to algorithm
efficiency) peak speed
tor Multip
or i = 1:n
{write y(1
f =nu q =f /
{read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n
{read row i of A into
tor Multip
y(i) + A(i,j)*x(j)
:n) back to slow memory}
mber of m≈2
Matrix-vector multi
of slow me
mory refs = 3n + operations =
limited by
or i = 1 to n
Multiplicatio
for k = 1 C(i,j) =
C(i,j) + A(i,k) *
for i = 1 to n
{read row i of A into
Multiplicatio
{read C(i,j) into fast
{read column
for k = 1 to n
C(i,j) = C(i,j) + A(i,k)
{write C(i,j) back to slo
j of B into fast m
Multiplicatio
Problem: fast
memory too small
Blocking for
C(i,j) = C(i,j) +
ock C(i,j) back
Consider A,B,C (of size n-by-n) be N-by-N matrices of b-by-b where b = n / N is the block size
for i = 1 to N for j = 1 to N
{read block
for k = 1 to
into fast memory}
A(i,k) into fast memory}
B(k,j) into fas
* B(k,j) {do a matrix mult
slow memory}
you/compiler)
iply on blocks}
Number of slow N*n2
+ N*n2 + 2n2
The computational
f / m = 2n3 /
memory refe
read each block of A N3 times read and write each block of C
≈n/N=b forlargen
rmance can be im
size b>>2 (as long as 3b2 < fast me
N3 times (N3 * b2 =
ed by increasin
mory size)
N)2 = N*n2)
g the block
Preloadin
It should be noted that there
2D matrix is stored in a programs
Data transferred between cache and main me in cache line which consists of multiple words
us keep reading a set of contiguous data y significantly improve the performance
e.g., change th
matrix multiplication
are other things
for loops in your
oop unrolling elps to optim
items into
is a loop transformation t ize the execution time of
of multiple reg
items into registers
Enhance the computational intensity,
utilization
enalty, i.e., reduc
instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration
bandwidth b
registers and then
isters, i.e.,
echnique a progra
y pre-loadi
of multiple computing units
e many tim
i.e., load the
Initial s[i]
or (j = 0; j < L; j++) s[i] += h[j] * a[i+j];
Initialize s[i
or (j = 0; j < L; j++) s[i] += h[j] * a[i+j];
Change the loop order
Initialize s[i] = 0; for (j = 0; j < L; j++)
s[i] += h[j] * a[i+j];
Unrolliing
Initialize s[i] =
for (j = 0; j <
float h0 =
factor = 4
= 0; i <= N-L; i++)
a[i+j+1] a[i+j+2]
a[i+j+3]);
Assume 4 divides N & L
s[i+1] += (h0 * a1
= a[i+j+5];
s[i+2] += (h0 * a2 +
a2 = a[i+j+6];
s[i+3] += (h0 * a3 +
}s[N-L] = h0 * k += 4;
both loops:
int k = N - L;
for (int j = 0; j < L; j+=4){ -
float h0 = h[j], h1 = h[j+1], h2 = h[j+2],
for (int i = 0; i < N-L; i+=4){
float a0 = a[i+j], a1 = a[i+j+1], a
s[i] += (h0 * a0 + h1 * a1 + h2 *
a0 = a[i+j+4];
+ h1 * a2 +
a[k] + h1 * a[k
Will this version perform better?
Not really
* a3 + h2 *
* a0 + h2 *
* a3 + h3 *
data dependency introduced
2 = a[i+j+2], a3 =
a2 + h3 * a3);
+ h3 * a1);
+ h2 * a[k+2] + h3 *
(i=0; i