程序代写 COMP5426 Distributed

COMP5426 Distributed
Design (1)

Logical organization (or parallel co

Copyright By PowCoder代写 加微信 powcoder

organization of parallel programs
parallel c
 Distributed-memory (e.g.,
ed-memory (e.g., mult
 SIMD (data parallel) and multithreading (e.g.,
– actual hardware
computer clu

A global memory is accessible to all
uniform memory access (UMA) platform – If the time taken by a processor
non-uniform memory access (NUMA) platform
Shared-memory programming paradigms, e.g., multithreads
(Pthreads) or directives
synchronization

Assume x is thread
Global variables are
accessible
ll processo

explicit sy tions on sh
Assume x is
hronization by ed addresses
performing

Implicit thread commu global shared variable
that thread
nication in
threa x = a;
Question: How can we guarantee that in thread 2 will be equal to that of a?
hronization!

 parallel p
to sequential  seems simply
programming
However, which can well
We’ll see that in parallel memory platform
programming
res may be very similar
program structures
a natural extension of sequential
we might write a simple parallel program solve the given problem, but not perform

These platforms comprise
of a set of processors
memory Processes
send and r
communicate
primitives
Popular libraries such as
MPI and PVM provide such
primitives for process
communication

In parallel p platform
composition of
 For blocking send
blocked till
 A novice mi
parallel p
data receive
serves as a synchronization point. Thus synchronization is implicit
am for distri
rogramming for distributed-m
data determines
work as data are partitioned, distributed and kept local to each process
receive (processes
send/receive pair
buted-memory
than for shared-memory platform
signment of
to write a

 e.g., even
toy experimental
We need to use shared-memory
multicore node
passing between the nodes
it is very common for a computer
multiple computing n
each being a multicore processor

SIMD Platform
instructio
all processing
n new types of
 large number
processors
different data in shared memory
 Shared memory synchronization
Parallel programming and
UDA and OpenCL
framework:

Memory Hierarc
Most unoptimized 10% of the machin
 Much of the performance
processors
 Most of that
and compiler
 But some We need to
they do th
parallel programs run at less than e’s “peak” performance
ers and ILP are managed by
other times they don’t
write programs to make things more
odes to achieve high performance

Memory Hierarc
Think of a p hierarchy
 Communicatio
lower latency higher bandwidth
smaller capacity
higher latency lower bandwidth
larger capacity
remote memory}
local memory
old words from L1
hold cache
line from L2
line from L3
hold cache line from LM
hold data bl
ock from R

Memory Hierarc
higher latency lower bandwidth
larger capacity
programs exhibit a
tial locality: acces
 temporal locality: reu
lower latency higher bandwidth
smaller capacity
sing an ite
local memory
high degree of
hold bytes from L1
hold cache
hold cache
ous accesses
m that was previously accessed
line from L2
line from L3
hold cache line from LM
hold data bl
ock from R

Memory Hierarc
Take perfo
advantage rmance:
values in s
register) and reuse temporal locality
 Get a chunk of contiguous vector register) and use w
spatial locality
 Allow processor to issue multiple independent or writes with a single instruction
data parallel
superscalar operations (VLI
data into c
hole chunk
operations

Assume just 2 and slow
data initial
hierarchy,
t = time per slow mem
slow memory
m = number of memory elements (words
= number of arithmetic operations
tf = time per arithmetic op
slow memory
ory operation

Actual time
tm/tf q ≥ tm
ger q mean
Minimum possible time memory
 q = f / m average number of flops pe access – computational intensity (key
efficiency)
– machine balance (key to machine /tf needed to get at least half of
en all data
to minimum f * tf
/tf * 1/q)
r slow memory to algorithm
efficiency) peak speed

tor Multip
or i = 1:n

{write y(1
f =nu q =f /
{read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n
{read row i of A into
tor Multip
y(i) + A(i,j)*x(j)
:n) back to slow memory}
mber of m≈2
Matrix-vector multi
of slow me
mory refs = 3n + operations =
limited by

or i = 1 to n
Multiplicatio
for k = 1 C(i,j) =
C(i,j) + A(i,k) *

for i = 1 to n
{read row i of A into
Multiplicatio
{read C(i,j) into fast
{read column
for k = 1 to n
C(i,j) = C(i,j) + A(i,k)
{write C(i,j) back to slo
j of B into fast m

Multiplicatio
Problem: fast
memory too small

Blocking for
C(i,j) = C(i,j) +
ock C(i,j) back
Consider A,B,C (of size n-by-n) be N-by-N matrices of b-by-b where b = n / N is the block size
for i = 1 to N for j = 1 to N
{read block
for k = 1 to
into fast memory}
A(i,k) into fast memory}
B(k,j) into fas
* B(k,j) {do a matrix mult
slow memory}
you/compiler)
iply on blocks}

Number of slow N*n2
+ N*n2 + 2n2
The computational
f / m = 2n3 /
memory refe
read each block of A N3 times read and write each block of C
≈n/N=b forlargen
rmance can be im
size b>>2 (as long as 3b2 < fast me N3 times (N3 * b2 = ed by increasin mory size) N)2 = N*n2) g the block  Preloadin It should be noted that there  2D matrix is stored in a programs  Data transferred between cache and main me in cache line which consists of multiple words us keep reading a set of contiguous data y significantly improve the performance e.g., change th matrix multiplication are other things for loops in your oop unrolling elps to optim items into is a loop transformation t ize the execution time of of multiple reg items into registers  Enhance the computational intensity, utilization enalty, i.e., reduc instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration bandwidth b registers and then isters, i.e., echnique a progra y pre-loadi of multiple computing units e many tim i.e., load the Initial s[i] or (j = 0; j < L; j++) s[i] += h[j] * a[i+j]; Initialize s[i or (j = 0; j < L; j++) s[i] += h[j] * a[i+j]; Change the loop order Initialize s[i] = 0; for (j = 0; j < L; j++) s[i] += h[j] * a[i+j]; Unrolliing Initialize s[i] = for (j = 0; j < float h0 = factor = 4 = 0; i <= N-L; i++) a[i+j+1] a[i+j+2] a[i+j+3]); Assume 4 divides N & L s[i+1] += (h0 * a1 = a[i+j+5]; s[i+2] += (h0 * a2 + a2 = a[i+j+6]; s[i+3] += (h0 * a3 + }s[N-L] = h0 * k += 4; both loops: int k = N - L; for (int j = 0; j < L; j+=4){ - float h0 = h[j], h1 = h[j+1], h2 = h[j+2], for (int i = 0; i < N-L; i+=4){ float a0 = a[i+j], a1 = a[i+j+1], a s[i] += (h0 * a0 + h1 * a1 + h2 * a0 = a[i+j+4]; + h1 * a2 + a[k] + h1 * a[k Will this version perform better? Not really * a3 + h2 * * a0 + h2 * * a3 + h3 * data dependency introduced 2 = a[i+j+2], a3 = a2 + h3 * a3); + h3 * a1); + h2 * a[k+2] + h3 * (i=0; iCS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com