Logical organization (or parallel co
organization of parallel programs
parallel c
ed-memory (e.g., mult
Copyright By PowCoder代写 加微信 powcoder
Distributed-memory (e.g.,
SIMD (data parallel) and multithreading (e.g.,
– actual hardware
computer clu
A global memory is accessible to all
uniform memory access (UMA) platform – If the time taken by a processor
non-uniform memory access (NUMA) platform
Shared-memory programming paradigms, e.g., multithreads
(Pthreads) or directives
synchronization
Think of a p hierarchy
Communicatio
lower latency higher bandwidth
smaller capacity
higher latency lower bandwidth
larger capacity
remote memory}
local memory
old words from L1
hold cache
line from L2
line from L3
hold cache line from LM
hold data bl
ock from R
Memory Hierarc
higher latency lower bandwidth
larger capacity
programs exhibit a
tial locality: acces
temporal locality: reu
lower latency higher bandwidth
smaller capacity
sing an ite
local memory
high degree of
hold bytes from L1
hold cache
ous accesses
m that was previously accessed
line from L2
line from L3
hold cache line from LM
hold data bl
ock from R
Take perfo
advantage rmance:
values in s
register) and reuse temporal locality
Get a chunk of contiguous vector register) and use w
spatial locality
Allow processor to issue multiple independent or writes with a single instruction
data parallel
superscalar operations (VLI
data into c
hole chunk
operations
Assume just 2 and slow
data initial
hierarchy,
t = time per slow mem
slow memory
m = number of memory elements (words
= number of arithmetic operations
tf = time per arithmetic op
slow memory
ory operation
Actual time
tm/tf q ≥ tm
ger q mean
Minimum possible time memory
q = f / m average number of flops pe access – computational intensity (key
efficiency)
– machine balance (key to machine /tf needed to get at least half of
en all data
to minimum f * tf
/tf * 1/q)
r slow memory to algorithm
efficiency) peak speed
tor Multip
or i = 1:n
{read x(1:n) into fast memory} {read y(1:n) into fast memory}
for i = 1:n
y(i) = y(i) + A(i,j)*x(j)
{write y(1:n) back to slow mem
tor Multip
i of A into fast m
number of slow memory refs = 3n+𝑛𝑛2
= number of arithmetic operations = 2𝑛𝑛2
q = f / m ≈2 Matrix-vector multi
limited by
or i = 1 to n
Multiplicatio
for k = 1 C(i,j) =
C(i,j) + A(i,k) *
for i = 1 to n
{read row i of A into
Multiplicatio
{read C(i,j) into fast
{read column
for k = 1 to n
C(i,j) = C(i,j) + A(i,k)
{write C(i,j) back to slo
j of B into fast m
Multiplicatio
The number of data is
2𝑛𝑛3 / (𝑛𝑛3 + rge n
write each element of
nt over matrix-vector multiply!
should be as
Problem: fast
memory too small
4*n2 = O(n)
m=𝑛𝑛 + 𝑛𝑛2
Number of slow memory references:
+ 2𝑛𝑛2 = 𝑛𝑛3 +
So q = f /
Blocking for
C(i,j) = C(i,j) +
ock C(i,j) back
Consider A,B,C (of size n-by-n) be N-by-N matrices of b-by-b where b = n / N is the block size
for i = 1 to N for j = 1 to N
{read block
for k = 1 to
into fast memory}
A(i,k) into fast memory}
B(k,j) into fas
* B(k,j) {do a matrix mult
slow memory}
you/compiler)
iply on blocks}
Number of slow N*n2
+ N*n2 + 2n2
Performance can (as long as 3b2 <
computational intensity 3 /((2N+
for large n
be improved fast memory
memory refe
read each block of A N3 times read and write each block of C
N3 times (N3 * b2 =
by increasing the size)
N)2 = N*n2)
Preloadin
It should be noted that there
2D matrix is stored in a programs
Data transferred between cache and main me in cache line which consists of multiple words
us keep reading a set of contiguous data y significantly improve the performance
e.g., change th
matrix multiplication
are other things
for loops in your
oop unrolling elps to optim
items into
is a loop transformation t ize the execution time of
of multiple reg
items into registers
Enhance the computational intensity,
utilization
enalty, i.e., reduc
instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration
bandwidth b
registers and then
isters, i.e.,
echnique a progra
y pre-loadi
of multiple computing units
e many tim
i.e., load the
Initial s[i]
𝑎𝑎0 𝑖𝑖=0: h0
or (j = 0; j < L; j++) s[i] += h[j] * a[i+j];
Parallel compu
To develop an
computation on each
ting is about performance
efficient parallel
To achieve good performance on single processor we need to
Understand memory system
, superscalar
ILP (e.g., pipe
Though registers, caches and ILP are handled by hardware and compilers, programmers need to get
ed to enhance the performa
program, serial
nce for ma
Blocking is a goo intensity
p unrolling
originally used
y making use of more registers It has disadvantages for more
nique to increase
ther things may also
significantly affect the performance (e.g., cache line, read in row major order for C programs, preloading by processor)
The advantages are obvious for simple loops if data loaded into registers can be used many times – a great increase in computational intensity
ectures 2 & 3 slides in CS267 omputers at UC Berkely
://sites.google.com/lbl.gov/cs26
Applications
of Parallel
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com