CS代写 CS267 omputers at UC Berkely

Logical organization (or parallel co
organization of parallel programs
parallel c
ed-memory (e.g., mult

Copyright By PowCoder代写 加微信 powcoder

 Distributed-memory (e.g.,
 SIMD (data parallel) and multithreading (e.g.,
– actual hardware
computer clu

A global memory is accessible to all
uniform memory access (UMA) platform – If the time taken by a processor
non-uniform memory access (NUMA) platform
Shared-memory programming paradigms, e.g., multithreads
(Pthreads) or directives
synchronization

Think of a p hierarchy
 Communicatio
lower latency higher bandwidth
smaller capacity
higher latency lower bandwidth
larger capacity
remote memory}
local memory
old words from L1
hold cache
line from L2
line from L3
hold cache line from LM
hold data bl
ock from R

Memory Hierarc
higher latency lower bandwidth
larger capacity
programs exhibit a
tial locality: acces
 temporal locality: reu
lower latency higher bandwidth
smaller capacity
sing an ite
local memory
high degree of
hold bytes from L1
hold cache
ous accesses
m that was previously accessed
line from L2
line from L3
hold cache line from LM
hold data bl
ock from R

Take perfo
advantage rmance:
values in s
register) and reuse temporal locality
 Get a chunk of contiguous vector register) and use w
spatial locality
 Allow processor to issue multiple independent or writes with a single instruction
data parallel
superscalar operations (VLI
data into c
hole chunk
operations

Assume just 2 and slow
data initial
hierarchy,
t = time per slow mem
slow memory
m = number of memory elements (words
= number of arithmetic operations
tf = time per arithmetic op
slow memory
ory operation

Actual time
tm/tf q ≥ tm
ger q mean
Minimum possible time memory
 q = f / m average number of flops pe access – computational intensity (key
efficiency)
– machine balance (key to machine /tf needed to get at least half of
en all data
to minimum f * tf
/tf * 1/q)
r slow memory to algorithm
efficiency) peak speed

tor Multip
or i = 1:n

{read x(1:n) into fast memory} {read y(1:n) into fast memory}
for i = 1:n
y(i) = y(i) + A(i,j)*x(j)
{write y(1:n) back to slow mem
tor Multip
i of A into fast m
number of slow memory refs = 3n+𝑛𝑛2
= number of arithmetic operations = 2𝑛𝑛2
q = f / m ≈2 Matrix-vector multi
limited by

or i = 1 to n
Multiplicatio
for k = 1 C(i,j) =
C(i,j) + A(i,k) *

for i = 1 to n
{read row i of A into
Multiplicatio
{read C(i,j) into fast
{read column
for k = 1 to n
C(i,j) = C(i,j) + A(i,k)
{write C(i,j) back to slo
j of B into fast m

Multiplicatio
The number of data is
2𝑛𝑛3 / (𝑛𝑛3 + rge n
write each element of
nt over matrix-vector multiply!
should be as
Problem: fast
memory too small
4*n2 = O(n)
m=𝑛𝑛 + 𝑛𝑛2
Number of slow memory references:
+ 2𝑛𝑛2 = 𝑛𝑛3 +
So q = f /

Blocking for
C(i,j) = C(i,j) +
ock C(i,j) back
Consider A,B,C (of size n-by-n) be N-by-N matrices of b-by-b where b = n / N is the block size
for i = 1 to N for j = 1 to N
{read block
for k = 1 to
into fast memory}
A(i,k) into fast memory}
B(k,j) into fas
* B(k,j) {do a matrix mult
slow memory}
you/compiler)
iply on blocks}

Number of slow N*n2
+ N*n2 + 2n2
Performance can (as long as 3b2 < computational intensity 3 /((2N+ for large n be improved fast memory memory refe read each block of A N3 times read and write each block of C N3 times (N3 * b2 = by increasing the size) N)2 = N*n2)  Preloadin It should be noted that there  2D matrix is stored in a programs  Data transferred between cache and main me in cache line which consists of multiple words us keep reading a set of contiguous data y significantly improve the performance e.g., change th matrix multiplication are other things for loops in your oop unrolling elps to optim items into is a loop transformation t ize the execution time of of multiple reg items into registers  Enhance the computational intensity, utilization enalty, i.e., reduc instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration bandwidth b registers and then isters, i.e., echnique a progra y pre-loadi of multiple computing units e many tim i.e., load the Initial s[i] 𝑎𝑎0 𝑖𝑖=0: h0 or (j = 0; j < L; j++) s[i] += h[j] * a[i+j]; Parallel compu To develop an computation on each ting is about performance efficient parallel To achieve good performance on single processor we need to  Understand memory system , superscalar  ILP (e.g., pipe Though registers, caches and ILP are handled by hardware and compilers, programmers need to get ed to enhance the performa program, serial nce for ma Blocking is a goo intensity p unrolling originally used y making use of more registers  It has disadvantages for more nique to increase ther things may also significantly affect the performance (e.g., cache line, read in row major order for C programs, preloading by processor)  The advantages are obvious for simple loops if data loaded into registers can be used many times – a great increase in computational intensity ectures 2 & 3 slides in CS267 omputers at UC Berkely ://sites.google.com/lbl.gov/cs26 Applications of Parallel 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com