COMP5426 Distributed
ramming Shared Memory
Platforms (D
Copyright By PowCoder代写 加微信 powcoder
Multiple threads cre
having their
A set of global variables shared by Threads communication implicitly u
reads coordination explici shared variables
~100 cores for large one)
machines are usuall
e grained parallelism
ated and running
local private
tly by sync
concurrently
all threads sing shared
hronization
Data locality
xplicit synchronization
Also need to synchronizat
l variables shared by all
algorithm design
Yes, data don’t need to be moved around We’ll use several examples for comparison
Data partitioning unnecessary? Not true! must consider ILP &
very important
minimize u ion to impr
e efficiency
Matrix Multiplicat
Matrix Multiplicat
For coarse grain matrix C
No data dependency, thus
Note regular
Cyclic block
2D block
Embarrassingly parallel
data structure
ioning – relatively easy
partition matri
chronization
x C and assig
In 1D cyclic partitioning, threads one by one in a c
rows are assig
Partitioning
rows are assigned to the yclic manner until all the
n 1D block part locks and then
artitionin
itioning, ro each block
ws are first grouped into is assigned one thread
For shared memory machines, we may assign each thread a row of small 2D blocks
artitionin
In 2D block partitioning, both rows and columns are first grouped into blocks and then each block
matrix) is assigned
Essentia
Which one is better?
Locality is very
Increase opportunities
IfBislarg
cache size –
Increase computational intensity (memory hierarchy) 1D cyclic: one row in C needs one row in A and whol
ational intensity
cache misses
Which one is better?
Locality is very
Increase opportunities
Increase computational intensity 1D block: better
Matrix multiplicati
till needs whole
To improve – further
– higher com
rtition row bloc
putational
emory hierar
if B is large
Which one is better?
Locality is very
Increase opportunities
Increase computational intensity block:
Matrix multiplicati
Only need one bloc
– high compu
k row of A and one b
Fit into the cache to minimize cache misses
emory hierar
lock colum
Multiplicatio
There is no data dependency
Linear speedup is achievable
The focus should be on sequential computation on each core to achieve high efficiency
e.g., cach
e effect, memory h
– embarrassi
ierarchy, loop
allel Algorit
allel Scan
allel Scan
allel Scan
ting (GEPP)
should use BLAS
How? – blocking, i.e.
BLAS 3 (matrix multiplication) is much faster (higher computational intensity) than BLAS 1 & 2
whenever po
blocked GEP
next slide)
for ib = 1 to n-1 step b … Process matrix b columns at a time end = ib + b-1 … Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = L’ * U’ … let LL denote the strict lower triangular part of A(ib:end , ib:en
A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n)
A(end+1:n , end+1:n ) = A(end+1:n , end+1:n )
– A(end+1:n , ib:end) * A(ib:end , end+1:n)
… apply delayed updates with single matrix-multiply … with inner dimension b
… update ne
allel Algorith
n GEPP the
need seriously
load balanci
After i=n-1
1D block: Assign
ubmatrix s
hrinks, more
Assign one
As work submatrix shr
Load not balanced –
inks, more
not efficient
become idle
D cyclic block: partition the matrix into 2D small locks and assign blocks in a cyclic manner
iteration, still
balanced – more
most threads are
There are two types
Static – when the load is p be partitioned and assigne assignment will not change
Dynamic – when the amoun unpredictable, have to dyn during the computation
Ideally, we want all the processors, or cores are busy for doing useful work all the time and they complete
work at the same time
Recall Amdahl’s law – even a small amount of load imbalance can significantly affect the performance
redictable, the work can d in advance and the
during the computation
t of work changes and is amically balance the load
memory machi
sible by all
maintain a
Other threads
Dynamic load balancing is harder, on shared memory machines than
nes – data in
On a shared memory machine, for exa
partitioning
but relatively easier on distributed
ed memory are
retrieve tasks from the pool for execution
However, we must consider the cost for achieving good load balancing
ize additional computational
ize synchr
n a shared memory
global data
Task/data partitioning & assignment
Static: the workload is predictable, parti assignment are done in advance and not c computation – today’s examples
Dynamic: the workload
Threads coordination is explicitly done synchronization on shared variables
algorithms
changes during
and is unpredictable. Dynamic partitioning is harder
tioning and hange during
Load balancing – ver
significantly
y important
Amdahl’s law: even a small amount of load imbalance may
affect the overall
locality – very im
Need seriously consider how to optimize the p on single processor, or core
Increase opportunities for ILP
Increase computational intensity
Three exam
matrix multiplication – embarrassingly parallel
scan (or prefix) operation – increased workload for parallel computation
Gaussian Elimination wit
balancing the
emory hierar
tial pivoting – harder to
erformance
Basic li
AS 1: vect
Through these examples we also introduced
l partitioning
with regular
near algebra
BLAS 3 is faster because of high
should be used
strategies widely used
AS 2: matrix vector operations AS 3 matrix multiplication
whenever possible
nes (BLAS)
ational intensity,
1D cyclic, 1D block, 2D block and 2D cyclic block Blocking technique achieves better locality
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com