CS考试辅导 COMP5426 Distributed

COMP5426 Distributed
ramming Shared Memory
Platforms (D

Multiple threads cre
having their
 A set of global variables shared by  Threads communication implicitly u
reads coordination explici shared variables
~100 cores for large one)
machines are usuall
e grained parallelism
ated and running
local private
tly by sync
concurrently
all threads sing shared
hronization

Data locality
xplicit synchronization
Also need to synchronizat
l variables shared by all
algorithm design
Yes, data don’t need to be moved around We’ll use several examples for comparison
 Data partitioning unnecessary? Not true! must consider ILP &
very important
minimize u ion to impr
e efficiency

Matrix Multiplicat

Matrix Multiplicat
For coarse grain matrix C
 No data dependency, thus
 Note regular
Cyclic block
 2D block
Embarrassingly parallel
data structure
ioning – relatively easy
partition matri
chronization
x C and assig

In 1D cyclic partitioning, threads one by one in a c
rows are assig
Partitioning
rows are assigned to the yclic manner until all the

n 1D block part locks and then
artitionin
itioning, ro each block
ws are first grouped into is assigned one thread

For shared memory machines, we may assign each thread a row of small 2D blocks
artitionin
In 2D block partitioning, both rows and columns are first grouped into blocks and then each block
matrix) is assigned

 Essentia
Which one is better?
 Locality is very
Increase opportunities
 IfBislarg
cache size –
 Increase computational intensity (memory hierarchy) 1D cyclic: one row in C needs one row in A and whol
ational intensity
cache misses

Which one is better?
 Locality is very
Increase opportunities
 Increase computational intensity 1D block: better
 Matrix multiplicati
till needs whole
 To improve – further
– higher com
rtition row bloc
putational
emory hierar
if B is large

Which one is better?
 Locality is very
Increase opportunities
 Increase computational intensity block:
 Matrix multiplicati
 Only need one bloc
– high compu
k row of A and one b
 Fit into the cache to minimize cache misses
emory hierar
lock colum

Multiplicatio
There is no data dependency
Linear speedup is achievable
The focus should be on sequential computation on each core to achieve high efficiency
 e.g., cach
e effect, memory h
– embarrassi
ierarchy, loop

allel Algorit

allel Scan

ting (GEPP)

should use BLAS
 How? – blocking, i.e.
BLAS 3 (matrix multiplication) is much faster (higher computational intensity) than BLAS 1 & 2
whenever po
blocked GEP
next slide)

for ib = 1 to n-1 step b … Process matrix b columns at a time end = ib + b-1 … Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = L’ * U’ … let LL denote the strict lower triangular part of A(ib:end , ib:en
A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n)
A(end+1:n , end+1:n ) = A(end+1:n , end+1:n )
– A(end+1:n , ib:end) * A(ib:end , end+1:n)
… apply delayed updates with single matrix-multiply … with inner dimension b
… update ne

allel Algorith
n GEPP the
need seriously
load balanci
After i=n-1

1D block: Assign
ubmatrix s
hrinks, more

Assign one
As work submatrix shr
 Load not balanced –
inks, more
not efficient
become idle

D cyclic block: partition the matrix into 2D small locks and assign blocks in a cyclic manner
iteration, still
balanced – more
most threads are

There are two types
 Static – when the load is p be partitioned and assigne assignment will not change
 Dynamic – when the amoun unpredictable, have to dyn during the computation
Ideally, we want all the processors, or cores are busy for doing useful work all the time and they complete
work at the same time
 Recall Amdahl’s law – even a small amount of load imbalance can significantly affect the performance
redictable, the work can d in advance and the
during the computation
t of work changes and is amically balance the load

memory machi
sible by all
maintain a
Other threads
Dynamic load balancing is harder, on shared memory machines than
nes – data in
On a shared memory machine, for exa
partitioning
but relatively easier on distributed
ed memory are
retrieve tasks from the pool for execution
However, we must consider the cost for achieving good load balancing
ize additional computational
ize synchr

n a shared memory
 global data
 Task/data partitioning & assignment
Static: the workload is predictable, parti assignment are done in advance and not c computation – today’s examples
Dynamic: the workload
 Threads coordination is explicitly done synchronization on shared variables
algorithms
changes during
and is unpredictable. Dynamic partitioning is harder
tioning and hange during

 Load balancing – ver
significantly
y important
Amdahl’s law: even a small amount of load imbalance may
affect the overall
locality – very im
 Need seriously consider how to optimize the p on single processor, or core
Increase opportunities for ILP
Increase computational intensity
Three exam
 matrix multiplication – embarrassingly parallel
 scan (or prefix) operation – increased workload for parallel computation
 Gaussian Elimination wit
balancing the
emory hierar
tial pivoting – harder to
erformance

 Basic li
AS 1: vect
Through these examples we also introduced
l partitioning
with regular
near algebra
BLAS 3 is faster because of high
should be used
strategies widely used
AS 2: matrix vector operations AS 3 matrix multiplication
whenever possible
nes (BLAS)
ational intensity,
 1D cyclic, 1D block, 2D block and 2D cyclic block  Blocking technique achieves better locality

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts