PowerPoint Presentation
Dr Massoud Zolgharni
mzolgharni@lincoln.ac.uk
Room SLB1004, SLB
Dr Grzegorz Cielniak
gcielniak@lincoln.ac.uk
Room INB2221, INB
Week W/C Lecture Workshop
1 23/01 Introduction –
2 30/01 Architectures
Tutorial-1
3 06/02 Patterns 1
4 13/02 Patterns 2
Tutorial-2
5 20/02 Patterns 3
6 27/02 Patterns 4
Tutorial-3
7 06/03 Communication & Synchronisation
8 13/03 Algorithms 1
Assessment
support
9 20/03 Algorithms 2
10 27/03 Algorithms 3
11 03/04 Performance & Optimisation
Tutorial-4 &
Demonstration
12 24/04 Parallel Libraries
13 01/05 –
Parallel Scan
Work Efficient – Hills Steele
Span Efficient – Blelloch
Transactions (credit/debit)
Input
Balance
Output
20 20
Transactions (credit/debit)
Input
Balance
Output
20 20
5 25
Transactions (credit/debit)
Input
Balance
Output
20 20
5 25
-3 22
7 29
———————————-
A = {1,2,5,4,9,7,0,1}
for i=1:N-1
A[i] += A[i-1]
———————————-
Add all elements and keep the partial results
1 4 9 7 0 12 5
1
3
3 8
8
12
12
31
31
30
30
30
30
23
23
Scan computes all partial reductions of a collection;
every element of output array is a reduction of all the
elements of input array up to position of that output
element
Combiner takes as input
the output of the previous
loop iteration as well as
a new input value
Work – O(n) operations,
Span – O(n) steps
Serial Implementation
y1 = x1
y2 = f(y1,x2)
y3 = f(y2,x3)
…
yN = f(yN-1,xN)
Inclusive
includes current element
in partial
reduction
Exclusive
excludes current element
in partial
reduction
1 2 3 4 5 6 7 8
1 3 6 10 15 21 28 36
1:2 2:3 3:4 4:5 5:6 6:7 7:8
1:3 1:4 2:5 3:6 4:7 5:8
1:5 1:6 1:7 1:8
addition
operation
Parallel implementation: form a reduction tree for every output
element, and then merge the redundant elements of all those trees
Example: cumulative sum
for d = 0 to log2(N) – 1 do
s = 2d
for all i in parallel do
if i ≥ s then
x[i] += x[i–s]
step
stride
d = 0
d = 1
d = 2
2d = 1
2d = 2
2d = 4
step stride
The combiner operation is called O(n log2 n) times and O(log2 n)
steps are required to complete
Requires more work than serial version
Span-efficient (low latency)
Performs badly on large arrays due to its work-inefficiency
If combiner function is associative f(a,f(b,c)) = f(f(a,b),c), we can
reorder operations to reduce span
Two-phase reordering performing an “up-sweep” and a “down-
sweep”
Up-sweep: Reduce on input (reduce phase)
Down-sweep: produces the intermediate results
up-sweep – compute reduction
down-sweep – compute intermediate values
Up-sweep phase
Down-sweep phase
Reduction steps: keep all
partial results
Traverse the tree from
leaves to root
Reduce and move
procedure
traverse the tree back
from root to leaves
reduce value a stride away
move the original value to
an element stride away
for d = 0 to log2(N)–1 do
s = 2d
for all i = N-1 down to 0 by s*2
in parallel do
x[i] += x[i-s]
x[N–1] 0 //exclusive scan
for d = log2(N)–1 down to 0 do
s = 2d
for all i = N-1 down to 0 by s*2
in parallel do
t = x[i]
x[i] += x[i-s]
x[i-s] = t
Up-sweep phase
Down-sweep phase
step
stride
reduce
move
Requires 2 times more steps than HS scan
Requires roughly 3 times more operations than the serial scan
(~2N reductions + N moves)
Span of O(2log2n) steps and Work of O(n) operations
1 2 7 2 4 34 0
1 2 7 2 4 34 0
f=max(a,b)
1 2 7 2 4 34 0
1 2 7 2 4 34 0
1 1
4
4
4
44
4
4
4
44
4
2
4
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
4
777
f=max(a,b)
more Work
than Processors
more Processors
than Work
use Work
efficient
use Span
efficient
Serial
Work: O(n)
Span: O(n)
Hillis/Steele
Work: O(n log2n)
Span: O(log2n)
Blelloch
Work: O(n)
Span: O(2 log2n)
512 elements &
512 processors
1M elements &
512 processors
128K elements &
1 processor
The presented algorithms only work for a single block/work group
Two stage procedure to accommodate large arrays (Blelloch, 1991)
Assign Boolean keys to all elements in the collection; false/0 if an
element needs to be filtered out
Perform the exclusive scan on keys: result is an array with output
indices – very efficient scatter!
0 1 1 0 0 1 1 1
scan 0 0 1 2 2 2 3 4 output indices
key mask
Structured parallel programming:
patterns for efficient computation
• Chapter 5
Papers
• W. D. Hillis and G. L. Steele, Jr.. Data
parallel algorithms. Commun. ACM 29,
12, 1986
• G.E. Blelloch. Prefix Sums and Their
Applications. In “Synthesis of Parallel
Algorithms”, Morgan Kaufmann, 1991
http://uenics.evansville.edu/~mr56/ece757/DataParallelAlgorithms.pdf
http://uenics.evansville.edu/~mr56/ece757/DataParallelAlgorithms.pdf
https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf
https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf