程序代写代做代考 algorithm PowerPoint Presentation

PowerPoint Presentation

Dr Massoud Zolgharni
mzolgharni@lincoln.ac.uk

Room SLB1004, SLB

Dr Grzegorz Cielniak
gcielniak@lincoln.ac.uk

Room INB2221, INB

Week W/C Lecture Workshop

1 23/01 Introduction –

2 30/01 Architectures
Tutorial-1

3 06/02 Patterns 1

4 13/02 Patterns 2
Tutorial-2

5 20/02 Patterns 3

6 27/02 Patterns 4
Tutorial-3

7 06/03 Communication & Synchronisation

8 13/03 Algorithms 1
Assessment

support
9 20/03 Algorithms 2

10 27/03 Algorithms 3

11 03/04 Performance & Optimisation
Tutorial-4 &

Demonstration
12 24/04 Parallel Libraries

13 01/05 –

Parallel Scan

Work Efficient – Hills Steele

Span Efficient – Blelloch

Transactions (credit/debit)
Input

Balance
Output

20 20

Transactions (credit/debit)
Input

Balance
Output

20 20

5 25

Transactions (credit/debit)
Input

Balance
Output

20 20

5 25

-3 22

7 29

———————————-
A = {1,2,5,4,9,7,0,1}

for i=1:N-1
A[i] += A[i-1]

———————————-

Add all elements and keep the partial results

1 4 9 7 0 12 5

1

3

3 8

8

12

12

31

31

30

30

30

30

23

23

Scan computes all partial reductions of a collection;
every element of output array is a reduction of all the
elements of input array up to position of that output
element

Combiner takes as input
the output of the previous
loop iteration as well as
a new input value

Work – O(n) operations,
Span – O(n) steps

Serial Implementation

y1 = x1

y2 = f(y1,x2)

y3 = f(y2,x3)

yN = f(yN-1,xN)

Inclusive

includes current element
in partial
reduction

Exclusive

excludes current element
in partial
reduction

1 2 3 4 5 6 7 8

1 3 6 10 15 21 28 36

1:2 2:3 3:4 4:5 5:6 6:7 7:8

1:3 1:4 2:5 3:6 4:7 5:8

1:5 1:6 1:7 1:8

addition
operation

Parallel implementation: form a reduction tree for every output
element, and then merge the redundant elements of all those trees

Example: cumulative sum

for d = 0 to log2(N) – 1 do
s = 2d

for all i in parallel do
if i ≥ s then

x[i] += x[i–s]

step

stride

d = 0

d = 1

d = 2

2d = 1

2d = 2

2d = 4

step stride

The combiner operation is called O(n log2 n) times and O(log2 n)
steps are required to complete

Requires more work than serial version

 Span-efficient (low latency)

Performs badly on large arrays due to its work-inefficiency

 If combiner function is associative f(a,f(b,c)) = f(f(a,b),c), we can
reorder operations to reduce span

Two-phase reordering performing an “up-sweep” and a “down-
sweep”

Up-sweep: Reduce on input (reduce phase)

Down-sweep: produces the intermediate results

up-sweep – compute reduction

down-sweep – compute intermediate values

Up-sweep phase

Down-sweep phase

Reduction steps: keep all
partial results

Traverse the tree from
leaves to root

Reduce and move
procedure

 traverse the tree back
from root to leaves

 reduce value a stride away

move the original value to
an element stride away

for d = 0 to log2(N)–1 do
s = 2d

for all i = N-1 down to 0 by s*2
in parallel do

x[i] += x[i-s]

x[N–1]  0 //exclusive scan
for d = log2(N)–1 down to 0 do

s = 2d

for all i = N-1 down to 0 by s*2
in parallel do

t = x[i]
x[i] += x[i-s]
x[i-s] = t

Up-sweep phase

Down-sweep phase

step
stride

reduce

move

Requires 2 times more steps than HS scan

Requires roughly 3 times more operations than the serial scan
(~2N reductions + N moves)

 Span of O(2log2n) steps and Work of O(n) operations

1 2 7 2 4 34 0

1 2 7 2 4 34 0

f=max(a,b)

1 2 7 2 4 34 0

1 2 7 2 4 34 0

1 1

4

4

4

44

4

4

4

44

4
2

4

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

4

777

f=max(a,b)

more Work
than Processors

more Processors
than Work

use Work
efficient

use Span
efficient

Serial

Work: O(n)

Span: O(n)

Hillis/Steele

Work: O(n log2n)

Span: O(log2n)

Blelloch

Work: O(n)

Span: O(2 log2n)

512 elements &
512 processors

1M elements &
512 processors

128K elements &
1 processor

The presented algorithms only work for a single block/work group

Two stage procedure to accommodate large arrays (Blelloch, 1991)

Assign Boolean keys to all elements in the collection; false/0 if an
element needs to be filtered out

Perform the exclusive scan on keys: result is an array with output
indices – very efficient scatter!

0 1 1 0 0 1 1 1

scan 0 0 1 2 2 2 3 4 output indices

key mask

Structured parallel programming:
patterns for efficient computation

• Chapter 5

Papers
• W. D. Hillis and G. L. Steele, Jr.. Data

parallel algorithms. Commun. ACM 29,
12, 1986

• G.E. Blelloch. Prefix Sums and Their
Applications. In “Synthesis of Parallel
Algorithms”, Morgan Kaufmann, 1991

http://uenics.evansville.edu/~mr56/ece757/DataParallelAlgorithms.pdf
http://uenics.evansville.edu/~mr56/ece757/DataParallelAlgorithms.pdf
https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf
https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf