PowerPoint Presentation
Dr Massoud Zolgharni
mzolgharni@lincoln.ac.uk
Room SLB1004, SLB
Dr Grzegorz Cielniak
gcielniak@lincoln.ac.uk
Room INB2221, INB
Week W/C Lecture Workshop
1 23/01 Introduction –
2 30/01 Architectures
Tutorial-1
3 06/02 Patterns 1
4 13/02 Patterns 2
Tutorial-2
5 20/02 Patterns 3
6 27/02 Patterns 4
Tutorial-3
7 06/03 Communication & Synchronisation
8 13/03 Algorithms 1
Assessment
support
9 20/03 Algorithms 2
10 27/03 Algorithms 3
11 03/04 Performance & Optimisation
Tutorial-4 &
Demonstration
12 24/04 Parallel Libraries
13 01/05 –
Parallel Reduction
Implementation
Optimisation and Scalability
Calculate the total sum of all vector elements
The sum value at i depends on
the sum value at i-1!
Data dependency – how to parallelise?
———————————-
vector
int sum = 0;
for (uint i = 0; i < A.size(); i++)
sum += A[i];
----------------------------------
Calculate partial pairwise sums – independent operations!
Sum partial sums from step 1
Sum partial sums from step 2
1+2 5+4 9+7 0+1
3 9 16 1
3+9 16+1
12 17
12+17
29
binary tree
re-ordering
Each of these loops/steps can be parallelised (separately)
Step i depends on the results from step i-1: steps have to be
executed sequentially
----------------------------------
vector
for (uint i = 0; i < A.size(); i+=2) A[i] += A[i+1]; for (uint i = 0; i < A.size(); i+=4) A[i] += A[i+2]; for (uint i = 0; i < A.size(); i+=8) A[i] += A[i+4]; ---------------------------------- 1 2 5 4 9 7 0 1 3 2 9 4 16 7 1 1 12 2 9 4 17 7 1 1 29 2 9 4 17 7 1 1 A st e p 1 st e p 2 st e p 3 stride stride stride stride Reduce is a combiner function f(a,b) used to combine all the elements of a collection pairwise and create a summary value Pairs of elements can be combined in any order - combiner function should be associative to be parallelisable Example combiner functions: addition, multiplication, max/min Combiner Function f(a,b) = a b all-to-one mapping a b c d (a+b) + (c+d) a b c d ((a+b) + c) + d Order in which operations are performed does not matter if sequence of operands is unchanged associative saturating addition on signed integers (range -128 and 127) S1 = ( ( (120 77) -90 ) -50 ) ( ( 127 -90 ) -50 ) ( 370 -50 ) -13 S2 = ( (120 77) (-90 -50) ) (127 -128 ) -1 not associative A B C (A + B) - C A + (B - C) 8-bit Unsigned Integer , Range: 0-255 Complexity measures (for N elements) • Work – a total number of operations • Span – a total number of sequential steps Serial implementation • Work requires N-1 operations: O(n) • Span requires N-1 steps: O(n) Parallel implementation • Work requires N-1 operations: O(n) • Each additional step allows for 2steps-1 operations. N elements require log2(N) steps. • Span complexity: O(log n) Use work items to perform partial sums. In each step, skip ids depending on the stride size Naïve implementation ---------------------------------------- __kernel reduce_add_1(__global int* A) { int id = get_global_id(0); if ((id % 2) == 0) //step 1 A[id] += A[id+1]; if ((id % 4) == 0) //step 2 A[id] += A[id+2]; if ((id % 8) == 0) //step 3 A[id] += A[id+4]; if ((id % 16) == 0) //step 4 A[id] += A[id+8]; } ---------------------------------------- Individual work items take different time to execute • e.g. in step 1: all threads check the if condition, even threads need to read/write to memory, odd threads just carry on • there are also small differences in time execution for even threads • potential for race conditions! if ((id % 2) == 0) //step 1 A[id] += A[id+1]; if ((id % 4) == 0) //step 2 A[id] += A[id+2]; 0 1 2 3id thread 0 finished step 1 earlier than thread 2 and continues directly to step 2 tim e ---------------------------------------- __kernel reduce_add_1(__global int* A) { int id = get_global_id(0); if ((id % 2) == 0) A[id] += A[id+1]; //wait for all threads to finish step 1 barrier(CLK_GLOBAL_MEM_FENCE); if ((id % 4) == 0) A[id] += A[id+2]; //wait for all threads to finish step 2 barrier(CLK_GLOBAL_MEM_FENCE); if ((id % 8) == 0) A[id] += A[id+4]; Synchronisation! (more in the following lectures) Automatic step calculation ---------------------------------------- __kernel reduce_add_2(__global int* A) { int id = get_global_id(0); int N = get_global_size(0); for (int i = 1; i < N; i*=2) { //i is the stride if ((id % (i*2) == 0) A[id] += A[id+i]; barrier(CLK_GLOBAL_MEM_FENCE); } } ---------------------------------------- • The hardware resources (simultaneous work items, local memory) are limited • Large problems need to be broken down into smaller chunks • Tiling - breaking into chunks/tiles/blocks, operate on each tile separately, and then combine results … … nr_groups = ? nr_passes = ? • N: number of elements • M: group size (elements per group) How to synchronise groups after each reduction step? • separate kernel launches, looping through steps implemented in the host code • Tiling requires different work item indexing as the number of elements to be reduced within a group is M (group size) • Local id is a unique index within each group (0 to M-1) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 local id global id 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 get_global_id() get_local_id() Dot Product 2 vectors of same length Map (*) to multiply the components Reduce with (+) to get final answer Combining Map and Reduce results in straightforward and efficient implementation Map for filtering and sorting, Reduce for counting Used by Google for webpage indexing (not any more) Interleaved addressing Sequential addressing • Modern hardware optimises common accesses to memory (global/local) • Sequential addressing groups memory accesses leading to better performance • Use of local (fast) memory • more in future lectures Reduce using f=max(a,b) Structured parallel programming: patterns for efficient computation • Section 5.1-5.3, Reduction with Applications M. Harris, Optimizing Parallel Reduction in CUDA http://developer.download.nvidia.com/compute/cuda/1.1- Beta/x86_website/projects/reduction/doc/reduction.pdf http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf