CS代写 XJCO3221 Parallel Computation

Overview Reduction within a single work group Full reduction Summary and next lecture
XJCO3221 Parallel Computation

University of Leeds

Copyright By PowCoder代写 加微信 powcoder

Lecture 17: Synchronisation
XJCO3221 Parallel Computation

Reduction within a single work group Previous lectures
Full reduction Today’s lecture Summary and next lecture
Previous lectures
Many of the previous lectures has mentioned parallel synchronisation in some form. However, there are many ways to synchronise:
Locks in shared memory systems [Lectures 6 and 7]. Synchronisation barrier at each level of a binary tree
reduction [Lecture 11].
Blocking communication, which affords a form of
synchronisation, in distributed memory systems [Lecture 12]. …
Also recall that GPU’s have multiple memory types, some of which can be viewed as shared ( global), and some which can be viewed as distributed ( local) [Lecture 16].
XJCO3221 Parallel Computation

Reduction within a single work group Previous lectures Full reduction Today’s lecture
Summary and next lecture
Today’s lecture
Today’s lecture will look at synchronisation on a GPU: How to synchronise within a work group.
How to synchronise between work groups.
We will also see how the SIMD cores can potentially reduce or improve performance:
Threads within a subgroup are automatically synchronised. Threads performing different calculations can lead to
divergence and reduced performance.
XJCO3221 Parallel Computation

Reduction within a single work group
Full reduction Summary and next lecture
Reminder: Vector product
Binary tree reduction in local memory Barriers within work groups
Reminder: Scalar product
As an example, we will use the vector product between two n-vectors a and b, as in Lecture 11.
Written mathematically as (indexing starting from 1):
a·b=􏰄aibi =a1b1 +a2b2 +…+anbn
In serial CPU code (indexing starting from 0):
float dot = 0.0f;
for( i=0; i>=’. XJCO3221 Parallel Computation

Reduction within a single work group
Full reduction Summary and next lecture
Reminder: Vector product
Binary tree reduction in local memory
Barriers within work groups
Kernel code
Code on Minerva: workGroupReduction.c, workGroupReduction.cl, helper.h
3 4 5 6 7 8 9
10 11 12 13 14 15
void reduceNoSync( __global float *device_a , __global
float *device_b, __global float *dot, __local
float *scratch ) {
int stride ,
id = get_local_id (0),
groupSize = get_local_size(0); //=work group
scratch[id] = device_a[id] * device_b[id];
for( stride=groupSize/2; stride>0; stride>>=1 ) if( id < stride ) scratch[id] += scratch[id+stride]; if(id==0) *dot = scratch[0]; } XJCO3221 Parallel Computation Reduction within a single work group Full reduction Summary and next lecture Reminder: Vector product Binary tree reduction in local memory Barriers within work groups Calling C-code 1 2 3 4 5 6 7 8 9 11 12 13 14 // float array of size 1 on device. cl_mem device_dot = clCreateBuffer(...); ... // Set kernel arguments 0, 1 and 2. clSetKernelArg(kernel ,3,N*sizeof(float),NULL); // NULL => __local memory of given size.
// Add to the command queue.
size_t indexSpaceSize[1]={N}, workGroupSize[1]={N}; clEnqueueNDRangeKernel(queue,kernel,1,NULL,
indexSpaceSize ,workGroupSize ,0,NULL ,NULL);
// Get the result back to host float ‘dot’.
float dot; clEnqueueReadBuffer(queue,device_dot,CL_TRUE,0,sizeof(
float),&dot,0,NULL,NULL);
XJCO3221 Parallel Computation

Reduction within a single work group
Full reduction Summary and next lecture
Reminder: Vector product
Binary tree reduction in local memory Barriers within work groups
Without synchronisation, this reduction is not guaranteed to work on all systems.
Recall that barriers are points in code that no processing unit can leave until all units reach it [c.f. Lecture 11].
#pragma omp barrier in OpenMP. MPI Barrier() in MPI.
In OpenCL1, a barrier within a work group is implemented as: 1 barrier(CLK_LOCAL_MEM_FENCE);
1In CUDA: syncthreads() synchronises within a thread block=work group. XJCO3221 Parallel Computation

Reduction within a single work group
Full reduction Summary and next lecture
Reminder: Vector product
Binary tree reduction in local memory Barriers within work groups
Reduction with synchronisation
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17
void reduceWithSync(…) // Same arguments. {
int id=…, groupSize=…, stride; // As before. scratch[id] = device_a[id] * device_b[id];
barrier(CLK_LOCAL_MEM_FENCE); // Sync.
for( stride=groupSize/2; stride>0; stride>>=1 ) {
if( id < stride ) scratch[id] += scratch[id+stride]; barrier(CLK_LOCAL_MEM_FENCE); // Sync. } if(id==0) *dot = scratch[0]; } XJCO3221 Parallel Computation Reduction within a single work group Full reduction Summary and next lecture Reminder: Vector product Binary tree reduction in local memory Barriers within work groups Reduction with barrier(CLK LOCAL MEM FENCE) a0 a1 a2 a3 a4 a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7 barrier(CLK_LOCAL_MEM_FENCE) barrier(CLK_LOCAL_MEM_FENCE) barrier(CLK_LOCAL_MEM_FENCE) barrier(CLK_LOCAL_MEM_FENCE) XJCO3221 Parallel Computation Overview Reduction within a single work group Full reduction Summary and next lecture Reduction across multiple work groups Subgroups and SIMD cores Divergence Problems larger than a single work group? If we could synchronise between work groups, could use the same method as before: 1 Make device vectors and scratch global. 2 Replace local barriers with global barriers. However, no such global barrier exists1. GPUs cannot synchronise between work groups/thread blocks2 1barrier(CLK GLOBAL MEM FENCE) does exist, but refers to accesses to global memory; it still only synchronises within a work group. 2Some modern GPUs support cooperative groups that allow synchronisation across multiple thread blocks; e.g. CUDA 9.0. and XJCO3221 Parallel Computation Overview Reduction within a single work group Full reduction Summary and next lecture Reduction across multiple work groups Subgroups and SIMD cores Divergence You might see claims that it is possible to synchronise globally on any GPU by constantly polling a global memory location. i.e. work items constantly read/write to synchronise. This may work, but only for small problems. If there are too many work groups for the device, it queues them: queued completed If they are not all on the device at the same time, it is impossible to synchronise within one kernel using this method. XJCO3221 Parallel Computation Overview Reduction within a single work group Full reduction Summary and next lecture Reduction across multiple work groups Subgroups and SIMD cores Divergence Solution: Multiple kernels The solution is to break the kernel at the barrier point into multiple kernels called consecutively: Original kernel: // Code before the barrier ... barrier(...) // Code after the barrier ... // Code before the barrier ... This way kernel 1 completes before kernel 2 starts. // Code after the barrier ... XJCO3221 Parallel Computation Overview Reduction within a single work group Full reduction Summary and next lecture Reduction across multiple work groups Subgroups and SIMD cores Divergence Reduction across work groups It is possible to use this method for reduction1: 1 Repeatedly call kernel that reduces an array of partial sums until less than maximum work group size. 2 Final kernel call to reduce these partial sums. It is simpler (although less efficient) to use the CPU: 1 Each work group inserts its partial sum into a global array. 2 Final summation performed on the host. This is conceptually similar to an MPI program performing final calculations on rank 0. 1Wilt, The CUDA handbook (Addison-Wesley, 2013). XJCO3221 Parallel Computation Overview Reduction within a single work group Full reduction Summary and next lecture Reduction across multiple work groups Subgroups and SIMD cores Divergence Subgroups (warp, wavefront, etc.) Recall that GPUs are based on SIMD cores. Each core contains multiple hardware threads that perform the same operation. In OpenCL, the number of work items (i.e. threads) simultaneously on a single SIMD core is known as a subgroup. Smaller than a work group. The actual size is vendor specific. For example: Nvidia call them warps, each of which has 32 threads. AMD have 64-thread wavefronts. XJCO3221 Parallel Computation Overview Reduction within a single work group Full reduction Summary and next lecture Reduction across multiple work groups Subgroups and SIMD cores Divergence The SIMD core applies the same operation to all items in the subgroup simultaneously. We say it advances in lockstep. void kernel (...) { int id = get_global_id(0); float a, b, c; a = 4*array[id]; c = b + a; } 1 2 3 4 5 6 7 8 9 SIMD core Subgroup XJCO3221 Parallel Computation Overview Reduction within a single work group Full reduction Summary and next lecture Reduction across multiple work groups Subgroups and SIMD cores Divergence Reduction with a subgroup For reduction, this means that once the problem has been reduced to the size of a subgroup, there is no longer any need for explicit synchronisation1: 1 2 3 4 5 6 7 8 9 void reduce (...) { ... // Start as before. // Split the loop into two. for(stride=group/2;stride>subgroup;stride>>=1) { if(id0; stride > >=1) {
if(idCS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com