Performing Reductions in OpenCL
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
Computer Graphics
Copyright By PowCoder代写 加微信 powcoder
opencl.reduction.pptx
mjb – June 14, 2021
Recall the OpenCL Memory Model
Global Memory
Constant Memory
Local Memory
Local Memory
Local Memory Work- Work-
Shared MItemory Item Work- Work-
Work- Item
Work- Work-ItemWork-Item Work-Item
Work-ItemWork-Item Work-Item Item Item Item
Computer Graphics
mjb – June 14, 2021
Private Memory
Private Memory
Private Memory
Here’s the Problem We are Trying to Solve
Like the first.cpp demo program, we are piecewise multiplying two arrays. Unlike the first demo program, we want to then add up all the products and return the sum.
numItems = 8;
A * B → prods Σ prods → C
After the array multiplication, we want each work-group to sum the products within that work-group, then return them to the host in an array for final summing.
To do this, we will not put the products into a large global device array, but into a prods[ ] array that is shared within its work-group.
Computer Graphics
mjb – June 14, 2021
numItems = 8;
Reduction Takes Place in a Single Work-Group
If we had 8 work-items in a work-group, we would like the threads in each work-group to execute the following instructions . . .
Thread #0:
prods[ 0 ] += prods[ 1 ];
Thread #2:
prods[ 2 ] += prods[ 3 ];
Thread #4:
prods[ 4 ] += prods[ 5 ];
Thread #6:
prods[ 6 ] += prods[ 7 ];
Thread #0:
prods[ 0 ] += prods[ 2 ];
Thread #4:
prods[ 4 ] += prods[ 6 ];
Thread #0:
prods[ 0 ] += prods[ 4 ];
Computer Graphics
. . . but in a more general way than writing them all out by hand.
mjb – June 14, 2021
float * float * float * size_t size_t
numWorkGroups = NUM_ELEMENTS / LOCAL_SIZE;
hA = new float [ NUM_ELEMENTS ]; hB = new float [ NUM_ELEMENTS ]; hC = new float [ numWorkGroups ];
abSize = NUM_ELEMENTS * sizeof(float); cSize = numWorkGroups * sizeof(float);
Here’s What You Would Change in your Host Program
cl_mem dA = clCreateBuffer( context, CL_MEM_READ_ONLY, cl_mem dB = clCreateBuffer( context, CL_MEM_READ_ONLY, cl_mem dC = clCreateBuffer( context, CL_MEM_WRITE_ONLY,
abSize, NULL, &status ); abSize, NULL, &status ); cSize, NULL, &status );
status = clEnqueueWriteBuffer( cmdQueue, dA, CL_FALSE, 0, abSize, hA, 0, NULL, NULL ); status = clEnqueueWriteBuffer( cmdQueue, dB, CL_FALSE, 0, abSize, hB, 0, NULL, NULL );
cl_kernel kernel = clCreateKernel( program, “ArrayMultReduce”, &status );
status = clSetKernelArg( kernel, 0, sizeof(cl_mem), &dA ); status = clSetKernelArg( kernel, 1, sizeof(cl_mem), &dB ); status = clSetKernelArg( kernel, 2, LOCAL_SIZE * sizeof(float), NULL );
// local “prods” array is dimensioned the size of each work-group
status = clSetKernelArg( kernel, 3, sizeof(cl_mem), &dC );
A * B → prods Σ prods → C
This NULL is how you tell OpenCL that this is a local (shared) array, not a global array
Computer Graphics
mjb – June 14, 2021
The Arguments to the Kernel
status = clSetKernelArg( kernel, 0, sizeof(cl_mem), &dA );
status = clSetKernelArg( kernel, 1, sizeof(cl_mem), &dB );
status = clSetKernelArg( kernel, 2, LOCAL_SIZE * sizeof(float), NULL );
// local “prods” array – one per work-item status = clSetKernelArg( kernel, 3, sizeof(cl_mem), &dC );
kernel void
ArrayMultReduce( global const float *dA, global const float *dB, local float *prods, global float *dC ) {
int gid = get_global_id( 0 ); int numItems = get_local_size( 0 ); int tnum = get_local_id( 0 );
// 0 .. total_array_size-1
// # work-items per work-group
// thread (i.e., work-item) number in this work-group // 0 .. numItems-1
// which work-group number this is in prods[ tnum ] = dA[ gid ] * dB[ gid ]; // multiply the two arrays together
int wgNum = get_group_id( 0 );
// now add them up – come up with one sum per work-group
// it is a big performance benefit to do it here while “prods” is still available – and is local
// it would be a performance hit to pass “prods” back to the host then bring it back to the device for reduction
A * B → prods
Computer Graphics
mjb – June 14, 2021
Reduction Takes Place Within a Single Work-Group Each work-item is run by a single thread
Thread #0:
prods[ 0 ] += prods[ 1 ];
Thread #2:
prods[ 2 ] += prods[ 3 ];
Thread #4:
prods[ 4 ] += prods[ 5 ];
Thread #6:
prods[ 6 ] += prods[ 7 ];
Thread #0:
prods[ 0 ] += prods[ 2 ];
Thread #4:
prods[ 4 ] += prods[ 6 ];
Thread #0:
prods[ 0 ] += prods[ 4 ];
offset = 4; mask = 7;
offset = 2; mask = 3;
offset = 1; mask = 1;
A work-group consisting of numItems work-items can be reduced to a sum in log2(numItems) steps. In this example, numItems=8.
The reduction begins with the individual products in prods[0] .. prods[7].
The final sum will end up in prods[0], which will then be copied into dC[wgNum].
Computer Graphics
mjb – June 14, 2021
Remember Truth Tables?
A Review of Bitmasks
Or, with Bits:
Or, with Multiple Bits:
Computer Graphics
mjb – June 14, 2021
offset = 1; mask = 1;
numItems = 8; Anding bits
Reduction Takes Place in a Single Work-Group Each work-item is run by a single thread
Thread #0:
prods[ 0 ] += prods[ 1 ];
Thread #2:
prods[ 2 ] += prods[ 3 ];
Thread #4:
prods[ 4 ] += prods[ 5 ];
Thread #6:
prods[ 6 ] += prods[ 7 ];
Thread #0:
prods[ 0 ] += prods[ 2 ];
Thread #4:
prods[ 4 ] += prods[ 6 ];
Thread #0:
prods[ 0 ] += prods[ 4 ];
offset = 2; mask = 3;
offset = 4; mask = 7;
kernel void ArrayMultReduce( … )
int gid = get_global_id( 0 ); int numItems = get_local_size( 0 ); int tnum = get_local_id( 0 ); int wgNum = get_group_id( 0 );
// thread number
// work-group number
prods[tnum]= dA[gid]*dB[gid];
Computer Graphics
// all threads execute this code simultaneously: for( int offset = 1; offset < numItems; offset *= 2 )
int mask = 2*offset - 1;
barrier( CLK_LOCAL_MEM_FENCE ); // wait for all threads to get here
if( ( tnum & mask ) == 0 ) // bit-by-bit and’ing tells us which
barrier( CLK_LOCAL_MEM_FENCE ); if( tnum == 0 )
// threads need to do work now prods[ tnum ] += prods[ tnum + offset ];
dC[ wgNum ] = prods[ 0 ];
mjb – June 14, 2021
Σ prods → C
And, Finally, in your Host Program
Wait( cmdQueue );
double time0 = omp_get_wtime( );
status = clEnqueueNDRangeKernel( cmdQueue, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL );
PrintCLError( status, "clEnqueueNDRangeKernel failed: " );
Wait( cmdQueue );
double time1 = omp_get_wtime( );
status = clEnqueueReadBuffer( cmdQueue, dC, CL_TRUE, 0, numWorkGroups*sizeof(float), hC, 0, NULL, NULL );
PrintCLError( status, "clEnqueueReadBufferl failed: " );
Wait( cmdQueue );
float sum = 0.;
for( int i = 0; i < numWorkgroups; i++ ) {
Computer Graphics
sum += hC[ i ];
mjb – June 14, 2021
Reduction Performance
Work-Group Size = 32
0 50 100 150 200 250 300
Array Size (MegaNumbers)
Computer Graphics
mjb – June 14, 2021
GigaNumbers Multiplied and Reduced Per Second)
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com