This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
Computer Graphics
Performing Reductions in OpenCL
Mike Bailey
Copyright By PowCoder代写 加微信 powcoder
opencl.reduction.pptx
mjb – June 14, 2021
Recall the OpenCL Memory Model
Global Memory
Constant Memory
WorkGroup Local Memory
Local Memory
WorkGroup Local Memory Work- Work-
Shared MItemory Item Item Work- Work- Work-
Work- Item
Work- Item
Work- Item
Computer Graphics
mjb – June 14, 2021
Private Memory
Private Memory
Private Memory
Here’s the Problem We are Trying to Solve
Like the first.cpp demo program, we are piecewise multiplying two arrays. Unlike the first demo program, we want to then add up all the products and return the sum.
numItems = 8; 0000
After the array multiplication, we want each work-group to sum the products within that 3 work-group, then return them to the host in
an array for final summing.
To do this, we will not put the products into a
large global device array, but into a prods[ ] 5 array that is shared within its work-group.
A * B → prods Σ prods → C
Computer Graphics
mjb – June 14, 2021
numItems = 8; 0000
Computer Graphics
If we had 8 work-items in a work-group, we would like the threads in each work-group to execute the following instructions . . .
Reduction Takes Place in a Single Work-Group
Thread #0:
prods[ 0 ] += prods[ 1 ];
Thread #2:
prods[ 2 ] += prods[ 3 ];
Thread #4:
prods[ 4 ] += prods[ 5 ];
Thread #6:
prods[ 6 ] += prods[ 7 ];
Thread #0:
prods[ 0 ] += prods[ 2 ];
Thread #4:
prods[ 4 ] += prods[ 6 ];
Thread #0:
prods[ 0 ] += prods[ 4 ];
. . . but in a more general way than writing them all out by hand.
mjb – June 14, 2021
float * float * float * size_t size_t
numWorkGroups = NUM_ELEMENTS / LOCAL_SIZE;
hA = new float [ NUM_ELEMENTS ]; hB = new float [ NUM_ELEMENTS ]; hC = new float [ numWorkGroups ];
abSize = NUM_ELEMENTS * sizeof(float); cSize = numWorkGroups * sizeof(float);
Here’s What You Would Change in your Host Program
cl_mem dA = clCreateBuffer( context, CL_MEM_READ_ONLY, cl_mem dB = clCreateBuffer( context, CL_MEM_READ_ONLY, cl_mem dC = clCreateBuffer( context, CL_MEM_WRITE_ONLY,
abSize, NULL, &status ); abSize, NULL, &status ); cSize, NULL, &status );
status = clEnqueueWriteBuffer( cmdQueue, dA, CL_FALSE, 0, abSize, hA, 0, NULL, NULL ); status = clEnqueueWriteBuffer( cmdQueue, dB, CL_FALSE, 0, abSize, hB, 0, NULL, NULL );
cl_kernel kernel = clCreateKernel( program, “ArrayMultReduce”, &status );
status = clSetKernelArg( kernel, 0, sizeof(cl_mem), &dA ); status = clSetKernelArg( kernel, 1, sizeof(cl_mem), &dB ); status = clSetKernelArg( kernel, 2, LOCAL_SIZE * sizeof(float), NULL );
status = clSetKernelArg( kernel, 3, sizeof(cl_mem), &dC );
Computer Graphics
// local “prods” array is dimensioned the size of each work-group
A * B → prods Σ prods → C
mjb – June 14, 2021
This NULL is how you tell OpenCL that this is a local (shared) array, not a global array
status = clSetKernelArg( kernel, 0, sizeof(cl_mem), &dA );
status = clSetKernelArg( kernel, 1, sizeof(cl_mem), &dB );
status = clSetKernelArg( kernel, 2, LOCAL_SIZE * sizeof(float), NULL );
// local “prods” array – one per work-item status = clSetKernelArg( kernel, 3, sizeof(cl_mem), &dC );
kernel void
ArrayMultReduce( global const float *dA, global const float *dB, local float *prods, global float *dC ) {
int gid = get_global_id( 0 ); int numItems = get_local_size( 0 ); int tnum = get_local_id( 0 );
int wgNum = get_group_id( 0 );
// 0 .. total_array_size-1
// # work-items per work-group
// thread (i.e., work-item) number in this work-group // 0 .. numItems-1
// which work-group number this is in
prods[ tnum ] = dA[ gid ] * dB[ gid ]; // multiply the two arrays together
// now add them up – come up with one sum per work-group
// it is a big performance benefit to do it here while “prods” is still available – and is local
// it would be a performance hit to pass “prods” back to the host then bring it back to the device for reduction
A * B → prods
The Arguments to the Kernel
Computer Graphics
mjb – June 14, 2021
Thread #0:
prods[ 0 ] += prods[ 1 ];
Thread #2:
prods[ 2 ] += prods[ 3 ];
Thread #4:
prods[ 4 ] += prods[ 5 ];
Thread #6:
prods[ 6 ] += prods[ 7 ];
offset = 1; mask = 1;
Computer Graphics
Reduction Takes Place Within a Single Work-Group Each work-item is run by a single thread
Thread #0:
prods[ 0 ] += prods[ 2 ];
Thread #4:
prods[ 4 ] += prods[ 6 ];
offset = 2; mask = 3;
Thread #0:
prods[ 0 ] += prods[ 4 ];
offset = 4; mask = 7;
A work-group consisting of numItems work-items can be reduced to a sum in log2(numItems) steps. In this example, numItems=8.
The reduction begins with the individual products in prods[0] .. prods[7].
The final sum will end up in prods[0], which will then be copied into dC[wgNum].
mjb – June 14, 2021
Remember Truth Tables?
FFTT &F&T&F&T =F=F=F=T
Or, with Bits:
0011 &0&1&0&1 =0=0=0=1
A Review of Bitmasks
Or, with Multiple Bits:
000 & 011 = 000
001 & 011 = 001
010 & 011 = 010
011 & 011 = 011
100 & 011 = 000
101 & 011 = 001
Computer Graphics
mjb – June 14, 2021
Reduction Takes Place in a Single Work-Group Each work-item is run by a single thread
Thread #0:
prods[ 0 ] += prods[ 1 ];
Thread #2:
prods[ 2 ] += prods[ 3 ];
Thread #4:
prods[ 4 ] += prods[ 5 ];
Thread #6:
prods[ 6 ] += prods[ 7 ];
Thread #0:
prods[ 0 ] += prods[ 2 ];
Thread #4:
prods[ 4 ] += prods[ 6 ];
Thread #0:
prods[ 0 ] += prods[ 4 ];
offset = 4; mask = 7;
offset = 1; mask = 1;
numItems = 8; Anding bits
Computer Graphics
offset = 2; mask = 3;
kernel void ArrayMultReduce( … )
int gid = get_global_id( 0 ); int numItems = get_local_size( 0 ); int tnum = get_local_id( 0 ); int wgNum = get_group_id( 0 );
// thread number
// work-group number
prods[tnum]= dA[gid]*dB[gid];
// all threads execute this code simultaneously: for( int offset = 1; offset < numItems; offset *= 2 )
int mask = 2*offset - 1;
barrier( CLK_LOCAL_MEM_FENCE ); // wait for all threads to get here if( ( tnum & mask ) == 0 ) // bit-by-bit and’ing tells us which
barrier( CLK_LOCAL_MEM_FENCE ); if( tnum == 0 )
dC[ wgNum ] = prods[ 0 ];
// threads need to do work now prods[ tnum ] += prods[ tnum + offset ];
mjb – June 14, 2021
Σ prods → C
And, Finally, in your Host Program
Wait( cmdQueue );
double time0 = omp_get_wtime( );
status = clEnqueueNDRangeKernel( cmdQueue, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL );
PrintCLError( status, "clEnqueueNDRangeKernel failed: " );
Wait( cmdQueue );
double time1 = omp_get_wtime( );
status = clEnqueueReadBuffer( cmdQueue, dC, CL_TRUE, 0, numWorkGroups*sizeof(float), hC, 0, NULL, NULL );
PrintCLError( status, "clEnqueueReadBufferl failed: " );
Wait( cmdQueue );
float sum = 0.;
for( int i = 0; i < numWorkgroups; i++ ) {
sum += hC[ i ];
Computer Graphics
mjb – June 14, 2021
Reduction Performance
Work-Group Size = 32
0 50 100 150 200 250 300
Array Size (MegaNumbers)
Computer Graphics
mjb – June 14, 2021
GigaNumbers Multiplied and Reduced Per Second)
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com