This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
Recall the OpenCL Memory Model
Global Memory
Constant Memory
Copyright By PowCoder代写 加微信 powcoder
WorkGroup Local Memory
Local Memory
WorkGroup Local Memory Work- Work-
Work- Item
Work- Item
Work- Item
Shared MItemory Item Item Work- Work- Work-
Computer Graphics
mjb – June 14, 2021
Computer Graphics
Performing Reductions in OpenCL
Mike Bailey
opencl.reduction.pptx
mjb – June 14, 2021
Here’s the Problem We are Trying to Solve
Like the first.cpp demo program, we are piecewise multiplying two arrays. Unlike the first demo program, we want to then add up all the products and return the sum.
numItems = 8;
A * B → prods Σ prods → C
After the array multiplication, we want each work-group to sum the products within that 3 work-group, then return them to the host in
an array for final summing.
To do this, we will not put the products into a
large global device array, but into a prods[ ] 5 array that is shared within its work-group.
Computer Graphics
mjb – June 14, 2021
numItems = 8; 0000
Computer Graphics
Reduction Takes Place in a Single Work-Group
If we had 8 work-items in a work-group, we would like the threads in each work-group to execute the following instructions . . .
. . . but in a more general way than writing them all out by hand.
Thread #0:
prods[ 0 ] += prods[ 1 ];
Thread #2:
prods[ 2 ] += prods[ 3 ];
Thread #4:
prods[ 4 ] += prods[ 5 ];
Thread #6:
prods[ 6 ] += prods[ 7 ];
Thread #0:
prods[ 0 ] += prods[ 2 ];
Thread #4:
prods[ 4 ] += prods[ 6 ];
Thread #0:
prods[ 0 ] += prods[ 4 ];
mjb – June 14, 2021
size_t numWorkGroups=NUM_ELEMENTS / LOCAL_SIZE;
float * hA = new float [ NUM_ELEMENTS ];
float * hB = new float [ NUM_ELEMENTS ]; float*hC =newfloat[numWorkGroups];
size_t abSize = NUM_ELEMENTS * sizeof(float); size_t cSize = numWorkGroups * sizeof(float);
cl_mem dA = clCreateBuffer( context, CL_MEM_READ_ONLY, cl_mem dB = clCreateBuffer( context, CL_MEM_READ_ONLY, cl_mem dC = clCreateBuffer( context, CL_MEM_WRITE_ONLY,
abSize, NULL, &status ); abSize, NULL, &status ); cSize, NULL, &status );
status = clEnqueueWriteBuffer( cmdQueue, dA, CL_FALSE, 0, abSize, hA, 0, NULL, NULL ); status = clEnqueueWriteBuffer( cmdQueue, dB, CL_FALSE, 0, abSize, hB, 0, NULL, NULL );
cl_kernel kernel = clCreateKernel( program, “ArrayMultReduce”, &status );
status = clSetKernelArg( kernel, 0, sizeof(cl_mem), &dA ); status = clSetKernelArg( kernel, 1, sizeof(cl_mem), &dB ); status = clSetKernelArg( kernel, 2, LOCAL_SIZE * sizeof(float), NULL );
// local “prods” array is dimensioned the size of each work-group
status = clSetKernelArg( kernel, 3, sizeof(cl_mem), &dC );
Computer Graphics
Here’s What You Would Change in your Host Program
A * B → prods Σ prods → C
This NULL is how you tell OpenCL that this is a local (shared) array, not a global array
mjb – June 14, 2021
status = clSetKernelArg( kernel, 0, sizeof(cl_mem), &dA );
status = clSetKernelArg( kernel, 1, sizeof(cl_mem), &dB );
status = clSetKernelArg( kernel, 2, LOCAL_SIZE * sizeof(float), NULL );
// local “prods” array – one per work-item status = clSetKernelArg( kernel, 3, sizeof(cl_mem), &dC );
kernel void ArrayMultReduce(globalconstfloat*dA,globalconstfloat*dB,localfloat*prods,globalfloat*dC ) {
int gid = get_global_id( 0 ); int numItems = get_local_size( 0 ); int tnum = get_local_id( 0 );
int wgNum = get_group_id( 0 );
prods[ tnum ] = dA[ gid ] * dB[ gid ];
// 0 .. total_array_size-1
// # work-items per work-group
// thread (i.e., work-item) number in this work-group // 0 .. numItems-1
// which work-group number this is in
// multiply the two arrays together
// now add them up – come up with one sum per work-group
// it is a big performance benefit to do it here while “prods” is still available – and is local
// it would be a performance hit to pass “prods” back to the host then bring it back to the device for reduction
A * B → prods
The Arguments to the Kernel
Computer Graphics
mjb – June 14, 2021
Private Memory
Private Memory
Private Memory
Thread #0:
prods[ 0 ] += prods[ 1 ];
Thread #2:
prods[ 2 ] += prods[ 3 ];
Thread #4:
prods[ 4 ] += prods[ 5 ];
Thread #6:
prods[ 6 ] += prods[ 7 ];
Thread #0:
prods[ 0 ] += prods[ 2 ];
Thread #4:
prods[ 4 ] += prods[ 6 ];
Thread #0:
prods[ 0 ] += prods[ 4 ];
A work-group consisting of numItems work-items can be reduced to a sum in log2(numItems) steps. In this example, numItems=8.
The reduction begins with the individual products in prods[0] .. prods[7].
The final sum will end up in prods[0], which will then be copied into dC[wgNum].
Remember Truth Tables?
FFTT &F&T&F&T =F=F=F=T
Or, with Bits:
0011 &0&1&0&1 =0=0=0=1
Or, with Multiple Bits:
000 & 011 = 000
Computer Graphics
001 & 011 = 001
010 & 011 = 010
011 & 011 = 011
100 & 011 = 000
101 & 011 = 001
A Review of Bitmasks
mjb – June 14, 2021
Reduction Takes Place Within a Single Work-Group 7 Each work-item is run by a single thread
offset = 4; mask = 7;
offset = 1; mask = 1;
Computer Graphics
offset = 2; mask = 3;
mjb – June 14, 2021
Thread #0:
prods[ 0 ] += prods[ 1 ];
Thread #2:
prods[ 2 ] += prods[ 3 ];
Thread #4:
prods[ 4 ] += prods[ 5 ];
Thread #6:
prods[ 6 ] += prods[ 7 ];
Thread #0:
prods[ 0 ] += prods[ 2 ];
Thread #4:
prods[ 4 ] += prods[ 6 ];
Thread #0:
prods[ 0 ] += prods[ 4 ];
kernel void ArrayMultReduce( … )
int gid = get_global_id( 0 ); int numItems = get_local_size( 0 ); int tnum = get_local_id( 0 ); int wgNum = get_group_id( 0 );
// thread number
// work-group number
prods[tnum]= dA[gid]*dB[gid];
// all threads execute this code simultaneously: for( int offset = 1; offset < numItems; offset *= 2 )
int mask = 2*offset - 1;
barrier(CLK_LOCAL_MEM_FENCE); //waitforallthreadstogethere if( ( tnum & mask ) == 0 ) // bit-by-bit and’ing tells us which
barrier( CLK_LOCAL_MEM_FENCE ); if( tnum == 0 )
dC[ wgNum ] = prods[ 0 ];
// threads need to do work now prods[ tnum ] += prods[ tnum + offset ];
Σ prods → C
mjb – June 14, 2021
And, Finally, in your Host Program
Wait( cmdQueue );
double time0 = omp_get_wtime( );
status = clEnqueueNDRangeKernel( cmdQueue, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL );
PrintCLError( status, "clEnqueueNDRangeKernel failed: " );
Wait( cmdQueue );
double time1 = omp_get_wtime( );
status = clEnqueueReadBuffer( cmdQueue, dC, CL_TRUE, 0, numWorkGroups*sizeof(float), hC, 0, NULL, NULL );
PrintCLError( status, "clEnqueueReadBufferl failed: " );
Wait( cmdQueue );
float sum = 0.;
for( int i = 0; i < numWorkgroups; i++ ) {
sum += hC[ i ];
Computer Graphics
mjb – June 14, 2021
Reduction Takes Place in a Single Work-Group 9 Each work-item is run by a single thread
offset = 4; mask = 7;
offset = 2; mask = 3;
offset = 1; mask = 1;
numItems = 8; Anding bits
Computer Graphics
Reduction Performance
Work-Group Size = 32
0 50 100 150 200 250 300
Array Size (MegaNumbers)
Computer Graphics
mjb – June 14, 2021
GigaNumbers Multiplied and Reduced Per Second)
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com