CS考试辅导 This work is licensed under a Creative Commons Attribution-NonCommercial-

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
Recall the OpenCL Memory Model
WorkGroup Local Memory
Global Memory

Copyright By PowCoder代写 加微信 powcoder

Constant Memory
Local Memory
WorkGroup Local Memory Work- Work-
Shared MIteemory Item Item Work- Work- Work-
Work- Item
Work- Item
Work- Item
Computer Graphics
mjb – June 14, 2021
Computer Graphics
Performing Reductions in OpenCL
Mike Bailey
opencl.reduction.pptx
mjb – June 14, 2021
Private Memory
Private Memory
Private Memory
Here’s the Problem We are Trying to Solve
Like the first.cpp demo program, we are piecewise multiplying two arrays. Unlike the first demo program, we want to then add up all the products and return the sum.
numItems = 8;
A * B → prods Σ prods → C
After the array multiplication, we want each work-group to sum the products within that 3 work-group, then return them to the host in
an array for final summing.
To do this, we will not put the products into a
large global device array, but into a prods[ ] 5 array that is shared within its work-group.
Computer Graphics
mjb – June 14, 2021
numItems = 8; 0000
Computer Graphics
If we had 8 work-items in a work-group, we would like the threads in each work-group to execute the following instructions . . .
Reduction Takes Place in a Single Work-Group
Thread #0:
prods[ 0 ] += prods[ 1 ];
Thread #2:
prods[ 2 ] += prods[ 3 ];
Thread #4:
prods[ 4 ] += prods[ 5 ];
Thread #6:
prods[ 6 ] += prods[ 7 ];
Thread #0:
prods[ 0 ] += prods[ 2 ];
Thread #4:
prods[ 4 ] += prods[ 6 ];
. . . but in a more general way than writing them all out by hand.
mjb – June 14, 2021
Thread #0:
prods[ 0 ] += prods[ 4 ];

float * float * float * size_t size_t
numWorkGroups = NUM_ELEMENTS / LOCAL_SIZE;
hA = new float [ NUM_ELEMENTS ]; hB = new float [ NUM_ELEMENTS ]; hC = new float [ numWorkGroups ];
abSize = NUM_ELEMENTS * sizeof(float); cSize = numWorkGroups * sizeof(float);
Here’s What You Would Change in your Host Program
cl_mem dA = clCreateBuffer( context, CL_MEM_READ_ONLY, cl_mem dB = clCreateBuffer( context, CL_MEM_READ_ONLY, cl_mem dC = clCreateBuffer( context, CL_MEM_WRITE_ONLY,
abSize, NULL, &status ); abSize, NULL, &status ); cSize, NULL, &status );
status = clEnqueueWriteBuffer( cmdQueue, dA, CL_FALSE, 0, abSize, hA, 0, NULL, NULL ); status = clEnqueueWriteBuffer( cmdQueue, dB, CL_FALSE, 0, abSize, hB, 0, NULL, NULL );
cl_kernel kernel = clCreateKernel( program, “ArrayMultReduce”, &status );
status = clSetKernelArg( kernel, 0, sizeof(cl_mem), &dA ); status = clSetKernelArg( kernel, 1, sizeof(cl_mem), &dB ); status = clSetKernelArg( kernel, 2, LOCAL_SIZE * sizeof(float), NULL );
status = clSetKernelArg( kernel, 3, sizeof(cl_mem), &dC );
Computer Graphics
// local “prods” array is dimensioned the size of each work-group
A * B → prods Σ prods → C
mjb – June 14, 2021
This NULL is how you tell OpenCL that this is a local (shared) array, not a global array
status = clSetKernelArg( kernel, 0, sizeof(cl_mem), &dA );
status = clSetKernelArg( kernel, 1, sizeof(cl_mem), &dB );
status = clSetKernelArg( kernel, 2, LOCAL_SIZE * sizeof(float), NULL );
// local “prods” array – one per work-item status = clSetKernelArg( kernel, 3, sizeof(cl_mem), &dC );
kernel void
ArrayMultReduce( global const float *dA, global const float *dB, local float *prods, global float *dC ) {
int gid = get_global_id( 0 ); int numItems = get_local_size( 0 ); int tnum = get_local_id( 0 );
int wgNum = get_group_id( 0 );
prods[ tnum ] = dA[ gid ] * dB[ gid ];
// 0 .. total_array_size-1
// # work-items per work-group
// thread (i.e., work-item) number in this work-group // 0 .. numItems-1
// which work-group number this is in
// multiply the two arrays together
// now add them up – come up with one sum per work-group
// it is a big performance benefit to do it here while “prods” is still available – and is local
// it would be a performance hit to pass “prods” back to the host then bring it back to the device for reduction
A * B → prods
offset = 1; mask = 1;
Computer Graphics
offset = 2; mask = 3;
offset = 4; mask = 7;
Computer Graphics
The Arguments to the Kernel
mjb – June 14, 2021
Reduction Takes Place Within a Single Work-Group 7 Each work-item is run by a single thread
mjb – June 14, 2021
Thread #0:
prods[ 0 ] += prods[ 1 ];
Thread #2:
prods[ 2 ] += prods[ 3 ];
Thread #4:
prods[ 4 ] += prods[ 5 ];
Thread #6:
prods[ 6 ] += prods[ 7 ];
Thread #0:
prods[ 0 ] += prods[ 2 ];
Thread #4:
prods[ 4 ] += prods[ 6 ];
Thread #0:
prods[ 0 ] += prods[ 4 ];
A work-group consisting of numItems work-items can be reduced to a sum in log2(numItems) steps. In this example, numItems=8.
The reduction begins with the individual products in prods[0] .. prods[7].
The final sum will end up in prods[0], which will then be copied into dC[wgNum].
Remember Truth Tables?
FFTT &F&T&F&T =F=F=F=T
Or, with Bits:
0011 &0&1&0&1 =0=0=0=1
Or, with Multiple Bits:
000 & 011 = 000
001 & 011 = 001
010 & 011 = 010
011 & 011 = 011
100 & 011 = 000
101 & 011 = 001
A Review of Bitmasks
Computer Graphics
mjb – June 14, 2021

Thread #0:
prods[ 0 ] += prods[ 1 ];
Thread #2:
prods[ 2 ] += prods[ 3 ];
Thread #4:
prods[ 4 ] += prods[ 5 ];
Thread #6:
prods[ 6 ] += prods[ 7 ];
Thread #0:
prods[ 0 ] += prods[ 2 ];
Thread #4:
prods[ 4 ] += prods[ 6 ];
Thread #0:
prods[ 0 ] += prods[ 4 ];
kernel void ArrayMultReduce( … )
int gid = get_global_id( 0 ); int numItems = get_local_size( 0 ); int tnum = get_local_id( 0 ); int wgNum = get_group_id( 0 );
// thread number
// work-group number
prods[tnum]= dA[gid]*dB[gid];
// all threads execute this code simultaneously: for( int offset = 1; offset < numItems; offset *= 2 ) int mask = 2*offset - 1; barrier( CLK_LOCAL_MEM_FENCE ); // wait for all threads to get here if( ( tnum & mask ) == 0 ) // bit-by-bit and’ing tells us which barrier( CLK_LOCAL_MEM_FENCE ); if( tnum == 0 ) dC[ wgNum ] = prods[ 0 ]; // threads need to do work now prods[ tnum ] += prods[ tnum + offset ]; Σ prods → C mjb – June 14, 2021 And, Finally, in your Host Program Wait( cmdQueue ); double time0 = omp_get_wtime( ); status = clEnqueueNDRangeKernel( cmdQueue, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL ); PrintCLError( status, "clEnqueueNDRangeKernel failed: " ); Wait( cmdQueue ); double time1 = omp_get_wtime( ); status = clEnqueueReadBuffer( cmdQueue, dC, CL_TRUE, 0, numWorkGroups*sizeof(float), hC, 0, NULL, NULL ); PrintCLError( status, "clEnqueueReadBufferl failed: " ); Wait( cmdQueue ); float sum = 0.; for( int i = 0; i < numWorkgroups; i++ ) { sum += hC[ i ]; Computer Graphics mjb – June 14, 2021 offset = 1; mask = 1; numItems = 8; Anding bits Computer Graphics Reduction Takes Place in a Single Work-Group 9 Each work-item is run by a single thread offset = 4; mask = 7; offset = 2; mask = 3; Reduction Performance Work-Group Size = 32 0 50 100 150 200 250 300 Array Size (MegaNumbers) Computer Graphics mjb – June 14, 2021 GigaNumbers Multiplied and Reduced Per Second) 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com