Overview Vector addition on a GPU Work items and work groups Summary and next lecture
XJCO3221 Parallel Computation
University of Leeds
Copyright By PowCoder代写 加微信 powcoder
Lecture 15: GPU threads and kernels
XJCO3221 Parallel Computation
Vector addition on a GPU Previous lecture
Work items and work groups Today¡¯s lecture Summary and next lecture
Previous lecture
In the last lecture we started looking at General Purpose GPU programming, or GPGPU:
Device contains a number of SIMD processors, each containing some number of cores.
Thread scheduling is performed in hardware. Programmable using OpenCL (this course), CUDA, and
Device discovery performed at run time (cf. the displayDevices.c example).
XJCO3221 Parallel Computation
Vector addition on a GPU Previous lecture Work items and work groups Today¡¯s lecture
Summary and next lecture
Today¡¯s lecture
Today we will see how to perform vector addition on a GPU: Communicating data between the device (GPU) and the
host (CPU) using the command queue. Compiling and executing kernels on the device. Work items are the basic unit of concurrency. Arranged into work groups for scalability. How to set the work group size.
XJCO3221 Parallel Computation
Vector addition on a GPU
Work items and work groups Summary and next lecture
Communication between host and device
GPU kernels
Copying data between device and host
Vector addition
Code on Minerva: vectorAddition.c, vectorAddition.cl and helper.h
Once again use vector addition as our first worked example: c=a+b or ci=ai+bi,i=1…N.
In serial code:
for( i=0; i
XJCO3221 Parallel Computation
Vector addition on a GPU
Work items and work groups Summary and next lecture
Communication between host and device GPU kernels
Copying data between device and host
Copying data between device and host1
To get the result (device c) back to the host (host c), enqueue a read buffer command:
1 2 3 4 5 6 7 8 9
status = clEnqueueReadBuffer(
device_c , CL_TRUE ,
0, N*sizeof(float), host_c ,
0, NULL , NULL
// The command queue.
// Device memory.
// Blocking.
// Offset; must be zero. // Data size.
// Host memory.
// Events; ignore for now.
Note this is a blocking communication call – it will not return until the copy has finished ¡ª like MPI Send()/MPI Recv().
1In CUDA: cudaMemcpy(…,cudaMemcpyDeviceToHost).
XJCO3221 Parallel Computation
Vector addition on a GPU
Work items and work groups Summary and next lecture
Communication between host and device GPU kernels
Copying data between device and host
Copying data from host to device1
If we had not used CL MEM COPY HOST PTR earlier, we would need two calls to clEnqueueWriteBuffer():
status = clEnqueueWriteBuffer(queue,device_a,CL_FALSE ,0,N*sizeof(float),host_a,0,NULL,NULL);
status = clEnqueueWriteBuffer(queue,device_b,CL_FALSE ,0,N*sizeof(float),host_b,0,NULL,NULL);
Copies from host to device.
CL FALSE used for non-blocking communication.
The device memory always comes before host memory in the argument list.
1In CUDA: cudaMemcpy(…,cudaMemcpyHostToDevice). XJCO3221 Parallel Computation
Overview Vector addition on a GPU Work items and work groups Summary and next lecture
Work item hierarchy
Specifying work groups and NDRange What group size to use?
Work items
Definition
The work item is the unit of concurrent execution. It usually maps onto a single hardware thread.
As thread scheduling on a GPU is implemented in hardware, there is (essentially) no overhead in launching/destroying threads.
No problem oversubscribing, i.e. issuing more threads than there are physical cores.
Normally issue as many threads as the problem requires. and XJCO3221 Parallel Computation
Overview Vector addition on a GPU Work items and work groups Summary and next lecture
Work item hierarchy
Specifying work groups and NDRange What group size to use?
Work item hierarchy
To remain scalable, the hardware does not allow communication (including synchronisation) between all threads at once.
Instead employs a hierarchy:
Work items belong to work groups1.
Communication (including synchronisation) only possible within a work group.
The full range of all threads is called NDRange in OpenCL, for n-dimensional range2.
1Threads and thread blocks in CUDA. 2Grid in CUDA.
XJCO3221 Parallel Computation
Overview Vector addition on a GPU Work items and work groups Summary and next lecture
Work item hierarchy
Specifying work groups and NDRange What group size to use?
Hierarchy of work items: 2D example
Work group
get_global_id(1)
get_local_id(0)
XJCO3221 Parallel Computation
Overview Vector addition on a GPU Work items and work groups Summary and next lecture
Work item hierarchy
Specifying work groups and NDRange
What group size to use?
Specifying the n-dimensional range NDRange
The NDRange must be 1, 2 or 3 dimensions. A 2-dimensional example:
size_t globalSize [2] = {X,Y }; size_t workGroupSize [2] = {8 ,16};
status = clEnqueueNDRangeKernel(queue,kernel,2,0, globalSize ,workGroupSize ,0,NULL ,NULL);
Launches X*Y kernels in total (one per work item). In work groups of 8*16.
OpenCL 2.0 allows X and Y to be arbitrary, but in earlier versions they must be multiples of the work group size (8 and 16 here).
XJCO3221 Parallel Computation
Overview Vector addition on a GPU Work items and work groups Summary and next lecture
Work item hierarchy
Specifying work groups and NDRange
What group size to use?
Once in a kernel, can get the global indices using get global id(). For this 2D example:
get_global_id(0); // Varies from 0 to X-1 inc. get_global_id(1); // Varies from 0 to Y-1 inc.
Similarly can get the indices within the work group using get local id():
get_local_id(0); // Varies from 0 to 7 inc. get_local_id(1); // Varies from 0 to 15 inc.
Can also get the number of work items in a group or in the NDRange using get local size() and get global size():
get_local_size (1); // Returns 16. get_global_size(0); // Returns X.
XJCO3221 Parallel Computation
Overview Vector addition on a GPU Work items and work groups Summary and next lecture
Work item hierarchy
Specifying work groups and NDRange What group size to use?
What group size to use?
Devices have a maximum work group size they can support. This can be determined at run time as follows:
size_t maxWorkItems;
clGetDeviceInfo(device ,CL_DEVICE_MAX_WORK_GROUP_SIZE ,
sizeof(size_t),&maxWorkItems ,NULL);
Note this refers to all items in a group (i.e. 8*16=128).
Other factors may suggest using work group sizes less than this
We will look at one of these next time.
Passing NULL as the work group argument lets OpenCL try to determine a suitable size automatically.
and XJCO3221 Parallel Computation
Overview Vector addition on a GPU Work items and work groups Summary and next lecture
Summary and next lecture
Summary and next lecture
Today we have looked at a complete GPGPU solution:
Communication between host and device.
Kernels that execute on the device.
Basic unit of concurrency is the work item.
Group into work groups, within which communication is possible.
Next time we will look at the different memory types on a GPU.
XJCO3221 Parallel Computation
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com