Introduction to OpenCL
Cliff Woolley, NVIDIA Developer Technology Group
Welcome to the OpenCL Tutorial!
OpenCL Platform Model
OpenCL Execution Model
Mapping the Execution Model onto the Platform Model Introduction to OpenCL Programming
Additional Information and Resources
OpenCL is a trademark of Apple, Inc.
Design Goals of OpenCL
Use all computational resources in the system — CPUs, GPUs and other processors as peers
Efficient parallel programming model — Based on C99
— Data- and task- parallel computational model
— Abstract the specifics of underlying hardware
— Specify accuracy of floating-point computations
Desktop and Handheld Profiles
© Copyright Khronos Group, 2010
OPENCL PLATFORM MODEL
It’s a Heterogeneous World
A modern platform includes:
– One or more CPUs
– One or more GPUs
– Optional accelerators (e.g., DSPs)
CPU
CPU
GPU
GMCH ICH
DRAM
GMCH = graphics memory control hub ICH = Input/output control hub
© Copyright Khronos Group, 2010
OpenCL Platform Model
Computational Resources
Host
… …
…
… …
… …
…
… …
… …
Compute Device
Processing Element
Compute Unit
OpenCL Platform Model
Computational Resources
Host
… …
…
… …
… …
…
… …
… …
Compute Device
Processing Element
Compute Unit
OpenCL Platform Model
on CUDA Compute Architecture
CPU
CUDA Streaming Processor
Host
… …
…
… …
… …
…
… …
… …
Compute Device
Processing Element
Compute Unit
CUDA-Enabled GPU
CUDA Streaming Multiprocessor
Anatomy of an OpenCL Application
OpenCL Application
Host Code Device Code
• Written in C/C++ • Written in OpenCL C
• Executes on the host • Executes on the device
Host code sends commands to the Devices:
… to transfer data between host memory and device memories … to execute device code
Host
Compute Devices
… …
…
… …
… …
…
… …
… …
Anatomy of an OpenCL Application
Serial code executes in a Host (CPU) thread
Parallel code executes in many Device (GPU) threads
across multiple processing elements
OpenCL Application
Serial code
Parallel code
Serial code
Parallel code
Host = CPU
Device = GPU
Host = CPU
Device = GPU
…
…
OPENCL EXECUTION MODEL
Decompose task into work-items Define N-dimensional computation domain
Execute a kernel at each point in computation domain
Traditional loop as a function in C
OpenCL C kernel
void trad_mul(int n,
const float *a,
const float *b,
float *c)
{
int i;
for (i=0; i
Private Memory
Work-Item
Local Memory Workgroup
Private Memory
Work-Item
Local Memory Workgroup
Private Memory
Private Memory
Work-Item
Work-Item
Compute Device
Global/Constant Memory
Host Memory
Host
© Copyright Khronos Group, 2010
INTRODUCTION TO OPENCL PROGRAMMING
OpenCL Framework
Platform layer
— Platform query and context creation
Compiler for OpenCL C
Runtime
— Memory management and command execution within a context
OpenCL Framework
Context
CPU
GPU
Programs Kernels
Memory Objects
Command Queues
dp_mul
arg [0]
arg [0]
value argv[0a]lvualeue
arg [1]
arg [1]
value argv[1a]lvualeue
arg [2]
arg [2]
value argv[2a]lvualeue
In
In
Order
Order
Queue
Queue
GPU
Compute Device
Out of
Out of
Order
Order
Queue
Queue
__kernel void
dp_mul(global const float *a,
global const float *b,
global float *c) {
int id = get_global_id(0);
c[id] = a[id] * b[id]; }
dp_mul
CPU program binary
dp_mul
GPU program binary
Third party names are the property of their owners.
© Copyright Khronos Group, 2010
Images
Buffers
OpenCL Framework: Platform Layer
Programs
CPU GPU
Context
Kernels
Memory Objects
Command Queues
dp_mul
arg [0]
arg [0]
value argv[0a]lvualeue
arg [1]
arg [1]
value argv[1a]lvualeue
arg [2]
arg [2]
value argv[2a]lvualeue
In
In
Order
Order
Queue
Queue
GPU
Compute Device
Out of
Out of
Order
Order
Queue
Queue
__kernel void
dp_mul(global const float *a,
global const float *b,
global float *c) {
int id = get_global_id(0);
c[id] = a[id] * b[id]; }
dp_mul
CPU program binary
dp_mul
GPU program binary
Third party names are the property of their owners.
© Copyright Khronos Group, 2010
Images
Buffers
OpenCL Framework: Platform Layer
Query platform information
— clGetPlatformInfo(): profile, version, vendor, extensions
— clGetDeviceIDs(): list of devices
— clGetDeviceInfo(): type, capabilities
Create an OpenCL context for one or more devices One or more devices
Context =
cl_context
cl_device_id
Memory and device code shared by these devices
cl_mem cl_program
Command queues to send commands to these devices
cl_command_queue
Platform Layer:
Context Creation (simplified)
// Get the platform ID
cl_platform_id platform; clGetPlatformIDs(1, &platform, NULL);
Number returned
// Get the first GPU device associated with the platform
cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
// Create an OpenCL context for the GPU device
cl_context context;
context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
Context Error User Error properties callback data code
Platform Layer:
Error Handling, Resource Deallocation
Error handling:
— All host functions return an error code — Context error callback
Resource deallocation
— Reference counting API: clRetain*(), clRelease*()
Both are removed from code samples for clarity — Please see SDK samples for complete code
OpenCL Framework: OpenCL C
Programs
CPU GPU
Context
Kernels
Memory Objects
Command Queues
dp_mul
arg [0]
arg [0]
value argv[0a]lvualeue
arg [1]
arg [1]
value argv[1a]lvualeue
arg [2]
arg [2]
value argv[2a]lvualeue
In
In
Order
Order
Queue
Queue
GPU
Compute Device
Out of
Out of
Order
Order
Queue
Queue
__kernel void
dp_mul(global const float *a,
global const float *b,
global float *c) {
int id = get_global_id(0);
c[id] = a[id] * b[id]; }
dp_mul
CPU program binary
dp_mul
GPU program binary
Third party names are the property of their owners.
© Copyright Khronos Group, 2010
Images
Buffers
OpenCL C
Derived from ISO C99 (with some restrictions)
Language Features Added — Work-items and work-groups — Vector types
— Synchronization
— Address space qualifiers
Also includes a large set of built-in functions — Image manipulation
— Work-item manipulation
— Math functions
© Copyright Khronos Group, 2010
OpenCL C Language Restrictions
Pointers to functions are not allowed
Pointers to pointers allowed within a kernel, but not as an argument Bit-fields are not supported
Variable-length arrays and structures are not supported
Recursion is not supported
Writes to a pointer to a type less than 32 bits are not supported* Double types are not supported, but reserved
3D Image writes are not supported
Some restrictions are addressed through extensions
OpenCL C Optional Extensions
Extensions are optional features exposed through OpenCL The OpenCL working group has already approved many
extensions to the OpenCL specification:
— Double precision floating-point types (Section 9.3)
— Built-in functions to support doubles
— Atomic functions (Section 9.5, 9.6, 9.7)
— Byte-addressable stores (write to pointers to types < 32-bits) (Section 9.9) — 3D Image writes (Section 9.8)
— Built-in functions to support half types (Section 9.10)
Now core features in OpenCL 1.1
Work-items and work-groups
A kernel is a function executed for each work-item
__kernel void square(__global float* input, __global float* output) {
int i = get_global_id(0);
output[i] = input[i] * input[i]; }
Built-in function
Address space qualifier
get_global_id(0) = 7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
input
output
Function qualifier
6
1
1
0
9
2
4
1
1
9
7
6
8
2
2
5
36
1
1
0
81
4
16
1
1
81
49
36
64
4
4
25
© Copyright Khronos Group, 2010
Work-items and work-groups
6
1
1
0
9
2
4
1
1
9
7
6
8
2
2
5
work-group work-item
get_group_id(0) = 0 get_local_size(0) = 8
get_group_id(0) = 1 get_local_size(0) = 8
get_local_id(0) = 3 get_global_id(0) = 11
input
get_work_dim() = 1 get_global_size(0) = 16
get_num_groups(0) = 2
© Copyright Khronos Group, 2010
OpenCL C Data Types
Scalar data types
— char, uchar, short, ushort, int, uint, long, ulong, float
— bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage)
Image types
— image2d_t, image3d_t, sampler_t
Vector data types
— Vector lengths 2, 3, 4, 8, 16 (char2, ushort4, int8, float16, double2, ...)
— Endian safe
— Aligned at vector length — Vector operations
3-vectors new in OpenCL 1.1
double is an optional type in OpenCL
© Copyright Khronos Group, 2010
OpenCL C Synchronization Primitives
Built-in functions to order memory operations and synchronize execution:
— mem_fence(CLK_LOCAL_MEM_FENCE and/or CLK_GLOBAL_MEM_FENCE) waits until all reads/writes to local and/or global memory made by the calling work-
item prior to mem_fence() are visible to all threads in the work-group
— barrier(CLK_LOCAL_MEM_FENCE and/or CLK_GLOBAL_MEM_FENCE)
waits until all work-items in the work-group have reached this point and calls mem_fence(CLK_LOCAL_MEM_FENCE and/or CLK_GLOBAL_MEM_FENCE)
Used to coordinate accesses to local or global memory shared among work-items
OpenCL C Kernel Example
__kernel void dp_mul(__global const float *a, __global const float *b,
__global float *c, int N)
{
int id = get_global_id (0); if (id < N)
}
c[id] = a[id] * b[id];
OpenCL Framework: Runtime
Programs
CPU GPU
Context
Kernels
Memory Objects
Command Queues
dp_mul
arg [0]
arg [0]
value argv[0a]lvualeue
arg [1]
arg [1]
value argv[1a]lvualeue
arg [2]
arg [2]
value argv[2a]lvualeue
In
In
Order
Order
Queue
Queue
GPU
Compute Device
Out of
Out of
Order
Order
Queue
Queue
__kernel void
dp_mul(global const float *a,
global const float *b,
global float *c) {
int id = get_global_id(0);
c[id] = a[id] * b[id]; }
dp_mul
CPU program binary
dp_mul
GPU program binary
Third party names are the property of their owners.
© Copyright Khronos Group, 2010
Images
Buffers
OpenCL Framework: Runtime
Command queues creation and management
Device memory allocation and management
Device code compilation and execution
Event creation and management (synchronization, profiling)
OpenCL Runtime: Kernel Compilation
Context
CPU GPU
Programs Kernels
Memory Objects
Command Queues
dp_mul
arg [0]
arg [0]
value argv[0a]lvualeue
arg [1]
arg [1]
value argv[1a]lvualeue
arg [2]
arg [2]
value argv[2a]lvualeue
In
In
Order
Order
Queue
Queue
GPU
Compute Device
Out of
Out of
Order
Order
Queue
Queue
__kernel void
dp_mul(global const float *a,
global const float *b,
global float *c) {
int id = get_global_id(0);
c[id] = a[id] * b[id]; }
dp_mul
CPU program binary
dp_mul
GPU program binary
Third party names are the property of their owners.
© Copyright Khronos Group, 2010
Images
Buffers
Kernel Compilation
A cl_program object encapsulates some source code (with potentially several kernel functions) and its last successful build
— clCreateProgramWithSource() // Create program from source — clBuildProgram() // Compile program
A cl_kernel object encapsulates the values of the kernel’s arguments used when the kernel is executed
— clCreateKernel() // Create kernel from successfully compiled program — clSetKernelArg() // Set values of kernel’s arguments
Kernel Compilation
// Build program object and set up kernel arguments
const char* source = "__kernel void dp_mul(__global const float *a, \n"
"
"
"
"{ \n" "
"
"
"} \n";
__global const float *b, \n" __global float *c, \n"
int N) \n"
int id = get_global_id (0); \n" if (id < N) \n"
c[id] = a[id] * b[id]; \n"
cl_program program = clCreateProgramWithSource(context, 1, &source, NULL, NULL); clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
cl_kernel kernel = clCreateKernel(program, ―dp_mul", NULL);
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&d_buffer);
clSetKernelArg(kernel, 1, sizeof(int), (void*)&N);
OpenCL Runtime: Memory Objects
Programs
CPU GPU
Context
Kernels
Memory Objects
Command Queues
dp_mul
arg [0]
arg [0]
value argv[0a]lvualeue
arg [1]
arg [1]
value argv[1a]lvualeue
arg [2]
arg [2]
value argv[2a]lvualeue
In
In
Order
Order
Queue
Queue
GPU
Compute Device
Out of
Out of
Order
Order
Queue
Queue
__kernel void
dp_mul(global const float *a,
global const float *b,
global float *c) {
int id = get_global_id(0);
c[id] = a[id] * b[id]; }
dp_mul
CPU program binary
dp_mul
GPU program binary
Third party names are the property of their owners.
© Copyright Khronos Group, 2010
Images
Buffers
Memory Objects
Two types of memory objects (cl_mem): — Buffer objects
— Image objects
Memory objects can be copied to host memory, from host memory, or to other memory objects
Regions of a memory object can be accessed from host by mapping them into the host address space
Buffer Object
One-dimensional array
Elements are scalars, vectors, or any user-defined structures Accessed within device code through pointers
Image Object
Two- or three-dimensional array
Elements are 4-component vectors from a list of predefined
formats
Accessed within device code via built-in functions (storage format not exposed to application)
— Sampler objects are used to configure how built-in functions sample images (addressing modes, filtering modes)
Can be created from OpenGL texture or renderbuffer
OpenCL Runtime: Command Queues
Programs Kernels
CPU GPU
Context
Memory Objects
Command Queues
dp_mul
arg [0]
arg [0]
value argv[0a]lvualeue
arg [1]
arg [1]
value argv[1a]lvualeue
arg [2]
arg [2]
value argv[2a]lvualeue
In
In
Order
Order
Queue
Queue
GPU
Compute Device
Out of
Out of
Order
Order
Queue
Queue
__kernel void
dp_mul(global const float *a,
global const float *b,
global float *c) {
int id = get_global_id(0);
c[id] = a[id] * b[id]; }
dp_mul
CPU program binary
dp_mul
GPU program binary
Third party names are the property of their owners.
© Copyright Khronos Group, 2010
Images
Buffers
Commands
Memory copy or mapping Device code execution
Synchronization point
Command Queue
Sequence of commands scheduled for execution on a specific device — Enqueuing functions: clEnqueue*()
— Multiple queues can execute on the same device
Two modes of execution:
— In-order: Each command in the queue executes only when the preceding
command has completed (including memory writes)
— Out-of-order: No guaranteed order of completion for commands
// Create a command-queue for a specific device
Error code
cl_command_queue cmd_queue = clCreateCommandQueue(context, device_id, 0, NULL); Properties
Data Transfer between Host and Device
// Create buffers on host and device
size_t size = 100000 * sizeof(int);
int* h_buffer = (int*)malloc(size);
cl_mem d_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, NULL);
...
// Write to buffer object from host memory
clEnqueueWriteBuffer(cmd_queue, d_buffer, CL_FALSE, 0, size, h_buffer, 0, NULL, NULL); ...
// Read from buffer object to host memory
clEnqueueReadBuffer(cmd_queue, d_buffer, CL_TRUE, 0, size, h_buffer, 0, NULL, NULL);
Blocking? Offset Event synch
Kernel Execution: NDRange
Host code invokes a kernel over an index space called an NDRange — NDRange = ―N-Dimensional Range‖ of work-items
— NDRange can be a 1-, 2-, or 3-dimensional space
— Work-group dimensionality matches work-item dimensionality
Kernel Invocation
// Set number of work-items in a work-group
size_t localWorkSize = 256;
int numWorkGroups = (N + localWorkSize – 1) / localWorkSize; // round up
size_t globalWorkSize = numWorkGroups * localWorkSize; // must be evenly divisible by localWorkSize clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL);
NDRange
Command Synchronization
Queue barrier command: clEnqueueBarrier()
— Commands after the barrier start executing only after all commands before
the barrier have completed
Events: a cl_event object can be associated with each command
— Commands return events and obey event waitlists
clEnqueue*(..., num_events_in_waitlist, *event_waitlist, *event);
— Any commands (or clWaitForEvents()) can wait on events before executing
— Event object can be queried to track execution status of associated command and get profiling information
Some clEnqueue*() calls can be optionally blocking — clEnqueueReadBuffer(..., CL_TRUE, ...);
Synchronization: Queues & Events
You must explicitly synchronize between queues — Multiple devices each have their own queue
— Possibly multiple queues per device
— Use events to synchronize
© Copyright Khronos Group, 2010
ADDITIONAL INFORMATION AND RESOURCES
Next Steps
Begin hands-on development with our publicly available OpenCL driver and GPU Computing SDK
Read the OpenCL Specification and the extensive documentation provided with the SDK
Read and contribute to OpenCL forums at Khronos and NVIDIA
NVIDIA OpenCL Resources
NVIDIA OpenCL Web Page:
— http://www.nvidia.com/object/cuda_opencl.html
NVIDIA OpenCL Forum:
— http://forums.nvidia.com/index.php?showforum=134
NVIDIA driver, profiler, code samples for Windows and Linux: — http://developer.nvidia.com/object/opencl.html
Khronos OpenCL Resources
OpenCL Specification
— http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
OpenCL Registry
— http://www.khronos.org/registry/cl/
OpenCL Developer Forums
— http://www.khronos.org/message_boards/
OpenCL Quick Reference Card
— http://www.khronos.org/files/opencl-1-1-quick-reference-card.pdf
OpenCL Online Man pages
— http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/
© Copyright Khronos Group, 2010
OpenCL Books
• The OpenCL Programming Book – Available now: search for OpenCL on
Amazon
• OpenCL Programming Guide - The ―Red Book‖ of OpenCL
– Coming in July 2011; rough cut available on Safaribooks
– http://my.safaribooksonline.com/9780132488006
Questions?