COMP8551 OpenCL
COMP 8551
Advanced Games
Programming
Techniques
OpenCL
Borna Noureddin, Ph.D.
British Columbia Institute of Technology
OpenCLOverview
• Platform model: a high-level description of the
heterogeneous system
• Execution model: an abstract representation of how
streams of instructions execute on the heterogeneous
platform
• Memory model: the collection of memory regions within
OpenCL and how they interact during an OpenCL
computation
• Programming models: the high-level abstractions a
programmer uses when designing algorithms to
implement an application
2
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Platform model
• Host interacts with environment external to OpenCL
program (I/O, interaction user, etc)
• Host connected to 1+ OpenCL devices
• Device: where streams of instructions (or kernels)
execute (aka “compute device”)
• can be CPU, GPU, DSP, or any other processor
• further divided into compute units
• compute units divided into one or more processing elements
(PEs)
• computations occur within PEs
3
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Execution model
• OpenCL application consists of:
• host program
• collection of one or more kernels
• Host program runs on host
• OpenCL does not define details of how host program works, only
how it interacts with objects defined within OpenCL
• Kernels execute on OpenCL devices
• Do real work of application
• Typically simple functions that transform input memory objects
into output memory objects
• OpenCL kernels: functions written in OpenCLC
• Native kernels: functions created outside OpenCL (function
pointer) [OPTIONAL]
4
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Execution model
• The OpenCL execution model defines how kernels
execute
• How do individual kernels run on a device?
• How does the host define the context for kernel
execution?
• How are the kernels enqueued for execution?
5
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Kernel execution
• Host program issues command that submits kernel for
execution on device
• Runtime system creates an integer index space
• Instance of kernel executes for each point in this index space
• Each instance of an executing kernel: work-item
• identified by coordinates in index space
• coordinates are global ID for work-item
• Command creates collection of work-items, each of
which uses same sequence of instructions defined by
single kernel
• Sequence of instructions same, but behavior of each
work-item can vary (branch statements or data selected
through global ID)
6
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Kernel execution
• Work-items organized into work-groups
• Provide coarse-grained decomposition of index space
• Exactly span global index space
• Work-groups same size in corresponding dimensions, and
this size evenly divides global size in each dimension
• Work-groups assigned unique ID with same
dimensionality as index space of work-items
• Work-items assigned unique local ID within work-group:
can be uniquely identified by its global ID or by a
combination of its local ID and work-group ID
7
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Kernel execution
• Work-items in given work-group execute concurrently on
PEs of single compute unit
• Implementation may serialize execution of kernels (may
even serialize execution of workgroups in single kernel
invocation)
• OpenCL only assures that workitems within a work-group
execute concurrently
• You can never assume that work-groups or kernel
invocations execute concurrently
8
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Kernel execution
• Index space spans an N -dimensioned range of values
(NDRange)
• N can be 1, 2, or 3
• Integer array of length N specifying size of index space in
each dimension
• Work-item’s global and local ID is an N–dimensional tuple
• Work-groups assigned IDs using a similar approach to
that used for work-items
9
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Kernel execution
Lx = Gx / Wx
Ly = Gy / Wy
gx = wx * Lx + lx
gy = wy * Ly + ly
wx = gx / Lx
wy = gy /Ly
lx = gx % Lx
ly = gy % Ly
gx = wx * Lx + lx + ox
gy = wy * Ly + ly + oy
1
0
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Context
• In OpenCL, computation takes place on device
• But host:
• Defines and establishes context kernels
• Defines NDRanges
• Defines queues that control details of how/when kernels execute
• Context defines environment within which kernels are
defined and execute:
• Devices: collection of OpenCL devices to be used by host
• Kernels: OpenCL functions that run on devices
• Program objects: program source code and executables that
implement kernels
• Memory objects: set of objects in memory that are visible to
OpenCL devices and contain values that can be operated on by
instances of a kernel
1
1
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Context
• Created and manipulated by host using the OpenCL API
• Context also contains one or more “program objects”
• think of these as a dynamic library from which the functions used
by the kernels are pulled
• Host program defines devices within context: only at that
point is it possible to know how to compile the program
source code to create the code for the kernels
1
2
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Context
• Program object built at runtime within host program (like
shader program)
• Context also defines how kernels interact with memory
• On heterogeneous platform, often multiple address
spaces to manage
• Devices may have range of different memory
architectures
• OpenCL introduces idea of “memory objects”
• explicitly defined on host
• explicitly moved between host and OpenCLdevices
• extra burden on programmer, but allows support for much wider
range of platforms
1
3
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Command-Queues
• Interaction between host and devices occurs through
commands posted by host to command-queue
• Created by host and attached to single device after
context has been defined
• Host places commands into command-queue, and
commands are then scheduled for execution on the
associated device
• OpenCL supports three types of commands:
• Kernel execution commands execute kernel on PEs of device
• Memory commands transfer data between host and different
memory objects, move data between memory objects, or map
and unmapmemory objects from host address space
• Synchronization commands put constraints on order of execution
1
4
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Command-Queues
Typical host program
• Define context and command-queues
• Define memory and program objects
• Builds data structures needed on host
• Use command-queue to move memory objects from the
host to devices
• Attach kernel arguments to memory objects
• Submit kernels to command-queue for execution
• When kernel completed, memory objects copied back to
host
1
5
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Command-Queues
• When multiple kernels are submitted, may need to
interact
• E.g., one set of kernels may generate memory objects that a
following set of kernels needs to manipulate
• Synchronization commands can be used to force first set to
complete before following set begins
• Many additional subtleties associated with how the
commands work in OpenCL
• Commands always execute asynchronously to host
• Host submits commands to command-queue and
continues without waiting for them to finish
• If necessary for host to wait on command, can be
explicitly established with synchronization command
1
6
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Command-Queues
Commands within single queue execute relative to each
other in one of 2 modes:
• In-order execution: Commands launched in order in which they
appear in command-queue and complete in order (serializes
execution order of commands)
• Out-of-order execution: Commands issued in order but do not
wait to complete before the following commands execute (order
constraints enforced by programmer through explicit
synchronization mechanisms) [OPTIONAL]
1
7
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Command-Queues
• Why out-of-order? Remember load balancing?
• Application is done until all of kernels complete
• Efficient program minimizes runtime: want all compute
units to be fully engaged and run for approximately same
amount of time
• You can often do this by carefully thinking about order in
which you submit commands to queues so that the in-
order execution achieves a well-balanced load
• But what about when set of commands take different
amounts of time to execute? Load balancing can be very
hard! Out-of-order queue can take care of this for you 1
8
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Command-Queues
• Automatic load balancing: Commands can execute in
any order, so if compute unit finishes its work early, it can
immediately fetch a new command and start executing
new kernel
• Commands generate event objects
• Command can be told to wait until certain conditions on event
objects exist
• Events can also be used to coordinate execution between host
and devices
• Also possible to associate multiple queues with single
context for any devices within that context
• Run concurrently and independently with no explicit mechanisms
within OpenCL to synchronize between them
1
9
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Memory Model
Two types of memory objects
• Buffer object:
• contiguous block of memory made available to kernels
• programmer can map data structures onto this buffer and access
buffer through pointers
• flexibility to define just about any data structure
• Image object:
• restricted to holding images
• storage format may be optimized to needs of specific device
• important to give an implementation freedom to customize
image format
• opaque object
• OpenCL provides functions to manipulate images, but other than
these specific functions, the contents of image object are hidden
from kernel program
2
0
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Memory Model
OpenCLmemory model defines five distinct memory regions:
• Host memory: visible only to host
• Global memory: permits read/write access to all work-items
in all work-groups
• Constant memory: region of global memory that remains
constant during kernel execution
• host allocates and initializes
• work-items have read-only access
• Local memory: local to work-group
• can be used to allocate variables shared by all work-items
• may be implemented as dedicated regions of memory on device
or mapped onto sections of global memory
• Private memory: private to work-item
• not visible to other work-items
2
1
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Memory Model
2
2
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Programming Model
• How to map parallel algorithms onto OpenCL
• Programming models intimately connected to how
programmers reason about their algorithms
• OpenCL defined with two different programming models in
mind: task parallelism and data parallelism
• Also possible to think in terms of a hybrid model: tasks that
contain data parallelism
2
3
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Programming Model
Data-Parallel Programming Model
• Problems well suited are organized around data structures,
the elements of which can be updated concurrently
• Single logical sequence of instructions applied concurrently to
elements of data structure
• Structure of algorithm is designed as sequence of concurrent
updates to data structures within problem
• Natural fit with OpenCL’s execution model
• Key is the NDRange defined when kernel is launched
• Algorithm designer aligns data structures with NDRange index
space and maps them onto OpenCLmemory objects
• Kernel defines sequence of instructions to be applied
concurrently as work-items
2
4
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Programming Model
Data-Parallel Programming Model
• Work-items in single work-group may need to share data (local
memory region)
• Regardless of order in which work-items complete, same
results should be produced
• Work-items in same work-group can participate in a work-
group barrier (all must execute before any continuing)
• NB: no mechanism for synchronization between work-items
from different work-groups
2
5
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Programming Model
Data-Parallel Programming Model
• Single Instruction Multiple Data or SIMD: no branch
statements in kernel, each work-item will execute identical
operations but on subset of data items selected by its global
ID
• Single Program Multiple Data or SPMD: branch statements
within a kernel leading each work-item to possibly execute
very different operations
• On platforms with restricted bandwidth to instruction
memory or if PEs map onto vector unit, SIMD model can be
dramatically more efficient
2
6
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Programming Model
Data-Parallel Programming Model
• Vector instructions strictly SIMD
• E.g., numerical integration program (4.0/(1 + x2 ))
float8 x, psum_vec;
float8 ramp= (float8)(0.5, 1.5, 2.5, 3.5,
4.5, 5.5, 6.5, 7.5};
float8 four= (float8)(4.0); // fill with 8 4’s
float8 one = (float8)(1.0); // fill with 8 1’s
float step_number; // step number from loop index
float step_size; // Input integration step size
. . . and later inside a loop body . . .
x = ((float8)step_number +ramp)*step_size;
psum_vec+=four/(one + x*x); 2
7
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51
Programming Model
Task-Parallel Programming Model
• Task = kernel that executes as a single work-item regardless of
NDRange used by other kernels in application
• Concurrency is internal to the task (eg, vector operations on
vector types)
• Kernels submitted as tasks that execute at the same time with
an out-of-order queue
• Tasks connected into task graph using OpenCL’s event model
• Commands submitted to event queue may optionally generate
events
• Subsequent commands can wait for these events before
executing 2
8
©
B
or
na
N
ou
re
dd
in
C
O
M
P
85
51