Part 5: OpenCLTM C Language Features
Built-in Functions
37
OpenCLTM C Language
Derived from ISO C99
No standard C99 headers, function pointers, recursion, variable length arrays, and bit fields
Additions to the language for parallelism Work-items and workgroups
Vector types
Synchronization
Address space qualifiers Optimized image access Built-in functions
38
Address space
•__global – memory allocated from global address space, images are global by default
•__constant – is like global, but read only •__local – memory shared by work-group •__private – private per work-item memory •__read_only – only for images •__write_only – only for images
Kernel args have to be global, constant or local. Can’t assign to different pointer type.
Workgroups
•uint get_work_dim () (1 to 3)
•size_t get_global_size (uint dimindx) •size_t get_global_id (uint dimindx)
•size_t get_local_size (uint dimindx)
•size_t get_local_id (uint dimindx)
•size_t get_num_groups (uint dimindx) •size_t get_group_id (uint dimindx) num_groups * local_size = global_size local_id + group_id * local_size = global_id global_size % local_size = 0
Synchronization
barrier() function. All work-items must reach the barrier before they execute further. It must be encountered by all
work-items in work-group.
Flags: LOCAL_MEM_FENCE, GLOBAL_MEM_FENCE – flush and ensure ordering for local or global memory.
mem_fence(), read_mem_fence(), write_mem_fence() – ensure memory loads and stores ordering within work-item.
Kernel
kernel void square(__global float* input, __global float* output)
{
output[i] = input[i] * input[i]; }
int i = get_global_id(0);
get_global_id(0)
i==11
6
1
1
0
9
2
4
1
1
9
7
6
1
2
2
1
9
8
4
1
9
2
0
0
7
8
input output
36
1
1
0
81
4
16
1
1
81
49
36
1
4
4
1
81
64
16
1
81
4
0
0
49
64
42
Work-Items and Workgroup Functions
get_global_size 26 input
get_local_size 13
get_work_dim
1
6
1
1
0
9
2
4
1
1
9
7
6
1
2
2
1
9
8
4
1
9
2
0
0
7
8
workgroups
get_num_groups
2
get_group_id 0
get_local_id 8 get_global_id 21
43
Data Types
Scalar data types
char , uchar, short, ushort, int, uint, long, ulong bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage)
Image types
image2d_t, image3d_t, sampler_t
Vector data types
44
Data Types
Portable
Vector length of 2, 4, 8, and 16
char2, ushort4, int8, float16, double2, … Endian safe
Aligned at vector length
Vector operations and built-in functions
45
Vector types
•Vector size N = 2,4,8,16 •charN – cl_charN •ucharN – cl_ucharN •shortN – cl_ushortN •ushortN – cl_ushortN •intN – cl_intN
•uintN – cl_uintN •longN – cl_longN •ulongN – cl_ulongN •halfN – cl_halfN •floatN – cl_floatN
Access using .xyzw for vec2 and vec4, can swizzle, like “a.xy”
Or s0 to sF (0,1,2,3,4,5,6,7,8,9,A,B,C, D,E,F, not case sensitive) like “a.s01”
Or .lo (.odd) and .hi (.even), can chain them like a.lo.lo
7 | OpenCL Tutorial – Language | September 13th, 2009
Vector Operations
• Vector literal
int4 vi0 = (int4) -7;
int4 vi1 = (int4)(0, 1, 2, 3);
-7
-7
-7
-7
0
1
2
3
46
Vector Operations
• Vector literal
int4 vi0 = (int4) -7;
int4 vi1 = (int4)(0, 1, 2, 3);
• Vector components
vi0.lo = vi1.hi;
-7
-7
-7
-7
0
1
2
3
2
3
-7
-7
47
Vector Operations
• Vector literal
int4 vi0 = (int4) -7;
int4 vi1 = (int4)(0, 1, 2, 3);
• Vector components
vi0.lo = vi1.hi;
int8 v8 = (int8)(vi0.s0123, vi1.odd);
-7
-7
-7
-7
0
1
2
3
2
3
-7
-7
2
3
-7
-7
0
1
1
3
48
Vector Operations
• Vector literal
int4 vi0 = (int4) -7;
int4 vi1 = (int4)(0, 1, 2, 3);
• Vector components
vi0.lo = vi1.hi;
int8 v8 = (int8)(vi0.s0123, vi1.odd); • Vector ops
vi0 += vi1;
vi0 = abs(vi0);
+
-7
-7
-7
-7
0
1
2
3
2
3
-7
-7
2
3
-7
-7
0
1
1
3
2
3
-7
-7
0
1
2
3
2
4
-5
-4
49
Address Spaces
• Kernel pointer arguments must use global, local, or constant kernel void distance(global float8* stars, local float8* local_stars)
kernel void sum(private int* p) // Illegal because is uses private
• Default address space for arguments and local variables is
private
kernel void smooth(global float* io) { float temp;
…
• image2d_t and image3d_t are always in global address space
kernel void average(read_only global image_t in, write_only image2d_t out)
50
Address Spaces
• Program (global) variables must be in constant address space
constant float bigG = 6.67428E-11;
global float time; // Illegal non constant kernel void force(global float4 mass) { time = 1.7643E18f; }
• Casting between different address spaces is undefined
kernel void calcEMF(global float4* particles) {
float* private_ptr = (float*) particles; // Undefined behavior –
float particle = * private_ptr; // different address }
global float* particle_ptr = (global float*) particles;
51
Conversions
Scalar and pointer conversions follow C99 rules • No implicit conversions for vector types
float4 f4 = int4_vec; // Illegal implicit conversion
• No casts for vector types (different semantics for vectors)
float4 f4 = (float4) int4_vec; // Illegal cast • Casts have other problems
float x;
int i = (int)(x + 0.5f); // Round float to nearest integer
Wrong for:
0.5f – 1 ulp (rounds up not down) negative numbers (wrong answer)
• There is hardware to do it on nearly every machine
52
Conversions
Explict conversions: convert_destType<_saturate><_roundingMode>
– Scalar and vector types – No ambiguity
uchar4 c4 = convert_uchar4_sat_rte(f4);
f4 c4
-5.0f
254.5f
254.6
1.2E9f
0
254
255
255
53
Reinterpret Data: as_typen Reinterpret the bits to another type
Types must be the same size
// f[i] = f[i] < g[i] ? f[i] : 0.0f
int4 is_less = f < g;
f = as_float4(as_int4(f) & is_less);
is_less
ffffffff
& f
OpenCLTM provides a select built-in 54
float4 f, g;
-5.0f
254.5f
254.6f
1.2E9f
f g
254.6f
254.6f
254.6f
254.6f
ffffffff
00000000
00000000
as_int
c0a00000
42fe0000
437e8000
4e8f0d18
c0a00000
42fe0000
00000000
00000000
-5.0f
254.5f
0.0f
0.0f
Built-in Math Functions
IEEE 754 compatible rounding behavior for single precision floating-point
IEEE 754 compliant behavior for double precision floating-point
Defines maximum error of math functions as ULP values
Handle ambiguous C99 library edge cases
Commonly used single precision math functions come in three flavors
eg. log(x)
– Full precision <= 3ulps
– Half precision/faster. half_log—minimum 11 bits of accuracy, <= 8192 ulps – Native precision/fastest. native_log: accuracy is implementation defined
Choose between accuracy and performance 55
Built-in Work-group Functions
kernel read(global int* g, local int* shared) {
if (get_global_id(0) < 5) barrier(CLK_GLOBAL_MEM_FENCE);
else
k = array[0]; }
work-item 0 work-item 6
Illegal since not all work-items encounter barrier
56
Built-in Functions
Integer functions
abs, abs_diff, add_sat, hadd, rhadd, clz, mad_hi, mad_sat, max, min, mul_hi, rotate, sub_sat, upsample
Image functions
read_image[f | i | ui]
write_image[f | i | ui]
get_image_[width | height | depth]
Common, Geometric and Relational Functions Vector Data Load and Store Functions
eg. vload_half, vstore_half, vload_halfn, vstore_halfn, ... 58
Extensions
Atomic functions to global and local memory
add, sub, xchg, inc, dec, cmp_xchg, min, max, and, or, xor
32-bit/64-bit integers
Select rounding mode for a group of instructions at compile
time
For instructions that operate on floating-point or produce floating-point values
#pragma OpenCL_select_rounding_mode rounding_mode
All 4 rounding modes supported
Extension: Check clGetDeviceInfo with CL_DEVICE_EXTENSIONS
59