程序代写代做代考 C kernel chain Part 5: OpenCLTM C Language Features

Part 5: OpenCLTM C Language Features
Built-in Functions
37

OpenCLTM C Language
Derived from ISO C99
 No standard C99 headers, function pointers, recursion, variable length arrays, and bit fields
Additions to the language for parallelism  Work-items and workgroups
 Vector types
 Synchronization
Address space qualifiers Optimized image access Built-in functions
38

Address space
•__global – memory allocated from global address space, images are global by default
•__constant – is like global, but read only •__local – memory shared by work-group •__private – private per work-item memory •__read_only – only for images •__write_only – only for images
Kernel args have to be global, constant or local. Can’t assign to different pointer type.

Workgroups
•uint get_work_dim () (1 to 3)
•size_t get_global_size (uint dimindx) •size_t get_global_id (uint dimindx)
•size_t get_local_size (uint dimindx)
•size_t get_local_id (uint dimindx)
•size_t get_num_groups (uint dimindx) •size_t get_group_id (uint dimindx) num_groups * local_size = global_size local_id + group_id * local_size = global_id global_size % local_size = 0

Synchronization
barrier() function. All work-items must reach the barrier before they execute further. It must be encountered by all
work-items in work-group.
Flags: LOCAL_MEM_FENCE, GLOBAL_MEM_FENCE – flush and ensure ordering for local or global memory.
mem_fence(), read_mem_fence(), write_mem_fence() – ensure memory loads and stores ordering within work-item.

Kernel
kernel void square(__global float* input, __global float* output)
{
output[i] = input[i] * input[i]; }
int i = get_global_id(0);
get_global_id(0)
i==11
6
1
1
0
9
2
4
1
1
9
7
6
1
2
2
1
9
8
4
1
9
2
0
0
7
8
input output
36
1
1
0
81
4
16
1
1
81
49
36
1
4
4
1
81
64
16
1
81
4
0
0
49
64
42

Work-Items and Workgroup Functions
get_global_size 26 input
get_local_size 13
get_work_dim
1
6
1
1
0
9
2
4
1
1
9
7
6
1
2
2
1
9
8
4
1
9
2
0
0
7
8
workgroups
get_num_groups
2
get_group_id 0
get_local_id 8 get_global_id 21
43

Data Types
Scalar data types
char , uchar, short, ushort, int, uint, long, ulong bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage)
Image types
image2d_t, image3d_t, sampler_t
Vector data types
44

Data Types
Portable
Vector length of 2, 4, 8, and 16
char2, ushort4, int8, float16, double2, … Endian safe
Aligned at vector length
Vector operations and built-in functions
45

Vector types
•Vector size N = 2,4,8,16 •charN – cl_charN •ucharN – cl_ucharN •shortN – cl_ushortN •ushortN – cl_ushortN •intN – cl_intN
•uintN – cl_uintN •longN – cl_longN •ulongN – cl_ulongN •halfN – cl_halfN •floatN – cl_floatN
Access using .xyzw for vec2 and vec4, can swizzle, like “a.xy”
Or s0 to sF (0,1,2,3,4,5,6,7,8,9,A,B,C, D,E,F, not case sensitive) like “a.s01”
Or .lo (.odd) and .hi (.even), can chain them like a.lo.lo
7 | OpenCL Tutorial – Language | September 13th, 2009

Vector Operations
• Vector literal
int4 vi0 = (int4) -7;
int4 vi1 = (int4)(0, 1, 2, 3);
-7
-7
-7
-7
0
1
2
3
46

Vector Operations
• Vector literal
int4 vi0 = (int4) -7;
int4 vi1 = (int4)(0, 1, 2, 3);
• Vector components
vi0.lo = vi1.hi;
-7
-7
-7
-7
0
1
2
3
2
3
-7
-7
47

Vector Operations
• Vector literal
int4 vi0 = (int4) -7;
int4 vi1 = (int4)(0, 1, 2, 3);
• Vector components
vi0.lo = vi1.hi;
int8 v8 = (int8)(vi0.s0123, vi1.odd);
-7
-7
-7
-7
0
1
2
3
2
3
-7
-7
2
3
-7
-7
0
1
1
3
48

Vector Operations
• Vector literal
int4 vi0 = (int4) -7;
int4 vi1 = (int4)(0, 1, 2, 3);
• Vector components
vi0.lo = vi1.hi;
int8 v8 = (int8)(vi0.s0123, vi1.odd); • Vector ops
vi0 += vi1;
vi0 = abs(vi0);
+
-7
-7
-7
-7
0
1
2
3
2
3
-7
-7
2
3
-7
-7
0
1
1
3
2
3
-7
-7
0
1
2
3
2
4
-5
-4
49

Address Spaces
• Kernel pointer arguments must use global, local, or constant kernel void distance(global float8* stars, local float8* local_stars)
kernel void sum(private int* p) // Illegal because is uses private
• Default address space for arguments and local variables is
private
kernel void smooth(global float* io) { float temp;

• image2d_t and image3d_t are always in global address space
kernel void average(read_only global image_t in, write_only image2d_t out)
50

Address Spaces
• Program (global) variables must be in constant address space
constant float bigG = 6.67428E-11;
global float time; // Illegal non constant kernel void force(global float4 mass) { time = 1.7643E18f; }
• Casting between different address spaces is undefined
kernel void calcEMF(global float4* particles) {
float* private_ptr = (float*) particles; // Undefined behavior –
float particle = * private_ptr; // different address }
global float* particle_ptr = (global float*) particles;
51

Conversions
Scalar and pointer conversions follow C99 rules • No implicit conversions for vector types
float4 f4 = int4_vec; // Illegal implicit conversion
• No casts for vector types (different semantics for vectors)
float4 f4 = (float4) int4_vec; // Illegal cast • Casts have other problems
float x;
int i = (int)(x + 0.5f); // Round float to nearest integer
Wrong for:
0.5f – 1 ulp (rounds up not down) negative numbers (wrong answer)
• There is hardware to do it on nearly every machine
52

Conversions
Explict conversions: convert_destType<_saturate><_roundingMode>
– Scalar and vector types – No ambiguity
uchar4 c4 = convert_uchar4_sat_rte(f4);
f4 c4
-5.0f
254.5f
254.6
1.2E9f
0
254
255
255
53

Reinterpret Data: as_typen Reinterpret the bits to another type
Types must be the same size
// f[i] = f[i] < g[i] ? f[i] : 0.0f int4 is_less = f < g; f = as_float4(as_int4(f) & is_less); is_less ffffffff & f OpenCLTM provides a select built-in 54 float4 f, g; -5.0f 254.5f 254.6f 1.2E9f f g 254.6f 254.6f 254.6f 254.6f ffffffff 00000000 00000000 as_int c0a00000 42fe0000 437e8000 4e8f0d18 c0a00000 42fe0000 00000000 00000000 -5.0f 254.5f 0.0f 0.0f Built-in Math Functions IEEE 754 compatible rounding behavior for single precision floating-point IEEE 754 compliant behavior for double precision floating-point Defines maximum error of math functions as ULP values Handle ambiguous C99 library edge cases Commonly used single precision math functions come in three flavors  eg. log(x) – Full precision <= 3ulps – Half precision/faster. half_log—minimum 11 bits of accuracy, <= 8192 ulps – Native precision/fastest. native_log: accuracy is implementation defined  Choose between accuracy and performance 55 Built-in Work-group Functions kernel read(global int* g, local int* shared) { if (get_global_id(0) < 5) barrier(CLK_GLOBAL_MEM_FENCE); else k = array[0]; } work-item 0 work-item 6 Illegal since not all work-items encounter barrier 56 Built-in Functions Integer functions  abs, abs_diff, add_sat, hadd, rhadd, clz, mad_hi, mad_sat, max, min, mul_hi, rotate, sub_sat, upsample Image functions  read_image[f | i | ui]  write_image[f | i | ui]  get_image_[width | height | depth] Common, Geometric and Relational Functions Vector Data Load and Store Functions  eg. vload_half, vstore_half, vload_halfn, vstore_halfn, ... 58 Extensions Atomic functions to global and local memory add, sub, xchg, inc, dec, cmp_xchg, min, max, and, or, xor 32-bit/64-bit integers Select rounding mode for a group of instructions at compile time  For instructions that operate on floating-point or produce floating-point values #pragma OpenCL_select_rounding_mode rounding_mode All 4 rounding modes supported Extension: Check clGetDeviceInfo with CL_DEVICE_EXTENSIONS 59