Dr Massoud Zolgharni
mzolgharni@lincoln.ac.uk
Room SLB1004, SLB
Dr Grzegorz Cielniak
gcielniak@lincoln.ac.uk
Room INB2221, INB
Week W/C Lecture Workshop
1 23/01 Introduction –
2 30/01 Architectures
Tutorial-1
3 06/02 Patterns 1
4 13/02 Patterns 2
Tutorial-2
5 20/02 Patterns 3
6 27/02 Patterns 4
Tutorial-3
7 06/03 Communication & Synchronisation
8 13/03 Algorithms 1
Assessment
support
9 20/03 Algorithms 2
10 27/03 Algorithms 3
11 03/04 Performance & Optimisation
Tutorial-4 &
Demonstration
12 24/04 Parallel Libraries
13 01/05 –
Work items/threads communicate through memory
• see stencil, reduce or histogram examples
In ideal case, we would like to have one type of
memory which is large, cheap and fast
In reality
• large = slow/expensive
• cheap = small/slow
• fast = small/expensive
Need for different types of memories arranged in a
hierarchy: from slow/large to fast/small
Different access speeds and storage capacities:
faster smaller expensive
cheaper bigger slower
Processing Element (PE)
Compute Unit (CU)
Device
Private memory (very fast) is only accessible by a single work
item
• registers, ~10/100 B
Local memory (fast) is accessible by
all work items within a single work group
• user accessible cache, ~K/MB
Global memory (slow) is accessible by
threads from all work groups
• DRAM, ~GB
Constant memory (fast) is also accessible
by all threads
• part of global memory
• not writable, relatively small, ~KB
Minimise time spent on memory operations
Move frequently accessed data to
a faster memory
host >> global >> local >> private
Single or sporadic accesses to
memory do not benefit from
this approach
• Once data is transferred from host
to device, it is stored in global
device memory
• Any data transferred in opposite
direction is also stored in global
• Keyword __global (two
underscores!) is indicating that
data stored in global memory
__kernel void foo( __global float *A )
OpenCL memory model
Memory Type OpenCL Keyword Scope
Global __global Kernel-wide
Constant __const Kernel-wide
Local __local Work-group-wide
Private __private Work-item-wide
1D averaging filter (range = 1)
• each input value is read 3 times from global
memory!
__kernel void avg_filter(__global const int* A, __global int* B) {
int id = get_global_id(0);
B[id] = (A[id – 1] + A[id] + A[id + 1])/3;
}
D D D
T
D
D D D
T
D
D D D
T
D
average operation
A: global memory
B: global memory
Make a local copy of the input to make accesses faster
__kernel void avg_filter(__global const int* A, __global int* B,
__local int* scratch) {
int id = get_global_id(0);
int lid = get_local_id(0);
scratch[lid] = A[id];
B[id] = (scratch[lid-1]
+ scratch[lid]
+ scratch[lid+1])/3;
}
D D D
T
D
D
T
D D D
T
D
D
T
D D D
T
D
D
T
average operation
scratch: local memory
B: global memory
A: global memory
copy operation
The copy operation has to
finish before the next one
can commence – need
synchronisation!
• Race conditions occur when 2+ work items attempt to access
the same memory location concurrently and at least one access
is a write
• Race conditions may produce unexpected, seemingly
arbitrary results
Accesses to shared locations need to be correctly
synchronised/coordinated to avoid race conditions
Different types of synchronisation mechanisms:
• Barriers/memory fences
• Atomic operations
• Separate kernel launches
They all impact the speed so always most efficient to
design algorithms to avoid synchronisation
whenever possible
Barriers ensure that all work items within the same
work group reach the same point
• intra-group synchronisation
• barriers block the execution so affect the speed!
Global and local memory barriers – local have lower
overhead
barrier(CLK_LOCAL_MEM_FENCE);
barrier(CLK_GLOBAL_MEM_FENCE);
__kernel void avg_filter(__global const int* A, __global int* B,
__local int* scratch) {
int id = get_global_id(0); int lid = get_local_id(0);
scratch[lid] = A[id];
barrier(CLK_LOCAL_MEM_FENCE);
B[id] = (scratch[lid-1]
+ scratch[lid]
+ scratch[lid+1])/3;
}
D D D
T
D
D
T
D D D
T
D
D
T
D D D
T
D
D
T
average operation
scratch: local memory
B: global memory
A: global memory
copy operation
barrier
Avoid barriers in conditional statements (IF) – should
always apply to all work items from the group
otherwise deadlock!
__kernel reduce_add_2(__global int* A) {
int id = get_global_id(0);
int N = get_global_size(0);
for (int i = 1; i < N; i*=2) { if ((id % (i*2) == 0) { A[id] += A[id+i]; barrier(CLK_GLOBAL_MEM_FENCE); } barrier(CLK_GLOBAL_MEM_FENCE); } } With modern GPU/CPU hardware impossible to synchronise different work groups The only way to do that is by writing and launching separate kernels Atomic functions resolve the conflict by serialising accesses to a variable. Atomic functions provide a mechanism for atomic (i.e. without interruption) memory operations Guarantee race free execution All atomic updates are performed serially, so performance penalty Order is unspecified – can be used with associative and commutative operators only Implementation using barriers __kernel hist_atomic(__global const int* A, __global int* H) { int id = get_global_id(0); atomic_inc(&H[bin_index(A[id])]); } __kernel hist_barriers(__global const int* A, __global int* H) { int id = get_global_id(0); int N = get_global_size(0); for (int i = 0; i < N; i++) { if (id == i) H[bin_index(A[id])]++; barrier(CLK_GLOBAL_MEM_FENCE); } } very inefficient! Implementation using atomics Different types of atomic operations: • arithmetic: atomic_add, atomic_sub, atomic_inc, atomic_dec • bitwise operations: atomic_and, atomic_or, atomic_xor • min/max operators: atomic_min, atomic_max • swap operators: atomic_xchg, atomic_cmpxchg OpenCL 2.0 introduces lots of improvements and new atomic functions: see https://software.intel.com/en-us/articles/using- opencl-20-atomics https://software.intel.com/en-us/articles/using-opencl-20-atomics the less memory accesses, the better Global memory • Whenever a memory fetch in __global space is requested from a kernel, the GPU reads a minimum number of elements When reading a single value we also get a block of data (e.g. 4 int values) - spatial locality • Memory coalescing access is the data access in a sequential pattern. This means that work-items should access adjacent memory locations • When another value is requested and is from the same block then no additional memory access is required __kernel void coalesced(__global float * v) { int i = get_global_id(0); float val = v[i]; } __kernel void non-coalesced(__global float * v) { int i = get_global_id(0); float val = v[4*i]; } 4 8 12] D D D D T T T T D D D D T T T T D D D D T T D D D D T T block of 4 values – stride 1block of 4 values – stride 2 • a “stride” affects the access pattern • if the stride is larger than the block size, the benefits of blocking are gone Global Memory Allocation (elements stored in row-major order) Two-Dimensional Description Global Memory Allocation (non-coalesced) Two-Dimensional Description Two-Dimensional Description Global Memory Allocation (coalesced) Quiz: which statements have coalesced access pattern? __kernel void foo(__global float *g) { float a = 3.14; int id = get_global_id(0); g[i] = a; g[i*2] = a; a = g[i]; a = g[BLOCK_WIDTH/2 + i]; g[i] = a * g[BLOCK_WIDTH/2 + i]; g[BLOCK_WIDTH -1 + i] = a; } … • Every work-item accessing a location in memory defined by its index • A given set of work-items accessing a bunch of adjacent contiguous locations in memory g Quiz: which statements have coalesced access pattern? __kernel void foo(__global float *g) { float a = 3.14; int id = get_global_id(0); g[i] = a; g[i*2] = a; a = g[i]; a = g[BLOCK_WIDTH/2 + i]; g[i] = a * g[BLOCK_WIDTH/2 + i]; g[BLOCK_WIDTH -1 + i] = a; } × … • Every work-item accessing a location in memory defined by its index times 2 • Strided access g Quiz: which statements have coalesced access pattern? __kernel void foo(__global float *g) { float a = 3.14; int id = get_global_id(0); g[i] = a; g[i*2] = a; a = g[i]; a = g[BLOCK_WIDTH/2 + i]; g[i] = a * g[BLOCK_WIDTH/2 + i]; g[BLOCK_WIDTH -1 + i] = a; } × • Exactly similar to first one, but we are doing reads, instead of writes … g Quiz: which statements have coalesced access pattern? __kernel void foo(__global float *g) { float a = 3.14; int id = get_global_id(0); g[i] = a; g[i*2] = a; a = g[i]; a = g[BLOCK_WIDTH/2 + i]; g[i] = a * g[BLOCK_WIDTH/2 + i]; g[BLOCK_WIDTH -1 + i] = a; } × • Exactly similar to third one, but locations starting at an offset … g Quiz: which statements have coalesced access pattern? __kernel void foo(__global float *g) { float a = 3.14; int id = get_global_id(0); g[i] = a; g[i*2] = a; a = g[i]; a = g[BLOCK_WIDTH/2 + i]; g[i] = a * g[BLOCK_WIDTH/2 + i]; g[BLOCK_WIDTH -1 + i] = a; } × • Coalesced read, followed by a coalesced write … g Quiz: which statements have coalesced access pattern? __kernel void foo(__global float *g) { float a = 3.14; int id = get_global_id(0); g[i] = a; g[i*2] = a; a = g[i]; a = g[BLOCK_WIDTH/2 + i]; g[i] = a * g[BLOCK_WIDTH/2 + i]; g[BLOCK_WIDTH -1 + i] = a; } × • Still accessing a contiguous region of memory … g Week W/C Lecture Workshop 1 23/01 Introduction - 2 30/01 Architectures Tutorial-1 3 06/02 Patterns 1 4 13/02 Patterns 2 Tutorial-2 5 20/02 Patterns 3 6 27/02 Patterns 4 Tutorial-3 7 06/03 Communication & Synchronisation 8 13/03 Algorithms 1 Assessment support 9 20/03 Algorithms 2 10 27/03 Algorithms 3 11 03/04 Performance & Optimisation Tutorial-4 & Demonstration 12 24/04 Parallel Libraries 13 01/05 -