程序代写代做代考 cache algorithm GPU Dr Massoud Zolgharni

Dr Massoud Zolgharni
mzolgharni@lincoln.ac.uk

Room SLB1004, SLB

Dr Grzegorz Cielniak
gcielniak@lincoln.ac.uk

Room INB2221, INB

Week W/C Lecture Workshop

1 23/01 Introduction –

2 30/01 Architectures
Tutorial-1

3 06/02 Patterns 1

4 13/02 Patterns 2
Tutorial-2

5 20/02 Patterns 3

6 27/02 Patterns 4
Tutorial-3

7 06/03 Communication & Synchronisation

8 13/03 Algorithms 1
Assessment

support
9 20/03 Algorithms 2

10 27/03 Algorithms 3

11 03/04 Performance & Optimisation
Tutorial-4 &

Demonstration
12 24/04 Parallel Libraries

13 01/05 –

Work items/threads communicate through memory
• see stencil, reduce or histogram examples

In ideal case, we would like to have one type of
memory which is large, cheap and fast

In reality
• large = slow/expensive
• cheap = small/slow
• fast = small/expensive

Need for different types of memories arranged in a
hierarchy: from slow/large to fast/small

Different access speeds and storage capacities:

 faster  smaller  expensive

 cheaper  bigger  slower

Processing Element (PE)

Compute Unit (CU)

Device

Private memory (very fast) is only accessible by a single work
item

• registers, ~10/100 B

Local memory (fast) is accessible by
all work items within a single work group

• user accessible cache, ~K/MB

Global memory (slow) is accessible by
threads from all work groups

• DRAM, ~GB

Constant memory (fast) is also accessible
by all threads

• part of global memory
• not writable, relatively small, ~KB

Minimise time spent on memory operations

Move frequently accessed data to
a faster memory

host >> global >> local >> private

Single or sporadic accesses to
memory do not benefit from
this approach

• Once data is transferred from host
to device, it is stored in global
device memory

• Any data transferred in opposite
direction is also stored in global

• Keyword __global (two
underscores!) is indicating that
data stored in global memory

__kernel void foo( __global float *A )

OpenCL memory model

Memory Type OpenCL Keyword Scope

Global __global Kernel-wide

Constant __const Kernel-wide

Local __local Work-group-wide

Private __private Work-item-wide

1D averaging filter (range = 1)
• each input value is read 3 times from global

memory!

__kernel void avg_filter(__global const int* A, __global int* B) {

int id = get_global_id(0);

B[id] = (A[id – 1] + A[id] + A[id + 1])/3;

}

D D D

T

D

D D D

T

D

D D D

T

D

average operation

A: global memory

B: global memory

Make a local copy of the input to make accesses faster

__kernel void avg_filter(__global const int* A, __global int* B,
__local int* scratch) {

int id = get_global_id(0);
int lid = get_local_id(0);

scratch[lid] = A[id];

B[id] = (scratch[lid-1]
+ scratch[lid]
+ scratch[lid+1])/3;

}

D D D

T

D

D

T

D D D

T

D

D

T

D D D

T

D

D

T

average operation

scratch: local memory

B: global memory

A: global memory

copy operation

The copy operation has to
finish before the next one
can commence – need
synchronisation!

• Race conditions occur when 2+ work items attempt to access
the same memory location concurrently and at least one access
is a write

• Race conditions may produce unexpected, seemingly
arbitrary results

Accesses to shared locations need to be correctly
synchronised/coordinated to avoid race conditions

Different types of synchronisation mechanisms:
• Barriers/memory fences

• Atomic operations

• Separate kernel launches

They all impact the speed so always most efficient to
design algorithms to avoid synchronisation
whenever possible

Barriers ensure that all work items within the same
work group reach the same point
• intra-group synchronisation

• barriers block the execution so affect the speed!

Global and local memory barriers – local have lower
overhead

barrier(CLK_LOCAL_MEM_FENCE);

barrier(CLK_GLOBAL_MEM_FENCE);

__kernel void avg_filter(__global const int* A, __global int* B,
__local int* scratch) {

int id = get_global_id(0); int lid = get_local_id(0);

scratch[lid] = A[id];

barrier(CLK_LOCAL_MEM_FENCE);

B[id] = (scratch[lid-1]
+ scratch[lid]
+ scratch[lid+1])/3;

}

D D D

T

D

D

T

D D D

T

D

D

T

D D D

T

D

D

T

average operation

scratch: local memory

B: global memory

A: global memory

copy operation

barrier

Avoid barriers in conditional statements (IF) – should
always apply to all work items from the group
otherwise deadlock!

__kernel reduce_add_2(__global int* A) {
int id = get_global_id(0);
int N = get_global_size(0);

for (int i = 1; i < N; i*=2) { if ((id % (i*2) == 0) { A[id] += A[id+i]; barrier(CLK_GLOBAL_MEM_FENCE); } barrier(CLK_GLOBAL_MEM_FENCE); } } With modern GPU/CPU hardware impossible to synchronise different work groups The only way to do that is by writing and launching separate kernels Atomic functions resolve the conflict by serialising accesses to a variable. Atomic functions provide a mechanism for atomic (i.e. without interruption) memory operations Guarantee race free execution All atomic updates are performed serially, so performance penalty Order is unspecified – can be used with associative and commutative operators only Implementation using barriers __kernel hist_atomic(__global const int* A, __global int* H) { int id = get_global_id(0); atomic_inc(&H[bin_index(A[id])]); } __kernel hist_barriers(__global const int* A, __global int* H) { int id = get_global_id(0); int N = get_global_size(0); for (int i = 0; i < N; i++) { if (id == i) H[bin_index(A[id])]++; barrier(CLK_GLOBAL_MEM_FENCE); } } very inefficient! Implementation using atomics Different types of atomic operations: • arithmetic: atomic_add, atomic_sub, atomic_inc, atomic_dec • bitwise operations: atomic_and, atomic_or, atomic_xor • min/max operators: atomic_min, atomic_max • swap operators: atomic_xchg, atomic_cmpxchg OpenCL 2.0 introduces lots of improvements and new atomic functions: see https://software.intel.com/en-us/articles/using- opencl-20-atomics https://software.intel.com/en-us/articles/using-opencl-20-atomics the less memory accesses, the better Global memory • Whenever a memory fetch in __global space is requested from a kernel, the GPU reads a minimum number of elements  When reading a single value we also get a block of data (e.g. 4 int values) - spatial locality • Memory coalescing access is the data access in a sequential pattern. This means that work-items should access adjacent memory locations • When another value is requested and is from the same block then no additional memory access is required __kernel void coalesced(__global float * v) { int i = get_global_id(0); float val = v[i]; } __kernel void non-coalesced(__global float * v) { int i = get_global_id(0); float val = v[4*i]; } 4 8 12] D D D D T T T T D D D D T T T T D D D D T T D D D D T T block of 4 values – stride 1block of 4 values – stride 2 • a “stride” affects the access pattern • if the stride is larger than the block size, the benefits of blocking are gone Global Memory Allocation (elements stored in row-major order) Two-Dimensional Description Global Memory Allocation (non-coalesced) Two-Dimensional Description Two-Dimensional Description Global Memory Allocation (coalesced) Quiz: which statements have coalesced access pattern? __kernel void foo(__global float *g) { float a = 3.14; int id = get_global_id(0); g[i] = a; g[i*2] = a; a = g[i]; a = g[BLOCK_WIDTH/2 + i]; g[i] = a * g[BLOCK_WIDTH/2 + i]; g[BLOCK_WIDTH -1 + i] = a; }       … • Every work-item accessing a location in memory defined by its index • A given set of work-items accessing a bunch of adjacent contiguous locations in memory g Quiz: which statements have coalesced access pattern? __kernel void foo(__global float *g) { float a = 3.14; int id = get_global_id(0); g[i] = a; g[i*2] = a; a = g[i]; a = g[BLOCK_WIDTH/2 + i]; g[i] = a * g[BLOCK_WIDTH/2 + i]; g[BLOCK_WIDTH -1 + i] = a; }  ×     … • Every work-item accessing a location in memory defined by its index times 2 • Strided access g Quiz: which statements have coalesced access pattern? __kernel void foo(__global float *g) { float a = 3.14; int id = get_global_id(0); g[i] = a; g[i*2] = a; a = g[i]; a = g[BLOCK_WIDTH/2 + i]; g[i] = a * g[BLOCK_WIDTH/2 + i]; g[BLOCK_WIDTH -1 + i] = a; }  ×     • Exactly similar to first one, but we are doing reads, instead of writes … g Quiz: which statements have coalesced access pattern? __kernel void foo(__global float *g) { float a = 3.14; int id = get_global_id(0); g[i] = a; g[i*2] = a; a = g[i]; a = g[BLOCK_WIDTH/2 + i]; g[i] = a * g[BLOCK_WIDTH/2 + i]; g[BLOCK_WIDTH -1 + i] = a; }  ×     • Exactly similar to third one, but locations starting at an offset … g Quiz: which statements have coalesced access pattern? __kernel void foo(__global float *g) { float a = 3.14; int id = get_global_id(0); g[i] = a; g[i*2] = a; a = g[i]; a = g[BLOCK_WIDTH/2 + i]; g[i] = a * g[BLOCK_WIDTH/2 + i]; g[BLOCK_WIDTH -1 + i] = a; }  ×     • Coalesced read, followed by a coalesced write … g Quiz: which statements have coalesced access pattern? __kernel void foo(__global float *g) { float a = 3.14; int id = get_global_id(0); g[i] = a; g[i*2] = a; a = g[i]; a = g[BLOCK_WIDTH/2 + i]; g[i] = a * g[BLOCK_WIDTH/2 + i]; g[BLOCK_WIDTH -1 + i] = a; }  ×     • Still accessing a contiguous region of memory … g Week W/C Lecture Workshop 1 23/01 Introduction - 2 30/01 Architectures Tutorial-1 3 06/02 Patterns 1 4 13/02 Patterns 2 Tutorial-2 5 20/02 Patterns 3 6 27/02 Patterns 4 Tutorial-3 7 06/03 Communication & Synchronisation 8 13/03 Algorithms 1 Assessment support 9 20/03 Algorithms 2 10 27/03 Algorithms 3 11 03/04 Performance & Optimisation Tutorial-4 & Demonstration 12 24/04 Parallel Libraries 13 01/05 -