Overview GPU atomics Atomic (compare and) exchange Summary and next lecture
XJCO3221 Parallel Computation
University of Leeds
Copyright By PowCoder代写 加微信 powcoder
Lecture 18: Atomic operations
XJCO3221 Parallel Computation
GPU atomics Previous lectures
Atomic (compare and) exchange Today¡¯s lecture Summary and next lecture
Previous lectures
Whenever multiple processing units had read-write access to the same memory location, there are potential data races:
If at least one unit writes to the memory [Lecture 5].
Can solve using critical regions guarded by locks/mutexes
[Lectures 6 and 7].
Single instructions can also be made performed atomically
[Lecture 6].
For instance, in OpenMP an atomic instruction looks like:
#pragma omp atomic count ++;
XJCO3221 Parallel Computation
GPU atomics Previous lectures Atomic (compare and) exchange Today¡¯s lecture
Summary and next lecture
Today¡¯s lecture
GPUs also have memory accessible by multiple work items / threads:
Global memory, accessible to all work groups.
Local memory, only accessible within a work group. There is therefore potential data races and the need for
synchronisation.
Today we will see how GPUs support atomic operations in a
similar way to a shared memory CPU.
Also consider an atomic compare and swap.
XJCO3221 Parallel Computation
GPU atomics
Atomic (compare and) exchange Summary and next lecture
Atomic operations in general
Worked example: Constructing a histogram OpenCL support for atomic instructions Optimisation using local memory
Atomic operations
Definition
An atomic operation is one that is completed without interruption by any other processing unit.
Usually restricted to simple arithmetic operations (addition, subtraction etc.)
Implemented by a combination of compiler and hardware. Typically a (much) smaller performance penalty than using
locks/mutexes etc.
and XJCO3221 Parallel Computation
GPU atomics
Atomic (compare and) exchange Summary and next lecture
Atomic operations in general
Worked example: Constructing a histogram OpenCL support for atomic instructions Optimisation using local memory
Load, compute, and store
Consider the following line:
Even this single instruction performs three sub-operations:
1 Loads the value of x.
2 Performs the computation (i.e. subtracts 2).
3 Stores the updated value.
Two or more processing units might interfere with each other, resulting in a different result to the serial equivalent.
This could not happen if the operation was atomic. and XJCO3221 Parallel Computation
GPU atomics
Atomic (compare and) exchange Summary and next lecture
Atomic operations in general
Worked example: Constructing a histogram OpenCL support for atomic instructions Optimisation using local memory
Suppose x=10 initially, and two processing units A and B both subtract 2 from x. Depending on the scheduler, this may happen:
1 A loads the value of x as 10.
2 B loads the value of x as 10.
3 A performs its computation: 10 – 2 = 8.
4 A stores 8 to memory.
5 B performs its computation: 10 – 2 = 8.
6 B stores 8 to memory.
The result is x = 8, rather than x = 6 as expected. XJCO3221 Parallel Computation
GPU atomics
Atomic (compare and) exchange Summary and next lecture
Atomic operations in general
Worked example: Constructing a histogram
OpenCL support for atomic instructions Optimisation using local memory
Constructing a histogram on a GPU
Code on Minerva: histogram.c, histogram.cl, helper.h
Have an array of integers in the range 0 to maxValue-1 inclusive; want the histogram showing the frequency of each value.
1 Memory allocated on the host and on the device, for both the data and the histogram.
data, hist on the host.
device data, device hist on the device.
2 Both initialised on the host and copied to the device.
3 Build, initialise and enqueue a kernel to construct the device histogram.
One work item per data element, e.g. data[i].
4 Copy the histogram back to the host using clEnqueueReadBuffer().
XJCO3221 Parallel Computation
GPU atomics
Atomic (compare and) exchange Summary and next lecture
Atomic operations in general
Worked example: Constructing a histogram
OpenCL support for atomic instructions Optimisation using local memory
Kernel 1: Direct to global; no atomics
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15
void histogramNoAtomic(
__global int *device_hist , __global int *device_data ,
int maxValue ) int gid = get_global_id(0);
// Data value.
int val = device_data[gid];
// Check range before updating.
if( val>=0 && val
// Some data structure. // Head of list.
// ¡®a¡¯ becomes head.
head pointer.
// Only update head if not just changed by another // work item/thread; else try again from line 6. if( atomic_cmpxchg(head,b,a)==b ) break;
XJCO3221 Parallel Computation
Overview GPU atomics Atomic (compare and) exchange Summary and next lecture
Definition of (compare and) exchange Implementing a spinlock
Lock-free data structures
if( atomic_cmpxchg(head,b,a)==b ) break;
If only a single thread was involved:
1 old=*head, i.e. old==b, the first item in the list.
2 Compare-exchange: *head==b, so changes *head to a.
3 atomic cmpxchg returns b, so will break from while loop.
This is the expected behaviour.
However, in a multi-threaded context:
1 Another thread may change *head from b before line 15.
2 Since *head!=b, will not change it.
3 Will return some value !=b, so will try again.
XJCO3221 Parallel Computation
Overview GPU atomics Atomic (compare and) exchange Summary and next lecture
Summary and next lecture
Summary and next lecture
This lecture we have revisited atomic operations with an emphasis on GPUs:
Atomics used to ensure correct updates of memory accessible by multiple work items.
Atomic compare and exchange can be used to implement a spinlock, lock-free data structures, etc.
The next lecture is the last on GPU programming when we will look at events and task parallelism.
XJCO3221 Parallel Computation
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com