PowerPoint Presentation
Dr Massoud Zolgharni
mzolgharni@lincoln.ac.uk
Room SLB1004, SLB
Dr Grzegorz Cielniak
gcielniak@lincoln.ac.uk
Room INB2221, INB
Week W/C Lecture Workshop
1 23/01 Introduction –
2 30/01 Architectures
Tutorial-1
3 06/02 Patterns 1
4 13/02 Patterns 2
Tutorial-2
5 20/02 Patterns 3
6 27/02 Patterns 4
Tutorial-3
7 06/03 Communication & Synchronisation
8 13/03 Algorithms 1
Assessment
support
9 20/03 Algorithms 2
10 27/03 Algorithms 3
11 03/04 Performance & Optimisation
Tutorial-4 &
Demonstration
12 24/04 Parallel Libraries
13 01/05 –
Parallel Software Patterns
Map – THE Parallel Pattern
Stencil – Local Reductions
Commonly recurring strategies to deal with problems
• general, so not dependent on particular platform
• still fairly low-level (not full algorithms)
Are used to implement algorithms
• sort, search
We will look at both semantic (how they work) and
specific implementation (e.g. exploiting the use of
shared memory)
Data Task
Dependency
Map applies a function (Elemental Function) to every
element of data in parallel
• resulting in a new collection of the same shape as the
input
Result does not depend on the order in which
various instances of elemental function are
executed
• no need to synchronize between separate elements of
map
Foundation of many important applications
Input
Elemental Function
Output
Task: C = A + B
A, B, C – vectors, N – number of elements
for (int id = 0; id < N; id++)
C[id] = A[id] + B[id];
par_for (int id = 0; id < N; id++)
C[id] = A[id] + B[id];
serial parallel
No dependencies so each
element can be processed
separately
Complexity measures (for N elements)
• work – a total number of operations
• span – a total number of sequential steps
serial parallel
Work – requires N operations, O(n)
Span – requires N steps, O(n)
Work – requires N operations, O(n)
Span – requires 1 step assuming
infinite computational resources!
• for p parallel processors it
is O(n/p)
Problems solved by the Map pattern are called
embarrassingly parallel problems
• straightforward to implement due to lack of dependency
between different tasks
• the best type of parallel problems you can think of!
Many important applications
• e.g. brightness/contrast adjustment in image processing
Unary Map: 1 input, 1 output
N-array Map: n input, 1 output
2*x
x + y
3 7 0 1 4 0 0 4 5 3 1 0
6 14 0 2 8 0 0 8 10 6 2 0
3 7 0 1 4 0 0 4 5 3 1 0
5 11 2 2 12 3 9 9 10 4 3 1
2 4 2 1 8 3 9 5 5 1 2 1
• often several map operations
of same shape occur in
sequence
• vector operations are maps
using very simple operations
like addition and
multiplication
• a naïve implementation may
write each intermediate
result to memory
Tutorial 1: C = A * B + B
C = mult(A, B): C = A * B
C = add(A, B): C = A + B
• convert a sequence of maps
into a map of sequences
• fuse together the operations
to perform them at once
inside a single map
• adds arithmetic intensity,
reduces memory
requirements and bandwidth
Tutorial 1: C = A * B + B
C = mult_add(A, B): C = A * B + B
SAXPY from BLAS library: Single-Precision A·X Plus Y
• a combination of scalar multiplication and vector
addition
• frequently used in linear algebra operations
void saxpy_serial(float a, const float* x, float* y, int N) {
for (int id = 0; id < N; id++)
y[id] = a*x[id] + y[id];
}
serial implementation in C
SAXPY from BLAS library: Single-Precision A·X Plus Y
• a combination of scalar multiplication and vector
addition
• frequently used in linear algebra operations
__kernel void saxpy(__global float a, __global const float* x,
__global float* y) {
int id = get_global_id(0);
y[id] = a*x[id] + y[id];
}
parallel implementation in OpenCL
a 4 4 4 4 4 4 4 4 4 4 4 4
11 23 8 5 36 12 36 49 50 7 9 4y[id]
y[id] 3 7 0 1 4 0 0 4 5 3 1 0
x[id] 2 4 2 1 8 3 9 5 5 1 2 1
*
+
0 1 2 3 4 5 6 7 8 9 10 11Global ID
a 4 4 4 4 4 4 4 4 4 4 4 4
11 23 8 5 36 12 36 49 50 7 9 4y[id]
y[id] 3 7 0 1 4 0 0 4 5 3 1 0
x[id] 2 4 2 1 8 3 9 5 5 1 2 1
*
+
4 1 3 5 8 2 6 10 11 9 0 7Global ID
The order in which each thread/work item is executed is not guaranteed!
Not important for the map pattern as there are no dependencies.
What if N > p (number of processors)?
• need to partition the input data into smaller parts
• workgroups in OpenCL, thread blocks in CUDA
Each partition is then processed in a sequence
In reality workgroups are much smaller than a
number of available processors
• good for synchronisation and communication
• in OpenCL workgroup is executed on a Compute Unit
(CU)
• if a device has many CUs, many workgroups can be
executed at the same time
Stencil is a generalization of Map in which an
elemental function can access not only a single
element in an input collection but also a set of
“neighbours”
• output depends on a “neighbourhood” of inputs
specified using a set of fixed offsets relative to the
output position
Common in image and signal processing
(convolution), PDE solvers over regular grids
Input
Elemental Function
Output
Neighbourhood
This stencil has 3
elements in the
neighbourhood:
id-1, id, id+1
idid-1 id+1
Applies some
function to them…
And outputs to
the ith position of
the output array
Neighbourhood
boundary
conditions
1D averaging filter (radius 1 = 3 neighbours)
• result is an average of values at the central position and
the immediate neighbours to the left and right.
• a simple smoothing filter for noise filtering
void avg_filter_serial(const float* a, float* b, int N) {
//handle boundary conditions:
//id = 0 and id = N-1
for (int id = 1; id < N-1; id++) b[id] = (a[id-1] + a[id] + a[id+1])/3; } serial implementation in C 1D averaging filter (radius 1 = 3 neighbours) • result is an average of values at the central position and the immediate neighbours to the left and right. • a simple smoothing filter for noise filtering __kernel void avg_filter(__global const float* a, __global float* b) { int id = get_global_id(0); //handle boundary conditions: //id = 0 and id = N-1 b[id] = (a[id-1] + a[id] + a[id+1])/3; } parallel implementation in OpenCL b a 2 7 3 2 10 0 Stencils can be one dimensional or multidimensional 1D Stencil nD Stencil 5-point stencil 9-point stencil4-point stencil center cell (P) is used as well center cell (C) is used as well center cell is not used shape of the neighbourhood used depends on the application itself 6-point stencil (7-point stencil) 24-point stencil (25-point stencil) • A – input array • R – output array • Apply a stencil operation to the inner square of the form: R(i,j) = avg(A(i,j), A(i-1,j), A(i+1,j), A(i,j-1), A(i,j+1)) 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 A 1) Average all blue squares 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 A 1) Average all blue squares 2) Store result in R 0 0 0 0 0 0 0 0 0 0 00 0 4.4 0 0 R 1) Average all blue squares 2) Store result in R 3) Repeat 1 and 2 for all green squares 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 A 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 0 0 0 0 0 0 0 0 0 0 00 0 4.4 0 4.0 RA 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 A 0 0 0 0 0 0 0 0 0 0 00 3.8 4.4 0 4.0 R 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 0 0 0 0 0 0 0 0 0 0 00 3.8 4.4 3.4 4.0 A R Original Filtered Original Filtered (larger stencil) Flexible stencil (type and size and shape) • provided externally by the host as an input argument Each input element is read multiple times • use temporary storage (local memory/cache) to store neighbourhood, which is much faster to access than the normal (global) memory Boundary handling • many different ways: ignore/expand borders, etc. • keep the main loop without boundary checks avg 0 0.2 0 0.2 0.2 0.2 0 0.2 0 convolution mask Structured parallel programming: patterns for efficient computation • Chapter 4, Map Heterogeneous computing with OpenCL • Chapter 3 and 4, Hardware and OpenCL Examples