程序代写代做代考 algorithm flex cache cuda PowerPoint Presentation

PowerPoint Presentation

Dr Massoud Zolgharni
mzolgharni@lincoln.ac.uk

Room SLB1004, SLB

Dr Grzegorz Cielniak
gcielniak@lincoln.ac.uk

Room INB2221, INB

Week W/C Lecture Workshop

1 23/01 Introduction –

2 30/01 Architectures
Tutorial-1

3 06/02 Patterns 1

4 13/02 Patterns 2
Tutorial-2

5 20/02 Patterns 3

6 27/02 Patterns 4
Tutorial-3

7 06/03 Communication & Synchronisation

8 13/03 Algorithms 1
Assessment

support
9 20/03 Algorithms 2

10 27/03 Algorithms 3

11 03/04 Performance & Optimisation
Tutorial-4 &

Demonstration
12 24/04 Parallel Libraries

13 01/05 –

Parallel Software Patterns

Map – THE Parallel Pattern

Stencil – Local Reductions

Commonly recurring strategies to deal with problems
• general, so not dependent on particular platform

• still fairly low-level (not full algorithms)

Are used to implement algorithms
• sort, search

We will look at both semantic (how they work) and
specific implementation (e.g. exploiting the use of
shared memory)

Data Task

Dependency

Map applies a function (Elemental Function) to every
element of data in parallel

• resulting in a new collection of the same shape as the
input

Result does not depend on the order in which
various instances of elemental function are
executed

• no need to synchronize between separate elements of
map

Foundation of many important applications

Input

Elemental Function

Output

Task: C = A + B

A, B, C – vectors, N – number of elements

for (int id = 0; id < N; id++) C[id] = A[id] + B[id]; par_for (int id = 0; id < N; id++) C[id] = A[id] + B[id]; serial parallel No dependencies so each element can be processed separately Complexity measures (for N elements) • work – a total number of operations • span – a total number of sequential steps serial parallel Work – requires N operations, O(n) Span – requires N steps, O(n) Work – requires N operations, O(n) Span – requires 1 step assuming infinite computational resources! • for p parallel processors it is O(n/p) Problems solved by the Map pattern are called embarrassingly parallel problems • straightforward to implement due to lack of dependency between different tasks • the best type of parallel problems you can think of! Many important applications • e.g. brightness/contrast adjustment in image processing Unary Map: 1 input, 1 output N-array Map: n input, 1 output 2*x x + y 3 7 0 1 4 0 0 4 5 3 1 0 6 14 0 2 8 0 0 8 10 6 2 0 3 7 0 1 4 0 0 4 5 3 1 0 5 11 2 2 12 3 9 9 10 4 3 1 2 4 2 1 8 3 9 5 5 1 2 1 • often several map operations of same shape occur in sequence • vector operations are maps using very simple operations like addition and multiplication • a naïve implementation may write each intermediate result to memory Tutorial 1: C = A * B + B C = mult(A, B): C = A * B C = add(A, B): C = A + B • convert a sequence of maps into a map of sequences • fuse together the operations to perform them at once inside a single map • adds arithmetic intensity, reduces memory requirements and bandwidth Tutorial 1: C = A * B + B C = mult_add(A, B): C = A * B + B SAXPY from BLAS library: Single-Precision A·X Plus Y • a combination of scalar multiplication and vector addition • frequently used in linear algebra operations void saxpy_serial(float a, const float* x, float* y, int N) { for (int id = 0; id < N; id++) y[id] = a*x[id] + y[id]; } serial implementation in C SAXPY from BLAS library: Single-Precision A·X Plus Y • a combination of scalar multiplication and vector addition • frequently used in linear algebra operations __kernel void saxpy(__global float a, __global const float* x, __global float* y) { int id = get_global_id(0); y[id] = a*x[id] + y[id]; } parallel implementation in OpenCL a 4 4 4 4 4 4 4 4 4 4 4 4 11 23 8 5 36 12 36 49 50 7 9 4y[id] y[id] 3 7 0 1 4 0 0 4 5 3 1 0 x[id] 2 4 2 1 8 3 9 5 5 1 2 1 * + 0 1 2 3 4 5 6 7 8 9 10 11Global ID a 4 4 4 4 4 4 4 4 4 4 4 4 11 23 8 5 36 12 36 49 50 7 9 4y[id] y[id] 3 7 0 1 4 0 0 4 5 3 1 0 x[id] 2 4 2 1 8 3 9 5 5 1 2 1 * + 4 1 3 5 8 2 6 10 11 9 0 7Global ID The order in which each thread/work item is executed is not guaranteed! Not important for the map pattern as there are no dependencies. What if N > p (number of processors)?
• need to partition the input data into smaller parts

• workgroups in OpenCL, thread blocks in CUDA

Each partition is then processed in a sequence

In reality workgroups are much smaller than a
number of available processors

• good for synchronisation and communication

• in OpenCL workgroup is executed on a Compute Unit
(CU)

• if a device has many CUs, many workgroups can be
executed at the same time

Stencil is a generalization of Map in which an
elemental function can access not only a single
element in an input collection but also a set of
“neighbours”

• output depends on a “neighbourhood” of inputs
specified using a set of fixed offsets relative to the
output position

Common in image and signal processing
(convolution), PDE solvers over regular grids

Input

Elemental Function

Output

Neighbourhood

This stencil has 3
elements in the
neighbourhood:
id-1, id, id+1

idid-1 id+1

Applies some
function to them…

And outputs to
the ith position of
the output array

Neighbourhood

boundary
conditions

1D averaging filter (radius 1 = 3 neighbours)
• result is an average of values at the central position and

the immediate neighbours to the left and right.

• a simple smoothing filter for noise filtering

void avg_filter_serial(const float* a, float* b, int N) {

//handle boundary conditions:
//id = 0 and id = N-1

for (int id = 1; id < N-1; id++) b[id] = (a[id-1] + a[id] + a[id+1])/3; } serial implementation in C 1D averaging filter (radius 1 = 3 neighbours) • result is an average of values at the central position and the immediate neighbours to the left and right. • a simple smoothing filter for noise filtering __kernel void avg_filter(__global const float* a, __global float* b) { int id = get_global_id(0); //handle boundary conditions: //id = 0 and id = N-1 b[id] = (a[id-1] + a[id] + a[id+1])/3; } parallel implementation in OpenCL b a 2 7 3 2 10 0 Stencils can be one dimensional or multidimensional 1D Stencil nD Stencil 5-point stencil 9-point stencil4-point stencil center cell (P) is used as well center cell (C) is used as well center cell is not used shape of the neighbourhood used depends on the application itself 6-point stencil (7-point stencil) 24-point stencil (25-point stencil) • A – input array • R – output array • Apply a stencil operation to the inner square of the form: R(i,j) = avg(A(i,j), A(i-1,j), A(i+1,j), A(i,j-1), A(i,j+1)) 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 A 1) Average all blue squares 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 A 1) Average all blue squares 2) Store result in R 0 0 0 0 0 0 0 0 0 0 00 0 4.4 0 0 R 1) Average all blue squares 2) Store result in R 3) Repeat 1 and 2 for all green squares 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 A 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 0 0 0 0 0 0 0 0 0 0 00 0 4.4 0 4.0 RA 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 A 0 0 0 0 0 0 0 0 0 0 00 3.8 4.4 0 4.0 R 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 0 0 0 0 0 0 0 0 0 0 00 3.8 4.4 3.4 4.0 A R Original Filtered Original Filtered (larger stencil) Flexible stencil (type and size and shape) • provided externally by the host as an input argument Each input element is read multiple times • use temporary storage (local memory/cache) to store neighbourhood, which is much faster to access than the normal (global) memory Boundary handling • many different ways: ignore/expand borders, etc. • keep the main loop without boundary checks avg 0 0.2 0 0.2 0.2 0.2 0 0.2 0 convolution mask Structured parallel programming: patterns for efficient computation • Chapter 4, Map Heterogeneous computing with OpenCL • Chapter 3 and 4, Hardware and OpenCL Examples