程序代写代做代考 algorithm cuda PowerPoint Presentation

PowerPoint Presentation

Parallel Computing

with GPUs: Parallel

Patterns
Dr Paul Richmond

http://paulrichmond.shef.ac.uk/teaching/COM4521/

Parallel Patterns Overview

Reduction

Scan

What are parallel Patterns

Parallel patterns are high level building blocks that can be used to
create algorithms

Implementation is abstracted to give a higher level view

Patterns describe techniques suited to parallelism
Allows algorithms to be built with parallelism from ground up
Top down approach might not parallelise very easily…

Consider a the simplest parallel pattern: Map
Takes the input list i
Applies a function f
Writes the result list o by applying f to all members of i
Equivalent to a CUDA kernel where i and o are memory locations determined

by threadIdx etc.

Gather

Multiple inputs and single coalesced output

Might have sequential loading or random access
Affect memory performance

Differs to map due to multiple inputs

0 1 2 3

Memory Values/Locations

ThreadIdx.x 4 5 6 7

Gather operation
 Read from a number of locations
 Random access load

0 1

Memory Values/Locations

ThreadIdx.x 2 3

Gather operation
 Read from a number of locations

Scatter

Reads from a single input and writes to one or many

Can be implemented in CUDA using atomics

Write pattern will determine performance

0 1 2 3

Memory Values/Locations

ThreadIdx.x 4 5 6 7 Scatter operation
 Write to a number of locations
 Random access write?

0 1 2 3

Memory Values/Locations

ThreadIdx.x 4 5 6 7 Scatter operation
 Write to a number of locations
 Collision on write

Other Parallel Patterns

Stencil
Gather a fixed pattern, usually based on locality

See 2D shared memory examples

Reduce (this lecture)
Reduce value to a single value or set of key value pairs

Combined with Map to form Map Reduce (often with intermediate shuffle or
sort)

Scan (this lecture)
Compute the sum of previous value in a set

Sort (later)
Sort values or pairs

Stencil Gather

Parallel Patterns Overview

Reduction

Scan

Reduction

A reduction is where all elements of a set have a common binary associative
operator (⊕) applied to them to “reduce” the set to a single value
Binary associative = order in which operations is performed on set does not matter

 E.g. (1 + 2) + 3 + 4 == 1 + (2 + 3) + 4 == 10

Example operators
Most obvious example is addition (Summation)
Other examples, Maximum, Minimum, product

Serial example is trivial but how does this work in parallel?

int data[N];

int i, r;

for (int i = 0; i < N; i++){ r = reduce(r, data[i]); } int reduce(int r, int i){ return r + i; } int data[N]; int i, r; for (int i = N-1; i >= 0; i–){

r = reduce(r, data[i]);

}

OR

Parallel Reduction

Order of operations does not matter so we don’t have to think serially.

A tree based approach can be used
At each step data is reduced by a factor of 2

9 5 2 1 2 3 8 1

14 3 5 9

1417

31

⊕⊕⊕⊕

⊕ ⊕

N Elements

Log2(N) steps

Parallel Reduction in CUDA

No global synchronisation so how do multiple blocks perform
reduction?

Split the execution into multiple stages
Recursive method

9 5 2 1 2 3 8 1

14 3 5 9

1417

31

⊕⊕⊕⊕

⊕ ⊕

Kernel Launch 1

Kernel Launch 2

Kernel Launch 3

Recursive Reduction Problems

What might be some problems with the following?

__global__ void sum_reduction(float *input, float *results){

extern __shared__ int sdata[];

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[threadIdx.x] = input[i];

__syncthreads();

if (i % 2 == 0){

results[i / 2] = sdata[threadIdx.x] + sdata[threadIdx.x+1]

}

}

Block Level Reduction

Lower launch overhead (reduction within block)

Much better use of shared memory
__global__ void sum_reduction(float *input, float *block_results){

extern __shared__ int sdata[];

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[threadIdx.x] = input[i];

__syncthreads();

for (unsigned int stride = 1; stride < blockDim.x; stride*=2){ unsigned int strided_i = threadIdx.x * 2 * stride; if (strided_i < blockDim.x){ sdata[strided_i] += sdata[strided_i + stride] } __syncthreads(); } if (threadIdx.x == 0) block_results[blockIdx.x] = sdata[0]; } Block Level Recursive Reduction for (unsigned int stride = 1; stride < blockDim.x; stride*=2){ unsigned int strided_i = threadIdx.x * 2 * stride; if (strided_i < blockDim.x){ sdata[strided_i] += sdata[strided_i + stride] } __syncthreads(); } 9 5 2 1 2 3 8 1 0 1 2 3 14 5 3 1 5 3 9 1 0 1 17 5 3 1 14 3 9 1 0 31 5 3 1 14 3 9 1 Shared Memory Values Shared Memory Values Shared Memory Values Shared Memory Values threadIdx.x threadIdx.x threadIdx.x Loop 1 stride = 1 Loop 2 stride = 2 Loop 3 stride = 4 Block Level Reduction Is this shared memory access pattern bank conflict free? for (unsigned int stride = 1; stride < blockDim.x; stride*=2){ unsigned int strided_i = threadIdx.x * 2 * stride; if (strided_i < blockDim.x){ sdata[strided_i] += sdata[strided_i + stride] } __syncthreads(); } 9 5 2 1 2 3 8 1 0 1 2 3 11 8 10 2 2 3 8 1 0 1 21 10 10 2 2 3 8 1 0 31 10 10 2 2 3 8 1 Shared Memory Values Shared Memory Values Shared Memory Values Shared Memory Values threadIdx.x threadIdx.x threadIdx.x Loop 1 stride = 4 Loop 2 stride = 2 Loop 3 stride = 1 Block Level Reduction (Sequential Addressing) for (unsigned int stride = blockDim.x/2; stride > 0; stride>>=1){

if (threadIdx.x < stride){ sdata[threadIdx.x] += sdata[threadIdx.x + stride] } __syncthreads(); } stride /=2 Now conflict free regardless of the reduction loop stride The stride between shared memory variable accesses for threads is always sequential Careful: Two types of stride discussed 1. Loop stride (of algorithm) 2. SM variable stride (in 4 bytes) sm_stride 1 loop stride 1 threadIdx.x index bank 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 0 Banks Used 32 Max Conflicts 1 Global Reduction Approach Use the recursive method Our block level reduction can be applied to the result At some stage it may be more effective to simply sum the final block on the CPU Or use atomics on block results Thread block width Global Reduction Atomics __global__ void sum_reduction(float *input, float *result){ extern __shared__ int sdata[]; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[threadIdx.x] = input[i]; __syncthreads(); for (unsigned int stride = blockDim.x/2; stride > 0; stride>>=2){

if (threadIdx.x < stride){ sdata[threadIdx.x] += sdata[threadIdx.x + stride] } __syncthreads(); } if (threadIdx.x == 0) atomicAdd(result, sdata[0]); } Further Optimisation? Can we improve our technique further? __global__ void sum_reduction(float *input, float *result){ extern __shared__ int sdata[]; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[threadIdx.x] = input[i]; __syncthreads(); for (unsigned int stride = blockDim.x/2; stride > 0; stride>>=2){

if (threadIdx.x < stride){ sdata[threadIdx.x] += sdata[threadIdx.x + stride] } __syncthreads(); } if (threadIdx.x == 0) atomicAdd(result, sdata[0]); } Parallel Patterns Overview Reduction Scan What is scan? Consider the following … A B C D F Remove empty elements from array (compact) 1 2 5 1 3 4 3 8 1 5 1 3 3 2 4 8 Split elements from array based on condition (split) E G H I Output variable numbers of values per thread 2 0 5 6 3 0 1 0 2 5 6 3 12 0 1 2 1 3 0 0 What is scan? Consider the following … 2 0 1 2 1 3 0 0 A B C D F Remove empty elements from array (compact) 1 2 5 1 3 4 3 8 1 5 1 3 3 2 4 8 Split elements from array based on condition (split) E G H I Output variable numbers of values per thread 2 0 5 6 3 0 1 0 2 5 6 3 1 Each has the same problem Not even considered for sequential programs! Where to write output in parallel? Parallel Prefix Sum (scan) Where to write output in parallel? Each threads needs to know the output location(s) it can write to avoid conflicts. 2 0 5 6 3 0 1 0 2 5 6 3 1 0 0 1 2 3 3 4 4 Sparse data 0 1 2 3 4 5 6 7 Output/Write index – running sum of binary output state Thread/Read index Compacted data The solution is a parallel prefix sum (or scan) Given the inputs A = [a 0 , a 1 , …, a n-1 ] and binary associate operator ⊕ Scan(A) = [0, a0, (a 0 ⊕a 1 ), …, (a 0 ⊕a 1 ⊕…⊕a n-1 )] Serial Parallel Prefix Sum Example E.g. Given the input and the addition operator A= [2, 6, 2 ,4, 7, 2 ,1, 5] Scan(A) = [0, 2, 2+6, 2+6+2, 2+6+2+4, …] Scan(A) = [0, 2, 8, 10, 14, 21, 23, 24] More generally a serial implementation of an additive scan using a running sum looks like… int A[8] = { 2, 6, 2, 4, 7, 2, 1, 5 }; int scan_A[8]; int running_sum = 0; for (int i = 0; i < 8; ++i) { scan_A[i] = running_sum; running_sum += A[i]; } Serial Scan for Compaction int Input[8] = { 2, 0, 5, 6, 3, 0, 1, 0 }; int A[8] = { 2, 0, 5, 6, 3, 0, 1, 0 }; int scan_A[8]; int output[5] int running_sum = 0; for (int i = 0; i < 8; ++i){ A[i] = Input>0;

}

for (int i = 0; i < 8; ++i){ scan_A[i] = running_sum; running_sum += A[i]; } for (int i = 0; i < 8; ++i){ int input = Input[i]; if (input > 0){

int idx = scan[i];

output[idx] = input;

}

}

// generate scan input

// A = {1, 0, 1, 1, 1, 0, 1, 0}

// scan

// result = {0, 1, 1, 2, 3, 4, 4, 5}

// scattered write

// output = {2, 5, 6, 3, 1}

Parallel Local (Shared Memory) Scan

2 6 2 4 7 2 1 5

0 1 2 3

2 8 8 6 11 9 3 6

2 8 10 14 19 15 14 15

2 8 10 14 21 23 24 29

Shared Memory Values

Shared Memory Values

Shared Memory Values

Shared Memory Values

threadIdx.x

threadIdx.x

threadIdx.x
Loop 1

stride = 1

Loop 2

stride = 2

Loop 3

stride = 4

4 5 6

0 1 2 3 4 5

0 1 2 3

Log2(N) steps

After Log(N) loops each sum has local plus preceding 2n-1 values

Inclusive Scan

Parallel Local Scan

2 6 2 4 7 2 1 5

0 1 2 3

2 8 8 6 11 9 3 6

2 8 10 14 19 15 14 15

2 8 10 14 21 23 24 29

Shared Memory Values

Shared Memory Values

Shared Memory Values

Shared Memory Values

threadIdx.x

threadIdx.x

threadIdx.x
Loop 1

stride = 1

Loop 2

stride = 2

Loop 3

stride = 4

4 5 6

0 1 2 3 4 5

0 1 2 3

Log2(N) steps

2 8 10 14 21 23 24 290

Inclusive scan

Exclusive scan + reduction

Implementing Local Scan with Shared Memory

No bank conflicts (stride of 1 between threads)

Synchronisation required between read and write

__global__ void scan(float *input) {

extern __shared__ float s_data[];

s_data[threadIdx.x] = input[threadIdx.x + blockIdx.x*blockDim.x];

for (int stride = 1; stride= stride) ? s_data[threadIdx.x – stride] : 0;

__syncthreads();

s_data[threadIdx.x] += s_value;

}

//something with global results?

}

Implementing Local Scan (at warp level)

Exactly the same as the block level technique but at warp level

Warp prefix sum is in threadIdx.x%32==31

Either use shared memory to reduce between warps
Or consider the following global scan approaches.

__global__ void scan(float *input) {

__shared__ float s_data[32];

float val1, val2;

val1 = input[threadIdx.x + blockIdx.x*blockDim.x];

for (int s = 1; s < 32; s <<= 1) { val2 = __shfl_up(val1, s); if (threadIdx.x % 32 >= s)

val1 += val2;

}

//store warp level results}

Implementing scan at Grid Level

2 6 2 4 7 2 1 5

0 1 2 3

2 8 8 6 11 9 3 6

2 8 10 14 19 15 14 15

2 8 10 14 21 23 24 29

Shared Memory Values

Shared Memory Values

Shared Memory Values

Local Scan Result

threadIdx.x

threadIdx.x

threadIdx.x 4 5 6

0 1 2 3 4 5

0 1 2 3

4 2 2 4 1 0 1 2

0 1 2 3

4 6 4 6 5 1 1 3

4 6 8 12 9 7 6 4

4 6 8 12 13 13 14 16

4 5 6

0 1 2 3 4 5

0 1 2 3

Thread Block 1 Thread Block 2

2 8 10 14 21 23 24 29 33 35 37 41 42 42 43 45Global Scan Result

Implementing scan at Grid Level

Same problem as reduction when scaling to grid level
Each block is required to add the reduction value from proceeding blocks

Global scan therefore requires either;
1. Recursive scan kernel on results of local scan

Additional kernel to add sums of proceeding blocks

2. Atomic Increments (next slides)
 Increment a counter for block level results

Additional kernel to add sums of proceeding blocks to each value

Global Level Scan (Atomics Part 1)

__device__ block_sums[BLOCK_DIM];

__global__ void scan(float *input, float *local_result) {

extern __shared__ float s_data[];

s_data[threadIdx.x] = input[threadIdx.x + blockIdx.x*blockDim.x];

for (int stride = 1; stride= stride) ? s_data[threadIdx.x – stride] : 0;

__syncthreads();

s_data[threadIdx.x] += s_value;

}

//store local scan result to each thread

local_result[threadIdx.x + blockIdx.x*blockDim.x] = s_data[threadIdx.x];

//atomic store to all proceeding block totals

if (threadIdx.x == 0){

for (int i=0; i