程序代写代做代考 GPU cache cuda PowerPoint Presentation

PowerPoint Presentation

Parallel Computing

with GPUs: Shared

Memory
Dr Paul Richmond

http://paulrichmond.shef.ac.uk/teaching/COM4521/

Average Mark: 71%

Grids, Blocks, Warps & Threads

Grid

Block

Thread

GPU

SM SM

SMSM

Device Memory

Shared Memory / Local Cache

32 CUDA core partitions – execute warps

Grids, Blocks, Warps & Threads

Blocks map to SMs
SMs may have more than one block

Blocks are split into warps by hardware (always pick block size multiple of 32)

Blocks do not migrate between SMs

No guarantee of order of block execution

No communication or synchronisation between blocks

Threads map to CUDA cores
Executed in partitions of 32, called warps

Lots of warps means lots of opportunity to hide memory movement

Review of last week

We have seen the importance of different types of memory
And observed the performance improvement from read-only and constant

cache usage

So far we have seen how CUDA can be used for performing thread
local computations; e.g.
Load data from memory to registers

Perform thread-local computations

Store results back to global memory

We will now consider another important type of memory
Shared memory

Shared Memory

Shared Memory Bank Conflicts

2D Shared Memory Bank Conflicts

Boundary Conditions for Shared Memory Loading

Host-side Configurations for Shared Memory

Shared Memory

Architecture Details
In Kepler (64KB) of Shared Memory is split

between Shared Memory and L1 cache
The ratio to SM and L1 can be configured

In Maxwell 64KB of Shared Memory is
dedicated

Its just another Cache, right?
User configurable

Requires manually loading and
synchronising data

Block (0, 0)

Local Cache

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Shared Mem/ L1

Constant Cache

Read-only Cache

Block (0, 0)

Local Cache

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Shared Mem

Constant Cache

L1 / Read-only

Kepler Maxwell

Shared Memory

Performance
Shared memory is very fast

Bandwidth > 1 TB/s

Block level computation
Challenges the thread level view…

Allows data to be shared between threads in the same block

User configurable cache at the thread block level

Still no broader synchronisation beyond the level of thread blocks

Block Local Computation

Partition data into groups that fit into shared memory

Load subset of data into shared memory

Perform computation on the subset

Copy subset back to global memory

………

Move, execute, move

From Host view

Move: Data to GPU memory

Execute: Kernel

Move: Data back to host

From Device view

Move: Data from device memory to
registers

Execute: instructions

Move: Data back to device memory

From Host view
Move: Data to GPU memory
Execute: Kernel
Move: Data back to host

From Device view
Move: Data from device memory to

local cache
Execute: subset of kernel (reusing

cached values)
Move: Data back to device memory

From Block View
Move: Data from local cache
Execute: instructions
Move: Data back to local cache (or

device memory)

Thread level parallelism Block level parallelism

A Case for Shared Memory
__global__ void sum3_kernel(int *c, int *a)

{

int i = blockIdx.x*blockDim.x + threadIdx.x;

int left, right;

//load value at i-1

left = 0;

if (i > 0)

left = a[i – 1];

//load value at i+1

right = 0;

if (i < (N - 1)) right = a[i + 1]; c[i] = left + a[i] + right; //sum three values } Do we have a candidate for block level parallelism using shared memory? A Case for Shared Memory Currently: Thread-local computation Bandwidth limited Requires three loads per thread (at index i-1, i, and i+1) Block level solution: load each value only once! __global__ void sum3_kernel(int *c, int *a) { int i = blockIdx.x*blockDim.x + threadIdx.x; int left, right; //load value at i-1 left = 0; if (i > 0)

left = a[i – 1];

//load value at i+1

right = 0;

if (i < (N - 1)) right = a[i + 1]; c[i] = left + a[i] + right; //sum three values } CUDA Shared memory Shared memory between threads in the same block can be defined using __shared__ Shared variables are only accessible from within device functions Not addressable in host code Must be careful to avoid race conditions Multiple threads writing to the same shared memory variable Results in undefined behaviour Typically write to shared memory using threadIdx Thread level synchronisation is available through __syncthreads() Synchronises threads in the block __shared__ int s_data[BLOCK_SIZE]; Example Allocate a shared array One integer element per thread Each thread loads a single item to shared memory Call __syncthreads to ensure shared memory data is populated by all threads Load all elements through shared memory What is wrong with this code? __global__ void sum3_kernel(int *c, int *a) { __shared__ int s_data[BLOCK_SIZE]; int i = blockIdx.x*blockDim.x + threadIdx.x; int left, right; s_data[threadIdx.x] = a[i]; __syncthreads(); //load value at i-1 left = 0; if (i > 0){

left = s_data[threadIdx.x – 1];

}

//load value at i+1

right = 0;

if (i < (N - 1)){ right = s_data[threadIdx.x + 1]; } c[i] = left + s_data[threadIdx.x] + right; //sum } Example Additional step required! Check boundary conditions for the edge of the block __global__ void sum3_kernel(int *c, int *a) { __shared__ int s_data[BLOCK_SIZE]; int i = blockIdx.x*blockDim.x + threadIdx.x; int left, right; s_data[threadIdx.x] = a[i]; __syncthreads(); //load value at i-1 left = 0; if (i > 0){

if (threadIdx.x > 0)

left = s_data[threadIdx.x – 1];

else

left = a[i – 1];

}

//load value at i+1

right = 0;

if (i < (N - 1)){ if (threadIdx.x <(BLOCK_SIZE-1)) right = s_data[threadIdx.x + 1]; else right = a[i + 1]; } c[i] = left + s_data[threadIdx.x] + right; //sum } Problems with Shared memory In the example we saw the introduction of boundary conditions Global loads still present at boundaries We have introduced divergence in the code (remember the SIMD model) This is even more prevalent in 2D examples where we tile data into shared memory //boundary condition left = 0; if (i > 0){

if (threadIdx.x > 0)

left = s_data[threadIdx.x – 1];

else

left = a[i – 1];

}

Shared Memory

Shared Memory Bank Conflicts

2D Shared Memory Bank Conflicts

Boundary Conditions for Shared Memory Loading

Host-side Configurations for Shared Memory

Shared Memory Bank Conflicts

Shared memory is arranged into 4byte (32bit banks)
A load or store of 𝑁 addresses spanning 𝑁 distinct banks can be serviced simultaneously

Overall bandwidth of × 𝑁 a single module

Kepler+ can also serve broadcast accesses simultaneously

A bank conflict occurs when two threads request addresses from the
same bank

Results in serialisation of the access

Bank conflicts only occur between threads in a warp

There are 32 banks and 32 threads per warp

If two threads in a warp access the same bank this is said to be a 2-way bank conflict

Think about you block sized array of floats
bank = (index * stride) % 32

Access Strides

Stride refers to the size (in
increments of the bank size)
between each threads memory
access pattern
If threads access consecutive 4

byte values (e.g. int or float)
then the stride is 1.
No conflicts

If a thread accesses consecutive
8 bytes values (e.g. double)
then the stride is 2.
2 way conflicts

In general odd strides result in
no conflicts

Stride=1 Stride=2 Stride=3

Stride (4 byte) 1

TPB 128

threadIdx.x index bank

0 0 1

1 1 2

2 2 3

3 3 4

4 4 5

5 5 6

6 6 7

7 7 8

8 8 9

9 9 10

10 10 11

31 31 12

Banks
Used 32

Max
Conflicts 1

bank = (index*stride) % 32

More on SM banks

Irregular access is fine as
long as no bank conflicts

Multiple threads can access
the same bank conflict free if
they access addresses in
broadcast

Broadcast can be to any
number of threads in a warp

__shared__ float s_data[??];

//read from shared memory using broadcast

some_thread_value = s_data[0] ;

Strided access example

What is the stride?

What is the level of conflict?

How can this be improved?

__shared__ char s_data[BLOCK_SIZE];

//load or calculate some_thread_value

s_data[threadIdx.x] = some_thread_value;

__syncthreads();

0

1

2

3

4

5

6

7

31

Bank

Thread

1

0

2

3

4

5

6
7

31

Strided access example

What is the stride? Less than 1 (0.25)

What is the level of conflict? 4 way

How can this be improved? Increase the stride

__shared__ char s_data[BLOCK_SIZE];

//load or calculate some_thread_value

s_data[threadIdx.x] = some_thread_value;

__syncthreads();

0

1

2

3

4

5

6

7

Bank

Thread

1

0

2

3

4

5

6
7

Increase the stride (OK solution)

What is the stride? 1

What is the level of conflict? 1 way (no conflict)

How can this be improved? Use less memory!

__shared__ char s_data[BLOCK_SIZE*4];

//load or calculate some_thread_value

s_data[threadIdx.x*4] = some_thread_value;

__syncthreads();

0

1

2

3

4

5

6

7

Bank

Thread

1

0

2

3

4

5

6
7

Increase the stride (good solution)

What is the stride? 1

What is the level of conflict? 1 way (no conflict)

How much shared memory is required? BLOCK_SIZE+1

__shared__ char s_data[BLOCK_SIZE+1];

//load or calculate some_thread_value

s_data[CONFLICT_FREE(threadIdx.x)] = some_thread_value;

__syncthreads();

0

1

2

3

4

5

6

7

31

Bank

Thread

1

0

2

3

4

5

6
7

31#define CHAR_MULTIPLIER 4
#define CONFLICT_FREE(x) (x*CHAR_MULTIPLIER % (BLOCK_SIZE+1))

where

stride 0.25

multipier 4

block_size 128

threadIdx.x adjusted index bank

0 0 0

1 4 1

2 8 2

3 12 3

4 16 4

5 20 5

6 24 6

7 28 7

8 32 8

9 36 9

10 40 10

… … …

127 121 30

Banks
Used 32

Max
Conflicts 1

=(tid*multiplier) % (block_size+1)

=(index*stride) % 32

Increase the stride (good solution)

BLOCK_SIZE+1 unique
indices

Much better than
BLOCK_SIZE*4 unique indices

Shared Memory

Shared Memory Bank Conflicts

2D Shared Memory Bank Conflicts

Boundary Conditions for Shared Memory Loading

Host-side Configurations for Shared Memory

Bank conflicts with 2D tiles

Shared Memory Bank

0 1 2 3 4 5 6 7 31

0 1 2 3 4 5 6 7 31

0 1 2 3 4 5 6 7 31

0 1 2 3 4 5 6 7 31

0 1 2 3 4 5 6 7 31

0 1 2 3 4 5 6 7 31

0 1 2 3 4 5 6 7 31

0 1 2 3 4 5 6 7 31

0 1 2 3 4 5 6 7 0

Example where each
thread (of 2D block)
operates on a row
Loads values by column

__global__ void image_kernel(float *image)

{

__shared__ float s_data[BLOCK_DIM][BLOCK_DIM];

for (int i = 0; i < BLOCK_DIM; i++){ some_thread_value += f(s_data[threadIdx.x][i]); } bank = threadIdx.x * stride % 32 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 0 Bank conflicts with 2D tiles __global__ void image_kernel(float *image) { __shared__ float s_data[BLOCK_DIM][BLOCK_DIM]; for (int i = 0; i < BLOCK_DIM; i++){ some_thread_value += f(s_data[threadIdx.x][i]); } BLOCK_DIM=32 i=0 Example where each thread (of 2D block) operates on a row Loads values by column 32 way bank conflicts! Very bad Stride = 32 Shared Memory Bank bank = threadIdx.x * stride % 32 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 0 Bank conflicts with 2D tiles __global__ void image_kernel(float *image) { __shared__ float s_data[BLOCK_DIM][BLOCK_DIM]; for (int i = 0; i < BLOCK_DIM; i++){ some_thread_value += f(s_data[threadIdx.x][i]); } Example where each thread (of 2D block) operates on a row Loads values by column How to fix Memory padding Transpose the matrix Or operate on columns (loading by row) if possible Shared Memory Bank BLOCK_DIM=32 i=0 bank = threadIdx.x * stride % 32 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10 11 4 5 6 7 8 9 10 11 12 5 6 7 8 9 10 11 12 13 6 7 8 9 10 11 12 13 14 7 31 0 1 2 3 4 5 6 31 Bank conflicts with 2D tiles __global__ void image_kernel(float *image) { __shared__ float s_data[BLOCK_DIM][BLOCK_DIM+1]; for (int i = 0; i < BLOCK_DIM; i++){ some_thread_value += f(d_data[threadIdx.x][i]); } Memory Padding Solution Shared Memory Bank BLOCK_DIM+1=33 bank = threadIdx.x * stride %32 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10 11 4 5 6 7 8 9 10 11 12 5 6 7 8 9 10 11 12 13 6 7 8 9 10 11 12 13 14 7 31 0 1 2 3 4 5 6 31 Bank conflicts with 2D tiles __global__ void image_kernel(float *image) { __shared__ float s_data[BLOCK_DIM][BLOCK_DIM+1]; for (int i = 0; i < BLOCK_DIM; i++){ some_thread_value += f(d_data[threadIdx.x][i]); } i=0 Memory Padding Solution Every thread in warp reads from different bank Alternative: Transpose solution left to you! Shared Memory Bank BLOCK_DIM+1=33 bank = threadIdx.x * stride % 32 Shared Memory Shared Memory Bank Conflicts 2D Shared Memory Bank Conflicts Boundary Conditions for Shared Memory Loading Host-side Configurations for Shared Memory Boundary Conditions & Shared Memory Tiling Consider a 2D problem where data is gathered from neighbouring cells Each cell reads 8 values (gather pattern) Sounds like a good candidate for shared memory We can tile data into memory Thread Block size is 8x8 Data tiled into shared memory Data not tiled into shared memory Gather pattern Problem with our tiling approach Memory access pattern is good for values inside the boundary 448 cached reads 64 loads Memory outside of boundary is loaded multiple times 92 un-cached reads 92 loads 1 2 3 3 3 3 3 3 2 1 2 4 6 6 6 6 6 6 4 2 3 6 8 8 8 8 8 8 6 3 3 6 8 8 8 8 8 8 6 3 3 6 8 8 8 8 8 8 6 3 3 6 8 8 8 8 8 8 6 3 3 6 8 8 8 8 8 8 6 3 3 6 8 8 8 8 8 8 6 3 2 4 6 6 6 6 6 6 4 2 1 2 3 3 3 3 3 3 2 1 Boundary Condition Improvements Launch more threads Launch thread block of DIM+2 × DIM+2 Allocate one element of space per thread in SM Every thread loads one value Only threads in inner DIM x DIM compute values Causes under utilisation Use more shared memory per thread Launch same DIM × DIM threads Allocate DIM+2 × DIM+2 elements of space in SM Threads on boundary load multiple elements Causes unbalanced loads All threads perform compute values 𝑈𝑡𝑖𝑙𝑖𝑠𝑎𝑡𝑖𝑜𝑛 = 𝐷𝐼𝑀2 (𝐷𝐼𝑀 + 2)2 DIM Utilisation 8 64% 12 73% 16 79% 20 83% 24 85% 28 87% 32 89% 36 90% 40 91% 44 91% 48 92% Shared Memory Shared Memory Bank Conflicts 2D Shared Memory Bank Conflicts Boundary Conditions for Shared Memory Loading Host-side Configurations for Shared Memory Dynamically Assigned Shared Memory It is possibly to dynamically assign shared memory at runtime. Requires both a host and device modification to code Device: Must declare shared memory as extern Host: Must declare shared memory size in kernel launch parameters unsigned int sm_size = sizeof(float)*DIM*DIM; image_kernel<<>>(d_image);

__global__ void image_kernel(float *image)

{

extern __shared__ float s_data[];

}

image_kernel<<>>(d_image);

__global__ void image_kernel(float *image)

{

__shared__ float *s_data[DIM][DIM];

}

Is equivalent to

Summary

Shared Memory introduces the idea of block level computation
rather than just thread level computation

Shared Memory is a limited resource but can be very useful for
reducing global memory bandwidth
Where data is reused

Shared Memory requires user synchronisation unlike other general
purpose caches (i.e. L1, texture)

For optimal performance memory banks must be considered and
boundary conditions must be handled

There are hardware specific options for configuring how Shared
Memory is used

Acknowledgements and Further Reading

http://cuda-programming.blogspot.co.uk/2013/02/bank-conflicts-in-
shared-memory-in-cuda.html

http://acceleware.com/blog/maximizing-shared-memory-
bandwidth-nvidia-kepler-gpus

http://cuda-programming.blogspot.co.uk/2013/02/bank-conflicts-in-shared-memory-in-cuda.html
http://acceleware.com/blog/maximizing-shared-memory-bandwidth-nvidia-kepler-gpus

Shared Memory Preferences

In Compute 2.0+ (Fermi) and Compute 3.0+ devices (Kepler) it is
possible to configure the ratio of SM and L1 with host function
cudaDeviceSetCacheConfig(enum cudaFuncCache)

does this for all kernels

cudaFuncSetCacheConfig(enum cudaFuncCache)

for a single kernel

Possible values are;
cudaFuncCachePreferNone: default cache configuration

cudaFuncCachePreferShared: 48KB SM and 16 KB L1

cudaFuncCachePreferL1: 16KB SM and 64 KB L1

cudaFuncCachePreferEqual: 32KB SM and 32KB L1 (only available on Kepler)

Not required in Maxwell