COMP5426 Distributed
Introduction
References
Copyright By PowCoder代写 加微信 powcoder
– NVIDIAGPUEducatorsProgram – https://developer.nvidia.com/educators
– NVIDIA’s Academic Programs
– https://developer.nvidia.com/academia
– The contents of the ppt slides are mainly copied from the following book and its accompanying teaching materials:
. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 2nd edition, , 2013
Example – Matrix Multiplication
For a given problem, M, N and P, we first need to know how many threads to be created?
• At least the size of P (why?)
WIDTH WIDTH
A Basic Matrix Multiplication
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {
// Calculate the row index of the P element and M int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of P and N int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
// each thread computes one element of matrix P for (int k = 0; k < Width; ++k) {
} Pvalue += M[Row*Width+k]*N[k*Width+Col];
} P[Row*Width+Col] = Pvalue; }
A Basic Matrix Multiplication
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) { // Calculate the row index of the P element and M
int Row = blockIdx.y*blockDim.y+threadIdx.y; // Calculate the column index of P and N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
// each thread computes one element of matrix P for (int k = 0; k < Width; ++k) {
} Pvalue += M[Row*Width+k]*N[k*Width+Col];
} P[Row*Width+Col] = Pvalue;
How about performance on a GPU
– All threads access global memory for their input matrix elements – Onememoryaccess(4bytes)perfloating-pointoperation
– AssumeaGPUwith
– Peakfloating-pointrate1,500GFLOPSwith200GB/sDRAMbandwidth – 4*1,500=6,000GB/srequiredtoachievepeakFLOPSrating
– The200GB/smemorybandwidthlimitstheexecutionat50GFLOPS
– This limits the execution rate to 3.33% (50/1500) of the peak floating-point execution rate of the device!
– Need to drastically cut down memory accesses to get close to the1,500 GFLOPS
– Make effective use of shared memory
Matrix Multiplication
– Dataaccesspattern
– Eachthread-arowofManda column of N
– Eachthreadblock–astripofManda strip of N
BLOCK_WIDTH WIDTH
BLOCK_WIDTHE
WIDTH WIDTH
Tiled Matrix Multiplication
– Break up the execution of each thread into phases
– so that the data accesses by the thread block in each phase are focused on one tile of M and one tile of N
– The tile is of BLOCK_SIZE elements in each dimension
BLOCK_WIDTH WIDTH
BLOCK_WIDTHE
WIDTH WIDTH
Loading a Tile
– All threads in a block participate
– Each thread loads one M element and one N element in tiled code
A Toy Example: Thread to P Data Mapping
Block(0,0) Block(0,1) Thread(0,1)
Thread(0,0)
Thread(1,0) Thread(1,1)
BLOCK_WIDTH = 2
Block(1,0)
Block(1,1)
Calculation of P0,0 and P0,1
Calculation of P1,0 and P1,1
Phase 0 Load for Block (0,0)
Shared Memory
Shared Memory
Phase 0 Use for Block (0,0) (iteration 0)
Shared Memory
Shared Memory
Phase 0 Use for Block (0,0) (iteration 1)
Shared Memory
Shared Memory
Phase 1 Load for Block (0,0)
Shared Memory
Shared Memory
Phase 1 Use for Block (0,0) (iteration 0)
Shared Memory
Shared Memory
Phase 1 Use for Block (0,0) (iteration 1)
Shared Memory
Shared Memory
Barrier Synchronization
– Synchronize all threads in a block – __syncthreads()
– All threads in the same block must reach the __syncthreads() before any of the them can move on
– Best used to coordinate the phased execution tiled algorithms
– To ensure that all elements of a tile are loaded at the beginning of a phase
– To ensure that all elements of a tile are consumed at the end of a phase
Loading Input Tile 0 of M (Phase 0)
– HaveeachthreadloadanM element and an N element at the same relative position as its P element.
int Row = by * blockDim.y + ty; int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0: N[ty][Col]
M[Row][tx]
TILE_WIDTH WIDTH
TILE_WIDTHE
WIDTH WIDTH
Loading Input Tile 0 of N (Phase 0)
– HaveeachthreadloadanM element and an N element at the same relative position as its P element.
int Row = by * blockDim.y + ty; int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0: M[Row][tx]
N[ty][Col]
BLOCK_WIDTH WIDTH
BLOCK_WIDTHE
WIDTH WIDTH
Loading Input Tile 1 of M (Phase 1)
2D indexing for accessing Tile 1: N[1*TILE*WIDTH + ty][Col]
M[Row][1*TILE_WIDTH + tx]
BLOCK_WIDTH WIDTH
BLOCK_WIDTHE
WIDTH WIDTH
Loading Input Tile 1 of N (Phase 1)
2D indexing for accessing Tile 1: M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]
BLOCK_WIDTH WIDTH
BLOCK_WIDTHE
WIDTH WIDTH
M and N are dynamically allocated - use 1D indexing
M[Row][p*TILE_WIDTH+tx] M[Row*Width + p*TILE_WIDTH + tx]
N[p*TILE_WIDTH+ty][Col] N[(p*TILE_WIDTH+ty)*Width + Col]
where p is the sequence number of the current phase
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y;
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element for (int p = 0; p < n/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col]; __syncthreads();
for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx]; } __synchthreads();
} P[Row*Width+Col] = Pvalue;
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y;
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
float Pvalue = 0;
// Collaborative loading of M and N tiles into shared memory ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col]; __syncthreads();
for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx]; } __synchthreads();
} P[Row*Width+Col] = Pvalue;
// Loop over the M and N tiles required to compute the P element for (int p = 0; p < n/TILE_WIDTH; ++p) {
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y;
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element for (int p = 0; p < n/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col]; __syncthreads();
for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx]; } __synchthreads();
} P[Row*Width+Col] = Pvalue;
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y;
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element for (int p = 0; p < n/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col]; __syncthreads();
for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx]; } __synchthreads();
} P[Row*Width+Col] = Pvalue;
Tile (Thread Block) Size Considerations
– Each thread block should have many threads – TILE_WIDTHof16gives16*16=256threads
– TILE_WIDTHof32gives32*32=1024threads
– For 16, in each phase, each block performs 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. (16 floating-point operations for each memory load)
– For 32, in each phase, each block performs 2*1024 = 2048 float loads from global memory for 1024 * (2*32) = 65,536 mul/add operations. (32 floating-point operation for each memory load)
Shared Memory and Threading
– ForanSMwith16KBsharedmemory
– Sharedmemorysizeisimplementationdependent!
– ForTILE_WIDTH=16,eachthreadblockuses2*256*4B=2KBofshared memory.
– For16KBsharedmemory,onecanpotentiallyhaveupto8threadblocks executing
– This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
– ThenextTILE_WIDTH32wouldleadto2*32*32*4Byte=8KByteshared memory usage per thread block, allowing 2 thread blocks active at the same time
– However, the thread count limitation of 1536 threads per SM for Fermi will reduce the number of blocks per SM to one!
Block Granularity Considerations
– For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks for Fermi?
– For 8X8, we have 64 threads per Block. Since each SM can take up to 1536 threads, which translates to 24 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!
– For 16X16, we have 256 threads per Block. Since each SM can take up to 1536 threads, it can take up to 6 Blocks and achieve full capacity unless other resource considerations overrule.
– For 32X32, we would have 1024 threads per Block. Only one block can fit into an SM for Fermi. Using only 2/3 of the thread capacity of an SM.
Handling Matrix of Arbitrary Size
• The tiled matrix multiplication kernel we presented so far can handle only square matrices whose dimensions (Width) are multiples of the tile width (TILE_WIDTH)
• However, real applications need to handle arbitrary sized matrices.
• One could pad (add elements to) the rows and columns into multiples of the tile size, but would have significant space and data transfer time overhead.
• There is a better approach.
Phase 1 Loads for Block (0,0) for a 3x3 Example
Threads (1,0) and (1,1) need special treatment in loading N tile
Shared Memory
Shared Memory
Threads (0,1) and (1,1) need special treatment in loading M tile
Phase 1 Use for Block (0,0) (iteration 0)
Shared Memory
Shared Memory
Phase 1 Use for Block (0,0) (iteration 1)
Shared Memory
Shared Memory
All Threads need special treatment. None of them should introduce invalidate contributions to their P elements.
Phase 0 Loads for Block (1,1) for a 3x3 Example
Threads (2,3) and (3,3) need special treatment in loading N tile
Shared Memory
SharedMemoryP1,0 P1,1
Threads (3,2) and (3,3) need special treatment in loading M tile
Major Cases in Toy Example
– ThreadsthatcalculatevalidPelementsmayattempttoloadnon- existing input elements when loading input tiles
– Phase0ofBlock(0,0),Thread(1,0),assignedtocalculatevalidP[1,0]but attempts to load non-existing N[3,0]
– Threads that do not calculate valid P elements but still need to participate in loading the input tiles
– Phase0ofBlock(1,1),Thread(3,2),assignedtocalculatenon-existentP[3,2]but need to participate in loading tile element M[1,2]
A “Simple” Solution
– When a thread is to load any input element, test if it is in the valid index range
– If valid, proceed to load
– Else, do not load, just write a 0
– Rationale: a 0 value will ensure that that the multiply-add step does not affect the final value of the output element
– The condition tested for loading input elements is different from the test for calculating output P element
– A thread that does not calculate valid P element can still participate in loading input tile elements
Phase 1 Use for Block (0,0) (iteration 1)
Shared Memory
Shared Memory
Boundary Condition for Input M Tile
– Each thread loads
– M[Row][p*TILE_WIDTH+tx]
– M[Row*Width + p*TILE_WIDTH+tx]
– Need to test
– (Row < Width) && (p*TILE_WIDTH+tx < Width)
– If true, load M element
– Else , load 0
TILE_WIDTH TILE_WIDTH
Boundary Condition for Input N Tile
– Each thread loads
– N[p*TILE_WIDTH+ty][Col]
– N[(p*TILE_WIDTH+ty)*Width+ Col]
– Need to test
– (p*TILE_WIDTH+ty < Width) && (Col< Width)
– If true, load N element
– Else , load 0
TILE_WIDTHTILE_WIDTH
Loading Elements – with boundary check
– 8 for(intp=0;p<(Width-1)/TILE_WIDTH+1;++p){ –
if(Row < Width && p * TILE_WIDTH+tx < Width) { ds_M[ty][tx] = M[Row * Width + p * TILE_WIDTH + tx];
ds_M[ty][tx] = 0.0;
if (p*TILE_WIDTH+ty < Width && Col < Width) {
ds_N[ty][tx] = N[(p*TILE_WIDTH + ty) * Width + Col]; }else{
ds_N[ty][tx] = 0.0; }
__syncthreads();
Inner Product – Before and After
– ++ if(Row < Width && Col < Width) {
– 12 for(inti=0;i