留学生代考 COMP5426 Distributed

COMP5426 Distributed
Introduction

References

Copyright By PowCoder代写 加微信 powcoder

– NVIDIAGPUEducatorsProgram – https://developer.nvidia.com/educators
– NVIDIA’s Academic Programs
– https://developer.nvidia.com/academia
– The contents of the ppt slides are mainly copied from the following book and its accompanying teaching materials:
. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 2nd edition, , 2013

Example – Matrix Multiplication
For a given problem, M, N and P, we first need to know how many threads to be created?
• At least the size of P (why?)
WIDTH WIDTH

A Basic Matrix Multiplication
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {
// Calculate the row index of the P element and M int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of P and N int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) { float Pvalue = 0; // each thread computes one element of matrix P for (int k = 0; k < Width; ++k) { } Pvalue += M[Row*Width+k]*N[k*Width+Col]; } P[Row*Width+Col] = Pvalue; } A Basic Matrix Multiplication __global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) { // Calculate the row index of the P element and M int Row = blockIdx.y*blockDim.y+threadIdx.y; // Calculate the column index of P and N int Col = blockIdx.x*blockDim.x+threadIdx.x; if ((Row < Width) && (Col < Width)) { float Pvalue = 0; // each thread computes one element of matrix P for (int k = 0; k < Width; ++k) { } Pvalue += M[Row*Width+k]*N[k*Width+Col]; } P[Row*Width+Col] = Pvalue; How about performance on a GPU – All threads access global memory for their input matrix elements – Onememoryaccess(4bytes)perfloating-pointoperation – AssumeaGPUwith – Peakfloating-pointrate1,500GFLOPSwith200GB/sDRAMbandwidth – 4*1,500=6,000GB/srequiredtoachievepeakFLOPSrating – The200GB/smemorybandwidthlimitstheexecutionat50GFLOPS – This limits the execution rate to 3.33% (50/1500) of the peak floating-point execution rate of the device! – Need to drastically cut down memory accesses to get close to the1,500 GFLOPS – Make effective use of shared memory Matrix Multiplication – Dataaccesspattern – Eachthread-arowofManda column of N – Eachthreadblock–astripofManda strip of N BLOCK_WIDTH WIDTH BLOCK_WIDTHE WIDTH WIDTH Tiled Matrix Multiplication – Break up the execution of each thread into phases – so that the data accesses by the thread block in each phase are focused on one tile of M and one tile of N – The tile is of BLOCK_SIZE elements in each dimension BLOCK_WIDTH WIDTH BLOCK_WIDTHE WIDTH WIDTH Loading a Tile – All threads in a block participate – Each thread loads one M element and one N element in tiled code A Toy Example: Thread to P Data Mapping Block(0,0) Block(0,1) Thread(0,1) Thread(0,0) Thread(1,0) Thread(1,1) BLOCK_WIDTH = 2 Block(1,0) Block(1,1) Calculation of P0,0 and P0,1 Calculation of P1,0 and P1,1 Phase 0 Load for Block (0,0) Shared Memory Shared Memory Phase 0 Use for Block (0,0) (iteration 0) Shared Memory Shared Memory Phase 0 Use for Block (0,0) (iteration 1) Shared Memory Shared Memory Phase 1 Load for Block (0,0) Shared Memory Shared Memory Phase 1 Use for Block (0,0) (iteration 0) Shared Memory Shared Memory Phase 1 Use for Block (0,0) (iteration 1) Shared Memory Shared Memory Barrier Synchronization – Synchronize all threads in a block – __syncthreads() – All threads in the same block must reach the __syncthreads() before any of the them can move on – Best used to coordinate the phased execution tiled algorithms – To ensure that all elements of a tile are loaded at the beginning of a phase – To ensure that all elements of a tile are consumed at the end of a phase Loading Input Tile 0 of M (Phase 0) – HaveeachthreadloadanM element and an N element at the same relative position as its P element. int Row = by * blockDim.y + ty; int Col = bx * blockDim.x + tx; 2D indexing for accessing Tile 0: N[ty][Col] M[Row][tx] TILE_WIDTH WIDTH TILE_WIDTHE WIDTH WIDTH Loading Input Tile 0 of N (Phase 0) – HaveeachthreadloadanM element and an N element at the same relative position as its P element. int Row = by * blockDim.y + ty; int Col = bx * blockDim.x + tx; 2D indexing for accessing Tile 0: M[Row][tx] N[ty][Col] BLOCK_WIDTH WIDTH BLOCK_WIDTHE WIDTH WIDTH Loading Input Tile 1 of M (Phase 1) 2D indexing for accessing Tile 1: N[1*TILE*WIDTH + ty][Col] M[Row][1*TILE_WIDTH + tx] BLOCK_WIDTH WIDTH BLOCK_WIDTHE WIDTH WIDTH Loading Input Tile 1 of N (Phase 1) 2D indexing for accessing Tile 1: M[Row][1*TILE_WIDTH + tx] N[1*TILE*WIDTH + ty][Col] BLOCK_WIDTH WIDTH BLOCK_WIDTHE WIDTH WIDTH M and N are dynamically allocated - use 1D indexing M[Row][p*TILE_WIDTH+tx] M[Row*Width + p*TILE_WIDTH + tx] N[p*TILE_WIDTH+ty][Col] N[(p*TILE_WIDTH+ty)*Width + Col] where p is the sequence number of the current phase Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) { __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; int Row = by * blockDim.y + ty; int Col = bx * blockDim.x + tx; float Pvalue = 0; // Loop over the M and N tiles required to compute the P element for (int p = 0; p < n/TILE_WIDTH; ++p) { // Collaborative loading of M and N tiles into shared memory ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx]; ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col]; __syncthreads(); for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx]; } __synchthreads(); } P[Row*Width+Col] = Pvalue; Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) { __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; int Row = by * blockDim.y + ty; int Col = bx * blockDim.x + tx; float Pvalue = 0; // Collaborative loading of M and N tiles into shared memory ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx]; ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col]; __syncthreads(); for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx]; } __synchthreads(); } P[Row*Width+Col] = Pvalue; // Loop over the M and N tiles required to compute the P element for (int p = 0; p < n/TILE_WIDTH; ++p) { Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) { __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; int Row = by * blockDim.y + ty; int Col = bx * blockDim.x + tx; float Pvalue = 0; // Loop over the M and N tiles required to compute the P element for (int p = 0; p < n/TILE_WIDTH; ++p) { // Collaborative loading of M and N tiles into shared memory ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx]; ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col]; __syncthreads(); for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx]; } __synchthreads(); } P[Row*Width+Col] = Pvalue; Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) { __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH]; __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; int Row = by * blockDim.y + ty; int Col = bx * blockDim.x + tx; float Pvalue = 0; // Loop over the M and N tiles required to compute the P element for (int p = 0; p < n/TILE_WIDTH; ++p) { // Collaborative loading of M and N tiles into shared memory ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx]; ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col]; __syncthreads(); for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx]; } __synchthreads(); } P[Row*Width+Col] = Pvalue; Tile (Thread Block) Size Considerations – Each thread block should have many threads – TILE_WIDTHof16gives16*16=256threads – TILE_WIDTHof32gives32*32=1024threads – For 16, in each phase, each block performs 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. (16 floating-point operations for each memory load) – For 32, in each phase, each block performs 2*1024 = 2048 float loads from global memory for 1024 * (2*32) = 65,536 mul/add operations. (32 floating-point operation for each memory load) Shared Memory and Threading – ForanSMwith16KBsharedmemory – Sharedmemorysizeisimplementationdependent! – ForTILE_WIDTH=16,eachthreadblockuses2*256*4B=2KBofshared memory. – For16KBsharedmemory,onecanpotentiallyhaveupto8threadblocks executing – This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) – ThenextTILE_WIDTH32wouldleadto2*32*32*4Byte=8KByteshared memory usage per thread block, allowing 2 thread blocks active at the same time – However, the thread count limitation of 1536 threads per SM for Fermi will reduce the number of blocks per SM to one! Block Granularity Considerations – For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks for Fermi? – For 8X8, we have 64 threads per Block. Since each SM can take up to 1536 threads, which translates to 24 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! – For 16X16, we have 256 threads per Block. Since each SM can take up to 1536 threads, it can take up to 6 Blocks and achieve full capacity unless other resource considerations overrule. – For 32X32, we would have 1024 threads per Block. Only one block can fit into an SM for Fermi. Using only 2/3 of the thread capacity of an SM. Handling Matrix of Arbitrary Size • The tiled matrix multiplication kernel we presented so far can handle only square matrices whose dimensions (Width) are multiples of the tile width (TILE_WIDTH) • However, real applications need to handle arbitrary sized matrices. • One could pad (add elements to) the rows and columns into multiples of the tile size, but would have significant space and data transfer time overhead. • There is a better approach. Phase 1 Loads for Block (0,0) for a 3x3 Example Threads (1,0) and (1,1) need special treatment in loading N tile Shared Memory Shared Memory Threads (0,1) and (1,1) need special treatment in loading M tile Phase 1 Use for Block (0,0) (iteration 0) Shared Memory Shared Memory Phase 1 Use for Block (0,0) (iteration 1) Shared Memory Shared Memory All Threads need special treatment. None of them should introduce invalidate contributions to their P elements. Phase 0 Loads for Block (1,1) for a 3x3 Example Threads (2,3) and (3,3) need special treatment in loading N tile Shared Memory SharedMemoryP1,0 P1,1 Threads (3,2) and (3,3) need special treatment in loading M tile Major Cases in Toy Example – ThreadsthatcalculatevalidPelementsmayattempttoloadnon- existing input elements when loading input tiles – Phase0ofBlock(0,0),Thread(1,0),assignedtocalculatevalidP[1,0]but attempts to load non-existing N[3,0] – Threads that do not calculate valid P elements but still need to participate in loading the input tiles – Phase0ofBlock(1,1),Thread(3,2),assignedtocalculatenon-existentP[3,2]but need to participate in loading tile element M[1,2] A “Simple” Solution – When a thread is to load any input element, test if it is in the valid index range – If valid, proceed to load – Else, do not load, just write a 0 – Rationale: a 0 value will ensure that that the multiply-add step does not affect the final value of the output element – The condition tested for loading input elements is different from the test for calculating output P element – A thread that does not calculate valid P element can still participate in loading input tile elements Phase 1 Use for Block (0,0) (iteration 1) Shared Memory Shared Memory Boundary Condition for Input M Tile – Each thread loads – M[Row][p*TILE_WIDTH+tx] – M[Row*Width + p*TILE_WIDTH+tx] – Need to test – (Row < Width) && (p*TILE_WIDTH+tx < Width) – If true, load M element – Else , load 0 TILE_WIDTH TILE_WIDTH Boundary Condition for Input N Tile – Each thread loads – N[p*TILE_WIDTH+ty][Col] – N[(p*TILE_WIDTH+ty)*Width+ Col] – Need to test – (p*TILE_WIDTH+ty < Width) && (Col< Width) – If true, load N element – Else , load 0 TILE_WIDTHTILE_WIDTH Loading Elements – with boundary check – 8 for(intp=0;p<(Width-1)/TILE_WIDTH+1;++p){ – if(Row < Width && p * TILE_WIDTH+tx < Width) { ds_M[ty][tx] = M[Row * Width + p * TILE_WIDTH + tx]; ds_M[ty][tx] = 0.0; if (p*TILE_WIDTH+ty < Width && Col < Width) { ds_N[ty][tx] = N[(p*TILE_WIDTH + ty) * Width + Col]; }else{ ds_N[ty][tx] = 0.0; } __syncthreads(); Inner Product – Before and After – ++ if(Row < Width && Col < Width) { – 12 for(inti=0;iCS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com