程序代写代做代考 cuda deep learning kernel case study algorithm GPU C 11/18/2020

11/18/2020
1
2
Example of the Forward Path of a Convolution Layer
Application Case Study – Deep Learning
Parallel Implementation of Convolutional Neural Network (CNN)
(Part 2)
2
752
1*0+ 1*2 + 1*1+ 1*2 + 1*0 + 1*3
2*1 + 2*1
1*0+ 1*2 + 1*0 + 1*3
1

4
Sequential Code for the Forward Path of a
Convolution Layer
void convLayer_forward(int M, int C, int H, int W, int K, float* X, float* W, float* Y) {
int m, c, h, w, p, q;
int H_out = H – K + 1; int W_out = W – K + 1;
for(int m = 0; m < M; m++) for(int h = 0; h < H_out; h++) for(int w = 0; w < W_out; w++) { Y[m, h, w] = 0; for(int c = 0; c < C; c++) // for each output feature map // for each output element // sum over all input feature maps // KxK filter } } for(int p = 0; p < K; p++) for(int q = 0; q < K; q++) Y[m, h, w] += X[c, h + p, w + q] * W[m, c, p, q]; 11/18/2020 Variables • C: the number of input feature maps • M: the number of output feature maps • H: the height of each input map image • W: the height of each input map image • K: the height (and width) of each filter • X[c, w, h]: input feature map • Y[m, w, h]: output feature map • Filter bank: W[m, c, K, K]. There are M x C filter banks Used for the input image map X[c, _ ,_] to calculate the output image Y[m, _, _] • intH_out=H–K+1; //slidewindow,32-5+1=28 • intW_out=W–K+1; 3 4 2 11/18/2020 5 Parallelism in a Convolution Layer • All output feature maps can be calculated in parallel – A small number in general, not sufficient to fully utilize a GPU • Alloutputfeaturemappixelscanbecalculatedin parallel – All rows can be done in parallel – All pixels in each row can be done in parallel • Allinputfeaturemapscanbeprocessedinparallel. 6 Design of a Basic Kernel • Each block computes a tile of output pixels – TILE_WIDTH pixels in each dimension • The first (x) dimension in the grid maps to the M output feature maps • The second (y) dimension in the grid maps to the tiles in the output feature maps dim3 gridDim(M, Y, 1); 3 11/18/2020 7 Host Code for the Basic Kernel • Defining the grid configuration – W_out and H_out are the output feature map width and height # define TILE_WIDTH 4 // We will use 4 for small examples. W_grid = W_out/TILE_WIDTH; // number of horizontal tiles per output map H_grid = H_out/TILE_WIDTH; // number of vertical tiles per output map Y = H_grid * W_grid; // dim3 blockDim(TILE_WIDTH, TILE_WIDTH, 1); dim3 gridDim(M, Y, 1); ConvLayerForward_Kernel<<< gridDim, blockDim>>>(…);
A Small Example
8
• Assume that we will produce 4 output feature maps – Each output feature map is 8×8 image
– We have 4 blocks in the x dimension
• If we use tiles of 4 pixels on each side (TILE_SIZE = 4)
– We have 4 blocks in the x dimension
• Top two blocks in each column calculates the top row of tiles in the corresponding output feature map
• Bottom two block in each column calculates the bottom row of tiles in the corresponding output feature map
4

Mapping Threads to Output Feature Maps Grid Perspective, first output feature map
Row of Tiles
First
Row of Tiles
9
11/18/2020
10
A Basic Conv. Layer Forward Kernel (Code is incomplete!)
__global__ void ConvLayerForward_Basic_Kernel(int C, int W_grid, int K, float* X, float* W, float* Y)
{
int m = blockIdx.x;
int h = (blockIdx.y / W_grid) * TILE_WIDTH + threadIdx.y;
int w = (blockIdx.y % W_grid ) * TILE_WIDTH + threadIdx.x;
float acc = 0.;
for (int c = 0; c < C; c++) { // sum over all input channels for (int p = 0; p < K; p++) // loop over KxK filter for (int q = 0; q < K; q++) acc += X[c, h + p, w + q] * W[m, c, p, q]; } Y[m, h, w] = acc; } 5 11/18/2020 Mapping Threads to Output Feature Maps Row of Tiles First Row of Tiles 11 12 Some Observations • The amount of parallelism is quite high as long as the total number of pixels across all output feature maps is large – This matches the CNN architecture well 6 13 Implementing a convolution layer with matrix multiplication 11/18/2020 14 20 15 24 12 24 17 26 Output Features Y Convolution Filtetrs W Input Features X 11 22 11 11 01 10 10 01 21 21 12 20 120 113 022 021 032 110 121 013 332 1122 1111 0110 1 2 1 1 1001 2121 1220 * 2013 = 12241726 1102 14201524 Convolution Filtetrs W’ 1322 0203 Output 2132 Features 0311 Y 3210 1211 2103 0133 1332 Input Features X_unrolled 13 Simple Matrix Multiplication 1 2 1 1 2 0 1 3 1 1 0 2 1 3 2 2 0 2 0 3 2 1 3 2 0 3 1 1 3 2 1 0 1 2 1 1 2 1 0 3 0 1 3 3 1 3 3 2 14 20 15 24 12 24 17 26 Each product matrix element is an output feature map pixel. This inner product generates element 0 of output feature map 0. Convolution Filters 0 1 0 1 2 1 1 2 2 1 1 1 1 0 1 1 0 1 0 0 1 2 1 2 1 1 2 2 0 14 7 Input feature maps 15 Convolution Layer – Back Propagation of dE/DY 15 11/18/2020 16 Summary • Deep Learning CNN and applications • Serial algorithm • Parallel algorithm design – Output data decomposition • CUDA programming – Thread organization – Layer forward kernel 8