11/18/2020
1
2
Example of the Forward Path of a Convolution Layer
Application Case Study – Deep Learning
Parallel Implementation of Convolutional Neural Network (CNN)
(Part 2)
2
752
1*0+ 1*2 + 1*1+ 1*2 + 1*0 + 1*3
2*1 + 2*1
1*0+ 1*2 + 1*0 + 1*3
1
4
Sequential Code for the Forward Path of a
Convolution Layer
void convLayer_forward(int M, int C, int H, int W, int K, float* X, float* W, float* Y) {
int m, c, h, w, p, q;
int H_out = H – K + 1; int W_out = W – K + 1;
for(int m = 0; m < M; m++) for(int h = 0; h < H_out; h++)
for(int w = 0; w < W_out; w++) { Y[m, h, w] = 0;
for(int c = 0; c < C; c++)
// for each output feature map // for each output element
// sum over all input feature maps // KxK filter
} }
for(int p = 0; p < K; p++) for(int q = 0; q < K; q++)
Y[m, h, w] += X[c, h + p, w + q] * W[m, c, p, q];
11/18/2020
Variables
• C: the number of input feature maps
• M: the number of output feature maps
• H: the height of each input map image
• W: the height of each input map image
• K: the height (and width) of each filter
• X[c, w, h]: input feature map
• Y[m, w, h]: output feature map
• Filter bank: W[m, c, K, K]. There are M x C filter banks Used for the input image map X[c, _ ,_] to calculate the output image Y[m, _, _]
• intH_out=H–K+1; //slidewindow,32-5+1=28
• intW_out=W–K+1;
3
4
2
11/18/2020
5
Parallelism in a Convolution Layer
• All output feature maps can be calculated in parallel
– A small number in general, not sufficient to fully utilize a GPU
• Alloutputfeaturemappixelscanbecalculatedin parallel
– All rows can be done in parallel
– All pixels in each row can be done in parallel
• Allinputfeaturemapscanbeprocessedinparallel.
6
Design of a Basic Kernel
• Each block computes a tile of output pixels – TILE_WIDTH pixels in each dimension
• The first (x) dimension in the grid maps to the M output feature maps
• The second (y) dimension in the grid maps to the tiles in the output feature maps
dim3 gridDim(M, Y, 1);
3
11/18/2020
7
Host Code for the Basic Kernel
• Defining the grid configuration
– W_out and H_out are the output feature map width and height
# define TILE_WIDTH 4 // We will use 4 for small examples.
W_grid = W_out/TILE_WIDTH; // number of horizontal tiles per output map H_grid = H_out/TILE_WIDTH; // number of vertical tiles per output map
Y = H_grid * W_grid; //
dim3 blockDim(TILE_WIDTH, TILE_WIDTH, 1);
dim3 gridDim(M, Y, 1);
ConvLayerForward_Kernel<<< gridDim, blockDim>>>(…);
A Small Example
8
• Assume that we will produce 4 output feature maps – Each output feature map is 8×8 image
– We have 4 blocks in the x dimension
• If we use tiles of 4 pixels on each side (TILE_SIZE = 4)
– We have 4 blocks in the x dimension
• Top two blocks in each column calculates the top row of tiles in the corresponding output feature map
• Bottom two block in each column calculates the bottom row of tiles in the corresponding output feature map
4
Mapping Threads to Output Feature Maps Grid Perspective, first output feature map
Row of Tiles
First
Row of Tiles
9
11/18/2020
10
A Basic Conv. Layer Forward Kernel (Code is incomplete!)
__global__ void ConvLayerForward_Basic_Kernel(int C, int W_grid, int K, float* X, float* W, float* Y)
{
int m = blockIdx.x;
int h = (blockIdx.y / W_grid) * TILE_WIDTH + threadIdx.y;
int w = (blockIdx.y % W_grid ) * TILE_WIDTH + threadIdx.x;
float acc = 0.;
for (int c = 0; c < C; c++) { // sum over all input channels
for (int p = 0; p < K; p++) // loop over KxK filter for (int q = 0; q < K; q++)
acc += X[c, h + p, w + q] * W[m, c, p, q]; }
Y[m, h, w] = acc; }
5
11/18/2020
Mapping Threads to Output Feature Maps
Row of Tiles
First
Row of Tiles
11
12
Some Observations
• The amount of parallelism is quite high as long as the total number of pixels across all output feature maps is large
– This matches the CNN architecture well
6
13
Implementing a convolution layer with matrix multiplication
11/18/2020
14 20
15 24
12 24 17 26
Output Features Y
Convolution Filtetrs W
Input Features X
11 22
11 11
01 10
10 01
21 21
12 20
120 113 022
021 032 110
121 013 332
1122 1111 0110
1 2 1 1
1001 2121 1220 * 2013 = 12241726 1102
14201524
Convolution Filtetrs W’
1322
0203 Output 2132 Features 0311 Y 3210
1211
2103
0133
1332
Input Features X_unrolled
13
Simple Matrix Multiplication
1
2
1
1
2
0
1
3
1
1
0
2
1
3
2
2
0
2
0
3
2
1
3
2
0
3
1
1
3
2
1
0
1
2
1
1
2
1
0
3
0
1
3
3
1
3
3
2
14
20
15
24
12
24
17
26
Each product matrix element is an output feature map pixel.
This inner product generates element 0 of output feature map 0.
Convolution Filters 0
1
0
1
2
1
1
2
2
1
1
1
1
0
1
1
0
1
0
0
1
2
1
2
1
1
2
2
0
14
7
Input feature maps
15
Convolution Layer – Back Propagation of dE/DY
15
11/18/2020
16
Summary
• Deep Learning CNN and applications
• Serial algorithm
• Parallel algorithm design – Output data decomposition
• CUDA programming – Thread organization – Layer forward kernel
8