程序代写代做代考 compiler algorithm cuda Com 4521 Parallel Computing with GPUs: Lab 06

Com 4521 Parallel Computing with GPUs: Lab 06

Spring Semester 2018

Dr Paul Richmond

Lab Assistants: John Charlton and Robert Chisholm

Department of Computer Science, University of Sheffield

Learning Outcomes

 Understand how to use shared memory to improve performance of a memory bound

algorithm (matrix multiply)

 Understand how to use tiling in shared memory for matrix multiply

 Understand how to manually calculate occupancy of a kernel at runtime

 Demonstrate how to automatically discover the thread block size which achieves highest

occupancy

Lab Register

The lab register must be completed by every student following the completion of the exercises. You

should complete this when you have completed the lab including reviewing the solutions. You are

not expected to complete this during the lab class but you should complete it by the end of the

teaching week.

Lab Register Link: https://goo.gl/0r73gD

Prerequisites

Download CUDA occupancy calculator spreadsheet from the NVIDIA website.

http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

https://goo.gl/0r73gD
https://goo.gl/0r73gD
https://goo.gl/0r73gD
http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

Exercise 01

Figure 1 – Blocked matrix multiply (See text for description). Note: indices of sub block are given as (y, x)

For exercise one we are going to modify an implementation of matrix multiply (provided in

exercise01_rows.cu). The implementation provided multiplies a Matrix A by a Matrix B to produce a

Matrix C. The widths and heights of the matrices can be modified but for simplicity must be a factor

of the BLOCK_SIZE (which has been predefined as a macro in the code). The implementation is

currently very inefficient as it performs A_WIDTH × B_HEIGHT memory loads to compute each

product (value in matrix C). To improve this we will implement a blocked matrix multiply which uses

CUDAs shared memory to reduce the number of memory reads by a factor of BLOCK_SIZE. First

note the performance of the original version and then modify the existing code to perform a blocked

matrix multiply by implementing the following changes.

To implement a blocked matrix multiply will require that we load NUM_SUBS square sub matrices of

the matrix A and B into shared memory so that we can compute the intermediate result of the sub

matrix products. In the example Figure 1 (above) the sub matrix C(1,1) can be calculated by a square

thread block of BLOCK_SIZE by BLOCK_SIZE where each thread (with location tx, ty in the

square thread block) performs the following steps which requires two stages of loading matrix tiles

into shared memory.

1. Load the two sub-matrices A(1,0) and B(0,1) into shared memory. For these sub matrices each

thread should

a. Load an element of the sub matrix A(1,0) into shared memory from the matrix A at

position (ty+BLOCK_SIZE, tx)

b. Each thread should load an element of the sub matrix B(0,1) into shared memory from

the matrix B at position (ty, tx+BLOCK_SIZE)

c. Synchronise to ensure all threads have completed loading sub matrix values to

shared memory

2. Compute the dot product of each row in sub-matrix A(1,0) with each column in the sub-

matrix B(0,1), storing the result in a local variable. Achieved through the following steps.

a. Iterate from 0 to BLOCK_SIZE to multiply row ty of A(1,0) (from shared memory)

by column tx of B(0,1) (from shared memory) to calculate the sub matrix product

value.

b. Store the sub matrix product value in a thread local variable.

c. Synchronise to ensure that all threads have finished reading from shared memory.

3. Repeat steps 1 & 2 for the next sub-matrix (or matrices in the general case), adding each

new dot product result to the previous result. For the example in the figure, there is only

one more iteration required to load the final 2 sub matrices. E.g.

a. Load an element of the sub matrix A(1,1) into shared memory from Matrix A at

position (ty+BLOCK_SIZE, tx+BLOCK_SIZE)

b. Load an element of the sub matrix B(1,1) into shared memory from matrix B at

position (ty+BLOCK_SIZE tx+BLOCK_SIZE,)

c. Synchronise to ensure all threads have completed loading sub matrix values to

shared memory

d. Iterate from 0 to BLOCK_SIZE to multiply row tx of A(1,1) (from shared memory)

by column ty of B(1,1) (from shared memory) to calculate the sub matrix product

value.

e. Add this sub matrix product value to the one calculated for the previous sub

matrices.

4. Store the sum of the sub-matrix dot products into global memory at location x, y of Matrix

C.

Following the approach described for the example in Figure 1 modify the code where it is marked

TODO to support the general case (any sizes of A and B which are a multiple of BLOCK_SIZE). Test

and benchmark your code compared to the original version. Do not worry about shared memory

bank conflicts until you have implemented your code (it is unlikely that you have introduced any).

Once you have a working solution see if you can understand what the stride of your accesses is to

understand why your memory access pattern is conflict free. Any shared memory bank conflicts can

be resolved by either inverting the storage of your shared memory tiles or by padding your shared

memory storage.

Note: When building in release mode the CUDA compiler often fuses a multiply and add instruction

which causes an improvement in performance but some small loss of accuracy. To ensure that you

test passes you should either

1. Implement a small epsilon value and compare the difference of the host and device version

to this to ensure that it is within an acceptable range.

2. Modify the nvcc flag to avoid the use of fused multiply and add with –fmad=false

Exercise 01 Part 2

We would like to know the occupancy of the matrix multiply example. We can calculate the

theoretical occupancy through the following formula;

http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 =
𝑚𝑎𝑥𝐴𝑐𝑡𝑖𝑣𝑒𝐵𝑙𝑜𝑐𝑘𝑠 ∗ 𝑇𝑃𝐵

𝑚𝑎𝑥𝑇ℎ𝑟𝑒𝑎𝑑𝑠𝑃𝑒𝑟𝑀𝑢𝑙𝑡𝑖𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟𝑠 ∗ 𝑚𝑢𝑙𝑡𝑖𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟𝐶𝑜𝑢𝑛𝑡

Where 𝑇𝑃𝐵 is threads per block and 𝑚𝑎𝑥𝐴𝑐𝑡𝑖𝑣𝑒𝐵𝑙𝑜𝑐𝑘𝑠 and 𝑚𝑎𝑥𝑇ℎ𝑟𝑒𝑎𝑑𝑠𝑃𝑒𝑟𝑀𝑢𝑙𝑡𝑖𝑃𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟𝑠

can be queried from the device properties.

2.1 Calculate the occupancy of the kernel and print it to the console

2.2 Modify the build arguments to output the register and shared memory usage. Confirm that your

kernel is calculating the occupancy correctly by comparing it to the value in the

CUDA_Occupancy_Calculator.xls spread sheet.

Exercise 02

It would benefit performance to achieve optimal occupancy in our matrix multiplication code. We

can do this using the CUDA Occupancy API however our code has a statically defined block size.

Make a copy of your code from the previous exercise and make the following changes.

2.1 Implement a function (requiredSM) to dynamically calculate shared memory usage based on

a given block size. For now use the exiting BLOCK_SIZE macro.

2.2 Modify the kernel launch to dynamically allocate shared memory for the kernel. Within your

kernel you will need to create pointers to the A and B submatrix as the kernel launch will have

allocated a single linear block. You will also need to update the indexing of the sub matrices as

your shared memory sub matrices are now a 1D array. Test your code using the existing

BLOCK_SIZE to ensure you have not introduced any errors.

2.3 Modify the macro BLOCK_SIZE so that it is instead a variable in CUDA constant memory.

Calculate the BLOCK_SIZE which gives best theoretical occupancy by implementing the

following;

2.3.1 Use the cudaOccupancyMaxPotentialBlockSizeVariableSMem function

to find the thread block size with best occupancy. See API docs for more details.

2.3.2 Apply the following formula to ensure that the square root of the thread block size is

an integer (i.e. the root has no decimal part) and also a power of two number.

𝑇ℎ𝑟𝑒𝑎𝑑𝑠𝑃𝑒𝑟𝐵𝑙𝑜𝑐𝑘 = 4
𝑓𝑙𝑜𝑜𝑟(

log𝑒(𝑇ℎ𝑟𝑒𝑎𝑑𝑠𝑃𝑒𝑟𝐵𝑙𝑜𝑐𝑘)
log𝑒(4)

)

2.3.3 Calculate the block size using sqrt function

2.3.4 Copy the block size to constant memory.

2.4 Modify your code to print the block size with the highest occupancy to the console.

2.5 Test your code and then build and execute in release mode to compare with the original version

with lower occupancy.

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__HIGHLEVEL.html#group__CUDART__HIGHLEVEL_1g77b3bfb154b86e215a5bc01509ce8ea6