CS计算机代考程序代写 cuda GPU Program Assignment #2

Program Assignment #2

Due day: NOV. 16, 2021

Problem 1: Matrix-Matrix Multiplication

In the first hands-on lab section, this lab introduces a famous and widely-used example

application in the parallel programming field, namely the matrix-matrix multiplication.

You will complete key portions of the program in the CUDA language to compute this

widely-applicable kernel.

In this lab you will learn:

‧ How to allocate and free memory on GPU.

‧ How to copy data from CPU to GPU.

‧ How to copy data from GPU to CPU.

‧ How to measure the execution times for memory access and computation

respectively.

‧ How to invoke GPU kernels.

Your output should look like this:

Input matrix file name:

Setup host side environment and launch kernel:

Allocate host memory for matrices M and N.

M:

N:

Allocate memory for the result on host side.

Initialize the input matrices.

Allocate device memory.

Copy host memory data to device.

Allocate device memory for results.

Setup kernel execution parameters.

# of threads in a block:

# of blocks in a grid :

Executing the kernel…

Copy result from device to host.

GPU memory access time:

GPU computation time :

GPU processing time :

Check results with those computed by CPU.

Computing reference solution.

CPU Processing time :

CPU checksum:

GPU checksum:

Record your runtime with respect to different input matrix sizes as follows:

Matrix Size GPU Memory

Access Time

(ms)

GPU

Computation

Time (ms)

GPU

Processing

Time (ms)

Ratio of

Computation Time

as compared with

matrix 128×128

8 x 8

128 x 128 1

512 x 512

3072 x 3072

4096 x 4096

What do you see from these numbers?

Problem 2: Matrix-Matrix Multiplication with Tiling and Shared Memory

This lab is an enhanced matrix-matrix multiplication, which uses the features of

shared memory and synchronization between threads in a block. The device shared

memory is allocated for storing the sub-matrix data for calculation, and threads share

memory bandwidth which was overtaxed in previous matrix-matrix multiplication lab.

In this lab you will learn:

‧ How to apply tiling on matrix-matrix multiplication.

‧ How to use shared memory on the GPU.

‧ How to apply thread synchronization in a block.

Your output should look like this.

Input matrix file name:

Setup host side environment and launch kernel:

Allocate host memory for matrices M and N.

M:

N:

Allocate memory for the result on host side.

Initialize the input matrices.

Allocate device memory.

Copy host memory data to device.

Allocate device memory for results.

Setup kernel execution parameters.

# of threads in a block:

# of blocks in a grid :

Executing the kernel…

Copy result from device to host.

GPU memory access time:

GPU computation time :

GPU processing time :

Check results with those computed by CPU.

Computing reference solution.

CPU Processing time :

CPU checksum:

GPU checksum:

Record your runtime with respect to different input matrix sizes as follows:

Matrix Size GPU Memory

Access Time

(ms)

GPU

Computation

Time (ms)

GPU

Processing

Time (ms)

Ratio of

Computation Time

as compared with

matrix 128×128

8 x 8

128 x 128 1

512 x 512

3072 x 3072

4096 x 4096

What do you see from these numbers? Have they improved a lot as compared to the

previous matrix-matrix multiplication implementation?

Problem 3: Matrix-Matrix Multiplication with Tiling and Constant Memory

This lab is an enhanced matrix-matrix multiplication, which uses the features of

constant memory and synchronization between threads in a block. Allocate constant

memory for matrices M and N.

Record your runtime with respect to different input matrix sizes as follows:

Matrix Size GPU Memory

Access Time

(ms)

GPU

Computation

Time (ms)

GPU

Processing

Time (ms)

Ratio of

Computation Time

as compared with

matrix 128×128

8 x 8

128 x 128 1

512 x 512

3072 x 3072

4096 x 4096

What do you see from these numbers? Have they improved a lot as compared to the

previous matrix-matrix multiplication implementation?