CS计算机代考程序代写 cuda GPU Program Assignment #2

Program Assignment #2

Due day: NOV. 16, 2021

Problem 1: Matrix-Matrix Multiplication

In the first hands-on lab section, this lab introduces a famous and widely-used example

application in the parallel programming field, namely the matrix-matrix multiplication.

You will complete key portions of the program in the CUDA language to compute this

widely-applicable kernel.

In this lab you will learn:

‧ How to allocate and free memory on GPU.

‧ How to copy data from CPU to GPU.

‧ How to copy data from GPU to CPU.

‧ How to measure the execution times for memory access and computation

respectively.

‧ How to invoke GPU kernels.

Your output should look like this:

Input matrix file name:

Setup host side environment and launch kernel:

Allocate host memory for matrices M and N.

Allocate memory for the result on host side.

Initialize the input matrices.

Allocate device memory.

Copy host memory data to device.

Allocate device memory for results.

Setup kernel execution parameters.

# of threads in a block:

# of blocks in a grid :

Executing the kernel…

Copy result from device to host.

GPU memory access time:

GPU computation time :

GPU processing time :

Check results with those computed by CPU.

Computing reference solution.

CPU Processing time :

CPU checksum:

GPU checksum:

Record your runtime with respect to different input matrix sizes as follows:

Matrix Size GPU Memory

Access Time

(ms)

GPU

Computation

Time (ms)

GPU

Processing

Time (ms)

Ratio of

Computation Time

as compared with

matrix 128×128

8 x 8

128 x 128 1

512 x 512

3072 x 3072

4096 x 4096

What do you see from these numbers?

Problem 2: Matrix-Matrix Multiplication with Tiling and Shared Memory

This lab is an enhanced matrix-matrix multiplication, which uses the features of

shared memory and synchronization between threads in a block. The device shared

memory is allocated for storing the sub-matrix data for calculation, and threads share

memory bandwidth which was overtaxed in previous matrix-matrix multiplication lab.

In this lab you will learn:

‧ How to apply tiling on matrix-matrix multiplication.

‧ How to use shared memory on the GPU.

‧ How to apply thread synchronization in a block.

Your output should look like this.

Input matrix file name:

Setup host side environment and launch kernel:

Allocate host memory for matrices M and N.

Allocate memory for the result on host side.

Initialize the input matrices.

Allocate device memory.

Copy host memory data to device.

Allocate device memory for results.

Setup kernel execution parameters.

# of threads in a block:

# of blocks in a grid :

Executing the kernel…

Copy result from device to host.

GPU memory access time:

GPU computation time :

GPU processing time :

Check results with those computed by CPU.

Computing reference solution.

CPU Processing time :

CPU checksum:

GPU checksum:

Record your runtime with respect to different input matrix sizes as follows:

Matrix Size GPU Memory

Access Time

(ms)

GPU

Computation

Time (ms)

GPU

Processing

Time (ms)

Ratio of

Computation Time

as compared with

matrix 128×128

8 x 8

128 x 128 1

512 x 512

3072 x 3072

4096 x 4096

What do you see from these numbers? Have they improved a lot as compared to the

previous matrix-matrix multiplication implementation?

Problem 3: Matrix-Matrix Multiplication with Tiling and Constant Memory

This lab is an enhanced matrix-matrix multiplication, which uses the features of

constant memory and synchronization between threads in a block. Allocate constant

memory for matrices M and N.

Record your runtime with respect to different input matrix sizes as follows:

Matrix Size GPU Memory

Access Time

(ms)

GPU

Computation

Time (ms)

GPU

Processing

Time (ms)

Ratio of

Computation Time

as compared with

matrix 128×128

8 x 8

128 x 128 1

512 x 512

3072 x 3072

4096 x 4096

What do you see from these numbers? Have they improved a lot as compared to the

previous matrix-matrix multiplication implementation?

Related Posts