Department of Electrical and Computer Engineering Queen¡¯s University
ELEC-374, Digital Systems Engineering
Machine Problems 1-4
For this and other machine problems, you may consult the Lecture Slides on Heterogeneous Computing – GPU Architectures and Computing and the GPU CUDA Environment Tutorial on the course website. You may also consult the NVIDIA CUDA C Programming Guide:
Copyright By PowCoder代写 加微信 powcoder
http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
Submission Instructions:
The machine problems are to be done individually. Students must run through the assignment by themselves, include their own solutions and analyses, and turn in their own reports.
Please submit each problem separately through onQ (machine problem assignments) by the due date March 25th, and attach a single zip file, ¡°MP#_YourLastName.zip¡±, containing your code (e.g., mp1.cu, with your name and student ID on top of the code), and a report in pdf format where you present your work by including your CUDA code and the Visual Studio output screenshots, analyses of your results, and discussions of any outstanding issues.
GPU Servers
Four GPU (Tesla C2075) servers are available for the machine problems. The servers run Windows server operating system and provide the Visual Studio 2015 IDE for building CUDA projects. For remote connection to the GPU servers and information on CUDA and its environment, please consult the GPU CUDA Environment Tutorial under week 7. The servers use a load balancer to balance the workload on the GPU servers, and it is possible that you may login to a different GPU server each time. Therefore, it is very important to back up your files and save your project files on the Z drive.
Machine Problem #1: Device Query
The objective of this machine problem is to understand the capabilities of the NVIDIA GPU. You will also test your environment and make sure you can build and run your CUDA programs.
Write a code that could identify the number and type of CUDA devices on the GPU servers, the clock rate, the number of streaming multiprocessors (SM), number of cores, warp size, amount of global memory, amount of constant memory, amount of shared memory per block, number of registers available per block, maximum number of threads per block, maximum size of each dimension of a block, and the maximum size of each dimension of a grid.
Machine Problem #2 Matrix Addition
The objective of this machine problem is to implement matrix addition routines where each thread produces one or more output matrix elements. It is also to get you familiar with using the CUDA API and the associated setup code.
A matrix addition takes two input matrices A and B and produces an output matrix C. Each element of the output matrix C is the sum of the corresponding elements of the input matrices A and B. For simplicity, we will only handle square matrices of which the elements are integer numbers. Write a matrix addition kernel and the host function that can be called with four parameters: pointer to the output matrix C, pointer to the first input matrix A, pointer to the second input matrix B, and the number of elements in each dimension. After the device matrix addition is invoked, the host function will compute the correct output matrix using the CPU and compare that solution with the device-computed solution. If they match, it will display “Test PASSED” on the screen before exiting.
Follow the instructions below:
1. Write the host code by allocating memory for the input and output matrices, transferring input data to the device, launching the kernel, transferring the output data to host, and freeing the device memory for the input and output data. Leave the execution configuration parameters open for this step.
2. Write a kernel that has each thread producing one output matrix element. Fill in the execution configuration parameters for the design considering 16×16 thread blocks.
3. Write a kernel that has each thread producing one output matrix row. Fill in the execution configuration parameters for the design considering 16 threads per block.
4. Write a kernel that has each thread producing one output matrix column. Fill in the execution configuration parameters for the design considering 16 threads per block.
5. Analyze the pros and cons of each kernel design above.
Create the randomly-initialized matrices A and B, and experiment with different matrix sizes (16 x 16, 256 x 256, and 4096 x 4096). In each case, measure the time for the kernel execution, and using a graph/table compare the GPU performance against the CPU performance. Analyze your results.
Note that in this machine problem we are not concerned about the data transfer cost, which is in fact a real concern in GPU computing and that is something we will consider in Machine Problem 3. Also, do not consider memory allocation/free time on the device in your timing. Remember to free all host/device allocation at the end of the program.
Machine Problem #3: Matrix Multiplication
The objective of this machine problem is to implement a dense matrix multiplication routine with different number of blocks and threads per block. It is also to understand the impact of data transfer time on performance.
A matrix multiplication takes two input matrices M and N and produces an output matrix P. For simplicity, we will only handle square matrices of which the elements are integer numbers. Write a matrix multiplication kernel and the host function that can be called with four parameters: pointer to the output matrix P, pointer to the first input matrix M, pointer to the second input matrix N, and the number of elements in each dimension. After the device matrix multiplication is invoked, the host function will compute the correct output matrix using the CPU and compare that solution with the device-computed solution. If they match, it will display “Test PASSED” on the screen before exiting.
Create randomly-initialized matrices M and N. Write the host code by allocating memory for the input and output matrices, transferring input data to device, launching the kernel, transferring the output data to host, and freeing the device memory for the input and output data. Write a kernel that has each thread producing one output matrix element. Write the execution configuration parameters in your host code accordingly.
In this machine problem, we also want to understand the impact of data transfer from the host to the device and from the device to the host on the overall performance, and whether kernel offloading to GPU is beneficial for all cases.
Follow the instructions below:
1. Find the time it takes to transfer the two input matrices from the host to the device. Experiment with different matrix sizes (16 x 16, 256 x 256, and 4096 x 4096), and plot the data transfer time vs. matrix size.
2. Using the same matrix sizes that you experimented with in Part 1, compare the time it takes to do the matrix multiplication on the GPU with that of the CPU. For the GPU computation, consider these widths for the thread blocks: 1, 4, 16, and 32; and compute the number of required blocks based on the number of elements in the input matrices. Do not consider the data transfer time and ignore memory allocation/free time on the device – just the matrix multiplication time on the GPU and CPU. Now, if you take the total data transfer time into account (both ways, from host to device and from device to host), is it always beneficial to offload your matrix multiplication to the device? Plot the application performance including data transfers. Also, explain the effect of changing the number of threads per block. Plot the kernel execution time vs. number-of- blocks/block-width and discuss your findings. Moreover, answer the following questions:
a) How many times is each element of each input matrix loaded during the execution of the kernel?
b) What is the ratio of integer computations to memory accesses in each thread? Consider multiply and addition as separate operations and ignore the global memory store at the end. Only count global memory loads towards your off-chip bandwidth.
Machine Problem #4: Tiled Matrix Multiplication
The objective of this machine problem is to get you familiar with using shared memory to write optimized kernel algorithms by implementing a ¡°tiled¡± version of matrix multiplication.
A matrix multiplication takes two input matrices M and N and produces an output matrix P. For simplicity, we will only handle square matrices of which the elements are integer numbers. Write a shared memory based (tiled) matrix multiplication kernel and the host function that can be called with four parameters: pointer to the output matrix P, pointer to the first input matrix M, pointer to the second input matrix N, and the number of elements in each dimension. After the device matrix multiplication is invoked, it will compute the correct output matrix using the CPU and compare that solution with the device-computed solution. If it matches (within a certain tolerance), it will display “Test PASSED” on the screen before exiting.
Create the randomly-initialized matrices M and N, and experiment with different matrix sizes (16 x 16, 256 x 256, and 4096 x 4096). Write the host code by allocating memory for the input and output matrices, transferring input data to device, launching the kernel, transferring the output data to host, and freeing the device memory for the input and output data. Write a kernel that has each thread producing one output matrix element. Write the execution configuration parameters in your host code considering different TILE_WIDTH of 4, 16, and 256. Do not consider the data transfer time and memory allocation/free time on the device in your timing. Plot the results and discuss your findings. Are the results faster than the baseline matrix multiplication in Machine Problem 3, and why?
Answer the following questions:
a) In your kernel implementation, how many threads can be simultaneously scheduled on your CUDA device, which contains 14 streaming multiprocessors?
b) Can you find the resource usage of your kernel, including the number of registers, shared memory size, number of blocks per streaming multiprocessor, and maximum total threads simultaneously scheduled/executing?
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com