The George Washington University
ECE6105 – Introduction to High Performance Computing
Homework 7
Due December 7th, end of the day
In this homework, you will implement untiled and tiled matrix transposition and measure their performance on Intel KNL (Knights Landing).
Loop Tiling
Loop tiling is explained in the last lecture. Here is a practical overview of the optimization you are supposed to implement. The naïve implementation does the transposition in the following order:
Output Matrix Input Matrix
Although this order of operation is intuitive and how you’d transpose a matrix by hand, this causes poor cache utilization in accessing the input matrix. To improve cache utilization, one can “tile” the loop structure to have the following access order:
Output Matrix Input Matrix
Requirements and Submission
1. Find the skeleton code on blackboard.
- The tarball includes all the source files and Makefile to build the benchmark on your own machine and KNL machines
- Inspect the code to familiarize yourself with variables and general structure
c. If you run make, it should produce 4 executables: transpose,
transpose_debug, transpose_tiled,
transpose_tiled_debug
- There are two spots in the source file that you should fill in, one for untiled transposition, one for tiled.
- In Makefile, you can choose to use gcc for compilation if you are developing on your own machine, but if you are using KNL machines (see the next bullet) you need to use icc (Intel C Compiler) to get a more optimized executable. In order to do that you need to set the compiler variable accordingly: On your own machine: CC=gcc On KNL machines: CC=icc
2. Make
- We will use three knl machines: knl1, knl2 and knl4
- All of them should be accessible to you from server2 via commands like ssh knl1
sure you have access to KNL machines. etc
i. Let me know (engin@gwu.edu), if you cannot access any of them c. These machines are identical and using them DOES NOT require Slurm.
Therefore, you can run your executable as if you are using your own machine.
d. Only caveat is: knl1 is shared across all of you, therefore you should not use it for
performance measurement. Your classmates’ work can interfere with your performance result. knl2 and knl4 are exclusive access machines that are available in a first-come-first-serve basis. You should use these for performance measurements but shouldn’t use them for long periods of time to allow fair usage for all classmates. Ideally you should develop and test for correctness on knl1, then use others for performance tests and logout as soon as you are done. You don’t need to move files between these machines. You have the same home folder in all of them (including pyramid and server2)
3. Implement and measure the naïve version of matrix transposition.
- This implementation will be straightforward and have two tightly nested loops.
- The skeleton implementation already has controls for changing number of threads, data sizes etc. Check the beginning of the main function to see the command line arguments and their meanings.
- Do a strong scaling study with 4096×4096 matrix, where you use 1 to 256 threads (use only power-of-two number of threads)
i. Make sure you use icc (Intel C Compiler) for the performance study.
4. Implement and measure tiled version of matrix transposition.
a. This will look like two additional outer loops that iterates tiles and not elements of
the matrix
i. You can choose (and ideally play around with) different tile sizes. Most
reasonable range for tile size is from 8 to 64 (stick with powers of two). b. Do an identical strong scaling study.
i. Make sure you use icc (Intel C Compiler) for the performance study.
-
Prepare plots and a short report describing the effect of optimizations.
- Include a plot comparing two versions.
- Write a paragraph describing this optimization in your own words and its impact on the performance.
- Make sure you specify the tile size
- Submit all the material on blackboard
a. Create a tarball and include all the code and Makefile and report.