程序代写 Compiler Optimisations Performance Programming Exercise

Compiler Optimisations Performance Programming Exercise

Getting started
1. DownloadthefileCompilerOpt.tarfromthecoursewebpage 2. UnpackitwiththecommandtarxvfCompilerOpt.tar

Exercise 1: Basic optimisation
The code for this exercise is in /CompilerOpt/*/Opt1. The main computation is in the loop in routine fred.
Note that for this and the following two exercises, you are asked to modify the code several times: make sure you keep a copy of each (correct) version, including the original one. Subroutine in-lining should be disabled to ensure the timing calls continue to work correctly.
1. Compilethecodewithnooptimisation(use-O0–fno-inline- functions), and record the performance.
2. Now optimise the code by hand. Ensure that every transformation you make is one that could be done by a compiler: record each stage with the optimisation technique used. How much performance gain can you achieve?
3. Finally, compile both the original and your optimised version with -O3 -fno-inline-functions.
How do the performances compare with your version, and why? Use the –S option to generate the assembly code for the various versions.
Exercise 2: Loop unrolling
The code for this exercise is in /CompilerOpt/*/Opt2. The main computation is in the loop contained in routine sum.
1. Compilethecodewith-O3–unroll=0–no-vec,whichdoesnotinvoke the compiler’s loop un-rolling, but allows other optimisations, and record the performance. We disable vectorisation because this also effectively unrolls the loop.
2. Unroll the loop by hand by a factor of 2, remembering to add a clean-up loop.
3. Record the performance. Now generate versions with larger unroll factors: what is the optimum factor?
4. Finally,recompiletheoriginalcodewith-fast–no-vectoobservethe compiler’s own optimisation. Use –S to generate assembly code and find out the unroll factor used by the compiler.
Exercise 3: Cache optimisation
The code for this exercise is in /CompilerOpt/*/Opt3. The main computation is in the loops contained in routine matmul, which forms the product of two matrices.
1. Compilethecodewith-O3andrecordtheperformance.ForFortranyou will also need -qno-opt-matmul
2. Use loop interchange/permutation to improve the cache behaviour. Which loop ordering gives the best performance?

3. Now try tiling all three loops, using the same blocksize for each loop. Experiment to find the optimal blocksize.
4. Whathappensifyouuse–fastinsteadof-O3?

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts