PowerPoint Presentation
Parallel Computing
with GPUs:
Optimisation
Dr Paul Richmond
http://paulrichmond.shef.ac.uk/teaching/COM4521/
Last Lecture
All about memory, pointers and storage
We have seen that C is a low level language
Now we would like to consider what makes a program fast.
This Lecture
Optimisation Overview
Compute Bound
Memory Bound
When to Optimise
Is your program complete?
If not then don’t start optimising
If you haven’t started coding then don’t try to perform advanced
optimisations until its complete
This might be counter intuitive
Is it worth it?
Is your code already fast enough?
Are you going to optimise the right bit?
What are the likely benefits? Is it cost effective?
(number of runs × number of users × time savings × user’s salary)
‐ (time spent optimizing × programmer’s salary)
“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical
parts of their programs, and these attempts at efficiency actually have a strong negative impact when
debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of
the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that
critical 3%.” Donald Knuth, Computer Programming as an Art (1974)
First step: Profiling
Which part of the program is the bottleneck
This may be obvious if you have a large loop
May be less obvious in a complicated program or procedure
Manually profiling using time() function
We can time critical aspects of the program using the time command
This gives us insight into how long it takes to execute.
Profiling using a profiler
Unix: gprof
VS2017: Built in profiler
Profiling with clock() – Windows only
#include time.h
The clock() function returns a clock_t value the number of
clock ticks elapsed since the program was launched
To calculate the time in seconds divide by CLOCK_PER_SEC
clock_t begin, end;
float seconds;
begin = clock();
func();
end = clock();
seconds = (end – begin) / (float)CLOCKS_PER_SEC;
VS2017 Profiling Example
Debug->Performance and Diagnostics
Start
Select CPU Sampling, Finish (or next and select project)
No Data? Your program might not run for long enough to sample
VS2017 Profiling Example
Samples
The profiler interrupts at given time intervals to collect information on the
stack
Default sampling is 10,000,000 clock cycles
Inclusive Samples
Time samples including any sub call
Exclusive Samples
Time samples excluding any sub calls
Hot Path
Slowest path of execution through the program
Best candidate for optimisation
Select the function for a line-by-line breakdown of sampling percentage
Compute vs Memory Bound
Compute bound
Performance is limited by the speed of the CPU
CPU usage is high: typically 100% for extended periods of time
Memory Bound
Performance is limited by the memory access speed
CPU usage might be lower
Typically the cache usage will be poor
poor hit rate if fragmented or random accesses
Optimisation Overview
Compute Bound
Memory Bound
Compute Bound: Optimisation
Approach 1: Compile with full optimisation
msvc compiler is very good at optimising code for efficiency
Many of the techniques we will examine can be applied automatically by a
compiler.
Optimisation: Compiler /O Optimisation property
Help the compiler
Refactor code to make it clear (clear to users is clear to a compiler)
Avoid complicated control flow
Optimisation Level Description
/O1 Optimises code for minimum size
/O2 Optimises code for maximum speed
/Od Disables optimisation for debugging
/Oi Generates intrinsic functions for appropriate calls
/Og Enables global optimisations
Compute Bound: Optimisation
Approach 2: Redesign the program
Compilers cant do this and it is most likely to have the biggest impact
If you have a loop that is executed 1000’s of times then find a way to do it
without the loop.
Be familiar with algorithms
Understand big O(n) notation
E.g. Sequential search has many faster replacements
http://bigocheatsheet.com/
Compute Bound: Optimisations
Approach 3: Understand operation performance
Cost of going to disk is massive
Loop Invariant Computations: move operations out of loops where possible
Strength reduction: replace expression with cheaper ones
Core i7 Instruction Cycle Latency
Integer ADD SUB (x32 and x64) 1
Integer MUL (x32 and x64) 3
Integer DIV (x32) 17-28
Integer DIV (x64) 28-90
Floating Point ADD SUB (x32) 3
Floating Point MUL (x32) 5
Floating Point DIV (x32) 7-27 http://www.agner.org/optimize/instruction_tables.pdf
Compute Bound: Optimisations
Approach 4: function in-lining
In-lining increases code size but reduces function calls.
Make your simple function a macro
Use the _inline operator
Be sensible: Not everything should be in-lined
float vec2f_len(vec2f a, vec2f b)
{
vec2f r;
r.x = a.x – b.x;
r.y = a.y b.y;
return (float)sqrt(r.x*r.x + r.y*r.y); //requires #include
}
_inline float vec2f_len(vec2f a, vec2f b)
{
return (float)sqrt((a.x-b.x)*(a.x-b.x) – (a.y-b.y)*(a.y-b.y));
}
#define vec2f_len(a, b) ((float)sqrt((a.x-b.x)*(a.x-b.x) – (a.y-
b.y)*(a.y-b.y)))
Compute Bound: Optimisations
Approach 5: Loop unrolling
msvc can do this automatically
Reduces the number of branch executions
for (int i=0; i<100; i++){
some_function(i);
}
for (int i=0; i<100;){
some_function(i); i++;
some_function(i); i++;
some_function(i); i++;
some_function(i); i++;
some_function(i); i++;
some_function(i); i++;
some_function(i); i++;
some_function(i); i++;
some_function(i); i++;
some_function(i); i++;
}
Compute Bound: Optimisations
Approach 6: Loop jamming
Combine adjacent loops to minimise branching (for ranges over the same
variable)
E.g. Reduction of iterating and testing value i
for (i=0; i