程序代写代做代考 Fortran algorithm cache compiler PowerPoint Presentation

PowerPoint Presentation

Parallel Computing

with GPUs:

Optimisation
Dr Paul Richmond

http://paulrichmond.shef.ac.uk/teaching/COM4521/

Last Lecture

All about memory, pointers and storage

We have seen that C is a low level language

Now we would like to consider what makes a program fast.

This Lecture

Optimisation Overview

Compute Bound

Memory Bound

When to Optimise

Is your program complete?
If not then don’t start optimising
If you haven’t started coding then don’t try to perform advanced

optimisations until its complete
This might be counter intuitive

Is it worth it?
Is your code already fast enough?
Are you going to optimise the right bit?
What are the likely benefits? Is it cost effective?

(number of runs × number of users × time savings × user’s salary)
‐ (time spent optimizing × programmer’s salary)

“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical

parts of their programs, and these attempts at efficiency actually have a strong negative impact when

debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of

the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that

critical 3%.” Donald Knuth, Computer Programming as an Art (1974)

First step: Profiling

Which part of the program is the bottleneck
This may be obvious if you have a large loop

May be less obvious in a complicated program or procedure

Manually profiling using time() function
We can time critical aspects of the program using the time command

This gives us insight into how long it takes to execute.

Profiling using a profiler
Unix: gprof

VS2017: Built in profiler

Profiling with clock() – Windows only

#include time.h

The clock() function returns a clock_t value the number of
clock ticks elapsed since the program was launched

To calculate the time in seconds divide by CLOCK_PER_SEC

clock_t begin, end;

float seconds;

begin = clock();

func();

end = clock();

seconds = (end – begin) / (float)CLOCKS_PER_SEC;

VS2017 Profiling Example

Debug->Performance and Diagnostics
Start

Select CPU Sampling, Finish (or next and select project)

No Data? Your program might not run for long enough to sample

VS2017 Profiling Example

Samples
The profiler interrupts at given time intervals to collect information on the

stack
Default sampling is 10,000,000 clock cycles

Inclusive Samples
Time samples including any sub call

Exclusive Samples
Time samples excluding any sub calls

Hot Path
Slowest path of execution through the program

Best candidate for optimisation

Select the function for a line-by-line breakdown of sampling percentage

Compute vs Memory Bound

Compute bound
Performance is limited by the speed of the CPU

CPU usage is high: typically 100% for extended periods of time

Memory Bound
Performance is limited by the memory access speed

CPU usage might be lower

Typically the cache usage will be poor
poor hit rate if fragmented or random accesses

Optimisation Overview

Compute Bound

Memory Bound

Compute Bound: Optimisation

Approach 1: Compile with full optimisation
msvc compiler is very good at optimising code for efficiency

Many of the techniques we will examine can be applied automatically by a
compiler.

Optimisation: Compiler /O Optimisation property

Help the compiler
Refactor code to make it clear (clear to users is clear to a compiler)

Avoid complicated control flow

Optimisation Level Description

/O1 Optimises code for minimum size

/O2 Optimises code for maximum speed

/Od Disables optimisation for debugging

/Oi Generates intrinsic functions for appropriate calls

/Og Enables global optimisations

Compute Bound: Optimisation

Approach 2: Redesign the program
Compilers cant do this and it is most likely to have the biggest impact
If you have a loop that is executed 1000’s of times then find a way to do it

without the loop.
Be familiar with algorithms

Understand big O(n) notation
E.g. Sequential search has many faster replacements

http://bigocheatsheet.com/

Compute Bound: Optimisations

Approach 3: Understand operation performance
Cost of going to disk is massive

Loop Invariant Computations: move operations out of loops where possible

Strength reduction: replace expression with cheaper ones

Core i7 Instruction Cycle Latency

Integer ADD SUB (x32 and x64) 1

Integer MUL (x32 and x64) 3

Integer DIV (x32) 17-28

Integer DIV (x64) 28-90

Floating Point ADD SUB (x32) 3

Floating Point MUL (x32) 5

Floating Point DIV (x32) 7-27 http://www.agner.org/optimize/instruction_tables.pdf

Compute Bound: Optimisations

Approach 4: function in-lining
In-lining increases code size but reduces function calls.

Make your simple function a macro

Use the _inline operator

Be sensible: Not everything should be in-lined

float vec2f_len(vec2f a, vec2f b)

{

vec2f r;

r.x = a.x – b.x;

r.y = a.y b.y;

return (float)sqrt(r.x*r.x + r.y*r.y); //requires #include

}

_inline float vec2f_len(vec2f a, vec2f b)

{

return (float)sqrt((a.x-b.x)*(a.x-b.x) – (a.y-b.y)*(a.y-b.y));

}

#define vec2f_len(a, b) ((float)sqrt((a.x-b.x)*(a.x-b.x) – (a.y-

b.y)*(a.y-b.y)))

Compute Bound: Optimisations

Approach 5: Loop unrolling
msvc can do this automatically

Reduces the number of branch executions

for (int i=0; i<100; i++){ some_function(i); } for (int i=0; i<100;){ some_function(i); i++; some_function(i); i++; some_function(i); i++; some_function(i); i++; some_function(i); i++; some_function(i); i++; some_function(i); i++; some_function(i); i++; some_function(i); i++; some_function(i); i++; } Compute Bound: Optimisations Approach 6: Loop jamming Combine adjacent loops to minimise branching (for ranges over the same variable) E.g. Reduction of iterating and testing value i for (i=0; i