CS代考 TO 10000 DO

PERFORMANCE PROGRAMMING
Introduction to Performance Optimisation II Adrian Jackson

Copyright By PowCoder代写 加微信 powcoder

• “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%”
• “In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering”

Architectural trends
• Optimisation is the process of tuning a code to run faster in a particular Hardware and Software environment.
• The hardware environment consists of many different resources
• Accelerator • Network
• Any of these resources could be the limiting factor for code performance
• Which one depends on the application

Moore’s Law
• Observation that the number of transistors in an integrated circuit that can be manufactured economically doubles roughly every 2 years.
• Increased transistor count allows more powerful CPUs.
• Historically smaller transistors also allowed faster clock speeds
but power constraints now limit this type of scaling.
• Only limited benefit available from “smarter” processors.
• Most CPU designs now use the increased transistor count to increase parallelism
• Morecores
• Longer SIMD instructions • Etc.

CPU resources
• In the early days of computing memory accesses were essentially free.
• Optimisation consisted of reducing the instruction count. • This is no longer the case, and is getting worse
• CPU performance historically increased at approx. 80% per year (though recently this has been due to increasing core count rather than clock speed)
• memory speed increases at approx. 7% per year (though capacity increases at a much higher rate)
• Most HPC codes/systems are memory bandwidth limited.
• Unless major data structures can live in cache.

CPU bottlenecks
• Note, the following slides are primarily x86 focussed
• Pipeline width
• Number of operations/instructions that can be issues per cycle
• 5 for modern AMD and Intel processors
• Reported under uops_issued.any counter for Intel
• Can address by loop unrolling, vectorisation, or algorithmic change
• Available functional units
• Also know as port or execution width
• Varies by functional unit
• For floating point stuff may be 1 or 2
• Reported under uops_dispatched_port counter for Intel
• Can address in same way as Pipeline width problem or aim to change instructions for “wider” instructions

CPU bottlenecks
• Load limits
• How many address load instructions can be issued per cycle
(from L1 cache)
• Port width limit, 2 for current Intel and AMD
• Not generally a problem for most code
• Store limits
• How many data items can be stored per cycle • 1 for Intel and AMD
• Chain dependencies
• Instruction pipeline performance depends on the slowest
instruction
• Pipeline latency of instructions varies
• Can remedy by breaking up pipeline chains or using other instructions

CPU bottlenecks
• Out-of-order Buffers
• Out-of-order scheduling of instructions requires storage
(buffers) to keep track of in flight instructions
• Re-order Buffer (ROB), Load Buffer, Store Buffer
• Can remedy by restructuring code so dependent statements are closer together, although this may impact overlap potential
• Register file size
• ROB entries need associated registers, which are stored in register files (typically one for scalar and one for vector operations)
• This can fill up before the ROB is full
• Can remedy by having mixed Vector and Scalar instructions or by reordering code as with the Out-of-order buffer issues

Memory resources
• Old HPC codes were often written for vector processors.
• Vectorprocessingtypicallyrequiresahighmemorybandwidth.
• Vectorsystemsaretypicallybuiltwithlargenumbersofmemory banks giving high peak memory bandwidth.
• Vector instructions have a very regular access pattern but tend to have low data re-use. Caches were rare on vector systems.
• Old vector codes were often written with stride-1 data access patterns to maximise memory bank use.
• Thesecodesoftenperformedbadlyoncachebasedmicroprocessors as stride-1 access obtained at the cost of temporal locality.
• Modern SIMD instructions are effectively vector instructions
• SIMD load/stores also need stride-1 access but also need to co-exist with caches.

• Modern microprocessors rely heavily on caches
• Fast on-chip memory that stores copies of a sub-set of the data
from main memory
• Reduces the demands on external memory provided data access patterns exhibit locality
• Particularly beneficial if important data structures can exist entirely within cache.
• Current HPC codes are written to increase data locality (and therefore cache benefit) as much as possible
• Cache sizes have been growing with Moore’s law which have helped offset slow growth in DRAM speed.

Commercial factors
• Computer design is all about balancing performance and cost.
• We know how to build faster computers than we can afford to build
• And faster than we can afford to run.
• Cheaper technologies used where “good enough”
• HPC computing is a very small market compared to desktop/business/mobile systems.
• HPC systems are often built out of components designed for a different purpose.
• Economies of scale make mainstream technology cheaper

Manufacturing trade-off example
• Processors need complex manufacturing processes • E.g. many metal layers.
• Expensive to manufacture
• DRAM can be manufactured on a much simpler process.
• Fewer metal layers
• Much less expensive
• DRAM and processors are manufactured as separate components for cost reasons.
• Simple DRAM process force interface to be simple too.
• Separation of RAM and processor impacts performance

Chip stacking
• One promising trend in CPU design is chip-stacking • Traditionally DRAM and CPU were distinct components
mounted on a common circuit board
• Stacks of very thin DRAM chips can be mounted directly on top of the CPU
• Alternatively both are mounted on a common carrier chip
• Through Silicon Vias (TSVs) (electrical connections that penetrate through the thin chips) allow 3D connectivity within the stack.
• Increased connectivity has potential to allow memory performance to grow at similar rate to CPU performance.

Comparing codes
• HPCcodes
•Large use of floating point •Programs are quite small •Data is quite large
•Data in simple arrays
•Regular access patterns •Limited cache utilisation unless
array fits in cache •Simple control flow
•Most time spent in small number of loops.
• Desktopapplications
•Low use of floating point •Large complicated programs •Data relatively small
•Data in complex structures
•Random access pattern •Typically good reuse of data
once accessed •Complex control flow
•Branches and pointer chasing.

Types of optimisation
Data structures
Code structure
Library calls
Compiler flags Impact
Algorithm Parallelism
Difficulty

Compiler flags
• Easiest thing to change are the compiler flags
• Most compilers will have good optimisation by
• Some compiler optimisations are not always beneficial and need to be requested explicitly.
• Some need additional guidance from user (e.g. inter-file in-lining)
• Some break language standards and need to be requested explicitly
• E.g. a/2 -> a*0.5 is contrary to Fortran standard but is usually safe.
• Usually worthwhile to read compiler manual pages before optimising.

Debugging flags
• Most compilers have flags to enable debugging • Traditionally debugging flags disable some
optimisations
• this is to make it easier to associate machine instructions with lines of source code.
• Many compilers now support optimisation and debugging together
• May have alternative flags to allow some debugging with optimisation enabled.

Library calls
• The easiest way to make a big impact on code performance is to re-use existing optimised code.
• Libraries represent large amount of development effort
• Somebody else has put in the effort so you don’t have to,
• Code libraries exist for commonly used
functionality (e.g. linear algebra, FFTs etc.).
• Often possible to access highly optimised versions of these libraries.
• Even if the application code does not use a standard library it is often easy to re-write to use the standard calls.

Using libraries
• Don’t just use libraries blindly.
• Read the documentation.
• Learn what the library is capable of.
• Libraries often have multiple ways of doing the same thing. You need to understand which is best for your code.
• Many modern libraries support auto-tuning.
• Can run internal experiments to automatically optimise
the implementation for a given problem.
• Application programmer needs to cooperate on this.

• The biggest performance increases typically require a change to the underlying algorithm.
• Consider changing an O(N) sort algorithm to a O(log(N)) algorithm.
• This is a lot of work as the relevant code section usually needs a complete re-write.
• A warning
• The complexity of an algorithm O(N), O(log(N)), O(N log(N)) etc. is related to number of operations and is not always a reliable indication of performance.
• Pre-factor may make a “worse” algorithm perform better for the value of N of interest.
• The “worse” algorithms may have much better cache re-use or exhibit more available parallelism

• Many scientific algorithms performance can be improved by mathematical approaches • i.e. Pre-conditioning matrices
• Often increase runtime per iteration
• But reduce iterations to convergence etc…
• i.e. Strength reductions
t:=b*c t:=b*c FORi:=1to10000DO d:=0
BEGIN FOR i := 1 TO 10000 DO
inti,sum=0;
for (i = 1; i <= N; ++i) { sum += i; } printf("sum: %d\n", sum); a := t BEGIN d:=i*3 a:=t ... d:=d+3 END intsum=N*(1+N)/2; printf("sum: %d\n", sum); Data structures • Changing the programs data structures can often give good performance improvements • These are often global changes to the program and therefore expensive. • Code re-writing tools can help with this. • Easier if data structures are reasonably opaque, declared once • objects, structs, F90 types, included common blocks. • As memory access is often the major performance bottleneck the benefits can be great. • Improve cache/register utilisation. • Avoid pointer chasing • May be able to avoid memory access problems by changing code structure in key areas instead. Code structure • Most optimisations involve changes to code structure • Loopunrolling • Loop fusion • routinein-lining. • Often overlap with optimisations attempted by the compiler. • Often better to help the compiler to do this than perform change by hand. • Easier to implement than data changes as more localised. • Performance impact is often also smaller unless the code fragment is • Performance improvement often at the expense of code maintainability. • Try to keep the unoptimised version up to date as well. a major time consumer. Parallelism • Parallel computing is usually an optimisation technique • Multiple processors are used to improve run-time • Large numbers of processors can result in very significant performance improvements. • Software effort to develop parallel code can be very high. • Code runs faster in real time but may consume more cpu cycles in total. Instruction level parallelism • Most modern CPUs include some kind of parallel instructions • SIMD instructions where one instruction performs multiple operations. • Pipelined or superscalar architectures where multiple (independent) instructions are executing at the same time. • Largely the compiler finds and exploits this kind of parallelism automatically • Therefore generally easy to exploit • May still have to change the source code to: • Increase the parallelism available to the compiler • Make it easier for the compiler to recognise the parallelism Shared memory parallelism • Multiple processors share a single memory system. • Data structures can be shared and need not be changed. • Only the work needs to be distributed (typically using threads). • The compiler does most of the work but usually needs some additional directives. • Typically the most flexible form of multi-processor parallelism • Parallelism can be introduced incrementally, only critical routines need to be parallel. • Often can add parallel directives to a serial code, automatically keeping parallel and serial versions in sync. • Make greater demands on hardware designers • Easy for parallel advantage to be lost as processors compete for access to memory. Distributed memory parallelism • Processors have private memory systems and perform explicit communications • Message passing, remote memory access calls etc. • Usually large software development task. • The whole code has to run in parallel. • Very little if any support from the compiler. • Data structures usually need changing. Exceptions: • Regular domain decomposition: Serial code data structures often ok, just have to be smaller. Boundary condition code needs to be changed. • Replicated data: Each node keeps a private copy of the global data structures. Easy to implement but often does not scale well. • Distributed memory codes can easily run on shared memory systems, but not vice-versa Accelerators • Special purpose hardware • Very good at some types of operations • Can’truntheentirecode. • GPUscommonexamplebutcanalsobebuiltoutofFPGA • Main CPU and accelerator handle different sections of the code. • Usuallysomeoverheadinswitching • Oftendatahastobecopiedbetweendevices • Different memory layouts might be appropriate for each device • Trend towards greater compiler support for acceleration • Programmerstillneedstounderstandbotharchitecturestogetbest performance. • Successfulacceleratorarchitecturesinfluencenewprocessordesigns so boundary blurred over time. • Data transfer over networks costs performance • Memory copy costs • Network transfer costs • On-node communication costs can be optimised • Shared-memory segments for local data transfers • Off-node communication costs can be optimised • Reduced copy communications, RDMA (direct memory access) • Reduced CPU requirements • Network communication maps/patterns • Reduce message numbers • Compression • Reduce data volumes at CPU expenses Knowing when to stop • Hardware metrics • Estimating program requirements • Roofline model • There are many ways to optimise a program • need to select the most appropriate for the situation • doing nothing is an option • Must always weigh the benefit (faster execution) against costs: • programmer effort • loss of maintainability • loss of portability • Performance measurement, analysis and testing are key parts of the process. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com