CS计算机代考程序代写 arm algorithm assembly x86 compiler matlab chain mips Hardware and Software Optimisation

Hardware and Software Optimisation

Syllabus
This module will cover the following
• Choices to make when optimising
• Hardware chip type specialisation
• Compiler optimisations
• Floating point to fixed point conversion
2

3
Optimisation
Modifying a system to make it run more efficiently, or utilise fewer resources.
Optimising hardware:
● use less energy, dissipate less power, reduce gate count, package size… Optimising software:
● run faster, use less data or program memory, reduce power or energy
Q. Knuth famously wrote in the 1960s about programming: “premature optimization is the root of all evil”. Do you think this applies to IoT design in 2020s? How?

4
Choices to make when optimising
Optimise for speed?
Optimise for size?
Optimise for power?
Optimise for energy?
Some combination (with trade-off) of all of these?
Do we need to react to events quickly?
Are we memory/space constrained?
Is there limited scope for power dissipation?
Do we need to conserve as much energy as possible?

5
Hardware Optimisation

Additional Processors
DSPs implement specialised routines for a specific application.
FPGAs are a popular accelerator in embedded systems, as they provide ultimate re-configurability.
FPGAs are also invaluable during hardware prototyping.
ASICs are the logical next step for a highly customised application, if gate-level re- configurability is not a requirement.
6

Field Programmable Gate Arrays (FPGAs)
FPGAs implement arbitrary combinational or sequential circuits, and are configured by
loading a local memory that determines the interconnections among logic blocks. Can be reconfigured an unlimited number of times – and “in the field”!
Very useful for hardware prototyping. Experimental data can feed into ASIC design.
7

Application-specific Integrated Circuits (ASICs)
● Designed for a fixed application, e.g., bitcoin mining.
● Designed to accelerate heavy and most used functions.
● Designed to implement the instruction set with
minimum hardware cost.
● Goals of ASIC design:
○ Best performance over silicon and over power consumption, lowest cost
● Involves:
○ ASIC design flow, source-code profiling, architecture exploration, instruction set design, assembly language design, tool chain production, firmware design, benchmarking, microarchitecture design.
● Can specialise every architectural CPU aspect 8

Digital Signal Processors (DSPs)
Designed for a specific application, typically arithmetic heavy. Offers lots of arithmetic
instructions/parallelism.
● Radio baseband hardware (4G)
● Image processing/filtering
● Audio processing/filtering
● Video encoding/decoding
● Vision applications
Choose to implement a DSP that performs processing faster, and uses less memory and
power, than if the algorithm was implemented on a general purpose processor. 13

Heterogeneous Multicores
A heterogeneous multicore is a processor that contains multiple cores that implement the same underlying architecture, but have different power and performance profiles.
An example is ARM big.LITTLE, of which a configuration is available comprising four Cortex-A7 and four Cortex-A15 (for a total of eight) cores.
This enables threads to run on the low energy cores, until they need a speed boost, or where multithreaded applications may have threads with different performance requirements, running on different cores.
14

Run-state Migration
Clustered Switching: Either all fast cores or all slow cores are being used to run threads. CPU Migration: Pairs of fast and slow cores are used for scheduling, and each pair can be
configured to execute the thread on either the fast core or the slow core.
Global Task Scheduling: Each core is seen separately, and threads are scheduled on cores with the appropriate performance profile for the workload.
15

16
Software Optimisation

Optimisation Targets
Optimise for speed Generate code that executes quickly; at the expense of size
Optimise for size Generate least amount of code; maybe at the expense of speed
Optimise for power/energy Combination of optimisation strategies
Use analysis, debugging, simulation, prototyping, monitoring, etc. to feedback into the optimisation strategy.
17

Compiler Optimisation Choices – Speed/Size
Optimise for Speed (-O3)
● May aggressively inline
● Can generate MASSIVE amounts of code for
seemingly simple operations
○ e.g. separate code to deal with aligned vs. unaligned array references
● Re-orders instructions
● Selects complex instruction encodings that
can be quite large
Q. Actually a huge number of possible optimizationchoices(trygcc–help)! How could you possibly choose what’s best?
Optimise for Size (-Os)
● Will penalise inlining decisions
● Generates shorter instruction encodings ● Affects instruction scheduling
● Fewer branches eliminated (e.g. by loop
unrolling)
Custom Optimisation (-On)
● You might be interested in very specific
optimisation passes
○ LLVM is best for this
● You may want to insert assembly language
templates.
18

Code Size
Does it matter?
128-Mbit Flash = 27.3mm2 @ 0.13μm ARM Cortex M3 = 0.43mm2 @ 0.13μm
RISC architectures sacrifice code density, in order to simplify implementation circuitry, and decrease die area.
19

Code Size
Possible solution: Dual Instruction Sets Provide a 32-bit and a 16-bit instruction set:
● Thumb/Thumb-2
● ARCompact
● microMIPS
…but 16-bit instructions come with constraints!
● Only a subset of the registers available
● Must explicitly change modes
● Range of immediate operands reduced
20

Code Size
Possible solution: CISC Instruction Set
Provide complex instruction encodings that can do more work:
● x86
● System/360
● PDP-11
…but CISC instruction sets are by definition complex, and require more complex hardware.
Often the support for generating more exotic instructions doesn’t exist in the compiler, negating some benefit.
21

Instruction Level Parallelism
Normally considered a way to increase performance.
● Dynamic Hardware-based Parallelism
○ Hardware decides at runtime which instructions to execute in parallel.
○ Strategies: pipelining, out-of-order execution, register renaming, speculative execution, branch
prediction.
● Static Software-defined Parallelism
○ Compiler decides at compile time which instructions should be executed in parallel.
22

Vectorisation
for i = 0 while i < 16 step 1 c[i] = a[i] + b[i] Choose a vector width, and divide total count to vectorise the loop (or increase step count) for i = 0 while i < 16 step 4 c[i:i+3] = a[i:i+3] + b[i:i+3] Be careful if vector width doesn’t divide iteration count exactly! Need extra code complete the operation. 23 Vectorisation: Scalar Operations a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 + + + + + + + + + + + + + + + + b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15 = = = = = = = = = = = = = = = = c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 24 Vectorisation: Vector Operations a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 + + + + b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15 = = = = c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 25 Vector width: 4 Vectorisation Exploit parallel data operations in instructions: c[i] = a[i] + b[i]; // 0 ≤ i < 16 no_vectorisation: xor %eax, %eax 1: mov (%rdx, %rax, 1), %ecx add (%rsi, %rax, 1), %ecx mov %ecx, (%rdi, %rax, 1) add $0x4, %rax cmp $0x40, %rax jne 1b retq avx512: vmovdqu32 (%rdx), %zmm0 vpaddd (%rsi), %zmm0, %zmm0 vmovdqu32 %zmm0, (%rdi) vzeroupper with_vectorisation: with_vectorisation_no_unroll: 26 retq movdqa (%rsi), %xmm0 paddd (%rdx), %xmm0 1: movaps %xmm0, (%rdi) movdqa 0x10(%rsi), %xmm0 paddd 0x10(%rdx), %xmm0 movaps %xmm0, 0x10(%rdi) movdqa 0x20(%rsi), %xmm0 paddd 0x20(%rdx), %xmm0 movaps %xmm0, 0x20(%rdi) movdqa 0x30(%rsi), %xmm0 retq paddd 0x30(%rdx), %xmm0 movaps %xmm0, 0x30(%rdi) retq xor %eax, %eax movdqu (%rsi, %rax, 4), %xmm0 cmp $0x10, %rax jne 1b movdqu (%rdx, %rax, 4), %xmm1 paddd %xmm0, %xmm1 movdqu %xmm1, (%rdi, %rax, 4) add $0x4, %rax Function Inlining Advantages ● Removes call overhead, branch delay Limitations ● Not all functions can be inlined ● Code size explosion ● May require manual intervention with, e.g. inline qualifier/function attributes. test_and_set: mov $0x1, %eax xchg %eax, (%rdi) retq acquire: mov %rdi, %rdx 1: mov %rdx, %rdi callq test_and_set test %eax, %eax jne 1b retq acquire: mov $0x1, %edx 1: mov %edx, %eax xchg %eax, (%rdi) test %eax, %eax jne 1b retq 27 Avoiding Branch Delay Use predicated Instructions 28 test: cmp addne subeq addne subeq bx test: r0, #0 r2, r1, r2 r2, r1, r2 r1, r1, r3 r1, r1, r3 lr cmp r0, #0 bne 2f sub r2, r1, r2 sub r1, r1, r3 1: bx lr add r2, r1, r2 add r1, r1, r3 b 1b 2: Floating Point to Fixed Point Conversion Algorithms are developed in floating point format, using tools such as Matlab Floating point processors and hardware are expensive! Fixed point processors and hardware are often used in embedded systems After algorithms are designed and tested, they are converted into a fixed point implementation. Algorithms are ported onto a fixed point processor, or ASIC 29 Qn.m Format Qn.m is a fixed positional number system for representing fixed point numbers A Qn.m format binary number assumes n-bits to the left (including the sign bit), and m- bits to the right of the decimal point. -2n-1 2n-2 22 21 20 . 2-1 2-2 2-3 30 Sign Bit ... ... 2m-1 Integer Bits Fractional Bits Qn.m Format 31 Q2.10 2 bits are for the 2’s complement integer part. 10 bits are for the fractional part. Conversion to Qn.m 1. Define the total number of bits to represent a Qn.m number e.g. 9 bits b8 b7 b6 b5 b4 b3 b2 b1 b0 1. Fix location of decimal point, based on the value of the number. e.g. assume 5 bits for the integer portion -b824 b723 b622 b521 b420 . b32-1 b22-2 b12-3 b02-4 32 Example - Q5.4 -24 23 22 21 20 . 2-1 2-2 2-3 2-4 0 1 0 1 0 . 1 0 1 1 0 8 0 2 0 . .5 0 .125 .0625 33 23 +21 +2-1 +2-3 +2-4 =10.6875 Hexadecimal representation: 0xab A 9-bit Q5.4 fixed point number covers -16 to +15.9375 Increasing the fractional bits, increases the precision Range Determination for Qn.m Format Run simulations for all input sets Observe ranges of values for all variables Note minimum + maximum value each variable sees for Qn.m range determination. 34 Q. What do you think about this strategy to determine ranges? Would you have any concerns or alternative suggestions? Energy saving: Opportunistic Sleeping To conserve energy, applications should try to move the processor and peripherals into the lowest usable power mode as soon as possible. (Bit like engine stop/start in cars.) Interrupts for events wake the processor, to perform more processing. Need to balance energy savings vs. reaction time, depending on application. 35 (Picture for Arm Cortex M0+, from Microchip Developer Help) Summary Hardware Optimisation ● Focus on energy and power, physical size/complexity and cost ● May be done at design time, or during hardware prototyping ● Choose or design specialised compute or peripheral units ● Examples: CPU, GPU, DSP, FPGA, ASICs Software Optimisation ● Focus on speed, memory usage, also energy and power, security ● Can be done during software prototyping, with simulation or profiling ● Big impact from compiler code transformations and CPU instruction choices ● Also use hardware support for low-energy: sleep states. 36 36