Hardware and Software Optimisation
Syllabus
This module will cover the following
• Choices to make when optimising
• Hardware chip type specialisation
• Compiler optimisations
• Floating point to fixed point conversion
2
3
Optimisation
Modifying a system to make it run more efficiently, or utilise fewer resources.
Optimising hardware:
● use less energy, dissipate less power, reduce gate count, package size… Optimising software:
● run faster, use less data or program memory, reduce power or energy
Q. Knuth famously wrote in the 1960s about programming: “premature optimization is the root of all evil”. Do you think this applies to IoT design in 2020s? How?
4
Choices to make when optimising
Optimise for speed?
Optimise for size?
Optimise for power?
Optimise for energy?
Some combination (with trade-off) of all of these?
Do we need to react to events quickly?
Are we memory/space constrained?
Is there limited scope for power dissipation?
Do we need to conserve as much energy as possible?
5
Hardware Optimisation
Additional Processors
DSPs implement specialised routines for a specific application.
FPGAs are a popular accelerator in embedded systems, as they provide ultimate re-configurability.
FPGAs are also invaluable during hardware prototyping.
ASICs are the logical next step for a highly customised application, if gate-level re- configurability is not a requirement.
6
Field Programmable Gate Arrays (FPGAs)
FPGAs implement arbitrary combinational or sequential circuits, and are configured by
loading a local memory that determines the interconnections among logic blocks. Can be reconfigured an unlimited number of times – and “in the field”!
Very useful for hardware prototyping. Experimental data can feed into ASIC design.
7
Application-specific Integrated Circuits (ASICs)
● Designed for a fixed application, e.g., bitcoin mining.
● Designed to accelerate heavy and most used functions.
● Designed to implement the instruction set with
minimum hardware cost.
● Goals of ASIC design:
○ Best performance over silicon and over power consumption, lowest cost
● Involves:
○ ASIC design flow, source-code profiling, architecture exploration, instruction set design, assembly language design, tool chain production, firmware design, benchmarking, microarchitecture design.
● Can specialise every architectural CPU aspect 8
Digital Signal Processors (DSPs)
Designed for a specific application, typically arithmetic heavy. Offers lots of arithmetic
instructions/parallelism.
● Radio baseband hardware (4G)
● Image processing/filtering
● Audio processing/filtering
● Video encoding/decoding
● Vision applications
Choose to implement a DSP that performs processing faster, and uses less memory and
power, than if the algorithm was implemented on a general purpose processor. 13
Heterogeneous Multicores
A heterogeneous multicore is a processor that contains multiple cores that implement the same underlying architecture, but have different power and performance profiles.
An example is ARM big.LITTLE, of which a configuration is available comprising four Cortex-A7 and four Cortex-A15 (for a total of eight) cores.
This enables threads to run on the low energy cores, until they need a speed boost, or where multithreaded applications may have threads with different performance requirements, running on different cores.
14
Run-state Migration
Clustered Switching: Either all fast cores or all slow cores are being used to run threads. CPU Migration: Pairs of fast and slow cores are used for scheduling, and each pair can be
configured to execute the thread on either the fast core or the slow core.
Global Task Scheduling: Each core is seen separately, and threads are scheduled on cores with the appropriate performance profile for the workload.
15
16
Software Optimisation
Optimisation Targets
Optimise for speed Generate code that executes quickly; at the expense of size
Optimise for size Generate least amount of code; maybe at the expense of speed
Optimise for power/energy Combination of optimisation strategies
Use analysis, debugging, simulation, prototyping, monitoring, etc. to feedback into the optimisation strategy.
17
Compiler Optimisation Choices – Speed/Size
Optimise for Speed (-O3)
● May aggressively inline
● Can generate MASSIVE amounts of code for
seemingly simple operations
○ e.g. separate code to deal with aligned vs. unaligned array references
● Re-orders instructions
● Selects complex instruction encodings that
can be quite large
Q. Actually a huge number of possible optimizationchoices(trygcc–help)! How could you possibly choose what’s best?
Optimise for Size (-Os)
● Will penalise inlining decisions
● Generates shorter instruction encodings ● Affects instruction scheduling
● Fewer branches eliminated (e.g. by loop
unrolling)
Custom Optimisation (-On)
● You might be interested in very specific
optimisation passes
○ LLVM is best for this
● You may want to insert assembly language
templates.
18
Code Size
Does it matter?
128-Mbit Flash = 27.3mm2 @ 0.13μm ARM Cortex M3 = 0.43mm2 @ 0.13μm
RISC architectures sacrifice code density, in order to simplify implementation circuitry, and decrease die area.
19
Code Size
Possible solution: Dual Instruction Sets Provide a 32-bit and a 16-bit instruction set:
● Thumb/Thumb-2
● ARCompact
● microMIPS
…but 16-bit instructions come with constraints!
● Only a subset of the registers available
● Must explicitly change modes
● Range of immediate operands reduced
20
Code Size
Possible solution: CISC Instruction Set
Provide complex instruction encodings that can do more work:
● x86
● System/360
● PDP-11
…but CISC instruction sets are by definition complex, and require more complex hardware.
Often the support for generating more exotic instructions doesn’t exist in the compiler, negating some benefit.
21
Instruction Level Parallelism
Normally considered a way to increase performance.
● Dynamic Hardware-based Parallelism
○ Hardware decides at runtime which instructions to execute in parallel.
○ Strategies: pipelining, out-of-order execution, register renaming, speculative execution, branch
prediction.
● Static Software-defined Parallelism
○ Compiler decides at compile time which instructions should be executed in parallel.
22
Vectorisation
for i = 0 while i < 16 step 1 c[i] = a[i] + b[i]
Choose a vector width, and divide total count to vectorise the loop (or increase step count)
for i = 0 while i < 16 step 4 c[i:i+3] = a[i:i+3] + b[i:i+3]
Be careful if vector width doesn’t divide iteration count exactly! Need extra code complete the operation.
23
Vectorisation: Scalar Operations
a0
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a11
a12
a13
a14
a15
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
c0
c1
c2
c3
c4
c5
c6
c7
c8
c9
c10
c11
c12
c13
c14
c15
24
Vectorisation: Vector Operations
a0
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a11
a12
a13
a14
a15
+
+
+
+
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
=
=
=
=
c0
c1
c2
c3
c4
c5
c6
c7
c8
c9
c10
c11
c12
c13
c14
c15
25 Vector width: 4
Vectorisation
Exploit parallel data operations in instructions: c[i] = a[i] + b[i]; // 0 ≤ i < 16
no_vectorisation:
xor %eax, %eax
1:
mov (%rdx, %rax, 1), %ecx
add (%rsi, %rax, 1), %ecx
mov %ecx, (%rdi, %rax, 1)
add $0x4, %rax
cmp $0x40, %rax
jne 1b
retq
avx512:
vmovdqu32 (%rdx), %zmm0
vpaddd (%rsi), %zmm0, %zmm0
vmovdqu32 %zmm0, (%rdi)
vzeroupper
with_vectorisation:
with_vectorisation_no_unroll:
26 retq
movdqa (%rsi), %xmm0
paddd (%rdx), %xmm0 1:
movaps %xmm0, (%rdi)
movdqa 0x10(%rsi), %xmm0
paddd 0x10(%rdx), %xmm0
movaps %xmm0, 0x10(%rdi)
movdqa 0x20(%rsi), %xmm0
paddd 0x20(%rdx), %xmm0
movaps %xmm0, 0x20(%rdi)
movdqa 0x30(%rsi), %xmm0 retq paddd 0x30(%rdx), %xmm0
movaps %xmm0, 0x30(%rdi)
retq
xor %eax, %eax
movdqu (%rsi, %rax, 4), %xmm0
cmp $0x10, %rax
jne 1b
movdqu (%rdx, %rax, 4), %xmm1
paddd %xmm0, %xmm1
movdqu %xmm1, (%rdi, %rax, 4)
add $0x4, %rax
Function Inlining
Advantages
● Removes call overhead, branch delay
Limitations
● Not all functions can be inlined
● Code size explosion
● May require manual intervention
with, e.g. inline qualifier/function attributes.
test_and_set:
mov $0x1, %eax
xchg %eax, (%rdi)
retq
acquire:
mov %rdi, %rdx
1:
mov %rdx, %rdi
callq test_and_set
test %eax, %eax
jne 1b
retq
acquire:
mov $0x1, %edx
1:
mov %edx, %eax
xchg %eax, (%rdi)
test %eax, %eax
jne 1b
retq
27
Avoiding Branch Delay
Use predicated Instructions
28
test:
cmp
addne
subeq
addne
subeq
bx
test:
r0, #0
r2, r1, r2
r2, r1, r2
r1, r1, r3
r1, r1, r3
lr
cmp r0, #0
bne 2f
sub r2, r1, r2
sub r1, r1, r3
1:
bx lr
add r2, r1, r2
add r1, r1, r3
b 1b
2:
Floating Point to Fixed Point Conversion
Algorithms are developed in floating point format, using tools such as Matlab
Floating point processors and hardware are expensive!
Fixed point processors and hardware are often used in embedded systems
After algorithms are designed and tested, they are converted into a fixed point implementation.
Algorithms are ported onto a fixed point processor, or ASIC
29
Qn.m Format
Qn.m is a fixed positional number system for representing fixed point numbers
A Qn.m format binary number assumes n-bits to the left (including the sign bit), and m- bits to the right of the decimal point.
-2n-1
2n-2
22
21
20
.
2-1
2-2
2-3
30
Sign Bit
... ... 2m-1 Integer Bits Fractional Bits
Qn.m Format
31
Q2.10
2 bits are for the 2’s complement integer part. 10 bits are for the fractional part.
Conversion to Qn.m
1. Define the total number of bits to represent a Qn.m number
e.g. 9 bits
b8
b7
b6
b5
b4
b3
b2
b1
b0
1. Fix location of decimal point, based on the value of the number. e.g. assume 5 bits for the integer portion
-b824
b723
b622
b521
b420
.
b32-1
b22-2
b12-3
b02-4
32
Example - Q5.4
-24
23
22
21
20
.
2-1
2-2
2-3
2-4
0
1
0
1
0
.
1
0
1
1
0
8
0
2
0
.
.5
0
.125
.0625
33
23 +21 +2-1 +2-3 +2-4 =10.6875
Hexadecimal representation: 0xab
A 9-bit Q5.4 fixed point number covers -16 to +15.9375 Increasing the fractional bits, increases the precision
Range Determination for Qn.m Format
Run simulations for all input sets
Observe ranges of values for all variables
Note minimum + maximum value each variable sees for Qn.m range determination.
34
Q. What do you think about this strategy to determine ranges? Would you have any concerns or alternative suggestions?
Energy saving: Opportunistic Sleeping
To conserve energy, applications should try to move the processor and peripherals into the lowest usable power mode as soon as possible. (Bit like engine stop/start in cars.)
Interrupts for events wake the processor, to perform more processing. Need to balance energy savings vs. reaction time, depending on application.
35
(Picture for Arm Cortex M0+, from Microchip Developer Help)
Summary
Hardware Optimisation
● Focus on energy and power, physical size/complexity and cost
● May be done at design time, or during hardware prototyping
● Choose or design specialised compute or peripheral units
● Examples: CPU, GPU, DSP, FPGA, ASICs
Software Optimisation
● Focus on speed, memory usage, also energy and power, security
● Can be done during software prototyping, with simulation or profiling
● Big impact from compiler code transformations and CPU instruction choices
● Also use hardware support for low-energy: sleep states.
36
36