Power and energy efficiency on HPC systems
Power and energy efficiency
on HPC systems
Dr Michèle Weiland
m. .ac.uk
Lecture content
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• What are power and energy efficiency?
• Why is it important for HPC and why do we care?
• How do we measure power/energy?
• How do we influence power and energy usage?
1
The difference between power and energy
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Power is measured at an instant in time
• Energy is measure over a period of time
• Energy = Power × time
• Joules = Watts × seconds
2
How do we define “efficiency”?
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
Perform computation using resources optimally for performance
Draw as little power as possible for the computation
Minimise the energy consumed for the computation
3
Why do we care about energy efficiency?
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• HPC systems consume a lot of power
• Both for operation and for cooling
• The fastest system in the world, Fugaku 1 in Japan, consumes
30MW
• This is potentially problematic for a number of reasons
• Cost: operating a large system becomes very (possibly
prohibitively) expensive
• Environment: power usage at that scale has environmental
implications
• Infrastructure: hosting a system of the scale of Fugaku is
limited to only a small number of sites worldwide
1https://www.r-ccs.riken.jp/en/fugaku
4
Energy efficiency – where are we today?
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Can be measured as the amount of work done per Joule (or
per Watt per second)
• For HPC, this is often floating-point operations per second
per Watt (flop/s/Watt)
• Green500 list shows most energy efficient systems in the world
• Top system today achieves 29.7GFlop/s/Watt, or 29.7 billion
floating-point operations per second and per Watt!
• Huge strides have been made in recent years – hardware is
becoming more and more efficient
5
https://www.top500.org/lists/green500/
Measuring power draw (i)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
In order to understand and improve energy efficiency, we must
be able to measure power draw
• Power measurements ideally
• are high resolution and high accuracy;
• include all system components; and
• do not introduce any overheads
6
Measuring power draw (ii)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
Resolution
• Power draw is continuous
• Must decide how often to
read the power draw values –
this is the resolution of the
measurement
• Read too little and you will
likely miss sudden spikes &
dips
• Read too often and you are
overwhelmed by the volume
of data
Accuracy
• Power can be read with
differing levels of accuracy,
from micro- to mega-Watts
• The more refined the
measurement, the more
information it carries
• Too little accuracy in
reading means subtle
changes might be obscured
• Too much accuracy however
might result in noisy data
7
Measuring power draw (iii) – components
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
System components to measure:
• Individual components: CPU, memory and accelerators
• Whole compute node, including components such as network
interface cards
• Whole system, including network and storage subsystems
8
Measuring power draw (iv) – overheads
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Measuring power should ideally not draw power itself, or at
least be quantifiable
• In-band measurements (i.e. software based) have overheads
that increase with resolution
• Out-of-band measurements (i.e. hardware based) do not
introduce overheads, but often lack resolution and fine grained
control
9
Examples of methods for measuring power
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
In-band
• Intel RAPL (Running
Average Power Limit)
• SLURM’s sacct/sstat
• likwid-powermeter
• tx2mon
Out-of-band
• Wall-socket powermeter
• IPMI (Intelligent Platform
Management Interface)
• Intelligent PDU (Power
Distribution Unit)
10
Example 1: Node-level measurement (i)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Application running on its own on entire system (34 nodes)
• Application is run twice, once without I/O and once doing
frequent writing of multi-GB files on all processes
• Power is measured using IPMI 2, on each compute node
• Measurements are taken at a frequency of 1 per minute
2IPMI is a standard interface that vendors use to implement system level monitoring software
11
https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface
Example 1: Node-level measurement (ii)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Good for high level overview of system power usage behaviour
• Measurement is out-of-band, therefore no overheads are
introduced
• Shows clear difference between a busy node/system (550W)
and an idle one (215W)
• Also shows that operations such as frequent I/O have a
significant impact on node-level power draw
• Does not show what is happening at the lower (component)
level
12
Example 2: CPU-level measurement (i)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• likwid-powermeter 3 is a tool that can read RAPL counters
• RAPL = Running Average Power Limit – power model rather
than measurement
• Provides power usage of CPU and memory on Intel CPUs
• Example code runs on 4 MPI processes and takes 32s
3Part of the likwid tool suite
13
https://hpc.fau.de/research/tools/likwid
Example 2: CPU-level measurement (ii)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• likwid-powermeter also gives
indication of the power
usage range
• CPU: 85W to 165W
• Memory: 6.375W to
38.23W
• Actual power draw for
example in previous slide
• CPU: 95W
• Memory: 21W
14
Example 2: CPU-level measurement (iii)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Comparing 4 vs 8 MPI processes computing the same problem
• CPU power increases from 95W to 117W
• Memory power increases from 21W to 25W
Super-linear scaling means more than halved energy-to-solution
(3,076J to 1,417J) despite power draw going up!
15
How do we influence energy use? (i)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
Remember: Energy = Power × time
In order to reduce energy consumption you can either
1 Reduce power draw; or
2 Speed up time to solution;
3 Or both!
16
How do we influence energy use? (ii)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
However, even when sitting idle, a system still draws power and
“idle power” cannot be influenced by the user.
In reality therefore we have:
Energyidle = Poweridle × time (1)
Energycompute = Powercompute × time (2)
EnergyTotal = Energyidle + Energycompute (3)
We know how to reduce time to solution (optimise!) but how do
we reduce power?
17
Option 1: Choice of hardware
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Use hardware that is appropriate for your software
• Modern hardware is much more energy efficient than old
hardware
• A good understanding of your software helps making the
correct choice
• Do not (for example) use a CPU+GPU system if your code is
CPU-only
18
Option 2: Choice of programming model
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Programming languages and parallel programming models all
have different power draw profiles
• distributed vs shared memory (e.g. MPI vs OpenMP)
• compiled vs interpreted (e.g. C vs Python)
• Will depend on implementations, compilers and underlying
hardware
• Need to measure to know what is “best”
• What is “best” depends what you are optimising for!
19
Option 2: Choice of programming model (ii)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• NAS Parallel Benchmarks, CG (Conjugate Gradient) test
• Provides separate MPI and OpenMP implementations, both
compiled with Intel v2021 with -O3
• Power and energy measured using likwid-powermeter
• Here, OpenMP is more power-efficient, but MPI is more
energy-efficient
20
https://www.nas.nasa.gov/software/npb.html
Option 3: Compiler options
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Code compiled with optimisation is more efficient than code
compiled without optimisation
• But it can be more power hungry
• Default behaviour of compilers varies
• Gnu default is -O0
• Intel default is -O3
• Always specify compiler optimisations flags explicitly – do not
rely on defaults
21
Option 3: Compiler options (ii)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• OpenMP version of NAS Parallel Benchmarks CG test
• Compiled with Intel v2021, measured using likwid-powermeter
• CPU power: reduces for -O3 vs -O0 with increasing cores
• Memory power: increases for -O3 vs -O0 with increasing cores
• Much faster runtime with -O3, therefore reduced energy use
• on 16 cores, -O0 takes 34.15s, -O3 takes 18.18s
22
Option 4: CPU frequency scaling (i)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Modern multi-core CPUs can run at a range of clock
speeds/frequencies
• They have a base clock, as well as “turbo boost” states and
power saving states
• The difference between the highest and the lowest frequencies
for a given CPU can be 3GHz (e.g. 1GHz to 4GHz) or more
• The CPU uses Dynamic Frequency and Voltage Scaling
(DVFS) to adjust the clock frequency in response to its
workload
• CPU frequency is directly proportional to CPU power use –
the higher the frequency, the higher the power draw
• The primary motivation for DVFS is that a CPU can
dynamically respond to overheating (i.e. using too much
power) by reducing its clock frequency
• Similarly, the CPU can clock up if it is underused
23
Option 4: CPU frequency scaling (ii)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Ideally, we want
• compute intensive applications to run at a high CPU
frequency
• memory intensive applications to run at a low CPU frequency
• However, because DVFS responds to CPU workload, in reality
the clock frequencies can end up the other way round
• A compute intensive application means the CPU is hot, and
DVFS will clock it down
• A memory intensive application means the CPU is cooler, so
DVFS will clock it up
• Memory intensive applications is where we can make the most
difference, and they represent the majority of HPC workloads
24
Option 4: CPU frequency scaling (iii)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• DVFS scales CPU frequency according to a “governor”
• Examples are performance, powersave and userspace
• What governors are available depends on the CPU
• The cpupower utility is used to set/query the frequency
• Governors like performance and powersave will
automatically try to optimise the clock frequency
• Governor userspace on the other hand lets the user set a
specific clock frequency directly
• Note: if the user sets a frequency that becomes unsafe for the
CPU, DVFS will take over
• The CPU will not be allowed to overheat!
25
Option 4: STREAM on Fulhame
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
STREAM
• Measure sustainable memory
bandwidth with 4 simple
kernels
• copy: a(i) = b(i)
• scale: a(i) = n × b(i)
• add: a(i) = b(i) + c(i)
• triad: a(i) = b(i) + n × c(i)
Fulhame
• 4096-core cluster with
Marvell ThunderX2
Arm-based CPUs
• tx2mon kernel module
• Power measurement utility
specifically for ThunderX2
processor
26
https://github.com/Marvell-SPBU/tx2mon
Option 4: STREAM on Fulhame (ii)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Different CPU frequency governors
• Best bandwidth from performance, ondemand and schedutil
governors, but also highest power draw
27
Important observations
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Accurate power measurements often only through tools that
require elevated privileges and access to system registers
• On HPC systems, power consumption of shared resources
(interconnect, parallel file system) almost impossible to
attribute to individual jobs
• Adding more parallel resources to your job can increase the
energy to solution if parallel efficiency is poor
28
Important observations (ii)
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
HPC systems and supercomputers are scientific tools
Their aim is to perform as many scientific simulations as
possible
Reducing science throughput (making your code slower!)
in order to reduce power consumption is a false economy
29
Summary
T
H
E
U
N I
V E
R
S
I
T
Y
O
F
E
D
I N B
U
R
G
H
• Reducing energy consumption means reducing time to
solution and/or reducing power draw
• Measuring power consumption is important to understand
opportunities for energy/power efficiency improvements
• Dependent on system and access privileges
• Many different ways to influence power usage
• Levels of effectiveness are problem dependent
• Always remember that power and energy efficiency should
never come at the expense of scientific throughput
30
Questions?
31