CS计算机代考程序代写 python compiler file system Power and energy efficiency on HPC systems

Power and energy efficiency on HPC systems

Power and energy efficiency

on HPC systems

Dr Michèle Weiland
m. .ac.uk

Lecture content

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• What are power and energy efficiency?
• Why is it important for HPC and why do we care?
• How do we measure power/energy?
• How do we influence power and energy usage?

1

The difference between power and energy

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Power is measured at an instant in time
• Energy is measure over a period of time

• Energy = Power × time
• Joules = Watts × seconds

2

How do we define “efficiency”?

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

Perform computation using resources optimally for performance

Draw as little power as possible for the computation

Minimise the energy consumed for the computation

3

Why do we care about energy efficiency?

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• HPC systems consume a lot of power
• Both for operation and for cooling

• The fastest system in the world, Fugaku 1 in Japan, consumes
30MW

• This is potentially problematic for a number of reasons
• Cost: operating a large system becomes very (possibly

prohibitively) expensive
• Environment: power usage at that scale has environmental

implications
• Infrastructure: hosting a system of the scale of Fugaku is

limited to only a small number of sites worldwide

1https://www.r-ccs.riken.jp/en/fugaku

4

Fugaku

Energy efficiency – where are we today?

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Can be measured as the amount of work done per Joule (or
per Watt per second)

• For HPC, this is often floating-point operations per second
per Watt (flop/s/Watt)

• Green500 list shows most energy efficient systems in the world
• Top system today achieves 29.7GFlop/s/Watt, or 29.7 billion

floating-point operations per second and per Watt!

• Huge strides have been made in recent years – hardware is
becoming more and more efficient

5

https://www.top500.org/lists/green500/

Measuring power draw (i)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

In order to understand and improve energy efficiency, we must
be able to measure power draw

• Power measurements ideally
• are high resolution and high accuracy;
• include all system components; and
• do not introduce any overheads

6

Measuring power draw (ii)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

Resolution

• Power draw is continuous
• Must decide how often to

read the power draw values –
this is the resolution of the
measurement

• Read too little and you will
likely miss sudden spikes &
dips

• Read too often and you are
overwhelmed by the volume
of data

Accuracy

• Power can be read with
differing levels of accuracy,
from micro- to mega-Watts

• The more refined the
measurement, the more
information it carries

• Too little accuracy in
reading means subtle
changes might be obscured

• Too much accuracy however
might result in noisy data

7

Measuring power draw (iii) – components

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

System components to measure:

• Individual components: CPU, memory and accelerators
• Whole compute node, including components such as network

interface cards

• Whole system, including network and storage subsystems

8

Measuring power draw (iv) – overheads

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Measuring power should ideally not draw power itself, or at
least be quantifiable

• In-band measurements (i.e. software based) have overheads
that increase with resolution

• Out-of-band measurements (i.e. hardware based) do not
introduce overheads, but often lack resolution and fine grained
control

9

Examples of methods for measuring power

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

In-band

• Intel RAPL (Running
Average Power Limit)

• SLURM’s sacct/sstat
• likwid-powermeter
• tx2mon

Out-of-band

• Wall-socket powermeter
• IPMI (Intelligent Platform

Management Interface)

• Intelligent PDU (Power
Distribution Unit)

10

Example 1: Node-level measurement (i)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Application running on its own on entire system (34 nodes)
• Application is run twice, once without I/O and once doing

frequent writing of multi-GB files on all processes
• Power is measured using IPMI 2, on each compute node
• Measurements are taken at a frequency of 1 per minute

2IPMI is a standard interface that vendors use to implement system level monitoring software
11

https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface

Example 1: Node-level measurement (ii)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Good for high level overview of system power usage behaviour
• Measurement is out-of-band, therefore no overheads are

introduced

• Shows clear difference between a busy node/system (550W)
and an idle one (215W)

• Also shows that operations such as frequent I/O have a
significant impact on node-level power draw

• Does not show what is happening at the lower (component)
level

12

Example 2: CPU-level measurement (i)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• likwid-powermeter 3 is a tool that can read RAPL counters
• RAPL = Running Average Power Limit – power model rather

than measurement

• Provides power usage of CPU and memory on Intel CPUs
• Example code runs on 4 MPI processes and takes 32s

3Part of the likwid tool suite

13

https://hpc.fau.de/research/tools/likwid

Example 2: CPU-level measurement (ii)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• likwid-powermeter also gives
indication of the power
usage range

• CPU: 85W to 165W
• Memory: 6.375W to

38.23W

• Actual power draw for
example in previous slide

• CPU: 95W
• Memory: 21W

14

Example 2: CPU-level measurement (iii)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Comparing 4 vs 8 MPI processes computing the same problem
• CPU power increases from 95W to 117W
• Memory power increases from 21W to 25W

Super-linear scaling means more than halved energy-to-solution
(3,076J to 1,417J) despite power draw going up!

15

How do we influence energy use? (i)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

Remember: Energy = Power × time

In order to reduce energy consumption you can either

1 Reduce power draw; or

2 Speed up time to solution;

3 Or both!

16

How do we influence energy use? (ii)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

However, even when sitting idle, a system still draws power and
“idle power” cannot be influenced by the user.

In reality therefore we have:

Energyidle = Poweridle × time (1)

Energycompute = Powercompute × time (2)

EnergyTotal = Energyidle + Energycompute (3)

We know how to reduce time to solution (optimise!) but how do
we reduce power?

17

Option 1: Choice of hardware

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Use hardware that is appropriate for your software
• Modern hardware is much more energy efficient than old

hardware
• A good understanding of your software helps making the

correct choice

• Do not (for example) use a CPU+GPU system if your code is
CPU-only

18

Option 2: Choice of programming model

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Programming languages and parallel programming models all
have different power draw profiles

• distributed vs shared memory (e.g. MPI vs OpenMP)
• compiled vs interpreted (e.g. C vs Python)

• Will depend on implementations, compilers and underlying
hardware

• Need to measure to know what is “best”
• What is “best” depends what you are optimising for!

19

Option 2: Choice of programming model (ii)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• NAS Parallel Benchmarks, CG (Conjugate Gradient) test
• Provides separate MPI and OpenMP implementations, both

compiled with Intel v2021 with -O3

• Power and energy measured using likwid-powermeter
• Here, OpenMP is more power-efficient, but MPI is more

energy-efficient

20

https://www.nas.nasa.gov/software/npb.html

Option 3: Compiler options

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Code compiled with optimisation is more efficient than code
compiled without optimisation

• But it can be more power hungry
• Default behaviour of compilers varies

• Gnu default is -O0
• Intel default is -O3

• Always specify compiler optimisations flags explicitly – do not
rely on defaults

21

Option 3: Compiler options (ii)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• OpenMP version of NAS Parallel Benchmarks CG test
• Compiled with Intel v2021, measured using likwid-powermeter
• CPU power: reduces for -O3 vs -O0 with increasing cores
• Memory power: increases for -O3 vs -O0 with increasing cores
• Much faster runtime with -O3, therefore reduced energy use

• on 16 cores, -O0 takes 34.15s, -O3 takes 18.18s

22

Option 4: CPU frequency scaling (i)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Modern multi-core CPUs can run at a range of clock
speeds/frequencies

• They have a base clock, as well as “turbo boost” states and
power saving states

• The difference between the highest and the lowest frequencies
for a given CPU can be 3GHz (e.g. 1GHz to 4GHz) or more

• The CPU uses Dynamic Frequency and Voltage Scaling
(DVFS) to adjust the clock frequency in response to its
workload

• CPU frequency is directly proportional to CPU power use –
the higher the frequency, the higher the power draw

• The primary motivation for DVFS is that a CPU can
dynamically respond to overheating (i.e. using too much
power) by reducing its clock frequency

• Similarly, the CPU can clock up if it is underused

23

Option 4: CPU frequency scaling (ii)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Ideally, we want
• compute intensive applications to run at a high CPU

frequency
• memory intensive applications to run at a low CPU frequency

• However, because DVFS responds to CPU workload, in reality
the clock frequencies can end up the other way round

• A compute intensive application means the CPU is hot, and
DVFS will clock it down

• A memory intensive application means the CPU is cooler, so
DVFS will clock it up

• Memory intensive applications is where we can make the most
difference, and they represent the majority of HPC workloads

24

Option 4: CPU frequency scaling (iii)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• DVFS scales CPU frequency according to a “governor”
• Examples are performance, powersave and userspace

• What governors are available depends on the CPU
• The cpupower utility is used to set/query the frequency
• Governors like performance and powersave will

automatically try to optimise the clock frequency

• Governor userspace on the other hand lets the user set a
specific clock frequency directly

• Note: if the user sets a frequency that becomes unsafe for the
CPU, DVFS will take over

• The CPU will not be allowed to overheat!

25

Option 4: STREAM on Fulhame

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

STREAM

• Measure sustainable memory
bandwidth with 4 simple
kernels

• copy: a(i) = b(i)
• scale: a(i) = n × b(i)
• add: a(i) = b(i) + c(i)
• triad: a(i) = b(i) + n × c(i)

Fulhame

• 4096-core cluster with
Marvell ThunderX2
Arm-based CPUs

• tx2mon kernel module
• Power measurement utility

specifically for ThunderX2
processor

26

https://github.com/Marvell-SPBU/tx2mon

Option 4: STREAM on Fulhame (ii)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Different CPU frequency governors
• Best bandwidth from performance, ondemand and schedutil

governors, but also highest power draw

27

Important observations

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Accurate power measurements often only through tools that
require elevated privileges and access to system registers

• On HPC systems, power consumption of shared resources
(interconnect, parallel file system) almost impossible to
attribute to individual jobs

• Adding more parallel resources to your job can increase the
energy to solution if parallel efficiency is poor

28

Important observations (ii)

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

HPC systems and supercomputers are scientific tools

Their aim is to perform as many scientific simulations as
possible

Reducing science throughput (making your code slower!)
in order to reduce power consumption is a false economy

29

Summary

T
H

E

U
N I

V E
R

S

I
T

Y

O
F

E
D

I N B
U

R

G
H

• Reducing energy consumption means reducing time to
solution and/or reducing power draw

• Measuring power consumption is important to understand
opportunities for energy/power efficiency improvements

• Dependent on system and access privileges
• Many different ways to influence power usage

• Levels of effectiveness are problem dependent
• Always remember that power and energy efficiency should

never come at the expense of scientific throughput

30

Questions?

31