Microsoft PowerPoint – COMP528 HAL03 terminology & top500.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
3 – HAL
Contact
• MS Teams
• channels for:
• general announcements & items of general interest
• lab sessions
• you can also chat direct to me
• Email: m.k. .uk
COMP328/COMP528 (c) mkbane, University of Liverpool
Aims
• To provide students with a deep, critical and systematic understanding of key
issues and effective solutions for parallel programming for systems with multi-
core processors and parallel architectures
• To develop students appreciation of a variety of approaches to parallel
programming, including using MPI and OpenMP
• To develop the students’ skills in parallel programming in particular using MPI
and OpenMP
• To develop the students’ skills in parallelization of ideas, algorithms and of
existing serial code.
Recap Slide
• Strategy:
– Limit CPU speed and sophistication
– Put multiple CPUs (“cores”) on a single chip in a socket
– Several (2 or 4) sockets on a node (cf motherboard)
– Connect 10s or 100s or 1000s of nodes…
• Potential performance of cluster:
CPU freq * no ops per cycle * #cores per CPU * #CPUs-per-node * #nodes
Methodology
• Study problem, sequential program, or code segment
• Look for opportunities for parallelism
• Usually best to start by thinking abstractly about the problem to be solved,
not by any current program implementation of a given solution
• Try to keep all cores of all processors busy
doing useful elements of the required work
• Processors (cores) could be either placed locally (multicore
processors), or connected by local/global networks => huge
variety of approaches/methods
(after Intel Software College)
Terminology
• Processor
– die on a socket
– may comprise
several cores
(where instructions
are executed)
• Node
– same as “motherboard”
– comprises one or more sockets i.e. one or more processors
• Cluster
– comprises several nodes
– nodes connected by an interconnect
• CPU?
– sometimes used by itself to mean ‘core’
– sometimes people say “CPU core” (meaning core)
– generally (if used by itself) used to mean ‘processor’
A Core of a Processor can be very complex
• decades of development
• vendor competition (Intel, AMD, IBM and now ARM)
• a modern, general purpose core:
– supports “out of order” execution & “speculative execution”
– has a vector unit
– works with many levels of memory (L1, L2, L3 cache)
– may support “hyper threading”
Hyper Threading on a Core
• Many modern architectures allow “hyper-threading” via
hardware threads
– For each physical core, there is a number of “hyper threads”
• hardware threads
• typically 2 HT for Intel and for AMD
• e.g. 4 or 8 HT for IBM Power9
• Processor sees a larger number of “virtual CPUs“
– can be useful for some throughput
(as long as not over-stretching physical resources)
Single-core processor
Core Cache
Rest of system
Dual-core
processor
Core
Core
Cache Rest of system
Simplest way of enabling a chip to run multiple threads
Single-core processor
with two hardware threads
Core
Thread 1
Thread 2
Cache
Core
Thread 1
Thread 2
Rest of system
• Variations on the design
– The core could execute instructions
from one thread for n cycles and then
switch to another thread for the next
n cycles
– The core could alternate every cycle
between fetching an instruction from
one thread and from another thread
– The core could simultaneously fetch
an instruction from each of multiple
thread every cycle
– The switching can depend on long
latency events (e.g. cache miss)
Single-core processor
with two hardware threads
Core
Thread 1
Thread 2
Cache
Core
Thread 1
Thread 2
Rest of system
• LIMITATIONS?
– If only one physical ALU in the core, will 2
threads super-saturate?
– If memory intensive code, will 2 threads
super-saturate the memory cache
Multi-core processors
• Common to turn HT off for HPC
– e.g. for Intel Xeon chips (as per UoL «Barkla»)
it is advantage to turn off hyper-threading
• but for Intel XeonPhi chips
– different overall architecture design
– best to try 1,2,3,4 sw threads per physical core
• and IBM chips also usually okay with some HT
Whilst we have some XeonPhi
nodes at Liverpool, Intel have
since terminated this product line
For this course…
• We presume NO hyper-threading
• We will use Intel Skylake processors
• HT turned off in the BIOS
• Clock speed is not fixed
• Aside: to balance power consumption with speed, if chip idle the
clock speed slowly drops. When there is work to do, the clock speed
quickly ramps up.
• You should do several timings (of non trivial data [>5 secs]) to get best
performance
More on threads…
• A system usually is capable of running many more software
threads on it than there are hardware (hyper) threads;
– Many of the threads will be inactive;
– Context switch moves a thread on or off a hardware thread
• In HPC systems, frequently a reduced kernel runs on the
compute nodes
– Remove any un-required functionality
=> less threads
less context switching required from real work to OS bureaucracy
Memory – very important wrt performance
• Memory is hierarchical
• Register
• L1 cache
• L2, L3 cache
• LLC cache
• RAM
• Swapped to disk
• LLC = last level cache before the RAM
• Will be L3 if it exists
• Implementation options
• L1 per core
• RAM is shared by node
• Options…
• Share L2 per processor, but with L1 private
per core?
• Share L2 per pair of cores
(e.g. 6c chip has 3 sets of L2 caches)
Is it good to have lots of threads accessing same
memory?
COMP328/COMP528 (c) mkbane, University of Liverpool
HPC Machines
Potential performance:
core freq * no ops per cycle * #cores per processor * #processors per node * #nodes
22
Cost Memory Power Requirements FLOPS per second
1948 “Baby” computer,
Manchester
1.1 K
1985 Cray 2 $16M 2 G
2013 ARCHER (Cray XC30). 118K
cores (#41 in Top500)
£43M 64 GB/node ~2 MW
641 MFLOPS/W
1.6 P
2015 iPhone 6S. ARM / Apple A9. 2
cores
£500 2 GB 4.9 G
2015 Raspberry Pi 2B. ARMv7. 4
cores
£30 1 GB 50 M per core
200 M per RPi
2013-2015 Tianhe-2 (#1 of
Top500). 3.1M cores
1 PB 17.8 MW 33.86 P
2015 Shoubu, RIKEN (#1 of
Gren500). 1.2M cores
82 TB 50.32 KW
7 GFLOPs/Watt
606 T
2016 Sunway Tiahu. 10.6 M cores
(new Chinese
chip/interconnect etc)
$270M (inc R&D
to design chips
etc)
1.3 PB 15.4 MW
6 GLOPS/Watt
93 P
Images: cs.man.ac.uk, CW, appleapple.top, top500/JD, RIKEN
Performance Metric/s
• FLOP: floating point operation
• Y = M*X + C 2 FLOPS
K = A + B + C 2 FLOPS
ALPHA = 1.0 / BETA 1 FLOP
• The cost to execute a FLOP is not fixed (per cpu architecture)
• Division requires more instructions than a multiplication
• Exponentials, logs, geometric function
• Performance: FLOPS per second
GigaFLOPS/sec 10^9 FLOPS/sec
TeraFLOPS/sec 10^12 FLOPS/sec
petaFLOPS/sec 10^15 FLOPS/sec
exaFLOPS/sec 10^18 FLOPS/sec
• “race to exascale”
24
Cost Memory Power Requirements FLOPS per second
1948 “Baby” computer, Manchester 1.1 K
1985 Cray 2 $16M 2 G
2013 ARCHER (Cray XC30). 118K cores
(#41 in Top500)
£43M 64 GB/node ~2 MW
641 MFLOPS/W
1.6 P
2015 iPhone 6S. ARM / Apple A9. 2
cores
£500 2 GB 4.9 G
2015 Raspberry Pi 2B. ARMv7. 4 cores £30 1 GB 50 M per core
200 M per RPi
2013-2015 Tianhe-2 (#1 of
Top500). 3.1M cores
1 PB 17.8 MW 33.86 P
2015 Shoubu, RIKEN (#1 of Green500 in
2015). 1.2M cores
82 TB 50.32 KW
6.7 GFLOPs/Watt (Green500)
606 T
2016 Sunway Tiahu. 10.6 M cores (new
Chinese chip/interconnect etc)
$270M (inc R&D
to design chips
etc)
1.3 PB 15.4 MW
6 GF/Watt (Green500)
93 P
(LINPACK, Top500)
out of peak 125 P
2018 Summit. IBM Power9 CPUs +
NVIDIA V100 GPUs. 2.3M cores
250 PB 9.8 MW
14.7 GF/W (#3 in Green500)
143 P
(LINPACK, Top500)
out of peak 201 P
2018 Shoubu “B”, RIKEN. #1 Green500
0.9M cores
60 kW
17.6 GF/W (#1 in Green500)
0.9 P
Images: cs.man.ac.uk, CW, appleapple.top, top500/JD, RIKEN
High Performance
• millions of cores
• ~20 cores on a processor
• couple of processors on a node
• and LOTS of nodes in a “supercomputer”
• #1 supercomputer: peak 200 PFLOPS/sec & 250 PByte of RAM
• 1 PetaFLOP: = #floating point operation (maths) that would take all of
Manchester doing one calculation per second a total of 62 years to
complete
• 1 PF/sec is doing all that maths BUT in a single second!
• 1 PetaBytes of RAM memory: stack of DVDs about 250m high
based upon https://kb.iu.edu/d/apeq
• Needs as much electrical power as a small town
KEY QUESTION: how would you programme it efficiently?
• Top500
• https://www.top500.org
• (bi)annual list of…
• June2018 list
• 3 of top 10 have NVIDIA “Volta” GPUs
• Accelerators present in most of the rest
• UK entries:
• UK entries:
“race to exascale”
?
HPC Performance over the Years
• “Moore’s Law”: doubling every
two years of
#transistors | clock speeds |
processor performance
• #500 this year’s list would have
been #1 roughly 10 years ago
• The 2015 iPhone 6 would have
made the 1997 Top500 list
SUM
#1
#500
“Top500”
THE GOOD
• Lots of data
• Can use web page to plot historical
trends
• Rise of accelerators
• Which countries?
• etc
• Easy to compare
• Drives competition => drives
innovation
THE NOT SO GOOD
• What machines does it cover?
• (or not cover…)
• How is the performance measured
• LINPACK
• What is LINPACK?
• Is it “typical”?
COMP328/COMP528 (c) mkbane, University of Liverpool
Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane