CS计算机代考程序代写 cache arm algorithm Microsoft PowerPoint – COMP528 HAL03 terminology & top500.pptx

Microsoft PowerPoint – COMP528 HAL03 terminology & top500.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

3 – HAL

Contact

• MS Teams
• channels for:

• general announcements & items of general interest
• lab sessions

• you can also chat direct to me

• Email: m.k. .uk

COMP328/COMP528 (c) mkbane, University of Liverpool

Aims

• To provide students with a deep, critical and systematic understanding of key
issues and effective solutions for parallel programming for systems with multi-
core processors and parallel architectures

• To develop students appreciation of a variety of approaches to parallel
programming, including using MPI and OpenMP

• To develop the students’ skills in parallel programming in particular using MPI
and OpenMP

• To develop the students’ skills in parallelization of ideas, algorithms and of
existing serial code.

Recap Slide
• Strategy:

– Limit CPU speed and sophistication
– Put multiple CPUs (“cores”) on a single chip in a socket
– Several (2 or 4) sockets on a node (cf motherboard)

– Connect 10s or 100s or 1000s of nodes…

• Potential performance of cluster:
CPU freq * no ops per cycle * #cores per CPU * #CPUs-per-node * #nodes

Methodology

• Study problem, sequential program, or code segment
• Look for opportunities for parallelism

• Usually best to start by thinking abstractly about the problem to be solved,
not by any current program implementation of a given solution

• Try to keep all cores of all processors busy
doing useful elements of the required work

• Processors (cores) could be either placed locally (multicore
processors), or connected by local/global networks => huge
variety of approaches/methods

(after Intel Software College)

Terminology

• Processor
– die on a socket

– may comprise
several cores
(where instructions
are executed)

• Node
– same as “motherboard”

– comprises one or more sockets i.e. one or more processors

• Cluster
– comprises several nodes

– nodes connected by an interconnect

• CPU?
– sometimes used by itself to mean ‘core’

– sometimes people say “CPU core” (meaning core)

– generally (if used by itself) used to mean ‘processor’

A Core of a Processor can be very complex

• decades of development

• vendor competition (Intel, AMD, IBM and now ARM)

• a modern, general purpose core:
– supports “out of order” execution & “speculative execution”

– has a vector unit

– works with many levels of memory (L1, L2, L3 cache)

– may support “hyper threading”

Hyper Threading on a Core

• Many modern architectures allow “hyper-threading” via
hardware threads
– For each physical core, there is a number of “hyper threads”

• hardware threads

• typically 2 HT for Intel and for AMD

• e.g. 4 or 8 HT for IBM Power9

• Processor sees a larger number of “virtual CPUs“
– can be useful for some throughput

(as long as not over-stretching physical resources)

Single-core processor

Core Cache

Rest of system

Dual-core
processor

Core

Core

Cache Rest of system

Simplest way of enabling a chip to run multiple threads

Single-core processor
with two hardware threads

Core

Thread 1

Thread 2

Cache
Core

Thread 1

Thread 2

Rest of system

• Variations on the design

– The core could execute instructions
from one thread for n cycles and then
switch to another thread for the next
n cycles

– The core could alternate every cycle
between fetching an instruction from
one thread and from another thread

– The core could simultaneously fetch
an instruction from each of multiple
thread every cycle

– The switching can depend on long
latency events (e.g. cache miss)

Single-core processor
with two hardware threads

Core

Thread 1

Thread 2

Cache
Core

Thread 1

Thread 2

Rest of system

• LIMITATIONS?
– If only one physical ALU in the core, will 2

threads super-saturate?

– If memory intensive code, will 2 threads
super-saturate the memory cache

Multi-core processors

• Common to turn HT off for HPC
– e.g. for Intel Xeon chips (as per UoL «Barkla»)

it is advantage to turn off hyper-threading

• but for Intel XeonPhi chips
– different overall architecture design

– best to try 1,2,3,4 sw threads per physical core

• and IBM chips also usually okay with some HT

Whilst we have some XeonPhi
nodes at Liverpool, Intel have
since terminated this product line

For this course…

• We presume NO hyper-threading

• We will use Intel Skylake processors
• HT turned off in the BIOS
• Clock speed is not fixed

• Aside: to balance power consumption with speed, if chip idle the
clock speed slowly drops. When there is work to do, the clock speed
quickly ramps up.

• You should do several timings (of non trivial data [>5 secs]) to get best
performance

More on threads…

• A system usually is capable of running many more software
threads on it than there are hardware (hyper) threads;
– Many of the threads will be inactive;

– Context switch moves a thread on or off a hardware thread

• In HPC systems, frequently a reduced kernel runs on the
compute nodes
– Remove any un-required functionality

=> less threads
less context switching required from real work to OS bureaucracy

Memory – very important wrt performance

• Memory is hierarchical
• Register
• L1 cache
• L2, L3 cache

• LLC cache
• RAM
• Swapped to disk

• LLC = last level cache before the RAM
• Will be L3 if it exists

• Implementation options
• L1 per core
• RAM is shared by node

• Options…
• Share L2 per processor, but with L1 private

per core?
• Share L2 per pair of cores

(e.g. 6c chip has 3 sets of L2 caches)

Is it good to have lots of threads accessing same
memory?

COMP328/COMP528 (c) mkbane, University of Liverpool

HPC Machines
Potential performance:

core freq * no ops per cycle * #cores per processor * #processors per node * #nodes

22

Cost Memory Power Requirements FLOPS per second

1948 “Baby” computer,
Manchester

1.1 K

1985 Cray 2 $16M 2 G

2013 ARCHER (Cray XC30). 118K
cores (#41 in Top500)

£43M 64 GB/node ~2 MW
641 MFLOPS/W

1.6 P

2015 iPhone 6S. ARM / Apple A9. 2
cores

£500 2 GB 4.9 G

2015 Raspberry Pi 2B. ARMv7. 4
cores

£30 1 GB 50 M per core
200 M per RPi

2013-2015 Tianhe-2 (#1 of
Top500). 3.1M cores

1 PB 17.8 MW 33.86 P

2015 Shoubu, RIKEN (#1 of
Gren500). 1.2M cores

82 TB 50.32 KW
7 GFLOPs/Watt

606 T

2016 Sunway Tiahu. 10.6 M cores
(new Chinese
chip/interconnect etc)

$270M (inc R&D
to design chips
etc)

1.3 PB 15.4 MW
6 GLOPS/Watt

93 P

Images: cs.man.ac.uk, CW, appleapple.top, top500/JD, RIKEN

Performance Metric/s
• FLOP: floating point operation

• Y = M*X + C 2 FLOPS
K = A + B + C 2 FLOPS
ALPHA = 1.0 / BETA 1 FLOP

• The cost to execute a FLOP is not fixed (per cpu architecture)
• Division requires more instructions than a multiplication

• Exponentials, logs, geometric function

• Performance: FLOPS per second
GigaFLOPS/sec 10^9 FLOPS/sec
TeraFLOPS/sec 10^12 FLOPS/sec
petaFLOPS/sec 10^15 FLOPS/sec
exaFLOPS/sec 10^18 FLOPS/sec

• “race to exascale”

24

Cost Memory Power Requirements FLOPS per second

1948 “Baby” computer, Manchester 1.1 K

1985 Cray 2 $16M 2 G

2013 ARCHER (Cray XC30). 118K cores
(#41 in Top500)

£43M 64 GB/node ~2 MW
641 MFLOPS/W

1.6 P

2015 iPhone 6S. ARM / Apple A9. 2
cores

£500 2 GB 4.9 G

2015 Raspberry Pi 2B. ARMv7. 4 cores £30 1 GB 50 M per core
200 M per RPi

2013-2015 Tianhe-2 (#1 of
Top500). 3.1M cores

1 PB 17.8 MW 33.86 P

2015 Shoubu, RIKEN (#1 of Green500 in
2015). 1.2M cores

82 TB 50.32 KW
6.7 GFLOPs/Watt (Green500)

606 T

2016 Sunway Tiahu. 10.6 M cores (new
Chinese chip/interconnect etc)

$270M (inc R&D
to design chips
etc)

1.3 PB 15.4 MW
6 GF/Watt (Green500)

93 P
(LINPACK, Top500)
out of peak 125 P

2018 Summit. IBM Power9 CPUs +
NVIDIA V100 GPUs. 2.3M cores

250 PB 9.8 MW
14.7 GF/W (#3 in Green500)

143 P
(LINPACK, Top500)
out of peak 201 P

2018 Shoubu “B”, RIKEN. #1 Green500
0.9M cores

60 kW
17.6 GF/W (#1 in Green500)

0.9 P

Images: cs.man.ac.uk, CW, appleapple.top, top500/JD, RIKEN

High Performance
• millions of cores

• ~20 cores on a processor
• couple of processors on a node
• and LOTS of nodes in a “supercomputer”

• #1 supercomputer: peak 200 PFLOPS/sec & 250 PByte of RAM
• 1 PetaFLOP: = #floating point operation (maths) that would take all of

Manchester doing one calculation per second a total of 62 years to
complete

• 1 PF/sec is doing all that maths BUT in a single second!
• 1 PetaBytes of RAM memory: stack of DVDs about 250m high

based upon https://kb.iu.edu/d/apeq

• Needs as much electrical power as a small town
KEY QUESTION: how would you programme it efficiently?

Units of Measurement

• Top500
• https://www.top500.org

• (bi)annual list of…

• June2018 list
• 3 of top 10 have NVIDIA “Volta” GPUs

• Accelerators present in most of the rest

• UK entries:

• UK entries:

“race to exascale”
 ?

HPC Performance over the Years

• “Moore’s Law”: doubling every
two years of
#transistors | clock speeds |
processor performance

• #500 this year’s list would have
been #1 roughly 10 years ago

• The 2015 iPhone 6 would have
made the 1997 Top500 list

SUM

#1

#500

“Top500”

THE GOOD

• Lots of data

• Can use web page to plot historical
trends
• Rise of accelerators

• Which countries?

• etc

• Easy to compare

• Drives competition => drives
innovation

THE NOT SO GOOD

• What machines does it cover?
• (or not cover…)

• How is the performance measured
• LINPACK

• What is LINPACK?

• Is it “typical”?

COMP328/COMP528 (c) mkbane, University of Liverpool

Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane