CS计算机代考程序代写 cache arm algorithm Microsoft PowerPoint - COMP528 HAL03 terminology & top500.pptx

Microsoft PowerPoint – COMP528 HAL03 terminology & top500.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

3 – HAL

Contact

• MS Teams
• channels for:

• general announcements & items of general interest
• lab sessions

• you can also chat direct to me

• Email: m.k. .uk

COMP328/COMP528 (c) mkbane, University of Liverpool

Aims

• To provide students with a deep, critical and systematic understanding of key
issues and effective solutions for parallel programming for systems with multi-
core processors and parallel architectures

• To develop students appreciation of a variety of approaches to parallel
programming, including using MPI and OpenMP

• To develop the students’ skills in parallel programming in particular using MPI
and OpenMP

• To develop the students’ skills in parallelization of ideas, algorithms and of
existing serial code.

Recap Slide
• Strategy:

– Limit CPU speed and sophistication
– Put multiple CPUs (“cores”) on a single chip in a socket
– Several (2 or 4) sockets on a node (cf motherboard)

– Connect 10s or 100s or 1000s of nodes…

• Potential performance of cluster:
CPU freq * no ops per cycle * #cores per CPU * #CPUs-per-node * #nodes

Methodology

• Study problem, sequential program, or code segment
• Look for opportunities for parallelism

• Usually best to start by thinking abstractly about the problem to be solved,
not by any current program implementation of a given solution

• Try to keep all cores of all processors busy
doing useful elements of the required work

• Processors (cores) could be either placed locally (multicore
processors), or connected by local/global networks => huge
variety of approaches/methods

(after Intel Software College)

Terminology

• Processor
– die on a socket

– may comprise
several cores
(where instructions
are executed)

• Node
– same as “motherboard”

– comprises one or more sockets i.e. one or more processors

• Cluster
– comprises several nodes

– nodes connected by an interconnect

• CPU?
– sometimes used by itself to mean ‘core’

– sometimes people say “CPU core” (meaning core)

– generally (if used by itself) used to mean ‘processor’

A Core of a Processor can be very complex

• decades of development

• vendor competition (Intel, AMD, IBM and now ARM)

• a modern, general purpose core:
– supports “out of order” execution & “speculative execution”

– has a vector unit

– works with many levels of memory (L1, L2, L3 cache)

– may support “hyper threading”

Hyper Threading on a Core

• Many modern architectures allow “hyper-threading” via
hardware threads
– For each physical core, there is a number of “hyper threads”

• hardware threads

• typically 2 HT for Intel and for AMD

• e.g. 4 or 8 HT for IBM Power9

• Processor sees a larger number of “virtual CPUs“
– can be useful for some throughput

(as long as not over-stretching physical resources)

Single-core processor

Core Cache

Rest of system

Dual-core
processor

Core

Cache Rest of system

Simplest way of enabling a chip to run multiple threads

Single-core processor
with two hardware threads

Core

Thread 1

Thread 2

Cache
Core

Thread 1

Thread 2

Rest of system

• Variations on the design

– The core could execute instructions
from one thread for n cycles and then
switch to another thread for the next
n cycles

– The core could alternate every cycle
between fetching an instruction from
one thread and from another thread

– The core could simultaneously fetch
an instruction from each of multiple
thread every cycle

– The switching can depend on long
latency events (e.g. cache miss)

Single-core processor
with two hardware threads

Core

Thread 1

Thread 2

Cache
Core

Thread 1

Thread 2

Rest of system

• LIMITATIONS?
– If only one physical ALU in the core, will 2

threads super-saturate?

– If memory intensive code, will 2 threads
super-saturate the memory cache

Multi-core processors

• Common to turn HT off for HPC
– e.g. for Intel Xeon chips (as per UoL «Barkla»)

it is advantage to turn off hyper-threading

• but for Intel XeonPhi chips
– different overall architecture design

– best to try 1,2,3,4 sw threads per physical core

• and IBM chips also usually okay with some HT

Whilst we have some XeonPhi
nodes at Liverpool, Intel have
since terminated this product line

For this course…

• We presume NO hyper-threading

• We will use Intel Skylake processors
• HT turned off in the BIOS
• Clock speed is not fixed

• Aside: to balance power consumption with speed, if chip idle the
clock speed slowly drops. When there is work to do, the clock speed
quickly ramps up.

• You should do several timings (of non trivial data [>5 secs]) to get best
performance

Related Posts