CS计算机代考程序代写 x86 GPU cache arm Multicore and Multithreads Processors

Multicore and Multithreads Processors

HPC ARCHITECTURES
Advanced CPUs
Multicore and multithreaded processors
a. .ac.uk
@adrianjhpc

Memory access

2

Latency hiding with multiple threads

• A processor may frequently stall while a memory
access is in progress

• Better use of the processor may be made by
running another thread in the “gap”

• latency hiding

• Cannot be done with standard multitasking

• cost for a context switch by the OS is ~1000s of cycles

• longer than a main memory access

3

Conventional multithreading
• With hardware support, a thread switch can be done in a single clock cycle

• may need to have multiple register files, one for each thread

• Can simply round-robin threads on consecutive cycles, or switch when a
thread stalls on a load.

• Extreme example is the Cray XMT
• 128 threads per processor

• no data caches

• typical applications require 10-20 threads per processor to hide memory latencies

• Also used in UltraSparc T3
• 16 cores, 8 threads per core

• BG/Q A2 processor
• 16 cores, 4 threads per core

• Xeon Phi Knights Landing
• 64 cores, 4 threads per core

• Marvel ThunderX2
• 32 cores, 4 threads per core

4

Empty instruction slots

• Most modern processors are superscalar
• can issue several instructions in every clock cycle
• selection and scheduling of instructions is done on-the-fly, in

hardware

• A typical processor can issue 4 or 5 instructions per clock,
going to different functional units
• obviously, there must be no dependencies between instructions

issue on the same cycle

• However, typical applications don’t have this much
instruction level parallelism (ILP)
• 1.5 or 2 is normal
• more than half the available instruction slots are empty

5

SMT

• Simultaneous multithreading (SMT) (a.k.a.
Hyperthreading) tries to fill these spare slots by mixing
instructions from more than one thread in the same
clock cycle.

• Requires some replication of hardware
• instruction pointer, instruction TLB, register rename logic, etc.
• Intel Xeon only requires about 5% extra chip area to support

SMT

• …but everything else is shared between threads
• functional units, register file, memory system (including caches)
• sharing of caches means there is no coherency problem

• For most architectures, two or four threads is all that
makes sense

6

SMT example

Time

Two threads on two CPUs
Two threads on one SMT CPU

7

More on SMT

• How successful is SMT?
• depends on the application, and how the 2 threads contend for the

shared resources.

• In practice, gains seem to be limited to around 1.2 to 1.3
times speedup over a single thread.
• benefits will be limited if both threads are using the same functional

units (e.g. FPUs) intensively.

• For memory intensive code, SMT can cause slow down
• caches are not thread-aware
• when two threads share the same caches, each will cause evictions

of data belonging to the other thread.

8

Hyper-threading example performance

• XC30

• Sandy-bridge (8 cores)

Effects of Hyper-Threading on the NERSC workload on Edison

http://www.nersc.gov/assets/CUG13HTpaper.pdf

• NAMD• VASP

9

• GTC• NWChem

• Quantum Espresso

10

Current hyperthreading

• Xeon Phi processors has 4 way hyperthreading

• Knights Corner (KNC) required 2x hyperthreading to get
full instruction issue performance

• Knights Landing (KNL) can get full performance on single
thread, but can run up to 4

• 3 threads damages performance for KNL

https://www.anandtech.com/show/126

94/assessing-cavium-thunderx2-arm-

server-reality/8

• Marvel ThunderX2 has 4

way hyperthreading

• 2 way threading is generally

the most performant (if

threading helps at all)

11

https://www.anandtech.com/show/12694/assessing-cavium-thunderx2-arm-server-reality/8

Thread priorities

• The PowerPC SMT implementation includes the notion
of hardware scheduling priorities.

• alters the proportion of instructions scheduled from the two
threads, so that one thread is favoured over the other.

• no longer get a 50-50 mix.

• Can be done automatically by the OS, e.g. when one
thread is spinning on a lock.

• lower the priority of the spinning thread.

• reduces the severity of priority inversion.

• Not accessible at user level

• not affected by software scheduling priority levels.

12

SIMD: Single instruction multiple data

• Same operation on multiple data items

• Wide registers

• SIMD needed to approach FLOP peak performance, but
your code must be capable of vectorisation

256 bit

64 bit +

256 bit

+

+

+

+

SIMD

instruction

Serial

instruction

for(i=0;i