Multicore and Multithreads Processors
HPC ARCHITECTURES
Advanced CPUs
Multicore and multithreaded processors
a. .ac.uk
@adrianjhpc
Memory access
2
Latency hiding with multiple threads
• A processor may frequently stall while a memory
access is in progress
• Better use of the processor may be made by
running another thread in the “gap”
• latency hiding
• Cannot be done with standard multitasking
• cost for a context switch by the OS is ~1000s of cycles
• longer than a main memory access
3
Conventional multithreading
• With hardware support, a thread switch can be done in a single clock cycle
• may need to have multiple register files, one for each thread
• Can simply round-robin threads on consecutive cycles, or switch when a
thread stalls on a load.
• Extreme example is the Cray XMT
• 128 threads per processor
• no data caches
• typical applications require 10-20 threads per processor to hide memory latencies
• Also used in UltraSparc T3
• 16 cores, 8 threads per core
• BG/Q A2 processor
• 16 cores, 4 threads per core
• Xeon Phi Knights Landing
• 64 cores, 4 threads per core
• Marvel ThunderX2
• 32 cores, 4 threads per core
4
Empty instruction slots
• Most modern processors are superscalar
• can issue several instructions in every clock cycle
• selection and scheduling of instructions is done on-the-fly, in
hardware
• A typical processor can issue 4 or 5 instructions per clock,
going to different functional units
• obviously, there must be no dependencies between instructions
issue on the same cycle
• However, typical applications don’t have this much
instruction level parallelism (ILP)
• 1.5 or 2 is normal
• more than half the available instruction slots are empty
5
SMT
• Simultaneous multithreading (SMT) (a.k.a.
Hyperthreading) tries to fill these spare slots by mixing
instructions from more than one thread in the same
clock cycle.
• Requires some replication of hardware
• instruction pointer, instruction TLB, register rename logic, etc.
• Intel Xeon only requires about 5% extra chip area to support
SMT
• …but everything else is shared between threads
• functional units, register file, memory system (including caches)
• sharing of caches means there is no coherency problem
• For most architectures, two or four threads is all that
makes sense
6
SMT example
Time
Two threads on two CPUs
Two threads on one SMT CPU
7
More on SMT
• How successful is SMT?
• depends on the application, and how the 2 threads contend for the
shared resources.
• In practice, gains seem to be limited to around 1.2 to 1.3
times speedup over a single thread.
• benefits will be limited if both threads are using the same functional
units (e.g. FPUs) intensively.
• For memory intensive code, SMT can cause slow down
• caches are not thread-aware
• when two threads share the same caches, each will cause evictions
of data belonging to the other thread.
8
Hyper-threading example performance
• XC30
• Sandy-bridge (8 cores)
Effects of Hyper-Threading on the NERSC workload on Edison
http://www.nersc.gov/assets/CUG13HTpaper.pdf
• NAMD• VASP
9
• GTC• NWChem
• Quantum Espresso
10
Current hyperthreading
• Xeon Phi processors has 4 way hyperthreading
• Knights Corner (KNC) required 2x hyperthreading to get
full instruction issue performance
• Knights Landing (KNL) can get full performance on single
thread, but can run up to 4
• 3 threads damages performance for KNL
https://www.anandtech.com/show/126
94/assessing-cavium-thunderx2-arm-
server-reality/8
• Marvel ThunderX2 has 4
way hyperthreading
• 2 way threading is generally
the most performant (if
threading helps at all)
11
https://www.anandtech.com/show/12694/assessing-cavium-thunderx2-arm-server-reality/8
Thread priorities
• The PowerPC SMT implementation includes the notion
of hardware scheduling priorities.
• alters the proportion of instructions scheduled from the two
threads, so that one thread is favoured over the other.
• no longer get a 50-50 mix.
• Can be done automatically by the OS, e.g. when one
thread is spinning on a lock.
• lower the priority of the spinning thread.
• reduces the severity of priority inversion.
• Not accessible at user level
• not affected by software scheduling priority levels.
12
SIMD: Single instruction multiple data
• Same operation on multiple data items
• Wide registers
• SIMD needed to approach FLOP peak performance, but
your code must be capable of vectorisation
256 bit
64 bit +
256 bit
+
+
+
+
SIMD
instruction
Serial
instruction
for(i=0;i