CS计算机代考程序代写 x86 GPU cache arm Multicore and Multithreads Processors

Multicore and Multithreads Processors

HPC ARCHITECTURES
Advanced CPUs
Multicore and multithreaded processors
a. .ac.uk
@adrianjhpc

Memory access

Latency hiding with multiple threads

• A processor may frequently stall while a memory
access is in progress

• Better use of the processor may be made by
running another thread in the “gap”

• latency hiding

• Cannot be done with standard multitasking

• cost for a context switch by the OS is ~1000s of cycles

• longer than a main memory access

Conventional multithreading
• With hardware support, a thread switch can be done in a single clock cycle

• may need to have multiple register files, one for each thread

• Can simply round-robin threads on consecutive cycles, or switch when a
thread stalls on a load.

• Extreme example is the Cray XMT
• 128 threads per processor

• no data caches

• typical applications require 10-20 threads per processor to hide memory latencies

• Also used in UltraSparc T3
• 16 cores, 8 threads per core

• BG/Q A2 processor
• 16 cores, 4 threads per core

• Xeon Phi Knights Landing
• 64 cores, 4 threads per core

• Marvel ThunderX2
• 32 cores, 4 threads per core

Empty instruction slots

• Most modern processors are superscalar
• can issue several instructions in every clock cycle
• selection and scheduling of instructions is done on-the-fly, in

hardware

• A typical processor can issue 4 or 5 instructions per clock,
going to different functional units
• obviously, there must be no dependencies between instructions

issue on the same cycle

• However, typical applications don’t have this much
instruction level parallelism (ILP)
• 1.5 or 2 is normal
• more than half the available instruction slots are empty

SMT

• Simultaneous multithreading (SMT) (a.k.a.
Hyperthreading) tries to fill these spare slots by mixing
instructions from more than one thread in the same
clock cycle.

• Requires some replication of hardware
• instruction pointer, instruction TLB, register rename logic, etc.
• Intel Xeon only requires about 5% extra chip area to support

SMT

• …but everything else is shared between threads
• functional units, register file, memory system (including caches)
• sharing of caches means there is no coherency problem

• For most architectures, two or four threads is all that
makes sense

SMT example

Time

Two threads on two CPUs
Two threads on one SMT CPU

Related Posts