Squishy Maps for Soft Body Modelling Using Generalised Chain Mail
KIT308/408 (Advanced) Multicore Architecture and Programming
Advanced CPU Architecture
Dr. Ian Lewis
Discipline of ICT, School of TED
University of Tasmania, Australia
1
Introduce the concept of pipelining
Understand why it a primary feature of high-performance architecture design
Introduce basic concepts of superscalar architectures
See drawbacks of these approaches and techniques to reduce their effects
These drawbacks are part of the reason that multicores are now standard
Introduce multicore architectures
And briefly look at some examples
2
Purpose of Lecture
1. https://i.kinja-img.com/gawker-media/image/upload/s–fApvDkM—/c_scale,fl_progressive,q_80,w_800/jilsosqjfmhfztga3yyo.jpg
Pipelined CPU Architecture
3
CPUs as we’ve discussed them so far process instructions one at a time
Do all the steps needed for one instruction before doing anything for the next
This isn’t optimal and pipelined architectures try to do process lots of instructions at once instead
The idea is similar to some real world process improvements, e.g.
Assembly lines
Each step of the assembly line does was part of the construction of the car
Lots of cars in lots of different states of completion
McDonalds drive-through
1. https://static.businessinsider.com/image/5285499beab8ea0a44f92334/image.jpg?_ga=2.88151323.1693129343.1520503568-611935390.1520503568
4
Pipelining
The full instruction cycle is complex
Lends itself to production of a pipeline with many stages (e.g. a six-stage pipeline)
Fetch instruction (FI)
Read next instruction into buffer
Decode instruction (DI)
Determine opcode and operands
Calculate operands (CO)
Work out operand addresses
Fetch operands (FO)
Fetch all operands from memory
Execute instruction (EI)
Perform operation and store result
Write operand (WO)
Store the result in memory
5
Splitting the Instruction Cycle into Stages
Each instruction progresses through the pipeline
Every instruction moves through all the stages
This six-stage pipeline processes six instructions simultaneously
Best case view
As we shall see in a moment, pipelines don’t run quite this smoothly
6
Pipeline Operation
There are many things that can cause delays in our pipeline
Pipeline stalls
Pipeline wastage
Branches
Memory conflicts
Dependencies
These problems cause stages in the pipeline to perform no work for some number of cycles
Generally referred to as bubbles in the pipeline
1. https://kingmanlennox.com/wp-content/uploads/2016/02/Pipeline_Knotted_2.jpg
7
Problems With Pipelining
Stalls
Some instructions may spend longer in
a stage than others
Particularly in the operand fetch stage
This causes delays for the instructions behind
Instructions in front can continue as
normal
Wastage
Many instructions simply have to wait through stages that they do not require
Waste of resources
8
Pipelining Problems: Stalls and Wastage
Branches are the biggest problem for pipelined architectures
Here, instruction 3 is a conditional branch
That turned out to be taken
ie. whatever condition is needed to branch was satisfied
Not possible to work this out until the last pipeline stage
Instructions 3..7 all have entered the pipeline
Instruction shouldn’t actually be executed
Flushed from the pipeline
Wasted processing time
9
Pipelining Problems: Branches
Lots of historical approaches to trying to deal with this problem, but only two worth discussing
Both perform speculative execution of instructions
(Educated) guesses are made on what is the correct set of instructions to execute
But can be wrong
Facilities needed to undo wrong guesses
Execute both branch possibilities
Requires duplicate pipeline stages for basically every step
We’ll come back to this in superscalar
Branch prediction
Takes an educated guess as to which way the branch will go, and executes that stream
1. https://images.pexels.com/photos/421876/pexels-photo-421876.jpeg?w=940&h=650&auto=compress&cs=tinysrgb
10
Dealing With Branches
Static prediction
Branch never (predict never taken)
Branch always (predict always taken)
Opcode based
e.g. perhaps, BNE always branch, BEQ never branch
Compiler hints
Instruction format has a flag saying whether or not the branch should be taken (more information is known at compile-time than run-time)
Relatively simple to implement
Up to 75% success rate with this strategy
Dynamic prediction
Single bit
Taken / not taken switch
Simply remembers last way the branch went
Two mis-predictions for a simple loop, one on entering and one on exit
Two bits
More sophisticated approach
Only one mis-predication (on exit) for a simple loop
More bits
More complicated
Extra information needs to be stored for each branch instruction
Can provide the highest level of success (90%+)
11
Branch Prediction
These occur when two stages both need to access memory simultaneously
ie. it is not possible for the fetch instruction, fetch operands, and write out stages all to access main store at the same time
Most modern processors have both an instruction cache and a data cache to reduce these conflicts
If operands are in registers and all “main store” operands are actually in the two caches, then the pipeline doesn’t stall
12
Pipelining Problems: Resource Conflicts
Central Processing Unit
Arithmetic Logic Unit
Status Registers
L2 Cache
MMU
TLB
Control Unit
Fetch
Decode
Address Calc
Execution
Operand Fetch
Operand Write
L1 Instruction Cache
L1 Data Cache
Registers
A (true data) dependency arises when an instruction relies on the result of a previous instruction
This is a very common occurrence
Operand fetching cannot take place until first instruction has fully completed
Dependencies mean that the latter instruction is unable to execute
For example:
R1 ← R2 + R3
R3 ← R1
For example:
R1 ← mem[1234]
R3 ← R1
A clever compiler (or programmer) can reorder (other) instructions so that dependencies do not effect the pipeline
R1 ← mem[1234]
R2 ← R1
R5 ← R3 + R4
R3 ← R5
Reordered
R1 ← mem[1234]
R5 ← R3 + R4
R2 ← R1
R3 ← R5
13
Pipelining Problems: Dependencies
the value in R1 can’t be stored in R3 until the first operation is complete
as memory accesses are very slow, R1 will take even longer to get its value
waiting for previous instruction
waiting for previous instruction
bigger gap between dependent instructions
14
How Long Should a Pipeline Be?
Direct result of CPU speed optimisations
A complex interaction between
Branch prediction (speculative execution)
Cache
(and Meltdown needs MMU features too)
Basically when you know what is in the cache, you can set up code to be speculatively run (but that will be ultimately discarded) that reads from memory it isn’t supposed to access
Then you need to do something with that memory that can be detected (by you) before the CPU discards the result
Use the speculative value to possibly read from non-cached memory and see how long the code takes to run
x = some calculation
We know this calculation will be false, but the CPU doesn’t
if (x is true)
Trick the CPU into executing this
y = mem[somewhere]
Read from memory we aren’t supposed to be able to access
mask = 1 << (bit we care about)
Make a mask of the bit of the memory location we want to read
if ((y & mask) is true)
If that bit is a 1, read from somewhere else in memory (that isn’t L1 cached)
z = mem[somewhere else]
Do the above a bunch of times, and time how long it takes
It’ll take slightly longer if the bit we care about is a 1
1. https://en.wikipedia.org/wiki/File:Spectre_with_text.svg
15
Aside: Spectre
Superscalar CPU Architecture
16
So what can be done when it’s not viable to lengthen the pipeline any further
No more sensible ways you can break up the steps
Too many branches enter the pipeline
Too many dependencies / resource conflicts in the code
The superscalar machine has multiple pipelines
These allow the simultaneous execution of multiple independent instructions
This requires multiple ALUs
IEUs, FPUs, etc.
17
Superscalar Architectures
CPUs contain specialised ALUs optimised to work on differing kinds of instructions
Integer execution unit (IEU)
Floating-point Unit (FPU)
SIMD / Vector Unit
Such specialised ALUs were added originally to simple or pipelined architectures
Multiple different ALUs is a crucial element in the design of superscalar architectures
18
Specialised ALUs
Central Processing Unit
Arithmetic Logic Units
Status Registers
L2 Cache
MMU
TLB
Control Unit
Fetch
Decode
Address Calc
Execution
Operand Fetch
Operand Write
L1 Instruction Cache
L1 Data Cache
Registers
IEU
FPU
SIMD
Originally floating-point (and commonly fixed-point) calculations were done in software
ie. created with a series of integer instructions
This was comparatively slow
The FPU is a specialised ALU specifically for handling floating point values
Originally an entirely extra chip
386 had optional FP coprocessor (the 80387)
486DX and 486SX were the same chip but with the FPU enabled or disabled
Always integrated since then
1. https://gifer.com/en/Pjxl
19
Specialised ALUs: FPU
An ALU that performs the same operation on a (small fixed-length) list of values simultaneously
SIMD means Single-Instruction Multiple-Data
Originally added to home processors with MMX (Multimedia Extensions) on the Pentium
These were integer only instructions
Floating-point versions later offered
Intel: SSE series
AMD: 3D-Now! series
PowerPC: AltiVec series
We’ll examine this concept further when we look at the optimizing our multithreaded code and later when we try to optimize our GPU code
1. http://archive.arstechnica.com/cpu/1q00/simd/m-simd-1.html
20
Specialised ALUs: SIMD / Vector Unit
A superscalar architecture needs to be able to identify multiple things to be done at once
Otherwise it wouldn’t use all the execution units
Needs facilities
To look at multiple instructions and work out which ones can be executed at the same time
To actually perform the operations
Instruction-level parallelism exists when instructions are independent
Determined by the frequency of data dependencies and procedural dependencies
Machine parallelism is a measure of the ability of the processor to take advantage of instruction-level parallelism
The number of instructions that can be fetched and executed at the same time
The speed and sophistication of the mechanisms used to determine independent instructions
21
Superscalar Approach
In essence, the processor is trying to look ahead of the current point of execution for instructions to add to the pipeline
Three types of orderings are important:
The order in which instructions are fetched
The order in which instructions are executed
The order in which instructions update the contents of registers and memory locations
The more sophisticated the processor, the less it is bound by a strict relationship between these
Changes in execution order increase efficiency
Only real constraint is that the result must be correct
To allow out-of-order issue, it is necessary to decouple the decode and execute stages
Often referred to as the processor’s front-end and back-end
After the processor is finished decoding an instruction, it is placed in an instruction buffer/window ready for execution
As long as the buffer is not full, the processor can continue fetching and decoding instructions
This organisation gives the processor a look-ahead capacity
Allows identification of independent instructions
22
Out-of-order Issue and Completion
The program itself is a sequence of linear instructions
The instruction fetch process forms a dynamic stream of instructions
Instructions are dispatched into a window of execution
Instructions no longer form a sequential stream but are structured according to their dependencies
Execution is performed according to these dependencies and available resources
Finally, instructions are put back into sequential order and their results are recorded
Because of the use of parallel pipelines, instructions may complete in orders different to the original program flow
The use of branch prediction and speculative execution means that some results may need to be abandoned
Therefore permanent storage and visible registers cannot be updated immediately
23
Superscalar Execution
The superscalar approach relies on the ability to execute multiple instructions in parallel
A combination of compiler-based and hardware techniques are used in order to maximise instruction-level parallelism
There are five basic limitations:
True data dependency (need the result of a previous instruction)
Procedural dependency (incorrectly predicted branches)
Resource conflicts (competition for functional units, eg. memory)
Output dependency
Antidependency
The first three have already become apparent in looking at pipelines
They are essentially the same problems, but with multiple pipelines they are much more likely to occur
Output dependencies occur when two instructions update the same memory location or register
The second instruction must not update the location/register until the first has executed
Antidependencies occur when an instruction updates a location used in previous instructions
The instruction must not update the location/register until all previous instructions that use it have executed
24
Superscalar Limitations
The following example code has many dependencies:
1: R3 ← R3 * R5
2: R4 ← R3 + 1
True dependency: must wait until instruction 1 updates R3
3: R3 ← R5 + 1
Output dependency: must wait until instruction 1 updates R3
Both instructions are updating the same register
Antidepedency: must wait until instruction 2 has used R3
This instruction will modify a value the other needs
4: R7 ← R3 / R4
True dependency: must wait until instruction 3 updates R3
Output- and anti-dependencies arise because the values in registers may no longer reflect the sequence of values dictated by the program flow
Multiple instructions compete for the use of the same register location
A solution to this is to provide register renaming
In essence registers are allocated dynamically
Renamed:
1: R3b ← R3a * R5a
2: R4b ← R3b + 1
3: R3c ← R5a + 1
4: R7b ← R3c / R4b
25
Dependencies Removed Via Renaming
The G5 is capable of having over 100 operations within the execution core
So needs complicated branch prediction scheme, with three tables each storing 16K entries
Branch history table: a 1-bit predictor is associated with every branch
Global predictor table: a 11-bit vector records the actual execution path leading up to a branch
Selector table: records the success of each scheme for each branch and keeps track of which is best
26
PowerPC 970
The original Pentium 4 is a superscalar processor with a 20-stage pipeline
This pipeline enabled the Pentium 4 to be clocked very fast
Intel wanted this both for performance and marketing
Modern Pentiums still hover around this number of stages
Fetching instructions from memory is considered outside this pipeline
So it’s actually even longer
The L1 instruction cache actually contains μ-ops (pre-decoded instructions)
27
Pentium 4 (Early Models)
Often it can be difficult to find enough work to fill all of the superscalar execution units
Many modern processors have the capacity to run two different programs (threads technically) at once
We’ll cover the concept of multithreading next week
1. http://yourhigherlevellife.com/wp-content/uploads/2013/02/TangledMess.png
28
Simultaneous Multithreading (SMT)
Multicore CPU Architecture
29
There is a limit to the speed increases possible with pipelined and superscalar processors
You can’t just keep adding more cache, longer pipelines, and more functional units (wider superscalar design)
You get diminishing returns by adding more
Execution units
As there is a limit to instruction-level parallelism
Pipeline stages
Bigger penalties for branch misses
Need better/bigger branch prediction
Instruction lookahead
Bigger penalties for branch misses
Need better/bigger branch prediction
More complicated scheduling logic
Cache
Cache hit frequency drops
1. http://i0.kym-cdn.com/photos/images/facebook/000/013/792/limitations.jpg
30
Superscalar Limitations
Superscalar uni-core architectures reached near the limit of their performance
Memory access requirements (memory wall)
Limit of instruction-level parallelism (ILP wall)
Increasing power requirements (power wall)
Wire delays
At each successive processor revision the basic design of each core can remain the same
No extra cost/risk with a new core design
Cache coherency (this is covered later) is faster when performed on-chip
Recent trend is towards placing multiple (typically identical) CPUs (cores) on a single chip
Due to superscalar architectures approaching their performance limit
Often the cores have reduced complexity themselves
eg. complicated out-of-order execution logic is removed
Cores may share some resources
eg. L2/L3 cache
31
Why Multicore?
The most naïve way to create a multi-core CPU is simply to place two entire (identical) CPU designs on a single chip
Each has all normal facilities
Standard internal architecture, L1/L2 cache, etc.
Each requires a connection to memory
This design has no shared resources (except for memory)
Twice as many transistors needed for two cores
No advantage to placing the cores on a single chip
In fact, just creates a harder wiring problem off-chip
No facility for inter-core communication
32
Simplest Multi-core Architecture
Core
Core
Memory interface
Memory interface
Multi-core processors typically share at least some resources
The biggest cache (eg. L2 or L3)
The memory interface
Cores are often a simpler design (than the complex designs we’ve worked up to so far)
A complex multithreaded superscalar machine requires a lot of transistors/space
If simpler cores are used, more cores can be placed on a single chip
33
Multicore Architecture
Simple Core
Simple Core
Simple Core
L2 / L3
Cache
Memory interface
Early multicore
Can see cores are copy/pasted (but mirrored)
Shared L2 cache
34
Example: Core 2 Duo
6 cores
Each core has it’s own L1 and L2 cache
Shared L3 cache
35
Example: Core i7 980X
12 cores
96 thread
Each core has it’s own L1 and L2 cache
L3 cache shared between all cores
With (off-chip) L4 cache shared between all cores
36
Example: Power8
72 cores
Each cores has its own L1 and L2 cache
37
Example: Xeon Phi
64 cores
But not all on a single chip
Built from multiple Core Chiplet Die (CCD) modules
Which each have 8 cores
L3 cache shared between groups of
4 cores (Core CompleX / CCX)
Big section in the centre on the package is a memory interconnect
https://www.quora.com/Why-are-the-top-end-AMD-Ryzen-mobile-CPUs-stuck-with-4-cores-while-Intel-already-offers-6-8-cores-for-their-lineup
38
Example: Threadripper 3990X
The Cell processor is radically different from the homogenous multicore processors we’ve seen up to now
Was designed to simultaneously attack
Power wall: specialised SIMD cores
Memory wall: no cache, just DMA and fast local memory
Frequency wall: large register files and no branch prediction
With the idea that this would provide greater efficiency and performance
Particularly in performace/watt measurements
39
Aside: CELL Processor
SPE
Core
Level 2 Cache
Memory and I/O interface
SPE
Core
SPE
Core
SPE
Core
SPE
Core
SPE
Core
SPE
Core
SPE
Core
PPE
Core
Less cache
More cores
No branch prediction
More cores
Greater compiler/programmer burden
SIMD cores (SPUs) have no main memory load/store
Fast local store
Predictable memory access speed
No possibility of cache or virtual memory latency
DMA for all main memory access
Greater compiler/programmer burden
Heterogeneous cores
Different cores for different tasks
Highly specialised and powerful SIMD cores
Greater compiler/programmer burden
40
Aside: CELL Processor
Unlike pipelining and superscalar architectures, for multicore code, heavy programmer intervention is required
Compilers (generally) can’t just take advantage of it magically
At least not currently
Great deal of ongoing research being performed in this area
Multicore programming is hard
Synchronisation issues
Coherency of data issues
Efficiency issues
1. https://www.redbubble.com/people/profharshman/works/13968926-mind-blown?cat_context=u-stationery&grid_pos=3&p=sticker&rbs=86f1765e-bfb8-4716-835d-6bfea73235ad&ref=shop_grid
41
Programming Multicore Architectures
Stallings Section 12.4
Stallings Chapter 13
Stallings Section 14.1, 14.2
Pipelining and Superscalar
arstechnica.com/paedia/c/cpu/part-2/cpu2-1.html
www.extremetech.com/article2/0,3973,17414,00.asp
www.hardwarecentral.com/hardwarecentral/tutorials/2427/1/
Dependencies
www.cs.berkeley.edu/~pattrsn/252S98/Lec03-ilp.pdf
RISC & CISC
www.inf.fu-berlin.de/lehre/WS94/RA/RISC-9.html
Embedded processors (a different introduction to RISC)
www.extremetech.com/print_article/0,3428,a%253D21424,00.asp
42
Further Reading
Stallings Section 12.4
Stallings Section 13.5
Stallings Section 14.1, 14.2
Pipelining and Superscalar
arstechnica.com/paedia/c/cpu/part-2/cpu2-1.html
www.extremetech.com/article2/0,3973,17414,00.asp
www.hardwarecentral.com/hardwarecentral/tutorials/2427/1/
Dependencies
www.cs.berkeley.edu/~pattrsn/252S98/Lec03-ilp.pdf
POSIX Threads
computing.llnl.gov/tutorials/pthreads/
pubs.opengroup.org/onlinepubs/007908799/xsh/threads.html
43
Further Reading
Cell Processor
Chapter 1, Programming the Cell Processor: For Games Graphics and Computation, Matthew Scarpino, Prentice Hall, 2008
www.blachford.info/computer/Cell/Cell0_v2.html
neugens.wordpress.com/2009/02/01/introduction-to-cell-part/
Cache Coherence
en.wikipedia.org/wiki/Cache_coherence
Core 2 Duo
www.intel.com/design/core2duo/documentation.htm
www.intel.com/products/processor/core2duo/index.htm
www.behardware.com/art/imprimer/623/
Core i7 980X
www.tomshardware.co.uk/Intel-i7-nehalem,review-31375.html
www.guru3d.com/article/core-i7-980x-review/1
44
Further Reading
Instruction Pre-fetch
Fetching the next instruction to be executed while executing the current one
Pipeline/Pipelining
An optimisation technique that performs different stages of the instruction cycle on different intructions simultaneously
Pipeline stalls
A pipeline is said to stall when instructions cannot move on the the next pipeline stage
Pipeline wastage
A pipeline has wastage when one or more stages aren’t performing useful work
Branches
Conditional or unconditional instructions that change the value of the program counter
45
Glossary
Memory conflicts
Occur when multiple CPU units (eg. pipeline stages) wish to access the same resource simultaneously (eg. memory)
Dependencies
When one instruction requires the use the result of a previous instruction
Bubbles
Pipeline stages that currently have no instruction in them
Flush
A flush of the pipeline occurs to remove instructions that should never have been added to the pipeline
Branch Prediction
A technique to guess the result of a branch
Instruction Cache
A cache that only buffers instructions
Data Cache
A cache that only buffers data
46
Glossary
47
Glossary
FPU: (Floating-Point Unit) a specialised ALU for performing floating-point (non-integer) operations
MMX / SSE / SSE2 / 3DNow!: a specialised ALU for performing optimised multimedia operations
SIMD: (Single-Instruction Multiple-Data) a type of instruction that performs the same operation on multiple pieces of data simultaneously
Glossary
Superscalar
A superscalar machine attempts to execute many instructions simultaneously with little regard to instruction order
Instruction-level parallelism
A measure of the amount of independent instructions in a program
Machine parallelism
The ability of a CPU to take advantage of instruction-level parallelism
Instruction issue
The process of initiating instruction execution in the processor’s functional units
Glossary
Antidependency
When the result of an instruction should not be written to a location used in a previous instruction (because it may not have been executed yet)
Register renaming
A process of dynamically allocating registers to remove all but true data dependencies
Committing
Permanently storing a speculatively calculated result
Window of execution
Buffer containing pre-fetched instructions that the CPU examines to find independent instructions to be executed
A McDonald's restaurant's drive-thru sign is pictured in Los Angeles
Pipeline
/docProps/thumbnail.jpeg