程序代写代做代考 compiler chain scheme assembly GPU cache Hive Squishy Maps for Soft Body Modelling Using Generalised Chain Mail

Squishy Maps for Soft Body Modelling Using Generalised Chain Mail

KIT308/408 (Advanced) Multicore Architecture and Programming

Advanced CPU Architecture
Dr. Ian Lewis
Discipline of ICT, School of TED
University of Tasmania, Australia
1

Introduce the concept of pipelining
Understand why it a primary feature of high-performance architecture design
Introduce basic concepts of superscalar architectures
See drawbacks of these approaches and techniques to reduce their effects
These drawbacks are part of the reason that multicores are now standard
Introduce multicore architectures
And briefly look at some examples

2
Purpose of Lecture
1. https://i.kinja-img.com/gawker-media/image/upload/s–fApvDkM—/c_scale,fl_progressive,q_80,w_800/jilsosqjfmhfztga3yyo.jpg

Pipelined CPU Architecture

3

CPUs as we’ve discussed them so far process instructions one at a time
Do all the steps needed for one instruction before doing anything for the next
This isn’t optimal and pipelined architectures try to do process lots of instructions at once instead
The idea is similar to some real world process improvements, e.g.
Assembly lines
Each step of the assembly line does was part of the construction of the car
Lots of cars in lots of different states of completion
McDonalds drive-through
1. https://static.businessinsider.com/image/5285499beab8ea0a44f92334/image.jpg?_ga=2.88151323.1693129343.1520503568-611935390.1520503568
4
Pipelining

The full instruction cycle is complex
Lends itself to production of a pipeline with many stages (e.g. a six-stage pipeline)
Fetch instruction (FI)
Read next instruction into buffer
Decode instruction (DI)
Determine opcode and operands
Calculate operands (CO)
Work out operand addresses
Fetch operands (FO)
Fetch all operands from memory
Execute instruction (EI)
Perform operation and store result
Write operand (WO)
Store the result in memory

5
Splitting the Instruction Cycle into Stages

Each instruction progresses through the pipeline
Every instruction moves through all the stages
This six-stage pipeline processes six instructions simultaneously
Best case view
As we shall see in a moment, pipelines don’t run quite this smoothly

6
Pipeline Operation

There are many things that can cause delays in our pipeline
Pipeline stalls
Pipeline wastage
Branches
Memory conflicts
Dependencies
These problems cause stages in the pipeline to perform no work for some number of cycles
Generally referred to as bubbles in the pipeline
1. https://kingmanlennox.com/wp-content/uploads/2016/02/Pipeline_Knotted_2.jpg
7
Problems With Pipelining

Stalls
Some instructions may spend longer in
a stage than others
Particularly in the operand fetch stage
This causes delays for the instructions behind
Instructions in front can continue as
normal
Wastage
Many instructions simply have to wait through stages that they do not require
Waste of resources

8
Pipelining Problems: Stalls and Wastage

Branches are the biggest problem for pipelined architectures
Here, instruction 3 is a conditional branch
That turned out to be taken
ie. whatever condition is needed to branch was satisfied
Not possible to work this out until the last pipeline stage
Instructions 3..7 all have entered the pipeline
Instruction shouldn’t actually be executed
Flushed from the pipeline
Wasted processing time

9
Pipelining Problems: Branches

Lots of historical approaches to trying to deal with this problem, but only two worth discussing
Both perform speculative execution of instructions
(Educated) guesses are made on what is the correct set of instructions to execute
But can be wrong
Facilities needed to undo wrong guesses
Execute both branch possibilities
Requires duplicate pipeline stages for basically every step
We’ll come back to this in superscalar
Branch prediction
Takes an educated guess as to which way the branch will go, and executes that stream
1. https://images.pexels.com/photos/421876/pexels-photo-421876.jpeg?w=940&h=650&auto=compress&cs=tinysrgb
10
Dealing With Branches

Static prediction
Branch never (predict never taken)
Branch always (predict always taken)
Opcode based
e.g. perhaps, BNE always branch, BEQ never branch
Compiler hints
Instruction format has a flag saying whether or not the branch should be taken (more information is known at compile-time than run-time)
Relatively simple to implement
Up to 75% success rate with this strategy
Dynamic prediction
Single bit
Taken / not taken switch
Simply remembers last way the branch went
Two mis-predictions for a simple loop, one on entering and one on exit
Two bits
More sophisticated approach
Only one mis-predication (on exit) for a simple loop
More bits
More complicated
Extra information needs to be stored for each branch instruction
Can provide the highest level of success (90%+)

11
Branch Prediction

These occur when two stages both need to access memory simultaneously
ie. it is not possible for the fetch instruction, fetch operands, and write out stages all to access main store at the same time
Most modern processors have both an instruction cache and a data cache to reduce these conflicts
If operands are in registers and all “main store” operands are actually in the two caches, then the pipeline doesn’t stall

12
Pipelining Problems: Resource Conflicts
Central Processing Unit
Arithmetic Logic Unit
Status Registers
L2 Cache
MMU

TLB
Control Unit
Fetch
Decode
Address Calc
Execution
Operand Fetch
Operand Write
L1 Instruction Cache
L1 Data Cache

Registers

A (true data) dependency arises when an instruction relies on the result of a previous instruction
This is a very common occurrence
Operand fetching cannot take place until first instruction has fully completed
Dependencies mean that the latter instruction is unable to execute
For example:
R1 ← R2 + R3
R3 ← R1
For example:
R1 ← mem[1234]
R3 ← R1
A clever compiler (or programmer) can reorder (other) instructions so that dependencies do not effect the pipeline
R1 ← mem[1234]
R2 ← R1
R5 ← R3 + R4
R3 ← R5
Reordered
R1 ← mem[1234]
R5 ← R3 + R4
R2 ← R1
R3 ← R5

13
Pipelining Problems: Dependencies
the value in R1 can’t be stored in R3 until the first operation is complete

as memory accesses are very slow, R1 will take even longer to get its value

waiting for previous instruction

waiting for previous instruction

bigger gap between dependent instructions

14
How Long Should a Pipeline Be?

Direct result of CPU speed optimisations
A complex interaction between
Branch prediction (speculative execution)
Cache
(and Meltdown needs MMU features too)
Basically when you know what is in the cache, you can set up code to be speculatively run (but that will be ultimately discarded) that reads from memory it isn’t supposed to access
Then you need to do something with that memory that can be detected (by you) before the CPU discards the result
Use the speculative value to possibly read from non-cached memory and see how long the code takes to run
x = some calculation
We know this calculation will be false, but the CPU doesn’t
if (x is true)
Trick the CPU into executing this
y = mem[somewhere]
Read from memory we aren’t supposed to be able to access
mask = 1 << (bit we care about) Make a mask of the bit of the memory location we want to read if ((y & mask) is true) If that bit is a 1, read from somewhere else in memory (that isn’t L1 cached) z = mem[somewhere else] Do the above a bunch of times, and time how long it takes It’ll take slightly longer if the bit we care about is a 1 1. https://en.wikipedia.org/wiki/File:Spectre_with_text.svg 15 Aside: Spectre Superscalar CPU Architecture 16 So what can be done when it’s not viable to lengthen the pipeline any further No more sensible ways you can break up the steps Too many branches enter the pipeline Too many dependencies / resource conflicts in the code The superscalar machine has multiple pipelines These allow the simultaneous execution of multiple independent instructions This requires multiple ALUs IEUs, FPUs, etc. 17 Superscalar Architectures CPUs contain specialised ALUs optimised to work on differing kinds of instructions Integer execution unit (IEU) Floating-point Unit (FPU) SIMD / Vector Unit Such specialised ALUs were added originally to simple or pipelined architectures Multiple different ALUs is a crucial element in the design of superscalar architectures 18 Specialised ALUs Central Processing Unit Arithmetic Logic Units Status Registers L2 Cache MMU TLB Control Unit Fetch Decode Address Calc Execution Operand Fetch Operand Write L1 Instruction Cache L1 Data Cache Registers IEU FPU SIMD Originally floating-point (and commonly fixed-point) calculations were done in software ie. created with a series of integer instructions This was comparatively slow The FPU is a specialised ALU specifically for handling floating point values Originally an entirely extra chip 386 had optional FP coprocessor (the 80387) 486DX and 486SX were the same chip but with the FPU enabled or disabled Always integrated since then 1. https://gifer.com/en/Pjxl 19 Specialised ALUs: FPU An ALU that performs the same operation on a (small fixed-length) list of values simultaneously SIMD means Single-Instruction Multiple-Data Originally added to home processors with MMX (Multimedia Extensions) on the Pentium These were integer only instructions Floating-point versions later offered Intel: SSE series AMD: 3D-Now! series PowerPC: AltiVec series We’ll examine this concept further when we look at the optimizing our multithreaded code and later when we try to optimize our GPU code 1. http://archive.arstechnica.com/cpu/1q00/simd/m-simd-1.html 20 Specialised ALUs: SIMD / Vector Unit A superscalar architecture needs to be able to identify multiple things to be done at once Otherwise it wouldn’t use all the execution units Needs facilities To look at multiple instructions and work out which ones can be executed at the same time To actually perform the operations Instruction-level parallelism exists when instructions are independent Determined by the frequency of data dependencies and procedural dependencies Machine parallelism is a measure of the ability of the processor to take advantage of instruction-level parallelism The number of instructions that can be fetched and executed at the same time The speed and sophistication of the mechanisms used to determine independent instructions 21 Superscalar Approach In essence, the processor is trying to look ahead of the current point of execution for instructions to add to the pipeline Three types of orderings are important: The order in which instructions are fetched The order in which instructions are executed The order in which instructions update the contents of registers and memory locations The more sophisticated the processor, the less it is bound by a strict relationship between these Changes in execution order increase efficiency Only real constraint is that the result must be correct To allow out-of-order issue, it is necessary to decouple the decode and execute stages Often referred to as the processor’s front-end and back-end After the processor is finished decoding an instruction, it is placed in an instruction buffer/window ready for execution As long as the buffer is not full, the processor can continue fetching and decoding instructions This organisation gives the processor a look-ahead capacity Allows identification of independent instructions 22 Out-of-order Issue and Completion The program itself is a sequence of linear instructions The instruction fetch process forms a dynamic stream of instructions Instructions are dispatched into a window of execution Instructions no longer form a sequential stream but are structured according to their dependencies Execution is performed according to these dependencies and available resources Finally, instructions are put back into sequential order and their results are recorded Because of the use of parallel pipelines, instructions may complete in orders different to the original program flow The use of branch prediction and speculative execution means that some results may need to be abandoned Therefore permanent storage and visible registers cannot be updated immediately 23 Superscalar Execution The superscalar approach relies on the ability to execute multiple instructions in parallel A combination of compiler-based and hardware techniques are used in order to maximise instruction-level parallelism There are five basic limitations: True data dependency (need the result of a previous instruction) Procedural dependency (incorrectly predicted branches) Resource conflicts (competition for functional units, eg. memory) Output dependency Antidependency The first three have already become apparent in looking at pipelines They are essentially the same problems, but with multiple pipelines they are much more likely to occur Output dependencies occur when two instructions update the same memory location or register The second instruction must not update the location/register until the first has executed Antidependencies occur when an instruction updates a location used in previous instructions The instruction must not update the location/register until all previous instructions that use it have executed 24 Superscalar Limitations The following example code has many dependencies: 1: R3 ← R3 * R5 2: R4 ← R3 + 1 True dependency: must wait until instruction 1 updates R3 3: R3 ← R5 + 1 Output dependency: must wait until instruction 1 updates R3 Both instructions are updating the same register Antidepedency: must wait until instruction 2 has used R3 This instruction will modify a value the other needs 4: R7 ← R3 / R4 True dependency: must wait until instruction 3 updates R3 Output- and anti-dependencies arise because the values in registers may no longer reflect the sequence of values dictated by the program flow Multiple instructions compete for the use of the same register location A solution to this is to provide register renaming In essence registers are allocated dynamically Renamed: 1: R3b ← R3a * R5a 2: R4b ← R3b + 1 3: R3c ← R5a + 1 4: R7b ← R3c / R4b 25 Dependencies Removed Via Renaming The G5 is capable of having over 100 operations within the execution core So needs complicated branch prediction scheme, with three tables each storing 16K entries Branch history table: a 1-bit predictor is associated with every branch Global predictor table: a 11-bit vector records the actual execution path leading up to a branch Selector table: records the success of each scheme for each branch and keeps track of which is best 26 PowerPC 970 The original Pentium 4 is a superscalar processor with a 20-stage pipeline This pipeline enabled the Pentium 4 to be clocked very fast Intel wanted this both for performance and marketing Modern Pentiums still hover around this number of stages Fetching instructions from memory is considered outside this pipeline So it’s actually even longer The L1 instruction cache actually contains μ-ops (pre-decoded instructions) 27 Pentium 4 (Early Models) Often it can be difficult to find enough work to fill all of the superscalar execution units Many modern processors have the capacity to run two different programs (threads technically) at once We’ll cover the concept of multithreading next week 1. http://yourhigherlevellife.com/wp-content/uploads/2013/02/TangledMess.png 28 Simultaneous Multithreading (SMT) Multicore CPU Architecture 29 There is a limit to the speed increases possible with pipelined and superscalar processors You can’t just keep adding more cache, longer pipelines, and more functional units (wider superscalar design) You get diminishing returns by adding more Execution units As there is a limit to instruction-level parallelism Pipeline stages Bigger penalties for branch misses Need better/bigger branch prediction Instruction lookahead Bigger penalties for branch misses Need better/bigger branch prediction More complicated scheduling logic Cache Cache hit frequency drops 1. http://i0.kym-cdn.com/photos/images/facebook/000/013/792/limitations.jpg 30 Superscalar Limitations Superscalar uni-core architectures reached near the limit of their performance Memory access requirements (memory wall) Limit of instruction-level parallelism (ILP wall) Increasing power requirements (power wall) Wire delays At each successive processor revision the basic design of each core can remain the same No extra cost/risk with a new core design Cache coherency (this is covered later) is faster when performed on-chip Recent trend is towards placing multiple (typically identical) CPUs (cores) on a single chip Due to superscalar architectures approaching their performance limit Often the cores have reduced complexity themselves eg. complicated out-of-order execution logic is removed Cores may share some resources eg. L2/L3 cache 31 Why Multicore? The most naïve way to create a multi-core CPU is simply to place two entire (identical) CPU designs on a single chip Each has all normal facilities Standard internal architecture, L1/L2 cache, etc. Each requires a connection to memory This design has no shared resources (except for memory) Twice as many transistors needed for two cores No advantage to placing the cores on a single chip In fact, just creates a harder wiring problem off-chip No facility for inter-core communication 32 Simplest Multi-core Architecture Core Core Memory interface Memory interface Multi-core processors typically share at least some resources The biggest cache (eg. L2 or L3) The memory interface Cores are often a simpler design (than the complex designs we’ve worked up to so far) A complex multithreaded superscalar machine requires a lot of transistors/space If simpler cores are used, more cores can be placed on a single chip 33 Multicore Architecture Simple Core Simple Core Simple Core L2 / L3 Cache Memory interface Early multicore Can see cores are copy/pasted (but mirrored) Shared L2 cache 34 Example: Core 2 Duo 6 cores Each core has it’s own L1 and L2 cache Shared L3 cache 35 Example: Core i7 980X 12 cores 96 thread Each core has it’s own L1 and L2 cache L3 cache shared between all cores With (off-chip) L4 cache shared between all cores 36 Example: Power8 72 cores Each cores has its own L1 and L2 cache 37 Example: Xeon Phi 64 cores But not all on a single chip Built from multiple Core Chiplet Die (CCD) modules Which each have 8 cores L3 cache shared between groups of 4 cores (Core CompleX / CCX) Big section in the centre on the package is a memory interconnect https://www.quora.com/Why-are-the-top-end-AMD-Ryzen-mobile-CPUs-stuck-with-4-cores-while-Intel-already-offers-6-8-cores-for-their-lineup 38 Example: Threadripper 3990X The Cell processor is radically different from the homogenous multicore processors we’ve seen up to now Was designed to simultaneously attack Power wall: specialised SIMD cores Memory wall: no cache, just DMA and fast local memory Frequency wall: large register files and no branch prediction With the idea that this would provide greater efficiency and performance Particularly in performace/watt measurements 39 Aside: CELL Processor SPE Core Level 2 Cache Memory and I/O interface SPE Core SPE Core SPE Core SPE Core SPE Core SPE Core SPE Core PPE Core Less cache More cores No branch prediction More cores Greater compiler/programmer burden SIMD cores (SPUs) have no main memory load/store Fast local store Predictable memory access speed No possibility of cache or virtual memory latency DMA for all main memory access Greater compiler/programmer burden Heterogeneous cores Different cores for different tasks Highly specialised and powerful SIMD cores Greater compiler/programmer burden 40 Aside: CELL Processor Unlike pipelining and superscalar architectures, for multicore code, heavy programmer intervention is required Compilers (generally) can’t just take advantage of it magically At least not currently Great deal of ongoing research being performed in this area Multicore programming is hard Synchronisation issues Coherency of data issues Efficiency issues 1. https://www.redbubble.com/people/profharshman/works/13968926-mind-blown?cat_context=u-stationery&grid_pos=3&p=sticker&rbs=86f1765e-bfb8-4716-835d-6bfea73235ad&ref=shop_grid 41 Programming Multicore Architectures Stallings Section 12.4 Stallings Chapter 13 Stallings Section 14.1, 14.2 Pipelining and Superscalar arstechnica.com/paedia/c/cpu/part-2/cpu2-1.html www.extremetech.com/article2/0,3973,17414,00.asp www.hardwarecentral.com/hardwarecentral/tutorials/2427/1/ Dependencies www.cs.berkeley.edu/~pattrsn/252S98/Lec03-ilp.pdf RISC & CISC www.inf.fu-berlin.de/lehre/WS94/RA/RISC-9.html Embedded processors (a different introduction to RISC) www.extremetech.com/print_article/0,3428,a%253D21424,00.asp 42 Further Reading Stallings Section 12.4 Stallings Section 13.5 Stallings Section 14.1, 14.2 Pipelining and Superscalar arstechnica.com/paedia/c/cpu/part-2/cpu2-1.html www.extremetech.com/article2/0,3973,17414,00.asp www.hardwarecentral.com/hardwarecentral/tutorials/2427/1/ Dependencies www.cs.berkeley.edu/~pattrsn/252S98/Lec03-ilp.pdf POSIX Threads computing.llnl.gov/tutorials/pthreads/ pubs.opengroup.org/onlinepubs/007908799/xsh/threads.html 43 Further Reading Cell Processor Chapter 1, Programming the Cell Processor: For Games Graphics and Computation, Matthew Scarpino, Prentice Hall, 2008 www.blachford.info/computer/Cell/Cell0_v2.html neugens.wordpress.com/2009/02/01/introduction-to-cell-part/ Cache Coherence en.wikipedia.org/wiki/Cache_coherence Core 2 Duo www.intel.com/design/core2duo/documentation.htm www.intel.com/products/processor/core2duo/index.htm www.behardware.com/art/imprimer/623/ Core i7 980X www.tomshardware.co.uk/Intel-i7-nehalem,review-31375.html www.guru3d.com/article/core-i7-980x-review/1 44 Further Reading Instruction Pre-fetch Fetching the next instruction to be executed while executing the current one Pipeline/Pipelining An optimisation technique that performs different stages of the instruction cycle on different intructions simultaneously Pipeline stalls A pipeline is said to stall when instructions cannot move on the the next pipeline stage Pipeline wastage A pipeline has wastage when one or more stages aren’t performing useful work Branches Conditional or unconditional instructions that change the value of the program counter 45 Glossary Memory conflicts Occur when multiple CPU units (eg. pipeline stages) wish to access the same resource simultaneously (eg. memory) Dependencies When one instruction requires the use the result of a previous instruction Bubbles Pipeline stages that currently have no instruction in them Flush A flush of the pipeline occurs to remove instructions that should never have been added to the pipeline Branch Prediction A technique to guess the result of a branch Instruction Cache A cache that only buffers instructions Data Cache A cache that only buffers data 46 Glossary 47 Glossary FPU: (Floating-Point Unit) a specialised ALU for performing floating-point (non-integer) operations MMX / SSE / SSE2 / 3DNow!: a specialised ALU for performing optimised multimedia operations SIMD: (Single-Instruction Multiple-Data) a type of instruction that performs the same operation on multiple pieces of data simultaneously Glossary Superscalar A superscalar machine attempts to execute many instructions simultaneously with little regard to instruction order Instruction-level parallelism A measure of the amount of independent instructions in a program Machine parallelism The ability of a CPU to take advantage of instruction-level parallelism Instruction issue The process of initiating instruction execution in the processor’s functional units Glossary Antidependency When the result of an instruction should not be written to a location used in a previous instruction (because it may not have been executed yet) Register renaming A process of dynamically allocating registers to remove all but true data dependencies Committing Permanently storing a speculatively calculated result Window of execution Buffer containing pre-fetched instructions that the CPU examines to find independent instructions to be executed A McDonald's restaurant's drive-thru sign is pictured in Los Angeles Pipeline /docProps/thumbnail.jpeg