程序代写代做代考 assembler algorithm arm cache C compiler computer architecture Compilers and computer architecture: Caches and caching

Compilers and computer architecture: Caches and caching
Martin Berger 1 December 2019
1Email: M.F.Berger@sussex.ac.uk, Office hours: Wed 12-13 in Chi-2R312
1/1

Recall the function of compilers
2/1

Caches in modern CPUs
Today we will learn about caches in modern CPUs. They are crucial for high-performance programs and high-performance compilation.
Today’s material can safely be ignored by ’normal’ programmers who don’t care about performance.
3/1

Caches in modern CPUs
Let’s look at a modern CPU. Here is a November 2018 Intel Ivy Bridge Xeon CPU. Much of the silicon is for the cache, and cache controllers.
4/1

Caches in modern CPUs
Why is much of the chip area dedicated to caches?
5/1

Simplified computer layout
Volatile memory (RAM)
CPU
ALU Registers
Fast Medium Slow
Bus Disks/Flash
The faster the memory, the faster the computer.
6/1

Available memory
Capacity
Latency
Cost
Register
1000s of bytes
1 cycle
£££££
SRAM
1s of MBytes
several cycles
££££
DRAM
10s GBytes
20 – 100 cycles
££
Flash
100s of GBytes
£
Hard disk
10 TByte
0.5 – 5 M cycles
cheap
Ideal
1000s GBytes
1 cycle
cheap
􏰀 RAM=RandomAccessMemory
􏰀 SRAM=staticRAM,fastbutuses6transistorsperbit.
􏰀 DRAM=dynamicRAM,slowbutuses1transistorperbit.
􏰀 Flash=non-volatile,slow,loosescapacityovertime.
􏰀 Latencyisthetimebetweenissuingtheread/writeaccess and the completion of this access.
It seems memory is either small, fast and expensive, or cheap, big and slow.
7/1

Tricks for making the computer run faster
Key ideas:
􏰀 Useahierarchyofmemorytechnology.
􏰀 Keepthemostoften-useddatainasmall,fastmemory. 􏰀 Useslowmainmemoryonlyrarely.
But how? Does that mean the programmer should have to worry about this? Let the CPU worry about this, using a mechanism called cache.
Why on earth should this work? Exploit locality!
8/1

Locality
In practice, most programs, most of the time, do the following:
If the program accesses memory location loc then the most of the next few memory accesses are very likely at addresses close to loc.
I.e. we use access to a memory location as a heuristic prediction for memory access in the (near) future. This is called (data) locality. Locality means that memory access often (but not necessarily always) follows this fairly predictable pattern.
Modern CPUs exploit this predictability for speed with caches.
Note: it is possible to write programs that don’t exhibit locality. They will typically run very slow.
9/1

Why would most programs exhibit locality?
10/1

Data locality
int [] a = new int [1000000];
for ( int i = 0; i < 1000000; i++ ) { a [ i ] = i+1; } 11/1 Memory address Time Data locality int [] a = new int [1000000]; ... for ( int i = 2; i < 1000000; i++ ) { a [ i ] = a [i-1] * a [i-2]; } 12/1 Memory address Time Data locality int [] a = new int [1000000]; int [] b = new int [1000000]; ... for ( int i = 0; i < 1000000; i++ ) { a [ i ] = b [ i ] + 1; } 13/1 Memory address Time Code locality Program execution (reading via PC) is local too, with occasional jumps. addiu $sp $sp li $a0 1 lw $t1 4($sp) sub $a0 $t1 $a addiu $sp $sp sw $a0 0($sp) addiu $sp $sp jal next lw $t1 4($sp) add $a0 $t1 $a addiu $sp $sp b exit2 14/1 Memory address 0 4 -4 0 4 -4 Time Jump Jump Jump Data locality Another cause for data locality is the stack and how we compile procedure invocations into activation records. This is because within a procedure activation we typically spend a lot of time accessing the procedure arguments and local variables. In addition, in recursive procedure invocations, related activation records, are nearby on the stack. main f( 3 ) Main's AR Result Argument: 3 Control Return address: "in main" Result Argument: 2 Control Return address: "recursive" f( 2 ) f(1) Result Argument: 1 f( 0 ) Control Return address: "recursive" Result Argument: 0 Control Return address: "recursive" 15/1 Data locality Stop & Copy garbage collectors improve locality because they compact the heap. root A B C D E F oldspace ACF new space 16/1 Locality In practise we have data access and instruction access together, so the access patterns look more like this: Still a lot of predictability in memory access patterns, but over (at least) two distinct regions of memory. Memory address = instruction access = data access Time 17/1 Data locality of OO programming Accessing objects, especially method invocation often has bad locality because of pointer chasing. Object pointers can point anywhere inside the heap, loosing locality. Partly to ameliorate this shortcoming, JIT compilers have been developed. Instance of A Pointer to class description dptr Members f_A g_A ... Description of A Class name "A" Method descriptions ... Body of f_A Body of g_A ... 18/1 How to exploit locality and the memory hierarchy? Two approaches 􏰀 Exposethehierarchy(proposedbyS.Cray):let programmers access registers, fast SRAM, slow DRAM and the disk ’by hand’. Tell them “Use the hierarchy cleverly”. This is not done in 2019. 􏰀 Hidethememoryhierarchy.Programmingmodel:thereis a single kind of memory, single address space (excluding registers). Automatically assigns locations to fast or slow memory, depending on usage patterns. This is what caches do in CPUs. 19/1 Cache The key element is the cache which is integrated in modern CPUs. ALU Registers CPU Cache SRAM DRAM Fast Medium Slow Disks Swap space 20/1 Cache A CPU cache is used by the CPU of a computer to reduce the average time to access memory. The cache is a smaller, faster and more expensive memory inside the CPU which stores copies of the data from the most frequently used main memory locations for fast access. When the CPU reads from or writes to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory. Recall that latency (of memory access) is the time between issuing the read/write access and the completion of this access. A cache entry is called cache line. 21/1 Cache (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache contains temporary copies of selected main memory locations, eg. Mem[119] = 2. The cache holds pairs of main memory address (called tag) and value. 22/1 Cache (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Goal of a cache: to reduce the average access time to memory by exploiting locality. Caches are made from expensive but fast SRAM, with much less capacity than main memory. So not all memory entries can be in cache. 23/1 Cache reading (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache ’algorithm’: if the CPU wants to read memory location loc: 􏰀 Lookfortaglocincache. 􏰀 Ifcacheline(loc,val)isfound(calledcachehit),then return val. 􏰀 Ifnocachelinecontainstagloc(calledcachemiss), then select some cache line k for replacement, read location loc from main memory getting value val’, replace k with ( loc, val’). 24/1 Cache writing (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache ’algorithm’: if the CPU wants to write the value val memory location loc: 􏰀 Writevaltomainmemorylocationloc. 􏰀 Lookfortaglocincache. 􏰀 Ifcacheline(loc,val’)isfound(cachehit),thenreplace val’ with val in the cache line. 􏰀 Ifnocachelinecontainstagloc(cachemiss),then select some cache line k for replacement, replace k with ( loc, val). 25/1 Note All these things (writing to main memory, looking for tag, replacing cache line, evicting etc) happen automatically, behind the programmer’s back. It’s all implemented in silicon, so cache management is very fast. Unless interested in peak performance, the programmer can program under the illusion of memory uniformity. 26/1 Successful cache read (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory 27/1 CPU read 119 Cache failure on read (highly simplified) CPU 77 stores 6 99 119 stores 2 read 222 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 12 222 321 1112 Main memory CPU read 222 28/1 Successful cache write (highly simplified) CPU store 3 at 77 77 stores 6 99 119 stores 2 222 stores 12 1112 stores 321 Cache 15 6 77 2 119 12 222 321 1112 Main memory CPU store 3 at 77 29/1 Write with cache failure (highly simplified) CPU store 3 at 85 77 stores 6 99 119 stores 2 222 stores 12 1112 stores 321 Cache 15 6 77 2 119 12 222 321 1112 Main memory CPU store 3 at 85 30/1 Prefetching What about locality? Explanation has so far made little use of locality. Caches typically do something else. By locality, the act of accessing loc predicts likely usage of ... loc-1 loc loc+1 loc+2 loc+3 ... in the near future. So far we are only ’reusing’ loc. So let’s get e.g. loc...loc+n from memory together (rather than separately). This is called prefetching . It relies on the fact that with current CPUs and memory systems a burst of reads of n adjacent addresses from main memory is faster than fetching those n cells separately. 31/1 Prefetching So on cache fault on loc, the CPU prefetches e.g. loc...loc+n in one big burst. The concept of preloading of webpages in is similar. (Cf. Instagram) 32/1 Counterintuitive behaviour of CPUs with caches Consider the following simple program. x := x + 1; x := x + 1 Clearly both assignments translate to the same machine code. But if the CPU has caches (and all modern CPUs have), then the execution of the first execution of x := x + 1 might take much longer than the execution of the second x := x + 1, despite having identical machine code. Why? Because in the first execution of x := x + 1, the cache may not contain x. In this case, a slow request will be issued to main memory to fetching x. However, if for some reason x is already in the cache then x will quickly be fetched from the cache. In both cases the second execution will execute quickly, because the cache will contain x. 33/1 Miss rate Miss rate = number of cache failures . number of cache accesses The performance of a program can be significantly improved if the miss rate is low. Miss rate is influenced by many factors: 􏰀 Selectionpolicyforcacheeviction. 􏰀 Compilerimprovingdatalocality(e.g.ingarbage collection). 􏰀 Programmerensuringdatalocality. 􏰀 Cachesize(bigger=better).Moderncomputershave multiple caches (see later). 34/1 Miss penalty The miss penalty of a cache are the time it takes to read from/write to main memory - (minus) the time it takes to read from/write to the cache. The miss penalty has been increasing dramatically, because CPUs get faster more quickly than memory. So good cache management is becoming increasingly important. 35/1 Important questions about Cache How does the search in the cache for a tag proceed? Which cache line is replaced after a cache miss? When to update main memory after cache write? What (how much) data to read from main memory after a cache miss? The answers to both are vitally important for performance. Different CPUs give different answers. Modern CPUs are very sophisticated in these matters: good answers have dramatic impact on performance. 36/1 Important questions about Cache Example: which cache line is replaced after a cache miss? Different policies: 􏰀 Randomchoice. 􏰀 Leastrecentlyused. 􏰀 Mostrecentlyused. 􏰀 FIFO. 􏰀 LIFO 􏰀 ... See e.g. https://en.wikipedia.org/wiki/Cache_ replacement_policies for more. 37/1 Important questions about Cache Example: when to update main memory after cache write? Different policies: 􏰀 Immediatelyonwrite.Thisissimpleandsimpleto implement, but can potentially stall the CPU while being executed. 􏰀 Later,e.g.whenthereisno/littlecontentionformemory bus. This leads to strange effects from the programmer’s point of view, where seemingly instructions are reordered, e.g. the programmer says: write this, then read that, but the CPU executes: read that then write this. This reordering of instructions is called the CPU’s memory model. Modern CPUs have complicated memory models. Compiler writes and assembler programmers need to be aware of this, otherwise programs will be buggy. ’Normal’ programmers can ignore this, since the compiler will sort it out. 38/1 Important questions about Cache Example: what (how much) data to read from main memory after a cache miss? Different CPUs give different answers. Due to locality, it makes sense to load loc+0, ..., loc+n on cache fault for location loc. Modern CPUs typically determine n dynamically. 39/1 Important questions about Cache What do real processors use? That’s quite hard to say, but looks complicated. CPU producers seem to keep this secret. If you know more, let me know ... Why can they keep this secret? Because caching does not affect the results programs compute, only speed. Give manufacturers freedom to change CPU architecture without telling anyone. 40/1 An artificial example for ( j <- 0 to 20-1 ) { for ( i <- 0 to 100000000-1 ) { a[i] = a[i]*b[i] } } When we run this we execute a[0] = a[0] * b[0] a[1] = a[1] * b[1] a[2] = a[2] * b[2] a[3] = a[3] * b[3] a[4] = a[4] * b[4] ... a[0] = a[0] * b[0] a[1] = a[1] * b[1] ... Depending on details, every read is a cache miss. So the program runs at the speed of main memory, i.e. slowly. 41/1 An artificial example What happens if we exchange loops? for ( i <- 0 to 100000000-1 ) { for ( j <- 0 to 20-1 ) { a[i] = a[i]*b[i] } } When we run this we execute a[0] = a[0] * b[0] a[0] = a[0] * b[0] ... a[0] = a[0] * b[0] a[1] = a[1] * b[1] ... a[1] = a[1] * b[1] ... Result: 􏰀 Theprogramcomputesexactlythesameresults. 􏰀 Wehavealotfewercachemisses. 42/1 An artificial example Going from for ( j <- 0 to 20-1 ) { for ( i <- 0 to 100000000-1 ) { a[i] = a[i]*b[i] } } to for ( i <- 0 to 100000000-1 ) { for ( j <- 0 to 20-1 ) { a[i] = a[i]*b[i] } } is called loop interchange, can speed up code up to 10 times, and some advanced compilers can do it automatically. See prog/proc.c 43/1 Problem Modern compilers are very good at managing registers, much better than almost all programmers could. Compilers are not good at managing caches. This problem is left to the programmer. It is an open research question of to make compilers better at managing caches. One problem: OS switches between processes, and a context switch typically ’trashes’ the cache, i.e. the cache holds values that are good for the outgoing process, but hold no values of interest to the incoming process. 44/1 Multi-level caches The picture painted about caches so far is too simplistic in that modern CPUs have not one but (usually) three caches: 􏰀 L1.Small,veryfast. 􏰀 L2.Mediumsize,mediumspeed. 􏰀 L3.Largesize,slow(butstillmuchfasterthanmain memory). (L stands for Level.) Intel has just started adding L4 caches (Haswell) Processor L1 L2 L3 ARM Coretex A9 32 KB 128kB-8MB - ARM Coretex A15 64 KB 4MB - Intel Ivy Bridge 64 KB p. core 256 kB p. core 8 MB shared Why? 45/1 Multi-level caches There is a fundamenal and hard trade-off in caches: 􏰀 Smallerchachesarefaster,havelowerlatency. 􏰀 Biggercacheshaveabettermiss-rate. It seems we can’t have both: fast caches and low miss-rate. 46/1 Multi-level caches To deal with the miss-rate/low latency trade-off, modern CPU create a hierarchy of caches: the small but fast L1 cache doesn’t read directly from memory but from a bigger but slower L2 cache. In turn the L2 cache often reads from a even larger and even more slow L3 cache. The L3 cache reads from the main memory. L1 L2 L3 Main memory ALU Registers CPU 47/1 Instruction caches So far we’ve mostly assumed that our caches can not only be read from, but also written to. This is vital for data. But in most modern computers, instructions can only be read. Caches that are read-only are much easier technically and hence faster, and taking up less chip space. Consequently, modern CPUs often have a separate and fast instruction cache that exploits instruction locality. L1 L2 L3 Main memory ALU Registers Instruction cache CPU 48/1 Interesting issues not mentioned 􏰀 ModernCPUshavemultiplecores,i.e.mini-CPUs.Howdo cores and caches interact? 􏰀 Cachescanbeusedtostealdatafromotherprocesses (e.g. private keys), how can that be avoided? See e.g. the paper “CACHE MISSING FOR FUN AND PROFIT” by Colin Percival. Short summary: this is a serious problem, and Intel has changed it’s CPU architectures so that caches can be ’switched off’ when dealing with secret data. 49/1 Conclusion Caches in modern CPUs are a form of fast memory that automatically exploits memory locality to speed up execution, often quite considerably. Making good use of the cache typically has drastic influences on execution speed. It is currently difficult for compilers to increase data-locality (although Stop & Copy garbage collectors help). For most normal programming activities, it’s probably not a good idea to worry much about data locality. It’s probably more economical to think about better high-level algorithms (or use a faster computer) when you encounter performance problems. For really high-performance computing, programming cache-aware is vital, and has a substantial (negative) influence on program structure and portability. 50/1