程序代写代做代考 flex computer architecture arm compiler assembler cache algorithm Compilers and computer architecture: Caches and caching

Compilers and computer architecture: Caches and caching
Martin Berger
December 2015

Recall the function of compilers

Caches in modern CPUs
Let’s look at a modern CPU. Here is the IBM Z196 CPU with 4 cores. Most of the silicon is for the cache, and cache controllers.

Caches in modern CPUs
Why is most of the chip area dedicated to caches?

Caches in modern CPUs
Today we will learn about caches in modern CPUs. They are crucial for high-performance programs and high-performance compilation.

Caches in modern CPUs
Today we will learn about caches in modern CPUs. They are crucial for high-performance programs and high-performance compilation.
Today’s material can safely be ignored by ’normal’ programmers who don’t care about performance.

Simplified computer layout
Volatile memory (RAM)
CPU
ALU Registers
Fast Medium Slow
Bus Disks/Flash
The faster the memory, the faster the computer.

Available memory
Capacity
Latency
Cost
Register
1000s of bytes
1 cycle
££££
SRAM
1s of MBytes
several cycles
£££
DRAM
10s GBytes
20 – 100 cycles
£
Flash
100s of GBytes
££
Hard disk
10 TByte
0.5 – 5 M cycles
cheap
Ideal
1000s MBytes
1 cycle
cheap
􏹩 RAM=RandomAccessMemory
􏹩 SRAM=staticRAM,fastbutuses6transistorsperbit.
􏹩 DRAM=dynamicRAM,slowbutuses1transistorperbit.
􏹩 Flash=non-volatile,slow,loosescapacityovertime.
􏹩 Latencyisthetimebetweenissuingtheread/writeaccess and the completion of this access.

Tricks for making the computer run faster
Key ideas:

Tricks for making the computer run faster
Key ideas:
􏹩 Useahierarchyofmemorytechnology.

Tricks for making the computer run faster
Key ideas:
􏹩 Useahierarchyofmemorytechnology.
􏹩 Keepthemostoften-useddatainasmall,fastmemory.

Tricks for making the computer run faster
Key ideas:
􏹩 Useahierarchyofmemorytechnology.
􏹩 Keepthemostoften-useddatainasmall,fastmemory. But how?

Tricks for making the computer run faster
Key ideas:
􏹩 Useahierarchyofmemorytechnology.
􏹩 Keepthemostoften-useddatainasmall,fastmemory. But how? Does that mean the programmer should have to worry about this? Let the CPU worry about this, using a mechanism called cache.

Data locality
In practice, most programs, most of the time, do the following:

Data locality
In practice, most programs, most of the time, do the following:
If the program accesses memory location loc at time t, then the next few memory accesses are going to be at addresses close to loc. (Moreover, memory locations used are likely to be used again.)
I.e. we use access to a memory location as a heuristic prediction for memory access in the (near) future. This is called (data) locality. Data locality means that memory access often (but not always) follows a fairly predictable pattern. Modern CPUs exploit this predictability for speed.

Data locality
int [] a = new int [1000000];
for ( int i = 0; i < 1000000; i++ ) { a [ i ] = i+1; } Memory address Time Data locality int [] a = new int [1000000]; ... for ( int i = 2; i < 1000000; i++ ) { a [ i ] = a [i-1] * a [i-2]; } Memory address Time Data locality int [] a = new int [1000000]; int [] b = new int [1000000]; ... for ( int i = 0; i < 1000000; i++ ) { a [ i ] = b [ i ] + 1; } Memory address Time Code locality Program execution (reading via PC) is local too, with occasional jumps. addiu $sp $sp li $a0 1 lw $t1 4($sp) sub $a0 $t1 $a addiu $sp $sp sw $a0 0($sp) addiu $sp $sp jal next lw $t1 4($sp) add $a0 $t1 $a addiu $sp $sp b exit2 Memory address 0 4 -4 0 4 -4 Time Jump Jump Jump Data locality Another cause for data locality is the stack and how we compile procedure invocations into activation records. This is because within a procedure activation we typically spend a lot of time accessing the procedure arguments and local variables. In addition, in recursive procedure invocations, related activation records, are nearby on the stack. main f( 3 ) Main's AR Result Argument: 3 Control Return address: "in main" Result Argument: 2 Control Return address: "recursive" Result Argument: 1 Control Return address: "recursive" Result Argument: 0 Control Return address: "recursive" f( 2 ) f( 1 ) f( 0 ) Data locality In practise we have data access and instruction access together, so the access patterns look more like this: Still a lot of predictability in memory access patterns, but over (at least) two distinct regions of memory. Memory address = instruction access = data access Time Data locality Accessing objects, especially method invocation often has bad data locality because of pointer chasing. Object pointers can point anywhere inside the heap, loosing locality. Instance of A Pointer to class description dptr Members f_A g_A ... Description of A Class name "A" Method descriptions ... Body of f_A Body of g_A ... Data locality Accessing objects, especially method invocation often has bad data locality because of pointer chasing. Object pointers can point anywhere inside the heap, loosing locality. Garbage collectors like Stop & Copy improve locality because they compact the heap. Instance of A Pointer to class description dptr Members f_A g_A ... Description of A Class name "A" Method descriptions ... Body of f_A Body of g_A ... How to exploit locality and the memory hierarchy? How to exploit locality and the memory hierarchy? Two approaches How to exploit locality and the memory hierarchy? Two approaches 􏹩 Exposethehierarchy(proposedbyS.Cray):let programmers access registers, fast SRAM, slow DRAM and the disk ’by hand’. Tell them “Use the hierarchy cleverly”. How to exploit locality and the memory hierarchy? Two approaches 􏹩 Exposethehierarchy(proposedbyS.Cray):let programmers access registers, fast SRAM, slow DRAM and the disk ’by hand’. Tell them “Use the hierarchy cleverly”. 􏹩 Hidethememoryhierarchy.Programmingmodel:thereis a single kind of memory, single address space (excluding registers). The automatically assigns locations to fast or slow memory, depending on usage patterns. ALU Registers CPU Cache SRAM DRAM Fast Medium Slow Disks Swap space Cache The key element is the cache which is integrated in modern CPUs. Cache A CPU cache is used by the CPU of a computer to reduce the average time to access memory. The cache is a smaller, faster and more expensive memory which stores copies of the data from the most frequently used main memory locations for fast access. Cache A CPU cache is used by the CPU of a computer to reduce the average time to access memory. The cache is a smaller, faster and more expensive memory which stores copies of the data from the most frequently used main memory locations for fast access. When the CPU reads from or writes to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory. The cache is typically inside the CPU. Cache A CPU cache is used by the CPU of a computer to reduce the average time to access memory. The cache is a smaller, faster and more expensive memory which stores copies of the data from the most frequently used main memory locations for fast access. When the CPU reads from or writes to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory. The cache is typically inside the CPU. Recall that latency (of memory access) is the time between issuing the read/write access and the completion of this access. Cache (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache contains temporary copies of selected main memory locations, eg. Mem[119] = 2. Cache (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache contains temporary copies of selected main memory locations, eg. Mem[119] = 2. The cache holds pairs of main memory address (called tag) and value. Cache (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache contains temporary copies of selected main memory locations, eg. Mem[119] = 2. The cache holds pairs of main memory address (called tag) and value. A cache entry is called cache line. Cache (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Goal of a cache: to reduce the average access time to memory by exploiting locality. Cache (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Goal of a cache: to reduce the average access time to memory by exploiting locality. Caches are made from expensive but fast SRAM, with much less capacity than main memory. So not all memory entries can be in cache. Cache reading (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache reading (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache ’algorithm’: if the CPU wants to read memory location loc: 􏹩 Lookfortaglocincache. 􏹩 Ifcacheline(loc,val)isfound(calledcachehit),then return val. 􏹩 Ifnocachelinecontainstagloc(calledcachemiss), then select some cache line k for replacement, read location loc from main memory getting value val, replace k with ( loc, val). Cache writing (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache writing (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache ’algorithm’: if the CPU wants to write the value val memory location loc: 􏹩 Writevaltomainmemorylocationloc. 􏹩 Lookfortaglocincache. 􏹩 Ifcacheline(loc,val’)isfound(cachehit),thenreplace val’ with val in the cache line. 􏹩 Ifnocachelinecontainstagloc(cachemiss),then select some cache line k for replacement, replace k with ( loc, val). Note All these things (writing to main memory, looking for tag, replacing cache line, evicting etc) happen automatically, behind the programmer’s back. It’s all implemented in silicon, so cache management is very fast. Note All these things (writing to main memory, looking for tag, replacing cache line, evicting etc) happen automatically, behind the programmer’s back. It’s all implemented in silicon, so cache management is very fast. Unless interested in peak performance, the programmer can program under the illusion of memory uniformity. Important questions about Cache Important questions about Cache How does the search in the cache for a tag proceed? Important questions about Cache How does the search in the cache for a tag proceed? Which cache line is replaced after a cache miss? Important questions about Cache How does the search in the cache for a tag proceed? Which cache line is replaced after a cache miss? When to update main memory after cache write? Important questions about Cache How does the search in the cache for a tag proceed? Which cache line is replaced after a cache miss? When to update main memory after cache write? What (how much) data to read from main memory after a cache miss? Important questions about Cache How does the search in the cache for a tag proceed? Which cache line is replaced after a cache miss? When to update main memory after cache write? What (how much) data to read from main memory after a cache miss? The answers to both are vitally important for performance. Different CPUs give different answers. Important questions about Cache How does the search in the cache for a tag proceed? Which cache line is replaced after a cache miss? When to update main memory after cache write? What (how much) data to read from main memory after a cache miss? The answers to both are vitally important for performance. Different CPUs give different answers. Let’s look at some example cache actions. Successful cache read (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Successful cache read (highly simplified) CPU 77 stores 6 99 119 stores 2 read 119 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Successful cache read (highly simplified) CPU 2 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Successful cache read (highly simplified) CPU 77 stores 6 99 119 stores 2 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 321 1112 Main memory Cache failure on read (highly simplified) CPU 77 stores 6 99 119 stores 2 read 222 15 stores 99 1112 stores 321 Cache 15 6 77 2 119 12 222 321 1112 Main memory Cache failure on read (highly simplified) CPU 77 stores 6 99 119 stores 2 read 222 15 stores 99 1112 stores 321 Cache fault 15 6 77 2 119 12 222 321 1112 Main memory Cache failure on read (highly simplified) CPU 77 stores 6 119 stores 2 15 stores 99 1112 stores 321 Cache fault 99 15 read 222 6 77 2 119 12 222 321 1112 Main memory Cache failure on read (highly simplified) CPU 77 stores 6 119 stores 2 15 stores 99 1112 stores 321 Cache fault 99 15 12 6 77 2 119 12 222 321 1112 Main memory Cache failure on read (highly simplified) CPU 77 stores 6 99 119 stores 2 read 222 evict 1112 stores 321 Cache 15 12 6 77 2 119 12 222 321 1112 Main memory Cache failure on read (highly simplified) CPU 77 stores 6 99 119 stores 2 read 222 222 stores 12 1112 stores 321 Cache 15 6 77 2 119 12 222 321 1112 Main memory Cache failure on read (highly simplified) CPU 12 77 stores 6 99 119 stores 2 222 stores 12 1112 stores 321 Cache 15 6 77 2 119 12 222 321 1112 Main memory Cache failure on read (highly simplified) CPU 77 stores 6 99 119 stores 2 222 stores 12 1112 stores 321 Cache 15 6 77 2 119 12 222 321 1112 Main memory Successful cache write (highly simplified) CPU store 3 at 77 77 stores 6 99 119 stores 2 222 stores 12 1112 stores 321 Cache 15 6 77 2 119 12 222 321 1112 Main memory Successful cache write (highly simplified) CPU store 3 at 77 77 stores 6 99 119 stores 2 222 stores 12 1112 stores 321 Cache 15 6 77 2 119 12 222 321 1112 Main memory Successful cache write (highly simplified) CPU 77 stores 3 119 stores 2 222 stores 12 1112 stores 321 Cache 99 store 3 at 77 15 6 77 2 119 12 222 321 1112 Main memory Successful cache write (highly simplified) CPU 77 stores 3 99 119 stores 2 222 stores 12 1112 stores 321 Cache 15 3 77 2 119 12 222 321 1112 Main memory Write with cache failure (highly simplified) CPU store 3 at 85 77 stores 6 99 119 stores 2 222 stores 12 1112 stores 321 Cache 15 6 77 2 119 12 222 321 1112 Main memory Write with cache failure (highly simplified) CPU store 3 at 85 77 stores 6 99 119 stores 2 evict 1112 stores 321 Cache 15 6 77 2 119 12 222 321 1112 Main memory Write with cache failure (highly simplified) CPU 77 stores 6 119 stores 2 85 stores 3 1112 stores 321 Cache 99 15 store 3 at 85 6 77 2 119 12 222 321 1112 Main memory Write with cache failure (highly simplified) CPU 77 stores 6 99 119 stores 2 15 85 stores 3 1112 stores 321 Cache 6 77 3 85 2 119 12 222 321 1112 Main memory Miss rate Miss rate = number of cache failures . number of cache accesses Miss rate Miss rate = number of cache failures . number of cache accesses The performance of a program can be significantly improved if the miss rate is low. Miss rate Miss rate = number of cache failures . number of cache accesses The performance of a program can be significantly improved if the miss rate is low. Miss rate is influenced by many factors: Miss rate Miss rate = number of cache failures . number of cache accesses The performance of a program can be significantly improved if the miss rate is low. Miss rate is influenced by many factors: 􏹩 Selectionpolicyforcacheeviction. 􏹩 Compilerimprovingdatalocality(e.g.ingarbage collection). 􏹩 Programmerensuringdatalocality. 􏹩 Cachesize(bigger=better).Moderncomputershave multiple caches (see later). Miss penalty The miss penalty of a cache are the time it takes to read from/write to main memory - the time it takes to read from/write to the cache. Miss penalty The miss penalty of a cache are the time it takes to read from/write to main memory - the time it takes to read from/write to the cache. The miss penalty has been increasing dramatically, because CPUs get faster more quickly than memory. So good cache management is becoming increasingly important. Important questions about Cache Important questions about Cache How does the search in the cache for a tag proceed? Important questions about Cache How does the search in the cache for a tag proceed? 􏹩 Directmapped:eachmemoryaddresscanonlybeata specific cache location. Simple, hence fast, but inflexible, potentially bad miss rate. 􏹩 Partially/fullyassociativecaches:amemoryaddresscan be at most/all slots in the cache. This is flexible and has a lower miss rate than direct mapped, but is very complicated to implement. Important questions about Cache Important questions about Cache Which cache line is replaced after a cache miss? Important questions about Cache Which cache line is replaced after a cache miss? Different policies: 􏹩 Randomchoice. 􏹩 Leastrecentlyused. 􏹩 ... Important questions about Cache Important questions about Cache When to update main memory after cache write? Important questions about Cache When to update main memory after cache write? Different policies: 􏹩 Immediatelyonwrite.Thisissimpleandsimpleto implement, but can potentially stall the CPU while being executed. 􏹩 Later,e.g.whenthereisno/littlecontentionformemory bus. This leads to strange effects from the programmer’s point of view, where seemingly instructions are reordered, e.g. the programmer says: write this, then read that, but the CPU executes: read that then write this. This reordering of instructions is called the CPU’s memory model. Modern CPUs have complicated memory models. Assembler programmers need to be aware of this, otherwise programs will be buggy. Important questions about Cache Important questions about Cache What (how much) data to read from main memory after a cache miss? Important questions about Cache What (how much) data to read from main memory after a cache miss? Different CPUs give different answers. Due to locality, it makes sense to load loc+1, ..., loc+n on cache fault for location loc. Modern CPUs typically have cache lines of size 64-128 bytes, which are loaded on cache fault. That’s why they are called cache line, and not cache entry. A practical example for ( j <- 0 to 20-1 ) { for ( i <- 0 to 100000000-1 ) { a[i] = a[i]*b[i] } } When we run this we execute a[0] = a[0] * b[0] a[1] = a[1] * b[1] a[2] = a[2] * b[2] a[3] = a[3] * b[3] a[4] = a[4] * b[4] ... A practical example for ( j <- 0 to 20-1 ) { for ( i <- 0 to 100000000-1 ) { a[i] = a[i]*b[i] } } When we run this we execute a[0] = a[0] * b[0] a[1] = a[1] * b[1] a[2] = a[2] * b[2] a[3] = a[3] * b[3] a[4] = a[4] * b[4] ... Depending on details, every read is a cache miss. So the program runs at the speed of main memory, i.e. slowly. A practical example What happens if we exchange loops? for ( i <- 0 to 100000000-1 ) { for ( j <- 0 to 20-1 ) { a[i] = a[i]*b[i] } } When we run this we execute a[0] = a[0] * b[0] a[0] = a[0] * b[0] ... a[0] = a[0] * b[0] a[1] = a[1] * b[1] ... a[1] = a[1] * b[1] ... A practical example What happens if we exchange loops? for ( i <- 0 to 100000000-1 ) { for ( j <- 0 to 20-1 ) { a[i] = a[i]*b[i] } } When we run this we execute a[0] = a[0] * b[0] a[0] = a[0] * b[0] ... a[0] = a[0] * b[0] a[1] = a[1] * b[1] ... a[1] = a[1] * b[1] ... Result: 􏹩 Theprogramcomputesexactlythesameresults. 􏹩 Wehavealotfewercachemisses. A practical example Going from for ( j <- 0 to 20-1 ) { for ( i <- 0 to 100000000-1 ) { a[i] = a[i]*b[i] } } to for ( i <- 0 to 100000000-1 ) { for ( j <- 0 to 20-1 ) { a[i] = a[i]*b[i] } } is called loop interchange, can speed up code up to 10 times, and some advanced compilers can do it automatically. A practical example Modern compilers are very good at managing registers, much better than almost all programmers could. A practical example Modern compilers are very good at managing registers, much better than almost all programmers could. Compilers are not good at managing caches. This problem is left to the programmer. It is an open research question of to make compilers better at managing caches. Multi-level caches Multi-level caches The picture painted about caches so far is too simplistic in that modern CPUs have not one but (usually) three caches: 􏹩 L1.Small,veryfast. 􏹩 L2.Mediumsize,mediumspeed. 􏹩 L3.Largesize,slow. (L stands for Level.) Multi-level caches The picture painted about caches so far is too simplistic in that modern CPUs have not one but (usually) three caches: 􏹩 L1.Small,veryfast. 􏹩 L2.Mediumsize,mediumspeed. 􏹩 L3.Largesize,slow. (L stands for Level.) Processor L1 L2 L3 ARM Coretex A9 32 KB 128kB-8MB - ARM Coretex A15 64 KB 4MB - Intel Ivy Bridge 64 KB p. core 256 kB p. core 8 MB shared Multi-level caches The picture painted about caches so far is too simplistic in that modern CPUs have not one but (usually) three caches: 􏹩 L1.Small,veryfast. 􏹩 L2.Mediumsize,mediumspeed. 􏹩 L3.Largesize,slow. (L stands for Level.) Processor L1 L2 L3 ARM Coretex A9 32 KB 128kB-8MB - ARM Coretex A15 64 KB 4MB - Intel Ivy Bridge 64 KB p. core 256 kB p. core 8 MB shared Why? Multi-level caches There is a fundamenal and hard trade-off in caches: 􏹩 Smallerchachesarefaster,havelowerlatency. 􏹩 Biggercacheshaveabettermiss-rate. It seems we can’t have both: fast caches and low miss-rate. Multi-level caches To deal with the miss-rate/low latency trade-off, modern CPU create a hierarchy of caches: the small but fast L1 cache doesn’t read directly from memory but from a bigger but slower L2 cache. (In turn the L2 cache often reads from a even larger and even more slow L3 cache. The L3 cache reads from the main memory. L1 L2 L3 Main memory ALU Registers CPU Instruction caches So far we’ve mostly assumed that our caches can not only be read from, but also written to. This is vital for data. But in most modern computers, instructions can only be read. Caches that are read-only are much easier technically and hence faster, and taking up less chip space. Consequently, modern CPUs often have a separate and fast instruction cache that exploits instruction locality. L1 L2 L3 Main memory ALU Registers Instruction cache CPU Conclusion Conclusion Caches in modern CPUs are a form of fast memory that automatically exploits memory locality to speed up execution, often quite considerably. Conclusion Caches in modern CPUs are a form of fast memory that automatically exploits memory locality to speed up execution, often quite considerably. Making good use of the cache typically has drastic influences on execution speed. Conclusion Caches in modern CPUs are a form of fast memory that automatically exploits memory locality to speed up execution, often quite considerably. Making good use of the cache typically has drastic influences on execution speed. It is currently difficult for compilers to increase data-locality (although Stop & Copy garbage collectors help). Conclusion Caches in modern CPUs are a form of fast memory that automatically exploits memory locality to speed up execution, often quite considerably. Making good use of the cache typically has drastic influences on execution speed. It is currently difficult for compilers to increase data-locality (although Stop & Copy garbage collectors help). For most normal programming activities, it’s probably not a good idea to worry much about data locality. It’s probably more economical to buy a faster computer if you encounter performance problems. Conclusion Caches in modern CPUs are a form of fast memory that automatically exploits memory locality to speed up execution, often quite considerably. Making good use of the cache typically has drastic influences on execution speed. It is currently difficult for compilers to increase data-locality (although Stop & Copy garbage collectors help). For most normal programming activities, it’s probably not a good idea to worry much about data locality. It’s probably more economical to buy a faster computer if you encounter performance problems. For really high-performance computing, programming cache-aware is vital, and has a substantial (negative) influence on program structure and portability.

Related Posts