Compilers and computer architecture: Caches and caching
Martin Berger
December 2015
Recall the function of compilers
Caches in modern CPUs
Let’s look at a modern CPU. Here is the IBM Z196 CPU with 4 cores. Most of the silicon is for the cache, and cache controllers.
Caches in modern CPUs
Why is most of the chip area dedicated to caches?
Caches in modern CPUs
Today we will learn about caches in modern CPUs. They are crucial for high-performance programs and high-performance compilation.
Caches in modern CPUs
Today we will learn about caches in modern CPUs. They are crucial for high-performance programs and high-performance compilation.
Today’s material can safely be ignored by ’normal’ programmers who don’t care about performance.
Simplified computer layout
Volatile memory (RAM)
CPU
ALU Registers
Fast Medium Slow
Bus Disks/Flash
The faster the memory, the faster the computer.
Available memory
Capacity
Latency
Cost
Register
1000s of bytes
1 cycle
££££
SRAM
1s of MBytes
several cycles
£££
DRAM
10s GBytes
20 – 100 cycles
£
Flash
100s of GBytes
££
Hard disk
10 TByte
0.5 – 5 M cycles
cheap
Ideal
1000s MBytes
1 cycle
cheap
RAM=RandomAccessMemory
SRAM=staticRAM,fastbutuses6transistorsperbit.
DRAM=dynamicRAM,slowbutuses1transistorperbit.
Flash=non-volatile,slow,loosescapacityovertime.
Latencyisthetimebetweenissuingtheread/writeaccess and the completion of this access.
Available memory
Capacity
Latency
Cost
Register
1000s of bytes
1 cycle
££££
SRAM
1s of MBytes
several cycles
£££
DRAM
10s GBytes
20 – 100 cycles
£
Flash
100s of GBytes
££
Hard disk
10 TByte
0.5 – 5 M cycles
cheap
Ideal
1000s MBytes
1 cycle
cheap
RAM=RandomAccessMemory
SRAM=staticRAM,fastbutuses6transistorsperbit.
DRAM=dynamicRAM,slowbutuses1transistorperbit.
Flash=non-volatile,slow,loosescapacityovertime.
Latencyisthetimebetweenissuingtheread/writeaccess and the completion of this access.
It seems memory is either small, fast and expensive, or cheap, big and slow.
Tricks for making the computer run faster
Key ideas:
Tricks for making the computer run faster
Key ideas:
Useahierarchyofmemorytechnology.
Tricks for making the computer run faster
Key ideas:
Useahierarchyofmemorytechnology.
Keepthemostoften-useddatainasmall,fastmemory.
Tricks for making the computer run faster
Key ideas:
Useahierarchyofmemorytechnology.
Keepthemostoften-useddatainasmall,fastmemory. But how?
Tricks for making the computer run faster
Key ideas:
Useahierarchyofmemorytechnology.
Keepthemostoften-useddatainasmall,fastmemory. But how? Does that mean the programmer should have to worry about this?
Tricks for making the computer run faster
Key ideas:
Useahierarchyofmemorytechnology.
Keepthemostoften-useddatainasmall,fastmemory. But how? Does that mean the programmer should have to worry about this? Let the CPU worry about this, using a mechanism called cache.
Tricks for making the computer run faster
Key ideas:
Useahierarchyofmemorytechnology.
Keepthemostoften-useddatainasmall,fastmemory. But how? Does that mean the programmer should have to worry about this? Let the CPU worry about this, using a mechanism called cache.
Useslowmainmemoryonlyrarely.
Tricks for making the computer run faster
Key ideas:
Useahierarchyofmemorytechnology.
Keepthemostoften-useddatainasmall,fastmemory. But how? Does that mean the programmer should have to worry about this? Let the CPU worry about this, using a mechanism called cache.
Useslowmainmemoryonlyrarely. Why on earth should this work?
Tricks for making the computer run faster
Key ideas:
Useahierarchyofmemorytechnology.
Keepthemostoften-useddatainasmall,fastmemory. But how? Does that mean the programmer should have to worry about this? Let the CPU worry about this, using a mechanism called cache.
Useslowmainmemoryonlyrarely.
Why on earth should this work?
Data locality!
Data locality
In practice, most programs, most of the time, do the following:
Data locality
In practice, most programs, most of the time, do the following:
If the program accesses memory location loc at time t, then the next few memory accesses are going to be at addresses close to loc. (Moreover, memory locations used are likely to be used again.)
Data locality
In practice, most programs, most of the time, do the following:
If the program accesses memory location loc at time t, then the next few memory accesses are going to be at addresses close to loc. (Moreover, memory locations used are likely to be used again.)
I.e. we use access to a memory location as a heuristic prediction for memory access in the (near) future. This is called (data) locality. Data locality means that memory access often (but not always) follows a fairly predictable pattern. Modern CPUs exploit this predictability for speed.
Data locality
In practice, most programs, most of the time, do the following:
If the program accesses memory location loc at time t, then the next few memory accesses are going to be at addresses close to loc. (Moreover, memory locations used are likely to be used again.)
I.e. we use access to a memory location as a heuristic prediction for memory access in the (near) future. This is called (data) locality. Data locality means that memory access often (but not always) follows a fairly predictable pattern. Modern CPUs exploit this predictability for speed.
Note: it is possible to write programs that don’t exhibit data locality. They will typically run very slow.
Data locality
In practice, most programs, most of the time, do the following:
If the program accesses memory location loc at time t, then the next few memory accesses are going to be at addresses close to loc. (Moreover, memory locations used are likely to be used again.)
I.e. we use access to a memory location as a heuristic prediction for memory access in the (near) future. This is called (data) locality. Data locality means that memory access often (but not always) follows a fairly predictable pattern. Modern CPUs exploit this predictability for speed.
Note: it is possible to write programs that don’t exhibit data locality. They will typically run very slow.
But why would most programs exhibit locality?
Data locality
int [] a = new int [1000000];
for ( int i = 0; i < 1000000; i++ ) {
a [ i ] = i+1; }
Memory address
Time
Data locality
int [] a = new int [1000000];
...
for ( int i = 2; i < 1000000; i++ ) {
a [ i ] = a [i-1] * a [i-2]; }
Memory address
Time
Data locality
int [] a = new int [1000000];
int [] b = new int [1000000];
...
for ( int i = 0; i < 1000000; i++ ) {
a [ i ] = b [ i ] + 1; }
Memory address
Time
Code locality
Program execution (reading via PC) is local too, with occasional jumps.
addiu $sp $sp
li $a0 1
lw $t1 4($sp)
sub $a0 $t1 $a
addiu $sp $sp
sw $a0 0($sp)
addiu $sp $sp
jal next
lw $t1 4($sp)
add $a0 $t1 $a
addiu $sp $sp
b exit2
Memory
address
0 4
-4
0 4
-4
Time
Jump Jump Jump
Data locality
Another cause for data locality is the stack and how we compile procedure invocations into activation records.
This is because within a procedure activation we typically spend a lot of time accessing the procedure arguments and local variables.
In addition, in recursive procedure invocations, related activation records, are nearby on the stack.
main
f( 3 )
Main's AR
Result
Argument: 3
Control
Return address: "in main"
Result
Argument: 2
Control
Return address: "recursive"
Result
Argument: 1
Control
Return address: "recursive"
Result
Argument: 0
Control
Return address: "recursive"
f( 2 )
f( 1 )
f( 0 )
Data locality
In practise we have data access and instruction access together, so the access patterns look more like this:
Still a lot of predictability in memory access patterns, but over (at least) two distinct regions of memory.
Memory address
= instruction access = data access
Time
Data locality
Accessing objects, especially method invocation often has bad data locality because of pointer chasing. Object pointers can point anywhere inside the heap, loosing locality.
Instance of A
Pointer to class description dptr
Members
f_A g_A ...
Description of A
Class name "A" Method descriptions ...
Body of f_A Body of g_A ...
Data locality
Accessing objects, especially method invocation often has bad data locality because of pointer chasing. Object pointers can point anywhere inside the heap, loosing locality.
Garbage collectors like Stop & Copy improve locality because they compact the heap.
Instance of A
Pointer to class description dptr
Members
f_A g_A ...
Description of A
Class name "A" Method descriptions ...
Body of f_A Body of g_A ...
How to exploit locality and the memory hierarchy?
How to exploit locality and the memory hierarchy?
Two approaches
How to exploit locality and the memory hierarchy?
Two approaches
Exposethehierarchy(proposedbyS.Cray):let programmers access registers, fast SRAM, slow DRAM and the disk ’by hand’. Tell them “Use the hierarchy cleverly”.
How to exploit locality and the memory hierarchy?
Two approaches
Exposethehierarchy(proposedbyS.Cray):let programmers access registers, fast SRAM, slow DRAM and the disk ’by hand’. Tell them “Use the hierarchy cleverly”.
Hidethememoryhierarchy.Programmingmodel:thereis a single kind of memory, single address space (excluding registers). The automatically assigns locations to fast or slow memory, depending on usage patterns.
ALU Registers
CPU
Cache SRAM
DRAM
Fast Medium Slow
Disks Swap space
Cache
The key element is the cache which is integrated in modern CPUs.
Cache
A CPU cache is used by the CPU of a computer to reduce the average time to access memory. The cache is a smaller, faster and more expensive memory which stores copies of the data from the most frequently used main memory locations for fast access.
Cache
A CPU cache is used by the CPU of a computer to reduce the average time to access memory. The cache is a smaller, faster and more expensive memory which stores copies of the data from the most frequently used main memory locations for fast access. When the CPU reads from or writes to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory. The cache is typically inside the CPU.
Cache
A CPU cache is used by the CPU of a computer to reduce the average time to access memory. The cache is a smaller, faster and more expensive memory which stores copies of the data from the most frequently used main memory locations for fast access. When the CPU reads from or writes to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory. The cache is typically inside the CPU.
Recall that latency (of memory access) is the time between issuing the read/write access and the completion of this access.
Cache (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77
2 119
321 1112 Main memory
Cache (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77
2 119
321 1112 Main memory
Cache contains temporary copies of selected main memory locations, eg. Mem[119] = 2.
Cache (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77
2 119
321 1112 Main memory
Cache contains temporary copies of selected main memory locations, eg. Mem[119] = 2.
The cache holds pairs of main memory address (called tag) and value.
Cache (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77
2 119
321 1112 Main memory
Cache contains temporary copies of selected main memory locations, eg. Mem[119] = 2.
The cache holds pairs of main memory address (called tag) and value.
A cache entry is called cache line.
Cache (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77
2 119
321 1112 Main memory
Cache (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77
2 119
321 1112 Main memory
Goal of a cache: to reduce the average access time to memory by exploiting locality.
Cache (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77
2 119
321 1112 Main memory
Goal of a cache: to reduce the average access time to memory by exploiting locality.
Caches are made from expensive but fast SRAM, with much less capacity than main memory. So not all memory entries can be in cache.
Cache reading (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77 2 119
321 1112 Main memory
Cache reading (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77 2 119
321 1112 Main memory
Cache ’algorithm’: if the CPU wants to read memory location loc:
Lookfortaglocincache.
Ifcacheline(loc,val)isfound(calledcachehit),then
return val.
Ifnocachelinecontainstagloc(calledcachemiss),
then select some cache line k for replacement, read location loc from main memory getting value val, replace k with ( loc, val).
Cache writing (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77 2 119
321 1112 Main memory
Cache writing (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77 2 119
321 1112 Main memory
Cache ’algorithm’: if the CPU wants to write the value val memory location loc:
Writevaltomainmemorylocationloc.
Lookfortaglocincache.
Ifcacheline(loc,val’)isfound(cachehit),thenreplace
val’ with val in the cache line.
Ifnocachelinecontainstagloc(cachemiss),then
select some cache line k for replacement, replace k with ( loc, val).
Note
All these things (writing to main memory, looking for tag, replacing cache line, evicting etc) happen automatically, behind the programmer’s back. It’s all implemented in silicon, so cache management is very fast.
Note
All these things (writing to main memory, looking for tag, replacing cache line, evicting etc) happen automatically, behind the programmer’s back. It’s all implemented in silicon, so cache management is very fast.
Unless interested in peak performance, the programmer can program under the illusion of memory uniformity.
Important questions about Cache
Important questions about Cache
How does the search in the cache for a tag proceed?
Important questions about Cache
How does the search in the cache for a tag proceed? Which cache line is replaced after a cache miss?
Important questions about Cache
How does the search in the cache for a tag proceed? Which cache line is replaced after a cache miss? When to update main memory after cache write?
Important questions about Cache
How does the search in the cache for a tag proceed? Which cache line is replaced after a cache miss? When to update main memory after cache write?
What (how much) data to read from main memory after a cache miss?
Important questions about Cache
How does the search in the cache for a tag proceed?
Which cache line is replaced after a cache miss?
When to update main memory after cache write?
What (how much) data to read from main memory after a cache miss?
The answers to both are vitally important for performance. Different CPUs give different answers.
Important questions about Cache
How does the search in the cache for a tag proceed?
Which cache line is replaced after a cache miss?
When to update main memory after cache write?
What (how much) data to read from main memory after a cache miss?
The answers to both are vitally important for performance. Different CPUs give different answers.
Let’s look at some example cache actions.
Successful cache read (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77
2 119
321 1112 Main memory
Successful cache read (highly simplified)
CPU
77 stores 6 99 119 stores 2
read 119
15 stores 99 1112 stores 321
Cache
15 6 77
2 119
321 1112 Main memory
Successful cache read (highly simplified)
CPU
2
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77
2 119
321 1112 Main memory
Successful cache read (highly simplified)
CPU
77 stores 6 99 119 stores 2
15 stores 99 1112 stores 321
Cache
15 6 77
2 119
321 1112 Main memory
Cache failure on read (highly simplified)
CPU
77 stores 6 99 119 stores 2
read 222
15 stores 99 1112 stores 321
Cache
15 6 77
2 119 12 222 321 1112
Main memory
Cache failure on read (highly simplified)
CPU
77 stores 6 99 119 stores 2
read 222
15 stores 99 1112 stores 321
Cache fault
15 6 77
2 119 12 222 321 1112
Main memory
Cache failure on read (highly simplified)
CPU
77 stores 6 119 stores 2 15 stores 99 1112 stores 321
Cache fault
99
15
read 222
6 77
2 119
12 222 321 1112
Main memory
Cache failure on read (highly simplified)
CPU
77 stores 6 119 stores 2 15 stores 99 1112 stores 321
Cache fault
99
15
12
6 77
2 119
12 222 321 1112
Main memory
Cache failure on read (highly simplified)
CPU
77 stores 6 99 119 stores 2
read 222
evict
1112 stores 321
Cache
15 12 6 77
2 119 12 222 321 1112
Main memory
Cache failure on read (highly simplified)
CPU
77 stores 6 99 119 stores 2
read 222
222 stores 12 1112 stores 321
Cache
15 6 77
2 119 12 222 321 1112
Main memory
Cache failure on read (highly simplified)
CPU
12
77 stores 6 99 119 stores 2
222 stores 12 1112 stores 321
Cache
15 6 77
2 119 12 222 321 1112
Main memory
Cache failure on read (highly simplified)
CPU
77 stores 6 99 119 stores 2
222 stores 12 1112 stores 321
Cache
15 6 77
2 119 12 222 321 1112
Main memory
Successful cache write (highly simplified)
CPU
store 3 at 77
77 stores 6 99 119 stores 2
222 stores 12 1112 stores 321
Cache
15 6 77
2 119 12 222 321 1112
Main memory
Successful cache write (highly simplified)
CPU
store 3 at 77
77 stores 6 99 119 stores 2
222 stores 12 1112 stores 321
Cache
15 6 77
2 119 12 222 321 1112
Main memory
Successful cache write (highly simplified)
CPU
77 stores 3 119 stores 2 222 stores 12 1112 stores 321
Cache
99
store 3 at 77
15 6 77
2 119 12 222 321 1112
Main memory
Successful cache write (highly simplified)
CPU
77 stores 3 99 119 stores 2
222 stores 12 1112 stores 321
Cache
15 3 77
2 119 12 222 321 1112
Main memory
Write with cache failure (highly simplified)
CPU
store 3 at 85
77 stores 6 99 119 stores 2
222 stores 12 1112 stores 321
Cache
15 6 77
2 119 12 222 321 1112
Main memory
Write with cache failure (highly simplified)
CPU
store 3 at 85
77 stores 6 99 119 stores 2
evict
1112 stores 321
Cache
15 6 77
2 119 12 222 321 1112
Main memory
Write with cache failure (highly simplified)
CPU
77 stores 6 119 stores 2 85 stores 3 1112 stores 321
Cache
99
15
store 3 at 85
6 77
2 119
12 222 321 1112
Main memory
Write with cache failure (highly simplified)
CPU
77 stores 6 99 119 stores 2
15
85 stores 3 1112 stores 321
Cache
6 77 3 85 2 119 12 222 321 1112
Main memory
Miss rate
Miss rate = number of cache failures . number of cache accesses
Miss rate
Miss rate = number of cache failures . number of cache accesses
The performance of a program can be significantly improved if the miss rate is low.
Miss rate
Miss rate = number of cache failures . number of cache accesses
The performance of a program can be significantly improved if the miss rate is low.
Miss rate is influenced by many factors:
Miss rate
Miss rate = number of cache failures . number of cache accesses
The performance of a program can be significantly improved if the miss rate is low.
Miss rate is influenced by many factors:
Selectionpolicyforcacheeviction.
Compilerimprovingdatalocality(e.g.ingarbage collection).
Programmerensuringdatalocality.
Cachesize(bigger=better).Moderncomputershave multiple caches (see later).
Miss penalty
The miss penalty of a cache are the time it takes to read from/write to main memory - the time it takes to read from/write to the cache.
Miss penalty
The miss penalty of a cache are the time it takes to read from/write to main memory - the time it takes to read from/write to the cache.
The miss penalty has been increasing dramatically, because CPUs get faster more quickly than memory. So good cache management is becoming increasingly important.
Important questions about Cache
Important questions about Cache
How does the search in the cache for a tag proceed?
Important questions about Cache
How does the search in the cache for a tag proceed?
Directmapped:eachmemoryaddresscanonlybeata specific cache location. Simple, hence fast, but inflexible, potentially bad miss rate.
Partially/fullyassociativecaches:amemoryaddresscan be at most/all slots in the cache. This is flexible and has a lower miss rate than direct mapped, but is very complicated to implement.
Important questions about Cache
Important questions about Cache
Which cache line is replaced after a cache miss?
Important questions about Cache
Which cache line is replaced after a cache miss?
Different policies:
Randomchoice.
Leastrecentlyused. ...
Important questions about Cache
Important questions about Cache
When to update main memory after cache write?
Important questions about Cache
When to update main memory after cache write? Different policies:
Immediatelyonwrite.Thisissimpleandsimpleto implement, but can potentially stall the CPU while being executed.
Later,e.g.whenthereisno/littlecontentionformemory bus. This leads to strange effects from the programmer’s point of view, where seemingly instructions are reordered, e.g. the programmer says: write this, then read that, but the CPU executes: read that then write this. This reordering of instructions is called the CPU’s memory model. Modern CPUs have complicated memory models. Assembler programmers need to be aware of this, otherwise programs will be buggy.
Important questions about Cache
Important questions about Cache
What (how much) data to read from main memory after a cache miss?
Important questions about Cache
What (how much) data to read from main memory after a cache miss?
Different CPUs give different answers. Due to locality, it makes sense to load loc+1, ..., loc+n on cache fault for location loc. Modern CPUs typically have cache lines of size 64-128 bytes, which are loaded on cache fault.
That’s why they are called cache line, and not cache entry.
A practical example
for ( j <- 0 to 20-1 ) {
for ( i <- 0 to 100000000-1 ) {
a[i] = a[i]*b[i] } }
When we run this we execute
a[0] = a[0] * b[0]
a[1] = a[1] * b[1]
a[2] = a[2] * b[2]
a[3] = a[3] * b[3]
a[4] = a[4] * b[4]
...
A practical example
for ( j <- 0 to 20-1 ) {
for ( i <- 0 to 100000000-1 ) {
a[i] = a[i]*b[i] } }
When we run this we execute
a[0] = a[0] * b[0]
a[1] = a[1] * b[1]
a[2] = a[2] * b[2]
a[3] = a[3] * b[3]
a[4] = a[4] * b[4]
...
Depending on details, every read is a cache miss. So the program runs at the speed of main memory, i.e. slowly.
A practical example
What happens if we exchange loops?
for ( i <- 0 to 100000000-1 ) {
for ( j <- 0 to 20-1 ) {
a[i] = a[i]*b[i] } }
When we run this we execute
a[0] = a[0] * b[0]
a[0] = a[0] * b[0]
...
a[0] = a[0] * b[0]
a[1] = a[1] * b[1]
...
a[1] = a[1] * b[1]
...
A practical example
What happens if we exchange loops?
for ( i <- 0 to 100000000-1 ) {
for ( j <- 0 to 20-1 ) {
a[i] = a[i]*b[i] } }
When we run this we execute
a[0] = a[0] * b[0]
a[0] = a[0] * b[0]
...
a[0] = a[0] * b[0]
a[1] = a[1] * b[1]
...
a[1] = a[1] * b[1]
...
Result:
Theprogramcomputesexactlythesameresults. Wehavealotfewercachemisses.
A practical example
Going from
for ( j <- 0 to 20-1 ) {
for ( i <- 0 to 100000000-1 ) {
a[i] = a[i]*b[i] } }
to
for ( i <- 0 to 100000000-1 ) {
for ( j <- 0 to 20-1 ) {
a[i] = a[i]*b[i] } }
is called loop interchange, can speed up code up to 10 times, and some advanced compilers can do it automatically.
A practical example
Modern compilers are very good at managing registers, much better than almost all programmers could.
A practical example
Modern compilers are very good at managing registers, much better than almost all programmers could.
Compilers are not good at managing caches. This problem is left to the programmer. It is an open research question of to make compilers better at managing caches.
Multi-level caches
Multi-level caches
The picture painted about caches so far is too simplistic in that modern CPUs have not one but (usually) three caches:
L1.Small,veryfast.
L2.Mediumsize,mediumspeed. L3.Largesize,slow.
(L stands for Level.)
Multi-level caches
The picture painted about caches so far is too simplistic in that modern CPUs have not one but (usually) three caches:
L1.Small,veryfast.
L2.Mediumsize,mediumspeed. L3.Largesize,slow.
(L stands for Level.)
Processor
L1 L2 L3
ARM Coretex A9
32 KB 128kB-8MB -
ARM Coretex A15
64 KB 4MB -
Intel Ivy Bridge
64 KB p. core 256 kB p. core 8 MB shared
Multi-level caches
The picture painted about caches so far is too simplistic in that modern CPUs have not one but (usually) three caches:
L1.Small,veryfast.
L2.Mediumsize,mediumspeed. L3.Largesize,slow.
(L stands for Level.)
Processor
L1 L2 L3
ARM Coretex A9
32 KB 128kB-8MB -
ARM Coretex A15
64 KB 4MB -
Intel Ivy Bridge
64 KB p. core 256 kB p. core 8 MB shared
Why?
Multi-level caches
There is a fundamenal and hard trade-off in caches: Smallerchachesarefaster,havelowerlatency.
Biggercacheshaveabettermiss-rate.
It seems we can’t have both: fast caches and low miss-rate.
Multi-level caches
To deal with the miss-rate/low latency trade-off, modern CPU create a hierarchy of caches: the small but fast L1 cache doesn’t read directly from memory but from a bigger but slower L2 cache. (In turn the L2 cache often reads from a even larger and even more slow L3 cache. The L3 cache reads from the main memory.
L1 L2 L3
Main memory
ALU Registers
CPU
Instruction caches
So far we’ve mostly assumed that our caches can not only be read from, but also written to. This is vital for data.
But in most modern computers, instructions can only be read. Caches that are read-only are much easier technically and hence faster, and taking up less chip space.
Consequently, modern CPUs often have a separate and fast instruction cache that exploits instruction locality.
L1 L2 L3
Main memory
ALU Registers
Instruction cache
CPU
Conclusion
Conclusion
Caches in modern CPUs are a form of fast memory that automatically exploits memory locality to speed up execution, often quite considerably.
Conclusion
Caches in modern CPUs are a form of fast memory that automatically exploits memory locality to speed up execution, often quite considerably.
Making good use of the cache typically has drastic influences on execution speed.
Conclusion
Caches in modern CPUs are a form of fast memory that automatically exploits memory locality to speed up execution, often quite considerably.
Making good use of the cache typically has drastic influences on execution speed.
It is currently difficult for compilers to increase data-locality (although Stop & Copy garbage collectors help).
Conclusion
Caches in modern CPUs are a form of fast memory that automatically exploits memory locality to speed up execution, often quite considerably.
Making good use of the cache typically has drastic influences on execution speed.
It is currently difficult for compilers to increase data-locality (although Stop & Copy garbage collectors help).
For most normal programming activities, it’s probably not a good idea to worry much about data locality. It’s probably more economical to buy a faster computer if you encounter performance problems.
Conclusion
Caches in modern CPUs are a form of fast memory that automatically exploits memory locality to speed up execution, often quite considerably.
Making good use of the cache typically has drastic influences on execution speed.
It is currently difficult for compilers to increase data-locality (although Stop & Copy garbage collectors help).
For most normal programming activities, it’s probably not a good idea to worry much about data locality. It’s probably more economical to buy a faster computer if you encounter performance problems.
For really high-performance computing, programming cache-aware is vital, and has a substantial (negative) influence on program structure and portability.