The Memory Hierarchy
Unit 3
• Learning Outcomes – Feb 12
• Define memory hierarchy.
• Evaluate the performance differences found at the different layers of the hardware memory hierarchy.
• Explain the different kinds of caching that processors and hardware systems perform to mitigate the performance differences between the levels of the memory hierarchy.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
Admin
• Online Quiz 3 becomes available on Friday but it isn’t due until Feb 24th (kind of like reading week never happened)
• A2dueFeb28
• Final Exam Thursday April 23rd 12:00 PM
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
2
CPU
System Architecture
Register file
ALU
Bus interface
System bus
I/O bus
Memory bus
I/O bridge
Expansion slots for
other devices such as network adapters
Mouse
Keyboard
Monitor
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
3
Disk
3
Main memory
USB controller
Graphics adapter
Disk controller
Pentium Motherboard
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 4
4
Memory system
(The part we are going to look at)
CPU chip
Register file
ALU
Bus interface
System bus Memory bus
I/O bridge
Main memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 5
5
CPSC 313
The Memory Hierarchy
Registers L1 Cache
L2 Cache
L3Cache L4 Cache
Main Memory Flash Drive
Disk Drive
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
6
Bigger
Faster
Intel Broadwell processor released in 2015
4 cores
The Memory Hierarchy — Size
~1 KB (~128 b/core)
256 KB (64 KB/core) 1 MB (256KB/core)
8 MB
128 MB 32 GB *
~200 GB – 1 TB
2-5 TB
Registers
L1 Cache L2 Cache
L3Cache L4 Cache
Main Memory
Solid State (flash) Drive
Disk Drive
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
7
Bigger
Faster
CPSC 313
The Memory Hierarchy — Speed
~ 1 KB (~120 b/core)
256 KB (64 KB/core) 1 MB (256KB/core)
8 MB
128 MB 32 GB *
~200 GB – 1 TB 2-5 TB
Registers
.3 ns
L1 Cache L2 Cache
L3Cache L4 Cache
1.1 ns
3.3 ns
12.8 ns
42.4 ns 62.9 ns
~.1 ms
~3 ms
Main Memory
Solid State (flash) Drive
Disk Drive
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
8
Bigger
Faster
CPSC 313
The Memory Hierarchy — Price
~1 KB (~120 b/core)
256 KB (64 KB/core) 1 MB (256KB/core)
8 MB 128 MB
Registers
.3 ns
L1 Cache L2 Cache
L3Cache L4 Cache
1.1 ns
3.3 ns
12.8 ns
42.4 ns 62.9 ns
~.1 ms
~3 ms
32 GB * ~200 GB – 1 TB 2-5 TB
$5-10/GB $0.67/GB $50/TB
Main Memory
Solid State (flash) Drive
Disk Drive
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
9
Bigger
Faster
Examples of Memory Technology
• Vacuum Tubes
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
10
Vacuum Tube Memory
( retrieved from http://cdn.wccftech.com/images/articles/History-of-RAM/Vacuum-Tube.jpg)
11
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
Examples of Memory Technology
• Vacuum Tubes
• Core memory
– Popular from 1955-1975
– 32 kilobits per cubic metre by the late 1960s
– Semiconductor memory started to take over the market in the early 1970s.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
12
https://commons.wikimedia.org/wiki/File:Magnetic-core_memory.JPG Benoitb
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
13
64 bits vs 8 GBytes
https://commons.wikimedia.org/wiki/File:8_bytes_vs._8Gbytes.jpg Daniel Sancho CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
14
CPSC 313
The Memory Hierarchy — Price
~1 KB (~120 b/core)
256 KB (64 KB/core) 1 MB (256KB/core)
8 MB 128 MB
Registers
.3 ns
L1 Cache L2 Cache
L3Cache L4 Cache
1.1 ns
3.3 ns
12.8 ns
42.4 ns 62.9 ns
~.1 ms
~3 ms
32 GB * ~200 GB – 1 TB 2-5 TB
$5-10/GB $0.67/GB $50/TB
Main Memory
Solid State (flash) Drive
Disk Drive
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
15
Bigger
Faster
Consider this or why the hierarchy exists
K-1
K
K+1
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
16
IfK+1samesizeasKthendon’tneedK+1
K-1
K
K+1 K+1
Everything in K+ 1 can fit in K
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
17
If K isn’t faster than K + 1 then don’t need K
K-1
K
K+1
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
18
CPSC 313
Implications
• We saw many factors of 10 or 100 in: – Size
– Performance – Price
• “When you see a factor of 100, it’s going to affect how you program.” – E. Kohler
• As the ratios between different parts of the system change, so do our priorities.
– 1956:
• $/MB(mem):$/MB(disk)=>$411M:$9200=>44,673X
– 2015:
• $/MB(mem):$/MB(disk)=>$0.005:0.0000317=>158X
– Cost of memory in 1956 versus 2015:
• 1956$/MB:2015$/MB=>$411M/.005=>82trillion…
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
19
CPSC 313
• Definition:
Caching
– Colloquially: store away in hiding or for future use
– Applied to computation:
• Placing data somewhere it can be accessed more quickly.
• Examples:
– In assembly language: we move data from memory to registers.
– In the hardware: we move data from main memory into memory banks that live on the processor (more on this in a moment).
– In software: we read things into our program’s local buffers and manipulate them there.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
20
Some Rules of Thumb
• How big things are
– Bigger means slower, typically also cheaper
– If A is bigger and faster than B (for same cost), there’s no point in using B
• Where things are
– On chip is much faster
– Off chips is slower but typically much bigger
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
21
CPSC 313
Today
• Learning Objectives
– Identify location of common caches in the IO path.
– Define
• Cache block
• Cache slot
• Cache hit/miss rate
• Replacement policy
• Direct-mapped cache
• Fully Associative Cache
• N-way Set Associative Cache
– Calculate hit/miss rates
– Map hit/miss rates into performance
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
22
Admin
• Midterm 1 handed back Wednesday – Question 6c/d has been remarked
• Online Quiz 3 becomes available – today – Due Feb 24th • Office hours during break
– Tas – check TA office hour link on Canvas – currently looks like some on Tuesday, Thursday, Friday
– My scheduled office hours next week are cancelled
– Will probably have some on Friday and maybe one other day (waiting
for some meeting to be scheduled.)
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
23
CPSC 313
From last class we learned:
• Caching is ubiquitous throughout our computing systems: – In the processor
– In the operating system
– In databases
– In middleware – In applications
• Writing efficient and correct software requires a deep understanding of caching and its implications.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
24
CPSC 313
Today
• Learning Objectives
– Identify location of common caches in the IO path.
– Define
• Cache block
• Cache slot
• Cache hit/miss rate
• Replacement policy
• Direct-mapped cache
• Fully Associative Cache
• N-way Set Associative Cache
– Calculate hit/miss rates
– Map hit/miss rates into performance
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
25
Read data
Application
Cache
Data Source
Write data
An Abstract Cache
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
26
A Read Cache Hit
Application
Please read this item!
Here you go!
Hmmm, do I have that item?
Cache
Yes!
Data Source
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
27
A Read Cache Miss
Application
1. Please read this item!
6. Here you go!
2. Hmmm, do I have that item?
Cache
4. Please get me my item.
Data Source
5. Here you go!
This Photo by Unknown Author is licensed under CC BY
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
28
3. No
A Read Cache Miss – Decisions!!
Application
1. Please read this item!
6. Here you go!
2. Hmmm, do I have that item?
Cache
4. Please get me my item.
Data Source
What do I do when I run out of space? How do I arrange my data?
5. Here you go!
How much data should I return?
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
29
3. No
CPSC 313
How much data should I return?
• Most (persistent) storage devices have a native data access and/or transmission size, e.g., disk block (4 KB).
• Cachesalsohaveanativesize.
– E.g., Intel Broadwell has 64-byte cache lines.
– Implication: Data moves in and out of caches in 64-byte chunks.
• Blocksize:theunitinwhichdataisstoredinacache.
– Broadwell caches: 64 bytes (block size called cache line size in HW caches)
– File system caches: 4 KB
– Object caches: size of the object
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
30
What do I do when I fill up?
• A cache has a limited capacity.
• At some point, the application will fill the cache and request
another item.
• Caching the new item requires evicting some other item.
• What item do I evict?
– We need an eviction policy
– The available decisions here vary between hardware and software.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
31
Eviction in Software
• Inaperfectworld,we’dliketoevicttheitemthatisleastvaluable.
• Intherealworld,wedon’tknowwhatthatitemis.
• Practicallyallsoftwarecachestrytoapproximatethisideal.
– LRU: Least-recently-used – find the item that has been unused the longest
and get rid of that.
– LFU: Least-frequently-used – find the item that has been used less frequently and get rid of that.
– Clock: Used in virtual memory systems to approximate LRU (you’ll learn more about that either later this semester or in CPSC 415).
– Something tuned to known access patterns.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
32
Eviction: Hardware
• Software algorithms are lovely, but are not nearly fast enough for most hardware, e.g., processor, caches.
• Cache is comprised of some number of slots.
• For any particular item, limit the number of possible slots in which it can be placed. Call the number of such slots A. Let n be the total number of slots in the cache.
– A = 1: Direct mapped: each object can live in exactly one slot in the cache, so you have no choice but to evict the item in that slot.
– A > 1, A << n: A-way set associative: an object can live in one of A slots; A is typically 2, 4, or 8. On eviction, choose randomly from among the A slots.
– A = n: Fully associative: an object can live in any slot (like a software cache).
• This answers the, “How do I organize my cache?” question
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
33
CPSC 313
Cache Arrangement: Direct Mapped
slots
Each
cache line can go in one and only one slot!
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
A slot holds a cache line
34
CPSC 313
Cache Arrangement: Fully Associative
X
X
X
X
X
A cac
he line
can go to ANY slot!
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
A slot holds a cache line
35
CPSC 313
Cache Arrangement: 2-way set associative
X
X
A cac
he line
can go in one of two slots!
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
A slot holds a cache line
36
CPSC 313
Cache Arrangement: 8-way set associative
X
X
X
X
A cac
he line
can go in one of 8 slots!
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
A slot holds a cache line
37
CPSC 313
Types of Cache Misses
• Compulsory:
– On first access to an object, you take a miss; there is little you can do
about it.
• Capacity:
– You are touching more data than can fit in the cache.
– If your cache were larger, you would have fewer misses.
• Conflict:
– You have something other than a fully associative cache and are taking
misses because there is another item occupying the slots you need.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
38
Evaluating a Cache
• Hits are much better than misses!
• We measure the efficiency of a cache in terms of its cache hit
rate:
– # cache hits / # cache accesses
– # cache hits / (# cache hits + # cache misses)
• Example:
– I access my cache 1000 times and have 400 hits.
• My cache hit rate is 400/1000 = 40%
• Good performance requires high cache hit rates.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
39
Computing Average Access Time
• Let’s say that it takes 1 time unit to access your cache and 100 time units to access the data source (i.e miss cost)
– 10% hit rate: 0.90 * 100 + 0.10* 1 = 90.1 time units/access – 50% hit rate: 0.50* 100 + 0.50* 1 = 50.5 time units/access – 90% hit rate: 0.10* 100 + 0.90* 1 = 10.9 time units/access – 99% hit rate: 0.01* 100 + 0.99* 1 = 1.99 time units/access
• The ratio between access time for the cache and the data source dictates just how good your hit rate needs to be to obtain good performance from a cache.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
40
CPSC 313
Now you practice
• Let’s say my cache time is 1 unit and my data source access time is 50 time units.
• Given a hit rate of 80%, what is my average access time?
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
41
Now you practice
• Let’s say my cache time is 1 unit and my data source access time is 50 time units.
• Given a hit rate of 80%, what is my average access time? – Hits: .8 * 1 = .8
– Miss: .2 * 50 = 10
– Average access time is 10.8
• That’s 10x slower than the cache!
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
42
CPSC 313
CPSC 313: Computer Hardware and Operating Systems
Unit 3: Caching Write Caching
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
43
Today
• Learning Objectives
– Get backup to speed on caching – quick overview – Explain how cache line sizes and stride interact
– Distinguish write back and write through caches
• Reminders
– Quiz 3 due today
– Assignment due on Friday.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
44
Terminology: What do these terms mean?
hold a block and all its associated metadata.
• Associativity:(Bookterminology)
– A direct mapped cache has one line per set
– An N-way associative cache has N lines per set
• eviction policy
• hitrate–Thisishowwemeasureacache’sefficiency
• missrate
• average access time
• cacheblock(orcacheline)(thebookcallstheseblocks)
• thebookusesthetermlinetomeanthespaceinthecacheto
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
45
Direct mapped Fully associative
CPU
Cache Arrangement
Cache
Register file
ALU
Bus interface
Data Source (memory or another cache level)
None of this is to scale CPSC 313 – 2019W2 Memory Hierarchy
Cache line/block/slot – holds multiple bytes
© 2020 Donald Acton et al
46
Cache Arrangement
In a set associated cache – each grouping of cache lines is a set
CPU
Cache
Register file
ALU
Bus interface
Data Source (memory or another cache level)
None of this is to scale CPSC 313 – 2019W2 Memory Hierarchy
Cache line/block/slot – holds multiple bytes
© 2020 Donald Acton et al
47
Terminology: What do these terms mean?
hold a block and all its associated metadata.
• Associativity:(Bookterminology)
– A direct mapped cache has one line per set
– An N-way associative cache has N lines per set
• evictionpolicy
• hitrate–Thisishowwemeasureacache’sefficiency
• missrate
• average access time (Ht * Hr ) + (Mt * Mr)
• cacheblock(orcacheline)(thebookcallstheseblocks)
• thebookusesthetermlinetomeanthespaceinthecacheto
• Access time on hit
• Hit rate
• Access time on miss
• Miss rate ( 1- hit rate)
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
48
Remember ...
• Caches have some number of slots. (Places you can put stuff in the cache.)
• A cache block (or cache line) occupies a slot.
• The associativity of a cache tells you something about which slot(s) a block can reside.
• The eviction policy determines which block/line gets evicted, when all slots in which a cache line can reside already have valid data.
• We measure cache efficiency in terms of hit rate.
• The miss rate is just 1 – hit rate.
• Computing average access time:
– LetHt bethetimeforahit
– Let Hr be the hit rate
– LetMt bethetimeforamiss
– Let Mr be the miss rate
– Averageaccesstime=(Ht *Hr)+(Mt *Mr)
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
49
More than one way to get a hit ...
• Ifyoutouchthesameitemmorethanonce,yougetahit,butthere is another way to get a hit.
• Thinkaboutthefactthatyourcacheisorganizedinblocks...
• Considerthis:
– Let’s say you are accessing an array of 4-byte integers one after the other – A cache line is 64 bytes.
• Hereisthequestion:
– Let’s say that you have 160 items in the array and you’ve never accessed it
before, how many cache misses will you take?
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
50
How many cache misses?
• Theanswerisabout10!
• Why?
– A cache line is 64-bytes => that holds 16 4-byte quantities
– When you miss on the first array element, the cache will bring in an entire block (called a cache line in hardware).
– And since arrays (usually) are laid out contiguously, the rest of the values in the cache line just happen to be the next elements you are going to access!
– So, you miss on element 0, but then hit on elements 1-15.
– You miss on element 16 and hit on elements 17-31.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
51
Access pattern interaction
• Considerthefollowingarray;assumeeachrowisacacheline.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
52
CPSC 313
Access pattern interaction
• If we access EVERY element of the array, what will our hit rate be?
7 out of 8 or .875 or 87.5%
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
53
CPSC 313
Access pattern interaction
• If we access EVERY OTHER element of the array, what will our hit rate be?
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
3 out of 4 or
.75
or 75%
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
54
CPSC 313
Access pattern interaction
• If we access EVERY FOURTH element of the array, what will our hit rate be?
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
1 out of 2 or
.5
or 50%
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
55
CPSC 313
Access pattern interaction
• If we access EVERY EIGHTH element of the array, what will our hit rate be?
X
X
X
X
X
X
X
X
X
X
0 or 0 or 0%
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
56
CPSC 313
Those were all examples of different strides
Stride = spacing between accesses
If your stride is N, you access 1 data element per N bytes (A data element could be a byte or a quad word)
256/512 1024
200
128 64
Size of the array from which we are reading (in MB)
57
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
Time to read data (in ns)
CPSC 313
A Write Cache Hit
Application
1. Please write this item!
2. Hmmm, do I have that item?
Cache
3.4Y.eNs:eHwEVRaEluiteis.
Data Source
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
58
CPSC 313
A Write Cache Hit: Policy Decisions
Application
Dirty data
Cache
4. New Value
Data Source
When do we write new data back to the source?
59
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
CPSC 313
A Write Cache Hit: Writeback Caching
Application
Dirty data
Cache
4. New Value
Data Source
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
60
CPSC 313
A Write Cache Hit: Write Through
Application
1. Please write this item!
2. Hmmm, do I have that item?
Cache
3.4Y.eNs:eHwEVRaEluiteis.
5. Please write this
Data Source
6. New Value
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
61
Today
– Complete cache write policies
– Simulate different replacement polices: • LRU
• LFU
• Belady’s Algorithm
– Map realistic replacement policies to different layers of the hardware/software stack
– Given a workload, cache configuration, and replacement policy, calculate hit/miss rates
• Learning Objectives
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
62
CPSC 313
A Write Cache Hit
Application
1. Please write this item!
2. Hmmm, do I have that item?
Cache
3.4Y.eNs:eHwEVRaEluiteis.
Data Source
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
63
CPSC 313
A Write Cache Hit: Policy Decisions
Application
Dirty data
Cache
4. New Value
Data Source
When do we write new data back to the source?
64
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
CPSC 313
A Write Cache Hit: Writeback Caching
Application
Dirty data
Cache
4. New Value
Data Source
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
65
CPSC 313
A Write Cache Hit: Write Through
Application
1. Please write this item!
2. Hmmm, do I have that item?
Cache
3.4Y.eNs:eHwEVRaEluiteis.
5. Please write this
Data Source
6. New Value
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
66
CPSC 313
Write Policy Trade-offs
• Advantages of writeback cache ( = disadvantage of write through)
– Faster (cache time, not to the original source time)
– Write coalescing: update the same item many times before writing it
back to the source.
• Advantages of writethrough cache (= disadvantage of write back)
– Window of vulnerability (new data is in cache, but not source) CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
68
CPSC 313
A Write Cache Miss Application
1. Please write this item!
2. Hmmm, do I have that item?
Cache
3. No
Data Source
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
69
CPSC 313
A Write Cache Miss – Write Allocate Application
1. Please write this item!
2. Hmmm, do I 6. New data have that item?
Cache
4. Please get me my item.
Data Source
5. Here you go!
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
70
3. No
CPSC 313
A Write Cache Miss – No Write Allocate Application
1. Please write this item!
2. Hmmm, do I have that item?
Cache
3. No
4. Please write this item.
Data Source
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
71
CPSC 313
Write Allocation Tradeoffs
• Advantages of write-allocate ( = disadvantage of No write allocate)
– Faster when you are modifying multiple values in a cache line. – Matches read behavior
• Advantages of no write-allocate ( = disadvantage of write allocate)
– Avoids cluttering the cache with write-only data
– Avoids consuming a lot of cache space for random writes
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
73
CPSC 313
Wrapping Up
• Ifwritesgoallthewaythroughtothedatasource,wecallthata write-through cache.
• If writes can stay in the cache, we call the written cache blocks dirty and the cache is write-back.
• Ifwritemissesresultinacachelinebeingallocatedinthecache, we call that write-allocate.
• Ifwebypassthecacheforwritemisses,wecallthatnowrite- allocate.
• Rule of thumb: write-back and write-allocate typically go together as do write-through and no write allocate.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
74
On To – Cache Replacement Policy
• When the cache is full (e.g., every slot is occupied by data thought to be useful), determines which block you kick out (evict) to make room for a new request.
• In HW: these are very simple – quite often random (because you have very few choices about where a cache block can be placed).
• In SW: typically fully associative, so you have a lot of choices about which cache block to replace.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
75
1.
Goals of a Good Replacement Policy
Some assumptions we make:
– Workloads exhibit locality.
– Spatial locality: you are likely to access data near data you just accessed. – Temporal locality: you are likely to access data you accessed recently
Evicts a block you won’t need for a long time.
Best case: evicts exactly that block that will next be accessed farther in the future than any other block.
– This is called Belady’s algorithm (or the optimal algorithm) – Unfortunately, it requires that you predict the future.
All other algorithms are attempts at approximating Belady’s algorithm.
2. 3.
4.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
77
A cache eviction policy to exploit temporal locality
• LRU (least-recently-used):
– Assumes that data accessed recently is likely to be accessed again.
– Assumes that the least valuable data is that data that has remained unaccessed for the longest period of time.
– Is probably the most commonly used replacement policy.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
78
LRU Example
• Example:
– Assume we have 2 slots in our cache.
– Assume fully associative
– We access cache blocks in the following order:
A, A, B, A, B, C, C, C, A, B,
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
79
Kinds of Misses
• Compulsory: No matter what your cache looks like, you are going to take a miss.
• Capacity: If your cache were larger, you might not have to take this miss.
• Conflict: If your cache had higher associativity, you might not have to take this miss.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
80
LRU Example
• Example:
– Assumewehave2slotsinourcache.
– Assume fully associative
– Weaccesscacheblocksinthefollowingorder:
A, A, C, A, C, B, B, B, A, C MH MH HMH HMM
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
81
LRU Example
• Example:
– Assumewehave2slotsinourcache.
– Assume fully associative
– Weaccesscacheblocksinthefollowingorder:
compulsory
A, A, C, A, C, B, B, B, A, C MH MH HMH HMM
capacity
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
82
LRU Example Compared to Direct Map
• Example:
– Assumewehave2slotsinourcache.
– Assumedirectmapped:assignanumbertoeachletter(A=1,B=2,C=3);evenlettersgoin one slot, odd in the other
– Weaccesscacheblocksinthefollowingorder: A, A, C, A, C, B, B, B, A, C
odd even
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
83
CPSC 313
• Example:
• Assume we have 2 slots in our cache.
• Assume fully associative
• We access cache blocks in the following order:
LRU Example
A, A, C, A, C, B, B, B, A, C M H MM M M H H M M
odd even
conflict
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
84
Belady Example
• What is the best possible replacement algorithm we could do? Always evict the item that we use farthest in the future.
– Assume we have 2 slots in our cache.
– Assume fully associative
– We access cache blocks in the following order:
A, A, C, A, C, B, B, B, A, C
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
85
Belady Example
• What is the best possible replacement algorithm we could do? Always evict the item that we use farthest in the future.
– Assume we have 2 slots in our cache.
– Assume fully associative
– We access cache blocks in the following order:
A, A, C, A, C, B, B, B, A, C MHMHHMHHHM
That replacement, turns this access to A into a HIT instead of a MISS
Last time we evicted A, because it was LRU;
This time, we evict C, because we are going to use A before we use C
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
86
Today
– Simulate different replacement polices: • LRU
• LFU
• Belady’s Algorithm
– Given a workload, cache configuration, and replacement policy, calculate hit/miss rates
• A2 due tonight
• Learning Objectives
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
87
LRU Example
• Example:
– Assumewehave2slotsinourcache.
– Assume fully associative
– Weaccesscacheblocksinthefollowingorder:
compulsory
A, A, C, A, C, B, B, B, A, C MH MH HMH HMM
capacity
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
88
Belady Example
• What is the best possible replacement algorithm we could do? Always evict the item that we use farthest in the future.
– Assume we have 2 slots in our cache.
– Assume fully associative
– We access cache blocks in the following order:
A, A, C, A, C, B, B, B, A, C MHMHHMHHHM
That replacement, turns this access to A into a HIT instead of a MISS
Last time we evicted A, because it was LRU;
This time, we evict C, because we are going to use A before we use C
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
89
LFU Example
• LFU = Least frequently used
– Assumewehave2slotsinourcache.
– Assume fully associative
– Weaccesscacheblocksinthefollowingorder:
A, A, C, A, C, B, B, B, A, C
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
90
• LFU = Least frequently used
– Assumewehave2slotsinourcache.
– Assume fully associative
– Weaccesscacheblocksinthefollowingorder:
A, A, C, A, C, B, B, B, A, C MH MH HMH HHM
Since we evicted C, this becomes a hit!
LFU Example
We have used A 3 times and C twice, so we are going to evict C.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
91
CPSC 313
Example – LRU, LFU, Belady’s
• Example:
• Assume we have 2 slots in our cache.
• Assume fully associative
• We access cache blocks in the following order:
A,A,C, A,C,B,B,B,A,C MHMHHMHHMM -LRU MHMHHMHHHM -LFU MH MH H MH H HM -Belady’s
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
92
There is no Perfect Policy
• LRU is frequently good
• LFU is sometimes better
• For anything but Belady, we can construct an adversarial workload that can be pretty devastating.
• Assume a 2 slot, fully associative cache with LRU replacement
– Using only 3 different data items, produce an ordering whose hit rate is 0!
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
93
When LRU is Terrible
• Assumea2slot,fullyassociativecachewithLRUreplacement
– Using only 3 different data items, produce an ordering whose hit rate is
0!
ABCABCABCABC
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
94
When LRU is Terrible
• Assumea2slot,fullyassociativecachewithLRUreplacement
– Using only 3 different data items, produce an ordering whose hit rate is
0!
ABCABCABCABC MMMMMMMMMMMM
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
95
What Would Belady’s Algorithm Do?
• Assumea2slot,fullyassociativecachewithLRUreplacement
– Using only 3 different data items, produce an ordering whose hit rate is
0!
ABCABCABCABC
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
96
What Would Belady’s Algorithm Do?
• Assumea2slot,fullyassociativecachewithLRUreplacement
– Using only 3 different data items, produce an ordering whose hit rate is
0!
ABCABCABCABC MMM HMHMHMHMH
Evict B
Evict A
Evict C Evict B Evict A
40% Hit Rate!
Ignoring compulsory misses, this produces a 50% hit rate in the steady state.
This algorithm is MRU!
97
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
MRU (Most Recently Used)
• Food for thought:
– Given a fully associative cache with N slots
– Given data that occupies N+1 slots
– In the steady state (ignoring compulsory misses), what hit rate will MRU produce on a workload that iterates over the entire data set repeatedly?
– What hit rate will LRU produce on that workload?
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
98
Evaluating Caches
• By now you should be pretty good at calculating hit/miss rates.
• Sometimes we also want to compare different cache architectures or determine how a change might influence performance.
• We often use speedup as a way to describe how much a change improves (or harms) performance.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
99
Amdahl’s Law
• Who is Amdahl and what is his law?
– Gene Amdahl : American computer architect who worked at IBM for
many years and then started Amdahl Corporation. • Amdahl’s Law:
– Originally designed to quantify the opportunities and benefits of parallelism.
– If you have a program that has a sequential part and a parallelizable part, when you parallelize it, you will only improve the performance of the latter part. So, it’s good to know just how much parallelizing the application will help.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
100
Let’s Apply Amdahl’s Law to Caching
• Amdahl’sLawStates:Overallspeedupisafunctionofboththe amount you improve some part of the system and the fraction of time you spend in that part of the system.
• Given:
– Latency of the original system is Torig
– You can speed up part of a system by k
AKA – execution time, elapsed time
– The program spends α (< 1) of its time in the part of the system you’re
going to speed up.
• Latencyaftertheimprovement:
–Tnew =(1-α)Told +(αTold)/k=Told((1–α)+α/k) – Speedup = Told / Tnew = 1/((1 – α) + α/k)
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
101
CPSC 313
Amdahl’s Law Visually
Told
Tnew =Told((1–α)+α/k) 1 𝛼𝛼
1-α
α
1-α
α/k
Speedup =
(1 −𝛼𝛼+𝑘𝑘)
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
102
1𝛼𝛼 (1 −𝛼𝛼+𝑘𝑘)
• Suppose that a program that takes 100 seconds to complete spends 25% of its time in memory instructions. If we make memory instructions twice as fast, how much faster will the program run?
–α= –k=
– Told =
– Speedup =
– Latency =
Amdahl’s Law Example
Speedup =
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
103
1𝛼𝛼 (1 −𝛼𝛼+𝑘𝑘)
• Suppose that a program that takes 100 seconds to complete spends 25% of its time in memory instructions. If we improve the cache so that the memory time increases by 40%, how much faster will the program run?
–Told =100 (1 −.25+.25) – Speedup = 2
Amdahl’s Law Example
Speedup =
– α = .25
–k=2 1
1 = .875
= 1.14
– Latency = 100 * .875 = 87.5 seconds
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
104
Now You Practice
• Supposethataprogramthattakes100secondstocompletespends 50% of its time in memory instructions. If we make memory instructions 67% faster, how much faster will the program run?
• Careful!SomethingthatisX%fasterspeedsupby1.X. –α=
–k=
– Told =
– Speedup = – Latency =
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
105
Now You Practice
• Supposethataprogramthattakes100secondstocompletespends 50% of its time in memory instructions. If we make memory instructions 67% faster, how much faster will the program run?
• Careful!SomethingthatisX%fasterspeedsupby1.X. – α = 0.5
– k = 1.67
– Speedup = 1 = 1 = 1.25
– Told = 100 (1 − .5+ .5 ) 0.8 1.67
– Latency = 100 * .8 = 80
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
106
Plot of Amdahl's Law
10 9 8
7 6 5
4 3 2
1
0
1
2
SPEEDUP OF OPTIMIZED PART OF THE PROGRAM
0.99 0.8
0.6 0.4
0.2
0.001 10
3
4
5
6
7
8
0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10
9
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
107
OVERALL SPEEDUP
CPSC 313
Today
• Upcoming:
– Quiz 4 will be out on Friday
– A3 is being finalized and will be out shortly
• Learning Objectives
– Identify the meta-data needed to manage a cache – Figure out how to extract hash tags from addresses
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
108
What is this meta-data of which we speak?
• Attheverybeginningofthisunitandtheninsomedefinitionsfrom the book, we talked about metadata that we need to manage the cache. It’s time to talk about that metadata.
– Metadata: data that describes and provides information about other data. – Example: book metadata: the author, the publisher, the date of
publication, the printing, its persistent identifier, ...
– Example: file metadata: the size of the file, when it was last written, where it resides on persistent store, who owns the file, who can access the file, ...
• Let’sthinkaboutwhatwehavetodowithcachesandreverse engineer the kind of metadata that we need to keep.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
109
CPSC 313
Metadata for Replacement
• What would we need to record to implement the following policies?
• LRU
–
•
•
• LFU –
–
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
110
Metadata for Replacement
• Whatwouldweneedtorecordtoimplementthefollowing policies?
• LRU
– A way to know the last time a block was accessed • A time stamp
• A linked list • LFU
– A count of the number of times something was accessed
– Maybe an index to find the minimum quickly (but you need to be able to
update that index quickly as well)
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
111
CPSC 313
Let’s talk about hits and misses
• Assume a direct mapped cache
• What information do you need to figure out if an access is a hit
or a miss?
– –
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
112
CPSC 313
Let’s talk about hits and misses
• Assume a direct mapped cache
• What information do you need to figure out if an access is a hit
or a miss?
–Where would my data go? –Is my data in that slot?
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
113
CPSC 313
Mapping Data to a Cache Slot: Conceptually
• We need some kind of map data structure that tells you for any possible data item, where should I look for it in the cache.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
114
CPSC 313
Mapping Data to a Cache Slot: Software
• How might you implement this map?
– You can implement this literally as a map.
– You might implement this as a hash table.
– You might implement this as a custom data structure (i.e., if you know something about your data, you might be able to do something clever)
– If your cache is small, you might exhaustively search.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
115
CPSC 313
Mapping Data to a Cache Slot: Hardware
• How might you implement this map? It has to be fast!
• What do you know about the data and how you will look it up? – The cache will either contain a complete cache line or no cache line. – A single data access references only part of a cache line (probably).
– You look up entries in the cache by address.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
117
Bits and Tags
Assume that each box here represents one byte How large is a cache line?
Cache Line
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
118
Bits and Tags
Assume that each box here represents one byte How large is a cache line? 16 bytes
Cache Line
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
119
Bits and Tags and Addresses
Let’s say the address of this item is 0x1000; what are the addresses of the remaining items in the line?
Cache Line
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
120
Bits and Tags and Addresses
Assume that each box here represents one 8-byte integer (i.e., a quad word)
Let’s say the address of this item is 0x1000; what are the addresses of the remaining items in the line?
0x1001
0x1002 0x1003
. 0x100F
Cache Line
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
121
CPSC 313
Bits and Tags and Addresses
Offsetinline: 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x1000
0x1010
0xB0A0
0x2030 0xC0C0 0xBAD0 0xD0E0 0xEFF0
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
122
Bits and Tags and Addresses
Address of the first byte of each cache line
Offset in line:
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
0x1000
0x1010
0xB0A0
0x2030
0xC0C0
0xBAD0
0xD0E0
0xEFF0
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
123
Bits and Tags and Addresses
Address of the first byte of each cache line
Offsetinline: 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x1000 (0000)
0x1010 (0001)
0xB0A0 (1010)
0x2030 (1011) 0xC0C0 (1100) 0xBAD0 (1101) 0xD0E0 (1110) 0xEFF0 (1111)
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
124
CPSC 313
How do we use these observations?
• The low-order 4 bits of the address (the low-order nibble or the low-order hex digit) tells you which item (in this case, byte) in the cache line you are accessing.
• The next 3 bits of the address select the place in the cache that a cache line goes.
• Provide two addresses that will map to the same cache line: – 0x1030 (3 = 0011)
– 0x10B0 (B = 1011)
• Also: 0xFF30, FFB0, AB30, DCB0, etc ...
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
125
CPSC 313
Where do we place items in the cache?
• Weusesomebitsintheaddresstoidentifytheslot(s)towhicha line maps.
• Inthiscase,weusedthelowestbitsthatwerenotpartoftheoffset into the cache line.
• Woulditmakesensetousehighorderbits? – No!
– Data is frequently accessed sequentially; if we chose high order bits, then lots of sequential cache lines would map to the same place.
– Consider taking the top 3 bits: All cache lines between 0xC000 and 0xD000 would map to the same place!
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
127
CPSC 313
Where do we place items in the cache?
• Weusesomebitsintheaddresstoidentifytheslot(s)towhicha line maps.
• Inthiscase,weusedthelowestbitsthatwerenotpartoftheoffset into the cache line.
• CanyouthinkofanyreasonNOTtojustusethelowestbitsthatare not part of the offset?
– Yes! What if we are doing a strided access?
– Let’s say that we want to read 1 byte out of every 1024 bytes; the
addresses would be:
• 0x0000, 0x0400, 0x0800, 0x0C00, 9x1000 0x1400, 0x1800, 0x1C00, ...
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
129
Which bits should we pick?
• There is no uniformly good answer:
– The lowest available bits (i.e., excluding the offset bits) can cause bad
behavior for strides.
– High bits cause bad behavior for sequential access.
– If you have to pick one, low bits are better as sequential is a more common pattern.
– Alternately, perhaps you pick some bits from the lowest available and a few from places that would cause common stride patterns (e.g., 1KB, 4KB).
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
130
CPSC 313
What if we have a set associative cache?
Consider 2-way set associative.
Cache Line
Cache Set
An address with index 3 can map to either slot
Each cache index has two locations
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
131
CPSC 313
Stride and Associativity
• Let’s say that:
– We use the same cache organization that we started with (direct
mapped; 16-byte cache lines).
Assume a1[0] and a2[0] have same index bits • If we copy bytes one at a time from a1 to a2, what will the hit rate be?
• If we make the cache twice as big, but it remains direct mapped, what will the hit rate be?
• If we make the cache twice as big by making it 2-way set associative, what will the hit rate be?
– We use bits 4-6 as the cache index.
– We have two 1 KB arrays (e.g., char a1[1024], a2[1024])
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
132
Stride and Associativity
• Let’s say that:
– We use the same cache organization that we started with (direct
mapped; 16-byte cache lines).
– We use bits 4-6 as the cache index.
– We have two 1 KB arrays (e.g., char a1[1024], a2[1024])
• If I copy bytes one at a time from a1 to a2, what will the hit rate be? 0
• If I make the cache twice as big, but it remains direct mapped, what will the hit rate be? 0
• If I make the cache twice as big by making it 2-way set associative, what will the hit rate be? 15/16!
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
133
Just one more thing...
• How do we distinguish a hit from a miss?
• Assuming our original cache configuration. – Let’s say that I access 0x1000
– Then I request the item at address 0x0104
• Where do each of those map in the cache: – cache index 0!
• But the second one should be a miss: how do I know that?
• We need to store metadata with each cache entry. We need at least: – A valid bit indicating if there is any valid data in the cache at a given entry – A tag that identifies exactly which data is in the cache.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
135
Today
• Learning Outcomes for Today
– Tags & misses
– Explain the challenges that arise in hardware caches in the presence of multiple cores and/or processors.
– Define each of the states in the MESI protocols
– Relate multi-core cache behavior to concurrency control in multi-
threaded programs.
• Assignment 2 is out due March 20th – good practice of MT 2 • Quiz 4 available Friday – due Monday 23:59
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
136
Just one more thing...
• How do we distinguish a hit from a miss?
• Assuming our original cache configuration. – Let’s say that I access 0x1000
– Then I request the item at address 0x0104
• Where do each of those map in the cache: – cache index 0!
• But the second one should be a miss: how do I know that?
• We need to store metadata with each cache entry. We need at least: – A valid bit indicating if there is any valid data in the cache at a given entry – A tag that identifies exactly which data is in the cache.
Reminder: Direct mapped 8 line cache 16 bytes/block
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
137
The parts of an address
Let’s assume the original cache: • 1-byte elements
• 16 per cache line
• 8 cache lines
• 16-bit addresses Tag
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
138
Index
Offset
Cache with metadata
9-bit cache tag
Valid bit
Offset in cache line (low 4 bits of address)
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
0x020
0x020
0x161
0x040
0x181
0x175
0x1A1
0x1BF
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
139
Index to identify cache slot (3 bits)
Address of first byte in the line (hex)
0x1000 0x1010 0xB0A0 0x2030 0xC0C0 0xBAD0 0xD0E0 0xEFF0
Address of first byte in the line (binary)
0001 0000 0000 0000
9-bit cache tag
offset
Valid bit
Where did those tags come from??
Cache tag
Cache index
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
140
Where did those tags come from??
Address of first byte in the line (hex)
Address of first byte in the line (binary)
9-bit cache tag
Valid bit
0x1000
0001 0000 0000 0000
0x020
0x1010
0xB0A0
0x2030
0xC0C0
0xBAD0
0xD0E0
0xEFF0
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
141
Where did those tags come from??
Address of first byte in the line (hex)
Address of first byte in the line (binary)
9-bit cache tag
Valid bit
0x1000
0001 0000 0000 0000
0x020
0x1010
0001 0000 0001 0000
0x020
0xB0A0
0x2030
0xC0C0
0xBAD0
0xD0E0
0xEFF0
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
142
Where did those tags come from??
Address of first byte in the line (hex)
Address of first byte in the line (binary)
9-bit cache tag
Valid bit
0x1000
0001 0000 0000 0000
0x020
0x1010
0001 0000 0001 0000
0x020
0xB0A0
1011 0000 1010 0000
0x161
0x2030
0xC0C0
0xBAD0
0xD0E0
0xEFF0
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
143
Where did those tags come from??
Address of first byte in the line (hex)
Address of first byte in the line (binary)
9-bit cache tag
Valid bit
0x1000
0001 0000 0000 0000
0x020
0x1010
0001 0000 0001 0000
0x020
0xB0A0
1011 0000 1010 0000
0x161
0x2030
0011 0000 0011 0000
0x040
0xC0C0
0xBAD0
0xD0E0
0xEFF0
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
144
Where did those tags come from??
Address of first byte in the line (hex)
Address of first byte in the line (binary)
9-bit cache tag
Valid bit
0x1000
0001 0000 0000 0000
0x020
0x1010
0001 0000 0001 0000
0x020
0xB0A0
1011 0000 1010 0000
0x161
0x2030
0011 0000 0011 0000
0x040
0xC0C0
1100 0000 1100 0000
0x181
0xBAD0
1011 1010 1101 0000
0x175
0xD0E0
1101 0000 1110 0000
0x1A1
0xEFF0
1110 1111 1111 0000
0x1BF
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
145
Check your Understanding
• Let’s say that we have a cache with the following configuration: – 64-byte cache lines
– 512 cache lines – 32-bit addresses
• How many bits are in the tag?
– Offset in the cache line: 64 bytes => 6-bits
– Index into the cache: 512 lines => 9-bits – So,32–(6+9)=32–15=17-bittags
Yes! Each index would refer to 4 sets, so we would only have 128 indexes.
• The cache index would only be 7 bits • So the tag must be 19 bits.
• Would this number change if the cache were 4-way set associative (and we
still stored a total of 512 cache lines)?
• What if we stored 512 sets, each of 4 lines?
The tag structure would stay the same, but we would have made the cache 4 times larger!
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
147
Remember …
(Hyper)threads
Core
Per-core cache
Shared cache
Execution Core
Execution Core
L1 Cache
L1 Cache
L2 Cache
Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
149
Remember …
…
Execution Core
L1 Cache
Execution Core
L1 Cache
L2 Cache
Execution Core
L1 Cache
Execution Core
L1 Cache
L2 Cache
Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
150
Let’s assume all caches are write-back
• What problems do you see that we have to solve when we have caches in the presence of multiple processors (or cores)?
– Two cores/processors might access the same data – that could result in two copies of the same cache line present in two different caches.
– Core 1 might wish to access data that is dirty in Core 2’s cache.
– Processor 2 might wish to access data that is dirty in some cache on
Processor 1.
• Consequently, if the hardware doesn’t do anything to avoid it, cores/processors can see incorrect data.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
152
Core 1 reads X
Core 1 writes X
…
Execution Core
L1 Cache
X: 0xCAF
E
L1 Cache
Execution Core
X: 0x1234
L2 Cache
Execution Core
L1 Cache
Execution Core
L1 Cache
L2 Cache
X: 0x1234
Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
153
Core 2 reads X
…
Execution Core
L1 Cache
X: 0xCAF
E
L1 Cache
Execution Core
X: 0x1234
L2 Cache
Execution Core
L1 Cache
Execution Core
L1 Cache
L2 Cache
X: 0x1234
Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
154
Processor 1/Core 1 reads X
…
Execution Core
L1 Cache
Execution Core
L1 Cache
X: 0x1234
L2 Cache
Execution Core
L1 Cache
Execution Core
L1 Cache
L2 Cache
X: 0x1234
Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
155
Processor 2/Core 1 reads X
…
Execution Core
L1 Cache
X: 0x123
4
L1 Cache
Execution Core
X: 0x1234
L2 Cache
Execution Core
L1 Cache
Execution Core
L1 Cache
X: 0x1234 L2 Cache
X: 0x1234 Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
156
Processor 1/Core 1 writes X
…Stale Data
Execution Execution Core Core
X: 0x1B2E3E4F
L1 Cache L1 Cache
X: 0x1B2E3E4F
L2 Cache
Execution Core
X: 0x1234
L1 Cache L1 Cache
Execution Core
X: 0x1234 L2 Cache
X: 0x1B2E3E4F Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
157
Cache Coherence
• Typically, it is the hardware’s job to make sure that programs cannot see such inconsistencies.
• Two parts to a solution:
– Use hardware to ensure that loads from all cores will return the
value of the latest store to that memory location.
– Use cache metadata to track the state of cached data.
• There are two major approaches to addressing this problem. – Snoopy caches
– Directory based coherence
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
158
Memory
Let’s Simplify Our Problem
Execution Core
L1 Cache
Execution Core
L1 Cache
Bus-based “snooping” multicore processor
Intercore bus
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
159
Invalidation Protocol with Snooping
• Invalidation
– If a core writes to a data item, all other copies of this data item in
other caches are invalidated. • Snooping:
– All cores continuously snoop (monitor) the bus connecting the cores.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
160
Let’s have Core 1 read X
Core 2 reads X
Execution Core
L1 Cache
Execution Core
L1 Cache
X: 0x1234 Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
161
What Happens this time if Core 1 writes X?
Execution Execution Core Core
X: 0x1C2A3F4E X: 0x1234
L1 Cache L1 Cache
Invalidation msg
X: 0x1234 Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
162
Enhancing Cache Metadata
• When we have caches accessible from multiple cores and we have a bus- based coherence protocol, we need to add extend our valid bit to a set of states.
• Cache operations can:
– Change state
– Send invalidate requests
– Request cache lines in a particular state (fetch)
• A minimal set of states (MSI Model)
– Assumes a writeback cache
– M: Cache line is modified (i.e., dirty)
– S: Cache line shared; appears in at least one cache. – I: Cache line is invalid (i.e., contains no valid data)
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
163
Enhancing Cache Metadata
• When we have caches accessible from multiple cores and we have a bus- based coherence protocol, we need to add extend our valid bit to a set of
states.
• Cache operations can:
– Change state
– Send invalidate requests
– Request cache lines in a particular state (fetch)
Local write (sends invalidate)
Local read/write
Remote read/write M (write data back)
Local write (fetch modified)
• A minimal set of states (MSI Model)
– Assumes a writeback cache
– M: Cache line is modified (i.e., dirty)
– S: Cache line is shared; appears in at least one cache – I: Cache line is invalid (i.e., contains no valid data)
Local read
Remote write
Local read (fetch shared)
S
I
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
164
Enhancing Cache Metadata
Local read/write
Remote read/write (write data back)
Local write (fetch modified)
Remote write
A minimal set of states (MSI Model)
Assumes a writeback cache
M: Cache line is modified (i.e., dirty)
S: Cache line is shared; appears in at least one cache I: Cache line is invalid (i.e. contains no valid data)
Local write (sends invalidate)
Local read
M
SI
Local read (fetch shared)
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
Today
• Learning Outcomes for Today
– Define each of the states in the MESI protocols
– Relate multi-core cache behavior to concurrency control in multi- threaded programs.
– Introduction to disk drives and file system API
• Assignment 2 is out due March 20th – good practice of MT 2 • Quiz 4 available Friday – due Monday 23:59
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
167
Enhancing Cache Metadata
Local read/write
Remote read/write (write data back)
Local write (fetch modified)
Remote write
A minimal set of states (MSI Model)
Assumes a writeback cache
M: Cache line is modified (i.e., dirty)
S: Cache line is shared; appears in at least one cache I: Cache line is invalid (i.e. contains no valid data)
Local write (sends invalidate)
Local read
M
SI
Local read (fetch shared)
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
Enhancing Cache Metadata – Initial State
Local read/write
Local read/write
Local write (sends invalidate)
Local read
M
Remote read/write (write data back)
Remote read/write (write data back)
Local write (fetch modified)
Remote write I
Local read (fetch shared)
Local write (sends invalidate)
Local read
M
Local write (fetch modified)
S
Remote write
Local read (fetch shared)
I
S
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
Enhancing Cache Metadata – Read Core 1
Local read/write
Local read/write
Local write (sends invalidate)
Local read
M
Remote read/write (write data back)
Remote read/write (write data back)
Local write (fetch modified)
Remote write I
Local read (fetch shared)
Local write (sends invalidate)
Local read
M
Local write (fetch modified)
S
Remote write
Local read (fetch shared)
I
S
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
Enhancing Cache Metadata – Read Core 2
Local read/write
Local read/write
Local write (sends invalidate)
Local read
M
Remote read/write (write data back)
Remote read/write (write data back)
Local write (fetch modified)
Remote write I
Local read (fetch shared)
Local write (sends invalidate)
Local read
M
Local write (fetch modified)
S
Remote write
Local read (fetch shared)
I
S
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
Enhancing Cache Metadata – Write Core 1
Local read/write
Local read/write
Local write (sends invalidate)
Local read
M
Remote read/write (write data back)
Remote read/write (write data back)
Local write (fetch modified)
Remote write I
Local read (fetch shared)
Local write (sends invalidate)
Local read
M
Local write (fetch modified)
S
Remote write
Local read (fetch shared)
I
S
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
Enhancing Cache Metadata – Write Core 2
Local read/write
Local read/write
Local write (sends invalidate)
Local read
M
Remote read/write (write data back)
Remote read/write (write data back)
Local write (fetch modified)
Remote write I
Local read (fetch shared)
Local write (sends invalidate)
Local read
M
Local write (fetch modified)
S
Remote write
Local read (fetch shared)
I
S
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
Enhancing Cache Metadata – Read Core 1
Local read/write
Local read/write
Local write (sends invalidate)
Local read
M
Remote read/write (write data back)
Remote read/write (write data back)
Local write (fetch modified)
Remote write I
Local read (fetch shared)
Local write (sends invalidate)
Local read
M
Local write (fetch modified)
S
Remote write
Local read (fetch shared)
I
S
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
But wait, …
• Doesn’t this get expensive if, for example, I am the only one who has something cached?
• Yes!
– If that data is not shared, you are still executing a large number of
costly cache invalidations – So ….
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
175
CPSC 313
Let’s add another state! MESI
• E: Exclusive: The data does not live in any other cache Local read/write
Local read
Exclusive read miss
Remote write request (write data back)
Write miss
Local write (sends invalidate)
M
Local write E Remote read
Remote write request (write data back)
Remote read (write data back)
Shared read miss
S
I
Remote write request
Read (local or remote)
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
177
CPSC 313
The Alternative: Directory Based Coherence
• Adirectorystoresallthecoherenceinformation
• Processorsconsultthedirectorybeforesendingcoherency
messages
– Information in the directory is similar to the cache states • Shared: Lives in more than one cache
• Uncached: Lives in 0 caches
• Exclusive: Lives in exactly 1 cache
– Must also track which processor(s) have the data
• Ratherthanbroadcastingonthebus,acoreconsultsthedirectory
and sends messages only to affected cores.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
178
CPSC 313
End of Caching
• Use Assignment 3 to help you study for the midterm. – It’s all about caching!
– Due date is March 20
– Do it now!
• And don’t forget the Quiz due Monday! CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
179