程序代写代做 assembly graph database data structure file system C cache clock algorithm concurrency go The Memory Hierarchy

The Memory Hierarchy
Unit 3
• Learning Outcomes – Feb 12
• Define memory hierarchy.
• Evaluate the performance differences found at the different layers of the hardware memory hierarchy.
• Explain the different kinds of caching that processors and hardware systems perform to mitigate the performance differences between the levels of the memory hierarchy.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al

Admin
• Online Quiz 3 becomes available on Friday but it isn’t due until Feb 24th (kind of like reading week never happened)
• A2dueFeb28
• Final Exam Thursday April 23rd 12:00 PM
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
2

CPU
System Architecture
Register file
ALU
Bus interface
System bus
I/O bus
Memory bus
I/O bridge
Expansion slots for
other devices such as network adapters
Mouse
Keyboard
Monitor
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
3
Disk
3
Main memory
USB controller
Graphics adapter
Disk controller

Pentium Motherboard
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 4
4

Memory system
(The part we are going to look at)
CPU chip
Register file
ALU
Bus interface
System bus Memory bus
I/O bridge
Main memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 5
5

CPSC 313
The Memory Hierarchy
Registers L1 Cache
L2 Cache
L3Cache L4 Cache
Main Memory Flash Drive
Disk Drive
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
6
Bigger
Faster

Intel Broadwell processor released in 2015
4 cores
The Memory Hierarchy — Size
~1 KB (~128 b/core)
256 KB (64 KB/core) 1 MB (256KB/core)
8 MB
128 MB 32 GB *
~200 GB – 1 TB
2-5 TB
Registers
L1 Cache L2 Cache
L3Cache L4 Cache
Main Memory
Solid State (flash) Drive
Disk Drive
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
7
Bigger
Faster

CPSC 313
The Memory Hierarchy — Speed
~ 1 KB (~120 b/core)
256 KB (64 KB/core) 1 MB (256KB/core)
8 MB
128 MB 32 GB *
~200 GB – 1 TB 2-5 TB
Registers
.3 ns
L1 Cache L2 Cache
L3Cache L4 Cache
1.1 ns
3.3 ns
12.8 ns
42.4 ns 62.9 ns
~.1 ms
~3 ms
Main Memory
Solid State (flash) Drive
Disk Drive
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
8
Bigger
Faster

CPSC 313
The Memory Hierarchy — Price
~1 KB (~120 b/core)
256 KB (64 KB/core) 1 MB (256KB/core)
8 MB 128 MB
Registers
.3 ns
L1 Cache L2 Cache
L3Cache L4 Cache
1.1 ns
3.3 ns
12.8 ns
42.4 ns 62.9 ns
~.1 ms
~3 ms
32 GB * ~200 GB – 1 TB 2-5 TB
$5-10/GB $0.67/GB $50/TB
Main Memory
Solid State (flash) Drive
Disk Drive
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
9
Bigger
Faster

Examples of Memory Technology
• Vacuum Tubes
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
10

Vacuum Tube Memory
( retrieved from http://cdn.wccftech.com/images/articles/History-of-RAM/Vacuum-Tube.jpg)
11
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al

Examples of Memory Technology
• Vacuum Tubes
• Core memory
– Popular from 1955-1975
– 32 kilobits per cubic metre by the late 1960s
– Semiconductor memory started to take over the market in the early 1970s.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
12

https://commons.wikimedia.org/wiki/File:Magnetic-core_memory.JPG Benoitb
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
13

64 bits vs 8 GBytes
https://commons.wikimedia.org/wiki/File:8_bytes_vs._8Gbytes.jpg Daniel Sancho CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
14

CPSC 313
The Memory Hierarchy — Price
~1 KB (~120 b/core)
256 KB (64 KB/core) 1 MB (256KB/core)
8 MB 128 MB
Registers
.3 ns
L1 Cache L2 Cache
L3Cache L4 Cache
1.1 ns
3.3 ns
12.8 ns
42.4 ns 62.9 ns
~.1 ms
~3 ms
32 GB * ~200 GB – 1 TB 2-5 TB
$5-10/GB $0.67/GB $50/TB
Main Memory
Solid State (flash) Drive
Disk Drive
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
15
Bigger
Faster

Consider this or why the hierarchy exists
K-1
K
K+1
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
16

IfK+1samesizeasKthendon’tneedK+1
K-1
K
K+1 K+1
Everything in K+ 1 can fit in K
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
17

If K isn’t faster than K + 1 then don’t need K
K-1
K
K+1
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
18

CPSC 313
Implications
• We saw many factors of 10 or 100 in: – Size
– Performance – Price
• “When you see a factor of 100, it’s going to affect how you program.” – E. Kohler
• As the ratios between different parts of the system change, so do our priorities.
– 1956:
• $/MB(mem):$/MB(disk)=>$411M:$9200=>44,673X
– 2015:
• $/MB(mem):$/MB(disk)=>$0.005:0.0000317=>158X
– Cost of memory in 1956 versus 2015:
• 1956$/MB:2015$/MB=>$411M/.005=>82trillion…
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
19

CPSC 313
• Definition:
Caching
– Colloquially: store away in hiding or for future use
– Applied to computation:
• Placing data somewhere it can be accessed more quickly.
• Examples:
– In assembly language: we move data from memory to registers.
– In the hardware: we move data from main memory into memory banks that live on the processor (more on this in a moment).
– In software: we read things into our program’s local buffers and manipulate them there.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
20

Some Rules of Thumb
• How big things are
– Bigger means slower, typically also cheaper
– If A is bigger and faster than B (for same cost), there’s no point in using B
• Where things are
– On chip is much faster
– Off chips is slower but typically much bigger
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
21

CPSC 313
Today
• Learning Objectives
– Identify location of common caches in the IO path.
– Define
• Cache block
• Cache slot
• Cache hit/miss rate
• Replacement policy
• Direct-mapped cache
• Fully Associative Cache
• N-way Set Associative Cache
– Calculate hit/miss rates
– Map hit/miss rates into performance
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
22

Admin
• Midterm 1 handed back Wednesday – Question 6c/d has been remarked
• Online Quiz 3 becomes available – today – Due Feb 24th • Office hours during break
– Tas – check TA office hour link on Canvas – currently looks like some on Tuesday, Thursday, Friday
– My scheduled office hours next week are cancelled
– Will probably have some on Friday and maybe one other day (waiting
for some meeting to be scheduled.)
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
23

CPSC 313
From last class we learned:
• Caching is ubiquitous throughout our computing systems: – In the processor
– In the operating system
– In databases
– In middleware – In applications
• Writing efficient and correct software requires a deep understanding of caching and its implications.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
24

CPSC 313
Today
• Learning Objectives
– Identify location of common caches in the IO path.
– Define
• Cache block
• Cache slot
• Cache hit/miss rate
• Replacement policy
• Direct-mapped cache
• Fully Associative Cache
• N-way Set Associative Cache
– Calculate hit/miss rates
– Map hit/miss rates into performance
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
25

Read data
Application
Cache
Data Source
Write data
An Abstract Cache
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
26

A Read Cache Hit
Application
Please read this item!
Here you go!
Hmmm, do I have that item?
Cache
Yes!
Data Source
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
27

A Read Cache Miss
Application
1. Please read this item!
6. Here you go!
2. Hmmm, do I have that item?
Cache
4. Please get me my item.
Data Source
5. Here you go!
This Photo by Unknown Author is licensed under CC BY
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
28
3. No

A Read Cache Miss – Decisions!!
Application
1. Please read this item!
6. Here you go!
2. Hmmm, do I have that item?
Cache
4. Please get me my item.
Data Source
What do I do when I run out of space? How do I arrange my data?
5. Here you go!
How much data should I return?
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
29
3. No

CPSC 313
How much data should I return?
• Most (persistent) storage devices have a native data access and/or transmission size, e.g., disk block (4 KB).
• Cachesalsohaveanativesize.
– E.g., Intel Broadwell has 64-byte cache lines.
– Implication: Data moves in and out of caches in 64-byte chunks.
• Blocksize:theunitinwhichdataisstoredinacache.
– Broadwell caches: 64 bytes (block size called cache line size in HW caches)
– File system caches: 4 KB
– Object caches: size of the object
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
30

What do I do when I fill up?
• A cache has a limited capacity.
• At some point, the application will fill the cache and request
another item.
• Caching the new item requires evicting some other item.
• What item do I evict?
– We need an eviction policy
– The available decisions here vary between hardware and software.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
31

Eviction in Software
• Inaperfectworld,we’dliketoevicttheitemthatisleastvaluable.
• Intherealworld,wedon’tknowwhatthatitemis.
• Practicallyallsoftwarecachestrytoapproximatethisideal.
– LRU: Least-recently-used – find the item that has been unused the longest
and get rid of that.
– LFU: Least-frequently-used – find the item that has been used less frequently and get rid of that.
– Clock: Used in virtual memory systems to approximate LRU (you’ll learn more about that either later this semester or in CPSC 415).
– Something tuned to known access patterns.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
32

Eviction: Hardware
• Software algorithms are lovely, but are not nearly fast enough for most hardware, e.g., processor, caches.
• Cache is comprised of some number of slots.
• For any particular item, limit the number of possible slots in which it can be placed. Call the number of such slots A. Let n be the total number of slots in the cache.
– A = 1: Direct mapped: each object can live in exactly one slot in the cache, so you have no choice but to evict the item in that slot.
– A > 1, A << n: A-way set associative: an object can live in one of A slots; A is typically 2, 4, or 8. On eviction, choose randomly from among the A slots. – A = n: Fully associative: an object can live in any slot (like a software cache). • This answers the, “How do I organize my cache?” question CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 33 CPSC 313 Cache Arrangement: Direct Mapped slots Each cache line can go in one and only one slot! CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al A slot holds a cache line 34 CPSC 313 Cache Arrangement: Fully Associative X X X X X A cac he line can go to ANY slot! CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al A slot holds a cache line 35 CPSC 313 Cache Arrangement: 2-way set associative X X A cac he line can go in one of two slots! CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al A slot holds a cache line 36 CPSC 313 Cache Arrangement: 8-way set associative X X X X A cac he line can go in one of 8 slots! CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al A slot holds a cache line 37 CPSC 313 Types of Cache Misses • Compulsory: – On first access to an object, you take a miss; there is little you can do about it. • Capacity: – You are touching more data than can fit in the cache. – If your cache were larger, you would have fewer misses. • Conflict: – You have something other than a fully associative cache and are taking misses because there is another item occupying the slots you need. CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 38 Evaluating a Cache • Hits are much better than misses! • We measure the efficiency of a cache in terms of its cache hit rate: – # cache hits / # cache accesses – # cache hits / (# cache hits + # cache misses) • Example: – I access my cache 1000 times and have 400 hits. • My cache hit rate is 400/1000 = 40% • Good performance requires high cache hit rates. CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 39 Computing Average Access Time • Let’s say that it takes 1 time unit to access your cache and 100 time units to access the data source (i.e miss cost) – 10% hit rate: 0.90 * 100 + 0.10* 1 = 90.1 time units/access – 50% hit rate: 0.50* 100 + 0.50* 1 = 50.5 time units/access – 90% hit rate: 0.10* 100 + 0.90* 1 = 10.9 time units/access – 99% hit rate: 0.01* 100 + 0.99* 1 = 1.99 time units/access • The ratio between access time for the cache and the data source dictates just how good your hit rate needs to be to obtain good performance from a cache. CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 40 CPSC 313 Now you practice • Let’s say my cache time is 1 unit and my data source access time is 50 time units. • Given a hit rate of 80%, what is my average access time? CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 41 Now you practice • Let’s say my cache time is 1 unit and my data source access time is 50 time units. • Given a hit rate of 80%, what is my average access time? – Hits: .8 * 1 = .8 – Miss: .2 * 50 = 10 – Average access time is 10.8 • That’s 10x slower than the cache! CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 42 CPSC 313 CPSC 313: Computer Hardware and Operating Systems Unit 3: Caching Write Caching CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 43 Today • Learning Objectives – Get backup to speed on caching – quick overview – Explain how cache line sizes and stride interact – Distinguish write back and write through caches • Reminders – Quiz 3 due today – Assignment due on Friday. CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 44 Terminology: What do these terms mean? hold a block and all its associated metadata. • Associativity:(Bookterminology) – A direct mapped cache has one line per set – An N-way associative cache has N lines per set • eviction policy • hitrate–Thisishowwemeasureacache’sefficiency • missrate • average access time • cacheblock(orcacheline)(thebookcallstheseblocks) • thebookusesthetermlinetomeanthespaceinthecacheto CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 45 Direct mapped Fully associative CPU Cache Arrangement Cache Register file ALU Bus interface Data Source (memory or another cache level) None of this is to scale CPSC 313 – 2019W2 Memory Hierarchy Cache line/block/slot – holds multiple bytes © 2020 Donald Acton et al 46 Cache Arrangement In a set associated cache – each grouping of cache lines is a set CPU Cache Register file ALU Bus interface Data Source (memory or another cache level) None of this is to scale CPSC 313 – 2019W2 Memory Hierarchy Cache line/block/slot – holds multiple bytes © 2020 Donald Acton et al 47 Terminology: What do these terms mean? hold a block and all its associated metadata. • Associativity:(Bookterminology) – A direct mapped cache has one line per set – An N-way associative cache has N lines per set • evictionpolicy • hitrate–Thisishowwemeasureacache’sefficiency • missrate • average access time (Ht * Hr ) + (Mt * Mr) • cacheblock(orcacheline)(thebookcallstheseblocks) • thebookusesthetermlinetomeanthespaceinthecacheto • Access time on hit • Hit rate • Access time on miss • Miss rate ( 1- hit rate) CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 48 Remember ... • Caches have some number of slots. (Places you can put stuff in the cache.) • A cache block (or cache line) occupies a slot. • The associativity of a cache tells you something about which slot(s) a block can reside. • The eviction policy determines which block/line gets evicted, when all slots in which a cache line can reside already have valid data. • We measure cache efficiency in terms of hit rate. • The miss rate is just 1 – hit rate. • Computing average access time: – LetHt bethetimeforahit – Let Hr be the hit rate – LetMt bethetimeforamiss – Let Mr be the miss rate – Averageaccesstime=(Ht *Hr)+(Mt *Mr) CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 49 More than one way to get a hit ... • Ifyoutouchthesameitemmorethanonce,yougetahit,butthere is another way to get a hit. • Thinkaboutthefactthatyourcacheisorganizedinblocks... • Considerthis: – Let’s say you are accessing an array of 4-byte integers one after the other – A cache line is 64 bytes. • Hereisthequestion: – Let’s say that you have 160 items in the array and you’ve never accessed it before, how many cache misses will you take? CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 50 How many cache misses? • Theanswerisabout10! • Why? – A cache line is 64-bytes => that holds 16 4-byte quantities
– When you miss on the first array element, the cache will bring in an entire block (called a cache line in hardware).
– And since arrays (usually) are laid out contiguously, the rest of the values in the cache line just happen to be the next elements you are going to access!
– So, you miss on element 0, but then hit on elements 1-15.
– You miss on element 16 and hit on elements 17-31.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
51

Access pattern interaction
• Considerthefollowingarray;assumeeachrowisacacheline.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
52

CPSC 313
Access pattern interaction
• If we access EVERY element of the array, what will our hit rate be?
7 out of 8 or .875 or 87.5%
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
53

CPSC 313
Access pattern interaction
• If we access EVERY OTHER element of the array, what will our hit rate be?
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
3 out of 4 or
.75
or 75%
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
54

CPSC 313
Access pattern interaction
• If we access EVERY FOURTH element of the array, what will our hit rate be?
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
1 out of 2 or
.5
or 50%
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
55

CPSC 313
Access pattern interaction
• If we access EVERY EIGHTH element of the array, what will our hit rate be?
X
X
X
X
X
X
X
X
X
X
0 or 0 or 0%
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
56

CPSC 313
Those were all examples of different strides
Stride = spacing between accesses
If your stride is N, you access 1 data element per N bytes (A data element could be a byte or a quad word)
256/512 1024
200
128 64
Size of the array from which we are reading (in MB)
57
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
Time to read data (in ns)

CPSC 313
A Write Cache Hit
Application
1. Please write this item!
2. Hmmm, do I have that item?
Cache
3.4Y.eNs:eHwEVRaEluiteis.
Data Source
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
58

CPSC 313
A Write Cache Hit: Policy Decisions
Application
Dirty data
Cache
4. New Value
Data Source
When do we write new data back to the source?
59
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al

CPSC 313
A Write Cache Hit: Writeback Caching
Application
Dirty data
Cache
4. New Value
Data Source
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
60

CPSC 313
A Write Cache Hit: Write Through
Application
1. Please write this item!
2. Hmmm, do I have that item?
Cache
3.4Y.eNs:eHwEVRaEluiteis.
5. Please write this
Data Source
6. New Value
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
61

Today
– Complete cache write policies
– Simulate different replacement polices: • LRU
• LFU
• Belady’s Algorithm
– Map realistic replacement policies to different layers of the hardware/software stack
– Given a workload, cache configuration, and replacement policy, calculate hit/miss rates
• Learning Objectives
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
62

CPSC 313
A Write Cache Hit
Application
1. Please write this item!
2. Hmmm, do I have that item?
Cache
3.4Y.eNs:eHwEVRaEluiteis.
Data Source
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
63

CPSC 313
A Write Cache Hit: Policy Decisions
Application
Dirty data
Cache
4. New Value
Data Source
When do we write new data back to the source?
64
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al

CPSC 313
A Write Cache Hit: Writeback Caching
Application
Dirty data
Cache
4. New Value
Data Source
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
65

CPSC 313
A Write Cache Hit: Write Through
Application
1. Please write this item!
2. Hmmm, do I have that item?
Cache
3.4Y.eNs:eHwEVRaEluiteis.
5. Please write this
Data Source
6. New Value
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
66

CPSC 313
Write Policy Trade-offs
• Advantages of writeback cache ( = disadvantage of write through)
– Faster (cache time, not to the original source time)
– Write coalescing: update the same item many times before writing it
back to the source.
• Advantages of writethrough cache (= disadvantage of write back)
– Window of vulnerability (new data is in cache, but not source) CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
68

CPSC 313
A Write Cache Miss Application
1. Please write this item!
2. Hmmm, do I have that item?
Cache
3. No
Data Source
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
69

CPSC 313
A Write Cache Miss – Write Allocate Application
1. Please write this item!
2. Hmmm, do I 6. New data have that item?
Cache
4. Please get me my item.
Data Source
5. Here you go!
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
70
3. No

CPSC 313
A Write Cache Miss – No Write Allocate Application
1. Please write this item!
2. Hmmm, do I have that item?
Cache
3. No
4. Please write this item.
Data Source
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
71

CPSC 313
Write Allocation Tradeoffs
• Advantages of write-allocate ( = disadvantage of No write allocate)
– Faster when you are modifying multiple values in a cache line. – Matches read behavior
• Advantages of no write-allocate ( = disadvantage of write allocate)
– Avoids cluttering the cache with write-only data
– Avoids consuming a lot of cache space for random writes
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
73

CPSC 313
Wrapping Up
• Ifwritesgoallthewaythroughtothedatasource,wecallthata write-through cache.
• If writes can stay in the cache, we call the written cache blocks dirty and the cache is write-back.
• Ifwritemissesresultinacachelinebeingallocatedinthecache, we call that write-allocate.
• Ifwebypassthecacheforwritemisses,wecallthatnowrite- allocate.
• Rule of thumb: write-back and write-allocate typically go together as do write-through and no write allocate.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
74

On To – Cache Replacement Policy
• When the cache is full (e.g., every slot is occupied by data thought to be useful), determines which block you kick out (evict) to make room for a new request.
• In HW: these are very simple – quite often random (because you have very few choices about where a cache block can be placed).
• In SW: typically fully associative, so you have a lot of choices about which cache block to replace.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
75

1.
Goals of a Good Replacement Policy
Some assumptions we make:
– Workloads exhibit locality.
– Spatial locality: you are likely to access data near data you just accessed. – Temporal locality: you are likely to access data you accessed recently
Evicts a block you won’t need for a long time.
Best case: evicts exactly that block that will next be accessed farther in the future than any other block.
– This is called Belady’s algorithm (or the optimal algorithm) – Unfortunately, it requires that you predict the future.
All other algorithms are attempts at approximating Belady’s algorithm.
2. 3.
4.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
77

A cache eviction policy to exploit temporal locality
• LRU (least-recently-used):
– Assumes that data accessed recently is likely to be accessed again.
– Assumes that the least valuable data is that data that has remained unaccessed for the longest period of time.
– Is probably the most commonly used replacement policy.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
78

LRU Example
• Example:
– Assume we have 2 slots in our cache.
– Assume fully associative
– We access cache blocks in the following order:
A, A, B, A, B, C, C, C, A, B,
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
79

Kinds of Misses
• Compulsory: No matter what your cache looks like, you are going to take a miss.
• Capacity: If your cache were larger, you might not have to take this miss.
• Conflict: If your cache had higher associativity, you might not have to take this miss.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
80

LRU Example
• Example:
– Assumewehave2slotsinourcache.
– Assume fully associative
– Weaccesscacheblocksinthefollowingorder:
A, A, C, A, C, B, B, B, A, C MH MH HMH HMM
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
81

LRU Example
• Example:
– Assumewehave2slotsinourcache.
– Assume fully associative
– Weaccesscacheblocksinthefollowingorder:
compulsory
A, A, C, A, C, B, B, B, A, C MH MH HMH HMM
capacity
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
82

LRU Example Compared to Direct Map
• Example:
– Assumewehave2slotsinourcache.
– Assumedirectmapped:assignanumbertoeachletter(A=1,B=2,C=3);evenlettersgoin one slot, odd in the other
– Weaccesscacheblocksinthefollowingorder: A, A, C, A, C, B, B, B, A, C
odd even
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
83

CPSC 313
• Example:
• Assume we have 2 slots in our cache.
• Assume fully associative
• We access cache blocks in the following order:
LRU Example
A, A, C, A, C, B, B, B, A, C M H MM M M H H M M
odd even
conflict
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
84

Belady Example
• What is the best possible replacement algorithm we could do? Always evict the item that we use farthest in the future.
– Assume we have 2 slots in our cache.
– Assume fully associative
– We access cache blocks in the following order:
A, A, C, A, C, B, B, B, A, C
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
85

Belady Example
• What is the best possible replacement algorithm we could do? Always evict the item that we use farthest in the future.
– Assume we have 2 slots in our cache.
– Assume fully associative
– We access cache blocks in the following order:
A, A, C, A, C, B, B, B, A, C MHMHHMHHHM
That replacement, turns this access to A into a HIT instead of a MISS
Last time we evicted A, because it was LRU;
This time, we evict C, because we are going to use A before we use C
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
86

Today
– Simulate different replacement polices: • LRU
• LFU
• Belady’s Algorithm
– Given a workload, cache configuration, and replacement policy, calculate hit/miss rates
• A2 due tonight
• Learning Objectives
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
87

LRU Example
• Example:
– Assumewehave2slotsinourcache.
– Assume fully associative
– Weaccesscacheblocksinthefollowingorder:
compulsory
A, A, C, A, C, B, B, B, A, C MH MH HMH HMM
capacity
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
88

Belady Example
• What is the best possible replacement algorithm we could do? Always evict the item that we use farthest in the future.
– Assume we have 2 slots in our cache.
– Assume fully associative
– We access cache blocks in the following order:
A, A, C, A, C, B, B, B, A, C MHMHHMHHHM
That replacement, turns this access to A into a HIT instead of a MISS
Last time we evicted A, because it was LRU;
This time, we evict C, because we are going to use A before we use C
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
89

LFU Example
• LFU = Least frequently used
– Assumewehave2slotsinourcache.
– Assume fully associative
– Weaccesscacheblocksinthefollowingorder:
A, A, C, A, C, B, B, B, A, C
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
90

• LFU = Least frequently used
– Assumewehave2slotsinourcache.
– Assume fully associative
– Weaccesscacheblocksinthefollowingorder:
A, A, C, A, C, B, B, B, A, C MH MH HMH HHM
Since we evicted C, this becomes a hit!
LFU Example
We have used A 3 times and C twice, so we are going to evict C.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
91

CPSC 313
Example – LRU, LFU, Belady’s
• Example:
• Assume we have 2 slots in our cache.
• Assume fully associative
• We access cache blocks in the following order:
A,A,C, A,C,B,B,B,A,C MHMHHMHHMM -LRU MHMHHMHHHM -LFU MH MH H MH H HM -Belady’s
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
92

There is no Perfect Policy
• LRU is frequently good
• LFU is sometimes better
• For anything but Belady, we can construct an adversarial workload that can be pretty devastating.
• Assume a 2 slot, fully associative cache with LRU replacement
– Using only 3 different data items, produce an ordering whose hit rate is 0!
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
93

When LRU is Terrible
• Assumea2slot,fullyassociativecachewithLRUreplacement
– Using only 3 different data items, produce an ordering whose hit rate is
0!
ABCABCABCABC
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
94

When LRU is Terrible
• Assumea2slot,fullyassociativecachewithLRUreplacement
– Using only 3 different data items, produce an ordering whose hit rate is
0!
ABCABCABCABC MMMMMMMMMMMM
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
95

What Would Belady’s Algorithm Do?
• Assumea2slot,fullyassociativecachewithLRUreplacement
– Using only 3 different data items, produce an ordering whose hit rate is
0!
ABCABCABCABC
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
96

What Would Belady’s Algorithm Do?
• Assumea2slot,fullyassociativecachewithLRUreplacement
– Using only 3 different data items, produce an ordering whose hit rate is
0!
ABCABCABCABC MMM HMHMHMHMH
Evict B
Evict A
Evict C Evict B Evict A
40% Hit Rate!
Ignoring compulsory misses, this produces a 50% hit rate in the steady state.
This algorithm is MRU!
97
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al

MRU (Most Recently Used)
• Food for thought:
– Given a fully associative cache with N slots
– Given data that occupies N+1 slots
– In the steady state (ignoring compulsory misses), what hit rate will MRU produce on a workload that iterates over the entire data set repeatedly?
– What hit rate will LRU produce on that workload?
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
98

Evaluating Caches
• By now you should be pretty good at calculating hit/miss rates.
• Sometimes we also want to compare different cache architectures or determine how a change might influence performance.
• We often use speedup as a way to describe how much a change improves (or harms) performance.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
99

Amdahl’s Law
• Who is Amdahl and what is his law?
– Gene Amdahl : American computer architect who worked at IBM for
many years and then started Amdahl Corporation. • Amdahl’s Law:
– Originally designed to quantify the opportunities and benefits of parallelism.
– If you have a program that has a sequential part and a parallelizable part, when you parallelize it, you will only improve the performance of the latter part. So, it’s good to know just how much parallelizing the application will help.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
100

Let’s Apply Amdahl’s Law to Caching
• Amdahl’sLawStates:Overallspeedupisafunctionofboththe amount you improve some part of the system and the fraction of time you spend in that part of the system.
• Given:
– Latency of the original system is Torig
– You can speed up part of a system by k
AKA – execution time, elapsed time
– The program spends α (< 1) of its time in the part of the system you’re going to speed up. • Latencyaftertheimprovement: –Tnew =(1-α)Told +(αTold)/k=Told((1–α)+α/k) – Speedup = Told / Tnew = 1/((1 – α) + α/k) CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 101 CPSC 313 Amdahl’s Law Visually Told Tnew =Told((1–α)+α/k) 1 𝛼𝛼 1-α α 1-α α/k Speedup = (1 −𝛼𝛼+𝑘𝑘) CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 102 1𝛼𝛼 (1 −𝛼𝛼+𝑘𝑘) • Suppose that a program that takes 100 seconds to complete spends 25% of its time in memory instructions. If we make memory instructions twice as fast, how much faster will the program run? –α= –k= – Told = – Speedup = – Latency = Amdahl’s Law Example Speedup = CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 103 1𝛼𝛼 (1 −𝛼𝛼+𝑘𝑘) • Suppose that a program that takes 100 seconds to complete spends 25% of its time in memory instructions. If we improve the cache so that the memory time increases by 40%, how much faster will the program run? –Told =100 (1 −.25+.25) – Speedup = 2 Amdahl’s Law Example Speedup = – α = .25 –k=2 1 1 = .875 = 1.14 – Latency = 100 * .875 = 87.5 seconds CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 104 Now You Practice • Supposethataprogramthattakes100secondstocompletespends 50% of its time in memory instructions. If we make memory instructions 67% faster, how much faster will the program run? • Careful!SomethingthatisX%fasterspeedsupby1.X. –α= –k= – Told = – Speedup = – Latency = CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 105 Now You Practice • Supposethataprogramthattakes100secondstocompletespends 50% of its time in memory instructions. If we make memory instructions 67% faster, how much faster will the program run? • Careful!SomethingthatisX%fasterspeedsupby1.X. – α = 0.5 – k = 1.67 – Speedup = 1 = 1 = 1.25 – Told = 100 (1 − .5+ .5 ) 0.8 1.67 – Latency = 100 * .8 = 80 CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 106 Plot of Amdahl's Law 10 9 8 7 6 5 4 3 2 1 0 1 2 SPEEDUP OF OPTIMIZED PART OF THE PROGRAM 0.99 0.8 0.6 0.4 0.2 0.001 10 3 4 5 6 7 8 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10 9 CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 107 OVERALL SPEEDUP CPSC 313 Today • Upcoming: – Quiz 4 will be out on Friday – A3 is being finalized and will be out shortly • Learning Objectives – Identify the meta-data needed to manage a cache – Figure out how to extract hash tags from addresses CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 108 What is this meta-data of which we speak? • Attheverybeginningofthisunitandtheninsomedefinitionsfrom the book, we talked about metadata that we need to manage the cache. It’s time to talk about that metadata. – Metadata: data that describes and provides information about other data. – Example: book metadata: the author, the publisher, the date of publication, the printing, its persistent identifier, ... – Example: file metadata: the size of the file, when it was last written, where it resides on persistent store, who owns the file, who can access the file, ... • Let’sthinkaboutwhatwehavetodowithcachesandreverse engineer the kind of metadata that we need to keep. CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 109 CPSC 313 Metadata for Replacement • What would we need to record to implement the following policies? • LRU – • • • LFU – – CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 110 Metadata for Replacement • Whatwouldweneedtorecordtoimplementthefollowing policies? • LRU – A way to know the last time a block was accessed • A time stamp • A linked list • LFU – A count of the number of times something was accessed – Maybe an index to find the minimum quickly (but you need to be able to update that index quickly as well) CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 111 CPSC 313 Let’s talk about hits and misses • Assume a direct mapped cache • What information do you need to figure out if an access is a hit or a miss? – – CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 112 CPSC 313 Let’s talk about hits and misses • Assume a direct mapped cache • What information do you need to figure out if an access is a hit or a miss? –Where would my data go? –Is my data in that slot? CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 113 CPSC 313 Mapping Data to a Cache Slot: Conceptually • We need some kind of map data structure that tells you for any possible data item, where should I look for it in the cache. CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 114 CPSC 313 Mapping Data to a Cache Slot: Software • How might you implement this map? – You can implement this literally as a map. – You might implement this as a hash table. – You might implement this as a custom data structure (i.e., if you know something about your data, you might be able to do something clever) – If your cache is small, you might exhaustively search. CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 115 CPSC 313 Mapping Data to a Cache Slot: Hardware • How might you implement this map? It has to be fast! • What do you know about the data and how you will look it up? – The cache will either contain a complete cache line or no cache line. – A single data access references only part of a cache line (probably). – You look up entries in the cache by address. CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 117 Bits and Tags Assume that each box here represents one byte How large is a cache line? Cache Line CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 118 Bits and Tags Assume that each box here represents one byte How large is a cache line? 16 bytes Cache Line CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 119 Bits and Tags and Addresses Let’s say the address of this item is 0x1000; what are the addresses of the remaining items in the line? Cache Line CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 120 Bits and Tags and Addresses Assume that each box here represents one 8-byte integer (i.e., a quad word) Let’s say the address of this item is 0x1000; what are the addresses of the remaining items in the line? 0x1001 0x1002 0x1003 . 0x100F Cache Line CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 121 CPSC 313 Bits and Tags and Addresses Offsetinline: 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x1000 0x1010 0xB0A0 0x2030 0xC0C0 0xBAD0 0xD0E0 0xEFF0 CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 122 Bits and Tags and Addresses Address of the first byte of each cache line Offset in line: 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x1000 0x1010 0xB0A0 0x2030 0xC0C0 0xBAD0 0xD0E0 0xEFF0 CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 123 Bits and Tags and Addresses Address of the first byte of each cache line Offsetinline: 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x1000 (0000) 0x1010 (0001) 0xB0A0 (1010) 0x2030 (1011) 0xC0C0 (1100) 0xBAD0 (1101) 0xD0E0 (1110) 0xEFF0 (1111) CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 124 CPSC 313 How do we use these observations? • The low-order 4 bits of the address (the low-order nibble or the low-order hex digit) tells you which item (in this case, byte) in the cache line you are accessing. • The next 3 bits of the address select the place in the cache that a cache line goes. • Provide two addresses that will map to the same cache line: – 0x1030 (3 = 0011) – 0x10B0 (B = 1011) • Also: 0xFF30, FFB0, AB30, DCB0, etc ... CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 125 CPSC 313 Where do we place items in the cache? • Weusesomebitsintheaddresstoidentifytheslot(s)towhicha line maps. • Inthiscase,weusedthelowestbitsthatwerenotpartoftheoffset into the cache line. • Woulditmakesensetousehighorderbits? – No! – Data is frequently accessed sequentially; if we chose high order bits, then lots of sequential cache lines would map to the same place. – Consider taking the top 3 bits: All cache lines between 0xC000 and 0xD000 would map to the same place! CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 127 CPSC 313 Where do we place items in the cache? • Weusesomebitsintheaddresstoidentifytheslot(s)towhicha line maps. • Inthiscase,weusedthelowestbitsthatwerenotpartoftheoffset into the cache line. • CanyouthinkofanyreasonNOTtojustusethelowestbitsthatare not part of the offset? – Yes! What if we are doing a strided access? – Let’s say that we want to read 1 byte out of every 1024 bytes; the addresses would be: • 0x0000, 0x0400, 0x0800, 0x0C00, 9x1000 0x1400, 0x1800, 0x1C00, ... CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 129 Which bits should we pick? • There is no uniformly good answer: – The lowest available bits (i.e., excluding the offset bits) can cause bad behavior for strides. – High bits cause bad behavior for sequential access. – If you have to pick one, low bits are better as sequential is a more common pattern. – Alternately, perhaps you pick some bits from the lowest available and a few from places that would cause common stride patterns (e.g., 1KB, 4KB). CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 130 CPSC 313 What if we have a set associative cache? Consider 2-way set associative. Cache Line Cache Set An address with index 3 can map to either slot Each cache index has two locations CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 131 CPSC 313 Stride and Associativity • Let’s say that: – We use the same cache organization that we started with (direct mapped; 16-byte cache lines). Assume a1[0] and a2[0] have same index bits • If we copy bytes one at a time from a1 to a2, what will the hit rate be? • If we make the cache twice as big, but it remains direct mapped, what will the hit rate be? • If we make the cache twice as big by making it 2-way set associative, what will the hit rate be? – We use bits 4-6 as the cache index. – We have two 1 KB arrays (e.g., char a1[1024], a2[1024]) CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 132 Stride and Associativity • Let’s say that: – We use the same cache organization that we started with (direct mapped; 16-byte cache lines). – We use bits 4-6 as the cache index. – We have two 1 KB arrays (e.g., char a1[1024], a2[1024]) • If I copy bytes one at a time from a1 to a2, what will the hit rate be? 0 • If I make the cache twice as big, but it remains direct mapped, what will the hit rate be? 0 • If I make the cache twice as big by making it 2-way set associative, what will the hit rate be? 15/16! CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 133 Just one more thing... • How do we distinguish a hit from a miss? • Assuming our original cache configuration. – Let’s say that I access 0x1000 – Then I request the item at address 0x0104 • Where do each of those map in the cache: – cache index 0! • But the second one should be a miss: how do I know that? • We need to store metadata with each cache entry. We need at least: – A valid bit indicating if there is any valid data in the cache at a given entry – A tag that identifies exactly which data is in the cache. CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 135 Today • Learning Outcomes for Today – Tags & misses – Explain the challenges that arise in hardware caches in the presence of multiple cores and/or processors. – Define each of the states in the MESI protocols – Relate multi-core cache behavior to concurrency control in multi- threaded programs. • Assignment 2 is out due March 20th – good practice of MT 2 • Quiz 4 available Friday – due Monday 23:59 CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 136 Just one more thing... • How do we distinguish a hit from a miss? • Assuming our original cache configuration. – Let’s say that I access 0x1000 – Then I request the item at address 0x0104 • Where do each of those map in the cache: – cache index 0! • But the second one should be a miss: how do I know that? • We need to store metadata with each cache entry. We need at least: – A valid bit indicating if there is any valid data in the cache at a given entry – A tag that identifies exactly which data is in the cache. Reminder: Direct mapped 8 line cache 16 bytes/block CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 137 The parts of an address Let’s assume the original cache: • 1-byte elements • 16 per cache line • 8 cache lines • 16-bit addresses Tag CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 138 Index Offset Cache with metadata 9-bit cache tag Valid bit Offset in cache line (low 4 bits of address) 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x020 0x020 0x161 0x040 0x181 0x175 0x1A1 0x1BF CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 139 Index to identify cache slot (3 bits) Address of first byte in the line (hex) 0x1000 0x1010 0xB0A0 0x2030 0xC0C0 0xBAD0 0xD0E0 0xEFF0 Address of first byte in the line (binary) 0001 0000 0000 0000 9-bit cache tag offset Valid bit Where did those tags come from?? Cache tag Cache index CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 140 Where did those tags come from?? Address of first byte in the line (hex) Address of first byte in the line (binary) 9-bit cache tag Valid bit 0x1000 0001 0000 0000 0000 0x020 0x1010 0xB0A0 0x2030 0xC0C0 0xBAD0 0xD0E0 0xEFF0 CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 141 Where did those tags come from?? Address of first byte in the line (hex) Address of first byte in the line (binary) 9-bit cache tag Valid bit 0x1000 0001 0000 0000 0000 0x020 0x1010 0001 0000 0001 0000 0x020 0xB0A0 0x2030 0xC0C0 0xBAD0 0xD0E0 0xEFF0 CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 142 Where did those tags come from?? Address of first byte in the line (hex) Address of first byte in the line (binary) 9-bit cache tag Valid bit 0x1000 0001 0000 0000 0000 0x020 0x1010 0001 0000 0001 0000 0x020 0xB0A0 1011 0000 1010 0000 0x161 0x2030 0xC0C0 0xBAD0 0xD0E0 0xEFF0 CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 143 Where did those tags come from?? Address of first byte in the line (hex) Address of first byte in the line (binary) 9-bit cache tag Valid bit 0x1000 0001 0000 0000 0000 0x020 0x1010 0001 0000 0001 0000 0x020 0xB0A0 1011 0000 1010 0000 0x161 0x2030 0011 0000 0011 0000 0x040 0xC0C0 0xBAD0 0xD0E0 0xEFF0 CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 144 Where did those tags come from?? Address of first byte in the line (hex) Address of first byte in the line (binary) 9-bit cache tag Valid bit 0x1000 0001 0000 0000 0000 0x020 0x1010 0001 0000 0001 0000 0x020 0xB0A0 1011 0000 1010 0000 0x161 0x2030 0011 0000 0011 0000 0x040 0xC0C0 1100 0000 1100 0000 0x181 0xBAD0 1011 1010 1101 0000 0x175 0xD0E0 1101 0000 1110 0000 0x1A1 0xEFF0 1110 1111 1111 0000 0x1BF CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al 145 Check your Understanding • Let’s say that we have a cache with the following configuration: – 64-byte cache lines – 512 cache lines – 32-bit addresses • How many bits are in the tag? – Offset in the cache line: 64 bytes => 6-bits
– Index into the cache: 512 lines => 9-bits – So,32–(6+9)=32–15=17-bittags
Yes! Each index would refer to 4 sets, so we would only have 128 indexes.
• The cache index would only be 7 bits • So the tag must be 19 bits.
• Would this number change if the cache were 4-way set associative (and we
still stored a total of 512 cache lines)?
• What if we stored 512 sets, each of 4 lines?
The tag structure would stay the same, but we would have made the cache 4 times larger!
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
147

Remember …
(Hyper)threads
Core
Per-core cache
Shared cache
Execution Core
Execution Core
L1 Cache
L1 Cache
L2 Cache
Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
149

Remember …

Execution Core
L1 Cache
Execution Core
L1 Cache
L2 Cache
Execution Core
L1 Cache
Execution Core
L1 Cache
L2 Cache
Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
150

Let’s assume all caches are write-back
• What problems do you see that we have to solve when we have caches in the presence of multiple processors (or cores)?
– Two cores/processors might access the same data – that could result in two copies of the same cache line present in two different caches.
– Core 1 might wish to access data that is dirty in Core 2’s cache.
– Processor 2 might wish to access data that is dirty in some cache on
Processor 1.
• Consequently, if the hardware doesn’t do anything to avoid it, cores/processors can see incorrect data.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
152

Core 1 reads X
Core 1 writes X

Execution Core
L1 Cache
X: 0xCAF
E
L1 Cache
Execution Core
X: 0x1234
L2 Cache
Execution Core
L1 Cache
Execution Core
L1 Cache
L2 Cache
X: 0x1234
Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
153

Core 2 reads X

Execution Core
L1 Cache
X: 0xCAF
E
L1 Cache
Execution Core
X: 0x1234
L2 Cache
Execution Core
L1 Cache
Execution Core
L1 Cache
L2 Cache
X: 0x1234
Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
154

Processor 1/Core 1 reads X

Execution Core
L1 Cache
Execution Core
L1 Cache
X: 0x1234
L2 Cache
Execution Core
L1 Cache
Execution Core
L1 Cache
L2 Cache
X: 0x1234
Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
155

Processor 2/Core 1 reads X

Execution Core
L1 Cache
X: 0x123
4
L1 Cache
Execution Core
X: 0x1234
L2 Cache
Execution Core
L1 Cache
Execution Core
L1 Cache
X: 0x1234 L2 Cache
X: 0x1234 Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
156

Processor 1/Core 1 writes X
…Stale Data
Execution Execution Core Core
X: 0x1B2E3E4F
L1 Cache L1 Cache
X: 0x1B2E3E4F
L2 Cache
Execution Core
X: 0x1234
L1 Cache L1 Cache
Execution Core
X: 0x1234 L2 Cache
X: 0x1B2E3E4F Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
157

Cache Coherence
• Typically, it is the hardware’s job to make sure that programs cannot see such inconsistencies.
• Two parts to a solution:
– Use hardware to ensure that loads from all cores will return the
value of the latest store to that memory location.
– Use cache metadata to track the state of cached data.
• There are two major approaches to addressing this problem. – Snoopy caches
– Directory based coherence
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
158

Memory
Let’s Simplify Our Problem
Execution Core
L1 Cache
Execution Core
L1 Cache
Bus-based “snooping” multicore processor
Intercore bus
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
159

Invalidation Protocol with Snooping
• Invalidation
– If a core writes to a data item, all other copies of this data item in
other caches are invalidated. • Snooping:
– All cores continuously snoop (monitor) the bus connecting the cores.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
160

Let’s have Core 1 read X
Core 2 reads X
Execution Core
L1 Cache
Execution Core
L1 Cache
X: 0x1234 Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
161

What Happens this time if Core 1 writes X?
Execution Execution Core Core
X: 0x1C2A3F4E X: 0x1234
L1 Cache L1 Cache
Invalidation msg
X: 0x1234 Memory
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
162

Enhancing Cache Metadata
• When we have caches accessible from multiple cores and we have a bus- based coherence protocol, we need to add extend our valid bit to a set of states.
• Cache operations can:
– Change state
– Send invalidate requests
– Request cache lines in a particular state (fetch)
• A minimal set of states (MSI Model)
– Assumes a writeback cache
– M: Cache line is modified (i.e., dirty)
– S: Cache line shared; appears in at least one cache. – I: Cache line is invalid (i.e., contains no valid data)
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
163

Enhancing Cache Metadata
• When we have caches accessible from multiple cores and we have a bus- based coherence protocol, we need to add extend our valid bit to a set of
states.
• Cache operations can:
– Change state
– Send invalidate requests
– Request cache lines in a particular state (fetch)
Local write (sends invalidate)
Local read/write
Remote read/write M (write data back)
Local write (fetch modified)
• A minimal set of states (MSI Model)
– Assumes a writeback cache
– M: Cache line is modified (i.e., dirty)
– S: Cache line is shared; appears in at least one cache – I: Cache line is invalid (i.e., contains no valid data)
Local read
Remote write
Local read (fetch shared)
S
I
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
164

Enhancing Cache Metadata
Local read/write
Remote read/write (write data back)
Local write (fetch modified)
Remote write
A minimal set of states (MSI Model)
Assumes a writeback cache
M: Cache line is modified (i.e., dirty)
S: Cache line is shared; appears in at least one cache I: Cache line is invalid (i.e. contains no valid data)
Local write (sends invalidate)
Local read
M
SI
Local read (fetch shared)
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al

Today
• Learning Outcomes for Today
– Define each of the states in the MESI protocols
– Relate multi-core cache behavior to concurrency control in multi- threaded programs.
– Introduction to disk drives and file system API
• Assignment 2 is out due March 20th – good practice of MT 2 • Quiz 4 available Friday – due Monday 23:59
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
167

Enhancing Cache Metadata
Local read/write
Remote read/write (write data back)
Local write (fetch modified)
Remote write
A minimal set of states (MSI Model)
Assumes a writeback cache
M: Cache line is modified (i.e., dirty)
S: Cache line is shared; appears in at least one cache I: Cache line is invalid (i.e. contains no valid data)
Local write (sends invalidate)
Local read
M
SI
Local read (fetch shared)
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al

Enhancing Cache Metadata – Initial State
Local read/write
Local read/write
Local write (sends invalidate)
Local read
M
Remote read/write (write data back)
Remote read/write (write data back)
Local write (fetch modified)
Remote write I
Local read (fetch shared)
Local write (sends invalidate)
Local read
M
Local write (fetch modified)
S
Remote write
Local read (fetch shared)
I
S
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al

Enhancing Cache Metadata – Read Core 1
Local read/write
Local read/write
Local write (sends invalidate)
Local read
M
Remote read/write (write data back)
Remote read/write (write data back)
Local write (fetch modified)
Remote write I
Local read (fetch shared)
Local write (sends invalidate)
Local read
M
Local write (fetch modified)
S
Remote write
Local read (fetch shared)
I
S
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al

Enhancing Cache Metadata – Read Core 2
Local read/write
Local read/write
Local write (sends invalidate)
Local read
M
Remote read/write (write data back)
Remote read/write (write data back)
Local write (fetch modified)
Remote write I
Local read (fetch shared)
Local write (sends invalidate)
Local read
M
Local write (fetch modified)
S
Remote write
Local read (fetch shared)
I
S
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al

Enhancing Cache Metadata – Write Core 1
Local read/write
Local read/write
Local write (sends invalidate)
Local read
M
Remote read/write (write data back)
Remote read/write (write data back)
Local write (fetch modified)
Remote write I
Local read (fetch shared)
Local write (sends invalidate)
Local read
M
Local write (fetch modified)
S
Remote write
Local read (fetch shared)
I
S
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al

Enhancing Cache Metadata – Write Core 2
Local read/write
Local read/write
Local write (sends invalidate)
Local read
M
Remote read/write (write data back)
Remote read/write (write data back)
Local write (fetch modified)
Remote write I
Local read (fetch shared)
Local write (sends invalidate)
Local read
M
Local write (fetch modified)
S
Remote write
Local read (fetch shared)
I
S
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al

Enhancing Cache Metadata – Read Core 1
Local read/write
Local read/write
Local write (sends invalidate)
Local read
M
Remote read/write (write data back)
Remote read/write (write data back)
Local write (fetch modified)
Remote write I
Local read (fetch shared)
Local write (sends invalidate)
Local read
M
Local write (fetch modified)
S
Remote write
Local read (fetch shared)
I
S
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al

But wait, …
• Doesn’t this get expensive if, for example, I am the only one who has something cached?
• Yes!
– If that data is not shared, you are still executing a large number of
costly cache invalidations – So ….
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
175

CPSC 313
Let’s add another state! MESI
• E: Exclusive: The data does not live in any other cache Local read/write
Local read
Exclusive read miss
Remote write request (write data back)
Write miss
Local write (sends invalidate)
M
Local write E Remote read
Remote write request (write data back)
Remote read (write data back)
Shared read miss
S
I
Remote write request
Read (local or remote)
CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
177

CPSC 313
The Alternative: Directory Based Coherence
• Adirectorystoresallthecoherenceinformation
• Processorsconsultthedirectorybeforesendingcoherency
messages
– Information in the directory is similar to the cache states • Shared: Lives in more than one cache
• Uncached: Lives in 0 caches
• Exclusive: Lives in exactly 1 cache
– Must also track which processor(s) have the data
• Ratherthanbroadcastingonthebus,acoreconsultsthedirectory
and sends messages only to affected cores.
CPSC 313 – 2019W2 Memory Hierarchy © 2020 Donald Acton et al
178

CPSC 313
End of Caching
• Use Assignment 3 to help you study for the midterm. – It’s all about caching!
– Due date is March 20
– Do it now!
• And don’t forget the Quiz due Monday! CPSC 313 – 2019W2 Memory Hierarchy
© 2020 Donald Acton et al
179