程序代写代做代考 compiler cache 18. Block Size and Writes

18. Block Size and Writes
EECS 370 – Introduction to Computer Organization – Fall 2020
Satish Narayanasamy
EECS Department
University of Michigan in Ann Arbor, USA
© Narayanasamy 2020
The material in this presentation cannot be copied in any form without written permission

Announcements
Upcoming deadlines:
HW4 due Nov 10th Project 3 due Nov. 12th
Midterm regrade issues are being worked out. Should be resolved this week.
2

Recap: Memory Hierarchy and Cache: The Basics
3

Memory Technology Design Space
Storage
FLASH
1E-01 1E-02
1E-03 1E-04 1E-05
1E-06 1E-07 1E-08
1E-09
Derived from Keynote by Martin Fink, NVMW 2017
Focus of 370
DRAM
SRAM
10
New memory technologies (emerging)
Memory
Ideal
0.01
0.1 1 Die Cost ($/GB)
4
Recap
Access time (s)

Memory hierarchy: Leveraging locality of reference
Fast
Cost Expensive Size Small
Temporarily move what you use here
For a program with
good locality of reference, memory appears as fast as cache and
as big as disk
Have a copy of everything here
Slow
Cache (SRAM)
Main memory (DRAM)
Disk
(magnetic or floating gate)
Cost Cheap Size Big
5
Recap

Memory in a Modern System
ISA Memory
Abstraction manual/compiler
register spilling
Automatic HW cache management
automatic demand paging
Register File
32 words, sub-nsec
L1 cache ~32 KB, ~nsec
L2 cache
512 KB ~ 1MB, many nsec
L3 cache, …..
Main memory (DRAM), GB, ~100 nsec
Swap Disk 100 GB, ~10 msec (~10,000,000nsec)
6
Recap

Intel Optane
Intel Optane web page
7

Cache: Importance
Caches consume
most of a processor’s die area
CORE 0
CORE 1
DRAM MEMORY CONTROLLER
CORE 2
CORE 3
8
Recap
DRAM BANKS DRAM INTERFACE
L2 CACHE 1 L2 CACHE 3 L2 CACHE 0 L2 CACHE 2
SHARED L3 CACHE

Review: A simple cache architecture
Tag Array
V/I/I
1
Addr-3
Data Array
cache line
data-3
0
Addr-5
data-5
0
A cache memory consists of multiple tag/block pairs (called cache lines) Address is searched across all tags in parallel.
At most one tag will match
If there is a tag match, it is a cache HIT
If there is no tag match, it is a cache MISS
Cache data that is likely to be used in future – exploit temporal locality 9
data-6
cache block
Addr-6
1
Addr-7
data-7
Recap

Temporal Locality
The principle of temporal locality in program references says that if you access a memory location (e.g., 0x1000) you will be more likely to re-access that location (e.g., 0x1000) than you will be to reference some other random location
Temporal locality says any miss location should be placed into the cache It is the most recent reference location
Temporal locality says that data in least recently referenced (or least recently used – LRU ) cache line should be evicted to make room for the new line
Because the re-access probability falls over time as a cache line isn’t referenced, the LRU line is least likely to be re-referenced
10
Recap

Tracking LRU
Naïve method
Maintain LRU_rank per cache line
Set the LRU_rank of newly accessed block to 0. Increment others by 1. Replace cache line with highest LRU_rank
Area overhead: log(number of cache lines) per cache line
LRU with least area overhead
# permutations for N cache lines is n! need just log(n!) bits for N cache lines
11

Problem — AMAT
Assume main memory access time does not include cache access time.
Suppose that accessing a cache takes 10ns while accessing main memory in case of cache-miss takes 100ns. What is the average memory access time if the cache hit rate is 97%?
To improve performance, the cache size is increased. It is determined that this will increase the hit rate by 1%, but it will also increase the time for accessing the cache by 2ns. Will this improve the overall average memory access time?
12

Problem —AMAT
Suppose that accessing a cache takes 10ns while accessing main memory in case of cache-miss takes 100ns. What is the average memory access time if the cache hit rate is 97%?
AMAT=10 +(1-0.97)*100=13ns
To improve performance, the cache size is increased. It is determined that this will increase the hit rate by 1%, but it will also increase the time for accessing the cache by 2ns. Will this improve the overall average memory access time?
AMAT=12 +(1-0.98)*100=14ns
13

This lecture
Cache blocks
Spatial locality
Problems
Stores: Write-through vs Write-back
14

How can we reduce the overhead for each cache line?
lru
tag data
Cache bigger units than bytes
Each block has a single tag, and blocks can be whatever size we choose.
1
1
110
1
7
170
15

Increasing cache block size reduces (tag and other) overhead
Block Block Case1 Case2
0123456789
10
11
12
13
14
15
Case 1:
Block size: 1 byte
V tag data (block)
1
0
74
1
6
160
Bits per tag
= log2(number of blocks in memory) = log2(16) = 4 bits
Memory
74
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
0123456789
10
11
12
13
14
15
01234567
16
Case 2:
Block size: 2 bytes
V tag data (block)
1
0
74
110
1
6
160
170
Bits per tag
= log2(number of blocks in memory) = log2(8) = 3 bits

Block size: Idea
If you increase cache block size
=> Increases data per cache line
=> Increases data per tag + reduces tag size
17

Block size for caches
Processor
Ld R1←M[ 1 ] Ld R2←M[ 5 ] Ld R3←M[ 1 ] Ld R3←M[ 4 ] Ld R2←M[ 0 ]
Cache
2 cache lines
3-bit tag field
2-byte block
tag data
Memory
0123456789
10
11
12
13
14
15
120
130
V
140
150
V
100
110
160
170
180
190
200
210
R0 R1 R2 R3
220
230
240
250
18

Block size for caches
Processor
Ld R1←M[ 1 ] Ld R2←M[ 5 ] Ld R3←M[ 1 ] Ld R3←M[ 4 ] Ld R2←M[ 0 ]
Cache tag data
Memory
0123456789
10
11
12
13
14
15
0
120
130
140
150
0
100
110
160
170
180
190
200
210
R0 R1 R2 R3
220
230
240
250
19

Block size for caches
Processor
Ld R1←M[ 1 ] Ld R2←M[ 5 ] Ld R3←M[ 1 ] Ld R3←M[ 4 ] Ld R2←M[ 0 ]
1
0
100
Memory
0123456789
10
11
12
13
14
15
100
110
120
130
140
110
150
0
160
170
180
190
200
210
110
R0 R1 R2 R3
Cache tag data
lru
Addr: 0001
Misses: 1 Hits: 0
220
230
240
250
20

Block size for caches
Processor
Ld R1←M[ 1 ] Ld R2←M[ 5 ] Ld R3←M[ 1 ] Ld R3←M[ 4 ] Ld R2←M[ 0 ]
Cache tag data
lru
0
0
100
Memory
0123456789
10
11
12
13
14
15
100
110
120
130
140
1
110
150
160
170
180
190
200
210
110
R0 R1 R2 R3
220
Misses: 1 Hits: 0
230
240
250
21

Block size for caches
Processor
Ld R1←M[ 1 ] Ld R2←M[ 5 ] Ld R3←M[ 1 ] Ld R3←M[ 4 ] Ld R2←M[ 0 ]
1
0
2
100
Memory
0123456789
10
11
12
13
14
15
100
110
120
130
140
1
110
150
140
160
150
170
180
190
200
210
110
150
R0 R1 R2 R3
Cache tag data
lru
Addr: 0101
Misses: 2 Hits: 0
220
230
240
250
22

Block size for caches
Processor
Ld R1←M[ 1 ] Ld R2←M[ 5 ] Ld R3←M[ 1 ] Ld R3←M[ 4 ] Ld R2←M[ 0 ]
Cache
tag data lru
1
2
100
Memory
0123456789
10
11
12
13
14
15
110
120
130
140
1
0
110
150
140
100
160
150
170
180
190
200
210
110
150
R0 R1 R2 R3
220
Misses: 2 Hits: 0
230
240
250
23

Block size for caches
Processor
Ld R1←M[ 1 ] Ld R2←M[ 5 ] Ld R3←M[ 1 ] Ld R3←M[ 4 ] Ld R2←M[ 0 ]
Cache tag data
lru
1
2
100
Memory
0123456789
10
11
12
13
14
15
110
120
130
140
1
0
110
150
140
100
160
150
170
180
190
200
210
110
150
110
R0 R1 R2 R3
220
Misses: 2 Hits: 1
230
240
250
24

Block size for caches
Processor
Ld R1←M[ 1 ] Ld R2←M[ 5 ] Ld R3←M[ 1 ] Ld R3←M[ 4 ] Ld R2←M[ 0 ]
Cache tag data
lru
1
2
100
Memory
0123456789
10
11
12
13
14
15
110
120
130
140
1
0
110
150
140
100
160
150
170
180
190
200
210
110
150
110
R0 R1 R2 R3
220
Misses: 2 Hits: 1
230
240
250
25

Block size for caches
Processor
Ld R1←M[ 1 ] Ld R2←M[ 5 ] Ld R3←M[ 1 ] Ld R3←M[ 4 ] Ld R2←M[ 0 ]
Cache
tag data lru
1
2
100
Memory
0123456789
10
11
12
13
14
15
110
120
130
140
1
0
110
150
140
100
160
150
170
180
190
200
210
110
150
140
R0 R1 R2 R3
220
Misses: 2 Hits: 2
230
240
250
26

Block size for caches
Processor
Ld R1←M[ 1 ] Ld R2←M[ 5 ] Ld R3←M[ 1 ] Ld R3←M[ 4 ] Ld R2←M[ 0 ]
Cache
tag data lru
1
2
100
Memory
0123456789
10
11
12
13
14
15
110
120
130
140
1
0
110
150
140
100
160
150
170
180
190
200
210
110
150
140
R0 R1 R2 R3
220
Misses: 2 Hits: 2
230
240
250
27

Block size for caches
Processor
Ld R1←M[ 1 ] Ld R2←M[ 5 ] Ld R3←M[ 1 ] Ld R3←M[ 4 ] Ld R2←M[ 0 ]
Cache tag data
lru
1
2
100
Memory
0123456789
10
11
12
13
14
15
110
120
130
140
1
0
110
150
140
100
160
150
170
180
190
200
210
110
1400
140
R0 R1 R2 R3
220
Misses: 2 Hits: 3
230
240
250
28

Spatial Locality
When we accessed address 1, we also brought in address 0.
This turned out to be a good thing since we later referenced address 0 and found it in the cache.
Spatial locality in a program says that if we reference a memory location (e.g., 1000), we are more likely to reference a location near it (e.g. 1001, 999) than some random location.
29

Storing arrays in memory: Row major vs Column major
Row major is
a common choice
30

Spatial Locality
Observation: Applications access data near to what they just accessed
31

Spatial Locality
Observation: Applications access data near to what they just accessed
32

Spatial Locality
Observation: Applications access data near to what they just accessed
33

Importance of Cache on Performance:
Cache Aware code can be several times faster than non-Aware code
Live demo:
See L1_3_370_Course_Overview Video at minute 18:00
34

Basic Cache organization
Decide on the block size
How? Simulate lots of different block sizes and see which one gives the best performance
Common cache block sizes: 32, 64, or 128 bytes
Larger block sizes reduce cache area overhead by:
Reducing number of cache lines (therefore, number of tags and other meta-data) Reducing each tag size
Address
Block size = 2 ^ (block_offset) sizeof(block_offset) = log2(block_size)
Tag size = address_size – block_offset_size
Tag
Block offset
35

Practice Problem– Compute the register and cache state after executing the instruction sequence
Processor
Cache
2 cache lines 2-bit tag field 4-byte block
V tag data
Memory
0123456789
10
11
12
13
14
15
Ld R0M[ 3] Ld R2M[ 12] Ld R3M[ 15] Ld R1M[ 4] Ld R2M[ 9]
R0 R1 R2 R3
0
0
78
29
120
123
71
150
162
173
18
21
33
28
19
200
210
36 225

Solution to Practice Problem
Ld R0M[ 3]
V tag data
Ld R2M[ 12] Ld R3M[ 15] Ld R1M[ 4] Ld R2M[ 9] Vtag data Vtag data Vtag data Vtag data
1
0
78
78
1
29
29
lru lru
miss
lru
lru lru
120
1
0
120
123
123
0
1
3
19
1
3
1
71
1
71
150
150
162
162
173
1
3
19
19
2
173
1
18
200
1
0
78
29
120
123
200
200
21
210
210
210
33
225
225
225
1
28
miss
hit
miss
miss
37
EECS 370: Introduction to Computer Organization

Solution to Practice Problem
Ld R0M[ 3]
V tag data
Ld R2M[ 12] Ld R3M[ 15] Ld R1M[ 4] Ld R2M[ 9] Vtag data Vtag data Vtag data Vtag data
1
0
78
78
1
29
29
lru lru
miss
lru
lru lru
120
1
0
120
123
123
0
1
3
19
1
3
1
71
1
71
150
150
162
162
173
1
3
19
19
2
173
1
18
200
1
0
78
29
120
123
200
200
21
210
210
210
33
225
225
225
1
28
miss
hit
miss
miss
38
EECS 370: Introduction to Computer Organization

Class Problem 1
Given a cache with the following configuration: cache size is 8 bytes, block size is 2 bytes, fully associative, LRU replacement. The memory address size is 16 bits and is byte addressable.
1. How many bits are for each tag? How many blocks in the cache?
2.For the following reference stream, indicate whether each reference is a hit or miss: 0, 1, 3, 5, 12, 1, 2, 9, 4
3.What is the hit rate?
4.How many bits are needed for storage overhead for each block?
39

Class Problem 1
Given a cache with the following configuration: cache size is 8 bytes, block size is 2 bytes, fully associative, LRU replacement. The memory address size is 16 bits and is byte addressable. Assume 2 bits per cache line to implement LRU.
1. How many bits are for each tag? How many blocks in the cache? Tag = 16 – 1 (block offset) = 15 bits;
2 byte blocks, 8 bytes total = 4 blocks.
2.For the following reference stream, indicate whether each reference is a hit or miss: 0, 1, 3, 5, 12, 1, 2, 9, 4 M,H,M,M,M,H,H,M,M
3.What is the hit rate? 3/9 = 33 %
4.How many bits are needed for storage overhead for each block?
Overhead = 15 (Tag) + 1 (V) + 2 (LRU) = 18 bits
We are assuming 2 bits per cache line to store the LRU rank.
More efficient solutions for LRU exists: log2(#cache_lines!) = log2(4!) = 5 bits
40

Class Problem 2—Storage overhead
Consider the following cache:
32-bit byte addressable ISA
Cache size : 64KB (kilo-bytes) = 64 * 8 kilo-bits = 512 Kilo-bits Cache block size : 64 Bytes
Write-allocate, write-back, fully associative Recall: 1 kilobyte = 1024 bytes (NOT 1000 bytes!)
What is the cache area overhead (tags, valid, dirty, LRU)?
41

Class Problem 2—Storage overhead
Consider the following cache:
32-bit byte addressable ISA
Cache size : 64KB (kilo-bytes) = 64 * 8 kilo-bits = 512 Kilo-bits Cache block size : 64 Bytes
Write-allocate, write-back, fully associative Recall: 1 kilobyte = 1024 bytes (NOT 1000 bytes!)
What is the cache area overhead (tags, valid, dirty, LRU)? Assume log(#blocks) bits for LRU.
Tag
#blocks
LRU bits
Overhead per block Total overhead for cache
= 26 bits = 1024 = 10 bits
= 26(Tag) + 1(V) + 1(D) + 10 (LRU) = 38 bits * #blocks = 38 * 1024
= 32 (Address) – 6 (block offset) = 64K / 64
= log(#blocks)
= 38 bits
= 38912 bits
42

Handling Stores:
write-through vs write-back caches
43

Memory
0123456789
10
11
12
13
14
15
44
74
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
Data Array
Tag Array
V/I/I

What about stores?
45
Where should you write the result of a store to an address X? If address X is in the cache
Write to the cache.
Should we also write to memory?
(yes – write-through policy)
(no – write-back policy – write only when modified cache block is evicted)
If address X is not in the cache Allocate address in the cache?
yes – allocate-on-write policy
no – directly write to memory, no allocate-on-write policy

Handling stores (write-through)
Processor
Cache
V tag data
0
Memory
0123456789
10
11
12
13
14
15
120
123
Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ]
R0 R1 R2 R3
71
150
162
0
Misses: 0 Hits: 0
78
29
173
18
21
33
28
19
200
210
225
46

write-through (REF 1)
Processor
Cache
V tag data
0
Memory
0123456789
10
11
12
13
14
15
78
Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ]
R0 R1 R2 R3
0
Misses: 0 Hits: 0
29
120
123
71
150
162
173
18
21
33
28
19
200
210
225
47

write-through (REF 1)
Processor
Cache
V tag data
lru
1
0
78
Memory
0123456789
10
11
12
13
14
15
78
29
120
123
Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ]
R0 R1 R2 R3
71
150
29
162
0
173
18
21
33
28
29
19
Misses: 1 Hits: 0
200
210
225
48

write-through (REF 2)
Processor
Cache
V tag data
lru
1
0
78
Memory
0123456789
10
11
12
13
14
15
78
29
120
123
Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ]
R0 R1 R2 R3
71
150
29
162
0
173
18
21
33
28
29
19
Misses: 1 Hits: 0
200
210
225
49

write-through (REF 2)
Processor
Cache
lru V tag data
1
1
0
Memory
0123456789
10
11
12
13
14
15
78
29
120
123
Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ]
R0 R1 R2 R3
71
150
3
29
162
162
173
173
18
21
33
28
29
173
19
Misses: 2 Hits: 0
78
200
210
225
50

write-through (REF 3)
Processor
Cache
lru V tag data
1
1
0
Memory
0123456789
10
11
12
13
14
15
78
29
120
123
Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ]
R0 R1 R2 R3
71
150
3
29
162
162
173
173
18
21
33
28
29
173
19
Misses: 2 Hits: 0
78
200
210
225
51

write-through (REF 3)
Processor
Cache
V tag data
lru
Memory
0123456789
10
11
12
13
14
15
Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ]
R0 R1 R2 R3
1
0
29
120
123
71
150
3
29
162
1
162
173
173
173
18
21
33
28
29
173
19
Misses: 2 Hits: 1
173
200
210
225
52

write-through (REF 4)
Processor
Cache
V tag data
lru
1
1
0
Memory
0123456789
10
11
12
13
14
15
173
29
120
123
Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ]
R0 R1 R2 R3
71
150
3
29
162
162
173
173
18
21
33
28
29
173
19
Misses: 2 Hits: 1
173
200
210
225
53

write-through (REF 4)
Processor
Cache
lru V tag data
1
0
Memory
0123456789
10
11
12
13
14
15
Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ]
R0 R1 R2 R3
1
2
29
173
Misses: 3 Hits: 1
173
29
71
12590
173
29
120
123
71
12590
162
173
18
21
33
28
19
200
210
225
54

write-through (REF 6)
Processor
Cache
lru V tag data
1
1
0
Memory
0123456789
10
11
12
13
14
15
173
29
120
123
Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ]
R0 R1 R2 R3
71
29
2
29
162
71
173
29
18
21
33
28
29
173
19
Misses: 3 Hits: 1
173
200
210
225
55

write-through (REF 6)
Processor
Cache
V tag data
lru
1
5
2
33
Memory
0123456789
10
11
12
13
14
15
173
29
120
123
Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ]
R0 R1 R2 R3
71
1
29
28
162
71
173
29
18
21
33
28
29
33
19
Misses: 4 Hits: 1
200
210
225
56

How many reads/writes to main memory in a write-through cache?
Each cache miss reads a block (2 bytes in our example) from main memory Total reads from main memory: 8 bytes
Each store writes a byte to main memory Total writes to main memory: 2 bytes
But, cache miss rate < 20%. total main memory reads due to cache misses << main memory writes in a write-through cache 57 Write-through vs. write-back Can we also design the cache to NOT write all stores to memory immediately? We can keep the most recent copy in the cache and update the memory only when that data is evicted from the cache (a write-back policy). Do we need to write-back all evicted cache blocks? No, only blocks that have been stored into Keep a “dirty bit”: reset when the line is allocated set when the block is stored into If a block is “dirty” when evicted, write its data back into memory. 58 Handling stores (write-back) Processor Cache Vdtag data 0 Memory 0123456789 10 11 12 13 14 15 29 120 123 Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ] R0 R1 R2 R3 71 150 162 0 Misses: 0 Hits: 0 78 173 18 21 33 28 19 200 210 225 59 write-back (REF 1) Processor 0 Memory 0123456789 10 11 12 13 14 Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ] R0 R1 R2 R3 Cache Vdtag data 0 Misses: 0 Hits: 0 78 29 120 123 71 150 162 173 18 21 33 28 19 200 210 15 225 60 write-back (REF 1) Processor Cache Vdtag data 1 0 0 0 78 Memory 0123456789 10 11 12 13 14 78 29 120 123 Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ] R0 R1 R2 R3 71 150 29 162 173 18 21 33 28 29 19 Misses: 1 Hits: 0 200 210 15 225 61 lru write-back (REF 2) Processor Cache Vdtag data 1 0 0 0 78 Memory 0123456789 10 11 12 13 14 78 29 120 123 Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ] R0 R1 R2 R3 71 150 29 162 173 18 21 33 28 29 19 Misses: 1 Hits: 0 200 210 15 225 62 lru write-back (REF 2) Processor Cache Vdtag data 0 0 78 Memory 0123456789 10 11 12 13 14 Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ] R0 R1 R2 R3 1 29 1 0 3 162 173 78 29 120 123 71 150 162 173 18 21 33 28 29 173 19 Misses: 2 Hits: 0 200 210 15 225 63 lru write-back (REF 3) Processor Cache Vdtag data 0 0 78 Memory 0123456789 10 11 12 13 14 Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ] R0 R1 R2 R3 1 29 1 0 3 162 173 78 29 120 123 71 150 162 173 18 21 33 28 29 173 19 Misses: 2 Hits: 0 200 210 15 225 64 lru write-back (REF 3) Processor Cache Vdtag data 1 0 173 Memory 0123456789 10 11 12 13 14 Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ] R0 R1 R2 R3 1 29 1 0 3 162 173 78 29 120 123 71 150 162 173 18 21 33 28 29 173 19 Misses: 2 Hits: 1 200 210 15 225 65 lru write-back (REF 4) Processor Cache Vdtag data 1 0 173 Memory 0123456789 10 11 12 13 14 Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ] R0 R1 R2 R3 1 29 1 0 3 162 173 78 29 120 123 71 150 162 173 18 21 33 28 29 173 19 Misses: 2 Hits: 1 200 210 15 225 66 lru write-back (REF 4) Processor Cache Vdtag data 1 0 173 Memory 0123456789 10 11 12 13 14 Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ] R0 R1 R2 R3 1 29 1 1 2 71 29 78 29 120 123 71 150 162 173 18 21 33 28 29 173 19 Misses: 3 Hits: 1 200 210 15 225 67 lru write-back (REF 5) Processor Cache Vdtag data 1 0 173 Memory 0123456789 10 11 12 13 14 Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ] R0 R1 R2 R3 1 29 1 1 2 71 29 78 29 120 123 71 150 162 173 18 21 33 28 29 173 19 Misses: 3 Hits: 1 200 210 15 225 68 lru write-back (REF 5) Processor Cache Vdtag data 1 0 173 Memory 0123456789 10 11 12 13 14 Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ] R0 R1 R2 R3 1 29 1 1 2 71 29 17783 29 120 123 71 150 162 173 18 21 33 28 29 173 19 Misses: 4 Hits: 1 200 210 15 225 69 lru write-back (REF 5) Processor Cache Vdtag data 0 5 33 Memory 0123456789 10 11 12 13 14 Ld R1←M[ 1 ] Ld R2←M[ 7 ] St R2→M[ 0 ] St R1→M[ 5 ] Ld R2←M[ 10 ] R0 R1 R2 R3 1 28 1 1 2 71 29 78 29 120 123 71 150 162 173 18 21 33 28 29 33 19 Misses: 4 Hits: 1 200 210 15 225 70 lru How many reads/writes to main memory in a write-through cache? Each cache miss reads a block (2 bytes in our example) from main memory Total reads from main memory: 8 bytes Each store writes a byte to main memory Total writes to main memory: 4 bytes For this example, would you choose write-back or write-through? 71 Summary: Questions to ask about a cache Cache design (so far): How many bytes of data storage? What is the block size? How many lines? What is the replacement policy? Performance What is the hit rate? What is the latency of an access? Area overhead How much overhead storage? (= cache size / cache block size) (LRU? LFU? FIFO? Random?) (spatial locality) (temporal locality) (tags, valid, dirty, LRU) 72 Interactive tools 73 Interactive tool: caches URL: Cache interactive simulator 74 Interactive tool: pipeline URL: Pipeline interactive simulator 75