Computer Architecture ELEC3441
Lecture 9 – Cache (2)
Dr. Hayden Kwok-Hay So
Department of Electrical and Electronic Engineering
Causes of Cache Misses: The 3 C’s
Compulsory: first reference to a line (a.k.a. cold start misses)
– misses that would occur even with infinite cache Capacity: cache is too small to hold all data needed
by the program
– misses that would occur even under perfect
replacement policy
Conflict: misses that occur because of collisions due to line-placement strategy
– misses that would not occur with ideal full associativity
2
AMA T
n Average Memory Access Time:
AMA T = Hit Time + Miss Rate × Miss Penalty
HKUEEE ENGG3441 – HS 3
Example
n Processor runs at 2 GHz with CPI=1. Miss penalty of memory is 50 clock cycles. L1 cache returns data in 1 cycle on cache hit. On a particular program, instruction miss rate is 1%. Load/store make up 30% of dynamic instruction, and have a miss rate of 5%. Assume read/write penalties are the same and ignore other stalls.
n What is AMAT for instruction/data?
n What is average CPI given the above memory
access time?
HKUEEE ENGG3441 – HS 4
Example: AMAT
Instruction Cache:
AMA T = Hit Time + Miss Rate × Miss Penalty =1+1%×50
Data Cache:
HKUEEE
ENGG3441 – HS 5
=1.5 cycles
AMA T = Hit Time + Miss Rate × Miss Penalty =1+5%×50
= 3.5 cycles
Average CPI (with Memory)
# of instr. = i
# of instruction memory miss cycles = i × 1% × 50 = 0.5i
# of data memory miss cycles = i × 30% × 5% × 50 = 0.75i Total # of memory stall cycles = 0.5i + 0.75i = 1.25i
Average CPI = i +1.25i = 2.25 i
HKUEEE ENGG3441 – HS
6
CPU-Cache Interaction (5-stage pipeline)
0x4
PCen
Add
E
A
IR B
Decode, Register Fetch
we addr
Primary Data rdata Cache
wdata hit? wdata
bubble
To Memory Control
M
ALU Y
MD2
R
PC
addr inst
hit?
Primary Instruction Cache
D
MD1
Stall entire CPU on data cache miss
HKUEEE
ENGG3441 – HS
7
Cache Refill Data from Lower Levels of Memory Hierarchy
Improving Cache Performance
Average memory access time (AMAT) =
Hit time + Miss rate x Miss penalty
To improve performance:
• reduce the hit time
• reduce the miss rate
• reduce the miss penalty
What is best cache design for 5-stage pipeline?
Biggest cache that doesn’t increase hit time past 1 cycle
(approx 8-32KB in modern technology)
[ design issues more complex with deeper pipelines and/or out-of- order superscalar processors]
HKUEEE ENGG3441 – HS 8
Effect of Cache Parameters on Performance
§Larger cache size
+ reduces capacity and conflict misses – hit time will increase
§Higher associativity
+ reduces conflict misses – may increase hit time
§Larger line size
+ reduces compulsory and capacity (reload) misses – increases conflict misses and miss penalty
9
10% 9% 8% 7%
15% One-way 12%
1 KiB
2 KiB
4 KiB
8 KiB
16 KiB 32 KiB
Performance vs. Associativity
Two-way
6% 9%
Miss rate Four-way per type 5%
4% 6% 3%
2% 1% 0%
Capacity
4 8 16 32 64 128 256 512 1024
Cache size (KiB)
3% 0
64 KiB
128 KiB Eight-way
n 1-way à 2-way èSignificant drop in miss rate n 2-way à 4-way è less significant
n Effect of associativity significant in small cache
HKUEEE ENGG3441 – HS
10
One-way
Two-way Four-way Associativity
Write Policy Choices
§Cache hit:
– write through: write both cache & memory
• Generally higher traffic but simpler pipeline & cache design
– write back: write cache only, memory is written only when the
entry is evicted
• A dirty bit per line further reduces write-back traffic
• Must handle 0, 1, or 2 accesses to memory for each load/store
§Cache miss:
– no write allocate: only write to main memory
– write allocate (aka fetch on write): fetch into cache
§Common combinations:
– write through and no write allocate – write back with write allocate
11
Write Hit (Cache Writing)
Tag
Index
Offset
t
V Tag
t =
k
b Data
WE
Data Word or Byte
2k lines
HIT
HKUEEE
ENGG3441 – HS
12
Write back: done
Write through: write also to memory
Miss rate
Write Through via Write Buffer
$
Processor
DRAM
Write Buffer
n Processor writes to both $ and write buffer
• Memory writes completes as soon as data in write
buffer
n Memory controller completes the write to DRAM offline
n Writing too fast may saturate write buffer
HKUEEE ENGG3441 – HS 13
Read Miss with Write Buffer
n On Read Miss, need to read memory to fill cache
n But data may still be in write buffer pending write to DRAM
n 2 Solutions:
• Flushwritebufferbeforeread
• Checkallpendingwritesinwritebufferandreturn latest write data if address match
Q: Would there be data in write buffer that needs to be forwarded on a read hit?
HKUEEE ENGG3441 – HS 14
Write Miss
n Write miss happens when write location not in cache
n Write Allocate:
• Attheendofthewrite,cachecontainsfulllineof
data
• Needtoreadfrommemory
• Writeback:musthavewriteallocate
• Writethrough:mayormaynot
n No write allocate:
• Datagostraighttomemory
HKUEEE ENGG3441 – HS 15
Multilevel Caches
Problem: A memory cannot be large and fast Solution: Increasing sizes of cache at each level
CPU
L1$
L2$
DRAM
Local miss rate = misses in cache / accesses to cache
Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions
16
Presence of L2 influences L1 design
§Use smaller L1 if there is also L2
– Trade increased L1 miss rate for reduced L1 hit time – Backup L2 reduces L1 miss penalty
– Reduces average access energy
§Use simpler write-through L1 with on-chip L2
– Write-back L2 cache absorbs write traffic, doesn’t go off-chip
– At most one L1 miss request per L1 access (no dirty victim write back) simplifies pipeline control
– Simplifies coherence issues
– Simplifies error recovery in L1 (can use just parity bits in L1 and reload from L2 when parity error detected on L1 read)
17
Inclusion Policy
§ Inclusive multilevel cache:
– Inner cache can only hold lines also present in outer
cache
– External coherence snoop access need only check outer cache
§ Exclusive multilevel caches:
– Inner cache may hold lines not in outer cache
– Swap lines between inner/outer caches on miss
– Used in AMD Athlon with 64KB primary and 256KB secondary cache
Why choose one type or the other?
18
25.0%
20.0%
15.0%
10.0%
5.0%
0.0%
L1 Data Miss Rate L2 Data Miss Rate
n Miss rate on L2$ usually much lower than L1$
n L2 usually has:
• Higher capacity
• Higher associativity
n Only missed L1 access arrived at L2
L1 vs L2 Miss Rate
Data cache miss rates for ARM Cortex-A8 when running Minnespec
HKUEEE ENGG3441 – HS
19
Itanium-2 On-Chip Caches (Intel/HP, 2002)
Level 1: 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency
Level 2: 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency
Level 3: 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency
20
Miss Rate
twolf bzip2
gzip
parser gap
perlbmk gcc
crafty vpr
vortex con
mcf
Power 7 On-Chip Caches [IBM 2009]
32KB L1 I$/core 32KB L1 D$/core 3-cycle latency
256KB Unified L2$/core 8-cycle latency
32MB Unified Shared L3$ Embedded DRAM (eDRAM) 25-cycle latency to local slice
21
IBM z196 Mainframe Caches 2010
§96 cores (4 cores/chip, 24 chips/system) – Out-of-order, 3-way superscalar @ 5.2GHz
§L1: 64KB I-$/core + 128KB D-$/core
§L2: 1.5MB private/core (144MB total)
§L3: 24MB shared/chip (eDRAM) (576MB total) §L4: 768MB shared/system (eDRAM)
22
Prefetching
§Speculate on future instruction and data accesses and fetch them into cache(s)
– Instruction accesses easier to predict than data accesses
§Varieties of prefetching – Hardware prefetching
– Software prefetching
– Mixed schemes
§What types of misses does prefetching affect?
23
Issues in Prefetching
§ Usefulness – should produce hits
§ Timeliness – not late and not too early § Cache and bandwidth pollution
CPU
L1 Instruction
Unified L2 Cache
L1 Data
RF
Prefetched data
24
Hardware Instruction Prefetching
Instruction prefetch in Alpha AXP 21064
– Fetch two lines on a miss; the requested line (i) and the next consecutive line (i+1)
– Requested line placed in cache, and next line in instruction stream buffer
– If miss in cache but hit in stream buffer, move stream buffer line into cache and prefetch next line (i+2)
Req line
Prefetched instruction line
Req line
Stream Buffer
CPU
L1 Instruction
Unified L2 Cache
RF
25
Hardware Data Prefetching
§ Prefetch-on-miss:
– Prefetchb+1uponmissonb
§ One-Block Lookahead (OBL) scheme
– Initiate prefetch for block b + 1 when block b is accessed – Whyisthisdifferentfromdoublingblocksize?
– Can extend to N-block lookahead
§ Strided prefetch
– If observe sequence of accesses to line b, b+N, b+2N, then prefetch b+3N
etc.
§ Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access
26
Software Prefetching
for(i=0; i < N; i++) {
prefetch( &a[i + 1] );
prefetch( &b[i + 1] );
SUM = SUM + a[i] * b[i];
}
27
Software Prefetching Issues
§ Timing is the biggest issue, not predictability
– If you prefetch very close to when the data is required, you might
be too late
– Prefetch too early, cause pollution
– Estimate how long it will take for the data to come into L1, so we can set P appropriately
– Why is this hard to do?
for(i=0; i < N; i++) { prefetch( &a[i + P] ); prefetch( &b[i + P] ); SUM = SUM + a[i] * b[i];
}
Must consider cost of prefetch instructions
28
Compiler Optimizations
§ Restructuring code affects the data access sequence – Group data accesses together to improve spatial locality
– Re-order data accesses to improve temporal locality
§ Prevent data from entering the cache
– Useful for variables that will only be accessed once before being
replaced
– Needs mechanism for software to tell hardware not to cache data (“no- allocate” instruction hints or page table bits)
§ Kill data that will never be used again
– Streaming data exploits spatial locality but not temporal locality – Replace into dead cache locations
29
Loop Interchange
for(j=0; j < N; j++) {
for(i=0; i < M; i++) {
x[i][j] = 2 * x[i][j];
}
}
for(i=0; i < M; i++) {
for(j=0; j < N; j++) {
x[i][j] = 2 * x[i][j];
}
}
What type of locality does this improve?
30
Loop Fusion
for(i=0; i < N; i++)
a[i] = b[i] * c[i];
for(i=0; i < N; i++)
d[i] = a[i] * c[i];
for(i=0; i < N; i++)
{
a[i] = b[i] * c[i];
d[i] = a[i] * c[i];
}
What type of locality does this improve?
31
Matrix Multiply, Naïve Code
for(i=0; i < N; i++)
for(j=0; j < N; j++) {
r = 0;
for(k=0; k < N; k++)
zj
k
ykxj
ii
r = r + y[i][k] * z[k][j];
x[i][j] = r;
}
Not touched
Old access New access
32
Matrix Multiply with Cache Tiling
for(jj=0; jj < N; jj=jj+B)
for(kk=0; kk < N; kk=kk+B)
for(i=0; i < N; i++)
for(j=jj; j < min(jj+B,N); j++) {
zj
}
r = 0;
for(k=kk; k < min(kk+B,N); k++) k
r = r + y[i][k] * z[k][j];
x[i][j] = x[i][j] + r;
ykxj
ii
What type of locality does this improve?
33
Acknowledgements
n These slides contain material developed and copyright by:
• Arvind (MIT)
• Krste Asanovic (MIT/UCB)
• Joel Emer (Intel/MIT)
• James Hoe (CMU)
• John Kubiatowicz (UCB)
• David Patterson (UCB)
• John Lazzaro (UCB)
n MIT material derived from course 6.823
n UCB material derived from course CS152,
CS252
HKUEEE ENGG3441 - HS 34