CS计算机代考程序代写 computer architecture compiler arm scheme cache Computer Architecture ELEC3441

Computer Architecture ELEC3441
Lecture 9 – Cache (2)
Dr. Hayden Kwok-Hay So
Department of Electrical and Electronic Engineering
Causes of Cache Misses: The 3 C’s
Compulsory: first reference to a line (a.k.a. cold start misses)
– misses that would occur even with infinite cache Capacity: cache is too small to hold all data needed
by the program
– misses that would occur even under perfect
replacement policy
Conflict: misses that occur because of collisions due to line-placement strategy
– misses that would not occur with ideal full associativity
2
AMA T
n Average Memory Access Time:
AMA T = Hit Time + Miss Rate × Miss Penalty
HKUEEE ENGG3441 – HS 3
Example
n Processor runs at 2 GHz with CPI=1. Miss penalty of memory is 50 clock cycles. L1 cache returns data in 1 cycle on cache hit. On a particular program, instruction miss rate is 1%. Load/store make up 30% of dynamic instruction, and have a miss rate of 5%. Assume read/write penalties are the same and ignore other stalls.
n What is AMAT for instruction/data?
n What is average CPI given the above memory
access time?
HKUEEE ENGG3441 – HS 4

Example: AMAT
Instruction Cache:
AMA T = Hit Time + Miss Rate × Miss Penalty =1+1%×50
Data Cache:
HKUEEE
ENGG3441 – HS 5
=1.5 cycles
AMA T = Hit Time + Miss Rate × Miss Penalty =1+5%×50
= 3.5 cycles
Average CPI (with Memory)
# of instr. = i
# of instruction memory miss cycles = i × 1% × 50 = 0.5i
# of data memory miss cycles = i × 30% × 5% × 50 = 0.75i Total # of memory stall cycles = 0.5i + 0.75i = 1.25i
Average CPI = i +1.25i = 2.25 i
HKUEEE ENGG3441 – HS
6
CPU-Cache Interaction (5-stage pipeline)
0x4
PCen
Add
E
A
IR B
Decode, Register Fetch
we addr
Primary Data rdata Cache
wdata hit? wdata
bubble
To Memory Control
M
ALU Y
MD2
R
PC
addr inst
hit?
Primary Instruction Cache
D
MD1
Stall entire CPU on data cache miss
HKUEEE
ENGG3441 – HS
7
Cache Refill Data from Lower Levels of Memory Hierarchy
Improving Cache Performance
Average memory access time (AMAT) =
Hit time + Miss rate x Miss penalty
To improve performance:
• reduce the hit time
• reduce the miss rate
• reduce the miss penalty
What is best cache design for 5-stage pipeline?
Biggest cache that doesn’t increase hit time past 1 cycle
(approx 8-32KB in modern technology)
[ design issues more complex with deeper pipelines and/or out-of- order superscalar processors]
HKUEEE ENGG3441 – HS 8

Effect of Cache Parameters on Performance
§Larger cache size
+ reduces capacity and conflict misses – hit time will increase
§Higher associativity
+ reduces conflict misses – may increase hit time
§Larger line size
+ reduces compulsory and capacity (reload) misses – increases conflict misses and miss penalty
9
10% 9% 8% 7%
15% One-way 12%
1 KiB
2 KiB
4 KiB
8 KiB
16 KiB 32 KiB
Performance vs. Associativity
Two-way
6% 9%
Miss rate Four-way per type 5%
4% 6% 3%
2% 1% 0%
Capacity
4 8 16 32 64 128 256 512 1024
Cache size (KiB)
3% 0
64 KiB
128 KiB Eight-way
n 1-way à 2-way èSignificant drop in miss rate n 2-way à 4-way è less significant
n Effect of associativity significant in small cache
HKUEEE ENGG3441 – HS
10
One-way
Two-way Four-way Associativity
Write Policy Choices
§Cache hit:
– write through: write both cache & memory
• Generally higher traffic but simpler pipeline & cache design
– write back: write cache only, memory is written only when the
entry is evicted
• A dirty bit per line further reduces write-back traffic
• Must handle 0, 1, or 2 accesses to memory for each load/store
§Cache miss:
– no write allocate: only write to main memory
– write allocate (aka fetch on write): fetch into cache
§Common combinations:
– write through and no write allocate – write back with write allocate
11
Write Hit (Cache Writing)
Tag
Index
Offset
t
V Tag
t =
k
b Data
WE
Data Word or Byte
2k lines
HIT
HKUEEE
ENGG3441 – HS
12
Write back: done
Write through: write also to memory
Miss rate

Write Through via Write Buffer
$
Processor
DRAM
Write Buffer
n Processor writes to both $ and write buffer
• Memory writes completes as soon as data in write
buffer
n Memory controller completes the write to DRAM offline
n Writing too fast may saturate write buffer
HKUEEE ENGG3441 – HS 13
Read Miss with Write Buffer
n On Read Miss, need to read memory to fill cache
n But data may still be in write buffer pending write to DRAM
n 2 Solutions:
• Flushwritebufferbeforeread
• Checkallpendingwritesinwritebufferandreturn latest write data if address match
Q: Would there be data in write buffer that needs to be forwarded on a read hit?
HKUEEE ENGG3441 – HS 14
Write Miss
n Write miss happens when write location not in cache
n Write Allocate:
• Attheendofthewrite,cachecontainsfulllineof
data
• Needtoreadfrommemory
• Writeback:musthavewriteallocate
• Writethrough:mayormaynot
n No write allocate:
• Datagostraighttomemory
HKUEEE ENGG3441 – HS 15
Multilevel Caches
Problem: A memory cannot be large and fast Solution: Increasing sizes of cache at each level
CPU
L1$
L2$
DRAM
Local miss rate = misses in cache / accesses to cache
Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions
16

Presence of L2 influences L1 design
§Use smaller L1 if there is also L2
– Trade increased L1 miss rate for reduced L1 hit time – Backup L2 reduces L1 miss penalty
– Reduces average access energy
§Use simpler write-through L1 with on-chip L2
– Write-back L2 cache absorbs write traffic, doesn’t go off-chip
– At most one L1 miss request per L1 access (no dirty victim write back) simplifies pipeline control
– Simplifies coherence issues
– Simplifies error recovery in L1 (can use just parity bits in L1 and reload from L2 when parity error detected on L1 read)
17
Inclusion Policy
§ Inclusive multilevel cache:
– Inner cache can only hold lines also present in outer
cache
– External coherence snoop access need only check outer cache
§ Exclusive multilevel caches:
– Inner cache may hold lines not in outer cache
– Swap lines between inner/outer caches on miss
– Used in AMD Athlon with 64KB primary and 256KB secondary cache
Why choose one type or the other?
18
25.0%
20.0%
15.0%
10.0%
5.0%
0.0%
L1 Data Miss Rate L2 Data Miss Rate
n Miss rate on L2$ usually much lower than L1$
n L2 usually has:
• Higher capacity
• Higher associativity
n Only missed L1 access arrived at L2
L1 vs L2 Miss Rate
Data cache miss rates for ARM Cortex-A8 when running Minnespec
HKUEEE ENGG3441 – HS
19
Itanium-2 On-Chip Caches (Intel/HP, 2002)
Level 1: 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency
Level 2: 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency
Level 3: 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency
20
Miss Rate
twolf bzip2
gzip
parser gap
perlbmk gcc
crafty vpr
vortex con
mcf

Power 7 On-Chip Caches [IBM 2009]
32KB L1 I$/core 32KB L1 D$/core 3-cycle latency
256KB Unified L2$/core 8-cycle latency
32MB Unified Shared L3$ Embedded DRAM (eDRAM) 25-cycle latency to local slice
21
IBM z196 Mainframe Caches 2010
§96 cores (4 cores/chip, 24 chips/system) – Out-of-order, 3-way superscalar @ 5.2GHz
§L1: 64KB I-$/core + 128KB D-$/core
§L2: 1.5MB private/core (144MB total)
§L3: 24MB shared/chip (eDRAM) (576MB total) §L4: 768MB shared/system (eDRAM)
22
Prefetching
§Speculate on future instruction and data accesses and fetch them into cache(s)
– Instruction accesses easier to predict than data accesses
§Varieties of prefetching – Hardware prefetching
– Software prefetching
– Mixed schemes
§What types of misses does prefetching affect?
23
Issues in Prefetching
§ Usefulness – should produce hits
§ Timeliness – not late and not too early § Cache and bandwidth pollution
CPU
L1 Instruction
Unified L2 Cache
L1 Data
RF
Prefetched data
24

Hardware Instruction Prefetching
Instruction prefetch in Alpha AXP 21064
– Fetch two lines on a miss; the requested line (i) and the next consecutive line (i+1)
– Requested line placed in cache, and next line in instruction stream buffer
– If miss in cache but hit in stream buffer, move stream buffer line into cache and prefetch next line (i+2)
Req line
Prefetched instruction line
Req line
Stream Buffer
CPU
L1 Instruction
Unified L2 Cache
RF
25
Hardware Data Prefetching
§ Prefetch-on-miss:
– Prefetchb+1uponmissonb
§ One-Block Lookahead (OBL) scheme
– Initiate prefetch for block b + 1 when block b is accessed – Whyisthisdifferentfromdoublingblocksize?
– Can extend to N-block lookahead
§ Strided prefetch
– If observe sequence of accesses to line b, b+N, b+2N, then prefetch b+3N
etc.
§ Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access
26
Software Prefetching
for(i=0; i < N; i++) { prefetch( &a[i + 1] ); prefetch( &b[i + 1] ); SUM = SUM + a[i] * b[i]; } 27 Software Prefetching Issues § Timing is the biggest issue, not predictability – If you prefetch very close to when the data is required, you might be too late – Prefetch too early, cause pollution – Estimate how long it will take for the data to come into L1, so we can set P appropriately – Why is this hard to do? for(i=0; i < N; i++) { prefetch( &a[i + P] ); prefetch( &b[i + P] ); SUM = SUM + a[i] * b[i]; } Must consider cost of prefetch instructions 28 Compiler Optimizations § Restructuring code affects the data access sequence – Group data accesses together to improve spatial locality – Re-order data accesses to improve temporal locality § Prevent data from entering the cache – Useful for variables that will only be accessed once before being replaced – Needs mechanism for software to tell hardware not to cache data (“no- allocate” instruction hints or page table bits) § Kill data that will never be used again – Streaming data exploits spatial locality but not temporal locality – Replace into dead cache locations 29 Loop Interchange for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } } for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } } What type of locality does this improve? 30 Loop Fusion for(i=0; i < N; i++) a[i] = b[i] * c[i]; for(i=0; i < N; i++) d[i] = a[i] * c[i]; for(i=0; i < N; i++) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } What type of locality does this improve? 31 Matrix Multiply, Naïve Code for(i=0; i < N; i++) for(j=0; j < N; j++) { r = 0; for(k=0; k < N; k++) zj k ykxj ii r = r + y[i][k] * z[k][j]; x[i][j] = r; } Not touched Old access New access 32 Matrix Multiply with Cache Tiling for(jj=0; jj < N; jj=jj+B) for(kk=0; kk < N; kk=kk+B) for(i=0; i < N; i++) for(j=jj; j < min(jj+B,N); j++) { zj } r = 0; for(k=kk; k < min(kk+B,N); k++) k r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; ykxj ii What type of locality does this improve? 33 Acknowledgements n These slides contain material developed and copyright by: • Arvind (MIT) • Krste Asanovic (MIT/UCB) • Joel Emer (Intel/MIT) • James Hoe (CMU) • John Kubiatowicz (UCB) • David Patterson (UCB) • John Lazzaro (UCB) n MIT material derived from course 6.823 n UCB material derived from course CS152, CS252 HKUEEE ENGG3441 - HS 34