CS计算机代考程序代写 compiler cache arm algorithm Digital System Design 4 Memory & Caches 2

Digital System Design 4 Memory & Caches 2
Dr Stewart Smith
Stewart Smith Digital Systems Design 4

Last Section
•Memory Access Locality •Memory Hierarchy
• Caches
•DRAM technology
Stewart Smith Digital Systems Design 4

This Section
•Cache Performance •Associative Caches •Multilevel Caches •Real examples
Stewart Smith Digital Systems Design 4

Measuring Cache Performance
•
‣
‣
•
Components of CPU time
Program execution cycles
–
– Mainly from cache misses
Includes cache hit time Memory stall cycles
With simplifying assumptions:
Memory stall cycles
= Memory accesses × Miss rate × Miss penalty Program
= Instructions × Program
Misses × Miss penalty Instruction
Stewart Smith
Digital Systems Design 4

Cache Performance Example
•
‣
‣
Given
Instruction-cache miss rate = 2% Data-cache miss rate = 4%
Miss penalty = 100 cycles
‣
‣ Base CPI (ideal cache) = 2
‣ Load & stores are 36% of instructions Miss cycles per instruction
‣ Instruction-cache: 0.02 × 100 = 2
‣ Data-cache: 0.36 × 0.04 × 100 = 1.44
• •
‣
•
Actual CPI = 2 + 2 + 1.44 = 5.44
‣ Ideal CPU is 5.44/2 =2.72 times faster
Amount of time spent on memory stalls:
3.44/5.44 = 63%
Stewart Smith
Digital Systems Design 4

Cache Performance Example
• • •
• •
•
‣
What happens with a faster processor?
If the CPI is 1 rather than 2?
Miss cycles per instruction
‣ Instruction-cache: 0.02 × 100 = 2
‣ Data-cache: 0.36 × 0.04 × 100 = 1.44
Actual CPI = 1 + 2 + 1.44 = 4.44
‣ Ideal CPU is 4.44/1 =4.44 times faster
Amount of time spent on memory stalls:
‣
3.44/4.44 = 77%
‣ Increased from 63%
Demonstration of Amdahl’s Law
More on that in previous lecture on performance
Stewart Smith
Digital Systems Design 4

Average Access Time
• •
‣
•
‣
‣
‣
‣
‣
–
Hit time is also important for performance Average memory access time (AMAT)
AMAT = Hit time + Miss rate × Miss penalty Example
CPU with 1 ns clock,
hit time = 1 cycle,
miss penalty = 20 cycles,
I-cache miss rate = 5%
AMAT = 1 + 0.05 × 20 = 2ns
2 cycles per instruction
Stewart Smith
Digital Systems Design 4

Associative Caches
•
‣ ‣ ‣
•
‣ ‣
–
‣ ‣
Fully associative
Allow a given block to go in any cache entry Requires all entries to be searched at once Comparator per entry (expensive)
n-way set associative
Each set contains n entries
Block number determines which set
Stewart Smith
(Block number) modulo (#Sets in cache) Search all entries in a given set at once
n comparators (less expensive) Digital Systems Design 4

Associative Cache Example
Stewart Smith Digital Systems Design 4

Spectrum of Associativity
Stewart Smith Digital Systems Design 4

Associativity Example
•
‣
fully associative
‣ Block address access sequence: 0, 8, 0, 6, 8
•
‣
Compare 4-block caches
Direct mapped, 2-way set associative,
Direct mapped
Index = (Block address) modulo (number of entries)
Block address
0
8 0 6 8
Cache Index
0
0 0 2 0
Hit or Miss?
miss
miss miss miss miss
Cache content after access
0
Mem[0]
Mem[808] Mem[080] Mem[0] Mem[80]
1
2
Mem[6]
Mem[6]
3
Stewart Smith
Digital Systems Design 4

Associativity Example
•
‣ ‣ ‣
2-way set associative
Each set can contain two blocks
Same address Sequence: 0, 8, 0, 6, 8
Set index = (Block)Mod(number of sets)
Block address
0
8 0 6 8
Set Index
0
0 0 0 0
Hit or Miss?
Cache content after access
Set 0 miss Mem[0]
miss Mem[0] Mem[8] hit Mem[0] Mem[8] miss Mem[0] Mem[686] miss Mem[08] Mem[6]
Set 1
Stewart Smith
Digital Systems Design 4

Associativity Example
•
‣ ‣
Fully associative
Blocks can go anywhere
Same address sequence: 0, 8, 0, 6, 8
Block address
0
8 0 6 8
Hit or Miss?
miss
miss hit miss hit
Cache content after access
Mem[0]
Mem[0] Mem[0] Mem[0] Mem[0]
Mem[8]
Mem[8] Mem[8] Mem[8]
Mem[6]
Mem[6]
Stewart Smith
Digital Systems Design 4

How Much Associativity
•
‣
•
‣ ‣ ‣ ‣
Increased associativity decreases miss rate
But with diminishing returns
Simulation of a system with 64KiB data- cache, 16-word blocks, SPEC2000 benchmarks
1-way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1%
Stewart Smith
Digital Systems Design 4

Dividing Up the Address
Memory Address
Tag
Index
Block Offset
Everything else…
Decreases with Increases with greater associativity block size
Stewart Smith Digital Systems Design 4

Quick Quiz 1
•
!
If we have a directly mapped cache of size 1 KiB with block size of 4 words, how many bits are required for the index?
A. 2 bits from the address B. 4 bits
C. 6 bits
D. 8 bits
Stewart Smith
Digital Systems Design 4

Quick Quiz 2
•
!
In the same directly mapped cache of size 1 KiB with block size of 4 words, what is the size of the block offset?
A. Lowest 4 bits (including word/byte offset) B. 5 bits
C. 6 bits
D. 8 bits
Stewart Smith
Digital Systems Design 4

Quick Quiz 3
•
!
In the same directly mapped cache of size 1 KiB with block size of 4 words, what is the tag size if the address is 32 bits wide?
A. Upper 16 bits of the address B. 18 bits
C. 20 bits
D. 22 bits
Stewart Smith
Digital Systems Design 4

Quick Quiz 4
•
!
If we change the cache architecture to be 4-way set associative, what happens to the index if other parameters stay the same?
A. Increases by 1 bit B. Stays the same size C. Decreases by 2 bits D. Decreases by 4 bits
Stewart Smith
Digital Systems Design 4

Set Associative Cache Organisation
Stewart Smith Digital Systems Design 4

Replacement Policy
• •
‣ ‣
•
‣
•
‣
Direct mapped: no choice
Set associative
Prefer non-valid entry, if there is one Otherwise, choose among entries in the set
Least-recently used (LRU)
Choose the one unused for the longest time
–
Random
Simple for 2-way, manageable for 4-way, too hard beyond that – though there are methods…
Gives approximately the same performance as LRU for high associativity
Stewart Smith
Digital Systems Design 4

Multilevel Caches
•
‣
•
‣
• •
Primary cache attached to CPU
Small, but fast
Level-2 cache services misses from primary cache
Larger, slower, but still faster than main memory Main memory services L-2 cache misses
L-3 cache is now pretty standard on modern multicore processors.
Stewart Smith
Digital Systems Design 4

Multilevel Cache Example
•
‣ ‣ ‣
•
‣ ‣
Given
CPU base CPI = 1, clock rate = 4 GHz Miss rate/instruction = 2 %
Main memory access time = 100 ns
With just primary cache
Miss penalty = 100 ns/0.25 ns = 400 cycles Effective CPI = 1 + 0.02 × 400 = 9
Stewart Smith
Digital Systems Design 4

Example (cont.)
•
‣ ‣
•
‣
•
‣
• •
Now add L-2 cache
Access time = 5 ns
Global miss rate to main memory = 0.5 %
Primary miss with L-2 hit
Penalty = 5 ns/0.25 ns = 20 cycles Primary miss with L-2 miss
Extra penalty = 400 cycles
CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4 Performance ratio = 9/3.4 = 2.6
Stewart Smith
Digital Systems Design 4

Multilevel Cache Considerations
•
‣
•
‣ ‣
•
‣ ‣
Primary cache
Focus on minimal hit time L-2 cache
Focus on low miss rate to avoid main memory access
Hit time has less overall impact
Results
L-1 cache usually smaller than a single cache
L-1 block size may be smaller than L-2 block size
Stewart Smith
Digital Systems Design 4

Interactions with Advanced CPUs
•
‣ ‣
–
•
‣ ‣
Out-of-order CPUs can execute instructions during cache miss
Pending store stays in load/store unit
Dependent instructions wait in reservation stations
Independent instructions continue
Effect of miss depends on program data flow
Much harder to analyse Use system simulation
Stewart Smith
Digital Systems Design 4

Interactions with Software
• Misses depend on memory access patterns
‣ ‣
Algorithm behavior
Compiler optimization for memory access
Stewart Smith Digital Systems Design 4

Multilevel Cache Example
Intel Nehalem processor with labelled sections
SMT CPU Core 0
256 kiB L2
SMT CPU Core 1
256 kiB L2
2 MiB of 8 MiB L3 Cache
2 MiB of 8 MiB L3 Cache
256 kiB L2
SMT CPU Core 2
256 kiB L2
SMT CPU Core 3
2 MiB of 8 MiB L3 Cache
2 MiB of 8 MiB L3 Cache
Stewart Smith Digital Systems Design 4
Write Buffers North Bridge and Communication Switch
32kiB L1D
32kiB L1I
Write Buffers
Write Buffers
Write Buffers

Multilevel Cache Example
ARM Cortex-A8 uni-core processor
Stewart Smith Digital Systems Design 4

Multilevel Cache Organization
ARM Cortex-A8
Intel Nehalem
L1 caches (per core)
Split instruction and data caches, 32 KiB each, 64-byte blocks,
4 way set associative, Random replacement,
Write-back with write allocate, Hit time: 1 clock cycle
Split instruction and data caches, 32 KiB each, 64-byte blocks,
4 way (Instruction) 8 way (data), Approximated LRU, Write-back no write allocate, Hit time: 4 clock cycles (pipelined)
L2 unified cache (per core)
128 KiB to 1 MiB, 64-byte blocks, 8-way set associative, Random replacement, Write back, Hit time: 11 clock cycles
256KiB, 64-byte blocks, 8-way, Approximated LRU, Write-back with write allocate, Hit time: 10 clock cycles
L3 unified cache (shared)
None
8MiB, 64-byte blocks, 16-way associativity, Approximated LRU, Write-back with write allocate, Hit time: 35 clock cycles
Stewart Smith Digital Systems Design 4

ARM A8 Minnespec Benchmarks
Stewart Smith Digital Systems Design 4

Core i7 920 SPEC2006 Benchmarks
Stewart Smith Digital Systems Design 4

Summary 1
•
‣
•
‣
•
‣
When CPU performance increased
Miss penalty becomes more significant Decreasing base CPI
Greater proportion of time spent on memory stalls Increasing clock rate
Memory stalls account for more CPU cycles
Stewart Smith Digital Systems Design 4

Summary 2
• •
‣
‣ ‣
•
Don’t neglect cache behaviour when evaluating performance
Higher cache associativity can reduce miss rate
Each block can go in more than one place in cache
More complicated: longer hit access times? Requires a replacement policy: LRU/Random?
Multilevel caches to reduce miss rates
Stewart Smith
Digital Systems Design 4

Next Section
• • • •
Virtual Memory
Common principles of memory Cache control systems
Pitfalls
Stewart Smith
Digital Systems Design 4

Related Posts