CS计算机代考程序代写 compiler cache arm algorithm Lecture 14 – Memory and Cache 2

Lecture 14 – Memory and Cache 2

Stewart Smith Digital Systems Design 4

Digital System Design 4
Memory & Caches 2

Dr Stewart Smith

Stewart Smith Digital Systems Design 4

Last Section

•Memory Access Locality
•Memory Hierarchy
•Caches
•DRAM technology

Stewart Smith Digital Systems Design 4

This Section

•Cache Performance
•Associative Caches
•Multilevel Caches
•Real examples

Stewart Smith Digital Systems Design 4

Measuring Cache Performance
• Components of CPU time
‣ Program execution cycles

– Includes cache hit time
‣ Memory stall cycles

– Mainly from cache misses
• With simplifying assumptions:

penalty Miss
nInstructio

Misses
Program

nsInstructio

penalty Missrate Miss
Program

accessesMemory

cycles stallMemory

××=

××=

Stewart Smith Digital Systems Design 4

Cache Performance Example
• Given
‣ Instruction-cache miss rate = 2%
‣ Data-cache miss rate = 4%
‣ Miss penalty = 100 cycles
‣ Base CPI (ideal cache) = 2
‣ Load & stores are 36% of instructions

• Miss cycles per instruction
‣ Instruction-cache: 0.02 × 100 = 2
‣ Data-cache: 0.36 × 0.04 × 100 = 1.44

• Actual CPI = 2 + 2 + 1.44 = 5.44
‣ Ideal CPU is 5.44/2 =2.72 times faster

• Amount of time spent on memory stalls:
‣ 3.44/5.44 = 63%

Stewart Smith Digital Systems Design 4

Cache Performance Example
• What happens with a faster processor?
• If the CPI is 1 rather than 2?
• Miss cycles per instruction
‣ Instruction-cache: 0.02 × 100 = 2
‣ Data-cache: 0.36 × 0.04 × 100 = 1.44

• Actual CPI = 1 + 2 + 1.44 = 4.44
‣ Ideal CPU is 4.44/1 =4.44 times faster

• Amount of time spent on memory stalls:
‣ 3.44/4.44 = 77%
‣ Increased from 63%

• Demonstration of Amdahl’s Law
‣ More on that in previous lecture on performance

Stewart Smith Digital Systems Design 4

Average Access Time
• Hit time is also important for performance
• Average memory access time (AMAT)
‣ AMAT = Hit time + Miss rate × Miss penalty

• Example
‣ CPU with 1 ns clock,
‣ hit time = 1 cycle,
‣ miss penalty = 20 cycles,
‣ I-cache miss rate = 5%
‣ AMAT = 1 + 0.05 × 20 = 2ns

– 2 cycles per instruction

Stewart Smith Digital Systems Design 4

Associative Caches
• Fully associative
‣ Allow a given block to go in any cache entry
‣ Requires all entries to be searched at once
‣ Comparator per entry (expensive)

• n-way set associative
‣ Each set contains n entries
‣ Block number determines which set

– (Block number) modulo (#Sets in cache)
‣ Search all entries in a given set at once
‣ n comparators (less expensive)

Stewart Smith Digital Systems Design 4

Associative Cache Example

Stewart Smith Digital Systems Design 4

Spectrum of Associativity

Stewart Smith Digital Systems Design 4

Block
address

Cache
Index

Hit or
Miss?

Cache content after access
0 1 2 3

0 0
8 0
0 0
6 2
8 0

Block
address

Cache
Index

Hit or
Miss?

Cache content after access
0 1 2 3

0 0 miss Mem[0]
8 0
0 0
6 2
8 0

Block
address

Cache
Index

Hit or
Miss?

Cache content after access
0 1 2 3

0 0 miss Mem[0]
8 0 Mem[0]
0 0
6 2
8 0

Block
address

Cache
Index

Hit or
Miss?

Cache content after access
0 1 2 3

0 0 miss Mem[0]
8 0 miss Mem[8]
0 0
6 2
8 0

Block
address

Cache
Index

Hit or
Miss?

Cache content after access
0 1 2 3

0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 Mem[8]
6 2
8 0

Block
address

Cache
Index

Hit or
Miss?

Cache content after access
0 1 2 3

0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2
8 0

Block
address

Cache
Index

Hit or
Miss?

Cache content after access
0 1 2 3

0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 Mem[0]
8 0

Block
address

Cache
Index

Hit or
Miss?

Cache content after access
0 1 2 3

0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0

Block
address

Cache
Index

Hit or
Miss?

Cache content after access
0 1 2 3

0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0 Mem[0] Mem[6]

Block
address

Cache
Index

Hit or
Miss?

Cache content after access
0 1 2 3

0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]

Associativity Example
• Compare 4-block caches
‣ Direct mapped, 2-way set associative,

fully associative
‣ Block address access sequence: 0, 8, 0, 6, 8

• Direct mapped
‣ Index = (Block address) modulo (number of entries)

Stewart Smith Digital Systems Design 4

Block
address

Set
Index

Hit or
Miss?

Cache content after access
Set 0 Set 1

0 0
8 0
0 0
6 0
8 0

Block
address

Set
Index

Hit or
Miss?

Cache content after access
Set 0 Set 1

0 0 miss Mem[0]
8 0
0 0
6 0
8 0

Block
address

Set
Index

Hit or
Miss?

Cache content after access
Set 0 Set 1

0 0 miss Mem[0]
8 0 Mem[0]
0 0
6 0
8 0

Block
address

Set
Index

Hit or
Miss?

Cache content after access
Set 0 Set 1

0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0
6 0
8 0

Block
address

Set
Index

Hit or
Miss?

Cache content after access
Set 0 Set 1

0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 Mem[0] Mem[8]
6 0
8 0

Block
address

Set
Index

Hit or
Miss?

Cache content after access
Set 0 Set 1

0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0
8 0

Block
address

Set
Index

Hit or
Miss?

Cache content after access
Set 0 Set 1

0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 Mem[0] Mem[8]
8 0

Block
address

Set
Index

Hit or
Miss?

Cache content after access
Set 0 Set 1

0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0

Block
address

Set
Index

Hit or
Miss?

Cache content after access
Set 0 Set 1

0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 Mem[0] Mem[6]

Block
address

Set
Index

Hit or
Miss?

Cache content after access
Set 0 Set 1

0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]

Associativity Example

• 2-way set associative
‣ Each set can contain two blocks
‣ Same address Sequence: 0, 8, 0, 6, 8
‣ Set index = (Block)Mod(number of sets)

Stewart Smith Digital Systems Design 4

Block
address

Hit or
Miss? Cache content after access

0
8
0
6
8

Block
address

Hit or
Miss? Cache content after access

0 miss Mem[0]
8
0
6
8

Block
address

Hit or
Miss? Cache content after access

0 miss Mem[0]
8 Mem[0]
0
6
8

Block
address

Hit or
Miss? Cache content after access

0 miss Mem[0]
8 miss Mem[0] Mem[8]
0
6
8

Block
address

Hit or
Miss? Cache content after access

0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 Mem[0] Mem[8]
6
8

Block
address

Hit or
Miss? Cache content after access

0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6
8

Block
address

Hit or
Miss? Cache content after access

0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 Mem[0] Mem[8]
8

Block
address

Hit or
Miss? Cache content after access

0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8

Block
address

Hit or
Miss? Cache content after access

0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 Mem[0] Mem[8] Mem[6]

Associativity Example

• Fully associative
‣ Blocks can go anywhere
‣ Same address sequence: 0, 8, 0, 6, 8

Block
address

Hit or
Miss? Cache content after access

0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 hit Mem[0] Mem[8] Mem[6]

Stewart Smith Digital Systems Design 4

How Much Associativity
• Increased associativity decreases miss rate
‣ But with diminishing returns

• Simulation of a system with 64KiB data-
cache, 16-word blocks, SPEC2000
benchmarks
‣ 1-way: 10.3%
‣ 2-way: 8.6%
‣ 4-way: 8.3%
‣ 8-way: 8.1%

Stewart Smith Digital Systems Design 4

Dividing Up the Address

Memory Address

Increases with
block size

Decreases with
greater associativity

Everything else…

Tag Index Block Offset

Stewart Smith Digital Systems Design 4

Quick Quiz 1

• If we have a directly mapped cache of size
1 KiB with block size of 4 words, how
many bits are required for the index?

A. 2 bits from the address

B. 4 bits

C. 6 bits

D. 8 bits

!

Stewart Smith Digital Systems Design 4

Quick Quiz 2

• In the same directly mapped cache of size
1 KiB with block size of 4 words, what is
the size of the block offset?

A. Lowest 4 bits (including word/byte offset)

B. 5 bits

C. 6 bits

D. 8 bits

!

Stewart Smith Digital Systems Design 4

Quick Quiz 3

• In the same directly mapped cache of size
1 KiB with block size of 4 words, what is
the tag size if the address is 32 bits wide?

A. Upper 16 bits of the address

B. 18 bits

C. 20 bits

D. 22 bits

!

Stewart Smith Digital Systems Design 4

Quick Quiz 4
• If we change the cache architecture to be

4-way set associative, what happens to
the index if other parameters stay the
same?

A. Increases by 1 bit

B. Stays the same size

C. Decreases by 2 bits

D. Decreases by 4 bits

!

Stewart Smith Digital Systems Design 4

Set Associative Cache Organisation

Stewart Smith Digital Systems Design 4

Replacement Policy
• Direct mapped: no choice
• Set associative
‣ Prefer non-valid entry, if there is one
‣ Otherwise, choose among entries in the set

• Least-recently used (LRU)
‣ Choose the one unused for the longest time

– Simple for 2-way, manageable for 4-way, too hard
beyond that – though there are methods…

• Random
‣ Gives approximately the same performance as LRU

for high associativity

Stewart Smith Digital Systems Design 4

Multilevel Caches
• Primary cache attached to CPU
‣ Small, but fast

• Level-2 cache services misses from primary
cache
‣ Larger, slower, but still faster than main memory

• Main memory services L-2 cache misses
• L-3 cache is now pretty standard on modern

multicore processors.

Stewart Smith Digital Systems Design 4

Multilevel Cache Example

• Given
‣ CPU base CPI = 1, clock rate = 4 GHz
‣ Miss rate/instruction = 2 %
‣ Main memory access time = 100 ns

• With just primary cache
‣ Miss penalty = 100 ns/0.25 ns = 400 cycles
‣ Effective CPI = 1 + 0.02 × 400 = 9

Stewart Smith Digital Systems Design 4

Example (cont.)
• Now add L-2 cache
‣ Access time = 5 ns
‣ Global miss rate to main memory = 0.5 %

• Primary miss with L-2 hit
‣ Penalty = 5 ns/0.25 ns = 20 cycles

• Primary miss with L-2 miss
‣ Extra penalty = 400 cycles

• CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
• Performance ratio = 9/3.4 = 2.6

Stewart Smith Digital Systems Design 4

Multilevel Cache Considerations

• Primary cache
‣ Focus on minimal hit time

• L-2 cache
‣ Focus on low miss rate to avoid main memory

access
‣ Hit time has less overall impact

• Results
‣ L-1 cache usually smaller than a single cache
‣ L-1 block size may be smaller than L-2 block size

Stewart Smith Digital Systems Design 4

Interactions with Advanced CPUs

• Out-of-order CPUs can execute instructions
during cache miss
‣ Pending store stays in load/store unit
‣ Dependent instructions wait in reservation

stations
– Independent instructions continue

• Effect of miss depends on program data flow
‣ Much harder to analyse
‣ Use system simulation

Stewart Smith Digital Systems Design 4

Interactions with Software

• Misses depend on
memory access patterns

‣ Algorithm behavior
‣ Compiler optimization for

memory access

Stewart Smith Digital Systems Design 4

Interactions with Software

• Misses depend on
memory access patterns

‣ Algorithm behavior
‣ Compiler optimization for

memory access

Stewart Smith Digital Systems Design 4

Interactions with Software

• Misses depend on
memory access patterns

‣ Algorithm behavior
‣ Compiler optimization for

memory access

Stewart Smith Digital Systems Design 4

Multilevel Cache Example

Intel Nehalem processor with labelled sections

SMT
CPU

Core 0

SMT
CPU

Core 1

SMT
CPU

Core 2

SMT
CPU

Core 332kiB
L1D

32kiB
L1I

256 kiB
L2

256 kiB
L2

256 kiB
L2

256 kiB
L2

2 MiB of
8 MiB L3

Cache

2 MiB of
8 MiB L3

Cache

2 MiB of
8 MiB L3

Cache

2 MiB of
8 MiB L3

Cache

W
rite B

uffers

W
rite B

uffers

W
rite B

uffers

W
rite B

uffers

N
orth B

ridge and C
om

m
unication Sw

itch

Stewart Smith Digital Systems Design 4

Multilevel Cache Example

ARM Cortex-A8 uni-core processor

Stewart Smith Digital Systems Design 4

Multilevel Cache Organization
ARM Cortex-A8 Intel Nehalem

L1 caches
(per core)

Split instruction and data caches,
32 KiB each, 64-byte blocks,

4 way set associative,
Random replacement,

Write-back with write allocate,
Hit time: 1 clock cycle

Split instruction and data caches,
32 KiB each, 64-byte blocks,

4 way (Instruction) 8 way (data),
Approximated LRU,

Write-back no write allocate,
Hit time: 4 clock cycles (pipelined)

L2 unified
cache

(per core)

128 KiB to 1 MiB, 64-byte blocks,
8-way set associative, Random

replacement, Write back, Hit time:
11 clock cycles

256KiB, 64-byte blocks, 8-way,
Approximated LRU,

Write-back with write allocate,
Hit time: 10 clock cycles

L3 unified
cache

(shared)
None

8MiB, 64-byte blocks, 16-way
associativity, Approximated LRU,
Write-back with write allocate,

Hit time: 35 clock cycles

Stewart Smith Digital Systems Design 4

ARM A8 Minnespec Benchmarks

Stewart Smith Digital Systems Design 4

ARM A8 Minnespec Benchmarks

Stewart Smith Digital Systems Design 4

Core i7 920 SPEC2006 Benchmarks

Stewart Smith Digital Systems Design 4

Summary 1

• When CPU performance increased
‣ Miss penalty becomes more significant

• Decreasing base CPI
‣ Greater proportion of time spent on memory stalls

• Increasing clock rate
‣ Memory stalls account for more CPU cycles

Stewart Smith Digital Systems Design 4

Summary 2
• Don’t neglect cache behaviour when

evaluating performance

• Higher cache associativity can reduce miss
rate
‣ Each block can go in more than one place in

cache
‣ More complicated: longer hit access times?
‣ Requires a replacement policy: LRU/Random?

• Multilevel caches to reduce miss rates

Stewart Smith Digital Systems Design 4

Next Section

• Virtual Memory
• Common principles of memory
• Cache control systems
• Pitfalls