CS计算机代考程序代写 flex cache scheme Chapter 5

Chapter 5
Memory Hierarchy
part 2
cache – Fully/Set Associative
1

Reducing Cache Misses by More Flexible Placement of Blocks
• Fully Associative cache: A cache structure in which a block can be placed in any location in the cache. To find a given block, all the entries in the cache must be searched because a block can be placed in any place.
• Set Associative cache: A cache that has a fixed number of locations (at least two) where each block can be placed. Each block in the memory maps to a unique set in the cache given by the index field, and a block can be placed in any location of that set.
n-way set associative cache
– a set-associative cache with n elements for each set
The position of a memory block
= (Block number) modulo (Number of sets in the cache)
• Direct-mapped: there is a direct mapping from any block address in memory to a single location in the upper level of the hierarchy.
The position of a memory block
= (Block number) modulo (Number of blocks in the cache)
2

• Therefore:
• A direct-mapped cache is a one-way set-associative cache
– Eachcacheentryholdsoneblockandeachsethasoneelement. • A fully associative cache with m entries is an m-way set associative
cache
– Ithasonesetwithmblocks,andanentrycanresideinanyblock within that set.
3

Set-associative cache • Decrease the miss rate
• The set containing a memory block: (Block#) modulo (# of sets)
• Least Recently Used (LRU): the block replaced is the one that has been unused for the longest time.
4

Choosing Which Block to Replace
• LRU (Least Recently Used) scheme: replacement scheme in which the block replaced is the one that has been unused for the longest time.
• For a two-way set-associative cache, tracking when the two elements were used can be implemented by keeping a single bit in each set and setting the bit indicate an element whenever that element is reference.
5

6

Example (direct-mapped, set-associative, fully- associative)
• Consider a cache consisting of four one-word blocks.
• Find the number of misses for the following sequence of block
addresses: 0, 8, 0, 6, 8
1. Directmapped
2. Two-way set associative 3. Fullyassociative
7

Block address 0
6
8
Cache block
(0 modulo 4) = 0 (6 modulo 4) = 2 (8 modulo 4) = 0
Direct mapped:
Address of memory block accessed
Hit or miss
Contents of cache blocks after reference 0123
0 8
Miss Miss
Memory[0] Memory[8]
0 6 8
Miss Miss Miss
Memory[0] Memory[0] Memory[8]
Memory[6] Memory[6]
8

Block address 0
6
8
Cache set
(0 modulo 2) = 0 (6 modulo 2) = 0 (8 modulo 2) = 0
2-Way Set-Associative Cache: (LRU is used)
Address of memory block accessed
Hit or miss
Contents of cache blocks after reference
0 8
Miss Miss
Memory[0] Memory[0]
Memory[8] Memory[8] Memory[6] Memory[6]
0 6 8
Hit Miss Miss
Memory[0] Memory[0] Memory[8]
Set 0
Set0 Set1 Set1
When the block 6 is referenced, it replaces block 8,
since block 8 has been less recently reference than block 0.
9

Fully Associative Cache: (LRU is used)
Address of memory block accessed
Hit or miss
Contents of cache blocks after reference
0 8
Miss Miss
Memory[0] Memory[0]
Memory[8]
Memory[8]
Memory[8] Memory[6] Memory[8] Memory[6]
0 6 8
Hit Miss Hit
Memory[0] Memory[0] Memory[0]
Block 0
Block 1
Block 2
Block 3
10

Implementation of 4-way set associative cache
11

Data Cache Miss rates for the Intrinsity FastMATH for SPEC2000 bench marks
Associativity Data miss rate 1-way 10.3%
2-way 8.6%
4-way 8.3%
8-way 8.1%
12

Performance
15% 12% 9% 6% 3% 0
1 KB
2 KB
4 KB
8 KB
16 KB 32 KB
64 KB
128 KB
One-way Two-way Four-way Eight-way Associativity
13

Locating a Block in the Cache — # of bits used for tags
• Assume that a cache that can contain data of 4096 (≈4K) blocks where each block size is of four-words, and a 32-bit address. Find the total number of sets and the total number of tag bits for caches that are direct mapped, two- way set associative, and fully associative.
Each 32 bit address can be divided as follows:
tag bits index bits block offset bits byte offset bits
Since each block consists of 4 words, we need16 (=2 4) bytes per block. #ofblockoffsetbits=log2 (#ofwordsperblock)=log2 4=2bits
# of byte off set bits = 2 bits
a 32-bit address yields 32-2-2 = 28 bits to be used for index and tag.
14

# of bits used for tags – cont.
Direct-Mapped Cache
The direct-mapped cache has the same number of sets as blocks,
thus the number of sets = 4096 (≈4K) .
And the number of indexes in the cache is also 4096 (≈4K) .
# of index bits = log 2 (# of indexes) = log 2 (4096) = 12. (2^12 = 4096)
Thus the number of tag bits within a 32 bit address is:
32 – # of index bits – # of block offset bits – # of byte offset bits = 32 -12 – 2 – 2 = 16.
The total # of bits used for all tags is:
# of indexes x # of tag bits x # of block in each set = 4096 x 16 x 1 ≈ 64 K tag bits.
Each degree of associativity decreases the number of sets by a factor of 2 and thus decreases the number of bits used to index the cache by 1 and increases the number of bits in the tag by 1.
15

# of bits used for tags – cont.
Two-Way Set Associative Cache:
Since there are total of 4096 blocks, if each set has 2 blocks,
then there are 2048 (= 4096/2) sets.
Thus there are 2048 indexes and
# of index bits = log 2 (# of indexes/sets) = log 2 (2048) = 11. (2^11 = 2048)
Thus the number of tag bits within a 32 bit address is: 32 – # of index bits – # of block offset bits
= 32 -11 – 2 – 2 = 17.
The total # of bits used for all tags is:
# of indexes x # of tag bits x # of blocks in each set = 2048 x 17 x 2 ≈ 68 K tag bits.
16

# of bits used for tags – cont.
Four-Way Set Associative Cache:
Since there are total of 4096 blocks, if each set has 4 blocks,
then there are 1024 (=4096/4) sets.
Thus there are 1024 indexes and
# of index bits = log 2 (# of indexes/sets) = log 2 (1024) = 10. (2^10 = 1024)
Thus the number of tag bits within a 32 bit address is: 32 – # of index bits – # of block offset bits
= 32 -10 – 2 – 2 = 18.
The total # of bits used for all tags is:
# of indexes x # of tag bits x # of blocks in each set = 1024 x 18 x 4 ≈ 72 K tag bits.
17

# of bits used for tags – cont.
Fully Associative Cache:
Since there are total of 4096 blocks, if there is one set,
then the set contains 4096 blocks (each block is of 4 words size)
Thus there is only one set (one index) and #ofindexbits=log2 (#ofindexes/sets)=log2 (1)=0. There is no index bit.
Thus the number of tag bits within a 32 bit address is: 32 – # of index bits – # of block offset bits
= 32 – 0 – 2 – 2 = 28.
The total # of bits used for all tags is:
# of indexes x # of tag bits x # of blocks in each set = 1 x 28 x 4096 ≈ 112 K tag bits.
So we need more tag bits if we have more blocks in each set.
18

Multi-Level Caches • Recall:
19

Reducing the Miss Penalty Using Multilevel Caches
• Add a second level cache:
– often primary cache is on the same chip as the processor
(on the same die as the microprocessor that forms the processor)
– use SRAMs to add another cache above primary memory (DRAM)
– miss penalty goes down if data is in 2nd level cache
• Many microprocessors support an additional level of caching (they can be on the same chip or off-chip in a separate set of SRAMs) to close the gap between the fast clock rates of modern processor and the relatively long time e required to access DRAMs.
• Using multilevel caches:
– try to optimize the hit time on the 1st level cache
– try to optimize the miss rate on the 2nd level cache
20

Performance of Multilevel Caches – example
• Suppose we have a processor with a base CPI of 1.0, assuming all reference hit in the primary cache, and a clock rate of 4 GHz.
• Assume a main memory access time of 100 ns, including all the miss handling.
• Suppose the miss rate per instruction at the primary cache is 2%.
• How much faster will the processor be if we add a secondary cache that has a 5ns access time for either a hit or a miss and is large enough to reduce the miss rate to main memory to 0.5%?
21

Performance of Multilevel Caches – example
What is the effective CPI of the single level cache?
# of cycles for a miss penalty to main memory
= main memory access time (sec) x clock rate (cycles/sec) = 100 x 10 -9 seconds x 4 x 109 cycles/sec
= 400 clock cycles.
The total # of miss penalty cycles for a program (where I is # of instructions) = 0.02 x I x 400 = 8 x I
The total # of cycles = CPI_base x I + Memory-stall cycles = CPI_base x I + 8 x I = I + 8 x I = 9 x I
The effective CPI with one level of caching
= total # of cycles / I = CPI_base + Memory-stall cycles per instruction = 9.
For the processor with one level of caching: effectiveCPI = 9 cycles per instruction
22

Example, cont.
Two-Level Caching
A secondary cache is added.
It has a 5 ns access to the primary cache for a hit or a miss
and is large enough to reduce the miss rate from the primary cache to the main memory to 0.5%.
What is the effective CPI of the two level cache?
# of cycles for a miss penalty of an access to the secondary cache = secondary cash access time (sec) x clock rate (cycles/sec)
= 5 x 10 -9 seconds x 4 x 10 9 cycles/sec
= 20 clock cycles.
With two levels of caching, a miss in the primary (or first-level) cache can be satisfied either by the secondary cache or by main memory.
If the miss satisfied in the secondary cache, then this is the entire miss penalty. If the miss needs to go to main memory, then the total miss penalty is the sum of the secondary cache access time and the main memory access time.
The total # of miss penalty cycles for a program (where I is # of instructions) = miss penalty to the primary cache + miss penalty to the main memory
= 0.02 x I x 20 + 0.005 x I x 400 = 0.4 x I + 2 x I = 2.4 x I
23

Example, cont. Two-Level Caching – cont.
The total # of cycles = CPI_base x I + Memory-stall cycles = CPI_base x I + 2.4 x I = I + 2.4 x I = 3.4 x I
For a two-level cache, total CPI is the sum of the stall cycles from both levels of
cache and the base CPI:
Total CPI = 1+Primary stalls per instruction + Secondary stalls per instruction
The effective CPI with two level caching
= total # of cycles / I = CPI_base + Memory-stall cycles per instruction = 3.4 cycles per instruction
—————–
Thus the processor with the secondary cache is faster by 2.6:
the ratio of CPU execution time
= CPU time for single cache/ CPU time for two level caching = (I x CPI_single) / (I x CPI_two_level)
= CPI_single/ CPI_two_level = 9.0/3.4 = 2.6470588 ≈ 2.6
24

Example, cont.
• Alternatively, we could have computed the stall cycles by summing the stall cycles of those references that go to main memory, which must include the cost to access the secondary cache as well as the main memory access time.
We can compute the stall cycles as: (2%-0.5%) x 20 = 0.3
Cost to access the secondary cache is: 0.5% x (20+400) = 2.1 The CPI is 1.0+0.3+2.1 = 3.4
25

AMAT: Average Memory Access Time
AMAT (Average Memory Access Time): the average time to access memory considering both hits and misses and the frequency of different access:
AMAT = Time for a hit + Miss rate x Miss Penalty
Example:
Suppose that we have a processor with a 1 ns clock cycle time, a miss penalty of 20 clock cycles, a miss rate of 0.05 misses per instruction, and a cache access time (including hit detection) of 1 clock cycle. Assume that the read and write miss penalties are the same and ignore other write stalls.
Compute its AMAT.
26

AMAT Example
The average memory access time per instruction is
AMAT (in clock cycles)
= cycles for a hit + Miss rate x Miss penalty in cycles = 1 + 0.05 x 20
= 2 clock cycles
AMAT (in time) = 2 clock cycles x 1 ns/clock cycle = 2 ns.
27

• Global Miss Rate: The fraction of references that miss in all levels of a multi-level of a cache.
• Local Miss Rate: The fraction of references to one level of a cache that miss; used in multilevel hierarchies.
28