PowerPoint 演示文稿
CO101
Principle of Computer
Organization
Lecture 16: Memory 2
Liang Yanyan
澳門科技大學
Macau of University of Science and
Technology
Computer Organization
• CPU clock rate is much faster than memory.
• Slow memory can dramatically reduce the performance.
2
Processor
Control
Datapath
Memory
Devices
Input
Output
C
ache
M
ain
M
em
ory
Secondary
M
em
ory
(D
isk)
A Typical Memory Hierarchy
• Take advantage of the principle of locality to present the
user with as much memory as is available in the
cheapest technology at the speed offered by the fastest
technology
3
Second
Level
Cache
(SRAM)
Control
Datapath
Secondary
Memory
(Disk)
On-Chip Components
R
egFile
Main
Memory
(DRAM) Data
C
ache
Instr
C
ache
IT
L
B
D
T
L
B
Speed (%cycles): ½’s 1’s 10’s 100’s 10,000’s
Size (bytes): 100’s 10K’s M’s G’s T’s
Cost: highest lowest
Important issues
• Where can a data be placed in cache when we copy the
data from main memory to cache?
• As we can move data from memory to cache to make use of
temporal and spatial locality.
• How to locate a data in cache?
• This is related to the first issue.
• If cache is full, how to replace an existing data?
4
Processor read data from memory
• Assume memory address is M-bit width
• Search the cache
• Use the lower K bits of the memory address as index to locate the
corresponding cache entry.
• Check if the tag equals to the upper (M-K) bits of the memory address.
• Check if the valid bit is 1.
• If all these are true → cache hit, return data from cache to processor.
5
CPU:
I need data
of this address
(M-K) bits K bits 0
1
2
3
…
…
1022
1023
Index Tag Data Valid Address (32 bits)
=
To CPU
Hit
10 22
Index
Tag
Cache Example
• 8-blocks, 1 word/block, direct mapped
• Initial state
6
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 N
111 N
Cache Example
7
Word addr Binary addr Hit/miss Cache block
22 10 110 Miss 110
Index V Tag Data
000 N
001 N
010 N
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Cache Example
8
Index V Tag Data
000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
26 11 010 Miss 010
Cache Example
9
Index V Tag Data
000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
22 10 110 Hit 110
26 11 010 Hit 010
Cache Example
10
Index V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
16 10 000 Miss 000
3 00 011 Miss 011
16 10 000 Hit 000
Cache Example
11
Index V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 10 Mem[10010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
18 10 010 Miss 010
Cache Example
12
Index V Tag Data
000 Y 10 Mem[10000]
001 N
010 Y 10 Mem[10010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
Word addr Binary addr Hit/miss Cache block
16 10 000 Hit 000
Example:
• 4 bits memory address
13
1
0
1
2
3
Index Tag Data Valid
Address (4 bits)
=
Hit
2
Block offset
Mux
Data
8 8
8
1 10
Tag Index (2 bits)
1 0
Example:
14
n
0
1
2
3
Index Tag Data Valid
Address (4 bits)
=
Hit
2
Block offset
Mux
Data
8 8
8
n nn
Tag Index (2 bits)
1
1
1
1
0
1
0
1
0xCA 0xFE
0xDE 0xAD
0xBE 0xEF
0xFE 0xED
0
0
For the addresses below,
what byte is read from the
cache (or is there a miss)?
� 1010
� 1110
� 0001
� 1101
A larger case
• Here is a cache with 1,024
blocks of 4 bytes each, and
32-bit memory addresses.
15
Tag = 20 bits
Index = 10 bits
Block offset = 2 bits
0
1
2
3
…
…
1022
1023
Index Tag Data Valid
Address (32 bits)
=
Hit
10 20
Tag
2 bits
Mux
Data
8 8 8 8
8
Cache Field Sizes
• The number of bits in a cache includes both the storage
for data and for the tags.
• 32-bit byte address
• For a direct mapped cache with 2n blocks, n bits are used for the
index
• For a block size of 2m words (2m+2 bytes), m bits are used to
address the word within the block and 2 bits are used to address
the byte within the word
• What is the size of the tag field?
• 32 – (n + m +2)
• The total number of bits in a direct-mapped cache is then
• 2n x (block size + tag field size + valid field size)
16
MIPS Direct Mapped Cache Example
• One word blocks, cache size = 1K words (or 4KB)
17
20 Tag 10
Index
Data Index Tag Valid
0
1
2
.
.
.
1021
1022
1023
31 30 . . . 13 12 11 . . . 2 1 0
Byte
offset
20
Data
32
Hit
Multiword Block Direct Mapped Cache
• Four words/block, cache size = 1K words
18
8
Index
Data Index Tag Valid
0
1
2
.
.
.
253
254
255
31 30 . . . 13 12 11 . . . 4 3 2 1 0
Byte
offset
20
20 Tag
Hit Dat
32
Block offset
Miss Rate vs Block Size vs Cache Size
• Miss rate goes up if the block size becomes a significant
fraction of the cache size because the number of blocks
that can be held in the same size cache is smaller
(increasing capacity misses)
19
Disadvantages of direct-mapped cache
• The direct-mapped cache is easy: indices and offsets
can be computed with bit operators or simple arithmetic,
because each memory address belongs to exactly one
block.
• However, this isn’t really
flexible. Consider a program
keep reading two addresses
A1, A2, A1, A2, A1, A2 ,…
where A1 and A2 have the
same cache index and
block offset.
What will happen?
20
Directed mapped cache: keep reading 0000, 1000
21
Memory
Address
00
01
10
11
Index Tag Byte 0 Byte 1
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Read 0000: miss.
Load it from memory,
store a copy into cache. A
B
C
D
A B 0
00
01
10
11
C D 1 Read 1000: miss.
Load it from memory,
store a copy into cache.
00
01
10
11
A B 0 Read 0000: miss.
Load it from memory,
store a copy into cache.
00
01
10
11
C D 1 Read 1000: miss.
Load it from memory,
store a copy into cache.
A fully associative cache
• A fully associative cache permits data to be stored in any
cache block, instead of enforcing each memory address into
one particular block.
• When data is stored to cache, it can be placed in any unused block
of the cache. There is no cache index field to locate the cache entry.
• In this way we can reduce the conflict between two or more memory
addresses which map to a single cache block.
• In the previous example, we might put memory address A1 in
cache block 1, and address A2 in block 2. Then subsequent
accesses to A1 and A2 would all be hits instead of misses.
• If all the blocks are already in use, it’s usually best to replace
the least recently used one which is least used in recent time.
Another replacement strategy is most recently used which is
replacing the one that is most used in recent time.
22
Fully associative cache: keep reading 0000, 1000
23
Memory
Address
Tag Byte 0 Byte 1
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Read 0000: miss.
Load it from memory,
store a copy into cache. A
B
C
D
A B 000
A B 000
Read 1000: miss.
Load it from memory,
store a copy into cache.
Read 0000: hit.
Load it from cache.
Read 1000: hit.
Load it from cache.
C D 100
A B 000
C D 100
A B 000
C D 100
Data can be stored in arbitrary location in the cache, thus no cache index field.
Disadvantages of fully associative cache
• Since data can be put into arbitrary location, so there is no cache
index field.
• Assume the address is m bits, each block contains 2k bytes of data.
• The tag is (m-k) bits long.
• To read a data, we must scan all blocks in the cache and check if the
tag match the address we are going to read. → Require too many
comparators, huge hardware size.
24
:
Cache Data
Byte 0
0 k-1 m-1
:
Cache Tag (m-k bits long)
Valid Bit
:
Byte 1 2k-1 :
:
Cache Tag
Byte Select
=
=
=
=
=
Set-associative cache
• Set-associative cache provides an intermediate solution.
• Cache blocks are divided into groups, each group is called a set.
• Each memory address is mapped to exactly one set in the cache, but
data can be stored in arbitrary block within the set.
• If each set has N blocks, the cache is called a N-way associative
cache.
• Here are several possible organizations of an eight-block cache.
25
0
1
2
3
4
5
6
7
Set
0
1
2
3
Set
0
1
Set
1-way associative
8 sets, 1 block each
2-way associative
4 sets, 2 blocks each
4-way associative
2 sets, 4 blocks each
Locating a set associative block
• We can determine where a memory address belongs to
an associative cache block in a similar way as before.
• If a cache has 2s sets and each block has 2n bytes, the
memory address can be partitioned as follows.
26
Used to address a set
instead of an individual
data block previously.
Used to address a byte
in the block.
Used to check if the current
block is what we are looking for.
Address (m bits)
s (m-s-n) n
Tag Index Block
offset
2-way set associative cache implementation
• How does an implementation
of a 2-way set associative
cache compare with that of a
fully-associative cache?
• Only two comparators are
needed.
• The cache tags are a little
shorter too.
• Some bits are used for
index.
27
0
…
2k-1
Index Tag Cache block Valid
Address (m bits)
=
Hit
k (m-k-n)
Tag
2-to-1 mux
Data block
2n
Tag Valid Cache block
2n
2n
=
Index Block
offset
one set
block 1 block 0
Example: a 2-way set associative cache
28
Memory
Address
00
01
10
11
Set
Block 1
Tag Byte 0 Byte 1
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Read 0000: miss, load it from memory, store a copy
into set 00. As block 0 is free, we can put it into block 0.
A
B
C
D
00
01
10
11
C D 1
Block 0
Tag Byte 0 Byte 1
A B 0
A B 0
The middle two address bits are used as cache index to locate a set,
the first address bit is used as tag.
Read 1000: miss, load it from memory, store a copy
into set 00. As block 1 is free, we can put it into block 1.
Set
Block 1
Tag Byte 0 Byte 1
Block 0
Tag Byte 0 Byte 1
Example
• Where would data from memory byte address 6195 be placed in the cache,
assuming the eight-block cache designs below, with 16 bytes per block?
• 6195 in binary is 00…0110000 011 0011
• Each block has 16 bytes, so the lowest 4 bits are the block offset.
For the 1-way cache, the next three bits (011) are the set index.
For the 2-way cache, the next two bits (11) are the set index.
For the 4-way cache, the next one bit (1) is the set index.
• The data may go to any block within the set which is in green.
29
0
1
2
3
4
5
6
7
Set
0
1
2
3
Set
0
1
Set
1-way associative
8 sets, 1 block each
2-way associative
4 sets, 2 blocks each
4-way associative
2 sets, 4 blocks each
Set associative
� By now you may have noticed the 1-way set associative
cache is the same as a direct-mapped cache.
� Similarly, if a cache has N blocks, a N-way set
associative cache would be the same as a fully-
associative cache.
30
0
1
2
3
4
5
6
7
Set
0
1
2
3
Set
0
1
Set
1-way
8 sets,
1 block each
2-way
4 sets,
2 blocks each
4-way
2 sets,
4 blocks each
0
Set
8-way
1 set,
8 blocks
direct mapped fully associative
Range of Set Associative Caches
• For a fixed size cache, each increase by a factor of two
in associativity doubles the number of blocks per set (i.e.,
the number or ways) and halves the number of sets –
decreases the size of the index by 1 bit and increases
the size of the tag by 1 bit.
31
Block offset Byte offset Index Tag
Decreasing associativity
Fully associative
(only one set)
Tag is all the bits except
block and byte offset
Direct mapped
(only one way)
Smaller tags, only a
single comparator
Increasing associativity
Selects the set Used for tag compare Selects the word in the block
Spectrum of Associativity
• For a cache with 8 entries
32
Associativity Example
• Compare 4-block caches
• Direct mapped, 2-way set associative,
fully associative
• Block access sequence: 0, 8, 0, 6, 8
• Direct mapped
33
Block
address
Cache
index
Hit/miss Cache content after access
0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
Associativity Example
• 2-way set associative
• Fully associative
34
Block
address
Cache
index
Hit/miss Cache content after access
Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
Block
address
Hit/miss Cache content after access
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 hit Mem[0] Mem[8] Mem[6]
How Much Associativity
• Increased associativity decreases miss rate
• But with diminishing returns
• Simulation of a system with 64KB
D-cache, 16-word blocks, SPEC2000
• 1-way: 10.3%
• 2-way: 8.6%
• 4-way: 8.3%
• 8-way: 8.1%
35
Benefits of Set Associative Caches
• The choice of direct mapped or set associative depends
on the cost of a miss versus the cost of implementation.
36
0
2
4
6
8
10
12
1-way 2-way 4-way 8-way
Associativity
M
is
s
R
at
e
4KB
8KB
16KB
32KB
64KB
128KB
256KB
512KB
Data from Hennessy & Patterson,
Computer Architecture, 2003
Largest gains are in going from direct mapped to 2-way (20%+
reduction in miss rate)
How about the set is full?
• Data can be placed in any block in a set.
• Direct mapped: no choice.
• Select the least recently used (LRU) block to replace if
the current set is full.
• The block which is least accessed in recent time.
• Must have hardware to keep track of when each way’s block was used relative
to the other blocks in the set.
• For 2-way set associative, takes one bit per set → set the bit when a block is
referenced (and reset the other way’s bit).
• Select the most recently used (MRU) one block to
replace.
• The block which is most accessed in recent time.
• For random access patterns and repeated scans over large datasets (sometimes
known as cyclic access patterns), MRU cache algorithms have more hits than
LRU due to their tendency to retain older data.
• Random
• Gives approximately the same performance as LRU for high associativity.
37
Write-hit policy: write through
• Write-hit: when processor write data (e.g. sw), the
address is found in cache.
• Write through: Data is written to both the block in the
cache and to the main memory.
• Advantage: cache and main memory are always consistent.
• Disadvantage: introduce too many writes to memory, occupy the
communication bandwidth.
38
Index Tag Data V Address
…
110
…
1 11010 21763
Data
21763
…
1101 0110
…
Mem[214] = 21763
Cache Main memory
214 = 11010110
Write buffer
• Write through is always associated with write buffers so
that the processor don’t need to wait for the slow main
memory.
• A write buffer queues the write requests from the
processor, and perform the actual write operations on
memory later.
• The processor doesn’t need to wait for the write operation to
complete. The processor just need to issues write requests to
the write buffer and continue to process other instructions.
• If the processor issues write request too fast, extra data are
stored in the buffer, which will be written to main memory later.
39
Processor
Cache
Write Buffer
Main
Memory
Write-hit policy: write back
• Write back: Data is written only to the block in the cache.
This modified cache block is written to main memory
only when it is replaced.
• Advantage: write to cache is fast, subsequent read will be served
by cache. No need to access memory for every write.
• Disadvantage: data inconsistency between cache and main
memory.
40
Use a dirty bit to indicate data
inconsistency.
Index Tag Data Dirty Address
…
110
…
1 11010 21763
Data
42803
1000 1110
1101 0110
…
Mem[214] = 21763
1225
V
1
214 = 11010110
Write back: replacement when read miss
• If there is a read miss when cache is full:
• Step 1: select LRU block, check the dirty bit of the mapped cache
location. If the dirty bit is 1, means the data in cache has been
modified. If the dirty bit is 0, means no modification.
• Step 2: assume dirty bit is 1, we need to store data from cache to
memory.
• Step 3: copy the read (requested) data from main memory to cache,
set dirty bit to 0.
41
Index Tag Data
…
110
…
Dirty
1 11010 21763
Address Data
2138
1000 1110
1101 0110
…
1225
V
1
Index Tag Data
…
110
…
10001 1225
Address Data
21763
1000 1110
1101 0110
…
1225
Dirty
0
V
1
Write-miss Policy: Write Allocate versus Not
Allocate
• When processor write data (e.g. sw), but the data block
is not found in cache → write miss.
• Write Allocate
• Load the block from memory to cache, the write is resumed and
results in a write-hit.
• Use write-hit policy to handle.
• Write Not Allocate
• Just update memory only
42
Index Tag Data V
…
110
…
1 00010 123456
Address Data
21763
…
1101 0110
…
Mem[214] = 21763
Possible combinations
Write hit policy Write miss policy
1. Write Through Write Allocate
2. Write Through Write Not Allocate
3. Write Back Write Allocate
4. Write Back Write Not Allocate
43
Possible combinations
• Write Through with Write Allocate:
• On hits it writes to cache and main memory.
• On misses it loads the block from main memory to cache, and
update the cache.
• The purpose of using Write Allocate is that subsequent write to the
same block will cause cache hit, so the subsequent write just need
to update the cache. However, using Write Through cannot make
use of this benefit.
• Since Write Through policy is used, even on cache hit, it will
generate a write to both cache and main memory → memory will
still be updated which means it cannot make use of cache to save
time.
• Write Through with Write Not Allocate
• On hits it writes to cache and main memory.
• On cache miss it updates the block in main memory only.
• As a result, on cache miss, time is saved by not loading the block
from memory to cache.
44
Possible combinations
• Write Back with Write Allocate
• On hits it writes to cache only, main memory is not updated.
• On misses it loads the block from main memory to cache, and
update the cache.
• The purpose of using Write Allocate is that subsequent write to the
same block will cause cache hit, so the subsequent write just need
to update cache only. Using Write Back can make use of this benefit.
• Subsequent writes to the same block cause cache hits, Write Back
just need to update cache and set dirty bit for the block. This
eliminates extra memory accesses and results in very efficient
execution compared with the (Write Through + Write Allocate)
combination.
• Write Back with Write Not Allocate
• On hits it writes to cache only, main memory is not updated.
• On misses it updates the block in main memory only.
• Subsequent writes to the same block, if the block originally caused
a write miss, it will always generate misses and result in very
inefficient execution.
45
Two Machines’ Cache Parameters
46
Intel Nehalem AMD Barcelona
L1 cache
organization & size
Split I$ and D$; 32KB for each
per core; 64B blocks
Split I$ and D$; 64KB for each
per core; 64B blocks
L1 associativity 4-way (I), 8-way (D) set
assoc.; ~LRU replacement
2-way set assoc.; LRU
replacement
L1 write policy write-back, write-allocate write-back, write-allocate
L2 cache
organization & size
Unified; 256KB (0.25MB) per
core; 64B blocks
Unified; 512KB (0.5MB) per
core; 64B blocks
L2 associativity 8-way set assoc.; ~LRU 16-way set assoc.; ~LRU
L2 write policy write-back write-back
L2 write policy write-back, write-allocate write-back, write-allocate
L3 cache
organization & size
Unified; 8192KB (8MB)
shared by cores; 64B blocks
Unified; 2048KB (2MB) shared
by cores; 64B blocks
L3 associativity 16-way set assoc. 32-way set assoc.; evict block
shared by fewest cores
L3 write policy write-back, write-allocate write-back; write-allocate
CO101�Principle of Computer Organization
Computer Organization
A Typical Memory Hierarchy
Important issues
Processor read data from memory
Cache Example
Cache Example
Cache Example
Cache Example
Cache Example
Cache Example
Cache Example
Example:
Example:
A larger case
Cache Field Sizes
MIPS Direct Mapped Cache Example
Multiword Block Direct Mapped Cache
Miss Rate vs Block Size vs Cache Size
Disadvantages of direct-mapped cache
Directed mapped cache: keep reading 0000, 1000
A fully associative cache
Fully associative cache: keep reading 0000, 1000
Disadvantages of fully associative cache
Set-associative cache
Locating a set associative block
2-way set associative cache implementation
Example: a 2-way set associative cache
Example
Set associative
Range of Set Associative Caches
Spectrum of Associativity
Associativity Example
Associativity Example
How Much Associativity
Benefits of Set Associative Caches
How about the set is full?
Write-hit policy: write through
Write buffer
Write-hit policy: write back
Write back: replacement when read miss
Write-miss Policy: Write Allocate versus Not Allocate
Possible combinations
Possible combinations
Possible combinations
Two Machines’ Cache Parameters