CS计算机代考程序代写 algorithm computer architecture RISC-V cache Computer Architecture ELEC3441

Computer Architecture ELEC3441
Lecture 8 – Cache
Dr. Hayden Kwok-Hay So
Department of Electrical and Electronic Engineering
CPU-Memory Bottleneck
CPU
Memory
Performance of high-speed computers is usually limited by memory bandwidth & latency
n Latency (time for a single access)
• Memory access time >> Processor cycle time
n Bandwidth (number of accesses per unit time)
if fraction m of instructions access memory
• 1+m memory references / instruction
• To achieve CPI = 1, we need 1+m memory refs / cycle (assuming RISC-V ISA)
HKUEEE ENGG3441 – HS 2
Bandwidth vs Latency
n Example: DDR3 SDRAM
• Latency in the range of 30-50 ns
• Bandwidth in the range of 1.5-3 GT/s ≈ 10-20 GB/s
n Modern processors:
• In the range of 1-3 GHz clock rate
• Multiple instruction issues (2-4 memory instructions at the same time)
• Multiple 2-8 cores
n Gap:
• Memory bandwidth is 2-16x slower
• Latency > 30x slower
HKUEEE ENGG3441 – HS 3
Cache Memory
CPU
Cache
Memory
Same die
Small, Fast Large, Slow (SRAM) (DRAM)
n High speed memory that holds temporary copy of frequently used data from main memory
n Usually on the same die as the CPU n Low latency
• Typical: 1 – 3 processor cycles
n Limited capacity (compared to main memory)
• Typical: 1k to 10 Mbytes L1 cache
HKUEEE ENGG3441 – HS 4

HKUEEE ENGG3441 – HS 5
Cache Operation Overview
CPU
Cache
Memory
n To access a memory location:
• Look up memory content from cache
• If found, return
• If not found, look into memory
n Low latency access statistically
• Need ways to make sure data that will be needed are in
cache
n Give illusion of a large + fast memory statistically
HKUEEE ENGG3441 – HS 6
Memory Hierarchy
Capacity Register << SRAM << DRAM << Magnetic disk Latency Register << SRAM << DRAM << Magnetic disk Bandwidth on-chip >> off-chip >> I/O bus
Cost ($/bit) Register >> SRAM >> DRAM >> Magnetic disk
n The same concept of creating the illusion of fast and large memory spans from register file to hard disk
CPU
regfile
L1$
L2$
L3$
Memory
HKUEEE ENGG3441 – HS 7
Hard Disk
Real Memory Reference Patterns
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory.Time IBM Systems Journal 10(3): 168-192 (1971)
8
Memory Address (one dot per access)

Typical Memory Reference Patterns
Address
n loop iterations
Instruction fetches
Stack accesses
Data accesses
subroutine call
subroutine return
argument access
scalar accesses
Time
9
Two predictable properties of memory references:
§Temporal Locality: If a location is referenced it is likely to be referenced again in the near future.
§Spatial Locality: If a location is referenced it is likely that locations near it will be referenced in the near future.
10
Memory Reference Patterns
Donald J. Hatfield, Jeanette Gerald: Program Time Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971) 11
Temporal Locality
Spatial Locality
Caches exploit both types of predictability:
§Exploit temporal locality by remembering the contents of recently accessed locations.
§Exploit spatial locality by fetching blocks of data around recently accessed locations.
12
vector access
Memory Address (one dot per access)

Inside a Cache
Processor
Address
Data
Address
Data
copy of main memory location 101
Main Memory
Line
Data Block
CACHE
copy of main memory location 100
100
Data Byte
Data Byte
304
Data Byte
6848
416
Address Tag
13
Cache Algorithm (Read)
Look at Processor Address, search cache tags to find match. Then either
Found in cache a.k.a. HIT
Return copy of data from cache
Not in cache a.k.a. MISS
Read block of data from Main Memory
Wait …
Return data to processor and update cache
14
HKUEEE ENGG3441 – HS 15
Designing Cache
Factors to consider when designing cache
n How big is the cache
n How much data to fetch from memory
every time
n Where to put a data in the cache when it is fetched?
n How to deal with conflict?
n Synchronization with memory
Capacity
Line Size
Cache organization
Replacement Policy
HKUEEE ENGG3441 – HS
16
read/write policies

Line Size and Spatial Locality
A line is unit of transfer between the cache and memory
Word0
Word1
Word2
Word3
Tag
Line Address
Offset
Split CPU address
4 word line, b=2
b + 2 bits
32-b-2 bits
2b = line size a.k.a line size (in bytes)
Larger line size has distinct hardware advantages 1 word => 2 bit • less tag overhead
• exploit fast burst transfers from DRAM
• exploit fast burst transfers over wide busses
What are the disadvantages of increasing line size?
Fewer lines => more conflicts. Can waste bandwidth.
17
3 Cache Configurations
n Fully Associative n Direct Map
n Set Associative
HKUEEE ENGG3441 – HS 18
Fully Associative Cache
n Cache lines can be stored in any location of the cache
n Evict (overwrite) cache line only when out of space
n Work similar to an ideal cache
• except with realistic capacity limitation
Tag
Offset
Valid
Tag
Content
T
ABC00E3
D0D0D0D0
T
E000923D
E0E0E0E0
F
CAA000E3
00000000
F
F
F
F
F
HKUEEE ENGG3441 – HS 19
Example: Fully Associative
Fully Associative with 8 entries, line size = 4 words (b=2) 32-bit address
Offset Size: Tag Size:
Tag
Offset
4 bits
Loc
Valid
Tag
Content
000
001
010
011
100
101
110
111
28 bits
HKUEEE
ENGG3441 – HS 20

Example: Fully Associative
Fully Associative with 8 entries, line size = 4 words (b=2) 0000 … 1010 0000 0000 0000
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
RB
0x0288A380
R
0x0000A00C
R
Miss
tag offset
Loc
Valid
Tag
Content
000
FT
0…1010 0000 0000
DDDD3333
DDDD2222
DDDD1111
DDDD0000
001
F
010
F
011
F
100
F
101
F
110
F
111
F
HKUEEE ENGG3441 – HS
21
Example: Fully Associative
Fully Associative with 8 entries, line size = 4 words (b=2) 0000 … 1010 0000 0000 0100
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
RB
0x0288A380
R
0x0000A00C
R
Miss
Hit
tag offset
Loc
Valid
Tag
Content
000
FT
0…1010 0000 0000
DDDD3333
DDDD2222
DDDD1111
DDDD0000
001
F
010
F
011
F
100
F
101
F
110
F
111
F
HKUEEE ENGG3441 – HS
22
Example: Fully Associative
Fully Associative with 8 entries, line size = 4 words (b=2) 1011 … 1111 1111 1111 0110
tag offset
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
RB
0x0288A380
R
0x0000A00C
R
Miss
Hit
Miss
Loc
Valid
Tag
Content
000
FT
0000A00
DDDD3333
DDDD2222
DDDD1111
DDDD0000
001
FT
BEEFFFF
EEEE3333
EEEE2222
EEEE1111
EEEE0000
010
F
011
F
100
F
101
F
110
F
111
F
HKUEEE ENGG3441 – HS
23
Example: Fully Associative
Fully Associative with 8 entries, line size = 4 words (b=2) 0000 … 1010 0011 1000 0000
tag offset
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
RB
0x0288A380
R
0x0000A00C
R
Miss
Hit
Miss
Miss
Loc
Valid
Tag
Content
000
FT
0000A00
DDDD3333
DDDD2222
DDDD1111
DDDD0000
001
FT
BEEFFFF
EEEE3333
EEEE2222
EEEE1111
EEEE0000
010
F
T
0288A38
F3030303
F2020202
F1010101
F0000000
011
F
100
F
101
F
110
F
111
F
HKUEEE ENGG3441 – HS
24

Example: Fully Associative
Fully Associative with 8 entries, line size = 4 words (b=2) 0000 … 1010 0000 0000 1100
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
RB
0x0288A380
R
0x0000A00C
R
Miss
Hit
Miss
Miss
Hit
tag offset
Loc
Valid
Tag
Content
000
FT
0000A00
DDDD3333
DDDD2222
DDDD1111
DDDD0000
001
FT
BEEFFFF
EEEE3333
EEEE2222
EEEE1111
EEEE0000
010
F
T
0288A38
F3030303
F2020202
F1010101
F0000000
011
F
100
F
101
F
110
F
111
F
HKUEEE ENGG3441 – HS
25
V Tag
t =
Data
Fully Associative Cache
=
b
=
t
HIT
Data Word or Byte
26
Direct Map Cache
n Simple and realistic implementation
n Each memory address may be stored at only one possible location in the cache
• Usually by taking k bits of address for a 2k line $
n More than 1 memory addresses may be mapped to the same cache location
• è Collision
• Evictoldcontentbeforenewcontentisstored
n Simple and fast
• OftenusedinL1cache
HKUEEE ENGG3441 – HS 27
Example: Direct Map
Direct map with 8 entries, line size = 4 words (b=2)
Index Size: Offset Size: Tag Size:
Tag
Index
Offset
3 bits
4 bits
Loc
Valid
Tag
Content
000
001
010
011
100
101
110
111
25 bits
HKUEEE ENGG3441 – HS 28
Tag
Offset

Example: Direct Map
Direct map with 8 entries, line size = 4 words (b=2)
0000 … 1010 0000 0000 0000
tag idx offset
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
R
0x0288A380
R
0x0000A00C
R
Miss
Loc
Valid
Tag
Content
000
FT
0…1010 0000 0
DDDD3333
DDDD2222
DDDD1111
DDDD0000
001
F
010
F
011
F
100
F
101
F
110
F
111
F
HKUEEE ENGG3441 – HS 29
Example: Direct Map
Direct map with 8 entries, line size = 4 words (b=2)
0000 … 1010 0000 0000 0100
tag idx offset
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
R
0x0288A380
R
0x0000A00C
R
Miss
Hit
Loc
Valid
Tag
Content
000
FT
0…1010 0000 0
DDDD3333
DDDD2222
DDDD1111
DDDD0000
001
F
010
F
011
F
100
F
101
F
110
F
111
F
HKUEEE ENGG3441 – HS 30
Example: Direct Map
Direct map with 8 entries, line size = 4 words (b=2)
1011 … 1111 1111 1111 0110
tag idx offset
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
RB
0x0288A380
R
0x0000A00C
R
Miss
Hit
Miss
Loc
Valid
Tag
Content
000
T
0…1010 0000 0
DDDD3333
DDDD2222
DDDD1111
DDDD0000
001
F
010
F
011
F
100
F
101
F
110
F
111
FT
1011… 1111 1
EEEE3333
EEEE2222
EEEE1111
EEEE0000
HKUEEE ENGG3441 – HS 31
Example: Direct Map
Direct map with 8 entries, line size = 4 words (b=2)
0000 … 1010 0011 1000 0000
tag idx offset
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
RB
0x0288A380
R
0x0000A00C
R
Miss
Hit
Miss
Miss
Loc
Valid
Tag
Content
000
T
0…1010 001010 10
F3D0D3D0D30333
FD2D0D2D022022
FD1D0D1D011011
FD0D0D0D0000
001
F
010
F
011
F
100
F
101
F
110
F
111
FT
1011… 1111 1
EEEE3333
EEEE2222
EEEE1111
EEEE0000
HKUEEE ENGG3441 – HS 32

Example: Direct Map
Direct map with 8 entries, line size = 4 words (b=2)
0000 … 1010 0000 0000 1100
tag idx offset
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
RB
0x0288A380
R
0x0000A00C
R
Miss
Hit
Miss
Miss
Miss
Loc
Valid
Tag
Content
000
T
0…1010 001010 10
FD3D0D3D033033
FD2D0D2D022022
DFD1D0D1101101
DFD0D0D0000
001
F
010
F
011
F
100
F
101
F
110
F
111
FT
1011… 1111 1
EEEE3333
EEEE2222
EEEE1111
EEEE0000
HKUEEE ENGG3441 – HS 33
Direct-Mapped Cache
Tag
Index
Offset
t
kb Data
V Tag
t =
2k lines
Data Word or Byte
HIT
34
Set Associative Cache
n One way to reduce miss on a cache is to increase the number of possible locations to store a data block
n An N-way set associative cache has N locations to store each data block
• A data block can be placed in any of the N locations
• The N locations form a set
n Each set may hold data with the same index
• Allows N different data blocks with the same index be
stored in the cache
n Need replacement policy to determine which 1 of the N data blocks in the set to be evicted when the N+1 data is written
HKUEEE ENGG3441 – HS 35
Example: 2-Way Set Associative
2-way Set Associative with 4 entries, line size = 4 words (b=2) 0000 … 1010 0000 0000 0000 tag idx offset
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
R
0x0288A380
R
0x0000A00C
R
Miss
V
Tag
Content
V
Tag
Content
00
F
T
0000A0_00
D3
D2
D1
D0
F
01
F
F
10
F
F
11
F
F
HKUEEE ENGG3441 – HS
36

Example: 2-Way Set Associative
2-way Set Associative with 4 entries, line size = 4 words (b=2) 0000 … 1010 0011 1000 0000 tag idx offset
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
R
0x0288A380
R
0x0000A00C
R
Miss
Hit
Miss
Miss
V
Tag
Content
V
Tag
Content
00
T
0000A0_00
D3
D2
D1
D0
F
T
0288A3_10
F3
F2
F1
F0
01
F
F
10
F
F
11
T
BEEFFF_11
E3
E2
E1
E0
F
HKUEEE ENGG3441 – HS
37
Example: 2-Way Set Associative
2-way Set Associative with 4 entries, line size = 4 words (b=2) 0000 … 1010 0000 0000 1100 tag idx offset
0x0000A000
R
0x0000A004
R
0xBEEFFFF6
R
0x0288A380
R
0x0000A00C
R
Miss
Hit
Miss
Miss
Hit
V
Tag
Content
V
Tag
Content
00
T
0000A0_00
D3
D2
D1
D0
F
T
0288A3_10
F3
F2
F1
F0
01
F
F
10
F
F
11
T
BEEFFF_11
E3
E2
E1
E0
F
What happens if the next access is 0xC000A38C?
HKUEEE ENGG3441 – HS
38
2-Way Set-Associative Cache
Tag
Index
Offset
Data V Tag Data
t
k
V Tag
b
t
==
Data Word or Byte
HIT
39
3 Cache Organizations
tag data 0
1 tag data tag data 20 31 42 53
Same size, different organizations
6 7
Direct Map
2-way Set Associative
tag data tag data tag data tag data tag data tag data
Fully Associative
HKUEEE ENGG3441 – HS 40

Replacement Policy
In an associative cache, which line from a set should be evicted when the set becomes full?
• Random
• Least-Recently Used (LRU)
• LRU cache state must be updated on every access
• True implementation only feasible for small sets (2-way) • Pseudo-LRU binary tree often used for 4-8 way
• First-In, First-Out (FIFO) a.k.a. Round-Robin • Used in highly associative caches
• Not-Most-Recently Used (NMRU)
• FIFO with exception for most-recently used line or lines
41
Least Recently Used
n On replacement, select the line that was accessed the least recently (oldest line)
n Need to memorize the access time of each line
Example:
at tag data at
Access:
time access H/M
HKUEEE
tag data
at tag data at
tag data
• • •
4-way set assoc.
line size = 1 word;
arrays a[], b[], c[], d[], e[] all map to row 0
0
a
a0
1
b
b0
2
c
c0
3
d
d0
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
11
a0
b0
c0
d0
c0
d0
e0
a0
e0
b0
c0
d0
M
M
M
M
ENGG3441 – HS
42
Least Recently Used
at tag data at
Access:
time access H/M
tag data
at tag data at
tag data
0
a
a0
1
b
b0
24
c
c0
35
d
d0
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
11
a0
b0
c0
d0
c0
d0
e0
a0
e0
b0
c0
d0
M
M
M
M
H
H
HKUEEE
ENGG3441 – HS
43
Least Recently Used
at tag data at
Access:
time access H/M
tag data
at tag data at
tag data
06
ae
ae0
1
b
b0
4
c
c0
5
d
d0
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
11
a0
b0
c0
d0
c0
d0
e0
a0
e0
b0
c0
d0
M
M
M
M
H
H
M
HKUEEE
ENGG3441 – HS
44

Least Recently Used
at tag data at
Access:
time access H/M
HKUEEE
tag data
at tag data at
tag data
6
e
e0
17
ab
ab0
4
c
c0
5
d
d0
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
11
a0
b0
c0
d0
c0
d0
e0
a0
e0
b0
c0
d0
M
M
M
M
H
H
M
M
ENGG3441 – HS
45
Least Recently Used
at tag data at
Access:
time access H/M
HKUEEE
tag data
at tag data at
tag data
68
e
e0
7
a
a0
4
c
c0
5
d
d0
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
11
a0
b0
c0
d0
c0
d0
e0
a0
e0
b0
c0
d0
M
M
M
M
H
H
M
M
H
ENGG3441 – HS
46
Least Recently Used
at tag data at
Access:
time access H/M
tag data
at tag data at
tag data
8
e
e0
11
d
d0
9
b
b0
10
c
c0
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
11
a0
b0
c0
d0
c0
d0
e0
a0
e0
b0
c0
d0
M
M
M
M
H
H
M
M
H
M
M
M
HKUEEE
ENGG3441 – HS
47
Pseudo LRU
n Implementation challenges for true LRU:
• Requires storage for access time on every line
• Enormous amount of storage
• counter wrap around
• Requires comparison of all access time within a set
• Comparison is slow in hardware
n Pseudo LRU relaxes the requirement to find the absolutely oldest piece of data in a set
• Randomly pick any one of the older data in the set
n One simple implementation:
• Set 1 bit for each line of cache when accessed
• To replace, evict any one of the cache lines with 0 flag
• Periodically reset all flags to 0
n Advanced version:
• Evict only lines that are not dirty when there’s a draw
HKUEEE ENGG3441 – HS 48

Acknowledgements
n These slides contain material developed and copyright by:
• Arvind (MIT)
• Krste Asanovic (MIT/UCB)
• Joel Emer (Intel/MIT)
• James Hoe (CMU)
• John Kubiatowicz (UCB)
• David Patterson (UCB)
• John Lazzaro (UCB)
n MIT material derived from course 6.823
n UCB material derived from course CS152,
CS252
HKUEEE ENGG3441 – HS 49