程序代写代做代考 go clock cache C Carnegie Mellon

Carnegie Mellon
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
1
14
–
513
18
–
613

Carnegie Mellon
Cache Memories
15-213/18-213/14-513/15-513/18-613: Introduction to Computer Systems 11th Lecture, October 6, 2020
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
2

Carnegie Mellon
Announcements  Lab 3 (attacklab)
▪ Due this Thursday, Oct. 8, 11:59pm ET
 Lab 4 (cachelab)
▪ Out Oct. 8, 11:59pm ET ▪ Due Oct. 20, 11:59pm ET
 Written Assignment 3 peer grading ▪ Due Wed, Oct. 7, 11:59pm ET
 Written Assignment 4 available on Canvas ▪ Due Wed, Oct. 7, 11:59pm ET
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
3

Carnegie Mellon
Today
 Cache memory organization and operation
 Performance impact of caches ▪ The memory mountain
▪ Rearranging loops to improve spatial locality ▪ Using blocking to improve temporal locality
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
4

Carnegie Mellon
Recall: Locality
 Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
 Temporal locality:
▪ Recently referenced items are likely
to be referenced again in the near future
 Spatial locality:
▪ Items with nearby addresses tend
to be referenced close together in time
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
5

Carnegie Mellon
Recall: Memory Hierarchy
L0:
Regs
L1 cache (SRAM)
L2 cache (SRAM)
L3 cache (SRAM)
Main memory (DRAM)
Local secondary storage (local disks)
Smaller, faster, and costlier (per byte) storage devices
Larger, slower, and cheaper (per byte) storage devices
L6:
L1:
CPU registers hold words retrieved from the L1 cache.
L2:
L1 cache holds cache lines retrieved from the L2 cache.
L2 cache holds cache lines retrieved from L3 cache.
L3 cache holds cache lines retrieved from main memory.
Main memory holds disk blocks retrieved from local disks.
Local disks hold files retrieved from disks on remote servers.
L3:
L4:
L5:
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
6
Remote secondary storage (e.g., Web servers)

Carnegie Mellon
Recall: General Cache Concepts
84
9
14
3
Cache
Smaller, faster, more expensive memory caches a subset of the blocks
Data is copied in block-sized transfer units
Larger, slower, cheaper memory viewed as partitioned into “blocks”
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Memory
4
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
7

Carnegie Mellon
General Cache Concepts: Hit
8
9
14
3
Cache
Request: 14
Data in block b is needed
Block b is in cache:
Hit!
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Memory
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
8

Carnegie Mellon
General Cache Concepts: Miss
8
192
14
3
Cache
Request: 12
Data in block b is needed Block b is not in cache:
Miss!
Block b is fetched from memory
Block b is stored in cache
• Placement policy: determines where b goes
• Replacement policy: determines which block gets evicted (victim)
12
Request: 12
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Memory
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
9

Carnegie Mellon
Recall: General Caching Concepts: 3 Types of Cache Misses
 Cold (compulsory) miss
▪ Cold misses occur because the cache starts empty and this is the first
reference to the block.
 Capacity miss
▪ Occurs when the set of active cache blocks (working set) is larger than
the cache.
 Conflict miss
▪ Most caches limit blocks at level k+1 to a small subset (sometimes a
singleton) of the block positions at level k.
▪ E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.
▪ Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block.
10
▪ E.g. Referencing blocks 0, 8, 0, 8, 0, 8, … would miss every time. Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition

Carnegie Mellon
Cache Memories
 Cache memories are small, fast SRAM-based memories managed automatically in hardware
▪ Hold frequently accessed blocks of main memory  CPU looks first for data in cache
 Typical system structure:
CPU chip
System bus
Memory bus
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
11
Register file
Cache memory
ALU
Bus interface
I/O bridge
Main memory

Carnegie Mellon
What it Really Looks Like
CPU chip
Nehalem
AMD FX 8150 Core i7-3960X
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
13
Register file
Cache memory
ALU
Bus interface

Carnegie Mellon
What it Really Looks Like (Cont.)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
14
Intel Sandy Bridge Processor Die
L1: 32KB Instruction + 32KB Data L2: 256KB
L3: 3–20MB

Carnegie Mellon
General Cache Organization (S, E, B)
E = 2e lines per set
set line
S = 2s sets
Cache size
= S x E x B data bytes
0
1
2
B-1
v
tag
valid bit
B = 2b bytes per cache block (the data)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
15

Carnegie Mellon
Cache Read
E = 2e lines per set
t bits
s bits
b bits
S = 2s sets
Address of word:
tag set block index offset
v
valid bit
B = 2b bytes per cache block (the data)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
16
tag
0
1
2
B-1
• Locate set
• Check if any line in set
has matching tag
• Yes + line valid: hit
• Locate data starting
at offset
data begins at this offset

Carnegie Mellon
Example: Direct Mapped Cache (E = 1)
Direct mapped: One line per set Assume: cache block size B=8 bytes
Address of int:
find set
0
1
2
3
4
5
6
7
v
tag
0…01
100
S = 2s sets
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
17
t bits
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag

Carnegie Mellon
Example: Direct Mapped Cache (E = 1)
Direct mapped: One line per set Assume: cache block size B=8 bytes
valid? + match: assume yes (= hit)
Address of int:
t bits
0…01
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
18
block offset
100
v
tag
0
1
2
3
4
5
6
7

Carnegie Mellon
Example: Direct Mapped Cache (E = 1)
Direct mapped: One line per set Assume: cache block size B=8 bytes
valid? + match: assume yes (= hit)
Address of int:
t bits
0…01
int (4 Bytes) is here
If tag doesn’t match (= miss): old line is evicted and replaced Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
19
block offset
100
v
tag
0
1
2
3
4
5
6
7

Carnegie Mellon
Direct-Mapped Cache Simulation
t=1 s=2 b=1
4-bit addresses (address space size M=16 bytes) S=4 sets, E=1 Blocks/set, B=2 bytes/block
Address trace (reads, one byte per read):
0 [00002], miss (cold)
1 [00012], hit
7 [01112], miss (cold)
8 [10002], miss (cold)
x
xx
x
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
20
Set 0 Set 1 Set 2 Set 3
v Tag
Block
0 [00002]
miss (conflict)
101
010
M[80-191]
0
0
10
0
M[6-7]

Carnegie Mellon
E-way Set Associative Cache (Here: E = 2) E = 2: Two lines per set
Assume: cache block size B=8 bytes
2 lines per set
Address of short int:
v
tag
0
1
2
3
4
5
6
6
7
7
v
tag
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
S sets
v
tag
0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
21
1
2
3
4
5
v
tag
t bits
7
0…01
find set
100
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7

Carnegie Mellon
E-way Set Associative Cache (Here: E = 2) E = 2: Two lines per set
Assume: cache block size B=8 bytes
Address of short int:
valid? +
v
tag
0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
22
1
2
3
4
5
6
7
v
tag
0
1
block offset
2
3
4
5
t bits
6
7
0…01
100
match: yes (= hit)
compare both

Carnegie Mellon
E-way Set Associative Cache (Here: E = 2) E = 2: Two lines per set
Assume: cache block size B=8 bytes
Address of short int:
valid? +
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
short int (2 Bytes) is here
No match or not valid (= miss):
• One line in set is selected for eviction and replacement
• Replacement policies: random, least recently used (LRU), …
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
23
block offset
0…01
100
match: yes (= hit)
compare both
t bits

Carnegie Mellon
2-Way Set Associative Cache Simulation
t=2 s=1 b=1
4-bit addresses (M=16 bytes)
S=2 sets, E=2 blocks/set, B=2 bytes/block
Address trace (reads, one byte per read):
xx
x
x
0 1 7 8 0
v Tag
[00002],
[00012], hit [01112], miss [10002], miss [00002] hit
Block
miss
10
00
M[0-1]
0 1
10
M[8-9]
Set 0
Set 1
0 1
01
M[6-7]
0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
24

Carnegie Mellon
What about writes?
 Multiple copies of data exist:
▪ L1, L2, L3, Main Memory, Disk
valid bit
dirty bit
0
1
2
B-1
B = 2b bytes
▪ Each cache line needs a dirty bit (set if data differs from memory)
 What to do on a write-miss?
▪ Write-allocate (load into cache, update line in cache)
▪ Good if more writes to the location will follow
▪ No-write-allocate (writes straight to memory, does not load into cache)
 Typical
▪ Write-through + No-write-allocate ▪ Write-back + Write-allocate
 What to do on a write-hit?
▪ Write-through (write immediately to memory)
▪ Write-back (defer write to memory until replacement of line)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
25
v
d
tag

Carnegie Mellon
Why Index Using Middle Bits?
Direct mapped: One line per set Assume: cache block size 8 bytes
0
1
2
3
4
5
6
7
v
tag
t bits
0…01
100
0
1
2
3
4
5
6
7
S = 2s sets
Standard Method: Middle bit indexing
Address of int:
find set
Alternative Method: High bit indexing
Address of int:
find set
v
tag
0
1
2
3
4
5
6
7
v
tag
0
1
2
3
4
5
6
7
v
tag
1…11
t bits
100
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
26

Carnegie Mellon
Illustration of Indexing Approaches
 64-byte memory ▪ 6-bit addresses
 16 byte, direct-mapped cache
 Block size = 4. (Thus, 4 sets; why?)
 2 bits tag, 2 bits index, 2 bits offset 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx
1111xx
0000xx
0001xx
0010xx
0011xx
0100xx
0101xx
0110xx
Set 0 Set 1 Set 2 Set 3
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
27

Carnegie Mellon
Middle Bit Indexing
 Addresses of form TTSSBB
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx
1111xx
▪ TT ▪ SS ▪ BB
Tag bits
Set index bits Offset bits
 Makes good use of spatial locality
Set 0 Set 1 Set 2 Set 3
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
28

Carnegie Mellon
High Bit Indexing
 Addresses of form SSTTBB
0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx
1111xx
▪ SS ▪ TT ▪ BB
Set index bits Tag bits Offset bits
 Program with high spatial locality would generate lots of conflicts
Set 0 Set 1 Set 2 Set 3
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
29

Carnegie Mellon
Intel Core i7 Cache Hierarchy
Processor package
Core 0
Core 3
L1 i-cache and d-cache:
32 KB, 8-way, Access: 4 cycles
L2 unified cache:
256 KB, 8-way, Access: 10 cycles
L3 unified cache:
8 MB, 16-way, Access: 40-75 cycles
Block size: 64 bytes for all caches.
Regs
L1 d-cache
L2 unified cache
L1 i-cache
…
L3 unified cache (shared by all cores)
Regs
L1 d-cache
L2 unified cache
L1 i-cache
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
30
Main memory

Carnegie Mellon
Example: Core i7 L1 Data Cache
32 kB 8-way set associative 64 bytes/block
47 bit address range
B=
S = , s= E = , e= C=
Block offset: . bits Set index: . bits Tag: . bits
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
31
Stack Address: Block offset: 0x?? 0x00007f7262a1e010 Set index: 0x?? Tag: 0x??

Carnegie Mellon
Example: Core i7 L1 Data Cache
32 kB 8-way set associative 64 bytes/block
47 bit address range
B = 64
S = 64, s = 6
E = 8, e = 3
C = 64 x 64 x 8 = 32,768
Block offset: 6 bits Set index: 6 bits Tag: 35 bits
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
32
Stack Address: Block offset: 0x10 0x00007f7262a1e010 Set index: 0x0 Tag: 0x7f7262a1e
0000 0001 0000

Carnegie Mellon
Cache Performance Metrics  Miss Rate
▪ Fraction of memory accesses not found in cache (misses / accesses) = 1 – hit rate
▪ Typical numbers (as %): ▪ 3-10% for L1
▪ can be quite small (e.g., < 1%) for L2, depending on size, etc.  Hit Time ▪ Time to deliver a cached block to the processor ▪ includes time to determine whether line is in cache ▪ Typical numbers: ▪ 4 clock cycle for L1 ▪ 10 clock cycles for L2  Miss Penalty ▪ Additional time required because of a miss ▪ typically 50-200 cycles for main memory (Trend: increasing!) Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 33 Carnegie Mellon How Bad Can a Few Cache Misses Be?  Huge difference between a hit and a miss ▪ Could be 100x, if just L1 and main memory  Would you believe 99% hits is twice as good as 97%? ▪ Consider this simplified example: cache hit time of 1 cycle miss penalty of 100 cycles ▪ Average access time: 97% hits: 1 cycle + 0.03 x 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 x 100 cycles = 2 cycles  This is why “miss rate” is used instead of “hit rate” Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 34 Carnegie Mellon Writing Cache Friendly Code  Make the common case go fast ▪ Focus on the inner loops of the core functions  Minimize the misses in the inner loops ▪ Repeated references to variables are good (temporal locality) ▪ Stride-1 reference patterns are good (spatial locality) Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 35 Carnegie Mellon Quiz Time! Check out: https://canvas.cmu.edu/courses/17808 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 36 Carnegie Mellon Today  Cache organization and operation  Performance impact of caches ▪ The memory mountain ▪ Rearranging loops to improve spatial locality ▪ Using blocking to improve temporal locality Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 37 Carnegie Mellon The Memory Mountain  Read throughput (read bandwidth) ▪ Number of bytes read from memory per second (MB/s)  Memory mountain: Measured read throughput as a function of spatial and temporal locality. ▪ Compact way to characterize memory system performance. Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 38 Carnegie Mellon Memory Mountain Test Function Call test() with many combinations of elems and stride. For each elems and stride: 1. Call test() once to warm up the caches. 2. Call test() again and measure the read throughput(MB/s) 39 long data[MAXELEMS]; /* Global array to traverse */ /* test - Iterate over first "elems" elements of * array "data" with stride of "stride“, * using 4x4 loop unrolling. */ int test(int elems, int stride) { long i, sx2=stride*2, sx3=stride*3, sx4=stride*4; long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0; long length = elems, limit = length - sx4; /* Combine 4 elements at a time */ for (i = 0; i < limit; i += sx4) { acc0 = acc0 + data[i]; acc1 = acc1 + data[i+stride]; acc2 = acc2 + data[i+sx2]; acc3 = acc3 + data[i+sx3]; } /* Finish any remaining elements */ for (; i < length; i++) { acc0 = acc0 + data[i]; } return ((acc0 + acc1) + (acc2 + acc3)); } mountain/mountain.c Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Carnegie Mellon The Memory Mountain Aggressive prefetching 16000 14000 12000 10000 8000 6000 4000 2000 Slopes of spatial locality Core i7 Haswell 2.1 GHz 32 KB L1 d-cache 256 KB L2 cache 8 MB L3 cache 64 B block size L1 L2 L3 0 s1 s3 Stride (x8 bytes) Ridges of temporal locality 32k 128k 2m 512k Size (bytes) Mem s5 s7 8m 32m s9 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 40 s11 128m Read throughput (MB/s) Carnegie Mellon Today  Cache organization and operation  Performance impact of caches ▪ The memory mountain ▪ Rearranging loops to improve spatial locality ▪ Using blocking to improve temporal locality Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 41 Carnegie Mellon Matrix Multiplication Example  Description: ▪ Multiply N x N matrices ▪ Matrix elements are doubles (8 bytes) ▪ O(N3) total operations ▪ N reads per source element ▪ N values summed per destination ▪ but may be able to hold in register Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 42 /* ijk */ for (i=0; i sizeof(aij) bytes, exploit spatial locality
▪ miss rate = sizeof(aij) / B
 Stepping through rows in one column: ▪ for (i = 0; i < n; i++) sum += a[i][0]; ▪ accesses distant elements ▪ no spatial locality! ▪ miss rate = 1 (i.e. 100%) Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 44 Carnegie Mellon Matrix Multiplication (ijk) Inner loop: /* ijk */ for (i=0; i

Related Posts