PowerPoint Presentation
Lecture 07
Memory Hierarchy
Copyright By PowCoder代写 加微信 powcoder
COMP1411: Introduction to Computer Systems
Dr. Mingsong LYU (呂鳴松)
Department of Computing,
The Hong Kong Polytechnic University
Spring 2022
These slides are only intended to use internally. Do not publish it anywhere without permission.
Acknowledgement: These slides are based on the textbook (Computer Systems: A Programmer’s Perspective) and its slides.
Computer components revisited
Bus interface
Register file
System bus
Memory bus
controller
controller
Expansion slots for
other devices such
as network adapters
I/O devices allow the computer to interact with the outside world
CPU and memory are the two core components for a modern computer to be able to compute, like our brain
LEARNING TARGETS
Get to know different memory hardware in a computer
Memories are major performance bottlenecks
Locality – Opportunity to improve memory performance
Caching – An important solution leveraging locality
Memory hierarchy – the reality of memory system
So far, what we have
$ ./hello (step 1)
$ ./hello (step 2)
movq $1, %rax
Solid State Drive
Main memory
Access latency
movq 8(%rsi), %rax
Read the word from memory with address 8(%rsi) into the CPU
Think about the pipelined CPU
Both latency for one instruction and throughput for the CPU are significantly degraded
Write back
Write back
Write back
Instruction 1
Instruction 2
Instruction 3
Access latency
Time used by CPU to execute one instruction
1 cycle for most instructions (1GHz CPU, 1 cycle = 10-9s)
Time used to fetch a word from main memory
10 ~ 100 cycles
Time used to fetch a block of data from disks
10,000 ~ 1,000,000 cycles
movq 8(%rsi), %rax
fread(buffer, 1024, 1, *fp)
Most of the time, the most precious resource CPU is waiting/doing nothing
Access latency
The gap between CPU, main memory and disks
Disk seek time 1985 1990 1995 2000 2003 2005 2010 2015 75000000 28000000 10000000 8000000 6000000 5000000 3000000 3000000 SSD access time 1985 1990 1995 2000 2003 2005 2010 2015 50000 DRAM access time 1985 1990 1995 2000 2003 2005 2010 2015 200 100 70 60 55 50 40 20 SRAM access time 1985 1990 1995 2000 2003 2005 2010 2015 150 35 15 3 2.5 2 1.5 1.3 CPU cycle time 1985 1990 1995 2000 2003 2005 2010 2015 166 50 6 1.6 0.30000000000000004 0.5 0.4 0.33000000000000007 Effective CPU cycle time 1985 1990 1995 2000 2003 2005 2010 2015 0.30000000000000004 0.25 0.1 8.0000000000000016E-2 Year
Solve the problem:
a fast CPU is wasted by waiting for slow memory?
Locality – Opportunity
Principle of locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
Temporal locality: recently referenced items (data, instructions) are likely to be referenced again in the near future
Spatial locality: items (data, instructions) with nearby addresses tend to be referenced close together in time
THINK: WHY LOCALITY EXISTS?
Examples of spatial locality
To compute the sum of all elements in an 2-D array
GOOD LOCALITY
Examples of spatial locality
To compute the sum of all elements in an 2-D array
BAD LOCALITY
Examples of temporal locality
Data references
The access to “sum” in the inner loop
Once accesses, will be accessed again in the near future
Instruction references
The instructions to do “sum += a[][]”
Once executed, will be executed again in the near future
To understand locality for “data” and “instructions”
They are essentially the same, as instructions are special data stored in memory
To measure locality
Stride: The distance of two adjacent data accesses in memory location, in the unit of 1 data element
Stride-1 reference pattern: access the data one by one according to their memory addresses, such as the good locality example
Stride-k reference pattern: for example, the bad locality example generally has a stride-4 reference pattern
The smaller the stride, the better the locality
The idea of caching
A coke fanatic story
To execute an instruction
I need data A
Is A in cache?
from cache to register
Copy A from memory to register
Execute instruction
from cache to register
The coke fanatic story in a computer
Adding a small but fast memory inside the CPU
1+ cycles to fetch data
Cache miss
100+ cycles to fetch data
Caching example
When a program executes, it generates 20 memory accesses
ABCDCCCDDDEGHGHGHGHB
The unit of data loading is “one block”
One block contains two data variables
Cache size = 2 blocks
Data access time
Cache hit: 1 cycle
Cache miss: 200 cycles
Caching example
ABCDCCCDDDEGHGHGHGHB
Caching example
Performance with cache
Access time = 1 * 15 + 5 * 200 = 1015 cycles
Performance without cache
Access time = 200 * 20 = 4000 cycles
ABCDCCCDDDEGHGHGHGHB
MHMHHHHHHHMMHHHHHHHM
Exploiting locality in your programs
To do array multiplication, C = A * B
Each input array is an n*n array
Each element in the array is double, i.e., 8 bytes
A, B, and C, each has a cache of size 32 bytes
n*8 is much larger than 32, the cache is far from enough to hold a whole row/column of data
Let’s look at three different implementations
Exploiting locality in your programs
i = 0, j = 0, k = 0
Exploiting locality in your programs
i = 0, j = 0, k = 1
Exploiting locality in your programs
i = 0, j = 0, k = 2
Exploiting locality in your programs
i = 0, j = 0, k = 3
Exploiting locality in your programs
i = 0, j = 0, k = 4
Exploiting locality in your programs
For every 4 accesses to elements in A, only 1 out of 4 has cache miss
Every access to an element in B, always cache miss
i = 0, j = 0, k = 7
Exploiting locality in your programs
i = 0, j = 0, k = 0
Exploiting locality in your programs
i = 1, j = 0, k = 0
Exploiting locality in your programs
i = 2, j = 0, k = 0
Exploiting locality in your programs
i = 7, j = 0, k = 0
Exploiting locality in your programs
i = 0, j = 0, k = 1
Every access to an element of A, there will be a cache miss
Every access to an element of C, there will be a cache miss
Exploiting locality in your programs
i = 0, j = 0, k = 0
i = 0, j = 1, k = 0
i = 0, j = 2, k = 0
i = 0, j = 3, k = 0
Exploiting locality in your programs
i = 1, j = 0, k = 0
i = 1, j = 1, k = 0
i = 1, j = 2, k = 0
i = 1, j = 3, k = 0
Exploiting locality in your programs
i = 2, j = (0, 1, 2, 3) k = 0
Every access to an element of B, 1 out of 4 accesses will be cache miss
Every access to an element of C, 1 out of 4 accesses will be cache miss
Exploiting locality in your programs
jki 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 6.4 6.87 4.1399999999999997 5.53 10.93 33.229999999999997 49.43 51.49 52.06 52.06 52.07 52.09 52.12 52.17 52.2 kji 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 6.4 6.8199999999999976 4.01 5.33 11.04 33.21 49.42 51.5 52.07 52.08 52.09 52.1 52.14 52.19 52.23 ijk 50 100 150 200 250 300 350 400 45 0 500 550 600 650 700 750 5.31 6.35 6.29 3.7 3.72 3.71 3.72 3.83 4.5999999999999996 7.74 11.71 16.54 20.57 23.85 23.86 jik 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 5.4 6.23 3.64 3.71 3.61 3.6 3.63 3.74 4.6399999999999997 7.57 11.62 16.440000000000001 20.440000000000001 23.68 23.66 kij 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 4.37 5.3599999999999977 3.23 3.32 3.29 3.24 3.2 3.17 3.16 3.14 3.13 3.12 3.1 3.1 3.08 ikj 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 3.58 5.31 3.19 3.18 3.15 3.12 3.1 3.1 3.11 3.09 3.07 3.06 3.02 3.02 3.01 Array size (n)
Cycles per inner loop iteration
Hints to software developers
Caching leverages locality, good locality make good use of cache
General principles to write programs with good locality
Focus your attention on inner loops, where the CPU spends most of the time
Try to maximize the spatial locality by reading data objects sequentially, with stride 1, in the order they are stored in memory
Try to maximize the temporal locality by using a data object as often as possible once it has been read from memory
Impacts to HW design – memory hierarchy
Registers 0 cycle 100 B
L1 cache 1~4 cycles 16 KB
L2 cache 10 cycles 512KB
Main memory 100 cycles 16GB
SSD 10,000 cycles 1TB
Hard Disk 1,000,000 cycles 4TB
Cloud drives 1,000,000,000 cycles
Memory hierarchy & caching
Main memory
SSD / Hard Disk
Cloud Drives
Size: smaller
Speed: faster
Price: higher
Conceptually, level K can be viewed as a cache of level K+1, storing a subset of data in level K+1
If caching policies are smartly designed, most of the time, cache accesses will be hit
Pretty much like we have a memory system that works at a speed of the highest level, but have the storage space of the lowest level, with reasonably low price
The last piece – managing caches
Caches are small, how to make good use of them?
The last piece – managing caches
Design considerations of caches
Block size
Bigger block size exploits spatial locality
Too big, bringing in many data that will not be used, waste of space and time
Who is in, who is out?
Replacement policy: if cache is full, new data going in, evict which data?
Intuition: keep the data being used in the near future in cache
But, how do we know which data will be used in the near future?
The last piece – managing caches
Design considerations of caches
To efficiently find a data item in the cache
Find 1 data in 1000 candidates? Takes too long!
Partition the cache into groups, map different data into different groups
To find a data item in a smaller group can be efficient, but space may not be fully utilized
A, D, G, J
B, E, H, K
C, F, I, L
Open refrigerator icon, cartoon style
/docProps/thumbnail.jpeg
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com