CS代写 COMP1411: Introduction to Computer Systems

PowerPoint Presentation

Lecture 07
Memory Hierarchy

Copyright By PowCoder代写 加微信 powcoder

COMP1411: Introduction to Computer Systems
Dr. Mingsong LYU (呂鳴松)
Department of Computing,
The Hong Kong Polytechnic University
Spring 2022
These slides are only intended to use internally. Do not publish it anywhere without permission.
Acknowledgement: These slides are based on the textbook (Computer Systems: A Programmer’s Perspective) and its slides.

Computer components revisited

Bus interface

Register file

System bus

Memory bus

controller

controller

Expansion slots for
other devices such
as network adapters

I/O devices allow the computer to interact with the outside world
CPU and memory are the two core components for a modern computer to be able to compute, like our brain

LEARNING TARGETS

Get to know different memory hardware in a computer
Memories are major performance bottlenecks
Locality – Opportunity to improve memory performance
Caching – An important solution leveraging locality
Memory hierarchy – the reality of memory system

So far, what we have
$ ./hello (step 1)
$ ./hello (step 2)

movq $1, %rax

Solid State Drive

Main memory

Access latency
movq 8(%rsi), %rax
Read the word from memory with address 8(%rsi) into the CPU
Think about the pipelined CPU
Both latency for one instruction and throughput for the CPU are significantly degraded

Write back

Write back

Write back
Instruction 1
Instruction 2
Instruction 3

Access latency
Time used by CPU to execute one instruction
1 cycle for most instructions (1GHz CPU, 1 cycle = 10-9s)
Time used to fetch a word from main memory
10 ~ 100 cycles
Time used to fetch a block of data from disks
10,000 ~ 1,000,000 cycles

movq 8(%rsi), %rax
fread(buffer, 1024, 1, *fp)
Most of the time, the most precious resource CPU is waiting/doing nothing

Access latency
The gap between CPU, main memory and disks

Disk seek time 1985 1990 1995 2000 2003 2005 2010 2015 75000000 28000000 10000000 8000000 6000000 5000000 3000000 3000000 SSD access time 1985 1990 1995 2000 2003 2005 2010 2015 50000 DRAM access time 1985 1990 1995 2000 2003 2005 2010 2015 200 100 70 60 55 50 40 20 SRAM access time 1985 1990 1995 2000 2003 2005 2010 2015 150 35 15 3 2.5 2 1.5 1.3 CPU cycle time 1985 1990 1995 2000 2003 2005 2010 2015 166 50 6 1.6 0.30000000000000004 0.5 0.4 0.33000000000000007 Effective CPU cycle time 1985 1990 1995 2000 2003 2005 2010 2015 0.30000000000000004 0.25 0.1 8.0000000000000016E-2 Year

Solve the problem:
a fast CPU is wasted by waiting for slow memory?

Locality – Opportunity
Principle of locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

Temporal locality: recently referenced items (data, instructions) are likely to be referenced again in the near future

Spatial locality: items (data, instructions) with nearby addresses tend to be referenced close together in time

THINK: WHY LOCALITY EXISTS?

Examples of spatial locality
To compute the sum of all elements in an 2-D array
GOOD LOCALITY

Examples of spatial locality
To compute the sum of all elements in an 2-D array
BAD LOCALITY

Examples of temporal locality
Data references
The access to “sum” in the inner loop
Once accesses, will be accessed again in the near future
Instruction references
The instructions to do “sum += a[][]”
Once executed, will be executed again in the near future

To understand locality for “data” and “instructions”
They are essentially the same, as instructions are special data stored in memory

To measure locality
Stride: The distance of two adjacent data accesses in memory location, in the unit of 1 data element
Stride-1 reference pattern: access the data one by one according to their memory addresses, such as the good locality example

Stride-k reference pattern: for example, the bad locality example generally has a stride-4 reference pattern

The smaller the stride, the better the locality

The idea of caching
A coke fanatic story

To execute an instruction
I need data A
Is A in cache?
from cache to register
Copy A from memory to register
Execute instruction
from cache to register
The coke fanatic story in a computer
Adding a small but fast memory inside the CPU

1+ cycles to fetch data

Cache miss
100+ cycles to fetch data

Caching example
When a program executes, it generates 20 memory accesses
ABCDCCCDDDEGHGHGHGHB

The unit of data loading is “one block”
One block contains two data variables
Cache size = 2 blocks
Data access time
Cache hit: 1 cycle
Cache miss: 200 cycles

Caching example
ABCDCCCDDDEGHGHGHGHB

Caching example

Performance with cache
Access time = 1 * 15 + 5 * 200 = 1015 cycles

Performance without cache
Access time = 200 * 20 = 4000 cycles
ABCDCCCDDDEGHGHGHGHB
MHMHHHHHHHMMHHHHHHHM

Exploiting locality in your programs
To do array multiplication, C = A * B
Each input array is an n*n array
Each element in the array is double, i.e., 8 bytes
A, B, and C, each has a cache of size 32 bytes
n*8 is much larger than 32, the cache is far from enough to hold a whole row/column of data

Let’s look at three different implementations

Exploiting locality in your programs

i = 0, j = 0, k = 0

Exploiting locality in your programs

i = 0, j = 0, k = 1

Exploiting locality in your programs

i = 0, j = 0, k = 2

Exploiting locality in your programs

i = 0, j = 0, k = 3

Exploiting locality in your programs

i = 0, j = 0, k = 4

Exploiting locality in your programs

For every 4 accesses to elements in A, only 1 out of 4 has cache miss
Every access to an element in B, always cache miss
i = 0, j = 0, k = 7

Exploiting locality in your programs

i = 0, j = 0, k = 0

Exploiting locality in your programs

i = 1, j = 0, k = 0

Exploiting locality in your programs

i = 2, j = 0, k = 0

Exploiting locality in your programs

i = 7, j = 0, k = 0

Exploiting locality in your programs

i = 0, j = 0, k = 1
Every access to an element of A, there will be a cache miss
Every access to an element of C, there will be a cache miss

Exploiting locality in your programs

i = 0, j = 0, k = 0

i = 0, j = 1, k = 0
i = 0, j = 2, k = 0
i = 0, j = 3, k = 0

Exploiting locality in your programs

i = 1, j = 0, k = 0
i = 1, j = 1, k = 0
i = 1, j = 2, k = 0
i = 1, j = 3, k = 0

Exploiting locality in your programs

i = 2, j = (0, 1, 2, 3) k = 0
Every access to an element of B, 1 out of 4 accesses will be cache miss
Every access to an element of C, 1 out of 4 accesses will be cache miss

Exploiting locality in your programs

jki 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 6.4 6.87 4.1399999999999997 5.53 10.93 33.229999999999997 49.43 51.49 52.06 52.06 52.07 52.09 52.12 52.17 52.2 kji 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 6.4 6.8199999999999976 4.01 5.33 11.04 33.21 49.42 51.5 52.07 52.08 52.09 52.1 52.14 52.19 52.23 ijk 50 100 150 200 250 300 350 400 45 0 500 550 600 650 700 750 5.31 6.35 6.29 3.7 3.72 3.71 3.72 3.83 4.5999999999999996 7.74 11.71 16.54 20.57 23.85 23.86 jik 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 5.4 6.23 3.64 3.71 3.61 3.6 3.63 3.74 4.6399999999999997 7.57 11.62 16.440000000000001 20.440000000000001 23.68 23.66 kij 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 4.37 5.3599999999999977 3.23 3.32 3.29 3.24 3.2 3.17 3.16 3.14 3.13 3.12 3.1 3.1 3.08 ikj 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 3.58 5.31 3.19 3.18 3.15 3.12 3.1 3.1 3.11 3.09 3.07 3.06 3.02 3.02 3.01 Array size (n)

Cycles per inner loop iteration

Hints to software developers
Caching leverages locality, good locality make good use of cache
General principles to write programs with good locality

Focus your attention on inner loops, where the CPU spends most of the time

Try to maximize the spatial locality by reading data objects sequentially, with stride 1, in the order they are stored in memory

Try to maximize the temporal locality by using a data object as often as possible once it has been read from memory

Impacts to HW design – memory hierarchy

Registers 0 cycle 100 B
L1 cache 1~4 cycles 16 KB
L2 cache 10 cycles 512KB

Main memory 100 cycles 16GB
SSD 10,000 cycles 1TB
Hard Disk 1,000,000 cycles 4TB
Cloud drives 1,000,000,000 cycles

Memory hierarchy & caching

Main memory
SSD / Hard Disk
Cloud Drives
Size: smaller
Speed: faster
Price: higher
Conceptually, level K can be viewed as a cache of level K+1, storing a subset of data in level K+1

If caching policies are smartly designed, most of the time, cache accesses will be hit

Pretty much like we have a memory system that works at a speed of the highest level, but have the storage space of the lowest level, with reasonably low price

The last piece – managing caches
Caches are small, how to make good use of them?

The last piece – managing caches
Design considerations of caches
Block size
Bigger block size exploits spatial locality
Too big, bringing in many data that will not be used, waste of space and time

Who is in, who is out?
Replacement policy: if cache is full, new data going in, evict which data?
Intuition: keep the data being used in the near future in cache
But, how do we know which data will be used in the near future?

The last piece – managing caches
Design considerations of caches
To efficiently find a data item in the cache
Find 1 data in 1000 candidates? Takes too long!
Partition the cache into groups, map different data into different groups
To find a data item in a smaller group can be efficient, but space may not be fully utilized
A, D, G, J
B, E, H, K
C, F, I, L

Open refrigerator icon, cartoon style

/docProps/thumbnail.jpeg

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com