CS代写 COMP1411: Introduction to Computer Systems

PowerPoint Presentation

Lecture 07
Memory Hierarchy

COMP1411: Introduction to Computer Systems
Dr. Mingsong LYU (呂鳴松)
Department of Computing,
The Hong Kong Polytechnic University
Spring 2022
These slides are only intended to use internally. Do not publish it anywhere without permission.
Acknowledgement: These slides are based on the textbook (Computer Systems: A Programmer’s Perspective) and its slides.

Computer components revisited

Bus interface

System bus

Memory bus

controller

Expansion slots for
other devices such
as network adapters

I/O devices allow the computer to interact with the outside world
CPU and memory are the two core components for a modern computer to be able to compute, like our brain

LEARNING TARGETS

Get to know different memory hardware in a computer
Memories are major performance bottlenecks
Locality – Opportunity to improve memory performance
Caching – An important solution leveraging locality
Memory hierarchy – the reality of memory system

So far, what we have
$ ./hello (step 1)
$ ./hello (step 2)

movq $1, %rax

Solid State Drive

Main memory

Access latency
movq 8(%rsi), %rax
Read the word from memory with address 8(%rsi) into the CPU
Think about the pipelined CPU
Both latency for one instruction and throughput for the CPU are significantly degraded

Write back

Write back
Instruction 1
Instruction 2
Instruction 3

Access latency
Time used by CPU to execute one instruction
1 cycle for most instructions (1GHz CPU, 1 cycle = 10-9s)
Time used to fetch a word from main memory
10 ~ 100 cycles
Time used to fetch a block of data from disks
10,000 ~ 1,000,000 cycles

movq 8(%rsi), %rax
fread(buffer, 1024, 1, *fp)
Most of the time, the most precious resource CPU is waiting/doing nothing

Access latency
The gap between CPU, main memory and disks

Disk seek time 1985 1990 1995 2000 2003 2005 2010 2015 75000000 28000000 10000000 8000000 6000000 5000000 3000000 3000000 SSD access time 1985 1990 1995 2000 2003 2005 2010 2015 50000 DRAM access time 1985 1990 1995 2000 2003 2005 2010 2015 200 100 70 60 55 50 40 20 SRAM access time 1985 1990 1995 2000 2003 2005 2010 2015 150 35 15 3 2.5 2 1.5 1.3 CPU cycle time 1985 1990 1995 2000 2003 2005 2010 2015 166 50 6 1.6 0.30000000000000004 0.5 0.4 0.33000000000000007 Effective CPU cycle time 1985 1990 1995 2000 2003 2005 2010 2015 0.30000000000000004 0.25 0.1 8.0000000000000016E-2 Year

Solve the problem:
a fast CPU is wasted by waiting for slow memory?

Locality – Opportunity
Principle of locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

Temporal locality: recently referenced items (data, instructions) are likely to be referenced again in the near future

Spatial locality: items (data, instructions) with nearby addresses tend to be referenced close together in time

THINK: WHY LOCALITY EXISTS?

Examples of spatial locality
To compute the sum of all elements in an 2-D array
GOOD LOCALITY

Examples of spatial locality
To compute the sum of all elements in an 2-D array
BAD LOCALITY

Examples of temporal locality
Data references
The access to “sum” in the inner loop
Once accesses, will be accessed again in the near future
Instruction references
The instructions to do “sum += a[][]”
Once executed, will be executed again in the near future

To understand locality for “data” and “instructions”
They are essentially the same, as instructions are special data stored in memory

To measure locality
Stride: The distance of two adjacent data accesses in memory location, in the unit of 1 data element
Stride-1 reference pattern: access the data one by one according to their memory addresses, such as the good locality example

Stride-k reference pattern: for example, the bad locality example generally has a stride-4 reference pattern

The smaller the stride, the better the locality

The idea of caching
A coke fanatic story

To execute an instruction
I need data A
Is A in cache?
from cache to register
Copy A from memory to register
Execute instruction
from cache to register
The coke fanatic story in a computer
Adding a small but fast memory inside the CPU

1+ cycles to fetch data

Cache miss
100+ cycles to fetch data

Caching example
When a program executes, it generates 20 memory accesses
ABCDCCCDDDEGHGHGHGHB

The unit of data loading is “one block”
One block contains two data variables
Cache size = 2 blocks
Data access time
Cache hit: 1 cycle
Cache miss: 200 cycles

Caching example
ABCDCCCDDDEGHGHGHGHB

Caching example

Performance with cache
Access time = 1 * 15 + 5 * 200 = 1015 cycles

Performance without cache
Access time = 200 * 20 = 4000 cycles
ABCDCCCDDDEGHGHGHGHB
MHMHHHHHHHMMHHHHHHHM

Exploiting locality in your programs
To do array multiplication, C = A * B
Each input array is an n*n array
Each element in the array is double, i.e., 8 bytes
A, B, and C, each has a cache of size 32 bytes
n*8 is much larger than 32, the cache is far from enough to hold a whole row/column of data

Let’s look at three different implementations