程序代写代做代考 compiler graph arm cache clock Carnegie Mellon

Carnegie Mellon
The Memory Hierarchy
15-213/18-213/15-513: Introduction to Computer Systems 10th Lecture, June 9, 2020
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
1

Carnegie Mellon
Today
 The memory abstraction
 RAM : main memory building block  Storage technologies and trends
 Locality of reference
 The memory hierarchy
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
2

Carnegie Mellon
Writing & Reading Memory  Write
▪ Transfer data from CPU to memory movq %rax, 8(%rsp)
▪ “Store” operation  Read
▪ Transfer data from memory to CPU movq 8(%rsp), %rax
▪ “Load” operation
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
From 5th lecture
3

Carnegie Mellon
Traditional Bus Structure Connecting CPU and Memory
 A bus is a collection of parallel wires that carry address, data, and control signals.
 Buses are typically shared by multiple devices. CPU chip
Register file
ALU
System bus Memory bus
Bus interface
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
4
I/O bridge
Main memory

Carnegie Mellon
Memory Read Transaction (1)  CPU places address A on the memory bus.
Register file
%rax
Load operation: movq A, %rax
I/O bridge
Main memory A 0
A
x
Bus interface
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
5
ALU

Carnegie Mellon
Memory Read Transaction (2)
 Main memory reads A from the memory bus, retrieves word x, and places it on the bus.
Register file
%rax
Load operation: movq A, %rax
I/O bridge
x
Main memory
0 A
x
Bus interface
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
6
ALU

Carnegie Mellon
Memory Read Transaction (3)
 CPU read word x from the bus and copies it into register
%rax.
Register file
%rax
Load operation: movq A, %rax
x
ALU
I/O bridge
Main memory 0
A
x
Bus interface
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
7

Carnegie Mellon
Memory Write Transaction (1)
 CPU places address A on bus. Main memory reads it and waits for the corresponding data word to arrive.
Register file
%rax
Store operation: movq %rax, A
y
ALU
I/O bridge
A
Main memory 0
A
Bus interface
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
8

Carnegie Mellon
Memory Write Transaction (2)  CPU places data word y on the bus.
Register file
%rax
Store operation: movq %rax, A
ALU
y
I/O bridge
y
Main memory 0
A
Bus interface
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
9

Carnegie Mellon
Memory Write Transaction (3)
 Main memory reads data word y from the bus and stores it at address A.
Register file
%rax
Store operation: movq %rax, A
y
ALU
I/O bridge
Main memory 0
A
y
Bus interface
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
10

Carnegie Mellon
Today
 The memory abstraction
 RAM : main memory building block  Storage technologies and trends
 Locality of reference
 The memory hierarchy
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
11

Carnegie Mellon
Random-Access Memory (RAM)
 Key features
▪ RAM is traditionally packaged as a chip.
▪ or embedded as part of processor chip
▪ Basic storage unit is normally a cell (one bit per cell). ▪ Multiple RAM chips form a memory.
 RAM comes in two varieties: ▪ SRAM (Static RAM)
▪ DRAM (Dynamic RAM)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
12

Carnegie Mellon
RAM Technologies
 DRAM
 1 Transistor + 1 capacitor / bit
▪ Capacitor oriented vertically
 Must refresh state
periodically
 SRAM
 6 transistors / bit
 Holds state indefinitely
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
13

Carnegie Mellon
SRAM vs DRAM Summary
Trans. Access Needs Needs
per bit time refresh? EDC? Cost
SRAM 6 or 8 1x No Maybe 100x DRAM 1 10x Yes Yes 1x
Applications
Cache memories
Main memories, frame buffers
EDC: Error detection and correction
 Trends
▪ SRAM scales with semiconductor technology
▪ Reaching its limits
▪ DRAM scaling limited by need for minimum capacitance
▪ Aspect ratio limits how deep can make capacitor ▪ Also reaching its limits
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
14

Carnegie Mellon
Enhanced DRAMs
 Operation of DRAM cell has not changed since its invention
▪ Commercialized by Intel in 1970.
 DRAM cores with better interface logic and faster I/O : ▪ Synchronous DRAM (SDRAM)
▪ Uses a conventional clock signal instead of asynchronous control
▪ Double data-rate synchronous DRAM (DDR SDRAM)
▪ Double edge clocking sends two bits per cycle per pin
▪ Different types distinguished by size of small prefetch buffer:
– DDR (2 bits), DDR2 (4 bits), DDR3 (8 bits), DDR4 (16 bits) ▪ By 2010, standard for most server and desktop systems
▪ Intel Core i7 supports DDR3 and DDR4 SDRAM
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
15

Carnegie Mellon
Conventional DRAM Organization  dxwDRAM:
▪ d⋅ w total bits organized as d supercells of size w bits 16 x 8 DRAM chip
2 bits /
addr
0 0
1 rows
2 3
cols 123
Memory controller
(to/from CPU)
supercell (2,1)
8 bits
/
data
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
16
Internal row buffer

Carnegie Mellon
Reading DRAM Supercell (2,1)
Step 1(a): Row access strobe (RAS) selects row 2.
Step 1(b): Row 2 copied from DRAM array to row buffer.
16 x 8 DRAM chip
RAS = 2
2 /
addr
0 0
1
Rows
2 3
Cols
123
Memory controller
8
/
data
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
17
Internal row buffer

Carnegie Mellon
Reading DRAM Supercell (2,1)
Step 2(a): Column access strobe (CAS) selects column 1.
Step 2(b): Supercell (2,1) copied from buffer to data lines, and eventually
back to the CPU.
Step 3: All data written back to row to provide refresh
To CPU
supercell (2,1)
16 x 8 DRAM chip
CAS = 1
2 /
addr
Memory controller
8
/
data
supercell
(2,1)
18
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
0 0
1 Rows
2 3
Cols 123
Internal row buffer

Carnegie Mellon
Memory Modules
: supercell (i,j)
64 MB
memory module consisting of
eight 8Mx8 DRAMs
DRAM 0
DRAM
7
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
19
addr (row = i, col = j)
bits bits bits 56-63 48-55 40-47
bits bits 32-39 24-31
bits bits bits 16-23 8-15 0-7
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
64-bit word main memory address A
Memory controller
64-bit word

Carnegie Mellon
Today
 The memory abstraction
 RAM : main memory building block  Storage technologies and trends
 Locality of reference
 The memory hierarchy
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
20

Carnegie Mellon
Storage Technologies  Magnetic Disks
 Store on magnetic medium
 Electromechanical access
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
 Nonvolatile (Flash) Memory
 Store as persistent charge
 Implemented with 3-D structure
▪ 100+ levels of cells ▪ 3 bits data per cell
21

Carnegie Mellon
What’s Inside A Disk Drive?
Arm Actuator
Spindle
Platters
SCSI connector
Electronics (including a processor
and memory!)
Image courtesy of Seagate Technology
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
22

Carnegie Mellon
Disk Geometry
 Disks consist of platters, each with two surfaces.
 Each surface consists of concentric rings called tracks.  Each track consists of sectors separated by gaps.
Tracks
Surface
Track k Gaps
Spindle
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
23
Sectors

Carnegie Mellon
Disk Capacity
 Capacity: maximum number of bits that can be stored.
▪ Vendors express capacity in units of gigabytes (GB) or terabytes (TB), where 1 GB = 109 Bytes and 1 TB = 1012 Bytes
 Capacity is determined by these technology factors:
▪ Recording density (bits/in): number of bits that can be squeezed into
a 1 inch segment of a track.
▪ Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment.
▪ Areal density (bits/in2): product of Tracks recording and track density.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
24

Carnegie Mellon
Disk Operation (Single-Platter View)
The disk surface spins at a fixed rotational rate
The read/write head
is attached to the end
of the arm and flies over the disk surface on
a thin cushion of air.
spindle
By moving radially, the arm can position the read/write head over any track.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
25
spindle
spindle
spindle

Carnegie Mellon
Disk Operation (Multi-Platter View)
Read/write heads move in unison from cylinder to cylinder
Arm
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
26
Spindle

Carnegie Mellon
Disk Access – Service Time Components
After BLUE read Seek for RED Rotational latency After RED read
Data transfer Seek Rotational Data transfer latency
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
27

Carnegie Mellon
Disk Access Time
 Average time to access some target sector approximated by:
▪ Taccess = Tavg seek + Tavg rotation + Tavg transfer  Seek time (Tavg seek)
▪ Time to position heads over cylinder containing target sector. ▪ Typical Tavg seek is 3—9 ms
 Rotational latency (Tavg rotation)
▪ Time waiting for first bit of target sector to pass under r/w head.
▪ Tavg rotation = 1/2 x 1/RPMs x 60 sec/1 min ▪ Typical rotational rate = 7,200 RPMs
 Transfer time (Tavg transfer)
▪ Time to read the bits in the target sector.
▪ Tavg transfer = 1/RPM x 1/(avg # sectors/track) x 60 secs/1 min
time for one rotation (in minutes) fraction of a rotation to be read Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
28

Carnegie Mellon
Disk Access Time Example
 Given:
▪ Rotational rate = 7,200 RPM ▪ Average seek time = 9 ms
▪ Avg # sectors/track = 400
 Derived:
▪ Tavg rotation = 1/2 x (60 secs/7200 RPM) x 1000 ms/sec = 4 ms
▪ Tavg transfer = 60/7200 x 1/400 x 1000 ms/sec = 0.02 ms ▪ Taccess = 9 ms + 4 ms + 0.02 ms
 Important points:
▪ Access time dominated by seek time and rotational latency.
▪ First bit in a sector is the most expensive, the rest are free.
▪ SRAM access time is about 4 ns/doubleword, DRAM about 60 ns
▪ Disk is about 40,000 times slower than SRAM,
▪ 2,500 times slower than DRAM. Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
29

Carnegie Mellon
I/O Bus CPU chip
Register file
ALU
System bus
Memory bus
Bus interface
I/O bridge
Main memory
I/O bus
Expansion slots for other devices such as network adapters.
USB controller
Graphics adapter
Disk controller
Mouse Keyboard
Monitor
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
30
Disk

Carnegie Mellon
Reading a Disk Sector (1)
CPU chip
CPU initiates a disk read by writing a command, logical block number, and destination memory address to a port (address) associated with disk controller.
Register file
ALU
Bus interface
Main memory
I/O bus
USB controller
Graphics adapter
Disk controller
mouse keyboard
Monitor
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
31
Disk

Carnegie Mellon
Reading a Disk Sector (2) CPU chip
Register file
Disk controller reads the sector and performs a direct memory access (DMA) transfer into main memory.
ALU
Bus interface
Main memory
I/O bus
USB controller
Graphics adapter
Disk controller
Mouse Keyboard
Monitor
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
32
Disk

Carnegie Mellon
Reading a Disk Sector (3)
CPU chip
When the DMA transfer completes, the disk controller notifies the CPU with an interrupt (i.e., asserts a special “interrupt” pin on the CPU).
Register file
ALU
Bus interface
Main memory
I/O bus
USB controller
Graphics adapter
Disk controller
Mouse Keyboard
Monitor
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
33
Disk

Carnegie Mellon
Nonvolatile Memories
 DRAM and SRAM are volatile memories
▪ Lose information if powered off.
 Nonvolatile memories retain value even if powered off ▪ Read-only memory (ROM): programmed during production
▪ Electrically eraseable PROM (EEPROM): electronic erase capability ▪ Flash memory: EEPROMs, with partial (block-level) erase capability
▪ Wears out after about 100,000 erasings ▪ 3D XPoint (Intel Optane) & emerging NVMs
▪ New materials
 Uses for Nonvolatile Memories
▪ Firmware programs stored in a ROM (BIOS, controllers for disks, network cards, graphics accelerators, security subsystems,…)
▪ Solid state disks (replacing rotating disks)
▪ Disk caches
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
34

Carnegie Mellon
Solid State Disks (SSDs)
I/O bus
Solid State Disk (SSD)
Requests to read and write logical disk blocks
Block 0
Block B-1
Flash translation layer

Page P-1
 Pages: 512KB to 4KB, Blocks: 32 to 128 pages
 Data read/written in units of pages.
 Page can be written only after its block has been erased.
 A block wears out after about 100,000 repeated writes.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
35

DRAM Buffer
Flash memory

Page P-1
Page 0
Page 1
Page 0
Page 1

Carnegie Mellon
SSD Performance Characteristics  Benchmark of Samsung 940 EVO Plus
https://ssd.userbenchmark.com/SpeedTest/711305/Samsung-SSD-970-EVO-Plus-250GB
Sequential read throughput 2,126 MB/s Sequential write tput 1,880 MB/s Random read throughput 140 MB/s Random write tput 59 MB/s
 Sequential access faster than random access ▪ Common theme in the memory hierarchy
 Random writes are somewhat slower
▪ Erasing a block takes a long time (~1 ms).
▪ Modifying a block page requires all other pages to be copied to new block.
▪ Flash translation layer allows accumulating series of small writes before doing block write.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
36

Carnegie Mellon
SSD Tradeoffs vs Rotating Disks  Advantages
▪ No moving parts→faster, less power, more rugged
 Disadvantages
▪ Have the potential to wear out
▪ Mitigated by “wear leveling logic” in flash translation layer
▪ E.g. Samsung 940 EVO Plus guarantees 600 writes/byte of
writes before they wear out
▪ Controller migrates data to minimize wear level
▪ In 2019, about 4 times more expensive per byte ▪ And, relative cost will keep dropping
 Applications
▪ MP3 players, smart phones, laptops
▪ Increasingly common in desktops and servers Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
37

Carnegie Mellon
Today
 The memory Abstraction
 DRAM : main memory building block  Storage technologies and trends
 Locality of reference
 The memory hierarchy
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
38

Carnegie Mellon
The CPU-Memory Gap
The gap
100,000,000.0 10,000,000.0 1,000,000.0 100,000.0 10,000.0 1,000.0 100.0 10.0 1.0 0.1 0.0
between DRAM, disk, and CPU speeds.
Disk SSD
DRAM
CPU
1985 1990 1995 2000 2003 2005 2010 2015
Year
Disk seek time
SSD access time
DRAM access time SRAM access time
CPU cycle time Effective CPU cycle time
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
39
Time (ns)

Carnegie Mellon
Locality to the Rescue!
The key to bridging this CPU-Memory gap is a fundamental
property of computer programs known as locality.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
40

Carnegie Mellon
Locality
 Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
 Temporal locality:
▪ Recently referenced items are likely
to be referenced again in the near future
 Spatial locality:
▪ Items with nearby addresses tend
to be referenced close together in time
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
41

Carnegie Mellon
Locality Example
 Data references
▪ Reference array elements in succession
(stride-1 reference pattern).
▪ Reference variable sum each iteration.
 Instruction references
▪ Reference instructions in sequence. ▪ Cycle through loop repeatedly.
Spatial or Temporal Locality?
spatial temporal
spatial temporal
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
42
sum = 0;
for (i = 0; i < n; i++) sum += a[i]; return sum; Carnegie Mellon Qualitative Estimates of Locality  Claim: Being able to look at code and get a qualitative sense of its locality is a key skill for a professional programmer.  Question: Does this function have good locality with respect to array a? Hint: array layout is row-major order Answer: yes int sum_array_rows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } a [0] [0] ••• a [0] [N-1] a [1] [0] ••• a [1] [N-1] ••• a [M-1] [0] ••• a [M-1] [N-1] Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 43 Carnegie Mellon Locality Example  Question: Does this function have good locality with respect to array a? int sum_array_cols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } Answer: no, unless... M is very small a [0] [0] ••• a [0] [N-1] a [1] [0] ••• a [1] [N-1] ••• a [M-1] [0] ••• a [M-1] [N-1] Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 44 Carnegie Mellon Locality Example  Question: Can you permute the loops so that the function scans the 3-d array a with a stride-1 reference pattern (and thus has good spatial locality)? int sum_array_3d(int a[M][N][N]) { int i, j, k, sum = 0; for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < M; k++) sum += a[k][i][j]; return sum; } Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 45 Answer: make j the inner loop Carnegie Mellon Today  The memory abstraction  DRAM : main memory building block  Storage technologies and trends  Locality of reference  The memory hierarchy Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 46 Carnegie Mellon Memory Hierarchies  Some fundamental and enduring properties of hardware and software: ▪ Fast storage technologies cost more per byte, have less capacity, and require more power (heat!). ▪ The gap between CPU and main memory speed is widening. ▪ Well-written programs tend to exhibit good locality.  These fundamental properties complement each other beautifully.  They suggest an approach for organizing memory and storage systems known as a memory hierarchy. Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 47 Carnegie Mellon Example Memory Hierarchy Smaller, faster, and costlier (per byte) storage devices Larger, slower, and cheaper (per byte) storage devices L6: L1: CPU registers hold words retrieved from the L1 cache. L0: Regs L1 cache (SRAM) L2 cache (SRAM) L3 cache (SRAM) Main memory (DRAM) Local secondary storage (local disks) L2: L1 cache holds cache lines retrieved from the L2 cache. L2 cache holds cache lines retrieved from L3 cache. L3 cache holds cache lines retrieved from main memory. Main memory holds disk blocks retrieved from local disks. Local disks hold files retrieved from disks on remote servers. L3: L4: L5: Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 48 Remote secondary storage (e.g., Web servers) Carnegie Mellon Caches  Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device.  Fundamental idea of a memory hierarchy: ▪ For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.  Why do memory hierarchies work? ▪ Because of locality, programs tend to access the data at level k more often than they access the data at level k+1. ▪ Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit.  Big Idea (Ideal): The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top. Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 49 Carnegie Mellon General Cache Concepts 84 9 140 3 Cache Smaller, faster, more expensive memory caches a subset of the blocks Data is copied in block-sized transfer units Larger, slower, cheaper memory viewed as partitioned into “blocks” 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Memory 140 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 50 Carnegie Mellon General Cache Concepts: Hit 8 9 14 3 Cache Request: 14 Data in block b is needed Block b is in cache: Hit! 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Memory Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 51 Carnegie Mellon General Cache Concepts: Miss 8 192 14 3 Cache Request: 12 Data in block b is needed Block b is not in cache: Miss! Block b is fetched from memory Block b is stored in cache • Placement policy: determines where b goes • Replacement policy: determines which block gets evicted (victim) 12 Request: 12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Memory Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 52 Carnegie Mellon General Caching Concepts: 3 Types of Cache Misses  Cold (compulsory) miss ▪ Cold misses occur because the cache starts empty and this is the first reference to the block.  Capacity miss ▪ Occurs when the set of active cache blocks (working set) is larger than the cache.  Conflict miss ▪ Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k. ▪ E.g. Block i at level k+1 must be placed in block (i mod 4) at level k. ▪ Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. 53 ▪ E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time. Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Examples of Caching in the Mem. Hierarchy Carnegie Mellon Cache Type What is Cached? Where is it Cached? Registers 0 Compiler L1 cache 64-byte blocks On-Chip L1 L2 cache 64-byte blocks On-Chip L2 10 Hardware Virtual Memory 4-KB pages Main memory 100 Hardware + OS Buffer cache Parts of files Main memory 100 OS Disk cache Disk sectors Disk controller 100,000 Disk firmware Network buffer cache Parts of files Local disk 10,000,000 NFS client Browser cache Web cache Web pages Web pages Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 54 Local disk Remote server disks Latency (cycles) 10,000,000 1,000,000,000 Managed By 4-8 byte words CPU core TLB Address translations On-Chip TLB 0 Hardware MMU 4 Hardware Web browser Web proxy server Carnegie Mellon Summary  The speed gap between CPU, memory and mass storage continues to widen.  Well-written programs exhibit a property called locality.  Memory hierarchies based on caching close the gap by exploiting locality.  Flash memory progress outpacing all other memory and storage technologies (DRAM, SRAM, magnetic disk) ▪ Able to stack cells in three dimensions Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 55 Carnegie Mellon Supplemental slides Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 56 Carnegie Mellon Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 $/MB 2,900 320 256 100 75 60 320 116 access (ns) 150 35 15 3 2 1.5 200 115 Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 $/MB access (ns) typical size (MB) 880 100 30 1 200 100 70 60 0.256 4 16 64 0.1 0.06 50 40 2,000 8,000 0.02 44,000 20 10 16.000 62,500 Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 $/GB 100,000 8,000 300 10 5 0.3 0.03 3,333,333 access(ms) 75 28 10 8 5 3 3 25 typical size (GB) 0.01 0.16 1 20 160 1,500 3,000 300,000 Storage Trends SRAM DRAM Disk Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 57 Carnegie Mellon 1985 1990 1995 2003 2005 2010 2015 2015:1985 CPU 80286 Clock rate (MHz) 6 Cycle time (ns) 166 Cores 1 Effective cycle 166 time (ns) 80386 Pentium 20 150 50 6 1 1 50 6 P-4 3,300 0.30 1 0.30 Core 2 2,000 Core i7(n) Core i7(h) 2,500 3,000 500 0.4 0.33 500 0.50 2444 0.25 0.10 0.08 2,075 CPU Clock Rates Inflection point in computer history when designers hit the “Power Wall” Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition (n) Nehalem processor (h) Haswell processor 58