Optimizing Sequential Programs — Memory Hierarchy
Memory Hierarchy and Performance Optimizations
Presented by
Shuaiwen Leon Song
USYD Future System Architecture Lab (FSA)
https://shuaiwen-leon-song.github.io/
Slides are modified based on similar classes offered by UC Berkeley and ’s cs0330 at Brown
‹#›
Write a 2d Array
‹#›
Write a 2d array via single pointer
‹#›
Using an array of pointers
‹#›
4
Using pointer to a pointer
‹#›
5
Principle of Locality
Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently
Temporal locality:
Recently referenced items are likely
to be referenced again in the near future
Spatial locality:
Items with nearby addresses tend
to be referenced close together in time
‹#›
6
Locality Example
Data references
Reference array elements in succession (stride-1 reference pattern).
Reference variable sum each iteration.
Instruction references
Reference instructions in sequence.
Cycle through loop repeatedly.
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
Spatial locality
Temporal locality
Spatial locality
Temporal locality
‹#›
7
Memory Hierarchies
Some fundamental and enduring properties of hardware and software:
Fast storage technologies cost more per byte, have less capacity, and require more power (heat!).
The gap between CPU and main memory speed is widening.
Well-written programs tend to exhibit good locality.
These fundamental properties complement each other beautifully.
They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
‹#›
An Example Memory Hierarchy
Registers
L1 cache
(SRAM)
Main memory
(DRAM)
Local secondary storage
(local disks)
Larger,
slower,
cheaper
per byte
Remote secondary storage
(tapes, distributed file systems, Web servers)
Local disks hold files retrieved from disks on remote network servers
Main memory holds disk blocks retrieved from local disks
L2 cache
(SRAM)
L1 cache holds cache lines retrieved from L2 cache
CPU registers hold words retrieved from L1 cache
L2 cache holds cache lines retrieved from main memory
L0:
L1:
L2:
L3:
L4:
L5:
Smaller,
faster,
more expensive
per byte
‹#›
static differentiates SRAM from DRAM (dynamic random-access memory) which must be periodically refreshed.
SRAM is faster and more expensive than DRAM; it is typically used for CPU cache while DRAM is used for a computer's main memory.
In core i7 the line sizes in L1 , L2 and L3 are the same: that is 64 Bytes. I guess this simplifies maintaining the inclusive property, and coherence.
Non-volatile
SSD
nlike volatile memory, NVM does not require its memory data to be periodically refreshed. It is commonly used for secondary storage or long-term consistent storage.
Non-volatile memory typically refers to storage in semiconductor memory chips, which store data in floating-gate memory cells consisting of floating-gate MOSFETs (metal-oxide-semiconductor field-effect transistors), including flash memory storage such as NAND flash and solid-state drives (SSD), and ROM chips such as EPROM (erasable programmable ROM) and EEPROM (electrically erasable programmable ROM). It can also be classified as traditional non-volatile disk storage.
Memory Wall Problem
The "memory wall" is the growing disparity of speed between CPU and memory outside the CPU chip.
CPU speed improved at an annual rate of 55% while memory speed only improved at 10%.
CPU speed improvements slowed significantly partly due to major physical barriers and partly because current CPU designs have already hit the memory wall in some sense.
‹#›
10
An Example: Intel Nehalem Core i-7
Register access, ~1 cycle (0.376 ns)
L1 Cache access, ~4 cycles
L2 Cache access, ~10 cycles
L3 Cache access, line unshared, ~40 cycles
L3 Cache access, shared line in another core, ~65 cycles
L3 Cache access, modified in another core, ~75 cycles
DRAM access, ~150 cycles
Disk access, ~10,000 cycles
CPU is very fast: 4 or 8 operations per cycle
‹#›
DDR double data rate
QPI” intel quick path interconnect.
GT/s: giga transfers per second
Memory wall concept
An Example: Intel Nehalem Core i-7
Register access, ~1 cycle (0.376 ns)
L1 Cache access, ~4 cycles
L2 Cache access, ~10 cycles
L3 Cache access, line unshared, ~40 cycles
L3 Cache access, shared line in another core, ~65 cycles
L3 Cache access, modified in another core, ~75 cycles
DRAM access, ~150 cycles
Disk access, ~10,000 cycles
CPU is very fast: 4 or 8 operations per cycle
Based on idea of spatial and temporal data locality
Cache hit: data found in cache
Cache miss: data not found in cache and must be copied from a lower level
Compulsory miss: first reference miss
Capacity miss: cache runs out of room for new data
Conflict miss: many data items map to same location in cache
True sharing miss: Occurs when a thread in another processor wants the same data. Cure: Minimize sharing
False sharing miss: Occurs when another processor uses different data in the same cache line. Cure: Pad data.
‹#›
True sharing vs False Sharing
True sharing: true sharing, would require programmatic synchronization constructs to ensure ordered data access.
False sharing: The frequent coordination required between processors when cache lines are marked ‘Invalid’ requires cache lines to be written to memory and subsequently loaded. False sharing increases this coordination and can significantly degrade application performance.
‹#›
13
C = C + A * B
for ( i=0; i
}
Every array element a[…], b[…] is used twice within
Define 4 registers to replace a[…], 4 registers to replace b[…] within
Every array element c[…] is used n times in the k-loop
Define 4 registers to replace c[…] before the k-loop begin
c[i*n + j] = a[i*n + k]*b[k*n + j] + a[i*n + k+1]*b[(k+1)*n + j]
+ c[i*n + j]
c[(i+1)*n + j] = a[(i+1)*n + k]*b[k*n + j] + a[(i+1)*n + k+1]*b[(k+1)*n + j]
+ c[(i+1)*n + j]
c[i*n + (j+1)] = a[i*n + k]*b[k*n + (j+1)] + a[i*n + k+1]*b[(k+1)*n + (j+1)]
+ c[i*n + (j+1)]
c[(i+1)*n + (j+1)] = a[(i+1)*n + k]*b[k*n + (j+1)]
+ a[(i+1)*n + k+1]*b[(k+1)*n + (j+1)] + c[(i+1)*n + (j+1)]
‹#›
Exploit more aggressive register reuse
for(i = 0; i < n; i += 2)
for(j = 0; j < n; j += 2) {
register int t = i*n+j; register int tt = t+n;
register double c00 = c2[t]; register double c01 = c2[t+1]; register double c10 = c2[tt]; register double c11 = c2[tt+1];
for(k = 0; k < n; k += 2) {
/* 2 by 2 mini matrix multiplication using registers*/
register int ta = i*n+k; register int tta = ta+n; register int tb = k*n+j; register int ttb = tb+n;
register double a00 = a[ta]; register double a01 = a[ta+1]; register double a10 = a[tta]; register double a11 = a[tta+1];
register double b00 = b[tb]; register double b01 = b[tb+1]; register double b10 = b[ttb]; register double b11 = b[ttb+1];
c00 += a00*b00 + a01*b10;
c01 += a00*b01 + a01*b11;
c10 += a10*b00 + a11*b10;
c11 += a10*b01 + a11*b11;
}
c2[t] = c00;
c2[t+1] = c01;
c2[tt] = c10;
c2[tt+1] = c11;
}
Every array element a[…], b[…] is used twice
Define 4 registers to replace a[…], 4 registers to replace b[…] within
Every array element c[…] is used n times in the k-loop
Define 4 registers to replace c[…] before the k-loop begin
‹#›
Can We Use Less Than 12 Registers for 2X2 MM?
for(i = 0; i < n; i += 2)
for(j = 0; j < n; j += 2) {
register int t = i*n+j; register int tt = t+n;
register double c00 = c2[t]; register double c01 = c2[t+1]; register double c10 = c2[tt]; register double c11 = c2[tt+1];
for(k = 0; k < n; k += 2) {
/* 2 by 2 mini matrix multiplication using registers*/
register int ta = i*n+k; register int tta = ta+n; register int tb = k*n+j; register int ttb = tb+n;
register double a00 = a[ta]; register double a10 = a[tta]; register double b00 = b[tb]; register double b01 = b[tb+1];
c00 += a00*b00 ; c01 += a00*b01 ; c10 += a10*b00 ; c11 += a10*b01 ;
a00 = a[ta+1]; a10 = a[tta+1]; b00 = b[ttb]; b01 = b[ttb+1];
c00 += a00*b00 ; c01 += a00*b01 ; c10 += a10*b00 ; c11 += a10*b01 ;
}
c2[t] = c00;
c2[t+1] = c01;
c2[tt] = c10;
c2[tt+1] = c11;
}
‹#›
19
How to calculate time wasted in accessing operands that are not in register
First, cycle time:
1 cycle = 1/frequency
e.g., 5GHz = 5* 10^9 Hz, so 1 cycle time is 1/(5*10^9) sec.
Total time wasted in accessing operands that are not in registers?
(Total time – Total Floating point operations time) / Total Time
‹#›
20
Caches
Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device.
Fundamental idea of a memory hierarchy:
For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.
Why do memory hierarchies work?
Because of locality, programs tend to access the data at level k more often than they access the data at level k+1.
Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit.
Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.
‹#›
Cache Performance Metrics
Miss Rate
Fraction of memory references not found in cache (misses / accesses)
= 1 – hit rate
Typical numbers (in percentages):
3-10% for L1
can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
Time to deliver a line in the cache to the processor
includes time to determine whether the line is in the cache
Typical numbers:
1-2 clock cycle for L1
5-20 clock cycles for L2
Miss Penalty
Additional time required because of a miss
typically 50-200 cycles for main memory (Trend: increasing!)
‹#›
Lets think about those numbers
Huge difference between a hit and a miss
Could be 100x, if just L1 and main memory
Would you believe 99% hits is twice as good as 97%?
Consider:
cache hit time of 1 cycle
miss penalty of 100 cycles
Average access time:
97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
This is why “miss rate” is used instead of “hit rate”
effective-access-time = cache-access-time + miss-rate * miss-penalty
‹#›
Writing Cache Friendly Code
Make the common case go fast
Focus on the inner loops of the core functions
Minimize the misses in the inner loops
Repeated references to variables are good (temporal locality)
Stride-K reference patterns are good (spatial locality)
Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.
‹#›
Miss Rate Analysis for Matrix Multiply
Assume:
Line size = 32Bytes (big enough for 4 64-bit double precision floating point numbers)
Matrix dimension (N) is very large
Approximate 1/N as 0.0
Cache is not even big enough to hold multiple rows
Analysis Method:
Look at access pattern of inner loop
A
k
i
B
k
j
C
i
j
‹#›
Matrix Multiplication Example
Description:
Multiply N x N matrices
O(N3) total operations
N reads per source element
N values summed per destination
but may be able to hold in register
/* ijk */
for (i=0; i
compulsory miss rate = 8 bytes / B
Stepping through rows in one column:
for (i = 0; i < n; i++)
sum += a[i][0];
accesses distant elements
no spatial locality!
compulsory miss rate = 1 (i.e. 100%)
‹#›
Matrix Multiplication (ijk)
/* ijk */
for (i=0; i