CS计算机代考程序代写 matlab compiler file system Fortran computer architecture cache algorithm CS3350B Computer Organization Chapter 1: CPU and Memory Part 2: The Memory Hierarchy

CS3350B Computer Organization Chapter 1: CPU and Memory Part 2: The Memory Hierarchy
Alex Brandt
Department of Computer Science University of Western Ontario, Canada
Monday January 18, 2021
Alex Brandt
Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 1 / 52

Recap: CPU Time
CPU Time = Instruction_count × CPI × clock_cycle
Algorithm Programming language Compiler
Instruction_count CPI clock_cycle
X X X X X X
ISA XXX Processor organization X X
From a programmer’s point of view CPU performance depends on: CPU Frequency
The type of instructions performed
Memory access time
Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 2 / 52

Recap: Processor-Memory Gap
The Processor-Memory Gap is a key contributor to the Memory Wall – the point where a program’s performance is totally determined by memory speed. Faster processors will not make programs run faster!
Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 3 / 52

Recap: Memory Wall Example
void copymatrix1(int** src,
int** dst, int n) {
void copymatrix2(int** src,
int** dst, int n) {
}
}
int i,j;
for (i = 0; i < n; i++) for (j = 0; j < n; j++) dst[i][j] = src[i][j]; int i,j; for (j = 0; j < n; j++) for (i = 0; i < n; i++) dst[i][j] = src[i][j]; Cache misses are the reason for the performance drop. Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 4 / 52 Outline 1 History, Trends, and Basics 2 The Principle of Locality 3 The Cache Hierarchy 4 Cache and Memory Performance 5 Locality and Cache Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 5 / 52 Von Neumann Architecture A simple shared memory for both instructions and data. From a software’s point of view this model is still very much true. Hardware takes care of all of the sophistication. Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 6 / 52 Before the Hierarchy In the 1950s, 60s, and 70s, software development was mostly mathematical and scientific. The heyday of FORTRAN. Computing resources were very limited. ë As late as the 1980s memory was only 64 megabytes. Algorithm development focused on minimizing amount of memory a program used in order to solve larger problems. ë Sparse mathematics, linear algebra, symbolic computations. ë Minimizing storage of numerical types: char/INT1 (1 byte), short/INT2 (2 bytes), int/INT4 (4 bytes), . . . Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 7 / 52 Current Memory Resources Historically, programmers where concerned with how much memory was used. ë Speeds of processor and memory were about the same. ë Memory access time was roughly same as arithmetic time. Today, we are for more concerned with how memory is used. Memory size is no longer limiting factor: ë Some laptops 􏰸⇒ up to 64 gigabytes. ë Compute clusters 􏰸⇒ hundreds of gigabytes per processor. Memory Wall (memory speed) is now the limiting factor. Memory Hierarchy introduced to tackle Memory Wall and limit the effects of the Processor-Memory Gap. Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 8 / 52 The Processor-Memory Gap Increasing gap between processor speed vs DRAM speed. Notice log scale! Hennessy & Patterson, Computer Architecture: A Quantitative Approach Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 9 / 52 Components of a Computer: Memory At first, cache was a single, simple entity. Around the 1990s multi-level caches began. Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 10 / 52 Memory Hierarchy Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 11 / 52 But Why a Hierarchy? Enduring properties of hardware and software: Fast storage technologies (SRAM) cost more per byte. Fast storage technologies take up more physical space per byte. Gap between CPU and main memory (DRAM) speed is widening. Principle of Locality. Whole sections of memory are copied from lower levels of hierarchy to higher levels. Assuming locality, this transfer is done infrequently and computations can occur continuously by only referencing data high in the hierarchy. Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 12 / 52 Outline 1 History, Trends, and Basics 2 The Principle of Locality 3 The Cache Hierarchy 4 Cache and Memory Performance 5 Locality and Cache Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 13 / 52 The Principle of Locality “Programs tend to reuse data and instructions they have recently used.” ë Hennessy and Patterson, Computer Architecture. During a short time range, a program is likely to access only a relatively small portion of: available address space, available program code. This principle is a driving factor in computer architecture and design. Always in an effort to improve performance. Locality directs hardware design. If your program breaks this principle then it will not use the hardware properly. Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 14 / 52 Locality in Time and Space Temporal Locality Recently accessed items are likely to be accessed again in the near future. Spatial Locality Recently accessed items are likely to have their adjacent (or near-by) items accessed in the near future. Note: items can mean data memory addresses or instructions. Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 15 / 52 Locality in Instructions Appreciating locality for instructions is more difficult than for memory addresses. What programming constructs lead to locality for instructions? Temporal? Spatial? Note: Continually using the same type of instruction is not locality. It must be the exact same instruction after compilation. (Same instruction type and operands). Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 16 / 52 Locality in Instructions Appreciating locality for instructions is more difficult than for memory addresses. What programming constructs lead to locality for instructions? Temporal? ë Loops Spatial? Note: Continually using the same type of instruction is not locality. It must be the exact same instruction after compilation. (Same instruction type and operands). Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 16 / 52 Locality in Instructions Appreciating locality for instructions is more difficult than for memory addresses. What programming constructs lead to locality for instructions? Temporal? ë Loops ë Repeated calls to the same method Spatial? Note: Continually using the same type of instruction is not locality. It must be the exact same instruction after compilation. (Same instruction type and operands). Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 16 / 52 Locality in Instructions Appreciating locality for instructions is more difficult than for memory addresses. What programming constructs lead to locality for instructions? Temporal? ë Loops ë Repeated calls to the same method Spatial? ë Loops Note: Continually using the same type of instruction is not locality. It must be the exact same instruction after compilation. (Same instruction type and operands). Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 16 / 52 Locality in Instructions Appreciating locality for instructions is more difficult than for memory addresses. What programming constructs lead to locality for instructions? Temporal? ë Loops ë Repeated calls to the same method Spatial? ë Loops ë Repeated calls to the same method Note: Continually using the same type of instruction is not locality. It must be the exact same instruction after compilation. (Same instruction type and operands). Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 16 / 52 Locality in Instructions Appreciating locality for instructions is more difficult than for memory addresses. What programming constructs lead to locality for instructions? Temporal? ë Loops ë Repeated calls to the same method Spatial? ë Loops ë Repeated calls to the same method ë Sequential operation, no unconditional jumps! Note: Continually using the same type of instruction is not locality. It must be the exact same instruction after compilation. (Same instruction type and operands). Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 16 / 52 Locality in Instructions Appreciating locality for instructions is more difficult than for memory addresses. What programming constructs lead to locality for instructions? Temporal? ë Loops ë Repeated calls to the same method Spatial? ë Loops ë Repeated calls to the same method ë Sequential operation, no unconditional jumps! Unless inlined, method calls can act like unconditional jumps. Note: Continually using the same type of instruction is not locality. It must be the exact same instruction after compilation. (Same instruction type and operands). Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 16 / 52 Locality in Data Data locality easier to understand: Repeated access to the same variable. Sequential access to an array. Initializing a variable just before it is used. Re-use previous variables that are finished being useful instead of initializing new ones. ë When a new variable is created, it must be loaded into cache. Why not re-use existing variables? Prime Example: Stride-1 access pattern int arr[n]; for (int i = 0; i < n; ++i) { printf("%d", arr[i]); } Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 17 / 52 Aside: Layout of C Arrays in Memory C arrays allocated in row-major order. ë Each row in contiguous memory locations. ë a[i][j] ⇐⇒ a[i * ncols + j] ⎨⎝0 1 2⎬⎠ ⎝3 4 5⎠ 􏰸⇒ )︀0,1,2,3,4,5,6,7,8⌈︀ ⎪6 7 8⎮ FORTRAN, Matlab arrays allocated in column-major order. ë Each column in contiguous memory locations. ⎨⎝1 2 3⎬⎠ ⎝4 5 6⎠ 􏰸⇒ )︀1,4,7,2,5,8,3,6,9⌈︀ ⎪7 8 9⎮ Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 18 / 52 Locality Exercises: Warm-up Does this function in C have good locality? If yes, which type? int sumArray(int* a, int n) { int sum = 0; for (i=0; i 𝑘 bytes, exploit spatial locality
Cold miss rate = 𝑘 bytes / B.
ë Typically𝑘=4or8(32/64bits)and𝐵=8𝑘or𝐵=16𝑘.
Stepping through rows in one column:
ë ë ë
for (i = 0; i < n; i++) sum += a[i][0]; Accesses distant elements (assuming number of columns in a is large). No spatial locality! Cold miss rate = 1 (i.e. 100%) Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 49 / 52 Advanced Spatial Locality Example Calculate the number of cache hits and cache misses given: a has row-major INT4 elements and is cache-aligned. Cache capacity of 4 Kilobytes, cache block is 32 bytes. All local variables are stored in registers, not cache. int i, sum = 0, N = 16384; for (i = 0; i < N; i++) sum += a[i]; Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 50 / 52 Advanced Spatial Locality Example Calculate the number of cache hits and cache misses given: a has row-major INT4 elements and is cache-aligned. Cache capacity of 4 Kilobytes, cache block is 32 bytes. All local variables are stored in registers, not cache. int i, sum = 0, N = 16384; for (i = 0; i < N; i++) sum += a[i]; Stride-1 access to a. Since cache block is 32 bytes, each block holds 32/4 = 8 ints. One cold miss for each block. (1/8) × 16384 = 2048 cache misses. 16384 - 2048 cache misses = 14336 cache hits. Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 50 / 52 Summary We want a large, cheap, fast memory ë Unreasonable in size/cost for everything to be SRAM (L1). Solution: Memory Hierarchy ë Successively lower levels contain “most used” data from next higher level. ë Exploits temporal & spatial locality of programs. ë Great Idea #3: Do the common case fast, worry less about the exceptions (RISC design principle). Challenges to programmer: ë Develop cache friendly (efficient) programs. Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 51 / 52 Extra Reading Being able to look at code and have a qualitative sense of its locality is a key skill for programmers. Some projects driven by data locality (and other features): BLAS http://www.netlib.org/blas/ FFTW, by Matteo Frigo and Steven G. Johnson http://www.fftw.org/ M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran, Cache-Oblivious Algorithms, 1999. Wikipedia is surprisingly well written on some of these topics. https://en.wikipedia.org/wiki/CPU_cache https://en.wikipedia.org/wiki/Cache_hierarchy https://en.wikipedia.org/wiki/Cache-oblivious_algorithm Alex Brandt Chapter 1: CPU and Memory, Part 2: The Memory Hierarchy Monday January 18, 2021 52 / 52