编程代考 NUM 10000 float Array[NUM][NUM]; double MyTimer( );

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
Problem: The Path Between a CPU Chip and Off-chip Memory is Slow
Main Memory
This path is relatively slow, forcing the CPU to wait for up to 200 clock cycles just to do a store to, or a load from, memory.

Depending on your CPU’s ability to process instructions out-of-order, it might go idle during this time.
This is a huge performance hit!
Computer Graphics
mjb – March 10, 2022
Caching Issues in Multicore Performance
Computer Graphics

cache.pptx
mjb – March 10, 2022
Off-chip Memory
Solution: Hierarchical Memory Systems, or “Cache”
Computer Graphics
Smaller, faster
L3 cache also exists on some high-end CPU chips
Main Memory
The solution is to add intermediate memory systems. The one closest to the ALU (L1) is small and fast. The memory systems get slower and larger as they get farther away from the ALU.
mjb – March 10, 2022
Cache and Memory are Named by “Distance Level” from the ALU
Computer Graphics
L3 cache also exists on some high-end CPU chips
mjb – March 10, 2022
Storage Level Characteristics
Type of Storage
Typical Size
Typical Access Time (ns)
Scaled Access Time
43 seconds
3.3 minutes
Managed by
Adapted from: and , Computer Architecture: A Quantitative Approach, Morgan-Kaufmann, 2007. (4th Edition)
Usually there are two L1 caches – one for Instructions and one for Data. You will often see this referred to in data sheets as: “L1 cache: 32KB + 32KB” or “I and D cache”
Computer Graphics
mjb – March 10, 2022
Cache Hits and Misses
When the CPU asks for a value from memory, and that value is already in the cache, it can get it quickly.
This is called a cache hit
When the CPU asks for a value from memory, and that value is not already in the cache, it will have to go off the chip to get it.
This is called a cache miss
While cache might be multiple kilo- or megabytes, the bytes are transferred inmuchsmallerquantities,eachcalledacacheline. Thesizeofacache line is typically just 64 bytes.
Computer Graphics
Performance programming should strive to avoid as many cache misses as possible. That’s why it is very helpful to know the cache structure of your CPU.
mjb – March 10, 2022

Spatial and Temporal Coherence
Successful use of the cache depends on Spatial Coherence:
“If you need one memory address’s contents now, then you will probably
also need the contents of some of the memory locations around it soon.”
Successful use of the cache depends on Temporal Coherence:
“If you need one memory address’s contents now, then you will probably
also need its contents again soon.”
Computer Graphics
If these assumptions are true, then you will generate a lot of cache hits.
If these assumptions are not true, then you will generate a lot of cache misses, and you end up re-loading the cache a lot.
mjb – March 10, 2022
How Bad Is It? — Demonstrating the Cache-Miss Problem
C and C++ store 2D arrays a row-at-a-time, like this, A[ i ][ j ]: [j]
For large arrays, would it be better to add the elements by row, or by column? Which will avoid the most cache misses?
Sequential memory order Jump-around-in-memory order
Computer Graphics
float f = Array[ i ][ j ] ; float f = Array[j][i];
mjb – March 10, 2022
for( int i = 0; i < NUM; i++ ) { for( int j = 0; j < NUM; j++ ) { float f = ??? sum += f; } Demonstrating the Cache-Miss Problem – Across Rows #defineNUM 10000 float Array[NUM][NUM]; double MyTimer( ); main( int argc, char *argv[ ] ) { float sum = 0.; double start = MyTimer( ); for( int i = 0; i < NUM; i++ ) { for( int j = 0; j < NUM; j++ ) { sum += Array[ i ][ j ]; double finish = MyTimer( ); double row_secs = finish – start; // access across a row Computer Graphics mjb – March 10, 2022 Demonstrating the Cache-Miss Problem – Down Columns Computer Graphics float sum = 0.; double start = MyTimer( ); for( int i = 0; i < NUM; i++ ) { for( int j = 0; j < NUM; j++ ) { sum += Array[ j ][ i ]; double finish = MyTimer( ); double col_secs = finish - start; // access down a column mjb – March 10, 2022 Demonstrating the Cache-Miss Problem Time, in seconds, to compute the array sums, based on by-row versus by-column order: Dimension (NUM) ( Total array size = NUMxNUM ) Computer Graphics mjb – March 10, 2022 struct xyz { } Array[N]; Computer Graphics float x, y, z; float X[N], Y[N], Z[N]; X0 X1 X2 X3 ... Y0 Y1 Y2 Y3 ... Z0 Z1 Z2 Z3 ... Array-of-Structures vs. Structure-of-Arrays: 1. Which is a better use of the cache if we are going to be using X-Y-Z triples a lot? 2. Which is a better use of the cache if we are going to be looking at all X’s, then all Y’s, then all Z’s? I’ve seen some programs use a “Shadow Data Structure” to get the advantages of both AOS and SOA mjb – March 10, 2022 Time (seconds) Computer Graphics is often a Good Use for Array-of-Structures: Y0 struct xyz Z0 { Computer Graphics float x, y, z; } Array[N]; glBegin( GL_LINE_STRIP ); for( int i = 0; i < N; i++ ) glVertex3f( Array[ i ].x, Array[ i ].y, Array[ i ].z ); mjb – March 10, 2022 X0 X1 X2 X3 ... Y0 Y1 Y2 Y3 ... Z0 Z1 Z2 Z3 ... Computer Graphics A Good Use for Structure-of-Arrays: float X[N], Y[N], Z[N]; float Dx[N], Dy[N], Dz[N]; ... Dx[0:N] = X[0:N] - Xnow; Dy[0:N] = Y[0:N] - Ynow; Dz[0:N] = Z[0:N] - Znow; mjb – March 10, 2022 Good Object-Oriented Programming Style can sometimes be Inconsistent with Good Cache Use: This is good OO style – it encapsulates and isolatesthedataforthisclass. Onceyouhave created a linked list whose elements are all over memory, is it the best use of the cache? class xyz { float x, y, z; xyz *next; static xyz *Head = NULL; xyz::xyz( ) { Computer Graphics xyz * n = new xyz; n->next = Head; Head = n;
mjb – March 10, 2022
But, Here Is a Compromise:
It might be better to create a large array of xyz structures and then have the constructor method pull new ones from that list. That would keep many of the elements close together while preserving the flexibility of the linked list.
When you need more, allocate another large array and link to it.
Computer Graphics
mjb – March 10, 2022
But, Here Is a Compromise:
#include
#define NUMALLOC 1024
struct node {
float data;
bool canBeDeleted; struct node *next;
struct node *Head = NULL;
struct node * GetNewNode( ) {
if( Head == NULL ) {
struct node *array = new struct node[NUMALLOC]; Head = &array[0];
for( int i = 0; i < NUMALLOC - 1; i++ ) array[i].canBeDeleted = false; array[i].next = &array[i+1]; array[NUMALLOC-1].next = NULL; struct node *p = Head; Head = Head->next; return p;
Remember: in this scheme, you cannot delete an individual node because it was allocated as part of an array. The best you can do is track which nodes can be deleted and then when all of an array’s nodes are flagged, delete the whole array.
DeleteNode( struct node *n ) {
n->canBeDeleted = true; }Computer Graphics
mjb – March 10, 2022
Why Can We Get This Kind of Performance Decrease as Data Sets Get Larger?
Computer Graphics
We are violating Temporal Coherence
mjb – March 10, 2022

Computer Graphics
We Can Help the Temporal Problem with Pre-Fetching
Array Size (M)
We will cover this in further detail when we discuss SIMD
mjb – March 10, 2022
An Example of Where Cache Coherence Really Matters: Matrix Multiply
The usual approach is multiplying the entire A row * entire B column This is equivalent to computing a single dot product
* ColumnjofB
for( i = 0; i < SIZE; i++ ) for( j = 0; j < SIZE; j++ ) Element(i,j)ofC C[ i ][ j ] f oΣr ( k = 0 ; k < S I Z E ; k + + ) Sum and store A[ i ][ k ] B[ k ][ j ] Computer Graphics Problem: Column j of the B matrix is not doing a unit stride mjb – March 10, 2022 An Example of Where Cache Coherence Really Matters: Matrix Multiply Scalable Universal Matrix Multiply Algorithm (SUMMA) Entire A row * one element of B row Equivalent to computing one item in many separate dot products Element (i,j) of C i* Row i of A for( i = 0; i < SIZE; i++ ) for( k = 0; k < SIZE; k++ ) for( j = 0; j < SIZE; j++ ) A[i][k] * B[k][j] Computer Graphics Row k of B mjb – March 10, 2022 Computer Graphics mjb – March 10, 2022 Computer Graphics mjb – March 10, 2022 Cache Architectures N-way Set Associative – a cache line from a particular block of memory can appear in a limited number of places in cache. Each “limited place” is called a set of cache lines. A set contains N cache lines. The memory block can appear in any cache line in its set. Most Caches today are N-way Set Associative N is typically 4 for L1 and 8 or 16 for L2 This would be called “2-way” Computer Graphics Cache line blocks in memory (the numbers) and what cache line set they map to (the colors) mjb – March 10, 2022 Set Set Set Sets of Cache Lines Performance vs. Number of Threads (MegaMultiplies / Sec) Performance vs. Matrix Size (MegaMultiplies / Sec) Speed (MegaMults/Second) Total #cache lines Memory address in bytes Cache Line Block in Memory modulo #cache sets Offset in the Cache Line Where to find a certain memory address in the cache Pick Least Recently Used Cache Line in that Cache Set Cache Line # Cache Set # A Specific Example with Numbers 26 Memory address = 1234 bytes Cache Line Block in Memory = 1234 / 64 =19 Because there are 64 bytes in a cache line Cache Set # = 19 % 4 = 3 Because there are 4 sets to rotate through Offset in the Cache Line = 1234 – 19*64 = 18 Memory: 16 17 18 19 Because there are 18 bytes left after filling 19 complete cache lines Cache: mjb – March 10, 2022 Set Set Set Computer Graphics It lives in one of these 2 locations in cache How do you figure out where in cache a specific memory address will live? #cache sets Computer Graphics mjb – March 10, 2022 Set Set Set Set How Different Cores’ Cache Lines Keep Track of Each Other Each core has its own separate L2 cache, but a write by one can impact the state of the others. For example, if one core writes a value into one of its own cache lines, any other core using a copy of that same cache line can no longer count on its values being up-to-date. In order to regain that confidence, the core that wrote must flush that cache line back to memory and the other core must then reload its copy of that cache line. To maintain this organization, each core’s L2 cache has 4 states (MESI): 1. Modified 2. Exclusive 3. Shared 4. Invalid Computer Graphics mjb – March 10, 2022 A Simplified View of How MESI Works 1. Core A reads a value. Those values are brought into its cache. That cache line is now tagged Exclusive. 2. Core B reads a value from the same area of memory. Those values are brought into its cache, and now both cache lines are re-tagged Shared. 3. If Core B writes into that value. Its cache line is re-tagged Modified and Core A’s cache line is re-tagged Invalid. 4. Core A tries to read a value from that same part of memory. But its cache line is tagged Invalid. So, Core B’s cache line is flushed back to memory and then Core A’s cache line is re- loadedfrommemory. BothcachelinesarenowtaggedShared. Cache Line A Cache Line B Computer Graph This is a huge performance hit, and is referred to as False Sharing Note that False Sharing doesn’t create incorrect results – just a performance hit. If anything, False Sharing prevents getting incorrect results. ics mjb – March 10, 2022 A Simplified View of How MESI Works – One Core’s State Diagram 29 This core reads a value into the cache line This core writes a value into this cache line Another core reads a value into this same cache line This core continues to read and write values into this cache line Another core writes a value into this cache line This core tries reading into this cache line. The cache line is written and reloaded mjb – March 10, 2022 Another core writes a value into this cache line Computer Graphics struct s { float value; } Array[4]; omp_set_num_threads( 4 ); #pragma omp parallel for for( int i = 0; i < 4; i++ ) { for( int j = 0; j < SomeBigNumber; j++ ) { Array[ i ].value = Array[ i ].value + (float)rand( ); } Some unpredictable function so the compiler doesn’t try to optimize the j-for-loop away. False Sharing – An Example Problem One cache line mjb – March 10, 2022 Computer Graphics #include struct s
float value;
int pad[NUMPAD]; } Array[4];
const int SomeBigNumber = 100000000; // keep less than 2B
omp_set_num_threads( 4 );
#pragma omp parallel for for( int i = 0; i < 4; i++ ) { for( int j = 0; j < SomeBigNumber; j++ ) { Array[ i ].value = Array[ i ].value + (float)rand( ); This works because successive Array elements are forced onto different cache lines, so less (or no) cache line conflicts exist False Sharing – Fix #1 # of threads Computer Graphics mjb – March 10, 2022 Why do these curves look this way? False Sharing – Fix #1 Adding some padding One cache line mjb – March 10, 2022 } NUMPAD=3 Computer Graphics False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 0 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 1 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 2 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 3 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 4 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 5 Computer Graphics mjb – March 10, 2022 False Sharing – Fix #1 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 6 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 7 Computer Graphics mjb – March 10, 2022 False Sharing – Fix #1 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 8 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 9 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 10 Computer Graphics mjb – March 10, 2022 False Sharing – Fix #1 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 11 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 12 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 13 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 14 Computer Graphics mjb – March 10, 2022 False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines NUMPAD = 15 Computer Graphics mjb – March 10, 2022 False Sharing – Fix #1 Computer Graphics mjb – March 10, 2022 False Sharing – Fix #2: Using local (private) variables OK, wasting memory to put your data on different cache lines seems a little silly (even though it works). Can we do something else? Remember our discussion in the OpenMP section about how stack space is allocated for different threads? If we use local variables, instead of contiguous array locations, that will spread our writes out in memory, and to different cache lines. Computer Graphics Program Executable Common Globals Common Heap mjb – March 1 #include struct s
float value; } Array[4];
omp_set_num_threads(
False Sharing – Fix #2
Makes this a private variable that lives in each thread’s individual stack
const int SomeBigNumber = 100000000;
#pragma omp parallel for
for( int i = 0; i < 4; i++ ) { float tmp = Array[ i ].value; for( int j = 0; j < SomeBigNumber; j++ ) { tmp = tmp + (float)rand( ); } Array[ i ].value = tmp; } This works because a localized temporary variable is created in each core’s stack area, so little or no cache Computer Graphiclisne conflict exists Program Executable Common Globals Common Heap mjb – March 1 False Sharing – Fix #2 vs. Fix #1 Fix #2 -- 4 Threads Fix #2 -- 2 Threads # of threads Fix #2 -- 1 Thread Computer Graphics March 10, 2022 Note that Fix #2 with {1, 2, 4} threads gives the same performance as NUMPAD= {0,7,15} malloc’ing on a cache line 56 What if you are malloc’ing, and want to be sure your data structure starts on a cache line boundary? Knowing that cache lines start on fixed 64-byte boundaries lets you do this. Consider a memory address. The top N-6 bits tell you what cache line number this address is a part of. The bottom 6 bits tell you what offset that address has within that cache line. So, for example, on a 32- bit memory system: Cache line number Offset in that cache line So, if you see a memory address whose bottom 6 bits are 000000, then you know that that memory location begins on a cache line boundary. Computer Graphics 32 - 6 = 26 bits 6 bits: 0-63 mjb – March 10, 2022 malloc’ing on a cache line 57 Let’s say that you have a structure and you want to malloc an ARRAYSIZE array of them. Normally, you would do this: If you wanted to make sure that array of structures started on a cache line boundary, you would do this: struct xyzw *p = (struct xyzw *) malloc( (ARRAYSIZE)*sizeof(struct xyzw) ); struct xyzw *Array = &p[0]; Array[ i ].x = 10. ; Remember that when you want to free this malloc’ed space, be sure to say: free( p ); free( Array ); Computer Graphics mjb – March 10, 2022 unsigned char *p = (unsigned char *) malloc( 64 + (ARRAYSIZE)*sizeof(struct xyzw) ); int offset = (long int)p & 0x3f; // 0x3f = bottom 6 bits are all 1’s struct xyzw *Array = (struct xyzw *) &p[64-offset]; Array[ i ].x = 10. ; Now, Consider This Type of Computation Should you allocate the data as one large global-memory block (i.e., shared)? Or, should you allocate it as separate blocks, each local to its own core (i.e., private)? Does it matter? Yes! If you allocate the data as one large global-memory block, there is a risk that you will get False Sharing at the individual-block boundaries. Solution: make sure that each individual-block starts and ends on a cache boundary, even if you have to pad it. (Fix #1!) If you allocate the data as separate blocks, then you don’t have to worry about False Sharing (Fix #2!), but you do have to worry about the l 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts