CS代写 STEPS 100

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
Data Decomposition

Computer Graphics

Copyright By PowCoder代写 加微信 powcoder

Data_decomposition.pptx
mjb – March 15, 2020

Multicore Block Data Decomposition: 1D Heat Transfer Example
You have a steel bar. Each section of the bar starts out at a different temperature. There are no incoming heat sources or outgoing heat sinks (i.e., ignore boundary conditions). Ready, go! How do the temperatures change over time?
The fundamental differential equation here is: T 2T C t k(x2 )
ρ is the density in kg/m3
C is the specific heat capacity measured in Joules / (kg ∙ °K)
k is the coefficient of thermal conductivity measured in Watts / (meter ∙ °K)
= units of Joules/(meter∙ sec ∙ °K)
Computer Graphics
In plain words, this all means that temperatures, left to themselves, try to even out. Hots get cooler. Cools get hotter. The greater the temperature differential, the faster the evening-out process goes.
mjb – March 15, 2020

Numerical Methods:
Changing a Derivative into Discrete Arithmetic
How fast the temperature is changing within the bar
How much the temperature changes over time
Computer Graphics
mjb – March 15, 2020

CT k(2T) t x2
How much the temperature changes in the time step
Multicore Block Data Decomposition: 1D Heat Transfer Example
 k T 2TT 
C x2  
Physical properties of the material
t x 
i1 i i1  ( ) i C 2 
How fast the temperature is changing within the bar
Computer Graphics
As a side note: the quantity k/(ρC) has the unlikely units of m2/sec! mjb – March 15, 2020

1D Data Decomposition: Partitioning Strategies
On a shared memory multicore system, the obvious approach is to allocate the data as one large global-memory block (i.e., shared).
You will actually need two such arrays, one to hold the current temperature values that you are reading from and one to hold the next temperature values that you are writing to.
Computer Graphics
mjb – March 15, 2020

1D Data Decomposition: Partitioning
#include
#include
#include
#define NUM_TIME_STEPS 100
#ifndef NUMN #define NUMN #endif
#ifndef NUMT #define NUMT #endif
16 // total number of nodes
4 // number of threads to use
#define NUM_NODES_PER_THREAD
( NUMN / NUMT )
Temps[2][NUMN];
int Now; // which array is the “current values“= 0 or 1 int Next; // which array is being filled = 1 or 0
DoAllWork( int );
Computer Graphics
mjb – March 15, 2020

Allocate as One Large Continuous Global Array
omp_set_num_threads( NUMT ); Now =0;
for( int i = 0; i < NUMN; i++ ) Temps[Now][ i ] = 0.; Temps[Now][NUMN/2] = 100.; double time0 = omp_get_wtime( ); #pragma omp parallel default(none) shared(Temps,Now,Next) int me = omp_get_thread_num( ); DoAllWork( me ); double time1 = omp_get_wtime( ); double usec = 1000000. * ( time1 - time0 ); // each thread calls this double megaNodesPerSecond = (float)NUM_TIME_STEPS * (float)NUMN / usec; Computer Graphics mjb – March 15, 2020 DoAllWork( ), I DoAllWork( int me ) { // what range of the global Temps array this thread is responsible for: int first = me * NUM_NODES_PER_THREAD; int last = first + ( NUM_NODES_PER_THREAD - 1 ); for( int step = 0; step < NUM_TIME_STEPS; step++ ) // first element on the left: { float left = 0.; if( me != 0 ) left = Temps[Now][first-1]; float dtemp = ( ( K / (RHO*C) ) * ( left - 2.*Temps[Now][first] + Temps[Now][first+1] ) / ( DELTA*DELTA ) ) * DT; Temps[Next][first] = Temps[Now][first] + dtemp; // all the nodes in between: for( int i = first+1; i <= last-1; i++ ) { float dtemp = ( ( K / (RHO*C) ) * ( Temps[Now][i-1] - 2.*Temps[Now][ i ] + Temps[Now][i+1] ) / ( DELTA*DELTA ) ) * DT; Temps[Next][ i ] = Temps[Now][ i ] + dtemp; What happens if two cores are writing to the same cache line? False Sharing! Computer Graphics mjb – March 15, 2020 DoAllWork( ), II // last element on the right: { float right = 0.; if( me != NUMT-1 ) right = Temps[Now][last+1]; float dtemp = ( ( K / (RHO*C) ) * ( Temps[Now][last-1] - 2.*Temps[Now][last] + right ) / ( DELTA*DELTA ) ) * DT; Temps[Next][last] = Temps[Now][last] + dtemp; // all threads need to wait here so that all Temps[Next][*] values are filled: #pragma omp barrier // want just one thread swapping the definitions of Now and Next: #pragma omp single Now = Next; Next = 1 - Next; } // implied barrier exists here: } //for(intstep=... } What happens if two cores are writing to the same cache line? False Sharing! Because each core is working from left to right across the data, I am guessing that there is little cache line conflict. Computer Graphics mjb – March 15, 2020 Allocate as Separate Thread-Local (private) Sub-arrays We could make each sub-array a thread-local (i.e., private) variable. This would put each sub- array on each thread’s individual stack. The strategy is now to read from the single large global array and compute into each thread’s local array. When we are done, copy each local array into the global array. Computer Graphics mjb – March 15, 2020 Allocate as Separate Thread-Local (private) Sub-arrays float nextTemps[NUM_NODES_PER_THREAD]; for( int i = 0; i < NUM_NODES_PER_THREAD; i++ ) nextTemps[ i ] = Temps[first+i]; // read from Temps[ ], write into nextTemps[ ] for( int steps = 0; steps < NUM_TIME_STEPS; steps++ ) { // all the other nodes in between: for( int i = 1; i < NUM_NODES_PER_THREAD-1; i++ ) { float dtemp = ( ( K / (RHO*C) ) * ( Temps[first+i-1] - 2.*Temps[first+i] + Temps[first+i+1] ) / ( DELTA*DELTA ) ) * DT; nextTemps[ i ] = Temps[first+i] + dtemp; // don't update the global Temps[ ] until they are no longer being used: #pragma omp barrier // update the global Temps[ ]: for( int i = 0; i < NUM_NODES_PER_THREAD-1; i++ ) { Temps[first+i] = nextTemps[ i ]; } // be sure all global Temps[ ] are updated: #pragma omp barrier Compu}ter//Gforarp(hincst steps = 0; ... mjb – March 15, 2020 Allocate as Separate Thread-Global-Heap Sub-arrays We could make each sub-array a thread-heap (also private) variable. This would put each sub- array on the heap. The strategy is now to read from the single large global array and compute into each thread’s heap array. When we are done, copy each heap array into the global array. Computer Graphics mjb – March 15, 2020 Allocate as Separate Thread-Global-Heap Sub-arrays float *nextTemps = new float [NUM_NODES_PER_THREAD]; for( int i = 0; i < NUM_NODES_PER_THREAD; i++ ) nextTemps[ i ] = Temps[first+i]; // read from Temps[ ], write into nextTemps[ ] for( int steps = 0; steps < NUM_TIME_STEPS; steps++ ) { // all the other nodes in between: for( int i = 1; i < NUM_NODES_PER_THREAD-1; i++ ) { float dtemp = ( ( K / (RHO*C) ) * ( Temps[first+i-1] - 2.*Temps[first+i] + Temps[first+i+1] ) / ( DELTA*DELTA ) ) * DT; nextTemps[ i ] = Temps[first+i] + dtemp; // don't update the global Temps[ ] until they are no longer being used: #pragma omp barrier // update the global Temps[ ]: for( int i = 0; i < NUM_NODES_PER_THREAD-1; i++ ) { Temps[first+i] = nextTemps[ i ]; } // be sure all global Temps[ ] are updated: #pragma omp barrier Compu}ter//Gforarp(hincst steps = 0; ... mjb – March 15, 2020 Computer Graphics Number of Nodes Threads Local Global Storage Strategy and # of mjb – March 15, 2020 MegaNodes Computed Per Second Performance as a Function of Number of Nodes Computer Graphics # of Nodes to Compute # of Threads mjb – March 15, 2020 MegaNodes Computed Per Second Performance as a Function of Number of Threads Computer Graphics # of Threads # of Nodes mjb – March 15, 2020 MegaNodes Computed Per Second 2D Heat Transfer Equation CTk(2T2T)  k T 2T T T 2T T  t x2 y2 T k 2T 2T i,j C   T   i1,j i1,j  i,j1 i,j1 t  tC(x2 y2) Computer Graphics mjb – March 15, 2020 T 2T 2T 2T Ctk(x2 y2 z2) T  i, j,k T T i1,j,k  i,j1,k T T i,j1,k  i,j,k1 T  i,j,k1 t 3D Heat Transfer Equation  C    T k 2T 2T 2T tC(x2 y2 z2) Computer Graphics mjb – March 15, 2020 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com