This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
Data Decomposition
Computer Graphics
Copyright By PowCoder代写 加微信 powcoder
Data_decomposition.pptx
mjb – March 15, 2020
Multicore Block Data Decomposition: 1D Heat Transfer Example
You have a steel bar. Each section of the bar starts out at a different temperature. There are no incoming heat sources or outgoing heat sinks (i.e., ignore boundary conditions). Ready, go! How do the temperatures change over time?
The fundamental differential equation here is: T 2T C t k(x2 )
ρ is the density in kg/m3
C is the specific heat capacity measured in Joules / (kg ∙ °K)
k is the coefficient of thermal conductivity measured in Watts / (meter ∙ °K)
= units of Joules/(meter∙ sec ∙ °K)
Computer Graphics
In plain words, this all means that temperatures, left to themselves, try to even out. Hots get cooler. Cools get hotter. The greater the temperature differential, the faster the evening-out process goes.
mjb – March 15, 2020
Numerical Methods:
Changing a Derivative into Discrete Arithmetic
How fast the temperature is changing within the bar
How much the temperature changes over time
Computer Graphics
mjb – March 15, 2020
CT k(2T) t x2
How much the temperature changes in the time step
Multicore Block Data Decomposition: 1D Heat Transfer Example
k T 2TT
C x2
Physical properties of the material
t x
i1 i i1 ( ) i C 2
How fast the temperature is changing within the bar
Computer Graphics
As a side note: the quantity k/(ρC) has the unlikely units of m2/sec! mjb – March 15, 2020
1D Data Decomposition: Partitioning Strategies
On a shared memory multicore system, the obvious approach is to allocate the data as one large global-memory block (i.e., shared).
You will actually need two such arrays, one to hold the current temperature values that you are reading from and one to hold the next temperature values that you are writing to.
Computer Graphics
mjb – March 15, 2020
1D Data Decomposition: Partitioning
#include
#include
#include
#define NUM_TIME_STEPS 100
#ifndef NUMN #define NUMN #endif
#ifndef NUMT #define NUMT #endif
16 // total number of nodes
4 // number of threads to use
#define NUM_NODES_PER_THREAD
( NUMN / NUMT )
Temps[2][NUMN];
int Now; // which array is the “current values“= 0 or 1 int Next; // which array is being filled = 1 or 0
DoAllWork( int );
Computer Graphics
mjb – March 15, 2020
Allocate as One Large Continuous Global Array
omp_set_num_threads( NUMT ); Now =0;
for( int i = 0; i < NUMN; i++ ) Temps[Now][ i ] = 0.;
Temps[Now][NUMN/2] = 100.;
double time0 = omp_get_wtime( );
#pragma omp parallel default(none) shared(Temps,Now,Next)
int me = omp_get_thread_num( ); DoAllWork( me );
double time1 = omp_get_wtime( );
double usec = 1000000. * ( time1 - time0 );
// each thread calls this
double megaNodesPerSecond = (float)NUM_TIME_STEPS * (float)NUMN / usec;
Computer Graphics
mjb – March 15, 2020
DoAllWork( ), I
DoAllWork( int me ) {
// what range of the global Temps array this thread is responsible for: int first = me * NUM_NODES_PER_THREAD;
int last = first + ( NUM_NODES_PER_THREAD - 1 );
for( int step = 0; step < NUM_TIME_STEPS; step++ )
// first element on the left: {
float left = 0.; if( me != 0 )
left = Temps[Now][first-1];
float dtemp = ( ( K / (RHO*C) ) *
( left - 2.*Temps[Now][first] + Temps[Now][first+1] ) / ( DELTA*DELTA ) ) * DT;
Temps[Next][first] = Temps[Now][first] + dtemp;
// all the nodes in between:
for( int i = first+1; i <= last-1; i++ ) {
float dtemp = ( ( K / (RHO*C) ) *
( Temps[Now][i-1] - 2.*Temps[Now][ i ] + Temps[Now][i+1] ) / ( DELTA*DELTA ) ) * DT;
Temps[Next][ i ] = Temps[Now][ i ] + dtemp;
What happens if two cores are writing to the same cache line?
False Sharing!
Computer Graphics
mjb – March 15, 2020
DoAllWork( ), II
// last element on the right: {
float right = 0.;
if( me != NUMT-1 )
right = Temps[Now][last+1]; float dtemp = ( ( K / (RHO*C) ) *
( Temps[Now][last-1] - 2.*Temps[Now][last] + right ) / ( DELTA*DELTA ) ) * DT;
Temps[Next][last] = Temps[Now][last] + dtemp;
// all threads need to wait here so that all Temps[Next][*] values are filled:
#pragma omp barrier
// want just one thread swapping the definitions of Now and Next:
#pragma omp single
Now = Next;
Next = 1 - Next;
} // implied barrier exists here:
} //for(intstep=... }
What happens if two cores are writing to the same cache line?
False Sharing!
Because each core is working from left to right across the data, I am guessing that there is little cache line conflict.
Computer Graphics
mjb – March 15, 2020
Allocate as Separate Thread-Local (private) Sub-arrays
We could make each sub-array a thread-local (i.e., private) variable. This would put each sub- array on each thread’s individual stack.
The strategy is now to read from the single large global array and compute into each thread’s local array.
When we are done, copy each local array into the global array.
Computer Graphics
mjb – March 15, 2020
Allocate as Separate Thread-Local (private) Sub-arrays
float nextTemps[NUM_NODES_PER_THREAD];
for( int i = 0; i < NUM_NODES_PER_THREAD; i++ ) nextTemps[ i ] = Temps[first+i];
// read from Temps[ ], write into nextTemps[ ]
for( int steps = 0; steps < NUM_TIME_STEPS; steps++ ) {
// all the other nodes in between:
for( int i = 1; i < NUM_NODES_PER_THREAD-1; i++ ) {
float dtemp = ( ( K / (RHO*C) ) *
( Temps[first+i-1] - 2.*Temps[first+i] + Temps[first+i+1] ) / ( DELTA*DELTA ) ) * DT;
nextTemps[ i ] = Temps[first+i] + dtemp;
// don't update the global Temps[ ] until they are no longer being used:
#pragma omp barrier
// update the global Temps[ ]:
for( int i = 0; i < NUM_NODES_PER_THREAD-1; i++ ) {
Temps[first+i] = nextTemps[ i ]; }
// be sure all global Temps[ ] are updated: #pragma omp barrier
Compu}ter//Gforarp(hincst steps = 0; ...
mjb – March 15, 2020
Allocate as Separate Thread-Global-Heap Sub-arrays
We could make each sub-array a thread-heap (also private) variable. This would put each sub- array on the heap.
The strategy is now to read from the single large global array and compute into each thread’s heap array.
When we are done, copy each heap array into the global array.
Computer Graphics
mjb – March 15, 2020
Allocate as Separate Thread-Global-Heap Sub-arrays
float *nextTemps = new float [NUM_NODES_PER_THREAD];
for( int i = 0; i < NUM_NODES_PER_THREAD; i++ ) nextTemps[ i ] = Temps[first+i];
// read from Temps[ ], write into nextTemps[ ]
for( int steps = 0; steps < NUM_TIME_STEPS; steps++ ) {
// all the other nodes in between:
for( int i = 1; i < NUM_NODES_PER_THREAD-1; i++ ) {
float dtemp = ( ( K / (RHO*C) ) *
( Temps[first+i-1] - 2.*Temps[first+i] + Temps[first+i+1] ) / ( DELTA*DELTA ) ) * DT;
nextTemps[ i ] = Temps[first+i] + dtemp;
// don't update the global Temps[ ] until they are no longer being used: #pragma omp barrier
// update the global Temps[ ]:
for( int i = 0; i < NUM_NODES_PER_THREAD-1; i++ ) {
Temps[first+i] = nextTemps[ i ]; }
// be sure all global Temps[ ] are updated: #pragma omp barrier
Compu}ter//Gforarp(hincst steps = 0; ...
mjb – March 15, 2020
Computer Graphics
Number of Nodes Threads
Local Global
Storage Strategy and # of
mjb – March 15, 2020
MegaNodes Computed Per Second
Performance as a Function of Number of Nodes
Computer Graphics
# of Nodes to Compute # of Threads
mjb – March 15, 2020
MegaNodes Computed Per Second
Performance as a Function of Number of Threads
Computer Graphics
# of Threads
# of Nodes
mjb – March 15, 2020
MegaNodes Computed Per Second
2D Heat Transfer Equation
CTk(2T2T) k T 2T T T 2T T
t x2 y2 T k 2T 2T
i,j C
T i1,j
i1,j i,j1
i,j1 t
tC(x2 y2)
Computer Graphics
mjb – March 15, 2020
T 2T 2T 2T Ctk(x2 y2 z2)
T i, j,k
T T i1,j,k i,j1,k
T T i,j1,k i,j,k1
T i,j,k1 t
3D Heat Transfer Equation
C
T k 2T 2T 2T tC(x2 y2 z2)
Computer Graphics
mjb – March 15, 2020
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com