This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
Multicore Block Data Decomposition: 1D Heat Transfer Example
You have a steel bar. Each section of the bar starts out at a different temperature. There are no incoming heat sources or outgoing heat sinks (i.e., ignore boundary conditions). Ready, go! How do the temperatures change over time?
The fundamental differential equation here is: T 2T C t k(x2 )
Copyright By PowCoder代写 加微信 powcoder
ρ is the density in kg/m3
C is the specific heat capacity measured in Joules / (kg ∙ °K)
k is the coefficient of thermal conductivity measured in Watts / (meter ∙ °K)
= units of Joules/(meter∙ sec ∙ °K)
Computer Graphics
In plain words, this all means that temperatures, left to themselves, try to even out. Hots get cooler. Cools get hotter. The greater the temperature differential, the faster the evening-out process goes.
mjb – March 15, 2020
Computer Graphics
Data Decomposition
Data_decomposition.pptx
mjb – March 15, 2020
Numerical Methods:
Changing a Derivative into Discrete Arithmetic
How fast the 𝜕𝑇 temperature is = changing within the bar 𝜕𝑥
𝑇 − 2𝑇 + 𝑇 (∆𝑥)
How much the
temperature changes
over time 𝜕𝑡
Computer Graphics
mjb – March 15, 2020
T 2T C t k(x2 )
How much the temperature changes in the time step
k Ti1 2Ti Ti1 Ti 2 t
Multicore Block Data Decomposition: 1D Heat Transfer Example
Computer Graphics
k 2T ( )
C x
Physical properties of the material
How fast the temperature is changing within the bar
As a side note: the quantity k/(ρC) has the unlikely units of m2/sec!
mjb – March 15, 2020
1D Data Decomposition: Partitioning Strategies
Ti-1 Ti Ti+1
Core #1 Core #2 Core #3
On a shared memory multicore system, the obvious approach is to allocate the data as one large global-memory block (i.e., shared).
You will actually need two such arrays, one to hold the current temperature values that you are reading from and one to hold the next temperature values that you are writing to.
Computer Graphics
mjb – March 15, 2020
1D Data Decomposition: Partitioning
#include
#include
#include
#define NUM_TIME_STEPS 100
#ifndef NUMN #define NUMN #endif
#ifndef NUMT #define NUMT #endif
16 // total number of nodes
4 // number of threads to use
#define NUM_NODES_PER_THREAD
( NUMN / NUMT )
Temps[2][NUMN];
int Now; // which array is the “current values“= 0 or 1 int Next; // which array is being filled = 1 or 0
DoAllWork( int );
Computer Graphics
mjb – March 15, 2020
Allocate as One Large Continuous Global Array
Ti-1 Ti Ti+1
omp_set_num_threads( NUMT ); Now =0;
for( int i = 0; i < NUMN; i++ ) Temps[Now][ i ] = 0.;
Temps[Now][NUMN/2] = 100.;
double time0 = omp_get_wtime( );
#pragma omp parallel default(none) shared(Temps,Now,Next)
int me = omp_get_thread_num( ); DoAllWork( me );
double time1 = omp_get_wtime( );
// each thread calls this
double usec = 1000000. * ( time1 - time0 );
ComdpuotuerbGlerampheicgs aNodesPerSecond = (float)NUM_TIME_STEPS * (float)NUMN / usec;
mjb – March 15, 2020
DoAllWork( ), I
DoAllWork( int me ) {
// what range of the global Temps array this thread is responsible for: int first = me * NUM_NODES_PER_THREAD;
int last = first + ( NUM_NODES_PER_THREAD - 1 );
for( int step = 0; step < NUM_TIME_STEPS; step++ )
// first element on the left: {
float left = 0.; if( me != 0 )
left = Temps[Now][first-1];
float dtemp = ( ( K / (RHO*C) ) *
( left - 2.*Temps[Now][first] + Temps[Now][first+1] ) / ( DELTA*DELTA ) ) * DT;
Temps[Next][first] = Temps[Now][first] + dtemp;
// all the nodes in between:
for( int i = first+1; i <= last-1; i++ ) {
float dtemp = ( ( K / (RHO*C) ) *
( Temps[Now][i-1] - 2.*Temps[Now][ i ] + Temps[Now][i+1] ) / ( DELTA*DELTA ) ) * DT;
Temps[Next][ i ] = Temps[Now][ i ] + dtemp;
Computer Graphics
What happens if two cores are writing to the same cache line? False Sharing!
mjb – March 15, 2020
DoAllWork( ), II
// last element on the right: {
float right = 0.;
if( me != NUMT-1 )
right = Temps[Now][last+1]; float dtemp = ( ( K / (RHO*C) ) *
( Temps[Now][last-1] - 2.*Temps[Now][last] + right ) / ( DELTA*DELTA ) ) * DT;
Temps[Next][last] = Temps[Now][last] + dtemp;
// all threads need to wait here so that all Temps[Next][*] values are filled:
#pragma omp barrier
// want just one thread swapping the definitions of Now and Next:
#pragma omp single
Now = Next;
Next = 1 - Next;
} // implied barrier exists here:
} //for(intstep=... }
Because each core is working from left to right across the data, I am guessing that there is little cache line conflict.
What happens if two cores are writing to the same cache line? False Sharing!
Computer Graphics
mjb – March 15, 2020
Allocate as Separate Thread-Local (private) Sub-arrays
Ti-1 Ti Ti+1
Core #1 Core #2 Core #3
We could make each sub-array a thread-local (i.e., private) variable. This would put each sub- array on each thread’s individual stack.
The strategy is now to read from the single large global array and compute into each thread’s local array.
When we are done, copy each local array into the global array.
Computer Graphics
mjb – March 15, 2020
Allocate as Separate Thread-Local (private) Sub-arrays
float nextTemps[NUM_NODES_PER_THREAD];
for( int i = 0; i < NUM_NODES_PER_THREAD; i++ ) nextTemps[ i ] = Temps[first+i];
// read from Temps[ ], write into nextTemps[ ]
for( int steps = 0; steps < NUM_TIME_STEPS; steps++ ) {
// all the other nodes in between:
for( int i = 1; i < NUM_NODES_PER_THREAD-1; i++ ) {
float dtemp = ( ( K / (RHO*C) ) *
( Temps[first+i-1] - 2.*Temps[first+i] + Temps[first+i+1] ) / ( DELTA*DELTA ) ) * DT;
nextTemps[ i ] = Temps[first+i] + dtemp;
// don't update the global Temps[ ] until they are no longer being used:
#pragma omp barrier
// update the global Temps[ ]:
for( int i = 0; i < NUM_NODES_PER_THREAD-1; i++ ) {
Temps[first+i] = nextTemps[ i ]; }
// be sure all global Temps[ ] are updated: #pragma omp barrier
Compu}ter//Gforarp(hincst steps = 0; ...
mjb – March 15, 2020
Allocate as Separate Thread-Global-Heap Sub-arrays
Ti-1 Ti Ti+1
Core #1 Core #2 Core #3
We could make each sub-array a thread-heap (also private) variable. This would put each sub- array on the heap.
The strategy is now to read from the single large global array and compute into each thread’s heap array.
When we are done, copy each heap array into the global array.
Computer Graphics
mjb – March 15, 2020
Allocate as Separate Thread-Global-Heap Sub-arrays
float *nextTemps = new float [NUM_NODES_PER_THREAD];
for( int i = 0; i < NUM_NODES_PER_THREAD; i++ ) nextTemps[ i ] = Temps[first+i];
// read from Temps[ ], write into nextTemps[ ]
for( int steps = 0; steps < NUM_TIME_STEPS; steps++ ) {
// all the other nodes in between:
for( int i = 1; i < NUM_NODES_PER_THREAD-1; i++ ) {
float dtemp = ( ( K / (RHO*C) ) *
( Temps[first+i-1] - 2.*Temps[first+i] + Temps[first+i+1] ) / ( DELTA*DELTA ) ) * DT;
nextTemps[ i ] = Temps[first+i] + dtemp;
// don't update the global Temps[ ] until they are no longer being used: #pragma omp barrier
// update the global Temps[ ]:
for( int i = 0; i < NUM_NODES_PER_THREAD-1; i++ ) {
Temps[first+i] = nextTemps[ i ]; }
// be sure all global Temps[ ] are updated: #pragma omp barrier
Compu}ter//Gforarp(hincst steps = 0; ...
mjb – March 15, 2020
Local Global
Storage Strategy and # of
Number of Nodes Threads
Computer Graphics
mjb – March 15, 2020
Performance as a Function of Number of Nodes 15
Computer Graphics
# of Nodes to Compute # of Threads
mjb – March 15, 2020
Performance as a Function of Number of Threads
Computer Graphics
# of Threads
# of Nodes
mjb – March 15, 2020
CT k(2T 2T)
2D Heat Transfer Equation
k Ti1,j 2Ti,j Ti1,j Ti,j
Ti,j1 2Ti,j Ti,j1
T k 2T 2T
C x2
tC(x2 y2)
Computer Graphics
mjb – March 15, 2020
T 2T 2T 2T C t k(x2 y2 z2 )
T k 2T 2T 2T t C(x2 y2 z2 )
Computer Graphics
k Ti1,j,k 2Ti,j,k Ti1,j,k Ti,j,k 2
Ti,j1,k 2Ti,j,k Ti,j1,k 2
3D Heat Transfer Equation
Ti,j,k1 2Ti,j,k Ti,j,k1
C x
mjb – March 15, 2020
MegaNodes Computed Per Second
MegaNodes Computed Per Second
MegaNodes Computed Per Second
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com