This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
Why Two URLs?
http://mpi-forum.org
Computer Graphics
Copyright By PowCoder代写 加微信 powcoder
This is the definitive reference for the MPI standard. Go here if you want to read the official specification, which, BTW, continues to evolve.
https://www.open-mpi.org/
This consortium formed later. This is the open source version of MPI. If you want to start using MPI, I recommend you look here.
This is the MPI that the COE systems use
mjb – March 15, 2022
https://www.open-mpi.org/doc/v4.0/
This URL is also really good – it is a link to all of the MPI man pages
Computer Graphics
The Message Passing Interface (MPI): Parallelism on Distributed CPUs
http://mpi-forum.org https://www.open-mpi.org/
mjb – March 15, 2022
The Open MPI Consortium
Computer Graphics
mjb – March 15, 2022
MPI: The Basic Idea
Programs on different CPUs coordinate computations by passing messages between each other
Note: Each CPU in the MPI “cluster” must be prepared ahead of time by having the MPI server code installed on it. Each MPI CPU must also have an
integer ID assigned to it (called its rank).
Computer Graphics
mjb – March 15, 2022
This paradigm is how modern supercomputers work!
Computer Graphics
mjb – March 15, 2022
The Texas Advanced Computing Center’s Frontera supercomputer
How to SSH to the COE MPI Cluster
flip3 151% ssh submit-c.hpc.engr.oregonstate.edu submit-c 142% module load slurm
submit-c 143% module load openmpi/3.1
ssh over to an MPI submission machine — submit-a and submit-b will also work
Type these two lines right away to set your paths correctly
BTW, you can find out more about the COE cluster here:
https://it.engineering.oregonstate.edu/hpc
“The College of Engineering HPC cluster is a heterogeneous mix of 202 servers providing over 3600 CPU cores, over 130 GPUs, and over 31 TB total RAM. The systems
are connected via gigabit ethernet, and most of the latest servers also utilize a Mellanox EDR InfiniBand network connection. The cluster also has access to 100TB global
scratch from the College of Engineering’s Dell/EMC Isilon enterprise storage.”
Computer Graphics
mjb – March 15, 2022
Compiling and Running from the Command Line
% mpicc -o program program.c . . .
% mpic++ -o program program.cpp . . .
% mpiexec -mca btl self,tcp -np 4 program
# of processors to use
Warning – use mpic++ and mpiexec !
Don’t use g++ and don’t run by just typing the name of the executable!
Computer Graphics
mjb – March 15, 2022
All distributed processors execute the same program at the same time
Running with a bash Batch Script
submit.bash:
#!/bin/bash
#SBATCH -J Heat
#SBATCH -A cs475-575
#SBATCH -p class
#SBATCH -N 8 # number of nodes
#SBATCH -n 8 # number of tasks
#SBATCH -o heat.out
#SBATCH -e heat.err
#SBATCH –mail-type=END,FAIL
#SBATCH module load openmpi/3.1
mpic++ heat.cpp -o heat -lm
mpiexec -mca btl self,tcp -np 4 heat
submit-c 143% sbatch submit.bash Submitted batch job 258759
Computer Graphics
mjb – March 15, 2022
Auto-Notifications via Email
You don’t have to ask for email notification, but if you do, please, please, please be sure you get your email address right!
The IT people are getting real tired of fielding the bounced emails when people spell their own email address wrong.
Computer Graphics
mjb – March 15, 2022
Use slurm’s scancel if your Job Needs to Be Killed
submit-c 143% sbatch submit.bash Submitted batch job 258759
submit-c 144% scancel 258759
Computer Graphics
mjb – March 15, 2022
#include
main( int argc, char *argv[ ] ) {
MPI_Init( &argc, &argv );
MPI_Finalize( );
return 0; }
Setting Up and Finishing
You don’t need to process command line arguments if you don’t need to. You can also call it as:
MPI_Init( NULL, NULL );
Computer Graphics
mjb – March 15, 2022
12 A communicator is a collection of CPUs that are capable of sending messages to each other
Oh, look, a
communicator
Oh, look, a
communicator
of turkeys!
MPI Follows a Single-Program-Multiple-Data (SPMD) Model
Getting information about our place in the communicator:
Computer Graphics
This requires MPI server code getting installed on all those CPUs. Only an administrator can do this.
Size, i.e., how many altogether?
Rank, i.e., which one am I?
mjb – March 15, 2022
int numCPUs; // total # of cpus involved int me; // which one I am
MPI_Comm_size( MPI_COMM_WORLD, &numCPUs ); MPI_Comm_rank( MPI_COMM_WORLD, &me );
It is then each CPU’s job to figure out what piece of the overall problem it is responsible for and then go do it.
A First Test of MPI
#include
#define BOSS 0
main( int argc, char *argv[ ] ) {
MPI_Init( &argc, &argv );
int numCPUs; // total # of cpus involved int me; // which one I am
MPI_Comm_size( MPI_COMM_WORLD, &numCPUs ); MPI_Comm_rank( MPI_COMM_WORLD, &me );
if( me == BOSS )
fprintf( stderr, “Rank %d says that we have a Communicator of size %d\n”, BOSS, numCPUs );
fprintf( stderr, “Welcome from Rank %d\n”, me );
MPI_Finalize( );
return 0; }
Computer Graphics
mjb – March 15, 2022
submit-c 165% mpiexec -np 16 ./first
Welcome from Rank 13
Welcome from Rank 15
Welcome from Rank 3
Welcome from Rank 7
Welcome from Rank 5
Welcome from Rank 8
Welcome from Rank 9
Welcome from Rank 11
Rank 0 says that we have a Communicator of size 16 Welcome from Rank 1
Welcome from Rank 12 Welcome from Rank 14 Welcome from Rank 6 Welcome from Rank 2 Welcome from Rank 10 Welcome from Rank 4
submit-c 166% mpiexec -np 16 ./first
Welcome from Rank 1
Welcome from Rank 5
Welcome from Rank 7
Welcome from Rank 9
Welcome from Rank 11
Welcome from Rank 13
Welcome from Rank 15
Rank 0 says that we have a Communicator of size 16 Welcome from Rank 2
Welcome from Rank 3 Welcome from Rank 4 Welcome from Rank 6 Welcome from Rank 8 Welcome from Rank 12 Welcome from Rank 14 Welcome from Rank 10
submit-c 167% mpiexec -np 16 ./first
Welcome from Rank 9
Welcome from Rank 11
Welcome from Rank 13
Welcome from Rank 7
Welcome from Rank 1
Welcome from Rank 3
Welcome from Rank 10
Welcome from Rank 15
Welcome from Rank 4
Welcome from Rank 5
Rank 0 says that we have a Communicator of size 16 Welcome from Rank 2
Welcome from Rank 6 Welcome from Rank 8 Welcome from Rank 14
ComWpuetlecroGmreapfrhoicms Rank 12
h 15, 2022
submit-c 168% mpiexec -np 16 ./first
Welcome from Rank 13
Welcome from Rank 15
Welcome from Rank 7
Welcome from Rank 3
Welcome from Rank 5
Welcome from Rank 9
Welcome from Rank 11
Welcome from Rank 1
Welcome from Rank 12
Welcome from Rank 14
Welcome from Rank 4
Welcome from Rank 2
Rank 0 says that we have a Communicator of size 16 Welcome from Rank 8
Welcome from Rank 10 Welcome from Rank 6
mjb – Marc
So, we have a group (a “communicator”) of distributed processors. How do they communicate about what work they are supposed to do?
Where am I?
What am I supposed to be doing? Hello? Is anyone listening?
Example: You could coordinate the units of our DGX system using MPI
Computer Graphics
mjb – March 15, 2022
A Good Place to Start: MPI Broadcasting
MPI_Bcast( array, count, type, src, MPI_COMM_WORLD );
Address of data to send from if you are the src node; Address of the data to receive into if you are not
Computer Graphics
# elements
MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE •••
rank of the CPU doing the sending
≠ src nodes
mjb – March 15, 2022
Both the sender and receivers need to execute MPI_Bcast – there is no separate receive function
This is our heat transfer equation frombefore. Clearly,everyCPU will need to know this value.
k T 2T T Ti i1 i i1t
MPI Broadcast Example
C x2
int numCPUs;
float k_over_rho_c; // the BOSS node will know this value, the others won’t (yet)
#define BOSS 0
MPI_Comm_size( MPI_COMM_WORLD, &numCPUs ); MPI_Comm_rank( MPI_COMM_WORLD, &me );
if( me == BOSS )
<< read k_over_rho_c from the data file >>
// how many are in this communicator // which one am I?
I am the BOSS: this identifies this call as a send
MPI_Bcast( &k_over_rho_c, 1, MPI_FLOAT, BOSS, MPI_COMM_WORLD ); // send if BOSS, and receive if not
Computer Graphics ≠ src nodes
mjb – March 15, 2022
Both the sender and receivers need to execute MPI_Bcast – there is no separate receive function
Executable code
k_over_rho_c (set)
Executable code
k_over_rho_c (being set)
Executable code
k_over_rho_c (being set)
Executable code
k_over_rho_c (being set)
Executable code
k_over_rho_c (being set)
Confused? Look at this Diagram
Node #BOSS:
All Nodes that are not #BOSS:
Computer Graphics
mjb – March 15, 2022
How Does this Work? Think Star Trek Wormholes!
Computer Graphics
mjb – March 15, 2022
Sending Data from One Source CPU to One Destination CPU
address of data to send from
# elements
(note: this is the number of elements, not the number of bytes!)
MPI_Send( array, numToSend, type, dst, tag, MPI_COMM_WORLD );
MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE •••
rank of the CPU to send to
An integer to differentiate this transmission from any other transmission (be sure this is unique!)
• One message from a specific src to a specific dst cannot overtake a previous message from the same src to the same dst.
• MPI_Send( ) blocks until the transfer is far enough along that array can be destroyed or re-used. • There are no guarantees on order from different src’s .
Computer Graphics
mjb – March 15, 2022
Receiving Data in a Destination CPU from a Source CPU
MPI_Recv( array, maxCanReceive, type, src, tag, MPI_COMM_WORLD, &status );
address of data to receive into
# elements we can receive, at most
MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE •••
Rank of the CPU we are expecting to get a transmission from
Type = MPI_Status
An integer to differentiate what transmission we are looking for with this call (be sure this matches what the sender is sending!). I like to use chars.
Computer Graphics
mjb – March 15, 2022
• The receiver blocks waiting for data that matches what it declares to be looking for
• One message from a specific src to a specific dst cannot overtake a previous message from
the same src to the same dst
• There are no guarantees on the order from different src’s
• The order from different src’s could be implied in the tag
• status is type MPI_Status – the “&status” can be replaced with MPI_STATUS_IGNORE
The tag to send
Remember, this identical code runs on all CPUs:
int numCPUs;
#define MYDATA_SIZE 128 char mydata[ MYDATA_SIZE ]; #define BOSS 0
MPI_Comm_size( MPI_COMM_WORLD, &numCPUs ); MPI_Comm_rank( MPI_COMM_WORLD, &me );
if( me == BOSS ) // the primary {
for( int dst = 0; dst < numCPUs; dst++ ) {
Be sure the receiving tag matches the sending tag
if( dst != BOSS ) {
char *InputData = “Hello, Beavers!”;
MPI_Send( InputData, strlen(InputData)+1, MPI_CHAR, dst, ‘B’, MPI_COMM_WORLD ); }
// a secondary
The tag to expect
MPI_Recv( myData, MYDATA_SIZE, MPI_CHAR, BOSS, ‘B’, MPI_COMM_WORLD, MPI_STATUS_IGNORE ); printf( “ ‘%s’ from rank # %d\n”, in, me );
You are highly discouraged from sending to yourself. Because both the send and receive
puaterreGcrapphiacsble of blocking, the result could be deadlock.
mjb – March 15,
Look at this Diagram
Executable code
Input Data
Destinations
Executable code
Executable code
Executable code
Executable code
Computer Graphics
mjb – March 15, 2022
How does MPI let the Sender perform an MPI_Send( ) even if the Receivers are not ready to MPI_Recv( )?
Sender Receiver
Computer Graphics
mjb – March 15, 2022
MPI_Send( )
MPI_Send( ) blocks until the transfer is far enough along that the array can be destroyed or re-used.
MPI_Recv( )
MPI Transmission Buffer
MPI Transmission Buffer
Another Example
You typically don’t send the entire workload to each dst – you just send part of it, like this:
#define NUMELEMENTS ?????
int numCPUs;
#define BOSS 0
MPI_Comm_size( MPI_COMM_WORLD, &numCPUs ); MPI_Comm_rank( MPI_COMM_WORLD, &me );
int localSize = NUMELEMENTS / numCPUs; // assuming it comes out evenly float *myData = new float [ localSize ];
if( me == BOSS ) // the sender {
Computer Graphics
mjb – March 15, 2022
float *InputData = new float [ NUMELEMENTS ];
<< read the full input data into InputData from disk >> for( int dst = 0; dst < numCPUs; dst++ )
if( dst != BOSS )
MPI_Send( &InputData[dst*localSize], localSize, MPI_FLOAT, dst, 0, MPI_COMM_WORLD );
// a receiver
MPI_Recv( myData, localSize, MPI_FLOAT, BOSS, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ); // do something with this subset of the data
Another Example
You typically don’t send the entire workload to each dst – you just send part of it, like this:
Executable code
Input Data
Destinations
Executable code
Executable code
Executable code
Executable code
Computer Graphics
mjb – March 15, 2022
In Distributed Computing, You Often Hear About These Design Patterns
Computer Graphics
mjb – March 15, 2022
Scatter and Gather Usually Go Together
Note surprisingly, this is referred to as Scatter/Gather
Computer Graphics
mjb – March 15, 2022
MPI_Scatter( snd_array, snd_count, snd_type, rcv_array, rcv_count, rcv_type, src, MPI_COMM_WORLD );
Both the sender and receivers need to execute MPI_Scatter. There is no separate receive function
MPI Gather
This is who is doing the receiving – everyone else is sending
MPI_Gather( snd_array, snd_count, snd_type, rcv_array, rcv_count, rcv_type, dst, MPI_COMM_WORLD );
The total large array to put the pieces back into
# elements to return
per-processor
Computer Graphics
Local array that this processor is sending back
MPI_DOUBLE
# elements to send back per-processor
MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE
Both the sender and receivers need to execute MPI_Gather. There is no separate receive function
mjb – March 15, 2022
MPI Scatter
Take a data array, break it into ~equal portions, and send it to each CPU
The total large array to split up
# elements to send
per-processor
Computer Graphics
Local array to store this processor’s piece in
This is who is doing the sending – everyone else is receiving
MPI_DOUBLE
# elements to receive
MPI_CHAR MPI_INT MPI_LONG MPI_FLOAT MPI_DOUBLE
per-processor
mjb – March 15, 2022
Remember This? It’s Baaaaaack as a complete Scatter/Gather Example
CPU #0 CPU #1 CPU #2 CPU #3
The Compute : Communicate Ratio still applies, except that it is even more important now because there is much more overhead in the Communicate portion.
This pattern of breaking a big problem up into pieces, sending them to different CPUs, computing on the pieces, and getting the results back is very common. That’s why MPI has its own scatter and gather functions.
Computer Graphics
mjb – March 15, 2022
#include
const float RHO = 8050.;
const float C = 0.466;
const float K = 20.;
float k_over_rho_c = K / (RHO*C);
// K / (RHO*C) = 5.33×10^-6 m^2/sec
const float DX = 1.0; const float DT = 1.0;
#define BOSS 0
#define NUMELEMENTS #define NUM_TIME_STEPS #define DEBUG
NOTE: this cannot be a const!
float * int
int float * float *
NextTemps; NumCpus; PPSize; PPTemps; TempData;
DoOneTimeStep( int );
(8*1024*1024) 4
// per-processor array to hold computer next-values // total # of cpus involved
// per-processor local array size
// per-processor local array temperature data
// the overall NUMELEMENTS-big temperature data
heat.cpp, I
// units of m^2/sec
Computer Graphics
mjb – March 15, 2022
heat.cpp, II
main( int argc, char *argv[ ] )
MPI_Init( &argc, &argv );
int me; // which one I am
MPI_Comm_size( MPI_COMM_WORLD, &NumCpus ); MPI_Comm_rank( MPI_COMM_WORLD, &me );
// decide how much data to send to each processor:
PPSize = NUMELEMENTS / NumCpus;
PPTemps = new float [PPSize]; // all processors now have this uninitialized Local array NextTemps = new float [PPSize]; // all processors now have this uninitialized local array too
// broadcast the constant:
MPI_Bcast( (void *)&k_over_rho_c, 1, MPI_FLOAT, BOSS, MPI_COMM_WORLD );
// assuming it comes out evenly
Computer Graphics
mjb – March 15, 2022
heat.cpp, III
if( me == BOSS ) // this is the data-creator {
TempData = new float [NUMELEMENTS]; for( int i = 0; i < NUMELEMENTS; i++ )
TempData[ i ] = 0.; TempData[NUMELEMENTS/2] = 100.;
MPI_Scatter( TempData, PPSize, MPI_FLOAT, PPTemps, PPSize, MPI_FLOAT, BOSS, MPI_COMM_WORLD );
Computer Graphics
mjb – March 15, 2022
heat.cpp, IV
Computer Graphics
mjb – March 15, 2022
// all the PPTemps arrays have now been filled // do the time steps:
double time0 = MPI_Wtime( );
for( int steps = 0; steps < NUM_TIME_STEPS; steps++ )
// do the computation for one time step: DoOneTimeStep( me );
// ask for all the data: #ifdef WANT_EACH_TIME_STEPS_DATA
MPI_Gather( PPTemps, PPSize, MPI_FLOAT, TempData, PPSize, MPI_FLOAT, BOSS, MPI_COMM_WORLD );
#ifndef WANT_EACH_TIME_STEPS_DATA
MPI_Gather( PPTemps, PPSize, MPI_FLOAT, TempData, PPSize, MPI_FLOAT,
BOSS, MPI_COMM_WORLD ); double time1 = MPI_Wtime( );
heat.cpp, V
if( me == BOSS ) {
MPI_Finalize( ); return 0;
double seconds = time1 - time0; double performance =
(double)NUM_TIME_STEPS * (double)NUMELEMENTS / seconds / 1000000.; // mega-elements computed per second
fprintf( stderr, "%3d, %10d, %8.2lf\n", NumCpus, NUMELEMENTS, performance );
Computer Graphics
mjb – March 15, 2022
DoOneTimeStep, I
// read from PerProcessorData[ ], write into NextTemps[ ]
DoOneTimeStep( int me ) {
MPI_Status status;
// send out the left and right end values:
// (the tag is from the point of view of the sender)
if( me != 0 ) {
// i.e., if i'm not the first group on the left
// send my PPTemps[0] to me-1 using tag 'L'
MPI_Send( &PPTemps[0], 1, MPI_FLOAT, me-1, 'L', MPI_COMM_WORLD ); if( DEBUG ) fprintf( stderr, "%3d sent 'L' to %3d\n", me, me-1 );
if( me != NumCpus-1 ) // i.e., not the last group on the right {
// send my PPTemps[PPSize-1] to me+1 using tag 'R'
MPI_Send( &PPTemps[PPSize-1], 1, MPI_FLOAT, me+1, 'R', MPI_COMM_WORLD ); if( DEBUG ) fprintf( stderr, "%3d sent 'R' to %3d\n", me, me+1 );
Computer Graphics
mjb – March 15, 2022
DoOneTimeStep, II
float left = 0.; float right = 0.;
if( me != 0 ) {
// i.e., if i'm not the first group on the left
// receive my "left" from me-1 using tag 'R'
MPI_Recv( &left, 1, MPI_FLOAT, me-1, 'R', MPI_COMM_WORLD, &status ); if( DEBUG ) fprintf( stderr, "%3d received 'R' from %3d\n", me, me-1 );
if( me != NumCpus-1 ) // i.e., not the last group on the right {
// receive my "right" from me+1 using tag 'L'
MPI_Recv( &right, 1, MPI_FLOAT, me+1, 'L', MPI_COMM_WORLD, &status ); if( DEBUG ) fprintf( stderr, "%3d received 'L' from %3d\n", me, me+1 );
Computer Graphics
mjb – March 15, 2022
Sharing Values Across the Boundaries
1 sent 'L' to 0
1 sent 'R' to 2
2 sent 'L' to 1
2 sent 'R' to 3
2 received 'R' from 1 0 sent 'R' to 1
0 received 'L' from 1 1 received 'R' from 0 1 received 'L' from 2 3 sent 'L' to 2
3 received 'R' from 2 2 received 'L' from 3
Computer Graphics
mjb – March 15, 2022
1D Compute-to-Communicate Ratio
Intraprocessor compu
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com