Comprehensive OpenMP
Comprehensive
MPI
Shuaiwen Leon Song
SOFT 3410 Week 10
A lot of materials are developed from Wes Kendall’s MPI Blog with a lot
of practice coding on Github:
https://mpitutorial.com/tutorials/
Please take some time to fill your survey
https://student-surveys.sydney.edu.au/students/
https://student-surveys.sydney.edu.au/students/
Fundamental MPI Concepts
MPI is short for Message Passing Interface. It has been designed by the HPC community to write parallel codes that
can be executed on distributed memory systems (e.g., a cluster with many nodes). It has been widely adopted to
complement OpenMP which is used for parallelize shared memory based single multi-core machine. MPI enables the
message passing model of parallel programming. There are couple of core concepts behind MPI.
(1) The notion of a communicator. A communicator defines a group of processes that have the ability to communicate
with one another. In this group of processes, each is assigned a unique rank, and they explicitly communicate with
one another by their ranks.
(2) The basic communication model. The core of it is built upon the send and receive operations among processes. A
process may send a message to another process by providing the rank of the process and a unique tag to identify the
message. The receiver can then post a receive for a message with a given tag (or it may not even care about the tag),
and then handle the data accordingly. Communications such as this which involve one sender and receiver are
known as point-to-point communications.
(3) The collective communication. There are many cases where processes may need to communicate with everyone
else. For example, when a manager process needs to broadcast information to all of its worker processes. In this
case, it would be cumbersome to write code that does all of the sends and receives. In fact, it would often not use
the network in an optimal manner. MPI can handle a wide variety of these types of collective communications that
involve all processes.
In this class, we will learn
(A)MPI setup, compile and run
(B) Point to point communication: blocking vs unblocking send and receive
(C) Collective Communication
(D)Some simple examples of using MPI.
Materials for you in your future coding
• The code examples used in this tutorial can be found at: https://github.com/mpitutorial/mpitutorial/tree/gh-pages/tutorials
• Full MPI APIs and tutorial: https://computing.llnl.gov/tutorials/mpi/
• This material is also developed based on the https://computing.llnl.gov/tutorials/mpi/ and www.mpitutorial.com.
https://github.com/mpitutorial/mpitutorial/tree/gh-pages/tutorials
https://computing.llnl.gov/tutorials/mpi/
https://computing.llnl.gov/tutorials/mpi/
http://www.mpitutorial.com/
Hello World !
>>> export MPIRUN=/home/kendall/bin/mpirun
>>> export MPI_HOSTS=host_file>>> ./run.py
mpi_hello_world/home/kendall/bin/mpirun -n 4 -f host_file ./mpi_hello_world
Hello world from processor cetus2, rank 1 out of 4 processors
Hello world from processor cetus1, rank 0 out of 4 processors
Hello world from processor cetus4, rank 3 out of 4 processors
Hello world from processor cetus3, rank 2 out of 4 processors
The first two lines are exporting environmental variables for the acutal mpirun.
Here you typically want to put a host_file in here to direct the MPI on how to
launch processes across nodes. For example:
>>> cat host_file
cetus1: 4
cetus2: 4
cetus3: 4
cetus4: 4
If your node has multiple cores, just modify your host file and place a colon and
the number of cores per processor after the host name
Point to Point Communication
• Blocked MPI Send and Receive
Blocked MPI’s send and receive calls operate in the following manner.
First, process A decides a message needs to be sent to process B. Process A then packs up all of its necessary data into a
buffer for process B. These buffers are often referred to as envelopes since the data is being packed into a single message
before transmission (similar to how letters are packed into envelopes before transmission to the post office).
After the data is packed into a buffer, the communication device (which is often a network) is responsible for routing the
message to the proper location. The location of the message is defined by the process’s rank. Even though the message is
routed to B, process B still has to acknowledge that it wants to receive A’s data.
Once it does this, the data has been transmitted. Process A is acknowledged that the data has been transmitted and may go
back to work.
MPI Send/RECV Sample Code
MPI_Comm_rank and MPI_Comm_size are first used to
determine the world size along with the rank of the process.
Then process zero initializes a number to the value of negative
one and sends this value to process one. As you can see in
the else if statement, process one is calling MPI_Recv to receive
the number. It also prints off the received value. Since we are
sending and receiving exactly one integer, each process requests
that one MPI_INT be sent/received. Each process also uses a tag
number of zero to identify the message. The processes could
have also used the predefined constant MPI_ANY_TAG for the
tag number since only one type of message was being
transmitted.
MPI Datatypes
The other elementary MPI datatypes are listed below with their equivalent C datatypes.
Blocking-based Ping Pong Program
Other examples:
Blocking vs. Unblocking
• Blocking communication is done using MPI_Send() and MPI_Recv(), e.g., like the example we have shown above. These
functions do not return (i.e., they block) until the communication is finished. Simplifying somewhat, this means that the
buffer passed to MPI_Send() can be reused, either because MPI saved it somewhere, or because it has been received by
the destination. Similarly, MPI_Recv() returns when the receive buffer has been filled with valid data.
• In contrast, non-blocking communication is done using MPI_Isend() and MPI_Irecv(). These functions return
immediately (i.e., they do not block) even if the communication is not finished yet. You must
call MPI_Wait() or MPI_Test() to see whether the communication has finished.
• Blocking communication is used when it is sufficient, since it is somewhat easier to use. Non-blocking communication is
used when necessary, for example, you may call MPI_Isend(), do some computations, then do MPI_Wait(). This allows
computations and communication to overlap, which generally leads to improved performance.
https://www.mpich.org/static/docs/v3.2/www3/MPI_Send.html
https://www.mpich.org/static/docs/v3.2/www3/MPI_Recv.html
https://www.mpich.org/static/docs/v3.2/www3/MPI_Isend.html
https://www.mpich.org/static/docs/v3.2/www3/MPI_Irecv.html
https://www.mpich.org/static/docs/v3.2/www3/MPI_Wait.html
https://www.mpich.org/static/docs/v3.2/www3/MPI_Test.html
Example 1
if(rank==0)
{
MPI_Send(x to process 1)
MPI_Recv(y from process 1)
}
if(rank==1)
{
MPI_Send(y to process 0);
MPI_Recv(x from process 0);
}
What happens in this case?
Process 0 sends x to process 1 and blocks until process 1 receives x.
Process 1 sends y to process 0 and blocks until process 0 receives y, but
process 0 is blocked such that process 1 blocks for infinity until the two processes are killed. This will cause deadlock.
Example 2
For example, a MPI sender looks like this:
*buf = 3;
MPI_Isend(buf, 1, MPI_INT, …)
*buf = 4;
MPI_Wait(…);
—What will the receiver receives? 3?
—Answer: 3, 4 or anything else. Isend() is non-blocking there is a race condition for which value will send.
How to change the code so that the receiver will receive 3?
*buf = 3;
MPI_Isend(buf, 1, MPI_INT, …)
MPI_Wait(…);
*buf = 4;
Example 3: Ring Communication
Blocked MPI Send and Recv example: Ring Program
The ring program initializes a value from
process zero, and the value is passed around
every single process. The program terminates
when process zero receives the value from the
last process. As you can see from the program,
extra care is taken to assure that it doesn’t
deadlock. In other words, process zero makes
sure that it has completed its first send before
it tries to receive the value from the last
process. All of the other processes simply
call MPI_Recv (receiving from their
neighboring lower process) and
then MPI_Send (sending the value to their
neighboring higher process) to pass the value
along the ring. MPI_Send and MPI_Recv will
block until the message has been transmitted.
Because of this, the printfs should occur by the
order in which the value is passed.
Dynamic Receiving with MPI Status
As we can see, process zero randomly sends up
to MAX_NUMBERS integers to process one. Process
one then calls MPI_Recv for a total
of MAX_NUMBERS integers. Although process one is
passing MAX_NUMBERS as the argument
to MPI_Recv, process one will receive at most this
amount of numbers. In the code, process one
calls MPI_Get_count with MPI_INT as the datatype to
find out how many integers were actually received.
Along with printing off the size of the received
message, process one also prints off the source and tag
of the message by accessing
the MPI_SOURCE and MPI_TAG elements of the
status structure. As a clarification, the return value
from MPI_Get_count is relative to the datatype which
is passed. If the user were to use MPI_CHAR as the
datatype, the returned amount would be four times as
large (assuming an integer is four bytes and a char is
one byte).
An example of using MPI_Probe to find out the message size
Now that you understand how the MPI_Status object works, we
can now use it to our advantage a little bit more. Instead of
posting a receive and simply providing a really large buffer to
handle all possible sizes of messages (as we did in the last
example), you can use MPI_Probe to query the message size
before actually receiving it.
MPI_Probe looks quite similar to MPI_Recv. In fact, you can
think of MPI_Probe as an MPI_Recv that does everything but
receive the message. Similar to MPI_Recv, MPI_Probe will
block for a message with a matching tag and sender. When the
message is available, it will fill the status structure with
information. The user can then use MPI_Recv to receive the
actual message.
Similar to the last example, process zero picks a random amount
of numbers to send to process one. What is different in this
example is that process one now calls MPI_Probe to find out
how many elements process zero is trying to send
(using MPI_Get_count). Process one then allocates a buffer of
the proper size and receives the numbers. Running the code will
look similar to this.
MPI Collective Communication: Broadcasting and Synchronization
Collective communication is a method of communication which involves participation of all processes in a communicator.
One of the things to remember about collective communication is that it implies a synchronization point among processes.
This means that all processes must reach a point in their code before they can all begin executing again.
Before going into detail about collective communication routines, let’s examine synchronization in more detail. As it turns
out, MPI has a special function that is dedicated to synchronizing processes:
MPI_Barrier (MPI_Comm communicator)
The name of the function is quite descriptive – the function forms a barrier, and no processes in the communicator can pass the
barrier until all of them call the function.
Broadcasting with MPI_Bcast: A broadcast is one of the standard collective communication techniques. During a broadcast, one
process sends the same data to all processes in a communicator. One of the main uses of broadcasting is to send out user input to
a parallel program, or send out configuration parameters to all processes.
MPI_Bcast
In this example, process zero is the root process, and it has the
initial copy of data. All of the other processes receive the copy
of data.
In MPI, broadcasting can be accomplished by using MPI_Bcast.
The function prototype looks like this:
MPI_Bcast( void* data, int count, MPI_Datatype datatype, int root,
MPI_Comm communicator)
Although the root process and receiver processes do different
jobs, they all call the same MPI_Bcast function. When the root
process (in our example, it was process zero) calls MPI_Bcast,
the data variable will be sent to all other processes. When all of
the receiver processes call MPI_Bcast, the data variable will be
filled in with the data from the root process.
MPI_Bcast Features
Two important features about MPI_Bcast:
(1) MPI_Bcast uses a smarter implementation is a tree-based communication algorithm that can use more of the
available network links at once. Imagine that each process has only one outgoing/incoming network link: only
using one network link from process zero to send all the data.
For example: assume I have 8 processes. I will have to send 7 data sequentially to the others during
broadcasting. This is not efficient at all.
MPI_Bcast uses a smart way: In this illustration, process zero starts off with the data and sends it to process
one. Process zero also sends the data to process two in the second stage. The difference with this example is
that process one is now helping out the root process by forwarding the data to process three. During the
second stage, two network connections are being utilized at a time. The network utilization doubles at every
subsequent stage of the tree communication until all processes have received the data.
MPI_Bcast Features
(2) Be careful: It is a different way of writing than blocking MPI send and recv !
For example, this code will hang:
Correct way to use MPI_Bcast
For MPI collective communications, everyone has to
particpate; everyone has to call the Bcast (That’s why
the Bcast routine has a parameter that specifies the
“root”, or who is doing the sending; if only the sender
called bcast, you wouldn’t need this.) Everyone calls
the broadcast, including the receivers; the receviers
don’t just post a receive.
The reason for this is that the collective operations can
involve everyone in the communication, so that you
state what you want to happen (everyone gets one
processes’ data) rather than how it happens (eg, root
processor loops over all other ranks and does a send),
so that there is scope for optimizing the
communication patterns (e.g., a tree-based hierarchical
communication that takes log(P) steps rather
than P steps for P processes).
MPI Scatter, Gather, and Allgather
MPI_Scatter is a collective routine that is very similar
to MPI_Bcast. MPI_Scatter involves a designated root process
sending data to all processes in a communicator. The primary
difference between MPI_Bcast and MPI_Scatter is small but
important. MPI_Bcast sends the same piece of data to all
processes while MPI_Scatter sends chunks of an array to
different processes. Check out the illustration below for further
clarification.
MPI_Gather
MPI_Gather is the inverse of MPI_Scatter. Instead of spreading elements from
one process to many processes, MPI_Gather takes elements from many
processes and gathers them to one single process. This routine is highly useful
to many parallel algorithms, such as parallel sorting and searching. Below is a
simple illustration of this algorithm.
Similar to MPI_Scatter, MPI_Gather takes elements from each process and
gathers them to the root process. The elements are ordered by the rank of the
process from which they were received. The function prototype
for MPI_Gather is identical to that of MPI_Scatter.
In MPI_Gather, only the root process needs to have a valid receive buffer. All
other calling processes can pass NULL for recv_data. Also, don’t forget that
the recv_count parameter is the count of elements received per process, not
the total summation of counts from all processes. This can often confuse
beginning MPI programmers.
MPI_Allgather
So far, we have covered two MPI routines that perform many-
to-one or one-to-many communication patterns, which simply
means that many processes send/receive to one process.
Oftentimes it is useful to be able to send many elements to
many processes (i.e., a many-to-many communication
pattern). MPI_Allgather has this characteristic.
Given a set of elements distributed across all
processes, MPI_Allgather will gather all of the elements to all
the processes. In the most basic sense, MPI_Allgather is
an MPI_Gather followed by an MPI_Bcast. The illustration
below shows how data is distributed after a call
to MPI_Allgather.
Just like MPI_Gather, the elements from each process are
gathered in order of their rank, except this time the elements are
gathered to all processes. The function declaration
for MPI_Allgather is almost identical to MPI_Gather with the
difference that there is no root process in MPI_Allgather.
Parallel Rank: https://mpitutorial.com/tutorials/performing-
parallel-rank-with-mpi/
Assignment 2
Three-dimensional array: https://www.geeksforgeeks.org/multidimensional-arrays-c-cpp/
The fireplace emits 100◦C of
heat (although in reality a fire is
much hotter). The walls are
considered to be 20◦C. The
boundary values (the fireplace
and the walls) are considered to
be fixed temperatures. The
room temperature is initialized
to 20◦C.