Comprehensive MPI
Shuaiwen SOFT 3410 Week 10
A lot of materials are developed from ’s MPI Blog with a lot of practice coding on Github:
https://mpitutorial.com/tutorials/
Please take some time to fill your survey
https://student-surveys.sydney.edu.au/students/
Fundamental MPI Concepts
MPI is short for Message Passing Interface. It has been designed by the HPC community to write parallel codes that can be executed on distributed memory systems (e.g., a cluster with many nodes). It has been widely adopted to complement OpenMP which is used for parallelize shared memory based single multi-core machine. MPI enables the message passing model of parallel programming. There are couple of core concepts behind MPI.
(1) The notion of a communicator. A communicator defines a group of processes that have the ability to communicate with one another. In this group of processes, each is assigned a unique rank, and they explicitly communicate with one another by their ranks.
(2) The basic communication model. The core of it is built upon the send and receive operations among processes. A process may send a message to another process by providing the rank of the process and a unique tag to identify the message. The receiver can then post a receive for a message with a given tag (or it may not even care about the tag), and then handle the data accordingly. Communications such as this which involve one sender and receiver are known as point-to-point communications.
(3) The collective communication. There are many cases where processes may need to communicate with everyone else. For example, when a manager process needs to broadcast information to all of its worker processes. In this case, it would be cumbersome to write code that does all of the sends and receives. In fact, it would often not use the network in an optimal manner. MPI can handle a wide variety of these types of collective communications that involve all processes.
In this class, we will learn
(A)MPI setup, compile and run
(B) Point to point communication: blocking vs unblocking send and receive (C) Collective Communication
(D) Some simple examples of using MPI.
Materials for you in your future coding
• Thecodeexamplesusedinthistutorialcanbefoundat:https://github.com/mpitutorial/mpitutorial/tree/gh-pages/tutorials
• FullMPIAPIsandtutorial:https://computing.llnl.gov/tutorials/mpi/
• This material is also developed based on the https://computing.llnl.gov/tutorials/mpi/ and www.mpitutorial.com.
Hello World !
>>> export MPIRUN=/home/kendall/bin/mpirun
>>> export MPI_HOSTS=host_file>>> ./run.py mpi_hello_world/home/kendall/bin/mpirun -n 4 -f host_file ./mpi_hello_world
Hello world from processor cetus2, rank 1 out of 4 processors Hello world from processor cetus1, rank 0 out of 4 processors Hello world from processor cetus4, rank 3 out of 4 processors Hello world from processor cetus3, rank 2 out of 4 processors
The first two lines are exporting environmental variables for the acutal mpirun. Here you typically want to put a host_file in here to direct the MPI on how to launch processes across nodes. For example:
>>> cat host_file cetus1: 4 cetus2: 4 cetus3: 4 cetus4: 4
If your node has multiple cores, just modify your host file and place a colon and the number of cores per processor after the host name
Point to Point Communication
• BlockedMPISendandReceive
Blocked MPI’s send and receive calls operate in the following manner.
First, process A decides a message needs to be sent to process B. Process A then packs up all of its necessary data into a buffer for process B. These buffers are often referred to as envelopes since the data is being packed into a single message before transmission (similar to how letters are packed into envelopes before transmission to the post office).
After the data is packed into a buffer, the communication device (which is often a network) is responsible for routing the message to the proper location. The location of the message is defined by the process’s rank. Even though the message is routed to B, process B still has to acknowledge that it wants to receive A’s data.
Once it does this, the data has been transmitted. Process A is acknowledged that the data has been transmitted and may go back to work.
MPI Send/RECV Sample Code
MPI_Comm_rank and MPI_Comm_size are first used to determine the world size along with the rank of the process. Then process zero initializes a number to the value of negative one and sends this value to process one. As you can see in the else if statement, process one is calling MPI_Recv to receive the number. It also prints off the received value. Since we are sending and receiving exactly one integer, each process requests that one MPI_INT be sent/received. Each process also uses a tag number of zero to identify the message. The processes could have also used the predefined constant MPI_ANY_TAG for the tag number since only one type of message was being transmitted.
MPI Datatypes
The other elementary MPI datatypes are listed below with their equivalent C datatypes.
Blocking-based Ping Pong Program
Other examples:
Blocking vs. Unblocking
• BlockingcommunicationisdoneusingMPI_Send()andMPI_Recv(),e.g.,liketheexamplewehaveshownabove.These functions do not return (i.e., they block) until the communication is finished. Simplifying somewhat, this means that the buffer passed to MPI_Send() can be reused, either because MPI saved it somewhere, or because it has been received by the destination. Similarly, MPI_Recv() returns when the receive buffer has been filled with valid data.
• In contrast, non-blocking communication is done using MPI_Isend() and MPI_Irecv(). These functions return immediately (i.e., they do not block) even if the communication is not finished yet. You must call MPI_Wait() or MPI_Test() to see whether the communication has finished.
• Blocking communication is used when it is sufficient, since it is somewhat easier to use. Non-blocking communication is used when necessary, for example, you may call MPI_Isend(), do some computations, then do MPI_Wait(). This allows computations and communication to overlap, which generally leads to improved performance.
Example 1
if(rank==0)
{
MPI_Send(x to process 1)
MPI_Recv(y from process 1) }
if(rank==1)
{
MPI_Send(y to process 0);
MPI_Recv(x from process 0);
}
What happens in this case?
Process 0 sends x to process 1 and blocks until process 1 receives x.
Process 1 sends y to process 0 and blocks until process 0 receives y, but
process 0 is blocked such that process 1 blocks for infinity until the two processes are killed. This will cause deadlock.
Example 2
For example, a MPI sender looks like this:
*buf = 3;
MPI_Isend(buf, 1, MPI_INT, …) *buf = 4;
MPI_Wait(…);
—What will the receiver receives? 3?
—Answer: 3, 4 or anything else. Isend() is non-blocking there is a race condition for which value will send. How to change the code so that the receiver will receive 3?
*buf = 3;
MPI_Isend(buf, 1, MPI_INT, …)
MPI_Wait(…);
*buf = 4;
Example 3: Ring Communication
Blocked MPI Send and Recv example: Ring Program
The ring program initializes a value from process zero, and the value is passed around every single process. The program terminates when process zero receives the value from the last process. As you can see from the program, extra care is taken to assure that it doesn’t deadlock. In other words, process zero makes sure that it has completed its first send before it tries to receive the value from the last process. All of the other processes simply call MPI_Recv (receiving from their neighboring lower process) and then MPI_Send (sending the value to their neighboring higher process) to pass the value along the ring. MPI_Send and MPI_Recv will block until the message has been transmitted. Because of this, the printfs should occur by the order in which the value is passed.
Dynamic Receiving with MPI Status
As we can see, process zero randomly sends up to MAX_NUMBERS integers to process one. Process one then calls MPI_Recv for a total of MAX_NUMBERS integers. Although process one is passing MAX_NUMBERS as the argument to MPI_Recv, process one will receive at most this amount of numbers. In the code, process one calls MPI_Get_count with MPI_INT as the datatype to find out how many integers were actually received. Along with printing off the size of the received message, process one also prints off the source and tag of the message by accessing the MPI_SOURCE and MPI_TAG elements of the status structure. As a clarification, the return value from MPI_Get_count is relative to the datatype which is passed. If the user were to use MPI_CHAR as the datatype, the returned amount would be four times as large (assuming an integer is four bytes and a char is one byte).
An example of using MPI_Probe to find out the message size
Now that you understand how the MPI_Status object works, we can now use it to our advantage a little bit more. Instead of posting a receive and simply providing a really large buffer to handle all possible sizes of messages (as we did in the last example), you can use MPI_Probe to query the message size before actually receiving it.
MPI_Probe looks quite similar to MPI_Recv. In fact, you can think of MPI_Probe as an MPI_Recv that does everything but receive the message. Similar to MPI_Recv, MPI_Probe will block for a message with a matching tag and sender. When the message is available, it will fill the status structure with information. The user can then use MPI_Recv to receive the actual message.
Similar to the last example, process zero picks a random amount of numbers to send to process one. What is different in this example is that process one now calls MPI_Probe to find out how many elements process zero is trying to send (using MPI_Get_count). Process one then allocates a buffer of the proper size and receives the numbers. Running the code will look similar to this.
MPI Collective Communication: Broadcasting and Synchronization
Collective communication is a method of communication which involves participation of all processes in a communicator. One of the things to remember about collective communication is that it implies a synchronization point among processes. This means that all processes must reach a point in their code before they can all begin executing again.
Before going into detail about collective communication routines, let’s examine synchronization in more detail. As it turns out, MPI has a special function that is dedicated to synchronizing processes:
MPI_Barrier (MPI_Comm communicator)
The name of the function is quite descriptive – the function forms a barrier, and no processes in the communicator can pass the barrier until all of them call the function.
Broadcasting with MPI_Bcast: A broadcast is one of the standard collective communication techniques. During a broadcast, one process sends the same data to all processes in a communicator. One of the main uses of broadcasting is to send out user input to a parallel program, or send out configuration parameters to all processes.
MPI_Bcast
In this example, process zero is the root process, and it has the initial copy of data. All of the other processes receive the copy of data.
In MPI, broadcasting can be accomplished by using MPI_Bcast. The function prototype looks like this:
MPI_Bcast( void* data, int count, MPI_Datatype datatype, int root, MPI_Comm communicator)
Although the root process and receiver processes do different jobs, they all call the same MPI_Bcast function. When the root process (in our example, it was process zero) calls MPI_Bcast, the data variable will be sent to all other processes. When all of the receiver processes call MPI_Bcast, the data variable will be filled in with the data from the root process.
MPI_Bcast Features
Two important features about MPI_Bcast:
(1) MPI_Bcast uses a smarter implementation is a tree-based communication algorithm that can use more of the
available network links at once. Imagine that each process has only one outgoing/incoming network link: only using one network link from process zero to send all the data.
For example: assume I have 8 processes. I will have to send 7 data sequentially to the others during broadcasting. This is not efficient at all.
MPI_Bcast uses a smart way: In this illustration, process zero starts off with the data and sends it to process one. Process zero also sends the data to process two in the second stage. The difference with this example is that process one is now helping out the root process by forwarding the data to process three. During the second stage, two network connections are being utilized at a time. The network utilization doubles at every subsequent stage of the tree communication until all processes have received the data.
MPI_Bcast Features
(2) Be careful: It is a different way of writing than blocking MPI send and recv ! For example, this code will hang:
Correct way to use MPI_Bcast
For MPI collective communications, everyone has to particpate; everyone has to call the Bcast (That’s why the Bcast routine has a parameter that specifies the “root”, or who is doing the sending; if only the sender called bcast, you wouldn’t need this.) Everyone calls the broadcast, including the receivers; the receviers don’t just post a receive.
The reason for this is that the collective operations can involve everyone in the communication, so that you state what you want to happen (everyone gets one processes’ data) rather than how it happens (eg, root processor loops over all other ranks and does a send), so that there is scope for optimizing the communication patterns (e.g., a tree-based hierarchical communication that takes log(P) steps rather than P steps for P processes).
MPI Scatter, Gather, and Allgather
MPI_Scatter is a collective routine that is very similar to MPI_Bcast. MPI_Scatter involves a designated root process sending data to all processes in a communicator. The primary difference between MPI_Bcast and MPI_Scatter is small but important. MPI_Bcast sends the same piece of data to all processes while MPI_Scatter sends chunks of an array to different processes. Check out the illustration below for further clarification.
MPI_Gather
MPI_Gather is the inverse of MPI_Scatter. Instead of spreading elements from one process to many processes, MPI_Gather takes elements from many processes and gathers them to one single process. This routine is highly useful to many parallel algorithms, such as parallel sorting and searching. Below is a simple illustration of this algorithm.
Similar to MPI_Scatter, MPI_Gather takes elements from each process and gathers them to the root process. The elements are ordered by the rank of the process from which they were received. The function prototype for MPI_Gather is identical to that of MPI_Scatter.
In MPI_Gather, only the root process needs to have a valid receive buffer. All other calling processes can pass NULL for recv_data. Also, don’t forget that the recv_count parameter is the count of elements received per process, not the total summation of counts from all processes. This can often confuse beginning MPI programmers.
MPI_Allgather
So far, we have covered two MPI routines that perform many- to-one or one-to-many communication patterns, which simply means that many processes send/receive to one process. Oftentimes it is useful to be able to send many elements to many processes (i.e., a many-to-many communication pattern). MPI_Allgather has this characteristic.
Given a set of elements distributed across all processes, MPI_Allgather will gather all of the elements to all the processes. In the most basic sense, MPI_Allgather is an MPI_Gather followed by an MPI_Bcast. The illustration below shows how data is distributed after a call to MPI_Allgather.
Just like MPI_Gather, the elements from each process are
gathered in order of their rank, except this time the elements are
gathered to all processes. The function declaration Parallel Rank: https://mpitutorial.com/tutorials/performing- for MPI_Allgather is almost identical to MPI_Gather with the parallel-rank-with-mpi/
difference that there is no root process in MPI_Allgather.
Assignment 2
Three-dimensional array: https://www.geeksforgeeks.org/multidimensional-arrays-c-cpp/
The fireplace emits 100◦C of heat (although in reality a fire is much hotter). The walls are considered to be 20◦C. The boundary values (the fireplace and the walls) are considered to be fixed temperatures. The room temperature is initialized to 20◦C.