Distributed-Memory 3 Programming with MPI
Recall that the world of parallel multiple instruction, multiple data, or MIMD, com- puters is, for the most part, divided into distributed-memory and shared-memory systems. From a programmer’s point of view, a distributed-memory system consists of a collection of core-memory pairs connected by a network, and the memory asso- ciated with a core is directly accessible only to that core. See Figure 3.1. On the other hand, from a programmer’s point of view, a shared-memory system consists of a col- lection of cores connected to a globally accessible memory, in which each core can have access to any memory location. See Figure 3.2. In this chapter we’re going to start looking at how to program distributed-memory systems using message-passing.
Recall that in message-passing programs, a program running on one core-memory pair is usually called a process, and two processes can communicate by calling func- tions: one process calls a send function and the other calls a receive function. The implementation of message-passing that we’ll be using is called MPI, which is an abbreviation of Message-Passing Interface. MPI is not a new programming lan- guage. It defines a library of functions that can be called from C, C++, and Fortran programs. We’ll learn about some of MPI’s different send and receive functions. We’ll also learn about some “global” communication functions that can involve more than two processes. These functions are called collective communications. In the pro- cess of learning about all of these MPI functions, we’ll also learn about some of the
FIGURE 3.1
Copyright By PowCoder代写 加微信 powcoder
A distributed-memory system
Interconnect
84 CHAPTER 3 Distributed-Memory Programming with MPI
CPU CPU CPU CPU
Interconnect
FIGURE 3.2
A shared-memory system
fundamental issues involved in writing message-passing programs–issues such as data partitioning and I/O in distributed-memory systems. We’ll also revisit the issue of parallel program performance.
3.1 GETTING STARTED
Perhaps the first program that many of us saw was some variant of the “hello, world”
program in Kernighan and Ritchie’s classic text [29]:
#include
int main(void) { printf(“hello, world\n”);
return 0; }
Let’s write a program similar to “hello, world” that makes some use of MPI. Instead of having each process simply print a message, we’ll designate one process to do the output, and the other processes will send it messages, which it will print.
In parallel programming, it’s common (one might say standard) for the processes to be identified by nonnegative integer ranks. So if there are p processes, the pro- cesses will have ranks 0,1,2,…, p−1. For our parallel “hello, world,” let’s make process 0 the designated process, and the other processes will send it messages. See Program 3.1.
3.1.1 Compilation and execution
The details of compiling and running the program depend on your system, so you may need to check with a local expert. However, recall that when we need to be explicit, we’ll assume that we’re using a text editor to write the program source, and
3.1 Getting Started 85
12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31 32
MPI MPI MPI
#include
#include
int main(void) {
greeting[MAX STRING];
8 char 9 int
const int MAX STRING = 100;
comm sz; /∗ Number of processes ∗/ my rank; /∗ My process rank ∗/
Init(NULL, NULL);
Comm size(MPI COMM WORLD, &comm sz); Comm rank(MPI COMM WORLD, &my rank);
if (my rank != 0) {
sprintf(greeting, “Greetings from process %d of %d!”,
my rank, comm sz);
MPI Send(greeting, strlen(greeting)+1, MPI CHAR, 0, 0,
MPI COMM WORLD);
printf(“Greetings from process %d of %d!\n”, my rank,
comm sz); for(intq=1;q
$ mpiexec −n 1 ./mpi hello
and to run the program with four processes, we’d type
$ mpiexec −n 4 ./mpi hello
With one process the program’s output would be
Greetings from process 0 of 1!
and with four processes the program’s output would be
Greetings from process 0 of 4!
Greetings from process 1 of 4!
Greetings from process 2 of 4!
Greetings from process 3 of 4!
How do we get from invoking mpiexec to one or more lines of greetings? The mpiexec command tells the system to start
3.1.2 MPI programs
Let’s take a closer look at the program. The first thing to observe is that this is a C program. For example, it includes the standard C header files stdio.h and string.h. It also has a main function just like any other C program. However, there are many parts of the program which are new. Line 3 includes the mpi.h header file. This contains prototypes of MPI functions, macro definitions, type definitions, and so on; it contains all the definitions and declarations needed for compiling an MPI program.
The second thing to observe is that all of the identifiers defined by MPI start with the string MPI . The first letter following the underscore is capitalized for function names and MPI-defined types. All of the letters in MPI-defined macros and con- stants are capitalized, so there’s no question about what is defined by MPI and what’s defined by the user program.
3.1.3 MPI Init and MPI Finalize
In Line 12 the call to MPI Init tells the MPI system to do all of the necessary setup. For example, it might allocate storage for message buffers, and it might decide which process gets which rank. As a rule of thumb, no other MPI functions should be called before the program calls MPI Init. Its syntax is
int MPI Init(
int∗ argc p /∗ in/out ∗/, char∗∗∗ argv p /∗ in/out ∗/);
The arguments, argc p and argv p, are pointers to the arguments to main, argc, and argv. However, when our program doesn’t use these arguments, we can just pass NULL for both. Like most MPI functions, MPI Init returns an int error code, and in most cases we’ll ignore these error codes.
In Line 30 the call to MPI Finalize tells the MPI system that we’re done using MPI, and that any resources allocated for MPI can be freed. The syntax is quite simple:
int MPI Finalize(void);
In general, no MPI functions should be called after the call to MPI Finalize.
Thus, a typical MPI program has the following basic outline:
#include
int main(int argc, char∗ argv[]) {
/∗ No MPI calls before this ∗/ MPI Init(&argc, &argv);
MPI Finalize();
/∗ No MPI calls after this ∗/ …
However, we’ve already seen that it’s not necessary to pass pointers to argc and argv to MPI Init. It’s also not necessary that the calls to MPI Init and MPI Finalize be in main.
3.1.4 Communicators,MPICommsizeandMPICommrank
In MPI a communicator is a collection of processes that can send messages to each other. One of the purposes of MPI Init is to define a communicator that consists of all of the processes started by the user when she started the program. This commu- nicator is called MPI COMM WORLD. The function calls in Lines 13 and 14 are getting information about MPI COMM WORLD. Their syntax is
int MPI Comm size( MPI Comm comm
int∗ commszp
int MPI Comm rank( MPI Comm comm
int∗ myrankp
/∗ in ∗/, /∗ out ∗/);
/∗ in ∗/, /∗ out ∗/);
3.1 Getting Started 87
88 CHAPTER 3 Distributed-Memory Programming with MPI
For both functions, the first argument is a communicator and has the special type defined by MPI for communicators, MPI Comm. MPI Comm size returns in its second argument the number of processes in the communicator, and MPI Comm rank returns in its second argument the calling process’ rank in the communicator. We’ll often use the variable comm sz for the number of processes in MPI COMM WORLD, and the variable my rank for the process rank.
3.1.5 SPMD programs
Notice that we compiled a single program—we didn’t compile a different program for each process—and we did this in spite of the fact that process 0 is doing something fundamentally different from the other processes: it’s receiving a series of messages and printing them, while each of the other processes is creating and sending a mes- sage. This is quite common in parallel programming. In fact, most MPI programs are written in this way. That is, a single program is written so that different processes carry out different actions, and this is achieved by simply having the processes branch on the basis of their process rank. Recall that this approach to parallel programming is called single program, multiple data, or SPMD. The if−else statement in Lines 16 through 28 makes our program SPMD.
Also notice that our program will, in principle, run with any number of processes. We saw a little while ago that it can be run with one process or four processes, but if our system has sufficient resources, we could also run it with 1000 or even 100,000 processes. Although MPI doesn’t require that programs have this property, it’s almost always the case that we try to write programs that will run with any number of pro- cesses, because we usually don’t know in advance the exact resources available to us. For example, we might have a 20-core system available today, but tomorrow we might have access to a 500-core system.
3.1.6 Communication
In Lines 17 and 18, each process, other than process 0, creates a message it will send to process 0. (The function sprintf is very similar to printf, except that instead of writing to stdout, it writes to a string.) Lines 19–20 actually send the message to process 0. Process 0, on the other hand, simply prints its message using printf, and then uses a for loop to receive and print the messages sent by pro- cesses 1,2,…,commsz−1. Lines 24–25 receive the message sent by process q, for q = 1,2,…,commsz−1.
3.1.7 MPI Send
The sends executed by processes 1,2,…,commsz−1 are fairly complex, so let’s take a closer look at them. Each of the sends is carried out by a call to MPI Send, whose syntax is
int MPI Send( void∗
MPI Datatype
msg buf p msg size msg type dest
communicator
/∗ in ∗/, /∗ in ∗/, /∗ in ∗/, /∗ in ∗/, /∗ in ∗/, /∗ in ∗/);
The first three arguments, msg buf p, msg size, and msg type, determine the con- tents of the message. The remaining arguments, dest, tag, and communicator, determine the destination of the message.
The first argument, msg buf p, is a pointer to the block of memory containing the contents of the message. In our program, this is just the string containing the message, greeting. (Remember that in C an array, such as a string, is a pointer.) The second and third arguments, msg size and msg type, determine the amount of data to be sent. In our program, the msg size argument is the number of characters in the message plus one character for the ‘\0’ character that terminates C strings. The msg type argument is MPI CHAR. These two arguments together tell the system that the message contains strlen(greeting)+1 chars.
Since C types (int, char, and so on.) can’t be passed as arguments to functions, MPI defines a special type, MPI Datatype, that is used for the msg type argument. MPI also defines a number of constant values for this type. The ones we’ll use (and a few others) are listed in Table 3.1.
Notice that the size of the string greeting is not the same as the size of the mes- sage specified by the arguments msg size and msg type. For example, when we run the program with four processes, the length of each of the messages is 31 characters,
3.1 Getting Started 89
Table 3.1 Some Predefined MPI Datatypes
MPI datatype
C datatype
MPI LONG LONG
MPI UNSIGNED CHAR MPI UNSIGNED SHORT MPI UNSIGNED
MPI UNSIGNED LONG MPI FLOAT
MPI DOUBLE
MPI LONG DOUBLE MPI BYTE
MPI PACKED
signed char
signed short int signed int
signed long int signed long long int unsigned char unsigned short int unsigned int unsigned long int float
long double
90 CHAPTER 3 Distributed-Memory Programming with MPI
while we’ve allocated storage for 100 characters in greetings. Of course, the size of the message sent should be less than or equal to the amount of storage in the buffer—in our case the string greeting.
The fourth argument, dest, specifies the rank of the process that should receive the message. The fifth argument, tag, is a nonnegative int. It can be used to dis- tinguish messages that are otherwise identical. For example, suppose process 1 is sending floats to process 0. Some of the floats should be printed, while others should be used in a computation. Then the first four arguments to MPI Send provide no information regarding which floats should be printed and which should be used in a computation. So process 1 can use, say, a tag of 0 for the messages that should be printed and a tag of 1 for the messages that should be used in a computation.
The final argument to MPI Send is a communicator. All MPI functions that involve communication have a communicator argument. One of the most important purposes of communicators is to specify communication universes; recall that a communica- tor is a collection of processes that can send messages to each other. Conversely, a message sent by a process using one communicator cannot be received by a process that’s using a different communicator. Since MPI provides functions for creating new communicators, this feature can be used in complex programs to insure that messages aren’t “accidentally received” in the wrong place.
An example will clarify this. Suppose we’re studying global climate change, and we’ve been lucky enough to find two libraries of functions, one for modeling the Earth’s atmosphere and one for modeling the Earth’s oceans. Of course, both libraries use MPI. These models were built independently, so they don’t communicate with each other, but they do communicate internally. It’s our job to write the interface code. One problem we need to solve is to insure that the messages sent by one library won’t be accidentally received by the other. We might be able to work out some scheme with tags: the atmosphere library gets tags 0, 1, . . . , n − 1 and the ocean library gets tags n,n+1,…,n+m. Then each library can use the given range to figure out which tag it should use for which message. However, a much simpler solution is provided by communicators: we simply pass one communicator to the atmosphere library functions and a different communicator to the ocean library functions.
3.1.8 MPI Recv
The first six arguments to MPI Recv correspond
to the first six arguments of
∗/, ∗/, ∗/, ∗/, ∗/, ∗/, ∗/);
int MPI Recv( void∗
MPI Datatype
MPI Comm MPI Status∗
msg buf p buf size buf type source
tag communicator status p
/∗ out /∗ in /∗ in /∗ in /∗ in /∗ in /∗ out
Thus, the first three arguments specify the memory available for receiving the message: msg buf p points to the block of memory, buf size determines the
number of objects that can be stored in the block, and buf type indicates the type of the objects. The next three arguments identify the message. The source argument specifies the process from which the message should be received. The tag argument should match the tag argument of the message being sent, and the communicator argument must match the communicator used by the sending pro- cess. We’ll talk about the status p argument shortly. In many cases it won’t be used by the calling function, and, as in our “greetings” program, the special MPI constant MPI STATUS IGNORE can be passed.
3.1.9 Message matching Suppose process q calls MPI Send with
MPI Send(send buf p, send buf sz, send type, dest, send tag, send comm);
Also suppose that process r calls MPI Recv with
MPI Recv(recv buf p, recv buf sz, recv type, src, recv tag,
recv comm, &status);
Then the message sent by q with the above call to MPI Send can be received by r
with the call to MPI Recv if
These conditions aren’t quite enough for the message to be successfully received, however. The parameters specified by the first three pairs of arguments, sendbufp/recvbufp,sendbufsz/recvbufsz, andsendtype/recvtype, must specify compatible buffers. For detailed rules, see the MPI-1 specification [39].
.Most of the time, the following rule will suffice:
If recv type = send type and recv buf sz ≥ send buf sz, then the message
sent by q can be successfully received by r.
Of course, it can happen that one process is receiving messages from multiple processes, and the receiving process doesn’t know the order in which the other pro- cesses will send the messages. For example, suppose, for example, process 0 is doling out work to processes 1,2,…,commsz−1, and processes 1,2,…,commsz−1, send their results back to process 0 when they finish the work. If the work assigned to each process takes an unpredictable amount of time, then 0 has no way of knowing the order in which the processes will finish. If process 0 simply receives the results in process rank order—first the results from process 1, then the results from process 2, and so on—and if, say, process comm sz−1 finishes first, it could happen that pro- cess comm sz−1 could sit and wait for the other processes to finish. In order to avoid this problem, MPI provides a special constant MPI ANY SOURCE that can be passed to
recv comm = send comm, recv tag = send tag, dest = r, and
3.1 Getting Started 91
92 CHAPTER 3 Distributed-Memory Programming with MPI
MPI Recv. Then, if process 0 executes the following code, it can receive the results in the order in which the processes finish:
for(i=1;i