Tutorial on MPI The MessagePassing Interface
Willi am Gropp
Mathematics and Computer Science Division Argonne National Lab oratory Argonne IL groppmcsanlgov
N
O
I
T
A
A
L
N
L
A
B
E
O
R
N
A
N
T
O
O
G
R
R
A
Y
•
•
U
O
N
G
I
A
V
C
E
I
H
R
C
S
F
O
I
T
Y
Course Outline
Background on Parallel Computing
Getting Started
MPI Basics
Intermediate MPI
To ols for writing libraries
Final comments
Thanks to Rusty Lusk for some of the material in this tutorial
This tutorial may b e used in
the b o ok Using MPI descriptions of the use
Material that b eings
and may b e skipp ed on a rst reading
conjunction with contains detailed
which
of the MPI routines
with this symb ol is advanced
Background
Parallel Computing
Communicating with other pro cesses
Co op erative op erations
Onesided op erations
The MPI pro cess
Parallel Computing
Separate workers or pro cesses
Interact by exchanging information
Typ es of parallel computing
All use dierent data
Dataparallel data Also
for each worker
SPMD Same
SIMD is practical sense
Same
called SIMD
op erations on dierent
program dierent data Dierent programs dierent data
MIMD
SPMD and MIMD are essentially the same
b ecause
any MIMD can b e made SPMD
also equivalent but in a less
MPI is primaril y for SPMDMIMD HPF is an example of a SIMD interface
Communicating with other pro cesses
Data must b e exchanged with other workers
Co op erative all parties agree to transfer data
One sided one worker p erforms transfer of data
Co op erative op erations
Messagepassing the exchange of
Data must b oth received
is an approach that makes data co op erative
b e explicitl y sent and
An advantage is that any
change in the
with the receivers
Process 1
RECV( data )
receivers memory participation
Process 0
SEND( data )
is made
Onesided op erations
Onesided op erations b etween parallel
pro cesses include remote memory reads and writes
An advantage is that data can be accessed
without
waiting for
Process 0
PUT( data )
another
pro cess
Process 1
(Memory)
Process 1
GET( data )
Process 0
(Memory)
Class Example
Take a pad of pap er Algorithm Initialize with the numb er of neighb ors you have
Compute average of your
subtract from your value Make that value
Rep eat until done
Questions
How do you
get values
from your neighb ors
Which step or iteration
Do you know Do you care
How do you decide when you are done
neighb ors values and your new
do they corresp ond to
Hardware mo dels
The previous example illustrates the hardware mo dels by how data is exchanged among workers
Distributed memory eg SPx workstation network
Paragon IBM
Power
Shared memory Challenge Cray TD
eg SGI
Either may b e used with SIMD or MIMD software mo dels
All memory is distributed
What is MPI
A
messagepassing mo del
not a compiler sp ecication
not a sp ecic pro duct
For parallel computers clusters and heterogeneous networks
Fullfeatured
Designed to permit unleash the development of parallel software libraries
Designed to provide access to advanced parallel hardware for
end users
library writers
to ol develop ers
messagepassing library sp ecication
Motivation for a New Design
Message Passing now mature as programming paradigm
well understo o d
ecient match to hardware
many applications
Vendor systems not p ortable
Portable systems are mostly
incomplete
lack vendor supp ort
not at most ecient level
research
projects
Motivation cont
Few systems oer the full range of desired features
mo dularity for libraries
access to p eak
p ortability
heterogeneity
subgroups
top ologies
p erformance
p erformance measurement to ols
The MPI Pro cess
Began at Williamsburg Workshop in April
Organized at Sup ercomputing Novemb er
Followed HPF format and pro cess
Met every six weeks for two days
Extensive op en email discussions
Drafts readings votes
Prenal draft distributed at Sup ercomputing
Twomonth public comment p erio d
Final version of draft in May
Widely available now on the Web ftp sites netlib httpwwwmcsanlgovmpiindexhtml
Public implementations available
Vendor implementations coming so on
Who Designed MPI
Broad participation
Vendors
IBM Intel TMC Meiko Cray Convex Ncub e
Library writers
PVM p Zip co de TCGMSG Express Linda
Chameleon
Application sp ecialists and consultants
Companies ARCO Convex Cray Res IBM
Intel
KAI Meiko NAG nCUBE ParaSoft Shell TMC
Lab oratories Universities
ANL GMD LANL LLNL NOAA NSF ORNL PNL Sandia SDSC SRC
UC Santa Barbara Syracuse U Michigan State U Oregon Grad Inst U of New Mexico Miss State U
U of Southampton U of Colorado
Yale U
U of Tennessee
U of Maryland Western Mich U
U of Edinburgh Cornell U
Rice U
U of San Francisco
Features of MPI
General
Communicators combine context and group for message security
Thread safety
Pointtop oint communication
Structured buers and derived datatyp es heterogeneity
Mo des normal blo cking and nonblo cking synchronous ready to allow access to fast proto cols buered
Collective
Both builtin and userdened collective op erations
Large numb er of data movement routines
Subgroups dened directly or by top ology
Features of MPI cont
Applicationoriented pro cess top ologies
Builtin supp ort for grids and graphs uses groups
Proling
Ho oks allow users to intercept MPI calls to install their own to ols
Environmental
inquiry
error control
Features not in MPI
Nonmessagepassing concepts not
pro cess management
remote memory transfers
active messages
threads
virtual shared memory
included
MPI do es not address these issues
remain compatible with these ideas eg thread safety as a goal intercommunicators
but has
tried to
Is MPI Large or Small
MPI is large functions
MPIs extensive functionality requires many functions
Number of functions not necessarily a measure of complexity
MPI is small functions
Many parallel programs can b e basic functions
written with just
MPI is just right
One can access exibility when it is required
One need not master all parts of MPI to use it
Where to use MPI
You
You
You
Where not to use MPI
You can use HPF or a parallel Fortran
You dont need parallelism at all
You can use libraries which may b e written in MPI
need a p ortable parallel program
are writing a parallel library
have irregular or dynamic data relationships that do not t a data parallel mo del
Why learn MPI
Portable
Expressive
Go o d way to learn ab out parallel computing
subtle
issues in
Getting started
Writing MPI programs
Compiling and linking
Running MPI programs
More information
Using MPI by William Gropp Ewing Lusk and Anthony Skjellum
The LAM companion to Using MPI by Zdzislaw Meglicki
Designing and Building Parallel Programs by
Ian Foster
A TutorialUsers Guide for MPI by Peter Pacheco
ftpmathusfcaedupubMPImpiguideps
The MPI standard and other information is available at httpwwwmcsanlgovmpi Also
the source for several implementations
Writing MPI programs
include mpih
include stdioh
int main argc argv int argc
char argv
MPIInit argc argv printf Hello worldn MPIFinalize
return
Commentary
include mpih provides basic MPI denitions and typ es
MPIInit starts MPI
MPIFinalize exits MPI
Note that all nonMPI
thus the printf run on each pro cess
routines are lo cal
Compiling and linking
For simple programs sp ecial compiler commands can b e used For large projects
it is
b est to use a standard Makele
The
the
as well as Makefile examples in usrlocalmpiexamplesMakefilein
MPICH implementation provides commands mpicc and mpif
Sp ecial compilation commands
The commands
mpicc o first firstc mpif o firstf firstff
may b e used to build simple programs when using MPICH
These provide sp ecial options that exploit the proling features of MPI
mpilog Generate log les of MPI calls
mpitrace
mpianim on all
Trace execution of MPI calls
Realtime animation of MPI not available systems
There are
sp ecic to the MPICH implementation implementations may provide similar commands
other
eg mpcc and mpxlf on IBM SP
Using Makeles
The le Makefilein
is a template Makele mpireconfig translates
The
this to a Makele for
program script
a particular system allows you to use the same Makele for
This
a network of workstations and a massively parallel computer even when they use dierent compilers librari es and linker options
mpireconfig Makefile
Note that you must have mpireconfig in your PATH
Sample
User
ARCH
COMM
INSTALLD IR
CC
F
CLINKER
FLINKER
OPTFLAGS
LIBPATH
FLIBPATH
FLIBPAT H LE AD ER I NS TA LL D IR li b ARC H C OM M
LIBLIST
INCLUDED IR
End User
LIBLI ST
INCLUD E PA TH IINST AL LD IR in cl ud e configur ab le options
Makelein
configur ab le
options
ARCH
COMM
INSTAL L DI R CC
F
CLINKE R FLINKE R OPTFLA GS
LINS TA LL D IR li b AR CH C OM M
Sample Makelein cont
CFLAGS CFLAGS FFLAGS FFLAGS
LIBS FLIBS EXECS
OPTFLA GS INCLUD ED IR DMPI AR CH
clean
co fo
hello
CLINK ER LIBP AT H
binrm f
OPTFLA GS o hello helloo LIBL IS T lm
o PI EXECS
INCLU DE D IR LIBLI ST
OPTFLAG S
LIBPA TH
FLIB PA TH LIBLI ST hello
default
all EXECS
hello helloo INSTAL L DI R i nc lu de m pi h
CC CFLAG S c c F FFLAGS c f
Running MPI programs
mpirun np hello
mpirun is not part of the standard but some version of it is common with several MPI implementations The version shown here is for the MPICH implementation of MPI
Just as Fortran do es not sp ecify how
Fortran sp ecify
programs are started MPI do es not how MPI programs are started
The
mpirun would nd out how system The to mpirun
option t shows the commands that execute you can use this to
mpirun starts option help
programs on yor shows all options
Finding out ab out the environment
Two of the rst questions asked in a parallel program are How many pro cesses are there and Who am I
How many is answered with MPICommsize and who am I is answered with MPICommrank
The rank is a numb er b etween zero and size
A simple program
include mpih include stdioh
int main argc argv int argc
char argv
int rank size
MPIInit argc argv
MPICommrank MPICommsize printf Hello
MPICOMMWORLD MPICOMMWORLD world Im d of
rank size MPIFinalize
return
rank size dn
Caveats
These
as simple
pro cesses can do output Not all systems provide this feature and MPI provides a way to handle this case
sample programs have b een kept
as p ossible by assuming
that all parallel
Exercise Getting
Started
Objective Learn
compile
Run the
dierent
output lo ok like
how to login write and run a simple MPI program
Hello world programs Try two parallel computers What do es the
Sending and Receiving
messages
Process 1
Recv B:
Questions
To whom is data
What is sent
How do es the receiver identify it
A:
Process 0
Send
sent
Current MessagePassing
A typical
send dest type address length
where
dest is an integer identier representing the pro cess to receive the message
type is a nonnegative integer that the destination can use to selectively screen messages
address length describ es a contiguous area in memory containing the message to be sent
blo cking send lo oks like
and
A typical global op eration lo oks broadcast type address
All of these sp ecications are a
hardware easy to understand but to o inexible
like length
go o d match to
The Buer
Sending and receiving only a contiguous array of bytes
hides the real data structure from hardware might be able to handle it directly
requires prepacking disp ersed data
rows of a matrix stored columnwise
general collections of structures
which
prevents communications b etween machines with dierent representations even lengths for same data typ e
Generalizing the Buer Description
Sp ecied in MPI by starting address datatyp e and
count where
elementary
contiguous array
strided blo cks of
indexed array of
general structure
Datatyp es are
of datatyp es
datatyp es
blo cks of datatyp es
constructed recursively
datatyp e is
all C and Fortran datatyp es
Sp ecications of elementary datatyp es allows
heterogeneous
communication
Elimination of length in favor of count is clearer
Sp ecifying applicationoriented layout of data allows maximal use of sp ecial hardware
Generalizing the Typ e
Problems
under user
wild cards
control
allowed MPIANYTAG
eld is to o constraining Often
A single typ e
overloaded to provide needed exibility
library use conicts with user and with other libraries
Sample Program using Library Calls
Sub and Sub are from dierent libraries
Sub
Sub
Suba and Subb are from
Suba
Sub
Subb
Thanks to Marc Snir for the
the same
library
following four examples
Correct Execution of Library Calls
Process 0 Process 1 Process 2
recv(any)
Sub1
recv(any)
send(1)
send(0)
recv(1)
send(2)
send(0) recv(2)
send(1) recv(0)
Sub2
Incorrect Execution of
Library Calls
Process 1 Process 2
Process 0
Sub1
recv(any)
recv(any)
send(1)
send(0)
Sub2
recv(1)
send(2)
send(0) recv(2)
send(1) recv(0)
Correct Execution of Communcication
Library Calls
Process 1
with Pending
Process 2
Sub1a
Process 0
recv(any)
send(1)
send(0)
recv(2)
send(1)
Sub2
send(2) recv(0)
send(0) recv(1)
Sub1b
recv(any)
Incorrect Execution of Communication
Library
Process 1
Calls
with Pending
Process 2
Process 0
recv(any)
send(1)
send(0)
Sub1a
Sub2
recv(2)
send(1)
send(2) recv(0)
send(0) recv(1)
Sub1b
recv(any)
Solution to the typ e problem
A separate communication context for each family of messages used for queueing and matching This has often b een simulated in the past by overloading the tag eld
No wild cards allowed for security
Allo cated by the system for security
Typ es tags in MPI retained for normal use wild cards OK
Delimiting Scop e of Communication
Separate groups of pro cesses working on subproblems
Merging of process name space interferes with mo dularity
Lo cal pro cess identiers desirable
Parallel invo cation of parallel libraries
Messages from application must be kept separate from messages internal to library
Knowledge of library message typ es interferes
with mo dularity
Synchronizing b efore and after library calls is undesirable
Generalizing the Pro cess Identier
Collective op erations typically op erated on all pro cesses although some systems provide subgroups
This is too restrictive eg need minimum over a column or a sum across a row of pro cesses
MPI provides groups of pro cesses
initial all group
group management routines build delete groups
op erations
All
takes place in groups
communication not just collective
A group and a context are combined in a communicator
Sourcedestination in sendreceive op erations
to rank in group asso ciated with a given communicator MPIANYSOURCE p ermitted in a receive
refer
MPI Basic SendReceive
Thus the basic blo cking send has b ecome
MPISend start count datatype dest tag comm
and the receive
MPIRecvstart count datatype source tag comm status
The source tag and count of the message actually received can b e retrieved from status
Two simple collective op erations
MPIBcaststart count datatype root comm MPIReducestart result count datatype
operation root comm
Getting information ab out a message
MPIStatus status
MPIRecv status
statusMPITAG
statusMPISOURCE
MPIGetcount status datatype count
MPITAG and MPISOURCE primarily of use when MPIANYTAG andor MPIANYSOURCE in the receive
MPIGetcount may b e used to determine how much data of a particular typ e was received
Simple Fortran example
C
i datai i call MPISEN D
program
include
integer
integer
integer
integer
double
main
mpifh
rank size to from tag count i ierr src dest
stsour ce sttag stcount statusMPISTATUSSIZE
if
rank
to
count
tag
do
eq src then dest
precisio n data
call MPIINIT ierr
call MPICOMM R AN K MPICOM M WO RL D rank call MPICOMM S IZ E MPICOM M WO RL D size
ierr ierr
is alive
print Process dest size src
rank
of
size
MPIDOUBL E PR EC IS IO N MPICOM M WOR LD ierr
then
tag
count
from
call MPIREC V da ta count MPIDOUB LE P RE CI SI ON
tag MPICOMM W ORL D status ierr
to
from
else if rank eq dest
MPIANY TA G
MPIANY SO UR CE
data count
tag
Simple
Fortran example cont
C
stcount statusM PI S OU RC E
statusM PI T AG
ierr
stsourc e
count stcoun t
endif
Status info source
tag sttag
call MPIGET C OU NT status MPIDOUBL E PR EC IS IO N
stsourc e
sttag
print
print
call MPIFINA LI ZE end
ierr
rank receive d
datai i
Six Function MPI
MPI is very simple
you to write many programs
MPI Init
MPI Finalize MPI Comm size MPI Comm rank MPI Send
MPI Recv
These six functions allow
A taste of things to come
The following examples show a C and Fortran version of the same program
This program computes PI with a very simple metho d but do es not use MPISend and MPIRecv Instead it uses collective
op erations to send data to and from all of
the running pro cesses sixfunction MPI set
MPI Init
MPI Finalize MPI Comm size MPI Comm rank MPI Bcast
MPI Reduce
This gives
a dierent
Broadcast and Reduction
The routine MPIBcast sends data from one pro cess to all others
The routine MPIReduce
all pro cesses by adding
and returning the result to a single pro cess
combines data from them in this case
Fortran example
PI
c
if
myid eq then write
format En te r the number read n
of intervals
program
include
main
mpifh
double
paramet er
d
formati endif
precisio n
PIDT PIDT
mypi pi numprocs
double
integer n myid
h sum i rc
function
x f a
to integrat e
precisio n
fa d d aa
call MPIINIT ierr call MPICOMM R AN K call MPICOMM S IZ E
MPICOM M WO RL D myid ierr
MPICOM M WO RL D
numprocs
ierr
quits
call MPIBCAS T n MP I IN TE GE R MPI C OM M WO RL D ie rr
Fortran example cont
c
c
c
c
check for
calculate the
numproc s
d
if n le goto
h
dn
sum
do
x
continue
mypi
call
if
h sum
endif
goto
call stop end
Error is
MPIFIN AL IZ E rc
F
d
i myid n
h dblei sum sum fx
collect MPIREDUCEmypipiMPIDOUBLEPRECISIONMPISUM
MPICOMMWORLDierr
myid eq then write
format
pi is approxi ma te ly
F
node prints the answer pi abspi PIDT
quit signal
all the
partial sums
interva l size
C example PI
include mpih include mathh
int int char
mainargc a rg v argc
argv
int done n myid
numprocs
double PIDT
double mypi pi h sum x a
MPIInit a rg c a rg v
MPIComm si ze M PI C OM M WO RL D n um pr oc s MPIComm ra nk M PI C OM M WO RL D m yi d
i rc
C
example cont
while done
if myid
printf E nt er the number scanf d n
MPIBcast n MPIINT if n break
h
sum
for i myid i n i numprocs
if myid printf p i pi
is approxi ma te ly f fabspi PIDT
Error
is fn
MPIFinal iz e
double n
x h doubl e i
sum
mypi h sum
xx
MPIReduc e m yp i pi MPICOM M WO RL D
MPISUM
of interval s
MPICOMM WO RL D
quits
MPIDOU BL E
Exercise PI
Objective Exp eriment with sendreceive
Run either program for PI Write new versions that replace the calls to MPIBcast
and MPIReduce with MPISend and
The MPI broadcast and reduce
use at most log p send and receive
on each pro cess where p is the size of MPI COMM WORLD How many op erations do your versions use
MPIRecv
op erations op erations
Exercise Ring
Objective Exp eriment with sendreceive Write a program to send a message around a
ring of pro cessors That is
to pro cessor who sends to pro cessor etc The last pro cessor returns the message to pro cessor
You can use the routine MPI Wtime to time
pro cessor sends
co de in
t MPI Wtime
MPI The statement
returns
PRECISION in Fortran
the time as a double DOUBLE
Top ologies
MPI provides collections of
This helps to
Who are my neighb ors
routines to provide structure to pro cesses
answer the question
Cartesian Top ologies
A Cartesian
top ology is a mesh
Example of Cartesian mesh with arrows p ointing at the right neighb ors
(0,2)
(1,2)
(2,2)
(3,2)
(0,1)
(1,1)
(2,1)
(3,1)
(0,0)
(1,0)
(2,0)
(3,0)
Dening a Cartesian Top ology
The routine MPICartcreate creates a Cartesian decomp osition of the pro cesses with the numb er of
dimensions
given by the ndim argument
dims
dims
periods
periods
reorder
ndim
false false true
call
MPICARTCREATE
MPICOMMWORLD ndim
periods reorder commd ierr
dims
Finding neighb ors
MPICartcreate creates a new communicator with the same pro cesses as the input communicator but with the sp ecied top ology
The question Who are my neighb ors can now b e answered with MPICartshift
call call
MPICARTSHIFT
MPICARTSHIFT
commd
nbrleft nbrright ierr commd
nbrbottom nbrtop ierr
The
communicator commd of the neighb ors shifted in the two dimensions
values returned
are the ranks in the
by
Who am
I
Can b e integer
call
call
answered with
coords
MPICOMMRANK commd myrank ierr MPICARTCOORDS commd myrank
coords ierr
Returns the Cartesian co ordinates of the calling
pro cess
in coords
Partitioning
When creating a Cartesian top ology one question is What is a go o d choice for the decomp osition of the pro cessors
This question can b e answered with MPIDimscreate
integer dims
dims
dims
call MPICOMMSIZE MPICOMMWORLD size ierr call MPIDIMSCREATE size dims ierr
Other Top ology Routines
MPI contains routines to translate b etween Cartesian co ordinates and ranks in a communicator and to access the prop erties of a Cartesian top ology
The routine MPIGraphcreate allows the creation of a general graph top ology
Why are these routines in MPI
In many parallel computer interconnects some pro cessors are closer to than
others These routines allow the MPI implementation to provide an ordering of pro cesses in a top ology that makes logical neighb ors close in the physical interconnect
Some parallel
hyp ercub es and
assigning no des in a mesh to pro cessors
programmers may rememb er the eort that went into
in a hyp ercub e
co des Many new systems have dierent interconnects ones with multiple paths may have notions of near neighb ors that changes with time These routines free the programmer from many of these considerations The reorder argument is used to request the b est ordering
through the use of Grey
The p erio ds argument
Who are my neighb ors if I am at the edge of a Cartesian Mesh
?
Perio dic Grids
Sp ecify this
in
MPICartcreate with
TRUE TRUE true
dims
dims
periods
periods
reorder
ndim
call
MPICARTCREATE
periods reorder commd ierr
MPICOMMWORLD
ndim dims
Nonp erio dic Grids
In the nonp erio dic case a neighb or may not exist This is indicated by a rank of MPIPROCNULL
This rank may be used in send and receive calls in MPI The action in b oth cases is as if the call was not made
Collective Communications in MPI
Communication is co ordinated among a group of pro cesses
Groups can be constructed by hand with MPI groupmanipulation routines or by using MPI
top ologydenition routines
Message tags are not used Dierent communicators are used instead
No nonblo cking collective
Three classes of collective
synchronization
data movement
collective computation
op erations
op erations
Synchronization
MPIBarriercomm
Function blo cks untill all pro cesses in
comm call it
Available
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
Schematic movement
Patterns
Collective
A
A
A
A
A
Broadcast
Scatter
Gather
All gather
All to All
representation in MPI
of
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
collective data
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A0
A1
A2
A3
B0
B1
B2
B3
C0
C1
C2
C3
D0
D1
D2
D3
A0
B0
C0
D0
A1
B1
C1
D1
A2
B2
C2
D2
A3
B3
C3
D3
Available
Computation
P0
P1 Reduce P1
Patterns
Collective
A
B
C
D
ABCD
P2 P3
P0
P2 P3
A
B
C
D
A
AB
ABC
ABCD
P0
P1 Scan P1
Schematic movement
of collective data
P2 P3
representation in MPI
P0
P2 P3
MPI Collective Routines
Many
routines
Allgather Alltoall
Gather ReduceScatter Scatterv
All
pro cesses
versions deliver results to all
participating
Allgatherv Alltoallv Gatherv Scan
Allreduce Bcast Reduce Scatter
V versions allow the chunks to have dierent sizes
Allreduce Reduce ReduceScatter and Scan take
b oth builtin and userdened combination functions
Builtin Collective Computation Op erations
MPI Name
Op eration
MPI MAX MPI MIN MPI PROD MPI SUM
Maximum Minimum Pro duct Sum
MPI LAND MPI LOR MPI LXOR
Logical and Logical or
Logical
exclusive or xor
MPI BAND MPI BOR MPI BXOR
Bitwise and Bitwise or Bitwise xor
MPI MAXLOC MPI MINLOC
Maximum value and lo cation Minimum value and lo cation
Dening Your Own Collective Op erations
MPIOpcreateuserfunction commute op MPIOpfreeop
userfunctioninvec inoutvec len datatype The user function should p erform
inoutveci inveci op inoutveci
for i from to len
userfunction can b e noncommutative eg matrix multiply
Sample user function
For example to create an same eect as MPISUM on values use
op eration that Fortran double
has the precision
subroutine myfunc invec
integer len datatype
double precision inveclen inoutveclen integer i
do ilen
inoutveci return
inveci
end
To use just
integer myop
call MPIOpcreate myfunc
true myop ierr
call MPIReduce a
b MPIDOUBLEPRECISON
myop
The they
routine MPIOpfree destroys userfunctions when are no longer needed
datatype
inoutveci
inoutvec len
Dening groups
All MPI communication is relative to a communicator which contains a context and a group The group is just a set of pro cesses
Sub dividing a communicator
The easiest way to create communicators with new groups is with MPICOMMSPLIT
For example
to form
0 Row 1 2
use
MPICommsplit oldcomm To maintain the order by
groups of rows of pro cesses
Column 01234
MPICommrank oldcomm rank MPICommsplit oldcomm row rank newcomm
row newcomm
rank use
Sub dividing cont
Similarly
to form groups of columns
Column 01234
use
MPICommsplit oldcomm To maintain the order by
column
rank use
newcomm
0 Row 1 2
MPICommrank oldcomm rank
MPICommsplit oldcomm column rank newcomm
Manipulating Groups
Another way to create a communicator with sp ecic memb ers is to use MPICommcreate
MPICommcreate oldcomm group newcomm
The group can b e created in many ways
Creating
All group sp ecifying group
Groups
creation routines create a group by the memb ers to take from an existing
MPIGroupincl sp ecies sp ecic memb ers
MPIGroupexcl excludes sp ecic memb ers
MPIGrouprangeincl and MPIGrouprangeexcl
use ranges of memb ers
MPIGroupunion and MPIGroupintersection
creates a new group
from two
existing
groups
To get an existing group use MPICommgroup oldcomm group Free a group with
MPIGroupfree group
Buering issues
Where do es data go when p ossibility is
you send it One
A:
Process 1
Process 2
Local Buffer
Local Buffer
The Network
B:
Better buering
This is not very ecient There are three
copies in b etween
A:
addition to the exchange of data pro cesses We prefer
Process 1
Process 2
But this
not return until the data has
or that we allow a send op eration
b efore completing the transfer In this case we need to test for completion later
requires that either
that MPISend
b een
delivered to return
B:
Blo cking and NonBlo cking communication
So far we have used blocking communication
MPI Send do es not complete until buer is empty available for reuse
MPI Recv do es not complete until buer is full available for use
Simple but
Completion and amount
can b e unsafe
Send works
for small enough messages but fails
Pro cess Send Recv
dep ends in of system
Pro cess Send Recv
general on size of buering
message
when messages get to o large To o large ranges from zero bytes to s of Megabytes
Some Solutions to the Unsafe Problem
Order the op erations more
carefully
Pro cess Send Recv
Pro cess Recv Send
Supply receive buer at
same time as
send with
MPI
Use
Sendrecv
Pro cess
Sendrecv Sendrecv
nonblo cking op erations
Pro cess
Pro cess Isend Irecv Waitall
Pro cess Isend Irecv Waitall
Use
MPIBsend
MPIs NonBlo cking Op erations
Nonblo cking op erations return immediately
request handles that can
MPI Isendstart count request
MPI Irecvstart count request
b e waited on and queried
datatype dest tag comm
datatype dest tag comm
MPI Waitrequest status
One can also test without waiting MPITest request flag status
Multiple completions
It is often desirable to wait on multiple requests An example is a masterslave program where the master
waits for one or more
MPI Waitallcount array of statuses
MPI Waitanycount status
slaves to send it a message
array of
array of
requests
requests index
MPI Waitsomeincount array of requests outcount array of indices array of statuses
There are corresp onding these
versions of test for each of
The MPI WAITSOME and
implement masterslave
access to the master by the slaves
MPI TESTSOME may b e used to algorithms that provide fair
Fairness
What happ ens with this program
include mpih include stdioh int mainargc argv int argc
char argv
int rank
MPIStatu s
MPIInit argc argv
MPIComm ra nk MPICOMM W OR LD rank MPIComm si ze MPICOMM W OR LD size if rank
for i i si ze i
MPIRec v buf MPIINT MPIANYS OU RC E
MPIANY TA G MPICOM M WOR LD status printf Msg from d with tag dn
size i buf status
else
for i i i
MPISen d
MPIFinal iz e
return
buf
MPIINT i
MPICOM M WO RL D
status MP I SO UR CE
statusMP I TA G
Fairness in messagepassing
An parallel algorithm is fair if no pro cess is eectively ignored In the preceeding program pro cesses with low rank like pro cess zero may b e the only one whose messages are received
MPI makes no guarentees ab out fairness However MPI makes it p ossible to write ecient fair programs
Providing Fairness
One alternative is define large
MPIReque st
MPIStatu s
int
int
for
request s la rg e statuse s la rg e indices l ar ge buflar ge
i
MPIIrecv bufi MPIINT i
isize i
MPIANY TA G MPICOM M WO RLD whilenot done
MPIWaits om e size request s ndone for i indone i
request s i
j indices i printf Msg from d
with tag dn statuse s i M PI S OU RC E
statuse s i M PI T AG MPIIre cv bufj MPIINT j
MPIANY TA G MPICOM MW OR LD request s j
indices
statuse s
Providing Fairness Fortran
One alternative is paramete r large
MPIANY TA G if not done then
MPICOM M WO RLD
i
requests i ierr
continue goto endif
integer
integer
integer
integer
logical
do i
call MPIIre cv bufi MPIINT EG ER
requests l arge
statuses M PISTATUSSIZElarge indices la rge
buflarg e
done
size
call MPIWait so me size requests ndone indices statuse s ierr
do
i ndone
j indices i
print
call
done
from statuse s MPI S OU RC E i with tag statuses M PI T AG i
Msg
MPIIrec v bufj MPIANY TA G
MPIINTEG ER MPICOM MW OR LD
j
requests j ierr
Exercise Fairness
Objective
Complete the program fragment on
Use nonblo cking communications
providing
leave no uncompleted requests How would you test your program
fairness Make sure that you
More on nonblo cking communication
In applications where the time to send data b etween pro cesses is large it is often helpful to cause communication and computation to overlap This can easily b e done with MPIs nonblo cking routines
For example in a D nite dierence mesh moving data needed for the b oundaries can b e done at the same time as computation on the interior
each ghost edge data for each ghost edge
MPIIrecv
MPIIsend
compute on interior
while still some uncompleted requests
MPIWaitany requests if request is a receive
compute on that edge
Note that we call MPIWaitany several
exploits the fact that after a request
is set to MPIREQUESTNULL and that this is a valid request object to the wait and test routines
times This is satised it
Communication Mo des
MPI provides mulitple mo des for sending messages
Synchronous mo de MPI Ssend the send do es not complete until a matching receive has b egun Unsafe programs b ecome incorrect and usually
deadlo ck
Buered buer to
memory
within an MPISsend
mo de MPI Bsend the user supplies the system for its use User supplies enough
Ready mo de MPI matching receive
to make
unsafe program safe
Rsend user guarantees that has b een p osted
allows access to fast
proto cols
undened b ehavior if the matching receive is not p osted
Nonblo cking versions MPI Issend MPI Irsend
Note that an MPIRecv any send mo de
MPI Ibsend may receive
messages
sent with
Buered Send
MPI provides a send routine that may b e used when MPIIsend is awkward to use eg lots of small messages
MPIBsend makes use of a userprovided buer to save any messages that can not be immediately sent
int bufsize
char buf mallocbufsize MPIBufferattach buf bufsize
MPIBsend same as MPISend
MPIBufferdetach buf bufsize
The MPIBufferdetach call do es not complete messages are sent
The p erformance of MPI Bsend dep ends on the implementation of MPI and may also dep end on
the size of the message For example making a message one byte longer may cause a signicant drop in p erformance
until all
Reusing the same buer
Consider a lo op
MPIBufferattach buf while done
MPIBsend
bufsize
where the buf is large enough to hold the
the MPIBsend This co de may
void buf int bufsize MPIBufferdetach buf bufsize MPIBufferattach buf bufsize
message in fail b ecause the
Other PointtoPoint Features
MPISENDRECV MPISENDRECVREPLACE
MPICANCEL
Persistent communication requests
Datatyp es and Heterogenity
MPI datatyp es have two main purp oses
Heterogenity parallel programs b etween dierent pro cessors
Noncontiguous data structures vectors with nonunit stride etc
Basic datatyp e corresp onding to the underlying language are predened
The user can construct new datatyp es at run time these are called derived datatyp es
Datatyp es in MPI
Elementary Languagedened typ es eg MPIINT or MPIDOUBLEPRECISION
Vector Separated by constant
Contiguous Vector with stride
stride of one
Hvector Vector with stride in bytes
Indexed Array of indices for scattergather
Hindexed Indexed with indices in bytes
Struct General mixed typ es for C etc
structs
Basic Datatyp es Fortran
MPI datatyp e
Fortran datatyp e
MPIINTEGER
MPIREAL
MPIDOUBLEPRECISION
MPICOMPLEX
MPILOGICAL
MPICHARACTER
MPIBYTE
MPIPACKED
INTEGER
REAL
DOUBLE PRECISION
COMPLEX
LOGICAL
CHARACTER
Basic Datatyp es C
MPI datatyp e
C datatyp e
MPICHAR
MPISHORT
MPIINT
MPILONG
MPIUNSIGNEDCHAR
MPIUNSIGNEDSHORT
MPIUNSIGNED
MPIUNSIGNEDLONG
MPIFLOAT
MPIDOUBLE
MPILONGDOUBLE
MPIBYTE
MPIPACKED
signed char
signed short int
signed int
signed long int
unsigned char
unsigned short int
unsigned int
unsigned long int
float
double
long double
Vectors
29
30
31 24 17 10 3
32
33
34
35
22
23
25
26
27
28
15
16
18
19
20
21
8
9
11
12
13
14
1
2
4
5
6
7
To sp ecify this row in C order we can use
MPITypevector count blocklen stride newtype
MPITypecommit newtype The exact co de for this is
MPITypevector MPIDOUBLE MPITypecommit newtype
oldtype
newtype
Structures
Structures are describ ed by arrays of
numb er of elements arrayoflen
displacement or lo cation arrayofdispls
datatyp e arrayoftypes
MPITypestructure count arrayoflen arrayofdispls
arrayoftypes newtype
Example
struct char
int
double
double
int
int
cmdline
Structures
display maxiter xmin ymin xmax ymax width height
blocks blockcou nts types displs cmdtype
Name of display
max of iteratio ns
set up int
MPIDatat yp e MPIAint MPIDatat yp e
initiali ze MPIAddre ss MPIAddre ss MPIAddre ss MPIAddre ss
types and displs with addresse s of items
types types types types for i
MPICHAR
MPIINT
MPIDOUB LE
MPIINT
i i
displsi MPIType st ru ct MPIType co mm it
displs
blockco un ts displs cmdtype
types cmdtype
cmdline d is pl ay cmdline m ax it er cmdline x mi n cmdline w id th
displs displs displs displs
lower left corner of upper right corner
rectangl e
of display
of display
in pixels in pixels
Strides
The extent of a datatyp e is normally the
distance
b etween the rst and last memb er
Memory locations specified by datatype
EXTENT
LB UB
You can set an articial extent by using MPIUB and MPILB in MPITypestruct
Vectors revisited
This co de creates a datatyp e for an arbitrary numb er of element in a row of an array
stored in Fortran order column
int blens displs MPIDatatype types rowtype
blens
blens
displs
displs
types
types
MPITypestruct blens displs types rowtype MPITypecommit rowtype
To send n elements you can use MPISend buf n rowtype
numberincolumn sizeofdouble MPIDOUBLE
MPIUB
rst
Structures revisited
When sending an array of a structure it is imp ortant
initiali ze MPIAddre ss MPIAddre ss MPIAddre ss MPIAddre ss MPIAddre ss
types and displs with addresse s of items cmdline d is pl ay displs
cmdline m ax it er displs
cmdline x mi n displs
cmdline w id th displs
types types types types types for i
cmdline displs MPICHAR
displsi MPIType st ru ct MPIType co mm it
displs
blockco un ts displs cmdtype
types cmdtype
MPIINT
MPIDOUB LE
MPIINT
MPIUB
i i
MPI and the C compiler have the size of each structure The most
to ensure that
same value for the
p ortable way to do
structure denition for the end of the structure In the previous example this is
this is to add an MPIUB to the
Interleaving data
By moving the UB inside the data you can interleave data
Consider the matrix
0 8 16 24 1 9 17 25 2 101826 3 111927
32 33 34 35
4 122028 5 132129 6 142230 7 152331
36 37 38 39
We wish to send and to pro cess and to pro cess etc How can we do this with MPIScatterv
An interleaved datatyp e
MPITypevector MPIDOUBLE vec
denes a
blo ck of this matrix
blens
types MPIUB displs sizeofdouble
blens
types vec
displs
MPITypestruct blens displs types block
denes a blo ck whose extent is just entries
Scattering
a Matrix
We set
lo cation
b ecause
the start of each piece
scdispls scdispls scdispls scdispls MPIScatterv sendbuf
recvbuf MPICOMMWORLD
the
of the rst element in the blo ck MPIScatterv uses the extents to determine
How would
this more general
displacements for each blo ck
as the
This works
to send
sendcounts
block
scdispls nx ny MPIDOUBLE
use use the top ology routines to make
Exercises
Objective
Write
in columnmajor form to the other pro cessors
a
b
Write the program to handle the case where the matrix is square
Write the program to handle a numb er of columns read from the terminal
datatyp es
Learn ab out datatyp es
a program to send rows of a matrix stored
Let pro cessor have the entire matrix as many rows as pro cessors
which has
Pro cessor
Pro cessor
holds only that row That is pro cessor has a matrix AN M while the other pro cessors have a row BM
sends row i to i reads that row
pro cessor i into a lo cal
array that
C
stored in rowmajor form
If you have time dont have time program these
Write a program
each pro cessor has a part of the matrix Use
top ologies to dene a Dimensional partitioning
programmers may send
columns of a matrix if they prefer
of the following If you
try one
think ab out how you would
to transp ose a matrix where
b Use MPISendrecv instead
c Create a datatyp e that allows you to receive the blo ck already transp osed
Write a program to send the ghostp oints Dimensional mesh to the neighb oring
pro cessors Assume that each pro cessor has the same size subblo ck
a Use top ologies to nd the neighb ors
b Dene a datatyp e for the rows
c Use MPISendrecv or MPIIRecv and MPISend
with MPIWaitall
d Use MPIIsend and MPIIrecv to start the communication do some computation on the interior and then use MPIWaitany to pro cess the b oundaries as they arrive
The same approach works for general datastructures such as unstructured meshes
Do but for Dimensional meshes You will need MPITypeHvector
of the matrix across the pro cessors and assume
that all
a Use the
pro cessors have the same size submatrix
MPISend and MPIRecv to send the blo ck transp ose the blo ck
of a
To ols for writing libraries
MPI is sp ecically designed to make it easier
to
write messagepassing librari es
Communicators solve tagsource wildcard problem
Attributes provide a way to attach information to a communicator
Private communicators
One of the rst thing that a library should normally do is create private communicator This allows the library to send and receive messages that are known only to the library
MPICommdup oldcomm newcomm
Attributes
Attributes are data that can b e attached to one or more communicators
Attributes are referenced by keyval Keyvals are created with MPIKEYVALCREATE
Attributes are attached to a communicator with MPIAttrput and their values accessed by MPIAttrget
to
one communicator from another or
by deleting a communicator when the keyval is created
Op erations are dened for what happ ens
an attribute when it is copied by
creating deleted
What is an attribute
In C an attribute You must allo cate to p oint to make
is a p ointer of typ e void storage for the attribute sure that you dont use
the address
In Fortran it is a single INTEGER
of a lo cal variable
Examples of using attributes
Forcing sequential op eration
Managing tags
Sequential Sections
include mpih include stdlib h
static int MPESeq ke yv al
MPESeq be gi n Begins
MPIKEY VA L INV AL ID
a sequent ia l section of code
Input Paramete rs
comm Communi ca tor to sequentialize
ng Number in group This many processes are allowed to execute
at the same time
void MPESeq be gi n comm MPIComm comm
int ng
int
int
MPIComm
MPIStatu s
Get the
operation s
if MPESeq k ey va l
MPIKeyva l cr ea te
sequenti al
MPIKEY VA L IN VA LI D MPINULL C OP Y FN MPINULL D EL ET E FN MPESeq keyval NULL
lidx np flag localco mm status
Usually one
private communic at or for the
ng
Sequential Sections I I
MPIAttr ge t comm MPESeq k ey flag
if flag
This expects a communi ca MPIComm du p comm local MPIAttr pu t comm MPESeq
void local
MPIComm ra nk comm lidx MPIComm si ze comm np
if lidx
MPIRecv NULL MPIINT status
va l
void
local c om m
pointer
Send to the next process in the group
are the last process in the
if lidx ng ng lidx np
MPISend NULL MPIINT lidx localc om m
to r
co mm k ey va l c om m
lidx
a
to be
localc om m
unless we processo r set
Sequential Sections I I I
MPESeq en d Ends a sequent ia l section of
code
Input comm
Paramete rs
Communi ca to r to Number in group
sequent ia li ze
ng
void MPESeq en d comm ng
MPIComm
int
int
MPIStatu s
MPIComm
MPIComm ra nk MPIComm si ze MPIAttr ge t flag
comm MPESeq k ey va l
comm ng
lidx np flag status
localco mm
comm lidx
comm np
if flag
MPIAbort comm MPIERR UN KN OW N
Send to the first process in the next group OR to the first process
in the process or if lidx ng
set
ng lidx np
MPISend NULL localcom m
if lidx
MPIRecv NULL status
MPIINT
MPIINT
lidx np
np
localc om m
void local c om m
Comments on sequential sections
Note use of MPIKEYVALINVALID to determine to create a keyval
Note use of ag on MPIAttrget to discover that a communicator has no attribute for the keyval
Example Managing tags
Problem A library contains many objects
that need to communicate not known until runtime
Messages b etween objects by using dierent message these tags chosen
in ways that are
are kept separate tags How are
Unsafe to use compile time values
Must allo cate tag values at runtime
Solution
Use a private communicator and use an attribute to keep track of available tags in that communicator
Caching tags on communicator
include mpih
static int MPETag ke yv al
Private routine to delete communica to r is freed
int MPEDelTa g comm keyval
MPIComm
int
void
comm
keyval
attrva l
extra st at e
free
return
attrva l MPISUCC ES S
MPIKEY VA L INV AL ID
internal storage when
a
attrva l
extrast at e
Caching tags on communicator
I I
MPEGetTa gs Returns tags that
communica ti on with a communica to r
Input Paramet er s
commin Input communi ca to r
can be
used in
ntags
Output
commout Output communi ca to r
firsttag First tag availab le
int MPEGetTa gs commin ntags MPIComm commin commo ut
Number of tags
Paramete rs
int
int mpeerrno MPISUC CE SS
ntags first ta g
int tagval tagval p
if MPETag k ey va l MPIKeyva l cr ea te
maxval flag
May be
commout
commin
firstt ag
MPIKEY VA L IN VA LI D
MPINULL C OP Y FN
MPETag k ey va l void
MPEDel Ta g
Caching tags on communicator I I I
if mpeerr no MPIAtt r ge t commin MPETagk ey va l
tagvalp
return
flag mpeerrn o
if flag
This
so we
communi ca to r
is not yet known to this the first value
system
dup it and setup
MPIComm du p commin
commin commo ut
MPIAttr ge t MPICOM M WO RL D MPITAG UB maxval
flag
tagvalp int malloc sizeofin t printf Malloc in g address xn tagvalp
if tagval p return MPIERR EX HA US TED
tagvalp maxval
MPIAttr pu t commin MPETag ke yv al tagvalp return MPISUCC ES S
commout
Caching tags on
communicator IV
commout commin if tagval p ntags
Error out of tags
Another
solution
would
be to do
an return
MPICom m du p MPIERR IN TE RN
firstta g tagvalp ntags tagvalp firstt ag
return MPISUCC ES S
Caching tags on communicator V
MPERetur nT ag s Returns tags allocat ed with
MPEGet Ta gs
Input Paramet er s
comm Communic at or to return
tags to
firsttag First
ntags Number of
int MPERetur nT ag s comm MPIComm comm
int int
if
tagvalp flag
return
mpeerrn o
firstta g ntags
tagvalp flag
mpeerr no
mpeerr no MPIAtt r ge t
comm
MPETag ke yv al
if flag
Error attribu te does not
return MPIERR OT HE R
if tagval p firstt ag tagvalp firstta g ntags
return MPISUCC ES S
of the tags tags to return
firstt ag
ntags
to return
exist
in this
communi ca to r
Caching tags on communicator VI
MPETagsE nd Returns the private
keyval
int MPETagsE nd
MPIKeyva l fr ee MPETagk ey va l
MPETag k ey va l MPIKEYV AL I NV AL ID
Commentary
Use MPIKEYVALINVALID to detect when keyval must b e created
Use flag return from MPIATTRGET to
detect when a communicator needs to b e initial i zed
Exercise Writing libraries
Objective Use private communicators and attributes Write a routine to circulate data to the next pro cess
using
a nonblo cking send and receive op eration
void Initpipe comm
void ISendpipe comm bufin len void Waitpipe comm
A typical use is
Initpipe MPICOMMWORLD for i in i
ISendpipe comm bufin len DoWork bufin len
datatype
bufout
Waitpipe
t bufin
What happ ens
What do you need to do to clean up Init
comm
bufin bufout bufout t
if DoWork calls MPI routines
datatype
bufout
pipe
How can you use a userdened top ology determine the next pro cess Hint see MPI Topo test and MPI Cartdim get
to
MPI Objects
MPI has a variety of objects
communicators groups
that can b e created and
section discusses the typ es of these how MPI manages them
This entire chapter may b e skipp ed by b eginners
datatyp es destroyed This
etc
data and
The MPI Objects
MPI
MPI
MPI
MPI
MPI
MPI
Request Handle for nonblo cking communication normally freed by MPI in a test or wait
Datatype MPI datatyp e Free with MPITypefree
Op Userdened op eration Free with MPIOpfree
Comm Communicator Free with MPICommfree
Group Group of pro cesses Free with MPIGroupfree
Errhandler MPI errorhandler Free with MPIErrhandlerfree
When should objects b e freed
Consider this co de
MPITypevector ly nx MPIDOUBLE newx MPITypehvector lz nxnysizeofdouble newx
newx MPITypecommit newx
This creates a datatyp e for one face of a D decomp osition When should newx b e freed
Reference
counting
MPI keeps only truely
is b eing created
track of the use of an MPI object and
destroys it when noone is using it
used by the user the MPITypevector that
newx it and by the MPIDatatype newx that uses it
is not needed
MPITypevector ly nx MPIDOUBLE newx MPITypehvector lz nxnysizeofdouble newx
newx MPITypefree newx
MPITypecommit newx
If newx
b e freed
after newx is dened it should
Why reference counts
Why not just free the object Consider this library routine
void MakeDatatype nx ny ly lz MPIDatatype new
MPIDatatype newx
MPITypevector ly nx MPIDOUBLE newx MPITypehvector lz nxnysizeofdouble newx
new MPITypefree newx
MPITypecommit new
Without the MPITypefree awkward to later free newx
newx
when new was freed
it would b e very
To ols for evaluating programs
MPI provides some to ols for evaluating the p erformance of parallel programs
These are
Timer
Proling interface
The MPI Timer
The elapsed wallclo ck time
p oints in an MPI program can b e computed using MPIWtime
double t t
t MPIWtime
t MPIWtime printf Elapsed time is
The value returned by a MPIWtime has littl e value
fn t single call to
The times are lo cal the attribute
MPI WTIME IS GLOBAL may be used to determine if the times are also synchronized with each other for all pro cesses in MPI COMM WORLD
b etween two
t
Proling
All routines have two PMPI
This makes it easy to
lowoverhead routines
without any source co de mo dications
entry p oints MPI and
Used to provide automatic les
generation of trace
provide a single level of to intercept MPI calls
MPI_Send
MPI_Bcast
MPI_Send PMPI_Send
MPI_Send PMPI_Send
MPI_Bcast
User Program
static int nsend int MPISend start
nsend
Profile Library
count datatyp e
dest
MPI Library
tag comm
return PMPISen d start count datatyp e dest tag comm
Writing proling routines
The MPICH implementation contains a program for writing wrapp ers
This description will write out each MPI routine that
is called
ifdef MPIBUILDPROFILING
undef MPIBUILDPROFILING endif
include stdioh include mpih
fnall fnname
vardecl int llrank
PMPICommrank MPICOMMWORLD llrank printf d Starting fnnamen
llrank fflush stdout
callfn
printf d Ending fnnamen llrank
fflush stdout endfnall
The command
wrappergen w tracew o tracec
converts this to a C program The complie
tracec and insert the resulting object le into your link line
cc o aout ao traceo lpmpi lmpi
the le
Another proling example
This version counts MPISend MPIBse nd include mpih
all calls and the numb er of bytes sent with or MPIIse nd
foreach fn
static long fnna me n by te s fi le no en df or ea ch fn
fnname
MPISen d MPIBsen d MPIIsend
forallf n
fnname n ca ll s fi le no endfora ll fn
fnname MPIInit MPIFin al iz e MPIWti me in t
fnall thisfn n am e MPIFina li ze
printf this f n na me is being calledn
callfn
thisfn n am e nc al ls f il en o endfnal l
fn fnname MPISend vardecl int typesiz e
MPIBse nd MPIIse nd
callfn
MPIType si ze dataty pe MPIAin t ty pe si ze fnname n by te s fi le no ty pe siz e co un t fnname n ca ll s fi le no
endfn
Another proling example cont
fn fnname MPIFina li ze forallf n disfn
if disf n nc al ls f il en o printf dis fn d callsn
disfn n ca ll s fi le no
endfora ll fn
if MPISen d nc al ls f il en o
printf d bytes sent in d calls with MPISen d n MPISend n by te s fi le no
MPISend nc al ls f il en o
callfn
endfn
Generating and viewing log les
Log les that contain a
parallel computation can b e very valuable in understanding a parallel program The upshot and nupshot programs provided in the MPICH and MPIF implementations may b e used to view log les
history of a
Generating a log le
This is very easy with the MPICH
implementation of
with llmpi lpmpi
your program and
do not need to recompile
lm lX
lpmpi
MPI Simply replace lmpi lm in the link line for
relink your program You
On some
animation by using the librari es lampi lmpe
systems you can get a realtime
Alternately you can use the mpilog or mpianim options to the mpicc or mpif commands
Connecting several programs together
MPI provides supp ort for connection separate messagepassing programs together through the use of intercommunicators
Sending messages b etween dierent programs
Programs share MPICOMMWORLD
Programs have separate communicators
and disjoint
App1 App2
Comm1 Comm2 Comm_intercomm
MPI_COMM_WORLD
Exchanging data b etween programs
Form intercommunicator MPIINTERCOMMCREATE
Send data
MPISend intercomm
MPIRecv buf intercomm
MPIBcast buf localcomm
More complex p ointtop oint op erations can also b e used
Collective op erations
Use MPIINTERCOMMMERGE to create an intercommunicator
Final Comments
Additional features of MPI not covered in
this
tutorial
Persistent Communication Error handling
Sharable MPI Resources
The Standard itself
As a Technical rep ort U of Tennessee rep ort
As p ostscript for ftp at infomcsanlgov in pubmpimpireportps
As hyp ertext on the World Wide Web httpwwwmcsanlgovmpi
As a journal article in the Fall issue of the Journal of Sup ercomputing Applications
MPI Forum discussions
The MPI Forum email discussions and b oth current and earlier versions of the Standard
are available from netlib
Bo oks
Using MPI Portable Parallel Programming with the MessagePassing Interface by Gropp Lusk and Skjellum MIT Press
MPI Annotated Reference Manual by Otto et al in preparation
Sharable MPI Resources continued
Newsgroup
compparallelmpi
Mailing lists
mpicommmcsanlgov the MPI Forum discussion list
mpiimplmcsanlgov the implementors discussion list
Implementations available by ftp
MPICH is available by anonymous ftp from infomcsanlgov in the directory pubmpimpich le mpichtarZ
LAM is available by anonymous ftp from tbagoscedu in the directory publam
The CHIMP version of MPI is available by
anonymous ftp from ftpepccedacuk in the directory pubchimprelease
Test co de rep ository
ftpinfomcsanlgovpubmpimpitest
MPI
The MPI Forum with old and new participants has b egun a followon series of meetings
Goals
clarify existing draft
provide features users have requested
make extensions not changes
Major Topics b eing considered
dynamic pro cess management
clientserver
realtime extensions
onesided communication putget active messages
p ortable access to MPI system state for debuggers
language bindings for C and Fortran
Schedule
Dynamic pro cesses clientserver by SC
MPI complete by SC
Summary
The parallel computing community has co op erated to develop a fullfeatured standard messagepassing library interface
Implementations ab ound
Applications b eginning to b e develop ed or p orted
MPI pro cess b eginning
Lots of MPI material available