程序代写代做代考 algorithm graph html ER C go data structure compiler Fortran FTP Tutorial on MPI􏰀 The Message􏰁Passing Interface

Tutorial on MPI􏰀 The Message􏰁Passing Interface
Willi am Gropp
Mathematics and Computer Science Division Argonne National Lab oratory Argonne􏰂 IL 􏰃􏰄􏰅􏰆􏰇 gropp􏰈mcs􏰉anl􏰉gov
􏰔
N
O
I
T
A
A
L
N
L
A
B
E
O
R
N
A
N
T
O
O
G
R
R
A
Y


U
O
N
G
I
A
V
C
E
I
H
R
C
S
F
O
I
T
Y

Course Outline
􏰝 Background on Parallel Computing
􏰝 Getting Started
􏰝 MPI Basics
􏰝 Intermediate MPI
􏰝 To ols for writing libraries
􏰝 Final comments
Thanks to Rusty Lusk for some of the material in this tutorial􏰉
This tutorial may b e used in
the b o ok 􏰕Using MPI􏰖 descriptions of the use
􏰡 Material that b eings
and may b e skipp ed on a 􏰐rst reading􏰉
conjunction with contains detailed
which
of the MPI routines􏰉
with this symb ol is 􏰒advanced􏰍
􏰗

Background
􏰝 Parallel Computing
􏰝 Communicating with other pro cesses
􏰝 Co op erative op erations
􏰝 One􏰁sided op erations
􏰝 The MPI pro cess
􏰆

Parallel Computing
􏰝 Separate workers or pro cesses
􏰝 Interact by exchanging information
􏰅

Typ es of parallel computing
All use di􏰊erent data
Data􏰁parallel data􏰉 Also
for each worker
SPMD Same
SIMD is practical sense􏰉
Same
called SIMD
op erations on di􏰊erent
program􏰂 di􏰊erent data Di􏰊erent programs􏰂 di􏰊erent data
MIMD
SPMD and MIMD are essentially the same
b ecause
any MIMD can b e made SPMD
also equivalent􏰂 but in a less
MPI is primaril y for SPMD􏰋MIMD􏰉 HPF is an example of a SIMD interface􏰉
􏰛

Communicating with other pro cesses
Data must b e exchanged with other workers
􏰝 Co op erative 􏰌 all parties agree to transfer data
􏰝 One sided 􏰌 one worker p erforms transfer of data
􏰃

Co op erative op erations
Message􏰁passing the exchange of
Data must b oth received􏰉
is an approach that makes data co op erative􏰉
b e explicitl y sent and
An advantage is that any
change in the
with the receiver􏰍s
Process 1
RECV( data )
receiver􏰍s memory participation􏰉
Process 0
SEND( data )
is made
􏰚

One􏰁sided op erations
One􏰁sided op erations b etween parallel
pro cesses include remote memory reads and writes􏰉
An advantage is that data can be accessed
without
waiting for
Process 0
PUT( data )
another
pro cess
Process 1
(Memory)
Process 1
GET( data )
Process 0
(Memory)
􏰙

Class Example
Take a pad of pap er􏰉 Algorithm􏰀 Initialize with the numb er of neighb ors you have
􏰝 Compute average of your
subtract from your value􏰉 Make that value􏰉
􏰝 Rep eat until done
Questions
􏰔􏰉 How do you
get values
from your neighb ors􏰓
􏰗􏰉 Which step or iteration
Do you know􏰓 Do you care􏰓
􏰆􏰉 How do you decide when you are done􏰓
neighb or􏰍s values and your new
do they corresp ond to􏰓
􏰇

Hardware mo dels
The previous example illustrates the hardware mo dels by how data is exchanged among workers􏰉
􏰝 Distributed memory 􏰎e􏰉g􏰉􏰂 SPx􏰂 workstation network􏰏
Paragon􏰂 IBM
Power
􏰝 Shared memory Challenge􏰂 Cray T􏰆D􏰏
􏰎e􏰉g􏰉􏰂 SGI
Either may b e used with SIMD or MIMD software mo dels􏰉
􏰡 All memory is distributed􏰉
􏰔􏰄

What is MPI􏰓
􏰝 A
􏰟 message􏰁passing mo del
􏰟 not a compiler sp eci􏰐cation
􏰟 not a sp eci􏰐c pro duct
􏰝 For parallel computers􏰂 clusters􏰂 and heterogeneous networks
􏰝 Full􏰁featured
􏰝 Designed to permit 􏰎unleash􏰓􏰏 the development of parallel software libraries
􏰝 Designed to provide access to advanced parallel hardware for
􏰟 end users
􏰟 library writers
􏰟 to ol develop ers
message􏰁passing library sp eci􏰐cation
􏰔􏰔

Motivation for a New Design
􏰝 Message Passing now mature as programming paradigm
􏰟 well understo o d
􏰟 e􏰘cient match to hardware
􏰟 many applications
􏰝 Vendor systems not p ortable
􏰝 Portable systems are mostly
􏰟 incomplete
􏰟 lack vendor supp ort
􏰟 not at most e􏰘cient level
research
projects
􏰔􏰗

Motivation 􏰎cont􏰉􏰏
Few systems o􏰊er the full range of desired features􏰉
􏰝 mo dularity 􏰎for libraries􏰏
􏰝 access to p eak
􏰝 p ortability
􏰝 heterogeneity
􏰝 subgroups
􏰝 top ologies
p erformance
􏰝 p erformance measurement to ols
􏰔􏰆

The MPI Pro cess
􏰝 Began at Williamsburg Workshop in April􏰂 􏰔􏰇􏰇􏰗
􏰝 Organized at Sup ercomputing 􏰍􏰇􏰗 􏰎Novemb er􏰏
􏰝 Followed HPF format and pro cess
􏰝 Met every six weeks for two days
􏰝 Extensive􏰂 op en email discussions
􏰝 Drafts􏰂 readings􏰂 votes
􏰝 Pre􏰁􏰐nal draft distributed at Sup ercomputing 􏰍􏰇􏰆
􏰝 Two􏰁month public comment p erio d
􏰝 Final version of draft in May􏰂 􏰔􏰇􏰇􏰅
􏰝 Widely available now on the Web􏰂 ftp sites􏰂 netlib 􏰎http􏰀􏰋􏰋www􏰉mcs􏰉anl􏰉gov􏰋mpi􏰋index􏰉html􏰏
􏰝 Public implementations available
􏰝 Vendor implementations coming so on
􏰔􏰅

Who Designed MPI􏰓
􏰝 Broad participation
􏰝 Vendors
􏰟 IBM􏰂 Intel􏰂 TMC􏰂 Meiko􏰂 Cray􏰂 Convex􏰂 Ncub e
􏰝 Library writers
􏰟 PVM􏰂 p􏰅􏰂 Zip co de􏰂 TCGMSG􏰂 Express􏰂 Linda
Chameleon􏰂
􏰝 Application sp ecialists and consultants
Companies ARCO Convex Cray Res IBM
Intel
KAI Meiko NAG nCUBE ParaSoft Shell TMC
Lab oratories Universities
ANL GMD LANL LLNL NOAA NSF ORNL PNL Sandia SDSC SRC
UC Santa Barbara Syracuse U Michigan State U Oregon Grad Inst U of New Mexico Miss􏰉 State U􏰉
U of Southampton U of Colorado
Yale U
U of Tennessee
U of Maryland Western Mich U
U of Edinburgh Cornell U􏰉
Rice U􏰉
U of San Francisco
􏰔􏰛

Features of MPI
􏰝 General
􏰟 Communicators combine context and group for message security
􏰟 Thread safety
􏰝 Point􏰁to􏰁p oint communication
􏰟 Structured bu􏰊ers and derived datatyp es􏰂 heterogeneity
􏰟 Mo des􏰀 normal 􏰎blo cking and non􏰁blo cking􏰏􏰂 synchronous􏰂 ready 􏰎to allow access to fast proto cols􏰏􏰂 bu􏰊ered
􏰝 Collective
􏰟 Both built􏰁in and user􏰁de􏰐ned collective op erations
􏰟 Large numb er of data movement routines
􏰟 Subgroups de􏰐ned directly or by top ology
􏰔􏰃

Features of MPI 􏰎cont􏰉􏰏
􏰝 Application􏰁oriented pro cess top ologies
􏰟 Built􏰁in supp ort for grids and graphs 􏰎uses groups􏰏
􏰝 Pro􏰐ling
􏰟 Ho oks allow users to intercept MPI calls to install their own to ols
􏰝 Environmental
􏰟 inquiry
􏰟 error control
􏰔􏰚

Features not in MPI
􏰝 Non􏰁message􏰁passing concepts not
􏰟 pro cess management
􏰟 remote memory transfers
􏰟 active messages
􏰟 threads
􏰟 virtual shared memory
included􏰀
􏰝 MPI do es not address these issues􏰂
remain compatible with these ideas 􏰎e􏰉g􏰉 thread safety as a goal􏰂 intercommunicators􏰏
but has
tried to
􏰔􏰙

Is MPI Large or Small􏰓
􏰝 MPI is large 􏰎􏰔􏰗􏰛 functions􏰏
􏰟 MPI􏰍s extensive functionality requires many functions
􏰟 Number of functions not necessarily a measure of complexity
􏰝 MPI is small 􏰎􏰃 functions􏰏
􏰟 Many parallel programs can b e 􏰃 basic functions􏰉
written with just
􏰝 MPI is just right
􏰟 One can access 􏰜exibility when it is required􏰉
􏰟 One need not master all parts of MPI to use it􏰉
􏰔􏰇

Where to use MPI􏰓
􏰝 You
􏰝 You
􏰝 You
Where not to use MPI􏰀
􏰝 You can use HPF or a parallel Fortran 􏰇􏰄
􏰝 You don􏰍t need parallelism at all
􏰝 You can use libraries 􏰎which may b e written in MPI􏰏
need a p ortable parallel program
are writing a parallel library
have irregular or dynamic data relationships that do not 􏰐t a data parallel mo del
􏰗􏰄

Why learn MPI􏰓
􏰝 Portable
􏰝 Expressive
􏰝 Go o d way to learn ab out parallel computing
subtle
issues in
􏰗􏰔

Getting started
􏰝 Writing MPI programs
􏰝 Compiling and linking
􏰝 Running MPI programs
􏰝 More information
􏰟 Using MPI by William Gropp􏰂 Ewing Lusk􏰂 and Anthony Skjellum􏰂
􏰟 The LAM companion to 􏰕Using MPI􏰉􏰉􏰉􏰖 by Zdzislaw Meglicki
􏰟 Designing and Building Parallel Programs by
Ian Foster􏰉
􏰟 A Tutorial􏰋User􏰍s Guide for MPI by Peter Pacheco
􏰎ftp􏰀􏰋􏰋math􏰉usfca􏰉edu􏰋pub􏰋MPI􏰋mpi􏰉guide􏰉ps􏰏
􏰟 The MPI standard and other information is available at http􏰀􏰋􏰋www􏰉mcs􏰉anl􏰉gov􏰋mpi􏰉 Also
the source for several implementations􏰉
􏰗􏰗

Writing MPI programs
􏰣include 􏰖mpi􏰉h􏰖
􏰣include 􏰤stdio􏰉h􏰥
int main􏰎 argc􏰂 argv 􏰏 int argc􏰑
char 􏰦􏰦argv􏰑
􏰟
MPI􏰧Init􏰎 􏰨argc􏰂 􏰨argv 􏰏􏰑 printf􏰎 􏰖Hello world􏰕n􏰖 􏰏􏰑 MPI􏰧Finalize􏰎􏰏􏰑
return 􏰄􏰑
􏰫
􏰗􏰆

Commentary
􏰝 􏰣include 􏰖mpi􏰉h􏰖 provides basic MPI de􏰐nitions and typ es
􏰝 MPI􏰧Init starts MPI
􏰝 MPI􏰧Finalize exits MPI
􏰝 Note that all non􏰁MPI
thus the printf run on each pro cess
routines are lo cal􏰑
􏰗􏰅

Compiling and linking
For simple programs􏰂 sp ecial compiler commands can b e used􏰉 For large projects􏰂
it is
b est to use a standard Make􏰐le􏰉
The
the
as well as 􏰒Makefile􏰍 examples in 􏰒􏰋usr􏰋local􏰋mpi􏰋examples􏰋Makefile􏰉in􏰍
MPICH implementation provides commands mpicc and mpif􏰚􏰚
􏰗􏰛

Sp ecial compilation commands
The commands
mpicc 􏰁o first first􏰉c mpif􏰚􏰚 􏰁o firstf firstf􏰉f
may b e used to build simple programs when using MPICH􏰉
These provide sp ecial options that exploit the pro􏰐ling features of MPI
􏰁mpilog Generate log 􏰐les of MPI calls
􏰁mpitrace
􏰁mpianim on all
Trace execution of MPI calls
Real􏰁time animation of MPI 􏰎not available systems􏰏
There are
sp eci􏰐c to the MPICH implementation􏰑 implementations may provide similar commands
other
􏰎e􏰉g􏰉􏰂 mpcc and mpxlf on IBM SP􏰗􏰏􏰉
􏰗􏰃

Using Make􏰐les
The 􏰐le 􏰒Makefile􏰉in􏰍
is a template Make􏰐le􏰉 􏰒mpireconfig􏰍 translates
The
this to a Make􏰐le for
program 􏰎script􏰏
a particular system􏰉 allows you to use the same Make􏰐le for
This
a network of workstations and a massively parallel computer􏰂 even when they use di􏰊erent compilers􏰂 librari es􏰂 and linker options􏰉
mpireconfig Makefile
Note that you must have 􏰒mpireconfig􏰍 in your PATH􏰉
􏰗􏰚

Sample
􏰣􏰣􏰣􏰣􏰣 User
ARCH
COMM
INSTALL􏰧D IR 􏰬
CC 􏰬
F􏰚􏰚 􏰬
CLINKER 􏰬
FLINKER 􏰬
OPTFLAGS 􏰬
􏰣
LIB􏰧PATH 􏰬
FLIB􏰧PATH 􏰬
􏰈FLIB􏰧PAT H􏰧 LE AD ER 􏰈􏰭 􏰎I NS TA LL 􏰧D IR 􏰏􏰋 li b􏰋 􏰭􏰎 ARC H􏰏 􏰋􏰭 􏰎C OM M􏰏
LIB􏰧LIST
􏰣
INCLUDE􏰧D IR
􏰣􏰣􏰣 End User
􏰬 􏰬
􏰈LIB􏰧LI ST 􏰈
􏰈INCLUD E􏰧 PA TH 􏰈 􏰁I􏰭􏰎INST AL L􏰧D IR 􏰏􏰋 in cl ud e configur ab le options 􏰣􏰣􏰣
Make􏰐le􏰉in
configur ab le
options
􏰣􏰣􏰣􏰣􏰣
􏰬 􏰈ARCH􏰈
􏰬 􏰈COMM􏰈
􏰈INSTAL L􏰧 DI R􏰈 􏰈CC􏰈
􏰈F􏰚􏰚􏰈
􏰈CLINKE R􏰈 􏰈FLINKE R􏰈 􏰈OPTFLA GS 􏰈
􏰁L􏰭􏰎INS TA LL 􏰧D IR 􏰏􏰋 li b􏰋 􏰭􏰎 AR CH􏰏 􏰋􏰭 􏰎C OM M􏰏
􏰗􏰙

Sample Make􏰐le􏰉in 􏰎con􏰍t􏰏
CFLAGS 􏰬 􏰈CFLAGS 􏰈 FFLAGS 􏰬 􏰈FFLAGS􏰈
LIBS 􏰬 FLIBS 􏰬 EXECS 􏰬
􏰭􏰎OPTFLA GS 􏰏 􏰭􏰎INCLUD E􏰧D IR 􏰏 􏰁DMPI􏰧􏰭􏰎 AR CH 􏰏
clean􏰀
􏰉c􏰉o􏰀 􏰉f􏰉o􏰀
hello
􏰭􏰎CLINK ER 􏰏 􏰭􏰎LIB􏰧P AT H􏰏
􏰋bin􏰋rm 􏰁f
􏰭􏰎OPTFLA GS 􏰏 􏰁o hello hello􏰉o 􏰕 􏰭􏰎LIB􏰧L IS T􏰏 􏰁lm
􏰦􏰉o 􏰦􏰰 PI􏰦 􏰭􏰎EXECS 􏰏
􏰭􏰎INCLU DE 􏰧D IR 􏰏 􏰭􏰎LIB􏰧LI ST 􏰏
􏰭􏰎OPTFLAG S􏰏
􏰭􏰎LIB􏰧PA TH 􏰏
􏰭􏰎FLIB􏰧 PA TH 􏰏 􏰭􏰎LIB􏰧LI ST 􏰏 hello
default􏰀
all􏰀 􏰭􏰎EXECS􏰏
hello􏰀 hello􏰉o 􏰭􏰎INSTAL L􏰧 DI R􏰏 􏰋i nc lu de 􏰋m pi􏰉 h
􏰭􏰎CC􏰏 􏰭􏰎CFLAG S􏰏 􏰁c 􏰭􏰦􏰉c 􏰭􏰎F􏰚􏰚􏰏 􏰭􏰎FFLAGS 􏰏 􏰁c 􏰭􏰦􏰉f
􏰗􏰇

Running MPI programs
mpirun 􏰁np 􏰗 hello
􏰒mpirun􏰍 is not part of the standard􏰂 but some version of it is common with several MPI implementations􏰉 The version shown here is for the MPICH implementation of MPI􏰉
􏰡 Just as Fortran do es not sp ecify how
Fortran sp ecify
programs are started􏰂 MPI do es not how MPI programs are started􏰉
􏰡 The
mpirun would 􏰐nd out how system􏰉 The to mpirun􏰉
option 􏰁t shows the commands that execute􏰑 you can use this to
mpirun starts option 􏰁help
programs on yor shows all options
􏰆􏰄

Finding out ab out the environment
Two of the 􏰐rst questions asked in a parallel program are􏰀 How many pro cesses are there􏰓 and Who am I􏰓
How many is answered with MPI􏰧Comm􏰧size and who am I is answered with MPI􏰧Comm􏰧rank􏰉
The rank is a numb er b etween zero and size􏰁􏰔􏰉
􏰆􏰔

A simple program
􏰣include 􏰖mpi􏰉h􏰖 􏰣include 􏰤stdio􏰉h􏰥
int main􏰎 argc􏰂 argv 􏰏 int argc􏰑
char 􏰦􏰦argv􏰑
􏰟
int rank􏰂 size􏰑
MPI􏰧Init􏰎 􏰨argc􏰂 􏰨argv 􏰏􏰑
MPI􏰧Comm􏰧rank􏰎 MPI􏰧Comm􏰧size􏰎 printf􏰎 􏰖Hello
MPI􏰧COMM􏰧WORLD􏰂 MPI􏰧COMM􏰧WORLD􏰂 world􏰩 I􏰍m 􏰪d of
rank􏰂 size 􏰏􏰑 MPI􏰧Finalize􏰎􏰏􏰑
return 􏰄􏰑 􏰫
􏰨rank 􏰏􏰑 􏰨size 􏰏􏰑 􏰪d􏰕n􏰖􏰂
􏰆􏰗

Caveats
􏰡 These
as simple
pro cesses can do output􏰉 Not all systems provide this feature􏰂 and MPI provides a way to handle this case􏰉
sample programs have b een kept
as p ossible by assuming
that all parallel
􏰆􏰆

Exercise 􏰁 Getting
Started
Objective􏰀 Learn
compile􏰂
Run the
di􏰊erent
output lo ok like􏰓
how to login􏰂 write􏰂 and run a simple MPI program􏰉
􏰕Hello world􏰖 programs􏰉 Try two parallel computers􏰉 What do es the
􏰆􏰅

Sending and Receiving
messages
Process 1
Recv B:
Questions􏰀
􏰝 To whom is data
􏰝 What is sent􏰓
􏰝 How do es the receiver identify it􏰓
A:
Process 0
Send
sent􏰓
􏰆􏰛

Current Message􏰁Passing
􏰝 A typical
send􏰎 dest􏰂 type􏰂 address􏰂 length 􏰏
where
􏰟 dest is an integer identi􏰐er representing the pro cess to receive the message􏰉
􏰟 type is a nonnegative integer that the destination can use to selectively screen messages􏰉
􏰟 􏰎address􏰂 length􏰏 describ es a contiguous area in memory containing the message to be sent􏰉
blo cking send lo oks like
and
􏰝 A typical global op eration lo oks broadcast􏰎 type􏰂 address􏰂
􏰝 All of these sp eci􏰐cations are a
hardware􏰂 easy to understand􏰂 but to o in􏰜exible􏰉
like􏰀 length 􏰏
go o d match to
􏰆􏰃

The Bu􏰊er
Sending and receiving only a contiguous array of bytes􏰀
􏰝 hides the real data structure from hardware might be able to handle it directly
􏰝 requires pre􏰁packing disp ersed data
􏰟 rows of a matrix stored columnwise
􏰟 general collections of structures
which
􏰝 prevents communications b etween machines with di􏰊erent representations 􏰎even lengths􏰏 for same data typ e
􏰆􏰚

Generalizing the Bu􏰊er Description
􏰝 Sp eci􏰐ed in MPI by starting address 􏰂 datatyp e 􏰂 and
count 􏰂 where
􏰟 elementary
􏰟 contiguous array
􏰟 strided blo cks of
􏰟 indexed array of
􏰟 general structure
􏰝 Datatyp es are
of datatyp es
datatyp es
blo cks of datatyp es
constructed recursively􏰉
datatyp e is􏰀
􏰎all C and Fortran datatyp es􏰏
􏰝 Sp eci􏰐cations of elementary datatyp es allows
heterogeneous
communication􏰉
􏰝 Elimination of length in favor of count is clearer􏰉
􏰝 Sp ecifying application􏰁oriented layout of data allows maximal use of sp ecial hardware􏰉
􏰆􏰙

Generalizing the Typ e
􏰝 Problems􏰀
􏰟 under user
􏰟 wild cards
control
allowed 􏰎MPI􏰧ANY􏰧TAG􏰏
􏰐eld is to o constraining􏰉 Often
􏰝 A single typ e
overloaded to provide needed 􏰜exibility􏰉
􏰟 library use con􏰜icts with user and with other libraries
􏰆􏰇

Sample Program using Library Calls
Sub􏰔 and Sub􏰗 are from di􏰊erent libraries􏰉
Sub􏰔􏰎􏰏􏰑
Sub􏰗􏰎􏰏􏰑
Sub􏰔a and Sub􏰔b are from
Sub􏰔a􏰎􏰏􏰑
Sub􏰗􏰎􏰏􏰑
Sub􏰔b􏰎􏰏􏰑
Thanks to Marc Snir for the
the same
library
following four examples
􏰅􏰄

Correct Execution of Library Calls
Process 0 Process 1 Process 2
recv(any)
Sub1
recv(any)
send(1)
send(0)
recv(1)
send(2)
send(0) recv(2)
send(1) recv(0)
Sub2
􏰅􏰔

Incorrect Execution of
Library Calls
Process 1 Process 2
Process 0
Sub1
recv(any)
recv(any)
send(1)
send(0)
Sub2
recv(1)
send(2)
send(0) recv(2)
send(1) recv(0)
􏰅􏰗

Correct Execution of Communcication
Library Calls
Process 1
with Pending
Process 2
Sub1a
Process 0
recv(any)
send(1)
send(0)
recv(2)
send(1)
Sub2
send(2) recv(0)
send(0) recv(1)
Sub1b
recv(any)
􏰅􏰆

Incorrect Execution of Communication
Library
Process 1
Calls
with Pending
Process 2
Process 0
recv(any)
send(1)
send(0)
Sub1a
Sub2
recv(2)
send(1)
send(2) recv(0)
send(0) recv(1)
Sub1b
recv(any)
􏰅􏰅

Solution to the typ e problem
􏰝 A separate communication context for each family of messages􏰂 used for queueing and matching􏰉 􏰎This has often b een simulated in the past by overloading the tag 􏰐eld􏰉􏰏
􏰝 No wild cards allowed􏰂 for security
􏰝 Allo cated by the system􏰂 for security
􏰝 Typ es 􏰎tags 􏰂 in MPI􏰏 retained for normal use 􏰎wild cards OK􏰏
􏰅􏰛

Delimiting Scop e of Communication
􏰝 Separate groups of pro cesses working on subproblems
􏰟 Merging of process name space interferes with mo dularity
􏰟 􏰕Lo cal􏰖 pro cess identi􏰐ers desirable
􏰝 Parallel invo cation of parallel libraries
􏰟 Messages from application must be kept separate from messages internal to library􏰉
􏰟 Knowledge of library message typ es interferes
with mo dularity􏰉
􏰟 Synchronizing b efore and after library calls is undesirable􏰉
􏰅􏰃

Generalizing the Pro cess Identi􏰐er
􏰝 Collective op erations typically op erated on all pro cesses 􏰎although some systems provide subgroups􏰏􏰉
􏰝 This is too restrictive 􏰎e􏰉g􏰉􏰂 need minimum over a column or a sum across a row􏰂 of pro cesses􏰏
􏰝 MPI provides groups of pro cesses
􏰟 􏰟
initial 􏰕all􏰖 group
group management routines 􏰎build􏰂 delete groups􏰏
op erations􏰏
􏰝 All
takes place in groups􏰉
communication 􏰎not just collective
􏰝 A group and a context are combined in a communicator􏰉
􏰝 Source􏰋destination in send􏰋receive op erations
to rank in group asso ciated with a given communicator􏰉 MPI􏰧ANY􏰧SOURCE p ermitted in a receive􏰉
refer
􏰅􏰚

MPI Basic Send􏰋Receive
Thus the basic 􏰎blo cking􏰏 send has b ecome􏰀
MPI􏰧Send􏰎 start􏰂 count􏰂 datatype􏰂 dest􏰂 tag􏰂 comm 􏰏
and the receive􏰀
MPI􏰧Recv􏰎start􏰂 count􏰂 datatype􏰂 source􏰂 tag􏰂 comm􏰂 status􏰏
The source􏰂 tag􏰂 and count of the message actually received can b e retrieved from status􏰉
Two simple collective op erations􏰀
MPI􏰧Bcast􏰎start􏰂 count􏰂 datatype􏰂 root􏰂 comm􏰏 MPI􏰧Reduce􏰎start􏰂 result􏰂 count􏰂 datatype􏰂
operation􏰂 root􏰂 comm􏰏
􏰅􏰙

Getting information ab out a message
MPI􏰧Status status􏰑
MPI􏰧Recv􏰎 􏰉􏰉􏰉􏰂 􏰨status 􏰏􏰑
􏰉􏰉􏰉 status􏰉MPI􏰧TAG􏰑
􏰉􏰉􏰉 status􏰉MPI􏰧SOURCE􏰑
MPI􏰧Get􏰧count􏰎 􏰨status􏰂 datatype􏰂 􏰨count 􏰏􏰑
MPI􏰧TAG and MPI􏰧SOURCE primarily of use when MPI􏰧ANY􏰧TAG and􏰋or MPI􏰧ANY􏰧SOURCE in the receive􏰉
MPI􏰧Get􏰧count may b e used to determine how much data of a particular typ e was received􏰉
􏰅􏰇

Simple Fortran example
C
􏰔􏰄
i􏰬􏰔􏰂 􏰔􏰄 data􏰎i􏰏 􏰬 i call MPI􏰧SEN D􏰎
program
include
integer
integer
integer
integer
double
main
􏰍mpif􏰉h 􏰍
rank􏰂 size􏰂 to􏰂 from􏰂 tag􏰂 count􏰂 i􏰂 ierr src􏰂 dest
st􏰧sour ce 􏰂 st􏰧tag􏰂 st􏰧count status􏰎MPI􏰧STATUS􏰧SIZE􏰏
if
􏰎rank
to
count
tag
do
􏰉eq􏰉 src􏰏 then 􏰬 dest
􏰬 􏰔􏰄
􏰬 􏰗􏰄􏰄􏰔
􏰔􏰄
precisio n data􏰎􏰔􏰄􏰄 􏰏
call MPI􏰧INIT 􏰎 ierr 􏰏
call MPI􏰧COMM 􏰧R AN K􏰎 MPI􏰧COM M􏰧 WO RL D􏰂 rank􏰂 call MPI􏰧COMM 􏰧S IZ E􏰎 MPI􏰧COM M􏰧 WO RL D􏰂 size􏰂
ierr 􏰏 ierr 􏰏
is alive􏰍
print 􏰦􏰂 􏰍Process dest 􏰬 size 􏰁 􏰔 src 􏰬 􏰄
􏰍􏰂 rank􏰂
􏰍 of
􏰍􏰂 size􏰂
􏰍
MPI􏰧DOUBL E􏰧 PR EC IS IO N􏰂 MPI􏰧COM M􏰧 WOR LD 􏰂 ierr 􏰏
then
tag 􏰬
count 􏰬
from 􏰬
call MPI􏰧REC V􏰎 da ta 􏰂 count􏰂 MPI􏰧DOUB LE 􏰧P RE CI SI ON 􏰂
tag􏰂 MPI􏰧COMM 􏰧W ORL D􏰂 status􏰂 ierr
to􏰂
from􏰂 􏰏
􏰠
else if 􏰎rank 􏰉eq􏰉 dest􏰏
􏰠
MPI􏰧ANY􏰧 TA G
􏰔􏰄
MPI􏰧ANY􏰧 SO UR CE
data􏰂 count􏰂
tag􏰂
􏰛􏰄

Simple
Fortran example 􏰎cont􏰉􏰏
C
􏰠
st􏰧count 􏰂 status􏰎M PI 􏰧S OU RC E􏰏
status􏰎M PI 􏰧T AG 􏰏
ierr 􏰏
􏰬 􏰍􏰂 st􏰧sourc e􏰂
count 􏰬 􏰍􏰂 st􏰧coun t
􏰠 endif
􏰦􏰂 􏰦􏰂
􏰍Status info􏰀 source
􏰍 tag 􏰬 􏰍􏰂 st􏰧tag􏰂 􏰍
call MPI􏰧GET 􏰧C OU NT 􏰎 status􏰂 MPI􏰧DOUBL E􏰧 PR EC IS IO N􏰂
st􏰧sourc e 􏰬
st􏰧tag
print
print
􏰬
call MPI􏰧FINA LI ZE 􏰎 end
ierr 􏰏
rank􏰂 􏰍 receive d􏰍 􏰂
􏰎data􏰎i􏰏􏰂 i􏰬 􏰔􏰂 􏰔􏰄 􏰏
􏰛􏰔

Six Function MPI
MPI is very simple􏰉
you to write many programs􏰀
MPI Init
MPI Finalize MPI Comm size MPI Comm rank MPI Send
MPI Recv
These six functions allow
􏰛􏰗

A taste of things to come
The following examples show a C and Fortran version of the same program􏰉
This program computes PI 􏰎with a very simple metho d􏰏 but do es not use MPI􏰧Send and MPI􏰧Recv􏰉 Instead􏰂 it uses collective
op erations to send data to and from all of
the running pro cesses􏰉 six􏰁function MPI set􏰀
MPI Init
MPI Finalize MPI Comm size MPI Comm rank MPI Bcast
MPI Reduce
This gives
a di􏰊erent
􏰛􏰆

Broadcast and Reduction
The routine MPI􏰧Bcast sends data from one pro cess to all others􏰉
The routine MPI􏰧Reduce
all pro cesses 􏰎by adding
and returning the result to a single pro cess􏰉
combines data from them in this case􏰏􏰂
􏰛􏰅

Fortran example􏰀
PI
c
􏰔􏰄 if 􏰇􏰙
􏰇􏰇
􏰎 myid 􏰉eq􏰉 􏰄 􏰏 then write􏰎􏰃􏰂 􏰇􏰙 􏰏
format􏰎􏰍 En te r the number read􏰎􏰛􏰂􏰇 􏰇􏰏 n
of intervals 􏰀
􏰎􏰄
program
include
main
􏰖mpif􏰉h 􏰖
double
paramet er
􏰆􏰉􏰔􏰅􏰔􏰛􏰇􏰗􏰃 􏰛􏰆 􏰛􏰙 􏰇􏰚 􏰇􏰆 􏰗􏰆 􏰙􏰅 􏰃􏰗 􏰃􏰅 􏰆d 􏰄􏰏
format􏰎i 􏰔􏰄 􏰏 endif
precisio n
PI􏰗􏰛DT 􏰎PI􏰗􏰛DT 􏰬
mypi􏰂 pi􏰂 numprocs 􏰂
double
integer n􏰂 myid􏰂
h􏰂 sum􏰂 i􏰂 rc
function
x􏰂 f􏰂 a
to integrat e
precisio n
f􏰎a􏰏 􏰬 􏰅􏰉d􏰄 􏰋 􏰎􏰔􏰉d􏰄 􏰠 a􏰦a􏰏
call MPI􏰧INIT 􏰎 ierr call MPI􏰧COMM 􏰧R AN K􏰎 call MPI􏰧COMM 􏰧S IZ E􏰎
􏰏
MPI􏰧COM M􏰧 WO RL D􏰂 myid􏰂 ierr 􏰏
MPI􏰧COM M􏰧 WO RL D􏰂
numprocs 􏰂
ierr 􏰏
quits􏰏􏰍 􏰏
call MPI􏰧BCAS T􏰎 n􏰂 􏰔􏰂 MP I􏰧 IN TE GE R􏰂 􏰄􏰂 MPI 􏰧C OM M􏰧 WO RL D􏰂 ie rr 􏰏
􏰛􏰛

Fortran example 􏰎cont􏰉􏰏
c
c
􏰗􏰄
c
c
􏰇􏰚
􏰆􏰄
check for
calculate the
numproc s
􏰁 􏰄􏰉􏰛d􏰄􏰏
if 􏰎 n 􏰉le􏰉 􏰄 􏰏 goto 􏰆􏰄
h 􏰬
􏰔􏰉􏰄d􏰄􏰋n
sum
do 􏰗􏰄
x
continue
mypi
call 􏰭
if
􏰬 h 􏰦 sum
􏰠 endif
goto
call stop end
Error is􏰀 􏰍􏰂
􏰔􏰄
MPI􏰧FIN AL IZ E􏰎 rc 􏰏
F􏰔􏰙􏰉􏰔􏰃􏰏
􏰬 􏰄􏰉􏰄d􏰄
i 􏰬 myid􏰠􏰔􏰂 n􏰂
􏰬 h 􏰦 􏰎dble􏰎i 􏰏 sum 􏰬 sum 􏰠 f􏰎x􏰏
collect MPI􏰧REDUCE􏰎mypi􏰂pi􏰂􏰔􏰂MPI􏰧DOUBLE􏰧PRECISION􏰂MPI􏰧SUM􏰂􏰄􏰂
MPI􏰧COMM􏰧WORLD􏰂ierr􏰏
􏰎myid 􏰉eq􏰉 􏰄􏰏 then write􏰎􏰃 􏰂
format􏰎 􏰍 􏰍
pi is approxi ma te ly 􏰀 􏰍􏰂
F􏰔􏰙􏰉􏰔􏰃􏰂
node 􏰄 prints the answer􏰉 􏰇􏰚􏰏 pi􏰂 abs􏰎pi 􏰁 PI􏰗􏰛DT􏰏
quit signal
all the
partial sums
interva l size
􏰛􏰃

C example􏰀 PI
􏰣include 􏰖mpi􏰉h􏰖 􏰣include 􏰤math􏰉h􏰥
int int char 􏰟
main􏰎argc 􏰂a rg v􏰏 argc􏰑
􏰦argv􏰮􏰯􏰑
int done 􏰬 􏰄􏰂 n􏰂 myid􏰂
numprocs 􏰂
􏰆􏰉􏰔􏰅􏰔􏰛􏰇 􏰗􏰃 􏰛􏰆 􏰛􏰙 􏰇􏰚 􏰇􏰆 􏰗􏰆 􏰙􏰅 􏰃􏰗􏰃 􏰅􏰆 􏰑
double PI􏰗􏰛DT 􏰬
double mypi􏰂 pi􏰂 h􏰂 sum􏰂 x􏰂 a􏰑
MPI􏰧Init􏰎 􏰨a rg c􏰂 􏰨a rg v􏰏 􏰑
MPI􏰧Comm􏰧 si ze 􏰎M PI 􏰧C OM M􏰧 WO RL D􏰂 􏰨n um pr oc s􏰏􏰑 MPI􏰧Comm􏰧 ra nk 􏰎M PI 􏰧C OM M􏰧 WO RL D􏰂 􏰨m yi d􏰏 􏰑
i􏰂 rc􏰑
􏰛􏰚

C
example 􏰎cont􏰉􏰏
while 􏰎􏰩done􏰏 􏰟
if 􏰎myid 􏰬􏰬 􏰄􏰏 􏰟
printf􏰎 􏰖E nt er the number scanf􏰎􏰖 􏰪d 􏰖􏰂 􏰨n 􏰏􏰑
􏰫
MPI􏰧Bcast 􏰎􏰨 n􏰂 􏰔􏰂 MPI􏰧INT􏰂 􏰄􏰂 if 􏰎n 􏰬􏰬 􏰄􏰏 break􏰑
h 􏰬 􏰔􏰉􏰄
sum 􏰬 􏰄􏰉􏰄􏰑
for 􏰎i 􏰬 myid 􏰠 􏰔􏰑 i 􏰤􏰬 n􏰑 i 􏰠􏰬 numprocs􏰏 􏰟
􏰫
􏰫
if 􏰎myid 􏰬􏰬 􏰄􏰏 printf􏰎 􏰖p i pi􏰂
is approxi ma te ly 􏰪􏰉􏰔􏰃f􏰂 fabs􏰎pi 􏰁 PI􏰗􏰛DT􏰏􏰏 􏰑
Error
is 􏰪􏰉􏰔􏰃f􏰕n􏰖 􏰂
MPI􏰧Final iz e􏰎 􏰏􏰑 􏰫
􏰋 􏰎double 􏰏 n􏰑
x 􏰬 h 􏰦 􏰎􏰎doubl e􏰏 i
sum 􏰠􏰬
mypi 􏰬 h 􏰦 sum􏰑
􏰠
􏰁 􏰄􏰉􏰛􏰏􏰑 x􏰦x􏰏􏰑
􏰅􏰉􏰄 􏰋 􏰎􏰔􏰉􏰄
MPI􏰧Reduc e􏰎 􏰨m yp i􏰂 􏰨pi􏰂 MPI􏰧COM M􏰧 WO RL D􏰏 􏰑
MPI􏰧SUM􏰂
􏰄􏰂
of interval s􏰀 􏰎􏰄
MPI􏰧COMM􏰧 WO RL D􏰏 􏰑
quits􏰏 􏰖􏰏􏰑
􏰔􏰂 MPI􏰧DOU BL E􏰂
􏰛􏰙

Exercise 􏰁 PI
Objective􏰀 Exp eriment with send􏰋receive
Run either program for PI􏰉 Write new versions that replace the calls to MPI􏰧Bcast
and MPI􏰧Reduce with MPI􏰧Send and
􏰡 The MPI broadcast and reduce
use at most log p send and receive
on each pro cess where p is the size of MPI COMM WORLD􏰉 How many op erations do your versions use􏰓
MPI􏰧Recv􏰉
op erations op erations
􏰛􏰇

Exercise 􏰁 Ring
Objective􏰀 Exp eriment with send􏰋receive Write a program to send a message around a
ring of pro cessors􏰉 That is􏰂
to pro cessor 􏰔􏰂 who sends to pro cessor 􏰗􏰂 etc􏰉 The last pro cessor returns the message to pro cessor 􏰄􏰉
􏰡 You can use the routine MPI Wtime to time
pro cessor 􏰄 sends
co de in
t 􏰬 MPI Wtime􏰎􏰏􏰑
MPI􏰉 The statement
returns
PRECISION in Fortran􏰏􏰉
the time as a double 􏰎DOUBLE
􏰃􏰄

Top ologies
MPI provides collections of
This helps to
Who are my neighb ors􏰓
routines to provide structure to pro cesses
answer the question􏰀
􏰃􏰔

Cartesian Top ologies
A Cartesian
top ology is a mesh
Example of 􏰆 􏰢 􏰅 Cartesian mesh with arrows p ointing at the right neighb ors􏰀
(0,2)
(1,2)
(2,2)
(3,2)
(0,1)
(1,1)
(2,1)
(3,1)
(0,0)
(1,0)
(2,0)
(3,0)
􏰃􏰗

De􏰐ning a Cartesian Top ology
The routine MPI􏰧Cart􏰧create creates a Cartesian decomp osition of the pro cesses􏰂 with the numb er of
dimensions
given by the ndim argument􏰉
dims􏰎􏰔􏰏
dims􏰎􏰗􏰏
periods􏰎􏰔􏰏
periods􏰎􏰗􏰏
reorder
ndim
􏰬 􏰅
􏰬 􏰆
􏰬 􏰉false􏰉 􏰬 􏰉false􏰉 􏰬 􏰉true􏰉 􏰬􏰗
call 􏰭
MPI􏰧CART􏰧CREATE􏰎
MPI􏰧COMM􏰧WORLD􏰂 ndim􏰂
periods􏰂 reorder􏰂 comm􏰗d􏰂 ierr 􏰏
dims􏰂
􏰃􏰆

Finding neighb ors
MPI􏰧Cart􏰧create creates a new communicator with the same pro cesses as the input communicator􏰂 but with the sp eci􏰐ed top ology􏰉
The question􏰂 Who are my neighb ors􏰂 can now b e answered with MPI􏰧Cart􏰧shift􏰀
call call
MPI􏰧CART􏰧SHIFT􏰎
MPI􏰧CART􏰧SHIFT􏰎
comm􏰗d􏰂 􏰄􏰂 􏰔􏰂
nbrleft􏰂 nbrright􏰂 ierr 􏰏 comm􏰗d􏰂 􏰔􏰂 􏰔􏰂
nbrbottom􏰂 nbrtop􏰂 ierr 􏰏
The
communicator comm􏰗d􏰂 of the neighb ors shifted in the two dimensions􏰉
values returned
are the ranks􏰂 in the
by 􏰞􏰔
􏰃􏰅

Who am
I􏰓
Can b e integer
call
call 􏰭
answered with
coords􏰎􏰗􏰏
MPI􏰧COMM􏰧RANK􏰎 comm􏰔d􏰂 myrank􏰂 ierr 􏰏 MPI􏰧CART􏰧COORDS􏰎 comm􏰔d􏰂 myrank􏰂 􏰗􏰂
coords􏰂 ierr 􏰏
Returns the Cartesian co ordinates of the calling
pro cess
in coords􏰉
􏰃􏰛

Partitioning
When creating a Cartesian top ology􏰂 one question is 􏰕What is a go o d choice for the decomp osition of the pro cessors􏰓􏰖
This question can b e answered with MPI􏰧Dims􏰧create􏰀
integer dims􏰎􏰗􏰏
dims􏰎􏰔􏰏 􏰬 􏰄
dims􏰎􏰗􏰏 􏰬 􏰄
call MPI􏰧COMM􏰧SIZE􏰎 MPI􏰧COMM􏰧WORLD􏰂 size􏰂 ierr 􏰏 call MPI􏰧DIMS􏰧CREATE􏰎 size􏰂 􏰗􏰂 dims􏰂 ierr 􏰏
􏰃􏰃

Other Top ology Routines
MPI contains routines to translate b etween Cartesian co ordinates and ranks in a communicator􏰂 and to access the prop erties of a Cartesian top ology􏰉
The routine MPI􏰧Graph􏰧create allows the creation of a general graph top ology􏰉
􏰃􏰚

Why are these routines in MPI􏰓
In many parallel computer interconnects􏰂 some pro cessors are closer to than
others􏰉 These routines allow the MPI implementation to provide an ordering of pro cesses in a top ology that makes logical neighb ors close in the physical interconnect􏰉
􏰡 Some parallel
hyp ercub es and
assigning no des in a mesh to pro cessors
programmers may rememb er the e􏰊ort that went into
in a hyp ercub e
co des􏰉 Many new systems have di􏰊erent interconnects􏰑 ones with multiple paths may have notions of near neighb ors that changes with time􏰉 These routines free the programmer from many of these considerations􏰉 The reorder argument is used to request the b est ordering􏰉
through the use of Grey
􏰃􏰙

The p erio ds argument
Who are my neighb ors if I am at the edge of a Cartesian Mesh􏰓
?
􏰃􏰇

Perio dic Grids
Sp ecify this
in
MPI􏰧Cart􏰧create with
􏰬 􏰅
􏰬 􏰆
􏰬 􏰉TRUE􏰉 􏰬 􏰉TRUE􏰉 􏰬 􏰉true􏰉 􏰬􏰗
dims􏰎􏰔􏰏
dims􏰎􏰗􏰏
periods􏰎􏰔􏰏
periods􏰎􏰗􏰏
reorder
ndim
call 􏰭
MPI􏰧CART􏰧CREATE􏰎
periods􏰂 reorder􏰂 comm􏰗d􏰂 ierr 􏰏
MPI􏰧COMM􏰧WORLD􏰂
ndim􏰂 dims􏰂
􏰚􏰄

Nonp erio dic Grids
In the nonp erio dic case􏰂 a neighb or may not exist􏰉 This is indicated by a rank of MPI􏰧PROC􏰧NULL􏰉
This rank may be used in send and receive calls in MPI􏰉 The action in b oth cases is as if the call was not made􏰉
􏰚􏰔

Collective Communications in MPI
􏰝 Communication is co ordinated among a group of pro cesses􏰉
􏰝 Groups can be constructed 􏰕by hand􏰖 with MPI group􏰁manipulation routines or by using MPI
top ology􏰁de􏰐nition routines􏰉
􏰝 Message tags are not used􏰉 Di􏰊erent communicators are used instead􏰉
􏰝 No non􏰁blo cking collective
􏰝 Three classes of collective
􏰟 synchronization
􏰟 data movement
􏰟 collective computation
op erations􏰉
op erations􏰀
􏰚􏰗

Synchronization
􏰝 MPI􏰧Barrier􏰎comm􏰏
􏰝 Function blo cks untill all pro cesses in
comm call it􏰉
􏰚􏰆

Available
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
Schematic movement
Patterns
Collective
A
A
A
A
A
Broadcast
Scatter
Gather
All gather
All to All
representation in MPI
of
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
collective data
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A0
A1
A2
A3
B0
B1
B2
B3
C0
C1
C2
C3
D0
D1
D2
D3
A0
B0
C0
D0
A1
B1
C1
D1
A2
B2
C2
D2
A3
B3
C3
D3
􏰚􏰅

Available
Computation
P0
P1 Reduce P1
Patterns
Collective
A
B
C
D
ABCD
P2 P3
P0
P2 P3
A
B
C
D
A
AB
ABC
ABCD
P0
P1 Scan P1
Schematic movement
of collective data
P2 P3
representation in MPI
P0
P2 P3
􏰚􏰛

MPI Collective Routines
􏰝 Many
routines􏰀
Allgather Alltoall
Gather ReduceScatter Scatterv
􏰝 All
pro cesses􏰉
versions deliver results to all
participating
Allgatherv Alltoallv Gatherv Scan
Allreduce Bcast Reduce Scatter
􏰝 V versions allow the chunks to have di􏰊erent sizes􏰉
􏰝 Allreduce􏰂 Reduce􏰂 ReduceScatter􏰂 and Scan take
b oth built􏰁in and user􏰁de􏰐ned combination functions􏰉
􏰚􏰃

Built􏰁in Collective Computation Op erations
MPI Name
Op eration
MPI MAX MPI MIN MPI PROD MPI SUM
Maximum Minimum Pro duct Sum
MPI LAND MPI LOR MPI LXOR
Logical and Logical or
Logical
exclusive or 􏰎xor􏰏
MPI BAND MPI BOR MPI BXOR
Bitwise and Bitwise or Bitwise xor
MPI MAXLOC MPI MINLOC
Maximum value and lo cation Minimum value and lo cation
􏰚􏰚

De􏰐ning Your Own Collective Op erations
MPI􏰧Op􏰧create􏰎user􏰧function􏰂 commute􏰂 op􏰏 MPI􏰧Op􏰧free􏰎op􏰏
user􏰧function􏰎invec􏰂 inoutvec􏰂 len􏰂 datatype􏰏 The user function should p erform
inoutvec􏰮i􏰯 􏰬 invec􏰮i􏰯 op inoutvec􏰮i􏰯􏰑
for i from 􏰄 to len􏰁􏰔􏰉
user􏰧function can b e non􏰁commutative 􏰎e􏰉g􏰉􏰂 matrix multiply􏰏􏰉
􏰚􏰙

Sample user function
For example􏰂 to create an same e􏰊ect as MPI􏰧SUM on values􏰂 use
op eration that Fortran double
has the precision
subroutine myfunc􏰎 invec􏰂
integer len􏰂 datatype
double precision invec􏰎len􏰏􏰂 inoutvec􏰎len􏰏 integer i
do 􏰔􏰄 i􏰬􏰔􏰂len
􏰔􏰄
inoutvec􏰎i􏰏 􏰬 return
invec􏰎i􏰏 􏰠
end
To use􏰂 just
integer myop
call MPI􏰧Op􏰧create􏰎 myfunc􏰂
􏰉true􏰉􏰂 myop􏰂 ierr 􏰏
call MPI􏰧Reduce􏰎 a􏰂
b􏰂 􏰔􏰂 MPI􏰧DOUBLE􏰧PRECISON􏰂
myop􏰂 􏰉􏰉􏰉 􏰏
The they
routine MPI􏰧Op􏰧free destroys user􏰁functions when are no longer needed􏰉
datatype 􏰏
inoutvec􏰎i􏰏
inoutvec􏰂 len􏰂
􏰚􏰇

De􏰐ning groups
All MPI communication is relative to a communicator which contains a context and a group􏰉 The group is just a set of pro cesses􏰉
􏰙􏰄

Sub dividing a communicator
The easiest way to create communicators with new groups is with MPI􏰧COMM􏰧SPLIT􏰉
For example􏰂
to form
0 Row 1 2
use
MPI􏰧Comm􏰧split􏰎 oldcomm􏰂 To maintain the order by
groups of rows of pro cesses
Column 01234
MPI􏰧Comm􏰧rank􏰎 oldcomm􏰂 􏰨rank 􏰏􏰑 MPI􏰧Comm􏰧split􏰎 oldcomm􏰂 row􏰂 rank􏰂 􏰨newcomm 􏰏􏰑
row􏰂 􏰄􏰂 􏰨newcomm 􏰏􏰑
rank􏰂 use
􏰙􏰔

Sub dividing 􏰎con􏰍t􏰏
Similarly􏰂
to form groups of columns􏰂
Column 01234
use
MPI􏰧Comm􏰧split􏰎 oldcomm􏰂 To maintain the order by
column􏰂 􏰄􏰂
rank􏰂 use
􏰨newcomm􏰗 􏰏􏰑
0 Row 1 2
MPI􏰧Comm􏰧rank􏰎 oldcomm􏰂 􏰨rank 􏰏􏰑
MPI􏰧Comm􏰧split􏰎 oldcomm􏰂 column􏰂 rank􏰂 􏰨newcomm􏰗 􏰏􏰑
􏰙􏰗

Manipulating Groups
Another way to create a communicator with sp eci􏰐c memb ers is to use MPI􏰧Comm􏰧create􏰉
MPI􏰧Comm􏰧create􏰎 oldcomm􏰂 group􏰂 􏰨newcomm 􏰏􏰑
The group can b e created in many ways􏰀
􏰙􏰆

Creating
All group sp ecifying group􏰉
Groups
creation routines create a group by the memb ers to take from an existing
􏰝 MPI􏰧Group􏰧incl sp eci􏰐es sp eci􏰐c memb ers
􏰝 MPI􏰧Group􏰧excl excludes sp eci􏰐c memb ers
􏰝 MPI􏰧Group􏰧range􏰧incl and MPI􏰧Group􏰧range􏰧excl
use ranges of memb ers
􏰝 MPI􏰧Group􏰧union and MPI􏰧Group􏰧intersection
creates a new group
from two
existing
groups􏰉
To get an existing group􏰂 use MPI􏰧Comm􏰧group􏰎 oldcomm􏰂 􏰨group 􏰏􏰑 Free a group with
MPI􏰧Group􏰧free􏰎 􏰨group 􏰏􏰑
􏰙􏰅

Bu􏰊ering issues
Where do es data go when p ossibility is􏰀
you send it􏰓 One
A:
Process 1
Process 2
Local Buffer
Local Buffer
The Network
B:
􏰙􏰛

Better bu􏰊ering
This is not very e􏰘cient􏰉 There are three
copies in b etween
A:
addition to the exchange of data pro cesses􏰉 We prefer
Process 1
Process 2
But this
not return until the data has
or that we allow a send op eration
b efore completing the transfer􏰉 In this case􏰂 we need to test for completion later􏰉
requires that either
that MPI􏰧Send
b een
delivered to return
B:
􏰙􏰃

Blo cking and Non􏰁Blo cking communication
􏰝
So far we have used blocking communication􏰀
􏰟 MPI Send do es not complete until bu􏰊er is empty 􏰎available for reuse􏰏􏰉
􏰟 MPI Recv do es not complete until bu􏰊er is full 􏰎available for use􏰏􏰉
􏰝
Simple􏰂 but
Completion and amount
can b e 􏰕unsafe􏰖􏰀
Send works
for small enough messages but fails
Pro cess Send􏰎􏰔􏰏 Recv􏰎􏰔􏰏
dep ends in of system
􏰄
Pro cess 􏰔 Send􏰎􏰄􏰏 Recv􏰎􏰄􏰏
general on size of bu􏰊ering􏰉
message
􏰡
when messages get to o large􏰉 To o large ranges from zero bytes to 􏰔􏰄􏰄􏰍s of Megabytes􏰉
􏰙􏰚

Some Solutions to the 􏰕Unsafe􏰖 Problem
􏰝 Order the op erations more
carefully􏰀
Pro cess Send􏰎􏰔􏰏 Recv􏰎􏰔􏰏
􏰄 Pro cess 􏰔 Recv􏰎􏰄􏰏 Send􏰎􏰄􏰏
􏰝 Supply receive bu􏰊er at
same time as
send􏰂 with
MPI
􏰝 Use
Sendrecv􏰀
Pro cess 􏰄
Sendrecv􏰎􏰔􏰏 Sendrecv􏰎􏰄􏰏
non􏰁blo cking op erations􏰀
Pro cess 􏰔
Pro cess Isend􏰎􏰔􏰏 Irecv􏰎􏰔􏰏 Waitall
􏰄
Pro cess 􏰔 Isend􏰎􏰄􏰏 Irecv􏰎􏰄􏰏 Waitall
􏰝 Use
MPI􏰧Bsend
􏰙􏰙

MPI􏰍s Non􏰁Blo cking Op erations
Non􏰁blo cking op erations return 􏰎immediately􏰏
􏰕request handles􏰖 that can
􏰝 MPI Isend􏰎start􏰂 count􏰂 request􏰏
􏰝 MPI Irecv􏰎start􏰂 count􏰂 request􏰏
b e waited on and queried􏰀
datatype􏰂 dest􏰂 tag􏰂 comm􏰂
datatype􏰂 dest􏰂 tag􏰂 comm􏰂
􏰝 MPI Wait􏰎request􏰂 status􏰏
One can also test without waiting􏰀 MPI􏰧Test􏰎 request􏰂 flag􏰂 status􏰏
􏰙􏰇

Multiple completions
It is often desirable to wait on multiple requests􏰉 An example is a master􏰋slave program􏰂 where the master
waits for one or more
􏰝 MPI Waitall􏰎count􏰂 array of statuses􏰏
􏰝 MPI Waitany􏰎count􏰂 status􏰏
slaves to send it a message􏰉
array of
array of
requests􏰂
requests􏰂 index􏰂
􏰝 MPI Waitsome􏰎incount􏰂 array of requests􏰂 outcount􏰂 array of indices􏰂 array of statuses􏰏
There are corresp onding these􏰉
versions of test for each of
􏰡 The MPI WAITSOME and
implement master􏰋slave
access to the master by the slaves􏰉
MPI TESTSOME may b e used to algorithms that provide fair
􏰇􏰄

Fairness
What happ ens with this program􏰀
􏰣include 􏰖mpi􏰉h􏰖 􏰣include 􏰤stdio􏰉h 􏰥 int main􏰎argc 􏰂 argv􏰏 int argc􏰑
char 􏰦􏰦argv􏰑
􏰟
int rank􏰂
MPI􏰧Statu s
MPI􏰧Init􏰎 􏰨argc􏰂 􏰨argv 􏰏􏰑
MPI􏰧Comm􏰧 ra nk 􏰎 MPI􏰧COMM 􏰧W OR LD 􏰂 􏰨rank 􏰏􏰑 MPI􏰧Comm􏰧 si ze 􏰎 MPI􏰧COMM 􏰧W OR LD 􏰂 􏰨size 􏰏􏰑 if 􏰎rank 􏰬􏰬 􏰄􏰏 􏰟
for 􏰎i􏰬􏰄􏰑 i􏰤􏰔􏰄􏰄􏰦􏰎 si ze 􏰁􏰔 􏰏􏰑 i􏰠􏰠􏰏 􏰟
MPI􏰧Rec v􏰎 buf􏰂 􏰔􏰂 MPI􏰧INT 􏰂 MPI􏰧ANY􏰧S OU RC E􏰂
MPI􏰧ANY􏰧 TA G􏰂 MPI􏰧COM M􏰧 WOR LD 􏰂 􏰨status 􏰏􏰑 printf􏰎 􏰖Msg from 􏰪d with tag 􏰪d􏰕n􏰖􏰂
size􏰂 i􏰂 buf􏰮􏰔􏰯􏰑 status􏰑
􏰫 􏰫
else 􏰟
for 􏰎i􏰬􏰄􏰑 i􏰤􏰔􏰄􏰄􏰑 i􏰠􏰠􏰏
MPI􏰧Sen d􏰎
􏰫
MPI􏰧Final iz e􏰎 􏰏􏰑
return 􏰄􏰑 􏰫
buf􏰂 􏰔􏰂
MPI􏰧INT 􏰂 􏰄􏰂 i􏰂
MPI􏰧COM M􏰧 WO RL D 􏰏􏰑
status􏰉 MP I􏰧 SO UR CE 􏰂
status􏰉MP I􏰧 TA G 􏰏􏰑
􏰇􏰔

Fairness in message􏰁passing
An parallel algorithm is fair if no pro cess is e􏰊ectively ignored􏰉 In the preceeding program􏰂 pro cesses with low rank 􏰎like pro cess zero􏰏 may b e the only one whose messages are received􏰉
MPI makes no guarentees ab out fairness􏰉 However􏰂 MPI makes it p ossible to write e􏰘cient􏰂 fair programs􏰉
􏰇􏰗

Providing Fairness
One alternative is 􏰣define large 􏰔􏰗􏰙
MPI􏰧Reque st
MPI􏰧Statu s
int
int
for
request s􏰮 la rg e􏰯 􏰑 statuse s􏰮 la rg e􏰯 􏰑 indices 􏰮l ar ge 􏰯􏰑 buf􏰮lar ge 􏰯􏰑
􏰎i􏰬􏰔􏰑
MPI􏰧Irecv 􏰎 buf􏰠i􏰂 􏰔􏰂 MPI􏰧INT􏰂 i􏰂
i􏰤size􏰑 i􏰠􏰠􏰏
MPI􏰧ANY􏰧 TA G􏰂 MPI􏰧COM M􏰧 WO RLD 􏰂 while􏰎not done􏰏 􏰟
MPI􏰧Waits om e􏰎 size􏰁􏰔􏰂 request s􏰂 􏰨ndone􏰂 for 􏰎i􏰬􏰄􏰑 i􏰤ndone 􏰑 i􏰠􏰠􏰏 􏰟
􏰨request s􏰮 i􏰁 􏰔􏰯 􏰏􏰑
􏰫 􏰫
j 􏰬 indices 􏰮i 􏰯􏰑 printf􏰎 􏰖Msg from 􏰪d
with tag 􏰪d􏰕n􏰖􏰂 statuse s􏰮 i􏰯 􏰉M PI 􏰧S OU RC E􏰂
statuse s􏰮 i􏰯 􏰉M PI 􏰧T AG 􏰏􏰑 MPI􏰧Ire cv 􏰎 buf􏰠j􏰂 􏰔􏰂 MPI􏰧INT􏰂 j􏰂
MPI􏰧ANY􏰧 TA G􏰂 MPI􏰧COM M􏰧W OR LD 􏰂 􏰨request s􏰮 j􏰯 􏰏􏰑
indices􏰂
statuse s 􏰏􏰑
􏰇􏰆

Providing Fairness 􏰎Fortran􏰏
One alternative is paramete r􏰎 large 􏰬 􏰔􏰗􏰙 􏰏
􏰔􏰄 􏰗􏰄
􏰦
MPI􏰧ANY􏰧 TA G􏰂 if 􏰎􏰉not􏰉 done􏰏 then
MPI􏰧COM M􏰧 WO RLD 􏰂
i􏰂
requests 􏰎i 􏰏􏰂 ierr 􏰏
􏰆􏰄
continue goto 􏰗􏰄 endif
integer
integer
integer
integer
logical
do 􏰔􏰄 i
call MPI􏰧Ire cv 􏰎 buf􏰎i􏰏􏰂 􏰔􏰂 MPI􏰧INT EG ER􏰂
requests 􏰎l arge􏰏􏰑
statuses 􏰎M PI􏰧STATUS􏰧SIZE􏰂large􏰏􏰑 indices􏰎 la rge􏰏􏰑
buf􏰎larg e􏰏 􏰑
done
􏰬 􏰔􏰂size􏰁􏰔
call MPI􏰧Wait so me 􏰎 size􏰁􏰔􏰂 requests 􏰂 ndone􏰂 indices􏰂 statuse s􏰂 ierr 􏰏
do
􏰦
􏰆􏰄 i􏰬􏰔􏰂 ndone
j 􏰬 indices 􏰎i 􏰏
print
call
done 􏰬 􏰉􏰉􏰉
from 􏰍􏰂 statuse s􏰎 MPI 􏰧S OU RC E􏰂 i􏰏 􏰂 􏰍 with tag􏰍􏰂 statuses 􏰎M PI 􏰧T AG 􏰂i 􏰏
􏰦􏰂 􏰍Msg
MPI􏰧Irec v􏰎 buf􏰎j􏰏􏰂 MPI􏰧ANY􏰧 TA G􏰂
􏰔􏰂 MPI􏰧INTEG ER 􏰂 MPI􏰧COM M􏰧W OR LD 􏰂
j􏰂
requests 􏰎j 􏰏􏰂 ierr 􏰏
􏰇􏰅

Exercise 􏰁 Fairness
Objective􏰀
Complete the program fragment on
Use nonblo cking communications
􏰕providing
leave no uncompleted requests􏰉 How would you test your program􏰓
fairness􏰖􏰉 Make sure that you
􏰇􏰛

More on nonblo cking communication
In applications where the time to send data b etween pro cesses is large􏰂 it is often helpful to cause communication and computation to overlap􏰉 This can easily b e done with MPI􏰍s non􏰁blo cking routines􏰉
For example􏰂 in a 􏰗􏰁D 􏰐nite di􏰊erence mesh􏰂 moving data needed for the b oundaries can b e done at the same time as computation on the interior􏰉
each ghost edge 􏰉􏰉􏰉 􏰏􏰑 data for each ghost edge
MPI􏰧Irecv􏰎 􏰉􏰉􏰉
MPI􏰧Isend􏰎 􏰉􏰉􏰉
􏰉􏰉􏰉 compute on interior
while 􏰎still some uncompleted requests􏰏 􏰟
MPI􏰧Waitany􏰎 􏰉􏰉􏰉 requests 􏰉􏰉􏰉 􏰏 if 􏰎request is a receive􏰏
􏰉􏰉􏰉 compute on that edge 􏰉􏰉􏰉
􏰫
Note that we call MPI􏰧Waitany several
exploits the fact that after a request
is set to MPI􏰧REQUEST􏰧NULL􏰂 and that this is a valid request object to the wait and test routines􏰉
􏰉􏰉􏰉 􏰏􏰑
times􏰉 This is satis􏰐ed􏰂 it
􏰇􏰃

Communication Mo des
MPI provides mulitple mo des for sending messages􏰀
􏰝 Synchronous mo de 􏰎MPI Ssend􏰏􏰀 the send do es not complete until a matching receive has b egun􏰉 􏰎Unsafe programs b ecome incorrect and usually
deadlo ck
􏰝 Bu􏰊ered bu􏰊er to
memory
within an MPI􏰧Ssend􏰉􏰏
mo de 􏰎MPI Bsend􏰏􏰀 the user supplies the system for its use􏰉 􏰎User supplies enough
􏰝 Ready mo de 􏰎MPI matching receive
to make
unsafe program safe􏰏􏰉
Rsend􏰏􏰀 user guarantees that has b een p osted􏰉
􏰟 allows access to fast
proto cols
􏰟 unde􏰐ned b ehavior if the matching receive is not p osted
Non􏰁blo cking versions􏰀 MPI Issend􏰂 MPI Irsend􏰂
Note that an MPI􏰧Recv any send mo de􏰉
MPI Ibsend may receive
messages
sent with
􏰇􏰚

Bu􏰊ered Send
MPI provides a send routine that may b e used when MPI􏰧Isend is awkward to use 􏰎e􏰉g􏰉􏰂 lots of small messages􏰏􏰉
MPI􏰧Bsend makes use of a user􏰁provided bu􏰊er to save any messages that can not be immediately sent􏰉
int bufsize􏰑
char 􏰦buf 􏰬 malloc􏰎bufsize􏰏􏰑 MPI􏰧Buffer􏰧attach􏰎 buf􏰂 bufsize 􏰏􏰑 􏰉􏰉􏰉
MPI􏰧Bsend􏰎 􏰉􏰉􏰉 same as MPI􏰧Send
􏰉􏰉􏰉
MPI􏰧Buffer􏰧detach􏰎 􏰨buf􏰂 􏰨bufsize 􏰏􏰑
The MPI􏰧Buffer􏰧detach call do es not complete messages are sent􏰉
􏰡 The p erformance of MPI Bsend dep ends on the implementation of MPI and may also dep end on
the size of the message􏰉 For example􏰂 making a message one byte longer may cause a signi􏰐cant drop in p erformance􏰉
􏰉􏰉􏰉 􏰏􏰑
until all
􏰇􏰙

Reusing the same bu􏰊er
Consider a lo op
MPI􏰧Buffer􏰧attach􏰎 buf􏰂 while 􏰎􏰩done􏰏 􏰟
􏰉􏰉􏰉
MPI􏰧Bsend􏰎 􏰉􏰉􏰉 􏰏􏰑 􏰫
bufsize 􏰏􏰑
where the buf is large enough to hold the
the MPI􏰧Bsend􏰉 This co de may
􏰟
void 􏰦buf􏰑 int bufsize􏰑 MPI􏰧Buffer􏰧detach􏰎 􏰨buf􏰂 􏰨bufsize 􏰏􏰑 MPI􏰧Buffer􏰧attach􏰎 buf􏰂 bufsize 􏰏􏰑
􏰫
message in fail b ecause the
􏰇􏰇

Other Point􏰁to􏰁Point Features
􏰝 MPI􏰧SENDRECV􏰂 MPI􏰧SENDRECV􏰧REPLACE
􏰝 MPI􏰧CANCEL
􏰝 Persistent communication requests
􏰔􏰄􏰄

Datatyp es and Heterogenity
MPI datatyp es have two main purp oses
􏰝 Heterogenity 􏰌 parallel programs b etween di􏰊erent pro cessors
􏰝 Noncontiguous data 􏰌 structures􏰂 vectors with non􏰁unit stride􏰂 etc􏰉
Basic datatyp e􏰂 corresp onding to the underlying language􏰂 are prede􏰐ned􏰉
The user can construct new datatyp es at run time􏰑 these are called derived datatyp es􏰉
􏰔􏰄􏰔

Datatyp es in MPI
Elementary􏰀 Language􏰁de􏰐ned typ es 􏰎e􏰉g􏰉􏰂 MPI􏰧INT or MPI􏰧DOUBLE􏰧PRECISION 􏰏
Vector􏰀 Separated by constant
Contiguous􏰀 Vector with stride
􏰕stride􏰖 of one
Hvector􏰀 Vector􏰂 with stride in bytes
Indexed􏰀 Array of indices 􏰎for scatter􏰋gather􏰏
Hindexed􏰀 Indexed􏰂 with indices in bytes
Struct􏰀 General mixed typ es 􏰎for C etc􏰉􏰏
structs
􏰔􏰄􏰗

Basic Datatyp es 􏰎Fortran􏰏
MPI datatyp e
Fortran datatyp e
MPI􏰧INTEGER
MPI􏰧REAL
MPI􏰧DOUBLE􏰧PRECISION
MPI􏰧COMPLEX
MPI􏰧LOGICAL
MPI􏰧CHARACTER
MPI􏰧BYTE
MPI􏰧PACKED
INTEGER
REAL
DOUBLE PRECISION
COMPLEX
LOGICAL
CHARACTER􏰎􏰔􏰏
􏰔􏰄􏰆

Basic Datatyp es 􏰎C􏰏
MPI datatyp e
C datatyp e
MPI􏰧CHAR
MPI􏰧SHORT
MPI􏰧INT
MPI􏰧LONG
MPI􏰧UNSIGNED􏰧CHAR
MPI􏰧UNSIGNED􏰧SHORT
MPI􏰧UNSIGNED
MPI􏰧UNSIGNED􏰧LONG
MPI􏰧FLOAT
MPI􏰧DOUBLE
MPI􏰧LONG􏰧DOUBLE
MPI􏰧BYTE
MPI􏰧PACKED
signed char
signed short int
signed int
signed long int
unsigned char
unsigned short int
unsigned int
unsigned long int
float
double
long double
􏰔􏰄􏰅

Vectors
29
30
31 24 17 10 3
32
33
34
35
22
23
25
26
27
28
15
16
18
19
20
21
8
9
11
12
13
14
1
2
4
5
6
7
To sp ecify this row 􏰎in C order􏰏􏰂 we can use
MPI􏰧Type􏰧vector􏰎 count􏰂 blocklen􏰂 stride􏰂 􏰨newtype 􏰏􏰑
MPI􏰧Type􏰧commit􏰎 􏰨newtype 􏰏􏰑 The exact co de for this is
MPI􏰧Type􏰧vector􏰎 􏰛􏰂 􏰔􏰂 􏰚􏰂 MPI􏰧DOUBLE􏰂 MPI􏰧Type􏰧commit􏰎 􏰨newtype 􏰏􏰑
oldtype􏰂
􏰨newtype 􏰏􏰑
􏰔􏰄􏰛

Structures
Structures are describ ed by arrays of
􏰝 numb er of elements 􏰎array􏰧of􏰧len􏰏
􏰝 displacement or lo cation 􏰎array􏰧of􏰧displs􏰏
􏰝 datatyp e 􏰎array􏰧of􏰧types􏰏
MPI􏰧Type􏰧structure􏰎 count􏰂 array􏰧of􏰧len􏰂 array􏰧of􏰧displs􏰂
array􏰧of􏰧types􏰂 􏰨newtype 􏰏􏰑
􏰔􏰄􏰃

Example􏰀
struct 􏰟 char
int
double
double
int
int
􏰫 cmdline􏰑
Structures
display􏰮 􏰛􏰄 􏰯􏰑 maxiter􏰑 xmin􏰂 ymin􏰑 xmax􏰂 ymax􏰑 width􏰑 height􏰑
blocks 􏰦􏰋 blockcou nts􏰮􏰅􏰯 types􏰮􏰅􏰯 􏰑 displs􏰮􏰅 􏰯􏰑 cmdtype􏰑
􏰋􏰦
􏰋􏰦
􏰋􏰦
􏰋􏰦
􏰋􏰦
􏰋􏰦
􏰬
Name of display 􏰦􏰋
max 􏰣 of iteratio ns 􏰦􏰋
􏰋􏰦 set up int
MPI􏰧Datat yp e MPI􏰧Aint MPI􏰧Datat yp e
􏰋􏰦 initiali ze MPI􏰧Addre ss 􏰎 MPI􏰧Addre ss 􏰎 MPI􏰧Addre ss 􏰎 MPI􏰧Addre ss 􏰎
types and displs with addresse s of items 􏰦􏰋
types􏰮􏰄􏰯 types􏰮􏰔􏰯 types􏰮􏰗􏰯 types􏰮􏰆􏰯 for 􏰎i 􏰬
􏰬 MPI􏰧CHAR 􏰑
􏰬 MPI􏰧INT􏰑
􏰬 MPI􏰧DOUB LE 􏰑
􏰬 MPI􏰧INT􏰑
􏰆􏰑 i 􏰥􏰬 􏰄􏰑 i􏰁􏰁􏰏
displs􏰮i􏰯 􏰁􏰬 MPI􏰧Type􏰧 st ru ct 􏰎 MPI􏰧Type􏰧 co mm it 􏰎
displs􏰮􏰄 􏰯􏰑
􏰅􏰂 blockco un ts 􏰂 displs􏰂 􏰨cmdtype 􏰏􏰑
types􏰂 􏰨cmdtype 􏰏􏰑
􏰅
􏰨cmdline 􏰉d is pl ay 􏰂 􏰨cmdline 􏰉m ax it er 􏰂 􏰨cmdline 􏰉x mi n􏰂 􏰨cmdline 􏰉w id th 􏰂
􏰨displs􏰮 􏰄􏰯 􏰏􏰑 􏰨displs􏰮 􏰔􏰯 􏰏􏰑 􏰨displs􏰮 􏰗􏰯 􏰏􏰑 􏰨displs􏰮 􏰆􏰯 􏰏􏰑
lower left corner of upper right corner 􏰦􏰋
rectangl e 􏰦􏰋
of display
of display
􏰟􏰛􏰄􏰂􏰔􏰂􏰅 􏰂􏰗 􏰫􏰑
in pixels 􏰦􏰋 in pixels 􏰦􏰋
􏰔􏰄􏰚

Strides
The extent of a datatyp e is 􏰎normally􏰏 the
distance
b etween the 􏰐rst and last memb er􏰉
Memory locations specified by datatype
EXTENT
LB UB
You can set an arti􏰐cial extent by using MPI􏰧UB and MPI􏰧LB in MPI􏰧Type􏰧struct􏰉
􏰔􏰄􏰙

Vectors revisited
This co de creates a datatyp e for an arbitrary numb er of element in a row of an array
stored in Fortran order 􏰎column
int blens􏰮􏰗􏰯􏰂 displs􏰮􏰗􏰯􏰑 MPI􏰧Datatype types􏰮􏰗􏰯􏰂 rowtype􏰑
blens􏰮􏰄􏰯
blens􏰮􏰔􏰯
displs􏰮􏰄􏰯
displs􏰮􏰔􏰯
types􏰮􏰄􏰯
types􏰮􏰔􏰯
MPI􏰧Type􏰧struct􏰎 􏰗􏰂 blens􏰂 displs􏰂 types􏰂 􏰨rowtype 􏰏􏰑 MPI􏰧Type􏰧commit􏰎 􏰨rowtype 􏰏􏰑
To send n elements􏰂 you can use MPI􏰧Send􏰎 buf􏰂 n􏰂 rowtype􏰂 􏰉􏰉􏰉 􏰏􏰑
􏰬 􏰔􏰑
􏰬 􏰔􏰑
􏰬 􏰄􏰑
􏰬 number􏰧in􏰧column 􏰦 sizeof􏰎double􏰏􏰑 􏰬 MPI􏰧DOUBLE􏰑
􏰬 MPI􏰧UB􏰑
􏰐rst􏰏􏰉
􏰔􏰄􏰇

Structures revisited
When sending an array of a structure􏰂 it is imp ortant
􏰋􏰦 initiali ze MPI􏰧Addre ss 􏰎 MPI􏰧Addre ss 􏰎 MPI􏰧Addre ss 􏰎 MPI􏰧Addre ss 􏰎 MPI􏰧Addre ss 􏰎
types and displs with addresse s of items 􏰦􏰋 􏰨cmdline 􏰉d is pl ay 􏰂 􏰨displs􏰮 􏰄􏰯 􏰏􏰑
􏰨cmdline 􏰉m ax it er 􏰂 􏰨displs􏰮 􏰔􏰯 􏰏􏰑
􏰨cmdline 􏰉x mi n􏰂 􏰨displs􏰮 􏰗􏰯 􏰏􏰑
􏰨cmdline 􏰉w id th 􏰂 􏰨displs􏰮 􏰆􏰯 􏰏􏰑
types􏰮􏰄􏰯 types􏰮􏰔􏰯 types􏰮􏰗􏰯 types􏰮􏰆􏰯 types􏰮􏰅􏰯 for 􏰎i 􏰬
􏰨cmdline 􏰠􏰔 􏰂 􏰨displs􏰮 􏰅􏰯 􏰏􏰑 􏰬 MPI􏰧CHAR 􏰑
displs􏰮i􏰯 􏰁􏰬 MPI􏰧Type􏰧 st ru ct 􏰎 MPI􏰧Type􏰧 co mm it 􏰎
displs􏰮􏰄 􏰯􏰑
􏰛􏰂 blockco un ts 􏰂 displs􏰂 􏰨cmdtype 􏰏􏰑
types􏰂 􏰨cmdtype 􏰏􏰑
􏰬 MPI􏰧INT􏰑
􏰬 MPI􏰧DOUB LE 􏰑
􏰬 MPI􏰧INT􏰑
􏰬 MPI􏰧UB􏰑
􏰅􏰑 i 􏰥􏰬 􏰄􏰑 i􏰁􏰁􏰏
MPI and the C compiler have the size of each structure􏰉 The most
to ensure that
same value for the
p ortable way to do
structure de􏰐nition for the end of the structure􏰉 In the previous example􏰂 this is
this is to add an MPI􏰧UB to the
􏰔􏰔􏰄

Interleaving data
By moving the UB inside the data􏰂 you can interleave data􏰉
Consider the matrix
0 8 16 24 1 9 17 25 2 101826 3 111927
32 33 34 35
4 122028 5 132129 6 142230 7 152331
36 37 38 39
We wish to send 􏰄􏰁􏰆􏰂􏰙􏰁􏰔􏰔􏰂􏰔􏰃􏰁􏰔􏰇􏰂 and 􏰗􏰅􏰁􏰗􏰚 to pro cess 􏰄􏰂 􏰅􏰁􏰚􏰂􏰔􏰗􏰁􏰔􏰛􏰂􏰗􏰄􏰁􏰗􏰆􏰂 and 􏰗􏰙􏰁􏰆􏰔 to pro cess 􏰔􏰂 etc􏰉 How can we do this with MPI􏰧Scatterv􏰓
􏰔􏰔􏰔

An interleaved datatyp e
MPI􏰧Type􏰧vector􏰎 􏰅􏰂 􏰅􏰂 􏰙􏰂 MPI􏰧DOUBLE􏰂 􏰨vec 􏰏􏰑
de􏰐nes a
blo ck of this matrix􏰉
blens􏰮􏰔􏰯 􏰬 􏰔􏰑
types􏰮􏰔􏰯 􏰬 MPI􏰧UB􏰑 displs􏰮􏰔􏰯 􏰬 sizeof􏰎double􏰏􏰑
blens􏰮􏰄􏰯 􏰬 􏰔􏰑
types􏰮􏰄􏰯 􏰬 vec􏰑
displs􏰮􏰄􏰯 􏰬 􏰄􏰑
MPI􏰧Type􏰧struct􏰎 􏰗􏰂 blens􏰂 displs􏰂 types􏰂 􏰨block 􏰏􏰑
de􏰐nes a blo ck whose extent is just 􏰔 entries􏰉
􏰔􏰔􏰗

Scattering
a Matrix
We set
lo cation
b ecause
the start of each piece
scdispls􏰮􏰄􏰯 􏰬 􏰄􏰑 scdispls􏰮􏰔􏰯 􏰬 􏰅􏰑 scdispls􏰮􏰗􏰯 􏰬 􏰆􏰗􏰑 scdispls􏰮􏰆􏰯 􏰬 􏰆􏰃􏰑 MPI􏰧Scatterv􏰎 sendbuf􏰂
recvbuf􏰂 MPI􏰧COMM􏰧WORLD 􏰏􏰑
the
of the 􏰐rst element in the blo ck􏰉 MPI􏰧Scatterv uses the extents to determine
􏰡 How would
this more general􏰓
displacements for each blo ck
as the
This works
to send􏰉
sendcounts􏰂
block􏰂
scdispls􏰂 nx 􏰦 ny􏰂 MPI􏰧DOUBLE􏰂 􏰄􏰂
use use the top ology routines to make
􏰔􏰔􏰆

Exercises
Objective􏰀
􏰔􏰉 Write
in column􏰁major form􏰏 to the other pro cessors􏰉
􏰎a􏰏
􏰎b􏰏
Write the program to handle the case where the matrix is square􏰉
Write the program to handle a numb er of columns read from the terminal􏰉
􏰁 datatyp es
Learn ab out datatyp es
a program to send rows of a matrix 􏰎stored
Let pro cessor 􏰄 have the entire matrix􏰂 as many rows as pro cessors􏰉
which has
Pro cessor
Pro cessor
holds only that row􏰉 That is􏰂 pro cessor 􏰄 has a matrix A􏰎N 􏰑 M 􏰏 while the other pro cessors have a row B􏰎M 􏰏􏰉
􏰄 sends row i to i reads that row
pro cessor i􏰉 into a lo cal
array that
C
stored in row􏰁major form
If you have time􏰂 don􏰍t have time􏰂 program these􏰉
􏰗􏰉 Write a program
each pro cessor has a part of the matrix􏰉 Use
top ologies to de􏰐ne a 􏰗􏰁Dimensional partitioning
programmers may send
columns of a matrix if they prefer􏰉
of the following􏰉 If you
try one
think ab out how you would
to transp ose a matrix􏰂 where
􏰔􏰔􏰅

􏰆􏰉
􏰎b􏰏 Use MPI􏰧Sendrecv instead􏰉
􏰎c􏰏 Create a datatyp e that allows you to receive the blo ck already transp osed􏰉
Write a program to send the 􏰖ghostp oints􏰖 􏰗􏰁Dimensional mesh to the neighb oring
pro cessors􏰉 Assume that each pro cessor has the same size subblo ck􏰉
􏰎a􏰏 Use top ologies to 􏰐nd the neighb ors
􏰎b􏰏 De􏰐ne a datatyp e for the 􏰕rows􏰖
􏰎c􏰏 Use MPI􏰧Sendrecv or MPI􏰧IRecv and MPI􏰧Send
with MPI􏰧Waitall􏰉
􏰎d􏰏 Use MPI􏰧Isend and MPI􏰧Irecv to start the communication􏰂 do some computation on the interior􏰂 and then use MPI􏰧Waitany to pro cess the b oundaries as they arrive
The same approach works for general datastructures􏰂 such as unstructured meshes􏰉
Do 􏰆􏰂 but for 􏰆􏰁Dimensional meshes􏰉 You will need MPI􏰧Type􏰧Hvector􏰉
􏰅􏰉
of the matrix across the pro cessors􏰂 and assume
that all
􏰎a􏰏 Use the
pro cessors have the same size submatrix􏰉
MPI􏰧Send and MPI􏰧Recv to send the blo ck􏰂 transp ose the blo ck􏰉
of a

To ols for writing libraries
MPI is sp eci􏰐cally designed to make it easier
to 􏰝
􏰝
write message􏰁passing librari es
Communicators solve tag􏰋source wild􏰁card problem
Attributes provide a way to attach information to a communicator
􏰔􏰔􏰛

Private communicators
One of the 􏰐rst thing that a library should normally do is create private communicator􏰉 This allows the library to send and receive messages that are known only to the library􏰉
MPI􏰧Comm􏰧dup􏰎 old􏰧comm􏰂 􏰨new􏰧comm 􏰏􏰑
􏰔􏰔􏰃

Attributes
Attributes are data that can b e attached to one or more communicators􏰉
Attributes are referenced by keyval􏰉 Keyvals are created with MPI􏰧KEYVAL􏰧CREATE􏰉
Attributes are attached to a communicator with MPI􏰧Attr􏰧put and their values accessed by MPI􏰧Attr􏰧get􏰉
􏰡
to
one communicator from another􏰏 or
􏰎by deleting a communicator􏰏 when the keyval is created􏰉
Op erations are de􏰐ned for what happ ens
an attribute when it is copied 􏰎by
creating deleted
􏰔􏰔􏰚

What is an attribute􏰓
In C􏰂 an attribute You must allo cate to p oint to 􏰎make
is a p ointer of typ e void 􏰦􏰉 storage for the attribute sure that you don􏰍t use
the address
In Fortran􏰂 it is a single INTEGER􏰉
of a lo cal variable􏰏􏰉
􏰔􏰔􏰙

Examples of using attributes
􏰝 Forcing sequential op eration
􏰝 Managing tags
􏰔􏰔􏰇

Sequential Sections
􏰣include 􏰖mpi􏰉h􏰖 􏰣include 􏰤stdlib􏰉 h􏰥
static int MPE􏰧Seq􏰧 ke yv al 􏰋􏰦􏰈
MPE􏰧Seq􏰧 be gi n 􏰁 Begins
􏰬 MPI􏰧KEY VA L􏰧 INV AL ID 􏰑
a sequent ia l section of code􏰉
Input Paramete rs 􏰀
􏰉 comm 􏰁 Communi ca tor to sequentialize􏰉
􏰉 ng 􏰁 Number in group􏰉 This many processes are allowed to execute
at the same time􏰉
􏰈􏰦􏰋
void MPE􏰧Seq􏰧 be gi n􏰎 comm􏰂 MPI􏰧Comm comm􏰑
int ng􏰑
􏰟
int
int
MPI􏰧Comm
MPI􏰧Statu s
􏰋􏰦 Get the
operation s
if 􏰎MPE􏰧Seq 􏰧k ey va l 􏰬􏰬
MPI􏰧Keyva l􏰧 cr ea te 􏰎
􏰫
sequenti al
MPI􏰧KEY VA L􏰧 IN VA LI D􏰏 􏰟 MPI􏰧NULL 􏰧C OP Y􏰧 FN 􏰂 MPI􏰧NULL 􏰧D EL ET E􏰧 FN􏰂 􏰨MPE􏰧Seq 􏰧keyval􏰂 NULL 􏰏􏰑
lidx􏰂 np􏰑 flag􏰑 local􏰧co mm 􏰑 status􏰑
Usually one􏰉
private communic at or for the 􏰦􏰋
ng 􏰏
􏰔􏰗􏰄

Sequential Sections I I
MPI􏰧Attr􏰧 ge t􏰎 comm􏰂 MPE􏰧Seq 􏰧k ey 􏰨flag 􏰏􏰑
if 􏰎􏰩flag􏰏 􏰟
􏰋􏰦 This expects a communi ca MPI􏰧Comm􏰧 du p􏰎 comm􏰂 􏰨local􏰧 MPI􏰧Attr􏰧 pu t􏰎 comm􏰂 MPE􏰧Seq
􏰎void 􏰦􏰏local
􏰫
MPI􏰧Comm􏰧 ra nk 􏰎 comm􏰂 􏰨lidx 􏰏􏰑 MPI􏰧Comm􏰧 si ze 􏰎 comm􏰂 􏰨np 􏰏􏰑
if 􏰎lidx 􏰩􏰬 􏰄􏰏 􏰟
MPI􏰧Recv􏰎 NULL􏰂 􏰄􏰂 MPI􏰧INT􏰂 􏰨status 􏰏􏰑
va l􏰂
􏰎void
􏰦􏰏􏰨local 􏰧c om m􏰂
pointer 􏰦􏰋
􏰫
􏰋􏰦 Send to the next process in the group
are the last process in the
if 􏰎 􏰎lidx 􏰪 ng􏰏 􏰤 ng 􏰁 􏰔 􏰨􏰨 lidx 􏰩􏰬 np 􏰁 􏰔􏰏 􏰟
MPI􏰧Send􏰎 NULL􏰂 􏰄􏰂 MPI􏰧INT􏰂 lidx 􏰠 􏰔􏰂 􏰄􏰂 local􏰧c om m 􏰏􏰑
􏰫 􏰫
to r
co mm 􏰏􏰑 􏰧k ey va l􏰂 􏰧c om m 􏰏􏰑
lidx􏰁􏰔􏰂
a
to be
􏰄􏰂
local􏰧c om m􏰂
unless we processo r set 􏰦􏰋
􏰔􏰗􏰔

Sequential Sections I I I 􏰋􏰦􏰈
MPE􏰧Seq􏰧 en d 􏰁 Ends a sequent ia l section of
code􏰉
Input 􏰉 comm
Paramete rs 􏰀
􏰁 Communi ca to r to 􏰁 Number in group􏰉
sequent ia li ze 􏰉
􏰉 ng
􏰈􏰦􏰋
void MPE􏰧Seq􏰧 en d􏰎 comm􏰂 ng
􏰏
MPI􏰧Comm
int
􏰟
int
MPI􏰧Statu s
MPI􏰧Comm
MPI􏰧Comm􏰧 ra nk 􏰎 MPI􏰧Comm􏰧 si ze 􏰎 MPI􏰧Attr􏰧 ge t􏰎 􏰨flag 􏰏􏰑
􏰏􏰑
comm􏰂 MPE􏰧Seq 􏰧k ey va l􏰂
comm􏰑 ng􏰑
lidx􏰂 np􏰂 flag􏰑 status􏰑
local􏰧co mm 􏰑
comm􏰂 􏰨lidx
comm􏰂 􏰨np 􏰏􏰑
if 􏰎􏰩flag􏰏
MPI􏰧Abort 􏰎 comm􏰂 MPI􏰧ERR􏰧 UN KN OW N 􏰏􏰑
􏰋􏰦 Send to the first process in the next group OR to the first process
in the process or if 􏰎 􏰎lidx 􏰪 ng􏰏 􏰬􏰬
set 􏰦􏰋
ng 􏰁 􏰔 􏰌􏰌 lidx 􏰬􏰬 np 􏰁 􏰔􏰏 􏰟
MPI􏰧Send􏰎 NULL􏰂 􏰄􏰂 local􏰧com m 􏰏􏰑
􏰫
if 􏰎lidx 􏰬􏰬 􏰄􏰏 􏰟
MPI􏰧Recv􏰎 NULL􏰂 􏰄􏰂 􏰨status 􏰏􏰑
􏰫 􏰫
MPI􏰧INT􏰂
MPI􏰧INT􏰂
􏰎lidx 􏰠 􏰔􏰏 􏰪 np􏰂 􏰄􏰂
np􏰁􏰔􏰂 􏰄􏰂
local􏰧c om m􏰂
􏰎void 􏰦􏰏􏰨local 􏰧c om m􏰂
􏰔􏰗􏰗

Comments on sequential sections
􏰝 Note use of MPI􏰧KEYVAL􏰧INVALID to determine to create a keyval
􏰝 Note use of 􏰜ag on MPI􏰧Attr􏰧get to discover that a communicator has no attribute for the keyval
􏰔􏰗􏰆

Example􏰀 Managing tags
Problem􏰀 A library contains many objects
that need to communicate not known until runtime􏰉
Messages b etween objects by using di􏰊erent message these tags chosen􏰓
in ways that are
are kept separate tags􏰉 How are
􏰝 Unsafe to use compile time values
􏰝 Must allo cate tag values at runtime
Solution􏰀
Use a private communicator and use an attribute to keep track of available tags in that communicator􏰉
􏰔􏰗􏰅

Caching tags on communicator
􏰣include 􏰖mpi􏰉h􏰖
static int MPE􏰧Tag􏰧 ke yv al 􏰬 􏰋􏰦
Private routine to delete communica to r is freed􏰉
􏰦􏰋
int MPE􏰧DelTa g􏰎 comm􏰂 keyval􏰂
MPI􏰧Comm
int
void
􏰟
􏰦comm􏰑
􏰦keyval􏰑
􏰦attr􏰧va l􏰂
􏰦extra􏰧 st at e􏰑
free􏰎
return
􏰫
attr􏰧va l 􏰏􏰑 MPI􏰧SUCC ES S􏰑
MPI􏰧KEY VA L􏰧 INV AL ID 􏰑
internal storage when
a
􏰏
attr􏰧va l􏰂
extra􏰧st at e
􏰔􏰗􏰛

Caching tags on communicator
I I
􏰋􏰦􏰈
MPE􏰧GetTa gs 􏰁 Returns tags that
communica ti on with a communica to r
Input Paramet er s􏰀
􏰉 comm􏰧in 􏰁 Input communi ca to r
can be
used in
􏰉 ntags
Output
􏰉 comm􏰧out 􏰁 Output communi ca to r􏰉
􏰉 first􏰧tag 􏰁 First tag availab le 􏰈􏰦􏰋
int MPE􏰧GetTa gs 􏰎 comm􏰧in􏰂 ntags􏰂 MPI􏰧Comm comm􏰧in􏰂 􏰦comm􏰧o ut 􏰑
􏰁 Number of tags
Paramete rs 􏰀
int
􏰟
int mpe􏰧errno 􏰬 MPI􏰧SUC CE SS 􏰑
ntags􏰂 􏰦first􏰧 ta g􏰑
int tagval􏰂 􏰦tagval p􏰂
if 􏰎MPE􏰧Tag 􏰧k ey va l 􏰬􏰬 MPI􏰧Keyva l􏰧 cr ea te 􏰎
􏰫
􏰦maxval 􏰂 flag􏰑
May be
comm􏰧out􏰂
􏰍comm􏰧in 􏰍􏰉
first􏰧t ag 􏰏
MPI􏰧KEY VA L􏰧 IN VA LI D􏰏
MPI􏰧NULL 􏰧C OP Y􏰧 FN 􏰂
􏰨MPE􏰧Tag 􏰧k ey va l􏰂 􏰎void 􏰦􏰏􏰄 􏰏􏰑
􏰟
MPE􏰧Del Ta g􏰂
􏰔􏰗􏰃

Caching tags on communicator I I I
if 􏰎mpe􏰧err no 􏰬 MPI􏰧Att r􏰧 ge t􏰎 comm􏰧in 􏰂 MPE􏰧Tag􏰧k ey va l􏰂
􏰨tagvalp􏰂
return
􏰨flag 􏰏􏰏 mpe􏰧errn o􏰑
if 􏰎􏰩flag􏰏
􏰋􏰦 This
so we
􏰟
communi ca to r
is not yet known to this the first value 􏰦􏰋
system􏰂
dup it and setup
MPI􏰧Comm􏰧 du p􏰎 comm􏰧in 􏰂
comm􏰧in 􏰬 􏰦comm􏰧o ut 􏰑
MPI􏰧Attr􏰧 ge t􏰎 MPI􏰧COM M􏰧 WO RL D􏰂 MPI􏰧TAG􏰧 UB 􏰂 􏰨maxval􏰂
􏰨flag 􏰏􏰑
tagvalp 􏰬 􏰎int 􏰦􏰏malloc 􏰎 􏰗 􏰦 sizeof􏰎in t􏰏 􏰏􏰑 printf􏰎 􏰖Malloc in g address 􏰪x􏰕n􏰖􏰂 tagvalp 􏰏􏰑
if 􏰎􏰩tagval p􏰏 return MPI􏰧ERR􏰧 EX HA US TED 􏰑
tagvalp 􏰬 􏰦maxval 􏰑
MPI􏰧Attr􏰧 pu t􏰎 comm􏰧in 􏰂 MPE􏰧Tag􏰧 ke yv al􏰂 tagvalp 􏰏􏰑 return MPI􏰧SUCC ES S􏰑
􏰫
comm􏰧out 􏰏􏰑
􏰔􏰗􏰚

Caching tags on
communicator IV
􏰦comm􏰧out 􏰬 comm􏰧in 􏰑 if 􏰎􏰦tagval p 􏰤 ntags􏰏
􏰟
􏰋􏰦 Error􏰂 out of tags􏰉
Another
solution
would
be to do
an return
MPI􏰧Com m􏰧 du p􏰉 􏰦􏰋 MPI􏰧ERR􏰧 IN TE RN 􏰑
􏰫
􏰦first􏰧ta g 􏰬 􏰦tagvalp 􏰁 ntags􏰑 􏰦tagvalp 􏰬 􏰦first􏰧t ag 􏰑
return MPI􏰧SUCC ES S􏰑 􏰫
􏰔􏰗􏰙

Caching tags on communicator V
􏰋􏰦􏰈
MPE􏰧Retur nT ag s 􏰁 Returns tags allocat ed with
MPE􏰧Get Ta gs 􏰉
Input Paramet er s􏰀
􏰉 comm 􏰁 Communic at or to return
tags to
􏰉 first􏰧tag 􏰁 First
􏰉 ntags 􏰁 Number of 􏰈􏰦􏰋
int MPE􏰧Retur nT ag s􏰎 comm􏰂 MPI􏰧Comm comm􏰑
int 􏰟 int
if
􏰨tagvalp􏰂 􏰨flag 􏰏􏰏
return
mpe􏰧errn o􏰑
first􏰧ta g􏰂 ntags􏰑
􏰦tagvalp􏰂 flag􏰂
mpe􏰧err no 􏰑
􏰎mpe􏰧err no 􏰬 MPI􏰧Att r􏰧 ge t􏰎
comm􏰂
MPE􏰧Tag􏰧 ke yv al 􏰂
if 􏰎􏰩flag􏰏
􏰋􏰦 Error􏰂 attribu te does not
􏰦􏰋
return MPI􏰧ERR􏰧 OT HE R􏰑 􏰫
if 􏰎􏰦tagval p 􏰬􏰬 first􏰧t ag 􏰏 􏰦tagvalp 􏰬 first􏰧ta g 􏰠 ntags􏰑
return MPI􏰧SUCC ES S􏰑 􏰫
􏰟
of the tags tags to return􏰉
first􏰧t ag 􏰂
ntags 􏰏
to return
exist
in this
communi ca to r
􏰔􏰗􏰇

Caching tags on communicator VI
􏰋􏰦􏰈
MPE􏰧TagsE nd 􏰁 Returns the private
keyval􏰉
􏰈􏰦􏰋
int MPE􏰧TagsE nd 􏰎􏰏 􏰟
MPI􏰧Keyva l􏰧 fr ee 􏰎 MPE􏰧Tag􏰧k ey va l 􏰬 􏰫
􏰨MPE􏰧Tag 􏰧k ey va l 􏰏􏰑 MPI􏰧KEYV AL 􏰧I NV AL ID 􏰑
􏰔􏰆􏰄

Commentary
􏰝 Use MPI􏰧KEYVAL􏰧INVALID to detect when keyval must b e created
􏰝 Use flag return from MPI􏰧ATTR􏰧GET to
detect when a communicator needs to b e initial i zed
􏰔􏰆􏰔

Exercise 􏰁 Writing libraries
Objective􏰀 Use private communicators and attributes Write a routine to circulate data to the next pro cess􏰂
using
a nonblo cking send and receive op eration􏰉
void Init􏰧pipe􏰎 comm 􏰏
void ISend􏰧pipe􏰎 comm􏰂 bufin􏰂 len􏰂 void Wait􏰧pipe􏰎 comm 􏰏
A typical use is
Init􏰧pipe􏰎 MPI􏰧COMM􏰧WORLD 􏰏 for 􏰎i􏰬􏰄􏰑 i􏰤n􏰑 i􏰠􏰠􏰏 􏰟
ISend􏰧pipe􏰎 comm􏰂 bufin􏰂 len􏰂 Do􏰧Work􏰎 bufin􏰂 len 􏰏􏰑
datatype􏰂
bufout 􏰏
Wait􏰧pipe􏰎
t 􏰬 bufin􏰑
􏰫
What happ ens
􏰡 What do you need to do to clean up Init
comm 􏰏􏰑
bufin 􏰬 bufout􏰑 bufout 􏰬 t􏰑
if Do􏰧Work calls MPI routines􏰓
datatype􏰂
bufout 􏰏􏰑
pipe􏰓
􏰡 How can you use a user􏰁de􏰐ned top ology determine the next pro cess􏰓 􏰎Hint􏰀 see MPI Topo test and MPI Cartdim get􏰉􏰏
to
􏰔􏰆􏰗

MPI Objects
􏰡 MPI has a variety of objects
􏰎communicators􏰂 groups􏰂
that can b e created and
section discusses the typ es of these how MPI manages them􏰉
􏰡 This entire chapter may b e skipp ed by b eginners􏰉
datatyp es􏰂 destroyed􏰉 This
etc􏰉􏰏
data and
􏰔􏰆􏰆

The MPI Objects
MPI
MPI
MPI
MPI
MPI
MPI
Request Handle for nonblo cking communication􏰂 normally freed by MPI in a test or wait
Datatype MPI datatyp e􏰉 Free with MPI􏰧Type􏰧free􏰉
Op User􏰁de􏰐ned op eration􏰉 Free with MPI􏰧Op􏰧free􏰉
Comm Communicator􏰉 Free with MPI􏰧Comm􏰧free􏰉
Group Group of pro cesses􏰉 Free with MPI􏰧Group􏰧free􏰉
Errhandler MPI errorhandler􏰉 Free with MPI􏰧Errhandler􏰧free􏰉
􏰔􏰆􏰅

When should objects b e freed􏰓
Consider this co de
MPI􏰧Type􏰧vector􏰎 ly􏰂 􏰔􏰂 nx􏰂 MPI􏰧DOUBLE􏰂 􏰨newx􏰔 􏰏􏰑 MPI􏰧Type􏰧hvector􏰎 lz􏰂 􏰔􏰂 nx􏰦ny􏰦sizeof􏰎double􏰏􏰂 newx􏰔􏰂
􏰨newx 􏰏􏰑 MPI􏰧Type􏰧commit􏰎 􏰨newx 􏰏􏰑
􏰎This creates a datatyp e for one face of a 􏰆􏰁D decomp osition􏰉􏰏 When should newx􏰔 b e freed􏰓
􏰔􏰆􏰛

Reference
counting
MPI keeps only truely
is b eing created
track of the use of an MPI object􏰂 and
destroys it when no􏰁one is using it􏰉
used by the user 􏰎the MPI􏰧Type􏰧vector that
newx􏰔 it􏰏 and by the MPI􏰧Datatype newx that uses it􏰉
is not needed
MPI􏰧Type􏰧vector􏰎 ly􏰂 􏰔􏰂 nx􏰂 MPI􏰧DOUBLE􏰂 􏰨newx􏰔 􏰏􏰑 MPI􏰧Type􏰧hvector􏰎 lz􏰂 􏰔􏰂 nx􏰦ny􏰦sizeof􏰎double􏰏􏰂 newx􏰔􏰂
􏰨newx 􏰏􏰑 MPI􏰧Type􏰧free􏰎 􏰨newx􏰔 􏰏􏰑
MPI􏰧Type􏰧commit􏰎 􏰨newx 􏰏􏰑
If newx􏰔
b e freed􏰀
after newx is de􏰐ned􏰂 it should
􏰔􏰆􏰃

Why reference counts
Why not just free the object􏰓 Consider this library routine􏰀
void MakeDatatype􏰎 nx􏰂 ny􏰂 ly􏰂 lz􏰂 MPI􏰧Datatype 􏰦new 􏰏 􏰟
MPI􏰧Datatype newx􏰔􏰑
MPI􏰧Type􏰧vector􏰎 ly􏰂 􏰔􏰂 nx􏰂 MPI􏰧DOUBLE􏰂 􏰨newx􏰔 􏰏􏰑 MPI􏰧Type􏰧hvector􏰎 lz􏰂 􏰔􏰂 nx􏰦ny􏰦sizeof􏰎double􏰏􏰂 newx􏰔􏰂
new 􏰏􏰑 MPI􏰧Type􏰧free􏰎 􏰨newx􏰔 􏰏􏰑
MPI􏰧Type􏰧commit􏰎 new 􏰏􏰑 􏰫
Without the MPI􏰧Type􏰧free􏰎 awkward to later free newx􏰔
􏰨newx􏰔 􏰏􏰂
when new was freed􏰉
it would b e very
􏰔􏰆􏰚

To ols for evaluating programs
MPI provides some to ols for evaluating the p erformance of parallel programs􏰉
These are
􏰝 Timer
􏰝 Pro􏰐ling interface
􏰔􏰆􏰙

The MPI Timer
The elapsed 􏰎wall􏰁clo ck􏰏 time
p oints in an MPI program can b e computed using MPI􏰧Wtime􏰀
double t􏰔􏰂 t􏰗􏰑
t􏰔 􏰬 MPI􏰧Wtime􏰎􏰏􏰑
􏰉􏰉􏰉
t􏰗 􏰬 MPI􏰧Wtime􏰎􏰏􏰑 printf􏰎 􏰖Elapsed time is
The value returned by a MPI􏰧Wtime has littl e value􏰉
􏰪f􏰕n􏰖􏰂 t􏰗 single call to
􏰡 The times are lo cal􏰑 the attribute
MPI WTIME IS GLOBAL may be used to determine if the times are also synchronized with each other for all pro cesses in MPI COMM WORLD􏰉
b etween two
􏰁 t􏰔 􏰏􏰑
􏰔􏰆􏰇

Pro􏰐ling
􏰝 All routines have two PMPI 􏰉􏰉􏰉􏰉
􏰝 This makes it easy to
low􏰁overhead routines
without any source co de mo di􏰐cations􏰉
entry p oints􏰀 MPI 􏰉􏰉􏰉 and
􏰝 Used to provide 􏰕automatic􏰖 􏰐les􏰉
generation of trace
provide a single level of to intercept MPI calls
MPI_Send
MPI_Bcast
MPI_Send PMPI_Send
MPI_Send PMPI_Send
MPI_Bcast
User Program
static int nsend 􏰬 􏰄􏰑 int MPI􏰧Send􏰎 start􏰂
􏰟
nsend􏰠􏰠􏰑
Profile Library
count􏰂 datatyp e􏰂
dest􏰂
MPI Library
tag􏰂 comm 􏰏
return PMPI􏰧Sen d􏰎 start􏰂 count􏰂 datatyp e􏰂 dest􏰂 tag􏰂 comm 􏰏 􏰫
􏰔􏰅􏰄

Writing pro􏰐ling routines
The MPICH implementation contains a program for writing wrapp ers􏰉
This description will write out each MPI routine that
is called􏰉􏰀
􏰣ifdef MPI􏰧BUILD􏰧PROFILING
􏰣undef MPI􏰧BUILD􏰧PROFILING 􏰣endif
􏰣include 􏰤stdio􏰉h􏰥 􏰣include 􏰖mpi􏰉h􏰖
􏰟􏰟fnall fn􏰧name􏰫􏰫
􏰟􏰟vardecl int llrank􏰫􏰫
PMPI􏰧Comm􏰧rank􏰎 MPI􏰧COMM􏰧WORLD􏰂 􏰨llrank 􏰏􏰑 printf􏰎 􏰖􏰮􏰪d􏰯 Starting 􏰟􏰟fn􏰧name􏰫􏰫􏰉􏰉􏰉􏰕n􏰖􏰂
llrank 􏰏􏰑 fflush􏰎 stdout 􏰏􏰑
􏰟􏰟callfn􏰫􏰫
printf􏰎 􏰖􏰮􏰪d􏰯 Ending 􏰟􏰟fn􏰧name􏰫􏰫􏰕n􏰖􏰂 llrank 􏰏􏰑
fflush􏰎 stdout 􏰏􏰑 􏰟􏰟endfnall􏰫􏰫
The command
wrappergen 􏰁w trace􏰉w 􏰁o trace􏰉c
converts this to a C program􏰉 The complie
􏰒trace􏰉c􏰍 and insert the resulting object 􏰐le into your link line􏰀
cc 􏰁o a􏰉out a􏰉o 􏰉􏰉􏰉 trace􏰉o 􏰁lpmpi 􏰁lmpi
the 􏰐le
􏰔􏰅􏰔

Another pro􏰐ling example
This version counts MPI􏰧Send􏰂 MPI􏰧Bse nd 􏰂 􏰣include 􏰖mpi􏰉h􏰖
all calls and the numb er of bytes sent with or MPI􏰧Ise nd 􏰉
􏰟􏰟foreach fn
static long 􏰟􏰟fn􏰧na me 􏰫􏰫 􏰧n by te s􏰧 􏰟􏰟 fi le no 􏰫􏰫􏰑 􏰟􏰟 en df or ea ch fn 􏰫􏰫
fn􏰧name
MPI􏰧Sen d MPI􏰧Bsen d MPI􏰧Isend 􏰫􏰫
􏰟􏰟forallf n
􏰟􏰟fn􏰧name 􏰫􏰫 􏰧n ca ll s􏰧 􏰟􏰟 fi le no 􏰫􏰫 􏰑 􏰟􏰟endfora ll fn 􏰫􏰫
fn􏰧name MPI􏰧Init MPI􏰧Fin al iz e MPI􏰧Wti me 􏰫􏰫 in t
􏰟􏰟fnall this􏰧fn 􏰧n am e MPI􏰧Fina li ze 􏰫􏰫
printf􏰎 􏰖􏰟􏰟this 􏰧f n􏰧 na me 􏰫􏰫 is being called􏰉􏰕n 􏰖 􏰏􏰑
􏰟􏰟callfn􏰫 􏰫
􏰟􏰟this􏰧fn 􏰧n am e􏰫 􏰫􏰧 nc al ls 􏰧􏰟 􏰟f il en o􏰫 􏰫􏰠 􏰠􏰑 􏰟􏰟endfnal l􏰫 􏰫
􏰟􏰟fn fn􏰧name MPI􏰧Send 􏰟􏰟vardecl int typesiz e􏰫 􏰫
MPI􏰧Bse nd MPI􏰧Ise nd􏰫 􏰫
􏰟􏰟callfn􏰫 􏰫
MPI􏰧Type􏰧 si ze 􏰎 􏰟􏰟dataty pe 􏰫􏰫 􏰂 􏰎MPI􏰧Ain t 􏰦􏰏􏰨􏰟􏰟ty pe si ze 􏰫􏰫 􏰏􏰑 􏰟􏰟fn􏰧name 􏰫􏰫 􏰧n by te s􏰧 􏰟􏰟 fi le no 􏰫􏰫 􏰠􏰬 􏰟􏰟 ty pe siz e􏰫 􏰫􏰦 􏰟􏰟 co un t􏰫 􏰫 􏰟􏰟fn􏰧name 􏰫􏰫 􏰧n ca ll s􏰧 􏰟􏰟 fi le no 􏰫􏰫 􏰠􏰠 􏰑
􏰟􏰟endfn􏰫􏰫
􏰔􏰅􏰗

Another pro􏰐ling example 􏰎con􏰍t􏰏
􏰟􏰟fn fn􏰧name MPI􏰧Fina li ze 􏰫􏰫 􏰟􏰟forallf n dis􏰧fn􏰫􏰫
if 􏰎􏰟􏰟dis􏰧f n􏰫 􏰫􏰧 nc al ls 􏰧􏰟 􏰟f il en o􏰫 􏰫􏰏 􏰟 printf􏰎 􏰖􏰟􏰟dis􏰧 fn 􏰫􏰫 􏰀 􏰪d calls􏰕n 􏰖􏰂
􏰟􏰟dis􏰧fn 􏰫􏰫 􏰧n ca ll s􏰧 􏰟􏰟 fi le no􏰫 􏰫 􏰏􏰑
􏰫
􏰟􏰟endfora ll fn 􏰫􏰫
if 􏰎MPI􏰧Sen d􏰧 nc al ls 􏰧􏰟 􏰟f il en o􏰫 􏰫􏰏 􏰟
printf􏰎 􏰖􏰪d bytes sent in 􏰪d calls with MPI􏰧Sen d􏰕 n􏰖 􏰂 MPI􏰧Send 􏰧n by te s􏰧 􏰟􏰟 fi le no 􏰫􏰫􏰂
MPI􏰧Send􏰧 nc al ls 􏰧􏰟 􏰟f il en o􏰫 􏰫 􏰏􏰑 􏰫
􏰟􏰟callfn􏰫 􏰫
􏰟􏰟endfn􏰫􏰫
􏰔􏰅􏰆

Generating and viewing log 􏰐les
Log 􏰐les that contain a
parallel computation can b e very valuable in understanding a parallel program􏰉 The upshot and nupshot programs􏰂 provided in the MPICH and MPI􏰁F implementations􏰂 may b e used to view log 􏰐les
history of a
􏰔􏰅􏰅

Generating a log 􏰐le
This is very easy with the MPICH
implementation of
with 􏰁llmpi 􏰁lpmpi
your program􏰂 and
do not need to recompile􏰉
􏰁lm 􏰁lX􏰔􏰔
􏰁lpmpi􏰉
MPI􏰉 Simply replace 􏰁lmpi 􏰁lm in the link line for
relink your program􏰉 You
On some
animation by using the librari es 􏰁lampi 􏰁lmpe
systems􏰂 you can get a real􏰁time
Alternately􏰂 you can use the 􏰁mpilog or 􏰁mpianim options to the mpicc or mpif􏰚􏰚 commands􏰉
􏰔􏰅􏰛

Connecting several programs together
MPI provides supp ort for connection separate message􏰁passing programs together through the use of intercommunicators􏰉
􏰔􏰅􏰃

Sending messages b etween di􏰊erent programs
Programs share MPI􏰧COMM􏰧WORLD􏰉
Programs have separate communicators􏰉
and disjoint
App1 App2
Comm1 Comm2 Comm_intercomm
MPI_COMM_WORLD
􏰔􏰅􏰚

Exchanging data b etween programs
􏰝 Form intercommunicator 􏰎MPI􏰧INTERCOMM􏰧CREATE􏰏
􏰝 Send data
MPI􏰧Send􏰎 􏰉􏰉􏰉􏰂 􏰄􏰂 intercomm 􏰏
MPI􏰧Recv􏰎 buf􏰂 􏰉􏰉􏰉􏰂 􏰄􏰂 intercomm 􏰏􏰑
MPI􏰧Bcast􏰎 buf􏰂 􏰉􏰉􏰉􏰂 localcomm 􏰏􏰑
More complex p oint􏰁to􏰁p oint op erations can also b e used
􏰔􏰅􏰙

Collective op erations
Use MPI􏰧INTERCOMM􏰧MERGE to create an intercommunicator􏰉
􏰔􏰅􏰇

Final Comments
Additional features of MPI not covered in
this 􏰝
􏰝
tutorial
Persistent Communication Error handling
􏰔􏰛􏰄

Sharable MPI Resources
􏰝 The Standard itself􏰀
􏰟 As a Technical rep ort􏰀 U􏰉 of Tennessee􏰉 rep ort
􏰟 As p ostscript for ftp􏰀 at info􏰉mcs􏰉anl􏰉gov in pub􏰋mpi􏰋mpi􏰁report􏰉ps􏰉
􏰟 As hyp ertext on the World Wide Web􏰀 http􏰀􏰋􏰋www􏰉mcs􏰉anl􏰉gov􏰋mpi
􏰟 As a journal article􏰀 in the Fall issue of the Journal of Sup ercomputing Applications
􏰝 MPI Forum discussions
􏰟 The MPI Forum email discussions and b oth current and earlier versions of the Standard
are available from netlib􏰉
􏰝 Bo oks􏰀
􏰟 Using MPI􏰀 Portable Parallel Programming with the Message􏰁Passing Interface􏰂 by Gropp􏰂 Lusk􏰂 and Skjellum􏰂 MIT Press􏰂 􏰔􏰇􏰇􏰅
􏰟 MPI Annotated Reference Manual􏰂 by Otto􏰂 et al􏰉􏰂 in preparation􏰉
􏰔􏰛􏰔

Sharable MPI Resources􏰂 continued
􏰝 Newsgroup􏰀
􏰟 comp􏰉parallel􏰉mpi
􏰝 Mailing lists􏰀
􏰟 mpi􏰁comm􏰈mcs􏰉anl􏰉gov􏰀 the MPI Forum discussion list􏰉
􏰟 mpi􏰁impl􏰈mcs􏰉anl􏰉gov􏰀 the implementors􏰍 discussion list􏰉
􏰝 Implementations available by ftp􏰀
􏰟 MPICH is available by anonymous ftp from info􏰉mcs􏰉anl􏰉gov in the directory pub􏰋mpi􏰋mpich􏰂 􏰐le mpich􏰉tar􏰉Z􏰉
􏰟 LAM is available by anonymous ftp from tbag􏰉osc􏰉edu in the directory pub􏰋lam􏰉
􏰟 The CHIMP version of MPI is available by
anonymous ftp from ftp􏰉epcc􏰉ed􏰉ac􏰉uk in the directory pub􏰋chimp􏰋release􏰉
􏰝 Test co de rep ository􏰀
􏰟 ftp􏰀􏰋􏰋info􏰉mcs􏰉anl􏰉gov􏰋pub􏰋mpi􏰋mpi􏰁test
􏰔􏰛􏰗

MPI􏰁􏰗
􏰝 The MPI Forum 􏰎with old and new participants􏰏 has b egun a follow􏰁on series of meetings􏰉
􏰝 Goals
􏰟 clarify existing draft
􏰟 provide features users have requested
􏰟 make extensions􏰂 not changes
􏰝 Major Topics b eing considered
􏰟 dynamic pro cess management
􏰟 client􏰋server
􏰟 real􏰁time extensions
􏰟 􏰕one􏰁sided􏰖 communication 􏰎put􏰋get􏰂 active messages􏰏
􏰟 p ortable access to MPI system state 􏰎for debuggers􏰏
􏰟 language bindings for C􏰠􏰠 and Fortran􏰁􏰇􏰄
􏰝 Schedule
􏰟 Dynamic pro cesses􏰂 client􏰋server by SC 􏰍􏰇􏰛
􏰟 MPI􏰁􏰗 complete by SC 􏰍􏰇􏰃
􏰔􏰛􏰆

Summary
􏰝 The parallel computing community has co op erated to develop a full􏰁featured standard message􏰁passing library interface􏰉
􏰝 Implementations ab ound
􏰝 Applications b eginning to b e develop ed or p orted
􏰝 MPI􏰁􏰗 pro cess b eginning
􏰝 Lots of MPI material available
􏰔􏰛􏰅