CSC 367 Parallel Programming
The Message Passing Paradigm MPI
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Copyright By PowCoder代写 加微信 powcoder
Message passing paradigm
• Key principles:
• Partitionedaddressspace(aimedat”shared-nothing”infrastructures) • Requiresexplicitparallelization(moreprogrammingeffort)
• Canachievegreatscalabilityifdoneright
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 2
Message passing paradigm
• Key principles:
• Partitionedaddressspace(aimedat”shared-nothing”infrastructures) • Requiresexplicitparallelization(moreprogrammingeffort)
• Canachievegreatscalabilityifdoneright • Logicalview:
Execution hosts (processes)
What are the implications? What advantages aside from scaling?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 3
Submission host (process)
Message passing model
• SingleProgramMultipleData(SPMD)
• Allprocessesexecutesamecode,communicateviamessages
• Technically,itdoessupportexecutingdifferentprogramsoneach host/process
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 4
Why message passing?
• Heterogeneousenvironments,computationsthatneedmassivescaling
• Build a parallel multi-computer from lots and lots of cheap nodes • Nodesareconnectedwithpoint-to-pointlinks
• Partitionthedataacrossthenodes,computationinparallel
• Iflocaldataisneededonremotenode,senditovertheinterconnect
• Computationisdonecollectively
• Can always add more and more nodes => bottleneck is not the number of cores or memory on a single node
• Scale out instead of scale up
• Downside:hardertoprogram
• Everycommunicationhastobehand-codedbythedeveloper
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 5
Message passing – logical & physical view
• Logicalview:Eachprocessisaseparateentitywithitsownmemory • Eachprocessrunsaseparateinstanceofthesame(parallel)program
• Aprocesscannotaccessanotherprocess’smemoryexceptviaexplicitmessages
• Recalldistinctionbetweensharedaddressspaceandmessagepassing
• Keepinmindsharedaddressspace!=sharedmemory • Message passing implies a distributed address space
• Physical view: the underlying architecture could be either distributed memory or shared memory
• Messagepassingonaphysicalsharedmemoryarchitecture:messagescanbesimulated by copying (or mapping) data between different processes’ memory space
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 6
Important observations
• Lettheprocessescomputeindependentlyandcoordinateonlyrarelyto exchange information
• Communicationaddsoverheads,soitshouldbedonerarely
• Needtostructuretheprogramdifferently,toincorporatethemessagepassing operations for exchanging data
• Needtopartitiondatasuchthatwemaximizelocalityandminimizetransfers • Recalltheconceptsdiscussedafewlecturesago
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 7
msg = 5; send(&msg, 1, 1); msg = 200;
recv(&msg, 1, 0); printf(“%d”, msg);
• WhatwillP1receive?
Building Blocks
• Interactionsarecarriedoutviapassingmessagesbetweenprocesses
• Buildingblocks:sendandreceiveprimitives–generalform:
• send(void *sendbuf, int nelems, int dest)
• receive(void *recvbuf, int nelems, int source)
• Complexity lies in how the operations are carried out internally
• Example(pseudocode):
• Send operation may be implemented to return before the receipt is confirmed
• Supportingthiskindofsendisnotabadidea–why?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 8
• Two possibilities:
• Blocking non-buffered send/receive • Blockingbufferedsend/receive
Blocking operations
• Only return from an operation once it’s safe to do so
• Not necessarily when the msg has been received, just guarantee semantics
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 9
Blocking non-buffered send/recv
• Send operation does not return until matching receive is encountered at the
receiver and communication operation is completed • Non-bufferedhandshakeprotocol–idlingoverheads:
RTS = request to send OTS=oktosend
Sender send
Receiver Sender
OTS recv send
1. Sender is first; idling at sender 2. Same time; idling minimized 3.Receiver is first; idling at recv
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 10
Deadlocks in blocking non-buffered comm
• Deadlockscanoccurwithcertainorderingsofoperations,duetoblocking • Example–canyouseethedeadlock?Howcanwefixit?
send(&m1, 1, 1); recv(&m2, 1, 1);
send(&m1, 1, 0); recv(&m2, 1, 0);
• Solution:switchorderinoneoftheprocesses
• But, more difficult to write code this way, and could create bugs
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 11
Sender send CTB
Sender send
CTB = Copy to buffer CFB = Copy from buf
Blocking buffered send/recv
• Sendercopiesdataintobufferandreturnsoncethecopytobufferiscompleted • Receivermustalsostorethedataintoabufferuntilitreachesthematchingrecv • Bufferedtransferprotocol–withorwithouthardwaresupport:
1. Use buffer at both sender and receiver Communication handled by H/W
2. Buffer only on one side. E.g., sender interrupts receiver and deposits the data in a buffer (or vice-versa)
But, now buffer management
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 12
In both cases,
no idling overheads!
overheads!
Problems with blocking buffered comm
• 1.Potentialproblemswithfinitebuffers • Example–seetheproblem?
P0 (producer)
for(i = 0; i < 1000000; i++){
P1 (consumer)
for(i = 0; i < 1000000; i++){
create_message(&m);
recv(&m, 1, 0); digest_message(&m);
send(&m, 1, 1); }}
• 2. Deadlocks still possible
• Example–seetheproblem?
recv(&m1, 1, 1); send(&m2, 1, 1);
recv(&m1, 1, 0); send(&m2, 1, 0);
• Solution is similar: break circular waits
• Unlike previously, in this protocol, deadlocks can only be caused by waits on recv
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 13
Blocking Operations
Sending process returns after data has been copied into communication buffer.
Sending process blocks until matching receive operation has been encountered.
send and recv semantics ensured by corresponding operation
Non-blocking Operations
Sending process returns after initiating DMA transfer to buffer. This operation may not be completed on return.
Programmer must explictly ensure semantics by polling to verify completion
Non-blocking operations
• Why?Performance!
• User is responsible to ensure that data is not changed until it's safe
• Typicallyacheck-statusoperationindicatesifcorrectnesscouldbeviolatedbya previous transfer which is still in flight
• Canalsobebufferedornon-buffered
• Canbeimplementedwithorwithouthardwaresupport
Non-Buffered
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 14
When the receive is encountered, communication is initiated
Unsafe to update data
Unsafe to update data
Advantage: very little communication overhead, or completely masked!
Example: non-blocking non-buffered
• Senderissuesarequesttosendandreturnsimmediately
RTS=requesttosend OTS=oktosend
Without hardware support
With hardware support
beingsent OTS data
recv beingsent OTS data
1. When recv is encountered, transfer is handled by interrupting the sender
2. When recv is found, comm. hardware handles the transfer and receiver can continue doing other work
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 15
Unsafe to update data being received
Take-aways
• Carefullyconsidertheimplementationguarantees
• Communication protocol and hardware support
• Blocking vs. non-blocking, buffered vs. non-buffered
• Tradeoffsintermsofcorrectnessandperformance
• Need automatic correctness guarantees => might not hide communication overhead that well
• Need performance => user is responsible for correctness via polling
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 16
The Message Passing Interface (MPI)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
MPI standard
• Standardlibraryformessagepassing
• Writeportablemessagepassingalgorithms,mostlyusingCorFortran
• RichAPI(over100routines,butonlyahandfularefundamental) • MustinstallOpenMPIorMPICH2(labsalreadyhave)
• Includempi.hheader
• Exampleruncommand:mpirun-np8./myapparg1arg2
• Basicroutines:
• MPI_Init:initializeMPIenvironment
• MPI_Finalize:terminatetheMPIenvironment
• MPI_Comm_size: get number of processes
• MPI_Comm_rank:gettheprocessIDofthecaller
• MPI_Send:sendmessage
• MPI_Recv:receivemessage
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 18
• int MPI_init(int *argc, char ***argv);
• int MPI_Finalize();
• Onsuccess=>MPI_SUCCESS,otherwiseerrorcode
• NoMPIcallsallowedafterthis,notevenanewMPI_Init!
MPI basics
• MPI_Init:onlycalledonceatstartbyonethread,toinitializetheMPIenvironment
• ExtractsandremovestheMPIpartsofthecommandline(e.g.,mpirun–np8)fromargv • Processyourapplication’scommandlineargumentsonlyaftertheMPI_Init
• Onsuccess=>MPI_SUCCESS,otherwiseerrorcode
• MPI_Finalize:calledattheend,todocleanupandterminatetheMPIenvironment
• These calls are made by all participating processes, otherwise results in
undefined behaviour
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 19
MPI Communication domains
• MPI communication domain = set of processes which are allowed to communicate with each other
• Communicators (MPI_comm variables) Store info about communication domains
• Commoncase:allprocessesneedtocommunicatetoallotherprocesses
• Defaultcommunicator:MPI_COMM_WORLDincludesallprocesses
• Inspecialcases,wemaywanttoperformtasksinseparate(oroverlapping)
groups of processes => define custom communicators
• Nomessagesforagivengroupwillbereceivedbyprocessesinothergroups
• Communicator size and id of current process can be retrieved with:
• int MPI_Comm_size(MPI_Comm comm, int *size); • int MPI_Comm_rank(MPI_Comm comm, int *rank);
• Theprocesscallingtheseroutinesmustbeinthecommunicatorcomm
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 20
• Hello world:
• mpirun –np 8 ./hello_world
Some basic examples
• Use mpicc to compile MPI programs
• Any program can be run using mpirun, even non-MPI ones: • mpirun–np8–hostfilehostfile.txtecho”Helloworld”
• mpirun –np 8 ls
• hostfile(optional):specifyonehostperline,buteachhostmayuseseveralprocessors
• InOpenMPI,e.g.localnodeonly: • localhostslots=8
• InMPICH2,e.g.localnodeonly: • localhost:8
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 21
• CanuseMPI_Wtime() • Example:
Timing measurements
double t1, t2;
t1 = MPI_Wtime();
t2 = MPI_Wtime();
printf(“Elapsed time: %f\n”, t2 – t1);
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 22
MPI_UNSIGNED_CHAR
MPI_UNSIGNED_SHORT
MPI_UNSIGNED
MPI_UNSIGNED_LONG
MPI_DOUBLE
MPI_PACKED
signed char
signed short int
signed int
signed long int
unsigned char
unsigned short int
unsigned int
unsigned long int
MPI data types
• Equivalenttobuilt-inCtypes,exceptforMPI_BYTEandMPI_PACKED
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 23
Sending and receiving messages
int MPI_Send(void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm);
int MPI_Recv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm, MPI_Status *status);
• Each message has a tag associated to distinguish it from other messages
• ThesourceandtagcanbeMPI_ANY_SOURCE/MPI_ANY_TAG
• The status can be used to get info about the MPI_recv operation:
typedef struct MPI_Status {
int MPI_SOURCE; // source of the received message
int MPI_TAG; // tag of the received message
int MPI_ERROR; // a potential error code
• Thelengthofthereceivedmessagecanberetrievedusing:
int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 24
Blocking vs non-blocking
• MPI_Recv is blocking
• Itreturnsonlyaftermessageisreceivedandcopiedintothebuffer! • Buffer can be safely sreused right after MPI_Recv
• MPI_Sendcanbeimplementedeitherblockingornon-blocking
• Blocking:returnsonlyafterthematchingMPI_Recvisexecutedandthe message was sent
• Non-blocking: copy msg into buf and returns, without waiting for MPI_Recv
• Inboth,thebuffercanbesafelyreusedrightafterMPI_Send
• DependingontheMPI_Sendimplementation,theprogramsemanticsmight need to be carefully analyzed for potential problems!
• Fornow,thinkofMPI_Sendasblocking
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 25
Deadlock avoidance
• Restrictions on MPI_Send/MPI_Recv, in order to avoid deadlocks
• Example:behaviourisMPI_Sendimplementation-dependent
int a[20], b[20], myrank;
MPI_Status status;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0) {
MPI_Send(a, 20, MPI_INT, 1, 1, MPI_COMM_WORLD);
MPI_Send(b, 20, MPI_INT, 1, 2, MPI_COMM_WORLD); }
else if (myrank == 1) {
MPI_Recv(b, 20, MPI_INT, 0, 2, MPI_COMM_WORLD, &status); MPI_Recv(a, 20, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
• What’stheprobleminthiscode?
• Fixbymatchingtheorderofthesendandrecvoperations
• We want to write “safe” programs which are not implementation dependent! University of Toronto Mississauga, Department of Mathematical and Computational Sciences 26
Deadlock avoidance
• Anotherexample:Circularchainofsend/recvoperations–whatcanhappen?
int a[20], b[20], myrank, np;
MPI_Status status;
MPI_Comm_size(MPI_COMM_WORLD, &np);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Send(a,20, MPI_INT, (myrank+1)%np, 1, MPI_COMM_WORLD); MPI_Recv(b,20, MPI_INT, (myrank-1+npes)%np, 1, MPI_COMM_WORLD,&status);
• Worksfineifsendisbufferednon-blocking,butdeadlocksifsendisblocking • Must rewrite the code to make it safe:
int a[20], b[20], myrank, np;
MPI_Status status;
MPI_Comm_size(MPI_COMM_WORLD, &np);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank % 2 == 1) {
MPI_Send(a,20, MPI_INT, (myrank+1)%np, 1, MPI_COMM_WORLD);
MPI_Recv(b,20, MPI_INT, (myrank-1+npes)%np, 1, MPI_COMM_WORLD,&status); }
else if (myrank % 2 == 0) {
MPI_Recv(b,20, MPI_INT, (myrank-1+npes)%np, 1, MPI_COMM_WORLD,&status); MPI_Send(a,20, MPI_INT, (myrank+1)%np, 1, MPI_COMM_WORLD);
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 27
Previous example is a common pattern
=> Combine the MPI_Send and MPI_Recv primitives into MPI_Sendrecv
Send/recv simultaneously
int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype,
int dest, int sendtag,
void *recvbuf, int recvcount, MPI_Datatype recvtype,
int source, int recvtag,
MPI_Comm comm, MPI_Status *status);
• Previousprogrambecomeseasiertowrite:
int a[20], b[20], myrank, np;
MPI_Status status; MPI_Comm_size(MPI_COMM_WORLD, &np); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Sendrecv(a, 20, MPI_INT, (myrank+1)%np, 1,
b, 20, MPI_INT, (myrank-1+npes)%np, 1,
MPI_COMM_WORLD, &status);
• Restriction:sendandreceivebuffersmustbedisjoint,otherwiseuse:
int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype datatype,
int dest, int sendtag,
int source, int recvtag,
MPI_Comm comm, MPI_Status *status);
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 28
Overlap communication with computation
• MPI_Sendrecv_replace–Blockingoperation!
• Couldoverlapsomedatatransmissionwithcomputationsthatdon’tdependon
the data being transmitted yet
• MPIprovidesnon-blockingprimitives
• MPI_Isendstartsasendoperation,andreturnsbeforethedataiscopiedoutofthebuffer
• MPI_Irecvstartsarecvoperation,andreturnsbeforethedataisreceivedintothebuffer
int MPI_Isend(void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm, MPI_Request *request)
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype,
int src, int tag, MPI_Comm comm, MPI_Request *request)
• Allocatearequestobjectandreturnapointertoitintherequestargument(detailslater)
• Non-blocking operation can be matched with a corresponding blocking operation
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 29
Overlap communication with computation
• Anyproblemswithusingthesenon-blockingprimitives?
• Atsomepoint,weneedthedatatobeguaranteedtohavebeensent/received,otherwise violates correctness
• Use MPI_Test and/or MPI_Wait to determine whether a non-blocking operation has finished, and/or to wait (block) until the non-blocking operation is completed
• UsetherequestobjectprovidedbyMPI_Isend/MPI_Irecvtotest/waitonthecompletion
int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)
• Returnsnon-zeroifcompleted(ifso,requestisdeallocated,settoMPI_REQUEST_NULL)
int MPI_Wait(MPI_Request *request, MPI_Status *status)
• Blocksuntilrequestcompletes,thendeallocatesrequestandsetsittoMPI_REQUEST_NULL • StatusissimilartotheonefortheblockingSend/Recv
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 30
Avoiding deadlocks
• Recalldeadlockexampleforblockingoperations
• Implementationdependent–maycausedeadlocks
int a[20], b[20], myrank;
MPI_Status status;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0) {
MPI_Send(a, 20, MPI_INT, 1, 1, MPI_COMM_WORLD);
MPI_Send(b, 20, MPI_INT, 1, 2, MPI_COMM_WORLD); }
else if (myrank == 1) {
MPI_Recv(b, 20, MPI_INT, 0, 2, MPI_COMM_WORLD, &status); MPI_Recv(a, 20, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
• If we replace either MPI_Send or MPI_Recv operations with non-blocking versions, the code will be safe, regardless of MPI implementation, e.g.:
MPI_Irecv(b, 20, MPI_INT, 0, 2, &requests[0], MPI_COMM_WORLD); MPI_Irecv(a, 20, MPI_INT, 0, 1, &requests[1], MPI_COMM_WORLD);
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 31
MPI collective operations
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Collective communication / computation
• Commoncollectiveoperations • Barrier
• Broadcast
• Reduction
• Prefixsum
• Scatter/Gather • All-to-all
• Allprocessesinthecommunicatororgrouphavetoparticipate!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 33
rank0 rank1 rank2
rankn comm MPI_Barrier(comm)
• Warning1:Carefulwithpotentialdeadlocks!
if(my_rank % 2 == 0) {
// do stuff
MPI_Barrier(MPI_COMM_WORLD); }
• Blocksuntilallprocessesinthegivencommunicatorhitthebarrier int MPI_Barrier(MPI_Comm comm)
• Warning2:Barrierdoesnotmagicallywaitforpendingnon-blockingoperations! University of Toronto Mississauga, Department of Mathematical and Computational Sciences 35
MPI_Bcast(foo, 20, MPI_INT, src, comm);
• One-to-all: send buf of source to all other processes in the group (into their buf)
int MPI_Bcast(void *buf, int count, MPI_Datatype datatype, int source, M
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com