Microsoft PowerPoint – COMP528 HAL14 MPI collective synchronisation.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
14 – HAL
1. Last few words on MPI Collective Communications
2. MPI Collective Synchronisation
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
[some] Available MPI COLLECTIVE Functions
• MPI_Scatterv
• Distributing data between all ranks with optional vary size of data chunks
• MPI_Scatterv(sendbuf, sendcounts[], displs[], sendtype,
recvbuf, recvcount, recvtype, root, MPI_Comm)
• Cf
MPI_Scatter(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, root, MPI_Comm)
• MPI_Gatherv
• Collecting together data from all ranks with optional vary size of data chunks
• MPI_Gatherv(sendbuf, sendcount, sendtype,
recvbuf, recvcounts[], displs[], recvtype, root, MPI_Comm)
• cf
int MPI_Gather(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, root, MPI_Comm)
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
MPI_Scatterv — Example c/o MPI Forum
ALL
• MPI_Gather, MPI_Gatherv & MPI_Reduce
• Each produce a result on the “root” process
• Having taken contributions for all processes
• Sometimes want to share such a result on all processes
• Naively one could do a MPI_Bcast immediately following one of these
MPI_Gather, MPI_Gatherv or MPI_Reduce calls
• More likely there’s a more efficient implementation (WHY?)
• Hence:
MPI_Allgather, MPI_Allgatherv, MPI_Allreduce
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
ALL
• MPI_Allgather, MPI_Allgatherv, MPI_Allreduce
• Each produce a result on all processes
• MPI standard requires identical result on each process
• What might this mean for MPI_Allreduce?
• Syntax
• Standard via mpi-forum.org
• Manual page on given system
• Implementation (& thus performance will vary)
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
Further collectives
• MPI_Alltoall
MPI_Alltoallv
• “transposing” data
• MPI_Reduce_scatter
• Does element-wise reduction in to a vector whose elements are then
scattered
• c.f. MPI_Reduce followed by MPI_Scatter
• Collective communication (data movement)
• Collective computation
• MPI_Reduce
• Data movement with some math
• Collective synchronization
• Q: are collectives blocking or non-blocking
• Pre v3.0 of MPI standard, only blocking collectives available
• Similar (but not identical) to being a synchronisation point
• As of v3.0, some non-blocking variants available
int MPI_Get_version(int *version, int *subVersion)
Tradevault.co.uk
B A R R I E R
Barrier = Synchronisation
• Why might be need to use explicit synchronisation?
• Can you think of 4 reasons…
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
Barrier = Synchronisation
• Why might be need to use explicit synchronisation?
• Debugging, to know all processes are at a given point (& maybe output info)
• To ensure things you want to have completed actually have completed
• Barrier will do this but there will be more elegant ways (usually with less overhead)
• Timing specific sections of running code,
ensuring no possible side effects from other elements of run-time code
• To enforce some ordering
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
Barrier = Synchronisation
• Why might be need to use explicit synchronisation?
• Whilst
MPI_Barrier(communicator)
is sometimes very useful it may very adversely affect performance
• In fact, synchronisation (e.g. over 1000s of processes) may be costly
• More in terms of time to wait for slowest thread
• But also in checking 1000s of processes
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
Useful Synchronisation #1
• To time what you want to time
• (without accidental timing side-effects of something else)
• Contrived example…
but for real life (=> complex coding, large number of processes,
different logic paths) such things do happen…
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
Outline of example
#include
#include
int main(void) {
double go, fin;
int myRank, dummy[100];
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
/* do some work */
work(myRank);
/* time collective comms */
go = MPI_Wtime();
MPI_Gather(&myRank, 1, MPI_INT, // gather ‘myRank’ from all processes
dummy, 1, MPI_INT, // in to contiguous elements of dummy[]
0, MPI_COMM_WORLD); // on rank 0 as root
fin = MPI_Wtime();
printf(“‘naive’ time for MPI_Gather: %f seconds\n”, fin-go);
MPI_Finalize();
}
Question
> What are we timing?
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
Outline of example
#include
#include
int main(void) {
double go, fin;
int myRank, dummy[100];
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
/* do some work */
work(myRank);
/* time collective comms */
go = MPI_Wtime();
MPI_Gather(&myRank, 1, MPI_INT, // gather ‘myRank’ from all processes
dummy, 1, MPI_INT, // in to contiguous elements of dummy[]
0, MPI_COMM_WORLD); // on rank 0 as root
fin = MPI_Wtime();
printf(“‘naive’ time for MPI_Gather: %f seconds\n”, fin-go);
MPI_Finalize();
}
Question
What are we timing?
NB
#1 work() takes 1 sec if rank==1,
else takes no time at all
#2 MPI_Gather requires all processes to
participate
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
“naïve” demo:
~mkbane/HPC_DEMOS/sync
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
work()
$ cat work.c
void work(int rank) {
if (rank==1) {
char syscall[20];
sprintf(syscall, “sleep 1”);
system(syscall);
}
return;
}
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
Outline of example
#include
#include
int main(void) {
double go, fin;
int myRank, dummy[100];
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
/* do some work */
work(myRank);
/* time collective comms */
go = MPI_Wtime();
MPI_Gather(&myRank, 1, MPI_INT, // gather ‘myRank’ from all processes
dummy, 1, MPI_INT, // in to contiguous elements of dummy[]
0, MPI_COMM_WORLD); // on rank 0 as root
fin = MPI_Wtime();
printf(“‘naive’ time for MPI_Gather: %f seconds\n”, fin-go);
MPI_Finalize();
}
Question
What are we timing?
NB
#1 work() takes 1 sec if rank==1,
else takes no time at all
#2 MPI_Gather requires all processes to
participate
==> let’s try and draw this…
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
#include
#include
int main(void) {
double go, fin;
int myRank, dummy[100];
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
/* do some work */
work(myRank);
/* time collective comms */
go = MPI_Wtime();
MPI_Gather(&myRank, 1, MPI_INT, // gather ‘
dummy, 1, MPI_INT, // in to contiguous elements of dummy[]
0, MPI_COMM_WORLD); // on rank 0 as root
fin = MPI_Wtime();
printf(“‘naive’ time for MPI_Gather: %f seconds
MPI_Finalize();
}
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
#include
#include
int main(void) {
double go, fin;
int myRank, dummy[100];
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
/* do some work */
work(myRank);
/* time collective comms */
go = MPI_Wtime();
MPI_Gather(&myRank, 1, MPI_INT, // gather ‘
dummy, 1, MPI_INT, // in to contiguous elements of dummy[]
0, MPI_COMM_WORLD); // on rank 0 as root
fin = MPI_Wtime();
printf(“‘naive’ time for MPI_Gather: %f seconds
MPI_Finalize();
}
Time to
participate
in Gather
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
#include
#include
int main(void) {
double go, fin;
int myRank, dummy[100];
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
/* do some work */
work(myRank);
/* time collective comms */
go = MPI_Wtime();
MPI_Gather(&myRank, 1, MPI_INT, // gather ‘
dummy, 1, MPI_INT, // in to contiguous elements of dummy[]
0, MPI_COMM_WORLD); // on rank 0 as root
fin = MPI_Wtime();
printf(“‘naive’ time for MPI_Gather: %f seconds
MPI_Finalize();
}
measuring
time of load
imbalance
Time to
participate
in Gather
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
“Fix” by use of Barrier Synchronisation
Time to
participate
in Gather
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
Demo with barrier synchronisation
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
Useful
Synchronisation #2
• Controlled access e.g. to write
to file
• [demo: barrier_example.c ]
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
MPI: the version matters
• For pre-v3.0 MPI
• all collectives are blocking
• ie each process waits until safe to use buffers it has used in the collective
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
MPI: the version matters
• For pre-v3.0 MPI
• all collectives are blocking
• ie each process waits until safe to use buffers it has used in the collective
• V3.0 and thereafter…
• There are also non-blocking collectives
• (beyond scope of this course)
int MPI_Get_version(int *version, int *subVersion)
COMP328/COMP528 (c) mkbane, Univ. of Liverpool
Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane