CS计算机代考程序代写 Microsoft PowerPoint – COMP528 HAL14 MPI collective synchronisation.pptx

Microsoft PowerPoint – COMP528 HAL14 MPI collective synchronisation.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

14 – HAL

1. Last few words on MPI Collective Communications

2. MPI Collective Synchronisation

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

[some] Available MPI COLLECTIVE Functions
• MPI_Scatterv

• Distributing data between all ranks with optional vary size of data chunks
• MPI_Scatterv(sendbuf, sendcounts[], displs[], sendtype,

recvbuf, recvcount, recvtype, root, MPI_Comm)
• Cf

MPI_Scatter(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, root, MPI_Comm)

• MPI_Gatherv
• Collecting together data from all ranks with optional vary size of data chunks
• MPI_Gatherv(sendbuf, sendcount, sendtype,

recvbuf, recvcounts[], displs[], recvtype, root, MPI_Comm)
• cf

int MPI_Gather(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, root, MPI_Comm)

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

MPI_Scatterv — Example c/o MPI Forum

ALL

• MPI_Gather, MPI_Gatherv & MPI_Reduce
• Each produce a result on the “root” process
• Having taken contributions for all processes

• Sometimes want to share such a result on all processes
• Naively one could do a MPI_Bcast immediately following one of these

MPI_Gather, MPI_Gatherv or MPI_Reduce calls
• More likely there’s a more efficient implementation (WHY?)

• Hence:
MPI_Allgather, MPI_Allgatherv, MPI_Allreduce

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

ALL

• MPI_Allgather, MPI_Allgatherv, MPI_Allreduce
• Each produce a result on all processes
• MPI standard requires identical result on each process

• What might this mean for MPI_Allreduce?

• Syntax
• Standard via mpi-forum.org
• Manual page on given system

• Implementation (& thus performance will vary)

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

Further collectives

• MPI_Alltoall
MPI_Alltoallv

• “transposing” data

• MPI_Reduce_scatter
• Does element-wise reduction in to a vector whose elements are then

scattered
• c.f. MPI_Reduce followed by MPI_Scatter

• Collective communication (data movement)
• Collective computation

• MPI_Reduce
• Data movement with some math

• Collective synchronization

• Q: are collectives blocking or non-blocking
• Pre v3.0 of MPI standard, only blocking collectives available

• Similar (but not identical) to being a synchronisation point
• As of v3.0, some non-blocking variants available

int MPI_Get_version(int *version, int *subVersion)

Tradevault.co.uk

B A R R I E R

Barrier = Synchronisation

• Why might be need to use explicit synchronisation?

• Can you think of 4 reasons…

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

Barrier = Synchronisation

• Why might be need to use explicit synchronisation?

• Debugging, to know all processes are at a given point (& maybe output info)

• To ensure things you want to have completed actually have completed
• Barrier will do this but there will be more elegant ways (usually with less overhead)

• Timing specific sections of running code,
ensuring no possible side effects from other elements of run-time code

• To enforce some ordering

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

Barrier = Synchronisation

• Why might be need to use explicit synchronisation?

• Whilst
MPI_Barrier(communicator)
is sometimes very useful it may very adversely affect performance

• In fact, synchronisation (e.g. over 1000s of processes) may be costly
• More in terms of time to wait for slowest thread
• But also in checking 1000s of processes

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

Useful Synchronisation #1

• To time what you want to time
• (without accidental timing side-effects of something else)

• Contrived example…

but for real life (=> complex coding, large number of processes,
different logic paths) such things do happen…

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

Outline of example
#include
#include
int main(void) {

double go, fin;
int myRank, dummy[100];

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

/* do some work */
work(myRank);

/* time collective comms */
go = MPI_Wtime();
MPI_Gather(&myRank, 1, MPI_INT, // gather ‘myRank’ from all processes

dummy, 1, MPI_INT, // in to contiguous elements of dummy[]
0, MPI_COMM_WORLD); // on rank 0 as root

fin = MPI_Wtime();

printf(“‘naive’ time for MPI_Gather: %f seconds\n”, fin-go);
MPI_Finalize();

}

Question
> What are we timing?

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

Outline of example
#include
#include
int main(void) {

double go, fin;
int myRank, dummy[100];

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

/* do some work */
work(myRank);

/* time collective comms */
go = MPI_Wtime();
MPI_Gather(&myRank, 1, MPI_INT, // gather ‘myRank’ from all processes

dummy, 1, MPI_INT, // in to contiguous elements of dummy[]
0, MPI_COMM_WORLD); // on rank 0 as root

fin = MPI_Wtime();

printf(“‘naive’ time for MPI_Gather: %f seconds\n”, fin-go);
MPI_Finalize();

}

Question
What are we timing?

NB
#1 work() takes 1 sec if rank==1,
else takes no time at all
#2 MPI_Gather requires all processes to
participate

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

“naïve” demo:
~mkbane/HPC_DEMOS/sync

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

work()
$ cat work.c
void work(int rank) {
if (rank==1) {
char syscall[20];
sprintf(syscall, “sleep 1”);
system(syscall);

}
return;

}

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

Outline of example
#include
#include
int main(void) {

double go, fin;
int myRank, dummy[100];

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

/* do some work */
work(myRank);

/* time collective comms */
go = MPI_Wtime();
MPI_Gather(&myRank, 1, MPI_INT, // gather ‘myRank’ from all processes

dummy, 1, MPI_INT, // in to contiguous elements of dummy[]
0, MPI_COMM_WORLD); // on rank 0 as root

fin = MPI_Wtime();

printf(“‘naive’ time for MPI_Gather: %f seconds\n”, fin-go);
MPI_Finalize();

}

Question
What are we timing?

NB
#1 work() takes 1 sec if rank==1,
else takes no time at all
#2 MPI_Gather requires all processes to
participate

==> let’s try and draw this…

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

#include
#include
int main(void) {

double go, fin;
int myRank, dummy[100];

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

/* do some work */
work(myRank);

/* time collective comms */
go = MPI_Wtime();
MPI_Gather(&myRank, 1, MPI_INT, // gather ‘

dummy, 1, MPI_INT, // in to contiguous elements of dummy[]
0, MPI_COMM_WORLD); // on rank 0 as root

fin = MPI_Wtime();

printf(“‘naive’ time for MPI_Gather: %f seconds
MPI_Finalize();

}

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

#include
#include
int main(void) {

double go, fin;
int myRank, dummy[100];

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

/* do some work */
work(myRank);

/* time collective comms */
go = MPI_Wtime();
MPI_Gather(&myRank, 1, MPI_INT, // gather ‘

dummy, 1, MPI_INT, // in to contiguous elements of dummy[]
0, MPI_COMM_WORLD); // on rank 0 as root

fin = MPI_Wtime();

printf(“‘naive’ time for MPI_Gather: %f seconds
MPI_Finalize();

}

Time to
participate
in Gather

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

#include
#include
int main(void) {

double go, fin;
int myRank, dummy[100];

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

/* do some work */
work(myRank);

/* time collective comms */
go = MPI_Wtime();
MPI_Gather(&myRank, 1, MPI_INT, // gather ‘

dummy, 1, MPI_INT, // in to contiguous elements of dummy[]
0, MPI_COMM_WORLD); // on rank 0 as root

fin = MPI_Wtime();

printf(“‘naive’ time for MPI_Gather: %f seconds
MPI_Finalize();

}
measuring

time of load
imbalance

Time to
participate
in Gather

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

“Fix” by use of Barrier Synchronisation

Time to
participate
in Gather

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

Demo with barrier synchronisation

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

Useful
Synchronisation #2

• Controlled access e.g. to write
to file

• [demo: barrier_example.c ]

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

MPI: the version matters

• For pre-v3.0 MPI
• all collectives are blocking
• ie each process waits until safe to use buffers it has used in the collective

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

MPI: the version matters

• For pre-v3.0 MPI
• all collectives are blocking
• ie each process waits until safe to use buffers it has used in the collective

• V3.0 and thereafter…
• There are also non-blocking collectives
• (beyond scope of this course)

int MPI_Get_version(int *version, int *subVersion)

COMP328/COMP528 (c) mkbane, Univ. of Liverpool

Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane