Advanced MPI
aka “This can get more complex!” https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs402/ 01/02/2022 ● CS402/922 High Performance Computing ● ●
01/02/2022
Copyright By PowCoder代写 加微信 powcoder
MPI so far in CS402/922
Previously, in the “Lost in the post” section…
• Message Passing Interface (MPI) allows for parallelisation across multiple processors
• We send messages from one processor to another using MPI_Send and MPI_Recv (or a variation of these)
• Need to know how the data is structured
• Referred to the type of the data (MPI_Type)
• Can be contiguous or non-contiguous (but may be slower if non-contiguous)
Introduction to MPI
lecture from last week
MPI and Types
lecture from yesterday
01/02/2022
Collective Communications Talking is always the most important step
• Just as in OpenMP, sometime we need to sync up
parts of the together
• Sometimes, all the ranks want to send a message to the same rank
• Sometimes, a rank needs to send a message to all other ranks
• Sometimes, everyone needs to come together to to share all there data with everyone else
• NoteàAll of these can be blocking or non-blocking in the same way as MPI_Send and MPI_Recv
An all-to-1 mapping. This was seen in the example from last week
An 1-to-all mapping An all-to-all mapping
01/02/2022
• Making all the processors sync can allow for large blocks of communications to work more efficiently
• As such, we need to set up a barrier to stop all the ranks at the same time
• Only continue when all the ranks hit the barrier together
int MPI_Barrier(MPI_Comm comm)
Returns 0 if no error, otherwise error code is given
The communicator to form the barrier over (example – MPI_COMM_WORLD)
Multi-message MPI
aka “A copy for you, and a copy for you, and…” https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs402/ 01/02/2022 ● CS402/922 High Performance Computing ● ●
01/02/2022
1-to-all communications
I want to send a message to everyone!
• A rank has computed something, and wants to share it with the other ranks
MPI Spec (Collective Communications Image àPage 189)
https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf
• You could say we want to broadcast it…
01/02/2022 Slide 7
Broadcasting in MPI
Shout it from the rooftops 📢📢
int MPI_Bcast(void *data, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
Returns 0 if no error, otherwise error code is given
The data block to be broadcasted
The number of elements in the data block
The data type of the data block
The rank where the broadcast will/has originated from
The communicator where the message will be broadcasted to (example – MPI_COMM_WORLD)
01/02/2022 Slide 8
Broadcasting in MPI
Shout it from the rooftops 📢📢
int MPI_Bcast(void *data, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
01/02/2022 Slide 9
Broadcasting in MPI
Shout it from the rooftops 📢📢
int MPI_Bcast(void *data, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
01/02/2022
Scattering
I made something for each of you!
• A rank has been so busy, it has computed
something or each individual rank, and wants to
share it with its corresponding ranks MPI Spec (Collective Communications Image àPage 189)
https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf
• You could say we want to scatter the data…
01/02/2022
Scattering in MPI
You get a bit of data, and you get a bit of data, and you get …!
The data block to be scattered (only required by the root rank)
The number of elements to be scattered (not negative, only required by the root rank)
The data type of the data to be sent (only required by the root rank)
int MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
Pointer to where the received data should be
The number of elements available in the receive pointer
The data type of the data to be received
The rank where the broadcast will/has originated from
The communicator where the message will be broadcasted to (example – MPI_COMM_WORLD)
01/02/2022 Slide 12
Scattering in MPI
You get a bit of data, and you get a bit of data, and you get …!
int MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
01/02/2022 Slide 13
Scattering in MPI
You get a bit of data, and you get a bit of data, and you get …!
int MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
01/02/2022
Scattering and Gathering
So many messages!
• Sometimes we want to do the opposite of scattering
• In this case, we gather the data
MPI Spec (Collective Communications Image àPage 189)
• Same interface as MPI_Scatter, but:
• All send parameters are required by all ranks
• All receive parameters are required only by the root rank
https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf
01/02/2022
All___ communications
Everyone gets everything!
• We can combine broadcasts, scatters and gathers together if needed
• Each of these take the same parameters as MPI_Scatter and MPI_Gather, but all parameters are required by all ranks
• MPI_Allgather à A gather, then a broadcast of the result
• MPI_AlltoallàAn all-gather, but where different ranks receive different data
01/02/2022
Reductions
Was that a bit too much?
• Just like OpenMP, we need to apply an operation over multiple ranks
• Therefore, we need a reduction functionàMPI_Reduce The data block to be sent by The data block where the final value is to The number of elements
The data type of the elements sent
other ranks be stored (only required by the root rank) in the sent to other ranks (not negative)
int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
The reduction operator
Rank of the root processor
The communicator (exampleàMPI_COMM_WORLD)
01/02/2022
We’re doing surgery now‽
• The MPI library has a collection of different predefined operations
• Usually only work with MPI basic typesà Examples: MPI_INT, MPI_FLOAT etc…
• MPI_MAXLOC and MPI_MINLOC are more complex, requires a custom struct
• Can’t find an operator that works for you? int MPI_Op_create(MPI_User_function *user_fn,
int commute, MPI_Op *op) int MPI_Op_free(MPI_Op *op)
MPI Operator (MPI_Op)
MPI_MAXLOC
MPI_MINLOC
Explanation
Maximum value
Minimum value
Sum of all values
Product of all values
AND/& (Logical)
AND/& (Binary)
OR/| (Logical)
OR/| (Binary)
XOR/^ (Logical)
XOR/^ (Binary)
Max value + location
Min value + location
01/02/2022
Broadcasted Reductions
Ok everyone, from now on, we all use this data!
• Often, the result is required in each of the ranks
• Therefore, we can do one of the following:
1. MPI_Reduce then an MPI_Bcast
2. MPI_Allreduce (does 1. a bit more efficiently)
• MPI_Allreduce has the same interface as MPI_Reduce but all parameters required by all ranks
• MPI_Allreduce is one of the most expensive instructions in MPI
High-Level Optimisation Techniques in MPI
aka “How much more can we do with MPI?” https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs402/ 01/02/2022 ● CS402/922 High Performance Computing ● ●
01/02/2022
… should have been called packages …
• Data needs to be sent of a collection of packets
• MPI does not specify how the data should be sent
• Different protocols may be required for different situations
• MPI does not specify how big the packets of data should be • Dependant on the network card
• Dependant on the network interface and interconnects • Dependant on the memory within the system
• Can be dependant on the user
01/02/2022
Packet Sizes
Smaller Packets
• Copying data from memory to the network card can be interspersed
• Packets can be made available more quickly
Depends on the cost…
Larger Packets
• Ratio of data to metadata (the envelope) is better
• Cost of copying memory vs the injection rate of the messages to the netowrk
01/02/2022
Is there an optimal packet size?
• Not really…
Why are you asking me?
• Many factors affect the optimal packet size, including: • Application algorithm
• The data location in memory
• Hardware (network cards, interconnects etc…)
• The MPI interface treats all the messages the sameàin theory, it shouldn’t matter
01/02/2022
Domain Decomposition
It’s breaking down!
• Each processor needs it’s own block of data to operate on
• For physics applications, we do this by splitting up the mesh àDomain Decomposition
• Each processor has a (relatively similarly sized) block of data
• Data is only sent when requiredàminimises the amount of data
• How you do this depends on the algorithm and the mesh itself…
01/02/2022
Mesh example
• 4 processors/ranks
• Each cell needs data from direct neighbouring cells (think deqn)
• Want to separate data equally between different processors
Knitting it together!
01/02/2022
1D Decomposition
• 4 processors/ranks à4 strips of data
• Edge data needs to be sent to neighbouring processors
• 0 & 3à8 cells
• 1 & 2à16 (all) cells
Split it 1 way!
01/02/2022
2D Decomposition
• 4 processors/ranks à4 blocks of data
• Edge data needs to be sent to neighbouring processors
• Each rank has to send 2×4 pieces of data
Split it 2 ways!
01/02/2022
• Trying to merge data can be difficult and expensive
• Additional computation • Synchronizing issues
It surrounds the data!
• Fixàeach data block has a halo • Duplicate of data from
neighbouring ranks
• Flexible sizing, depending on access patterns
01/02/2022
Halos in Physics Applications
• The principles of halos can be applied on many decomposition types
• More complex as you decompose in more dimensions
• The halos size and complexity also depends on the access pattern
01/02/2022
Interesting related reads
Some of this might even be fun…
• MPI message sizes
• K. B. Ferreira and S. Levy. Evaluating MPI Message Size Summary Statistics. In 27th European MPI Users’ Group Meeting, pages 61—70,
Austin, TX, 2020, Association for Computing Machinery, , NY • Communication Patterns
• D. G. Chester, S. A. Wright and S. A. Jarvis. Understanding Communication Patterns in HPCG. In UKPEW 2017, the Thirty Third Annual UK Performance Engineering Workshops, Electronic Notes in Theoretical Computer Science, vol. 340, pages 55—65, Newcastle, UK, October 2018, Elsevier
Next lecture: Programming Models
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com