MPI
CMPSC 450
MPI Performance Tools
• Optimize serial performance first!
• How much time is spent with MPI calls? • Simple printf benchmarking not enough.
• Diagnostic tools / MPI profilers exist
• Compile as wrapper libraries that generate diagnostic files during execution. • Separate application used to view diagnostic files.
• Simple: time spent in MPI function calls
• Complex: event timeline depicting interaction between nodes
CMPSC 450
Intel Trace Analyzer
Top: Timeline View
Lower Left: Load Balance Analysis Lower Right: Communication Summary
CMPSC 450
Communication Parameters
• Simple alpha-beta model too simple.
• MPI can/will treat messages differently.
• Can depend on underlying messaging protocols between nodes. • Also depends on message size
• Eager Protocol
• Short messages can be sent and buffered on receiver side before recv()
called.
• If lots of shorts messages are sent, remote receive buffer can get overloaded.
• Rendezvous Protocol
• Larger messages cannot be stored in a temporary buffer, message transfer
blocks until receiver is ready. Sender and receiver must synchronize. • Buffer size threshold for protocols can often be adjusted.
CMPSC 450
Ring Communication Example
CMPSC 450
Ring Communication Example
CMPSC 450
Ring Communication Example
CMPSC 450
Ring Communication Example
• Possible Solutions:
• Change order of sends and receives • Use non-blocking functions
• Use blocking point-to-point functions that are guaranteed not to deadlock (MPI_Sendrecv())
CMPSC 450
Contention
• Contention occurs when more than one node uses the same resource
• Often related to physical network topology • Multiple nodes on one processor
• Bottlenecks in networks
• Network topology not fully non-blocking
• Possible Solutions:
• Reduce network communication overhead • Reduce message volume
CMPSC 450
Reduce Communication Overhead
• Domain Decomposition (partitioning) matters
CMPSC 450
Mapping – default case
• Given 4 – 8-core processors. Cartesian mapping of nodes makes a significant difference to inter-processor communication links.
CMPSC 450
Mapping – best case
CMPSC 450
Mapping – worst case
CMPSC 450
Message Aggregation
• Small messages are driven by latency and overhead issues.
• Combining several smaller messages into one larger one may
improve communication efficiency.
• MPI derived datatypes allow for strided data transfer.
CMPSC 450
Non-blocking vs Asynchronous Communication
• No guarantee that non-blocking communication will free a thread to continue execution.
CMPSC 450
Collective Communication
• A one by one reduction will scale with T(p)
• Can be re-organized in a tree like fashion reducing execution time to
T(log p)
• MPI_Reduce can sometimes realize this optimization.
CMPSC 450