MPI topologies
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
Row-major mapping
Copyright By PowCoder代写 加微信 powcoder
Column-major mapping
Space-filling curve mapping
Hypercube mapping
Topologies
• MPI view: 1D topology, linear ordering to number the processes
• Other(2D,3D)topologiesmaybemorenaturaltoarrangeprocesses • Thecomputationmayalsobesuitabletoacertaintopology
• MapMPIprocessestohigher-dimensionaltopologies
0 1 2 3 4 5 6 7 8 91011 12131415
0 4 8 12 1 5 9 13 2 61014 3 7 1115
1 2 7 6 14138 9 15121110
4 5 7 6 12131514 8 9 1110
• Mappingshouldbedoneconsideringthepatternofinteractionamongprocesses • Theinterconnectionofphysicalprocessorsisimportant(communicationcosts)
• Programsshouldbewrittenindependentlyofunderlyingphysicalinterconnect(portability!) • LettheMPIlibrarymapthevirtualtopology(theonedictatedbytheinteractionsof
processes in our program) to the physical topology (the underlying architecture)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 51
Cartesian topology
• MPIallowslotsofvirtualtopologies,mostcommonisgridorCartesian
int MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims,
int *periods, int reorder, MPI_Comm *comm_cart)
• Newcommunicator,processeslaidoutinmulti-dimensional(dims)grid • ndims = number of dimensions
• dims = sizes in each dimension
• periods=wrap-around(1)ornot(0),foreachdimension
• reorder=ifprocesseskeepthesamerankasinoldcommunicator
• comm_old = old communicator
• comm_cart=newcommunicatortousewiththistopology
• Note: all processes from comm_old must call this function, but some may not be part of the Cartesian topology (if total processes from dims is less than the ones in comm_old) – see manual for more info…
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 52
Cartesian topology
• Each process in Cartesian topology has (2D, 3D, etc.) coordinates in the grid 01234
(0,0) (0,1) (0,2) (0,3) (0,4)
56789 (1,0) (1,1) (1,2) (1,3) (1,4)
rank and coordinate numbers are just for illustrative purposes
10 11 12 13 14
(2,0) (2,1) (2,2) (2,3) (2,4)
15 16 17 18 19
(3,0) (3,1) (3,2) (3,3) (3,4)
20 21 22 23 24
(4,0) (4,1) (4,2) (4,3) (4,4)
• But,MPIfunctionsoperatewithprocessrankstoidentifyprocesses • Translatecoordinatestorankandvice-versa:
int MPI_Cart_rank(MPI_Comm comm_cart, int *coords, int *rank)
int MPI_Cart_coords(MPI_Comm comm_cart, int rank,
int maxdims, int *coords)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 53
C1 C2 C3 C4 C5
C1 C2 C3 C4 C5 A1|E1 A2|E2 A3|E3 A4|E4 A5|E5
Communication in Cartesian topology
• Find neighbouring processes along a dimension of the topology
int MPI_Cart_shift(MPI_Comm comm_cart, int direction, int disp,
int *rank_src, int *rank_dest)
• Calculates the ranks and returns them in rank_source and rank_dest
• Ifperiods[dir]=1,shiftwrapsaround,otherwiserank_src/destisMPI_PROC_NULL
• Example: MPI_Cart_shift(cart_comm, 0, 2, &source, &dest);
A1 A2 A3 A4 A5
A1 A2 A3 A4 A5
-|C1 -|C2 -|C3 -|C4 -|C5
B1 B2 B3 B4 B5
B1 B2 B3 B4 B5
-|D1 -|D2 -|D3 -|D4 -|D5
D1 D2 D3 D4 D5
D1 D2 D3 D4 D5
B1|- B2|- B3|- B4|- B5|-
E1 E2 E3 E4 E5
E1 E2 E3 E4 E5
C1|- C2|- C3|- C4|- C5|-
Note: ranks labeled A-E/1-5 just for visual purposes “-” = MPI_PROC_NULL
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 54
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 C1 C2 C3 C4 C5 D1 D2 D3 D4 D5 E1 E2 E3 E4 E5
A1 A2 A3 A4 A5
D1|C1 D2|C2 D3|C3 D4|C4 D5|C5
Communication in Cartesian topology
• Example: MPI_Cart_shift(cart_comm, 0, 2, &source, &dest); • Withwrap-around(oncreation,forthe”column”direction,periods[0]=1)
B1 B2 B3 B4 B5 E1|D1 E2|D2 E3|D3 E4|D4 E5|D5
C1 C2 C3 C4 C5
A1|E1 A2|E2 A3|E3 A4|E4 A5|E5
D1 D2 D3 D4 D5
B1|A1 B2|A2 B3|A3 B4|A4 B5|A5
E1 E2 E3 E4 E5
C1|B1 C2|B2 C3|B3 C4|B4 C5|B5
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 55
Example: matrix multiplication
• Sequentialsingle-nodealgorithm • Cij =Aik*Bkj,with0<=k
A00 A01 A02 B00 B01 B02
• HowdowecalculateeachCij?
• WeneedtobroadcastallB’sblockstoeach 10 A11 A12 B10 B11 B12
• Similarly, need to broadcast all A’s blocks to each C column
• EachPijdoesthesubmatrixmultiplicationsandadditions
A20 A21 A22 B20 B21 B22
• Problem:excessivememoryoverhead! • Canwedobetter?
University of Toronto Mississauga, Department of Mathematical and Computational Sciences
A30 A31 A32 B30 B31 B32
A00 A01 A02 A03 A11 A12 A13 A10 A22 A23 A20 A21
B00 B11 B22 B33 B10 B21 B32 B03 B20 B31 B02 B13
A11 A12 A13 A10 B10 B21 B32 B03
A33 A30 A31 A32
B30 B01 B12 B23
A33 A30 A31 A32 B30 B01 B12 B23
Cannon’s algorithm
• Idea: same partitioning, but schedule computations such that at any given time, each process uses a completely different Aik block
• Aftereachsubmatrixmultiplication,rotateblockstoaneighbour(immediateorstrided)
• Startwithblocksrotatedhorizontally(A)andvertically(B),shiftedby0/1/2/3onrow/column
A00 A01 A02 A03 A10 A11 A12 A23 A20 A21 A22 A23 A30 A31 A32 A33
B00 B01 B02 B03 B10 B11 B12 B23 B20 B21 B22 B23 B30 B31 B32 B33
Initial alignment of A and B blocks
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 58
A00 A01 A02 A03 B00 B11 B22 B33
A22 A23 A20 A21 B20 B31 B02 B13
Example: matrix multiplication
Recall that:
C00 C01 C02 C03 C10 C11 C12 C23 C20 C21 C22 C23 C30 C31 C32 C33
Process C00 = A00*B00 + A01*B10 + A02*B20 + A03*B30 P0,0
Initial alignment of A and B blocks
A00 A01 A02 A03 B00 B11 B22 B33
A11 A12 A13 A10 B10 B21 B32 B03
A22 A23 A20 A21 B20 B31 B02 B13
With this initial (rotated) configuration, every process has a pair of data tiles needed to calculate a partial term of its assigned C tile
A33 A30 A31 A32 B30 B01 B12 B23
C10 = A10*B00 + A11*B10 + A12*B20 + A13*B30 P1,0
P2,0 C30 = A30*B00 + A31*B10 + A32*B20 + A33*B30 P3,0 C01 = A00*B01 + A01*B11 + A02*B21 + A03*B31 P0,1 C11 = A10*B01 + A11*B11 + A12*B21 + A13*B31 P1,1 P2,1 C31 = A30*B01 + A31*B11 + A32*B21 + A33*B31 P3,1 C02 = A00*B02 + A01*B12 + A02*B22 + A03*B32 P0,2 C12 = A10*B02 + A11*B12 + A12*B22 + A13*B32 P1,2 C22 = A20*B02 + A21*B12 + A22*B22 + A23*B32 P2,2 C32 = A30*B02 + A31*B12 + A32*B22 + A33*B32 P3,2
C20 = A20*B00 + A21*B10 + A22*B20 + A23*B30
C21 = A20*B01 + A21*B11 + A22*B21 + A23*B31
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 59
A00 A01 A02 A03 B00 B11 B22 B33
A01 A02 A03 A00 B10 B21 B32 B03
A02 A03 A00 A01 B20 B31 B02 B13
A11 A12 A13 A10 B10 B21 B32 B03
A12 A13 A10 A11 B20 B31 B02 B13
A13 A10 A11 A12 B30 B01 B12 B23
A22 A23 A20 A21 B20 B31 B02 B13
A23 A20 A21 A22 B30 B01 B12 B23
A20 A21 A22 A23 B00 B11 B22 B33
A33 A30 A31 A32 B30 B01 B12 B23
A30 A31 A32 A33 B00 B11 B22 B33
A31 A32 A33 A30 B10 B21 B32 B03
Cannon’s algorithm
• Now rotate A blocks horizontally and B blocks vertically in single-step shifts • 3rotationsarenecessarytocompleteallthenecessarysubmatrixmultiplications
• NoweachprocesshasanotherpairoftilesfromAandBtocomputeanotherterm of its assigned C tile
• Lessmemoryoverheadthanthenaïvealgorithm!
• Implementation…
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 60
Cannon’s algorithm – implementation
• Followthestepsdescribed,andusethecartesiantopologyshifting and sendrecv_replace-ing according to the algorithm
• Solutionwillbeposted–peekasneeded(pleasedo!)
• Important to do this yourself though, to get “the feel for it”!
• Remember: different mindset entirely when using message passing paradigm
• Mustthinkdistributed,andhavetheSPMDmodelinmind(everyprocess executes the same program but differentiated by rank!)
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 61
Overlap communication with computation
• In the previous example, for each iteration we compute a submatrix multiplication, then shift the blocks using MPI_Sendrecv_replace
• Blockingoperationuntilmatrixblockissentandreceived
• Overlapthecomputationforthecurrentiterationwiththedatatransmissionforthe
next iteration?
• RememberMPI_Isend+MPI_Irecv!
• MPI_Isendstartsasendoperation,andreturnsbeforethedataiscopiedoutofthebuffer • MPI_Irecvstartsarecvoperation,andreturnsbeforethedataisreceivedintothebuffer
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 62
Cannon’s algorithm – non-blocking implementation
• Practice at home: Adjust your previous solution to use non-blocking operations
• Solutionpostedforthistoo,forreferencewhenyoupractice
• Again,veryimportanttodothisyourselfthough,totruly”getit”!
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 63
• MPII/Ooperations • Parallel I/O
• Take-aways
MPI Odds and Ends
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 64
• MPIprimitivestocoordinatereadsandwritestoafile
Parallel IO
• Each processor can write to its own part of the IO domain (e.g., file), without overlapping each other’s writes
• int MPI_File_open(MPI_Comm comm, const char *filename,
int mode, MPI_Info info, MPI_File handle);
• comm:MPI_COMM_WORLDorcustomcommunicator
• filename:characterstringrepresentingfilename
• mode:accessmodeMPI_MODE_CREATE,MPI_MODE_RDONLY,MPI_MODE_RDWR,etc. • info:extraMPI_Infoinformation,typicallyuseMPI_INFO_NULL
• handle:MPI_Filehandle
• int MPI_File_close(MPI_File handle); • Closethecorrespondingfile
• Consult the documentation for more details
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 65
• Accessmethods: • MPI_File_read • MPI_File_write • MPI_File_seek
Similar to UNIX functions equivalent counterparts
• MPI_File_read_at
• MPI_File_write_at
• MPI_File_read_shared • MPI_File_write_shared
Combine the read/write with the seek, for thread safety
MPI IO operations
Use shared file pointer
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 66
char *msg;
MPI_Offset offset = msg_size * rank;
MPI_File file;
MPI_Status stat;
MPI_File_seek(file, offset, MPI_SEEK_SET);
• Openafileandwriteinparallelatdifferentoffsets(nooverlap)
• EachprocessormustpositionitselfattherightoffsetbeforecallingMPI_File_write
MPI_File_open(MPI_COMM_WORLD, “somefile”,
MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL,
MPI_File_write(file, msg, msg_size, MPI_CHAR, &stat);
MPI_File_close(&file);
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 67
Parallel IO – collective operations
• OptimizedparallelIOprimitives
• AllprocessesfromthecommunicatormustcallthecollectiveIOfunction
• Collective operations (coordinated implicitly) • MPI_File_read_all
• MPI_File_write_all
• MPI_File_read_at_all
• MPI_File_write_at_all
• MPI_File_read_ordered • MPI_File_write_ordered
• MoreonthisintheMPIdocumentation…
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 68
• MPI_Write_at_all(MPI_Filehandle,MPI_Offsetoff,char*buffer,intcount, MPI_Type type, MPI_Status status)
• All write count elements from buffer of a given data type, to an explicit offset in the file
char *msg;
MPI_Offset offset = msg_size * rank;
MPI_File file;
MPI_Status stat;
MPI_File_open(MPI_COMM_WORLD, “somefile”,
MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL,
MPI_File_write_at_all(file, offset, msg, msg_size,
MPI_File_close(&file);
• Seealso:fileviews,MPI_File_set_view()
MPI_CHAR, &stat);
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 69
Take-aways and further pointers
• What are the important ideas to take away here?
• Message passing paradigm enables scale out to a large number of nodes
• Highscalabilitybuthardertoprogram
• Optimizedcollectiveoperationsmakeourlifeeasier,butprogramsmustbe carefully designed (and be suitable for such collective computations)
• OpenMP vs MPI?
• Itdepends!Ideally,useboth!
• Thesewerejustthebasicstounderstandmessage-passingparadigm • ConsulttheMPIdocumentationformoredetailsandotheradvancedfeatures
• Messagepassinghybrids:PGAS(PartitionGlobalArrayStorage)
• UnifiedparallelC,UPC++,Chapel,Fortress,etc.
University of Toronto Mississauga, Department of Mathematical and Computational Sciences 70