Information Technology
FIT3143 – LECTURE WEEK 9b
PARALLEL ALGORITHM DESIGN – MATRIX MULTIPLICATION USING FOX & CANNON WITH VIRTUAL TOPOLOGIES & COLLECTIVE COMMUNICATIONS
Copyright By PowCoder代写 加微信 powcoder
Topic Overview
▪ Quick recap of the matrix multiplication algorithm ▪ Fox algorithm for parallel matrix multiplication
▪ Cannon algorithm for parallel matrix multiplication
Learning outcome(s) related to this topic
• Design and develop parallel algorithms for various parallel computing architectures (LO3)
These slides and enclosed sample code files were prepared by , PhD postgraduate student, Monash University.
FIT3143 Parallel Computing 2
Quick recap of the matrix multiplication algorithm
FIT3143 Parallel Computing 3
Matrix multiplication
Assume we have 2 n x n matrices, A and B, and we want to find C such that C = A x B
FIT3143 Parallel Computing 4
Matrix Multiplication
FIT3143 Parallel Computing 5
Matrix Multiplication
FIT3143 Parallel Computing 6
Matrix Multiplication
FIT3143 Parallel Computing 7
Matrix Multiplication
FIT3143 Parallel Computing 8
Matrix Multiplication
FIT3143 Parallel Computing 9
Serial Algorithm
Time Complexity O(n^3) – very slow
FIT3143 Parallel Computing 10
Parallel matrix multiplication algorithm – Fox
FIT3143 Parallel Computing 11
Parallel Matrix Multiplication Algorithm – General
▪ First, Matrix A & B are partitioned so that each process works on a specific submatrix of A & B.
▪ Each process does a serial matrix multiplication on their local submatrices of A & B and the results are
stored in an accumulating sum
▪ Local submatrices are then communicated around to calculate the final result
▪ There are different algorithms for how submatrices are communicated, mainly, Fox algorithm and Cannon algorithm
FIT3143 Parallel Computing 12
Parallel Algorithm – Data Partitioning
▪ Parallel matrix multiplication requires data partitioning – common methods are Block-Row Partitioning or Block-Column Partitioning.
▪ Another method is grid or checkerboard partitioning – which is used in Fox and Cannon algorithm ▪ Say we have p number of processes and a square Matrix, M, of size nxn and let q = sqrt(p)
▪ M is partitioned into p square blocks (sub matrices) Mi,j where 0 <= i, j < q of size n/qxn/q
Say p = 4 and n = 4 Each submatrix = n/q = 4/sqrt(4) = 2
FIT3143 Parallel Computing 13
Fox Algorithm
▪ Matrix A & B are grid partitioned into submatrices Ai,j and Bi,j
▪ In each step, a one-to-all broadcast of the rows of matrix A and a single-step circular upwards shift of columns in matrix B is done
▪ Loop q times
1. Broadcast diagonal block Ai,i to all processors in the same row group
2. Multiply block of received Ai,i with local block Bi,j and add to Ci,j
3. Send (shift) entire local block B up within the same column group
4. Select block Ai,(j+1) mod q (where Ai,j is the block broadcast in the previous step) and broadcast to all processors in the same row group.
▪ Gather all
FIT3143 Parallel Computing 14
Data Partitioning – Fox Algorithm
Processors
[0,1] [2, 3]
Column Group
[0,2] [1,3]
FIT3143 Parallel Computing 15
Loop q times
• Broadcast diagonal block Ai,i to all processors in
the same row group A0,0
B0,1 Ai,i B1,0
FIT3143 Parallel Computing
Loop q times
• Broadcast diagonal block Ai,i to all processors in
the same row group
B0,1 Ai,i B1,0
FIT3143 Parallel Computing
Loop q times
• Broadcast diagonal block Ai,i to all processors in
the same row group
B0,1 Ai,i B1,0
FIT3143 Parallel Computing
Loop q times
• Broadcast diagonal block Ai,i to all processors in
the same row group
B0,1 A1,1 B1,0
FIT3143 Parallel Computing
Loop q times
• Broadcast diagonal block Ai,i to all processors in
the same row group
B0,1 A1,1 B1,0
FIT3143 Parallel Computing
Loop q times
• Multiply block of received A with resident block B
and add to C
FIT3143 Parallel Computing
Loop q times
• Multiply block of received A with resident block B
and add to C
FIT3143 Parallel Computing
Loop q times
1. Send (shift) entire local block B up within the same column group
B0,1 A1,1 B1,0
FIT3143 Parallel Computing
Loop q times
• Send (shift) entire resident block B up one step
B0,1 A1,1 B1,0
FIT3143 Parallel Computing
Loop q times
• Broadcast Ai,(j+1) mod q to all processors in the same row group
B0,1 A1,0 B1,0
FIT3143 Parallel Computing
Loop q times
• Multiply block of received A with resident block B
and add to C
FIT3143 Parallel Computing
Loop q times
• Multiply block of received A with resident block B
and add to C
FIT3143 Parallel Computing
• Already looped q(2) times, so now we must:
• Gather All
Gathered results on Process 0
FIT3143 Parallel Computing
Fox Algorithm
▪ Tends to be faster than simple parallel matrix multiplication for large values of n
▪ Implemented in MPI by creating row and column communicators which enable easy
communication of A & B blocks
▪ Although dealing with 2D matrices, in coding implementation these are represented using 1D arrays
▪ Difficult to adapt for non-square matrices i.e. A (mxn) x B (nxm)
▪ Has high communication overhead because at each step we are sending (relatively) large blocks of local A & B.
FIT3143 Parallel Computing 29
Parallel matrix multiplication algorithm - Cannon
FIT3143 Parallel Computing 30
Cannon Algorithm
▪ Matrix A & B are grid partitioned into submatrices Ai,j and Bi,j
▪ Data is moved incrementally in q-1 phases (ring broadcast algorithm)
▪ Each process skews matrix Ai,j and Bi,j to align elements by shifting Ai,j by i rows to the left and
shifting Bi,j by j columns up.
▪ Multiply local block A with local block B and add to local C
▪ Loop q-1 times
1. Shift elements (each Ai row by 1 unit and each Bj column by 1 unit)
2. Multiply local block A with local block B and add to local C
▪ Gather all
FIT3143 Parallel Computing 31
Data Partitioning – Cannon Algorithm
FIT3143 Parallel Computing 32
Skew matrix A and B to align elements by shifting Ai,j by i rows to the left and shift Bi,j by j columns up.
FIT3143 Parallel Computing 33
Skew matrix A and B to align elements by shifting Ai by i rows to the left and shift Bi by i columns up.
FIT3143 Parallel Computing 34
Skew matrix A and B to align elements by shifting Ai by i rows to the left and shift Bi by i columns up.
FIT3143 Parallel Computing 35
Loop q-1 times
Multiply elements and add to accumulating sum
xxxx Local B Local A Local B Local A Local B Local A
FIT3143 Parallel Computing
Loop q1 times
1. Shift elements (each Ai row by 1 unit and each Bi
column by 1 unit)
FIT3143 Parallel Computing 37
Loop q-1 times
1. Shift elements (each Ai row by 1 unit and each Bi
column by 1 unit)
FIT3143 Parallel Computing 38
Loop q-1 times
1. Shift elements (each Ai row by 1 unit and each Bi
column by 1 unit)
FIT3143 Parallel Computing 39
Loop q-1 times
Multiply local block A with local block B and add to local C
xxxx Local B Local A Local B Local A Local B Local A
FIT3143 Parallel Computing
Loop q-1 times
Multiply local block A with local block B and add to local C
xxxx Local B Local A Local B Local A Local B Local A
FIT3143 Parallel Computing
• Already looped q-1(1) times, so now we must:
• Gather All
Gathered results on Process 0
FIT3143 Parallel Computing
Cannon Algorithm
▪ Tends to be faster than simple parallel matrix multiplication for large values of n
▪ Implemented in MPI by creating row and column communicators which enable easy shifting of data
▪ Although dealing with 2D matrices, in coding implementation these are represented using 1D arrays
▪ Difficult to adapt for non-square matrices i.e. A (mxn) x B (nxm)
▪ Has lower communication overhead than Fox algorithm because at each step we are sending less data than in Fox
FIT3143 Parallel Computing 43
Cannon Algorithm – Bigger Example
FIT3143 Parallel Computing 44
FIT3143 Parallel Computing 45
013 0 76 00
FIT3143 Parallel Computing 46
013 0 76 00
FIT3143 Parallel Computing 47
033 0 102 00
FIT3143 Parallel Computing 48
033 0 102 00
FIT3143 Parallel Computing 49
039 0 116 00
FIT3143 Parallel Computing 50
• Already looped q-1(2) times, so now we must:
• Gather All
85 81 83 38
67 130 39 116 116
157 166 149 68
125 112 74 109 109
85 73 21 60
103 78 123 58
67 130 39 116 116
Gathered results on Process 0
FIT3143 Parallel Computing
C code files implementing fox and cannon parallel matrix multiplication with MPI
▪Fox – Click here ▪Cannon – Click here
FIT3143 Parallel Computing 52
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com