CS代考 FIT3143 – LECTURE WEEK 9b

Information Technology
FIT3143 – LECTURE WEEK 9b
PARALLEL ALGORITHM DESIGN – MATRIX MULTIPLICATION USING FOX & CANNON WITH VIRTUAL TOPOLOGIES & COLLECTIVE COMMUNICATIONS

Copyright By PowCoder代写 加微信 powcoder

Topic Overview
▪ Quick recap of the matrix multiplication algorithm ▪ Fox algorithm for parallel matrix multiplication
▪ Cannon algorithm for parallel matrix multiplication
Learning outcome(s) related to this topic
• Design and develop parallel algorithms for various parallel computing architectures (LO3)
These slides and enclosed sample code files were prepared by , PhD postgraduate student, Monash University.
FIT3143 Parallel Computing 2

Quick recap of the matrix multiplication algorithm
FIT3143 Parallel Computing 3

Matrix multiplication
Assume we have 2 n x n matrices, A and B, and we want to find C such that C = A x B
FIT3143 Parallel Computing 4

Matrix Multiplication
FIT3143 Parallel Computing 5

Matrix Multiplication
FIT3143 Parallel Computing 6

Matrix Multiplication
FIT3143 Parallel Computing 7

Matrix Multiplication
FIT3143 Parallel Computing 8

Matrix Multiplication
FIT3143 Parallel Computing 9

Serial Algorithm
Time Complexity O(n^3) – very slow
FIT3143 Parallel Computing 10

Parallel matrix multiplication algorithm – Fox
FIT3143 Parallel Computing 11

Parallel Matrix Multiplication Algorithm – General
▪ First, Matrix A & B are partitioned so that each process works on a specific submatrix of A & B.
▪ Each process does a serial matrix multiplication on their local submatrices of A & B and the results are
stored in an accumulating sum
▪ Local submatrices are then communicated around to calculate the final result
▪ There are different algorithms for how submatrices are communicated, mainly, Fox algorithm and Cannon algorithm
FIT3143 Parallel Computing 12

Parallel Algorithm – Data Partitioning
▪ Parallel matrix multiplication requires data partitioning – common methods are Block-Row Partitioning or Block-Column Partitioning.
▪ Another method is grid or checkerboard partitioning – which is used in Fox and Cannon algorithm ▪ Say we have p number of processes and a square Matrix, M, of size nxn and let q = sqrt(p)
▪ M is partitioned into p square blocks (sub matrices) Mi,j where 0 <= i, j < q of size n/qxn/q Say p = 4 and n = 4 Each submatrix = n/q = 4/sqrt(4) = 2 FIT3143 Parallel Computing 13 Fox Algorithm ▪ Matrix A & B are grid partitioned into submatrices Ai,j and Bi,j ▪ In each step, a one-to-all broadcast of the rows of matrix A and a single-step circular upwards shift of columns in matrix B is done ▪ Loop q times 1. Broadcast diagonal block Ai,i to all processors in the same row group 2. Multiply block of received Ai,i with local block Bi,j and add to Ci,j 3. Send (shift) entire local block B up within the same column group 4. Select block Ai,(j+1) mod q (where Ai,j is the block broadcast in the previous step) and broadcast to all processors in the same row group. ▪ Gather all FIT3143 Parallel Computing 14 Data Partitioning – Fox Algorithm Processors [0,1] [2, 3] Column Group [0,2] [1,3] FIT3143 Parallel Computing 15 Loop q times • Broadcast diagonal block Ai,i to all processors in the same row group A0,0 B0,1 Ai,i B1,0 FIT3143 Parallel Computing Loop q times • Broadcast diagonal block Ai,i to all processors in the same row group B0,1 Ai,i B1,0 FIT3143 Parallel Computing Loop q times • Broadcast diagonal block Ai,i to all processors in the same row group B0,1 Ai,i B1,0 FIT3143 Parallel Computing Loop q times • Broadcast diagonal block Ai,i to all processors in the same row group B0,1 A1,1 B1,0 FIT3143 Parallel Computing Loop q times • Broadcast diagonal block Ai,i to all processors in the same row group B0,1 A1,1 B1,0 FIT3143 Parallel Computing Loop q times • Multiply block of received A with resident block B and add to C FIT3143 Parallel Computing Loop q times • Multiply block of received A with resident block B and add to C FIT3143 Parallel Computing Loop q times 1. Send (shift) entire local block B up within the same column group B0,1 A1,1 B1,0 FIT3143 Parallel Computing Loop q times • Send (shift) entire resident block B up one step B0,1 A1,1 B1,0 FIT3143 Parallel Computing Loop q times • Broadcast Ai,(j+1) mod q to all processors in the same row group B0,1 A1,0 B1,0 FIT3143 Parallel Computing Loop q times • Multiply block of received A with resident block B and add to C FIT3143 Parallel Computing Loop q times • Multiply block of received A with resident block B and add to C FIT3143 Parallel Computing • Already looped q(2) times, so now we must: • Gather All Gathered results on Process 0 FIT3143 Parallel Computing Fox Algorithm ▪ Tends to be faster than simple parallel matrix multiplication for large values of n ▪ Implemented in MPI by creating row and column communicators which enable easy communication of A & B blocks ▪ Although dealing with 2D matrices, in coding implementation these are represented using 1D arrays ▪ Difficult to adapt for non-square matrices i.e. A (mxn) x B (nxm) ▪ Has high communication overhead because at each step we are sending (relatively) large blocks of local A & B. FIT3143 Parallel Computing 29 Parallel matrix multiplication algorithm - Cannon FIT3143 Parallel Computing 30 Cannon Algorithm ▪ Matrix A & B are grid partitioned into submatrices Ai,j and Bi,j ▪ Data is moved incrementally in q-1 phases (ring broadcast algorithm) ▪ Each process skews matrix Ai,j and Bi,j to align elements by shifting Ai,j by i rows to the left and shifting Bi,j by j columns up. ▪ Multiply local block A with local block B and add to local C ▪ Loop q-1 times 1. Shift elements (each Ai row by 1 unit and each Bj column by 1 unit) 2. Multiply local block A with local block B and add to local C ▪ Gather all FIT3143 Parallel Computing 31 Data Partitioning – Cannon Algorithm FIT3143 Parallel Computing 32 Skew matrix A and B to align elements by shifting Ai,j by i rows to the left and shift Bi,j by j columns up. FIT3143 Parallel Computing 33 Skew matrix A and B to align elements by shifting Ai by i rows to the left and shift Bi by i columns up. FIT3143 Parallel Computing 34 Skew matrix A and B to align elements by shifting Ai by i rows to the left and shift Bi by i columns up. FIT3143 Parallel Computing 35 Loop q-1 times Multiply elements and add to accumulating sum xxxx Local B Local A Local B Local A Local B Local A FIT3143 Parallel Computing Loop q1 times 1. Shift elements (each Ai row by 1 unit and each Bi column by 1 unit) FIT3143 Parallel Computing 37 Loop q-1 times 1. Shift elements (each Ai row by 1 unit and each Bi column by 1 unit) FIT3143 Parallel Computing 38 Loop q-1 times 1. Shift elements (each Ai row by 1 unit and each Bi column by 1 unit) FIT3143 Parallel Computing 39 Loop q-1 times Multiply local block A with local block B and add to local C xxxx Local B Local A Local B Local A Local B Local A FIT3143 Parallel Computing Loop q-1 times Multiply local block A with local block B and add to local C xxxx Local B Local A Local B Local A Local B Local A FIT3143 Parallel Computing • Already looped q-1(1) times, so now we must: • Gather All Gathered results on Process 0 FIT3143 Parallel Computing Cannon Algorithm ▪ Tends to be faster than simple parallel matrix multiplication for large values of n ▪ Implemented in MPI by creating row and column communicators which enable easy shifting of data ▪ Although dealing with 2D matrices, in coding implementation these are represented using 1D arrays ▪ Difficult to adapt for non-square matrices i.e. A (mxn) x B (nxm) ▪ Has lower communication overhead than Fox algorithm because at each step we are sending less data than in Fox FIT3143 Parallel Computing 43 Cannon Algorithm – Bigger Example FIT3143 Parallel Computing 44 FIT3143 Parallel Computing 45 013 0 76 00 FIT3143 Parallel Computing 46 013 0 76 00 FIT3143 Parallel Computing 47 033 0 102 00 FIT3143 Parallel Computing 48 033 0 102 00 FIT3143 Parallel Computing 49 039 0 116 00 FIT3143 Parallel Computing 50 • Already looped q-1(2) times, so now we must: • Gather All 85 81 83 38 67 130 39 116 116 157 166 149 68 125 112 74 109 109 85 73 21 60 103 78 123 58 67 130 39 116 116 Gathered results on Process 0 FIT3143 Parallel Computing C code files implementing fox and cannon parallel matrix multiplication with MPI ▪Fox – Click here ▪Cannon – Click here FIT3143 Parallel Computing 52 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com