CMPSC 450
Concurrent Scientific Programming Dense matrix multiplication
CMPSC 450
Parallel dense matrix-matrix multiplication algorithms
• SUMMA
• Cannon’s algorithm
CMPSC 450
2
SUMMA: Scalable Universal Matrix Multiply Algorithm
• Each processor owns 2D sub-blocks of matrices A and B, and computes a submatrix of C
• Assumes a 2D logical processor topology
• Algorithm can be generalized to arbitrary (non-square) matrix
dimensions
• Main communication step is broadcast of a strip
• Only √p processors communicate in each broadcast operation
CMPSC 450
SUMMA
C = C + AB
p processors, each processor owns submatrix in A and B of
dimensions
B
AC
CMPSC 450
SUMMA
Example:
P = 4, N = 8, b = 2;
Computation and communication performed by processor 1 are shown
B
AC
CMPSC 450
SUMMA
Example:
P = 4, N = 8, b = 2
Communication step 1: Proc 0 broadcasts strip in A; Proc 1 broadcasts strip in B.
B
AC
CMPSC 450
SUMMA
Example:
P = 4, N = 8, b = 2
Computation step 1:
Update C using outer products
B
AC
CMPSC 450
SUMMA
Example:
P = 4, N = 8, b = 2
Communication step 2: Proc 0 broadcasts strip in A; Proc 1 broadcasts strip in B.
B
AC
CMPSC 450
SUMMA
Example:
P = 4, N = 8, b = 2
Computation step 2: Update C
B
AC
CMPSC 450
SUMMA
Example:
P = 4, N = 8, b = 2
Communication step 3: Proc 1 broadcasts strip in A; Proc 3 broadcasts strip in B.
B
AC
CMPSC 450
SUMMA
Example:
P = 4, N = 8, b = 2
Computation step 3: Update C
B
AC
CMPSC 450
SUMMA
Example:
P = 4, N = 8, b = 2
Communication step 4: Proc 1 broadcasts strip in A; Proc 3 broadcasts strip in B.
B
AC
CMPSC 450
SUMMA
Example:
P = 4, N = 8, b = 2
Computation step 4: Update C
B
AC
CMPSC 450
Analysis
• n/b communication phases
• Each communication phase: two broadcasts of bn/√p values to √p
processors
• Tcomm =
√
orTcomm = √
√
•
• b offers tradeoff between memory usage and latency cost
CMPSC 450
Cannon’s algorithm
• Skew matrices A and B
• Perform local sub-matrix
multiplies, shift matrices
CMPSC 450
Cannon’s algorithm
Skew phase
for all (i = 0 to -1)
Left circular shift row i of A by i, so that A(i,j) is overwritten by A(i,(j+i) mod )
Up circular shift column i of B by i, so that B(i,j) is overwritten by B((i+j) mod , j)
CMPSC 450
Cannon’s algorithm
Matrix multiply phase
for all (k = 0 to -1)
for all (i = 0 to -1, j = 0 to -1)
C(i,j) = C(i,j) + A(i,j) B(i,j)
Left circular shift each row of A by 1, so that A(i,j) is overwritten by A(i,(j+1) mod )
Up circular shift column of B by 1, so that B(i,j) is overwritten by B((i+1) mod , j)
CMPSC 450
B00
B01
B02
B10
B11
B12
B20
B21
B22
Cannon’s algorithm
Example:
P = 9, N = 9
Each process initially owns 3×3 submatrices
A
B
A00
A01
A02
A10
A11
A12
A20
A21
A22
C
CMPSC 450
B00
B01
B02
B10
B11
B12
B20
B21
B22
Cannon’s algorithm
Example:
P = 9, N = 9 Skew step
A
B
A00
A01
A02
A10
A11
A12
A20
A21
A22
C
CMPSC 450
B00
B11
B02
B10
B21
B12
B20
B01
B22
Cannon’s algorithm
Example:
P = 9, N = 9
Matrices after skewing
A
B
A00
A01
A02
A11
A12
A10
A20
A21
A22
C
CMPSC 450
B00
B11
B02
B10
B21
B12
B20
B01
B22
Cannon’s algorithm
Example:
P = 9, N = 9
Local computation + shift phase (proc 1’s updates to C shown): Compute
A
B
A01B
11
A00
A01
A02
A11
A12
A10
A20
A21
A22
C
CMPSC 450
B00
B11
B02
B10
B21
B12
B20
B01
B22
Cannon’s algorithm
Example:
P = 9, N = 9
Local computation + shift phase (proc 1’s updates to C shown): Shift
A
B
A01B
11
A00
A01
A02
A11
A12
A10
A20
A21
A22
C
CMPSC 450
B10
B21
B02
B20
B01
B12
B00
B11
B22
Cannon’s algorithm
Example:
P = 9, N = 9
Local computation + shift phase (proc 1’s updates to C shown): After shift
A
B
A01B11
A01
A02
A00
A12
A10
A11
A21
A22
A20
C
CMPSC 450
B10
B21
B02
B20
B01
B12
B00
B11
B22
Cannon’s algorithm
Example:
P = 9, N = 9
Local computation + shift phase (proc 1’s updates to C shown): Compute
A
B
A01B11
+ A02B21
A01
A02
A00
A12
A10
A11
A21
A22
A20
C
CMPSC 450
B10
B21
B02
B20
B01
B12
B00
B11
B22
Cannon’s algorithm
Example:
P = 9, N = 9
Local computation + shift phase (proc 1’s updates to C shown): shift
A
B
A01B11
+ A02B21
A01
A02
A00
A12
A10
A11
A21
A22
A20
C
CMPSC 450
B20
B01
B12
B00
B11
B22
B10
B21
B02
Cannon’s algorithm
Example:
P = 9, N = 9
Local computation + shift phase (proc 1’s updates to C shown): after shift
A
B
A01B11
+ A02B21
A02
A00
A01
A10
A11
A12
A22
A20
A21
C
CMPSC 450
B20
B01
B12
B00
B11
B22
B10
B21
B02
Cannon’s algorithm
Example:
P = 9, N = 9
Local computation + shift phase (proc 1’s updates to C shown): Compute, and done!
A
B
A01B11
+ A02B21
+ A00B01
A02
A00
A01
A10
A11
A12
A22
A20
A21
C
CMPSC 450
Analysis
• 2 √p communication phases
• √p in skew, √p for local computation
• Each communication phase: two sends of n2/p values to neighbors, two receives
√
• Memory-efficient
• Latency term is better than 1D row-based approach
• Drawback: requires process grid
• Tcomm =
(ignoring the constant factors)
CMPSC 450