MPI types, Scatter and
Scatterv
Wednesday, April 6, 16
MPI types, Scatter and
Scatterv
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Logical and physical layout of a
C/C++ array in memory.
A = malloc(6*6*sizeof(int));
Wednesday, April 6, 16
MPI_Scatter
int MPI_Scatter(
const void *sendbuf, // data to send
int sendcount, // sent to each process
MPI_Datatype sendtype,// type of data sent
void *recvbuf, // where received
int recvcount, // how much to receive
MPI_Datatype recvtype,// type of data received
int root, // sending process
MPI_Comm comm) // communicator
sendbuf, sendcount, sendtype valid only at the
sending process
Wednesday, April 6, 16
Equal number elements to all
processors
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
int MPI_Scatter(A, 9, MPI_Int, B, 9,
MPI_Int,0,
MPI_COMM_WORLD)
A
3 4 50 1 2 6 7 8P0
12 13 149 10 11 15 16 17P1
21 22 2318 19 20 24 25 26P2
30 31 3227 28 29 33 34 35P3
Wednesday, April 6, 16
MPI_Scatterv
int MPI_Scatter(
const void *sendbuf, // data to send
const int *sendcounts,// sent to each process
const int* displ // where in sendbuf
// sent data is
MPI_Datatype sendtype,// type of data sent
void *recvbuf, // where received
int recvcount, // how much to receive
MPI_Datatype recvtype,// type of data received
int root, // sending process
MPI_Comm comm) // communicator
sendbuf, sendcount, sendtype valid only at the
sending process
Wednesday, April 6, 16
Specify the number elements
sent to each processor
int[] counts = {10, 9, 8, 9};
int[] displ = {0, 10, 19, 27};
int MPI_Scatterv(A, counts, displs, MPI_Int,rb, counts, MPI_Int 0,
MPI_COMM_WORLD)
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
A
3 4 50 1 2 6 7 8P0
21 22 2319 20 24 25 26P2
30 31 3227 28 29 33 34 35P3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
9
P1 12 13 1410 11 15 16 17 18
rb
Wednesday, April 6, 16
What if we want to scatter
columns (C array layout)
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
A
18 24 300 6 12P0
19 25 311 7 13P1
20 26 322 8 14P2
21 27 333 9 15P3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
22 28 344 10 16P4
23 29 355 11 17P5
Wednesday, April 6, 16
MPI_Type_vector
int MPI_Type_vector(
int count, // number of blocks
int blocklength, // #elts in a blocks
int stride, // #elts between block starts
MPI_Datatype oldtype, // type of block elements
MPI_Datatype *newtype // handle for new type
)
Allows a type to be created that puts together blocks of
elements in a vector into another vector.
Note that a 2-D array in contiguous memory can be treated as
a 1-D vector.
Wednesday, April 6, 16
MPI_Type_create_resized
int MPI_Type_create_resized(
MPI_Datatype oldtype, // type being resized
MPI_Aint lb, // new lower bound
MPI_Aint extent, // new extent (“length”)
MPI_Datatype *newtype) // resized type name
)
Allows a new size (or extent) to be assigned to an existing type.
Allows MPI to determine how far from an object O1 the next
adjacent object O2 is. As we will see this is often necessitated
because we treat a logically 2-D array as a 1-D vector.
Wednesday, April 6, 16
Using MPI_Type_vector
MPI_Datatype col, coltype;
MPI_Type_vector(6, 1, 6, MPI_INT,
&col);
MPI_Type_commit(&col); // not necessary
MPI_Type_create_resized(col, 0,
1*sizeof(int), &coltype);
MPI_Type_commit(&coltype); // needed
MPI_Scatter(A, 1, coltype, rb,
6, MPI_Int, 0, MPI_COMM_WORLD);
A
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Wednesday, April 6, 16
MPI_Datatype col, coltype;
MPI_Type_vector(6, 1, 6, MPI_INT,
&col);
MPI_Type_commit(&col); // not necessary
MPI_Type_create_resized(col, 0,
1*sizeof(int), &coltype);
MPI_Type_commit(&coltype); // needed
MPI_Scatter(A, 1, coltype, rb,
6, MPI_Int, 0, MPI_COMM_WORLD);
A
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
Each of 6 blocks is made of 1 ints of
length one, and the next block starts
6 positions in the linearized array
from the start of the previous block.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
MPI_Type_vector: defining the
type
1
1 2 3 4 5 6
Wednesday, April 6, 16
Using
MPI_type_create_resized
MPI_Datatype col, coltype;
MPI_Type_vector(6, 1, 6, MPI_INT,
&col);
MPI_Type_commit(&col); // not necessary
MPI_Type_create_resized(col, 0,
1*sizeof(int), &coltype);
MPI_Type_commit(&coltype); // needed
MPI_Scatter(A, 1, coltype, rb,
6, MPI_Int, 0, MPI_COMM_WORLD);
A
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
resize creates a new type from a previous
type and changes the size. This allows
easier computation of the offset from one
element of a type to the next element of
a type in the original data structure.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
1
Wednesday, April 6, 16
MPI_Datatype col, coltype;
MPI_Type_vector(6, 1, 6, MPI_INT,
&col);
MPI_Type_commit(&col); // not necessary
MPI_Type_create_resized(col, 0,
1*sizeof(int), &coltype);
MPI_Type_commit(&coltype); // needed
MPI_Scatter(A, 1, coltype, rb,
6, MPI_Int, 0, MPI_COMM_WORLD);
A
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
resize creates a new type from a previous
type and changes the size. This allows
easier computation of the offset from one
element of a type to the next element of
a type in the original data structure.
one object of type
col starts here
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
The next starts here, one
sizeof(int) away.
one object of type
col starts here
The next starts here, one
sizeof(int) away.
Wednesday, April 6, 16
The result of the
communication
MPI_Datatype col, coltype;
MPI_Type_vector(6, 1, 6, MPI_INT,
&col);
MPI_Type_commit(&col); // not necessary
MPI_Type_create_resized(col, 0,
1*sizeof(int), &coltype);
MPI_Type_commit(&coltype); // needed
MPI_Scatter(A, 1, coltype, rb,
6, MPI_Int, 0, MPI_COMM_WORLD);
A
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
1
0 6 12 18 24 30
1 7 13 19 25 31
. . .
5 11 17 23 29 35
P0
P1
P2
Wednesday, April 6, 16
Scattering diagonal blocks
MPI_Datatype block, blocktype;
MPI_Type_vector(2, 2, 6, MPI_INT,
&block);
MPI_Type_commit(&block); // not necessary
MPI_Type_create_resized(block, 0,
14*sizeof(int), &blocktype);
MPI_Type_commit(&blocktype); // needed
int MPI_Scatter(A, 1, blocktype, B, 4,
MPI_Int,0,
MPI_COMM_WORLD)
A
5
11
17
23
29
0 1 2 3 4
6 7 8 9 10
12 13 14 15 16
18 19 20 21 22
24 25 26 27 28
30 31 32 33 34 35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
2
1 2
6
note that 2*numrows + width of block =14
Wednesday, April 6, 16
Scattering the blocks
MPI_Datatype block, blocktype;
MPI_Type_vector(2, 2, 14, MPI_INT,
&block);
MPI_Type_commit(&block); // not necessary
MPI_Type_create_resized(block, 0,
14*sizeof(int), &blocktype);
MPI_Type_commit(&blocktype); // needed
int MPI_Scatter(A, 1, blocktype, B, 4,
MPI_Int,0,
MPI_COMM_WORLD)
A
5
11
17
23
29
0 1 2 3 4
6 7 8 9 10
12 13 14 15 16
18 19 20 21 22
24 25 26 27 28
30 31 32 33 34 35 0 1 6 7P0
14 15 20 21P1
2928 34 35P2
B
Wednesday, April 6, 16
The Type_vector statement
describing this
MPI_Datatype block, blocktype;
MPI_Type_vector(3, 3, 6, MPI_INT,
&block);
MPI_Type_commit(&block); // not necessary
MPI_Type_create_resized(block, 0,
3*sizeof(int), &blocktype);
MPI_Type_commit(&blocktype); // needed
A
5
11
17
23
29
0 1 2 3 4
6 7 8 9 10
12 13 14 15 16
18 19 20 21 22
24 25 26 27 28
30 31 32 33 34 35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
3
1 2 3
6
Wednesday, April 6, 16
The create_resize statement
for this
MPI_Datatype block, blocktype;
MPI_Type_vector(3, 3, 6, MPI_INT,
&block);
MPI_Type_commit(&block); // not necessary
MPI_Type_create_resized(block, 0,
3*sizeof(int), &blocktype);
MPI_Type_commit(&blocktype); // needed
A
5
11
17
23
29
0 1 2 3 4
6 7 8 9 10
12 13 14 15 16
18 19 20 21 22
24 25 26 27 28
30 31 32 33 34 35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
3 15 3
Distance between start of blocks
varies, but are multiples of 3. Use
MPI_Scatterv
Wednesday, April 6, 16
Sending the data
MPI_Datatype block, blocktype;
int disp = {0, 1, 6, 7)
int scount = {1, 1, 1, 1}
int rcount = {9, 9, 9, 9}
MPI_Type_vector(3, 3, 6, MPI_INT,
&block);
MPI_Type_commit(&block); // not necessary
MPI_Type_create_resized(block, 0,
3*sizeof(int), &blocktype);
MPI_Type_commit(&blocktype); // needed
int MPI_Scatterv(A, scount, displ,
blocktype, rb, rcount,
MPI_Int, 0,
MPI_COMM_WORLD)
A
5
11
17
23
29
0 1 2 3 4
6 7 8 9 10
12 13 14 15 16
18 19 20 21 22
24 25 26 27 28
30 31 32 33 34 35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
3 15 3
0 1 6 7 displacement is sizeof(blockcol)
Wednesday, April 6, 16
Matrix Multiply
Cannon’s Algorithm
• Useful for the small project
Wednesday, April 6, 16
Parallel Algorithm 2
(Cannon’s Algorithm)
• Associate a primitive task with each matrix element
• Agglomerate tasks responsible for a square (or nearly
square) block of C (the result matrix)
• Computation-to-communication ratio rises to n / √p
(same total computation, more computation per
communication)
• 2n / p < n / √p when p > 4
Wednesday, April 6, 16
A simplifying assumption
• Assume that
• A, B and (consequently) C are n x n square
matrices
• √p is an integer, and
• n = k⋅√p, k an integer (i.e. n is a multiple of √p
Wednesday, April 6, 16
Elements of A and B Needed to
Compute a Process’s Portion of C
Algorithm 1
Cannon’s
Algorithm
Wednesday, April 6, 16
Blocks need to compute
a C element
A B C
Wednesday, April 6, 16
Blocks need to compute
a C element
A B C
Processor that owns these blocks fully computes value of this C
block (but needs more data than this)
Wednesday, April 6, 16
Blocks needed to compute a C element
CBA
Processor P3,2 needs, at some point, to simultaneously hold the green
A and B blocks, the red A and B blocks, the blue A and B blocks, and
the cayenne A and B blocks.
With the current data layout it cannot do useful work because it does
not contain matching A and B blocks (it has a red A and blue B block)
Wednesday, April 6, 16
3,2 3,2 3,2
Blocks needed to compute a C element
A B C
We need to rearrange the data so that every block has
useful work to do
The initial data configuration does not provide for this
Wednesday, April 6, 16
A B C
Move every element Ai,j over i columns
Move every element Bi,j up j rows (0 based
index)
Change the initial data setup
Wednesday, April 6, 16
Move every element Ai,j over i columns
Move every element Bi,j up j rows
CBA
Change the initial data setup
Wednesday, April 6, 16
Every processor now has useful work to
do
Note — this only shows the full data layout for one
processor
CBA
Wednesday, April 6, 16
At each step in the multiplication, shift B
elements up within their column, and A
elements left within their row
CBA
First partial sum
Wednesday, April 6, 16
At each step in the multiplication, shift B
elements up within their column, and A
elements left within their row
CBA
Second partial sum
Wednesday, April 6, 16
At each step in the multiplication, shift B
elements up within their column, and A
elements left within their row
CBA
Third partial sum
Wednesday, April 6, 16
At each step in the multiplication, shift B
elements up within their column, and A
elements left within their row
CBA
Fourth partial sum
Wednesday, April 6, 16
Another way to view this
Before After
Wednesday, April 6, 16
Another way to view this
Before After
This goes
here
A block goes here
(over 2 (i) rows)
B block goes here
(up 1 (j) rows)
B block goes here
(up 1 (j) rows)
Wednesday, April 6, 16
Yet another way to view
this
A00
B00
A01
B01
A02
B02
A03
B03
A10
B10
A11
B11
A12
B12
A13
B13
A20
B20
A21
B21
A22
B22
A23
B23
A30
B30
A31
B31
A32
B32
A33
B33
Each triangle
represents a
matrix block
Only same-color
triangles should
be multiplied
Wednesday, April 6, 16
Rearrange Blocks
A00
B00
A01
B01
A02
B02
A03
B03
A10
B10
A11
B11
A12
B12
A13
B13
A20
B20
A21
B21
A22
B22
A23
B23
A30
B30
A31
B31
A32
B32
A33
B33
Block Ai,j shifts
left i positions
Block Bi,j shifts
up j positions
Wednesday, April 6, 16
Consider Process P1,2
B02
A10A11 A12
B12
A13
B22
B32 Step 1
Next communication
Next communication
Wednesday, April 6, 16
Consider Process P1,2
B12
A11A12 A13
B22
A10
B32
B02 Step 2
Next communication
Next communication
Wednesday, April 6, 16
Consider Process P1,2
B22
A12A13 A10
B32
A11
B02
B12 Step 3
Next communication
Next communication
Wednesday, April 6, 16
Consider Process P1,2
B32
A13A10 A11
B02
A12
B12
B22 Step 4
Next communication
Next communication
Wednesday, April 6, 16
Complexity Analysis
• Algorithm has √p iterations
• During each iteration process multiplies two
(n / √p ) × (n / √p ) matrices: Θ(n / √p)3 or Θ(n3 / p 3/2)
• Overall computational complexity: √p n3/p 3/2 or Θ(n3 / p)
• During each √p iterations a process sends and receives
two blocks of size (n / √p ) × (n / √p )
• Overall communication complexity: Θ(n2/ √p)
Wednesday, April 6, 16