lec21.key
CS 314 Principles of Programming Languages
Prof. Zheng Zhang
Rutgers University
Lecture 20: Parallelism and Dependence Analysis
November 21, 2018
Class Information
2
• Project 2 deadline is extended to 11/25 Sunday.
• Midterm grades will be released immediately after Thanksgiving break.
Review: Parallelizing Affine Loops
3
Three spaces
• Iteration space
– The set of dynamic execution instances
For instance, the set of value vectors taken by loop indices
– A k-dimensional space for a k-level loop nest
• Data space
– The set of array elements accessed
– An n-dimensional space for an n-dimensional array
• Processor space
– The set of processors in the system
– In analysis, we may pretend there are unbounded # of virtual processors
Three Spaces
4
• Iteration space, data space, and processor space
0 1 9 10 11 19 20
0 1 9
… … …
…Iteration space
0 1 9
…Processor space
float Z[100];
for (i=0; i<10; i++)
S: Z[i+10] = Z[i];
Data Space
Z[i] & Z[i+10]
S(i)
P(i) = i
Synchronization-free Parallelism
5
Parallelize an application without allowing any communication
or synchronization among (logical) processors.
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
do i = 1, N
do j = 1, N
S1: A[i, j] = A[i, j - 1]
Write in S1(i, j) to Read in S1(i, j+1)
Write in S1(1,1) to Read in S1(1,2)
S1Example:
i
P1 P2 P3 P4 P5 P6
Communication is limited to the
iterations within one processor.
Dependence from S1 to S1
Synchronization-free Parallelism
6
Parallelize an application without allowing any communication
or synchronization among (logical) processors.
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
Write in S1(i, j) to Read in S1(i, j+1)
Write in S1(1,1) to Read in S1(1,2)
S1
Which loop can be parallelized?
The “i” loop or the “j” loop?
Answer: the “i” loop
Example 1:
i
P1 P2 P3 P4 P5 P6
do i = 1, N
do j = 1, N
S1: A[i, j] = A[i, j - 1]
7
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
S1
S2
S1 S2
True, j loop, for X
True, i loop, for Y
j = 4
j = 3
j = 2
j = 1
Dependence from S1(1,1) to S2(1,2)
1 2 3 4 5 6
i
P3
Dependence from S2(1,1) to S1(2,1)
Parallelize an application without allowing any communication
or synchronization among (logical) processors.
Synchronization-free Parallelism
Example 2:
Review — Processing Space: Affine Partition Schedule
• Map an iteration to a processor using < C, d >
C is a n by m matrix
• m = d (the loop level)
• n is the dimension of the processor grid
d⃗ is a n-element constant vector
p⃗ = C x⃗ + d⃗, where x⃗ is an iteration vector
• Map an iteration to a processor using < C, d >
p⃗ = C x⃗ + d⃗, where x⃗ is an iteration vector
• Example
Review: Processing Space: Affine Partition Schedule
for (i=1; i<=N; i++)
S: Y[i] = Z[i];
C = [1], d = [0]
p⃗( S(i) ) = 1*i + 0
= i
Map iteration i to Processor i
Review: Synchronization-free Parallelism
10
One Potential Solution:
Affine schedule for S1, p(S1): [C11 C12] = [1 -1], d1 = -1
i.e. (i, j) iteration of S1 to processor p = i - j - 1;
Affine schedule for S2, p(S2): [C21 C22 ] = [1 -1], d2 = 0
i.e. (i, j) iteration of S2 to processor p = i - j.
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
Example: -2
j = 4
j = 3
j = 2
j = 1
S1
S2
-1 0 1
p:
2
p:
3
p:
4
p:
5
-3 -4p:
i
1 2 3 4 5 6
C11 = C21 = -C22= -C12 = d2 -d1
Code Generation
11
•Step 1: find processor ID ranges
• S1: -4 <= p <= 4
• S2: -3 <= p <= 5
•Union: -4 <= p <= 5
•Step 2: generate code
S1(i, j): processor p = i-j-1;
S2(i, j): processor p = i-j.
for (i=1; i<=6; i++)
for (j=1; j<=4; j++){
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
forall (p=-4; p<=5; p++)
for (i=1; i<=6; i++)
for (j=1; j<=4; j++){
if (p== i-j-1)
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
if (p== i-j)
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
Naive Code Generation
12
• Wider than necessary loop bounds
• Redundant tests in loop body
What are the issues with this code?
forall (p=-4; p<=5; p++)
for (i=1; i<=6; i++)
for (j=1; j<=4; j++){
if (p== i-j-1)
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
if (p== i-j)
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
Remove Idle Iterations
13
For example, when p=-4, only 1 of the 24
iterations has useful operations, i = 1, j =4 .
Loop bounds are wider than they should have been
forall (p=-4; p<=5; p++)
for (i=1; i<=6; i++)
for (j=1; j<=4; j++){
if (p== i-j-1)
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
if (p== i-j)
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
Make Loop Bounds Tighter
14
-4 <= p <= 5
1<= i <= 6
1<= j <= 4
i-p-1=j
j: i-p-1<= j <= i-p-1
1<=j <= 4
i: p+2<=i <= p+5
1<=i <= 6
Fourier-Motzkin Elimination
S1
Eliminate j
Fourizer Motzkin Elimination
15
Eliminating variable z in inequality systems
• Match each of the lower bounds on z with each of its upper bounds
• Equivalent to projecting a polyhedron into reduced dimension space
-4 <= p <= 5
1<= i <= 6
1<= j <= 4
i-p-1=j
Suppose we want to eliminate j:
i-p-1<= j <= i-p-1
1<= j <= 4
i-p-1 <= i-p-1
i-p-1 <= 4
1 <= i-p-1
1 <= 4
p+2 <= i <= p+5
Make Loop Bounds Tighter
16
-4 <= p <= 5
1<= i <= 6
1<= j <= 4
i-p-1=j
j: i-p-1<= j <= i-p-1
1<=j <= 4
i: p+2<=i <= p+5
1<=i <= 6
p: -4<= p<= 4
Fourier-Motzkin Elimination
S1
-4 <= p <= 5
1<= i <= 6
1<= j <= 4
i-p=j
j: i-p<=j <= i-p
1<=j <=4
i: p+1<=i <= 4+p
1<=i <=6
p: -3<= p <= 5
S2
Eliminate j
Eliminate i
Make Loop Bounds Tighter
17
j: i-p-1<= j <= i-p-1
1<=j <= 4
i: p+2<=i <= p+5
1<=i <= 6
p: -4<= p<= 4
j: i-p<=j <= i-p
1<=j <=4
i: p+1<=i <= 4+p
1<=i <=6
p: -3<= p <= 5
S1 S2
Union result:
j: i-p-1<=j <= i-p
1<=j <=4
i: p+1<=i <= 5+p
1<=i <=6
p: -4<= p <= 5
Make Loop Bounds Tighter
18
Union result:
j: i-p-1<=j <= i-p
1<=j <=4
i: p+1<=i <= 5+p
1<=i <=6
p: -4<= p <= 5
forall (p=-4; p<=5; p++)
for (i=1; i<=6; i++)
for (j=1; j<=4; j++){
if (p== i-j-1)
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
if (p== i-j)
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
for (p=-4; p<=5; p++)
for (i=max(1,p+1); i<=min(6,5+p); i++)
for (j=max(1,i-p-1); j<=min(4,i-p); j++){
if (p== i-j-1)
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
if (p== i-j)
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
Make Loop Bounds Tighter
19
for (p=-4; p<=5; p++)
for (i=max(1,p+1); i<=min(6,5+p); i++)
for (j=max(1,i-p-1); j<=min(4, i-p); j++)
{
if (p== i-j-1)
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
if (p== i-j)
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
S1
S2
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
-2
When p = -2,
i: [1, 3]
Remove Tests
20
for (p=-4; p<=5; p++)
for (i=max(1,p+1); i<=min(6,5+p); i++)
for (j=max(1,i-p-1); j<=min(4,i-p); j++){
if (p== i-j-1)
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
if (p== i-j)
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
for (p=-4; p<=5; p++)
for (i=max(1,p+1); i<=min(6,5+p); i++)
j=i-p-1;
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
j=i-p;
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
?
S1 S2
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
-2
Remove Tests
21
• Reason for the tests
• The iteration spaces of statements intersect but do not completely
overlap
• Solution
• Split the iteration space at the boundaries of overlapping polyhedra.
• Generate code for each of the subspaces.
Remove Tests
22
/*space 2*/
for (p=-3; p<=4; p++)
for (i=max(1,p+1); i<=min(6,5+p); i++)
for (j=max(1,i-p-1); j<=min(4,i-p); j++){
if (p== i-j-1)
X[i,j] = X[i,j] + Y[i-1, j]; /* S1
*/
if (p== i-j)
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2
*/
}
/*space 1*/
p=-4; i=1; j=4;
X[i,j]=X[i,j]+Y[i-1,j]; /*S1*/
/*space 3*/
p=5; i=6; j=1;
Y[i,j] = X[i,j-1] + Y[i,j]; /*S2*/
j: i-p-1<= j <= i-p-1
1<=j <=4
i: p+2<=i <=5+p
1<=i <= 6
p: -4 <= p <= 4
j: i-p<=j <= i-p
1<=j <=4
i: p+1<=i <= 4+p
1<=i <=6
p: -3 <= p <= 5
S1:
S2:
Split on “p”:
subspace 1: p = -4;
subspace 2: -3 <= p <= 4;
subspace 3: p = 5;
Remove Tests
23
/*space 2*/
for (p=-3; p<=4; p++)
for (i=max(1,p+1); i<=min(6,5+p); i++)
for (j=max(1,i-p-1); j<=min(4,i-p); j++){
if (p== i-j-1)
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
if (p== i-j)
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
j: i-p-1<= j <= i-p-1
1<=j <=4
i: p+2<=i <=5+p
1<=i <= 6
p: -4 <= p <= 4
j: i-p<=j <= i-p
1<=j <=4
i: p+1<=i <= 4+p
1<=i <=6
p: -3 <= p <= 5
S1:
S2:
/*space 3*/
p=5; i=6; j=1;
Y[i,j] = X[i,j-1] + Y[i,j]; /*S2*/
/*space 1*/
p=-4; i=1; j=4;
X[i,j]=X[i,j]+Y[i-1,j]; /*S1*/
Remove Tests
24
Split on “i”:
subspace 2a: max(1, p+1)<= i < max(1, p+2); only S2;
subspace 2b: max(1, p+2) <= i <= min(6, 4+p); both S1 and S2;
subspace 2c: min(6, 4+p) < i <= min(5+p, 6); only S1;
j: i-p-1<= j <= i-p-1
1<=j <=4
i: p+2<=i <=5+p
1<=i <= 6
p: -4 <= p <= 4
j: i-p<=j <= i-p
1<=j <=4
i: p+1<=i <= 4+p
1<=i <=6
p: -3 <= p <= 5
S1: S2:
p>=-3; p<=4
Remove Tests
25
/*space 2*/
for (p=-3; p<=4; p++)
for (i=max(1,p+1); i<=min(6,5+p); i++)
for (j=max(1,i-p-1); j<=min(4,i-p); j++){
if (p== i-j-1)
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
if (p== i-j)
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
/*space 2*/
for (p=-3; p<=4; p++){
/*space 2a*/
if (p>=0){
i = p+1; j = 1;
Y[i,j] = Y[i,j] + X[i, j-1];/* S2 */}
/*space 2b*/
for (i=max(1, p+2); i<=min(6, 4 +p); i++)
{
j=i-p-1;
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
j=i-p
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ }
/*space 2c*/
if (p<=1){
i=5+p; j=5;
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */}
}
Split on “i”:
subspace 2a: max(1,p+1)<= i < max(1, p+2);
subspace 2b: max(1,p+2) <= i <= min(6, 4+p);
subspace 2c: min(6, 4+p) < i <= min(5+p, 6).
Remove Tests
26
Split on “i”:
subspace 2a: max(1,p+1)<= i < max(1, p+2);
subspace 2b: max(1,p+2) <= i <= min(6, 4+p);
subspace 2c: min(6, 4+p) < i <= min(5+p, 6).
/*space 2*/
for (p=-3; p<=4; p++){
/*space 2a*/
if (p>=0){
i = p+1; j = 1;
Y[i,j] = Y[i,j] + X[i, j-1];/* S2 */}
/*space 2b*/
for (i=max(1, p+2); i<=min(6, 4 +p); i++)
{
j=i-p-1;
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
j=i-p
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ }
/*space 2c*/
if (p<=1){
i=5+p; j=5;
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */}
}
S1
S2
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
-2 -1 0 1
2
3
4
5
Code Generation and Optimization Summary
27
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
forall (p=-4; p<=5; p++)
for (i=1; i<=6; i++)
for (j=1; j<=4; j++){
if (p== i-j-1)
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
if (p== i-j)
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
/*space 1*/
if ( p == -4 )
X[1,4]=X[1,4]+Y[0,4]; /*S1*/
/*space 2*/
for (p=-3; p<=4; p++){
/*space 2a*/
if (p>0)
Y[p+1,1] = Y[p+1,1] + X[p+1, 0];/* S2 */
/*space 2b*/
for (i=max(1,p+2); i<=min(6,4+p); i++)
X[i,i-p-1] = X[i,i-p-1] + Y[i-1, i-p-1]; /* S1 */
Y[i,i-p] = Y[i,i-p] + X[i, i-p-1]; /* S2 */ }
/*space 2c*/
if (p<=-1)
X[5+p,5] = X[5+p,5] + Y[4+p, 5]; /* S1 */
}
/*space 3*/
if (p = =5)
Y[6,1] = X[6,0] + Y[6,1]; /*S2*/
Next Class
28
Reading
• Scott, Chapter 7.2; ALSU Chapter 6.5