lec20
CS 314 Principles of Programming Languages
Prof. Zheng Zhang
Rutgers University
Lecture 20: Parallelism and Dependence Analysis
November 16, 2018
Class Information
2
• Homework 7 is released.
• Project 2 deadline will be extended to 11/25 Sunday.
Review: Dependence Test
3
Given
do i1 = L1,U1
…
do in = Ln,Un
S1 : A[ f1( i1, …, in), …, fm(i1,…, in) ] = …
S2 : … = A[ g1(i1, …, in), …, gm(i1, …, in) ]
Let α & β be a vector of n integers within the ranges of the lower
and upper bounds of the n loops.
Does ∃ α, β in the loop iteration space, s.t.
fk(α) =gk (β) ∀k,1 ≤ k ≤ m?
Integer Linear Programming (ILP) for Dependence Test
4
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
Does ∃ α, β in the loop iteration space, s.t.
fk(α) =gk (β) ∀k,1 ≤ k ≤ m?
Consider the two memory references:
S1(α): X[i1, j1], S2(β): X[i2, j2-1]
α: (i1, j1)
β: (i2, j2)
Does there exist a solution to
this integer linear
programming (ILP) problem?
i1 = i2
j1 = j2 -1
1<= i1 <=100
1<= j1 <= 100
1<= i2 <=100
1<= j2<=100
Access the same
memory location
Loop bounds
constraint
Integer Linear Programming (ILP) for Dependence Test
5
i1 = i2
j1 = j2 -1
1<= i1 <=100
1<= j1 <= 100
1<= i2 <=100
1<= j2<=100
f1 =[ ]1 00 1F1 = [ ]0 0
[ ]1 00 1F2 = f2 = [ ]0 -1
Memory reference X[i1, j1]
Memory reference X[i2, j2-1]
[ ]1 00 1 [ ]i1
j1 [ ]0 0 + [ ]i1
j1 =
[ ]1 00 1 [ ]i2
j2 [ ]0 -1 + [ ]i1
j2 - 1 =
If we use the matrix vector notation
[ ]i2
j2 +F2 f2[ ]i1
j1 +F1 f1 =
[ ]i1
j1 [ ]i2
j2 – 1 =
i1 = i2
j1 = j2 -1
1<= i1 <=100
1<= j1 <= 100
1<= i2 <=100
1<= j2<=100
Integer Linear Programming (ILP) for Dependence Test
6
If we use the matrix vector notation
Loop bounds for (i1, j1)
[ ]i1
j1 +B1 b1 >= 0
[ ]i2
j2 +B2 b2 >= 0
B1 = b1 =
1 0
0 1
-1 0
0 -1
-1
-1
100
100
[ ]i1
j1 + >=
1 0
0 1
-1 0
0 -1
-1
-1
100
100
0
0
0
0
Loop bounds for (i2, j2)
B2 = b2 =
1 0
0 1
-1 0
0 -1
-1
-1
100
100
i1 = i2
j1 = j2 -1
1<= i1 <=100
1<= j1 <= 100
1<= i2 <=100
1<= j2<=100
Putting Everything Together
7
If we use the matrix vector notation
[ ]1 00 1F2 = f2 = [ ]0 -1 f1 =[ ]1 00 1F1 = [ ]0 0
B1 , b1 , B2 , b2 see previous slides
[ ]i2
j2 +F2 f2[ ]i1
j1 +F1 f1 =
[ ]i1
j1 +B1 b1 >= 0
[ ]i2
j2 +B2 b2 >= 0
Review: Parallelizing Affine Loops
8
Three spaces
• Iteration space
– The set of dynamic execution instances
For instance, the set of value vectors taken by loop indices
– A k-dimensional space for a k-level loop nest
• Data space
– The set of array elements accessed
– An n-dimensional space for an n-dimensional array
• Processor space
– The set of processors in the system
– In analysis, we may pretend there are unbounded # of virtual processors
Affine Half Space
9
Definition
An affine half-space of Zd is defined as the set of points
{ x⃗ ∈ Zd | a⃗ * x⃗ <= b }
a⃗ * x⃗ = b is the hyperplane that
divides the d-dimensional space
into two half-spaces.
Iteration Space
10
• Bounded by multiple hyperplanes
i <= j <= 7
i <= 5
0 <= i
i
j
for (i=0; i<=5; i++)
for (j=i; j<=7; j++)
Z[j, i] = 0;
0 <= i <= 5
i <= j <= 7
Three Spaces
11
• Iteration space, data space, and processor space
0 1 9 10 11 19 20
0 1 9
… … …
…Iteration space
0 1 9
…Processor space
float Z[100];
for (i=0; i<10; i++)
S: Z[i+10] = Z[i];
Data Space
Z[i] & Z[i+10]
S(i)
P(i) = i
Synchronization-free Parallelism
12
Parallelize an application without allowing any communication
or synchronization among (logical) processors.
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
Write in S1(i, j) to Read in S1(i, j+1)
Write in S1(1,1) to Read in S1(1,2)
S1Example:
i
do i = 1, N
do j = 1, N
S1: A[i, j] = A[i, j - 1]
Dependence from S1 to S1
Synchronization-free Parallelism
13
Parallelize an application without allowing any communication
or synchronization among (logical) processors.
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
do i = 1, N
do j = 1, N
S1: A[i, j] = A[i, j - 1]
Write in S1(i, j) to Read in S1(i, j+1)
Write in S1(1,1) to Read in S1(1,2)
S1Example:
i
P1 P2 P3 P4 P5 P6
Communication is limited to the
iterations on one processor.
Dependence from S1 to S1
Synchronization-free Parallelism
14
Parallelize an application without allowing any communication
or synchronization among (logical) processors.
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
Write in S1(i, j) to Read in S1(i, j+1)
Write in S1(1,1) to Read in S1(1,2)
S1
Which loop can be parallelized?
The “i” loop or the “j” loop?
Answer: the “i” loop
Example:
i
P1 P2 P3 P4 P5 P6
do i = 1, N
do j = 1, N
S1: A[i, j] = A[i, j - 1]
Synchronization-free Parallelism
15
Parallelize an application without allowing any communication
or synchronization among (logical) processors.
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
doall i = 1, N
do j = 1, N
S1: A[i, j] = A[i, j - 1]
Write in S1(i, j) to Read in S1(i, j+1)
Write in S1(1,1) to Read in S1(1,2)
S1
The dependence chain is characterized by a
hyperplane. In this case it is “i = constant”.
Example 1:
i
P1 P2 P3 P4 P5 P6
Dependence from S1(1,1) to S2(1,2)
Synchronization-free Parallelism
16
Parallelize an application without allowing any communication
or synchronization among (logical) processors.
The dependence chain is characterized by a
hyperplane. In this case it is “j - i + constant = 0”.
Example 2:
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
P2 P3 P4 P5 P6
P7
P8
P9
P1
do i = 1, N
do j = 1, N
S1: A[i, j] = A[i - 1, j - 1]
Dependence from S1(i,j) to S1(i+1,j+1)
17
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
S1
S2
S1 S2
True, j loop, for X
True, i loop, for Y
j = 4
j = 3
j = 2
j = 1
Dependence from S1(1,1) to S2(1,2)
1 2 3 4 5 6
i
P3
Dependence from S2(1,1) to S1(2,1)
Parallelize an application without allowing any communication
or synchronization among (logical) processors.
Synchronization-free Parallelism
Example 3:
18
Dependence and Parallelization
• Dependence chain in affine loops modeled as a hyperplane.
• Iterations along the same hyperplane must execute sequentially.
• Iterations on different hyperplanes can execute in parallel.
Example:
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
P2 P3 P4 P5 P6
P7
P8
P9
P1
do i = 1, N
do j = 1, N
S1: A[i, j] = A[i - 1, j - 1]
Dependence from S1(i,j) to S1(i+1,j+1)
The hyperplane is j - i + constant = 0 .
Processing Space: Affine Partition Schedule
• Map an iteration to a processor using < C, d >
C is a n by m matrix
• m = d (the loop level)
• n is the dimension of the processor grid
d⃗ is a n-element constant vector
p⃗ = C x⃗ + d⃗, where x⃗ is an iteration vector
• Map an iteration to a processor using < C, d >
p⃗ = C x⃗ + d⃗, where x⃗ is an iteration vector
• Example
Processing Space: Affine Partition Schedule
for (i=1; i<=N; i++)
S: Y[i] = Z[i];
C = [1], d = [0]
p⃗( S(i) ) = 1*i + 0
= i
Map iteration i to Processor i
Synchronization-free Parallelism
21
• Two memory references as
such that
– F1 is a matrix and f1 is a vector.
The affine memory access index is F1* α + f1 .
– B1 is a matrix and b1 is a vector.
The affine loop bounds can be expressed as B1 * α + b1 >= 0
• Let
schedule, to have synchronization-free parallelism,
C1 * α + d1 = C2* β + d2
These two memory references must execute on the same processor
(sequentially).
Synchronization-free Parallelism
22
• Two memory references as
such that
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
[ ]i2
j2 +F2 f2[ ]i1
j1 +F1 f1 =
[ ]i1
j1 +B1 b1 >= 0
[ ]i2
j2 +B2 b2 >= 0• We want to find processor
schedule
such that
[ C11 C12 ] [ ]i1 j1 + [ d1 ] [ C21 C22 ] [ ]i2 j2 + [ d2 ] =
Synchronization-free Parallelism
23
for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 True, j loop, for X True, i loop, for Y S1 to S2 dependence 1<=i1 <=100, 1<=j1 <= 100, 1<=i2 <=100, 1<=j2<=100, i1 = i2, j1 = j2 -1, [C11 C12] [ ]i1 j1 +[d1] = [C11 C12] [ ]i2 j2 +[d2] [C11 - C21, C12 - C22] [ ]i1 j1 +[d1 - d2 - C22] = 0 C11 - C21 = 0 C12 - C22 = 0 d1 - d2 - C22 = 0 Synchronization-free Parallelism 24 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 True, j loop, for X True, i loop, for Y 1<=i3 <=100, 1<=j3 <= 100, 1<=i4 <=100, 1<=j3<=100, i3 -1 = i4, j3 = j4, S2 to S1 dependence [C11 C12] [ ]i3 j3 +[d1] = [C11 C12] [ ]i4 j4 +[d2] [C11 - C21, C12 - C22] [ ]i3 j3 +[d1 - d2 + C21] = 0 C11 - C21 = 0 C12 - C22 = 0 d1 - d2 + C21 = 0 Synchronization-free Parallelism 25 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 True, j loop, for X True, i loop, for Y C11 - C21 = 0 C12 - C22 = 0 d1 - d2 + C21 = 0 S2 to S1 dependence C11 - C21 = 0 C12 - C22 = 0 d1 - d2 - C22 = 0 S1 to S2 dependence C11 = C21 = -C22= -C12 = d2 -d1 Synchronization-free Parallelism 26 One Potential Solution: Affine schedule for S1, p(S1): [C11 C12] = [1 -1], d1 = -1 i.e. (i, j) iteration of S1 to processor p = i - j - 1; Affine schedule for S2, p(S2): [C21 C22 ] = [1 -1], d2 = 0 i.e. (i, j) iteration of S2 to processor p = i - j. for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } Example: -2 j = 4 j = 3 j = 2 j = 1 S1 S2 -1 0 1 p: 2 p: 3 p: 4 p: 5 -3 -4p: i 1 2 3 4 5 6 C11 = C21 = -C22= -C12 = d2 -d1 Code Generation 27 •Step 1: find processor ID ranges • S1: -4 <= p <= 4 • S2: -3 <= p <= 5 •Union: -4 <= p <= 5 •Step 2: generate code S1(i, j): processor p = i-j-1; S2(i, j): processor p = i-j. for (i=1; i<=6; i++) for (j=1; j<=4; j++){ X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } Naive Code Generation 28 • Idle iterations • Lots of tests in loop body What are the issues with this code? forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } Remove Idle Iterations 29 For example, when p=-4, only 1 of the 24 iterations has useful operations, i = 1, j =4 . Some iterations have idle operations forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } Make Loop Bounds Tighter 30 -4 <= p <= 5 1<= i <= 6 1<= j <= 4 i-p-1=j j: i-p-1<= j <= i-p-1 1<=j <= 4 i: p+2<=i <= p+5 1<=i <= 6 p: -4<= p<= 4 Fourier-Motzkin Elimination S1 -4 <= p <= 5 1<= i <= 6 1<= j <= 4 i-p=j j: i-p<=j <= i-p 1<=j <=4 i: p+1<=i <= 4+p 1<=i <=6 p: -3<= p <= 5 S2 Eliminate j Eliminate i j = i - p - 1 1 <= i - p - 1 <= 4 p + 1 + 1 <= i <= 4 + p + 1 p + 2 <= i <= p + 5 Make Loop Bounds Tighter 31 j: i-p-1<= j <= i-p-1 1<=j <= 4 i: p+2<=i <= p+5 1<=i <= 6 p: -4<= p<= 4 j: i-p<=j <= i-p 1<=j <=4 i: p+1<=i <= 4+p 1<=i <=6 p: -3<= p <= 5 S1 S2 Union result: j: i-p-1<=j <= i-p 1<=j <=4 i: p+1<=i <= 5+p 1<=i <=6 p: -4<= p <= 5 Make Loop Bounds Tighter 32 Union result: j: i-p-1<=j <= i-p 1<=j <=4 i: p+1<=i <= 5+p 1<=i <=6 p: -4<= p <= 5 forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } for (p=-4; p<=5; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) for (j=max(1,i-p-1); j<=min(4,i-p); j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } Make Loop Bounds Tighter 33 for (p=-4; p<=5; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) for (j=max(1,i-p-1); j<=min(4, i-p); j++) { if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } S1 S2 j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 -2 When p = -2, i: [1, 3] j: [i+1, min(4, i+2)] Remove Tests 34 for (p=-4; p<=5; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) for (j=max(1,i-p-1); j<=min(4,i-p); j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } for (p=-4; p<=5; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) j=i-p-1; X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ j=i-p; Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } ? S1 S2 j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 -2 Next Class 35 Reading • ALSU, Chapter 11.1 - 11.7