lec19
CS 314 Principles of Programming Languages
Prof. Zheng Zhang
Rutgers University
Lecture 19: Parallelism and Dependence Analysis
November 14, 2018
Review: Dependence Definition
2
• Both statements (instances) access the same memory location(s)
• One of them is a write
• There is a run-time execution path from S1 to S2
Bernstein’s Condition: — There is a data dependence from
statement (instance) S1 to statement S2 (instance) if
S1: pi = 3.14
S2: R = 5
S3: Area = pi * R2
S1 S2
S3
Example:
Data Dependence Classifications
3
“S2 depends on S1” — (S1 δ S2)
True (flow) dependence
occurs when S1 writes a memory location that S2 later reads (RAW).
Anti dependence
occurs when S1 reads a memory location that S2 later writes (WAR).
Output dependence
occurs when S1 writes a memory location that S2 later writes (WAW).
Input dependence
occurs when S1 reads a memory location that S2 later reads (RAR).
Review: Dependence Testing
4
• Single loop nest with constant lower (LB) and upper (UB) bound,
and step 1.
• Two array references as affine function of loop induction variable
Single Induction Variable (SIV) Test
for i = LB, UB, 1
…
endfor
for i = LB, UB, 1
R1: X(a*i + c1) = …
R2: … = X(a*i + c2)
endfor
Question: Is there a true dependence between R1 and R2?
Review: Dependence Testing
5
There is a dependence between R1 and R2 iff
∃ i, i′: LB ≤ i ≤ i′ ≤ UB and (a∗i+c1) = (a∗i′+c2)
where i and i′ represent two iterations in the iteration space. This
means that in both iterations, the same element of array X is accessed.
There is a dependence iff • ∆d is an integer value
• UB – LB ≥ ∆d ≥ 0
So let’s just solve the equation:
(a ∗ i + c1) = (a ∗ i′ + c2) (c1 – c2)/a = i′ − i = ∆d
for i = LB, UB, 1
R1: X(a*i + c1) = …
R2: … = X(a*i + c2)
endfor
• Examples:
Simple Dependence Testing
6
for (i = 1; i <= 100; i++) {
S1: A[i] = ...
S2: ...= A[i - 1]
}
float Z[100];
for (i =0; i < 12; i++) {
S: Z[ i+10 ] = Z[i];
}
1. Is there dependence?
2. If so, what type of dependence?
3. From which statement (instance) to which statement (instance)?
• Examples:
Simple Dependence Testing
7
for (i = 1; i <= 100; i++) {
S1: A[i] = ...
S2: ...= A[i - 1]
}
float Z[100];
for (i =0; i < 12; i++) {
S: Z[ i+10 ] = Z[i];
}
S1
S2
True Dependence
(read after write):
Wt: A[i] in S1 →
Rd: A[i’-1] in S2
S
True Dependence
(read after write):
Wt: Z[i+10] in S →
Rd: Z[i’] in S
i’ = i +1
∆d = 1
i’ = i +10
∆d = 10
• More Examples:
Simple Dependence Testing
8
1. Is there dependence?
2. If so, what type of dependence?
3. From which statement (instance) to which statement (instance)?
for (i = 1; i <= 100; i++) {
R1: X(i) = ...
R2: ... = X(i + 2)
}
for (i = 3; i <= 15, i++) {
S1: X(2 * i) = ...
S2: ... = X(2 * i - 1)
}
• More Examples:
Simple Dependence Testing
9
for (i = 1; i <= 100; i++) {
R1: X[i] = ...
R2: ... = X[i + 2]
}
for (i = 3; i <= 15, i++) {
S1: X[2 * i] = ...
S2: ... = X[2 * i - 1]
}
R1
R2
Anti Dependence
(write after read):
Rd: X[i+2] in R2 →
Wt: X[i’] in R1
S1
S2
No dependence!
Review: Automatic Parallelization
10
We will use loop analysis as an example to describe automatic
dependence analysis and parallelization.
Assumptions:
1. We only have scalar and subscripted variables (no pointers and
no control dependence) for loop dependence analysis.
2. We focus on affine loops: both loop bounds and memory
references are affine functions of loop induction variables.
A function f(x1, x2, ..., xn) is affine if it is in such a form:
f = c0 + c1*x1 + c2*x2 + ... + cn*xn, where ci are all constants
Review: Affine Loops
11
Three spaces
• Iteration space
‣ The set of dynamic execution instances
‣ i.e. the set of value vectors taken by loop indices
‣ A k-dimensional space for a k-level loop nest
• Data space
‣ The set of array elements accessed
‣ An n-dimensional space for an n-dimensional array
• Processor space
‣ The set of processors in the system
‣ In analysis, we may pretend there are unbounded # of virtual processors
Iteration Space
12
• Example
i <= j <= 7
i <= 5
0 <= i
i
j
for (i=0; i<=5; i++)
for (j=i; j<=7; j++)
Z[j, i] = 0;
0 <= i <= 5
i <= j <= 7
Lexicographical Order
13
j
1 <= i <= 5
1 <= j <= 6 - i
i
for (i = 1; i <= 5; i++)
for (j = 1; j <= 6 - i; j++)
Z[j, i] = 0;
• Order of sequential loop executions
• Sweeping through the space in an ascending lexicographic order:
(i, j) <= (i', j') iff one of the two conditions is satisfied
1. i <= i'
2. i = i' & j <= j'
Dependence Test
14
Given
do i1 = L1,U1
...
do in = Ln,Un
S1 : A[ f1( i1, …, in), …, fm(i1,…, in) ] = ...
S2 : … = A[ g1(i1, …, in), …, gm(i1, …, in) ]
A dependence between statement (instance) S1 and S2, denoted S1 δ S2,
indicates that the S1 instance, the source, must be executed before S2
instance, the sink on some iteration of the loop nest.
Let α & β be a vector of n integers within the ranges of the lower
and upper bounds of the n loops.
Does ∃ α, β in the loop iteration space, s.t.
fk(α) =gk (β) ∀k,1 ≤ k ≤ m?
Dependence Test
15
Given
do i1 = L1,U1
...
do in = Ln,Un
S1 : A[ f1( i1, …, in), …, fm(i1,…, in) ] = ...
S2 : … = A[ g1(i1, …, in), …, gm(i1, …, in) ]
For X[i,j]: f1(i,j) = i,
f2(i,j) = j;
For X[i,j-1]: g1(i,j) = i,
g2(i,j) = j - 1;
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
Example: consider the two memory references X[i, j] and X[i, j-1]
Dependence Test as Integer Linear Programming Problem
16
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
Does ∃ α, β in the loop iteration space, s.t.
fk(α) =gk (β) ∀k,1 ≤ k ≤ m?
α: (i1, j1)
β: (i2, j2)
Consider the two memory references:
S1(α): X[i1, j1], S2(β): X[i2, j2-1]
i1 = i2
j1 = j2 -1
If there is dependence, then
(i1 , j1 ): 1<= i1 <=100, 1<=j1 <= 100,
(i2 , j2 ): 1<= i2 <=100, 1<=j2<=100,
And
Do such (i1, j1), (i2, j2)
exist?
Dependence Test as Integer Linear Programming Problem
17
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
Consider the two memory references:
S1(α): X[i1, j1], S2(β): X[i2, j2-1]
Does there exist a solution to
this integer linear
programming (ILP) problem?
i1 = i2
j1 = j2 -1
1<= i1 <=100
1<= j1 <= 100
1<= i2 <=100
1<= j2<=100
Does ∃ α, β in the loop iteration space, s.t.
fk(α) =gk (β) ∀k,1 ≤ k ≤ m?
α: (i1, j1)
β: (i2, j2)
Do such (i1, j1), (i2, j2)
exist?
access the same
memory location
loop bounds
constraint
Back to this Example
18
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i, j] = X[i,j] + Y[i-1, j];
S2: Y[i, j] = Y[i,j] + X[i, j-1];
}
S1 S2
True Dependence
(RAW)
Wt: X[i, j] in S1 →
Rd: X[i', j'-1] in S2
True Dependence
(RAW)
Wt: Y[i, j] in S2
→ Rd: Y[i'-1, j'] in S1
Dependence in the “i” loop
Dependence in the “j” loop
i1 = i2
j1 = j2 -1
1<= i1 <=100
1<= j1 <= 100
1<= i2 <=100
1<= j2<=100
Access the same
memory location
Loop bounds
constraints
(Only showing the ILP problem for
the dependence marked in red.)
Dependence and Parallelization
19
1 2 3 4 5 6
S1
S2
i
• Dependence in affine loops modeled as a hyperplane
• Iterations along the same hyperplane must execute sequentially
Dependence from S2(1,1) to S1(2,1)
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
S1 S2
True, j loop, for X
True, i loop, for Y
j = 4
j = 3
j = 2
j = 1
20
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
S1
S2
• Dependence in affine loops modeled as a hyperplane
• Iterations along the same hyperplane must execute sequentially
S1 S2
True, j loop, for X
True, i loop, for Y
j = 4
j = 3
j = 2
j = 1
Dependence and Parallelization
Dependence from S2(1,1) to S1(2,1)
for Y[ , ] memory reference
1 2 3 4 5 6
i
21
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
S1
S2
S1 S2
True, j loop, for X
True, i loop, for Y
j = 4
j = 3
j = 2
j = 1
Dependence and Parallelization
• Dependence in affine loops modeled as a hyperplane
• Iterations along the same hyperplane must execute sequentially
1 2 3 4 5 6
i
Dependence from S2(1,1) to S1(2,1)
for Y[ , ] memory reference
22
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
S1
S2
S1 S2
True, j loop, for X
True, i loop, for Y
j = 4
j = 3
j = 2
j = 1
Dependence and Parallelization
• Dependence in affine loops modeled as a hyperplane
• Iterations along the same hyperplane must execute sequentially
Dependence from S1(1,1) to S2(1,2)
for X[ , ] memory reference
1 2 3 4 5 6
i
23
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
S1
S2
S1 S2
True, j loop, for X
True, i loop, for Y
j = 4
j = 3
j = 2
j = 1
Dependence and Parallelization
• Dependence in affine loops modeled as a hyperplane
• Iterations along the same hyperplane must execute sequentially
1 2 3 4 5 6
i
Dependence from S1(1,1) to S2(1,2)
for X[ , ] memory reference
24
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
do i = 1, N
do j = 1, N
S1: A[i, j] = A[i, j - 1]
Write: S1(i, j) to Read in S1(i, j+1)
Write in S1(1,1) to Read in S1(1,2)
• Dependence in affine loops modeled as a hyperplane
• Iterations along the same hyperplane must execute sequentially
Dependence and Parallelization
S1Dependence from S1(1,1) to S1(1,2)
Which loop can be parallelized?
The “i” loop or the “j” loop?
i
25
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
doall i = 1, N
do j = 1, N
S1: A[i, j] = A[i, j - 1]
Write: S1(i, j) to Read in S1(i, j+1)
Write in S1(1,1) to Read in S1(1,2)
• Dependence in affine loops modeled as a hyperplane
• Iterations along the same hyperplane must execute sequentially
Dependence and Parallelization
S1Dependence from S1(1,1) to S1(1,2)
Which loop can be parallelized?
The “i” loop or the “j” loop?
iAnswer: the “i” loop
doall loop means all iterations
in the loop can run in parallel
26
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
do i = 1, N
do j = 1, N
S1: A[i, j] = A[i - 1, j - 1]
Can either the “i” loop or
the “j” loop be parallelized?
(assuming no synchronization is
allowed)
Write in S1(i, j) to Read S1(i+1, j+1)
Write in S1(1,1) to Read in S1(2,2)
• Dependence in affine loops modeled as a hyperplane
• Iterations along the same hyperplane must execute sequentially
Dependence and Parallelization
Dependence from S1(1, 1) to S1(2, 2)
The hyperplane is j - i = “a constant”
27
do I = 1, N
do J = 1, N
S1: A[I, J] = A[I-1, J+1]
Dependence and Parallelization
• Dependence in affine loops modeled as a hyperplane
• Iterations along the same hyperplane must execute sequentially
• Iterations on different hyperplanes can execute in parallel
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
Dependence from S1(1, 2) to S1(2, 1)
Write in S1(1,2) to Read in S1(2,1)
Write in S1(i, j) to Read in S1(i-1,j+1)
The hyperplane is j + i = “a constant”
Distance Vector
28
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
do I = 1, N
do J = 1, N
S1: A(I, J) = A(I+1, J-1)
Read in S1(1,2) to Write in S1(2,1)
S1(i, j) to S1(i+1, j-1)
The number of iterations between two accesses to the same memory
location, usually represented as a distance vector.
Distance vector from read to write: (1, -1)
Write After Read
Processing Space: Affine Partition Schedule
29
Notation:
bold fonts for container variables;
normal fonts for scalar variables.
for (i=1; i<=N; i++)
Y[i] = Z[i];
C = [1], c = [0], p = i
1-d processor grid
•
• C is a n by m matrix
• m = d (the loop level)
• n is the dimension of the processor grid
• c is a n-element constant vector
• p = C*i + c
• Examples
C =
for (i=1; i<=N; i++)
for (j=1; j<=N; j++)
Y[i,j] = Z[i,j];
2-d processor grid
[ ]1 00 1
p = i, q = j
Synchronization-free Parallelism
30
• Two memory references as
• Let
schedule
• To be synchronization-free
‣ For all i1 in Zd1 (d1-dimension integer vectors) and i2 in Zd2 such
that
1. B1*i1 + b1 >= 0, and
2. B2*i2 + b2 >= 0, and
3. F1*i1 + f1 = F2*i2 + f2, and
4. It must be the case that C1*i1 + c1 = C2*i2 + c2.
F1, f1 is for memory reference, i.e., F1*x + f1
B1, b1 is for loop bound constraints, i.e., B1*x + b1
Synchronization-free Parallelism
31
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
S1
S2
i
• To be synchronization-free
- For all i1 in Zd1 (d1-dimension integer
vectors) and i2 in Zd2 such that
‣ B1*i1 + b1 >= 0, and
‣ B2*i2 + b2 >= 0, and
‣ F1*i1 + f1 = F2*i2 + f2, and
‣ It must be the case that
C1*i1 + c1 = C2*i2 + c2.
Synchronization-free Parallelism
32
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
S1 S2
True, j loop, for X
True, i loop, for Y
1<=i1 <=100, 1<=j1 <= 100,
1<=i2 <=100, 1<=j2<=100,
i1 = i2, j1 = j2 -1,
[C11 C12 ]
i1
j1
⎡
⎣
⎢
⎤
⎦
⎥ + [c1] = [C21 C22 ]
i2
j2
⎡
⎣
⎢
⎤
⎦
⎥ + [c2 ]
[C11 − C21 C12 − C22 ]
i1
j1
⎡
⎣
⎢
⎤
⎦
⎥ + [c1 − c2 − C22 ] = 0
S1 to S2 dependence
1<=i3 <=100, 1<=j3 <= 100,
1<=i4 <=100, 1<=j4<=100,
i3 -1 = i4, j3 = j4,
[C11 C12 ]
i3
j3
⎡
⎣
⎢
⎤
⎦
⎥ + [c1] = [C21 C22 ]
i4
j4
⎡
⎣
⎢
⎤
⎦
⎥ + [c2 ]
[C11 − C21 C12 − C22 ]
i3
j3
⎡
⎣
⎢
⎤
⎦
⎥ + [c1 − c2 + C21] = 0
S2 to S1 dependence
Synchronization-free Parallelism
33
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
S1: X[i,j] = X[i,j] + Y[i-1, j];
S2: Y[i,j] = Y[i,j] + X[i, j-1];
}
S1 S2
True, j loop, for X
True, i loop, for Y
[C11 − C21 C12 − C22 ]
i1
j1
⎡
⎣
⎢
⎤
⎦
⎥ + [c1 − c2 − C22 ] = 0
[C11 − C21 C12 − C22 ]
i3
j3
⎡
⎣
⎢
⎤
⎦
⎥ + [c1 − c2 + C21] = 0
C11-C21 =0, C12-C22=0, & c1-c2-C22=0
C11-C21 =0, C12-C22=0, & c1-c2+C21=0
C11 = C21 = -C22= -C12 = c2-c1
Solution
34
Affine schedule for S1, p(S1): [C11 C12] = [1 -1], c1 = -1
i.e. (i,j) iteration of S1 to processor p = i-j-1;
Affine schedule for S2, p(S2) [C21 C22 ]= [1 -1], c2 =0
i.e. (i,j) iteration of S2 to processor p = i-j.
for (i=1; i<=100; i++)
for (j=1; j<=100; j++){
X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */
Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */
}
p(S1): < [C11 C12], [c1]>
p(S2): < [C21 C22], [c2]>
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
S1
S2i
p:
2
p:
3
p:
4
p:
5
p:
-2
p:
-1
p:
0
p:
1
p:
-3
p:
-4
C11 = C21 = -C22= -C12 = c2-c1
p:
-1
More Examples
35
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
do I = 1, N
do J = 1, N
S1: A[I, J] = A[I-1, J – 1]
Affine partition schedule
Affine schedule for S1, p(S1): C= [C11 C12] = [1 -1], c= 0
i.e. (i, j) iteration of S1 to processor p = i-j ;
The hyperplane is j – i = “a constant”
Read After Write
p:
0
p:
1
p:
2
p:
-2
p:
-3
p:
3
p:
4
p:
5
do I = 1, N
do J = 1, N
S1: A[I, J] = A[I+1, J-1]
More Examples
36
j = 4
j = 3
j = 2
j = 1
1 2 3 4 5 6
Read in S1(1,2) to Write in S1(2,1)
S1(i, j) to S1(i+1, j-1)
Write After Read
Affine schedule for S1, p(S1): C= [C11 C12] = [1 1], c= 0
i.e. (i, j) iteration of S1 to processor p = i + j ;
Affine partition schedule
The hyperplane is j + i = “a constant”
p:
8
p:
9
p:
10
p:
11
p:
12
p:
13
Next Class
37
Reading
• ALSU, Chapter 11.1 – 11.7