程序代写代做代考 lec19

lec19

CS 314 Principles of Programming Languages

Prof. Zheng Zhang
Rutgers University

Lecture 19: Parallelism and Dependence Analysis

November 14, 2018

Review: Dependence Definition

2

• Both statements (instances) access the same memory location(s)
• One of them is a write
• There is a run-time execution path from S1 to S2

Bernstein’s Condition: — There is a data dependence from
statement (instance) S1 to statement S2 (instance) if

S1: pi = 3.14
S2: R = 5
S3: Area = pi * R2

S1 S2

S3

Example:

Data Dependence Classifications

3

“S2 depends on S1” — (S1 δ S2)

True (flow) dependence
occurs when S1 writes a memory location that S2 later reads (RAW).

Anti dependence
occurs when S1 reads a memory location that S2 later writes (WAR).

Output dependence
occurs when S1 writes a memory location that S2 later writes (WAW).

Input dependence
occurs when S1 reads a memory location that S2 later reads (RAR).

Review: Dependence Testing

4

• Single loop nest with constant lower (LB) and upper (UB) bound,
and step 1.

• Two array references as affine function of loop induction variable

Single Induction Variable (SIV) Test

for i = LB, UB, 1

endfor

for i = LB, UB, 1
R1: X(a*i + c1) = …
R2: … = X(a*i + c2) 


endfor

Question: Is there a true dependence between R1 and R2?

Review: Dependence Testing

5

There is a dependence between R1 and R2 iff

∃ i, i′: LB ≤ i ≤ i′ ≤ UB and (a∗i+c1) = (a∗i′+c2)

where i and i′ represent two iterations in the iteration space. This
means that in both iterations, the same element of array X is accessed.

There is a dependence iff • ∆d is an integer value
• UB – LB ≥ ∆d ≥ 0

So let’s just solve the equation:

(a ∗ i + c1) = (a ∗ i′ + c2) (c1 – c2)/a = i′ − i = ∆d

for i = LB, UB, 1
R1: X(a*i + c1) = …
R2: … = X(a*i + c2) 


endfor

• Examples:

Simple Dependence Testing

6

for (i = 1; i <= 100; i++) { S1: A[i] = ... S2: ...= A[i - 1] } float Z[100]; for (i =0; i < 12; i++) { S: Z[ i+10 ] = Z[i]; } 1. Is there dependence? 2. If so, what type of dependence? 3. From which statement (instance) to which statement (instance)? • Examples: Simple Dependence Testing 7 for (i = 1; i <= 100; i++) { S1: A[i] = ... S2: ...= A[i - 1] } float Z[100]; for (i =0; i < 12; i++) { S: Z[ i+10 ] = Z[i]; } S1 S2 True Dependence (read after write): Wt: A[i] in S1 → Rd: A[i’-1] in S2 S True Dependence (read after write): 
 Wt: Z[i+10] in S → Rd: Z[i’] in S i’ = i +1 ∆d = 1 i’ = i +10 ∆d = 10 • More Examples: Simple Dependence Testing 8 1. Is there dependence? 2. If so, what type of dependence? 3. From which statement (instance) to which statement (instance)? for (i = 1; i <= 100; i++) { R1: X(i) = ... R2: ... = X(i + 2) 
 } for (i = 3; i <= 15, i++) { S1: X(2 * i) = ... S2: ... = X(2 * i - 1) 
 } • More Examples: Simple Dependence Testing 9 for (i = 1; i <= 100; i++) { R1: X[i] = ... R2: ... = X[i + 2] 
 } for (i = 3; i <= 15, i++) { S1: X[2 * i] = ... S2: ... = X[2 * i - 1] 
 } R1 R2 Anti Dependence (write after read): 
 Rd: X[i+2] in R2 → Wt: X[i’] in R1 S1 S2 No dependence! Review: Automatic Parallelization 10 We will use loop analysis as an example to describe automatic dependence analysis and parallelization. Assumptions: 1. We only have scalar and subscripted variables (no pointers and no control dependence) for loop dependence analysis. 2. We focus on affine loops: both loop bounds and memory references are affine functions of loop induction variables. A function f(x1, x2, ..., xn) is affine if it is in such a form: f = c0 + c1*x1 + c2*x2 + ... + cn*xn, where ci are all constants Review: Affine Loops 11 Three spaces • Iteration space ‣ The set of dynamic execution instances ‣ i.e. the set of value vectors taken by loop indices ‣ A k-dimensional space for a k-level loop nest • Data space ‣ The set of array elements accessed ‣ An n-dimensional space for an n-dimensional array • Processor space ‣ The set of processors in the system ‣ In analysis, we may pretend there are unbounded # of virtual processors Iteration Space 12 • Example i <= j <= 7 i <= 5 0 <= i i j for (i=0; i<=5; i++) for (j=i; j<=7; j++) Z[j, i] = 0; 0 <= i <= 5 i <= j <= 7 Lexicographical Order 13 j 1 <= i <= 5 1 <= j <= 6 - i i for (i = 1; i <= 5; i++) for (j = 1; j <= 6 - i; j++) Z[j, i] = 0; • Order of sequential loop executions • Sweeping through the space in an ascending lexicographic order: 
 
 (i, j) <= (i', j') iff one of the two conditions is satisfied 
 1. i <= i' 
 2. i = i' & j <= j' Dependence Test 14 Given do i1 = L1,U1 ... do in = Ln,Un
 S1 : A[ f1( i1, …, in), …, fm(i1,…, in) ] = ... S2 : … = A[ g1(i1, …, in), …, gm(i1, …, in) ] A dependence between statement (instance) S1 and S2, denoted S1 δ S2, indicates that the S1 instance, the source, must be executed before S2 instance, the sink on some iteration of the loop nest. Let α & β be a vector of n integers within the ranges of the lower and upper bounds of the n loops. Does ∃ α, β in the loop iteration space, s.t. 
 fk(α) =gk (β) ∀k,1 ≤ k ≤ m? Dependence Test 15 Given do i1 = L1,U1 ... do in = Ln,Un
 S1 : A[ f1( i1, …, in), …, fm(i1,…, in) ] = ... S2 : … = A[ g1(i1, …, in), …, gm(i1, …, in) ] For X[i,j]: f1(i,j) = i, f2(i,j) = j; For X[i,j-1]: g1(i,j) = i, g2(i,j) = j - 1; for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } Example: consider the two memory references X[i, j] and X[i, j-1] Dependence Test as Integer Linear Programming Problem 16 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } Does ∃ α, β in the loop iteration space, s.t. 
 fk(α) =gk (β) ∀k,1 ≤ k ≤ m? α: (i1, j1) β: (i2, j2) Consider the two memory references: S1(α): X[i1, j1], S2(β): X[i2, j2-1] i1 = i2 j1 = j2 -1 If there is dependence, then (i1 , j1 ): 1<= i1 <=100, 1<=j1 <= 100, (i2 , j2 ): 1<= i2 <=100, 1<=j2<=100, And Do such (i1, j1), (i2, j2) exist? Dependence Test as Integer Linear Programming Problem 17 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } Consider the two memory references: S1(α): X[i1, j1], S2(β): X[i2, j2-1] Does there exist a solution to this integer linear programming (ILP) problem? i1 = i2 j1 = j2 -1 1<= i1 <=100 1<= j1 <= 100 1<= i2 <=100 1<= j2<=100 Does ∃ α, β in the loop iteration space, s.t. 
 fk(α) =gk (β) ∀k,1 ≤ k ≤ m? α: (i1, j1) β: (i2, j2) Do such (i1, j1), (i2, j2) exist? access the same memory location loop bounds constraint Back to this Example 18 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i, j] = X[i,j] + Y[i-1, j]; S2: Y[i, j] = Y[i,j] + X[i, j-1]; } S1 S2 True Dependence 
 (RAW) 
 Wt: X[i, j] in S1 → Rd: X[i', j'-1] in S2 True Dependence 
 (RAW) 
 Wt: Y[i, j] in S2 
 → Rd: Y[i'-1, j'] in S1 Dependence in the “i” loop Dependence in the “j” loop i1 = i2 j1 = j2 -1 1<= i1 <=100 1<= j1 <= 100 1<= i2 <=100 1<= j2<=100 Access the same memory location Loop bounds constraints (Only showing the ILP problem for the dependence marked in red.) Dependence and Parallelization 19 1 2 3 4 5 6 S1 S2 i • Dependence in affine loops modeled as a hyperplane • Iterations along the same hyperplane must execute sequentially Dependence from S2(1,1) to S1(2,1) for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 True, j loop, for X True, i loop, for Y j = 4 j = 3 j = 2 j = 1 20 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 • Dependence in affine loops modeled as a hyperplane • Iterations along the same hyperplane must execute sequentially S1 S2 True, j loop, for X True, i loop, for Y j = 4 j = 3 j = 2 j = 1 Dependence and Parallelization Dependence from S2(1,1) to S1(2,1) for Y[ , ] memory reference 1 2 3 4 5 6 i 21 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 S1 S2 True, j loop, for X True, i loop, for Y j = 4 j = 3 j = 2 j = 1 Dependence and Parallelization • Dependence in affine loops modeled as a hyperplane • Iterations along the same hyperplane must execute sequentially 1 2 3 4 5 6 i Dependence from S2(1,1) to S1(2,1) for Y[ , ] memory reference 22 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 S1 S2 True, j loop, for X True, i loop, for Y j = 4 j = 3 j = 2 j = 1 Dependence and Parallelization • Dependence in affine loops modeled as a hyperplane • Iterations along the same hyperplane must execute sequentially Dependence from S1(1,1) to S2(1,2) for X[ , ] memory reference 1 2 3 4 5 6 i 23 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 S1 S2 True, j loop, for X True, i loop, for Y j = 4 j = 3 j = 2 j = 1 Dependence and Parallelization • Dependence in affine loops modeled as a hyperplane • Iterations along the same hyperplane must execute sequentially 1 2 3 4 5 6 i Dependence from S1(1,1) to S2(1,2) for X[ , ] memory reference 24 j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 do i = 1, N do j = 1, N S1: A[i, j] = A[i, j - 1] Write: S1(i, j) to Read in S1(i, j+1) Write in S1(1,1) to Read in S1(1,2) • Dependence in affine loops modeled as a hyperplane • Iterations along the same hyperplane must execute sequentially Dependence and Parallelization S1Dependence from S1(1,1) to S1(1,2) Which loop can be parallelized? The “i” loop or the “j” loop? i 25 j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 doall i = 1, N do j = 1, N S1: A[i, j] = A[i, j - 1] Write: S1(i, j) to Read in S1(i, j+1) Write in S1(1,1) to Read in S1(1,2) • Dependence in affine loops modeled as a hyperplane • Iterations along the same hyperplane must execute sequentially Dependence and Parallelization S1Dependence from S1(1,1) to S1(1,2) Which loop can be parallelized? The “i” loop or the “j” loop? iAnswer: the “i” loop doall loop means all iterations in the loop can run in parallel 26 j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 do i = 1, N do j = 1, N S1: A[i, j] = A[i - 1, j - 1] Can either the “i” loop or 
 the “j” loop be parallelized? (assuming no synchronization is allowed) Write in S1(i, j) to Read S1(i+1, j+1) Write in S1(1,1) to Read in S1(2,2) • Dependence in affine loops modeled as a hyperplane • Iterations along the same hyperplane must execute sequentially Dependence and Parallelization Dependence from S1(1, 1) to S1(2, 2) The hyperplane is j - i = “a constant” 27 do I = 1, N do J = 1, N S1: A[I, J] = A[I-1, J+1] Dependence and Parallelization • Dependence in affine loops modeled as a hyperplane • Iterations along the same hyperplane must execute sequentially • Iterations on different hyperplanes can execute in parallel j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 Dependence from S1(1, 2) to S1(2, 1) Write in S1(1,2) to Read in S1(2,1) Write in S1(i, j) to Read in S1(i-1,j+1) The hyperplane is j + i = “a constant” Distance Vector 28 j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 do I = 1, N do J = 1, N S1: A(I, J) = A(I+1, J-1) Read in S1(1,2) to Write in S1(2,1) S1(i, j) to S1(i+1, j-1) The number of iterations between two accesses to the same memory location, usually represented as a distance vector. Distance vector from read to write: (1, -1) Write After Read Processing Space: Affine Partition Schedule 29 Notation: bold fonts for container variables; normal fonts for scalar variables. for (i=1; i<=N; i++) Y[i] = Z[i]; C = [1], c = [0], p = i 1-d processor grid • to represent a partition
• C is a n by m matrix

• m = d (the loop level)
• n is the dimension of the processor grid

• c is a n-element constant vector
• p = C*i + c

• Examples

C =

for (i=1; i<=N; i++) for (j=1; j<=N; j++) Y[i,j] = Z[i,j]; 2-d processor grid [ ]1 00 1 p = i, q = j Synchronization-free Parallelism 30 • Two memory references as and
• Let and represent their respective processor

schedule
• To be synchronization-free
‣ For all i1 in Zd1 (d1-dimension integer vectors) and i2 in Zd2 such

that
1. B1*i1 + b1 >= 0, and
2. B2*i2 + b2 >= 0, and
3. F1*i1 + f1 = F2*i2 + f2, and
4. It must be the case that C1*i1 + c1 = C2*i2 + c2.

F1, f1 is for memory reference, i.e., F1*x + f1
B1, b1 is for loop bound constraints, i.e., B1*x + b1

Synchronization-free Parallelism

31

for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 S1 S2 i • To be synchronization-free - For all i1 in Zd1 (d1-dimension integer vectors) and i2 in Zd2 such that ‣ B1*i1 + b1 >= 0, and
‣ B2*i2 + b2 >= 0, and
‣ F1*i1 + f1 = F2*i2 + f2, and
‣ It must be the case that 


C1*i1 + c1 = C2*i2 + c2.

Synchronization-free Parallelism

32

for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 True, j loop, for X True, i loop, for Y 1<=i1 <=100, 1<=j1 <= 100, 1<=i2 <=100, 1<=j2<=100, i1 = i2, j1 = j2 -1, [C11 C12 ] i1 j1 ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ + [c1] = [C21 C22 ] i2 j2 ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ + [c2 ] [C11 − C21 C12 − C22 ] i1 j1 ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ + [c1 − c2 − C22 ] = 0 S1 to S2 dependence 1<=i3 <=100, 1<=j3 <= 100, 1<=i4 <=100, 1<=j4<=100, i3 -1 = i4, j3 = j4, [C11 C12 ] i3 j3 ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ + [c1] = [C21 C22 ] i4 j4 ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ + [c2 ] [C11 − C21 C12 − C22 ] i3 j3 ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ + [c1 − c2 + C21] = 0 S2 to S1 dependence Synchronization-free Parallelism 33 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 True, j loop, for X True, i loop, for Y [C11 − C21 C12 − C22 ] i1 j1 ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ + [c1 − c2 − C22 ] = 0 [C11 − C21 C12 − C22 ] i3 j3 ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ + [c1 − c2 + C21] = 0 C11-C21 =0, C12-C22=0, & c1-c2-C22=0 C11-C21 =0, C12-C22=0, & c1-c2+C21=0 C11 = C21 = -C22= -C12 = c2-c1 Solution 34 Affine schedule for S1, p(S1): [C11 C12] = [1 -1], c1 = -1 i.e. (i,j) iteration of S1 to processor p = i-j-1; Affine schedule for S2, p(S2) [C21 C22 ]= [1 -1], c2 =0 i.e. (i,j) iteration of S2 to processor p = i-j. for (i=1; i<=100; i++) for (j=1; j<=100; j++){ X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } p(S1): < [C11 C12], [c1]>
p(S2): < [C21 C22], [c2]>

j = 4

j = 3

j = 2

j = 1

1 2 3 4 5 6
S1
S2i

p:
2

p:
3
p:
4

p:
5

p:
-2

p:
-1

p:
0

p:
1

p:
-3

p:
-4

C11 = C21 = -C22= -C12 = c2-c1

p:
-1

More Examples

35

j = 4

j = 3

j = 2

j = 1

1 2 3 4 5 6

do I = 1, N
do J = 1, N
S1: A[I, J] = A[I-1, J – 1]

Affine partition schedule

Affine schedule for S1, p(S1): C= [C11 C12] = [1 -1], c= 0
i.e. (i, j) iteration of S1 to processor p = i-j ;

The hyperplane is j – i = “a constant”

Read After Write

p:
0

p:
1

p:
2

p:
-2

p:
-3

p:
3

p:
4

p:
5

do I = 1, N
do J = 1, N
S1: A[I, J] = A[I+1, J-1]

More Examples

36

j = 4

j = 3

j = 2

j = 1

1 2 3 4 5 6

Read in S1(1,2) to Write in S1(2,1)

S1(i, j) to S1(i+1, j-1)

Write After Read

Affine schedule for S1, p(S1): C= [C11 C12] = [1 1], c= 0
i.e. (i, j) iteration of S1 to processor p = i + j ;

Affine partition schedule

The hyperplane is j + i = “a constant”

p:
8

p:
9

p:
10

p:
11

p:
12

p:
13

Next Class

37

Reading
• ALSU, Chapter 11.1 – 11.7