程序代写代做代考 chain lec20

lec20

CS 314 Principles of Programming Languages

Prof. Zheng Zhang
Rutgers University

Lecture 20: Parallelism and Dependence Analysis

November 16, 2018

Class Information

2

• Homework 7 is released.
• Project 2 deadline will be extended to 11/25 Sunday.

Review: Dependence Test

3

Given
do i1 = L1,U1


do in = Ln,Un

S1 : A[ f1( i1, …, in), …, fm(i1,…, in) ] = …
S2 : … = A[ g1(i1, …, in), …, gm(i1, …, in) ]

Let α & β be a vector of n integers within the ranges of the lower
and upper bounds of the n loops.

Does ∃ α, β in the loop iteration space, s.t. 

fk(α) =gk (β) ∀k,1 ≤ k ≤ m?

Integer Linear Programming (ILP) for Dependence Test

4

for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } Does ∃ α, β in the loop iteration space, s.t. 
 fk(α) =gk (β) ∀k,1 ≤ k ≤ m? Consider the two memory references: S1(α): X[i1, j1], S2(β): X[i2, j2-1] α: (i1, j1) β: (i2, j2) Does there exist a solution to this integer linear programming (ILP) problem? i1 = i2 j1 = j2 -1 1<= i1 <=100 1<= j1 <= 100 1<= i2 <=100 1<= j2<=100 Access the same memory location Loop bounds constraint Integer Linear Programming (ILP) for Dependence Test 5 i1 = i2 j1 = j2 -1 1<= i1 <=100 1<= j1 <= 100 1<= i2 <=100 1<= j2<=100 f1 =[ ]1 00 1F1 = [ ]0 0 [ ]1 00 1F2 = f2 = [ ]0 -1 Memory reference X[i1, j1] Memory reference X[i2, j2-1] [ ]1 00 1 [ ]i1
j1 [ ]0 0 + [ ]i1
j1 = [ ]1 00 1 [ ]i2
j2 [ ]0 -1 + [ ]i1
j2 - 1 = If we use the matrix vector notation and for two references at two iterations α: (i1, j1) and β: (i2, j2)

[ ]i2
j2 +F2 f2[ ]i1
j1 +F1 f1 =
[ ]i1
j1 [ ]i2
j2 – 1 =

i1 = i2
j1 = j2 -1
1<= i1 <=100 1<= j1 <= 100 1<= i2 <=100 1<= j2<=100 Integer Linear Programming (ILP) for Dependence Test 6 If we use the matrix vector notation and for two references at two iterations α: (i1, j1) and β: (i2, j2)

Loop bounds for (i1, j1)

[ ]i1
j1 +B1 b1 >= 0
[ ]i2
j2 +B2 b2 >= 0

B1 = b1 =
1 0
0 1
-1 0
0 -1

-1
-1

100
100

[ ]i1
j1 + >=
1 0
0 1
-1 0
0 -1

-1
-1

100
100

0
0
0
0

Loop bounds for (i2, j2)

B2 = b2 =
1 0
0 1
-1 0
0 -1

-1
-1

100
100

i1 = i2
j1 = j2 -1
1<= i1 <=100 1<= j1 <= 100 1<= i2 <=100 1<= j2<=100 Putting Everything Together 7 If we use the matrix vector notation and for two references at two iterations α: (i1, j1) and β: (i2, j2)

[ ]1 00 1F2 = f2 = [ ]0 -1 f1 =[ ]1 00 1F1 = [ ]0 0
B1 , b1 , B2 , b2 see previous slides

[ ]i2
j2 +F2 f2[ ]i1
j1 +F1 f1 =
[ ]i1
j1 +B1 b1 >= 0
[ ]i2
j2 +B2 b2 >= 0

Review: Parallelizing Affine Loops

8

Three spaces
• Iteration space

– The set of dynamic execution instances 

For instance, the set of value vectors taken by loop indices

– A k-dimensional space for a k-level loop nest
• Data space

– The set of array elements accessed
– An n-dimensional space for an n-dimensional array

• Processor space
– The set of processors in the system
– In analysis, we may pretend there are unbounded # of virtual processors

Affine Half Space

9

Definition
An affine half-space of Zd is defined as the set of points

{ x⃗ ∈ Zd | a⃗ * x⃗ <= b } a⃗ * x⃗ = b is the hyperplane that divides the d-dimensional space into two half-spaces. Iteration Space 10 • Bounded by multiple hyperplanes i <= j <= 7 i <= 5 0 <= i i j for (i=0; i<=5; i++) for (j=i; j<=7; j++) Z[j, i] = 0; 0 <= i <= 5 i <= j <= 7 Three Spaces 11 • Iteration space, data space, and processor space 0 1 9 10 11 19 20 0 1 9 … … … …Iteration space 0 1 9 …Processor space float Z[100]; for (i=0; i<10; i++) S: Z[i+10] = Z[i]; Data Space Z[i] & Z[i+10] S(i) P(i) = i Synchronization-free Parallelism 12 Parallelize an application without allowing any communication or synchronization among (logical) processors. j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 Write in S1(i, j) to Read in S1(i, j+1) Write in S1(1,1) to Read in S1(1,2) S1Example: i do i = 1, N do j = 1, N S1: A[i, j] = A[i, j - 1] Dependence from S1 to S1 Synchronization-free Parallelism 13 Parallelize an application without allowing any communication or synchronization among (logical) processors. j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 do i = 1, N do j = 1, N S1: A[i, j] = A[i, j - 1] Write in S1(i, j) to Read in S1(i, j+1) Write in S1(1,1) to Read in S1(1,2) S1Example: i P1 P2 P3 P4 P5 P6 Communication is limited to the iterations on one processor. Dependence from S1 to S1 Synchronization-free Parallelism 14 Parallelize an application without allowing any communication or synchronization among (logical) processors. j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 Write in S1(i, j) to Read in S1(i, j+1) Write in S1(1,1) to Read in S1(1,2) S1 Which loop can be parallelized? The “i” loop or the “j” loop? Answer: the “i” loop Example: i P1 P2 P3 P4 P5 P6 do i = 1, N do j = 1, N S1: A[i, j] = A[i, j - 1] Synchronization-free Parallelism 15 Parallelize an application without allowing any communication or synchronization among (logical) processors. j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 doall i = 1, N do j = 1, N S1: A[i, j] = A[i, j - 1] Write in S1(i, j) to Read in S1(i, j+1) Write in S1(1,1) to Read in S1(1,2) S1 The dependence chain is characterized by a hyperplane. In this case it is “i = constant”. Example 1: i P1 P2 P3 P4 P5 P6 Dependence from S1(1,1) to S2(1,2) Synchronization-free Parallelism 16 Parallelize an application without allowing any communication or synchronization among (logical) processors. The dependence chain is characterized by a hyperplane. In this case it is “j - i + constant = 0”. Example 2: j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 P2 P3 P4 P5 P6 P7 P8 P9 P1 do i = 1, N do j = 1, N S1: A[i, j] = A[i - 1, j - 1] Dependence from S1(i,j) to S1(i+1,j+1) 17 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 S1 S2 True, j loop, for X True, i loop, for Y j = 4 j = 3 j = 2 j = 1 Dependence from S1(1,1) to S2(1,2) 1 2 3 4 5 6 i P3 Dependence from S2(1,1) to S1(2,1) Parallelize an application without allowing any communication or synchronization among (logical) processors. Synchronization-free Parallelism Example 3: 18 Dependence and Parallelization • Dependence chain in affine loops modeled as a hyperplane. • Iterations along the same hyperplane must execute sequentially. • Iterations on different hyperplanes can execute in parallel. Example: j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 P2 P3 P4 P5 P6 P7 P8 P9 P1 do i = 1, N do j = 1, N S1: A[i, j] = A[i - 1, j - 1] Dependence from S1(i,j) to S1(i+1,j+1) The hyperplane is j - i + constant = 0 . Processing Space: Affine Partition Schedule • Map an iteration to a processor using < C, d >
C is a n by m matrix

• m = d (the loop level)
• n is the dimension of the processor grid

d⃗ is a n-element constant vector
p⃗ = C x⃗ + d⃗, where x⃗ is an iteration vector

• Map an iteration to a processor using < C, d >
p⃗ = C x⃗ + d⃗, where x⃗ is an iteration vector
• Example

Processing Space: Affine Partition Schedule

for (i=1; i<=N; i++) S: Y[i] = Z[i]; C = [1], d = [0] p⃗( S(i) ) = 1*i + 0 = i Map iteration i to Processor i Synchronization-free Parallelism 21 • Two memory references as and
such that at iteration α: (i1, j1) depends on 

at iteration β: (i2, j2)

– F1 is a matrix and f1 is a vector. 

The affine memory access index is F1* α + f1 .

– B1 is a matrix and b1 is a vector. 

The affine loop bounds can be expressed as B1 * α + b1 >= 0

• Let and represent the respective processor
schedule, to have synchronization-free parallelism,

C1 * α + d1 = C2* β + d2

These two memory references must execute on the same processor
(sequentially).

Synchronization-free Parallelism

22

• Two memory references as and
such that at iteration (i1, j1) depends on 

at iteration (i2, j2)

for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } [ ]i2
j2 +F2 f2[ ]i1
j1 +F1 f1 = [ ]i1
j1 +B1 b1 >= 0
[ ]i2
j2 +B2 b2 >= 0• We want to find processor

schedule and
such that

[ C11 C12 ] [ ]i1
j1 + [ d1 ] [ C21 C22 ] [ ]i2
j2 + [ d2 ] =

Synchronization-free Parallelism

23

for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 True, j loop, for X True, i loop, for Y S1 to S2 dependence 1<=i1 <=100, 1<=j1 <= 100, 1<=i2 <=100, 1<=j2<=100, i1 = i2, j1 = j2 -1, [C11 C12] [ ]i1
j1 +[d1] = [C11 C12] [ ]i2
j2 +[d2] [C11 - C21, C12 - C22] [ ]i1
j1 +[d1 - d2 - C22] = 0 C11 - C21 = 0 C12 - C22 = 0 d1 - d2 - C22 = 0 Synchronization-free Parallelism 24 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 True, j loop, for X True, i loop, for Y 1<=i3 <=100, 1<=j3 <= 100, 1<=i4 <=100, 1<=j3<=100, i3 -1 = i4, j3 = j4, S2 to S1 dependence [C11 C12] [ ]i3
j3 +[d1] = [C11 C12] [ ]i4
j4 +[d2] [C11 - C21, C12 - C22] [ ]i3
j3 +[d1 - d2 + C21] = 0 C11 - C21 = 0 C12 - C22 = 0 d1 - d2 + C21 = 0 Synchronization-free Parallelism 25 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 True, j loop, for X True, i loop, for Y C11 - C21 = 0 C12 - C22 = 0 d1 - d2 + C21 = 0 S2 to S1 dependence C11 - C21 = 0 C12 - C22 = 0 d1 - d2 - C22 = 0 S1 to S2 dependence C11 = C21 = -C22= -C12 = d2 -d1 Synchronization-free Parallelism 26 One Potential Solution: Affine schedule for S1, p(S1): [C11 C12] = [1 -1], d1 = -1 i.e. (i, j) iteration of S1 to processor p = i - j - 1; Affine schedule for S2, p(S2): [C21 C22 ] = [1 -1], d2 = 0 i.e. (i, j) iteration of S2 to processor p = i - j. for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } Example: -2 j = 4 j = 3 j = 2 j = 1 S1 S2 -1 0 1 p: 2 p: 3 p: 4 p: 5 -3 -4p: i 1 2 3 4 5 6 C11 = C21 = -C22= -C12 = d2 -d1 Code Generation 27 •Step 1: find processor ID ranges • S1: -4 <= p <= 4 • S2: -3 <= p <= 5 •Union: -4 <= p <= 5 •Step 2: generate code S1(i, j): processor p = i-j-1; S2(i, j): processor p = i-j. for (i=1; i<=6; i++) for (j=1; j<=4; j++){ X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } Naive Code Generation 28 • Idle iterations • Lots of tests in loop body What are the issues with this code? forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } Remove Idle Iterations 29 For example, when p=-4, only 1 of the 24 iterations has useful operations, i = 1, j =4 . Some iterations have idle operations forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } Make Loop Bounds Tighter 30 -4 <= p <= 5 1<= i <= 6 1<= j <= 4 i-p-1=j j: i-p-1<= j <= i-p-1 1<=j <= 4 i: p+2<=i <= p+5 1<=i <= 6 p: -4<= p<= 4 Fourier-Motzkin Elimination S1 -4 <= p <= 5 1<= i <= 6 1<= j <= 4 i-p=j j: i-p<=j <= i-p 1<=j <=4 i: p+1<=i <= 4+p 1<=i <=6 p: -3<= p <= 5 S2 Eliminate j Eliminate i j = i - p - 1 1 <= i - p - 1 <= 4 p + 1 + 1 <= i <= 4 + p + 1 p + 2 <= i <= p + 5 Make Loop Bounds Tighter 31 j: i-p-1<= j <= i-p-1 1<=j <= 4 i: p+2<=i <= p+5 1<=i <= 6 p: -4<= p<= 4 j: i-p<=j <= i-p 1<=j <=4 i: p+1<=i <= 4+p 1<=i <=6 p: -3<= p <= 5 S1 S2 Union result: j: i-p-1<=j <= i-p 1<=j <=4 i: p+1<=i <= 5+p 1<=i <=6 p: -4<= p <= 5 Make Loop Bounds Tighter 32 Union result: j: i-p-1<=j <= i-p 1<=j <=4 i: p+1<=i <= 5+p 1<=i <=6 p: -4<= p <= 5 forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } for (p=-4; p<=5; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) for (j=max(1,i-p-1); j<=min(4,i-p); j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } Make Loop Bounds Tighter 33 for (p=-4; p<=5; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) for (j=max(1,i-p-1); j<=min(4, i-p); j++) { if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } S1 S2 j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 -2 When p = -2, i: [1, 3] j: [i+1, min(4, i+2)] Remove Tests 34 for (p=-4; p<=5; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) for (j=max(1,i-p-1); j<=min(4,i-p); j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } for (p=-4; p<=5; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) j=i-p-1; X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ j=i-p; Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } ? S1 S2 j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 -2 Next Class 35 Reading • ALSU, Chapter 11.1 - 11.7