程序代写代做代考 lec21.key

lec21.key

CS 314 Principles of Programming Languages

Prof. Zheng Zhang
Rutgers University

Lecture 20: Parallelism and Dependence Analysis

November 21, 2018

Class Information

2

• Project 2 deadline is extended to 11/25 Sunday.
• Midterm grades will be released immediately after Thanksgiving break.

Review: Parallelizing Affine Loops

3

Three spaces
• Iteration space

– The set of dynamic execution instances 

For instance, the set of value vectors taken by loop indices

– A k-dimensional space for a k-level loop nest
• Data space

– The set of array elements accessed
– An n-dimensional space for an n-dimensional array

• Processor space
– The set of processors in the system
– In analysis, we may pretend there are unbounded # of virtual processors

Three Spaces

4

• Iteration space, data space, and processor space

0 1 9 10 11 19 20

0 1 9

… … …

…Iteration space

0 1 9
…Processor space

float Z[100];
for (i=0; i<10; i++) S: Z[i+10] = Z[i]; Data Space Z[i] & Z[i+10] S(i) P(i) = i Synchronization-free Parallelism 5 Parallelize an application without allowing any communication or synchronization among (logical) processors. j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 do i = 1, N do j = 1, N S1: A[i, j] = A[i, j - 1] Write in S1(i, j) to Read in S1(i, j+1) Write in S1(1,1) to Read in S1(1,2) S1Example: i P1 P2 P3 P4 P5 P6 Communication is limited to the iterations within one processor. Dependence from S1 to S1 Synchronization-free Parallelism 6 Parallelize an application without allowing any communication or synchronization among (logical) processors. j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 Write in S1(i, j) to Read in S1(i, j+1) Write in S1(1,1) to Read in S1(1,2) S1 Which loop can be parallelized? The “i” loop or the “j” loop? Answer: the “i” loop Example 1: i P1 P2 P3 P4 P5 P6 do i = 1, N do j = 1, N S1: A[i, j] = A[i, j - 1] 7 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } S1 S2 S1 S2 True, j loop, for X True, i loop, for Y j = 4 j = 3 j = 2 j = 1 Dependence from S1(1,1) to S2(1,2) 1 2 3 4 5 6 i P3 Dependence from S2(1,1) to S1(2,1) Parallelize an application without allowing any communication or synchronization among (logical) processors. Synchronization-free Parallelism Example 2: Review — Processing Space: Affine Partition Schedule • Map an iteration to a processor using < C, d >
C is a n by m matrix

• m = d (the loop level)
• n is the dimension of the processor grid

d⃗ is a n-element constant vector
p⃗ = C x⃗ + d⃗, where x⃗ is an iteration vector

• Map an iteration to a processor using < C, d >
p⃗ = C x⃗ + d⃗, where x⃗ is an iteration vector
• Example

Review: Processing Space: Affine Partition Schedule

for (i=1; i<=N; i++) S: Y[i] = Z[i]; C = [1], d = [0] p⃗( S(i) ) = 1*i + 0 = i Map iteration i to Processor i Review: Synchronization-free Parallelism 10 One Potential Solution: Affine schedule for S1, p(S1): [C11 C12] = [1 -1], d1 = -1 i.e. (i, j) iteration of S1 to processor p = i - j - 1; Affine schedule for S2, p(S2): [C21 C22 ] = [1 -1], d2 = 0 i.e. (i, j) iteration of S2 to processor p = i - j. for (i=1; i<=100; i++) for (j=1; j<=100; j++){ S1: X[i,j] = X[i,j] + Y[i-1, j]; S2: Y[i,j] = Y[i,j] + X[i, j-1]; } Example: -2 j = 4 j = 3 j = 2 j = 1 S1 S2 -1 0 1 p: 2 p: 3 p: 4 p: 5 -3 -4p: i 1 2 3 4 5 6 C11 = C21 = -C22= -C12 = d2 -d1 Code Generation 11 •Step 1: find processor ID ranges • S1: -4 <= p <= 4 • S2: -3 <= p <= 5 •Union: -4 <= p <= 5 •Step 2: generate code S1(i, j): processor p = i-j-1; S2(i, j): processor p = i-j. for (i=1; i<=6; i++) for (j=1; j<=4; j++){ X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } Naive Code Generation 12 • Wider than necessary loop bounds • Redundant tests in loop body What are the issues with this code? forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } Remove Idle Iterations 13 For example, when p=-4, only 1 of the 24 iterations has useful operations, i = 1, j =4 . Loop bounds are wider than they should have been forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } Make Loop Bounds Tighter 14 -4 <= p <= 5 1<= i <= 6 1<= j <= 4 i-p-1=j j: i-p-1<= j <= i-p-1 1<=j <= 4 i: p+2<=i <= p+5 1<=i <= 6 Fourier-Motzkin Elimination S1 Eliminate j Fourizer Motzkin Elimination 15 Eliminating variable z in inequality systems • Match each of the lower bounds on z with each of its upper bounds • Equivalent to projecting a polyhedron into reduced dimension space -4 <= p <= 5 1<= i <= 6 1<= j <= 4 i-p-1=j Suppose we want to eliminate j: i-p-1<= j <= i-p-1 1<= j <= 4 i-p-1 <= i-p-1 i-p-1 <= 4 1 <= i-p-1 1 <= 4 p+2 <= i <= p+5 Make Loop Bounds Tighter 16 -4 <= p <= 5 1<= i <= 6 1<= j <= 4 i-p-1=j j: i-p-1<= j <= i-p-1 1<=j <= 4 i: p+2<=i <= p+5 1<=i <= 6 p: -4<= p<= 4 Fourier-Motzkin Elimination S1 -4 <= p <= 5 1<= i <= 6 1<= j <= 4 i-p=j j: i-p<=j <= i-p 1<=j <=4 i: p+1<=i <= 4+p 1<=i <=6 p: -3<= p <= 5 S2 Eliminate j Eliminate i Make Loop Bounds Tighter 17 j: i-p-1<= j <= i-p-1 1<=j <= 4 i: p+2<=i <= p+5 1<=i <= 6 p: -4<= p<= 4 j: i-p<=j <= i-p 1<=j <=4 i: p+1<=i <= 4+p 1<=i <=6 p: -3<= p <= 5 S1 S2 Union result: j: i-p-1<=j <= i-p 1<=j <=4 i: p+1<=i <= 5+p 1<=i <=6 p: -4<= p <= 5 Make Loop Bounds Tighter 18 Union result: j: i-p-1<=j <= i-p 1<=j <=4 i: p+1<=i <= 5+p 1<=i <=6 p: -4<= p <= 5 forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } for (p=-4; p<=5; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) for (j=max(1,i-p-1); j<=min(4,i-p); j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } Make Loop Bounds Tighter 19 for (p=-4; p<=5; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) for (j=max(1,i-p-1); j<=min(4, i-p); j++) { if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } S1 S2 j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 -2 When p = -2, i: [1, 3] Remove Tests 20 for (p=-4; p<=5; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) for (j=max(1,i-p-1); j<=min(4,i-p); j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } for (p=-4; p<=5; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) j=i-p-1; X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ j=i-p; Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } ? S1 S2 j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 -2 Remove Tests 21 • Reason for the tests • The iteration spaces of statements intersect but do not completely overlap • Solution • Split the iteration space at the boundaries of overlapping polyhedra. • Generate code for each of the subspaces. Remove Tests 22 /*space 2*/ for (p=-3; p<=4; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) for (j=max(1,i-p-1); j<=min(4,i-p); j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } /*space 1*/ p=-4; i=1; j=4; X[i,j]=X[i,j]+Y[i-1,j]; /*S1*/ /*space 3*/ p=5; i=6; j=1; Y[i,j] = X[i,j-1] + Y[i,j]; /*S2*/ j: i-p-1<= j <= i-p-1 1<=j <=4 i: p+2<=i <=5+p 1<=i <= 6 p: -4 <= p <= 4 j: i-p<=j <= i-p 1<=j <=4 i: p+1<=i <= 4+p 1<=i <=6 p: -3 <= p <= 5 S1: S2: Split on “p”: subspace 1: p = -4; subspace 2: -3 <= p <= 4; subspace 3: p = 5; Remove Tests 23 /*space 2*/ for (p=-3; p<=4; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) for (j=max(1,i-p-1); j<=min(4,i-p); j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } j: i-p-1<= j <= i-p-1 1<=j <=4 i: p+2<=i <=5+p 1<=i <= 6 p: -4 <= p <= 4 j: i-p<=j <= i-p 1<=j <=4 i: p+1<=i <= 4+p 1<=i <=6 p: -3 <= p <= 5 S1: S2: /*space 3*/ p=5; i=6; j=1; Y[i,j] = X[i,j-1] + Y[i,j]; /*S2*/ /*space 1*/ p=-4; i=1; j=4; X[i,j]=X[i,j]+Y[i-1,j]; /*S1*/ Remove Tests 24 Split on “i”: subspace 2a: max(1, p+1)<= i < max(1, p+2); only S2; subspace 2b: max(1, p+2) <= i <= min(6, 4+p); both S1 and S2; subspace 2c: min(6, 4+p) < i <= min(5+p, 6); only S1; j: i-p-1<= j <= i-p-1 1<=j <=4 i: p+2<=i <=5+p 1<=i <= 6 p: -4 <= p <= 4 j: i-p<=j <= i-p 1<=j <=4 i: p+1<=i <= 4+p 1<=i <=6 p: -3 <= p <= 5 S1: S2: p>=-3; p<=4 Remove Tests 25 /*space 2*/ for (p=-3; p<=4; p++) for (i=max(1,p+1); i<=min(6,5+p); i++) for (j=max(1,i-p-1); j<=min(4,i-p); j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } /*space 2*/ for (p=-3; p<=4; p++){ /*space 2a*/ if (p>=0){
i = p+1; j = 1;
Y[i,j] = Y[i,j] + X[i, j-1];/* S2 */}
/*space 2b*/
for (i=max(1, p+2); i<=min(6, 4 +p); i++) { j=i-p-1; X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ j=i-p Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } /*space 2c*/ if (p<=1){ i=5+p; j=5; X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */} } Split on “i”: subspace 2a: max(1,p+1)<= i < max(1, p+2); subspace 2b: max(1,p+2) <= i <= min(6, 4+p); subspace 2c: min(6, 4+p) < i <= min(5+p, 6). Remove Tests 26 Split on “i”: subspace 2a: max(1,p+1)<= i < max(1, p+2); subspace 2b: max(1,p+2) <= i <= min(6, 4+p); subspace 2c: min(6, 4+p) < i <= min(5+p, 6). /*space 2*/ for (p=-3; p<=4; p++){ /*space 2a*/ if (p>=0){
i = p+1; j = 1;
Y[i,j] = Y[i,j] + X[i, j-1];/* S2 */}
/*space 2b*/
for (i=max(1, p+2); i<=min(6, 4 +p); i++) { j=i-p-1; X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ j=i-p Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } /*space 2c*/ if (p<=1){ i=5+p; j=5; X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */} } S1 S2 j = 4 j = 3 j = 2 j = 1 1 2 3 4 5 6 -2 -1 0 1 2 3 4 5 Code Generation and Optimization Summary 27 for (i=1; i<=100; i++) for (j=1; j<=100; j++){ X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } forall (p=-4; p<=5; p++) for (i=1; i<=6; i++) for (j=1; j<=4; j++){ if (p== i-j-1) X[i,j] = X[i,j] + Y[i-1, j]; /* S1 */ if (p== i-j) Y[i,j] = Y[i,j] + X[i, j-1]; /* S2 */ } /*space 1*/ if ( p == -4 ) X[1,4]=X[1,4]+Y[0,4]; /*S1*/ /*space 2*/ for (p=-3; p<=4; p++){ /*space 2a*/ if (p>0)
Y[p+1,1] = Y[p+1,1] + X[p+1, 0];/* S2 */
/*space 2b*/
for (i=max(1,p+2); i<=min(6,4+p); i++) X[i,i-p-1] = X[i,i-p-1] + Y[i-1, i-p-1]; /* S1 */ Y[i,i-p] = Y[i,i-p] + X[i, i-p-1]; /* S2 */ } /*space 2c*/ if (p<=-1) X[5+p,5] = X[5+p,5] + Y[4+p, 5]; /* S1 */ } /*space 3*/ if (p = =5) Y[6,1] = X[6,0] + Y[6,1]; /*S2*/ Next Class 28 Reading • Scott, Chapter 7.2; ALSU Chapter 6.5