lec18
CS 314 Principles of Programming Languages
Prof. Zheng Zhang
Rutgers University
Lecture 18: Parallelism and Dependence Analysis
November 9, 2018
Class Information
2
• Project 2 is released.
• Homework 7 will be released this weekend.
Programming with Concurrency
3
• A PROCESS or THREAD is a potentially-active execution context
• Classic von Neumann model of computing has single thread of
control, however parallel programs have more than one
• A process or thread can be thought of as
An abstraction of a physical PROCESSOR
• Processes/Threads can come from
‣ Multiple CPUs
‣ Kernel-level multiplexing of single physical machine
‣ Language or library level multiplexing of kernel-level abstraction
• They can run
‣ In true parallel
‣ Unpredictably interleaved
‣ Run-until-block
Dependence and Parallelization
4
The dependence relation can be modeled as a directed graph such that
if A → B, the result of task A is required for the processing of task B
Dependence analysis is fundamental to parallelization analysis
S1: pi = 3.14
S2: R = 5
S3: Area = pi * R2
S1 S2
S3
Example:
Statement-level dependence graph
Dependence relation: all task–to–task execution orderings that must be
preserved if the meaning of the program is to remain the same.
Dependence Graph
5
• Directed acyclic graph (DAG)
• A node represents a task
• A directed edge represents precedence constraint
start
end
a
b
c
d
h
i
g
f
e
DAG example 1:
• Directed acyclic graph (DAG)
• A node represents a task
• A directed edge represents precedence constraint
Dependence Graph
6
DAG example 2:
S = sum(A[1], A[2], …, A[N])
+ + +…
+ +
S
……
+ +
A[1] A[2] A[N]A[3] A[4] …… A[N-1]
Scheduling a DAG
7
Tp: time to perform computation with p processors
• T1 : work (total # operations)
• T∞ : critical path or span
start
end
a
b
c
d
h
i
g
f
e
Assuming a task
takes 1 unit time
Scheduling a DAG
8
Tp: time to perform computation with p processors
• T1 : work (total # operations)
• T∞ : critical path or span
start
end
a
b
c
d
h
i
g
f
e
T1 = ?
Assuming a task
takes 1 unit time
Scheduling a DAG
9
Tp: time to perform computation with p processors
• T1 : work (total # operations)
• T∞ : critical path or span
start
end
a
b
c
d
h
i
g
f
e
T∞ = ?
Assuming a task
takes 1 unit time
Scheduling a DAG
10
Tp: time to perform computation with p processors
• T1 : work (total # operations)
• T∞ : critical path or span
start
end
a
b
c
d
h
i
g
f
e
T∞ = ?
Assuming a task
takes 1 unit time
11
start
end
a
b
c
d
h
i
g
f
e
Computing Critical Path
Compute the earliest start time of each node
• Keep a value called S(n) associated with each node n
• For each node n
S(n) is the maximum of { S(p) + 1 }, for all p ∈ pred(n)
Assuming a task
takes 1 unit time
12
start
end
a
b
c
d
h
i
g
f
e
Computing Critical Path
: 1
Assuming a task
takes 1 unit time
Compute the earliest start time of each node
• Keep a value called S(n) associated with each node n
• For each node n
S(n) is the maximum of { S(p) + 1 }, for all p ∈ pred(n)
13
start
end
a
b
c
d
h
i
g
f
e
Computing Critical Path
: 1
Assuming a task
takes 1 unit time
: 2 : 2
: 2
Compute the earliest start time of each node
• Keep a value called S(n) associated with each node n
• For each node n
S(n) is the maximum of { S(p) + 1 }, for all p ∈ pred(n)
14
start
end
a
b
c
d
h
i
g
f
e
Computing Critical Path
: 1
Assuming a task
takes 1 unit time
: 2 : 2
: 2
: 3
: 3
: 3
: 3
Compute the earliest start time of each node
• Keep a value called S(n) associated with each node n
• For each node n
S(n) is the maximum of { S(p) + 1 }, for all p ∈ pred(n)
15
start
end
a
b
c
d
h
i
g
f
e
Computing Critical Path
: 1
Assuming a task
takes 1 unit time
: 2 : 2
: 2
: 3
: 3
: 3
: 3
: 4
: 4
Compute the earliest start time of each node
• Keep a value called S(n) associated with each node n
• For each node n
S(n) is the maximum of { S(p) + 1 }, for all p ∈ pred(n)
16
start
end
a
b
c
d
h
i
g
f
e
Computing Critical Path
: 1
Assuming a task
takes 1 unit time
: 2 : 2
: 2
: 3
: 3
: 3
: 3
: 4
: 4
: 5
Compute the earliest start time of each node
• Keep a value called S(n) associated with each node n
• For each node n
S(n) is the maximum of { S(p) + 1 }, for all p ∈ pred(n)
Based on if the dependence constraints have been resolved
• Schedule the nodes that are ready at every time tick
• A completed operation at the end of one time step can lead to more
ready operations at next time tick
17
start
end
a
b
c
d
h
i
g
f
e
List Scheduling
: 1
: 2 : 2
: 2
: 3
: 3
: 3
: 3
: 4
: 4
: 5
1 2 3 4 5
T1 start a b f end
T2 c e i
T3 d g
T4 h
Four threads T1, T2, T3, T4
Assuming a task
takes 1 unit time
Automatic Parallelization
18
We will use loop analysis as an example to describe automatic
dependence analysis and parallelization.
Assumptions:
1. We only have scalar and subscripted variables (no pointers and
no control dependence) for loop dependence analysis.
2. We focus on affine loops: both loop bounds and memory
references are affine functions of loop induction variables.
A function f(x1, x2, …, xn) is affine if it is in such a form:
f = c0 + c1*x1 + c2*x2 + … + cn*xn, where ci are all constants
Affine Loops
19
Three spaces
• Iteration space
‣ The set of dynamic execution instances
‣ i.e. the set of value vectors taken by loop indices
‣ A k-dimensional space for a k-level loop nest
• Data space
‣ The set of array elements accessed
‣ An n-dimensional space for an n-dimensional array
• Processor space
‣ The set of processors in the system
‣ In analysis, we may pretend there are unbounded # of virtual processors
Three Spaces
20
• Iteration space, data space, and processor space
0 1 9 10 11 19 20
0 1 9
… … …
…Iteration space
0 1 9
…Processor space
float Z[100];
for (i=0; i<10; i++)
Z[i+10] = Z[i]; Data Space
Maximum parallelism: T1 / T∞
Assuming one task is one loop iteration,
what is the maximum parallelism?
Array Z[ ]
i = 0 … 9
Dependence Definition
21
• Both statements (instances) access the same memory locations
• One of them is a write
• There is a run-time execution path from S1 to S2
Bernstein’s Condition: — There is a data dependence from
statement (instance) S1 to statement S2 (instance) if
float Z[100];
for (i=0; i<10; i++)
Z[i+10] = Z[i];
No dependence across any
two loop iterations!
Data Dependence Classifications
22
“S2 depends on S1” — (S1 δ S2)
True (flow) dependence
occurs when S1 writes a memory location that S2 later reads (RAW).
Anti dependence
occurs when S1 reads a memory location that S2 later writes (WAR).
Output dependence
occurs when S1 writes a memory location that S2 later writes (WAW).
Input dependence
occurs when S1 reads a memory location that S2 later reads (RAR).
• Examples:
Simple Dependence Testing
23
for (i = 1; i <= 100; i++) {
S1: A[i] = ...
S2: ...= A[i - 1]
}
float Z[100];
for (i =0; i < 12; i++) {
S: Z[ i+10 ] = Z[i];
}
1. Is there dependence?
2. If so, what type of dependence?
3. From which statement (instance) to which statement (instance)?
Dependence Testing
24
• Single loop nest with constant lower (LB) and upper (UB) bound,
and step 1.
Single Induction Variable (SIV) Test
for i = LB, UB, 1
...
endfor
for i = LB, UB, 1
R1: X(a*i + c1) = ... \\ write
R2: ... = X(a*i + c2) ... \\ read
endfor
Question: Is there a true dependence between R1 and R2?
• Two array references as affine function of loop induction variable
Dependence Testing
25
There is a dependence between R1 and R2 iff
∃ i, i′: LB ≤ i ≤ i′ ≤ UB and (a∗i+c1) = (a∗i′+c2)
where i and i′ represent two iterations in the iteration space. This
means that in both iterations, the same element of array X is accessed.
• ∆d is an integer value
• UB - LB ≥ ∆d ≥ 0
for i = LB, UB, 1
R1: X(a*i + c1) = ... \\ write
R2: ... = X(a*i + c2) ... \\ read
endfor
So let’s just solve the equation:
(a ∗ i + c1) = (a ∗ i′ + c2) (c1 - c2)/a = i′ − i = ∆d
There is a dependence iff
• Examples:
Simple Dependence Testing
26
for (i = 1; i <= 100; i++) {
S1: A[i] = ...
S2: ...= A[i - 1]
}
float Z[100];
for (i =0; i < 12; i++) {
S: Z[ i+10 ] = Z[i];
}
1. Is there dependence?
2. If so, what type of dependence?
3. From which statement (instance) to which statement (instance)?
• Examples:
Simple Dependence Testing
27
for (i = 1; i <= 100; i++) {
S1: A[i] = ...
S2: ...= A[i - 1]
}
float Z[100];
for (i =0; i < 12; i++) {
S: Z[ i+10 ] = Z[i];
}
S1
S2
True Dependence
(read after write):
Wt: A[i] in S1 →
Rd: A[i’-1] in S2
S
True Dependence
(read after write):
Wt: Z[i+10] in S →
Rd: Z[i’] in S
i’ = i +1 i’ = i +10
∆d = 1 ∆d = 10
• More Examples:
Simple Dependence Testing
28
1. Is there dependence?
2. If so, what type of dependence?
3. From which statement (instance) to which statement (instance)?
for (i = 1; i <= 100; i++) {
R1: X(i) = ...
R2: ... = X(i + 2)
}
for (i = 3; i <= 15, i++) {
S1: X(2 * i) = ...
S2: ... = X(2 * i - 1)
}
• More Examples:
Simple Dependence Testing
29
for (i = 1; i <= 100; i++) {
R1: X[i] = ...
R2: ... = X[i + 2]
}
for (i = 3; i <= 15, i++) {
S1: X[2 * i] = ...
S2: ... = X[2 * i - 1]
}
R1
R2
Anti Dependence
(write after read):
Rd: X[i+2] in R2 →
Wt: X[i’] in R1
S1
S2
No dependence!
Next Class
30
Reading:
• ALSU, Chapter 11.1 - 11.3