Basics of Parallelization
CMPSC 450
Why parallelize?
• Not enough memory on single system.
• More computers scales linearly, easy to predict fix.
• Execution time too long on single core serial implementation.
• How do we know how much faster more processors will run my code?
CMPSC 450
Amdahl’s Law
• Slatency is the theoretical speedup of the execution of the whole task
• s is the speedup of the part of the task that benefits from improved system resources
• p is the proportion of execution time that the part benefiting from improved resources originally occupied.
CMPSC 450
Derivation
• Break an application up into 2 parts. One part can be improved, and one part cannot.
• Expressed as execution time: T = (1-p)T + pT
• The improved execution time becomes pT/s
• The new execution time becomes: T(s) = (1-p)T + pT/s
• Speedup for a fixed workload is represented as T/T(s).
CMPSC 450
Example 1
• If 30% of the execution time may be the subject of a speedup, p will be 0.3; if the improvement makes the affected part twice as
fast, s will be 2. Amdahl’s law states that the overall speedup of applying the improvement will be:
CMPSC 450
Example 2
• For example, assume that we are given a serial task which is split into four consecutive parts, whose percentages of execution time are p1 = 0.11, p2 = 0.18, p3 = 0.23, and p4 = 0.48 respectively. Then we are told that the 1st part is not sped up, so s1 = 1, while the 2nd part is sped up 5 times, so s2 = 5, the 3rd part is sped up 20 times, so s3 = 20, and the 4th part is sped up 1.6 times, so s4 = 1.6. By using Amdahl’s law, the overall speedup is
CMPSC 450
How to parallelize (at a higher level)?
• Functional Parallelism
• Break up algorithm into subtasks.
• Still requires some level of message passing, memory sharing
POSIX Threads C++11 Threads
Source: https://solarianprogrammer.com/2011/12/16/cpp-11-thread-tutorial/
CMPSC 450
Functional Parallelism
• Manager-worker scheme
• One thread distributes the work across several worker threads
• Functional Decomposition
• Each thread executes a different job
• What about a single algorithm/problem?
CMPSC 450
Summing ‘n’ numbers
• T(n) = n + 1; • O(n) algorithm
• T(n)=n–1;
• O(n) algorithm
OR
A=0;
for (i = 0; i < n; i++)
A += B[i];
for (i = 1; i < n; i++)
B[0] += B[i];
CMPSC 450
Side note: Parallel Models
• To analyze and describe parallel algorithms
• Requirements: simplicity, implementability
• PRAM, BSP, LogP, Hockney model, multicore models, ...
CMPSC 450
Random Access Machine (RAM) model
• Uniform cost criteria: cost of elementary operation is fixed and is independent of word size
CPU
All operations cost 1 unit.
T(n) can be estimated based on this assumption.
Memory
CMPSC 450
PRAM: Parallel Random Access Machine
Shared memory
P1 P2 P3 ............... Pp
• PRAM is a shared memory model
• Two modes of execution: synchronous and asynchronous
• Synchronous: all processors operate under the control of a common clock
• Asynchronous: each processor has a different clock
CMPSC 450
PRAM model variants
• Based on concurrent memory access rules. • EREW: Exclusive Read Exclusive Write
• No simultaneous access (read or write) to a single memory location • CREW: Concurrent Read Exclusive Write
• Concurrent read access allowed
• CRCW: Concurrent Read Concurrent Write
• Most powerful PRAM model
• ERCW
• Doesn’t make sense
CMPSC 450
Summing ‘n’ numbers, PRAM algorithm
• Input: Array A(1), A(2), A(3), ..., A(n)
•Output: Compute S = A(1) + A(2) + A(3) + ... + A(n) • Assume that n = 2k
• PRAM algorithm design approach:
• assume we have an infinite number of processors
CMPSC 450
Expressing Algorithms using DAGs
• DAG: Directed Acyclic Graph
• Numerical computations without branching instructions can be expressed using DAGs
• Each input: node/vertex in graph with no incoming arcs
• Each operation: node/vertex in graph with incoming arcs representing operations
• Output: node with zero out-degree
• DAG implies precedence
A
B
D
E
G
H
C
F
constraints
• It is architecture-independent
CMPSC 450
DAG use
• Cost model
• Every processor runs at the same speed • One operation takes One unit of time
• No edge costs
A
C B
D E
G
H
F
CMPSC 450
DAG example: adding N elements
A=0;
for (i = 0; i < N; i++)
A += B[i];
A can be loaded concurrently in our PRAM model
A(1) +
A(2) A(3) + +
... A(4) A(5)
+ + We still have a dependency on our add operations!
Bad for parallelism, we are not exposing concurrency
CMPSC 450
Parallel Summation Algorithm
• Consider pairwise sums
• Iteratively sum pairs of values or results until one result is
achieved.
• In step 1, compute
A(1) + A(2), A(3) + A(4),
• In step 2, compute
[A(1) + A(2)] + [A(3) + A(4)],
• In step 3, compute
[A(1) + A(2) + A(3) + A(4)] +
A(5) + A(6), A(7) + A(8) [A(5) + A(6)] + [A(7) + A(8)]
[A(5) + A(6) + A(7) + A(8)]
CMPSC 450
Example DAG: adding 8 elements
A(1) A(2) A(3) A(4) A(5) A(6) A(7) A(8)
++++
Depth
of dag = 3
log(n) in General
++
‘balanced
binary tree’ +
Which approach is better?
CMPSC 450
Work and Span
• Work performed by an algorithm W(n) is the total number of operations
• Provides insight into the amount of computing power (number of processors) necessary for a particular algorithm
• For a single processor: T1(n) = W(n)
• Span is the longest path through the algorithm D(n)
• Also called the Critical Path or Depth
• Given enough processors, the fastest execution time will be T∞(n) = D(n)
A
C
D
B
E
F
G
H
CMPSC 450
Work-Time analysis of algorithms
• Work performed in the previous example:
W(n) = • Span is the Depth
A(1) A(2) A(3)
D(n) = log(n)
A(4) A(5) A(6)
A(7) A(8)
++++
++
+
CMPSC 450
Work-Time scheduling principle
• Brent’s theorem
11
• The theoretical execution time for a given number of processors is bound by:
• The execution time of one processor / the number of processors
• The execution time of one processor / the number of processors plus the
∞
execution time given infinite processors.
• T∞: Parallel running time given infinite processors • Also called ‘depth’
• For the previous algorithm, time T∞(n) = O(log n)
CMPSC 450
Parallel time, speedup
• Tp(n): Time on parallel system with p processors • T*(n): best possible sequential algorithm time
• Absolute speedup
• Relative speedup = ()
∗
()
𝑝
() ()
• Can also compare Tp1(n) with Tp2(n)
• ex: what is the speedup of the number of processors is doubled?
() ()
• Efficiency
• Ideally, we want relative speedup to be p and efficiency to be 1.
𝑝
• For the binary tree summation example: • Absolute speedup is N/(log N)
CMPSC 450
Work-Time scheduling principle
• A PRAM algorithm that runs in T ∞(n) time while performing W(n) work can be adapted to run on a p-processor PRAM in at most
• A PRAM algorithm can be simulated to run on a p-processor system
()
parallel steps. • Remember W(n) = T1(n)
∞
in T (n) = () time. p∞
CMPSC 450
Optimality
• T*(n): complexity of problem in RAM model (running time of `best’ sequential algorithm)
• We say a parallel algorithm is work-optimal if the parallel work W(n) required by the algorithm satisfies W(n) = θ(T*(n)).
• Regardless of parallel running time
• A work-optimal PRAM algorithm on a p-processor system would take
∗
Tp(n)= time.
• (Absolute) speedup Sp(n) is given by ∗
∗
∗
∗
()
CMPSC 450
Optimality
• Optimal speedup: S (n) = θ(p) p ∗
• This occurs when , i.e. the speedup scales with p.
• The faster the parallel algorithm, the larger the range of p for which
the algorithm achieves optimal speedup
• A work-optimal parallel algorithm is said to be work-time optimal if it can be shown that T(n) cannot be improved by any other work- optimal parallel algorithm.
• Need to determine tight lower bound
• Work-optimality and T(n) = O(log(n)) is usually pretty good.
• For the previous example: Optimal speedup whenever p = O( )
CMPSC 450
The ‘pardo’ construct
• `Parallel do’
• for l ≤ i ≤ u pardo
statement(s)
• Statement(s) following pardo depends on iteration i.
• Statements corresponding to all values of i between l and u are executed concurrently
CMPSC 450
PRAM algorithm for summing n elements
Input: n = 2k numbers stored in an array A Output: The sum S = A(1) + A(2) + ... + A(n) begin
1. for 1 ≤ i ≤ n pardo Set B(i) := A(i)
2. for h = 1 to log n do for1≤i≤n/2h pardo
Set B(i) := B(2i-1) + B(2i) 3. Set S := B(1)
end
• The overall theoretical execution time for this loop is (1 + log N + 1) or simply (log N).
CMPSC 450
This takes 1 unit of time. We can assign one assignment of array to each processor.
This takes log N units of time. The inner loop takes 1 unit of time (pardo) since we can assign one iteration of loop to each processor.
The final assignment takes 1 unit of time.
Dense matrix-vector multiplication
• A is an nxn matrix, b is an nx1 vector. Compute c = Ab.
• Serial pseudo-code for 1 ≤ j ≤ n
Set c(j) := 0 for 1 ≤ i ≤ n
for 1 ≤ j ≤ n
Set c(i) := c(i) + A(i,j) * b(j)
• T*(n) = θ(n2)
CMPSC 450
PRAM algorithm for dense matrix vector multiplication
• Each c(i) can be computed independently
• Each c(i) requires summation of n elements
• Use binary tree based PRAM algorithm for summation
• With n2 processors, we have a parallel algorithm with time T(n) =
O(log n).
• W(n) = O(n2) = θ(T*(n))
• The parallel algorithm is work-optimal
• Achieves optimal speedup whenever p = O( )
CMPSC 450
Dense matrix-matrix multiplication
• A is an nxn matrix, B is an nxn matrix. Compute C = AB.
• Serial pseudo-code for 1 ≤ i, j ≤ n
Set C(i, j) := 0 for 1 ≤ i ≤ n
for 1 ≤ j ≤ n
for 1 ≤ k ≤ n
Set c(i, j) := c(i, j) + A(i, k) . B(k, j) • T*(n) = θ(n3)
• Wrong! This is not the fastest sequential algorithm for matrix multiply. Best known bound is O(n2.3727)
• The floating point op. count for above alg. is 2n3 • We also know T*(n) = Ω(n2)
CMPSC 450
PRAM algorithm for (dense) matrix-matrix multiplication
• Each C(i, j) can be computed independently
• Each C(i, j) requires summation of n elements
• Use binary tree based PRAM algorithm for summation
• With n3 processors, we have a parallel algorithm with time T(n) =
O(log n).
• W(n) = O(n3) ≠ θ(T*(n))
• The parallel algorithm is not work-optimal
• By work-time scheduling principle, the algorithm can be simulated by
p processors to run in O( )
CMPSC 450
PRAM matrix multiplication algorithm
Input: Two nxn matrices A and B, where n = 2k for some integer k. Output: The product C = AB
begin
1. for 1 ≤ i, j, l ≤ n pardo
Set C’(i, j, l) := A(i, l)B(l, j)
2. for h = 1 to log n do for1≤i,j≤n,1≤l≤n/2h pardo
Set C’(i, j, l) := C’(i, j, 2l-1) + C’(i, j, 2l) 3. for 1 ≤ i,j ≤ n pardo
end
Set C(i, j) := C’(i, j, 1)
CMPSC 450
Sources
• https://learnxinyminutes.com/docs/asymptotic-notation/
• http://webhome.phy.duke.edu/~rgb/brahma/Resources///als/als/no
de3.html
• Chapter 2 of class notes by Prof. Uzi Vishkin http://www.umiacs.umd.edu/~vishkin/PUBLICATIONS/classnotes.pdf
• https://stanford.edu/~rezab/dao/notes/lecture01/cme323_lec1.pdf
CMPSC 450
Partitioning-based PRAM algorithm for parallel summation
• The number of processors p is explicit
• Partition the array into p parts, each processor computes partial sum • One processor adds the p partial sums to get the final solution
• Tp(n) = θ(n/p + p)
• W(n)=θ(n+p)
• This algorithm is work optimal if p = O(n)
CMPSC 450
35
Define: Computational Problem
• Specifies relations between a given input and a desired output
• Examples
• Given 10 integers, give their sum as output
• Given 100 integers, give the sorted list as output
• Multiply two matrices
• Given a positive integer n, find a nontrivial prime factor of n.
CMPSC 450
Define: Algorithm
• “a process or set of rules to be followed in calculations or other
problem-solving operations, especially by a computer.”
• Examples
• Merge sort algorithm for the problem of sorting integers • Cooley-Tukey algorithm for FFT
CMPSC 450
Algorithm Analysis
• Execution time, memory requirements
• Execution time is expressed in terms of elementary operation counts. • Number of floating point operations
• Number of loads
• Number of stores
CMPSC 450
Asymptotic Notation: O()
• n: input size
• T(n): number of basic operations
• f(n) : arbitrary time complexity
• Upper bound (worst case execution):
• T(n) = O(f(n)) if for some positive values of c and n0
T(n) < c * f(n) for all n >= n0
CMPSC 450
Asymptotic Notation: O()
Example:
T(n) = 3 log n + 100 f(n) = log n
• IsT(n)=O(f(n))?
• Supposec=150,n0 >4
• 3logn+100<=150*logn
• T(n) is O( f(n) )
What does this mean to my code?
3 log n – execution time of some input driven loop 100 – constant execution time of algorithm (startup time)
CMPSC 450
Asymptotic Notation: Ω() (Omega)
• n: input size
• T(n): number of basic operations
• f(n) : arbitrary time complexity
• Lower bound (best execution time):
• T(n)=Ω(f(n))ifthereexistscandn0suchthat
T(n) >= c(f(n)) for all n > n0
CMPSC 450
Asymptotic Notation: θ() (Theta)
• Tight bound
• An algorithm’s estimated runtime is tight bound if both the
best and worst execution times are equivalent
• T(n) = θ( f(n) ) if
• T(n)=O(f(n))AND • T(n)=Ω(f(n))
CMPSC 450
Asymptotic notation example
•T(n)=5n2 +8n=O(n2)
• Why? c = 6 and n0 = 8 in the previous definition
• Note that T(n) = O(n3) and T(n) = O(n100) are also valid statements
•T(n)=5n2 +8n=Ω(n2)
• Why? c = 1 and n0 = 1 in the previous definition
• Note that T(n) = Ω(n) and T(n) = Ω(log n) are also valid statements
• O() ∩ Ω() = θ(). Thus, in this case, T(n) = θ(n2).
CMPSC 450