CS计算机代考程序代写 concurrency scheme algorithm Basics of Parallelization

Basics of Parallelization
CMPSC 450

Why parallelize?
• Not enough memory on single system.
• More computers scales linearly, easy to predict fix.
• Execution time too long on single core serial implementation.
• How do we know how much faster more processors will run my code?
CMPSC 450

Amdahl’s Law
• Slatency is the theoretical speedup of the execution of the whole task
• s is the speedup of the part of the task that benefits from improved system resources
• p is the proportion of execution time that the part benefiting from improved resources originally occupied.
CMPSC 450

Derivation
• Break an application up into 2 parts. One part can be improved, and one part cannot.
• Expressed as execution time: T = (1-p)T + pT
• The improved execution time becomes pT/s
• The new execution time becomes: T(s) = (1-p)T + pT/s
• Speedup for a fixed workload is represented as T/T(s).
CMPSC 450

Example 1
• If 30% of the execution time may be the subject of a speedup, p will be 0.3; if the improvement makes the affected part twice as
fast, s will be 2. Amdahl’s law states that the overall speedup of applying the improvement will be:
CMPSC 450

Example 2
• For example, assume that we are given a serial task which is split into four consecutive parts, whose percentages of execution time are p1 = 0.11, p2 = 0.18, p3 = 0.23, and p4 = 0.48 respectively. Then we are told that the 1st part is not sped up, so s1 = 1, while the 2nd part is sped up 5 times, so s2 = 5, the 3rd part is sped up 20 times, so s3 = 20, and the 4th part is sped up 1.6 times, so s4 = 1.6. By using Amdahl’s law, the overall speedup is
CMPSC 450

How to parallelize (at a higher level)?
• Functional Parallelism
• Break up algorithm into subtasks.
• Still requires some level of message passing, memory sharing
POSIX Threads C++11 Threads
Source: https://solarianprogrammer.com/2011/12/16/cpp-11-thread-tutorial/
CMPSC 450

Functional Parallelism
• Manager-worker scheme
• One thread distributes the work across several worker threads
• Functional Decomposition
• Each thread executes a different job
• What about a single algorithm/problem?
CMPSC 450

Summing ‘n’ numbers
• T(n) = n + 1; • O(n) algorithm
• T(n)=n–1;
• O(n) algorithm
OR
A=0;
for (i = 0; i < n; i++) A += B[i]; for (i = 1; i < n; i++) B[0] += B[i]; CMPSC 450 Side note: Parallel Models • To analyze and describe parallel algorithms • Requirements: simplicity, implementability • PRAM, BSP, LogP, Hockney model, multicore models, ... CMPSC 450 Random Access Machine (RAM) model • Uniform cost criteria: cost of elementary operation is fixed and is independent of word size CPU All operations cost 1 unit. T(n) can be estimated based on this assumption. Memory CMPSC 450 PRAM: Parallel Random Access Machine Shared memory P1 P2 P3 ............... Pp • PRAM is a shared memory model • Two modes of execution: synchronous and asynchronous • Synchronous: all processors operate under the control of a common clock • Asynchronous: each processor has a different clock CMPSC 450 PRAM model variants • Based on concurrent memory access rules. • EREW: Exclusive Read Exclusive Write • No simultaneous access (read or write) to a single memory location • CREW: Concurrent Read Exclusive Write • Concurrent read access allowed • CRCW: Concurrent Read Concurrent Write • Most powerful PRAM model • ERCW • Doesn’t make sense CMPSC 450 Summing ‘n’ numbers, PRAM algorithm • Input: Array A(1), A(2), A(3), ..., A(n) •Output: Compute S = A(1) + A(2) + A(3) + ... + A(n) • Assume that n = 2k • PRAM algorithm design approach: • assume we have an infinite number of processors CMPSC 450 Expressing Algorithms using DAGs • DAG: Directed Acyclic Graph • Numerical computations without branching instructions can be expressed using DAGs • Each input: node/vertex in graph with no incoming arcs • Each operation: node/vertex in graph with incoming arcs representing operations • Output: node with zero out-degree • DAG implies precedence A B D E G H C F constraints • It is architecture-independent CMPSC 450 DAG use • Cost model • Every processor runs at the same speed • One operation takes One unit of time • No edge costs A C B D E G H F CMPSC 450 DAG example: adding N elements A=0; for (i = 0; i < N; i++) A += B[i]; A can be loaded concurrently in our PRAM model A(1) + A(2) A(3) + + ... A(4) A(5) + + We still have a dependency on our add operations! Bad for parallelism, we are not exposing concurrency CMPSC 450 Parallel Summation Algorithm • Consider pairwise sums • Iteratively sum pairs of values or results until one result is achieved. • In step 1, compute A(1) + A(2), A(3) + A(4), • In step 2, compute [A(1) + A(2)] + [A(3) + A(4)], • In step 3, compute [A(1) + A(2) + A(3) + A(4)] + A(5) + A(6), A(7) + A(8) [A(5) + A(6)] + [A(7) + A(8)] [A(5) + A(6) + A(7) + A(8)] CMPSC 450 Example DAG: adding 8 elements A(1) A(2) A(3) A(4) A(5) A(6) A(7) A(8) ++++ Depth of dag = 3 log(n) in General ++ ‘balanced binary tree’ + Which approach is better? CMPSC 450 Work and Span • Work performed by an algorithm W(n) is the total number of operations • Provides insight into the amount of computing power (number of processors) necessary for a particular algorithm • For a single processor: T1(n) = W(n) • Span is the longest path through the algorithm D(n) • Also called the Critical Path or Depth • Given enough processors, the fastest execution time will be T∞(n) = D(n) A C D B E F G H CMPSC 450 Work-Time analysis of algorithms • Work performed in the previous example: W(n) = • Span is the Depth A(1) A(2) A(3) 􏰌􏰍􏰎􏰈 􏰈 􏰐􏰑􏰒 􏰏􏰐 D(n) = log(n) A(4) A(5) A(6) A(7) A(8) ++++ ++ + CMPSC 450 Work-Time scheduling principle • Brent’s theorem 11 • The theoretical execution time for a given number of processors is bound by: • The execution time of one processor / the number of processors • The execution time of one processor / the number of processors plus the ∞ execution time given infinite processors. • T∞: Parallel running time given infinite processors • Also called ‘depth’ • For the previous algorithm, time T∞(n) = O(log n) CMPSC 450 Parallel time, speedup • Tp(n): Time on parallel system with p processors • T*(n): best possible sequential algorithm time • Absolute speedup • Relative speedup = 􏰓􏰒 (􏰈) ∗ 􏰓 (􏰈) 𝑝 􏰓􏰊(􏰈) 􏰓􏰊(􏰈) • Can also compare Tp1(n) with Tp2(n) • ex: what is the speedup of the number of processors is doubled? 􏰓􏰒 (􏰈) 􏰊􏰓􏰊(􏰈) • Efficiency • Ideally, we want relative speedup to be p and efficiency to be 1. 𝑝 • For the binary tree summation example: • Absolute speedup is N/(log N) CMPSC 450 Work-Time scheduling principle • A PRAM algorithm that runs in T ∞(n) time while performing W(n) work can be adapted to run on a p-processor PRAM in at most • A PRAM algorithm can be simulated to run on a p-processor system 􏰔(􏰈) 􏰊 parallel steps. • Remember W(n) = T1(n) ∞ in T (n) = 􏰔(􏰈) time. p􏰊∞ CMPSC 450 Optimality • T*(n): complexity of problem in RAM model (running time of `best’ sequential algorithm) • We say a parallel algorithm is work-optimal if the parallel work W(n) required by the algorithm satisfies W(n) = θ(T*(n)). • Regardless of parallel running time • A work-optimal PRAM algorithm on a p-processor system would take ∗ Tp(n)= 􏰓 􏰈 time. 􏰊 • (Absolute) speedup Sp(n) is given by ∗ ∗ ∗ 􏰊􏰓 􏰈 ∗ 􏰓 (􏰈)􏰕􏰊 􏰈 CMPSC 450 Optimality • Optimal speedup: S (n) = θ(p) p 􏰓∗􏰈 • This occurs when 􏰓􏰊 􏰈 , i.e. the speedup scales with p. • The faster the parallel algorithm, the larger the range of p for which the algorithm achieves optimal speedup • A work-optimal parallel algorithm is said to be work-time optimal if it can be shown that T(n) cannot be improved by any other work- optimal parallel algorithm. • Need to determine tight lower bound • Work-optimality and T(n) = O(log(n)) is usually pretty good. • For the previous example: Optimal speedup whenever p = O( 􏰈 ) 􏰌􏰍􏰎 􏰈 CMPSC 450 The ‘pardo’ construct • `Parallel do’ • for l ≤ i ≤ u pardo statement(s) • Statement(s) following pardo depends on iteration i. • Statements corresponding to all values of i between l and u are executed concurrently CMPSC 450 PRAM algorithm for summing n elements Input: n = 2k numbers stored in an array A Output: The sum S = A(1) + A(2) + ... + A(n) begin 1. for 1 ≤ i ≤ n pardo Set B(i) := A(i) 2. for h = 1 to log n do for1≤i≤n/2h pardo Set B(i) := B(2i-1) + B(2i) 3. Set S := B(1) end • The overall theoretical execution time for this loop is (1 + log N + 1) or simply (log N). CMPSC 450 This takes 1 unit of time. We can assign one assignment of array to each processor. This takes log N units of time. The inner loop takes 1 unit of time (pardo) since we can assign one iteration of loop to each processor. The final assignment takes 1 unit of time. Dense matrix-vector multiplication • A is an nxn matrix, b is an nx1 vector. Compute c = Ab. • Serial pseudo-code for 1 ≤ j ≤ n Set c(j) := 0 for 1 ≤ i ≤ n for 1 ≤ j ≤ n Set c(i) := c(i) + A(i,j) * b(j) • T*(n) = θ(n2) CMPSC 450 PRAM algorithm for dense matrix vector multiplication • Each c(i) can be computed independently • Each c(i) requires summation of n elements • Use binary tree based PRAM algorithm for summation • With n2 processors, we have a parallel algorithm with time T(n) = O(log n). • W(n) = O(n2) = θ(T*(n)) • The parallel algorithm is work-optimal • Achieves optimal speedup whenever p = O( 􏰈􏰏 ) 􏰌􏰍􏰎 􏰈 CMPSC 450 Dense matrix-matrix multiplication • A is an nxn matrix, B is an nxn matrix. Compute C = AB. • Serial pseudo-code for 1 ≤ i, j ≤ n Set C(i, j) := 0 for 1 ≤ i ≤ n for 1 ≤ j ≤ n for 1 ≤ k ≤ n Set c(i, j) := c(i, j) + A(i, k) . B(k, j) • T*(n) = θ(n3) • Wrong! This is not the fastest sequential algorithm for matrix multiply. Best known bound is O(n2.3727) • The floating point op. count for above alg. is 2n3 • We also know T*(n) = Ω(n2) CMPSC 450 PRAM algorithm for (dense) matrix-matrix multiplication • Each C(i, j) can be computed independently • Each C(i, j) requires summation of n elements • Use binary tree based PRAM algorithm for summation • With n3 processors, we have a parallel algorithm with time T(n) = O(log n). • W(n) = O(n3) ≠ θ(T*(n)) • The parallel algorithm is not work-optimal • By work-time scheduling principle, the algorithm can be simulated by p processors to run in O(􏰈􏰖 ) 􏰊 CMPSC 450 PRAM matrix multiplication algorithm Input: Two nxn matrices A and B, where n = 2k for some integer k. Output: The product C = AB begin 1. for 1 ≤ i, j, l ≤ n pardo Set C’(i, j, l) := A(i, l)B(l, j) 2. for h = 1 to log n do for1≤i,j≤n,1≤l≤n/2h pardo Set C’(i, j, l) := C’(i, j, 2l-1) + C’(i, j, 2l) 3. for 1 ≤ i,j ≤ n pardo end Set C(i, j) := C’(i, j, 1) CMPSC 450 Sources • https://learnxinyminutes.com/docs/asymptotic-notation/ • http://webhome.phy.duke.edu/~rgb/brahma/Resources///als/als/no de3.html • Chapter 2 of class notes by Prof. Uzi Vishkin http://www.umiacs.umd.edu/~vishkin/PUBLICATIONS/classnotes.pdf • https://stanford.edu/~rezab/dao/notes/lecture01/cme323_lec1.pdf CMPSC 450 Partitioning-based PRAM algorithm for parallel summation • The number of processors p is explicit • Partition the array into p parts, each processor computes partial sum • One processor adds the p partial sums to get the final solution • Tp(n) = θ(n/p + p) • W(n)=θ(n+p) • This algorithm is work optimal if p = O(n) CMPSC 450 35 Define: Computational Problem • Specifies relations between a given input and a desired output • Examples • Given 10 integers, give their sum as output • Given 100 integers, give the sorted list as output • Multiply two matrices • Given a positive integer n, find a nontrivial prime factor of n. CMPSC 450 Define: Algorithm • “a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer.” • Examples • Merge sort algorithm for the problem of sorting integers • Cooley-Tukey algorithm for FFT CMPSC 450 Algorithm Analysis • Execution time, memory requirements • Execution time is expressed in terms of elementary operation counts. • Number of floating point operations • Number of loads • Number of stores CMPSC 450 Asymptotic Notation: O() • n: input size • T(n): number of basic operations • f(n) : arbitrary time complexity • Upper bound (worst case execution): • T(n) = O(f(n)) if for some positive values of c and n0 T(n) < c * f(n) for all n >= n0
CMPSC 450

Asymptotic Notation: O()
Example:
T(n) = 3 log n + 100 f(n) = log n
• IsT(n)=O(f(n))?
• Supposec=150,n0 >4
• 3logn+100<=150*logn • T(n) is O( f(n) ) What does this mean to my code? 3 log n – execution time of some input driven loop 100 – constant execution time of algorithm (startup time) CMPSC 450 Asymptotic Notation: Ω() (Omega) • n: input size • T(n): number of basic operations • f(n) : arbitrary time complexity • Lower bound (best execution time): • T(n)=Ω(f(n))ifthereexistscandn0suchthat T(n) >= c(f(n)) for all n > n0
CMPSC 450

Asymptotic Notation: θ() (Theta)
• Tight bound
• An algorithm’s estimated runtime is tight bound if both the
best and worst execution times are equivalent
• T(n) = θ( f(n) ) if
• T(n)=O(f(n))AND • T(n)=Ω(f(n))
CMPSC 450

Asymptotic notation example
•T(n)=5n2 +8n=O(n2)
• Why? c = 6 and n0 = 8 in the previous definition
• Note that T(n) = O(n3) and T(n) = O(n100) are also valid statements
•T(n)=5n2 +8n=Ω(n2)
• Why? c = 1 and n0 = 1 in the previous definition
• Note that T(n) = Ω(n) and T(n) = Ω(log n) are also valid statements
• O() ∩ Ω() = θ(). Thus, in this case, T(n) = θ(n2).
CMPSC 450