CSE 440: Advanced Algorithms
Lecture 9: Parallel Algorithms CLRS Ch. 27
Scope
Our Assumptions (Model)
• Multiprocessor computers
• Different variants
• Chip-multiprocessor
• Multi-Socket computers • Clusters
• Model: P processors
• Shared Memory
• As opposed to message passing
• Sequentially Consistent
• the memory behaves as if the instructions were executed sequentially according to some global linear order that preserves the individual orders in which each processor issues its own instructions.
Static Vs Dynamic Multithreading
• Static Multithreading
• Programmer assigns specific tasks to each created thread
• Dynamic Multithreading
• Programmer only provides opportunities for parallelism (i.e., logical parallelism)
• A layer of software that coordinates, schedules, and manages the parallel- computing resources.
• Example: print all prime numbers from 1 to 1010
Why Dynamic Multi-threading
• Simple
• Onlythreeconcurrencykeywords • spawn, sync, parallel
• few more beyond this lecture
• The remaining is the original sequential code
• Covers most common parallelism patterns
• Nested Parallelism (spawn/sync) • Divide-and-conquer
• Parallel loops (parallel) • Map/Reduce
• Allows for analyzing parallelism • Using the notion of work/span
Nested Parallelism (Spawn/Sync) • Spawn
• The (parent) procedure may continue to execute in parallel with the spawned (child) subroutine.
• Sync
• The procedure must wait as necessary for all its spawned children to complete
• Implicitly executed before return (in addition to explicit places)
Remember: this is all logical parallelism. Scheduler decides at runtime
Computation DAG
• Example P-FIB(4)
• G(V,E)
• V: instructions/strands • E: dependencies
• Directed path from u to v
• u and v are (logically) in series.
• Otherwise
• u and v are (logically) in parallel.
• Scheduler objective:
• Maximizinginterleavingwhile preserving the partial order of the computation DAG
Computation DAG
• Edges classification
• Continuation (horizontal):
• next strand within same procedure • Spawn edge (downward)
• uspawnsv
• call edge (downward)
• u (as normal) calls v
• return edge (upward)
• returntocallingprocedure
Computation Measures
• Work (T1)
• The total time to execute the entire computation on one processor.
• Ifeachstrandtakesunittime (which we will assume), work is the number of vertices in the DAG
• Span (T∞)
• The total time if we have an unlimited number of processors
• Thenumberofverticesona longest or critical path in the DAG.
• Actual time on P processors (TP)
For P-FIB(4): Work = 17, Span = 8
Computation Measures
• Lower Bounds on TP provided by T1 and T∞ • (work bound) TP ≥ T1 /P
• proof: at most P units of work at each step • (span bound) TP ≥ T∞
• proof: P-processor machine cannot be faster than a machine with an unlimited number of processors
Computation Measures
• Speedup: how many times faster the computation is on P processors than on 1 processor.
• T1/Tp
• From previous bounds: T1/Tp ≤ P
• What is our target/expectation? • Perfect Linear speedup: T1/Tp = P
• Linear speedup: T1/Tp = Θ(P)
• Practical objective: linear speedup up to decent value of P.
Computation Measures
• Parallelism: max speedup • T1/T∞
• average amount of work that can be done in parallel along each step of the critical path
• Parallelism is a limit on possible speedup • IfP>T1/T∞,thenP>T1/TP
• Informally: after P exceeds parallelism, perfect linear speedup is no longer possible.
What is the parallelism of P-FIB(4)? What does it mean?
Computation Measures
• (Parallel) Slackness: the ratio (T1/T∞)/P • If less than 1: we reach parallelism limit
• If greater than 1: a good scheduler can achieve closer and closer to perfect linear speedup.
Scheduling
• The actual assignments of strands to processes at runtime. • If work and span are minimized but scheduler is inefficient, it is as
• Optimal scheduler
• Complex (NP-Complete!!)
• out-of-context • Greedy scheduler:
• assign as many strands to processors as possible in each time step. • Completestep(morethanPstrandsareready):scheduleanyPofthem • Incompletestep(otherwise):scheduleall
if you do nothing!!
Scheduling
Now we are ready to analyze parallel algorithms
Analysis of Multithreaded Algorithms
• Use the same techniques we studied before to calculate T1(n) and T∞(n)
• T1(n) is straightforward
• Analysisoftheordinaryserialalgorithmwithoutparallelismkeywords.
• T∞(n) is trickier
• Thefollowingtrickwillhelp
Example: Analyzing P-Fib(n)
• Work • Span
• Parallelism
• Good or bad?
• Great:growsdramaticallyasngetslarge.
• Addingmoreprocessorswillalwayspayoffwithnearperfectlinear speedup.
• Conclusion:P-Fibisahighlyparallelizablealgorithm
Parallel Loops
• Programmer uses “Parallel” keyword • Framework converts it to spawn/sync
• Example: Matrix-Vector Multiplication
• recursively spawn first half of iterations to execute in parallel with the second half of the iterations.
Parallel Loops
• Analysis of Matrix-Vector Multiplication
• Work:
• Same as sequential cost:
• T1(n) = Θ(n2)
• Note: cost of recursive
spawning is Θ(n)
• So, it is (asymptotically) neglected.
Parallel Loops
• Analysis of Matrix-Vector Multiplication
What is the parallelism? Is it good?
• Span:
• Thefirsttermisthecostof
recursive spawning
• why Θ(lg n) not Θ(n)?
• Thesecondtermisthecostof each iteration
• Θ(n) in this procedure
• Thesecondtermdominates! • Not always the case
Race Conditions and Synchronization
• Determinacy race:
• Two logically parallel instructions access the same memory location and at least one of the instructions performs a write.
• Example
What is the final output?
Another Example
• A wrong parallel Matrix-Vector Multiplication implementation with determinacy races
The inner loop cannot be parallelized
Does It Always Pay off
• Two examples
• Matrix Multiplication • Yes, definitely
• A typical benchmark for parallelism! • Merge Sort
• Not exactly
• But we still can do something!
Parallel Matrix Multiplication
• Straightforward algorithm • Work: 3
• T1(n) = Θ(n )
• Span:
• T∞(n)=Θ(lgn)+Θ(lgn)+Θ(n)
recursive spawning of outer parallel loop
recursive spawning of inner parallel loop
cost of executing inner-most loop (in parallel)
• Parallelism
• Θ(n3)/Θ(n)=Θ(n2)
Parallel Matrix Multiplication
• Divide and Conquer Algorithm • Work
• Span
Cost of recursive call
(all executed in parallel)
• Parallelism
recursive spawning of parallel loops
using substitution
Parallel Matrix Multiplication
• Strassen’s Method
• Summary
• Use 7 recursive calls instead of 8
• Use more Matrix additions instead (same O(n2) complexity).
• Work changes to Θ(nlg7)
• Span remains the same Θ(lg2n)
• Parallelism becomes Θ(nlg7/ lg2n)
Parallel Merge Sort
• Straightforward solution • Work
• Span
• Parallelism
Not good. Why?
Parallel Merge Sort
• Idea: parallelize divide and conquer merge • Findmiddleelementxinthefirstsubarray
• Θ(1) since array is sorted
• Usebinarysearchtofindtheindexinthesecondsubarraythatsplitsit
around x
• details in textbook
• Θ(lg n) — we will not parallelize this
• calculateindexofxintheoutputarrayandplaceit • Θ(1)
• Recursivelymerge
• The two sub-arrays to the left of x (spawn)
• The two sub-arrays to the right of x
Parallel Merge Sort
• Analysis of P-Merge • Work
• Span
• Parallelism
why?
why?
Parallel Merge Sort
• Putting all together
Parallel Merge Sort
• Analysis of P-Merge-Sort • Work
• Span
• Parallelism
Thanks!